Secure and Efficient Regression Analysis Using a Hybrid Cryptographic Framework: Development and Evaluation

Background Machine learning is an effective data-driven tool that is being widely used to extract valuable patterns and insights from data. Specifically, predictive machine learning models are very important in health care for clinical data analysis. The machine learning algorithms that generate predictive models often require pooling data from different sources to discover statistical patterns or correlations among different attributes of the input data. The primary challenge is to fulfill one major objective: preserving the privacy of individuals while discovering knowledge from data. Objective Our objective was to develop a hybrid cryptographic framework for performing regression analysis over distributed data in a secure and efficient way. Methods Existing secure computation schemes are not suitable for processing the large-scale data that are used in cutting-edge machine learning applications. We designed, developed, and evaluated a hybrid cryptographic framework, which can securely perform regression analysis, a fundamental machine learning algorithm using somewhat homomorphic encryption and a newly introduced secure hardware component of Intel Software Guard Extensions (Intel SGX) to ensure both privacy and efficiency at the same time. Results Experimental results demonstrate that our proposed method provides a better trade-off in terms of security and efficiency than solely secure hardware-based methods. Besides, there is no approximation error. Computed model parameters are exactly similar to plaintext results. Conclusions To the best of our knowledge, this kind of secure computation model using a hybrid cryptographic framework, which leverages both somewhat homomorphic encryption and Intel SGX, is not proposed or evaluated to this date. Our proposed framework ensures data security and computational efficiency at the same time.


Garbled Circuit
In the mid 80s, Yao proposed garbled circuits [17] in the context of secure two-party computation, which can compute a function f on input x without exposing anything about f or x. So, a malicious party cannot learn anything about the function f or the input x other than the result f(x). It should be noted that the term circuit in this context means, boolean circuit. Valeria et al. [13] implemented an evaluator for computing regression coefficient that uses linear homomorphism in the first phase to perform all the linear operations. In the second phase, it uses garbled circuit for non-linear computations since garbled circuit is much more efficient than homomorphic encryption for this purpose. However, there are some critical issues of garbled circuits.
1. First of all, standard garbled circuits suffer from one limitation: they offer no security if used on more than one inputs. In other words, garbled circuits are not reusable. Consequently, evaluating the circuit on a new input requires a completely new garbling of the circuit.
2. Another problem with garbled circuits is that the communication complexity is proportional to the size of the circuit. This makes garbled circuits inefficient from the communication perspective [18,Page 22]. However, with homomorphic encryption, the communication complexity is much less. For instance, consider a scenario, where encrypted clinical data is stored in the cloud, and a researcher executes private prediction queries on this massive clinical data set. In this case, the communication complexity of a private query is extremely high since the garbled circuit used to represent the query is proportional to the size of the dataset. On the contrary, the communication complexity of such a query in homomorphic encryption scheme is proportional to the size of the encrypted response to the query.
3. Finally, garbled circuit-based techniques need complex circuit design and optimization for each particular computation. Thus, it is not very flexible.

Differential Privacy
Solutions based on differential privacy [19] add noise to the data to preserve individual privacy.
There are also some works on differentially private regression analysis [20,9,10,11]. The solution proposed by Chaudhuri et al. [20,9] is applicable only for linear regression. Lei [10] proposed another technique where in the first step, they generate noisy histogram from the input data. Then, from the noisy histogram they generate synthetic data by preserving statistical property of the histogram. In the final step, they uses synthetic data to compute the regression results. Finally, Zhang et al. [11] proposed a solution based on functional mechanism. Instead of perturbing the results, they perturb the objective function (cost function) of the regression analysis.
Noise added by differentially private techniques reduces data utility, and makes statistical analysis very difficult. Also, differential privacy requires one trusted entity who can access the integrated dataset. In addition, in client-server architecture, where a client executes query on the database stored in the server, differential privacy is not applicable for several types of queries [21].

Secure Hardware
Intel Software Guard Extensions [22,23] is a set of extensions to the Intel architecture, which provides support to run an application inside protected execution area of a processor. Among the state-of-the-art secure computation schemes, Intel SGX is the most efficient. For example, an SGX-based MapReduce framework [24] demonstrates an overhead of only 8% to achieve read/write integrity. This is a significant benefit of SGX in comparison to other secure computation techniques like garbled circuit and homomorphic encryption, which generally increase the computational overhead several times.
There are no secure hardware based techniques that target regression analysis (to the best of our knowledge). However, Ohrimenko et al. [16] worked on some machine learning algorithms using Intel SGX.
Although, SGX is very efficient from computation and storage point of view, the security guarantee of SGX is yet to be fully established due to some recently proposed side-channel attacks against SGX [25,26,27].