^{1}

^{2}

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.

Machine learning is an effective data-driven tool that is being widely used to extract valuable patterns and insights from data. Specifically, predictive machine learning models are very important in health care for clinical data analysis. The machine learning algorithms that generate predictive models often require pooling data from different sources to discover statistical patterns or correlations among different attributes of the input data. The primary challenge is to fulfill one major objective: preserving the privacy of individuals while discovering knowledge from data.

Our objective was to develop a hybrid cryptographic framework for performing regression analysis over distributed data in a secure and efficient way.

Existing secure computation schemes are not suitable for processing the large-scale data that are used in cutting-edge machine learning applications. We designed, developed, and evaluated a hybrid cryptographic framework, which can securely perform regression analysis, a fundamental machine learning algorithm using somewhat homomorphic encryption and a newly introduced secure hardware component of Intel Software Guard Extensions (Intel SGX) to ensure both privacy and efficiency at the same time.

Experimental results demonstrate that our proposed method provides a better trade-off in terms of security and efficiency than solely secure hardware-based methods. Besides, there is no approximation error. Computed model parameters are exactly similar to plaintext results.

To the best of our knowledge, this kind of secure computation model using a hybrid cryptographic framework, which leverages both somewhat homomorphic encryption and Intel SGX, is not proposed or evaluated to this date. Our proposed framework ensures data security and computational efficiency at the same time.

Machine learning algorithms are now being widely used in many applications to uncover deep and predictive insights from datasets that are large scale and diverse. For instance, building predictive models from biomedical data is very important in biomedical science. Such predictive models can identify genetic risk factors for a specific disease under study and can guide medical treatment. For instance, Tabaei and Hermana formulated a predictive equation to screen for diabetes [

Machine learning thrives on growing datasets. In most of the cases, the more data fed into a machine learning system, the more it can learn and offer the potential to make more accurate prediction. It is often known as “data never hurt in machine learning,” as insufficient information cannot lead to powerful learning systems. In the context of health care, building an accurate predictive model depends on the quality and quantity of aggregate clinical data, which come from different hospitals or health care institutions. Consequently, in a real-world scenario, machine learning applications use data from several sources, including genetic and genomic, clinical, and sensor data. Day by day, many new sources of data are becoming available—for instance, data from cell phones [

Data collection, storage, and processing power of a single institution is not always adequate to handle the large-scale data used in cutting-edge machine learning applications. For rare diseases, individual institutions oftentimes do not have sufficient data to calculate a model to achieve sufficient statistical power. Therefore, data sharing among multiple institutions is required. However, sharing sensitive biomedical data (clinical or genomic) exposes many security and privacy threats [

In this paper, we concentrate on secure and efficient computation for a fundamental technique used in numerous learning algorithms called

To ensure the security and privacy of the sensitive data used in learning algorithm, different techniques (eg, garbled circuit [

Wu et al developed a framework, grid binary logistic regression (GLORE) [

Later, Shi et al incorporated secure multiparty computation in GLORE. Their proposed framework, secure multiparty computation framework for grid logistic regression (SMAC-GLORE) [

There are two obvious but suboptimal solutions in terms of security and efficiency. Existing fully homomorphic encryption (FHE) techniques [

On the contrary, Software Guard Extensions (SGX; Intel)-based solutions are very efficient but have some security concerns resulting from the recent discovery of side-channel attacks against SGX [

Our proposed hybrid framework uses both techniques and provides a good trade-off in terms of security and efficiency.

In this paper, we propose a hybrid cryptographic framework for secure and efficient regression analysis (both linear and logistic). Our proposed framework leverages the best features of two secure computation schemes: somewhat homomorphic encryption (SWHE) and secure hardware (Intel SGX). In this framework, data reside at the data owner’s end. We assumed that data are horizontally partitioned, where all the records share same attributes. Inspired by GLORE [

We summarize our contributions as follows: (1) We address the limitations of existing secure computation schemes and propose a hybrid secure computation model for performing regression analysis over distributed data, which is more efficient and robust. (2) We designed the framework in such a way that no homomorphic multiplication is necessary, which is an expensive operation. In addition, we do not need any bootstrapping or relinearization operation. (3) In our proposed approach, a significant portion of computation is performed at the data owner’s end on plaintext. In computation at a central server, after homomorphic addition operations, further computation is performed inside secure hardware on plaintext. Since most of the operations are performed on plaintext, our proposed approach is very efficient. In addition, due to avoiding any kind of approximation technique, our proposed method does not introduce any precision loss in the final output.

In

The idea of an encryption scheme that is capable of performing arbitrary computation on encrypted data was first proposed by Rivest et al [

Developing an encryption scheme that supports an arbitrary number of additions and multiplications was an open problem until 2009. Since addition and multiplication operations over integer ring _{2} form a complete set of operations, this type of encryption scheme supports any polynomial time computation on ciphertext. In 2009, Gentry showed the first construction of an FHE scheme [

To explain FHE, say ciphertext _{i} is the encrypted form of plaintext _{i}, where _{i} and _{i} are elements of a ring (the operations of the ring are addition and multiplication). In FHE, if a function _{1},_{2},...,_{n})) = _{1},_{2},...,_{n}). Generally,

In the existing FHE schemes, a certain amount of noise needs to be introduced in the ciphertexts to ensure data confidentiality. This noise grows while performing homomorphic operations on ciphertexts. In particular, a homomorphic multiplication operation increases the size of the ciphertext abruptly. For instance, if 2 input ciphertexts have size

In use cases where only a predetermined number of computational operations needs to be done, the costly bootstrapping process can be avoided by using an SWHE scheme [

Partial list of homomorphic encryption schemes.

Cryptosystem | Homomorphism |

Goldwasser and Micali [ |
Additive |

Rivest et al [ |
Multiplicative |

Boneh et al [ |
Both |

Intel SGX is a collection of extensions to the Intel architecture that mostly concentrates on the issue of running applications on a remote machine managed by an untrusted party. SGX enables parts of an application to run within secure portions of the central processing unit called

SGX enclaves are generated without privacy-sensitive information. Privacy-sensitive information is provisioned after the enclave has been appropriately instantiated. This process of demonstrating that an application has been correctly instantiated within an enclave is called

At the point when an enclave is instantiated, SGX protects its data until they are kept within the enclave. In any case, when the enclave procedure terminates, the enclave will be destroyed and all related data will be lost. So, for later use, data should be stored outside the enclave.

Our proposed framework has three main entities (

These parties are geographically distributed and possess databases. Data can come from a variety of sources, including cell phones, wearable sensors, and relational databases. Data owners send encrypted intermediary results to the central server so that it can analyze the combined dataset.

This generates and distributes the cryptographic keys that will be used for data encryption and decryption in different stages of our proposed framework. Each data owner gets a public key from the key manager and uses it for encrypting data.

The central server maintains communication with all the other entities of the framework. It receives data from the data owners and computes the final result using SWHE and secure hardware.

In proposing this framework, our goal was to guarantee the confidentiality of data provided by different data owners. We assume that the central server is a semihonest party (also referred to as honest-but-curious), where it obeys the system protocol but may try to infer sensitive information by analyzing the system logs or received information [

We assume that the computation runs in an SGX-enabled central server. SGX architecture enables the central server to perform any computation securely on data provided by different data owners. We assume that the processor of the central server works properly and is not compromised. We trust the design and implementation of SGX and all cryptographic operations performed by it.

In general, side-channel attacks against SGX can be classified into two categories: physical attacks (where the attacker has physical access to the machine) and software attacks (these are launched by any malicious software running in the same machine) [

Block diagram of the system architecture. SGX: Software Guard Extensions.

There is another type of well-known software attack, which is called a

We did not consider the aspects of adversarial machine learning through obtained outputs. Adversarial parties may try to infer sensitive attributes of data by model inversion attacks [

Suppose we are given a set of paired observations (_{i}, _{i}) for _{1}+β_{2}_{1},β_{2}. The purpose is to explain the correlation between variable _{1}+β_{2}_{1}=β_{1}+β_{2}_{1}+ε_{1}, _{2}=β_{1}+β_{2}_{2}+ε_{2}, and _{n}=β_{1}+β_{2}_{n}+ε_{n}.

We can formulate this regression model using the matrix in

Equations used in developing the framework.

In this way, the simple linear regression function can be represented by a short and simple equation:

The linear regression model with several explanatory variables is known as

Here, _{1i}=1, for

It is noteworthy that Equation 1 is also applicable for multiple linear regression.

Using the ordinary least squares estimate technique we can show that β=(^{T }^{–1}^{T }

For secure linear regression over distributed data, each data owner _{i } computes ^{T }_{i }_{i} and ^{T }_{i }_{i} locally on plaintext. _{i} then encrypts ^{T}_{i}_{i} and ^{T }_{i }_{i} using homomorphic encryption. After receiving these intermediary results from all of the data owners, the central server then adds these using homomorphic addition operations to construct ^{T }^{T }

_{i } provides encrypted ^{T }_{i }_{i} and ^{T }_{i }_{i}.

Perform homomorphic addition over ^{T }_{i }_{i} for each data owner

Perform homomorphic addition over ^{T }_{i }_{i} for each data owner

Send ^{T }^{T }

Inside enclave, decrypt encrypted ^{T }^{T }

Inside enclave, compute (^{T }^{–1}.

Finally, compute β inside enclave.

Sequence diagram of our proposed framework. Ack: acknowledge; SGX: Software Guard Extensions.

Logistic regression extends the principles of multiple linear regression to the case where the dependent variable

Instead of modeling the dependent variable directly, logistic regression models the probability of the dependent variable. Logistic regression uses the equation of linear regression equation (2). But, in that equation, the value of the dependent variable can fall outside [0, 1]. Therefore, a nonlinear transformation is used, which is called _{1}, _{2},...,_{k}) = [exp(β_{1}+β_{2}_{2}+...+β_{k }_{k})]/[1+exp(β_{1}+β_{2}_{2}+...+β_{k }_{k})] where β_{1}, β_{2},...,B_{k } are unknown constants analogous to the multiple linear regression model. _{1}, _{2},...,_{k }) denotes the probability that input (_{1}, _{2},...,_{k }) belongs to default class (

Logistic regression models are generally fit by maximum likelihood by using the conditional probability of

Let _{i } values, _{i } values, _{i};β^{old }), and _{i};β^{old })(1– _{i };β^{old }]). Then a Newton step is as follows:

In the second and third steps, the Newton step is expressed as a weighted least squares step, with the response ^{old}+^{-1}(

In practice, the

For secure logistic regression over distributed data, each data owner _{i } computes ^{T }_{i}[_{i } and ^{T }_{i}(_{i}– _{i }) locally on plaintext. _{i } then encrypts ^{T }_{i}[_{i } and ^{T }_{i}(_{i}– _{i }) using homomorphic encryption. After receiving these intermediary results from all the data owners, the central server then adds these using homomorphic addition operations to construct ^{T }[^{T }(^{T }_{i}[_{i } and ^{T }_{i}(_{i}– _{i}) using new β (received from the central server) and sends these intermediary results to the central server. The central server then updates β using newly received ^{T }_{i}[_{i } and ^{T }_{i}(_{i}– _{i }). In this way, iterations continue until parameters converge.

We developed our proposed framework using C++. For SWHE, we used the Simple Encrypted Arithmetic Library (SEAL) [

_{i } provides encrypted ^{T }_{i}[_{i } and ^{T }_{i}(_{i}– _{i }), and β is initialized to an all-zero vector.

Receive encrypted ^{T }_{i}[_{i } and ^{T }_{i}(_{i}– _{i }) from each data owner _{i }.

Perform homomorphic addition over ^{T }_{i}[_{i } for each data owner _{i }.

Perform homomorphic addition over ^{T }_{i}(_{i}– _{i }) for each data owner _{i }.

Send encrypted ^{T }[^{T }(

Inside enclave, decrypt ^{T }[^{T }(

Update β^{new}=β^{old}+(^{T }[^{–1}^{T }(

If the stopping criteria are satisfied, then stop; otherwise, send β to each data owner and go to step 1.

Parameters used for the Simple Encrypted Arithmetic Library.

Parameters | Value |

Polynomial modulus | ^{1024}+1 |

Plaintext modulus | 1<<8 |

Decomposition bit count | 32 |

No. of coefficients reserved for fractional part | 64 |

Size of datasets used for experiments.

Records | Dataset | |

Haberman | Low Birth Weight Study | |

No. of instances | 270 | 488 |

No. of features | 3 | 8 |

We performed experiments in a machine with an Intel Core i7-6700 (3.40 GHz) processor and 8 GB memory (Intel Corporation, Santa Clara, CA, USA). We used Intel SGX software development kit version 1.7. We simulated 2 data owners and the central server in this machine.

We performed experiments using Haberman’s survival dataset from the University of California, Irvine, Machine Learning Repository [

Experimental results for computation time.

Regression analyses | Dataset | ||

Haberman | Low Birth Weight Study | ||

Plaintext (ms) | 6 | 25 | |

Proposed method (s) | 8.991 | 39.382 | |

Secure hardware (SWHE^{a}) (s) |
259.908 | 880.228 | |

Secure hardware (AES^{b}) (s) |
4.30 | 8.54 | |

Plaintext (ms) | 171 | 886 | |

Proposed method (s) | 27.037 | 162.544 | |

Secure hardware (SWHE) (s) | 264.669 | 904.718 | |

Secure hardware (AES) (s) | 4.65 | 8.64 |

^{a}SWHE: somewhat homomorphic encryption.

^{b}AES: Advanced Encryption Standard.

Storage overhead for the secure hardware approach.

Overhead before and after encryption | Dataset | |

Haberman | Low Birth Weight Study | |

Before encryption (kB) | 3.8 | 28 |

After encryption (SWHE^{a}) (MB) |
30.3 | 123 |

After encryption (AES^{b}) (kB) |
36 | 143 |

^{a}SWHE: somewhat homomorphic encryption.

^{b}AES: Advanced Encryption Standard.

We want to emphasize that, although the secure hardware (Advanced Encryption Standard [AES]) method is faster, state-of-the-art attack models targeting SGX show that solely secure hardware-based approaches might expose data from participating institutions to potential attackers (as explained above). Our method, although a little bit slower, preserves such institutional privacy by combining the local inputs without decrypting them; therefore, it offers a stronger security guarantee without imposing too much computation or storage cost. In this way, our proposed hybrid model provides a good trade-off in terms of security and efficiency.

There is a homomorphic encryption-based implementation of linear regression [

Hall et al [^{–3}. Precision can be slightly improved by choosing greater values for the 2 constants used by their method. However, this would require a larger public key, which would introduce significant computation overhead. In contrast, in our proposed method, there is no approximation error: the regression coefficients are completely identical to the plaintext results.

In the Methods (Threat Model subsection), we discussed the security of SGX, specifically different side-channel attacks on SGX, and how we treat those attacks in our proposed framework. Addressing these attacks, we developed our framework in such a way that it can protect institutional privacy by combining the local inputs of participating institutions without decrypting them. This approach provides a higher layer of security without imposing too much computational cost.

In our proposed method, only intermediate values (eg, ^{T }^{T }

A symmetric cryptosystem like AES requires

There are some limitations of our proposed framework.

First, we did not consider the issue of model privacy. Several works based on differential privacy have addressed inference attacks (eg, model privacy [

Second, the central server of our proposed method must be SGX enabled; that is, it must use an Intel processor of sixth generation or later.

Third, since computing coefficients for logistic regression require multiple iterations, all parties must be synchronized until coefficients converge. However, linear regression does not require multiple iterations. So, in this case, parties can be offline just after sending their intermediary results.

Others have addressed training machine learning models (eg, support vector machines [

The Intel SGX feature is available in all Intel Skylake and Kaby Lake processors. The price of an Intel Skylake or Kaby Lake processor is identical to that of processors from other vendors (having similar configuration). Price ranges from US $42 to US $1207 depending on configuration [

In this age of big data, data need to be analyzed to uncover valuable insights and patterns. But this kind of analysis poses a threat to individual privacy, since data often contain sensitive information. In this paper, we address this data security and privacy issue and propose a hybrid cryptographic framework to overcome the limitations of the existing cryptographic techniques. We think that secure hardware–assisted predictive analysis of biomedical data is very promising for health care and medical research.

In future work, we will investigate the applicability of our proposed method to other learning algorithms such as neural networks, support vector machines, and decision trees.

Related works.

Advanced Encryption Standard

fully homomorphic encryption

grid binary logistic regression

Health Insurance Portability and Accountability Act

Personal Information Protection and Electronic Documents Act

Simple Encrypted Arithmetic Library

Software Guard Extensions

somewhat homomorphic encryption

This work was funded in part by the National Human Genome Research Institute (R00HG008175) and the National Institute of Biomedical Imaging and Bioengineering (U01EB023685), the Natural Sciences and Engineering Research Council of Canada Discovery Grants (RGPIN-2015-04147), the National Institute of General Medical Sciences (R01GM118574 and R01GM114612), and the University Research Grants Program from the University of Manitoba, Winnipeg, Manitoba, Canada.

None declared.