^{1}

^{2}

^{3}

^{4}

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.

Learning a model without accessing raw data has been an intriguing idea to security and machine learning researchers for years. In an ideal setting, we want to encrypt sensitive data to store them on a commercial cloud and run certain analyses without ever decrypting the data to preserve privacy. Homomorphic encryption technique is a promising candidate for secure data outsourcing, but it is a very challenging task to support real-world machine learning tasks. Existing frameworks can only handle simplified cases with low-degree polynomials such as linear means classifier and linear discriminative analysis.

The goal of this study is to provide a practical support to the mainstream learning models (eg, logistic regression).

We adapted a novel homomorphic encryption scheme optimized for real numbers computation. We devised (1) the least squares approximation of the logistic function for accuracy and efficiency (ie, reduce computation cost) and (2) new packing and parallelization techniques.

Using real-world datasets, we evaluated the performance of our model and demonstrated its feasibility in speed and memory consumption. For example, it took approximately 116 minutes to obtain the training model from the homomorphically encrypted Edinburgh dataset. In addition, it gives fairly accurate predictions on the testing dataset.

We present the first homomorphically encrypted logistic regression outsourcing model based on the critical observation that the precision loss of classification models is sufficiently small so that the decision plan stays still.

Biomedical data are highly sensitive and often contain important personal information about individuals. In the United States, health care data sharing is protected by the Health Insurance Portability and Accountability Act [

An intuitive solution is to train a model without accessing the data and only obtain the estimated model parameters in a global manner. Assuming summary statistics can be shared, this can be done in a joint manner and we have developed the “grid logistic regression” [

Many medical decision-making systems rely on the logistic regression model [

Graepel et al [

In the current literature, most similar to our work are Aono et al [

Two secure models: (a) secure storage and computation outsourcing and (b) secure model outsourcing.

Research works in secure analysis.

Reference | Problem | Techniques |

Graepel et al [ |
Linear means classifier/discriminative analysis | Homomorphic encryption |

Bos et al [ |
Prediction using learned logistic regression model | Homomorphic encryption |

Dowlin et al [ |
Prediction using learned neural networks | Homomorphic encryption |

Aono et al [ |
Logistic regression | Additive homomorphic encryption |

Mohassel et al [ |
Logistic regression | Multiparty computation |

This work | Logistic regression | Homomorphic encryption |

This method takes a completely different approach (garbled circuit and secret sharing vs homomorphic encryption) and the assumptions are widely different from ours (secure multiparty computation vs secure outsourcing). There are several prominent challenges related to scalability and efficiency. Traditional methods cannot handle many iterations of multiplications, which leads to a deep circuit and exponential growth in computational cost and storage size of the ciphertext. On the other hand, it is a nontrivial task to approximate certain critical functions in machine learning models using only low-degree polynomials. Naive approximation may lead to big errors and makes the solutions intractable. Our framework proposes novel methods to handle these challenges and makes it possible to learn a logistic regression model on encrypted data based completely on homomorphic encryption.

Logistic regression is a widely used learning model in biomedicine [_{i }, _{i }) of a vector of covariates _{i } = (_{i 1},..., _{i d}) and a class label _{i } for _{i } = 1 / –1 for binary classification. The model looks like:

for the sigmoid function _{0}, _{1},..., _{d }) are the model parameters to be estimated. Training methods of logistic regression aim to find the optimal parameters

Homomorphic encryption is an encryption technique that allows computations on ciphertexts and generates encrypted results that match those of plaintext computation. We adopted a special cryptosystem developed by Cheon et al [

A unique property of this cryptosystem is the following rescaling procedure, which plays an important role in controlling the magnitude of messages and, therefore, achieving the efficiency of approximate computation. The rescaling procedure coverts an encryption ^{-1} ⋅ m under the same secret key but a smaller modulus ^{-1} ⋅

Unlike linear regression, logistic regression does not have a closed-form solution in most cases. As a result, we need to use nonlinear optimization methods to find the maximum likelihood estimators of the regression parameters. The Newton-Raphson [

Let (_{i }, _{i }) be the supervised learning samples for _{i } = _{i} ⋅ (1, _{i }), the cost function for logistic regression is defined by:

Its gradient with respect to _{1≤i}_{≤}_{n }_{i }^{T}_{i }. To find a local minimum point, the gradient descent method updates the regression parameters using the following formula until

where

Although the gradient descent method seems better suited than other training methods for homomorphic evaluation, some technical problems remain for implementation. In the preceding update formula, the sigmoid function is the biggest obstacle for evaluation, since the existing homomorphic encryption schemes only allow evaluation of polynomial functions. Hence, the Taylor polynomials _{d}(_{0≤k≤d }(^{ (k)}(0) / ^{k } have been commonly used for approximation of the sigmoid function [

However, we observed the input values _{i }^{T}_{9}(_{9}(4) ≈ 4.44, _{9}(6) ≈ 31.23, and _{9}(8) ≈ 138.12. In addition, we have to use a higher degree Taylor polynomial to guarantee the accuracy of regression, but it requires too many homomorphic multiplications to be practically implemented. In summary, the Taylor polynomial is not a good candidate for approximation because it is a local approximation near a certain point. Therefore, we adopted a global approximation method that minimizes the mean squared error (MSE). For an integrable function _{I} ^{2}d_{I} (^{2}d

In our implementation, we used the degree 3 and 7 least squares approximations of the sigmoid function over the interval [–8,8], which contains all of the input values (– _{i }^{T}

where the coefficients vectors are (_{1}, _{3}) ≈ (1.20096,–0.81562) and (_{1}, _{3}, _{5}, _{7}) ≈ (1.73496,–4.19407, 5.43402,–2.50739). The degree 3 least squares approximation requires a smaller depth for evaluation, whereas the degree 7 polynomial has a better precision (see

We will describe how to encode data and explain how to analyze logistic regression on encrypted data. To speed up the computation, we will use the packing mechanism to batch

We start with a useful aggregation operation across plaintext slots from the literature [_{1}, _{2},..., _{k}), we introduce an algorithm (denoted by AllSum) that generates a ciphertext representing a value of Σ_{1≤i}_{≤}_{k}_{i} in each plaintext slot. Assume that _{2},..., _{k}, _{1}). Then an encryption of the vector (_{1} + _{2}, _{2} + _{3},..., _{k} + _{1}) is obtained by adding the original ciphertext. We repeatedly apply this method (log _{1≤i}_{≤}_{k}_{i}. The AllSum algorithm is explicitly described in

Let us assume that we are given _{i} with (

where

Because our cryptosystem only supports integer computation, all the elements are scaled by a factor of an integer _{i }) from _{i } for all _{1j },..., _{nj }) of the _{0},..., _{d }) are sent to the server for the computation of gradient descent.

Graphs of (a) sigmoid function and Taylor polynomials and (b) sigmoid function and least squares approximations.

0: _{1}, _{2},..., _{k}).

1:

2: Compute ^{i }))

3:

4: _{1≤ i≤k }_{i } in each plaintext slot

0: _{j}}_{0≤j}_{≤}_{d}, a polynomial

1:

2: _{j } ←

3:

4:

5: _{0≤ j≤d}Mult(_{j }, _{j });

6:

7:

8: _{j } ←RS(Mult(_{j });

9: _{j } ←RS(AllSum(_{j }); ⌊

10: _{j } ←Add(_{j }, _{j })

11:

12:

13: _{> j }}_{0≤ j≤d }

The public server generates the initial ciphertexts (_{0 },..., _{d }) as zero polynomials in R_{q } (the residue ring of R = Z[X] / (X^{N } + 1) modulo an integer _{j } and _{j }, and outputs a ciphertext encrypting the plaintext vector ^{2} ⋅ (_{1j }^{T}_{j },..., _{nj }^{T}_{j }) for all _{1 }^{T}_{n}^{T}

For the evaluation of the least squares polynomial _{i }^{T}_{i }^{T}_{j }, AllSum procedure, and rescaling by a factor of ⌊ _{0},..., _{d } corresponding to the entries of the gradient vector weighted by the learning rate and the sample size. Then it only needs to perform an addition with the model parameters _{j } that encrypts an approximation of the

Our solution can compute the gradient descent algorithm securely; however, its direct implementation is not efficient and requires a total ciphertext modulus of log _{1≤i≤n} _{i }^{T}_{i } and could reduce the ciphertext modulus to 3 ⋅ log _{3}(_{7}(

All experiments were performed on an Intel Xeon running at 2.3 GHz processor with 16 cores and 64 GB of RAM, which is an m4.4xlarge AWS EC2 instance. In our implementation, we used a variant of a fixed-point homomorphic encryption scheme of Cheon et al [

We developed our approximation algorithm using the Myocardial Infarction dataset from Edinburgh [

We assumed that all inputs had log _{0} = log _{7}(_{3}(_{3}(_{7}(

In

We used a popular metric, area under the receiver operating characteristic curve (AUC), to measure the model’s classification performance when the true positive rate was plotted against the false positive rate at various thresholds.

We can converge to the optimum within a small number of iterations (20~25), which makes it very promising to train a homomorphically encrypted logistic regression model and mitigate the privacy concerns.

In

Description of datasets.

Dataset | Number of observations | Number of features |

Edinburgh Myocardial Infarction | 1253 | 10 |

Low Birth Weight Study | 189 | 10 |

Nhanes III | 15,649 | 16 |

Prostate Cancer Study | 379 | 10 |

Umaru Impact Study | 575 | 9 |

Experiment results of our homomorphic encryption-based logistic regression algorithm

Dataset and degree of |
Encryption (sec) | Evaluation (min) | Decryption (sec) | Storage (GB) | |

3 | 12 | 131 | 6.3 | 0.69 | |

7 | 12 | 116 | 6.0 | 0.71 | |

3 | 11 | 101 | 4.9 | 0.67 | |

7 | 11 | 100 | 4.5 | 0.70 | |

3 | 21 | 265 | 12 | 1.15 | |

7 | 21 | 240 | 13 | 1.17 | |

3 | 11 | 119 | 4.4 | 0.68 | |

7 | 11 | 100 | 4.5 | 0.70 | |

3 | 10 | 109 | 5.1 | 0.61 | |

7 | 10 | 94 | 4.3 | 0.63 |

Average AUC of encrypted logistic regression. FPR: false positive rate; TPR: true positive rate.

Comparison of encrypted/unencrypted logistic regression. AUC: area under the receiver operating characteristic curve. MSE: mean squared error; NMSE: normalized mean squared error.

Dataset and iteration number | Degree of |
Our homomorphic encryption-based logistic regression | Unencrypted logistic regression | MSE | NMSE | |||

Accuracy | AUC | Accuracy | AUC | |||||

25 | 3 | 86.03% | 0.956 | 88.43% | 0.956 | 0.0259 | 0.0261 | |

20 | 7 | 86.19% | 0.954 | 86.19% | 0.954 | 0.0007 | 0.0012 | |

25 | 3 | 69.30% | 0.665 | 68.25% | 0.668 | 0.0083 | 0.0698 | |

20 | 7 | 69.29% | 0.678 | 69.29% | 0.678 | 0.0003 | 0.0049 | |

25 | 3 | 79.23% | 0.732 | 79.26% | 0.751 | 0.0033 | 0.0269 | |

20 | 7 | 79.23% | 0.737 | 79.23% | 0.737 | 0.0002 | 0.0034 | |

25 | 3 | 68.85% | 0.742 | 68.86% | 0.750 | 0.0085 | 0.0449 | |

20 | 7 | 69.12% | 0.750 | 69.12% | 0.752 | 0.0002 | 0.0018 | |

25 | 3 | 74.43% | 0.585 | 74.43% | 0.587 | 0.0074 | 0.0829 | |

20 | 7 | 75.43% | 0.617 | 74.43% | 0.619 | 0.0004 | 0.0077 |

Our implementation shows that the evaluation of the gradient descent algorithm with the degree 7 least squares polynomial yields better accuracy and AUC than degree 3. It is quite close to the unencrypted result of logistic regression using the original sigmoid function with the same number of iterations; for example, on the training model of Edinburgh dataset, we could obtain the model parameters

which can reach 86.19% accuracy and 0.954 AUC on the testing dataset. When using the sigmoid function on the same training dataset, the model parameters

which give the same accuracy and AUC. On the other hand, as shown in

One of the inherent properties of our underlying homomorphic encryption scheme is that the inserted errors for security may increase after some homomorphic operations. Hence, the size of error and the precision loss should be discussed carefully to guarantee the correctness of the resulting value. On the other hand, the gradient descent method has a property of negative feedback on computational error. Because we use the gradient at the current weight vector β to move it closer to the optimal point of minimized cost, the effect of noise disappears after some iterations. Therefore, there is no need to manage the precision of messages to confirm the correctness of resulting value because the noises are not amplified during evaluation. In our experimentation on the Edinburgh dataset, for instance, the difference between the model parameters obtained from encrypted/unencrypted evaluations was less than 2^{-11}. This means that we can precisely compute at least most significant 11 bits after the radix point of the model parameters and this approximate vector is accurate enough to achieve a good performance in testing data samples.

There are still a number of limitations in the application of our evaluation model to an arbitrary dataset. First, the use of homomorphic encryption yields the overheads in computation and storage. The size of the dataset should be limited for practical evaluation, but this is not a big problem because there have been significant improvements in the existing homomorphic encryption schemes. The development of homomorphic encryption technology will achieve much better practical performance in our protocol.

Another issue arises from the polynomial approximation. We suggested the least squares method on a certain interval [–8,8], but the precision of the result can increase by managing approximation error from wider range inputs. Finally, our model is based on fixed hyperparameters that should be decided before starting of the evaluation. It would be highly beneficial if we could detect convergence of the loss function in the training process and support early stop instead.

This paper presents the first effective methodology to evaluate the learning phase of logistic regression using the gradient descent method based on homomorphic encryption. We have demonstrated the capability of our model across the experiments with different biological datasets. In particular, our solution can be applied to a large-scale dataset, which shows the feasibility of our approach.

Homomorphic encryption for approximate arithmetic.

Further optimization of secure logistic regression algorithm.

How to set parameters.

area under the curve

mean squared error

normalized mean squared error

The authors would like to thank Kristin Lauter for helpful discussions and suggestions. The authors would also like to thank Andrey Kim for extensive assistance with the code for the homomorphic encryption scheme.

This research of MK, SW, and XJ was supported in part by the National Institute of Health under award numbers R00HG008175, R01GM118574, R01GM118609, and U01EB023685. YS was supported by the National Research Foundation of Korea grant funded by the Korean Government (No: 2017R1A5A1015626).

MK led the algorithm development and the writing of the methodology. YS, YX, SW, and XJ contributed to the approximation algorithm and evaluation. YS also developed the parallelization for the proposed protocol. XJ and SW motivated the study and blended novel algorithms and new homomorphic schemes to enable secure learning. All authors carefully reviewed and edited the paper.

None declared.