This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.

Biomedical research often requires large cohorts and necessitates the sharing of biomedical data with researchers around the world, which raises many privacy, ethical, and legal concerns. In the face of these concerns, privacy experts are trying to explore approaches to analyzing the distributed data while protecting its privacy. Many of these approaches are based on secure multiparty computations (SMCs). SMC is an attractive approach allowing multiple parties to collectively carry out calculations on their datasets without having to reveal their own raw data; however, it incurs heavy computation time and requires extensive communication between the involved parties.

This study aimed to develop usable and efficient SMC applications that meet the needs of the potential end-users and to raise general awareness about SMC as a tool that supports data sharing.

We have introduced distributed statistical computing (DSC) into the design of secure multiparty protocols, which allows us to conduct computations on each of the parties’ sites independently and then combine these computations to form 1 estimator for the collective dataset, thus limiting communication to the final step and reducing complexity. The effectiveness of our privacy-preserving model is demonstrated through a linear regression application.

Our secure linear regression algorithm was tested for accuracy and performance using real and synthetic datasets. The results showed no loss of accuracy (over nonsecure regression) and very good performance (20 min for 100 million records).

We used DSC to securely calculate a linear regression model over multiple datasets. Our experiments showed very good performance (in terms of the number of records it can handle). We plan to extend our method to other estimators such as logistic regression.

Human genome research promises to transform health care through personalized medicine. It enables the determination of an individual’s unique molecular characteristics, which can be used to diagnose diseases, select individualized treatments (with a higher success rate), and reduce possible adverse reactions [

Traditionally, researchers would strip the data from the identifying information—such as names and identity cards—and apply some privacy-protection techniques—such as generalization or noise addition—before sharing them with each other. However, recent studies have shown that it is possible to deduce the identity of research participants from clinical data that were considered anonymized. DNA sequencing aggravates this problem as the genome is unique to every individual and can be used to predict future ailments for individuals and their blood relatives (such as Alzheimer's or schizophrenia). Such information has the potential to deny jobs and to isolate subjects socially [_{1},...,S_{m}

Despite the mathematical proofs that have been established, demonstrating the ability of the SMC protocols to protect data, they are still not widely used. This may be because knowledge about their capabilities is still relatively small, they tend to have complex solutions that are not accessible without domain knowledge, they require coordinating analyses among the different sites, or they are not efficient in every setting. In fact, one of the main problems with SMC protocols is efficiency. Communication between the different parties is the main factor driving the inefficiency of SMC protocols [

Our goal with this line of research is to develop usable and efficient SMC applications that meet the needs of the potential end-users and to raise general awareness about SMC as a tool that supports data sharing. Thus, we proposed a divergence from current research efforts. Instead of lowering the security requirements, we proposed to introduce distributed statistical computing (DSC) into the design of secure protocols. Through DSC, we will study the possibility of conducting computations on each of the parties’ sites independently and then combine these (local) computations to form 1 (accurate) estimator for the collective dataset, thus limiting communication to the final step and significantly reducing complexity.

The main contribution of this study is introducing a model for privacy-preserving distributed data mining in which local models are produced separately and SMC is used to aggregate the results privately. The study applies these novel ideas to linear regression and introduces the first secure linear regression model that does model selection and parameter estimation efficiently (all previous secure multiparty algorithms perform parameter estimation only). The paper then presents experiments on real and synthetic datasets to demonstrate the accuracy and efficiency of the algorithm.

The paper is organized as follows: the next section defines our problem formally and introduces the theory behind DSC; the following section demonstrates the effectiveness of our privacy-preserving model through a linear regression application; and finally, the paper is concluded with a discussion of the results and limitations and a proposal for future research directions.

A researcher wants to estimate a population parameter _{1},...,d_{m}_{1},...,d_{m}

Traditional secure computation approach. Double-sided arrows indicate required communication channels. All communication should be secure and no party (including the third party) should learn anything about other parties' input aside from the final estimation of _{i}

Interinstitutional data sharing is generally motivated by multiple scenarios such as (1) increasing results’ accuracy and lowering bias, (2) performing benchmarking studies, or (3) attaining the cohort required for a study. In what follows, we illustrate each with a scenario:

Hospitals in a given geographical area are interested in calculating the average rate of hospital-acquired bacterial infection (across all the hospitals in the said area) for the purpose of self-evaluation. In this case, the hospitals have an additional incentive against data sharing as it may implicate them negatively.

Monogenic diseases are very rare genetic disorders associated with single gene variations observed in few subjects per 1000 to 10,000 individuals. Some are well-characterized such as cystic fibrosis (frequency of disease 1:2000), sickle cell anemia, phenylketonuria (frequency of disease 1:8000), and some primary immunodeficiency diseases [

Many protocols have been developed for the above problem in the area of SMC. The most efficient protocols are based on secret sharing [

A common approach in statistical learning with big data is to split the data (along observations) into multiple subsets. Each subset conducts the required computation completely independently. The final result is then obtained by combining these local computations. Thus, communication (and sharing of information) is reduced to the final step only. This will significantly reduce the complexity and provide simpler algorithms. The problem is illustrated in

A researcher is interested in estimating a population parameter _{1},...,d_{m}_{i}=f_{i}^{#}=g_{1},...,θ_{m}

The

McDonald et al, who advocated for this

The split and merge approach. The one-sided arrows indicate message passing. All communication should be secure and no party (including the optional third party) should learn anything about other parties' input aside from the final estimation of _{i}_{i}^{#}

The

What is the error of the averaged estimator versus the centralized one?

What affects the optimality of the averaged estimator?

How many sites to include in a given study? And how many samples to include from these sites?

The accuracy of averaging depends on the relationship between the number of observations (^{#}-θ^{2}) is the same as the MSE of the centralized one (ie, E||^{2}). In [^{#}-θ

For situations where ^{#}

For small and medium

For situations where

The authors presented the exact expression of the estimator errors in all situations and confirmed the results through a series of experiments.

In [^{1/2}

Battey et al specified in [^{2}

As a general insight into the questions raised above and to summarize the above results, we say that when the number of samples per site

The assumptions across the DSC literature are that (1) the

An equivalent assumption to (1), that applies to our SMC scenarios, is that of

The (second) assumption of evenness can be easily circumvented by pre-setting the number of samples to consider from each site. However, that is not needed, as the theoretical results presented in the previous section would still apply provided that every site satisfies the required assumptions (ie, the assumptions are true for each _{i}

However, the first assumption is not always realistic and can be overly restrictive for some applications. For instance, if the data are already owned and collected by the different sites, then it may exhibit systematic differences across these sites. For example, if 2 hospitals are considering an analysis involving their cancer patients and if 1 of the hospitals is located in a heavily polluted area whereas the other is not, then the distribution of the local population from which the sites’ data are sampled could have significant differences. Pooling the data and redistributing them randomly along the different sites may not be realistic or feasible as the data may be big or private [

Going back to the question of the number of sites to include in a study when ^{3} if ^{6},

We distinguish between these 2 strategies when analyzing the performance of our algorithm.

In many applications, certain inferences require 2 or more estimations. For example, inference for regression typically requires 2 components—feature selection and parameter estimation. When conducting feature selection, the median probability model has been recommended [

In [^{1/2}

In the next section, we demonstrate the effectiveness of the privacy-preserving model through a linear regression application. We restrict our application to models with the best theoretical results, that is, linear models with more records than features in every site and with high number of records relative to the number of sites (

We introduce the classical setting of a linear regression problem. Let _{i,j}_{1},...,y_{N}^{T}

Despite the simplicity of linear regression, it is widely used in various biomedical applications [

As linear regression is one of the most commonly used statistical tools in data analysis, there are many attempts in the literature at obtaining secure linear regression protocols over distributed databases. Many of these protocols do not offer adequate privacy guarantees [^{8} records with 20 features and 270 MB of communication. In 2015, Cock et al, presented a method to calculate the parameters of the linear regression by computing ^{T}X^{-1}X^{T}Y^{2}^{2}

It is very important to note that all cited secure regression algorithms do not perform feature selection. In other words, they use the supplied features set to predict the model [

The goal of distributed statistical learning algorithms is to scale up computations by distributing the data over multiple machines. The underlying assumption is that all data are owned by the same organization. Thus, sharing of intermediate and local results between the different machines is allowed.

In our setting, the dataset (_{1},...,S_{m}_{min}_{min}_{min}≫p_{min}≫m

Formally, the data (^{1}^{1}^{m}^{m}^{i}_{1}^{i},...,X^{i}_{p}_{i}^{i}_{j}_{i}×1^{i}_{1}^{i},...,y^{i}_{ni })^{T } the corresponding _{i}×1

Each site calculates its local feature selection vector privately, and the local vectors are aggregated securely using a secure median algorithm (in other words, the parties jointly perform the median on their data and obtain the result), without any party revealing any information about their selected features (aside from what can be deduced from the final median output).

Next, each site uses the shared selected features to calculate the model parameters locally. These local parameters are then securely averaged using a secure average protocol. Our algorithm is presented in

We evaluated our secure multiparty linear regression algorithm (SMA) by implementing it and analyzing the results using real and synthetic datasets. The real datasets are used to test the accuracy of the algorithm whereas the synthetic datasets are used to analyze its performance. To analyze the accuracy of the algorithm, we needed real datasets that originated from multiple different sources (different data owners). The sources’ IDs had to be included to inform the actual data division along the different sites. For the synthetic experiment, data were generated and randomly allocated along the different sites, as the purpose was solely to evaluate the efficiency of the algorithm for various numbers of records and features. We used Python3 as our programming language, which we augmented with the Scikit-learn, numpy, pandas, gmpy2, and phe libraries for functionality such as socket programming, homomorphic encryption, and for dealing with negative and real floating-point arithmetic. We built our system on top of 10 Linux machines with Intel Core i5-4570 CPU, 3.20GHz processor, and 8GB RAM, 4 cores each. To increase the number of possible sites to 20, we installed 2 Linux virtual machines on each machine with 4 GB memory each (note that the Paillier encryption library handles real-number values with arbitrarily high precision).

To test our SMA, we compared its performance with the central algorithm (CA). The CA performs linear regression on 1 machine using the same approach as the SMA (ie, it uses lasso for feature selection and linear least squares method for parameter calculation [

To test the accuracy of our algorithm, we collected real datasets that include information about the original collection site. Then, we treated each site as an independent data owner. We succeeded in finding 4 real datasets: 3 public datasets contained within the University of California Irvine repository (student performance in Portuguese, student performance in Math, and automobile fuel consumption data) and 1 from Cerner clinical database (the Diabetes dataset, where the number of sites included was varied between 3, 6, and 12, and the weight variable was excluded in some experiments because of excessive missing values). The datasets are presented in detail in

In the experiments on real datasets, we randomly divided the datasets into 0.7 training set and 0.3 testing set. A regression model was trained based on the training set and used to predict the outcome variable in the testing set. The average of the square prediction error was used to evaluate the model (MSE). The experiments were repeated 50 times each.

Performance results for the 4 datasets used.

Dataset | Mean (SD) | ^{a} |
^{b} |
^{c} |
MSE^{d} |
^{2} |
MSE ratio^{e} |
||

CA^{f} |
SMA^{g} |
CA | SMA | ||||||

Student performance in Portuguese^{h} |
11.91 (3.23) | 2 | 30 | 649 (423, 226) | 3.364 | 3.417 | 0.68 | 0.675 | 0.984 |

Student performance in Math^{h} |
10.41 (4.58) | 2 | 30 | 395 (349, 46) | 7.554 | 7.719 | 0.627 | 0.62 | 0.978 |

AutoMPG^{i} |
23.45 (7.80) | 3 | 6 | 392 (245, 68, 79) | 13.56 | 17.563 | 0.778 | 0.711 | 0.77 |

Diabetes (with weight)^{j} |
4.848 (3.11) | 3 | 39 | 267 (129, 72, 66) | 8.801 | 8.59 | 0.09 | 0.108 | 1.025 |

Diabetes (with weight)^{j} |
4.41 (3.02) | 6 | 39 | 456 (68, 130, 57, 73, 55, 73) | 7.443 | 7.733 | 0.19 | 0.157 | 0.962 |

Diabetes (without weight)^{j} |
4.39 (3.01) | 3 | 38 | 8567 (2478, 3936, 2153) | 5.558 | 5.612 | 0.309 | 0.303 | 0.99 |

Diabetes (without weight)^{j} |
4.42 (3.00) | 6 | 38 | 13626 (2478, 3936, 1480, 2153, 2108, 1471) | 5.708 | 5.798 | 0.345 | 0.335 | 0.984 |

Diabetes (without weight)^{j} |
4.39 (2.97) | 12 | 38 | 21205 (2478, 3936, 1480, 2153, 2108, 1471, 1160, 1323, 1524, 1425, 1024, 1122) | 5.705 | 5.87 | 0.345 | 0.336 | 0.971 |

^{a}

^{b}

^{c}

^{d}MSE: mean square error.

^{e}MSE ratio=MSE CA/MSE SMA.

^{f}CA: central algorithm.

^{g}SMA: secure multiparty linear regression algorithm.

^{h}Outcome variable: grade out of 20.

^{i}Outcome variable: fuel consumption (miles per gallon).

^{j}Outcome variable: length of stay (days).

Using synthetic data, we performed a scalability analysis to evaluate the time performance of the proposed solution as the data size and the number of parties increase. The synthetic datasets were generated in Python using sklearn.datasets.make_regression. The number of records was varied between 10^{5} and 10^{8}, the number of features between 2 and 50, and the number of sites between 2 and 20. The records were always divided equally between the sites. We distinguished between 2 testing strategies:

For the fixed

For the fixed

Time performance for central algorithm versus secure multiparty linear regression algorithm (SMA) with 100,000 records per site and varying feature set. As the number of sites increases, the number of records also increases. For small datasets, the time taken by SMA is more than that taken by the central algorithm (CA). This is due to the encrypted communication required by the algorithm. As the number of records and features increases, the time taken by the CA increases rapidly (at 1,500,000 records and 100 features).

Time performance for central algorithm versus secure multiparty linear regression algorithm (SMA) with 1,000,000 records per site and varying feature set. The time taken by the central algorithm (CA) is greater than that taken by the SMA. For 10 million records, the SMA algorithm takes almost 30 seconds, whereas the CA takes around 18 minutes.

Time performance for secure multiparty linear regression algorithm (SMA) with 10 million records per site and varying feature set. The panel on the left shows the time as a function of the number of features, while the panel on the right shows the time as a function of the number of sites. Note that for

Time performance for central algorithm versus secure multiparty linear regression algorithm (SMA) with total number of records (

This study introduced a model for privacy-preserving distributed data mining in which local models are produced separately and SMC is used to aggregate the results privately. Theoretical results from statistical theory were used to design the first secure multiparty linear regression model that does model selection and parameter estimation.

In general, theoretical results from statistical computing say that the averaged local estimates are as accurate as the centralized estimates when

The experiments on synthetic data showed very good time performance. When ^{8}), the algorithm does model selection, and parameter estimation in under 20 min (the algorithm of [

Much of the existing theoretical work in DSC assumes a uniform and random distribution of samples across the different sites or that the

This assumption certainly facilitates the mathematical analysis but may not be realistic for some applications. In the SMC applications, data are collected and owned by the different sites and may thus have systematic differences across these different sites, in which case, the assumption could be overly restrictive. Redistributing the samples randomly across the different sites is not an option due to data privacy issues. However, it is worth noting that our experiments on real data (although limited due to lack of access to real data) showed high accuracy compared with the central case. In the future, we intend to relax these assumptions and study their theoretical effect on the accuracy of the results.

Another limitation is the assumption of horizontal distribution among the different sites which should be generalized to vertical divisions (or both).

Moreover, the study demonstrated the theoretical results using a linear model. We plan to extend our results to other estimators such as ridge regression and logistic regression.

Secure multiparty algorithm.

Real datasets.

central algorithm

distributed statistical computing

mean square error

secure multiparty linear regression algorithm

secure multiparty computation

This work is supported by the United Arab Emirates University research grant #31T084 Research Start-up. The authors would like to thank the anonymous reviewers for their helpful comments and suggestions on earlier drafts of the manuscript.

None declared.