Published on in Vol 8, No 6 (2020): June

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/15073, first published .
Distributed Regression Analysis Application in Large Distributed Data Networks: Analysis of Precision and Operational Performance

Distributed Regression Analysis Application in Large Distributed Data Networks: Analysis of Precision and Operational Performance

Distributed Regression Analysis Application in Large Distributed Data Networks: Analysis of Precision and Operational Performance

Original Paper

1Harvard Medical School, Harvard Pilgrim Health Care Institute, Boston, MA, United States

2Institute for Health Research, Kaiser Permanente Colorado, Denver, CO, United States

3Center for Observational Research and Data Science, Bristol-Meyers Squibb, Lawrenceville, NJ, United States

4Division of Research, Kaiser Permanete North California, Oakland, CA, United States

5Health Research Institute, Kaiser Permanente Washington, Seattle, WA, United States

Corresponding Author:

Qoua Her, MSc, PharmD, MSPharmD

Harvard Medical School

Harvard Pilgrim Health Care Institute

401 Park Drive, 4th Floor East

Boston, MA, 02215

United States

Phone: 1 617 867 4885

Email: qouaher@gmail.com


Background: A distributed data network approach combined with distributed regression analysis (DRA) can reduce the risk of disclosing sensitive individual and institutional information in multicenter studies. However, software that facilitates large-scale and efficient implementation of DRA is limited.

Objective: This study aimed to assess the precision and operational performance of a DRA application comprising a SAS-based DRA package and a file transfer workflow developed within the open-source distributed networking software PopMedNet in a horizontally partitioned distributed data network.

Methods: We executed the SAS-based DRA package to perform distributed linear, logistic, and Cox proportional hazards regression analysis on a real-world test case with 3 data partners. We used PopMedNet to iteratively and automatically transfer highly summarized information between the data partners and the analysis center. We compared the DRA results with the results from standard SAS procedures executed on the pooled individual-level dataset to evaluate the precision of the SAS-based DRA package. We computed the execution time of each step in the workflow to evaluate the operational performance of the PopMedNet-driven file transfer workflow.

Results: All DRA results were precise (<10−12), and DRA model fit curves were identical or similar to those obtained from the corresponding pooled individual-level data analyses. All regression models required less than 20 min for full end-to-end execution.

Conclusions: We integrated a SAS-based DRA package with PopMedNet and successfully tested the new capability within an active distributed data network. The study demonstrated the validity and feasibility of using DRA to enable more privacy-protecting analysis in multicenter studies.

JMIR Med Inform 2020;8(6):e15073

doi:10.2196/15073

Keywords



Background and Significance

Distributed regression analysis (DRA) is a suite of methods that perform multivariable regression analysis in multicenter studies without the need for pooling individual-level data [1,2]. Data partners compute highly summarized intermediate statistics (eg, sums of squares and cross products matrices) of their individual-level data and share these statistics with a trusted third-party or analysis center (Figure 1). The analysis center aggregates the intermediate statistics, assesses model convergence, and computes the regression parameter estimates. DRA is mathematically equivalent to the conventional regression analysis of pooled individual-level data. It achieves the same level of statistical sophistication using only summary-level information, thereby offering better protection for individual and institutional privacy without jeopardizing the scientific rigor of the analysis.

Figure 1. Distributed regression analysis with horizontally partitioned data.
View this figure

However, DRA is not widely used in practice due to the operational challenges in implementing the approach [3]. The modeling process of common regression analyses (eg, logistic regression, Cox proportional hazards regression) is iterative and requires multiple exchanges of highly summarized intermediate statistics between the data partners and the analysis center. Manual execution of DRA is labor-intensive and highly susceptible to human errors (eg, transfer of incorrect files). There have been efforts to develop capabilities that coordinate and automate the iterative computation and file transfer process of DRA to make it a more practical analytical option in real-world multicenter studies [4-11]. These efforts have focused primarily on the programming language R and specially designed applications (eg, Java applets) to facilitate semiautomated or fully automated file transfers between the data partners and the analysis center [7-11]. The performance of these capabilities has largely been tested in simulated or relatively well-controlled environments [4-8], and no DRA application has been developed in SAS, another commonly used statistical software.

In our previous work, we enhanced PopMedNet, an open-source distributed networking software currently used by several large national distributed data networks (DDNs), to enable an automatable and iterative file transfer workflow for routine implementation of DRA [3]. This workflow coordinates and automates the iterative transfer of files between the data partners and the analysis center. We also created a SAS-based DRA package to conduct distributed linear, logistic, and Cox proportional hazards regression analysis in horizontally partitioned DDN [12,13], environments where each data partner holds information about distinct individuals [14,15]. We integrated the PopMedNet workflow with the SAS-based DRA package to create a DRA application.

Objectives

Despite the appealing theoretical properties of DRA, applications designed to perform the analysis can still be inoperable or produce biased results in real-world settings due to unappreciated factors (eg, human errors in execution, incompatible or different software versions, network or firewall restrictions, and network conditions). Evaluating the precision of DRA applications compared with the pooled individual-level data analysis and the feasibility of performing the analysis in reasonable execution times in real-world settings is needed to demonstrate DRA as a practical and valid analytical method. In this study, we demonstrate the feasibility of using the SAS-based DRA package and PopMedNet-driven file transfer workflow to perform DRA in a real-world horizontally partitioned DDN. Specifically, we quantify the precision of the SAS-based DRA package and the operational performance of the PopMedNet-driven file transfer workflow.


Study Setting: The Sentinel System

Funded by the US Food and Drug Administration, the Sentinel System is an active surveillance system designed to monitor the safety of approved medical products using longitudinal, regularly updated electronic health data from a network of 18 health plans and health care delivery systems [16,17]. Sentinel data partners transform their data into a common data model [18], which enables analytical programs and tools to be centrally developed and executed across data partners with minimal modifications. Over the years, the system has developed a suite of version-controlled, customizable, and freely available modular programs to rapidly query the transformed data across the DDN [19]. Among the tools is the Cohort Identification and Descriptive Analysis (CIDA) tool, a SAS program that assembles cohorts of individuals according to user-specified study parameters (eg, exposures, outcomes, inclusion and exclusion criteria) using established coding systems (eg, International Classification of Diseases, Ninth or Tenth Revision, Clinical Modification; National Drug Codes). The CIDA tool can generate a harmonized (ie, with the same covariates and covariate names) individual-level dataset at each data partner. Users can employ other tools (eg, Propensity Score Analysis Tool) or develop ad hoc analytical programs to query these datasets behind the data partner’s firewall for complex inferential analyses.

Sentinel uses PopMedNet to facilitate file transfers between the data partners and the Sentinel Operations Center [20]. The Sentinel Operations Center, which serves as the analysis center for all Sentinel queries, uses a Web-based portal to create and securely distribute queries to data partners via PopMedNet. The data partners use a locally installed Microsoft Windows application, known as the DataMart Client, to retrieve the query and return the requested dataset, usually in aggregate-level format, to the Sentinel Operations Center. All file transfers between data partners and the Sentinel Operations Center are accomplished through secure HTTPS, secure sockets layer, or transport layer security connections. PopMedNet security and authentication requirements ensure that only approved queries are submitted to and responses returned by prespecified and approved data partners. In addition, the PopMedNet workflow is agnostic to query types, file formats (RData, sas, .docx, etc) and can transfer individual file sizes up to 2 GB.

SAS-Based Distributed Regression Analysis Application

There are numerous algorithms (eg, secure data integration, secure summation) for DRA in horizontally partitioned DDNs, environments where each data partner holds information about distinct patient cohorts [21,22]. In our previous work, we created a SAS-based DRA package comprising 2 interlinked SAS packages (one executed at the data partners and the other at the analysis center) using 2 algorithms: (1) distributed iteratively reweighted least squares to perform distributed linear and logistic regression analysis [12], and (2) distributed Newton-Raphson algorithm to perform distributed Cox proportional hazards regression analysis using the Efron or Breslow approximation for tied event times [13]. Both algorithms utilize a semitrusted third-party as the analysis center to aggregate the highly summarized intermediate statistics (eg, sums of squares and cross products matrices) and compute regression parameter estimates and SEs. We define a semitrusted third-party as a party that data partners trust with their summary-level information but not with their individual-level data. This party does not share data from any data partner with other data partners without consent, does not attempt to derive the individual-level data from the intermediate statistics, does not collude with data partners to derive any information about other data partners’ individual-level data, and follows the DRA algorithms [23].

We provide a brief overview of the distributed iteratively reweighted least squares and the Newton-Raphson algorithms used to implement the SAS-based DRA package for distributed linear, logistic, and Cox proportional hazards regression analysis using the Sentinel Operations Center as the analysis center in Multimedia Appendix 1. A detailed description of these algorithms is available elsewhere [12,13].

PopMedNet Enhancements to Enable Automatable Distributed Regression Analysis

Both the distributed iteratively reweighted least squares and Newton-Raphson algorithms in the SAS-based DRA package utilize a master-worker process, where the analysis center directs the iterative DRA computations and the data partners execute these computations on their individual-level data with input (eg, updated regression parameter estimates) from the analysis center. Thus, an iterative file transfer workflow is required to transfer the highly summarized intermediate statistics and the updated regression parameter estimates between the data partners and the analysis center until the model converges or the analysis reaches a prespecified maximum number of iterations.

We previously enhanced PopMedNet to create an iterative and automatable file transfer workflow to facilitate routine DRA [3]. In brief, we built a back-end component, referred to as the DRA-adapter, into PopMedNet to allow the DataMart Client to upload files automatically and iteratively from and download files to prespecified folders at the data partners and the analysis center. We also developed functionalities for folder monitoring and trigger file creation and deletion in the DataMart Client to integrate the PopMedNet workflow with the two interlinked SAS packages of our SAS-based DRA package. A full description of the PopMedNet workflow and its integration with the SAS-based DRA package is available elsewhere [12,13]. We collectively refer to the integration of the SAS-based DRA package and the PopMedNet-driven file transfer workflow as the DRA application hereafter.

Distributed Regression Analysis: A 3-Step Process

A typical DRA includes 3 major steps [3]. Step 1 involves the assembly of a harmonized individual-level analytical dataset at each data partner. In step 2, the analysis center and each data partner execute a DRA algorithm locally. Step 3 involves the iterative transfer of the DRA algorithm outputs between the data partners and the analysis center until the regression model converges or the process reaches a prespecified maximum number of iterations. We used this 3-step process to guide our execution and evaluation of the DRA application with 3 Sentinel data partners, with the Sentinel Operations Center serving as the analysis center (Figure 2).

Figure 2. Three-step process to conduct distributed regression analysis with PopMedNet. CIDA: Cohort Identification and Descriptive Analysis Tool; DRA: Distributed Regression Analysis; SOC: Sentinel Operations Center.
View this figure
Step 1: Assemble a Harmonized Individual-Level Analytical Dataset at Each Data Partner

We used the CIDA tool (version 3.3.6) to assemble a harmonized individual-level analytical dataset of adult patients aged 18-79 years who received sleeve gastrectomy or Roux-en-Y gastric bypass in any care setting between January 1, 2010 and September 30, 2015 at 3 Sentinel data partners. To be eligible for cohort inclusion, patients must be continuously enrolled in a health plan with medical and drug coverage for at least 1 year before the index bariatric surgery, have at least one weight and height measurement that corresponded to a BMI ≥35 kg/m2 in the year before surgery, and have at least one height and weight measurement in the year after surgery. We excluded patients with any bariatric procedure during the 1-year period before the index bariatric surgery. We also excluded patients with gastrointestinal cancer or a revised bariatric surgery procedure on the day of surgery. For each regression analysis, follow-up started on the day of the index bariatric surgery and continued until the occurrence of the outcome of interest (see below), death, end of health plan enrollment, or end of the study period. For distributed linear regression analysis, the outcome was a change in BMI within 1-year postsurgery, defined by subtracting the BMI measurement closest to the end of the 1-year postsurgery date from the last BMI measurement before surgery. For logistic regression, we created a binary outcome variable indicating 1 if the patient had weight loss ≥20% within 1-year postsurgery, and 0 if otherwise. For Cox regression analysis, we computed the time to weight loss ≥20% within the 1-year post-surgery period (Table 1).

Table 1. Analytical datasets and variables.
Regression model typeOutcome variable (within 1-year postsurgery)Variables (exposure and confounders)
LinearChange in BMIBariatric surgery exposure, age at surgery, sex, race and ethnicity, combined Charlson-Elixhauser comorbidity score, number of ambulatory visits, number of other ambulatory visits, number of inpatient stays, number of nonacute institutional stays, number of emergency department visits, BMI before bariatric surgery, number of days between last weight and height measurement and bariatric surgery, and data partner
LogisticWeight loss ≥20%Same as above
CoxTime to weight loss ≥20%Same as above
Step 2: Locally Execute the Distributed Regression Analysis Application at Each Data Partner and the Analysis Center

We assembled 3 separate SAS-based DRA packages to perform distributed linear, logistic, or Cox regression analyses and assessed the association between bariatric procedure (sleeve gastrectomy vs Roux-en-Y gastric bypass) and weight loss within 1-year postsurgery, adjusting for prespecified confounders (Table 1). For Cox regression analysis, we used the Efron approximation to handle tied event times. To be consistent with the standard SAS regression procedures, we prespecified a convergence criterion of <0.01 and a maximum of 25 iterations for distributed logistic and Cox regression analyses.

We distributed each SAS-based DRA package to the 3 data partners through PopMedNet (version 6.7). We instructed the data partners to (1) initiate the automated PopMedNet workflow, allowing the DataMart Client (version 6.7) to automatically download and unzip the SAS-based DRA package to a prespecified local directory, (2) manually place the individual-level analytical dataset created in step 1 in a prespecified local folder, (3) specify the file path to the SAS-based DRA package, and (4) execute the SAS-based DRA package in batch mode. Similarly, we instructed the Sentinel Operations Center to (1) initiate the automated PopMedNet workflow, (2) manually place the SAS-based DRA package for the analysis center in a prespecified local directory, (3) specify the file path to the SAS-based DRA package, and (4) execute the SAS-based DRA package in batch mode. Full details of these packages and examples of their execution have been previously described [12,13].

Step 3: Iteratively Transfer Distributed Regression Analysis Files Between the Data Partners and the Analysis Center

Once the data partners and the analysis center executed their SAS-based DRA package, the package ran continuously, awaiting input files (eg, updated regression parameter estimates or intermediate statistics) and DRA computation directions (eg, compute intermediate statistics, residuals, and SEs) from the Sentinel Operations Center. We used the PopMedNet workflow to transfer input files and computation directions iteratively and automatically between the data partners and the Sentinel Operations Center.

Evaluation of Precision and Operational Performance

We requested all data partners to securely transfer their deidentified individual-level analytical datasets to the Sentinel Operations Center. We assessed the precision of the SAS-based DRA package by comparing the DRA parameter estimates and SEs to those obtained from the pooled individual-level data analyses using standard SAS procedures. For distributed linear regression, we compared the model fit statistics R2, Akaike information criterion (AIC), Sawa’s Bayesian information criterion (BIC), and Schwarz BIC to the statistics obtained from a PROC REG run with the pooled individual-level data. For distributed logistic regression, we compared the model fit statistics log-likelihood, AIC, and Sawa’s BIC to the statistics obtained from a PROC LOGISTIC run with the pooled individual-level data. For distributed Cox proportional hazards regression, we compared the model fit statistics log-likelihood, AIC, and Schwarz BIC to the statistics obtained from a PROC PHREG run with the pooled individual-level data. We considered the integration successful if the DRA parameter estimates and SEs and model fit statistics were precise to the results from the corresponding pooled individual-level data analyses (10−6).

For distributed logistic regression, we also compared the receiver operating characteristic (ROC) curve and the area under the ROC curve with the corresponding curve and area obtained from a PROC LOGISTIC run with the pooled individual-level data. We considered the integration successful if the ROC curves were similar in likeliness and if the areas under the curves were comparable. To offer better privacy protection, we summarized individual-level predicted values for the distributed logistic regression analysis in bins of 6. Full details of this approximation method can be found elsewhere [12]. For distributed Cox proportional hazards analysis, we also compared the survival function curve with the curve obtained from a PROC PHREG run with the pooled individual-level data. We considered the integration successful if the survival function curves were similar in likeliness and if the median times to weight loss ≥20% were equivalent.

We extracted time stamps of status changes from PopMedNet and computed the average download, upload, SAS execution, and transfer time at the data partners and the analysis center to evaluate the operational performance of the DRA application. We also reported the average iteration time for each regression model type, and the time required to perform an end-to-end DRA in our test case.

We executed all SAS-based DRA packages in SAS versions 9.3 or 9.4, on a Windows desktop or server routinely used to perform Sentinel queries. All machines used to execute the SAS-based DRA packages and DataMart Client instance operated on a Windows 7 platform, with multiple Intel core processors ranging from 2.3 to 3.4 GHz, and 8 to 16 GB of RAM (Multimedia Appendix 2).


Overview

We identified 5452 eligible patients among the 3 participating data partners (n1=1706, n2=2728, and n3=1018). Of these, 981 patients received sleeve gastrectomy, whereas 4471 patients received Roux-en-Y gastric bypass during the study period. Within 1-year postsurgery, the BMI decreased on average by 9.8 kg/m2 in sleeve gastrectomy patients and 18.7 kg/m2 in Roux-en-Y gastric bypass patients. Five-hundred eighty-two of the 981 (59.3%) patients who had undergone sleeve gastrectomy and 3617 of the 4471 (80.10%) patients who had undergone Roux-en-Y gastric bypass had a weight loss ≥20% within the 1-year postsurgery period. The median time to a weight loss ≥20% was 223.9 days for patients who had undergone sleeve gastrectomy and 196.2 days for patients who had undergone Roux-en-Y gastric bypass.

Precision

Tables 2-4 summarize the precision of distributed linear, logistic, and Cox proportional hazards regression analyses. Table 5 shows the model fit statistics of the 3 regression models. All DRA parameter estimates, SEs, and model fit statistics were highly comparable to the estimates obtained from the pooled individual-level analyses that used standard SAS regression procedures. The ROC curve in distributed logistic regression (Figure 3) and the survival function in distributed Cox regression (Figure 4) were similar to those obtained from the pooled individual-level data analyses. The DRA application reported an area under the curve (AUC) of 0.6591 for logistic regression (vs 0.6592 from the pooled individual-level data analysis) and 184 days for Cox proportional hazards analysis (vs 184 days from the pooled individual-level data analysis) as the median time to weight loss ≥20%.

Table 2. Distributed linear regression vs pooled individual-level linear regression.
CovariatesDistributed regression
analysis
Pooled individual-level
analysis
Difference in parameter estimateDifference in SE

Parameter
estimate
SEParameter
estimate
SE

Intercept34.039350.6107534.039350.610753.66 x 10−12−9.14 x 10−13
Exposure2.047140.287232.047140.28723−4.15 x 1013−4.30 x 1013
Age−0.033340.00837−0.033340.00837−3.68 x 1014−1.25 x 1014
Preindex BMI−0.999830.00050−0.999830.00050−6.00 x 1015−7.44 x 1016
Combined comorbidity score0.043880.069490.043880.069493.59 x 1015−1.04 x 1013
Number of ambulatory visits−0.030680.01008−0.030680.01008−6.59 x 1017−1.51 x 1014
Number of emergency department visits0.103290.087490.103290.08749−2.79 x 1014−1.31 x 1013
Number of inpatient visits0.887250.259760.887250.25976−6.51 x 1013−3.89 x 1013
Number of nonacute institutional stay1.323381.790561.323381.790564.21 x 1013−2.68 x 1012
Number of other ambulatory visits0.021590.008730.021590.008731.22 x 1014−1.31 x 1014
Days between BMI measurement and index procedure0.012070.005670.012070.005673.92 x 1015−8.48 x 1015
Racea
Unknown0.942120.268410.942120.26841−4.16 x 1013−4.02 x 1013
American Indian or Alaska Native−0.309480.69817−0.309480.69817−2.39 x 1013−1.04 x 1012

Asian−0.168530.63001−0.168530.63001−4.52 x 1013−9.42 x 1013
Black or African American1.519610.292061.519610.29206−9.95 x 1014−4.37 x 1013

Native Hawaiian or other Pacific Islander−1.223151.04973−1.223151.04973−4.11 x 1013−1.57 x 1012
Female−1.223660.23205−1.223660.23205−5.33 x 1013−3.47 x 1013
Surgery yeara

20110.151500.303610.151500.30361−5.94 x 1013−4.54 x 1013

2012−0.249040.30372−0.249040.30372−6.47 x 1013−4.54 x 1013

2013−0.023080.30223−0.023080.30223−6.08 x 1013−4.52 x 1013

20140.327670.306090.327670.30609−5.93 x 1013−4.58 x 1013

2015−0.257670.33352−0.257670.33352−6.18 x 1013−4.99 x 1013
Data partner sitea

2−1.105590.31373−1.105590.313732.89 x 1015−4.69 x 1013

3−0.109900.30341−0.109900.30341−2.07 x 1013−4.54 x 1013

aReference groups: race (white), surgery year (2010), and data partner site (1).

Table 3. Distributed logistic regression vs pooled individual-level logistic regression.
CovariatesDistributed regression analysisPooled individual-level analysisDifference in parameter estimateDifference in SE

Parameter
estimate
SEParameter
estimate
SE

Intercept2.115730.228332.115730.22833−6.22 x 10−15−1.00 x 10−14
Exposure1.067110.09895−1.067110.09895−2.00 x 10−15−1.80 x 10−16
Age−0.016060.00316−0.016070.00316−4.51 x 10−17−1.57 x 10−16
Preindex BMI0.000030.000200.000030.000206.51 x 10−192.44 x 10−19
Combined comorbidity score−0.026230.02561−0.026230.02561−6.97 x 10−16−3.12 x 10−17
Number of ambulatory visits0.011550.004470.011550.004476.25 x 10−171.13 x 10−17
Number of emergency department visits−0.062300.03132−0.062300.031333.05 x 10−161.39 x 10−17
Number of inpatient visits−0.120980.08940−0.120980.089401.75 x 10−15−2.36 x 10−16
Number of nonacute institutional stay0.425100.788090.425100.78809−2.00 x 10−15−3.33 x 10−16
Number of other ambulatory visits0.003810.003400.003810.003403.17 x 10−17−2.91 x 10−17
Days between BMI measurement and index procedure−0.002660.00201−0.002660.002013.90 x 10−17−4.77 x 10−18
Racea

Unknown−0.396850.09485−0.396850.094850.00 x 10+00−2.50 x 10−16
American Indian or Alaska Native−0.139380.26230−0.139380.26230−1.11 x 10−165.55 x 10−17
Asian−0.372570.22341−0.372570.22341−3.04 x 10−142.78 x 10−17

Black or African American−0.296170.10507−0.296170.10507−3.33 x 10−16−9.71 x 10−17

Native Hawaiian or Other Pacific Islander−0.029100.40543−0.029100.40543−6.14 x 10−160.00 x 10+00
Female0.199930.084220.199930.08422−1.80 x 10−15−3.61 x 10−16
Surgery yeara
2011−0.102690.11683−0.102690.116846.37 x 10−15−5.55 x 10−17
20120.055470.118970.055470.118975.45 x 10−15−1.67 x 10−16
2013−0.119560.11382−0.119560.113826.80 x 10−15−1.94 x 10−16

2014−0.109560.11617−0.109560.116174.36 x 10−15−1.80 x 10−16

20150.037010.127980.037010.127986.47 x 10−15−2.50 x 10−16
Data partner sitea

2−0.104330.11751−0.104330.117514.51 x 10−15−9.99 x 10−16

30.755060.125770.755060.125772.11 x 10−15−2.50 x 10−16

aReference groups: Race (white), surgery year (2010), and data partner site (1).

Table 4. Distributed Cox proportional hazards regression vs pooled individual-level Cox proportional hazards regression.
CovariatesDistributed regression analysisPooled individual-level analysisDifference in parameter estimateDifference in SE

Parameter
estimate
SEParameter
estimate
SE

Exposure−0.581600.05275−0.581600.052756.66 x 10−16−8.33 x 10−17
Age−0.011070.00146−0.011070.001461.39 x 10−17−9.11 x 10−18
Preindex BMI−0.000060.00009−0.000060.000092.85 x 10−19−1.49 x 10−19
Combined comorbidity score−0.007870.01205−0.007870.01205−3.64 x 10−17−1.04 x 10−17
Number of ambulatory visits0.005840.001580.005840.00158−2.95 x 10−171.08 x 10−18
Number of emergency department visits−0.018730.01679−0.018730.001581.56 x 10−16−2.43 x 10−17
Number of inpatient visits−0.085870.04580−0.085870.04580−9.58 x 10−16−1.25 x 10−16
Number of nonacute institutional stay0.066260.292660.066260.292663.75 x 10−16−3.33 x 10−16
Number of other ambulatory visits0.002790.001340.002790.001344.03 x 10−17−1.52 x 10−18
Days between BMI measurement and index procedure−0.002210.00096−0.002210.000962.39 x 10−17−2.17 x 10−18
Racea
Unknown−0.188980.04765−0.188980.047655.27 x 10−160.00 x 10+00

American Indian or Alaska Native−0.074760.12019−0.074760.120191.25 x 10−162.78 x 10−17

Asian−0.223090.10933−0.223090.10933−2.78 x 10−176.94 x 10−17

Black or African American−0.184570.05116−0.184570.051161.94 x 10−16−1.39 x 10−17

Native Hawaiian or Other Pacific Islander−0.197480.17333−0.197480.173331.42 x 10−152.78 x 10−17
Female−0.008870.04052−0.008870.04052−1.24 x 10−15−3.47 x 10−17
Surgery yeara

2011−0.080210.05176−0.080210.051768.60 x 10−161.11 x 10−16

2012−0.025470.05136−0.025470.051364.61 x 10−167.63 x 10−17

2013−0.095190.05195−0.095190.051951.17 x 10−154.86 x 10−17

2014−0.168660.05235−0.168660.052358.60 x 10−161.18 x 10−16

20150.247630.056400.247630.056403.89 x 10−161.04 x 10−16
Data partner sitea

2−0.152700.05188−0.152700.051882.11 x 10−15-6.94 x 10−18

30.334400.051610.334400.051618.33 x 10−162.08 x 10−17

aReference groups: race (white), surgery year (2010), and data partner site (1).

Table 5. Comparison of model fit statistics between distributed regression and pooled individual-level data analysis.
Regression model type and statistic or testDistributed regression analysisPooled individual-level data analysisDifference in model fit statistics
Linear

R20.99870.99873.89 x 1015

Akaike information criterion20089.653820089.6538−1.59 x 1008

Sawa\'s Bayesian information criterion20091.871020091.8710−1.59 x 1008

Schwarz\'s Bayesian information criterion20247.586820247.5868−1.59 x 1008
Logistic

-2 log-likelihood5423.24915423.24911.36 x 1011

Akaike information criterion5471.24915471.24911.36 x 1011

Sawa\'s Bayesian information criterion5629.52655629.52651.36 x 1011

Area under the ROCa curve0.65910.6592−1.00 x 1004

Hosmer-Lemeshow (chi-square statistics)1.34051.5596−2.19 x 1001

Hosmer-Lemeshow, P value (df).995 (8).991 (8)3.38 x 1003
Cox

-2 log-likelihood66217.727066217.72701.46 x 1011

Akaike information criterion66263.727066263.72701.46 x 1011

Schwarz\'s Bayesian information criterion66409.607066409.60701.46 x 1011

Median time to event (days)1841840

aROC: receiver operating characteristic.

Figure 3. Comparison of receiver operating characteristic curves between distributed logistic regression (left) and pooled individual-level logistic regression (right). To offer better privacy-protecting, individual-level predicted values were summarized in bins of 6 and transferred to the analysis center for aggregation in the distributed logistic regression analysis. The size of the bin is user-specified. ROC: receiver operating characteristic.
View this figure
Figure 4. Comparison of survival functions between distributed cox proportional hazards regression (left) and pooled individual-level cox proportional hazards regression (right). The survival curves were evaluated at the mean value of covariates for patients with events.
View this figure

Operational Performance

As expected, the closed-form solution of distributed linear regression analysis required only two iterations, one for computing the regression parameter estimates and SEs and the other for computing the model fit statistics. Both logistic and Cox proportional hazards regression analyses required 6 iterations for model convergence in our test case. Each file transfer process transferred between 3 and 10 files with sizes of 1 to 800 KB.

We extracted 111, 271, and 271 time stamps of status changes from PopMedNet for distributed linear, logistic, and Cox analysis, respectively. Table 6 summarizes the operational performance of the DRA application. It took an average of 102.4 s to complete one DRA iteration across all regression model types. The file transfer workflow (file upload, download, and transfer to the reciprocal party) accounted for 89% of the iteration time. Downloading and uploading the DRA files at the Sentinel Operations Center required an average of 28.6 and 9.8 s, respectively. File transfer from the Sentinel Operations Center to the data partners took on average 9.4 s. Downloading and uploading the DRA files at the data partners required an average of 10.1 and 15.5 s, respectively. File transfer from the data partners to the Sentinel Operations Center took an average 22.1 s. Computing the intermediate statistics at the data partners required an average of 8.0 s, whereas computing the updated regression parameters took an average of 3.8 s at the Sentinel Operations Center.

The distributed Cox regression required the greatest amount of iteration time (113.5 s), followed by logistic regression (95.0 s) and linear regression (91.5 s). Overall, distributed linear regression analysis with our bariatric surgery test case required 440.7 s to complete, whereas logistic and Cox proportional hazards regression analysis required 925.5 and 1016.0 s, respectively.

Table 6. Operational performance of the distributed regression analysis application.
Performance metricLinearLogisticCoxOverall
Required number of iterations for model convergence266a
Total run time440.7925.51,016.0
Average iteration time, mean (SE)91.5 (10.5)95 (3.1)113.5 (5.2)102.4 (3.8)
Sentinel operations center (analysis center)
Average download time, mean (SE)20.5 (5.4)20.6 (1.3)39.4 (4)28.6 (3.2)

Average computation time, mean (SE)4.3 (2.6)3 (1.1)4.4 (0.4)3.8 (0.6)

Average upload time, mean (SE)8.4 (1.1)10.2 (0.7)9.9 (0.6)9.8 (0.4)

Average file transfer time (to data partners), mean (SE)10.5 (0.4)9.1 (0.5)9.4 (0.5)9.4 (0.3)
Data partners

Average download time, mean (SE)8.6 (1.2)10.3 (0.6)10.3 (0.8)10.1 (0.4)

Average computation time, mean (SE)8.2 (0.8)7.9 (0.4)8 (0.3)8 (0.2)

Average upload time, mean (SE)15.6 (1.2)15.9 (0.6)15.1 (0.3)15.5 (0.3)

Average file transfer time (to analysis center), mean (SE)20 (0.8)21.8 (1.9)23.1 (1.2)22.1 (1.0)

aN/A: not applicable.


Principal Findings

We have successfully integrated a SAS-based DRA package with PopMedNet, an open-source distributed networking software, and performed DRA in select data partners within a real-world DDN. Our application was able to compute regression parameters, SEs, model fit statistics, and model fit graphics of 3 regression model types (linear, logistic, and Cox proportional hazards) that were within machine precision or similar in likeliness to those produced using standard SAS regression procedures, without the need to share any individual-level data, in under 20 min. The study demonstrated the feasibility and validity of performing multivariable regression analysis in a multicenter setting while limiting the risk of disclosing sensitive individual or institutional information.

Previous Studies

Previous studies have used simulated or relatively well-controlled distributed environments to demonstrate the ability to perform DRA with only summary-level information [4-8]. These studies have consistently reported that DRA produced precise (generally <10−12) results compared with the results from the pooled individual-level data analysis. However, information on the operational performance (computation and file transfer time) of DRA algorithms or workflows is scarce. The closest experience to our DRA application is a Web-based DRA software developed by the SCAlable National Network for Effectiveness Research (SCANNER) [11]. This software is composed of a network portal with a set of Web services and virtual machines that host data from data-contributing sites and several libraries of analytical programs. At the time of our analysis, 3 method libraries were available in the SCANNER software: a cohort discovery tool, an algorithm to perform meta-analyses with distributed data, and an algorithm to perform distributed logistic regression analysis (Grid Binary LOgistic Regression, GLORE) [6]. The authors reported that GLORE produced results equivalent to those from the pooled individual-level data analysis, and software response times of 0.015 s with a dataset of 580 records (with a binary outcome variable, a treatment indicator variable, and 24 covariates) and 27.02 s with a dataset of 10,000 records (with a binary outcome variable and 5 covariates) partitioned among 3 different institutions.

Our DRA application required significantly more time for model convergence than the SCANNER software. However, this additional time for model convergence may be considered marginal in practice, where other aspects of a multicenter study are typically more time-consuming. For example, developing a study protocol and analysis plan or assembling an analytical dataset at each participating data partner for DRA may require considerably more time than the time required to perform DRA. There are also several key differences between our application and the SCANNER software that may explain the difference in operational performance. Specifically, the SCANNER software requires users to install a virtual machine and open ports to the master node hosting the SCANNER hub. This design may have shorter file upload, transfer, and download times between the execution nodes, as files are only transferred between homogeneous virtual machines on the server and not subject to impediments such as firewall security protocols, additional workload, and upload, transfer, and download speeds.

The operational performance of the SCANNER software makes it a desirable option for DRA in networks that are amenable to installing the required software and applications. We previously found that most Sentinel data partners were unwilling to install new software or make modifications to their existing hardware configurations to perform DRA [3]. We chose to develop the DRA application using SAS and PopMedNet because all Sentinel data partners have both software in their systems. In addition, several other large DDNs, including the National Patient-Centered Clinical Research Network [24] and the National Institutes of Health’s Health Care Systems Research Collaboratory [25], use PopMedNet as their file transfer software. In other words, our DRA application requires no new software installation or modifications to existing hardware configurations in DDNs that employ SAS as their statistical software and PopMedNet as their file transfer software. The 3 data partners that participated in this project are also members of numerous PopMedNet-based DDNs. Therefore, the successful integration of our SAS-based DRA package with PopMedNet and execution of DRA with these data partners have the potential to extend DRA beyond the Sentinel System.

Limitations

Our study is not without limitations. First, DRA requires infrastructure and processes beyond the algorithms and technology described in this paper. For example, DRA with our application requires harmonized individual-level datasets. Since its inception, Sentinel has continuously enhanced its common data model, routine analytical tools, and data quality assurance processes. Thus, Sentinel data partners can rapidly create harmonized analytical datasets for DRA. Research networks and investigators without the same infrastructure may not be able to perform DRA with our application as easily, even if data partners are willing to use PopMedNet as their data-sharing software.

Second, we tested the DRA application with only 3 Sentinel data partners, and all tests were completed in a Windows version of SAS (desktop or server). It is possible that different hardware configurations not found at these data partners or different versions of SAS (Linux or Unix) could change the precision and operational performance or even inhibit the execution of our DRA application. However, we previously found only 3 different configurations of the required hardware components (DataMart Client, SAS software, and the common folder structure) among Sentinel data partners [3]. All 3 hardware configurations were represented among the 3 data partners in this study. We also found the reconfiguration of these components to be relatively straightforward. Therefore, it may be possible to have data partners with other configurations make minor changes to implement our DRA application. During the development of the DRA application, we were able to successfully execute our application on a Linux server with a fourth data partner, by placing the application on a Linux server directory accessible to the DataMart Client as a mapped Windows network drive. This allowed the DataMart Client to access the same file system as the DRA application. Overall, additional testing with more data partners with different hardware configurations and different versions of SAS is needed to fully ensure that our DRA application is operable across different DDNs, research networks, operation systems, and environments.

Third, our precision and operational performances were based on a small sample of successful end-to-end executions of our DRA application. These executions were limited to regression models with 23 variables and analytical datasets ranging from 1000 to 3000 patients distributed among 3 data partners. Future work should include more end-to-end executions, regression models with more variables, datasets of larger sample sizes, and more data partners. However, we found that 89% of the iteration time was attributed to file transfer time, which was largely driven by the number of files, size of the files transferred, and network conditions (upload, download, and transfer speeds, firewall security protocols, and workload). Because the files contain highly summarized information, increasing the number of variables or patients will not increase the number of files or substantially increase the size of the files to be transferred. In this study, each file transfer process transferred files that were less than 1 MB. Our internal testing of analyses with more variables, patients, and data partners did not result in file sizes larger than a few MBs or increased the iteration time. Thus, we do not anticipate DRA with more variables, patients, and data partners in a real-world setting to have a considerable impact on the operational performance of our DRA application. In addition, network conditions at each data partner can vary depending on the workload. We could not vary network conditions at each data partner to formally analyze its impact on the operational performance. However, we did complete our experiments with 3 Sentinel data partners, with machines that are routinely used to fulfill Sentinel query requests. Thus, our results on precision and operational performance likely represent what potential users of DRA will experience in practice.

Fourth, our bariatric surgery test case was relatively simplistic and not as sophisticated as an actual clinical or epidemiologic study. For example, we did not include all the potential confounders. Therefore, the results of our analysis did not have any causal interpretation.

Finally, although DRA uses the intermediate statistics at each data partner to perform multivariable regression analysis, the risk of reidentifying specific individuals is not 0. Under certain conditions (eg, uncommon individual attributes coded with indicator variables), there could be leakage of personal information that could be used to infer or identify specific individuals [26]. To further protect privacy, DRA can be performed using more secure algorithms, such as encrypting or perturbing the intermediate statistics. Future work should explore the integration of these more secure DRA algorithms into our DRA application.

Conclusions

We have successfully developed and integrated a SAS-based DRA package with an iterative and automatable PopMedNet-driven file transfer workflow to create a DRA application and conduct DRA in select data partners within a real-world DDN. The application produced results that were within machine precision to the results from the pooled individual-level data analyses using standard SAS regression procedures. The end-to-end execution times were reasonable, demonstrating that DRA can be a practical and valid analytical method in real-world settings.

Acknowledgments

This work was supported by the Office of the Assistant Secretary for Planning and Evaluation and the Food and Drug Administration (HHSF223201400030I/HHSF22301006T).

Conflicts of Interest

ST is the principal investigator of projects funded by the National Institutes of Health (U01EB023683) and the Agency for Healthcare Research and Quality (R01HS026214).

Multimedia Appendix 1

Distributed regression analysis algorithms.

DOCX File , 42 KB

Multimedia Appendix 2

Analysis center and data partner hardware description.

DOCX File , 23 KB

  1. Karr AF, Lin X, Sanil AP, Reiter JP. Secure regression on distributed databases. J Comput Graph Stat 2005;14(2):263-279. [CrossRef]
  2. Fienberg SE, Fulp WJ, Slavkovic AB, Wrobel TA. 'Secure' log-linear and logistic regression analysis of distributed databases. In: Domingo-Ferrer J, Franconi L, editors. Privacy in Statistical Databases. Berlin, Heidelberg: Springer; 2006:277-290.
  3. Her QL, Malenfant JM, Malek S, Vilk Y, Young J, Li L, et al. A query workflow design to perform automatable distributed regression analysis in large distributed data networks. EGEMS (Wash DC) 2018 May 25;6(1):11 [FREE Full text] [CrossRef] [Medline]
  4. Jiang X, Wu Y, Marsolo K, Ohno-Machado L. Development of a web service for analysis in a distributed network. EGEMS (Wash DC) 2014;2(1):1053 [FREE Full text] [CrossRef] [Medline]
  5. Wolfson M, Wallace SE, Masca N, Rowe G, Sheehan NA, Ferretti V, et al. DataSHIELD: resolving a conflict in contemporary bioscience--performing a pooled analysis of individual-level data without sharing the data. Int J Epidemiol 2010 Oct;39(5):1372-1382 [FREE Full text] [CrossRef] [Medline]
  6. Wu Y, Jiang X, Kim J, Ohno-Machado L. Grid Binary LOgistic REgression (GLORE): building shared models without sharing data. J Am Med Inform Assoc 2012;19(5):758-764 [FREE Full text] [CrossRef] [Medline]
  7. Lu CL, Wang S, Ji Z, Wu Y, Xiong L, Jiang X, et al. WebDISCO: a web service for distributed cox model learning without patient-level data sharing. J Am Med Inform Assoc 2015 Nov;22(6):1212-1219 [FREE Full text] [CrossRef] [Medline]
  8. Jiang W, Li P, Wang S, Wu Y, Xue M, Ohno-Machado L, et al. WebGLORE: a web service for Grid LOgistic REgression. Bioinformatics 2013 Dec 15;29(24):3238-3240 [FREE Full text] [CrossRef] [Medline]
  9. Narasimhan B, Rubin DL, Gross SM, Bendersky M, Lavori PW. Software for distributed computation on medical databases: a demonstration project. J Stat Soft 2017;77(13). [CrossRef]
  10. OBiBa Opal Documentation. 2019. R DataSHIELD Introduction   URL: http://opaldoc.obiba.org/en/latest/r-user-guide/index.html [accessed 2018-02-18]
  11. Meeker D, Jiang X, Matheny ME, Farcas C, D'Arcy M, Pearlman L, et al. A system to build distributed multivariate models and manage disparate data sharing policies: implementation in the scalable national network for effectiveness research. J Am Med Inform Assoc 2015 Nov;22(6):1187-1195 [FREE Full text] [CrossRef] [Medline]
  12. Her QL, Vilk Y, Young J, Zhang Z, Malenfant J, Malek S, et al. 2018 Aug. A Distributed Regression Analysis Application Based on SAS Software. Part I: Linear and Logistic Regression   URL: https://ui.adsabs.harvard.edu/#abs/2018arXiv180802387H [accessed 2019-04-15]
  13. Vilk Y, Zhang Z, Young J, Her Q, Malenfant J, Malek S, et al. 2018 Aug. A Distributed Regression Analysis Application Based on SAS Software Part II: Cox Proportional Hazards Regression   URL: https://ui.adsabs.harvard.edu/#abs/2018arXiv180802392V [accessed 2019-04-15]
  14. Karr AF, Feng J, Lin X, Sanil AP, Young SS, Reiter JP. Secure analysis of distributed chemical databases without data integration. J Comput Aided Mol Des 2005;19(9-10):739-747. [CrossRef] [Medline]
  15. Li Y, Jiang X, Wang S, Xiong H, Ohno-Machado L. VERTIcal Grid lOgistic regression (VERTIGO). J Am Med Inform Assoc 2016 May;23(3):570-579 [FREE Full text] [CrossRef] [Medline]
  16. Platt R, Carnahan RM, Brown JS, Chrischilles E, Curtis LH, Hennessy S, et al. The US Food and Drug Administration's Mini-Sentinel program: status and direction. Pharmacoepidemiol Drug Saf 2012 Jan;21(Suppl 1):1-8. [CrossRef] [Medline]
  17. Ball R, Robb M, Anderson SA, Dal Pan G. The FDA's sentinel initiative--A comprehensive approach to medical product surveillance. Clin Pharmacol Ther 2016 Mar;99(3):265-268. [CrossRef] [Medline]
  18. Curtis LH, Weiner MG, Boudreau DM, Cooper WO, Daniel GW, Nair VP, et al. Design considerations, architecture, and use of the Mini-Sentinel distributed data system. Pharmacoepidemiol Drug Saf 2012 Jan;21(Suppl 1):23-31. [CrossRef] [Medline]
  19. Sentinel System. Routine Querying Tools   URL: https://www.sentinelsystem.org/sentinel/surveillance-tools/routine-querying-tools [accessed 2016-02-11]
  20. Davies M, Erickson K, Wyner Z, Malenfant J, Rosen R, Brown J. Software-enabled distributed network governance: the PopMedNet experience. EGEMS (Wash DC) 2016;4(2):1213 [FREE Full text] [CrossRef] [Medline]
  21. Karr AF, Lin X, Sanil AP, Reiter JP. Privacy-preserving analysis of vertically partitioned data using secure matrix products. J Off Stat 2009;25(1):125-138 [FREE Full text]
  22. Dankar FK. Privacy preserving linear regression on distributed databases. Trans Data Privacy 2015;8(1):3-28 [FREE Full text]
  23. Du W, Han YS, Chen S. Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification. In: Proceedings of the 2004 SIAM International Conference on Data Mining. 2004 Presented at: SIAM'04; April 22-24, 2004; Florida, USA p. 222-233. [CrossRef]
  24. Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, Brown JS. Launching PCORnet, a national patient-centered clinical research network. J Am Med Inform Assoc 2014;21(4):578-582 [FREE Full text] [CrossRef] [Medline]
  25. Richesson RL, Hammond WE, Nahm M, Wixted D, Simon GE, Robinson JG, et al. Electronic health records based phenotyping in next-generation clinical trials: a perspective from the NIH Health Care Systems Collaboratory. J Am Med Inform Assoc 2013 Dec;20(e2):e226-e231 [FREE Full text] [CrossRef] [Medline]
  26. El Emam K, Samet S, Arbuckle L, Tamblyn R, Earle C, Kantarcioglu M. A secure distributed logistic regression protocol for the detection of rare adverse drug events. J Am Med Inform Assoc 2013 May 1;20(3):453-461 [FREE Full text] [CrossRef] [Medline]


AIC: Akaike information criterion
AUC: area under the curve
BIC: Bayesian information criterion
CIDA: Cohort Identification and Descriptive Analysis
DDN: distributed data network
DRA: distribution regression analysis
GLORE: Grid Binary LOgistic Regression
ROC: receiver operating characteristic
SCANNER: SCAlable National Network for Effectiveness Research


Edited by G Eysenbach; submitted 17.06.19; peer-reviewed by W Camelo Castillo, S Wang; comments to author 12.09.19; revised version received 05.11.19; accepted 04.02.20; published 04.06.20

Copyright

©Qoua Her, Jessica Malenfant, Zilu Zhang, Yury Vilk, Jessica Young, David Tabano, Jack Hamilton, Ron Johnson, Marsha Raebel, Denise Boudreau, Sengwee Toh. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 04.06.2020.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.