This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
A distributed data network approach combined with distributed regression analysis (DRA) can reduce the risk of disclosing sensitive individual and institutional information in multicenter studies. However, software that facilitates largescale and efficient implementation of DRA is limited.
This study aimed to assess the precision and operational performance of a DRA application comprising a SASbased DRA package and a file transfer workflow developed within the opensource distributed networking software PopMedNet in a horizontally partitioned distributed data network.
We executed the SASbased DRA package to perform distributed linear, logistic, and Cox proportional hazards regression analysis on a realworld test case with 3 data partners. We used PopMedNet to iteratively and automatically transfer highly summarized information between the data partners and the analysis center. We compared the DRA results with the results from standard SAS procedures executed on the pooled individuallevel dataset to evaluate the precision of the SASbased DRA package. We computed the execution time of each step in the workflow to evaluate the operational performance of the PopMedNetdriven file transfer workflow.
All DRA results were precise (<10^{−12}), and DRA model fit curves were identical or similar to those obtained from the corresponding pooled individuallevel data analyses. All regression models required less than 20 min for full endtoend execution.
We integrated a SASbased DRA package with PopMedNet and successfully tested the new capability within an active distributed data network. The study demonstrated the validity and feasibility of using DRA to enable more privacyprotecting analysis in multicenter studies.
Distributed regression analysis (DRA) is a suite of methods that perform multivariable regression analysis in multicenter studies without the need for pooling individuallevel data [
Distributed regression analysis with horizontally partitioned data.
However, DRA is not widely used in practice due to the operational challenges in implementing the approach [
In our previous work, we enhanced PopMedNet, an opensource distributed networking software currently used by several large national distributed data networks (DDNs), to enable an automatable and iterative file transfer workflow for routine implementation of DRA [
Despite the appealing theoretical properties of DRA, applications designed to perform the analysis can still be inoperable or produce biased results in realworld settings due to unappreciated factors (eg, human errors in execution, incompatible or different software versions, network or firewall restrictions, and network conditions). Evaluating the precision of DRA applications compared with the pooled individuallevel data analysis and the feasibility of performing the analysis in reasonable execution times in realworld settings is needed to demonstrate DRA as a practical and valid analytical method. In this study, we demonstrate the feasibility of using the SASbased DRA package and PopMedNetdriven file transfer workflow to perform DRA in a realworld horizontally partitioned DDN. Specifically, we quantify the precision of the SASbased DRA package and the operational performance of the PopMedNetdriven file transfer workflow.
Funded by the US Food and Drug Administration, the Sentinel System is an active surveillance system designed to monitor the safety of approved medical products using longitudinal, regularly updated electronic health data from a network of 18 health plans and health care delivery systems [
Sentinel uses PopMedNet to facilitate file transfers between the data partners and the Sentinel Operations Center [
There are numerous algorithms (eg, secure data integration, secure summation) for DRA in horizontally partitioned DDNs, environments where each data partner holds information about distinct patient cohorts [
We provide a brief overview of the distributed iteratively reweighted least squares and the NewtonRaphson algorithms used to implement the SASbased DRA package for distributed linear, logistic, and Cox proportional hazards regression analysis using the Sentinel Operations Center as the analysis center in
Both the distributed iteratively reweighted least squares and NewtonRaphson algorithms in the SASbased DRA package utilize a masterworker process, where the analysis center directs the iterative DRA computations and the data partners execute these computations on their individuallevel data with input (eg, updated regression parameter estimates) from the analysis center. Thus, an iterative file transfer workflow is required to transfer the highly summarized intermediate statistics and the updated regression parameter estimates between the data partners and the analysis center until the model converges or the analysis reaches a prespecified maximum number of iterations.
We previously enhanced PopMedNet to create an iterative and automatable file transfer workflow to facilitate routine DRA [
A typical DRA includes 3 major steps [
Threestep process to conduct distributed regression analysis with PopMedNet. CIDA: Cohort Identification and Descriptive Analysis Tool; DRA: Distributed Regression Analysis; SOC: Sentinel Operations Center.
We used the CIDA tool (version 3.3.6) to assemble a harmonized individuallevel analytical dataset of adult patients aged 1879 years who received sleeve gastrectomy or RouxenY gastric bypass in any care setting between January 1, 2010 and September 30, 2015 at 3 Sentinel data partners. To be eligible for cohort inclusion, patients must be continuously enrolled in a health plan with medical and drug coverage for at least 1 year before the index bariatric surgery, have at least one weight and height measurement that corresponded to a BMI ≥35 kg/m^{2} in the year before surgery, and have at least one height and weight measurement in the year after surgery. We excluded patients with any bariatric procedure during the 1year period before the index bariatric surgery. We also excluded patients with gastrointestinal cancer or a revised bariatric surgery procedure on the day of surgery. For each regression analysis, followup started on the day of the index bariatric surgery and continued until the occurrence of the outcome of interest (see below), death, end of health plan enrollment, or end of the study period. For distributed linear regression analysis, the outcome was a change in BMI within 1year postsurgery, defined by subtracting the BMI measurement closest to the end of the 1year postsurgery date from the last BMI measurement before surgery. For logistic regression, we created a binary outcome variable indicating
Analytical datasets and variables.
Regression model type  Outcome variable (within 1year postsurgery)  Variables (exposure and confounders) 
Linear  Change in BMI  Bariatric surgery exposure, age at surgery, sex, race and ethnicity, combined CharlsonElixhauser comorbidity score, number of ambulatory visits, number of other ambulatory visits, number of inpatient stays, number of nonacute institutional stays, number of emergency department visits, BMI before bariatric surgery, number of days between last weight and height measurement and bariatric surgery, and data partner 
Logistic  Weight loss ≥20%  Same as above 
Cox  Time to weight loss ≥20%  Same as above 
We assembled 3 separate SASbased DRA packages to perform distributed linear, logistic, or Cox regression analyses and assessed the association between bariatric procedure (sleeve gastrectomy vs RouxenY gastric bypass) and weight loss within 1year postsurgery, adjusting for prespecified confounders (
We distributed each SASbased DRA package to the 3 data partners through PopMedNet (version 6.7). We instructed the data partners to (1) initiate the automated PopMedNet workflow, allowing the DataMart Client (version 6.7) to automatically download and unzip the SASbased DRA package to a prespecified local directory, (2) manually place the individuallevel analytical dataset created in step 1 in a prespecified local folder
Once the data partners and the analysis center executed their SASbased DRA package, the package ran continuously, awaiting input files (eg, updated regression parameter estimates or intermediate statistics) and DRA computation directions (eg, compute intermediate statistics, residuals, and SEs) from the Sentinel Operations Center. We used the PopMedNet workflow to transfer input files and computation directions iteratively and automatically between the data partners and the Sentinel Operations Center.
We requested all data partners to securely transfer their deidentified individuallevel analytical datasets to the Sentinel Operations Center. We assessed the precision of the SASbased DRA package by comparing the DRA parameter estimates and SEs to those obtained from the pooled individuallevel data analyses using standard SAS procedures. For distributed linear regression, we compared the model fit statistics
For distributed logistic regression, we also compared the receiver operating characteristic (ROC) curve and the area under the ROC curve with the corresponding curve and area obtained from a PROC LOGISTIC run with the pooled individuallevel data. We considered the integration successful if the ROC curves were similar in likeliness and if the areas under the curves were comparable. To offer better privacy protection, we summarized individuallevel predicted values for the distributed logistic regression analysis in bins of 6. Full details of this approximation method can be found elsewhere [
We extracted time stamps of status changes from PopMedNet and computed the average download, upload, SAS execution, and transfer time at the data partners and the analysis center to evaluate the operational performance of the DRA application. We also reported the average iteration time for each regression model type, and the time required to perform an endtoend DRA in our test case.
We executed all SASbased DRA packages in SAS versions 9.3 or 9.4, on a Windows desktop or server routinely used to perform Sentinel queries. All machines used to execute the SASbased DRA packages and DataMart Client instance operated on a Windows 7 platform, with multiple Intel core processors ranging from 2.3 to 3.4 GHz, and 8 to 16 GB of RAM (
We identified 5452 eligible patients among the 3 participating data partners (n_{1}=1706, n_{2}=2728, and n_{3}=1018). Of these, 981 patients received sleeve gastrectomy, whereas 4471 patients received RouxenY gastric bypass during the study period. Within 1year postsurgery, the BMI decreased on average by 9.8 kg/m^{2} in sleeve gastrectomy patients and 18.7 kg/m^{2} in RouxenY gastric bypass patients. Fivehundred eightytwo of the 981 (59.3%) patients who had undergone sleeve gastrectomy and 3617 of the 4471 (80.10%) patients who had undergone RouxenY gastric bypass had a weight loss ≥20% within the 1year postsurgery period. The median time to a weight loss ≥20% was 223.9 days for patients who had undergone sleeve gastrectomy and 196.2 days for patients who had undergone RouxenY gastric bypass.
Distributed linear regression vs pooled individuallevel linear regression.
Covariates  Distributed regression 
Pooled individuallevel 
Difference in parameter estimate  Difference in SE  

Parameter 
SE  Parameter 
SE 



Intercept  34.03935  0.61075  34.03935  0.61075  3.66 x 10^{−12}  −9.14 x 10^{−13}  
Exposure  2.04714  0.28723  2.04714  0.28723  −4.15 x 10^{−}^{13}  −4.30 x 10^{−}^{13}  
Age  −0.03334  0.00837  −0.03334  0.00837  −3.68 x 10^{−}^{14}  −1.25 x 10^{−}^{14}  
Preindex BMI  −0.99983  0.00050  −0.99983  0.00050  −6.00 x 10^{−}^{15}  −7.44 x 10^{−}^{16}  
Combined comorbidity score  0.04388  0.06949  0.04388  0.06949  3.59 x 10^{−}^{15}  −1.04 x 10^{−}^{13}  
Number of ambulatory visits  −0.03068  0.01008  −0.03068  0.01008  −6.59 x 10^{−}^{17}  −1.51 x 10^{−}^{14}  
Number of emergency department visits  0.10329  0.08749  0.10329  0.08749  −2.79 x 10^{−}^{14}  −1.31 x 10^{−}^{13}  
Number of inpatient visits  0.88725  0.25976  0.88725  0.25976  −6.51 x 10^{−}^{13}  −3.89 x 10^{−}^{13}  
Number of nonacute institutional stay  1.32338  1.79056  1.32338  1.79056  4.21 x 10^{−}^{13}  −2.68 x 10^{−}^{12}  
Number of other ambulatory visits  0.02159  0.00873  0.02159  0.00873  1.22 x 10^{−}^{14}  −1.31 x 10^{−}^{14}  
Days between BMI measurement and index procedure  0.01207  0.00567  0.01207  0.00567  3.92 x 10^{−}^{15}  −8.48 x 10^{−}^{15}  


Unknown  0.94212  0.26841  0.94212  0.26841  −4.16 x 10^{−}^{13}  −4.02 x 10^{−}^{13}  
American Indian or Alaska Native  −0.30948  0.69817  −0.30948  0.69817  −2.39 x 10^{−}^{13}  −1.04 x 10^{−}^{12}  

Asian  −0.16853  0.63001  −0.16853  0.63001  −4.52 x 10^{−}^{13}  −9.42 x 10^{−}^{13}  
Black or African American  1.51961  0.29206  1.51961  0.29206  −9.95 x 10^{−}^{14}  −4.37 x 10^{−}^{13}  

Native Hawaiian or other Pacific Islander  −1.22315  1.04973  −1.22315  1.04973  −4.11 x 10^{−}^{13}  −1.57 x 10^{−}^{12}  
Female  −1.22366  0.23205  −1.22366  0.23205  −5.33 x 10^{−}^{13}  −3.47 x 10^{−}^{13}  



2011  0.15150  0.30361  0.15150  0.30361  −5.94 x 10^{−}^{13}  −4.54 x 10^{−}^{13}  

2012  −0.24904  0.30372  −0.24904  0.30372  −6.47 x 10^{−}^{13}  −4.54 x 10^{−}^{13}  

2013  −0.02308  0.30223  −0.02308  0.30223  −6.08 x 10^{−}^{13}  −4.52 x 10^{−}^{13}  

2014  0.32767  0.30609  0.32767  0.30609  −5.93 x 10^{−}^{13}  −4.58 x 10^{−}^{13}  

2015  −0.25767  0.33352  −0.25767  0.33352  −6.18 x 10^{−}^{13}  −4.99 x 10^{−}^{13}  



2  −1.10559  0.31373  −1.10559  0.31373  2.89 x 10^{−}^{15}  −4.69 x 10^{−}^{13}  

3  −0.10990  0.30341  −0.10990  0.30341  −2.07 x 10^{−}^{13}  −4.54 x 10^{−}^{13} 
^{a}Reference groups: race (white), surgery year (2010), and data partner site (1).
Distributed logistic regression vs pooled individuallevel logistic regression.
Covariates  Distributed regression analysis  Pooled individuallevel analysis  Difference in parameter estimate  Difference in SE  

Parameter 
SE  Parameter 
SE 



Intercept  2.11573  0.22833  2.11573  0.22833  −6.22 x 10^{−15}  −1.00 x 10^{−14}  
Exposure  1.06711  0.09895  −1.06711  0.09895  −2.00 x 10^{−15}  −1.80 x 10^{−16}  
Age  −0.01606  0.00316  −0.01607  0.00316  −4.51 x 10^{−17}  −1.57 x 10^{−16}  
Preindex BMI  0.00003  0.00020  0.00003  0.00020  6.51 x 10^{−19}  2.44 x 10^{−19}  
Combined comorbidity score  −0.02623  0.02561  −0.02623  0.02561  −6.97 x 10^{−16}  −3.12 x 10^{−17}  
Number of ambulatory visits  0.01155  0.00447  0.01155  0.00447  6.25 x 10^{−17}  1.13 x 10^{−17}  
Number of emergency department visits  −0.06230  0.03132  −0.06230  0.03133  3.05 x 10^{−16}  1.39 x 10^{−17}  
Number of inpatient visits  −0.12098  0.08940  −0.12098  0.08940  1.75 x 10^{−15}  −2.36 x 10^{−16}  
Number of nonacute institutional stay  0.42510  0.78809  0.42510  0.78809  −2.00 x 10^{−15}  −3.33 x 10^{−16}  
Number of other ambulatory visits  0.00381  0.00340  0.00381  0.00340  3.17 x 10^{−17}  −2.91 x 10^{−17}  
Days between BMI measurement and index procedure  −0.00266  0.00201  −0.00266  0.00201  3.90 x 10^{−17}  −4.77 x 10^{−18}  



Unknown  −0.39685  0.09485  −0.39685  0.09485  0.00 x 10^{+00}  −2.50 x 10^{−16}  
American Indian or Alaska Native  −0.13938  0.26230  −0.13938  0.26230  −1.11 x 10^{−16}  5.55 x 10^{−17}  
Asian  −0.37257  0.22341  −0.37257  0.22341  −3.04 x 10^{−14}  2.78 x 10^{−17}  

Black or African American  −0.29617  0.10507  −0.29617  0.10507  −3.33 x 10^{−16}  −9.71 x 10^{−17}  

Native Hawaiian or Other Pacific Islander  −0.02910  0.40543  −0.02910  0.40543  −6.14 x 10^{−16}  0.00 x 10^{+00}  
Female  0.19993  0.08422  0.19993  0.08422  −1.80 x 10^{−15}  −3.61 x 10^{−16}  


2011  −0.10269  0.11683  −0.10269  0.11684  6.37 x 10^{−15}  −5.55 x 10^{−17}  
2012  0.05547  0.11897  0.05547  0.11897  5.45 x 10^{−15}  −1.67 x 10^{−16}  
2013  −0.11956  0.11382  −0.11956  0.11382  6.80 x 10^{−15}  −1.94 x 10^{−16}  

2014  −0.10956  0.11617  −0.10956  0.11617  4.36 x 10^{−15}  −1.80 x 10^{−16}  

2015  0.03701  0.12798  0.03701  0.12798  6.47 x 10^{−15}  −2.50 x 10^{−16}  



2  −0.10433  0.11751  −0.10433  0.11751  4.51 x 10^{−15}  −9.99 x 10^{−16}  

3  0.75506  0.12577  0.75506  0.12577  2.11 x 10^{−15}  −2.50 x 10^{−16} 
^{a}Reference groups: Race (white), surgery year (2010), and data partner site (1).
Distributed Cox proportional hazards regression vs pooled individuallevel Cox proportional hazards regression.
Covariates  Distributed regression analysis  Pooled individuallevel analysis  Difference in parameter estimate  Difference in SE  

Parameter 
SE  Parameter 
SE 



Exposure  −0.58160  0.05275  −0.58160  0.05275  6.66 x 10^{−16}  −8.33 x 10^{−17}  
Age  −0.01107  0.00146  −0.01107  0.00146  1.39 x 10^{−17}  −9.11 x 10^{−18}  
Preindex BMI  −0.00006  0.00009  −0.00006  0.00009  2.85 x 10^{−19}  −1.49 x 10^{−19}  
Combined comorbidity score  −0.00787  0.01205  −0.00787  0.01205  −3.64 x 10^{−17}  −1.04 x 10^{−17}  
Number of ambulatory visits  0.00584  0.00158  0.00584  0.00158  −2.95 x 10^{−17}  1.08 x 10^{−18}  
Number of emergency department visits  −0.01873  0.01679  −0.01873  0.00158  1.56 x 10^{−16}  −2.43 x 10^{−17}  
Number of inpatient visits  −0.08587  0.04580  −0.08587  0.04580  −9.58 x 10^{−16}  −1.25 x 10^{−16}  
Number of nonacute institutional stay  0.06626  0.29266  0.06626  0.29266  3.75 x 10^{−16}  −3.33 x 10^{−16}  
Number of other ambulatory visits  0.00279  0.00134  0.00279  0.00134  4.03 x 10^{−17}  −1.52 x 10^{−18}  
Days between BMI measurement and index procedure  −0.00221  0.00096  −0.00221  0.00096  2.39 x 10^{−17}  −2.17 x 10^{−18}  


Unknown  −0.18898  0.04765  −0.18898  0.04765  5.27 x 10^{−16}  0.00 x 10^{+00}  

American Indian or Alaska Native  −0.07476  0.12019  −0.07476  0.12019  1.25 x 10^{−16}  2.78 x 10^{−17}  

Asian  −0.22309  0.10933  −0.22309  0.10933  −2.78 x 10^{−17}  6.94 x 10^{−17}  

Black or African American  −0.18457  0.05116  −0.18457  0.05116  1.94 x 10^{−16}  −1.39 x 10^{−17}  

Native Hawaiian or Other Pacific Islander  −0.19748  0.17333  −0.19748  0.17333  1.42 x 10^{−15}  2.78 x 10^{−17}  
Female  −0.00887  0.04052  −0.00887  0.04052  −1.24 x 10^{−15}  −3.47 x 10^{−17}  



2011  −0.08021  0.05176  −0.08021  0.05176  8.60 x 10^{−16}  1.11 x 10^{−16}  

2012  −0.02547  0.05136  −0.02547  0.05136  4.61 x 10^{−16}  7.63 x 10^{−17}  

2013  −0.09519  0.05195  −0.09519  0.05195  1.17 x 10^{−15}  4.86 x 10^{−17}  

2014  −0.16866  0.05235  −0.16866  0.05235  8.60 x 10^{−16}  1.18 x 10^{−16}  

2015  0.24763  0.05640  0.24763  0.05640  3.89 x 10^{−16}  1.04 x 10^{−16}  



2  −0.15270  0.05188  −0.15270  0.05188  2.11 x 10^{−15}  6.94 x 10^{−18}  

3  0.33440  0.05161  0.33440  0.05161  8.33 x 10^{−16}  2.08 x 10^{−17} 
^{a}Reference groups: race (white), surgery year (2010), and data partner site (1).
Comparison of model fit statistics between distributed regression and pooled individuallevel data analysis.
Regression model type and statistic or test  Distributed regression analysis  Pooled individuallevel data analysis  Difference in model fit statistics  




0.9987  0.9987  3.89 x 10^{−}^{15} 

Akaike information criterion  20089.6538  20089.6538  −1.59 x 10^{−}^{08} 

Sawa's Bayesian information criterion  20091.8710  20091.8710  −1.59 x 10^{−}^{08} 

Schwarz's Bayesian information criterion  20247.5868  20247.5868  −1.59 x 10^{−}^{08} 



2 loglikelihood  5423.2491  5423.2491  1.36 x 10^{−}^{11} 

Akaike information criterion  5471.2491  5471.2491  1.36 x 10^{−}^{11} 

Sawa's Bayesian information criterion  5629.5265  5629.5265  1.36 x 10^{−}^{11} 

Area under the ROC^{a} curve  0.6591  0.6592  −1.00 x 10^{−}^{04} 

HosmerLemeshow (chisquare statistics)  1.3405  1.5596  −2.19 x 10^{−}^{01} 

HosmerLemeshow, 
.995 (8)  .991 (8)  3.38 x 10^{−}^{03} 



2 loglikelihood  66217.7270  66217.7270  1.46 x 10^{−}^{11} 

Akaike information criterion  66263.7270  66263.7270  1.46 x 10^{−}^{11} 

Schwarz's Bayesian information criterion  66409.6070  66409.6070  1.46 x 10^{−}^{11} 

Median time to event (days)  184  184  0 
^{a}ROC: receiver operating characteristic.
Comparison of receiver operating characteristic curves between distributed logistic regression (left) and pooled individuallevel logistic regression (right). To offer better privacyprotecting, individuallevel predicted values were summarized in bins of 6 and transferred to the analysis center for aggregation in the distributed logistic regression analysis. The size of the bin is userspecified. ROC: receiver operating characteristic.
Comparison of survival functions between distributed cox proportional hazards regression (left) and pooled individuallevel cox proportional hazards regression (right). The survival curves were evaluated at the mean value of covariates for patients with events.
As expected, the closedform solution of distributed linear regression analysis required only two iterations, one for computing the regression parameter estimates and SEs and the other for computing the model fit statistics. Both logistic and Cox proportional hazards regression analyses required 6 iterations for model convergence in our test case. Each file transfer process transferred between 3 and 10 files with sizes of 1 to 800 KB.
We extracted 111, 271, and 271 time stamps of status changes from PopMedNet for distributed linear, logistic, and Cox analysis, respectively.
The distributed Cox regression required the greatest amount of iteration time (113.5 s), followed by logistic regression (95.0 s) and linear regression (91.5 s). Overall, distributed linear regression analysis with our bariatric surgery test case required 440.7 s to complete, whereas logistic and Cox proportional hazards regression analysis required 925.5 and 1016.0 s, respectively.
Operational performance of the distributed regression analysis application.
Performance metric  Linear  Logistic  Cox  Overall  
Required number of iterations for model convergence  2  6  6  —^{a}  
Total run time  440.7  925.5  1,016.0  —  
Average iteration time, mean (SE)  91.5 (10.5)  95 (3.1)  113.5 (5.2)  102.4 (3.8)  


Average download time, mean (SE)  20.5 (5.4)  20.6 (1.3)  39.4 (4)  28.6 (3.2)  

Average computation time, mean (SE)  4.3 (2.6)  3 (1.1)  4.4 (0.4)  3.8 (0.6) 

Average upload time, mean (SE)  8.4 (1.1)  10.2 (0.7)  9.9 (0.6)  9.8 (0.4) 

Average file transfer time (to data partners), mean (SE)  10.5 (0.4)  9.1 (0.5)  9.4 (0.5)  9.4 (0.3) 



Average download time, mean (SE)  8.6 (1.2)  10.3 (0.6)  10.3 (0.8)  10.1 (0.4) 

Average computation time, mean (SE)  8.2 (0.8)  7.9 (0.4)  8 (0.3)  8 (0.2) 

Average upload time, mean (SE)  15.6 (1.2)  15.9 (0.6)  15.1 (0.3)  15.5 (0.3) 

Average file transfer time (to analysis center), mean (SE)  20 (0.8)  21.8 (1.9)  23.1 (1.2)  22.1 (1.0) 
^{a}N/A: not applicable.
We have successfully integrated a SASbased DRA package with PopMedNet, an opensource distributed networking software, and performed DRA in select data partners within a realworld DDN. Our application was able to compute regression parameters, SEs, model fit statistics, and model fit graphics of 3 regression model types (linear, logistic, and Cox proportional hazards) that were within machine precision or similar in likeliness to those produced using standard SAS regression procedures, without the need to share any individuallevel data, in under 20 min. The study demonstrated the feasibility and validity of performing multivariable regression analysis in a multicenter setting while limiting the risk of disclosing sensitive individual or institutional information.
Previous studies have used simulated or relatively wellcontrolled distributed environments to demonstrate the ability to perform DRA with only summarylevel information [
Our DRA application required significantly more time for model convergence than the SCANNER software. However, this additional time for model convergence may be considered marginal in practice, where other aspects of a multicenter study are typically more timeconsuming. For example, developing a study protocol and analysis plan or assembling an analytical dataset at each participating data partner for DRA may require considerably more time than the time required to perform DRA. There are also several key differences between our application and the SCANNER software that may explain the difference in operational performance. Specifically, the SCANNER software requires users to install a virtual machine and open ports to the master node hosting the SCANNER hub. This design may have shorter file upload, transfer, and download times between the execution nodes, as files are only transferred between homogeneous virtual machines on the server and not subject to impediments such as firewall security protocols, additional workload, and upload, transfer, and download speeds.
The operational performance of the SCANNER software makes it a desirable option for DRA in networks that are amenable to installing the required software and applications. We previously found that most Sentinel data partners were unwilling to install new software or make modifications to their existing hardware configurations to perform DRA [
Our study is not without limitations. First, DRA requires infrastructure and processes beyond the algorithms and technology described in this paper. For example, DRA with our application requires harmonized individuallevel datasets. Since its inception, Sentinel has continuously enhanced its common data model, routine analytical tools, and data quality assurance processes. Thus, Sentinel data partners can rapidly create harmonized analytical datasets for DRA. Research networks and investigators without the same infrastructure may not be able to perform DRA with our application as easily, even if data partners are willing to use PopMedNet as their datasharing software.
Second, we tested the DRA application with only 3 Sentinel data partners, and all tests were completed in a Windows version of SAS (desktop or server). It is possible that different hardware configurations not found at these data partners or different versions of SAS (Linux or Unix) could change the precision and operational performance or even inhibit the execution of our DRA application. However, we previously found only 3 different configurations of the required hardware components (DataMart Client, SAS software, and the common folder structure) among Sentinel data partners [
Third, our precision and operational performances were based on a small sample of successful endtoend executions of our DRA application. These executions were limited to regression models with 23 variables and analytical datasets ranging from 1000 to 3000 patients distributed among 3 data partners. Future work should include more endtoend executions, regression models with more variables, datasets of larger sample sizes, and more data partners. However, we found that 89% of the iteration time was attributed to file transfer time, which was largely driven by the number of files, size of the files transferred, and network conditions (upload, download, and transfer speeds, firewall security protocols, and workload). Because the files contain highly summarized information, increasing the number of variables or patients will not increase the number of files or substantially increase the size of the files to be transferred. In this study, each file transfer process transferred files that were less than 1 MB. Our internal testing of analyses with more variables, patients, and data partners did not result in file sizes larger than a few MBs or increased the iteration time. Thus, we do not anticipate DRA with more variables, patients, and data partners in a realworld setting to have a considerable impact on the operational performance of our DRA application. In addition, network conditions at each data partner can vary depending on the workload. We could not vary network conditions at each data partner to formally analyze its impact on the operational performance. However, we did complete our experiments with 3 Sentinel data partners, with machines that are routinely used to fulfill Sentinel query requests. Thus, our results on precision and operational performance likely represent what potential users of DRA will experience in practice.
Fourth, our bariatric surgery test case was relatively simplistic and not as sophisticated as an actual clinical or epidemiologic study. For example, we did not include all the potential confounders. Therefore, the results of our analysis did not have any causal interpretation.
Finally, although DRA uses the intermediate statistics at each data partner to perform multivariable regression analysis, the risk of reidentifying specific individuals is not 0. Under certain conditions (eg, uncommon individual attributes coded with indicator variables), there could be leakage of personal information that could be used to infer or identify specific individuals [
We have successfully developed and integrated a SASbased DRA package with an iterative and automatable PopMedNetdriven file transfer workflow to create a DRA application and conduct DRA in select data partners within a realworld DDN. The application produced results that were within machine precision to the results from the pooled individuallevel data analyses using standard SAS regression procedures. The endtoend execution times were reasonable, demonstrating that DRA can be a practical and valid analytical method in realworld settings.
Distributed regression analysis algorithms.
Analysis center and data partner hardware description.
Akaike information criterion
area under the curve
Bayesian information criterion
Cohort Identification and Descriptive Analysis
distributed data network
distribution regression analysis
Grid Binary LOgistic Regression
receiver operating characteristic
SCAlable National Network for Effectiveness Research
This work was supported by the Office of the Assistant Secretary for Planning and Evaluation and the Food and Drug Administration (HHSF223201400030I/HHSF22301006T).
ST is the principal investigator of projects funded by the National Institutes of Health (U01EB023683) and the Agency for Healthcare Research and Quality (R01HS026214).