This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
Patient body weight is a frequently used measure in biomedical studies, yet there are no standard methods for processing and cleaning weight data. Conflicting documentation on constructing body weight measurements presents challenges for research and program evaluation.
In this study, we aim to describe and compare methods for extracting and cleaning weight data from electronic health record databases to develop guidelines for standardized approaches that promote reproducibility.
We conducted a systematic review of studies published from 2008 to 2018 that used Veterans Health Administration electronic health record weight data and documented the algorithms for constructing patient weight. We applied these algorithms to a cohort of veterans with at least one primary care visit in 2016. The resulting weight measures were compared at the patient and site levels.
We identified 496 studies and included 62 (12.5%) that used weight as an outcome. Approximately 48% (27/62) included a replicable algorithm. Algorithms varied from cutoffs of implausible weights to complex models using measures within patients over time. We found differences in the number of weight values after applying the algorithms (71,961/1,175,995, 6.12% to 1,175,177/1,175,995, 99.93% of raw data) but little difference in average weights across methods (93.3, SD 21.0 kg to 94.8, SD 21.8 kg). The percentage of patients with at least 5% weight loss over 1 year ranged from 9.37% (4933/52,642) to 13.99% (3355/23,987).
Contrasting algorithms provide similar results and, in some cases, the results are not different from using raw, unprocessed data despite algorithm complexity. Studies using point estimates of weight may benefit from a simple cleaning rule based on cutoffs of implausible values; however, research questions involving weight trajectories and other, more complex scenarios may benefit from a more nuanced algorithm that considers all available weight data.
The use of electronic health records (EHRs) by health care systems has rapidly increased during the last 2 decades [
Obesity is associated with increased risk of a wide range of medical problems, including diabetes, hypertension, high blood cholesterol, cardiovascular events, bone and joint problems, and sleep apnea [
Despite being a common clinical measure, there is no standard for processing and cleaning EHR weight data for use in research and evaluation studies. Researchers are often left to select and replicate a method described by others or develop their own algorithms to define weight measures for analyses, resulting in many different definitions in the published literature [
The objectives of this study include (1) comparing algorithms for extracting and processing clinical weight measures from EHR databases and (2) providing recommendations for the use of algorithms. We used measures of patient weight from the Corporate Data Warehouse (CDW) of the Veterans Health Administration (VHA) to accomplish these objectives. The VHA includes a network of medical centers that rely on a systemwide integrated EHR system. Patient data are extracted from EHR records nightly and uploaded to a centralized CDW, which comprises relational data tables that can be accessed by data analysts, including researchers. Users extract data from the CDW and typically perform simple data checks to verify accuracy. More complex algorithms may be used, especially in research; for example, to ensure that the amount of missing data does not exceed a prespecified threshold [
We included cohorts of VHA patients based on two calendar year periods: 2008 and 2016. Previous work suggests that data quality for some CDW data fields has improved over time in terms of cleanliness and data capture [
We collected all weight and height measurements from the CDW vital sign table during the collection period. If a patient had more than one height measurement during the 4 years, we used the modal value to determine a single measure of height for each patient. In the event that an individual only had 2 recorded height values, the last value was chosen when height was arranged in ascending order by collection date. We calculated BMI by dividing weight in kilograms by height in meters squared. All weight and height data were cleansed of any nonnumeric characters, converting commas to decimals where appropriate.
Previously, our team conducted a systematic literature review to identify studies that used patient weight outcome measures from the VHA CDW [
For comparison, we divided the 12 algorithms into two conceptual groups: (1) those that included all weight measurements during a specified time frame and (2) those that were periodspecific. A brief description of the key differences between algorithms by group is shown in
All algorithms were recreated from the methods sections described in the relevant publications and translated into pseudocode and then into R (version 3.6.1; R Foundation for Statistical Computing) or SAS (version 9.4; SAS Institute) code (
Conceptual description of main exclusions after applying each algorithma.
Conceptual group  Exclusions based on algorithm  



Buta et al [ 
Patients with ≤1 weight value BMI <11 or >70 

Chan and Raffa [ 
Weights <23 kg or >340 kg Weights >3 SD from mean 

Maguen et al [ 
Weights <32 kg or >318 kg Weights where the absolute value of conditional residual from linear mixed model ≥10 

Breland et al [ 
Weights <34 kg or >318 kg Weight values that fell outside of specific ratios calculated within patients over time 

Maciejewski et al [ 
Weight values associated with large SDs calculated on a rolling basis 

Littman et al [ 
Weights <34 kg or >272 kg Weights where difference from mean >SD Weights where SD was >10% of the mean 



Rosenberger et al [ 
Patients with <K number of weight measures; K chosen by researcher Weights outside of 6month time points 

Noël et al [ 
Weights ≤32 kg or ≥318 kg Patients with too few values to compute median within fiscal quarters 

Kazerooni and Lim [ 
Weights outside of windows around 3 periods Patients missing data in any of the 3 periods 

Jackson et al [ 
Weights <34 kg or >318 kg Weights outside of 90day window of each time point 

Goodrich et al [ 
Weights <36 kg or >227 kg Patients with >45 kg change between periods (baseline and 6 and 12 months) Weights outside of 30day window of each time point 

Janney et al [ 
Weights <41 kg or >272 kg at baseline Weights outside of 30day window of baseline and 60day window of 6 and 12month period Weights resulting in >45 kg change during study 
^{a}Details of each algorithm, including code, excerpts from published methods, and pseudocode, can be found in
All algorithms were applied to the data for both cohorts and compared based on descriptive statistics, including the number of weight measures and patients retained and the mean, SD, median, and range of weight values. For comparison, we also included descriptive statistics based on the raw, unprocessed weight data during the study time frame.
Weight is often used as a risk factor or covariate in statistical models to predict health outcomes. We present an example showing the association between baseline weight and
A common metric used in weight loss evaluation studies involves
Weight is frequently measured, often resulting in several weight measures per patient over time. Researchers may be interested in assessing weight trajectories within patients over time and potentially classifying patients according to their trajectory or examining whether types of patients respond differentially to interventions. Algorithm choice may affect the trajectory of individuals and their measurements collected over time, especially for algorithms that severely reduce the number of measurements left to analyze. Instead of aggregating patient weight over a specific period, studies analyzing weight measures within patients over time use repeatedmeasure designs such as (generalized) linear mixed models (LMMs), analysis of variance, or analysis of covariance for estimation. To compare algorithms in this context, we used a latent class LMM that assumes the population is heterogeneous and composed of some selected number of latent classes characterized by specific trajectories.
The latent class mixed models implemented through the R package
Researchers and evaluators are often interested in comparing facilities according to the percentage of patients who meet a metric of interest. To examine this application, we calculated the percentage of patients with 1year
Both cohorts included approximately 100,000 patients (n=98,786 in 2008 and n=99,958 in 2016;
Using the raw data from the 2016 cohort, each veteran had a mean of 12.2 (SD 24.9) weights recorded over the 4year collection period, and 1 patient had 4981 measurements (webbased supplement [
Aside from the difference in average weight between the 2 cohorts, the results did not reveal major differences in the number of weight measurements per patient or weight distributions. Therefore, the remainder of the results will focus on the 2016 cohort. The results from the 2008 cohort are included in
Descriptive statistics for the raw data and each of the 12 algorithms are shown in
Weight processing by algorithm and type of algorithm.
Item  Patients retained, n (% of raw weights)  Weight measurements retained, n (% of raw weights)  Weight (kg), mean (SD; range)  Weight (kg), median (IQR)  
Raw weights  99,958 (100)  1,175,995 (100)  94.3 (22.0; 0674.0)  91.8 (27.4)  



Buta et al [ 
90,159 (90.2)  1,131,996 (96.3)  94.3 (21.9; 12.3111.1)  91.9 (27.3) 

Chan and Raffa [ 
96,132 (96.2)  1,170,114 (99.5)  94.3 (21.9; 24.5330.0)  91.8 (27.4) 

Maguen et al [ 
98,352 (98.4)  1,037,293 (88.2)  93.3 (21.0; 31.9245.4)  91.0 (26.4) 

Breland et al [ 
99,958 (100)  1,175,177 (99.9)  94.3 (21.9; 34.0315.0)  91.8 (27.4) 

Maciejewski et al [ 
99,958 (100)  1,146,995 (97.5)  94.4 (21.8; 28.1247.7)  91.9 (27.2) 

Littman et al [ 
96,130 (96.2)  1,161,661 (98.8)  94.3 (21.8; 34.0247.7)  91.9 (27.2) 



Rosenberger et al [ 
63,405 (63.4)  227,215 (19.3)  94.3 (21.0; 0596.2)  92.0 (26.3) 

Kazerooni and Lim [ 
23,987 (24)  71,961 (6.1)  94.8 (21.8; 0559.6)  92.5 (27.2) 

Goodrich et al [ 
95,748 (95.8)  199,830 (17)  93.5 (20.6; 36.3226.8)  91.2 (25.7) 

Janney et al [ 
95,742 (95.8)  199,830 (17)  93.5 (20.6; 35.6247.7)  91.2 (25.7) 

Jackson et al [ 
96,559 (96.6)  251,501 (21.4)  93.6 (20.6; 27.4259.0)  91.2 (25.9) 

Noël et al [ 
99,958 (100)  683,008 (58.1)  94.0 (20.9; 31.8267.1)  91.6 (26.1) 
^{a}These algorithms differ from the other periodspecific algorithms as they first use all available data and then proceed to aggregate measures by the mean or median within select periods.
The raw, unprocessed data contained implausible values ranging from 0 kg to 674 kg. Although most algorithms involved removing outlying values—often as the first step—some did not. Most notably, data processed by two of the algorithms (Kazerooni and Lim [
Algorithms designed to use all available weights retained a bulk of the measurements (1,037,293/1,175,995, 88.21% to 1,175,177/1,175,995, 99.93%) and resulted in a similar average weight (mean 93.394.4, SD 21.022.0 kg). The SD did not decrease after applying the algorithms except for the algorithm by Maguen et al [
For the periodspecific algorithms, only 1 retained >50% of the raw weight measurements (Noël et al [
Although the mean weight did not change appreciably between the 12 algorithms, there were noticeable differences in the resulting distributions of weight. To explore these differences, we implemented a bootstrap procedure for the mean and variance by sampling 1000 patients, with replacement—thus each patient could be in each sampling iteration more than once—then evaluating the sample data with all 12 algorithms, and repeating this procedure 100 times. Each algorithm is designed to
Bootstrapped 95% CI of the SD by algorithm and algorithm type. The midpoint represents the median SD, the thick gray line represents the 80% quantile interval, and the black line represents the 95% quantile interval [
A total of 13 individual logistic regressions were computed to predict the occurrence of newonset diabetes as a function of weight. The reported OR and 95% CI varied little between algorithms, and all ORs were slightly >1.00 (
Odds ratio (OR; 95% CI) from 13 separate logistic regressions predicting newonset diabetes as a function of weight [
The average weight change was slightly <0, ranging from −0.13 kg to −0.43 kg, with the largest discrepancy resulting from the Maguen et al [
Comparing weight loss metrics by algorithm, common measures of weight loss ≥5%, and average weight change from baseline.
Item  Patients retained^{a}, n (%)  Weight loss ≥5% from baseline, n (%)  Weight gain ≥5% from baseline, n (%)  Average weight change from baseline (kg), mean (SD; range)  
Raw weights  60,286 (60.3)  8162 (13.5)  6977 (11.6)  −0.13 (7.3; −456 to –485)  



Buta et al [ 
57,014 (57)  7762 (13.6)  6642 (11.6)  −0.27 (5.4; −111 to –126) 

Chan and Raffa [ 
60,175 (60.2)  8069 (13.4)  6902 (11.5)  −0.26 (5.4; −231 to –126) 

Maguen et al [ 
52,642 (52.7)  4933 (9.4)  4088 (7.8)  −0.17 (3.5; −33 to –44) 

Breland et al [ 
60,225 (60.3)  8124 (13.5)  6936 (11.5)  −0.27 (5.2; −117 to –94) 

Maciejewski et al [ 
58,457 (58.5)  7985 (13.7)  6810 (11.6)  −0.28 (5.1; −53 to –88) 

Littman et al [ 
59,773 (59.8)  7851 (13.1)  6787 (11.4)  −0.22 (4.9; −54 to –49) 



Rosenberger et al [ 
38,875 (38.9)  5425 (14)  4725 (12.2)  −0.31 (6.4; −454 to –135) 

Kazerooni and Lim [ 
23,987 (24)  3355 (14)  2503 (10.4)  −0.43 (5.6; −242 to –136) 

Goodrich et al [ 
58,142 (58.2)  7828 (13.5)  6688 (11.5)  −0.27 (5.2; −53 to –93) 

Janney et al [ 
58,171 (58.2)  7842 (13.5)  6679 (11.5)  −0.28 (5.4; −132 to –127) 

Jackson et al [ 
59,770 (59.8)  7973 (13.3)  6494 (10.9)  −0.32 (5.1; −111 to –104) 

Noël et al [ 
58,525 (58.5)  7786 (13.3)  6624 (11.3)  −0.26 (5.2; −111 to –88) 
^{a}Number of patients retained after applying the algorithm. N=99,958 (number of veterans in the 2016 cohort).
For each algorithm, the individual trajectories were modeled using a random slope and intercept. Latent class membership represents a choice by the statistical modeler; here, for both conceptual and parsimonious reasons, a 3class model was chosen for analysis.
The choice of algorithm can affect predicted weight loss and weight gain within 1 year. Each algorithm produced a slightly different slope and intercept for each class (eg, for the raw data,
Groupbased trajectory modeling by algorithm [
The percentage of patients with ≥5% weight loss and gain was calculated for each of the 130 facilities using the raw weight data and the weight data as processed by each algorithm. Using the raw data, the percentage of patients with ≥5% weight loss ranged from 2% (1/44) to 19.7% (78/395) across facilities, with an average of 13.5% (SD 2.6%). Across algorithms, the percentage of patients who met the metric ranged from a minimum of 2% (1/44) to a maximum of 26% (13/50). For weight gain, the percentage of patients with ≥5% weight gain ranged from 6% (14/234) to 20% (9/44) across facilities using the raw data, with an average of 11.6% (SD 2.3%); across algorithms, the percentage of patients ranged from 3.1% (12/386) to 27% (14/51).
Facilitylevel percentage of patients with ≥5% weight loss by algorithm. Facilities are ranked along the xaxis according to the percentage of patients who met the metric using raw data, with higherranking facilities having greater rates of patients meeting the metric. The percentage of patients who met the metric calculated by each algorithm is displayed for each facility [
For many applications, the differences between weightprocessing algorithms are minor, implying that a simpler algorithm design may be accurate and computationally more efficient in many scenarios. Furthermore, in some cases, the results are not appreciably different from using raw, unprocessed data.
There are subtleties between each algorithm and algorithm type that appear to be more appropriate for specific applications. For example, if it is assumed within a cohort that weight will be lost or gained linearly (eg, weight loss programs or patients with terminal cancer), the Maguen et al [
Studies using point estimates of weight (descriptive statistics and weight as a predictor) and weight change may benefit from a simple cleaning rule based on cutoffs of implausible values, such as excluding weights <34 kg or >318 kg. However, we also recommend examining the computed
Among the algorithms that used all weight measures, most removed outliers within patients, often using some variation of
Studies examining weight trajectories and facilitylevel metrics may benefit from a more nuanced algorithm that considers all available weight data. With respect to trajectory analyses, Kazerooni and Lim [
As an example of a recommendation, based on preliminary findings, we used a 2stage algorithm to derive and clean a weight outcome for the study by Miech et al [
These data can be stratified in many ways and, for the purposes of brevity, we chose to display the results assuming homogeneity of the sample. Alternatively, stratifying by demographic or clinical factors had the potential to change our results and conclusions; thus, we chose to differentiate our analysis for patient sex and for categories of weight—namely, underweight, overweight, and obese (webbased supplement [
Similar to the choice of data, the methods we chose to address the impact of algorithms were tested on a small selection of analytic approaches while disregarding others that researchers may wish to use. Chiefly, we did not examine the impact on a broader set of machine learning or artificial or computational intelligence approaches common in big data analytics. Further combining machine learning, missing data imputation, and the impact of algorithm choice could prove to be an invaluable resource for the clinical research community.
Our data lack a gold standard and thus, we cannot establish that a presumed outlier is in fact implausible; it is possible that some individuals experienced drastic weight changes that were not considered. Patients who were pregnant during the period were excluded; however, other diseases or conditions may be associated with dramatic weight shifts, and amputation in diabetic patients could also be considered. We did examine the impact of including weight measures from the inpatient setting as well as bariatric surgery patients and found only 2 individuals with implausible weight change values (webbased supplement [
In addition, many algorithms were designed using a specific cohort of patients or an analytic approach, which may not transfer to a general patient cohort. The Maciejewski et al [
Our conclusion that applying a simple algorithm or filter may be enough to
Finally, all algorithms were reconstructed from the published methods and supplemental material, and there was potential for misinterpretation. In the era of big data analytics and use of patient EHR data for research and evaluation, it is essential that details surrounding data processing and measure creation are included in supplemental materials or shared code (eg, GitHub, Bitbucket, or Docker) to facilitate reproducibility and replication efforts.
In this paper, we presented several applications of algorithms to process weight measurements obtained from EHRs and attempted to provide recommendations for common research scenarios. Different algorithms result in generally similar results. In some cases, the results are not different from using raw, unprocessed data, despite algorithm complexity. Studies using point estimates of weight (descriptive statistics and weight as a predictor) and weight change may benefit from a simple cleaning rule based on cutoffs of implausible values. Research questions involving weight trajectories and facilitylevel metrics may benefit from a more nuanced algorithm that considers all available weight data.
Supplementary tables and figures.
Corporate Data Warehouse
electronic health record
linear mixed model
odds ratio
Veterans Health Administration
The authors would like to thank their colleagues Eugene Oddone, Michael Goldstein, Jane Kim, Sophia Califano, Margaret Dundon, Sophia Hurley, and Felicia McCant for ongoing support for this work, including reviewing drafts of this manuscript and providing valuable feedback. This work was supported by the National Center for Health Promotion and Disease Prevention of the Veterans Health Administration.
None declared.