This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Artificial intelligence–enabled electronic health record (EHR) analysis can revolutionize medical practice from the diagnosis and prediction of complex diseases to making recommendations in patient care, especially for chronic conditions such as chronic kidney disease (CKD), which is one of the most frequent complications in patients with diabetes and is associated with substantial morbidity and mortality.
The longitudinal prediction of health outcomes requires effective representation of temporal data in the EHR. In this study, we proposed a novel temporal-enhanced gradient boosting machine (GBM) model that dynamically updates and ensembles learners based on new events in patient timelines to improve the prediction accuracy of CKD among patients with diabetes.
Using a broad spectrum of deidentified EHR data on a retrospective cohort of 14,039 adult patients with type 2 diabetes and GBM as the base learner, we validated our proposed Landmark-Boosting model against three state-of-the-art temporal models for rolling predictions of 1-year CKD risk.
The proposed model uniformly outperformed other models, achieving an area under receiver operating curve of 0.83 (95% CI 0.76-0.85), 0.78 (95% CI 0.75-0.82), and 0.82 (95% CI 0.78-0.86) in predicting CKD risk with automatic accumulation of new data in later years (years 2, 3, and 4 since diabetes mellitus onset, respectively). The Landmark-Boosting model also maintained the best calibration across moderate- and high-risk groups and over time. The experimental results demonstrated that the proposed temporal model can not only accurately predict 1-year CKD risk but also improve performance over time with additionally accumulated data, which is essential for clinical use to improve renal management of patients with diabetes.
Incorporation of temporal information in EHR data can significantly improve predictive model performance and will particularly benefit patients who follow-up with their physicians as recommended.
With the rapid development in digitization of health care data, the modern electronic health records (EHRs) hold considerable promise for driving scientific advances in various aspects of biomedicine through the utilization of machine learning techniques. EHRs contain not only diverse clinical data elements that can better describe a patient’s overall health status but also rich longitudinal data of patients that serve as a critical source for understanding the evolution of disease and management of chronic conditions. Developing accurate risk prediction models to drive timely initiation of appropriate therapies and monitoring is of paramount importance for conditions that have a substantial public health impact and can benefit greatly from early intervention.
Chronic kidney disease (CKD), especially CKD attributed to diabetes, that is, diabetic kidney disease (DKD), certainly falls within this category [
The effective use of temporal EHR data for predictive modeling remains a challenge owing to its highly variable sampling rates across different groups of patients (eg, patients may not follow the annual check protocol and only visit the hospital for critical health events) and distinct data types (eg, vital signs are noted hourly during inpatient encounters, whereas laboratory tests and medications are recorded when clinicians order them, and demographic data are more stable). Attempts have been made to handle temporal information in a variety of clinical applications. One approach involves representing the time series of clinical features with a single heuristic value (eg, taking the latest value or the trend [
In the prediction of kidney-related events, single-value abstraction is the most popular approach for its simplicity but at the expense of reduced temporal granularity. For example, in the ADVANCE prospective study for diabetic nephropathy, only baseline values of selected labs and vitals are used in a Cox proportional survival model [
In this study, we propose a new approach for incorporating the temporal information in medical history of patients with diabetes to further improve the predictive model for evaluating their risk of renal complication in the next year. Because of its robustness, efficiency, and established efficacy in the prediction of kidney events [
We adopted the Surveillance, Prevention, and Management of Diabetes Mellitus definition of diabetes in this study. Diabetes was defined based on the following: (1) the use of glucose-lowering medications (insulin or oral hypoglycemic medications); or (2) level of HbA1c of 6.5% or greater, random glucose of 200 mg/dL or greater, or fasting glucose of 126 mg/dL on at least two different dates within 2 years; or (3) any two type 1 and type 2 DM diagnoses been given on 2 different days within 2 years; or (4) any two distinct types of events among (1), (2), or (3); and (5) excluding any gestational diabetes (temporary glucose rise during pregnancy) [
DKD was defined as diabetes with the presence of microalbuminuria or proteinuria, impaired glomerular filtration rate (GFR), or both [
The study constructed a retrospective cohort using deidentified EHR and billing data from November 2007 to December 2017 in the University of Kansas Medical Center’s integrated clinical data repository Healthcare Enterprise Repository for Ontological Narration (HERON) [
Study cohort inclusion and exclusion. Note that the counts of exclusions do not necessarily add up to the difference between the initial and final population, as 1 patient could satisfy multiple exclusion criteria. ACR: albumin-to-creatinine ratio; DKD: diabetic kidney disease; DM: diabetes mellitus; EGFR: estimated glomerular filtration rate.
According to our data, the heuristic time between 2 adjacent outpatient eGFR or ACR labs is on average 1 year per patient. Thus, for a patient
Each patient was then represented by collecting 15 common types of clinical observations from HERON [
Integrated data repository data domain categories.
Domain | Descriptions | Data type | Number of eligible featuresa | Patientsb, n (%) |
Alerts | Includes drug interaction, dose warnings, drug interactions, medication administration warnings, and best practice alerts | Binary | 531 | 11,848 (84.39) |
Allergy | Includes documented allergies and reactions | Binary | 49 | 5044 (35.93) |
Demographics | Basic demographics such as age, gender, race, etc, as well as their reachability, and some geographical information | Binary/numeric | 10 | 14,039 (100.00) |
Diagnoses | Organized using ICDc-9 and ICD-10 hierarchies. Intelligent Medical Objects interface terms are grouped to ICD-9 and ICD-10 levels. Diagnosis resources are further separated by source of the assignment (eg, EMRd, professional billing, technical billing, and registry). | Binary | 1186 | 12,616 (89.86) |
History | Contains family, social (ie, smoking), and surgical history from the EMR, as well as engineered features such as number of distinct clinical facts and clinical fact increments since last collection point | Binary/numeric | 155 | 12,178 (86.74) |
Laboratory tests | Results of a variety of laboratory tests, including cardiology and microbiology findings. Note that the actual laboratory values are used in modeling, if available. | Binary/numeric | 685 | 11,990 (85.40) |
Medications | Includes dispensing, administration, prescriptions, as well as home medication reconciliation at the University of Kansas Hospital grouped at Semantic Clinical Drug Form or Semantic Clinical Brand Form level. Medication resources are further separated by types of medication activity. | Binary | 1205 | 8295 (59.09) |
Procedures | Includes Current Procedural Terminology professional services and inpatient ICD-9 billing procedure codes. | Binary | 560 | 12,460 (88.75) |
Orders | Includes physician orders for nonmedications, such as culture and imaging orders from the EMR. | Binary | 1053 | 12,460 (88.75) |
Vizient (billing) | (formerly University Health System Consortium) Includes both billing classifications such as Diagnostic Related Groups, comorbidities, discharge placement, length of stay, and national quality metrics. | Binary | 657 | 3619 (25.78) |
Visit details | Includes visit types, vital signs collected at the visit, discharge disposition, and clinical services providing care from both EMR and billing. | Binary/numeric | 474 | 13,671 (97.38) |
aThis does not include all distinct concepts from the entire Healthcare Enterprise Repository for Ontological Narration system; it only includes the total number of distinct features that had ever been recorded for at least one patient in the study cohort.
bThis is the number of patients who have at least one observation during any time window recorded from the corresponding data domain.
cICD: International Classification of Diseases.
dEMR: electronic medical record.
In
Clinical feature densities across data types. Each row corresponds to the average number of distinct clinical facts per patient for a certain type of clinical data over 5 years before and after DM onset. The darker the region is, the more distinct facts have been recorded for patients on average within the corresponding time window. DM: diabetes mellitus; UHC: University HealthSystem Consortium.
In
Clinical observation intensity.
Data typea | Mean time lapses (days) | Within-patient standard deviation (days) | Between-patient standard deviation (days) | |
Alerts | 67 | 93 | 146 | <.001 |
Allergy | 169 | 158 | 214 | <.001 |
Diagnoses | 87 | 105 | 133 | <.001 |
History | 184 | 230 | 872 | <.001 |
Laboratory tests | 107 | 122 | 175 | <.001 |
Medications | 70 | 70 | 137 | <.001 |
Procedures | 74 | 99 | 132 | <.001 |
Orders | 81 | 95 | 127 | <.001 |
Vizient | 228 | 189 | 304 | <.001 |
Visit details | 36 | 61 | 70 | <.001 |
aDemographics are not included as they are unique at the patient level.
For the clinical task of predicting DKD risk over the next year, we first randomly divided the 14,039 patients into training set (80%) for model development and validation set (20%) for performance evaluations. To simulate a more realistic clinical scenario and account for the bias caused by varying degrees of health care exposure over time, we stepped forward through patients’ time course and built prediction models at each landmark time, that is, every full year since DM onset, for rolling predictions of 1-year DKD risk. As such, individuals may contribute to or be tested by one or more prediction models, depending on their eligibility at the landmark time.
We chose GBMs as the baseline training model, which were then combined with four different approaches to incorporate temporal data. GBM is a family of powerful machine-learning techniques that have shown considerable success in a wide range of practical applications [
Missing values were handled in the following fashion: for categorical data, a value of 0 was set for missing, whereas for numerical data, a
We used AUROC and area under precision recall curve (AUPRC) to compare the overall prediction performance, with the latter known to be more robust to imbalanced datasets. In addition, we characterized calibration by the observed-to-expected outcome ratio (O:E), which measures agreement between the predicted and observed risk on average across observations. By treating testing examples with predicted probability of outcome in the top 40th percentile as positive cases, we made fair performance comparisons among different methods and further examined the model’s ability in detecting positive vs negative cases by reporting the sensitivity, specificity, positive predictive values (PPVs), and negative predictive values.
In this approach, we simply collect the last observed value before each landmark time for each predictor across all time windows (
Given the variables for all time windows T, the Stack-Temporal approach concatenates the variable from all windows to represent patient
The Discrete-Survival approach simulates a discrete-time survival framework by separating the full course of patient’s medical history into
To build the continuous learning mechanism, we developed a new method by extending the classical GBM to ensemble learners over time, that is, from one landmark time to the next (
where
Illustration of the temporal approaches, which are Latest-Value, Stack-Temporal, Discrete-Survival, and Landmark-Boosting from top to bottom. Different colors of circles represent different types of clinical data. Red triangles represent real values of the outcome (ie, diabetic kidney disease (DKD) or non-diabetic kidney disease in the following prediction window). Blue triangles represent predicted outcome based on clinical features presented in the previous observation window. Xti denotes all available clinical features collected strictly before landmark time ti (ie, number of full years since DM onset). yti denotes real label of DKD onset after within the prediction window (ti, ti+1). DM: diabetes mellitus.
Pseudocode for landmark boosting algorithm. In this experiment, Mt (the number of trees at each iteration is set to 1000), α (learning rate), and Ω(hMt) (levels of each tree) are hyperparameters tuned by 10-fold cross-validation on the training dataset at each iteration.
At each landmark time, the eligibility of a patient was determined by checking if a valid eGFR or ACR reading presented in the current time window and was neither DKD nor censored in the previous time windows. As shown in
There is a mild decreasing trend of age and race (white) proportion over the landmark times. In addition, we compared such case-mix shifts between training and testing sets and found no significant differences (
Case-mix shift over landmark time.
Landmark time (number of years since DMa onset) | Eligible, n (%) | DKDb, n (%) | Age (years), mean (SD) | Sex (male), n (%) | Race (white), n (%) |
0 | 10,705 (76.25) | 1673 (15.63) | 58 (13) | 5229 (48.84) | 7221 (67.45) |
1 | 7755 (72.44) | 1467 (18.92) | 58 (13) | 3782 (48.77) | 5185 (66.86) |
2 | 5689 (73.36) | 1163 (20.44) | 57 (13) | 2734 (48.06) | 3715 (65.30) |
3 | 4113 (72.30) | 914 (22.22) | 56 (12) | 2002 (48.67) | 2671 (64.94) |
4 | 3006 (73.09) | 740 (25.73) | 56 (12) | 1480 (49.23) | 1941 (64.57) |
aDM: diabetes mellitus.
bDKD: diabetic kidney disease.
Case-mix shift in training and testing sets.
Landmark time (number of years since DMa onset) | Training (n=11,184) | Testing (n=2855) | ||
|
||||
|
0 | 8524 | 2181 | —c |
|
1 | 6174 | 1581 | — |
|
2 | 4537 | 1152 | — |
|
3 | 3254 | 859 | — |
|
4 | 2366 | 640 | — |
|
||||
|
0 | 1352 (15.86) | 321 (14.72) | .19 |
|
1 | 1174 (19.02) | 293 (18.53) | .66 |
|
2 | 952 (20.98) | 211 (18.32) | .05 |
|
3 | 732 (22.50) | 182 (21.19) | .41 |
|
4 | 586 (24.77) | 154 (24.06) | .71 |
|
||||
|
0 | 57.8 (13.1) | 57.4 (13.1) | .98 |
|
1 | 57.6 (12.8) | 57.3 (12.7) | .98 |
|
2 | 57.0 (12.6) | 56.9 (13.1) | >.99 |
|
3 | 56.4 (12.6) | 57.1 (12.0) | .96 |
|
4 | 56.1 (12.3) | 56.7 (11.7) | .99 |
|
||||
|
0 | 4183 (49.07) | 1046 (47.96) | .98 |
|
1 | 3023 (48.96) | 759 (48.01) | .98 |
|
2 | 2208 (48.67) | 526 (45.66) | .95 |
|
3 | 1593 (48.96) | 409 (47.61) | .98 |
|
4 | 1173 (49.58) | 307 (47.97) | .97 |
|
||||
|
0 | 5776 (67.76) | 1445 (66.25) | .97 |
|
1 | 4145 (67.14) | 1040 (65.78) | .97 |
|
2 | 2975 (65.57) | 740 (64.24) | .97 |
|
3 | 2123 (65.24) | 548 (63.79) | .95 |
|
4 | 1541 (65.13) | 400 (62.50) | .89 |
aDM: diabetes mellitus.
b
cThe two-sample test is not applicable for the corresponding comparison.
Overall, the prediction results in
Performance comparisons among temporal approaches over landmark time. Area under receiver operating curve (AUROC) and area under the precision-recall curve (PRAUC) are first reported. For fair comparisons, sensitivity, specificity, positive predicted value, and negative predicted value are calculated by treating testing examples with predicted probability of outcome in the top 40th percentile as positive cases. Here, 95% bootstrap confidence intervals are reported for each metric at each landmark time (ie, full year since diabetes mellitus [DM] onset). The bootstrap confidence intervals are generated based on 30 bootstrapped samples, and used 2.5th percentile, 50th percentile, and 97.5th percentile to construct the confidence intervals for each metric.
Calibration comparisons among temporal approaches over landmark time. Regions of calibration across the range of predicted probabilities, scaled by proportion of observations in each region and shaded by the magnitude of the within-region observed-to-expected ratio (O:E), with green suggests underprediction (ie, O:E significantly less than 1), and red suggests overprediction (ie, O:E significantly larger than 1). Pearson correlation coefficients between predicted and actual values over landmark times for each temporal model are included in the table below (the closer the coefficient is to 1, the better the predicted and actual values are linearly related). DM: diabetes mellitus.
To closely examine the prediction change over time, we extracted a subset of 111 testing cases eligible at all five landmark times (ie, who had outcome sequence either like [0,0,0,0,0] or [0,0,0,0,1]) and plotted their predicted probability percentiles over years (
A visualization of predicted diabetic kidney disease (DKD) risk over landmark time. Risk percentiles (ie, normalized risk scores) against landmark time for a sample of patients. Each red line represents patient who finally progressed to DKD, whereas each green line represents patient who did not. DM: diabetes mellitus.
The study results suggested that exploiting historical temporal EHR data in predictive models would significantly improve prediction performance, especially with our proposed Landmark-Boosting model. As demonstrated in
Our proposed temporal model will benefit patients with longitudinal data, and the longer we follow up, the better the model can predict the next-year DKD risk by self-adjustment with respect to both the individual’s medical history and population shift over time. The study has three important implications. First, our investigation confirmed that temporal EHR and billing data carry critical information depicting the progression of the patient’s condition, and it is important to choose the appropriate method for incorporating longitudinal data to promote the
Our model can continually calculate kidney disease risk for patients with diabetes with automatic collection of new EHR data and improve prediction over time. The ability to precisely stratify patients with diabetes by their renal complication risk in the coming year would merit a variety of potential intervention designs: (1)
There are several limitations to our work. Disease diagnosis sequence is not necessarily the same as the disease manifestation sequence, which may lead to the underestimation of false-negative rates for DKD in this study. For example, our exclusion criteria may have excluded patients with DKD who visited our hospital for their kidney disease but have not had their diabetes-related information recorded in our EHR yet. In addition, the current design of our model is not robust against population drift because of changes in practice over time or differences in clinical vocabulary and workflow implemented across institutions. To further investigate the generalizability of our model, it is necessary to perform external validations and adequate recalibration based on patients from different sites as well as over calendar years to capture the general population shift and practice change.
Although not the focus of this paper, we further examined the factors that potentially contributed to the superiority of the Landmark-Boosting model. In
This study addressed the problem of underutilization of temporal information in EHR-based predictive models. We proposed a new approach in leveraging the temporal dynamics in EHR to improve DKD prediction and validated it against three state-of-the-art models using the idea of
Variable importance ranking across model and over time.
albumin-to-creatinine ratio
area under precision recall curve
area under receiver operating curve
chronic kidney disease
diabetic kidney disease
diabetes mellitus
estimated glomerular filtration rate
electronic health record
end-stage renal disease
gradient boosting machine
glomerular filtration rate
glycated hemoglobin
Healthcare Enterprise Repository for Ontological Narration
positive predictive value
YH is supported by the Major Research Plan of the National Natural Science Foundation of China (Key Program, grant number 91746204) and grant award from the Science and Technology Department in Guangdong Province (Major Projects of Advanced and Key Techniques Innovation, grant number 2017B030308008). The dataset used for analysis described in this study was obtained from the University of Kansas Medical Center’s HERON clinical data repository, which is supported by institutional funding and by the University of Kansas Medical Center Clinical and Translational Science Award grant UL1TR002366 from the National Center for Advancing Translational Sciences.
None declared.