This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
Sepsis is a life-threatening condition that can rapidly lead to organ damage and death. Existing risk scores predict outcomes for patients who have already become acutely ill.
We aimed to develop a model for identifying patients at risk of getting sepsis within 2 years in order to support the reduction of sepsis morbidity and mortality.
Machine learning was applied to 2,683,049 electronic health records (EHRs) with over 64 million encounters across five states to develop models for predicting a patient’s risk of getting sepsis within 2 years. Features were selected to be easily obtainable from a patient’s chart in real time during ambulatory encounters.
The models showed consistent prediction scores, with the highest area under the receiver operating characteristic curve of 0.82 and a positive likelihood ratio of 2.9 achieved with gradient boosting on all features combined. Predictive features included age, sex, ethnicity, average ambulatory heart rate, standard deviation of BMI, and the number of prior medical conditions and procedures. The findings identified both known and potential new risk factors for long-term sepsis. Model variations also illustrated trade-offs between incrementally higher accuracy, implementability, and interpretability.
Accurate implementable models were developed to predict the 2-year risk of sepsis, using EHR data that is easy to obtain from ambulatory encounters. These results help advance the understanding of sepsis and provide a foundation for future trials of risk-informed preventive care.
Sepsis is a life-threatening condition characterized by a systemic immunological response to infection. Each year, more than 1.7 million adults in the United States develop sepsis, and nearly 16% of them die [
Better risk models are needed to support community-acquired sepsis prevention. In 2016, Wang et al were the first to develop a risk score for long-term sepsis [
Given the increased adoption of electronic health records (EHRs) in ambulatory care [
In this study, we developed EHR-based models using supervised machine learning methods to predict the long-term risk of sepsis, investigating both time-invariant and temporal synopsis features. For each model, we reported results for both performance and feature importance, and discussed trade-offs between accuracy, interpretability, implementability, and biomedical relevance. This research investigated the potential to predict long-term sepsis risk in ways that can inform clinical decisions and lead to a better understanding of the disease.
Providence St. Joseph Health (PSJH) is a community health system that includes over 51 hospitals and 1085 clinics. This retrospective study used clinical data from PSJH EHRs for patients who presented for health care at Providence, Swedish, or Kadlec sites in Alaska, California, Montana, Oregon, and Washington. Research was conducted within a Health Insurance Portability and Accountability Act (HIPAA)-secure data platform, after date shifting had been applied to reduce the risk of reidentification. Dates were shifted using a randomly selected offset per patient of up to ±365 days. All time windows below were defined on postshifted dates. Procedures were approved by the Institutional Review Board (IRB) at PSJH (IRB Study Number STUDY2019000389). Records were included for patients who presented for health care at least one time between 2017 and 2019. Our prediction model used records from patients over 18 years of age during a 3-year observation window starting in 2014 to predict sepsis in a 2-year window, starting in 2017. Patient age was calculated for the prediction window start date. Patients with no valid birth date or no encounters prior to 2014 were excluded. Our final study cohort consisted of 2,683,049 patients, including 1,558,851 (58.1%) women and 1,124,198 (41.9%) men, and the median age was 51.36 years. Over 64,000,000 encounters were collected from the cohort patients for feature extraction.
Features represent information about the data used as model inputs, and the label is the outcome that the model is trained to predict. In this study, we selected features that can be easily obtained from EHRs, including previously reported long-term risk factors for sepsis [
The following features were extracted from the observation window: sex, age, ethnicity, race, height, weight, BMI, ambulatory vital signs, history of medical conditions, hospital length of stay, encounters, problem list entries, medical history entries, medication orders, and procedures. Medical conditions were considered present if the SNOMED CT parent concept or any of its descendant concepts were found in the problem list during the observation window. The sepsis feature was included to investigate whether having a history of sepsis is a risk factor for developing sepsis in the future. Ratio features with repeated observations (eg, BMI, vital signs, and hospital length of stay) were transformed through statistical aggregation (minimum, maximum, mean, and standard deviation). All features are defined in
Definitions of features used for models in the study for the observation window.
Category | Definition | |
|
|
|
|
Sex | Male (1), female (0), missing (−1) |
|
Age | Age calculated at the start of the prediction window |
|
Race | Native Hawaiian/Pacific Islander, American Indian/Alaska Native, Asian, Black/African American (1); White (0); other/missing (−1) |
|
Ethnicity | Hispanic/Latino (1), not Hispanic/Latino (0), missing (−1) |
|
Height | Last observed height |
|
Weight | Last observed weight |
|
Std_BMI | Standard deviation of BMI |
|
|
|
|
BP_sys | Average and standard deviation of systolic blood pressure |
|
BP_dia | Average and standard deviation of diastolic blood pressure |
|
BT | Average and standard deviation of body temperature |
|
HR | Average and standard deviation of heart rate |
|
RR | Average and standard deviation of respiratory rate |
|
|
|
|
Sepsis | Sepsis (SCTIDa 91302008) |
|
Pneumonia | Pneumonia (SCTID 233604007) |
|
Bacterial infection | Bacterial infectious disease (SCTID 87628006) |
|
Fungal infection | Mycosis (SCTID 3218000) |
|
Protein-energy malnutrition | Deficiency of macronutrients (SCTID 238107002) |
|
Cancer | Malignant neoplastic disease (SCTID 363346000) |
|
COPDb | Chronic obstructive lung disease (SCTID 13645005) |
|
Diabetes | Diabetes mellitus (SCTID 73211009) |
|
Chronic kidney disease | Chronic kidney disease (SCTID 709044004) |
|
Hypertension | Hypertensive disorder, systemic arterial (SCTID 38341003) |
|
Deep vein thrombosis | Deep venous thrombosis (SCTID 128053003) |
|
Arteriosclerosis | Arteriosclerotic vascular disease (SCTID 72092001) |
|
Peripheral artery disease | Peripheral arterial occlusive disease (SCTID 399957001) |
|
Coronary artery disease | Coronary arteriosclerosis (SCTID 53741008) |
|
Heart attack | Myocardial infarction (SCTID 22298006) |
|
Atrial fibrillation | Atrial fibrillation (SCTID 49436004) |
|
Stroke | Cerebrovascular accident (SCTID 230690007) |
|
Heart failure | Heart failure (SCTID 84114007) |
|
|
|
|
n_encounter | Total count of clinical encounters |
|
n_hospitalization | Total count of hospitalizations |
|
LOS | Average, minimum, maximum, and standard deviation of length of hospital stay |
|
n_problem | Total count of problem list entries |
|
u_problem | Number of unique problem list entries |
|
n_medical_hx | Total count of medical history entries |
|
u_medical_hx | Number of unique medical history entries |
|
n_medication | Total count of prescription medication orders |
|
u_medication | Number of unique prescription medication orders |
|
n_procedure | Total count of ordered medical procedures |
|
u_procedure | Number of unique ordered medical procedures |
aSCTID: Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) identifier.
bCOPD: chronic obstructive pulmonary disease.
Data preprocessing and cleaning were conducted as follows. Missing data in categorical features (sex, race, and ethnicity) were assigned to be −1. Missing data in height, weight, and vital signs were imputed using the carry-forward method if previous observations were available; otherwise, median imputation was used. Outliers in height and weight were detected by calculating the modified z-score based on median absolute deviation (MAD) [
where MAD is the median absolute deviation and x̃ is the median of x.
Patients diagnosed with sepsis accounted for only about 0.8% of the cohort, leading to extremely imbalanced data. To ensure the validity of the model but, at the same time, overcome the class imbalance in the medical data set, we reserved 20% of the original data as a test set and undersampled the other 80% of the data by randomly selecting the same number of patients from the majority class (no sepsis) as the minority class (sepsis) to construct a balanced training set. The train/test split process is shown in
Training, validation, and test split for modeling of the long-term risk of sepsis.
All classification models were built using scikit-learn [
Ten-fold cross-validation results on the training set.
Model and classifier | Precision | Sensitivity | Specificity | AUROCa | Ten-fold error (%) | |
|
|
|
|
|
|
|
|
GBb | 0.6727 | 0.6725 | 0.6725 | 0.7349 | 0.29% |
|
SVMc | 0.6607 | 0.6606 | 0.6606 | 0.7167 | 0.27% |
|
LRd | 0.6569 | 0.6565 | 0.6565 | 0.7134 | 0.29% |
|
|
|
|
|
|
|
|
GB | 0.6947 | 0.6946 | 0.6946 | 0.7595 | 0.28% |
|
SVM | 0.6812 | 0.6811 | 0.6811 | 0.7425 | 0.29% |
|
LR | 0.6776 | 0.6775 | 0.6775 | 0.7399 | 0.26% |
|
|
|
|
|
|
|
|
GB | 0.7008 | 0.7006 | 0.7006 | 0.7671 | 0.20% |
|
SVM | 0.6897 | 0.6868 | 0.6868 | 0.7502 | 0.17% |
|
LR | 0.6893 | 0.6891 | 0.6891 | 0.7523 | 0.18% |
|
|
|
|
|
|
|
|
GB | 0.7483 | 0.7481 | 0.7481 | 0.8216 | 0.27% |
|
SVM | 0.7191 | 0.7169 | 0.7169 | 0.7910 | 0.26% |
|
LR | 0.7185 | 0.7175 | 0.7175 | 0.7835 | 0.19% |
aAUROC: area under the receiver operating characteristic curve.
bGB: gradient boosting.
cSVM: support vector machine.
dLR: logistic regression.
eVS: vital signs.
fMHX: medical history.
gHCD: health care delivery data.
Prediction results and 95% confidence intervals for the test set using the trained gradient boosting model.
Model | Precision, value (95% CI) | Sensitivity, value (95% CI) | Specificity, value (95% CI) | LR+a, value (95% CI) | LR−b, value (95% CI) | DORc |
Model 1 (basic) | 0.0165 (0.0159-0.0171) | 0.6552 (0.6407-0.6694) | 0.6900 (0.6887-0.6912) | 2.1135 (2.0670-2.1611) | 0.4997 (0.4793-0.5209) | 4 |
Model 2 (basic + VSd) | 0.0177 (0.0171-0.0184) | 0.6862 (0.6721-0.7001) | 0.6980 (0.6968-0.6993) | 2.2724 (2.2256-2.3202) | 0.4495 (0.4299-0.4701) | 5 |
Model 3 (basic + VS + MHXe) | 0.0184 (0.0177-0.0190) | 0.6874 (0.6733-0.7012) | 0.7084 (0.7071-0.7096) | 2.3570 (2.3086-2.4065) | 0.4413 (0.4220-0.4615) | 5 |
Model 4 (basic + VS + MHX + HCDf) | 0.0224 (0.0217-0.0231) | 0.7653 (0.7523-0.7779) | 0.7352 (0.7340-0.7363) | 2.8897 (2.8401-2.9401) | 0.3192 (0.3023-0.3371) | 9 |
aLR+: positive likelihood ratio.
bLR−: negative likelihood ratio.
cDOR: diagnostic odds ratio.
dVS: vital signs.
eMHX: medical history.
fHCD: health care delivery data.
To ensure the stability and reliability of the model, SHAP and permutation testing methods were implemented on the GB model. These methods improve the interpretability of the black box model and give a reasonable explanation for the prediction of each outcome. The results for the SHAP algorithm are shown in
The Shapley Additive exPlanations (SHAP) algorithm results for long-term sepsis risk in model 1. (A) The influence of higher and lower values of the feature on the patient's outcome. The left side of this graph represents reduced risk of developing sepsis, and the right side of the graph represents increased risk of developing sepsis. Red dots represent higher values of the feature, and blue dots represent lower values of the feature. Nominal classes are binary (0,1). (B) The ranking of feature importance indicated by SHAP.
The Shapley Additive exPlanations (SHAP) algorithm results for long-term sepsis risk in model 2. (A) The influence of higher and lower values of the feature on the patient's outcome. The left side of this graph represents reduced risk of developing sepsis, and the right side of the graph represents increased risk of developing sepsis. Red dots represent higher values of the feature, and blue dots represent lower values of the feature. Nominal classes are binary (0,1). (B) The ranking of feature importance indicated by SHAP. BP: blood pressure; BT: body temperature; HR: heart rate; RR: respiratory rate.
The Shapley Additive exPlanations (SHAP) algorithm results for long-term sepsis risk in model 3. (A) The influence of higher and lower values of the feature on the patient's outcome. The left side of this graph represents reduced risk of developing sepsis, and the right side of the graph represents increased risk of developing sepsis. Red dots represent higher values of the feature, and blue dots represent lower values of the feature. Nominal classes are binary (0,1). (B) The ranking of feature importance indicated by SHAP. BP: blood pressure; BT: body temperature; COPD: chronic obstructive pulmonary disease; HR: heart rate; RR: respiratory rate.
The Shapley Additive exPlanations (SHAP) algorithm results for long-term sepsis risk in model 4. (A) The influence of higher and lower values of the feature on the patient's outcome. The left side of this graph represents reduced risk of developing sepsis, and the right side of the graph represents increased risk of developing sepsis. Red dots represent higher values of the feature, and blue dots represent lower values of the feature. Nominal classes are binary (0,1). (B) The ranking of feature importance indicated by SHAP. BP: blood pressure; BT: body temperature; COPD: chronic obstructive pulmonary disease; HR: heart rate; LOS: length of hospital stay; RR: respiratory rate.
The L1-regularized logistic regression (L1-LR) algorithm results (A) and permutation testing results for long-term sepsis risk in model 4 (B). BP: blood pressure; BT: body temperature; COPD: chronic obstructive pulmonary disease; HR: heart rate; LOS: length of hospital stay; RR: respiratory rate.
In this study, we constructed four interpretable implementable EHR-based models to predict the 2-year risk of sepsis in adults. Each model performed well, considering the complexity of the features included. As expected, model 4 with all 49 features outperformed the others, with an AUROC of 0.8216 achieved by the GB algorithm in the training set. Due to the low prevalence of sepsis outcomes in the 20% test set, the precision was low in all models. However, the positive likelihood ratio of 2.8897 and negative likelihood ratio of 0.3192 achieved by model 4 showed that our model has the ability to identify patients with higher risk of sepsis. The dominant features in this model, accounting for more than half of the feature importance, were the numbers of unique and total medical history entries (u_medical_hx and n_medical_hx), and age. Medical history features suggest an increased burden of underlying health conditions, and aging is the most substantial risk factor for multimorbidity [
Age, ethnicity, sex, average heart rate (avg_HR), and standard deviation of BMI (std_BMI) were the most important features in models 2 and 3. In addition to increasing the risk of multimorbidity, age is a known independent risk factor for sepsis incidence, severity, and outcomes [
Although the highest performance was achieved with the health care delivery data features set, it has limited usefulness for discovering potential risk factors given its reliance on aggregated features, such as the number of medical history entries. Inclusion of these aggregated features weakens other predictors that are potentially more biomedically informative, including medical conditions and biomarkers, such as vital signs. The second-best performing model (model 3) identified a subset of biomarkers as strong predictors, including the standard deviation of BMI and average resting heart rate.
In models 3 and 4 that incorporated medical history, the conditions with greater importance for long-term sepsis risk were history of sepsis, heart failure, chronic kidney disease, pneumonia, COPD, and diabetes. In contrast, the most impactful chronic diseases in the REGARDS 10-year prediction score were chronic lung disease, followed by diabetes and peripheral artery disease [
The primary goal of this study was to investigate whether readily available EHR data can predict the long-term risk of developing sepsis during ambulatory visits in real time. Performance could also be useful for assessing population health. Interpretability was a secondary concern, and the feature importance estimates discussed above should be taken as exploratory. Relationships identified in the models reflected shared information content, but not necessarily biomedical relevance or causality. However, feature importance models suggested new insights on potential risk factors for sepsis that merit further investigation.
The studied population may have sample bias toward patients with continuous care within one health care system. There are also many common issues with structured EHR data that hamper the extraction of accurate information, including missing data, erroneous data, differences in EHR conventions among providers, and changes in how data are stored in EHRs over time [
Using EHR diagnostic codes to identify sepsis patients also has limitations. First, it may miss cases where patients had sepsis at a different health care system. Second, because there is no confirmatory diagnostic test for sepsis, this model included patients who were treated empirically for sepsis but might not have had it. Third, variations in sepsis diagnosis, documentation, and coding practices could lead to missing sepsis labels [
Future models can take advantage of the Adult Sepsis Event surveillance definition optimized for EHRs, which was recently released by the CDC [
Strategies for long-term sepsis risk prediction are needed to advance the understanding of the disease and guide efforts for prevention. We used retrospective EHR data from 2,683,049 adults across five US states to develop models for predicting adult patients’ long-term risk of sepsis. Our models achieved a high AUROC and suggested new insights into potential long-term risk factors, including changes in BMI and a higher mean heart rate in ambulatory settings. These models could be implemented at a low cost, requiring only information that is easy to obtain from EHRs in real time. Ambulatory patients at the highest risk for sepsis could benefit from personalized preventative approaches, including increased emphasis on immunization, and education on reducing the risk of infection and recognizing early symptoms of sepsis. This implementable model provides a path toward clinical trials of risk-informed interventions for long-term sepsis prevention.
area under the receiver operating characteristic curve
chronic obstructive pulmonary disease
electronic health record
gradient boosting
L1-regularized logistic regression
logistic regression
median absolute deviation
Providence St. Joseph Health
Systematized Nomenclature of Medicine-Clinical Terms identifier
Shapley Additive exPlanations
Systematized Nomenclature of Medicine-Clinical Terms
support vector machine
This work was funded in part by the Washington Research Foundation. We thank Ryan T Roper and Venkata R Duvvuri for their design and implementation assistance for biomedical concept extraction. We are grateful to Providence St. Joseph Health for sharing their data, data engineering expertise, and computational resources. We appreciate the technical assistance of Mark Premo, Jennifer Jones, and Andrey Dubovoy. We would also like to acknowledge SNOMED International for developing and maintaining SNOMED CT.
None declared.