This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
COVID-19 has led to an unprecedented strain on health care facilities across the United States. Accurately identifying patients at an increased risk of deterioration may help hospitals manage their resources while improving the quality of patient care. Here, we present the results of an analytical model, Predicting Intensive Care Transfers and Other Unforeseen Events (PICTURE), to identify patients at high risk for imminent intensive care unit transfer, respiratory failure, or death, with the intention to improve the prediction of deterioration due to COVID-19.
This study aims to validate the PICTURE model’s ability to predict unexpected deterioration in general ward and COVID-19 patients, and to compare its performance with the Epic Deterioration Index (EDI), an existing model that has recently been assessed for use in patients with COVID-19.
The PICTURE model was trained and validated on a cohort of hospitalized non–COVID-19 patients using electronic health record data from 2014 to 2018. It was then applied to two holdout test sets: non–COVID-19 patients from 2019 and patients testing positive for COVID-19 in 2020. PICTURE results were aligned to EDI and NEWS scores for head-to-head comparison via area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve. We compared the models’ ability to predict an adverse event (defined as intensive care unit transfer, mechanical ventilation use, or death). Shapley values were used to provide explanations for PICTURE predictions.
In non–COVID-19 general ward patients, PICTURE achieved an AUROC of 0.819 (95% CI 0.805-0.834) per observation, compared to the EDI’s AUROC of 0.763 (95% CI 0.746-0.781; n=21,740;
The PICTURE model is more accurate in predicting adverse patient outcomes for both general ward patients and COVID-19 positive patients in our cohorts compared to the EDI. The ability to consistently anticipate these events may be especially valuable when considering potential incipient waves of COVID-19 infections. The generalizability of the model will require testing in other health care systems for validation.
The effect of COVID-19 on the US health care system is difficult to overstate. It has led to unprecedented clinical strain in hospitals nationwide, prompting the proliferation of intensive care unit (ICU) capability and of lower-acuity field hospitals to accommodate the increased patient load. A predictive early warning system capable of identifying patients at increased risk of deterioration could assist hospitals in maintaining a high level of patient care while more efficiently distributing their thinly stretched resources. However, a recent review has illustrated that high quality validated models of deterioration in patients with COVID-19 are lacking [
Early warning systems have been and continue to be applied in hospital settings prior to the COVID-19 pandemic to predict patient deterioration events before they occur, giving health care providers time to intervene [
One model that has been assessed in patients with COVID-19 is the Epic Deterioration Index (EDI; Epic Systems Inc) [
In this study, we have applied our previously described model, Predicting Intensive Care Transfers and Other Unforeseen Events (PICTURE), to a cohort of patients testing positive for COVID-19 [
The study protocol was approved by the University of Michigan’s Institutional Review Board (HUM00092309). EHR data was collected from a large tertiary, academic medical system (Michigan Medicine) from January 1, 2014, to November 11, 2020. The first 5 years of data (2014-2018; n=131,546 encounters) were used to train and validate the model, while 2019 data was reserved as a holdout test set (n=33,472 encounters). Training, validation, and test populations were segmented to prevent overlap of multiple hospital encounters between sets. Criteria for inclusion in these three cohorts were defined as 18 years or older and who were hospitalized (having inpatient or other observation status) in a general ward. We excluded patients who were discharged to hospice and whose ICU transfer was from a floor other than a general ward (eg, operating or interventional radiology unit) to exclude planned ICU transfers. We also excluded patients with a left ventricular assist device to avoid artifactual blood pressure readings.
To be included in the COVID-19 cohort (n=637 encounters), patients must have been admitted to the hospital with a COVID-19 diagnosis and have received a positive COVID-19 test from Michigan Medicine during their encounter. These patients were then filtered using the same criteria used in the 2019 test set, with the exception of the hospice distinction. Only discharged patients or those who already experienced an adverse event were included.
Study population.a
Data set | Non–COVID-19 | COVID-19 | |||||||||||
|
Training 2014-2018 | Validation 2014-2018 | Testing 2019 | Testing 2020 |
|
||||||||
Encounters, n | 105,457 | 26,089 | 33,472 | 637 | N/Ac | ||||||||
Patients, n | 62,392 | 15,597 | 23,368 | 600 | N/A | ||||||||
Age (years), median (IQR) | 60.2 (46.5-70.8) | 60.4 (46.7-71.2) | 61.0 (47.0-71.5) | 61.8 (49.6-72.0) | .02 | ||||||||
|
|||||||||||||
|
White | 86,522 (82.0) | 21,647 (83.0) | 27,036 (80.8) | 329 (51.6) | <.001 | |||||||
|
Black | 12,344 (11.7) | 2861 (11.0) | 4214 (12.6) | 220 (34.5) | <.001 | |||||||
|
Asian | 2145 (2.0) | 504 (1.9) | 686 (2.0) | 29 (4.6) | <.001 | |||||||
|
Otherd | 4446 (4.2) | 1077 (4.1) | 1536 (4.6) | 59 (9.3) | <.001 | |||||||
Female sex, n (%) | 53,225 (50.5) | 13,048 (50.0) | 16,760 (50.1) | 282 (44.3) | .003 | ||||||||
|
4236 (4.0) | 1007 (3.9) | 1337 (4.0) | 155 (24.3) | <.001 | ||||||||
|
Death | 920 (0.9) | 232 (0.9) | 277 (0.8) | 16 (2.5) | <.001 | |||||||
|
ICUf transfer | 2979 (2.8) | 717 (2.7) | 1000 (3.0) | 139 (21.8) | <.001 | |||||||
|
Mechanical ventilation | 1330 (1.3) | 299 (1.1) | 352 (1.1) | 49 (7.7) | <.001 | |||||||
|
Cardiac arrestg | 143 (0.1) | 37 (0.1) | 56 (0.2) | N/A | N/A |
aPatients were subset into one of four study cohorts: a training set for learning model parameters, a validation set for model structure and hyperparameter tuning, a holdout test set for evaluation, and a final test set composed of patients testing positive for COVID-19. Values are based on individual hospital encounters.
b
cN/A: not applicable.
dOther races comprising less than 1% of the population each were incorporated under the “Other” heading.
eThe event rate represents a composite outcome indicating that one of the following events occurred: death, ICU transfer, mechanical ventilation, and cardiac arrest. The individual frequencies of these adverse events are also reported and represent the number of cases where each particular outcome was the first to occur. Please see the section Outcomes for the procedure of calculating these targets.
fICU: intensive care unit.
gCardiac arrest was not used as a target in the COVID-19 positive population, as the manually adjudicated data is not yet available at the time of writing.
The variables used as predictors were collected from the EHR and broadly included vital signs and physiologic observations, laboratory and metabolic values, and demographics. We selected specific features based on previous analysis [
The primary outcomes in the training, validation, and non–COVID-19 test cohorts (data collected from 2014 through 2019) were death, cardiac arrest (as defined by the American Heart Association’s
To verify the accuracy of our automatically generated labels, a clinician (author MRM) manually reviewed the patient charts for 20 encounters to determine whether the patient was infected with COVID-19, whether the recorded event truly took place, and whether the event was unplanned. To do so, we randomly sampled two encounters (one positive, the other negative if available) from each patient service with eight or more encounters to ensure the accuracy of the labels across all services. The result was a sample of 20 encounters, 11 of which were positive. The recorded event of interest for each encounter was reviewed by the clinician to determine whether the event took place and whether it was emergent (not planned). For the patients that were labeled as negative, the clinician reviewed the entire patient chart to ensure that no adverse events occurred during the encounter. The results indicate that all 20 patients were infected with COVID-19, all the labels and the event times were accurate, and all the events were unplanned. This provides evidence that the automatically generated outcomes accurately identify unplanned adverse events.
To train and evaluate the PICTURE model, we partitioned our data into four folds: a training and validation set using data from 2014 to 2018, a test set using 2019 data, and a fourth set consisting of data from patients who are COVID-19 positive. We partitioned the sets such that multiple hospital encounters from the same individual were restricted to one cohort, preventing patient-level overlap between cohorts. Encounters with an admission date from January 1, 2014, to December 31, 2018, were used for training and validation and hyperparameter tuning (n=131,546 encounters). These patients were further divided between training and validation sets using an 80%/20% split. Those patients with an admission date between January 1 and December 31, 2019, were reserved as a holdout test set (n=33,472 encounters). Lastly, patients testing positive for COVID-19 from March 1 to September 11, 2020, were reserved as a separate set (n=637 encounters).
PICTURE training and validation framework. The electronic health record data is split into COVID-19 and non–COVID-19 patients. Encounters with an admission date between January 1, 2014, and December 31, 2018, were set aside for training (80%) and validation (20%) subsets. Encounters with an admission date between January 1 and December 31, 2019, were used as a non–COVID-19 test set. Encounters from 2020 that tested positive for COVID-19 were held out as a separate test set. In the case that a given patient has multiple encounters that overlap these boundaries, only the later encounters were considered to remove patient overlap between the cohorts. EDI: Epic Deterioration Index; NEWS: National Early Warning Score; PICTURE: Predicting Intensive Care Transfers and Other Unforeseen Events; XGBoost: extreme gradient boosting.
As the EHR stores data in a long format (with each new row corresponding to a new measurement at a new time point), it was first converted to a wide structure such that each observation represented all features at a given time point for a given patient. The training and validation sets were grouped into 8-hour windows to ensure that each encounter would have the same amount of observations for the same amount of time in the hospital, avoiding emphasis on patients who get more frequent updates while training the model as described in Gillies et al [
The EDI is a proprietary model developed by Epic Systems Corporation. Michigan Medicine uses Epic as its electronic medical record system and has access to the EDI tool. Similar to PICTURE, it uses clinical data that are commonly available in the EHR to make predictions regarding patient deterioration. It was trained using a similar composite outcome including death, ICU transfer, and resuscitation as adverse events [
NEWS, developed by the Royal College of Physicians, is a second index used to detect patients at an increased risk of deterioration event such as cardiac arrest, ICU transfer, and death [
We first assessed the performance of the PICTURE model on all 33,472 encounters in the holdout test set comprising patients from 2019. Another early warning aggregate score, NEWS, was used for comparison in this preliminary analysis [
Since the EDI makes a prediction every 15 minutes, we simulated how the PICTURE score, calculated at irregular intervals each time a new data point arrives, would align with the EDI. This limited the available number of encounters to 21,740 in the 2019 test set and 607 encounters in the COVID-19 cohort. The PICTURE scores were merged onto EDI values by taking the most recent PICTURE prediction before the EDI prediction. This was to give the EDI any advantages in the alignment procedure.
Alignment of PICTURE predictions to EDI scores. Although the PICTURE system outputs predictions each time a new observation (eg, a new vital sign) is input in to the system, the EDI score is generated every 15 minutes. To give the EDI any potential advantage, PICTURE scores are aligned to EDI scores by selecting the most recent PICTURE score before each EDI prediction. In both cases, observations occurring 30 minutes before the target and after are excluded (red). For the patients who did not experience an adverse event, the maximum score was calculated across the entire encounter. EDI: Epic Deterioration Index; PICTURE: Predicting Intensive Care Transfers and Other Unforeseen Events.
AUROC and AUPRC were used as the primary criteria for comparison between the models. AUROC can be interpreted as the probability that two randomly chosen observations (one with a positive target, the other negative) are ranked in the correct order by the model prediction score. AUPRC describes the average positive predictive value (PPV) across the range of sensitivities. We also calculated 95% CIs for encounter-level statistics with a bootstrap method using 1000 replications to compute pivotal CIs. For observation-level statistics, block bootstrapping was used to ensure randomization between encounters and within the observations of an encounter.
Despite the many benefits yielded by increasingly advanced machine learning models, use of these models in the medical field has lagged behind other fields. One contributing factor is their complexity, which make the resulting predictions difficult to interpret and in turn make it difficult to build clinician trust [
Neither PICTURE nor the EDI are calibrated scores—that is, even though their output ranges from 0 to 1 (or 0 to 100 in the case of EDI), these values do not reflect a probability of deterioration [
The ability of the PICTURE model to accurately predict the composite target was first assessed using the 33,472 encounters in the holdout test set from 2019. To provide a baseline for comparison, NEWS scores were calculated alongside each PICTURE prediction output. The observation-level and encounter-level AUROC and AUPRC are presented with 95% CIs in
Evaluation of PICTURE (performance in a non–COVID-19 cohort).
Granularity and analytic | AUROCa (95% CIb) | AUPRCd (95% CI) | Event rate (%) | ||||||||||||
|
<.001 |
|
<.001 | 1.01 | |||||||||||
|
PICTUREe | 0.821 (0.810-0.832) |
|
0.099 (0.085-0.110) |
|
|
|||||||||
|
NEWSf,g | 0.753 (0.741-0.765) |
|
0.058 (0.049-0.064) |
|
|
|||||||||
|
<.001 |
|
<.001 | 3.99 | |||||||||||
|
PICTURE | 0.846 (0.834-0.858) |
|
0.326 (0.301-0.351) |
|
|
|||||||||
|
NEWS | 0.782 (0.768-0.795) |
|
0.185 (0.165-0.203) |
|
|
aAUROC: area under the receiver operating characteristic curve.
b95% CIs were calculated using a block bootstrap with 1000 replicates. In the case of the observation level, this bootstrap was blocked on the encounter level.
c
dAUPRC: area under the precision-recall curve.
ePICTURE: Predicting Intensive Care Transfers and Other Unforeseen Events.
fNEWS: National Early Warning Score.
gNEWS is used as a baseline for comparison.
PICTURE was then compared to the EDI model on non–COVID-19 patients in the same holdout test set from 2019. Due to limitations in available EDI scores, the number of encounters was restricted to 21,740. These time-matched scores were again evaluated using AUROC and AUPRC on the observation and encounter levels (
Comparison of PICTURE and the EDI in a non–COVID-19 cohort.
Granularity and analytic | AUROCa (95% CI) | AUPRCc (95% CI) | Event rate (%) | ||||
|
0.77 | ||||||
|
PICTUREd | 0.819 (0.805-0.834) |
vs EDIe: <.001 vs NEWSf: <.001 |
0.115 (0.096-0.130) |
vs EDI: <.001 vs NEWS: <.001 |
|
|
|
EDI | 0.763 (0.746-0.781) |
vs NEWS: .01 |
0.081 (0.066-0.094) |
vs NEWS: <.001 |
|
|
|
NEWS | 0.745 (0.729-0.761) |
N/Ag |
0.062 (0.051-0.072) |
N/A |
|
|
|
4.21 | ||||||
|
PICTURE | 0.859 (0.846-0.873) |
vs EDI: <.001 vs NEWS: <.001 |
0.368 (0.335-0.400) |
vs EDI: <.001 vs NEWS: <.001 |
|
|
|
EDI | 0.803 (0.788-0.821) |
vs NEWS: .15 |
0.274 (0.244-0.301) |
vs NEWS: <.001 |
|
|
|
NEWS | 0.797 (0.781-0.814) |
N/A |
0.229 (0.204-0.254) |
N/A |
|
aAUROC: area under the receiver operating characteristic curve.
b
cAUPRC: area under the precision-recall curve.
dPICTURE: Predicting Intensive Care Transfers and Other Unforeseen Events.
eEDI: Epic Deterioration Index.
fNEWS: National Early Warning Score.
gN/A: not applicable.
Comparison of PICTURE and the EDI. Panel A: receiver operating characteristic (ROC) curves for PICTURE, EDI, and NEWS models in the non–COVID-19 cohort. PICTURE area under the curve (AUC): 0.819; EDI AUC: 0.763; NEWS AUC: 0.745. Panel B: Precision-recall (PR) curves for the two models in the non–COVID-19 cohort. PICTURE AUC: 0.115; EDI AUC: 0.081; NEWS AUC: 0.062. Panel C: ROC curves for PICTURE, EDI, and NEWS models in the COVID-19 cohort. PICTURE AUC: 0.849; EDI AUC: 0.803; NEWS AUC: 0.746. Panel D: PR curves for the two models. PICTURE AUC: 0.173; EDI AUC: 0.131; NEWS AUC: 0.098 in the COVID-19 cohort. All curves represent observation-level analysis. EDI: Epic Deterioration Index; FPR: false-positive rate; NEWS: National Early Warning Score; PICTURE: Predicting Intensive Care Transfers and Other Unforeseen Events; TPR: true-positive rate.
In addition to classification performance, lead time represents another critical component of a predictive analytics’ utility. Lead time refers to the amount of time between the alert and the actual event, and it determines how much time clinicians have to act on the model’s recommendations. We assessed the model’s relative performance at different lead times in a threshold-independent manner by excluding data occurring 0.5 hours, 1 hour, 2 hours, 6 hours, 12 hours, and 24 hours before an adverse event and calculating encounter-level performance (
Lead time analysis in non–COVID-19 cohort.a
Lead time (hours) | AUROCb (95% CI) | AUPRCc (95% CI) | Event rate (%) | Sample size, n | |||||
|
PICTUREd | EDIe | PICTURE | EDI |
|
|
|||
0.5 | 0.859 (0.846-0.873) | 0.803 (0.787-0.820) | 0.368 (0.336-0.400) | 0.274 (0.244-0.302) | 4.21 | 21,636 | |||
1 | 0.850 (0.835-0.864) | 0.795 (0.778-0.811) | 0.346 (0.315-0.379) | 0.254 (0.227-0.280) | 4.18 | 21,636 | |||
2 | 0.838 (0.823-0.853) | 0.784 (0.767-0.802) | 0.321 (0.292-0.352) | 0.238 (0.210-0.265) | 4.14 | 21,622 | |||
6 | 0.825 (0.810-0.840) | 0.768 (0.750-0.787) | 0.280 (0.249-0.310) | 0.210 (0.184-0.237) | 3.92 | 21,572 | |||
12 | 0.817 (0.801-0.832) | 0.767 (0.749-0.786) | 0.247 (0.215-0.275) | 0.183 (0.159-0.207) | 3.67 | 21,515 | |||
24 | 0.808 (0.790-0.826) | 0.759 (0.740-0.779) | 0.205 (0.172-0.230) | 0.144 (0.121-0.164) | 3.24 | 21,419 |
aThe performance of the two models (encounter level) at various lead times were assessed by evaluating the maximum prediction score prior to
bAUROC: area under the receiver operating characteristic curve.
cAUPRC: area under the precision-recall curve.
dPICTURE: Predicting Intensive Care Transfers and Other Unforeseen Events.
eEDI: Epic Deterioration Index.
When applied to patients testing positive for COVID-19, PICTURE performed similarly well. PICTURE scores were again aligned to EDI scores using the process outlined in the section Comparison of PICTURE and EDI. This resulted in the inclusion of 607 encounters.
As with the non–COVID-19 cohort, a similar lead time analysis was then performed to assess the performance of PICTURE and EDI when making predictions further in advance. Thresholds were again set at 0.5 hours, 1 hour, 2 hours, 6 hours, 12 hours, and 24 hours before the event, and observations occurring after this cutoff were excluded. In our cohort, PICTURE again outperformed the EDI even when making predictions 24 hours in advance (
Comparison of PICTURE and the EDI in patients testing positive for COVID-19.
Granularity and analytic | AUROCa (95% CI) | AUPRCb (95% CI) | Event rate (%) | ||||
|
3.20 | ||||||
|
PICTUREc | 0.849 (0.820-0.878) |
vs EDId: <.001 vs NEWSe: <.001 |
0.173 (0.116-0.211) |
vs EDI: .002 vs NEWS: <.001 |
|
|
|
EDI | 0.803 (0.772-0.838) |
vs NEWS: <.001 |
0.131 (0.087-0.163) |
vs NEWS: .002 |
|
|
|
NEWS | 0.746 (0.708-0.783) |
N/Af |
0.098 (0.066-0.122) |
N/A |
|
|
|
20.6 | ||||||
|
PICTURE | 0.895 (0.868-0.928) |
vs EDI: <.001 vs NEWS: <.001 |
0.665 (0.590-0.743) |
vs EDI: <.001 vs NEWS: <.001 |
|
|
|
EDI | 0.802 (0.762-0.848) |
vs NEWS: .05 |
0.510 (0.438-0.588) |
vs NEWS: .02 |
|
|
|
NEWS | 0.773 (0.732-0.818) |
N/A |
0.441 (0.364-0.510) |
N/A |
|
aAUROC: area under the receiver operating characteristic curve.
bAUPRC: area under the precision-recall curve.
cPICTURE: Predicting Intensive Care Transfers and Other Unforeseen Events.
dEDI: Epic Deterioration Index.
eNEWS: National Early Warning Score.
fN/A: not applicable.
Lead time analysis in COVID-19 cohort.a
Lead time (hours) | AUROCb (95% CI) | AUPRCc (95% CI) | Event rate (%) | Sample size, n | ||||
|
PICTUREd | EDIe | PICTURE | EDI |
|
|
||
0.5 | 0.895 (0.867-0.926) | 0.802 (0.761-0.842) | 0.665 (0.586-0.739) | 0.510 (0.436-0.587) | 20.6 | 607 | ||
1 | 0.887 (0.860-0.918) | 0.793 (0.753-0.836) | 0.631 (0.553-0.710) | 0.491 (0.418-0.570) | 20.5 | 606 | ||
2 | 0.870 (0.840-0.901) | 0.794 (0.754-0.833) | 0.598 (0.518-0.675) | 0.478 (0.400-0.555) | 20.1 | 603 | ||
6 | 0.847 (0.813-0.885) | 0.769 (0.729-0.813) | 0.552 (0.474-0.639) | 0.435 (0.354-0.517) | 19.3 | 597 | ||
12 | 0.821 (0.783-0.863) | 0.752 (0.708-0.798) | 0.497 (0.411-0.577) | 0.403 (0.333-0.480) | 17.9 | 587 | ||
24 | 0.808 (0.767-0.856) | 0.740 (0.690-796) | 0.443f (0.344-0.529) | 0.370 (0.289-0.459) | 16.0 | 574 |
aThe performance of the two models (encounter level) at various lead times were again assessed by evaluating the maximum prediction score prior to x hours before the given event, with x ranging in progressively greater intervals from 0.5 to 24. On this cohort of non–COVID-19 patients, PICTURE consistently outperformed the EDI. At each level of censoring, the
bAUROC: area under the receiver operating characteristic curve.
cAUPRC: area under the precision-recall curve.
dPICTURE: Predicting Intensive Care Transfers and Other Unforeseen Events.
eEDI: Epic Deterioration Index.
f
To provide clinicians with a description of factors influencing a given PICTURE score, we used Shapley values computed at each observation.
Shapley summary plots. Panel A depicts an aggregated summary plot of the Shapley values from the 2019 test set, while panel B corresponds to COVID-19 positive patients. The 20 most influential features are ranked from top to bottom, and the distribution of Shapley values across all predictions are plotted. The magnitude of the Shapley value is displayed on the horizontal axis, while the value of the feature itself is represented by color. For example, a large amount of oxygen support over 24 hours (red) in panel A was associated with a highly positive influence on the model, while low to no oxygen support (blue) pushed the model back toward 0. BUN: blood urea nitrogen; GCS: Glasgow Coma Scale; INR: international normalized ratio; SHAP: Shapley; WBC: white blood cells.
Both PICTURE and the EDI return scores indicate a patient’s risk of deterioration; however, neither score is calibrated as a probability. Therefore, alert thresholds may provide a convenient mechanism to decide whether or not to alert a clinician that their patient is at increased risk. A previous study assessing the use of the EDI in patients with COVID-19 found that an EDI score of 64.8 or greater to be an actionable threshold to identify patients at increased risk [
Distribution of scores and calibration curve. Panel A presents a KDE of the distribution of PICTURE and EDI scores. In addition to raw PICTURE scores, logit-transformed scores are also included. Panel B depicts quantiles of PICTURE and EDI scores (0.1, 0.2, 0.3,...0.9) against observed risk. Neither PICTURE nor the EDI are calibrated as probabilities, and as such, the use of set alarm thresholds may be useful to help alert clinicians when their patient is at an increased risk. EDI: Epic Deterioration Index; KDE: kernel density estimate; PICTURE: Predicting Intensive Care Transfers and Other Unforeseen Events.
To simulate when a clinician might receive an alert from the PICTURE system, four thresholds were selected, aligned based on the observed sensitivity, specificity, PPV, and NPV of the EDI score using the 64.8 value posed by Singh et al [
Alert thresholds and median lead time.a
Score | Threshold source | Threshold value | Sensitivity | Specificity | PPVb | NPVc | WDRd | F1 scoree | Lead timef (h:min), median (IQR) |
EDIg | Singh et al [ |
64.8 | 0.448 | 0.917 | 0.583 | 0.865 | 1.71 | 0.507 | 32:26 (4:37-66:08) |
|
|||||||||
|
Align by sensitivity | 0.165 | N/Ai | 0.946 | 0.683 | 0.869 | 1.46 | 0.541 | 40:14 (7:51-67:50) |
|
Align by specificity | 0.097 | 0.616 | N/A | 0.658 | 0.902 | 1.52 | 0.636 | 40:04 (7:44-91:00) |
|
Align by PPV | 0.048 | 0.792 | 0.851 | N/A | 0.940 | N/A | 0.668 | 54:10 (29:26-115:50) |
|
Align by NPV | 0.173 | 0.432 | 0.946 | 0.675 | N/A | 1.48 | 0.527 | 41:40 (7:31-68:30) |
aSensitivity, specificity, PPV, and NPV were calculated for the EDI at a threshold of 64.8 as suggested in Singh et al [
bPPV: positive predictive value.
cNPV: negative predicative value.
dWDR: workup to detection ratio.
eF1 scores were calculated as the harmonic mean between PPV and sensitivity.
fLead times were determined using the intersection of true positives between PICTURE and the EDI, and were calculated as the time between a patient first crossing the threshold and their first deterioration event.
gEDI: Epic Deterioration Index.
hPICTURE: Predicting Intensive Care Transfers and Other Unforeseen Events.
iN/A: not applicable.
PICTURE makes a prediction at every observation for the features included. A natural starting point for the assessment of PICTURE’s performance is at this level of granularity. Using the general structure outlined in Gillies et al [
The EDI does not make predictions at every feature observation; instead, it makes predictions every 15 minutes. To provide a direct comparison to the EDI, we subset the PICTURE scores and time-matched them to the EDI scores as described in the section Performance Measures. PICTURE significantly outperformed the EDI on this cohort of non–COVID-19 patients, with an observation-level AUROC of 0.819 compared to the EDI’s AUROC of 0.763. This performance gap extended out over multiple lead times, and even when restricted to data collected 24 hours or more before the adverse event, PICTURE’s performance remained high with an AUROC of 0.808, while the EDI’s AUROC dropped to 0.759. These results suggest that using PICTURE, instead of the EDI, at the University of Michigan hospital will lead to less false alarms. PICTURE maintained the performance improvement even as the models were forced to make predictions with longer times before the adverse event.
As the EDI has increasingly been investigated as a feasible metric to gauge deterioration risk in patients with COVID-19 [
One important point of discussion is the considerably higher rate of deterioration observed in patients with COVID-19 (20.6% vs 4.21% of encounters). This is likely due to a combination of the severity of the virus when compared to a general ward population and the aggressive treatment regimen endorsed by clinicians facing a disease that, during the early phases of the pandemic, represented many unknowns. Therefore, the threshold selection presented in the section Calibration and Alert Thresholds may differ between COVID-19 and general ward patients. The performance of the PICTURE analytic (as measured by AUROC) increased slightly (though with overlapping 95% CIs) when applied to patients with COVID-19 versus the general test set, indicating that patients with COVID-19 may represent a slightly easier classification task. This is supported by the fact that the EDI also performed better on the COVID-19 cohort when measured by observation-level AUROC (0.763 vs 0.803), though this increase was not sustained in the encounter-level results (AUROC 0.803 vs 0.802).
One key feature of the PICTURE model is its use of Shapley values to help explain individual predictions to clinicians. These explanations help add interpretability to the model, allowing clinicians to evaluate individual model scores and identify potential next steps, follow-up tests, or treatment plans.
Simulated alert thresholds were calculated based on the derived sensitivity, specificity, PPV, and NPV of the EDI threshold posited by Singh et al [
As a demonstration of the potential utility of PICTURE, an individual hospital encounter was selected, and the trajectories of PICTURE and the EDI are visualized in
To simulate what a clinician receiving an alert from PICTURE might encounter, the Shapley values explaining the PICTURE predictions at both alert thresholds are recorded in
Sample trajectory of one patient. Panel A depicts the PICTURE predictions over 27 hours before the patient is eventually transferred to an ICU level of care (green bar). Two possible alert thresholds are noted: one (red: 0.165) based on the EDI’s sensitivity at a threshold of 64.8 (as suggested by Singh et al [
Sample Predicting Intensive Care Transfers and Other Unforeseen Events explanations.
Rank and feature namea | Value | Median (IQR)b | Shapley score | ||||
|
|||||||
|
1. Oxygen supplementation (rolling 24 h max) | 7 L/min | 2.0 (0.0-3.0) | 1.06 | |||
|
2. SpO2d (rolling 24 h min) | 85% | 92.0 (90.0-94.0) | 0.93 | |||
|
3. Respiratory rate | 26 bpm | 20.0 (18.0-20.0) | 0.76 | |||
|
4. Temperature | 39.1 ˚C | 36.9 (36.8-37.2) | 0.32 | |||
|
5. Protein level | 5.7 | 6.0 (5.6-6.4) | 0.13 | |||
|
|||||||
|
1. Oxygen supplementation (rolling 24 h max) | 35 L/min | 2.0 (0.0-3.0) | 1.93 | |||
|
2. SpO2 (rolling 24 h min) | 85% | 92.0 (90.0-94.0) | 1.09 | |||
|
3. Respiratory rate | 24 bpm | 20.0 (18.0-20.0) | 0.73 | |||
|
4. Heart ratee | 124 bpm | 83.0 (74.0-92.0) | 0.71 | |||
|
5. Temperature | 39.1˚C | 36.9 (36.8-37.2) | 0.32 |
aThe top 5 features corresponding to Predicting Intensive Care Transfers and Other Unforeseen Events predictions as it crosses the PPV-aligned threshold and the sensitivity-aligned threshold as noted in
bThe median and IQR are included for comparison, and are calculated using the COVID-19 data set.
cPPV: positive predictive value.
dSpO2: oxygen saturation as measured by pulse oximetry.
eHeart rate represented the primary difference between these two time points. When the Predicting Intensive Care Transfers and Other Unforeseen Events score first exceeded the PPV threshold 12.5 hours before the intensive care unit transfer, the heart rate remained at 65 bpm and was not among the top features as measured by Shapley. At 11 hours before the event, when the Predicting Intensive Care Transfers and Other Unforeseen Events score was at its highest, the heart rate had jumped to 124 bpm and was the fourth-most influential feature as measured by Shapley values.
This analysis is limited to a single academic medical center, and its generalizability to other health care systems will require future study. Our sample of patients with COVID-19 was also limited in size, limiting our power to detect differences between PICTURE and the EDI. Lastly, the thresholds presented in the section Calibration and Alert Thresholds may be different from those used in the general population due to the increased event rate. The thresholds may also require future tuning to suit the needs of individual units.
The PICTURE early warning system accurately predicts adverse patient outcomes including ICU transfer, mechanical ventilation, and death at Michigan Medicine. The ability to consistently anticipate these events may be especially valuable when considering a potential impending second wave of COVID-19 infections. The EDI is a widespread deterioration model, which has recently been assessed in a COVID-19 population. Both PICTURE and the EDI were trained using approximately 130,000 non–COVID-19 encounters for general deterioration and thus are not overfit to the COVID-19 population [
Supplementary material.
area under the precision-recall curve
area under the receiver operating characteristic curve
blood urea nitrogen
Epic Deterioration Index
electronic health record
Glasgow Coma Scale
intensive care unit
National Early Warning Score
negative predictive value
Predicting Intensive Care Transfers and Other Unforeseen Events
Propelling Original Data Science
positive predictive value
precision-recall
receiver operating characteristic
oxygen saturation as measured by pulse oximetry
This study was supported in part by the Michigan Institute for Data Science “Propelling Original Data Science (PODS) Mini-Grants for COVID-19 Research” award. AJA has received funding from NIH/NHLBI (F32HL149337).
CEG, RPM Jr, and KRW have submitted a patent regarding our machine learning methodologies presented in this paper through the University of Michigan’s Office of Technology Transfer.