This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
With a large-scale electronic health record repository, it is feasible to build a customized patient outcome prediction model specifically for a given patient. This approach involves identifying past patients who are similar to the present patient and using their data to train a personalized predictive model. Our previous work investigated a cosine-similarity patient similarity metric (PSM) for such patient-specific predictive modeling.
The objective of the study is to investigate the random forest (RF) proximity measure as a PSM in the context of personalized mortality prediction for intensive care unit (ICU) patients.
A total of 17,152 ICU admissions were extracted from the Multiparameter Intelligent Monitoring in Intensive Care II database. A number of predictor variables were extracted from the first 24 hours in the ICU. Outcome to be predicted was 30-day mortality. A patient-specific predictive model was trained for each ICU admission using an RF PSM inspired by the RF proximity measure. Death counting, logistic regression, decision tree, and RF models were studied with a hard threshold applied to RF PSM values to only include the M most similar patients in model training, where M was varied. In addition, case-specific random forests (CSRFs), which uses RF proximity for weighted bootstrapping, were trained.
Compared to our previous study that investigated a cosine similarity PSM, the RF PSM resulted in superior or comparable predictive performance. RF and CSRF exhibited the best performances (in terms of mean area under the receiver operating characteristic curve [95% confidence interval], RF: 0.839 [0.835-0.844]; CSRF: 0.832 [0.821-0.843]). RF and CSRF did not benefit from personalization via the use of the RF PSM, while the other models did.
The RF PSM led to good mortality prediction performance for several predictive models, although it failed to induce improved performance in RF and CSRF. The distinction between predictor and similarity variables is an important issue arising from the present study. RFs present a promising method for patient-specific outcome prediction.
Harnessing the information contained in health data from various sources toward personalized medicine has been a research topic of interest and discussed by a number of highly regarded health researchers recently [
A key to personalized patient outcome prediction is how patient similarity is defined. Various similarity measures have been investigated in the context of EHR-based outcome prediction, including distance-based [
Outside of health, several domains have employed similarity approaches in machine learning and predictive analytics. One prime example is product recommendation in e-commerce where purchase histories of similar consumers are leveraged to recommend products to a given customer [
Our previous studies in this line of research investigated a cosine similarity PSM for personalized mortality prediction in intensive care unit (ICU) patients with hard thresholding [
One interesting way to define a PSM is to use the random forest (RF) proximity measure, which represents the likelihood of 2 cases falling in the same terminal node in the trees of an RF [
The objective of this study was to evaluate the effectiveness of RF patient similarity on improving mortality prediction performance in the critically ill.
Patient data were extracted from the public ICU database Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) version 2.6 [
List of predictor variables.
Category | Predictor variables |
Demographics | Age, gender |
Administrative information | Admission type (elective, urgent, emergency), ICUa service type (MICUb, SICUc, CCUd, CSRUe) |
Vital signs (min. and max. every 6 hours during the first 24 hours in the ICU) | Heart rate, mean blood pressure, systolic blood pressure, SpO2, spontaneous respiratory rate, body temperature |
Labs (min. and max. from the first 24 hours in the ICU | Hematocrit, white blood cell count, serum glucose, serum HCO3, serum potassium, serum sodium, blood urea nitrogen, and serum creatinine |
Intervention (yes/no during the first 24 hours in the ICU) | Vasopressor therapy, mechanical ventilation, or continuous positive airway pressure |
Others (from the first 24 hours in the ICU) | Worst Glasgow Coma Scale score, total urinary output every 6 hours |
aICU: intensive care unit.
bMICU: medical intensive care unit.
cSICU: surgical intensive care unit.
dCCU: coronary care unit.
eCSRU: cardiac surgery recovery unit.
In MIMIC-II, there are 1.24 ICU admissions per patient on average which indicates that most patients have only one ICU admission. Other admissions from the same patient are likely to contain useful and relevant information since they represent the most similar “patients” (determined by the PSM) for the index patient. This is akin to incorporating past patient history in most prognostic scoring systems. Most importantly, if the data from other admissions from the same patient are available at the time of prediction, it would be a waste of data if they are not taken into account.
Out of 29,149 adult ICU admissions in MIMIC-II, 17,152 (58.84%) had complete data and were included in the present study.
In the data set, the overall 30-day mortality rate was 15.10% (4,401/29,149), while 56.70% (16,527/29,149) were male. The average age was 64.5 years with a standard deviation of 17.0. The percentages of elective, urgent, and emergency admissions were 18.00% (5,247/29,149), 3.70% (1,078/29,149), and 78.30% (22,824/29,149), respectively. More detailed descriptive statistics of the data set can be found elsewhere [
Following how bootstrap sampling weights were calculated in CSRF [
Random forest patient similarity metric formula.
where
For every index patient, the M most similar patients in the training data (ie, hard thresholding on RF PSM) were used to train a customized predictive model. A total of 4 different models were evaluated with hard thresholding: death counting (DC, predicted mortality risk is equal to the empirical probability of death among the M most similar patients), logistic regression (LR), decision tree (DT), and RF.
The range of M values varied depending on the predictive model. For DC, M ranged from 10 to 15,000 with a step size of 10, whereas for LR and RF the range was from 4000 to 15,000 with a step size of 1000. The M range for DT was from 5000 to 15,000 with a step size of 1000. This variation in M range accounted for computational burden and lack of variability in categorical variables (either predictor or outcome) among the M most similar patients when M was sufficiently small (only 1 category remained in some categorical variables when the M patients were too homogenous). Moreover, for LR, DT, and RF, training data size had to be sufficiently large (at least 1000), given that there were 75 predictor variables (
Note that we did not attempt to select the optimal M value for each patient, since our objective was to investigate the effects of the RF PSM on prediction performance as a function of M, as was done in our previous work [
As a fifth predictive model, CSRF was investigated. The entire data set was used for training CSRF models since the RF PSM was used as bootstrap sampling weights instead. Since this is a soft thresholding method, applying a range of M values was not applicable to CSRF.
Note that while DC, LR, and DT were investigated previously [
Predictive performance was evaluated using the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). A 10-fold cross-validation was conducted for all models to avoid overfitting. In each iteration of the cross-validation, the RF PSM between each patient in the test data and each patient in the training data was computed. Then, for each patient in the test data, the predictive models described above were trained using the M most similar cases in the training data (except CSRF), where M was varied as explained above. Once all patients in the test data were predicted, the AUROC and AUPRC for that fold were computed. Hence, the 10 iterations of the cross-validation yielded 10 AUROCs and 10 AUPRCs.
All computation was conducted in R version 3.3.1 (R Foundation). In particular, the
DC and DT showed statistically significant improvement (via 2-sided
Best predictive performance from each random forest patient similarity metric (PSM) model in terms of mean area under the receiver operating characteristic curve and area under the precision-recall curve in comparison with cosine PSM and traditional models with no PSM. All cosine PSM results are from Lee et al [
Number of similar patients at best predictive performance | Best predictive performance, mean (95% CI) | |||||||||
AUROCa | AUPRCb | AUROC | AUPRC | |||||||
RFc PSMd | Cosine PSM | RF PSM | Cosine PSM | RF PSM | Cosine PSM | No PSM | RF PSM | Cosine PSM | No PSM | |
DCe | 260 | 100 | 230 | 60 | 0.801 (0.792-0.811) | 0.797 (0.791-0.803) | 0.693 (0.679-0.707) | 0.429 (0.409-0.449) | 0.393 (0.378-0.407) | 0.280 (0.263-0.297) |
LRf | 5000 | 6000 | 9000 | 6000 | 0.824 (0.815-0.832) | 0.830 (0.825-0.836) | 0.811 (0.799-0.821) | 0.460 (0.437-0.484) | 0.474 (0.460-0.488) | 0.449 (0.430-0.468) |
DTg | 5000 | 2000 | 7000 | 4000 | 0.779 (0.775-0.784) | 0.753 (0.742-0.764) | 0.721 (0.705-0.738) | 0.352 (0.337-0.367) | 0.347 (0.335-0.358) | 0.339 (0.324-0.353) |
RF | 15000 | — | 4000 | — | 0.839 (0.835-0.844) | — | 0.839 (0.835-0.844) | 0.507 (0.527-0.486) | — | 0.505 (0.487-0.523) |
CSRFh | — | — | — | — | 0.832 (0.821-0.843) | — | — | 0.496 (0.520-0.471) | — | — |
aAUROC: area under the receiver operating characteristic curve.
bAUPRC: area under the precision-recall curve.
cRF: random forest.
dPSM: patient similarity metric.
eDC: death counting.
fLR: logistic regression.
gDT: decision tree.
hCSRF: case-specific random forest.
Mortality prediction performance measured in area under the receiver operating characteristic curve as a function of the number of similar patients. Mean and 95% confidence interval from 10-fold cross-validation are shown.
Mortality prediction performance measured in area under the precision-recall curve as a function of the number of similar patients. Mean and 95% confidence interval from 10-fold cross-validation are shown.
Box plot comparing area under the receiver operating characteristic curves (AUROCs) from all 5 models. For death counting, logistic regression, decision tree, and random forest, the performance from the number of similar patients corresponding to the maximum mean AUROC is shown.
Box plot comparing area under the precision-recall curves (AUPRCs) from all 5 models. For death counting, logistic regression, decision tree, and random forest, the performance from the number of similar patients corresponding to the maximum mean AUPRC is shown.
The conventional doctrine in machine learning is that it is always beneficial to collect more training data. This is true if collected data represent the same underlying phenomenon, but it is often difficult to make this assumption in medicine largely due to enormous variability in patient/clinical characteristics, as well as our limited understanding of complex human health and disease pathways. In the era of big data, we can now afford to be more selective regarding which cases should be included in predictive modeling. Data-driven patient similarity matching, via a PSM, leads to objective training data selection and uses hidden patterns in multidimensional data that are difficult for human clinicians to identify.
In comparison with conventional, one-size-fits-all predictive models such as those from the Framingham Heart Study [
In this study, RF and CSRF, which were not studied in our previous work, outperformed DC, LR, and DT, while the difference between RF and CSRF was statistically insignificant. The comparison between RF with CSRF is interesting because the difference is essentially hard thresholding versus weighted bootstrapping. Also, CSRF did not improve upon the conventional RF that used all data as training data. The comparable performances from RF and CSRF, and the negligible effects of the number of similar patients on RF performance (
An appropriate method to circumvent the lack of variability in categorical variables when M is too small should be investigated in future work. One simple solution is to exclude such problematic categorical variables in model training after “using them up” for similarity matching. For example, gender may be used for PSM calculation but if the training data subsequently only include 1 gender (same as that of the index patient), then gender can be dropped from model training to avoid the computational issue. This solution will not work for the outcome variable, however. This issue is closely related to the important topic of how best to use the variables in a given data set in personalized predictive analytics: should they be used as predictor or similarity variables, or both? This topic requires further research involving both real life and simulated data.
Although this study only investigated prediction performance as a function of M without attempting to select optimal M values for individual patients, optimal M selection will become important when the results of this study are taken to practice. However, in the context of patient similarity, selecting the optimal M for a given patient is not trivial because it may vary across different types of patient. This implies that optimal M values should not be selected based on
Instead of selecting the M most similar patients from training data, it is also feasible to threshold RF PSM values so that all patients with an RF PSM above this threshold are included in model training. One advantage of this approach is that the quality of similarity in the training data can be ensured and controlled, whereas a fixed M would force M patients to be used for model training regardless of how similar they are to the index patient in absolute sense. However, a challenge with this approach is that a good understanding of the magnitude of the RF PSM is required, which could be the subject of another research study. Moreover, the threshold is likely to vary across patients.
In addition to the future work mentioned above, future PSM research directions should involve the following: other PSMs especially those that can capture temporal patterns, other health data sets, patient outcomes other than mortality, other predictor variables (such as diagnosis, medications other than vasopressors, information from free-text clinician notes, etc), and other predictive models.
In comparison with the DC, LR, and DT results from our previous work that studied a cosine similarity PSM [
However, the number of similar patients associated with the best performance was different between the 2 studies. The RF PSM tended to require more similar patients to be included in the training data to achieve peak performance; the cosine similarity PSM results peaked with respect to AUROC and AUPRC at 100 and 60 for DC, 6000 and 6000 for LR, and 2000 and 4000 for DT. This finding suggests that the RF and cosine similarity PSMs quantified patient similarity somewhat differently.
This study has limitations. First, MIMIC-II is a single-center database and hence the results may not be generalizable to other ICU data. Second, no extensive investigation of various parameter values was conducted for the predictive models in order to keep the number of models that had to be trained at a reasonable level. Although the default parameter settings from the widely used R packages are reasonable choices, the effects of the parameter values on predictive performance could be investigated further in future work.
PSM-driven predictive analytics is an exciting topic, as accurate, tailor-made patient outcome prediction can greatly inform risk stratification and resource allocation. The results from this study that stem from the use of an RF PSM corroborate the utility of PSMs in enhancing predictive performance at the patient level. The superior predictive performances of RF and CSRF, as well as the fact that the RF PSM outperformed the cosine similarity PSM in some models, indicate that RFs are well suited for patient-specific predictive modeling.
area under the precision-recall curve
area under the receiver operating characteristic curve
coronary care unit
case-specific random forest
cardiac surgery recovery unit
death counting
decision tree
electronic health record
intensive care unit
logistic regression
medical intensive care unit
Multiparameter Intelligent Monitoring in Intensive Care II
patient similarity metric
random forest
surgical intensive care unit
JL conducted the entire research study including study design, data extraction, and predictive modeling. JL also wrote, revised, and approved the entire manuscript. This work was supported by a Discovery Grant (RGPIN-2014-04743) from the Natural Sciences and Engineering Research Council of Canada (NSERC). The author would like to thank Professor Dan Nettleton and Dr Ruo Xu for sharing their CSRF code.
None declared.