This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Physicians and health policy makers are required to make predictions during their decision making in various medical problems. Many advances have been made in predictive modeling toward outcome prediction, but these innovations target an average patient and are insufficiently adjustable for individual patients. One developing idea in this field is individualized predictive analytics based on patient similarity. The goal of this approach is to identify patients who are similar to an index patient and derive insights from the records of similar patients to provide personalized predictions..
The aim is to summarize and review published studies describing computer-based approaches for predicting patients’ future health status based on health data and patient similarity, identify gaps, and provide a starting point for related future research.
The method involved (1) conducting the review by performing automated searches in Scopus, PubMed, and ISI Web of Science, selecting relevant studies by first screening titles and abstracts then analyzing full-texts, and (2) documenting by extracting publication details and information on context, predictors, missing data, modeling algorithm, outcome, and evaluation methods into a matrix table, synthesizing data, and reporting results.
After duplicate removal, 1339 articles were screened in abstracts and titles and 67 were selected for full-text review. In total, 22 articles met the inclusion criteria. Within included articles, hospitals were the main source of data (n=10). Cardiovascular disease (n=7) and diabetes (n=4) were the dominant patient diseases. Most studies (n=18) used neighborhood-based approaches in devising prediction models. Two studies showed that patient similarity-based modeling outperformed population-based predictive methods.
Interest in patient similarity-based predictive modeling for diagnosis and prognosis has been growing. In addition to raw/coded health data, wavelet transform and term frequency-inverse document frequency methods were employed to extract predictors. Selecting predictors with potential to highlight special cases and defining new patient similarity metrics were among the gaps identified in the existing literature that provide starting points for future work. Patient status prediction models based on patient similarity and health data offer exciting potential for personalizing and ultimately improving health care, leading to better patient outcomes.
Medicine is largely reactive—a disease is treated only after it is observed [
Many studies have analyzed large populations to answer a wide range of health-related questions, including the study that developed Acute Physiology and Chronic Health Evaluation II (APACHE-II) [
One developing idea in this field is personalized predictive modeling based on patient similarity. The goal of this approach is to identify patients who are similar to an index patient and derive insights from the records of similar patients to provide personalized predictions. Employing patient similarity helps identify a precision cohort for an index patient, which will then be used to train a personalized model. Compared to conventional models trained on all patients, this approach has the potential to provide customized prediction. This approach has been widely used for personalized predictions in other fields, including music [
Although the concept of patient similarity is not new—blood typing has been used for blood transfusion for more than a century [
In which context (applications) have patient health prediction models based on health data and patient similarity been used?
Which modeling techniques have been considered in the literature?
How do patient similarity-based models affect health predictions in comparison to conventional models?
We hope the results could also contribute to the broad field of case-based reasoning (CBR)—with the core component of similarity assessment—to meet the challenges in medical applications [
A systematic search approach in line with guidelines of Kitchenham et al [
Studies included in this review had to be journal articles or conference proceedings written in English. They had to focus on prediction in the health domain, devise a model for prediction, embed explicit patient similarity analytics, and utilize health data for training their model. Studies were excluded if (1) they entirely relied on human input for predictions or similarity assessment, (2) the model was tested on seen data—the part of the data used for training the algorithm, and (3) the algorithm was trained using only genomic data. If the same study appeared in multiple publications, only the most comprehensive and latest version was included.
The literature search was finalized in December 2015. Scopus, PubMed, and ISI Web of Science, all databases covering health-related publications, were searched for peer-reviewed studies with keywords related to “prediction,” “health data,” and “patient similarity.” The search strings used in each of these search engines are given in
Data from included articles were extracted into a matrix table and analyzed with respect to the following criterion: publication information, context, predictors (or features), missing data, modeling algorithms, performance measures, and outcomes. The context was further examined from two points of view: data source and application area. The employed patient similarity-based modeling algorithms were also synthesized in three categories: neighborhood-based, clustering-based, and other algorithms, with the majority falling in the first category. Because measuring predictive performance is essential to model development (model selection/model tuning)—and can also be used to compare a given model with other methods (performance estimation)—evaluation metrics along with validation techniques used in the reviewed studies were also extracted.
A total of 22 articles were included in the review (
Summary of the reviewed articles in terms of data type, data origin, number of predictors, and number of instances (N=22).
Authors | Data type | Data origina | Predictors, nb | Instances, nc |
Cross-sectional | ||||
Jurisica et al [ |
Cross-sectional | NR | 55 | 788 |
Bobrowski [ |
Cross-sectional | The Gastroenterological Clinic of the Institute of Food and Feeding in Warsaw [ |
40 | 511 |
Park et al [ |
Cross-sectional | UCI repository [ |
35 | 350 |
Cross-sectional | UCI repository [ |
13 | 270 | |
Cross-sectional | UCI repository[ |
31 | 560 | |
Cross-sectional | UCI repository[ |
8 | 760 | |
Cross-sectional | UCI repository[ |
7 | 340 | |
Saeed et al [ |
Longitudinal | MIMIC-II [ |
50 | 377 |
Chattopadhyay et al [ |
Cross-sectional | Hospital-history of suicidal attempts and committed suicides collected from hospital records | 15 | 50 |
Sun et al [ |
Longitudinal | MIMIC-II [ |
50 | 74 |
Sun et al [ |
Longitudinal | MIMIC-II [ |
10 | 1500 |
David et al [ |
Cross-sectional | Laboratory results generated by two Beckman-Coulter Gen-S analyzers at an acute care facility in Brooklyn | NR | 4900 |
Houeland [ |
Cross-sectional | A dataset focused on palliative care for cancer patients | 55 | 1486 |
Wang et al [ |
Cross-sectional | UCI repository [ |
31 | 560 |
Cross-sectional | UCI repository [ |
8 | 760 | |
Cross-sectional | A real-world EHR data warehouse of a health network consisting of data from 135K patients over a year | NR | 135K | |
Wang et al [ |
Cross-sectional | A real-world EHR data warehouse of a health network consisting of data from 135K patients over a year | 2388 | 3946 |
Campillo-Gimenez et al [ |
Cross-sectional | French Renal Epidemiology and Information Network (REIN) registry [ |
19 | 1137 |
Gottlieb et al [ |
Cross-sectional and longitudinal | Hospital dataset-Stanford Medical Center, USA | 16 | 9974 |
Cross-sectional and longitudinal | Hospital dataset-Rabin Medical Center, Israel | 16 | 5513 | |
Lowsky et al [ |
Cross-sectional | A dataset by the United States Renal Data System (USRDS) consisting of all kidney transplant procedures from 1969 to 1999 | 13 | 51,088 |
Hielscher et al [ |
Cross-sectional | The Study of Health in Pomerania (SHIP) [ |
65/57 | 578 |
Zhang et al [ |
Longitudinal | A 3-year longitudinal EHR data of 110,157 patients | NR | 1219 |
Henriques et al [ |
Longitudinal | myHeart home telemonitoring study [ |
NR | 41 |
Lee et al [ |
Cross-sectional and longitudinal | MIMIC-II [ |
76 | 17,152 |
Ng et al [ |
Cross-sectional and longitudinal | A longitudinal medical claims database consisting of data from over 300,000 patients during four years | 8500 | 15038 |
Panahiazar et al [ |
Cross-sectional | The Mayo Clinic | 33 | 1386 |
Wang [ |
Cross-sectional | UCI repository [ |
31 | 560 |
Cross-sectional | UCI repository[ |
8 | 760 | |
Cross-sectional | A real-world EHR data warehouse | NR | 135K | |
Wang et al [ |
Cross-sectional | A real-world EHR data warehouse | 127 | 3946 |
a NR: not reported.
b Predictors: the total number of predictors.
c Instances: the total number of data points used in each study including the training and test.
Summary of reviewed articles in terms of outcome, evaluation metrics, and comparing methods (N=22).
Authors | Outcomea | Evaluation metricsb | Compared againstc | |
Jurisica et al [ |
Suggesting hormonal therapy (day of human chorionic gonadotrophin administration and the number of ampoules of human menopausal gonadotrophin) after in vitro fertilization and predicting pregnancy outcome (pregnancy, abortion, ectopic pregnancy, and ovarian hyperstimulation syndrome) | Accuracy | NR | |
Bobrowski [ |
Four types of liver disease (cirrhosis hepatis biliaris primaria, cirrhosis hepatis decompensata, hepatitis chronica activa, and hepatitis chronica steatosis) | Accuracy | Classic |
|
Park et al [ |
(1) Six types of dermatology diseases (psoriasis, seborrheic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis, pityriasis rubra pilaris); (2) diagnosis of heart disease (angiographic disease status); (3) diagnosis of a breast tumor as malignant or benign; (4) diagnosis of diabetes; (5) diagnosis of liver disorder | Accuracy; sensitivity; specificity | LR; C5.0; CART; neural network; conventional CBR ( |
|
Saeed et al [ |
Hemodynamic stability or instability of an episode | Sensitivity; positive predictive value | NR | |
Chattopadhyay et al [ |
Suicidal risk levels (level 1: suicidal plans or thoughts; level 2: single suicidal attempt; level 3: multiple suicidal attempts) | NR | NR | |
Sun et al [ |
Occurrence of acute hypotensive episode within the forecast window of an hour | Accuracy | Human expert’s idea based on the Euclidean [ |
|
Sun et al [ |
Occurrence of acute hypotensive episode within the forecast window of an hour | Accuracy | Human expert’s idea based on the Euclidean [ |
|
David et al [ |
Seven disease diagnoses (microcytic anemia, normocytic anemia, mild SIRS, thrombocytopenia, leukocytopenia, moderate/severe SIRS, normal) | Accuracy | Human expert’s idea | |
Houeland [ |
Pain levels | Error rate (1-accuracy). | Random retrieval; |
|
Wang et al [ |
(1) Diagnosis of a breast tumor as malignant or benign; (2) diagnosis of diabetes; (3) diagnosis of dementia without complications (HCC352) or diabetes with no or unspecified complications (HCC019) | Accuracy; sensitivity; precision; F-measure | PCA; LDA [ |
|
Wang et al [ |
Diagnosis of CHF 6 months later | Accuracy; sensitivity; precision; F-measure | LLE; LE; PCA; Euclidean distance. | |
Campillo-Gimenez et al [ |
Registration on the renal transplant waiting list: yes/no | ROC curve | ||
Gottlieb et al [ |
Patient discharge diagnosis |
ROC curve; F-measure | NR | |
Lowsky et al [ |
Graft survival probability | IPEC | Cox model; RSF [ |
|
Hielscher et al [ |
Three levels of liver fat concentration measured by magnetic resonance tomography: (1) fat concentration <10%; (2) fat concentration of 10%-25%; (3) fat concentration ≥25% | Accuracy; sensitivity; specificity | Multiple variants of the |
|
Zhang et al [ |
Four effective drugs for hypercholesterolemia treatment: atorvastatin, lovastatin, pravastatin, and simvastatin | ROC curve | Patient similarity; patient similarity with drug structure similarity; patient similarity with drug target similarity | |
Henriques et al [ |
Early detection of heart failure: decompensation or normal condition | Sensitivity; specificity; F-measure; G-measure | Coefficients’ distance; linear correlation of signals; Euclidean distance | |
Lee et al [ |
30-day in-hospital mortality | Area under ROC curve; area under precision-recall curve | Population-based and personalized versions of: majority vote; LR; DT | |
Ng et al [ |
The risk of diabetes disease onset | ROC curve | Global LR; |
|
Panahiazar et al [ |
Medication plans for heart-failure patients (angiotensin-converting enzyme, angiotensin receptor blockers, β-adrenoceptor antagonists, statins, and calcium channel blocker) | Sensitivity; specificity; F-measure; accuracy | K-means; hierarchical clustering | |
Wang [ |
(1) Diagnosis of a breast tumor as malignant or benign; (2) diagnosis of diabetes; (3) occurrence of CHF within 6 months | Precision; F-measure; sensitivity; accuracy | ||
Wang et al [ |
Occurrence of CHF within 6 months | Precision; F-measure; sensitivity; accuracy | PCA; Laplacian regularized metric learning [ |
a CHF: congestive heart failure;
b IPEC: integrated prediction error curve ; NR: not reported; ROC: receiver operating characteristic: SIRS: systemic inflammatory response syndrome.
c CART: classification and regression tree; CBR: case-based reasoning; DT: decision tree;
Flow diagram of article selection procedure.
The level of interest could be gauged by the increase in publication on this topic in recent years (
Distribution of publications by year.
Although a considerable number of articles did not clearly state the source of data—some articles used more than one dataset—hospitals were named (10/22); within hospitals, intensive care units (ICUs) were the main sources of data (5/10). In addition, one study [
Focused application areas of studies. Some studies featured more than a single application area and were counted more than once.
Raw health data can be in various formats, including narrative/textual data (eg, history of a present illness), numerical measurements (eg, laboratory results, vital signs, and measurements), recorded signals (eg, electrocardiograms), and pictures (eg, radiologic images). Numerical measurements and recorded signals were the format used most in the reviewed articles. Three main approaches were used for extracting predictors from raw health data. First, for some variables, including age and gender, the exact/coded value was used as a predictor. Second, in articles employing recorded signals and/or longitudinal numerical measurements, numeric variables, including wavelet coefficients, minimums, maximums, means, and variances, were extracted from within particular time windows [
Although predictor extraction affects the performance of the model [
One common challenge in using health data in predictive analytics is missing data. Most of the modeling techniques cannot handle an incomplete data matrix. Nevertheless, all studies that mentioned this challenge [
The neighborhood-based algorithms indicate studies in which a group of patients similar to an index patient is retrieved and a prediction is produced by a model trained on similar patients’ data. This category is comparable to memory-based techniques in collaborative filtering [
Twelve studies (of 18) used various types of distance-based similarity. One study [
Five studies [
David et al [
Six studies utilized the Mahalanobis distance [
Wang et al [
Wang [
Lowsky et al [
Ng et al [
Saeed et al [
Lee et al [
Four studies used other similarity metrics. One of the earliest proposed methods [
Campillo-Gimenez et al [
Cluster-based algorithms group patients in a training set based on their profiles and relationships. Therefore, a new patient is assigned to a predefined cluster based on his/her similarity to each cluster. These methods have a trade-off between prediction performance and scalability for large datasets. Only one study [
Gottlieb et al [
Zhang et al [
Wang [
The outcomes of prediction models normally take six forms: continuous, binary, categorical (but not ordered), ordinal, count, and survival. The studies reviewed targeted continuous outcomes [
Evaluation metrics are widely used to tune the parameters of a model and compare the model with other methods.
A confusion matrix is a cross-tabulation representation of observed and predicted classes. Various evaluation metrics extracted from a confusion matrix—including accuracy, sensitivity, specificity, F-measure, G-measure, precision, and positive predictive value—were used in the included articles.
Five articles [
When a model generates a continuous outcome, a common performance measure is the mean squared error. This metric is based on model residuals, which are the difference between the observed and predicted responses, and can be calculated by taking the average of squared model residuals. One study [
Validation techniques can generally be grouped into two categories: internal and external [
Internal validation techniques randomly split the available dataset into two parts using various approaches: a training set and a test set. Seven studies [
External validation means assessing the performance of the prediction model in other scenarios or settings (eg, assessing the geographic or temporal transportability of the model). Only one study [
Over the period of 1989 to 2015, we found 22 articles that focused on patient similarity in predictive modeling using EHR data, with an increase in the number of these studies over time. Overall, three main approaches were employed in these studies to leverage patient similarity: neighborhood-based modeling, clustering modeling, and other algorithms. This section discusses the results from this review study to address the research questions then identifies gaps and future research directions.
This study showed that patient similarity-based predictive modeling has been widely used on hospital data, which sheds light on the need for patient similarity-based predictive modeling in tackling big data. In addition, further analysis revealed that ICUs are the central focus in hospitals. ICUs treat patients with severe and life-threatening illnesses that require continuous monitoring. Thus, ICU patients are surrounded by equipment that constantly generates a large amount of data. However, this large volume usually overwhelms clinicians and highlights the need for a computerized system. In addition, the critical health status of the patients in ICUs requires more proactive (rather than reactive), precise, and personalized care. Therefore, ICUs are a suitable environment for personalized prediction models.
Furthermore, chronic disease prognosis was one of the common application areas for personalized predictive modeling. Such analytics can help in improving patient health status if used in planning new therapies or interventions to prevent further complications. Patient similarity analytics can also be used for predicting a patient’s risk of developing further complications or disease. In particular, patient similarity analytics can overcome the challenge of comorbidities in chronic disease risk stratification and provide customized plans for a given patient. It is worth mentioning that cardiovascular diseases and diabetes were common application domains among the reviewed studies.
Most of the studies focused on neighborhood-based modeling. These models are easy to implement and they typically perform well. However, their performance depends greatly on the chosen patient similarity metric. Although there are a variety of similarity metrics in data mining [
Cluster-based methods exhibit better scalability than neighborhood-based modeling, but there is a trade-off between prediction accuracy and scalability. These methods may not satisfactorily address the prediction for patients with rare conditions because they work based on predefined clusters. Especially in hierarchical clustering methods, in which final clusters are derived based on merging smaller clusters [
Four studies embedded patient similarity analytics in their modeling approach even though they did not explicitly compute a patient similarity metric [
Only two studies [
One of the factors that strongly affects predictive performance is the choice of predictors. Results show that researchers are searching for reliable predictors to enhance the performance of patient similarity-based models. In the context of personalized prediction models, the best possible predictors should have at least two characteristics: (1) be capable of capturing the progression of a patient’s health status, and (2) be as discriminative as possible. Applying TF-IDF technique could help boost the accuracy of similarity assessment for patients with rare conditions [
As observed in several studies, some values for a given patient may be missing. Although imputation methods can help deal with missing data, it is important to determine why the values are missing. Sometimes, associations exist between patterns of missing data and the outcomes. This type of information gap is referred to as informative missingness [
As mentioned previously, a wide variety of techniques have been employed in efforts to achieve personalized prediction. Neighborhood-based methods are among the most popular techniques. However, abundant room remains for progress in defining new patient similarity metrics. In addition, as suggested by Gottlieb et al [
There are some limitations to this review. First, although the article selection protocol was devised by all reviewers, there could have been a bias in selecting articles because title and abstract screening was done by only one reviewer. Second, the search process focused on the more generic terms covering the concept of EHR, and it might have excluded articles in which domain-specific words (eg, “diabetes data”) were used to describe the data source. Finally, due to inaccessibility to some EHR data in the included studies, data quality assessment was infeasible and all the studies received equal importance in the interpretation of the findings, which might have caused a bias in the results.
Personalized medicine has the potential to facilitate predictive medicine, provide tailored prognoses/diagnoses, and prescribe more effective treatments. Interest is increasing in the use of personalized predictive modeling and various patient similarity-based models using EHRs have been described in the literature. This review has demonstrated the value of patient similarity-based models in critical health problems and noted the results of two studies [
Search strings used to search databases.
Summary of the reviewed articles in terms of methodology (N=22).
Acute Physiology and Chronic Health Evaluation II
adaptive semisupervised recursive tree partitioning
case-based reasoning
decision tree
electronic health data
intensive care unit
k dimensional tree
k-nearest neighbor
logistic regression
locality sensitive discriminant analysis
local spline regression
principal component analysis
receiver operating characteristic
random survival forest
term frequency–-inverse document frequency
The authors would like to thank Rebecca Hutchinson, statistics and computer science research librarian at the University of Waterloo, for her advice on the systematic literature search. The authors would also like to thank Dr David Maslove, clinician scientist with the Department of Medicine at the Queen’s University for his comments on this study. AS and JL were supported by NSERC Discovery Grant (RGPIN-2014-04743). JAD was partially supported by NSERC Discovery Grant (RGPIN-2014-05911).
Study conception and review design were conducted by AS, JAD, and JL. AS screened titles and abstracts of references identified in the databases. AS, JAD, and JL designed the data extraction instrument. AS extracted data from the original studies and prepared the initial results. Interpretation of the results was provided by all authors and study supervision was provided by JAD and JL. All authors contributed in writing the manuscript and approved the final version of the review.
None declared.