This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Asthma affects 9% of Americans and incurs US $56 billion in cost, 439,000 hospitalizations, and 1.8 million emergency room visits annually. A small fraction of asthma patients with high vulnerabilities, severe disease, or great barriers to care consume most health care costs and resources. An effective approach is urgently needed to identify high-risk patients and intervene to improve outcomes and to reduce costs and resource use. Care management is widely used to implement tailored care plans for this purpose, but it is expensive and has limited service capacity. To maximize benefit, we should enroll only patients anticipated to have the highest costs or worst prognosis. Effective care management requires correctly identifying high-risk patients, but current patient identification approaches have major limitations. This paper pinpoints these limitations and outlines multiple machine learning techniques to address them, providing a roadmap for future research.
Asthma affects 9% of Americans [
Almost all private health plans provide, and most major employers purchase, care management services that implement tailored care plans with early interventions for high-risk patients to avoid high costs and health status degradation [
Although widely used, care management has costs of its own and can require more than US $5000 per patient per year [
Current predictive models for individual patient costs and health outcomes exhibit poor accuracy causing misclassification and need improvement. When projecting individual patient cost, the
Existing predictive models have low accuracy for multiple reasons, which include the following:
Although several dozen risk factors for adverse outcomes in asthma are known [
As with many diseases, many features (also known as independent variables that include both raw and transformed variables) predictive of adverse outcomes in asthma have likely not been identified. For instance, using a data-driven approach to find new predictive features from many variables in EMRs, Sun et al [
Existing models mainly use patient features only, presuming that each patient’s cost and health outcomes relate only to the patient’s characteristics and are unassociated with characteristics of the health care system (eg, the treating physician and facility). However, system features are known to be influential, have larger impacts on patients with the worst outcomes, and can account for up to half of the variance in their outcomes in certain cases [
Applying care management to a patient tends to improve the patient’s health outcomes and reduce the patient’s cost, excluding the cost of care management. Yet, existing models omit the factor of care management enrollment.
A health care system often has limited training data, whereas a model’s accuracy generally increases with more training data. Different systems have differing data distributions [
Unlike physicians who see patients regularly, care managers often have no prior interaction with a patient and are unfamiliar with the patient’s medical history when they need to make enrollment decisions. They need to understand why a patient is forecasted to be at high risk before allocating to care management and creating a tailored care plan, but have limited time to review extensive patient records with many variables, possibly accumulated over a long time and often including hundreds of pages [
Existing predictive models provide limited help in creating tailored care plans. An intervention targeting the reason underlying the high risk is likely to have better effect than nonspecific ones. A patient can have high risk for several reasons. A care manager may develop a tailored care plan for a patient using variable and subjective clinical judgment, but may miss certain suitable interventions because of the following reasons. First, many features exist. As true for any human, a typical care manager can process no more than 9 information items at once [
An asthma patient’s risk changes over time, whereas a care management program can enroll only limited patients. To best use the program, all patients remaining in the health plan are reevaluated for their risk periodically, for example, on an annual basis. The patients who are in the program and now predicted to be at low risk are moved off the program to make room for those previously at low risk but now at high risk. Doing this properly requires answering intervention queries via causal inference [
New techniques are needed to identify more high-risk asthma patients and provide appropriate care. Besides those proposed in our paper [
Many risk factors for adverse asthma outcomes are known [
Many features predictive of adverse outcomes in asthma have not yet been identified. To find new predictive features, we use many asthma patients and a data-driven approach to explore many standard patient features listed in the clinical predictive modeling literature [
To consider their impact, we include health care system features in building models for predicting individual patient costs and health outcomes. For each health care system level, such as physician or facility, we construct a profile containing its own information (eg, facility hours) and aggregated historical data of its patients (omitting the index patient) extracted from the provider’s administrative and EMR systems. The count of the physician’s asthma patients [
Some system features are computed using only system profile variables. Our paper [
The standard approach for predicting individual patient costs or health outcomes in asthma is to build a model using only asthma patient data. In the presence of many features, we may not have enough asthma patients to train the model and to obtain high prediction accuracy. To address this issue, we add a binary indicator feature for asthma and train the model on all patients, not just asthma patients. Asthma patients and other patients share many features in common. We can better tune these features’ coefficients in the model by using all patients.
To consider care management’s impact on costs and health outcomes, we add a binary indicator feature for care management enrollment when building models for predicting individual patient costs and health outcomes [
To address limited training data and improve model accuracy for the target health care system, we perform transfer learning using trained models from other source systems. Organizations are usually more willing to share their trained models than their raw data. Publications often describe trained models in detail. A model trained using a source system’s data includes much information useful for the prediction task on the target system, particularly if the source system has lots of training data. Our transfer learning approach can handle all kinds of features, prediction targets, and models used in the source and target systems. Our approach can potentially improve model accuracy regardless of the amount of training data available at the target system. Even if the target system has enough training data in general, it may not have enough training data for a particular pattern. A trained model from a source system can help make this up if the source system contains enough training data for the pattern.
Different health care systems use differing schemas, medical coding systems, and medical terminologies. To enable the application of a model trained using a source system’s data to the target system’s data, we convert the datasets of every source system and the target system into the same common data model (eg, OMOP [
An illustration of our transfer learning approach.
To improve prediction accuracy, it is desirable to use machine learning to construct models for predicting individual patient costs and health outcomes [
Each patient has the same set of patient and health care system features and is marked as high or not high risk. Our method mines from historical data class-based association rules related to high risk. Each rule’s left hand side is the conjunction of one or more feature-value pairs. An example rule is as follows: the patient had two or more urgent care visits for asthma last year AND lives 15 miles or more away from the patient’s physician → high risk. Through discussion and consensus, the clinicians in the automatic explanation function’s design team check mined rules and drop those making no or little clinical sense. For every rule kept, the clinicians enumerate zero or more interventions targeting the reason the rule shows. At prediction time, for each patient the predictive model identifies as high risk, we find and display all rules of which the patient fulfills the left hand side conditions. Every rule shows a reason why the patient is projected to be at high risk.
Our method can find a new type of risk factor termed conditional risk factors, which increase a patient’s risk only when some other variables are also present and can be used to design tailored interventions. This broadens risk factors’ scope, as ordinary risk factors are independent of other variables. Our method can automatically find appropriate cut-off thresholds for numerical variables and inform new interventions based on objective data. For instance, for the aforementioned association rule, our method would automatically find the cut-off thresholds of two in the number of urgent care visits and 15 miles in distance. Then we map all the patients who satisfy the rule’s left hand side conditions and have adverse outcomes in the next year. For the intervention of opening new primary care clinics, this informs the new clinics’ locations by maximizing the number of these patients living less than 15 miles away. A cost-benefit analysis can determine whether adopting this intervention is worthwhile.
For each association rule related to high risk, the proportion of patients who are at high risk and satisfy the rule’s left hand side is called the rule’s support showing the rule’s coverage. Among all patients fulfilling the rule’s left hand side, the proportion of patients at high risk is called the rule’s confidence showing the rule’s accuracy. Our method discretizes each numerical feature into a categorical one, and mines rules exceeding some predefined minimum support
Consider all of the association rules related to high risk and satisfying all conditions above except for the last one. If a feature-value pair is specified by the automatic explanation function’s designers as unrelated to high risk but appears in many of these rules, the designers can examine the pair in detail and determine the following [
Whether the pair is associated with a surrogate condition related to high risk. This helps us understand the subtleties in the data and how they affect machine learning. Sometimes, we can use the information to design new interventions targeting the surrogate condition. For instance, suppose the pair is that the patient had two outpatient visits for asthma last year and the associated surrogate condition is noncompliance coupled with high vulnerability, for example, because of genetics or working environment. For each rule related to high risk and whose left hand side contains the pair and indicates the surrogate condition (eg, by mentioning that the patient had at least two hospitalizations for asthma last year), we keep the rule, inspect the patients satisfying the rule’s left hand side, and arrange regular phone checks for some of them.
Whether the feature is uninformative. Retraining the predictive model after dropping the feature can possibly serve as a new way to improve model accuracy and make the model generalize better to other health care systems beyond the one in which it was developed. Ribeiro et al [
A health care system often has limited training data impacting model accuracy. To improve model accuracy, we can enlarge the training set by generating synthetic data instances:
Using historical data from the target or other source systems, we mine another set
The clinicians specify some rules related to high risk based on medical knowledge. Each rule is used to generate multiple synthetic data instances in a way similar to above. Alternatively, we can use these rules and the predictive model together at prediction time. We use these rules to identify a subset of high-risk patients and apply the predictive model to the other patients not satisfying the left hand side of any of these rules.
Using synthetic data instances to improve model accuracy has been done before, for example, for images [
The mined association rules
More specifically, during association rule mining, some rules are found and then removed because they fall below the predefined minimum support
Unlike the rules in the set
To provide causal inference capability, we need to estimate the impact of care management on a patient’s cost or health outcome. We use this estimate to adjust the cost or health outcome threshold for deciding whether a patient on care management should be moved off care management. Propensity score matching is one technique for doing this on observational data [
Care management is broadly used for improving asthma outcomes and cutting costs, but current high-risk patient identification methods have major limitations degrading its effectiveness. This paper pinpoints these limitations and outlines multiple machine learning techniques to address them, offering a roadmap for future research. Besides being used for asthma, care management is also broadly adopted in managing patients with diabetes, heart diseases, or chronic obstructive pulmonary disease [
electronic medical record
The authors would like to thank Bryan L Stone, Flory L Nkoy, Corinna Koebnick, Xiaoming Sheng, Maureen A Murtaugh, Sean D Mooney, Adam B Wilcox, Peter Tarczy-Hornoch, Rosemarie Fernandez, Philip J Brewster, and David E Jones for helpful discussions. Gang Luo was partially supported by the National Heart, Lung, and Blood Institute of the National Institutes of Health under Award Number R21HL128875. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
GL was mainly responsible for the paper. He performed the literature review, conceptualized the presentation approach, and drafted the manuscript. KAS gave feedback on various medical issues and revised the manuscript. Both authors read and approved the final manuscript.
None declared.