A Roadmap for Boosting Model Generalizability for Predicting Hospital Encounters for Asthma

In the United States, ~9% of people have asthma. Each year, asthma incurs high healthcare cost and many hospital encounters covering 1.8 million emergency room visits and 439,000 hospitalizations. A small percentage of patients with asthma use most healthcare resources. To improve outcomes and cut resource use, many healthcare systems use predictive models to prospectively find high-risk patients and enroll them in care management for preventive care. For maximal benefit from costly care management with limited service capacity, only patients at the highest risk should be enrolled. Yet, prior models built by others miss >50% of true highest-risk patients and mislabel many low-risk patients as high risk, leading to suboptimal care and wasted resources. To address this issue, we recently built three site-specific models to predict hospital encounters for asthma and gained up to 11%+ better performance. But, these models do not generalize well across sites and patient subgroups, creating two gaps before translating these models into clinical use. This paper points out these two gaps and outlines two corresponding solutions: a) a new machine learning technique to create cross-site generalizable predictive models to accurately find high-risk patients, and b) a new machine learning technique to automatically raise model performance for poorly performing subgroups while maintaining model performance on other subgroups. This gives a roadmap for future research.


Introduction
Asthma care management and our prior work on predictive modeling In the United States, ~9% of people have asthma [1][2][3]. Each year, asthma incurs $56 billion of healthcare cost [4] and many hospital encounters covering 1.8 million emergency room visits and 439,000 hospitalizations [1]. As is the case with many chronic diseases, a small percentage of patients with asthma use most healthcare resources [5,6]. The top 1% of patients spend 25% of healthcare costs. The top 20% spend 80% [5,7]. An effective approach is urgently in need to prospectively identify high-risk patients and intervene early to avoid health decline, improve outcomes, and cut resource use. Most major employers purchase and nearly all private health plans offer care management services for preventive care [8][9][10]. Care management is a collaborative process to assess, coordinate, plan, implement, evaluate, and monitor the services and options to meet the health and service needs of a patient [11]. A care management program employs care managers to call patients regularly to assess their status, arrange doctor appointments, and coordinate health-related services. Proper use of care management can cut down hospital encounters by up to 40% [10,[12][13][14][15][16][17]; lower healthcare cost by up to 15% [13][14][15][16][17][18]; and improve patient satisfaction, quality of life, and adherence to treatment by 30-60% [12]. Care management can cost >$5,000 per patient per year [13] and normally enrolls no more than 3% of patients [7] due to resource limits.
Correctly finding high-risk patients to enroll is crucial for effective care management. Currently, the best method to identify high-risk patients is to use models to predict each patient's risk [19]. Many health plans, such as those in 9 of 12 metropolitan communities [20], and many healthcare systems [21] use this method for care management. For patients predicted to have the highest risk, care managers manually review patients' medical records, consider factors like social dimensions, and make enrollment decisions. Yet, prior models built by others miss >50% of true highest-risk patients and mislabel many low-risk patients as high risk [5,12,[22][23][24][25][26][27][28][29][30][31][32][33][34][35][36]. This makes enrollment align poorly with patients who would benefit most from care management [12], leading to suboptimal care and higher costs. As the patient population is large, a small boost in model performance will benefit many patients and produce a large positive impact. Of the top 1% asthma patients who would incur the highest costs, for every 1% more whom we could find and enroll, we could save up to $21 million more in asthma care every year as well as improve outcomes [5,26,27].
To address the issue of low model performance, we recently built three site-specific models to predict whether a patient with asthma would incur any hospital encounter for asthma in the subsequent 12 months, one model for each of the three healthcare systems the University of Washington Medicine (UWM), Intermountain Healthcare (IH), and Kaiser Permanente Southern California (KPSC) [21,37,38]. Each prior model that others built for a comparable outcome [5,[26][27][28][29][30][31][32][33][34] had an area under the receiver operating characteristic curve (AUC) that was ≤0.79 and a sensitivity that was ≤49%. Our models raised the AUC to 0.9 and the sensitivity to 70% on the UWM data [21], the AUC to 0.86 and the sensitivity to 54% on IH data [37], and the AUC to 0.82 and the sensitivity to 52% on KPSC data [38].
Our eventual goal is to translate our models into clinical use. Yet, despite major progress, our models do not generalize well across sites and patient subgroups, and two gaps remain.
Gap 1: The site-specific models have suboptimal generalizability when applied to the other sites Each of our models was built for one site. As is typical in predictive modelling [39,40], when applied to the other sites, the site-specific model had AUC drops of up to 4.1% [38] potentially degrading care management enrollment decisions. One can do transfer learning using other source healthcare systems' raw data to boost model performance for the target healthcare system [41][42][43][44][45], but healthcare systems are seldom willing to share raw data. Research networks [46][47][48] mitigate the problem, but do not solve it. Many healthcare systems are not in any network. Healthcare systems in the network share raw data of finite attributes. Our prior model-based transfer learning approach [49] requires no raw data from other healthcare systems. But, it does not control the number of features (independent variables) used in the final model for the target site, creating difficulty to build the final model for the target site for clinical use. Consequently, it is never implemented in computer code.
Gap 2: The models exhibit large performance gaps when applied to specific patient subgroups Our models performed up to 8% worse on black patients. This is a typical barrier in machine learning, where many models exhibit large subgroup performance gaps, e.g., of up to 38% [50][51][52][53][54][55][56][57]. No existing tool for auditing model bias and fairness [58,59] has been applied to our models. Currently, it is unknown how our models perform on key patient subgroups defined by independent variables such as race, ethnicity, and insurance type. In other words, it is unknown how our models perform for different races, different ethnicities, and patients using different types of insurance. Large performance gaps among patient subgroups can lead to care inequity and should be avoided.
Many methods to improve fairness in machine learning exist [50][51][52]. These methods usually boost model performance on some subgroups at the price of lowering both model performance on others and the overall model performance [50][51][52]. Lowering the overall model performance is undesired [51,57]. Due to the large patient population, even a 1% drop in the overall model performance could potentially degrade many patients' outcomes. Chen et al. [57] cut model performance gaps among subgroups by collecting more training data and adding additional features, both of which are often difficult or infeasible to do. For classifying images via machine learning, Goel et al.'s method [55] raised the overall model performance and also cut model performance gaps among subgroups of a value of the dependent variable, not among subgroups defined by independent variables. The dependent variable is also known as the outcome or the prediction target. An example of the dependent variable is whether a patient with asthma will incur any hospital encounter for asthma in the subsequent 12 months. The independent variables are also known as features. Race, ethnicity, and insurance type are three examples of independent variables. Many machine learning techniques to handle imbalanced classes exist [60,61]. There, subgroups are defined by the dependent variable rather than by independent variables.

Contributions of this paper
To fill the two gaps on suboptimal model generalizability and let more high-risk patients obtain appropriate and equitable preventive care, the paper makes two contributions giving a roadmap for future research: 1) To address the first gap, we outline a new machine learning technique to create cross-site generalizable predictive models to accurately find high-risk patients. This is to cut model performance drop across sites. 2) To address the second gap, we outline a new machine learning technique to automatically raise model performance for poorly performing subgroups while maintaining model performance on other subgroups. This is to cut model performance gaps among patient subgroups and reduce care inequity. In the following, we describe the main ideas of our proposed new machine learning techniques.

Machine Learning Technique for Creating Cross-site Generalizable Predictive Models to Accurately Find High-risk Patients
Our prior models In our prior work [21,37,38], for each of the three healthcare systems (sites) KPSC, IH, and the UWM, we checked >200 candidate features and used the site's data to build a full site-specific extreme gradient boosting (XGBoost) model to predict hospital encounters for asthma. XGBoost [62] automatically chose from the candidate features the features to be used in the model, computed their importance values, and ranked them in descending order of these values. The top (~20) features with importance values ≥1% have nearly all of the predictive power of all (on average ~140) features used in the model [21,37,38]. Although some lower-ranked features are unavailable at other sites, each top feature, such as the number of the patient's asthma-related emergency room visits in the prior 12 months, is computed using (e.g., diagnosis and encounter) attributes routinely collected by almost every American healthcare system that uses electronic medical records. Using the top features and the site's data, we built a simplified XGBoost model. It, but not the full model, can be applied to other sites. The simplified model performed similarly to the full model at the site. But, when applied to another site, even after being re-trained on its data, the simplified model performed up to 4.1% worse than the full model built specifically for it, as distinct sites have only partially overlapping top features [21,37,38].

Building cross-site generalizable models
To ensure that the same variable is called the same name at different sites and the variable's content is recorded in the same way across these sites, we convert the data sets at all source sites and the target site into the Observational Medical Outcomes Partnership (OMOP) common data model [63] and its linked standardized terminologies [64]. If needed, the data model is extended to cover the variables that are not included in the original data model, but exist in the data sets.
Our goal is to build cross-site generalizable models fulfilling two conditions. First, the model uses a moderate number of features. Controlling the number of features used in the model would ease future clinical deployment of the model. Second, a separate component or copy of the model is initially built at each source site. When applied to the target site and possibly after being re-trained on its data, the model performs similarly to the full model built specifically for it. To reach our goal for the case of IH and the UWM being the source sites and KPSC being the target site, we proceed in two steps (Figure 1). In step 1, we combine the top features found at each source site. For each source site, we use the combined top features, its data, and the machine learning algorithm used to build its full model to build an expanded simplified model. Compared with the original simplified model built for the site, the expanded simplified model uses more features with predictive power and tends to generalize better across sites. In step 2, we do model-based transfer learning to further boost model performance. For each data instance of the target site, we apply each source site's expanded simplified model to the data instance, compute a prediction result, and use it as a new feature. For the target site, we use its data, the combined top features found at the source sites, and the new features to build its final model. To reach our goal for the case that IH or the UWM is the target site and KPSC is one of the source sites, we need to address the issue that the claim-based features used at KPSC [38] are unavailable at IH, the UWM, and many other healthcare systems with no claim data. At KPSC, we drop these features and use the other candidate features to build a site-specific model and re-compute the top features. This helps reach the effect that the top features found at each of KPSC, IH, and the UWM are available at all three sites and almost every other American healthcare system that uses electronic medical record systems. In the unlikely case that any re-computed top feature at KPSC violates this, we skip the feature when building cross-site generalizable models.
Our method to build cross-site generalizable models can handle all kinds of prediction targets, features, and models used at the source and target sites. Given a distinct prediction target, if some top features found at a source site are unavailable at many American healthcare systems using electronic medical record systems, we can use the drop→re-compute→skip approach shown above to handle these features. Also, at any source site, if the machine learning algorithm used to build the full site-specific model is like XGBoost [62] or random forest that automatically computes feature importance values, we can use the top features with the highest importance values. Otherwise, if the algorithm used to build the full model does not automatically compute feature importance values, we can use an automatic feature selection method [65] like the information gain method to choose the top features. Alternatively, we can use XGBoost or random forest to build a model, automatically compute feature importance values, and choose the top features with the highest importance values.
Our new model-based transfer learning approach waives the need for source sites' raw data. Healthcare systems are more willing to share with others trained models than raw data. A model trained using the data of a source site contains much information that is useful for the prediction task at the target site. This information offers much value when the target site has insufficient data for model training. If the target site is large, this information can still be valuable. Distinct sites have differing data pattern distributions. A pattern that matches a small percentage of patients and is difficult to identify at the target site could match a larger percentage of patients and be easier to identify at one of the source sites. In this case, its expanded simplified model could incorporate the pattern through model training to better predict the outcomes of certain types of patients, which is difficult to do using only the information from the target site but no information from the source sites. Thus, we expect that compared with just re-training a source site's expanded simplified model on the target site's data, doing model-based transfer learning in step 2 could lead to a better performing final model for the target site. When the target site goes beyond IH, the UWM, and KPSC, we can use IH, the UWM, and KPSC as the source sites and have more top features to combine. This would make our cross-site models generalize even better.

Machine Learning Technique for Automatically Raising Model Performance for Poorly Performing Patient Subgroups while Maintaining Model Performance on Other Subgroups to Reduce Care Inequity
We ask several clinical experts to identify several patient subgroups of great interest to clinicians (e.g., by race, ethnicity, or insurance type) through discussion. These subgroups are not necessarily mutually exclusive of each other. Each subgroup is defined by one or more attribute values. Given a predictive model built on a training set, we compute and show model performance on each subgroup on the test set [58,59]. Machine learning needs enough training data to work well. Often, the model performs much worse on a small subgroup than on a large subgroup [50,52]. After identifying one or more target subgroups where the model performs much worse than on other subgroups [51], we use a new dual-model approach to raise model performance on the target subgroups while maintaining model performance on other subgroups.
More specifically, given n target patient subgroups, we sort them as Gi (1≤i≤n) in ascending order of size and oversample them based on n integers ri (1≤i≤n) satisfying r1≥r2≥…≥rn>1. As Figure 2 shows, for each training instance in G1, we make r1 copies of it including itself. For each training instance in − ⋃ −1 =1 (2≤j≤n), we make rj copies of it including itself. Intuitively, the smaller the i (1≤i≤n) and thus Gi, the more aggressive oversampling is needed on Gi for machine learning to work well on it. The sorting ensures that if a training instance appears in ≥2 target subgroups, copies are made for it based on the largest ri of these subgroups. If needed, we could use one set of ri's for training instances with bad outcomes, and another set of ri's for training instances with good outcomes [66]. = ⋃ =1 is the union of the n target subgroups. Using the training instances outside G, the copies made for the training instances in G, and an automatic machine learning model selection method like our formerly developed one [67], we optimize the AUC on G, automatically select the values of ri (1≤i≤n), and train a second model. As is typical in using oversampling to improve fairness in machine learning, compared with the original model, the second model tends to perform better on G and worse on the patients outside G [51,66] because oversampling increases the percentage of training instances in G and decreases the percentage of training instances outside G. To avoid running into the case of having insufficient data for model training, no undersampling is performed on the training instances outside G. We use the original model to make predictions on the patients outside G, and the second model to make predictions on the patients in G.
In this way, we can raise model performance on G without lowering either model performance on the patients outside G or the overall model performance. We use all patients' data instead of only the training instances in G to train the second model. Otherwise, the second model may perform poorly on G due to insufficient training data in G [51]. For a similar reason, we choose to not use decoupled classifiers, where a separate classifier is trained for each subgroup using only that subgroup's data [51], on the target subgroups [57]. The above discussion focuses on the case that the original model is built on one site's data without using any other site's information. When the original model is a cross-site generalizable model built for the target site using the method in the "Building cross-site generalizable models" section and models trained at the source sites, to raise model performance on the target patient subgroups, we change the way to build the second model for the target site by proceeding in two steps (Figure 3). In step 1, we combine the top features found at each source site. Recall that G is the union of the n target subgroups. For each source site, we oversample the target subgroups in the way mentioned above; optimize the AUC on G at the source site; and use its data both in and outside G, the combined top features, and the machine learning algorithm used to build its full model to build a second expanded simplified model. In step 2, we do model-based transfer learning to incorporate useful information from the source sites. For each data instance of the target site, we apply each source site's second expanded simplified model to Training set r2 copies r3 copies the data instance, compute a prediction result, and use it as a new feature. For the target site, we oversample the target subgroups in the way mentioned above; optimize the AUC on G at the target site; and use its data both in and outside G, the combined top features found at the source sites, and the new features to build the second model for it. For each i (1≤i≤n), each of the source and target sites could use a distinct oversampling ratio ri.

Discussion
Predictive models differ by diseases and other factors. Yet, our proposed machine learning techniques are general and depend on no specific disease, patient cohort, or healthcare system. Given a new data set with a differing prediction target, disease, patient cohort, set of healthcare systems, or set of variables, one can use our proposed machine learning techniques to improve model generalizability across sites, as well as to boost model performance on poorly performing patient subgroups while maintaining model performance on others. For instance, we can use our proposed machine learning techniques to improve model performance for predicting other outcomes such as adherence to treatment [68] and no show [69]. This will help target resources, such as interventions to improve adherence to treatment [68] and reminders by phone calls to reduce no shows [69]. Care management is widely adopted to manage patients with chronic obstructive pulmonary disease, patients with diabetes, and patients with heart disease [6], where our proposed machine learning techniques can also be used. Our proposed predictive models are based on the OMOP common data model [63] and its linked standardized terminologies [64], which standardize administrative and clinical variables from at least 10 large healthcare systems in the United States [47,70]. Our proposed predictive models apply to those healthcare systems and others using OMOP.

Conclusions
To better identify patients likely to benefit most from asthma care management, we recently built the most accurate models to date to predict hospital encounters for asthma. But, these models do not generalize well across sites and patient subgroups, creating two gaps before translating these models into clinical use. This paper points out these two gaps and outlines two corresponding solutions, giving a roadmap for future research. The principles of our proposed machine learning techniques generalize to many other clinical predictive modeling tasks.