Can we trust the prediction model? Illustrating the importance of external validation by implementing the COVID-19 Vulnerability (C-19) Index across an international network of observational healthcare datasets

Background: SARS-CoV-2 is straining healthcare systems globally. The burden on hospitals during the pandemic could be reduced by implementing prediction models that can discriminate between patients requiring hospitalization and those who do not. The COVID-19 vulnerability (C-19) index, a model that predicts which patients will be admitted to hospital for treatment of pneumonia or pneumonia proxies, has been developed and proposed as a valuable tool for decision making during the pandemic. However, the model is at high risk of bias according to the Prediction model Risk Of Bias ASsessment Tool and has not been https://preprints.jmir.org/preprint/21547 [unpublished, peer-reviewed preprint] JMIR Preprints Reps et al externally validated. Objective: Externally validate the C-19 index across a range of healthcare settings to determine how well it broadly predicts hospitalization due to pneumonia in COVID-19 cases Methods: We followed the OHDSI framework for external validation to assess the reliability of the C-19 model. We evaluated the model on two different target populations: i) 41,381 patients that have SARS-CoV-2 at an outpatient or emergency room visit and ii) 9,429,285 patients that have influenza or related symptoms during an outpatient or emergency room visit, to predict their risk of hospitalization with pneumonia during the following 0 to 30 days. In total we validated the model across a network of 14 databases spanning the US, Europe, Australia and Asia. Results: The internal validation performance of the C-19 index was a c-statistic of 0.73 and calibration was not reported by the authors. When we externally validated it by transporting it to SARS-CoV-2 data the model obtained c-statistics of 0.36, 0.53 (0.473-0.584) and 0.56 (0.488-0.636) on Spanish, US and South Korean datasets respectively. The calibration was poor with the model under-estimating risk. When validated on 12 datasets containing influenza patients across the OHDSI network the cstatistics ranged between 0.40-0.68. Conclusions: The results show that the discriminative performance of the C-19 model is low for influenza cohorts, and even worse amongst COVID-19 patients in the US, Spain and South Korea. These results suggest that C-19 should not be used to aid decision making during the COVID-19 pandemic. Our findings highlight the importance of performing external validation across a range of settings, especially when a prediction model is being extrapolated to a different population. In the field of prediction, extensive validation is required to create appropriate trust in a model. (JMIR Preprints 17/06/2020:21547) DOI: https://doi.org/10.2196/preprints.21547 Preprint Settings 1) Would you like to publish your submitted manuscript as preprint? Please make my preprint PDF available to anyone at any time (recommended). Please make my preprint PDF available only to logged-in users; I understand that my title and abstract will remain visible to all users. Only make the preprint title and abstract visible. No, I do not wish to publish my submitted manuscript as a preprint. 2) If accepted for publication in a JMIR journal, would you like the PDF to be visible to the public? Yes, please make my accepted manuscript PDF available to anyone at any time (Recommended). Yes, but please make my accepted manuscript PDF available only to logged-in users; I understand that the title and abstract will remain visible to all users (see Important note, above). I also understand that if I later pay to participate in <a href="https://jmir.zendesk.com/hc/en-us/articles/360008899632-What-is-the-PubMed-Now-ahead-of-print-option-when-I-pay-the-APF-" target="_blank">JMIR’s PubMed Now! service</a> service, my accepted manuscript PDF will automatically be made openly available. Yes, but only make the title and abstract visible (see Important note, above). I understand that if I later pay to participate in <a href="https://jmir.zendesk.com/hc/en-us/articles/360008899632-What-is-the-PubMed-Now-ahead-of-print-option-when-I-pay-the-APF-" target="_blank">JMIR’s PubMed Now! service</a> service, my accepted manuscript PDF will automatically be made openly available. https://preprints.jmir.org/preprint/21547 [unpublished, peer-reviewed preprint] JMIR Preprints Reps et al


Table of Contents
Objective: Externally validate the C-19 index across a range of healthcare settings to determine how well it broadly predicts hospitalization due to pneumonia in COVID-19 cases Methods: We followed the OHDSI framework for external validation to assess the reliability of the C-19 model. We evaluated the model on two different target populations: i) 41,381 patients that have SARS-CoV-2 at an outpatient or emergency room visit and ii) 9,429,285 patients that have influenza or related symptoms during an outpatient or emergency room visit, to predict their risk of hospitalization with pneumonia during the following 0 to 30 days. In total we validated the model across a network of 14 databases spanning the US, Europe, Australia and Asia.

Results:
The internal validation performance of the C-19 index was a c-statistic of 0.73 and calibration was not reported by the authors. When we externally validated it by transporting it to SARS-CoV-2 data the model obtained c-statistics of 0.36, 0.53 (0.473-0.584) and 0.56 (0.488-0.636) on Spanish, US and South Korean datasets respectively. The calibration was poor with the model under-estimating risk. When validated on 12 datasets containing influenza patients across the OHDSI network the cstatistics ranged between 0.40-0.68.

Conclusions:
The results show that the discriminative performance of the C-19 model is low for influenza cohorts, and even worse amongst COVID-19 patients in the US, Spain and South Korea. These results suggest that C-19 should not be used to aid decision making during the COVID-19 pandemic. Our findings highlight the importance of performing external validation across a range of settings, especially when a prediction model is being extrapolated to a different population. In the field of prediction, extensive validation is required to create appropriate trust in a model.

Introduction Background and Objectives
The novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, that causes the COVID-19 disease, is quickly spreading throughout the world and burdening healthcare systems worldwide [1]. Numerous prediction models have started to be developed and released to the public to aid decision making during the pandemic [2]. Many of these models aim to inform people of their risk of developing severe outcomes due to COVID-19 [3][4][5]. A recent systematic review found all then-published models suffer from high risks of bias along with one or more limitations, including small datasets used to develop the models and lack of external validation. [2] The COVID-19 vulnerability (C-19) index [5] is an example of a prognostic model developed to identify people susceptible to severe outcomes during COVID-19 infection. The model is potentially valuable because it aims to predict the hospitalization risk in the general population [2]. The model publication is currently available as a preprint and the model is publicly available at the website http://c19survey.closedloop.ai. The C-19 index aims to predict which patients will require hospitalization due to pneumonia (or proxies for pneumonia) within 3 months. The model was developed using retrospectively collected Medicare data (patients aged 65 or older) that do not contain COVID-19 patients.
In this paper we aim to show the importance of external validation and demonstrate the feasibility, during times of urgency, of using a collaborate network for this purpose. We chose to demonstrate this with the C-19 index because it is available as a commercial product to the public, prior to being peer reviewed, as a model that can predict COVID-19 severity but has not had any external validation. It is unknown whether it is currently being used for medical decision making but is has been advertised as a decision-making tool. However, the process illustrated in this paper and lesson learned are applicable to any COVID-19 prediction model. Furthermore, the C-19 model was developed using non-COVID-19 data and there is no guarantee that a model trained on non-COVID-19 Medicare patients will perform similarly or even adequately in COVID-19 patients. Research has shown that there is high risk of bias for a model that lacks external validation [6]. In addition, it is recommended that knowledge of model reproducibility and transportability is assessed before a model is used clinically [7]. Models must be reliable as poor predictions can hurt decision making [2].
The Observational Health Data Science and Informatics (OHDSI) collaboration is a group of researchers collaborating to develop best practices for analyzing observational healthcare data [8].
OHDSI has developed a framework that enables timely validation of prediction models across a large number of datasets from around the globe [9]. The OHDSI network currently contains large COVID-19 cohorts from the US, Europe and Asia. In this study we aim to demonstrate the importance of performing external validation before we can trust a model's predictions. As a case study we chose to investigate the predictive performance of the C-19 index when applied to COVID-19 data from across the world. This study can inform us about the suitability of utilizing the C-19 model to aid decision making during the COVID-19 pandemic.

Existing C-19 Models
Three models were developed in the C-19 index publication [5]. The simplest model was a logistic regression with a limited number of predictors: age, sex, hospital usage, 11 comorbidities and their age interactions. The two other models were less parsimonious gradient boosting machines with more than 500 variables. Only one of these gradient boosting machine models was reported.
Withholding a model makes it non-compliant with the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement [10] and makes external validation impossible. In this paper we chose to evaluate the simple logistic regression model, recognizing that COVID-19 prediction models are urgently needed worldwide, and parsimonious models are more readily implemented across healthcare settings.

Source of Data
Electronic medical records (EMR) and administrative claims databases from primary care and secondary care containing patients from Australia, Japan, Netherlands, Spain, South Korea, and the US were analyzed in a distributed network, and are detailed in the Appendix 1, Table S1. Five datasets contained COVID-19 cases and nine datasets did not. All datasets used in this paper were mapped into the OHDSI Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM) [11]. The OMOP-CDM was developed to enable researchers with diverse datasets to have a standard database structure. This enables analysis code and software to be shared among researchers which facilitates external validation of prediction models. De-identified or pseudonymised data were obtained from routinely collected records from clinical practice. Analyses were performed using the following databases: AU-ePBRN (linked primary and secondary care from All analyses were conducted locally in a distributed network, where analysis code was sent to participating sites and only aggregate summary statistics were returned, with no sharing of patientlevel data between organizations.

Consent to Publish
Each site obtained institutional review board (IRB) approval for the study or used de-identified data, and therefore the study was determined not to be human subjects research. Informed consent was not necessary at any site.

Participants
The purpose of the C-19 index is to identify which COVID-19 patients are more likely to require hospitalization due to severe complications. The C-19 model was developed using non-COVID data,

Missing Data
The prediction models used a cohort design that includes any patient that satisfies the inclusion criteria. We did not exclude patients who are lost to follow-up during the 30-day period after the valid OP/ER visit.

Statistical Analysis Methods
The model performance was evaluated using the standard discriminative metrics: area under the receiver operating characteristic (AUROC) curve (equivalent to the c-statistic) and area under the precision recall curve (AUPRC). The latter is a useful addition to the AUROC when assessing rare outcomes [12]. An AUROC of 1 corresponds to a model that can always assign a higher risk to patients who will experience the outcome compared those who will not. An AUROC of 0. The precision-recall curve shows the tradeoff between precision and recall for different thresholds. The AUPRC performance is relative to the outcome rareness. An AUPRC greater than the percentage of the population with the outcome means the model is discriminating, and the greater the value (closer to 1) the better the discrimination. The AUPRC gives some insight into the false positive rate, a low AUPRC value means the model will lead to many false positives. The calibration was determined by creating deciles based on the predicted risk and plotting the mean predicted risk versus the observed risk in each decile. If a model is well calibrated, then the mean predicted risk will be approximately equal to the observed risk for each decile.
We follow the TRIPOD statement guidelines [10]

Development vs. Validation
The differences between the C-19 model development settings and the validation settings include a different target population and different datasets. Our validation design settings were chosen to mimic the COVID-19 situation when a clinician needs to decide whether to admit a patient.
Importantly, we validated the C-19 model on COVID-19 patients.
The C-19 index was developed using a cohort design that entered adult patients into the cohort on 9/30/2016 and predicted whether they would be hospitalized for pneumonia or proxies (influenza, acute bronchitis, or other specified upper respiratory infections) in the following 3 months. Patients must have been in the data for 6 or more months and patients who left the database within 3 months of index and did not have death recorded were excluded. In our external validation we used a cohort design but entered adult patients into the cohort when they had an initial OP/ER visit for influenza (or COVID-19) rather than a fixed date and predicted hospitalization due to pneumonia in 30 days rather than 3 months. We excluded patients with influenza or pneumonia within the 60 days prior to index to restrict to initial visits. This mimics the situation during the COVID-19 pandemic where clinicians need to decide whether to hospitalize a patient initially presenting with COVID-19. We required 12 months of prior observation and did not exclude patients who left the database within 3 months of index.
The C-19 index was developed using a subset of patients from the Medicare database prior to the pandemic. This is a US claims database containing patients aged 65 or older. In this study we were able to externally evaluate the C-19 model on COVID-19 data, including adult patients under 65 years of age, from South Korea, Spain and the US.

Online Results
The complete results are available as an interactive app at: http://evidence.ohdsi.org/C19validation The characteristics of the MDCR data (same data source as the development data but different patient subset) and the HIRA, SIDIAP and VA data (COVID-19 patients) are displayed in Table 1. The characteristics for all datasets used in the study are available in Appendix 4.

Model Performance
When C-19 was transported to COVID-19 patients it achieved AUROCs between 0.36-0.56, full details are available in Table 2. The AUROC and calibration plots are presented in Figure 1. Full results are presented in Table 3, and AUROC and calibration plots are presented in Appendix 5.
As a sensitivity analysis we also validated the C-19 index on a target population consisting of patients with COVID-19 or symptoms during 2020, the results were similar and are presented in Appendix 1, Table S2.

Discussion
The C-19 index is available online as a tool to predict severity in patients with COVID-19 ; while lacking validation for this population. Our validation across three datasets with sufficient COVID-19 data showed poor discriminative performance (AUROCs <0.6) and calibration. We observed similarly poor performance when validated across twelve datasets with influenza patients, with best AUROCs <0.70.

Interpretation
The key finding of this study is the performance of the C-19 model when transported to COVID-19 patients. The model performance was poor (AUROCs 0.36-0.56) across the COVID-19 datasets. The performance was worse than random guessing in the SIDIAP data, which is consistent with the poor performance seen when applied to European patients with influenza. The calibration plots show that the C-19 index consistently underestimated risk in the COVID-19 patients.
The datasets used to perform the validation had very different patient populations. MDCR had the oldest patient population and many patients had comorbidities. Compared to MDCR, the CCAE and JMDC datasets presented healthier and younger patients (mean age around 40s) in the target population. While MDCD had younger patients these patients often had comorbidities (i.e. 20% these patients had COPD, 11% had heart failure and 17% has a history of pneumonia). The rate of hospitalization ranged greatly across the sites with values between 0.1% in JMDC and 12.4% in MDCR. The rate of the outcome in the dataset used to develop the C-19 index was 0.23%, much lower than in the MDCR data used to validate the model in this study. This is due to our study restricting to patients at the point they had an OP/ER visit due to influenza or COVID-19. Although five datasets contained COVID-19 patients, only four had sufficient data (VA, HIRA, SIDIAP and CUIMC) for external validation. The result of the C-19 when applied to COVID-19 patients in CUIMC was poor, <0.5 AUROC, however this dataset consisted mostly of hospitalized patients and therefore did not seem suitable for validating a model that predicts hospitalizations.
We chose a target population of symptomatic patients as this resembles the situation in which COVID-19 prediction models may be clinically implemented during the pandemic: clinicians likely would not admit asymptomatic patients. This suggests the internal C-19 AUROC estimate, that was evaluated within the general population rather than using those with symptoms, may be optimistic compared to if it were used in a realistic setting, due to the inclusion of many healthy patients in the model development data. When applied to predict hospitalization in influenza patients across US data the discriminative performance ranged between 0.58-0.68. The performance was worse on the CCAE database with younger patients, likely due to age being a key predictor in the model. When the C-19 index was transported across non-US datasets the discrimination was poor to reasonable in the Australian and Asian data (0.52-0.64) and poor in the European data (0.4-0.49). The European data are extracted from general practice (GP) settings, but the C-19 model was developed using US claims data. Given the differences in clinical settings, it is not surprising that the performances were poor. This highlights that models often may not transport to different healthcare settings. The AUROC of 0.36 when the C-19 model was validated in SIDIAP is worse than random guessing and inverting the predicted risk would lead to an AUROC of 0.64. This may be a result of the C-19 including age interaction terms that resulted in the age coefficient being negative. Table 1 shows that in SIDIAP the model's age interacting comorbidities are not as often recorded relative to the other databases. This may have resulted in younger patients being assigned higher risks than older patients in SIDIAP.
The calibration was poor when applying the C-19 to COVID-19 data. This is not unexpected, as it is known that COVID-19 patients have a higher risk of hospitalization due to pneumonia than the general COVID-19 free population. The calibration could likely be improved by performing recalibration using a sample of COVD-19 patient data.

Implication
The results provide extensive insight into the performance of the logistic regression C-19 index when used for COVID-19. The external validation uncovered that the logistic regression C-19 model is unreliable when predicting hospitalization risk for COVID-19 patients. Given this result, we do not recommend using the logistic regression C-19 index to aid decision making during the COVID-19 pandemic. The model did not appear to transport to COVID-19 patients, highlighting the importance of externally validating models, especially models whose target population differs from the development population.
There are numerous potential reasons why the logistic regression C-19 model failed to predict hospitalization due to pneumonia in the COVID-19 patients investigated. First reason may be due to the model being developed on patients aged 65 or older but applied to patients aged 18 or older.
Age had a negative coefficient in the model, so this may have caused issues when the model was applied to younger patients. A second reason may be due to incorrect phenotyping for the predictors. We matched the SNOMED codes to the CCSR ICD-10 codes provided, but the predictors may require database specific phenotypes due to coding differences across datasets and healthcare settings. This may explain the poor performance in the European datasets that may record things differently than the US. A third reason is the study design:C-19 was developed to predict hospitalization from a set date in 2016 but we validated in a target cohort of symptomatic patients with an OP/ER visit as this more closely matches the clinical use case of the model. This means we are likely to have a sicker population where discrimination may be more difficult. A fourth potential reason is that the C-19 model was developed using data prior to 2017 but was validated on data from 2020: temporal changes and concept drift may negatively impact performance. Although we do not know the reason for the unreliability of the C-19 model on COVID-19 patients, we were able to quantify it by large-scale external validation across a network of datasets. In future work it would be beneficial to develop techniques that can identify reasons for poor external validation performance, as this may inform new best practices for model development.
This study highlights the importance of performing extensive external validation across different settings. During times of uncertainty, such as during pandemics, medical staff who are under pressure to make important decisions could benefit from implementing vetted prediction models.
However, it is important to gain an unbiased and reliable evaluation of a model's performance across numerous patient populations before the model is used. Internal validation can often be biased (e.g., the population used the develop the model does not match the intended target population) and provide optimistic performance estimates (e.g., a poor design or small dataset may result in overestimated discriminative performance). The approach used by the OHDSI collaboration enables efficient external validation of models across multiple datasets and this is a valuable resource when urgency is required.

Limitations
A common issue when using observational healthcare data, especially across a network of databases, is the difficulty in developing phenotypes that are valid on all datasets. In this study we used predictor definitions given by the researchers who developed the model. However, these definitions may not transport across all the datasets and may account for some of the decrease in performance. We were also limited to validate the less complex C-19 model due to the large number of variables and lack of transparency for the more complex models.
The C-19 model used in this paper to demonstrate the importance of external validation may have limited use for medical decision making. Other COVID-19 model, such as those including physiological measurements, may be making more clinical impact. However, we choose the C-19 model because it was available early in the pandemic and was being advertised to the public as a useful tool while being reported in a preprint paper with no formal peer review.

Conclusions
We have demonstrated the importance of implementing external validation in multiple datasets to determine the reliability of prediction models. We picked a newly developed model, the C-19 index, that aimed to predict which COVID-19 patients are at risk of severe complications due to the virus.
The model reported an internal AUC of 0.73 but was deemed as having a high risk of potential bias [2]. The C-19 index addresses an important issue that could have greatly aided decision making during the COVID-19 pandemic, but its performance in COVID-19 patients was unknown. Our results show that the C-19 index performs poorly when applied to newly diagnosed COVID-19 patients in Asia, Europe and the US. Overall, we suggest that the model currently only be used to predict hospitalization due to pneumonia in older patients in the US. The results of this study demonstrate that internal validation performance should be considered optimistic estimates and a prediction model requires validation across multiple datasets in the target population where it will be used (or a close proxy), before it should be trusted.