Candidemia Risk Prediction (CanDETEC) Model for Patients With Malignancy: Model Development and Validation in a Single-Center Retrospective Study

Background: Appropriate empirical treatment for candidemia is associated with reduced mortality; however, the timely diagnosis of candidemia in patients with sepsis remains poor. Objective: We aimed to use machine learning algorithms to develop and validate a candidemia prediction model for patients with cancer. Methods: We conducted a single-center retrospective study using the cancer registry of a tertiary academic hospital. Adult patients diagnosed with malignancies between January 2010 and December 2018 were included. Our study outcome was the prediction of candidemia events. A stratified undersampling method was used to extract control data for algorithm learning. Multiple models were developed—a combination of 4 variable groups and 5 algorithms (auto-machine learning, deep neural network, gradient boosting, logistic regression, and random forest). The model with the largest area under the receiver operating characteristic curve (AUROC) was selected as the Candida detection (CanDETEC) model after comparing its performance indexes with those of the Candida Score Model. Results: From a total of 273,380 blood cultures from 186,404 registered patients with cancer, we extracted 501 records of candidemia events and 2000 records as control data. Performance among the different models varied (AUROC 0.7710.889), with all models demonstrating superior performance to that of the Candida Score (AUROC 0.677). The random forest model performed the best (AUROC 0.889, 95% CI 0.888-0.889); therefore, it was selected as the CanDETEC model. Conclusions: The CanDETEC model predicted candidemia in patients with cancer with high discriminative power. This algorithm could be used for the timely diagnosis and appropriate empirical treatment of candidemia. (JMIR Med Inform 2021;9(7):e24651) doi: 10.2196/24651 JMIR Med Inform 2021 | vol. 9 | iss. 7 | e24651 | p. 1 https://medinform.jmir.org/2021/7/e24651 (page number not for citation purposes) Yoo et al JMIR MEDICAL INFORMATICS


Introduction
Candidemia is a representative nosocomial bloodstream infection that contributes to the mortality of immunocompromised patients; it has been shown to occur in 3% of patients in intensive care and 20% of immunosuppressed patients [1]. In addition, owing to a compromised immunity from chemotherapy or malignancy itself, patients with cancer have been reported as the most vulnerable hosts to candidemia [2][3][4].
Significant mortality has been reported over several decades. In studies from the 1980s, the mortality rate in patients with cancer found to have candidemia exceeded 50% [5][6][7]. High mortality rates, ranging from 30% to 51%, have also been reported in studies after 2010 [3,4,8]. The mortality rate of candidemia was significantly higher than that of bacteremia [3].
Early empirical treatment is important for patients with candidemia. A retrospective study [8] showed that patients with candidemia whose antifungal treatment was initiated 12 hours after onset of candidemia had a hospital mortality rate twice that of patients whose antifungal treatment was initiated within 12 hours of onset. Despite evidence on the need for early treatment of candidemia, only a small number of patients, receive timely antifungal treatment because of the difficulty of early diagnosis [9].
There has been an unmet clinical need-the coexistence of timeliness, high reliability, and cost-effectiveness in candidemia diagnosis. Blood culture is the reference standard for the candidemia diagnosis [10]. Due to its inherent nature, obtaining results can take a median of 2 to 3 days. Thus, this delay constitutes one challenge in the problem of timely diagnosis [11]. Multiple statistical models, such as the Candida Score, have been developed for the early prediction of candidemia [12][13][14]. However, such models have neither been tested on unseen data sets during development nor shown consistent performance in subsequent external validation studies [15][16][17].
The new T2Candida molecular test, which combines magnetic resonance with molecular diagnostics, is useful for the detection of candidemia in a very short amount of time and with high accuracy [18]; however, the high cost of the T2Candida molecular test is a barrier to its wide application in clinical practice [19,20].
Electronic health records allow efficient extraction and integration of clinical data, and the development of machine learning algorithms for critical care has been vigorously researched [21,22]. We aimed to develop a candidemia prediction model for patients with cancer using machine learning algorithms.

Study Population
This study was conducted in a 1950-bed tertiary academic single hospital in Seoul, Republic of Korea. The study population included adult patients (≥18 years old) diagnosed with a malignancy and from whom blood cultures had been obtained between January 2010 and December 2018 after diagnosis. The data set used in this analysis was extracted from the cancer registry and clinical data warehouse of the study site. The selection process is shown in Figure 1.

Outcome
We defined candidemia event as a positive culture for any Candida species in more than one blood sample. If candidemia was detected in a follow-up culture within 7 days, the subsequent event was merged with previous candidemia events. Because the algorithm was designed to predict candidemia events at the time of blood culture extraction, data were processed at the level of each event rather than at the patient level.
A data set with a low candidemia prevalence can cause difficulties in model training [23,24]. To solve this problem, stratified undersampling was conducted at a 1:4 ratio in 4 different subsets (used as a control data): (1) events of bacteremia in patients who experienced candidemia; (2) events of negative blood culture in patients who experienced candidemia; (3) events of bacteremia in patients who had not experienced candidemia; and (4) events of negative blood culture in patients who had not experienced candidemia. In the control subsets with patients who had experienced candidemia, a 60-day wash-out period was imposed from the day of onset of the candidemia event.

Stage 1: Feature Selection and Preprocessing
Upon review of clinical-domain literature on candidemia risk factors, we identified 210 variables [2,13,14,[25][26][27] that have been widely used in the development of machine learning algorithms in other clinical fields. We extracted data for these variables from the electronic health records of the study site. Each variable was classified into 4 groups based on variable importance, clinical importance, auto-extractability, and missing rate ( Figure 2 and Multimedia Appendix 1). We gradually eliminated features: variable group 4 had 210 variables, whereas variable group 1 (higher variable importance, higher clinical importance, better extractability, and less missing values) had only 30 variables. In order to prevent the algorithm from learning from postdiagnostic data, we removed candidemia diagnostic code and antifungal agent prescription information from the input data. To impute missing data, we used 2 serial methods. We used the carry-forward method to fill empty bins with the most recent value. This method reflects the workflow of the clinician in recording new data as the patients' condition changes; it is also easy and simple. In the case of missing values that were not imputed by the carry-forward method, average values were used. All numerical variables were normalized.

Stage 2: Data Partitioning
The data set was divided into training and test sets (7:3 ratio) using stratified sampling, matched by case-control group, binned age, Charlson comorbidity index, and sex.

Stage 3: Model Development
A total of 20 models were developed using a combination of 5 algorithms (logistic regression, deep neural network, random forest, gradient boosting, and automated machine learning) algorithms and 4 variable groups. For each model, 100 different development trials were conducted by changing random seeds to prevent selective performance reporting in the subsequent model evaluation stage. We used 2 methods for model selection and parameter optimization-an automated machine learning tool called the tree-based pipeline optimization tool [28], which helps identify the best prediction model and the best parameters for each variable group by using genetic algorithms and cross-validated performance on the training set (Multimedia Appendix 2), and a simple grid search, which is used to pick the parameters with the best performance.

Stage 4. Model Evaluation
Each model was evaluated using a test set (unseen data). Area under the receiver operating characteristic curve (AUROC), sensitivity, specificity, positive predictive value, negative predictive value, and F1 score were used to measure performance. Under the assumption that the algorithm was to be used as a screening tool, sensitivity >0.90 with the highest F1 score, was determined as the threshold for classifying the risk group. One-way analysis of variance was conducted to compare AUROC values. Statistical significance was set at .05. Subsequently, the Bonferroni test was conducted with α=.0025 for each of the 20 outcomes combining the algorithm types and variable groups.
The model with the highest AUROC was selected as the Candida detection (CanDETEC) algorithm. To examine how well the predicted risk correlated with the observed risk, we generated a linear regression model using 10 bins based on the predicted risk of the CanDETEC model. In general, if the coefficients and intercept of the linear regression model were close to 1 and 0, respectively, the model was considered well calibrated.
Candida Score [13], a traditional statistical candidemia prediction model, was used as the standard. We compared the CanDETEC model and the rounded Candida Score: 1 × (total parenteral nutrition) + 1 × (surgery) + 1 × (multifocal candida colonization) + 2 × (severe sepsis) [13]. Performance indexes were calculated for Candida Score >3. We also employed a net benefit index to compare both models as well as to determine whether the chosen threshold could have beneficial clinical implications.

Statistical Programming
We used R (version 3.6.2; The R Project) and Python (version 3.6.8) for data preprocessing, statistical analysis, visualization, development, and validation of machine learning algorithms [29,30]. The sample preprocessed data set (20 records) and code for developing our models are available [31]. Hematologic malignancy was the most common malignancy (Table 1). Several clinical factors were significantly associated with case events. Patients with candidemia received longer treatment with steroids, parenteral nutrition, and antibiotics within 30 days before candidemia. Furthermore, antianaerobic therapy (mean 10.1 days) was twice as long (P<.001) in those with candidemia than that in controls (5.5 days). In addition, patients with candidemia had higher C-reactive protein, bilirubin, fasting glucose, blood urea nitrogen, and lactic acid levels.

Model Evaluation
The best model was the random forest model trained using variable group 1, which was selected as the CanDETEC algorithm (Table 3; Multimedia Appendix 4); the lowest performing model was the logistic regression model trained using variable group 4. Receiver operating characteristic curves are shown in Figure 3. The tree-based pipeline optimization tool algorithm returned an extra-tree model, which showed the highest performance at the specified cut-off. Among the 30 auto-extractable variables, 5 variables showed the highest significance (P<.001) in the prediction of candidemia: blood urea nitrogen level, 7-day variance of respiratory rate, total bilirubin level, 7-day variance of systolic blood pressure, and body weight (Multimedia Appendix 5).

CanDETEC Model
The threshold for categorizing the candidemia risk group was set at 0.216 by a predefined condition (sensitivity >0.90 and highest F1 score). At this cut-off point, the calculated net benefit was 0.121 ( Figure 4). This threshold not only had a greater net benefit than that of treating all or none of the patients but was also close to the point showing the most significant net benefit. The coefficients, intercept, and R 2 of the linear regression model for evaluating calibration were 1.393, 0.078, and 0.93, respectively. These indexes show that the CanDETEC model was acceptably calibrated ( Figure 5).

Principal Results
We (Multimedia Appendix 5). Therefore, this model could also be used in real time.

Comparison With Prior Work
The CanDETEC algorithm can be used to support clinicians' decision-making process in providing appropriate empirical treatment for candidemia in patients with cancer. Although candidemia contributes to the mortality of patients diagnosed with malignancies, the timely diagnosis of candidemia remains clinically challenging [8]. Given its low prevalence, busy clinicians often overlook candidemia as a possible cause of infection. Only 14.0% of patients (70/501) in our data set with candidemia had received empirical antifungal agents, and the 30-day all-cause mortality rate exceeded 50% (264/501, 52.7%), which is consistent with outcomes in previous studies [3,4,8,9]. Furthermore, infectious disease diagnostic processes that are currently used have a high cognitive burden as they require the collection and calculation of several complicated patient information variables [32]. Because the CanDETEC algorithm can be used without the clinician's subjective judgment, our model has the potential to support physicians' decision-making process with minimal additional workload.
Current tools for predicting candidemia have major limitations.
Although the Candida Score is the most widely used tool to predict invasive candidemia, its sensitivity was reported to be only 37% in a recent external validation study [16], which was lower than the 81% in the original study [13]. This might have been a result of differences in the patient population. Static variables may also hinder the appropriate prediction of candidemia because they cannot appropriately reflect changes in the clinical status of patients over time. Not surprisingly, the Candida Score showed relatively low diagnostic performance in our study (AUROC 0.677).

Limitations
Our study has several limitations. First, CanDETEC was designed to predict candidemia when blood cultures were performed; however, the Candida Score was originally designed to predict invasive candidiasis, including candidemia. Thus, a comparison of diagnostic performance between our model and should be interpreted with caution. Second, this was a single-center retrospective study. Further multicenter prospective studies are required for external validation and to prove the clinical efficacy of the CanDETEC model. Third, although we developed a model with clinically acceptable performance, we only applied basic machine learning, such as random forest and gradient boosting. Recently, more complex ensemble models have been developed, and they have presented better performance compared to that of basic machine learning models in other medical domains [33,34]. Therefore, a follow-up study employing a state-of-the-art model should be conducted to examine whether the performance of the CanDETEC model could be improved.

Conclusions
Our CanDETEC model, to predict candidemia in patients with cancer, is expected to reduce the mortality of patients with malignancy by helping the clinician with decision making for timely antifungal treatment.