An Artificial Neural Network–Based Pediatric Mortality Risk Score: Development and Performance Evaluation Using Data From a Large North American Registry

Background: In the pediatric intensive care unit (PICU), quantifying illness severity can be guided by risk models to enable timely identification and appropriate intervention. Logistic regression models, including the pediatric index of mortality 2 (PIM-2) and pediatric risk of mortality III (PRISM-III), produce a mortality risk score using data that are routinely available at PICU admission. Artificial neural networks (ANNs) outperform regression models in some medical fields. Objective: In light of this potential, we aim to examine ANN performance, compared to that of logistic regression, for mortality risk estimation in the PICU. Methods: The analyzed data set included patients from North American PICUs whose discharge diagnostic codes indicated evidence of infection and included the data used for the PIM-2 and PRISM-III calculations and their corresponding scores. We stratified the data set into training and test sets, with approximately equal mortality rates, in an effort to replicate real-world data. Data preprocessing included imputing missing data through simple substitution and normalizing data into binary variables using PRISM-III thresholds. A 2-layer ANN model was built to predict pediatric mortality, along with a simple logistic regression model for comparison. Both models used the same features required by PIM-2 and PRISM-III. Alternative ANN models using single-layer or unnormalized data were also evaluated. Model performance was compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC) and their empirical 95% CIs. Results: Data from 102,945 patients (including 4068 deaths) were included in the analysis. The highest performing ANN (AUROC 0.871, 95% CI 0.862-0.880; AUPRC 0.372, 95% CI 0.345-0.396) that used normalized data performed better than PIM-2 (AUROC 0.805, 95% CI 0.801-0.816; AUPRC 0.234, 95% CI 0.213-0.255) and PRISM-III (AUROC 0.844, 95% CI 0.841-0.855; AUPRC 0.348, 95% CI 0.322-0.367). The performance of this ANN was also significantly better than that of the logistic regression model (AUROC 0.862, 95% CI 0.852-0.872; AUPRC 0.329, 95% CI 0.304-0.351). The performance of the ANN that used unnormalized data (AUROC 0.865, 95% CI 0.856-0.874) was slightly inferior to our highest performing ANN; the single-layer ANN architecture performed poorly and was not investigated further. Conclusions: A simple ANN model performed slightly better than the benchmark PIM-2 and PRISM-III scores and a traditional logistic regression model trained on the same data set. The small performance gains achieved by this two-layer ANN model may not offer clinically significant improvement; however, further research with other or more sophisticated model designs and better imputation of missing data may be warranted. JMIR Med Inform 2021 | vol. 9 | iss. 8 | e24079 | p. 1 https://medinform.jmir.org/2021/8/e24079 (page number not for citation purposes) Ghanad Poor et al JMIR MEDICAL INFORMATICS


Background
The use of risk models in medicine enables timely and more targeted interventions for a given patient and facilitates benchmarking quality of care and conduct of clinical studies [1]. It is often necessary to quantify the severity of illness in the pediatric intensive care unit (PICU). Estimating the probability of mortality or expected length of stay from early admission data with such risk models is mainly used for quality improvement and benchmarking; however, it might enable a clinician to make objective medical decisions regarding the state of the patient, the necessary level of care, possible treatments, discharge plans, or expected costs [2][3][4].
PICUs are data-rich environments with a wide range of physiological variables that are responsive to interventions over short periods and outcomes that are well-defined and generally quantifiable [5]. Thus, the PICU provides fertile ground to develop and test prediction models of risks and outcomes. A score, which is quick and pragmatic to use, can enable the timely identification of adverse conditions and may be used to tailor appropriate interventions [6]. Two commonly encountered pediatric risk scores are the pediatric index of mortality 2 (PIM-2) [2] and pediatric risk of mortality III (PRISM-III) [1]. Both are derived from logistic regression models, which estimate mortality risk and have been validated with respective areas under the receiver operating characteristic curves (AUROCs) of 0.90 and 0.89 [1,7].
Increased computing capabilities, big data, and machine learning algorithms enable the application of artificial intelligence (AI) for clinical decision support [8]. Artificial neural networks (ANNs), a subtype of AI, can be used in different medical areas and have been shown to outperform physicians in diagnosis based on medical imaging or data from electronic medical records [9][10][11][12]. A recurrent neural network is a type of ANN that is most commonly used for sequential data. An ANN-based cardiac risk score, which used the recurrent neural network approach, was able to detect small changes in an electrocardiogram segment, which cannot be found by visual inspection [11]; another was used to classify clinical time series data for pediatric patients in critical care [12].
The clinical adoption of ANN-based risk models relies on gaining physicians' trust in the use of AI [13,14], which may include, but is not limited to, demonstrating better performance than traditional regression approaches.

Objectives
The primary aim of this study is to examine the performance of an ANN-based approach compared to that of traditional approaches based on logistic regression models when applied to estimating the risk of mortality in children admitted to PICU with suspected sepsis. We developed an ANN model using features required in the PIM-2 and PRISM-III models to predict mortality outcomes (died or survived) in a large North American registry data set and evaluated the ANN's performance using the AUROC. We compared its performance with the benchmark PIM-2 and PRISM-III scores, as well as a logistic regression model, trained on the study data set, which used the same features as PIM-2 and PRISM-III.

Study Design and Approval
In this study, we used data from a North American PICU registry to compare the performance of an ANN model with PIM-2 and PRISM-III scores. The data set was obtained from Virtual Pediatric Systems (VPS), LLC, a registry of prospectively collected records from 130 PICUs in the United States and Canada. This is a secondary analysis of data obtained for a different purpose-to develop a simple risk stratification score for children with sepsis [6]. Ethical approval for the study was obtained from the University of British Columbia/Children's and Women's Health Centre of British Columbia Research Ethics Board (H15-01398). The requirement for written informed consent was waived by the research ethics board, as this study was a secondary analysis of registry data. This manuscript has been prepared in accordance with the guidelines for Transparent Reporting if a multivariable prediction risk model for Individual Prognosis or Diagnosis.
As sepsis diagnosis might not necessarily be known or documented at the time of admission to the PICU, we identified all children in the VPS data set whose diagnostic codes at discharge exhibited evidence of an infection, and combined with their admission to the PICU, this provides a reasonably strong indication for sepsis. This allowed us to create a representative data set of children with a high likelihood of sepsis.

Data Available for Analysis
The analyzed data set included data on PICU admissions between January 1, 2009, and December 31, 2014. Data were available from 102,945 children, of whom 4068 died (mortality rate 3.95%). Each entry included a variety of vital signs, laboratory tests, and other clinical information, including the variables required to calculate the PIM-2 and PRISM-III scores. The clinical data used in this analysis were solely from early admission to the PICU. Hence, the longer the length of stay, the less associated these predictors were with the outcome under investigation: mortality or survival at PICU discharge.
Although the variables for PIM-2 and PRISM-III were collected from the same source, these models captured data from different sampling windows. For any given PICU admission, the VPS data set provides a single measurement for each variable used by these 2 risk scores as required for their respective calculations.

PRISM-III Variables and Sampling Window
PRISM-III uses the highest or lowest values of systolic blood pressure, heart rate, temperature, mental status, pupillary reflexes, acidosis, pH, P CO2 , total carbon dioxide (CO 2 ), Pa O2 , glucose, potassium, creatinine, blood urea nitrogen, white blood cell count, platelet count, and prothrombin time or partial thromboplastin time [1]. Values included were measured in the first 12 hours of PICU care; laboratory variables were also considered up to 2 hours before PICU admission.

PIM-2 Variables and Sampling Window
PIM-2 uses the first recorded values of systolic blood pressure, pupillary reaction to light, Pa O2 , base excess, early mechanical ventilation (yes or no), elective PICU admission (yes or no), admission following surgery (yes or no), admission following cardiopulmonary bypass, high-risk diagnoses (nine options: cardiac arrest preceding intensive care unit (ICU) admission, severe combined immune deficiency, leukemia or lymphoma after first induction, spontaneous cerebral hemorrhage, cardiomyopathy or myocarditis, hypoplastic left heart syndrome, HIV infection, liver failure as the main reason for ICU admission, or neurodegenerative disorder), and low-risk diagnoses (five options: main reason for ICU admission of asthma, bronchiolitis, croup, obstructive sleep apnea, or diabetic keto-acidosis) [2]. Values included were measured in the first hour of PICU care starting at the time of the first face-to-face meeting of the patient with a PICU team member.
Not all vital signs were collected routinely for every patient, so the data set was only sparsely populated, and the vital signs used for calculating PIM-2 and PRISM-III scores were incomplete in some cases, for example, the Glasgow Coma Score (mental status) was missing from 60.2% (61,976/102,945) of cases. In the calculation of both PIM-2 and PRISM-III scores, missing vital signs are taken as a sign of being normal, that is, healthy, as such tests were not ordered or performed by the PICU team [1,2]. For example, a missing Glasgow Coma Score is interpreted as indicating a normal mental status and is input to the model as such. This assumption is discussed further in the Limitations section.

Preprocessing
Preprocessing was performed in Python (v3.8.5; Python Software Foundation) to perform three tasks: (1) generate the training and test sets, (2) address missing values in the data set, and (3) generate new variables through data transformation.

Generation of Training and Test Sets
The total data set was initially divided into training and test sets using a stratified approach to ensure that the class ratio for mortality remained approximately equal for the training, test, and full data sets. ANN and logistic regression models were built on the training sets and evaluated on the test sets, and the results were compared against the PIM-2 and PRISM-III models. The overall data set was bootstrapped 100 times to generate the training and test sets.

Addressing Data Missingness Through Simple Substitution
The data set was examined for missing entries, and the missing values were imputed based on the feature type; specifically, the missing values in categorical features, such as pupillary reaction and coma status, were imputed using the most common value (mode). The missing values in numerical features, such as glucose or P CO2 , were imputed using the median value, as most of these features did not follow a normal distribution. Median and mode approaches were used to build imputation models and fill the missing values in the training set, and these imputation models were applied to the test set separately to avoid a data leakage problem.

Generation of New Variables Through Data Transformation
We performed minimum-maximum normalization to normalize numerical data for the ANN and logistic regression models. The minimum and maximum values of each feature from the training set were used to normalize the data in the training and test sets to avoid data leakage. Dummy encoding was performed on categorical features that contained more than 2 distinct values, such as pupillary reaction, but all categorical features with only 2 distinct values were dichotomized to accommodate them in the machine learning models. We used thresholds defined by PRISM-III to define normal and abnormal values. PIM-2 does not have defined thresholds; however, it penalizes any diversion of a vital sign from its normal value continuously.

Model Training
We built an ANN model using the Keras framework on top of TensorFlow (Google Brain Team) in Python (Python Software Foundation), with training conducted in Jupyter notebook (IPython). The Python code files that were used to build the models and generate results are available in Multimedia Appendix 1. We used a grid approach to determine the optimum configuration while designing the neural network. We tested various configurations between 1 and 3 hidden layers and 8 and 32 neurons per hidden layer, with a rule-of-thumb approach to limit the number of hidden layer neurons to the neurons in the input layer. Through experimentation, we identified that a 2-layer ANN, with 32 nodes in the first hidden layer and 16 nodes in the second hidden layer, performed better than the other configurations we tested. Our final model consisted of 32 input features (consisting of the variables used in the PRISM-III [1] and PIM-2 [2] models; see the Study Data Set section), and the 2 hidden layers, with each node using rectifier linear unit activation functions; finally, a sigmoid activated dense layer was used to predict the mortality for each instance (Figure 1). The model was compiled using an adam optimizer with a binary cross-entropy loss function. While keeping the main network the same, we also evaluated the model with unnormalized data as well as a model with only a single hidden layer. We conducted training with a batch size of 32 and observed that the loss remained constant after 100 epochs. Artificial neural network architecture with two hidden layers: the node in input layer "iXy" processes data from pediatric intensive care unit admission "X" with feature "y" (such as age, length of stay, pupillary reaction, etc). The total number of features in the data set is denoted by "n." The first and second hidden layers are represented by h1 and h2, respectively, with a subscript to denote the node number. The output layer has a single node (o), which shows probability of mortality for patient "X." ReLU: rectifier linear unit.
The ANN model was trained with features used in the PIM-2 and PRISM-III models to predict the outcome (died or survived); AUROC was used as an evaluation metric while training the model. Finally, we developed a logistic regression model for comparison using the same features from PIM-2 and PRISM-III.

Model Evaluation
The empirical range of AUROC scores was computed for each test set (obtained from bootstrap) using the sklearn.metrics function in Python. The test set that resulted in the median AUROC value was used to determine the optimum Youden index value. This threshold was then used to calculate the false positive rate (FPR) and false negative rate (FNR) for each test set, and the 95% empirical CIs were reported by pooling the results from all the test sets [15][16][17]; median and ranges of pooled results were reported for all other indices. We also reported the area under the precision recall curve (AUPRC) and its empirical 95% CI for each model. A Welch 2-sided t test was used to compare AUROC and AUPRC for model pairs.
To compare how the models performed at specific true positive rate (TPR) and FPR levels, we fixed the TPR values at 95%, 90%, and 85% and computed the corresponding median FPR values (from all the test sets) for ANN, logistic regression, PIM-2, and PRISM-III. Similarly, we also reported the median TPR results by fixing the FPR at 5%, 10%, and 15%.

Data Set Characteristics
The data set included 102,945 children with infection admitted between 2009 and 2014, of whom 4068 died (3.95% mortality rate). The training sets contained 72,061 children, of whom a median of 2852 (range 2790-2903) died, equivalent to a 3.96% mortality rate; the test sets contained 30,884 children, of whom a median of 1216 (range 1165-1278) died, equivalent to a 3.94% mortality rate (Table 1).  As is commonly encountered in large clinical registries using clinical availability of routinely collected data, between 0.41% (424/102,945)

Model Performance: ANN Trained Using Imputed and Normalized Data
With the ANN trained on normalized data, the median FPR was mostly close to 18.4% (range 12.5%-30.8%) and the median FNR value was 24% (range 12.7%-33.2%; Table 2), with a median accuracy of 81.3% (range 69.9%-86.7%) on the test set.   Similar results were observed using AUPRC, which indicated that the ANN (AUPRC 0.372, 95% CI 0.345-0.396) performed better than PIM-2 (AUPRC 0.234, 95% CI 0.213-0.255; P<.001) and PRISM-III (AUPRC 0.348, 95% CI 0.322-0.367; P<.001; Figure 3). The ANN achieved the highest TPR compared with the logistic regression, PIM-2, and PRISM-III when FPR was fixed at 5%, 10%, or 15%. Similarly, FPR was lowest for the ANN when TPR was fixed at 85% or 90% (Table 3). However, the logistic regression model showed the smallest FPR when TPR was fixed at 95%.  The lowest FPR observed at the Youden-optimized threshold point for any of the models evaluated was 3.5% using PRISM-III, with a corresponding FNR of 54.3% (Table 2). If we target an FPR of 3.5%, the corresponding FNRs for the other models were 68.7% for PIM-2, 55.8% for the logistic regression, and 54% for the ANN.
Although the AUROCs of the ANN and logistic regression overlap, it was found that ANN performed better than logistic regression (P<.001).

Summary of Results
We created an ANN-based pediatric risk prediction score using the features included in PIM-2 and PRISM-III scores, which we trained on patients from a large North American multicenter pediatric cohort with presumed sepsis as identified by a discharge diagnosis of infection. The overall performance of the ANN model with binary cross-entropy loss was better than the PIM-2 and PRISM-III scores, with median AUROCs of 0.871 (ANN) versus 0.805 (PIM-2; P<.001) and 0.844 (PRISM-III; P<.001). It also performed better than a traditional logistic regression model that used the same features required by PIM-2 and PRISM-III. However, these performance gains may not represent a clinically significant improvement. Our evaluation of the ANN approach with a single hidden layer and nonnormalized data returned poorer results than the other models evaluated.

Improved Performance, but Is It Relevant?
Our highest performing ANN was significantly better, statistically, than PIM-2 and PRISM-III using the AUROC and AUPRC measures of performance. The ANN missed fewer cases than PIM-2, PRISM-III, and the logistic regression model (ie, the ANN had a lower FNR; Table 2) at their respective ideal thresholds, as determined by optimizing their respective Youden indices; however, its rate of false positive detections was higher than that of PRISM-III and marginally higher than that of the logistic regression model (ie, the ANN had a higher FPR) at these Youden-optimized thresholds. This may suggest an opportunity for further optimization and evaluation, but it should be noted that the ANN did not miss more cases than PRISM-III (ie, the ANN had an equivalent FNR) when the FPR was fixed at the value of 3.5% (PRISM-III's Youden-optimized threshold). A direct comparison between models is challenging given that model selection will depend to a large extent on the clinical context; in some settings, a single objective (eg, to minimize FPR) may be the overriding concern, whereas in other cases, a balance of multiple objectives may be required (eg, to minimize both FPR and FNR).
Despite limited performance gains and increased robustness, the improvement may not be clinically relevant and is unlikely to overcome the initial concerns that physicians might have about the new model. The limited performance gains were not surprising. Although studies have proposed that ANNs outperform logistic regression models [12,18] or offer at least partially better performance [19], a recent systematic review of 71 studies found no superior performance of ANN over logistic regression models [20]. However, ANN-based models allow for the tuning of performance characteristics, which offers a potential advantage.

Trust Issues as a Barrier to ANN Use in Risk Modeling
The successful acceptance of AI-based risk models requires physicians' willingness to accept AI models and the interpretability of those models. Although clinically improved performance might help this case, trust is a key element in acceptance, which is built (or lost) in a dynamic and evolving process [13,21]. Our failure to demonstrate a significant improvement in clinical performance will not help overcome the barriers to adoption.
Future AI-based risk models may need to become more interpretable to find acceptance [14], and the higher the risk, the more interpretability is needed to earn the trust. Including clinicians and patients in the development of AI models may be a step toward promoting acceptability and interpretability. Certification and licensure for AI models might also help build trust in model-based risk scores [22,23]. Finally, it may be useful to assure the user that the model is a tool and not a replacement for the clinician [13].

Challenges With Skewed Data
The working data set was skewed: only 3.95% (4068/102,945) of instances had the outcome as died. Local minima are a problem frequently associated with imbalanced data sets, and customized learning algorithms, cost functions, or external approaches (ie, resampling the data set) can be used to help overcome this problem [24]. Some ANNs tended to predict (mostly) everyone as a survivor; given the overall mortality rate of the population (4068/102,945, 3.95%), even assuming every patient will survive results in an accuracy of approximately 96%, but with an FPR of zero and an FNR of one. A traditional experimental setup with accuracy as an evaluation metric fails when building models with skewed data, as the models tend to be biased toward the majority class (here survived) [25]. This challenge can be addressed by modifying the cost function to maximize the AUROC of the model [25].

Limitations
The main limitation of this work is the fact that out of several ANN-based models evaluated, only 1 type learned to discriminate between survival and death of patients effectively. Despite attempts to address the root cause (imbalance of outcomes in the data set), this suggests that the approaches selected may not have been optimal and that further network types and designs should be considered in future approaches. Following the initial positive outcomes with this model, secondary training on a data set can be used to fine-tune the ANN model.
The information included in the new models was limited to risk factors from PIM-2 and PRISM-III. By creating new features such as vital sign combinations or ratios [26], which in principle can be emulated by adding hidden layers, one might be able to provide another significant performance boost to the model. However, this did not seem to be the case in a recent sepsis prediction competition [27], where novel methods or applications seemed to be more promising than the creation of new features.
Another limitation was the relatively low number of complete patient entries in the VPS data set. Given that VPS is a curated data set, the potential reasons for this likely stem from local practices, such as tests not being required for clinical management in particular cases or it being generally decided that recording the results of these tests is optional. Although it makes the creation and use of some modeling techniques more difficult, this is an unavoidable feature of real-world clinical data. Characterizing the missingness to inform modeling might offer a valuable approach, but such features may not be generalizable because they represent local patterns of practice. To use models without the filtering layer, simple imputation approaches were used; however, data were likely not missing at random, which invalidates some of the (median or mode imputation) approaches used. More sophisticated approaches for handling data missingness, such as multivariate imputation by chained equations, may yield better performance [28,29], as the substituted values are likely closer to specific cases than the overall population. Importantly, physicians should inform the treatment of missing values, which might boost confidence in the methods used. It might be possible to use a complete time series in an ANN instead of extreme values observed in a certain window, which could improve performance.
This study explored only a limited range of ANN design techniques. For example, we used rectifier linear unit activation in the hidden layers but did not evaluate the effect of other activation functions on model performance; similarly, we used the adam optimizer to identify the optimal ANN architecture but did not evaluate alternative optimizers. Thus, more exhaustive experimentation may yield improved performance results. Similarly, Youden index was used as a pragmatic approach to identify the optimal cut off by maximizing the models' true positive and true negative rates. However, selecting the appropriate operating point for clinical implementation should consider alternative approaches to finding the optimal threshold and would also require a more nuanced evaluation of clinical priorities, which might, for example, penalize missed cases over false positives.
A major limitation to the development of a new risk score is the lack of recognized clinically acceptable performance criteria to assess the utility of integrating ANN-based risk scores into daily clinical routines. In their absence, it is difficult to make a clear statement on the clinical utility of models with slightly better performance compared with existing risk scores.

Conclusions
This study examined the performance of ANN models over logistic regression-based models to estimate the risk of mortality in the PICU. A simple 2-layer ANN demonstrated better performance than traditional logistic regression, PIM-2, and PRISM-III; the statistically significant improvement in performance may not be clinically significant. Further work, including involvement of physicians in defining performance thresholds, better handling of data missingness, and possibly the use of more sophisticated ANN-modeling methods, will be required to achieve meaningful advances to guide decision-making in the care of critically ill children.