This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
In emergency departments (EDs), early diagnosis and timely rescue, which are supported by prediction modes using ED data, can increase patients’ chances of survival. Unfortunately, ED data usually contain missing, imbalanced, and sparse features, which makes it challenging to build early identification models for diseases.
This study aims to propose a systematic approach to deal with the problems of missing, imbalanced, and sparse features for developing suddendeath prediction models using emergency medicine (or ED) data.
We proposed a 3step approach to deal with data quality issues: a random forest (RF) for missing values, kmeans for imbalanced data, and principal component analysis (PCA) for sparse features. For continuous and discrete variables, the decision coefficient R^{2} and the κ coefficient were used to evaluate performance, respectively. The area under the receiver operating characteristic curve (AUROC) and the area under the precisionrecall curve (AUPRC) were used to estimate the model’s performance. To further evaluate the proposed approach, we carried out a case study using an ED data set obtained from the Hainan Hospital of Chinese PLA General Hospital. A logistic regression (LR) prediction model for patient condition worsening was built.
A total of 1085 patients with rescue records and 17,959 patients without rescue records were selected and significantly imbalanced. We extracted 275, 402, and 891 variables from laboratory tests, medications, and diagnosis, respectively. After data preprocessing, the median R^{2} of the RF continuous variable interpolation was 0.623 (IQR 0.647), and the median of the κ coefficient for discrete variable interpolation was 0.444 (IQR 0.285). The LR model constructed using the initial diagnostic data showed poor performance and variable separation, which was reflected in the abnormally high odds ratio (OR) values of the 2 variables of cardiac arrest and respiratory arrest (201568034532 and 1211118945, respectively) and an abnormal 95% CI. Using processed data, the recall of the model reached 0.746, the
The proposed systematic approach is valid for building a prediction model for emergency patients.
In the emergency department (ED), early identification of highrisk patients can improve clinical decisions, avoid waste of resources, and lead to better patient prognosis [
Prediction models for highrisk patients in EDs can greatly support caregivers [
Missing values, imbalanced data, and sparse features are 3 common problems of EMR data. Missing values indicate not enough data collected due to improper use of the hospital information system or other reasons [
To solve the aforementioned 3 problems, we propose a series of ML approaches to increase fitting ability and generalization ability. Using the approach, we developed a suddendeath predication model. The risk factors related to sudden death obtained through logistic regression (LR) model were consistent with the results reported in the earlier literature on the analysis of risk factors of inhospital death. These results show that our datapreprocessing approach can effectively maintain the rich information contained in emergency data and provide a reliable data source for the development of a suddendeath prediction model.
Our methods of data preprocessing consisted of 5 steps, as shown in
Workflow of ED data preprocessing and evaluation. ED: emergency department; EMR: electronic medical record.
Data for ED patient prediction model development are summarized in
Close investigation of each data table is required so as to know the location of our content of interest. For instance, data regarding a patient’s basic information are stored in the emg_visit table. Lab test items and results are stored in the lab_result and lab_master tables. The clinical record field in the emg_order table can be used to determine whether a suddendeath event occurred. One lab test (eg, blood test) can be performed multiple times to observe the patient status closely. Based on clinical experts’ opinions, only the last one is meaningful.
Description of the data table involved in the query process.
Table name  Data description 
emg_drug detail  The patient's medication record, including the prescription number, drug name, dosage, drug specification, administration time, and administration route during the treatment period 
emg_drug_master  Master record form of patient medication recording patient ID and prescription number 
emg_order  Doctor’s order record form used to record the medication, inspection, diagnosis, treatment, and other doctor’s orders of the patient during treatment 
emg_visit  Patient visit information table, including the patient's basic personal information, diagnosis of the current visit, triage, and other information 
lab_test_master  Patient’s laboratory test master record form recording the patient’s age and gender information, laboratory test items made during the visit, and the corresponding doctor’s order ID 
lab_result  Laboratory test results of patients, including test results of patients 
The number of variables obtained from the data collection was large, so screening of important variables facilitated final analysis. Two approaches can be adopted. One is based on statistical significance. The other is based on the specific research objective, opinions of medical experts, or authoritative literature [
Missing values affect the effectiveness of ML models. Data missing show 3 different patterns: missing completely at random (MCAR), missing at random deletion (MAR), and not missing at random (MNAR). MCAR means that the missing of data is completely random and does not depend on observed or unobserved values [
The goal of all kinds of interpolation methods is to reasonably estimate missing values and improve the quality of data. Interpolation methods are mainly divided into single interpolation and multiple interpolation. Multiple interpolation is a commonly used and better performance interpolation method. It generates multiple possible estimates for missing data and uses statistical inference to interpolate the final value. This method can reflect the randomness of missing data, and the interpolation error is smaller [
Altogether, the followed steps are proposed.
For variable “i,” 1 set of patient samples without missing values work as training samples and the other set of patient samples with missing values work as test samples.
If other variables in the 2 samples are missing, the mean (continuous variable) or mode (discrete variable) is temporarily interpolated to form a complete sample.
Use training samples to train RF models, the model is applied to test samples to predict missing values.
For the next variable, steps 1, 2, and 3 are repeated until all variables of the whole sample are interpolated.
Imbalanced data refer to the imbalanced distribution of negative and positive samples. For example, in the classification of rare diseases and credit predictions, there could be more negative samples than positive ones. Because most ML algorithms assume that categories (eg, positive or negative) of samples are evenly distributed, classifying models trained with imbalanced data are more likely to classify a new sample into the majority category [
Basic solutions for imbalanced data are to use under or oversampling to make the data balanced, such as random oversampling [
To avoid the loss of important samples, we adopted kmeans based on the Euclidean distance to cluster samples. New samples were generated though clustering, which had similar characteristics in the same cluster and were distinguished in the different clusters. The centroid of a cluster represents the overall characteristics of the whole cluster. In this way, important features are not discarded. Since the centroid of the cluster is calculated based on the average of the samples in the cluster, the centroid is not necessarily a real sample. So, we took the real samples with the smallest distance from the centroid.
Sparse features means that the feature index is much larger than the actual number of nonzero features. In total, there were 891 different types of diagnosis in our data set. However, for a single patient, the number of diagnoses was quite few. This formed sparsefeature phenomena.
When sparse features occur, the sample is prone to having the problem of variable separation and multicollinearity. That is, a single variable or a linear combination of multiple variables can perfectly predict outcome events. However, this only works for smallsize samples. It also leads to the situation in which the model gives an abnormally large weight to the variables and the results are unreliable [
The processing of sparse features can be considered from both the model and the data themselves. From the point of view of the model, the parameter estimation bias of highdimensional sparse data can be reduced through the optimization of the algorithm. For example, Firth regression [
PCA has been widely used in analysis with highdimensional sparse features [
In detail, new data can replace the original data as the input source for regression or classification models. Suppose
As C is a real symmetric matrix, according to the properties of the real symmetric matrix, its order m must have m unit orthogonal eigenvectors. That is,
Take the first k columns of V as the basis for transforming mdimensional features into kdimensional features and record it as
First, we manually merged similar diagnostic nouns according to prior knowledge, from 891 to 405. However, the data were obviously separated and sparse. For instance, none of the negative samples had a sudden cardiac arrest or sudden respiratory arrest diagnosis. Next, we only kept the diagnosis that appeared in more than 5% population. Finally, PCA was proposed for the remaining variables. The first 17 PCs that could explain 98.2% variance of the original sample were selected. Regression analysis was carried out on the samples after dimensionality reduction. The explanation of variables was achieved by counting the weight of the original variables on each PC.
After preliminary review, the project was found to be in line with relevant medical ethics requirements. If it is funded by the Hainan Major Science and Technology Program in 2020, the Hainan Medical Ethics Committee will perform its duties and strictly abide by relevant regulations and requirements for medical ethics and informed consent of patients to ensure ethical supervision and review during the implementation of the project (reference number: 00824482406).
A comprehensive evaluation was carried out on the ED data set of the Hainan Hospital of Chinese PLA General Hospital. We developed a set of Python programs to implement our methods. Specifically, the program was developed in Microsoft Windows 10 (Intel (R) core (TM) i59500 CPU, 3GHz). All data preprocessing and model building were completed in Python (Python 3.8 Anaconda) using multiple Python data science libraries, mainly including Numpy, Pandas, Matplotlib, and Scikitlearn. In addition, codes on data interpolation, imbalance correction, and PC regression are currently available on GitHub [
We collected the data of patients who went to the ED of the Hainan Hospital of Chinese PLA General Hospital from July 27, 2017, to May 6, 2021. In the suddendeath group, the data of 1085 patients were collected. In the nonsuddendeath group, the data of 17,959 patients were collected. For the analysis of laboratory test data, we excluded patients who did not have any laboratory test records before sudden death. A total of 108 (10%) patients were excluded, and 977 (90%) patients with sudden death were used for the analysis of laboratory test data. For diagnostic data, we excluded patients who were missing diagnostic data from the visit. Finally, there were 1083 patients with sudden death and 615 patients with nonsudden death. We developed statistics on the baseline data of all patients, as shown in Supplementary Table S1 in
In the first group, there were 741 males (68.4%) and 342 females (31.6%), and 2 (0.2%) patients lacked gender information (
Distribution of the gender of patients with sudden death.
Distribution of the gender of patients without sudden death.
Distribution of age of patients with sudden death.
Distribution of patients of age with nonsudden death.
To perform variable screening, that is, filtering out insignificant variables, we counted the total number of appearance and missing times. The second row of
There were 275 variables in the lab test category. For a given variable, not every patient (sample) had the value, namely a missing value. The missing ratio of a variable could be obtained by the number of cases having a missing value of that variable being divided by the total number of patients. The average ratio was 79.8%, as shown in the third row of
For diagnosis, 891 different types of diagnosis were obtained after the initial data collection. Because the diagnosis is recorded in the form of free text, 1 diagnosis item could have several different synonyms. By merging these texts into a unified name via manual review, we obtained 405 variables. The number of confirmed patients of each diagnostic variable was counted. Instead of an 80% threshold, 5% was considered. Considering both positive and negative samples, 18 diagnostic variables were kept. Among them, 11 (61.1%) variables were shared by both. These were myocardial infarction, chest distress, sudden cardiac arrest, fever, rib fracture, renal dysfunction, chest pain, diabetes, abdominal pain, pulmonary infection, respiratory arrest, trauma, atrial fibrillation, disturbance of consciousness, cerebral hemorrhage, cerebral infarction, coronary heart disease, and hypertension.
Missing value ratios of variables of patients with sudden death.

Laboratory tests (275 variables)  Medications (402 variables)  Diagnosis (891 variables) 
Patients without data, n (%)  108 (10%)  287 (26.4%)  2 (0.18%) 
Average ratio of missing values  79.8% (866/1085)  72.4% (786/1085)  99% (1080/1085) 
Maximum ratio of missing values  90% (977/1085)  73.5% (797/1085)  100% (1085/1085) 
Minimum ratio of missing values  25.8% (280/1085)  48.5% (526/1085)  58.4% (634/1085) 
In addition to age and gender, we used an RF to interpolate the missing values for each of the remaining variables. Nonmissing patient data were used as a training set to train the model to interpolate missing values. The training set was further split into training data (80%) and validation data (20%). The coefficient of determination R^{2} and the κ coefficient were used to test the consistency of the imputation results of continuous variables and categorical variables. In the interpolation process, the median of R^{2} was 0.623 (IQR 0.647) and the median of the κ coefficient was 0.444 (IQR 0.285).
Due to the extreme imbalance of our original data, the number of patients with sudden death only accounted for 5% (977/18,936) of the total sample size. We generated 4 different data ratios (1:10, 1:5, 1:2, and 1:1) through kmeans to achieve undersampling. These data were used with the original ratio to evaluate models of different data ratios and then to verify the rationality of our sampling method.
We constructed an LR model to analyze the patients’ laboratory test variables using a data set with a data ratio of 1:1 as the data source to filter variables. To reflect the degree of correlation between variables, continuous variables were treated as ordinal categorical variables. Taking the normal index range of the variables as a reference point, the test results of the patients were mapped into 3 categories: L (index is lower than the normal value), N (index is normal), and H (index is higher than the normal value). To determine the significant factors affecting the sudden death of patients and avoid a negative effect on the final analysis results, we first performed the chisquare test to filter out the variables and then excluded variables when
Group 1: qualitative test of creatinine, serum uric acid, urine protein
Group 2: γglutamyl transferase, alanine aminotransferase, total bilirubin
Group 3: international normalized ratio, platelet count, plasma prothrombin time
Group 4: potassium, creatine kinase
Group 5: urine specific gravity, chloride, hematocrit, sodium, magnesium, lactate dehydrogenase, urine ketone body test, red blood cell count, serum albumin
For each group, 500fold bootstrapping was used for model training and evaluation [
After determining the patient features for analysis, we split the original scale data into a training set (70%) and a test set (30%). For the training set, 4 different categories of data sets (1:1, 1:2, 1:5, 1:10) were formed by undersampling to train the model. Finally, the performance of the model was evaluated on the test set. The mean and 95% CI (500fold bootstrapping) of the final AUROC, AUPRC,
In general, as the data ratio tended to balance, the performance of the model gradually improved.
Statistics of variables filtered by the chisquare test.
Variable  
Monocytes  5.433 (6)  .49 
Basophil  0.705 (4)  .95 
Eosinophils  0.977 (4)  .91 
Urine specific gravity determination  0 (2)  .99 
Urine tube type  1.25 (4)  .87 
Urine tube type (microscopic examination)  6.863 (8)  .98 
Qualitative test of urinary bilirubin  13.185 (4)  .21 
Mean erythrocyte hemoglobin concentration  7.828 (6)  .25 
Chloride  4.649 (6)  .59 
Erythrocyte volume distribution width measurement coefficient of variation (CV)  1.148 (4)  .89 
Hematocrit assay  4.982 (6)  .55 
Sodium  7.915 (6)  .24 
Magnesium  10.22 (6)  .12 
Statistics of variables screened by LR^{a} univariate analysis.
Variable  Reference range  OR^{b} (95% CI)  
Lactate dehydrogenase  50.0150.0 U/L  1.029 (0.941.127)  .53 
Urine ketone body test  N/A^{c}  0.912 (0.7691.081)  .29 
Red blood cell count  3.55.9 1012/L  0.827 (0.6421.065)  .14 
Serum albumin  35.050.0 g/L  0.893 (0.6891.157)  .39 
Highdensity lipoprotein cholesterol  1.01.6 mmol/L  0.961 (0.7491.232)  .75 
^{a}LR: logistic regression.
^{b}OR: odds ratio.
^{c}N/A: not applicable.
Comparing the performance of 5 groups of variables.
Group  Recall  AUROC^{a} (95% CI)  
1  0.478  0.6  0.683 (0.6810.684) 
2  0.801  0.835  0.843 (0.8420.844) 
3  0.606  0.687  0.725 (0.7240.727) 
4  0.484  0.605  0.686 (0.6850.687) 
5  0.852  0.651  0.562 (0.5610.564) 
^{a}AUROC: area under the receiver operating characteristic curve.
LR^{a} multivariate analysis.
Variable  Reference range  OR^{b} (95% CI) 
γGlutamyl transferase  0.050.0 U/L  0.225 (0.2220.228) 
Alanine aminotransferase  5.040.0 U/L  1.828 (1.8041.852) 
Total bilirubin  0.021.0 μmol/L  19.954 (19.720.2) 
Creatinine  30.0110.0 μmol/L  1.352 (1.3311.372) 
Serum uric acid  104.0444.0 μmol/L  1.346 (1.3341.359) 
International normalized ratio  0.81.2  2.23 (2.1882.272) 
Creatine kinase  24.0320.0 U/L  2.457 (2.4312.483) 
Platelet count  100.0300.0 ×10^{9}/L  0.623 (0.6170.629) 
Potassium  3.55.1 mmol/L  1.057 (1.0431.07) 
Gender  Female  0.183 (0.1820.184) 
Sodium  135145 mmol/L  2.182 (2.1022.262) 
Magnesium  0.81.0 mmol/L  4.807 (4.5875.027) 
Chloride  96.00106.00 mmol/L  0.615 (0.6030.627) 
Serum albumin  3551g/L  1.284 (1.2681.3) 
^{a}LR: logistic regression.
^{b}OR: odds ratio.
ROC curves of different data ratio. AUC: area under the curve; ROC: receiver operating characteristic.
PR curves of different data ratio. AUPRC: area under the precisionrecall curve; PR: precisionrecall.
Visualization of logistic regression coefficients.
We use interpolated and undersampled data (data ratio 1:1) to train several other ML models and evaluate their performance. The training models included an RF [
The final sample included 1083 patients with sudden death and 615 patients with nonsudden death.
We used 500fold bootstrapping for internal validation of the model. For each bootstrap, 70% of the samples were randomly selected as the training set and 30% as the test set to evaluate the model. The final reported model performance was the mean and 95% CI of 500 results [
The first 17 PCs that could explain 98.2% of the variance of the original sample were selected as new variables for analysis. To observe the role of PCA, we compared the 2 schemes: the LR model using the original data and the LR model after dimensionality reduction using PCA. The LR model trained with the original data obtained a recall rate of 0.445 (95% CI 0.4430.448), an
To determine the impact of various diagnostic variables on the sudden death of emergency patients, we statistically analyzed the results of multivariate analysis on 17 PCs input into the LR model. The OR of PC4, PC5, and PC6 was 3.044, 2.859, and 3.931, respectively, showing a significant correlation with suddendeath events (
Statistics of people diagnosed.
Variable  People with sudden death diagnosed, n (%)/people with nonsudden death diagnosed, n (%) 
Myocardial infarction  57 (5.26)/23 (3.74) 
Chest tightness  8 (0.74)/35 (5.69) 
Cardiac arrest  120 (11.08)/0 
Fever  50 (4.62)/43 (6.99) 
Rib fracture  58 (5.36)/3 (0.49) 
Abnormal renal function  42 (3.88)/35 (5.69) 
Chest pain  18 (1.66)/38 (6.18) 
Diabetes  65 (6.00)/66 (10.73) 
Abdominal pain  30 (2.77)/45 (7.32) 
Pulmonary infection  85 (7.85)/64 (10.41) 
Respiratory arrest  106 (9.79)/0 
Trauma  58 (5.36)/16 (2.60) 
Atrial fibrillation  39 (3.60)/33 (5.37) 
Disturbance of consciousness  82 (7.57)/17 (2.76) 
Cerebral hemorrhage  77 (7.11)/26 (4.23) 
Cerebral infarction  75 (6.93)/71 (11.54) 
Coronary heart disease  29 (2.68)/39 (6.34) 
Hypertension  65 (6.00)/106 (17.24) 
ROC curves of 2 models. AUC: area under the curve; LR: logistic regression; PCA: principal component analysis; ROC: receiver operating characteristic.
PC^{a} regression results
PC  OR^{b} (95% CI) 
1  0.239 (0.2350.242 
2  2.429 (2.3832.476) 
3  1.19 (1.1261.253) 
4  3.044 (2.9483.141) 
5  2.859 (2.6873.031) 
6  3.931 (3.7144.148) 
7  1.49 (1.4051.575) 
8  1.699 (1.5621.836) 
9  2.104 (1.9492.259) 
10  2.153 (2.0162.289) 
11  2.451 (2.1912.711) 
12  2.031 (1.8552.206) 
13  1.457 (1.3391.575) 
14  0.949 (0.8631.034) 
15  1.423 (1.2311.614) 
16  2.546 (2.2212.871) 
17  0.182 (0.1640.201) 
^{a}PC: principal component.
^{b}OR: odds ratio.
In this paper, 3 ML schemes were proposed to deal with missing, imbalanced, and sparse features in the process of developing suddendeath prediction models using emergency medicine data, which improves the performance of the developed model. To solve the problem of missing data, we propose an RF method to use real data to interpolate missing data. In the interpolation process, the consistency of the interpolation results is checked by determining the coefficient R^{2} and the κ coefficient. From the interpolation results, the method shows the ability to correctly interpolate missing data. Imbalanced data are not conducive to obtaining accurate analysis results, and the model will be more inclined to predict new samples as patients with nonsudden death [
At present, there are many studies on the prediction of sudden death. Yu et al [
This work also has some limitations. On the one hand, we only considered a single ML algorithm for data interpolation and did not discuss and compare the application of other possible ML algorithms in interpolation. It is possible that we overlooked the better performance of other methods. For example, for our data, due to the large proportion of missing and seriously imbalanced categorical variables, although we tried to adjust the relatively balanced data set to train the model, the κ coefficient improved to a certain extent but the effect was still poor. Therefore, a further discussion of ML methods that can handle a large number of missing and unbalanced categories or more reasonable feature processing may achieve better imputation results. Although imbalance correction can improve the sensitivity and specificity of the model, it can avoid biased errors of the model. However, this correction will also weaken the clinical application value of the model, lowering the calibration ability of the model and making it unable to accurately estimate the risk probability of patients. For the prediction model, the calibration ability of the model was not high, even on the original scale data set. Model calibration is another important characteristic of evaluating the clinical significance of prediction models. A wellcalibrated model can provide more useful information for clinical decisions [
Our work proposes to use ML methods to deal with data quality issues, such as missing data, data imbalance, and sparse features in emergency data, so as to improve data availability. In addition, the risk factors of sudden death in emergency patients are obtained from our model analysis. As a preliminary analysis result, this result is also the basis for the later use of ML algorithms to build the feature selection and data analysis of the prediction model of sudden death in emergency patients.
Supplementary Tables and Figures.
area under the curve
area under the precisionrecall curve
area under the receiver operating characteristic curve
emergency department
electronic medical record
early realtime early warning system
gradient boosting machine
least absolute shrinkage and selection operator
linear discriminant analysis
missing at random deletion
missing completely at random
machine learning
not missing at random
logistic regression
odds ratio
principal component
principal component analysis
precisionrecall
random forest
receiver operating characteristic
sudden cardiac death
support vector machine
The authors would like to show their appreciation to the engineers working in the Information Centre (ICT) department of the Hainan Hospital of Chinese PLA General Hospital for their help with data preparation.
The publication of this paper was funded by grants from the National Natural Science Foundation of China (no. 82102187) and the Hainan Natural Science Foundation Youth Fund (no. 620QN380).
The data sets used and analyzed during this study are available from the first author upon reasonable request.
XC carried out the methodological study and drafted the manuscript. HC collected and processed the data and drafted the manuscript. SN made the conceptual design and made critical revisions to the manuscript. XK reviewed the methodology and reviewed the manuscript. HD also reviewed the manuscript. HD conceptualized the study and performed a critical review.
None declared.