This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
Redundancy in laboratory blood tests is common in intensive care units (ICUs), affecting patients’ health and increasing health care expenses. Medical communities have made recommendations to order laboratory tests more judiciously. Wise selection can rely on modern datadriven approaches that have been shown to help identify lowyield laboratory blood tests in ICUs. However, although conditional entropy and conditional probability distribution have shown the potential to measure the uncertainty of yielding an abnormal test, no previous studies have adapted these techniques to include them in machine learning models for predicting abnormal laboratory test results.
This study aimed to address the limitations of previous reports by adapting conditional entropy and conditional probability to extract features for predicting abnormal laboratory blood test results.
We used an ICU data set collected across Alberta, Canada, which included 55,689 ICU admissions from 48,672 patients. We investigated the features of conditional entropy and conditional probability by comparing the performances of 2 machine learning approaches for predicting normal and abnormal results for 18 blood laboratory tests. Approach 1 used patients’ vitals, age, sex, and admission diagnosis as features. Approach 2 used the same features plus the new conditional entropy–based and conditional probability–based features. Both approaches used 4 different machine learning models (fuzzy model, logistic regression, random forest, and gradient boosting trees) and 10 metrics (sensitivity, specificity, accuracy, precision, negative predictive value [NPV], F_{1} score, area under the curve [AUC], precisionrecall AUC, mean G, and index balanced accuracy) to assess the performance of the approaches.
Approach 1 achieved an average AUC of 0.86 for all 18 laboratory tests across the 4 models (sensitivity 78%, specificity 84%, precision 82%, NPV 75%, F_{1} score 79%, and mean G 81%), whereas approach 2 achieved an average AUC of 0.89 (sensitivity 84%, specificity 84%, precision 83%, NPV 81%, F_{1} score 83%, and mean G 84%). We found that the inclusion of the new features resulted in significant differences for most of the metrics in favor of approach 2. Sensitivity significantly improved for 8 and 15 laboratory tests across the different classifiers (minimum
The findings suggest that conditional entropy–based features and pretest probability improve the capacity to discriminate between normal and abnormal laboratory test results. Detecting the next laboratory test result is an intermediate step toward developing guidelines for reducing overtesting in the ICU.
Redundancy in laboratory blood tests is common in health care [
One of the areas greatly experiencing laboratory blood test redundancy is ICUs, in which daily blood tests are performed to monitor physiological functions and define clinical management strategies. Previous reports have underscored overtesting in ICUs. In a study conducted in an ICU of a tertiary hospital in Ontario, Canada, physicians retrospectively analyzed 694 blood tests performed over 4 weeks and concluded that only 48.7% of those tests were essential [
To reduce redundancy in the ICU, the Choosing Wisely campaign has made recommendations to order laboratory tests judiciously [
Modern datadriven approaches can help identify redundant laboratory blood tests in ICUs [
In addition to using datadriven approaches to describe redundancy in the ICU, other reports have used electronic medical record (EMR) data collected during the ICU stay to predict whether ordering a new blood test would provide new information. Cismondi et al [
More complex models based on deep learning have also been used to recommend laboratory reduction strategies. Yu et al [
Although previous reports have shown to be effective in identifying unnecessary blood tests, none have used conditional entropy and pretest probability [
In this study, we adapted conditional entropy and pretest probability techniques to derive features to predict normal and abnormal laboratory test results. Our rationale is that by identifying whether the next laboratory test would yield a normal or abnormal result, medical professionals could decide on the necessity of such a test based on their experience and the patient’s diagnosis and disease severity. To evaluate the effect of the inclusion of new types of features, we compared the performance of 2 machine learning approaches for predicting normal or abnormal laboratory test results on largescale ICU data from Alberta, Canada. The difference between the 2 approaches was that only the second approach included new features based on conditional entropy and conditional probability.
This retrospective study was conducted using the Alberta ICU data set collected from 17 ICUs, comprising 55,689 ICU admissions from 48,672 deidentified unique patients admitted between February 2012 and December 2019. The primary data source was eCritical, an EMRbased data repository containing the device and laboratory data in use in all ICUs across Alberta.
The use of the ICU data set was approved by the Conjoint Health Research Ethics Board at the University of Calgary (reference number REB170389).
We focused on 18 laboratory blood tests that are common and critical in the ICU (
Blood laboratory tests and reference normal ranges for tests selected for analysis.^{a}
Laboratory test  Normal range  Total records, N  
Potential of hydrogen: arterial (pH)  7.207.40  668,388  
PaO_{2}^{b} (mm Hg)  7090  668,130  
PCO_{2}^{c} (mm Hg)  3545  667,889  
Blood potassium (mmol/L)  3.55.0  400,306  



If male  140175  398,436  

If female  123153  398,436  



If age (years) <90  136145  396,431  

If age (years) ≥90  132146  396,431  



If male  0.420.50  395,046  

If female  0.360.45  395,046  
White blood cells (E+9 units/L)  4.511.0  394,809  



If age (years) ≤60  2329  390,906  

If age (years) >60 and ≤90  2331  390,906  

If age (years) >90  2029  390,906  



If male and age (years) <60  80115  370,361  

If male and age (years) ≥60  71115  370,361  

If female and age (years) <60  5397  370,361  

If female and age (years) ≥60  53106  370,361  



If male and age (years) ≤55  3.09.0  295,445  

If male and age (years) >55  3.08.0  295,445  

If female and age (years) ≤55  3.08.0  295,445  

If female and age (years) >55  2.07.0  295,445  
Random glucose (mmol/L)  3.311.0  225,627  



If male  060  136,552  

If female  040  136,552  
Total bilirubin (µmol/L)  1.7120.5  133,806  
Alkaline phosphatase (U/L)  40120  128,773  
Blood albumin (g/L)  30.045.0  102,923  



If male  1040  98,399  

If female  932  98,399  



If male  080  36,095  

If female  050  36,095 
^{a}For some laboratory tests, the reference values depend on patients’ sex and age [
^{b}PaO_{2}: partial pressure of oxygen (arterial).
^{c}PCO_{2}: partial pressure of carbon dioxide (arterial).
This study compared 2 approaches to predict normal and abnormal blood laboratory tests performed in the ICU. The prediction was performed for all laboratory tests except for those first performed on the day, whose value was used as a feature in both approaches.
A framework to compare the two redundancy detection approaches. AUC: area under the curve; CV: crossvalidation; IBA: index balanced accuracy; ICU: intensive care unit; ML: machine learning; NPV: negative predictive value; PRAUC: precisionrecall area under the curve.
ICU admissions that meet the following inclusion criteria were included in the study: aged >18 years; at least one measurement of each of heart rate, respiration rate, blood pressure, temperature, oxygen saturation, and urine output; and ≥2 orders for at least one of the 18 laboratory blood tests in
ICU admissions satisfying these inclusion criteria were split into 10 folds. Laboratory tests from the same admission were assigned to the same fold, ensuring that ICU admissions were mutually exclusive among the folds.
In approach 1, the variables used to predict the abnormal results of the next laboratory blood test were heart rate (beats per minute), oxygen saturation (%), respiration rate (breaths per minute), temperature (°C), blood pressure (mm Hg), and total amount of urine void (mL). These measurements were selected as bedside monitors commonly collect large quantities of these vitals regardless of patients’ admission diagnosis. We also included patients’ sex, age, and admission diagnosis. Age and sex were included as they affect the normality of the laboratory test results (
Owing to different sampling rates, laboratory blood tests and patients’ vital measurements do not always occur simultaneously. We corrected the misalignment between laboratory tests and patients’ vitals following the steps in the study by Cismondi et al [
The pretest probability was calculated as the conditional probability of yielding a normal value, given a specific number of previous consecutive laboratory tests were normal. This probability was calculated on the training admissions by following the procedure presented by Roy et al [
Here,
The pretest probability distribution was calculated using only ICU admissions from the training set. The feature values for the heldout fold were calculated using the pretest probability distribution obtained with the training folds.
Entropy measures the expected amount of information. The conditional entropy also measures the expected amount of information of a random variable, given the occurrence of a value of secondary random variables, described as follows:
Here,
We adapted conditional entropy to measure the expected amount of information of a test result if a patient’s features were already known. This conditional entropy was calculated for all the features of approach 1. The conditional entropy for each feature was calculated as follows:
Here,
To estimate the conditional probability distribution for each patient’s feature, we grouped each feature into a histogram with a bin width defined by the FreedmanDiaconis rule [
Here, IQR(
Similar to the pretest probability, the conditional entropy distribution was calculated using only ICU admissions from the training folds. For the heldout fold, values were obtained from the distribution derived from the training folds.
We used four different classifiers to perform the comparison between approaches 1 and 2: (1) fuzzy modeling, (2) logistic regression (LR), (3) RF, and (4) gradient boosting (GB) trees.
For all classifiers, the features of the training folds and the heldout fold set were standardized before training the models using minimummaximum normalization to avoid any feature scale impact on the performance. Normalization was performed using the maximum and minimum values from the training set as a reference.
Fuzzy models are classifiers that define rules to establish nonlinear relationships between a set of features and a response variable. In this study, we used the TakagiSugeno model [
Here,
As multiple rules are derived for the data, they are aggregated for the final output using their degree of activation. The degree of activation of the
Here,
More details about fuzzy modeling can be found in the study by Takagi and Sugeno [
The number of features included in each rule was selected using a wrapper feature selection method that iteratively evaluated whether adding a new feature improved the model classification performance [
Here,
Machine learning classifiers included the LR, RF, and GB tree models. The model parameters were tuned using nested crossvalidation on the grid search and defined as follows:
For LR, the grid search for the inverse of regularization strength (
For RF, the grid search for the number of trees was defined as {300, 500, 800}, the number of maximum splits (tree height) was defined as {8, 15, 25}, the number of minimum samples to split was defined as {5, 10}, the number of maximum samples in leaves was defined as {2, 5}, and the number of maximum features was defined as {
For the GB tree, the grid search for the learning rate was defined as {0.01, 0.05, 0.10}, the number of trees was defined as {300, 500, 800}, and the number of maximum features was defined as {
The best parameters were used to retrain a model using all data from the training folds and then test the heldout fold. The models were implemented using the
The metrics also allow the comparison of the approaches from a medical perspective. Sensitivity indicates the proportion of actual abnormal laboratory tests that were correctly classified, whereas specificity indicates the proportion of actual normal laboratory tests that were correctly classified. These 2 metrics are related to precision (positive predictive value) and negative predictive value. When the number of false positives (normal tests predicted as abnormal) increases, the specificity and precision metrics decrease. The same occurs with the sensitivity and negative predictive value metrics when the number of false negatives increases.
Metrics used to measure the performance of approaches 1 and 2.
Metric  Equation  Description 
Specificity  TN^{a}/(FP^{b} + TN)  The proportion of actual normal laboratory tests that were correctly classified 
Sensitivity (or recall)  TP^{c}/(FN^{d} + TP)  The proportion of actual abnormal laboratory tests that were correctly classified 
Accuracy  (TP + TN)/(FN + FP + TP + TN)  The proportion of laboratory tests that were correctly classified 
Precision (positive predictive value)  TP/(FP + TP)  The proportion of laboratory tests predicted as abnormal that, in fact, were abnormal 
Negative predictive value  TN/(FN + TN)  The proportion of laboratory tests predicted as normal that, in fact, were normal 
F_{1} score  2 × (precision × recall)/(precision + recall)  Weighted mean of precision and recall 
Area under the receiver operating characteristic curve 

The balance between the true positive rate and true negative rate of the predictions 
Area under the precisionrecall curve 

The balance between the precision and recall of the predictions 
Mean G  √(sensitivity × specificity)  The balance between the performance of majority and minority classes 
Index balanced accuracy [ 
(mean G)^{2} × (1 + [sensitivity – specificity])  Imbalanced index of the overall accuracy 
^{a}TN: true negative.
^{b}FP: false positive.
^{c}TP: true positive.
^{d}FN: false negative.
The sets of metrics for each approach were compared pairwise using a 2sided Wilcoxon ranksum hypothesis test. The null hypothesis was that there was no difference between the metrics obtained using the 2 approaches, whereas the alternative hypothesis was that there was a difference. As 720 comparisons were conducted for the 18 laboratory tests, 4 classifiers, and 10 metrics, we used BenjaminHochberg correction with the falsepositive rate set at 0.05.
In addition to comparing the performances of the approaches, we explored the most relevant features for classification. For each iteration of the 10fold crossvalidation, we stored the relevance of each feature for the trained model.
For each classifier, features were ranked based on their relevance values. For the fuzzy model, relevance was given by the wrapper feature selection method used to derive the antecedent of the fuzzy rules. For LR, the relevance was given by the absolute value of the coefficient associated with each feature. For the RF and GB tree, the relevance was calculated using the mean of the impurity reduction within each tree of the fitted models.
After performing the 10fold crossvalidation, a total of 10 ranking feature sets were obtained for each laboratory blood test and each classifier. We aggregated these ranking feature sets by averaging the rank of each feature, which is an aggregation strategy used in the medical domain [
To compare the performance obtained with each new feature, we compared approach 1 with the 2 alternative approaches. The first alternative approach used the features of approach 1 plus the pretest probability features, whereas the second alternative approach used the features of approach 1 plus the entropybased features. These alternative approaches were trained and compared with the same methodology used for approaches 1 and 2.
Performance distribution of approaches 1 and 2 across the laboratory blood tests. The first quantile, median, and third quantile are displayed inside each distribution (dashed lines). AUC: area under the curve; FM: fuzzy model; GB: gradient boosting; IBA: index balanced accuracy; LR: logistic regression; PRAUC: precisionrecall area under the curve; RF: random forest.
The detailed performance of the approaches for each laboratory test, metric, and machine learning classifier is presented in
Among the classifiers, LR benefited the most from the inclusion of the new features, achieving an improvement of at least eight metrics for 10 out of the 18 laboratory blood tests. The RF and GB obtained less significant improvements for the different metrics than the fuzzy and LR models.
Percentage change for the 10fold mean metric values between approaches 1 and 2. The asterisk indicates a statistically significant difference (2sided Wilcoxon ranksum hypothesis tests adjusted via BenjaminHochberg correction using a falsepositive rate set at 0.05). ALP: alkaline phosphatase; ALT: alanine transaminase; AST: aspartate aminotransferase; AUC: area under the curve; FM: fuzzy model; GB: gradient boosting; GGT: gammaglutamyl transferase; IBA: index balanced accuracy; LR: logistic regression; PaO_{2}: partial pressure of oxygen (arterial); PCO_{2}: partial pressure of carbon dioxide (arterial); PRAUC: precisionrecall area under the curve; RF: random forest; WBC: white blood cell.
Finally, to visualize how the features relate to the prediction of abnormal test results, the fuzzy predictive rules obtained by retraining a fuzzy model on the data set are presented in
The top 5 ranking of the features selected across the machine learning classifiers for each of the laboratory tests for approach 2. Light blue and light red boxes correspond to the vital features and diagnoses, respectively, shared with approach 1. The light green boxes correspond to the pretest probability feature, and the light gray boxes correspond to the entropybased features. ALP: alkaline phosphatase; ALT: alanine transaminase; AST: aspartate aminotransferase; GGT: gammaglutamyl transferase; PaO_{2}: partial pressure of oxygen (arterial); PCO_{2}: partial pressure of carbon dioxide (arterial); SPO2: oxygen saturation; WBC: white blood cell.
Cubic root of the percentage change between the 10fold means of approaches 1 and 2 (blue bars), approach 1 plus the pretest probability (yellow bars), and approach 1 plus the entropybased features for the fuzzy model. The asterisk indicates a statistically significant difference (2sided Wilcoxon ranksum hypothesis tests adjusted via BenjaminHochberg correction using a falsepositive rate set at 0.05). ALP: alkaline phosphatase; ALT: alanine transaminase; AST: aspartate aminotransferase; AUC: area under the curve; GGT: gammaglutamyl transferase; IBA: index balanced accuracy; PaO_{2}: partial pressure of oxygen (arterial); PCO_{2}: partial pressure of carbon dioxide (arterial); PRAUC: precisionrecall area under the curve; WBC: white blood cell.
Cubic root of the percentage change between the 10fold means of approaches 1 and 2 (blue bars), approach 1 plus the pretest probability (yellow bars), and approach 1 plus the entropybased features for the logistic regression. The asterisk indicates a statistically significant difference (2sided Wilcoxon ranksum hypothesis tests adjusted via BenjaminHochberg correction using a falsepositive rate set at 0.05). ALP: alkaline phosphatase; ALT: alanine transaminase; AST: aspartate aminotransferase; AUC: area under the curve; GGT: gammaglutamyl transferase; IBA: index balanced accuracy; PaO_{2}: partial pressure of oxygen (arterial); PCO_{2}: partial pressure of carbon dioxide (arterial); PRAUC: precisionrecall area under the curve; WBC: white blood cell.
Cubic root of the percentage change between the 10fold means of approaches 1 and 2 (blue bars), approach 1 plus the pretest probability (yellow bars), and approach 1 plus the entropybased features for the random forest model. The asterisk indicates a statistically significant difference (2sided Wilcoxon ranksum hypothesis tests adjusted via BenjaminHochberg correction using a falsepositive rate set at 0.05). ALP: alkaline phosphatase; ALT: alanine transaminase; AST: aspartate aminotransferase; AUC: area under the curve; GGT: gammaglutamyl transferase; IBA: index balanced accuracy; PaO_{2}: partial pressure of oxygen (arterial); PCO_{2}: partial pressure of carbon dioxide (arterial); PRAUC: precisionrecall area under the curve; WBC: white blood cell.
Cubic root of the percentage change between the 10fold means of approaches 1 and 2 (blue bars), approach 1 plus the pretest probability (yellow bars), and approach 1 plus the entropybased features for the gradient boosting model. The asterisk indicates a statistically significant difference (2sided Wilcoxon ranksum hypothesis tests adjusted via BenjaminHochberg correction using a falsepositive rate set at 0.05). ALP: alkaline phosphatase; ALT: alanine transaminase; AST: aspartate aminotransferase; AUC: area under the curve; GGT: gammaglutamyl transferase; IBA: index balanced accuracy; PaO_{2}: partial pressure of oxygen (arterial); PCO_{2}: partial pressure of carbon dioxide (arterial); PRAUC: precisionrecall area under the curve; WBC: white blood cell.
We found that the inclusion of the conditional entropy–based features and pretest probability significantly improved the capacity to predict abnormal results of a new laboratory test. Notably, the inclusion of these features improved the detection of actual abnormal tests (sensitivity) for half or more than half of the laboratory blood tests across the 4 classifiers (
The most relevant feature analysis revealed that the pretest probability feature was the most relevant among the new 2 types of features. In fact, the models strongly relied on the pretest probability to discriminate between normal and abnormal laboratory blood tests (
The classifiers that improved the most were the LR and fuzzy models. A possible reason for this difference is that the LR and fuzzy models used all the features to fit their model. Instead, the ensemble models built individual trees by randomly selecting a subset of the total features, thereby excluding the pretest probabilities or entropybased features for some trees. Nevertheless, the RF and GB tree also improved for approach 2, achieving significant improvements in the sensitivity, F_{1} score, and IBA metrics.
The inclusion of the new features improved sensitivity and negative predictive value but decreased specificity and precision. This tradeoff is beneficial for the medical context because although ordering extra blood tests when it may not be necessary (higher false positives) can raise the medical expenses, patients’ safety is preserved (lower false negatives). The new features also improved balanced metrics such as F_{1} score, AUC, mean G, and IBA, thus showing the benefit of the inclusion of such types of features to improve the capacity for discriminating between normal and abnormal test results. For instance, approach 2 improved the aforementioned metrics for blood gas tests (potential of hydrogen and PCO_{2}), which are among the most expensive laboratory blood tests ordered in the ICU [
However, we note that predicting normal and abnormal blood test results is an intermediate step toward detecting redundant tests. Deciding whether to order a new test should be based on more than predicting a normal laboratory result, as the situation and severity of each patient in the ICU are different. We included vitals and admission diagnosis to mitigate these factors; however, human interpretation still plays a crucial role in deciding whether ordering a new laboratory test is clinically meaningful. Normal laboratory test results can help measure trends, validate the required thresholds, and assess treatments. Therefore, predicting the result of a new test as normal does not imply its relevance or redundancy. However, redundancy guidelines can be established by analyzing predictions using prior consecutive results. For instance, if ≥1 previous result has yielded normal results and the prediction of the new test is again normal, the new laboratory blood test may be redundant. In contrast, if the prediction is abnormal, the new test may be relevant as it can inform medical decisions.
The relevance of the new features is consistent with prior literature [
In comparison with the studies by Roy et al [
We note that this study used an ICU data set collected in Alberta, Canada. As ethnical and racial subgroups have different distributions for laboratory tests [
We also noted that our exclusion criteria excluded patients who did not have >1 sample of the target laboratory blood test or did not have any measurements for heart rate, respiration rate, temperature, oxygen saturation, blood pressure, or urine output. This condition limits the applicability of our work as it was not designed to predict abnormal results of the first laboratory test provided in the day or when the patient’s vitals are missing. Future work should explore how to predict abnormal results of a new test in such cases.
This study introduced new types of features to predict abnormal or normal results in laboratory blood tests in the ICU. The new features were extracted from historical data to describe the chances of yielding a normal test if previous sequential tests were normal (pretest probability) and the expected uncertainty of an abnormal yield if a patient’s vitals were already known (conditional entropy). These historical data combined with patients’ data are suitable indicators to predict the abnormal results of performing an additional laboratory blood test. Therefore, this study provides tools that can help develop guidelines to reduce overtesting in the ICU.
Performance of approach 1 using the 10fold crossvalidation for each blood laboratory test and machine learning classifier. For each classifier, the means and SDs of metrics across the 10 folds are presented. The best result for each metric and laboratory test is in bold.
Performance of approach 2 using the 10fold crossvalidation for each blood laboratory test and machine learning classifier. For each classifier, the means and SDs of metrics across the 10 folds are presented. The best result for each metric and laboratory test is in bold.
Percentage change for the 10fold mean metric values between approach 1 and approach 2. The asterisk indicates a statistically significant difference (2sided Wilcoxon ranksum hypothesis tests adjusted with BenjaminiHochberg using a falsepositive rate set at 0.05). The difference was conducted for each classifier and for each metric.
Predictive rules for the fuzzy model. Entropy means that the feature is the conditionalbased version.
area under the curve
electronic medical record
gradient boosting
index balanced accuracy
intensive care unit
logistic regression
random forest
CEV and JL contributed to the development of the techniques and analysis tools. JL, HTS, and DJN coordinated the data collection process of the study and acquired funding. CEV implemented the methods presented herein and prepared the manuscript. CEV, DJN, HTS, and JL provided oversight throughout the study and proofread the manuscript.
None declared.