This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Laboratory tests are considered an essential part of patient safety as patients’ screening, diagnosis, and followup are solely based on laboratory tests. Diagnosis of patients could be wrong, missed, or delayed if laboratory tests are performed erroneously. However, recognizing the value of correct laboratory test ordering remains underestimated by policymakers and clinicians. Nowadays, artificial intelligence methods such as machine learning and deep learning (DL) have been extensively used as powerful tools for pattern recognition in large data sets. Therefore, developing an automated laboratory test recommendation tool using available data from electronic health records (EHRs) could support current clinical practice.
The objective of this study was to develop an artificial intelligence–based automated model that can provide laboratory tests recommendation based on simple variables available in EHRs.
A retrospective analysis of the National Health Insurance database between January 1, 2013, and December 31, 2013, was performed. We reviewed the record of all patients who visited the cardiology department at least once and were prescribed laboratory tests. The data set was split into training and testing sets (80:20) to develop the DL model. In the internal validation, 25% of data were randomly selected from the training set to evaluate the performance of this model.
We used the area under the receiver operating characteristic curve, precision, recall, and hamming loss as comparative measures. A total of 129,938 prescriptions were used in our model. The DLbased automated recommendation system for laboratory tests achieved a significantly higher area under the receiver operating characteristic curve (AUROCmacro and AUROCmicro of 0.76 and 0.87, respectively). Using a low cutoff, the model identified appropriate laboratory tests with 99% sensitivity.
The developed artificial intelligence model based on DL exhibited good discriminative capability for predicting laboratory tests using routinely collected EHR data. Utilization of DL approaches can facilitate optimal laboratory test selection for patients, which may in turn improve patient safety. However, future study is recommended to assess the costeffectiveness for implementing this model in realworld clinical settings.
Laboratory tests are key components of the health care system and patient safety [
Inappropriate testing can be in several forms. The first one is overutilization or overordering, which refers to recommended tests to the patients that are ordered without any indication. The second one is underutilization, which refers to recommended laboratory tests that are indicated but not ordered. Overutilization can result in unnecessary blood draws and other samplecollection procedures [
Deep learning (DL), a subset of machine learning, is being used in many areas including health care and has already shown its promise in various domains. This success can be attributed to an increase in computational power and the availability of massive amounts of data sets [
We collected data from the Taiwanese National Health Insurance Research and Development (NHIRD) database, which contains all claims for the medications and diagnoses data of 23 million (covers approximately 99.9% of the total population in Taiwan) Taiwanese. The database includes patients’ demographic information, number of prescriptions, the brand and generic name of the drugs, the date of prescriptions, dosage of medication, and diagnosis. The quality and completeness of this database are excellent and have been used to conduct highquality research [
In this study, we retrieved prescription information for those patients who visited the cardiology department at least once from 2 million randomly selected patient’s data from the NHIRD database between January 1, 2013, and December 31, 2013.
We collected EHR data available at the time of ordering laboratory tests to develop the predictive model; these data included patients’ demographics, visit date, department ID, diagnosis, medications, and laboratory tests. We considered the first 3 digits in ICD9CM to retrieve information about comorbidities. The ICD9CM is usually distributed from 001 to 999 and V01 to V82. Furthermore, we considered the first 5 characters of the ATC code that cover almost every medication in a single category. For example, the 5 digits ATC of the code C09AA (ACE inhibitors, plain) include all plain ACE inhibitors such as C09AA01 (captopril), C09AA02 (enalapril). However, 7 characters (e.g., R06AX12) were considered for other drugs with “X” as the fifth character because usually “X” means other agents in the ATC code. The overall data set retrieved included 328 types of laboratory tests. This is a large amount of data and most laboratory tests were not ordered frequently, which can make prediction performance worse. We therefore calculated the percentage of all laboratory tests and selected a threshold of 0.5% to be included in this study. Finally, we narrowed the laboratory tests down to 35, which contributed to at least 0.5% of all tests in the study period. However, these 35 tests contributed to more than 90% of total tests (see Table S2 in
In this study, 80% of data were assigned to the training set, and 20% to the testing set. In the internal validation, we randomly selected 25% of the data from the training set and evaluated model performance (
Overall study design.
An architecture of proposed deep learning model.
The activation function is an integral part of a neural network and does the nonlinear transformation (ie, it describes the input and output relations in a nonlinear way). However, it is this nonlinearity element that allows for higher flexibility and performing complex tasks during the whole model learning process. It helps to speed up the whole learning process. Several activation functions such as sigmoid or ReLU are commonly used in practice.
This function takes a realvalue input and converts it into a range between 0 and 1. The sigmoid function is defined as follows:
Here it is clear that it will convent the output between 0 and 1 when the input varies in (–∞, ∞). A neuron can use the sigmoid for computing the nonlinear function σ(y = wx + b). If y=wx + b is very large and positive, then e^{–y} → 0, so σ(y) → 1, whereas if y = wx + b is very large and negative, then e^{–y}→∞, so σ(y) → 0.
It is called the rectified linear unit and takes a real input variable and thresholds it at zero (ie, replace native values with zero). The ReLU function is defined as follows:
f(x) = max (0, x)
These algorithms are generally used to minimize errors and generate slightly better and faster results by updating input parameters such as weight and bias values.
The process of gradient descent.
The cost function C is the initial value and the desired point is C_{min}. The starting weight is w_{0}, with each step presented as r while the gradient represents the direction of maximum increase. The direction of the value can be expressed mathematically as the partial derivative ac/∂w to evaluate the time needed for w to reach step r, whereas the opposite direction can be expressed as –(ac/∂w)(w_{r})_{.} The most commonly used optimizers are Momentum, Adagrad, AdaDelta, Adam.
We assessed the performance of the DNN model on the validations set for laboratory test recommendations using the following metrics.
This averages the prediction matrix. S_{micro} corresponds to a set of correct quadruples. The formula for calculating micro–area under the curve (microAUC) is
MicroAUC = (S_{micro})/[(Σ^{m}_{i=1}Y^{+}_{i.})·(Σ^{m}_{i=1}Y^{–}_{i.})]
Smicro = {(a,b,i,j)(a,b)∈Y^{+}_{.i} × Y^{–}_{.j}, f_{i}(x_{a}) ≥ f_{j}(x_{b})}
This averages each label. S_{micro} corresponds to a set of correctly ordered instance pairs on each label. The formula for calculating macroAUC is
MacroAUC = (1/l)Σ^{l}_{j=1}(S^{j}_{macro})/(Y^{+}_{.j}Y^{–}_{.J})
S^{j}_{macro} = {(a,b)}∈Y^{+}_{.j} × Y^{–}_{.j}f_{i}(x_{a}) ≥ f_{i}(x_{b})
This averages the prediction matrix, and is calculated as follows:
MicroF1 = (2Σ^{l}_{j=1}Σ^{m}_{i=1}y_{ij}h_{ij})/(Σ^{l}_{j=1}Σ^{l}_{i=1}y_{ij} + Σ^{l}_{j=1}Σ^{m}_{i=1}h_{ij})
It averages each label, and is calculated as follows:
MacroF1 = (1/l)Σ^{l}_{j=1}(2Σ^{m}_{i=1}y_{ij}h_{ij})/(Σ^{m}_{i=1}y_{ij} + Σ^{m}_{i=1}h_{ij})
This reflects the average fraction of relevant labels ranked higher than one other relevant label, and is calculated as follows:
Average precision = (1/m)Σ^{m}_{i=1}[1/(y_{i.}^{+})Σ_{j∈Y}^{+}_{i.}[S^{ij}_{precision}/rankF(x_{i},j)]
S^{ij}_{precision} = {k ∈ Y^{+}_{i.}rank_{F}(x_{i},k) ≤ rank_{F}(x_{i},j)}
It is the most commonly used metric to evaluate the performance of a multilabel classifier. It is the average symmetric difference between a set of true labels and a set of predicted labels of the data set. Its formula is as follows:
hloss (H) = (1/ml)Σ^{m}_{i=1}Σ^{l}_{j–1}[[h_{ij}≠y_{ij}]]
The hamming loss (HL) value ranges from 0 to 1. A lesser value of HL indicates a better classifier.
In this study, we considered all patients who visited the cardiology department. A total of 37,890 patients visited the department at least once between January 1, 2013, and December 31, 2013. The number of male patients was higher than the number of female patients (51.11% [19,366/37,890] vs 48.89% [18,524/37,890]) and the age of patients ranged from 4 to 102 years. A total of 129,938 prescriptions with laboratory tests were ordered in the cardiology department (
Characteristics of patients and clinical variables.
Variables  Values  
Total number of prescription  129,938  
Total number of patients  37,890  
Age (years), range  4102  




Male, n (%)  19,366 (51.11)  

Female, n (%)  18,524 (48.89)  
Number of drugs input  416  
Number of diseases input  714  
Number of laboratory tests  35 
A total of 1132 input variables were used to predict the 35 types of laboratory tests. The DL model was applied to data from the cardiology department to predict laboratory tests accurately; the model achieved good discrimination (AUROC_{macro}=0.76 and AUROC_{micro}=0.87).
Receiver operating characteristic (ROC) curves of the deep learning model for predicting laboratory tests.
The DL model’s precision, recall, F1 score, and HL based on varying cutoffs for clinical laboratory test prediction are presented in
Recall, precision, F1 score, and hamming loss of the model based on varying cutoffs for clinical laboratory test prediction.
Cutoffs  Recall^{a}  Precision^{a}  F1 score^{a}  Hamming loss 
0.01  0.99  0.24  0.36  0.46 
0.05  0.94  0.33  0.45  0.39 
0.10  0.89  0.40  0.50  0.29 
0.15  0.85  0.44  0.52  0.24 
0.20  0.80  0.47  0.54  0.21 
0.25  0.76  0.51  0.55  0.19 
0.30  0.71  0.54  0.55  0.17 
0.35  0.67  0.56  0.55  0.16 
^{a}Overall (micro and macro) result presented.
In this study, we developed and validated a DLbased automated model to recommend laboratory tests based on individual patient clinical history. To our knowledge this is the first study which evaluated the performance of a DL algorithm to recommend laboratory tests, and achieved good performance; therefore, this model can be used in a realworld clinical setting. The main advantage of this model is that it requires minimal input data such as gender, age, disease, and drug information, and thus can be easily integrated into EHR systems. Most importantly, the model can be adjusted for different cutoff values according to physician needs. Moreover, physicians can select the required laboratory tests for an individual patient from a provided list of laboratory tests. This would ensure performing a quick and accurate test. The model showed high discrimination capacity; hence, implementation of this model would ensure accurate laboratory tests, improve patients’ safety, and reduce unnecessary costs associated with wrong orders.
Previously, Wright et al [
Health care budgets worldwide are facing increasing pressure to minimize costs while maintaining quality care and ensuring patients’ safety [
Several groups of researchers have proposed many ways to control inappropriate laboratory test ordering, but it remains unclear which is the most effective or how to integrate these ways with other systems designed to control laboratory costs. Some have suggested reducing the reimbursement rate to control expenditures on laboratory services. Although this approach can be effective in the short run, it has several fundamental flaws. The second approach is linked to medical necessity; laboratory test cost can be reduced by decreasing the utilization of tests that are not medically necessary; however, it is very difficult to define the appropriate use of laboratory tests. Albeit significant progress has already been made, much work remains to be done in this area. A third approach has been active management of test utilization by laboratory staff. This approach has been used mostly in academic medical centers, often integrated as part of training for residents and fellows [
Our study has several strengths that need to be addressed. (1) This is the first study to evaluate and utilize the DL model to recommend laboratory tests using variables available in EHR. This study can therefore be used as a benchmark for future studies. (2) Our novel model is significantly more accurate and can adjust the cutoff value according to physician demand. Third, our evaluation of DL algorithms was rigorous, including fewer variables, and the model was developed based on daily clinical practice data.
Our study has also several limitations: (1) Our model was developed based on data from only the cardiology department; however, this model can be extended for use in other departments using their own data. (2) This model used only 35 laboratory tests in the prediction model; however, it covered more than 90% of total tests ordered. (3) Our model has not yet been tested using an external data set; however, we used internal validation to evaluate model performance. Sometimes performance of the model may deviate when it is validated using other data sets but this is not to a large extent.
Our next step is to extend this work to other departments and includes using nonstructured data such as progress notes and operative notes. We believe that the inclusion of these data could increase our model performance. Moreover, we will use 10year data to improve our model performance, although it would be computationally expensive. We also have a plan to include procedures in our system because it would further add value in the realworld clinical setting. Because our model showed higher sensitivity and a less falsepositive rate, we will integrate our model with EHR to improve clinical decisions and reduce laboratory error rates. Although this would be quite powerful, it remains challenging for several reasons, including goldstandard evaluation and the acceptability of our model in clinical settings. However, one potential benefit of implementing this model in realworld clinical settings would be individual physician selection choice from a list of provided laboratory tests based on probability (
Proposed infographic of deep learning (DL)based laboratory testing recommendation tool.
Using commonly available clinical variables, we developed and validated a DL algorithm that predicts laboratory tests with high accuracy, and recommends clinically relevant laboratory tests at the time of ordering. To our knowledge, this is the first study to evaluate the performance of algorithms and this predictive algorithm can serve as a clinical decisionsupport tool. Most importantly, our model could help reduce unnecessary laboratory test ordering and health care costs. The integration of this model into daily clinical practice may facilitate optimal laboratory test selection based on the appropriate thresholds. However, further research is necessary to assess the workflow of the system, and weigh the benefits of patients and physicians while implementing the model as an effective recommendation tool in clinical practice.
Artificial intelligencebased automated recommendation system for clinical laboratory tests.
angiotensinconverting enzyme
area under the curve
area under the receiver operating characteristic curve
deep learning
deep neural network
electronic health record
hamming loss
National Health Insurance Research and Development
National Health Service
This research is funded in part by the Ministry of Education (MOE) under grant MOE 1096604001400 and DP21092112101A01 and the Ministry of Science and Technology (MOST) under grant MOST 10928238038004. We thank our colleagues who edited our manuscript.
None declared.