This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Liquid biopsies based on blood samples have been widely accepted as a diagnostic and monitoring tool for cancers, but extremely high sensitivity is frequently needed due to the very low levels of the specially selected DNA, RNA, or protein biomarkers that are released into blood. However, routine blood indices tests are frequently ordered by physicians, as they are easy to perform and are cost effective. In addition, machine learning is broadly accepted for its ability to decipher complicated connections between multiple sets of test data and diseases.
The aim of this study is to discover the potential association between lung cancer and routine blood indices and thereby help clinicians and patients to identify lung cancer based on these routine tests.
The machine learning method known as Random Forest was adopted to build an identification model between routine blood indices and lung cancer that would determine if they were potentially linked. Ten-fold cross-validation and further tests were utilized to evaluate the reliability of the identification model.
In total, 277 patients with 49 types of routine blood indices were included in this study, including 183 patients with lung cancer and 94 patients without lung cancer. Throughout the course of the study, there was correlation found between the combination of 19 types of routine blood indices and lung cancer. Lung cancer patients could be identified from other patients, especially those with tuberculosis (which usually has similar clinical symptoms to lung cancer), with a sensitivity, specificity and total accuracy of 96.3%, 94.97% and 95.7% for the cross-validation results, respectively. This identification method is called the routine blood indices model for lung cancer, and it promises to be of help as a tool for both clinicians and patients for the identification of lung cancer based on routine blood indices.
Lung cancer can be identified based on the combination of 19 types of routine blood indices, which implies that artificial intelligence can find the connections between a disease and the fundamental indices of blood, which could reduce the necessity of costly, elaborate blood test techniques for this purpose. It may also be possible that the combination of multiple indices obtained from routine blood tests may be connected to other diseases as well.
Using liquid biopsies based on blood tests is a promising method to achieve noninvasive diagnosis of cancers, but it is also currently a challenge in oncology [
Cancers that can be detected with liquid biopsy methods include breast [
Meanwhile, misdiagnosis of lung cancer and tuberculosis occurs frequently in clinical situations [
In this work, inspired both by the fact that multi-analyte blood tests can reveal greater correlation between complicated connections, and that comprehensive consideration of multiple factors may also mitigate the effects of variation between individual patients, we tried to find the connection between the results of routine blood examinations and serious diseases. Although none of the blood test data for a single factor was proven to be the sole indicator of lung cancer, it was found that a combination of 19 routine blood biochemical indices were highly related as indicators of lung cancer, based on the Random Forest method [
Data from routine blood tests were collected from the Second Hospital of Lanzhou University. A total of 277 patients with 49 types of routine blood indices were included in this study, including 183 patients whose lung cancer was diagnosed by tissue biopsies as positive samples and another 94 patients, without lung cancer, as negative samples. These patients ranged from 20 to 81 years of age, and general information about their data sets can be accessed in
After collection, the data were randomly split into a training set and a test set with a ratio of about 4 to 1. The training set included 222 patients and was constructed with 149 lung cancer samples, 37 tuberculosis samples, and 36 other samples, and then the remaining 55 samples were assigned to the test set.
General demographic information on the test set and the training set (N=277).
Characteristic | Training set | Test set | |||||
Lung cancer (n=149) | Tuberculosis (n=37) | Other (n=36) | Lung cancer (n=34) | Tuberculosis (n=14) | Other (n=7) | ||
Male | 110 | 37 | 12 | 22 | 5 | 5 | |
Female | 39 | 20 | 24 | 12 | 9 | 2 | |
Median age (range) | 60 (27-81) | 46 (20-79) | 55 (30-78) | 58 (38-79) | 52 (20-78) | 62 (49-68) | |
Smokers, n | 44 | 2 | 2 | 5 | 0 | 1 |
The Random Forest method (RF) [
Both internal cross-validation and further tests were adopted to obtain a reliable classifier for lung cancer. The entire modelling process, including feature ranking, RF parameter adjusting, and final model selection, was performed based only on the training set using 10-fold cross-validation. The presplitting test set for further testing of the built model was not involved in any of these model-building processes, as emphasized by Smialowski et al [
Five frequently used indicators were adopted here to evaluate the final performance of the routine blood indices model for lung cancer (RBLC) method, including sensitivity (Sens), specificity (Spec), accuracy (ACC), Matthews correlation coefficient (MCC), and the area under the curve (AUC), where TP, TN, FP, and FN stand for true positive, true negative, false positive, and false negative, respectively.
The receiver operating characteristic (ROC) curve is a composite indicator and a graphical plot for the continuous variables of Sens and Spec, with Sens as the y-axis and 1–Spec as the x-axis. One characteristic of the ROC curve is that it could remain unchanged if the positive and negative samples are out of balance in the test set.
AUC is the area under the ROC curve, and it can range from a value of 0 to a value of 1. The closer the AUC is to 1, the better the prediction performance of lung cancer. It is one of the main evaluation indices for a binary classifier system.
Routine blood tests listed in
To answer that question in this study, we used routine blood and biochemical test data that can be measured by common chemistry analyzers, with a cost of approximately $10-20 for each sample, to determine their correlation with lung cancer. Surprisingly, positive correlation was found with a simple Random Forest (RF) test method, with 19 blood indices enough to prove correlation. With the data set we used, an MCC of 91.36%, ACC of 95.7% (
In fact, 19 indices are equivalent to a critical point (
The detailed forest structure for the RBLC model is illustrated in
Classification performance of the RBLC model. (A) Cross-validation results of models which were built on top ranking features. (B) ROC curves and the corresponding AUCs for the cross-validation on the training set and for the test set. RBLC: routine blood indices model for lung cancer; ROC: receiver operating characteristic; AUC: area under the curve; ACC: accuracy; MCC: Matthews correlation coefficient.
Top-ranking blood indices for the identification of lung cancer.
Rank | Index | Reference range |
1 | Basophil ratio | 0.00-0.01 |
2 | Creatine kinase isoenzymes (U/L) | 0.0-25.0 |
3 | Platelet large cell ratio (%) | 17.0-45.0 |
4 | Albumin (g/L) | 30.0-55.0 |
5 | Platelet distribution width (fl) | 9.0-17.0 |
6 | Neutrophilic granulocytes (109/L) | 2.00-7.00 |
7 | White blood cell count (109/L) | 4.00-10.00 |
8 | Albumin/Globulin ratio | 1.10-2.50 |
9 | Monocytes (109/L) | 0.12-1.20 |
10 | Monocyte ratio | 0.03-0.08 |
11 | Lymphocyte ratio | 0.20-0.40 |
12 | Neutrophil granulocyte ratio | 0.50-0.70 |
13 | Lactate dehydrogenase (U/L) | 0.0-240.0 |
14 | Carbamide (mmol/L) | 1.80-8.00 |
15 | Eosinophil cells (109/L) | 0.02-0.50 |
16 | Mean corpuscular volume (fl) | 80.0-100.0 |
17 | Alkaline phosphatase (U/L) | 0.0-120.0 |
18 | Mean corpuscular hemoglobin (pg) | 27.0-34.0 |
19 | Creatine kinase (U/L) | 0-195 |
The detailed forest structure for the RBLC model. (A) The general structure of the voting strategy of the RBLC model. (B) The independent decision rulings for different blood indices for the first tree (T1) in (A). T: tree; WBC: white blood cell count; NE%: neutrophil granulocyte ratio; LY%: lymphocyte ratio; MO%: monocyte ratio; BA%: basophil ratio; NE#: neutrophilic granulocytes; MO#: monocytes; EO#: eosinophil cells; MCV: mean corpuscular volume; MCH: mean corpuscular hemoglobin; PDW: platelet distribution width; P-LCR: platelet large cell ratio; UREA: carbamide; ALP: alkaline phosphatase; ALB: albumin; A/G: albumin/globulin; CK: creatine kinase; CK-MB: creatine kinase isoenzymes; LDH: lactate dehydrogenase.
To confirm the efficiency, reliability, and repeatability of the RBLC model, 34 serial blood samples from 15 additional patients were also included in the study (detailed information, including the patients’ sex, age, smoking status, cancer stage and blood data, is listed in
A user-friendly web server is available online to use the RBLC method [
Web page of the RBLC tool for convenient usage online. RBLC: routine blood indices model for lung cancer; ALB/GLB: albumin/globulin.
The performance of the RBLC method was compared to other commonly used identification methods of lung cancer and ended up showing a favorable result, and then, the association of these selected key routine blood indices with lung cancer was analyzed and further confirmed.
With regard to other identification methods, CT scans are a common tool for the detection of lung cancer. For instance, the National Lung Screening Trial (NLST) recommends the use of CT scans to help diagnose patients at high risk for lung cancer. The NLST also demonstrated that mortality could be reduced by 20% using CT screening, with a specificity of 72.6% [
Comparison of the performance of different methods for predicting lung cancer on cross-validation.
Prediction method | Sample size | Sensitivity, % | Specificity, % | Area under the curve |
RBLCa | 226 | 96.30 | 94.97 | 0.99 |
Protein biomarker [ |
143 | 93.00 | 45.00 | N/Ab |
RNA biomarker [ |
310 | 93.00 | 90.00 | 0.97 |
DNA biomarker [ |
318 | 79.20 | 67.30 | 0.75 |
Computed tomography scans [ |
N/A | 94.40 | 72.60 | N/A |
aRBLC: routine blood indices model for lung cancer.
bN/A: not applicable.
Currently, biomarker analysis is another prevalent technique for detecting lung cancer in high-risk populations. Different lung cancer–related components are ideal biomarkers for the detection of lung cancer. The protein, DNA, and RNA referenced in
Detailed information for the selected key indices for the RBLC model was shown in
Among these key indices, the relationship between lactate dehydrogenase (LDH) and lung cancer has been discussed extensively [
Feature comparison of lung cancer and other samples.
Feature | Negative sample | Positive sample (lung cancer) | |
White blood cell count | 0.1986 | 0.3088 | <.001 |
Neutrophil-granulocyte ratio | 0.4257 | 0.6502 | <.001 |
Lymphocyte ratio | 0.5298 | 0.3232 | <.001 |
Monocyte ratio | 0.4319 | 0.3970 | .20 |
Basophil ratio | 0.2555 | 0.1242 | <.001 |
Neutrophilic granulocytes | 0.1839 | 0.2808 | <.001 |
Monocytes | 0.2795 | 0.384 | <.001 |
Eosinophil cells | 0.3236 | 0.0833 | <.001 |
Mean corpuscular volume | 0.6808 | 0.5453 | <.001 |
Mean corpuscular hemoglobin | 0.6545 | 0.5983 | .008 |
Platelet distribution width | 0.5765 | 0.6337 | .03 |
Platelet large cell ratio | 0.5081 | 0.4010 | <.001 |
Carbamide | 0.4181 | 0.3197 | <.001 |
Alkaline phosphatase | 0.4138 | 0.1366 | <.001 |
Albumin | 0.5757 | 0.5574 | .52 |
Albumin/globulin | 0.3917 | 0.4155 | .46 |
Creatine kinase | 0.1103 | 0.0867 | .19 |
Creatine kinase Isoenzymes | 0.3557 | 0.2014 | <.001 |
Lactate dehydrogenase | 0.5441 | 0.1462 | <.001 |
In addition, white blood cell count (WBC) is one of the most commonly used, nonspecific markers of inflammation [
Research on eosinophil cells (EO#) associated with lung cancer is rarely reported. The significant difference in the EO# between lung cancer samples (
Although creatine kinase isoenzymes (CK-MB) have a good specificity for diagnosis of myocardial infarction, related reports have indicated that the presence of malignant tumors can cause a significant distinction in CK-MB levels [
All of above the results demonstrate that the blood indices we selected were related to lung cancer to some extent, but none of them solely exhibits a clear connection and can be used for diagnostic purposes. With the aid of machine learning, through a combination of multiple test items and connections between the complicated patterns of these blood indices, specific diseases may be distinguished. The identification performance of the RBLC model for lung cancer is rather encouraging, as shown in
Detailed information of the samples for RBLC modeling and validation.
Detailed information of the samples for further clinical relevance evaluation.
accuracy
alkaline phosphatase
area under the curve
creatine kinase isoenzymes
circulating tumor DNA
computed tomography
eosinophil cells
false negative
false positive
lactate dehydrogenase
Matthews correlation coefficient
number of randomly selected features to split at each node
neutrophil granulocyte ratio
National Lung Screening Trial
tree number
routine blood indices model for lung cancer
Random Forest method
receiver operating characteristic
sensitivity
specificity
true negative
true positive
white blood cell count
We thank Professor Qiaosheng Pu for his critical advice on the composition and improvement of this paper. We also thank Professor Jianxi Xiao at Lanzhou University, Professor Chongge You at Lanzhou University Second Hospital, Deputy Chief Examiner Juan Li at Lanzhou University First Hospital, and Deputy Chief Examiner Yonghong Li at Gansu Provincial Hospital for their valuable advice on this work. This work is supported by National Natural Science Foundation of China (#21405068 to SL), the Fundamental Research Funds for the Central Universities of China (#lzujbky-2017-104 to SL).
None declared.