This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
Fatty liver disease (FLD) arises from the accumulation of fat in the liver and may cause liver inflammation, which, if not well controlled, may develop into liver fibrosis, cirrhosis, or even hepatocellular carcinoma.
We describe the construction of machine-learning models for current-visit prediction (CVP), which can help physicians obtain more information for accurate diagnosis, and next-visit prediction (NVP), which can help physicians provide potential high-risk patients with advice to effectively prevent FLD.
The large-scale and high-dimensional dataset used in this study comes from Taipei MJ Health Research Foundation in Taiwan. We used one-pass ranking and sequential forward selection (SFS) for feature selection in FLD prediction. For CVP, we explored multiple models, including k-nearest-neighbor classifier (KNNC), Adaboost, support vector machine (SVM), logistic regression (LR), random forest (RF), Gaussian naïve Bayes (GNB), decision trees C4.5 (C4.5), and classification and regression trees (CART). For NVP, we used long short-term memory (LSTM) and several of its variants as sequence classifiers that use various input sets for prediction. Model performance was evaluated based on two criteria: the accuracy of the test set and the intersection over union/coverage between the features selected by one-pass ranking/SFS and by domain experts. The accuracy, precision, recall, F-measure, and area under the receiver operating characteristic curve were calculated for both CVP and NVP for males and females, respectively.
After data cleaning, the dataset included 34,856 and 31,394 unique visits respectively for males and females for the period 2009-2016. The test accuracy of CVP using KNNC, Adaboost, SVM, LR, RF, GNB, C4.5, and CART was respectively 84.28%, 83.84%, 82.22%, 82.21%, 76.03%, 75.78%, and 75.53%. The test accuracy of NVP using LSTM, bidirectional LSTM (biLSTM), Stack-LSTM, Stack-biLSTM, and Attention-LSTM was respectively 76.54%, 76.66%, 77.23%, 76.84%, and 77.31% for fixed-interval features, and was 79.29%, 79.12%, 79.32%, 79.29%, and 78.36%, respectively, for variable-interval features.
This study explored a large-scale FLD dataset with high dimensionality. We developed FLD prediction models for CVP and NVP. We also implemented efficient feature selection schemes for current- and next-visit prediction to compare the automatically selected features with expert-selected features. In particular, NVP emerged as more valuable from the viewpoint of preventive medicine. For NVP, we propose use of feature set 2 (with variable intervals), which is more compact and flexible. We have also tested several variants of LSTM in combination with two feature sets to identify the best match for male and female FLD prediction. More specifically, the best model for males was Stack-LSTM using feature set 2 (with 79.32% accuracy), whereas the best model for females was LSTM using feature set 1 (with 81.90% accuracy).
Prior research on the use of machine learning for early disease prediction has focused on diabetes, fatty liver disease (FLD), hypotension, and other metabolic syndromes [
Recently, machine learning has been used extensively in medicine and health care. Dealing with large datasets with many features requires efficient methods to reduce the computing time. We adopted one-pass ranking (OPR) for automatic feature selection, with accuracy similar to the features selected by sequential forward selection (SFS). OPR enables finding good features for current-visit prediction (CVP) and next-visit prediction (NVP). The contributions of this paper can be summarized as follows. First, we compared the performance of OPR and SFS for automatic feature selection, demonstrating that OPR offers great efficiency with decent accuracy when dealing with a large-dimensional dataset. Second, in addition to CVP, we propose the task of NVP, which is much more important for practicing preventive medicine. To our knowledge, this is the first attempt to perform NVP on FLD. Third, we modeled NVP as a sequence classification problem and proposed two feature sets with fixed or variable intervals for the long short-term memory (LSTM) classifier and some of its variants. Before describing the study, we first provide a review of some important prior work on FLD prediction along with a brief overview of automatic feature selection in machine learning.
Comparison of prior research and this study for fatty liver disease (FLD) prediction.
Reference | Sample size | Years of study | Feature selection | FLD type | Gender | Next-visit prediction | Data source |
Birjandi et al [ |
<1700 | 2012 | Yes | NAFLDa | Male/Female | No | Health screening centers |
Jamali et al [ |
<100 | 2012-2014 | No | NAFLD | Male | No | Hospital |
Yip et al [ |
<1000 | 2015 | Yes | NAFLD | Male | No | Hospital |
Islam et al [ |
<1000 | 2012-2013 | Yes | NAFLD/AFLDb | Male/Female | No | Hospital |
Ma et al [ |
<11,000 | 2010 | Yes | NAFLD | Male/Female | No | Hospital |
Wu et al [ |
<600 | 2009 | No | NAFLD/AFLD | Male | No | Hospital |
This study | >150,000 | 2009-2016 | Yes | NAFLD/AFLD | Male/Female | Yes | Health screening dataset |
aNAFLD: nonalcoholic fatty liver disease.
bAFLD: alcoholic fatty liver disease.
In various application domains, LSTM has proven to be the state-of-the-art sequence classifier that can achieve better results than classical methods. For instance, Kim et al [
Automatic feature selection is an important step in machine learning, since it can identify a feature subset to construct a better model while requiring less computing time for training and testing. Automatic feature selection methods can be divided into three categories: wrappers, filters, and embedded methods. Wrapper methods use a classifier to score the feature subsets, which produces accurate results but is time-consuming. Filter methods use a proxy measure instead of accuracy to score a feature subset, which is efficient but does not always produce a good model since the proxy measure does not always relate to classification accuracy [
Not all approaches covered in the literature use the wrapper methods for feature selection. For example, as shown in in
This study used different conventional classifiers for CVP, including Adaboost [
This study explored feature selection schemes for CVP and NVP, and proposes two feature sets for NVP using LSTM.
Flowchart of current-visit prediction and next-visit prediction for fatty liver disease (FLD).
Flowchart of current-visit prediction (CVP) and next-visit prediction (NVP) for fatty liver disease (FLD) with different classifiers. OPR: one-pass ranking; SFS: sequential forward selection; KNNC: k-nearest neighbor classifier; SVM: support vector machine; LSTM: long short-term memory; biLSTM: bidirectional long short-term memory.
Although fatty liver has no special symptoms, there is a certain chance that fatty hepatitis will develop in the long term, and it may progress to serious liver diseases such as cirrhosis, liver failure, and even liver cancer [
For this task, CVP uses a classifier with all important information (including lab and questionnaire results) at the current visit as inputs to predict whether or not the patient currently has FLD. Correct execution of CVP with selected features can help the doctor better understand what features are more likely to contribute to FLD. Sufficiently high CVP accuracy allows patients with a low FLD risk to forego a time-consuming and costly abdominal ultrasound. That is, CVP can be used for rapid screening at medical clinics that do not have the equipment or specialists needed to manually diagnose FLD. This can effectively reduce staff and equipment requirements at clinics and hospitals, which is of particularly importance in the era of the COVID-19 pandemic.
For CVP feature selection, we used two wrapper-based methods, OPR and SFS, with a simple classifier of KNNC and LOO cross-validation for performance evaluation. Following this rapid feature selection, we used the selected features for model training and evaluation with other advanced classifiers, including Adaboost, SVM, LR, RF, GNB, decision trees C4.5, and CART.
Early prediction also plays an essential role in disease prevention, especially for chronic diseases. With NVP, our system can even predict the next visit result, allowing physicians to arrange abdominal ultrasound examinations or other appropriate interventions for patients with a high future risk of FLD. For this task, we used a sequence classifier with all historical information (up to the current visit) as inputs to predict whether or not the patient will be diagnosed with FLD at the next visit. NVP is more important than CVP from the perspective of preventive medicine. If the patient is predicted to have a high probability of FLD risk at the next visit, the physician can suggest lifestyle changes (eg, diet, smoking, alcohol consumption) to effectively modify the key features that contribute to FLD in NVP, along with other appropriate interventions, including abdominal ultrasound at the next health check.
For feature selection in NVP, we used OPR with the LSTM classifier and a hold-out test (ie, training and testing) for performance evaluation. Note that we could not use SFS for feature selection since it is too time-consuming for LSTM. If we want to create equal-spaced features for each month between two visits for LSTM, we need to perform linear interpolation between these two visits for each subject. For lab test features (with continuous numerical values), this is achieved by spline interpolation with the piecewise cubic method. For questionnaire features (with categorical values of integers), this is achieved by linear interpolation with rounding off to the nearest labels, as shown in
Interpolation for the questionnaire features between any two medical checkups.
As mentioned above, there are three categories of feature selection methods: wrappers, filters, and embedded methods [
Conceptual diagram of wrappers that interact with a given classifier to select critical features.
This study is primarily related to the MJ-FLD dataset [
Visit counts for males (blue) and females (red) per year in the MJ-FLD dataset and statistics of no fatty liver disease (NFLD) and fatty liver disease (FLD) per year. The drop from 2013 to 2014 is likely due to the implementation of Taiwan’s Personal Data Protection Act.
Therefore, between 2013 and 2014, the male count falls from 11,184 to 6770 (60.53% decrease), and the female count falls from 8896 to 4958 (55.73% decrease). Furthermore, over this 8-year period, the class size ratio of no fatty liver disease (NFLD) vs FLD was 0.66 (34,885 vs 53,171) for males and 2.02 (48,574 vs 23,990) for females. For each year from 2009 to 2016, the class size ratios of NFLD vs FLD were 0.69, 0.67, 0.63, 0.63, 0.66, 0.68, 0.64, and 0.63 for males, and 2.09, 2.16, 2.0, 1.93, 1.94, 2.1, 1.89, and 1.96 for females, respectively (
Another characteristic of the dataset is its high ratio of missing values, as shown in
The ratio of missing values for all features and for the top 20 features in the MJ-FLD dataset.
Histograms of important features of the MJ-FLD dataset for males (blue) and females (red). NFL: no fatty liver; FL: fatty liver; FAT: body fat; WC: waist circumference; WHR: waist-to-hip ratio; WEI: weight; DM_FG: diabetes for fasting glucose; TG: triglyceride; CHOL: total cholesterol; HDLC: high-density lipoprotein cholesterol; LDLC: low-density lipoprotein cholesterol; GPT: serum glutamic-pyruvic transaminase; DRINKALCGRAM: alcohol per gram; METAEQUI: metabolic equivalent for exercise per week; GGR: serum glutamic-oxaloacetic transaminase to glutamic-pyruvic transaminase ratio.
Some features such as BMI are strong indicators of FLD.
Progression of yearly average BMI over 8 years, broken down by [FLD, NFLD] x [male, female, overall] into 6 curves. FLD: fatty liver disease; NFLD: no fatty liver disease.
Our dataset is based on health screening results from individuals, some of whom underwent multiple screenings at different intervals with different sets of screening items. As a result, there are several missing values in the dataset that needed to be imputed before further processing. Moreover, the questionnaires also changed over these 8 years when the dataset was compiled; therefore, we needed to consolidate the answers to different questionnaires of the same type.
To perform missing values imputation in our dataset, we used the mean for numerical features and the mode for questionnaire features. This is a quick and dirty method, especially for such a large dataset. Missing value imputation could be accomplished using other more complicated methods such as MICE (Multivariate Imputation by Chained Equations) [
To consolidate the answers to different questionnaires of the same type in the dataset, we needed to use some heuristics to derive consistent numerical values as features for machine learning. For instance, “grams of alcohol” represents the average weekly alcohol intake in grams [
In summary, the steps involved in data preprocessing were performed as follows:
Deletion of useless features: Our first step in data preprocessing was to drop features that are apparently not related to FLD, such as “cervical cancer,” “prostate cancer,” “other forms of cancer,” “other hereditary diseases,” “Chinese medicine,” and “has your mother or sister had breast cancer, ovarian cancer, or endometrial cancer?”
Missing value handling: Missing values in the dataset were replaced by the average for numerical features and by the mode for categorical features.
Feature conversion: To create consistent features from questionnaires, we consolidated highly related questionnaires and expressed the corresponding responses in numeric terms. For example, the feature “grams of alcohol consumption” was derived from responses to the questionnaire items “type of drink,” “amount of drink,” “drink or not,” and “alcohol density.” Similarly, the feature “weekly exercise metabolic equivalent” was derived from responses to the questionnaire items “type of sport,” “frequency of sport,” and “time for sport.”
Deletion of redundant features: Some highly redundant features were deleted from the dataset, such as “BMI,” “systolic/diastolic blood pressure while lying down left arm,” and “systolic/diastolic blood pressure while lying down right arm.”
Feature-wise normalization: This was achieved by z-score normalization to have a zero mean and unit variance for each feature:
where
All experiments were performed on a 64-bit Windows-10 server, with an Intel Xeon Silver 4116 CPU at 2.10 GHz, two NVIDIA Quadro GV100 GPUs, 256 GB RAM, 1-TB hard disk, and Matlab R2020b (9.8.0.1538559), and python 3.8.2, scikit-learn 0.24.1, TensorFlow-GPU 2.4.1.
All of the models in this study were constructed based on the MJ-FLD dataset [
To investigate the effectiveness of different feature selection methods, we compared the computer-selected features with expert-suggested FLD features. All of the expert-suggested features are listed in
Features of fatty liver disease, including those suggested by domain experts or selected by one-pass ranking (OPR) and sequential forward selection (SFS) for current-visit prediction and next-visit prediction.
Features | Explanation | Suggested by experts | OPR | SFS | OPR (Feature set 1) | OPR (Feature set 2) | |||||||
|
|
|
Selected by OPR | Matcha | Selected by SFS | Match | Selected by OPR | Match | Selected by OPR | Match | |||
age | Age |
|
|
|
✓ |
|
✓ |
|
✓ |
|
|||
blood type | Blood type |
|
|
|
✓ |
|
|
|
|
|
|||
bmd | Bone mineral density |
|
|
|
✓ |
|
✓ |
|
✓ |
|
|||
bmi | Body mass index | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
cc (cm) | Chest circumference |
|
✓ |
|
|
|
✓ |
|
|
|
|||
cci (cm) | Chest circumference during inspiration |
|
✓ |
|
|
|
✓ |
|
✓ |
|
|||
cea (ng/ml) | Carcinoembryonic antigen |
|
|
|
✓ |
|
|
|
|
|
|||
ch | The ratio of chol/hdlc | ✓ | ✓ | ✓ |
|
|
✓ | ✓ | ✓ | ✓ | |||
chol (mg/dl) | Total cholesterol | ✓ |
|
|
✓ | ✓ |
|
|
|
|
|||
diastolic | Diastolic blood pressure |
|
|
|
|
|
✓ |
|
|
|
|||
drinkalcgram (g) | Alcohol per gram | ✓ |
|
|
|
|
|
|
|
|
|||
drinkyear | How many years have you been drinking? | ✓ |
|
|
|
|
|
|
|
|
|||
e (%) | Eosinophils |
|
|
|
|
|
|
|
✓ |
|
|||
ery (106/µl) | Red blood cells |
|
|
|
✓ |
|
|
|
|
|
|||
fat (g) | Body fat | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
fg (mg/dl) | Diabetes mellitus fasting glucose | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
food18 | How many servings of bread do you eat? | ✓ |
|
|
|
|
|
|
|
|
|||
food19 | Do you add jam or honey to your food? | ✓ |
|
|
|
|
|
|
|
|
|||
food20 | Do you add sugar to your coffee, tea, cola/soda, fruit juices, or other beverages? | ✓ |
|
|
|
|
|
|
|
|
|||
food21 | How many servings of your food intake are fried in oil? | ✓ |
|
|
✓ | ✓ |
|
|
|
|
|||
ggr | The ratio of got/gpt | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
ggt (IU/L) | Gamma-glutamyl transferase |
|
✓ |
|
✓ |
|
✓ |
|
✓ |
|
|||
got (IU/L) | Serum glutamic-oxaloacetic transaminase (sGOT) |
|
✓ |
|
✓ |
|
✓ |
|
✓ |
|
|||
gpt (IU/L) | Serum glutamic-pyruvic transaminase (sGPT) | ✓ | ✓ | ✓ |
|
|
✓ | ✓ | ✓ | ✓ | |||
hc (cm) | Hip circumference |
|
✓ |
|
✓ |
|
✓ |
|
✓ |
|
|||
hdlc (mg/dl) | High-density lipoprotein cholesterol | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
hei (cm) | Height |
|
|
|
✓ |
|
|
|
✓ |
|
|||
hema (%) | Hematocrit |
|
|
|
|
|
|
|
✓ |
|
|||
Ldlc (mg/dl) | Low-density lipoprotein cholesterol | ✓ |
|
|
|
|
✓ | ✓ |
|
|
|||
leu (103/ml) | White blood cells |
|
|
|
|
|
✓ |
|
✓ |
|
|||
mcv (fl) | Mean corpuscular volume |
|
|
|
✓ |
|
|
|
|
|
|||
mdrug10 | Steroids | ✓ |
|
|
✓ | ✓ |
|
|
|
|
|||
mdrug8 | Medicine for asthma | ✓ |
|
|
✓ | ✓ |
|
|
|
|
|||
metaequi | Metabolic equivalent for exercise per week | ✓ |
|
|
|
|
|
|
✓ | ✓ | |||
n (%) | Neutrophils |
|
|
|
|
|
|
|
✓ |
|
|||
p (mg/dl) | Phosphorus |
|
✓ |
|
|
|
|
|
|
|
|||
pul (beat/mint) | Pulse rate |
|
|
|
|
|
✓ |
|
✓ |
|
|||
relate33b | In the last 3 months, have you lost weight by more than 4 kg? | ✓ |
|
|
|
|
|
|
|
|
|||
relate17a | Have your defecation habits changed? |
|
✓ |
|
|
|
|
|
|
|
|||
sdephi (/HPF) | Sediment epithelial cells high |
|
✓ |
|
|
|
|
|
|
|
|||
sdrhi (/HPF) | Sediment red blood cells high |
|
✓ |
|
|
|
|
|
|
|
|||
sdwhi (/HPF) | Sediment white blood cells high |
|
✓ |
|
|
|
|
|
|
|
|||
sg | Specific gravity |
|
✓ |
|
|
|
|
|
|
|
|||
smokeornot | Have you ever smoked? | ✓ |
|
|
|
|
|
|
|
|
|||
systolic | Systolic blood pressure |
|
|
|
|
|
✓ |
|
|
|
|||
tb (mg/dl) | Total bilirubin |
|
|
|
|
|
|
|
✓ |
|
|||
tg (mg/dl) | Triglyceride | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
tp (g/dl) | Total protein |
|
|
|
|
|
✓ |
|
|
|
|||
tsh (µIU/ml) | Thyroid stimulating hormone |
|
✓ |
|
✓ |
|
|
|
|
|
|||
ua (mg/dl) | Uric acid |
|
|
|
|
|
✓ |
|
✓ |
|
|||
vanl | Visual acuity (naked left eye) |
|
|
|
✓ |
|
|
|
|
|
|||
wc (cm) | Waist circumference | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
wei (kg) | Weight | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
|
|
|||
Whr | Waist-to-hip ratio | ✓ | ✓ | ✓ |
|
|
✓ | ✓ |
|
|
|||
Workstreng | What is your level of activity at work? |
|
✓ |
|
|
|
|
|
|
|
aIndicates a match with the features selected by domain experts based on the literature.
To evaluate the similarity between the feature sets manually selected by human experts (set S1) and automatically selected by OPR/SFS (set S2), we used two similarity indices, intersection over union (IoU) and coverage, defined as follows:
Both similarity indices range from 0 to 1, and a higher value indicates higher similarity.
Given the size of the dataset, we can explore it in different directions. First, we needed to confirm the modeling accuracy of CVP across years, which was achieved using the previous year data for training and the current year data for testing. The test accuracy for each year is shown in
Next, we wanted to further explore the optimum duration in years considered for modeling in feature selection. In general, using a long period of historical data for modeling may result in mismatching with the test data since the optimum model may change over time. However, a short period of historical data may not be sufficient for stable model construction. As a result, we needed to identify the optimum duration in years where the training data are obtained for predicting the data in 2016. More specifically, we defined seven subtasks for training data in intervals (2015, 2014-2015, 2013-2015, 2012-2015, 2011-2015, 2010-2015, 2009-2015), and the test data were from 2016. This arrangement is illustrated in
The result is shown in
Test accuracy for each year using the previous year data for training and the current year data for testing for both males and females.
The models of seven subtasks for training in intervals (2015, 2014-2015, 2013-2015, 2012-2015, 2011-2015, 2010-2015, 2009-2015) and 2016 for testing, for males.
The best year interval for the model of male fatty liver disease prediction is 2012-2015.
Features selected by one-pass ranking based on the standard set, in descending order of recognition rate.
For easy reference, we refer to the training set of the interval 2012-2015 and the test set from 2016 as the “standard set.” Based on the standard set, we applied OPR and SFS, as shown in
Finally, we tested other classifiers on the standard set, including KNNC, Adaboost, SVM, LR, RF, GNB, decision trees C4.5, and CART, as shown in
Comparison of one-pass ranking (OPR) and sequential forward selection (SFS) in terms of feature selection and classification.
Metric | OPR | SFS | ||
|
|
|
||
|
Intersection over union | 29.73% (11/37) | 33.33% (12/36) | |
|
Coverage | 45.83% (11/24) | 50.00% (12/24) | |
Classification accuracy | 80.32% | 80.92% |
Performance of various classifiers on the standard set. KNNC: k-nearest neighbor classifier; SVM: support vector machine; LR: logistic regression; RF: random forest; GNB: Gaussian naive Bayes; CART: classification and regression trees.
Performance metrics for eight different classifiers.
Classifier | AUROCa | Precision | Recall | F1 score | Accuracy | |
|
|
|
|
|
|
|
|
Males | 0.80 | 0.77 | 0.82 | 0.79 | 80.00% |
|
Females | 0.87 | 0.77 | 0.68 | 0.72 | 82.45% |
|
|
|
|
|
|
|
|
Males | 0.85 | 0.80 | 0.85 | 0.82 | 77.51% |
|
Females | 0.90 | 0.77 | 0.70 | 0.74 | 83.07% |
|
|
|
|
|
|
|
|
Males | 0.85 | 0.80 | 0.86 | 0.83 | 77.97% |
|
Females | 0.90 | 0.80 | 0.69 | 0.74 | 83.44% |
|
|
|
|
|
|
|
|
Males | 0.85 | 0.83 | 0.78 | 0.81 | 76.94% |
|
Females | 0.90 | 0.71 | 0.82 | 0.76 | 82.59% |
|
|
|
|
|
|
|
|
Males | 0.85 | 0.83 | 0.79 | 0.81 | 77.01% |
|
Females | 0.90 | 0.72 | 0.80 | 0.76 | 82.90% |
|
|
|
|
|
|
|
|
Males | 0.79 | 0.83 | 0.70 | 0.76 | 72.53% |
|
Females | 0.88 | 0.77 | 0.67 | 0.72 | 82.23% |
|
|
|
|
|
|
|
|
Males | 0.83 | 0.83 | 0.76 | 0.80 | 75.95% |
|
Females | 0.87 | 0.67 | 0.78 | 0.72 | 79.30% |
|
|
|
|
|
|
|
|
Males | 0.73 | 0.79 | 0.78 | 0.79 | 73.85% |
|
Females | 0.76 | 0.63 | 0.75 | 0.68 | 76.72% |
aAUROC: area under the receiver operating characteristic curve.
bKNNC: k-nearest-neighbor classifier.
cSVM: support vector machine.
dLR: logistic regression.
eRF: random forest.
fGNB: Gaussian naïve Bayes.
gDT: decision tree.
hCART: classification and regression trees.
As shown in
Investigation of hormonal influence, assuming menopause/andropause occurs at ages 53, 54, 55, 56, and 57, respectively. The upper plot is for males and the lower plot is for females. Each yellow-purple bar pair indicates the accuracy before and after menopause at a specific age. The dataset used for this analysis corresponds to the years 2009-2016.
For females, sex hormones will be affected not only by the lifestyle habits an individual engages in to maintain a good figure but also by factors such as dieting and drugs. To achieve a slim figure, many women try various types of diets that have several side effects, which may affect specific biochemical tests related to FLD. In addition, some women may resort to the ingestion of nutritional supplements or other forms of “diet pills” to lose weight. However, many of these drugs contain unknown ingredients or illegal substances that could significantly affect the results of tests associated with FLD.
In this experiment, we used LSTM with various setups for NVP. LSTM is a well-known sequence classifier that can use information from historical visits, with no length limit, to predict the possibility of FLD at the patient’s next clinic visit. As explained earlier, from the perspective of preventive medicine, NVP is much more important than CVP. The specifications for feature selection of NVP are as follows: dataset, male subjects in the MJ-FLD dataset; classifier, LSTM; feature selection, OPR with 3-fold cross-validation to select the most important 24 features.
In general, clinic visits do not always occur at regular intervals. For a given visit pattern of length N, we can extract N – 1 input-output pairs for NVP modeling using LSTM, as shown in
For feature set 1 with fixed-interval data, the dataset included the number of input/output pairs for males (13,315) and for females (10,998). The mean input sequence length for males and females was 42.03 (SD 21.25, range 5-96) and 41.44 (SD 20.84, range 4-96), respectively. For feature set 2 with variable-interval data, there were 16,081 input/output pairs for males with a mean input sequence length of 3.32 (SD 1.46, range 2-13), and 13,364 input/output pairs for females with a mean input sequence length of 3.15 (SD 1.35, range 2-15).
Feature set 2 with input data from variable intervals showed three major advantages: (1) the unfolded LSTM network has considerably fewer stages, resulting in much shorter training and prediction times; (2) the dataset is used directly with no need to perform extra interpolation in advance, thus reducing time requirements and increasing precision; and (3) it can perform any prediction at any time in the future directly.
A typical visit pattern and the extracted input/output pairs for training long short-term memory (LSTM). If the visit pattern is denoted by [v1, v2, v3, v4, v5], then we can extract 4 input/output pairs for training LSTM: {v1⇒v2}, {v1,v2⇒v3}, {v1, v2, v3⇒v4}, {v1,v2,v3,v4⇒v5}. Note that patients with only a single visit are discarded in this next-visit prediction task.
To create fixed-interval data for feature set 1, we need to perform interpolation on the input/output parts. For this case, the input part is interpolated to have a fixed interval of 1 month and the output part is interpolated to have a time distance of 12 months from the nearest time of the input.
To create variable-interval data for feature set 2, we need to add two extra inputs to long short-term memory, including the time span from the previous visit and the time span to the future point at which the prediction occurs.
First, the input/output pairs used to train feature set 1 (with 24 features for males in the dataset) were prepared as follows. All patients with only a single visit were removed from the dataset, reducing the total number of males from 34,856 to 22,972. From the historical data for each patient, we interpolated data between any two consecutive visits to the monthly values. For a specific visit (excluding the last one), the first 12 months of the interpolated data right before the visit were used as the feature set 1 input, while the interpolated output at 12 months right after the visit was used as the output. The input-output data pairs were then collected using moving windows with a stride of 1 month.
The final count of input-output data pairs for trained feature set 1 with 24 features was 469,159. These data pairs were divided into 70% used for training (10% of which was used for validation) and 30% used for testing, all with stratified partitioning. All training options and parameters for LSTM are listed in
Based on the above process, we then performed OPR on top of feature set 1 to derive 24 features. As shown in
The accuracy (upper plot) and loss (lower plot) for training and validation during the training of feature set 1 for male subjects of the MJ-FLD dataset. The best model was selected at epoch 93 where the validation accuracy reached its maximum of 81.72%.
Features selected by one-pass ranking based on feature set 1, ranked by accuracy.
Comparison of intersection over union (IoU), coverage, and accuracy of the features selected by one-pass ranking (OPR) and domain experts in the two feature sets.
Metric | OPR | Experts | ||
|
Feature set 1 | Feature set 2 | Feature set 1 | Feature set 2 |
IoU | 29.73% (11/37) | 23.08% (9/39) | N/Aa | N/A |
Coverage | 45.83% (11/24) | 37.50% (9/24) | N/A | N/A |
Accuracy | 75.91% | 77.32% | 75.40% | 74.95% |
Computing time (seconds) | 5875 | 1452 | N/A | N/A |
aN/A: not applicable.
Comparison of performance, computing time, and error reduction rate with five long short-term memory (LSTM)-based classifiers.
Classifier | AUROCa | Precision | Recall | F1 score | Accuracy | Computing time (s) | Error reduction rate | |||||||||
|
|
|
|
|
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
||||||||
|
|
Males | 0.83 | 0.75 | 0.74 | 0.75 | 76.54% | 2713 | 5.33% | |||||||
|
|
Females | 0.88 | 0.80 | 0.78 | 0.79 | 81.90% | 2485 | 30.86% | |||||||
|
|
|
|
|
|
|
|
|
||||||||
|
|
Males | 0.86 | 0.78 | 0.77 | 0.78 | 79.29% | 1466 | 16.42% | |||||||
|
|
Females | 0.87 | 0.79 | 0.77 | 0.78 | 80.81% | 1469 | 26.70% | |||||||
|
|
|
|
|
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
||||||||
|
|
Males | 0.83 | 0.75 | 0.74 | 0.75 | 76.66% | 3380 | 5.81% | |||||||
|
|
Females | 0.88 | 0.81 | 0.78 | 0.79 | 81.70% | 3155 | 30.10% | |||||||
|
|
|
|
|
|
|
|
|
||||||||
|
|
Males | 0.87 | 0.78 | 0.77 | 0.78 | 79.12% | 1789 | 15.74% | |||||||
|
|
Females | 0.88 | 0.79 | 0.77 | 0.78 | 80.79% | 1800 | 26.63% | |||||||
|
|
|
|
|
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
||||||||
|
|
Males | 0.84 | 0.76 | 0.75 | 0.75 | 77.23% | 3764 | 8.11% | |||||||
|
|
Females | 0.88 | 0.80 | 0.78 | 0.79 | 81.87% | 3524 | 30.75% | |||||||
|
|
|
|
|
|
|
|
|
||||||||
|
|
Males | 0.87 | 0.78 | 0.77 | 0.78 | 79.32% | 1952 | 16.55% | |||||||
|
|
Females | 0.87 | 0.79 | 0.77 | 0.78 | 80.51% | 2016 | 25.55% | |||||||
|
|
|
|
|
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
||||||||
|
|
Males | 0.84 | 0.76 | 0.75 | 0.75 | 76.84% | 6085 | 6.54% | |||||||
|
|
Females | 0.88 | 0.80 | 0.78 | 0.79 | 81.77% | 5429 | 30.37% | |||||||
|
|
|
|
|
|
|
|
|
||||||||
|
|
Males | 0.87 | 0.78 | 0.77 | 0.78 | 79.29% | 2714 | 16.42% | |||||||
|
|
Females | 0.88 | 0.79 | 0.77 | 0.78 | 80.78% | 2802 | 26.59% | |||||||
|
|
|
|
|
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
||||||||
|
|
Males | 0.84 | 0.83 | 0.80 | 0.81 | 77.31% | N/Ae | 8.43% | |||||||
|
|
Females | 0.89 | 0.69 | 0.79 | 0.74 | 80.81% | N/A | 26.70% | |||||||
|
|
|
|
|
|
|
|
|
||||||||
|
|
Males | 0.87 | 0.87 | 0.77 | 0.82 | 78.36% | N/A | 12.67% | |||||||
|
|
Females | 0.89 | 0.70 | 0.81 | 0.75 | 81.46% | N/A | 29.18% |
aAUROC: area under the receiver operating characteristic curve.
bFS1: feature set 1.
cFS2: feature set 2.
dbiLSTM: bidirectional long short-term memory.
eN/A: not applicable.
We next compared the performances of feature sets 1 and 2 to two baseline models, as shown in
The test accuracy of NVP using feature set 1 (with fixed intervals) and feature set 2 (with variable intervals) for males was 77.31% with Attention-LSTM (8.43% error reduction) and 79.32% with Stack-LSTM (16.55% error reduction), respectively. The error reduction rates were compared with a baseline model of simple inference. For females, the corresponding values were 81.90% with LSTM (30.86% error reduction) and 81.46% with Attention-LSTM (29.18% error reduction). The error reduction rates of four classifiers for males and females are listed in
Accuracy for two baseline models and 10 long short-term memory (LSTM) models for males and females. biLSTM: bidirectional LSTM.
For feature set 2, we discarded patients with a single visit to obtain 76,172 input-output pairs; therefore, the number of male patient visits dropped from 34,856 to 22,972. The results of OPR-selected features are listed in
The computing time of OPR was much lower than that of SFS; however, it can achieve comparable performance (in terms of the overlap between the automatically selected features and the manually selected features) as SFS, especially when dealing with a large-scale dataset with high-dimensional features. The best model for CVP was KNNC for males (80.00%) and SVM for females (83.44%). The best model for NVP was Stack-LSTM using feature set 1 (79.32%) for males and LSTM using feature set 2 (81.90%) for females.
For NVP, the proposed feature set 2 is highly flexible and can achieve comparable results to those obtained with feature set 1; however, the computing time is much shorter, and the prediction can be derived at any time in the future. Both feature sets 1 and 2 outperformed a simple inference model (baseline 2), achieving an error reduction of 16.53% (Stack-LSTM) for males and 30.86% (LSTM) for females.
As shown in
For CVP, the influence of hormones for females was more intense than that for males, leading to difficulty in FLD prediction for females after menopause, as shown in
For males in
In
It should be noted that by using Attention-LSTM with feature set 2, the accuracy only dropped by 0.96% for female FLD prediction and by 0.44% for male FLD prediction. The advantages in using feature set 2 include better efficiency in training/evaluation and more flexible prediction at any future time. Thus, if efficiency and flexibility are major concerns, we can sacrifice accuracy to a certain degree to achieve high efficiency and flexibility.
This study explored the use of a large health checkup dataset for FLD prediction in terms of current-visit and next-visit predictions. We used OPR and SFS for feature selection in CVP and then compared the results against expert-selected features. In our experiment with CVP, OPR was more efficient and provided comparable results with those obtained using SFS in terms of classification accuracy and the similarity between the automatically selected features and the expert-selected features.
For NVP, we propose two feature sets (feature sets 1 and 2) for various LSTM models. For females, the best accuracy of 81.90% was obtained when using feature set 1 for LSTM. For males, the best accuracy of 79.32% was obtained when using feature set 2 for LSTM. This indicates that the best models and best features are gender-dependent. However, it should be noted that feature set 2 is a much more compact representation; thus, it requires less time for training/evaluation, and there is no need for prior feature interpolation. Moreover, the model trained by feature set 2 is more flexible and it allows for FLD prediction at any time in the future.
In practice, NVP is much more valuable from the perspective of preventive medicine since whenever a positive prediction occurs, the physician can suggest lifestyle changes to prevent FLD at the next visit. To our knowledge, this is the first use of machine learning for NVP using a large-scale dataset.
Our immediate future work will focus on extending our LSTM-based NVP system to develop a comprehensive recommendation system, in which precise and personal recommendations will be given to prevent the potential future development of FLD, such as reduction in alcohol consumption, weight loss, and increased exercise. Such precise, personalized recommendations can be made based on patient clustering according to influential features. In general, such a system for preventive treatment can also be extended to other chronic or metabolic syndrome diseases, as long as we have a large dataset that covers many years for longitudinal studies.
Parameters of training options for the several variants of long short-term memory (LSTM).
alcohol-related fatty liver disease
area under the receiver operating characteristic curve
bidirectional long short-term memory
classification and regression trees
current-visit prediction
fatty liver disease
Gaussian naive Bayes
intersection over union
k-nearest neighbor classification
leave one out
logistic regression
long short-term memory
nonalcoholic fatty liver disease
no fatty liver disease
next-visit prediction
one-pass ranking
random forest
sequential forward selection
support vector machine
All data used in this study were authorized by and received from MJ Health Research Foundation (authorization code MJHRF2019014C). Any interpretations or conclusions described in this paper are those of the authors and do not represent the views of MJ Health Research Foundation. The work presented herein was partly supported by the Ministry of Science and Technology, Taiwan (grant MOST 110-2634- F-002-032).
None declared.