Model-Based Reasoning of Clinical Diagnosis in Integrative Medicine: Real-World Methodological Study of Electronic Medical Records and Natural Language Processing Methods

Background Integrative medicine is a form of medicine that combines practices and treatments from alternative medicine with conventional medicine. The diagnosis in integrative medicine involves the clinical diagnosis based on modern medicine and syndrome pattern diagnosis. Electronic medical records (EMRs) are the systematized collection of patients health information stored in a digital format that can be shared across different health care settings. Although syndrome and sign information or relative information can be extracted from the EMR and content texts can be mapped to computability vectors using natural language processing techniques, application of artificial intelligence techniques to support physicians in medical practices remains a major challenge. Objective The purpose of this study was to investigate model-based reasoning (MBR) algorithms for the clinical diagnosis in integrative medicine based on EMRs and natural language processing. We also estimated the associations among the factors of sample size, number of syndrome pattern type, and diagnosis in modern medicine using the MBR algorithms. Methods A total of 14,075 medical records of clinical cases were extracted from the EMRs as the development data set, and an external test data set consisting of 1000 medical records of clinical cases was extracted from independent EMRs. MBR methods based on word embedding, machine learning, and deep learning algorithms were developed for the automatic diagnosis of syndrome pattern in integrative medicine. MBR algorithms combining rule-based reasoning (RBR) were also developed. A standard evaluation metrics consisting of accuracy, precision, recall, and F1 score was used for the performance estimation of the methods. The association analyses were conducted on the sample size, number of syndrome pattern type, and diagnosis of lung diseases with the best algorithms. Results The Word2Vec convolutional neural network (CNN) MBR algorithms showed high performance (accuracy of 0.9586 in the test data set) in the syndrome pattern diagnosis of lung diseases. The Word2Vec CNN MBR combined with RBR also showed high performance (accuracy of 0.9229 in the test data set). The diagnosis of lung diseases could enhance the performance of the Word2Vec CNN MBR algorithms. Each group sample size and syndrome pattern type affected the performance of these algorithms. Conclusions The MBR methods based on Word2Vec and CNN showed high performance in the syndrome pattern diagnosis of lung diseases in integrative medicine. The parameters of each group’s sample size, syndrome pattern type, and diagnosis of lung diseases were associated with the performance of the methods. Trial Registration ClinicalTrials.gov NCT03274908; https://clinicaltrials.gov/ct2/show/NCT03274908


Introduction
Integrative medicine is a form of medicine that combines practices and treatments from alternative medicine with conventional medicine [1][2][3]. In China, integrative medicine combines traditional Chinese medicine (TCM) and modern medicine for clinical practice [1][2][3]. The diagnosis in integrative medicine comprises the clinical diagnosis based on modern medicine and syndrome pattern diagnosis [4]. Syndrome pattern based on TCM theory is an outcome of the analysis of TCM information by the TCM practitioner, and TCM treatments rely on this concept [4]. A syndrome pattern can be defined as a categorized pattern of symptoms and signs in a patient at a specific stage during the course of a disease. Syndrome elements are the smaller units of syndrome classification and the basic elements of a syndrome pattern [5]. The correct combination of syndrome elements can infer an appropriate syndrome pattern. Syndrome elements are also derived from the syndrome and signs from the patient [5,6]. Generally, practitioners of integrative medicine making diagnosis decisions need to combine syndrome pattern diagnosis and the diagnosis in modern medicine [5,6]. As TCM treatments rely on syndrome pattern diagnosis, the treatment combined with the therapies of TCM and modern medicine is expected to be more efficient for patients. Therefore, syndrome pattern for the diagnosis in integrative medicine is an essential part of diagnosis.
Electronic medical records (EMRs) are the systematized collection of patients' and the population's electronically stored health information in a digital format that can be shared across different health care settings [7,8]. In China, EMRs are a collection of diagnoses of syndrome patterns and model medicine as well as syndromes and signs with the TCM format [7,8]. Natural language processing (NLP) is a field of artificial intelligence and computational linguistics concerned with the interactions between computers and human natural languages [9,10]. Currently, NLP techniques combining EMRs have been comprehensively applied to medical data mining and medical decision support system [9,10]. Word embedding, as one of the techniques in NLP, attempted to map a word using a dictionary to a vector of real numbers in a low-dimensional space [11,12]. It is important in EMR data mining or artificial intelligence application in medicine for medical texts to be transferred to vectors because computers can handle or understand medical texts through computability vectors.
Applying artificial intelligence techniques to support physicians in medical practices is a major challenge. The processing of uncertainty information mainly contributes to the challenge. Syndrome and sign information is under the classic uncertainty information. The artificial neural network (ANN) can successfully and efficiently handle syndrome and sign information with uncertainty [13]. ANN is a computational model based on the structure and functions of biological neural networks [14]. The remarkable information processing characteristics of the ANN in terms of nonlinearity, fault and noise tolerance, high parallelism, and learning and generalization capabilities contribute to uncertain information processing and quantitative analysis. Furthermore, model-based reasoning (MBR) methods based on machine learning or ANN can successfully process syndrome and sign information with uncertainty to make a precise and accurate diagnosis in integrative medicine.
As mentioned previously, syndrome and sign information or relative information can be extracted from the EMRs, and content texts can be mapped to computability vectors using NLP techniques. Furthermore, MBR methods can be used to create a computer-aided system to support the diagnosis in integrative medicine. However, only a few studies have been conducted on MBR methods with EMRs and NLP to support the diagnosis in integrative medicine. Fortunately, our previous work was carried out to analyze syndrome patterns and syndrome elements in lung diseases based on real-world EMR data [5]. This study aimed to explore MBR algorithms in the diagnosis in integrative medicine based on EMRs and NLP techniques applied on lung disease data sets. We also estimated the associations among the factors of sample size, number of syndrome pattern type, and diagnosis in modern medicine using the MBR algorithms.

Analysis of Workflow
The workflow of the analysis of the MBR methods in the diagnosis in integrative medicine based on EMRs and NLP is illustrated in Figure 1. The EMRs on lung diseases were exported from the hospital information system, and the syndrome and sign information and relative information were extracted as a text format. The corresponding syndrome pattern diagnosis, clinical diagnosis in modern medicine, and syndrome elements were extracted and saved to the database with the structure data according to the unique code of patients. The content texts of the syndrome and sign information were mapped to the computability vectors through word embedding. The classification models that include the vectors of syndrome and sign information and syndrome patterns or syndrome elements were developed using machine learning or neural network methods. MBR algorithms were developed on the basis of classification models concerning the syndrome pattern, and the model-based and rule reasoning algorithms were developed using the classification models and rule knowledge based on the combination of syndrome elements and syndrome patterns.
The performances of the MBR methods in the diagnosis of lung diseases in integrative medicine have been evaluated and compared (for the main program codes for the module, please see [15]).

Data Collection and Processing
In our previous real-world study on the syndrome pattern and syndrome element of lung disease, EMRs were collected from lung disease wards in 5 hospitals [5]. A data set consisting of 14,075 medical records of clinical cases from 4 hospitals was assigned as the development data set, and it was divided into the train data set and the test data set at a ratio of 4:1. Another independent data set comprising 1000 medical records of clinical cases from a hospital was set as the external test data set. The information comprised patients' identity number, ward number, admission time, admission notes, first medical records, general medical records, discharge note, diagnosis of syndrome pattern, and diagnosis in modern medicine. In this work, we selected 10 common syndrome pattern types and 8 common lung diseases in the lung disease wards. Nine syndrome element types were generated and combined with the corresponding 10 syndrome pattern types.

Medical Information Extraction
The Chinese text information on the chief complaints, syndromes, and positive signs in the chest, tongue, and pulse was extracted from the admission notes, first medical records, and discharge records ( Figure 2). The extracted Chinese text information was combined into contexts called "four diagnoses in TCM." The contexts of the syndromes and signs underwent word-cutting process to split them into tokens. In this work, the first corpus included the context of syndrome and sign information. In the analysis of the diagnosis in modern medicine and syndrome pattern diagnosis, another corpus included an additional token of diagnosis in modern medicine. The Chinese text information on the chief complaints, syndromes, and positive signs in the chest, tongue, and pulse that was extracted from the admission notes, first medical records, and discharge records. TCM: traditional Chinese medicine.

Word2Vec
Word embedding is an NLP feature-learning technique in which words are mapped to vectors of real numbers [16]. Word embedding involves mathematical embedding from a space with 1 dimension per word to a continuous vector space with a much lower number of dimensions. The Word2Vec model is an NLP system that is used to produce word embedding, which takes a large corpus of text as its input and produces a vector space, and each unique word in the corpus is assigned a corresponding vector in the space [16]. The Word2Vec model generates vectors for each word present in a document. In this study, the corpus from a Chinese language Wikipedia dump, which is available at [17], was used to pretrain the word vector model. The parameters utilized with the Word2Vec model were developed for dimension reduction into 256 dimension vectors, 5 context windows, and a minimum sentence word count of 10. The Word2Vec model was implemented using the Gensim Python library [18].

Doc2Vec
The Doc2Vec model is an extension of Word2Vec that constructs embeddings from entire documents or sentences (instead of individual words) to learn a randomly initialized vector for the document (or sentence) along with the words [19]. The Doc2Vec model modifies the Word2Vec algorithm into an unsupervised learning algorithm that produces continuous representations for large blocks of texts, such as sentences, paragraphs, or entire documents. In this work, Doc2Vec was used to produce vectors for texts. The corpus from a Chinese language Wikipedia dump was again used to pretrain the Doc2Vec model. The parameters utilized with the Doc2Vec model were developed in the dimension reduction into 192 dimension vectors, 5 context windows, and a minimum sentence word count of 10. The Doc2Vec model was also implemented using the Gensim Python library.

Machine Learning
In this work, the 4 different machine learning classifiers algorithms, namely, random forest (RF), extreme gradient boosting (XGBoost), support vector machines (SVMs), and K-nearest neighbor (KNN), were used to develop MBR [20][21][22]. The 4 algorithms were the classic machine leaning algorithms, which were the best algorithms suitable for classification tasks.
RF, a classic machine learning classifier, is composed of tree predictors, with each tree depending on the values of a random vector sampled independently and having the same distribution for all trees in the forest [23]. RF aims to reduce the tree correlation issue by choosing only a subsample of the feature space at each split. In this work, RF was used on 1000 trees in the forest, and it was implemented using the scikit-learn Python library.
XGBoost is an optimized distributed gradient-boosting system designed to be highly efficient, flexible, and portable [24]. It implements machine learning algorithms under the gradient boosting framework, which attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler, weaker models. XGBoost can also be implemented using the scikit-learn Python library.
SVM is a well-known supervised learning model associated with learning algorithms that analyze data used for classification and regression analysis [25]. SVM was useful in text-based classification tasks and is not prone to errors in high-dimensional data sets. In this work, SVM was used with a linear kernel and implemented using the scikit-learn Python library.
The KNN classifier, one of the most popular machine learning algorithms, is based on the Euclidean distance between a test sample and the specified training samples [26]. It is used for data classification that attempts to determine in which group a data point is included by examining the data points around it. In this study, KNN was implemented using the scikit-learn Python library.

Artificial Neural Network
ANNs, one of the main tools used in machine learning, are a group of models inspired by biological neural networks used for estimating functions that depend on a large number of inputs [13]. ANN algorithms have 2 different classifiers: multilayer perceptron (MLP) and convolutional neural network (CNN). MLP is a feed-forward ANN model that maps sets of input data onto a set of appropriate outputs [27]. It consists of multiple layers of nodes with a nonlinear activation function in a directed graph, with each layer fully connected to the next one. Back-propagation is used as a supervised learning technique in MLP. In this work, MLP was performed with 6 hidden layers, with the nodes per layer varying from 64 to 1024. It was also implemented using the scikit-learn Python library.
CNN is one of the most popular algorithms for deep learning [28]. It is a category of ANN in which a model learns to perform classification tasks directly from images, text, or sound, and it has been proven effective in the areas of text classification and image recognition. CNN comprises one or more convolutional layers with a subsampling step, followed by one or more fully connected layers as in a standard multilayer neural network [29]. In this work, CNN consisted of an embedding layer, a convolutional layer, a max pooling layer, and 2 fully connected layers, and it was implemented using the Keras Python library.

MBR
In this study, the development of MBR was based on word embedding and machine learning classifiers for syndrome pattern [30,31]. A total of 11 MBR algorithms were used: Word2Vec RF, Word2Vec XGBoost, Word2Vec SVM, Word2Vec KNN, Word2Vec MLP, Word2Vec CNN, Doc2Vec RF, Doc2Vec XGBoost, Doc2Vec SVM, Doc2Vec KNN, and Doc2Vec MLP. These models with multiclass outputs were consistent with the syndrome pattern types. A comparison of the performance of the 11 MBR algorithms was conducted.

MBR Combined With Rule-Based Reasoning
MBR was based on word embedding and machine learning classifiers for syndrome elements. Nine MBR algorithms were used: Word2Vec RF, Word2Vec XGBoost, Word2Vec KNN, Word2Vec MLP, Word2Vec CNN, Doc2Vec RF, Doc2Vec XGBoost, Doc2Vec KNN, and Doc2Vec MLP. These models with multilabel outputs were consistent with the syndrome element types. The syndrome patterns were generated by combining the syndrome elements, which follow the rule knowledge base of the syndrome elements, with the syndrome pattern. A comparison of the performance of the 9 MBR combined with rule-based reasoning (RBR) algorithms was performed. The rules of combination of TCM elements for TCM syndrome are presented in Multimedia Appendix 1.

Evaluation
The performances of the MBR algorithms in syndrome pattern were evaluated in the test data set and the external data set using standard metrics, which included accuracy, precision, recall, and F1 score [32]. Moreover, the performances of the Word2Vec CNN MBR algorithms in each syndrome pattern and each syndrome element were evaluated in the test data set using standard metrics. A fivefold cross validation was conducted 20 times on the train data set for each algorithm to estimate the 95% CI for the performance parameters.
The accuracy comparison analysis of the Word2Vec CNN MBR algorithms in corpus 1 and corpus 2 was conducted in different proportions of the sample size of the development data set. In the accuracy analysis of the data set, each group sample size was set as a proportion of total sample size and the number of syndrome pattern type was selected randomly. The linear regression analyses were conducted to evaluate the associations between each group sample size and the number of syndrome pattern type at accuracies of 0.90% and 0.95% of the methods.

Ethics Approval and Consent to Participate
The study was approved by the Ethics Committee of the Huashan Hospital and performed in accordance with the Declaration of Helsinki.

Availability of Data and Material
The data sets generated or analyzed during this study are not publicly available due to private information but are available from the corresponding author on reasonable request. Data sets are from the study whose authors may be contacted at the Center of Bioinformatics and Biostatistics, Institutes of Integrative Medicine, Fudan University. The data concerning external test data set and an example of development data set are available online [15].

Development and External Data Sets
The characteristics of the data set are shown in Figure 3. The development data set consisted of 14,075 medical records of clinical cases, and the external data set had 1000 medical records of clinical cases. Eight common lung diseases were found in the development data set: lung cancer (18.42%), pulmonary infection (18.59%), acute bronchitis (8.39%), interstitial pneumonia (1.66%), chronic bronchitis (9.78%), chronic obstructive pulmonary disease (25.98%), bronchiectasis (4.31%), and asthma (12.88%; Figure 3A). The same common lung diseases with the same proportions were also found in the external data set ( Figure 3B). Ten common syndrome pattern types were found in the development data set: qi-deficiency of lung and spleen, qi-deficiency of lung and kidney, yin-deficiency of lung, wind-cold attacking lung, wind-heat attacking lung, cold wheezing, deficiency of qi and yin, hot wheezing, phlegm-heat obstruction in lung, and phlegm obstruction in lung ( Figure 3C). The same 10 syndrome pattern types with the same proportions were also found in the external data set ( Figure   3D). The development data set had 35,992 syndrome elements for 14,075 syndrome patterns, and a syndrome pattern consisted of 2.56 syndrome elements on average. The development data set included 9 syndrome element types: phlegm, wind, cold, heat, qi-deficiency, yin-deficiency, lung, spleen, and kidney ( Figure 3E). A total of 2602 syndrome elements with the same 9 types were found in 1000 syndrome patterns ( Figure 3F).   The performance analysis of the MBR based on Doc2Vec to identify syndrome patterns in the test data set showed the highest average accuracy of 0.8840 (95% CI 0.8730-0.8970) in the Doc2Vec CNN model ( Table 2). The parameters of precision, recall, and F1 score were 0.8876 (95% CI 0.8776-0.8976), 0.8840 (95% CI 0.8710-0.8932), and 0.8843 (95% CI 0.8753-0.8973) in the Doc2Vec CNN model, respectively. Similar performance values were found in the corresponding external data set. Table 2. Performance analysis of model-based reasoning methods applied for syndrome pattern diagnosis of lung disease based on Doc2Vec in the test and external data sets.

MBR Combined With RBR
The performance analysis of the MBR combined with RBR based on Word2Vec in the test data set indicated that the highest average accuracy was 0.9229 (95% CI 0.9099-0.9319) in the Word2Vec CNN model ( Table 3). The parameters of precision, recall, and F1 score were 0.9884 (95% CI 0.9744-0.9964), 0.9679 (95% CI 0.9589-0.9809), and 0.9778 (95% CI 0.9698-0.9888) in the Word2Vec CNN model, respectively. Similar performance values were found in the corresponding external data set. The performance analysis of the MBR combined with RBR based on Doc2Vec showed that the highest average accuracy was 0.8190 (95% CI 0.8082-0.8281) in the Doc2Vec CNN model ( Table 4). The parameters of precision, recall, and F1 score were 0.9550 (95% CI 0.9441-0.9673), 0.9507 (95% CI 0.9387-0.9597), and 0.9524 (95% CI 0.9444-0.9654) in the Doc2Vec CNN model, respectively. Similar performance values were found in the corresponding external data set. Table 4. Performance analysis of model-based reasoning methods in combination with rule-based reasoning methods applied for syndrome pattern diagnosis of lung disease based on Doc2Vec in the test and external data sets.

Word2Vec CNN MBR in Corpus 1 and Corpus 2
Corpus 1 included the syndrome and sign information without a clinical diagnosis of lung disease, whereas corpus 2 included the syndrome and sign information with a clinical diagnosis of lung disease. A higher average accuracy (0.9584; 95% CI 0.9510-0.9655) was found in the Word2Vec CNN model for syndrome pattern diagnosis in corpus 2 than in corpus 1 (0.9471; 95% CI 0.9382-0.9549) in the test data set (Table 5). Moreover, higher performance parameter values of precision, recall, and F1 score were found in the Word2Vec CNN model for each syndrome pattern diagnosis in corpus 2 than in corpus 1 ( Table  5). Similar results were found in the Word2Vec CNN method combined with the RBR model for syndrome pattern diagnosis in corpus 2 in comparison with the model in corpus 1 in the test data set with a full sample size (Table 6). A higher average accuracy of the Word2Vec CNN model was found for syndrome pattern diagnosis in the test data set with different sample sizes in corpus 2 than in corpus 1 (Figure 4).

Association of Accuracy and Sample Size With Syndrome Pattern Type
We performed an average accuracy analysis in the development data set classified by the number of syndrome pattern type and each group's sample size. The results showed that the average accuracy increased with the increase in sample size of each group and decreased with the increase in number of syndrome pattern ( Table 7). The linear regression analysis showed that each group's sample size was significantly associated with the number of syndrome pattern with an accuracy of 0.90 (Y = 34.39 × X + 109.43, P<.001, where Y is each group's sample size and X is the number of syndrome pattern type) and 0.95 (Y = 48.55 × X + 296.78, P<.001, where Y is each group's sample size and X is the number of syndrome pattern type), respectively ( Figure 5).

Principal Findings
We developed MBR methods for diagnosis of lung diseases in integrative medicine based on a real-world EMR data set with NLP. In our previous studies, we accumulated large-scale real-world data for artificial intelligence on integrative medicine. In this work, real-world medical records of clinical cases were used to develop models, and medical texts were mapped to vectors of real numbers that a computer could process. CNN approaches can automatically extract features from word vectors, thus contributing to the high performance of MBR methods in syndrome pattern diagnosis for diagnosis of lung diseases in integrative medicine. To the best of our knowledge, this study is the first to investigate MBR methods for diagnosis in integrative medicine on a large real-world data set using NLP and deep learning methods in China. These MBR methods can be recommended for a clinical decision-making system and can also provide a novel approach for diagnosis in integrative medicine. This work would be of significance for applications of artificial intelligence on integrative medicine.
An interesting finding is the high performance of the MBR methods for syndrome pattern diagnosis in integrative medicine. The best Word2Vec CNN MBR method for syndrome pattern diagnosis in integrative medicine had an accuracy of 0.9471 and 0.9250 in the development and external data sets, respectively. Word embedding and CNN contributed to the high performance. Word embedding techniques can map texts to computability vectors, which can perform text analysis with quantitative analysis. CNN can automatically extract features from medical texts, significantly contributing to the performance of the MBR. Additionally, the diagnosis information of modern medicine being added to the corpus enhances the accuracy of the syndrome pattern diagnosis in integrative medicine with reasoning, thus indicating that physicians can more efficiently make a syndrome pattern diagnosis after determining the diagnosis in modern medicine.
We performed an association analysis to evaluate the relationship between the number of syndrome pattern type and each group's sample size for the accuracy of MBR algorithms. Moreover, we conducted a linear regression analysis to estimate the linear function of each group's sample size and syndrome pattern type at an accuracy of 0.95. Only a few studies reported on the quantitative associations. In the Word2Vec CNN MBR algorithms at an accuracy of 0.95, the smallest group sample size was 300 for 2 syndrome pattern types, and for each group the sample size was at least 800 for 10 syndrome pattern types. According to the linear model, the Word2Vec CNN MBR method based on each group's sample size of at least 1200 showed high performance in syndrome pattern with 20 types. A total of 400 common syndrome pattern types were grouped into 20 systems in integrative internal medicine. A total of 25,000 medical records of clinical cases could satisfy the Word2Vec CNN MBR methods in syndrome pattern diagnosis in an integrative system at an accuracy of 0.95. A total of 500,000 medical records of clinical cases could satisfy the Word2Vec CNN MBR methods in the diagnosis of 400 syndrome patterns in the entire integrative internal medicine at an accuracy of 0.95. We could thus combine data-driven artificial intelligence and knowledge-driven artificial intelligence for developing an intelligent clinical decision system on integrative medicine.
Interestingly, the combination of MBR and RBR methods applied for syndrome pattern diagnosis in integrative medicine showed high performance. Specifically, Word2Vec CNN MBR combined with RBR methods had an accuracy of 0.9559 in syndrome pattern diagnosis in corpus 2 with additional information on modern medicine diagnosis. This reasoning method showed a more understandable and clearer knowledge of lung diseases for physicians in comparison with the Word2Vec CNN MBR methods. Moreover, it was more suitable for users of or physicians practicing integrative medicine. Generally, a hybrid reasoning is more suitable for application in clinical practice. The data-and knowledge-driven artificial intelligence contributed to the hybrid reasoning, which has the advantages of high performance reasoning and being explainable for clinicians. In clinical practice, the TCM elements reasoning could be used for TCM diagnosis or differentiation.
Although this study used novel methods to develop MBR in syndrome pattern diagnosis in integrative medicine, it has several limitations. First, we selected only 10 of the 20 common syndrome pattern types in lung diseases, partly because the other 10 syndrome pattern types did not have enough medical records of clinical cases. Therefore, future studies should use comprehensive syndrome patterns in lung diseases or other systems. Second, the size of the corpus for pretrained word vectors was not large to cover all Chinese words or special items on lung diseases.

Conclusion
MBR methods based on Word2Vec CNN showed high performance in syndrome pattern diagnosis of lung diseases in integrative medicine. The parameters of each group's sample size, syndrome pattern type, and clinical diagnosis of lung diseases were associated with the performance of the methods. The hybrid reasoning with data-and knowledge-driven artificial intelligence could well contribute to the development of medical artificial intelligence on integrative medicine. We aim to develop a clinical diagnosis or decision-making model with knowledge graph and hybrid reasoning to better combine data-and knowledge-driven artificial intelligence on integrative medicine in the near future.