This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Electronic health records (EHRs) are important data resources for clinical studies and applications. Physicians or clinicians describe patients’ disorders or treatment procedures in EHRs using free text (unstructured) clinical notes. The narrative information plays an important role in patient treatment and clinical research. However, it is challenging to make machines understand the clinical narratives.
This study aimed to automatically identify Chinese clinical entities from free text in EHRs and make machines semantically understand diagnoses, tests, body parts, symptoms, treatments, and so on.
The dataset we used for this study is the benchmark dataset with human annotated Chinese EHRs, released by the China Conference on Knowledge Graph and Semantic Computing 2017 clinical named entity recognition challenge task. Overall, 2 machine learning models, the conditional random fields (CRF) method and bidirectional long short-term memory (LSTM)-CRF, were applied to recognize clinical entities from Chinese EHR data. To train the CRF–based model, we selected features such as bag of Chinese characters, part-of-speech tags, character types, and the position of characters. For the bidirectional LSTM-CRF–based model, character embeddings and segmentation information were used as features. In addition, we also employed a dictionary-based approach as the baseline for the purpose of performance evaluation. Precision, recall, and the harmonic average of precision and recall (F1 score) were used to evaluate the performance of the methods.
Experiments on the test set showed that our methods were able to automatically identify types of Chinese clinical entities such as diagnosis, test, symptom, body part, and treatment simultaneously. With regard to overall performance, CRF and bidirectional LSTM-CRF achieved a precision of 0.9203 and 0.9112, recall of 0.8709 and 0.8974, and F1 score of 0.8949 and 0.9043, respectively. The results also indicated that our methods performed well in recognizing each type of clinical entity, in which the “symptom” type achieved the best F1 score of over 0.96. Moreover, as the number of features increased, the F1 score of the CRF model increased from 0.8547 to 0.8949.
In this study, we employed two computational methods to simultaneously identify types of Chinese clinical entities from free text in EHRs. With training, these methods can effectively identify various types of clinical entities (eg, symptom and treatment) with high accuracy. The deep learning model, bidirectional LSTM-CRF, can achieve better performance than the CRF model with little feature engineering. This study contributed to translating human-readable health information into machine-readable information.
Electronic health records (EHRs) comprise individuals’ health information such as laboratory test results, diagnosis, and medications. This information includes various data types, from structured information such as laboratory test results consisting of test items and the corresponding values, to unstructured data such as clinical narratives in discharge notes [
Early named entity recognition (NER) systems often use rule-based approaches that rely on various dictionary resources. More recently, machine learning (ML)–based approaches have been applied to NER, such as maximum entropy (ME), conditional random fields (CRF), support vector machines (SVM), structural support vector machines (SSVM), and multiple deep learning methods [
Traditional ML-based approaches such as CRF can achieve good performance on the sequence-labeling tasks but usually rely heavily on hand-engineered features and medical knowledge. However, deep learning methods such as Convolutional Neural Network (CNN) and Recurrent Neural Networks (RNN) can achieve state-of-the-art performance with little feature engineering. Wu et al [
Most of the previous studies on CNER primarily focus on English clinical texts. Various ML models have shown significant performance on CNER on English EHRs. Compared with English CNER, Chinese CNER faces more obstacles and still remains a challenge, which may due to the following reasons: (1) few open access Chinese EHR corpora; (2) a small number of Chinese medical dictionaries and ontology libraries; and (3) complicated properties of the Chinese language, such as the lack of word boundaries, the complex composition forms, and word forms remaining unchanged in all kinds of tense or POS [
In this study, we investigate 2 automatic methods, bidirectional LSTM-CRF and the CRF model, in terms of simultaneously identifying 5 types of clinical entities from Chinese EHR data. Experiment results indicate that the 2 ML models showed significant performance on each type of entity, demonstrating their effectiveness in recognizing multiple types of clinical entities for further data-driven medical research. Our bidirectional LSTM-CRF model can capture not only the past and future input features through the bidirectional LSTM layer but also the sentence-level tag information via the CRF layer. Its performance is comparable with the Top 1 system (F1 score 0.9043 vs 0.9102) in the CCKS 2017 CNER challenge task and better than that of each of the 4 individual models of the Top 1 hybrid system, which needs much effort for feature engineering and model constructing. The bidirectional LSTM-CRF model achieves state-of-the-art performance by utilizing only the character and segmentation information, which significantly alleviates the human work involved in feature engineering to a large extent.
A total of 2 datasets were used in this study, the first one is an annotated corpus, which is used for training and testing, whereas the second one, regarded as the development set, is an unlabeled corpus for learning character embedding. All data are derived from the progress notes and examination results of in-patients’ EHRs released by the CCKS 2017 CNER challenge task [
The second dataset includes 2605 patients’ unlabeled EHR data, which was used for learning character embeddings. Character embeddings were trained using individual Chinese characters.
Traditionally, dictionary-based CNER approaches utilize medical dictionary resources such as the Unified Medical Language system, Medical Subject Headings, and RxNorm. For Chinese clinical entity recognition, we constructed a new dictionary on the basis of the Chinese Unified Medical Language System (CUMLS) [
CNER is generally converted into a sequence-labeling problem or a classification problem. Sequence-labeling problem means, given a sequence of input tokens A=(a1,...,an) and a predefined set of labels L, determine a sequence of labels B=(b1,…,bn) with the largest joint probability for the sequence of input tokens [
1、患儿为4岁儿童,起病急,病程短。2、以咳嗽,发热为主症。3、查体:咽部稍充血,双扁桃体稍肿大。双肺呼吸音粗,可闻及中小水泡音,结合胸片故诊断为:支气管肺炎。给予静点头孢哌酮、炎琥宁联合抗感染、雾化吸入布地奈德、沙丁胺醇减轻气道高反应 (1. Patient is a 4-year-old children, acute onset, short duration. 2. Main symptoms are cough and fever. 3. Examination: the throat is slightly congestive, double tonsils are slightly swollen. Lung breath sounds thick, a small and medium-sized bubble sound can be heard. Combined with the chest x-ray, diagnosed as: bronchopneumonia. Given cefoperazone combined with andrographolide for anti-infection, aerosolized inhaled budesonide and salbutamol to reduce airway hyperresponsiveness.)
An example of the manually annotated golden standard.
Entity | pos_ba | pos_eb | Entity type |
咳嗽 (cough) | 21 | 22 | Symptom |
发热 (fever) | 24 | 25 | Symptom |
查体 (examination) | 32 | 33 | Test |
咽部 (throat) | 35 | 36 | Body part |
充血 (congestion) | 38 | 39 | Symptom |
双扁桃体 (double tonsils) | 41 | 44 | Body part |
肿大 (swollen) | 46 | 47 | Symptom |
双肺 (lung) | 49 | 50 | Body part |
呼吸音 (breath sound) | 51 | 53 | Test |
胸片 (chest x-ray) | 67 | 68 | Test |
支气管肺炎(bronchopneumonia) | 74 | 78 | Diagnosis |
头孢哌酮 (cefoperazone) | 84 | 87 | Treatment |
炎琥宁 (andrographolide) | 89 | 91 | Treatment |
布地奈德 (budesonide) | 102 | 105 | Treatment |
沙丁胺醇 (salbutamol) | 107 | 110 | Treatment |
气道 (airway) | 113 | 114 | Body part |
apos_b: start position.
bpos_e: end position.
Distribution of entities among the training set and the test set.
Dataset | Number of patients | Body part | Diagnosis | Symptom | Test | Treatment | Total |
Training set | 300 | 10,719 | 722 | 7831 | 9546 | 1048 | 29,866 |
Test set | 100 | 3021 | 553 | 2311 | 3143 | 465 | 9493 |
All | 400 | 13,740 | 1275 | 10,142 | 12,689 | 1513 | 39,359 |
CRF is a probabilistic undirected graphical model, which was first proposed by Lafferty in 2001 [
Let P(Y|X) be a linear chain conditional random field. Under the condition that the value of random variable X is x (eg, “患者左腹压痛; patients with left abdominal pressing pain”), the conditional probability of which the random variable Y is y (eg, “O, O, B-body, I-body, B-symptom, and I-symptom”) is defined as:
in which Z(x) denotes the normalization factor, yi (eg, I-symptom) is the label of xi (eg, “痛; pain” in the “患者左腹压痛; patients with left abdominal pressing pain”), and then yi-1 (B-symptom) is the label of xi-1 “压 (pressing)”. tk (depends on the current label yi and former label yi-1) and sl (depends on the current label yi) denote the feature functions, and λk and μl denote their corresponding weights. Once the corresponding weights are learned, the labels of a new input sequence can be predicted according to P(y|x). The prediction process can be achieved in an efficient way using the Viterbi algorithm. In this study, we use the CRF++ package [
Recently, multiple deep neural architectures have been exploited for NER tasks [
where σ denotes the element-wise sigmoid function; • denotes the element-wise product; and bi, bf, bc, and bo denote the bias vectors. W denotes the weight matrix, xt denotes the input vector corresponding to the current Chinese character, and ht denotes the output vector of the LSTM, which represents the context information of the current Chinese characters.
For many sequence-labeling tasks, it is beneficial to have access to both past (left) and future (right) contexts. However, the LSTM’s hidden state ht takes information only from past. Bidirectional LSTM model presents each sequence forward and backward to 2 separate hidden states to capture past and future information, respectively. Then, the 2 hidden states are concatenated to further form the final output. Bidirectional LSTM-CRF model, which takes advantage of both bidirectional LSTM and CRF, can simultaneously utilize the past and future input features through the forward and the backward LSTM layer and the sentence level tag information via the CRF layer. The architecture of the bidirectional LSTM-CRF is shown in
When predicting the tags of Chinese characters, first, given a sentence S=(c1,...,cn), each character ct (1≤t≤n) is represented by vector xt (the concatenation of the character embeddings and the segmentation information) generated in the input layer. Second, the forward and the backward LSTM layer take the sequence of character representations X=(x1,...,xn) as input and generate the representation of the left (hl=hl1,...,hln) and right (hr=hr1,...,hrn) context for each character, respectively. Third, the sequence of overall context representations is h=(h1,...,hn), where ht is the concatenation of hlt and hrt. Finally, the sequence of overall context representations is taken as input for the CRF layer to predict the output label sequence L=(l1,...,ln).
A long short-term memory unit. it: input gate; ft: forget gate; ct: memory cell; ot: output gate; ht: output vector of the LSTM.
Architecture of the bidirectional long short-term memory-conditional random fields. LSTM: long short-term memory; CRF: conditional random fields; B-dis: B-diagnosis; I-dis: I-diagnosis.
For training the CRF model, we select 4 types of features, BOC, POS tags, character types (CT), as well as the position of the character in the sentence (POCIS). NLPIR Chinese word segmentation system Institute of Computing Technology, Chinese Lexical Analysis System (ICTCLAS)-2016 [
For training the bidirectional LSTM-CRF model, we employ the character embeddings and segmentation information as our features. Character embeddings are learned through Google’s word2vec [
We conducted an experimental study to compare ML-based CNER with the dictionary-based approach. First, we divide the first dataset into 2 parts, the first part, which contains 300 patients’ EHR data, for training, and the second part, which involves 100 patients’ EHR data, for testing.
In the CRF model, the content window size is set to 5 for extracting character features, including the 2 preceding characters, the current character, and the 2 following characters. Different combinations of features have been tried to train the CRF model, including (1) BOC; (2) BOC+POS tags; (3) BOC+POS tags+CT; and (4) BOC+POS tags+CT+POCIS. In addition, we applied 10-fold cross validation for tuning model parameters. In 10-fold cross validation, the training set was randomly divided into 10 parts; each time, we used 1 part as the test set and the remaining 9 parts as the training set for the experiment. Finally, we used the average F1 score of the 10 experiments to estimate the accuracy of the model and tune the parameters.
As for the deep learning model, we fix the learning rate at 0.0004, the dropout at 0.5, and the character embedding dimension at 100. The number of hidden units in bidirectional LSTM-CRF is set to 100, and the optimizer is set to Adam.
As for dictionary-based CNER, which is regarded as the baseline approach, maximum forward matching based on our dictionary is adopted for extracting clinical entities.
The evaluation for this CNER challenge task is implemented through the algorithm provided by CCKS 2017 organizers, which reports the Precision (P), Recall (R), and F1 score for all clinical entities using exact matching methods [
Here, mention represents the content of the entity, pos_b and pos_e separately denote the start and end position of the entity in the EHR text, and category represents the entity type. On the basis of the above equivalence relation, strict evaluation metrics are implemented as follows:
To validate the effectiveness of the ML models on simultaneously identifying various types of clinical entities from Chinese EHRs, we carried out comparative experiments on the basis of CCKS CNER corpus.
As shown in
Besides the overall performance,
Overall performance of the bidirectional long short-term memory-conditional random fields model, conditional random fields–based models with different feature combinations, and the dictionary-based model.
Model | Precision | Recall | F1 score |
Dictionary-based model | 0.5215 | 0.6855 | 0.5924 |
CRFa model+BOCb | 0.8792 | 0.8316 | 0.8547 |
CRF model+BOC+POSc tags | 0.9065 | 0.8529 | 0.8789 |
CRF model+BOC+POS tags+CTd | 0.9144 | 0.8658 | 0.8895 |
CRF model+BOC+POS tags+CT+POCISe | 0.9203 | 0.8709 | 0.8949 |
Bidirectional LSTM-CRFf model | 0.9112 | 0.8974 | 0.9043 |
aCRF: conditional random fields.
bBOC: bag-of-characters.
cPOS: part-of-speech.
dCT: character types.
ePOCIS: position of the character in the sentence.
fLSTM-CRF: long short-term memory-conditional random fields.
Detailed performance of the bidirectional long short-term memory-conditional random fields–based, conditional random fields–based, and dictionary-based clinical named entity recognition approaches.
Entity type | Bidirectional LSTM-CRFa | CRFb_all_features | Dictionary-based approach | ||||||
Precision | Recall | F1 score | Precision | Recall | F1 score | Precision | Recall | F1 score | |
Body part | 0.8873 | 0.8444 | 0.8653 | 0.8909 | 0.8186 | 0.8532 | 0.6081 | 0.6452 | 0.6261 |
Diagnosis | 0.8086 | 0.7486 | 0.7775 | 0.8148 | 0.6763 | 0.7391 | 0.3545 | 0.6058 | 0.4473 |
Symptom | 0.9584 | 0.9675 | 0.9630 | 0.9715 | 0.9580 | 0.9647 | 0.7591 | 0.7594 | 0.7592 |
Test | 0.9314 | 0.9510 | 0.9411 | 0.9459 | 0.9233 | 0.9345 | 0.7093 | 0.6949 | 0.7020 |
Treatment | 0.7833 | 0.7075 | 0.7435 | 0.7581 | 0.6538 | 0.7021 | 0.2240 | 0.6108 | 0.3278 |
Total | 0.9112 | 0.8974 | 0.9043 | 0.9203 | 0.8709 | 0.8949 | 0.5215 | 0.6855 | 0.5924 |
aLSTM-CRF: long short-term memory-conditional random fields.
bCRF: conditional random fields.
Comparison of F1 scores between dictionary-based approach and machine learning–based approaches among 5 entity types; LSTM-CRF: long short-term memory-conditional random fields; CRF: conditional random fields.
Essentially, recognizing various types of clinical entities allows extraction of the structured information of patients, which can be further exploited for data-driven medical research, clinical decision making, and health management. Compared with previous studies in CNER, ML-based methods can simultaneously extract 5 types of entities. Moreover, the proposed bidirectional LSTM-CRF model achieves a performance that is comparable with the Top 1 system, which is an ensemble model incorporating 4 ML models including a rule-based model, a CRF model, and 2 RNN models, in the CNER challenge only using character embeddings, and the segmentation information, therefore, reduces considerable efforts for feature engineering and model constructing.
Experiments on the CCKS 2017 CNER challenge corpus show that ML-based models (bidirectional LSTM-CRF and CRF) achieve remarkably better performance than the dictionary-based method. Different from the maximum forward matching of the dictionary-based CNER, ML methods can sufficiently exploit the context information (eg, bag of Chinese characters and context representation information derived from LSTM), syntactic information (eg, POS tags), and structure information (eg, the position of the Chinese character in the sentence), which makes their performance significantly better. Furthermore, the performance of ML models is comparable with the Top 1 system in the CNER challenge with an overall F1 score of 0.9102, validating the effectiveness of the 2 ML-based methods in simultaneously recognizing multiple types of clinical entities for further data-driven medical studies.
The bidirectional LSTM-CRF model achieves the best overall performance (see
Furthermore, by comparing the results of CRF models and the bidirectional LSTM-CRF model in
Despite the impressive overall performance, the ML models do not show superiority over all the 5 types of clinical entities. As shown in
An error analysis on our 2 ML-based models shows that plenty of errors often occur when predicting tags on long entities with composite structures. For example, “高血压病腔隙性脑梗死” (hypertension Lacunar Cerebral Infarction), which is annotated as a “diagnosis” type entity in the golden standard, is automatically annotated as 2 entities “高血压” (hypertension) and “腔隙性脑梗死” (lacunar infarction) in our ML models. Especially, we find that, in the EHR text, a “body part” type of entity is often followed by a “symptom” or a “diagnosis” type of entity, which makes it difficult to identify the border between the 2 entities. For instance, in EHR text “股骨骨折 (femoral fracture),” the “body part” type of entity “股骨 (femur)” is followed by a “symptom” type of entity “骨折 (fracture).” Incorporating domain knowledge and medical dictionaries as well as combining the active learning methods with current ML models and increasing the scale of datasets might be the right path.
Furthermore, taking CRF model based on all features (BOC, POS tags, CT, and POCIS) as an example, we conduct an in-depth error analysis on its result to explore the effectiveness and limitations of the ML models on Chinese CNER either from a statistical view or from the clinical view.
As for “GT-P” type of errors, only 1.51% (143/9493) entities of the test set are missed by the CRF model, which demonstrate its effectiveness in Chinese CNER. After further analysis on type “GT-P” errors from a medical view, we find that some entities missed by CRF model, which may be because the ground truth is not accurate, contain some punctuations that are not related to the entities. For example, the ground truth “肿,(swollen,)” should be “肿 (swollen)” rather than “肿 (swollen)” with punctuation “,”. Moreover, some entities such as “对称 (symmetry)” do not belong to each type of clinical entity from the clinical view and should not appear in the ground truth. These entities are not recognized by the CRF model, which is not a problem of the model but a problem of ground truth. With more accurate ground truth, our results can be better. Moreover, some errors such as the “Symptoms and signs” type of new entities “活动障碍” (activity disorder), “听力下降” (hearing loss), and “功能障碍” (dysfunction) were not recognized by CRF, which may be because they never appear in the training set. Without sufficient training examples, it is challenging to effectively identify clinical entities, especially the unknown ones, for supervised ML models. Some studies [
In addition, through the analysis on “P-GT” type of errors, we find that most of the entities in these types of errors are clinically meaningful, such as “2型糖尿病 (type 2 diabetes)” and “冠心病 (coronary disease).” These entities recognized by the CRF model should be the ground truth rather than errors. The reason behind this may be due to missing annotations while manually building the ground truth. Thus, these type of “errors” should be the advantage of our models, which could maintain high efficiency and accuracy during CNER, rather than errors. Moreover, some entities such as “腔隙性脑梗 (lacunar clog)” are new entities that never appeared in the ground truth. These entities are meaningful to clinicians and should be recognized. This proves that our model has the ability to identify a few new clinical entities from Chinese EHR. However, some entities recognized by our models, such as “比重 (proportion),” do not make any sense.
Finally, the deep analysis of “INTERSECT” type of errors shows that most of the errors are due to the different granularities between our results and the ground truth. For example, the ground truth for clinical text “患者于去年诊断为脑水肿 (the patient was diagnosed with cerebral edema last year)” is “水肿 (edema)” and our result is “脑水肿 (cerebral edema).” This is a limitation of ML models that cannot accurately identify entities at the appropriate granularity. However, plenty of entities appear to be annotated at different granularities in different EHR documents when building the ground truth. For example, text “脑梗死 (cerebral infarction)” is sometimes annotated as “脑 (brain)” and sometimes annotated as “脑梗死 (cerebral infarction)” and text “右侧丘脑腔隙性脑梗死 (right thalamic lacunar infarction)” is sometimes annotated as “右侧丘脑 (right thalamic)” and “腔隙性脑梗死 (right lacunar infarction),” whereas it is sometimes annotated as “脑梗死 (cerebral infarction).” The ambiguity of the granularities in the ground truth will make the ML models more difficult to extract clinical entities on appropriate granularities. Specific annotation rules on annotation granularities as well as high-quality datasets could be constructed to further improve the performance of ML models on Chinese CNER.
In the future, we will not only develop new ML methods to enhance the accuracy of CNER but will also try to collect and standardize the recognized entities into the standard medical lexicons. Considering that different types of entities have different distributions in different fields of EHR, for instance, “treatment” type of entities often concentrates on the “diagnosis and treatment” field and rarely appears in the “general items” field, separately building ML-based models on each type of field data rather than on all EHR data may be a worthwhile study. As the amount of Chinese EHR data is limited, incorporating the active learning methods with ML models may be a possible future direction. Furthermore, when such structural patient information is used for data-driven medical studies, the time order of the clinical entities as well as their modifications are usually required. Therefore, a future direction is to identify more details of the clinical entities.
Distribution of different types of errors in the results of the conditional random fields model based on all the 4 types of features (N=1386).
GT-Pa (N=143) | P-GTb (N=604) | INTERSECTc (GT vs P; N=639) |
尿蛋白- (urinary protein-) | 2型糖尿病 (type 2 diabetes) | 右侧丘脑腔隙性脑梗死 versus 右侧丘脑 + 腔隙性脑梗死 (right thalamic lacunar infarction vs right thalamic+lacunar infarction) |
低血糖 (hypoglycemia) | 冠心病(coronary disease) | 胃肠 versus 急性胃肠炎 (stomach and intestine vs acute gastroenteritis) |
对称 (symmetry) | 胸 (chest) | 糖尿病肾病 versus 糖尿病 + 肾病 (diabetic nephropathy vs diabetes+nephropathy) |
瞳孔 (pupil) | 腔隙性脑梗 (lacunar clog) | 右下后 versus 右下后牙 (right lower back vs lower right posterior teeth) |
冠心病 (coronary disease) | 脂肪肝 (fatty liver) | 水肿 versus 脑水肿 (edema vs brain edema) |
肿, (swollen,) | 比重 (proportion) | 氨溴索注射液祛痰 versus 氨溴索注射液 (ambroxol injection to remove phlegm vs ambroxol injection) |
胃粘膜 (gastric mucosa) | 角膜 (cornea) | 皮肤、粘膜 versus 皮肤 + 黏膜 (skin、mucous membrane vs skin+mucous membrane) |
寒战 (chill) | 脑萎缩 (encephalatrophy) | 脂肪肝 versus 肝 (fatty liver vs liver) |
无力 (faintness) | 峰值 (peak value) | 尼群地平药物 versus 尼群地平 (nitrendipine drug vs nitrendipine) |
皮肤 (skin) | 活动障碍 (activity disorder) | 脑 versus 脑梗死 (brain vs cerebral infarction) |
aGT-P: Entities that were not identified by CRF.
bP-GT: Entities recognized by CRF but are not in the ground truth.
cINTERSECT: For each entity, there is an intersection between the ground truth and the entity predicted by CRF.
CNER is one of the basic works of data-driven medical research. However, previous studies usually focused on recognizing a single type of clinical entity. In this study, we implemented 2 ML methods, including the bidirectional LSTM-CRF and the CRF models, for simultaneously recognizing 5 types of clinical entities from the Chinese EHR corpus provided by the CNER challenge of CCKS 2017. Compared with the baseline dictionary-based approach, ML methods show remarkably better performance than the former. Moreover, the deep learning model bidirectional LSTM-CRF, outperforming the traditional CRF model in the overall result, achieves state-of-the-art performance on the basis of the character and segmentation information, which alleviates the human work involved in feature engineering to a large extent.
bag-of-characters
China Conference on Knowledge Graph and Semantic Computing
clinical named entity recognition
Convolutional Neural Network
conditional random fields
character types
Chinese Unified Medical Language System
electronic health records
Institute of Computing Technology, Chinese Lexical Analysis System
long short-term memory
maximum entropy
machine learning
named entity recognition
position of the character in the sentence
part-of-speech
Recurrent Neural Networks
structural support vector machines
support vector machines
The authors would like to thank the CCKS 2017 CNER challenge organizers for providing the training, development, and test corpora. The authors would also like to thank Kesheng Liu for the help with program development. This research is funded by the National Natural Science Foundation of China (Grant No. 81601573), the National Key Research and Development Program of China (Grant No. 2016YFC0901901), the National Population and Health Scientific Data Sharing Program of China, the National Population and Health Scientific Data Sharing Program of China, the Knowledge Centre for Engineering Sciences and Technology (Medical Centre), the Key Laboratory of Medical Information Intelligent Technology Chinese Academy of Medical Sciences, the Key Laboratory of Knowledge Technology for Medical Integrative Publishing, and the Chinese Academy of Medical Sciences (Grant No. 2016RC330005) and supported by the Fundamental Research Funds for the Central Universities (Grant No. 3332018153).
None declared.