This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
Clinical electronic medical records (EMRs) contain important information on patients’ anatomy, symptoms, examinations, diagnoses, and medications. Large-scale mining of rich medical information from EMRs will provide notable reference value for medical research. With the complexity of Chinese grammar and blurred boundaries of Chinese words, Chinese clinical named entity recognition (CNER) remains a notable challenge. Follow-up tasks such as medical entity structuring, medical entity standardization, medical entity relationship extraction, and medical knowledge graph construction largely depend on medical named entity recognition effects. A promising CNER result would provide reliable support for building domain knowledge graphs, knowledge bases, and knowledge retrieval systems. Furthermore, it would provide research ideas for scientists and medical decision-making references for doctors and even guide patients on disease and health management. Therefore, obtaining excellent CNER results is essential.
We aimed to propose a Chinese CNER method to learn semantics-enriched representations for comprehensively enhancing machines to understand deep semantic information of EMRs by using multisemantic features, which makes medical information more readable and understandable.
First, we used Robustly Optimized Bidirectional Encoder Representation from Transformers Pretraining Approach Whole Word Masking (RoBERTa-wwm) with dynamic fusion and Chinese character features, including 5-stroke code, Zheng code, phonological code, and stroke code, extracted by 1-dimensional convolutional neural networks (CNNs) to obtain fine-grained semantic features of Chinese characters. Subsequently, we converted Chinese characters into square images to obtain Chinese character image features from another modality by using a 2-dimensional CNN. Finally, we input multisemantic features into Bidirectional Long Short-Term Memory with Conditional Random Fields to achieve Chinese CNER. The effectiveness of our model was compared with that of the baseline and existing research models, and the features involved in the model were ablated and analyzed to verify the model’s effectiveness.
We collected 1379 Yidu-S4K EMRs containing 23,655 entities in 6 categories and 2007 self-annotated EMRs containing 118,643 entities in 7 categories. The experiments showed that our model outperformed the comparison experiments, with
Our proposed CNER method would mine the richer deep semantic information in EMRs by multisemantic embedding using RoBERTa-wwm and CNNs, enhancing the semantic recognition of characters at different granularity levels and improving the generalization capability of the method by achieving information complementarity among different semantic features, thus making the machine semantically understand EMRs and improving the CNER task accuracy.
Abundant medical data have been accumulated since the development of the hospital information system, among which the electronic medical records (EMRs) contain information closely related to patients’ diagnosis and treatment processes [
Clinical NER (CNER) refers to the recognition of entities such as anatomy, disease, symptoms, clinical examination, medication, surgical procedure, and so on from EMRs [
Recently, the features of radicals for Chinese characters have been widely used to improve the efficiency of different Chinese natural language processing tasks [
The contributions of this study are as follows: (1) using pretrained language model (PLM) Robustly Optimized Bidirectional Encoder Representation from Transformers Pretraining Approach Whole Word Masking (RoBERTa-wwm) with a dynamic fusion transformer layer to obtain the semantic features of Chinese characters; (2) using CNNs for extracting the radicals and picto-phonetic features of Chinese characters through the 5-stroke code, Zheng code, phonological code, and stroke code; (3) converting Chinese characters into square images, extracting Chinese character image features from another modality by CNNs, and deeply capturing the pictographic characteristics of Chinese characters; and (4) improving the semantic recognition ability of the model at different levels of granularity, achieving information complementarity between different semantic features, and improving the effect and generalization ability of the model based on multisemantic features.
In recent decades, the medical NER is still a research focus. Medical NER research has 3 main development stages as follows: based on dictionaries and rules, based on statistical machine learning, and based on deep learning.
The dictionary-based [
With the continuous development of deep learning, Cocos et al [
PLMs are pretrained on a large-scale corpus to obtain prior semantic knowledge from unlabeled text and improve the effectiveness of different downstream tasks. The word vector generated by a bidirectional language model BERT with stacked transformer substructures contains not only the preliminary information from the corpus training but also the encoded contextual information. Some robust versions of BERT have been constructed since BERT was proposed in 2018. For example, the RoBERTa model [
The Yidu-S4K data set, shared publicly by YiduCloud, is derived from the Chinese EMRs entity recognition task of the China Conference on Knowledge Graph and Semantic Computing 2019 [
Self-annotated EMR data, collected from publicly desensitized Chinese EMR websites [
The ratio of the training set to the test set of the EMRs was 7:3. The Yidu-S4K data set was preprovided with 1000 EMRs as the training data sets (1000/1379, 72.52%) and 379 EMRs as the test data sets (379/1379, 27.48%). The self-annotated data set was divided by randomization into 1401 EMRs as the training data sets (1401/2007, 69.81%) and 379 EMRs as the test data sets (606/2007, 30.19%).
The statistics of different types of entities in 2 electronic medical record data sets
Data sets and entity type | Training set, n | Test set, n | |||
|
|||||
|
Disease | 4212 | 1323 | ||
|
Anatomy | 8426 | 3094 | ||
|
Laboratory | 1195 | 590 | ||
|
Image | 969 | 348 | ||
|
Medicine | 1822 | 485 | ||
|
Operation | 1029 | 162 | ||
|
All entities | 17,653 | 6002 | ||
|
|||||
|
Disease | 9470 | 4504 | ||
|
Symptoms | 26,334 | 11,065 | ||
|
Anatomy | 17,877 | 7588 | ||
|
Examination | 19,664 | 8746 | ||
|
Instrument | 1244 | 560 | ||
|
Medicine | 5314 | 2566 | ||
|
Operation | 2578 | 1133 | ||
|
All entities | 82,481 | 36,162 |
Ethics approval was not required because the patient’s private information was masked by the website.
In this study, all the experiments were conducted by Python [
Parameter settings.
Parameter | Value |
Dropout | 0.5 |
Epoch | Optimization |
Optimization | Adam W |
Learning rate | 0.0001 |
Batch size | 32 |
BiLSTMa hidden layer | 768 |
Max_len | 510 |
RoBERTa-wwmb feature dimension | 768 |
Semantic feature dimension | 124 |
Image feature dimension | 128 |
aBiLSTM: Bidirectional Long Short-Term Memory.
bRoBERTa-wwm: Robustly Optimized Bidirectional Encoder Representation from Transformers Pretraining Approach Whole Word Masking.
The experiments used precision, recall, and
where precision is the proportion of positive samples in all samples predicted to be positive; recall is the proportion of positive samples in all positive samples;
In this study, we proposed a CNER model based on multisemantic features, as shown in
The main architecture of our model. 1D CNN: 1D convolutional neural network; 2D CNN: 2D convolutional neural network; B-DIS: beginning of disease entity; CRF: conditional random fields; h: embedding of output character; I-DIS: inside of disease entity; LSTM: long short-term memory; O: other type; RoBERTa-wwm: Robustly Optimized Bidirectional Encoder Representation from Transformers Pretraining Approach Whole Word Masking; x: embedding of input character.
Many Chinese characters have retained their original connotations, as they originated from pictographic characters in ancient times. Moreover, the inherent fine-grained character information contained in Chinese characters often implies more additional semantic information. Accordingly, we obtained the 5-stroke code, Zheng code, phonological code, and stroke code information, as shown in
As shown in
Example of Chinese characters’ coded information from ZDIC.
Character | 5-stroke code | Zheng code | Phonological code | Stroke code |
呕 (vomit) | kaqy | jhos | ǒu | 2,511,345 |
吐 (vomit) | kfg | jbvv | tù | 251,121 |
肿 (swelling) | ekhh | qji | zhǒng | 35,112,512 |
胀 (swelling) | etay | qch | zhàng | 35,113,154 |
心 (heart) | nyny | wz | xīn | 4544 |
手 (hand) | rtgh | md | shǒu | 3112 |
The process of obtaining Chinese character multisemantic features by convolutional neural network. ReLU: Rectified Linear Unit function; Conv 1: first convolutional layer; Conv 2: second convolutional layer; Max pooling 1: first max pooling layer; Max pooling 2: second max pooling layer; Dense: dense layer.
When RoBERTa-wwm pretrains the corpus, it is segmented on the language technology platform established by the Harbin Institute of Technology based on Wikipedia content in Chinese, which can provide a basis for achieving wwm. As shown in
The encoder structure of each transformer layer of the BERT model outputs had different abstract representations of grammar, semantics, and real knowledge in sentences. Studies have confirmed that each layer of the BERT model represents text information differently through 12 natural language processing tasks [
The transformer structure of the RoBERTa-wwm model is consistent with that of the BERT model. To make full use of the representation information of each transformer layer, we used the RoBERTa-wwm model with dynamic fusion [
Mask process of Bidirectional Encoder Representation from Transformers (BERT) and Robustly Optimized Bidirectional Encoder Representation from Transformers Pretraining Approach Whole Word Masking (RoBERTa-wwm).
Coding representation of Transformer with 12 layers of Bidirectional Encoder Representation from Transformers model.
Assume that the text input sequence
The 5-stroke code is a typical semantic code, which encodes Chinese characters according to strokes and structures. Currently, it is widely used to code Chinese characters. The expression of the 5-stroke code may inevitably repeat with the phonological code, for example, the 5-stroke code for “亦 (also)” is “you,” while the phonological code for “亦 (also)” is also “you” [
where
The Zheng code was created by famous Chinese literature professors as per the strokes and roots of Chinese characters through in-depth research on the patterns and structures of Chinese characters. The early Microsoft operating system in Chinese adopted the Zheng code as the built-in code. This indicates that Zheng code is a scientific coding of Chinese characters. Chinese characters with similar codes may contain related semantic information. Hence, the potential semantic relationship of text may be found by mining the structural information of Chinese characters using Zheng code. The Zheng code was vectorized as the 5-stroke code and has the following formulas:
where
Over 90% of Chinese characters are picto-phonetic characters [
where
Chinese characters with similar strokes may have similar meanings. The strokes of each Chinese character were encoded in ZDIC [
where
To extract the fine-grained semantic features of Chinese characters deeply, we trained the character features using CNNs. The character features were trained by 2 convolutions with a kernel of 3 and
Chinese characters have been derived from pictographic symbols since ancient times, and characters with similar symbolic appearances have similar image features. However, the fonts of Chinese characters have changed a lot over time. Simplified characters have lost much pictographic information compared with complex characters. Therefore, Cui et al [
where
Finally, the multisemantic features
The role of BiLSTM [
where
The BiLSTM can be used to handle contextual relationships. However, it cannot consider the dependencies between tags. Therefore, it is necessary to add a constraint relation for the final predicted label by using the CRF [
where
Finally, we predicted the best label sequences by using the Viterbi algorithm [
To get convincing experimental results, we ran each model 5 times and calculated the average precision, average recall, and average
To verify the validity of the model, we compared our model with the existing ensemble models BiLSTM-CRF, ELMo-Lattice-LSTM-CRF, ELMo-BiLSTM-CRF, all CNNs, ELMo-encoder from transformer-CRF, and multigranularity semantic dictionary and multimodal tree-NER on Yidu-S4K and self-annotated data sets, and the results are shown in
Performance comparison of ensemble models on the Yidu-S4K and self-annotated data sets.
Data set and model | Precision (%) | Recall (%) | |||||
|
|||||||
|
BiLSTM-CRFa [ |
69.43 | 72.58 | 70.97 | |||
|
ACNNb [ |
83.07 | 87.29 | 85.13 | |||
|
ELMoc-lattice-LSTM-CRF [ |
84.69 | 85.35 | 85.02 | |||
|
ELMo-BiLSTM-CRF [ |
—d | — | 85.16 | |||
|
ELMo-ETe-CRF [ |
82.08 | 86.12 | 85.59 | |||
|
MSD_DT_NERf [ |
86.09 | 87.29 | 86.69 | |||
|
Our model | 90.37 | 88.22 | 89.28 | |||
|
|||||||
|
BiLSTM-CRF | 81.98 | 77.10 | 79.47 | |||
|
Our model | 84.24 | 84.99 | 84.61 |
aBiLSTM-CRF: Bidirectional Long Short-Term Memory-conditional random fields.
bACNN: all convolutional neural network.
cELMo: Embeddings from Language Models.
dNot available.
eET: encoder from transformer
fMSD_DT_NER: multigranularity semantic dictionary and multimodal named entity recognition.
The performance of the PLM, BERT, is a milestone in natural language processing. To verify the BERT robust version’s validity of the RoBERTa-wwm model, we compared our model with the existing ensemble models with the BiLSTM-CRF, BERT-BiLSTM-CRF, and RoBERTa-wwm-BiLSTM-CRF on Yidu-S4K and self-annotated data sets, and the results are shown in
Performance comparison of PLMsa on the Yidu-S4K and self-annotated data sets.
Data set and model | Precision (%) | Recall (%) | |||||
|
|||||||
|
BiLSTMb-CRFc [ |
69.43 | 72.58 | 70.97 | |||
|
BERTd-BiLSTM-CRF | 89.07 | 83.67 | 86.29 | |||
|
RoBERTa-wwme-BiLSTM-CRF | 90.08 | 86.90 | 88.46 | |||
|
Our model | 90.37 | 88.22 | 89.28 | |||
|
|||||||
|
BiLSTM-CRF | 81.98 | 77.10 | 79.47 | |||
|
BERT-BiLSTM-CRF | 82.48 | 80.86 | 81.66 | |||
|
RoBERTa-wwm-BiLSTM-CRF | 84.23 | 82.86 | 83.54 | |||
|
Our model | 84.24 | 84.99 | 84.61 |
aPLM: pretrained language model.
bBiLSTM: Bidirectional Long Short-Term Memory.
cCRF: conditional random fields.
dBERT: Bidirectional Encoder Representation from Transformers.
eRoBERTa-wwm: Robustly Optimized Bidirectional Encoder Representation from Transformers Pretraining Approach Whole Word Masking.
To comprehensively evaluate our model, we calculated the
Performance comparison of each entity category on the Yidu-S4K data set.
Model | |||||||
|
All | Disease | Anatomy | Image | Laboratory | Medicine | Operation |
ELMoa-BiLSTMb-CRFc [ |
85.16 | 82.81 | 85.99 | 88.01 | 75.65 | 94.49 | 86.79 |
BERTd-BiLSTM-CRF | 86.29 | 87.14 | 86.36 | 83.43 | 77.98 | 89.46 | 93.11 |
BERT-wwme-BiLSTM-CRF | 87.12 | 86.18 | 85.47 | 81.52 | 79.69 | 90.14 | 92.49 |
RoBERTaf-wwm-BiLSTM-CRF | 88.46 | 87.71 | 87.01 | 86.69 | 82.36 | 93.22 | 92.87 |
Our model | 89.28 | 87.91 | 87.47 | 87.66 | 83.25 | 94.98 | 94.33 |
aELMo: Embeddings from Language Models.
bBiLSTM: Bidirectional Long Short-Term Memory.
cCRF: conditional random fields.
dBERT: Bidirectional Encoder Representation from Transformers.
ewwm: Whole Word Masking.
fRoBERTa: Robustly Optimized Bidirectional Encoder Representation from Transformers Pretraining Approach.
Performance comparison of each entity category on the self-annotated data set.
Model | ||||||||
|
All | Disease | Symptoms | Anatomy | Examination | Instrument | Medicine | Operation |
BERTa-BiLSTMb-CRFc | 81.66 | 81.33 | 85.87 | 83.86 | 90.36 | 60.38 | 89.72 | 79.75 |
BERT-wwmd-BiLSTM-CRF | 81.58 | 74.91 | 83.89 | 81.23 | 88.84 | 54.76 | 85.63 | 68.49 |
RoBERTae-wwm-BiLSTM-CRF | 83.54 | 81.99 | 86.69 | 84.68 | 91.21 | 66.01 | 91.04 | 81.17 |
Our model | 84.61 | 82.34 | 86.93 | 85.62 | 91.30 | 69.25 | 91.28 | 82.49 |
aBERT: Bidirectional Encoder Representation from Transformers.
bBiLSTM: Bidirectional Long Short-Term Memory.
cCRF: conditional random fields.
dwwm: Whole Word Masking.
eRoBERTa: Robustly Optimized Bidirectional Encoder Representation from Transformers Pretraining Approach.
To verify the fine-grained semantic features and image features of Chinese characters, dynamic fusion was effective. We used the RoBERTa-wwm-BiLSTM-CRF model as the baseline to perform ablation experiments for the above contents on 2 EMR data sets, and the results are shown in
The performance of the model was significantly improved with the dynamic fusion of RoBERTa-wwm. After incorporating the semantic features of Chinese characters into the model alone, the overall performance of the model was not as high as that after dynamic fusion. However, the performance on both data sets was superior to that of the baseline. The performance of the model was unstable when image features of Chinese characters were added to the model alone. On the Yidu-S4K data set, the model’s performance was inferior to that of the baseline, whereas on the self-annotated data set, the model’s performance only improved slightly. After adding the semantic and image features of Chinese characters to the model, the performance of the model on the Yidu-S4K data set was superior to that of the baseline. Furthermore, it was better than that of the model with semantic or image features of Chinese characters alone. The performance of the model on the self-annotated data set was superior to that of the baseline and better than that of the model with the image features of Chinese characters alone. When the model combined dynamic fusion with the semantic features and image features of Chinese characters, it was found that the performance of the model was significantly improved on the 2 data sets. Dynamic fusion with image features of Chinese characters showed the best comprehensive performance on the Yidu-S4K data set, whereas dynamic fusion with semantic features of Chinese characters achieved the best comprehensive performance on the self-annotated data set. After combining the semantic and image features of the Chinese characters and dynamic fusion, it was noted that the performance of the model was superior to that of the baseline. Because the quality of the self-annotated EMRs is inferior to that of the public Chinese EMRs corpus and the self-annotated data set contains a wider coverage of departments, the comprehensive effect of the self-annotated data set is lower than that of the YiduS4K data set in
The results of ablation experiments for mutisemantic features on the Yidu-S4K and self-annotated data sets. BiLSTM: Bidirectional Long Short-Term Memory; CRF: Conditional Random Fields; RoBERTa-wwm: Robustly Optimized Bidirectional Encoder Representation from Transformers Pretraining Approach Whole Word Masking.
The fine-grained semantic features of Chinese characters used in this study included the 5-stroke code, Zheng code, phonological code, and stroke code. To verify the effectiveness of these features, we used the RoBERTa-wwm-BiLSTM-CRF model as the baseline to perform ablation experiments for the 4 features on the 2 EMR data sets, and the results are shown in
The results of ablation experiments for fine-grained semantic features on the Yidu-S4K data set. BiLSTM: Bidirectional Long Short-Term Memory; CRF: conditional random fields; RoBERTa-wwm: Robustly Optimized Bidirectional Encoder Representation from Transformers Pretraining Approach Whole Word Masking.
The results of ablation experiments for fine-grained semantic features on the self-annotated data set. BiLSTM: Bidirectional Long Short-Term Memory; CRF: conditional random fields; RoBERTa-wwm: Robustly Optimized Bidirectional Encoder Representation from Transformers Pretraining Approach Whole Word Masking.
The
From
We strictly controlled the annotation quality of both data sets. Hence, the probability of causes (1-3) was relatively low. Causes (4-6) were more likely to occur, and cause (7) mainly occurred on some entities that were less common or had fewer training samples.
Different types of errors on 2 data sets.
Types of errors |
Example | ||
|
|||
|
1. Some manually annotated entities contained punctuation marks unrelated to the entities. | For instance, some Laboratory entities, like “PLTa,” “NEUTb,” and “CAEc,” on the Yidu-S4K data set contained commas, which were correctly recognized as “PLT,” “NEUT,” and “CAE” in our model. | |
|
2. A few entity categories were confused. | For example, “PET-CTd” was manually annotated as a Laboratory entity on the Yidu-S4K data set, but our model correctly predicted as an Image entity. | |
|
|||
|
The inconsistent annotation will affect the accuracy of machine learning. | On the Yidu-S4K data set, the character “下 (below)” expressing orientation of “剑突下 (below xiphoid)” was not annotated, and the character “部 (part)” expressing the part of “咽喉部 (hypopharynx)” was also not annotated. Most of the characters expressing specific locations were annotated. | |
|
|||
|
The missing annotated entity will also affect the overall effect of the model. | The Disease entity “窦性心律 (sinus rhythm)” was missed annotated on the Yidu-S4K data set, and the Medicine entity “生理盐水 (normal saline)” was missed annotated on the self-annotated data set. | |
|
|||
|
Figures, letters, and other symbols cannot be extracted with more semantic features than Chinese characters. Hence, it may be difficult to recognize entities with symbols other than Chinese characters in the Chinese corpus. | The failure to recognize the non-Chinese character entities, like the Laboratory entity “AFPe” on the Yidu-S4K data set and the Examination entity “HCGf” on the self-annotated data set, so did such situations as the Medicine entity “VPg-16” was recognized as “VP-,” and “50%葡萄糖 (50% glucose)” as “葡萄糖 (glucose)” on the Yidu-S4K data set. | |
|
|||
|
On the Yidu-S4K data set, the Disease entity and Image entity might contain the Anatomy entity. | For example, the Disease entity “二尖瓣后叶钙化 (posterior mitral valve leaflet calcification)” was recognized as the Anatomy entity “二尖瓣 (bicuspid),” and the Image entity “腹部彩超 (abdominal color doppler ultrasound)” was recognized as the Anatomy entity “腹部 (abdominal).” | |
|
On the self-annotated data set, entity nesting is more severe, the Disease entity, Examination entity, and Instrument entity might contain the Anatomy entity, and the Instrument entity might contain the Operation entity. | For example, the Disease entity “内踝骨折 (ankle fracture)” was recognized as the Anatomy entity “内踝 (medial malleolus),” the Examination entity “骨髓组织病理 (bone marrow histopathology)” was recognized as the Anatomy entity “骨髓 (bone marrow),” the Instrument entity “胸部支具 (chest brace)” was recognized as the Anatomy entity “胸(chest),” and the Instrument entity “左胸引流管 (left thoracic drainage tube)” was recognized as the Operation entity “左胸引流 (left thoracic drainage).” | |
|
|||
|
Entity composition is more complex, mixed representations occur more often. | The Medicine entity “奥沙利铂 (乐沙定) (Oxaliplatin [Eloxatin])” on the Yidu-S4K data set was recognized as “奥沙利铂 (Oxaliplatin)” and “乐沙定 (Eloxatin),” respectively, the Disease entity “CD5h+弥漫大B细胞淋巴瘤 (白血病期)” on the self-annotated data set was recognized as “CD” and “弥漫大B细胞淋巴瘤 (白血病期) (diffuse large B-cell lymphoma [Leukemia stage]),” and the Examination entity “肥达、外斐反应 (Widal, well-felix reaction)” on the self-annotated data set was recognized as “肥达 (Widal)” and “外斐反应 (well-felix reaction),” respectively. | |
|
|||
|
In the case of insufficient training samples, the machine may provide inadequate training for entities, so that the machine cannot fully learn the features of such entities, failing to recognize many entities. | On the self-annotated data set, the number of Instrument entities was less than that of other categories (Table 2), accounting for only 1.52% of the total, those entities might never appear in the training data set, such as “针筒 (syringe),” “微导管 (microtubule),” “550px碳钢钻头 (550px carbon steel drill bit),” etc. |
aPLT: platelet count.
bNEUT: neutrophil count.
cCAE: carcinoembryonic antigen.
dPET-CT: positron emission tomography-computed tomography.
eAFP: alpha fetoprotein.
fHCG: human chorionic gonadotropin.
gVP: etoposide.
hCD5: a differentiation antigen, cluster of differentiation 5.
In this study, we developed a Chinese CNER method based on multisemantic features. The method extracted the semantic features of text using the RoBERTa-wwm model after dynamic fusion, extracted the fine-grained semantic features of Chinese characters by 1D CNN, and converted Chinese characters into square images to extract the image features of the simplified Chinese characters from another modality by 2D CNN. We conducted a series of experiments to evaluate the model’s performance on the Yidu-S4K data set and self-annotated data set; the results showed that the
Compared with ensemble models, for the BiLSTM-CRF model, the representation information of characters was obtained with the help of a vector look-up table. However, the information obtained by this method was too simple to excavate the text’s semantic meaning or solve problems such as the polysemy of words. Hence, the model did not perform well. Kong et al [
Compared with PLMs related to BERT, both the BERT-BiLSTM-CRF and BiLSTM-CRF models could obtain word-level vector representations. However, the word-level vector obtained by BERT contained rich contextual characteristics, including morphology, syntax, semantics, location, and other important semantic information, which can directly improve the task performance by replacing the lattice structure and complicated text representation methods in
In addition, 2 ablation experiments showed that different features and means lead to different degrees of improvement in the semantic comprehension ability of the model. Multisemantic features could help the machine to obtain richer semantic information, whereas dynamic fusion could fully recognize and used the representation information so that the model performance could be comprehensively improved. Considering the heterogeneity among data, using 1 method alone or both methods may affect the generalization ability of the model. In this study, the model combining the fine-grained semantic features and image features of Chinese characters and dynamic fusion might not show the best performance. However, it was more universal and could maintain the performance at a relatively high level compared with other experimental models. Furthermore, introducing more feature engineering was conducive to fully mining the semantic information of text connotation with the help of fine-grained semantic information contained in Chinese characters and improving the performance of the model on different data sets through the cross-complementarity of different features in a relatively stable manner.
To reduce the error rate of entity recognition, specifically for human-caused errors, we could try to avoid these problems by further improving the annotation quality. For the data special characteristics or data defects, the errors might be reduced by medical knowledge, medical dictionaries, and some rules, regardless of the lack of training data.
The limitation of this study was that the ratio of the 6 entity types on the Yidu-S4K data set did not exactly follow 7:3, such that the ratio of the training set to test set for disease entities is approximately 0.7610:0.2390; the ratio of the training set to test set for medicine entities is approximately 0.7898:0.2102; and the ratio of the training set to test set for all entities is approximately 0.7463:0.2537. The unbalanced data of different entity types in the training and test sets caused a performance bias. Although the ratio of the training set to the test set of the EMRs was 7:3, we could not ensure that the number of entities of each type in each EMR in the training set and test set remained at a similar ratio.
In the future, we may focus on the recognition of a specific entity type in EMRs to improve the CNER performance. In addition, we will incorporate other prior medical knowledge or assign different weights to the Chinese character semantic features and image features, such as using the attention mechanism to capture important features, to improve the performance of the model.
This study proposes a Chinese CNER method to learn a semantics-enriched representation of Chinese character features in EMRs to enhance the specificity and diversity of feature representations. The results showed that the model had state-of-the-art performance on 2 Chinese CNER data sets compared with existing models. We demonstrated that multisemantic features could provide richer and more fine-grained semantic information for Chinese CNER through the cross-complementarity of different semantic features. This enabled the model to learn a better feature representation and improve its generalization ability.
Bidirectional Encoder Representation from Transformers
Bidirectional Long Short-Term Memory
clinical named entity recognition
convolutional neural network
conditional random fields
Embeddings from Language Models
electronic medical record
false negative
false positive
long short-term memory
named entity recognition
pretrained language model
Robustly Optimized Bidirectional Encoder Representation from Transformers Pretraining Approach
true positive
Word to Vector
Whole Word Masking
The authors would like to thank the YiduCloud for providing the Yidu-S4K corpora. This work was supported by the Science and Technology Innovation 2030 “New Generation Artificial Intelligence” major project “Research on the construction and application of knowledge system for Medical Artificial intelligence Services,” China (grant 2020AAA0104901).
None declared.