This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
The automatic coding of clinical text documents by using the
This study aims to train a classification model via federated learning for ICD-10 multilabel classification.
Text data from discharge notes in electronic medical records were collected from the following three medical centers: Far Eastern Memorial Hospital, National Taiwan University Hospital, and Taipei Veterans General Hospital. After comparing the performance of different variants of bidirectional encoder representations from transformers (BERT), PubMedBERT was chosen for the word embeddings. With regard to preprocessing, the nonalphanumeric characters were retained because the model’s performance decreased after the removal of these characters. To explain the outputs of our model, we added a label attention mechanism to the model architecture. The model was trained with data from each of the three hospitals separately and via federated learning. The models trained via federated learning and the models trained with local data were compared on a testing set that was composed of data from the three hospitals. The micro
The
Federated learning was used to train the ICD-10 classification model on multicenter clinical text while protecting data privacy. The model’s performance was better than that of models that were trained locally.
The World Health Organization published a unified classification system for diagnoses of diseases called the
Structure of an
In hospitals, diagnoses for each patient are first written as text descriptions in the electronic health record. A coder then reads these records to classify diagnoses into ICD codes. Because diagnoses are initially written as free text, the text's ambiguity makes diagnoses difficult to code. Classifying each diagnosis is very time-consuming. A discharge record may contain 1 to 20 codes. Per the estimation of a trial, coders spent 20 minutes assigning codes to each patient on average [
Recently, deep learning and natural language processing (NLP) models have been developed to turn plain text into vectors, making it possible to automatically classify them. Shi et al [
Federated learning has achieved impressive results in the medical field, being used to train models on multicenter data while keeping them private. Federated learning is widely used in medical image and signal analyses, such as brain imaging analysis [
Previously, we applied a Word2Vec model with a bidirectional gated recurrent unit to classify ICD-10-CM codes from electronic medical records [
This study aims to further improve the performance of the ICD-10 classification model and enable the model’s use across hospitals. In this study, we investigated the effect of federated learning on the performance of a model that was trained on medical text requiring ICD-10 classification.
The study protocol was approved by the institutional review boards of Far Eastern Memorial Hospital (FEMH; approval number: 109086-F), National Taiwan University Hospital (NTUH; approval number: 201709015RINC), and Taipei Veterans General Hospital (VGHTPE; approval number: 2022-11-005AC), and the study adhered to the tenets of the Declaration of Helsinki. Informed consent was not applicable due to the use of deidentified data.
Our data were acquired from electronic health records at FEMH (data recorded between January 2018 and December 2020), NTUH (data recorded between January 2016 and July 2018), and VGHTPE (data recorded between January 2018 and December 2020). The data contained the text of discharge notes and ICD-10-CM codes. Coders in each hospital annotated the ground truth ICD-10 codes.
After duplicate records were removed, our data set contained 100,334, 239,592, and 283,535 discharge notes from FEMH, NTUH, and VGHTPE, respectively. Each record contained between 1 and 20 ICD-10-CM labels. The distribution of labels for each chapter is shown in
The text in the data set contained alphabetic characters, punctuation, and a few Chinese characters. The punctuation count and the top 10 Chinese characters are shown in
Counts of ICD-10-CM labels for 22 chapters from (A) Far Eastern Memorial Hospital, (B) National Taiwan University Hospital, and (C) Taipei Veterans General Hospital. ICD-10-CM:
We first removed duplicate medical records from the data set. We then transformed all full-width characters into half-width characters and all alphabetic characters into lowercase letters. Records shorter than 5 characters were removed, as these were usually meaningless words, such as “nil” and “none.” We also removed meaningless characters, such as newlines, carriage returns, horizontal tabs, and formed characters (“\n,” “\r,” “\t,” and “\f,” respectively). Finally, all text fields were concatenated.
To choose a better method for managing punctuation and Chinese characters during the preprocessing stage, we determined model performance by using FEMH data, given the inclusion of these characters in the data. Each experiment used 2 versions of the data. In the first version, we retained these specific characters, and in the second, we removed them. Experiment P investigated the effect of punctuation, experiment C investigated the effect of Chinese characters, and experiment PC investigated the effects of both punctuation and Chinese characters. Another method of retaining Chinese character information is using English translations of Chinese characters. Therefore, we also compared the model’s performance when Chinese characters were retained to its performance when Google Translate was used to obtain English translations.
One-hot encoding was used for the labels. Of the 69,823 available ICD-10-CM codes, 17,745 appeared in our combined data set, resulting in a one-hot encoding vector length of 17,745. The final cohort comprised 100,334, 239,592, and 283,535 records from FEMH, NTUH, and VGHTPE, respectively; 20% (FEMH: 20,067/100,334; NTUH: 47,918/239,592; VGHTPE: 56,707/283,535) of the records were randomly selected for the testing set, and the remaining records were used as the training set.
We compared the performance of different variants of BERT, including PubMedBERT [
For our comparison, the text was first fed into the BERT tokenizer, which transformed strings into tokens. The number of tokens was then truncated to 512 for every text datum that met the input length limit of 512. A linear layer connected the word embeddings produced from the models to the output layers of the one-hot–encoded multilabels. The output size of the linear layer was 17,745, which matched the one-hot encoding vector size of the labels. Binary cross-entropy was used to calculate the model loss. We trained our model for 100 epochs, with a learning rate of 0.00005. These models were fine-tuned for our ICD-10-CM multilabel classification task to compare their performance.
Summary of the vocabulary and corpus sources for the various bidirectional encoder representations from transformers (BERT) models.
Models | Vocabulary sources | Corpus sources (training data) |
PubMedBERT | PubMed | PubMed |
RoBERTaa | The BookCorpus, CC-Newsb, and OpenWebText data sets | The BookCorpus, CC-News, and OpenWebText data sets |
ClinicalBERT | English Wikipedia and the BookCorpus data set | The MIMIC-IIIc data set |
BioBERTd | English Wikipedia and the BookCorpus data set | PubMed |
aRoBERTa: Robustly Optimized BERT Pretraining Approach.
bCC-News: CommonCrawl News.
cMIMIC-III: Medical Information Mart for Intensive Care III.
dBioBERT: BERT for Biomedical Text Mining.
Model architecture and processing flowchart. CLS: class token; ICD-10-CM:
With federated learning, a model can be trained without sharing data [
Flower is an open-source federated learning framework for researchers [
Clients were set in the three hospitals, where the model was trained on local data. The weights from each client were transferred to the server, where the weights were averaged, and global models were made (
Federated learning architecture. FEMH: Far Eastern Memorial Hospital; NTUH: National Taiwan University Hospital; VGHTPE: Taipei Veterans General Hospital.
To explain the outputs of our model, we added a label attention architecture [
Our model architecture with label attention. BERT: bidirectional encoder representations from transformers.
We used the micro
where
and
The
Performance of different bidirectional encoder representations from transformers (BERT) models.
Models | Precision | Recall | |
PubMedBERT | 0.735 | 0.756 | 0.715 |
RoBERTaa | 0.692 | 0.719 | 0.666 |
ClinicalBERT | 0.711 | 0.735 | 0.689 |
BioBERTb | 0.721 | 0.754 | 0.691 |
aRoBERTa: Robustly Optimized BERT Pretraining Approach.
bBioBERT: BERT for Biomedical Text Mining.
Mean number of data tokens for retaining or removing punctuation or Chinese characters.
Experiment | Mean number of tokens |
Removed punctuation and Chinese characters (baseline) | 52.9 |
Retained punctuation | 65.0 |
Retained Chinese characters | 53.1 |
Retained punctuation and Chinese characters | 65.1 |
Experiment | Absolute increases (percentage) | |
Removed punctuation and Chinese characters (baseline) | 0.7875 | N/Aa |
Retained punctuation | 0.8049 | 0.0174 (2.21%) |
Retained Chinese characters | 0.7984 | 0.0109 (1.38%) |
Retained punctuation and Chinese characters | 0.8120 | 0.0245 (3.11%) |
aN/A: not applicable.
In the experiment where we translated Chinese into English, the
Models that were trained in the three hospitals for
Hospitals | Validation |
Testing |
Weighted average testing |
FEMHa | 0.7802 |
0.7412 (FEMH) 0.5116 (NTUHb) 0.1596 (VGHTPEc) |
0.4472 |
NTUH | 0.7718 |
0.5583 (FEMH) 0.7710 (NTUH) 0.1592 (VGHTPE) |
0.5353 |
VGHTPE | 0.6151 |
0.1081 (FEMH) 0.1058 (NTUH) 0.5692 (VGHTPE) |
0.2522 |
aFEMH: Far Eastern Memorial Hospital.
bNTUH: National Taiwan University Hospital.
cVGHTPE: Taipei Veterans General Hospital.
The federated learning model’s performance in the three hospitals.
Data | Validation |
Testing |
FEMHb data | 0.7464 | 0.7103 |
NTUHc data | 0.6511 | 0.6135 |
VGHTPEd data | 0.5979 | 0.5536 |
aThe weighted average testing
bFEMH: Far Eastern Memorial Hospital.
cNTUH: National Taiwan University Hospital.
dVGHTPE: Taipei Veterans General Hospital.
The
Attention for
The federated learning model outperformed each local model when tested on external data. The weighted average
Federated learning improves model performance on external data. Federated learning can be used to build an ICD coding system for use across hospitals. However, the training time required for federated learning is longer than the training time required for local deep learning. Federated learning takes approximately 1 week, and local training takes approximately 2 days. There are 2 reasons for this. First, the communication between the server and the clients takes longer if the model is large. The size of our model is approximately 859 MB. Second, different clients may have different computing powers, and the slower client becomes a bottleneck [
The performance of PubMedBERT was better than that of BioBERT, ClinicalBERT, and RoBERTa.
In most scenarios, nonalphanumeric characters are removed because they are considered useless to the models [
For experiments P, C, and PC, all models performed better when additional characters were retained (
In our previous study, we introduced an attention mechanism to visualize the attention given to the input text for ICD-10 definitions [
The
Our study has several limitations. First, our data were acquired from 3 tertiary hospitals in Taiwan. The extrapolation of our results to hospitals in other areas should be studied in the future. Second, although our results suggest that model performance is better when punctuation and Chinese characters are retained, this effect may be restricted to specific note types. This finding should be further examined in the context of classifying other types of clinical text. Third, the translated text in our last experiment may not be as accurate as translations by a native speaker. However, it is difficult to manually translate large amounts of data. As such, we could only automatically translate the text by using Google Translate.
It should be noted that there is a primary and secondary diagnosis code for each discharge note. Although choosing the primary code makes reimbursements different, the model proposed in this study did not identify primary codes. To make our model capable of identifying a primary code, we proposed a sequence-to-sequence model in our previous work [
Federated learning was used to train the ICD-10 classification model on multicenter clinical text while protecting data privacy. The model’s performance was better than that of models that were trained locally. We showed the explainable predictions by highlighting input words via a label attention architecture. We also found that the PubMedBERT model can use the meanings of punctuation and non-English characters. This finding demonstrates that changing the preprocessing method for ICD-10 multilabel classification can improve model performance.
Counts of ICD-10-CM labels from the three hospitals. (A) Ranking of counts of labels in a medical record. (B) Ranking of counts of ICD-10-CM codes. ICD-10-CM:
The punctuation count and the top 10 Chinese characters.
bidirectional encoder representations from transformers
Bidirectional Encoder Representations From Transformers for Biomedical Text Mining
CommonCrawl News
Far Eastern Memorial Hospital
Google Remote Procedure Call
Medical Information Mart for Intensive Care III
natural language processing
National Taiwan University Hospital
Robustly Optimized Bidirectional Encoder Representations From Transformers Pretraining Approach
Taipei Veterans General Hospital
This study was supported by grants from the Ministry of Science and Technology, Taiwan (grants MOST 110-2320-B-075-004-MY and MOST 110-2634-F-002-032-); Far Eastern Memorial Hospital, Taiwan (grant FEMH-2022-C-058); and Taipei Veterans General Hospital (grants V111E-002 and V111E-005-2). The sponsors had no role in the study design, data collection and analysis, publication decision, or manuscript drafting.
None declared.