This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Doctors must care for many patients simultaneously, and it is time-consuming to find and examine all patients’ medical histories. Discharge diagnoses provide hospital staff with sufficient information to enable handling multiple patients; however, the excessive amount of words in the diagnostic sentences poses problems. Deep learning may be an effective solution to overcome this problem, but the use of such a heavy model may also add another obstacle to systems with limited computing resources.
We aimed to build a diagnoses-extractive summarization model for hospital information systems and provide a service that can be operated even with limited computing resources.
We used a Bidirectional Encoder Representations from Transformers (BERT)-based structure with a two-stage training method based on 258,050 discharge diagnoses obtained from the National Taiwan University Hospital Integrated Medical Database, and the highlighted extractive summaries written by experienced doctors were labeled. The model size was reduced using a character-level token, the number of parameters was decreased from 108,523,714 to 963,496, and the model was pretrained using random mask characters in the discharge diagnoses and International Statistical Classification of Diseases and Related Health Problems sets. We then fine-tuned the model using summary labels and cleaned up the prediction results by averaging all probabilities for entire words to prevent character level–induced fragment words. Model performance was evaluated against existing models BERT, BioBERT, and Long Short-Term Memory (LSTM) using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) L score, and a questionnaire website was built to collect feedback from more doctors for each summary proposal.
The area under the receiver operating characteristic curve values of the summary proposals were 0.928, 0.941, 0.899, and 0.947 for BERT, BioBERT, LSTM, and the proposed model (AlphaBERT), respectively. The ROUGE-L scores were 0.697, 0.711, 0.648, and 0.693 for BERT, BioBERT, LSTM, and AlphaBERT, respectively. The mean (SD) critique scores from doctors were 2.232 (0.832), 2.134 (0.877), 2.207 (0.844), 1.927 (0.910), and 2.126 (0.874) for reference-by-doctor labels, BERT, BioBERT, LSTM, and AlphaBERT, respectively. Based on the paired t test, there was a statistically significant difference in LSTM compared to the reference (
Use of character-level tokens in a BERT model can greatly decrease the model size without significantly reducing performance for diagnoses summarization. A well-developed deep-learning model will enhance doctors’ abilities to manage patients and promote medical studies by providing the capability to use extensive unstructured free-text notes.
Medical centers are the last line of defense for public health and are responsible for educating medical talent. The number of patients in the emergency department of such medical centers is particularly large, and these patients tend to have more severe conditions than those admitted to hospital at a lower tier. For staff, the emergency department can be an overloaded work environment [
There are several available methods to accomplish a text summarization task, ranging from traditional natural language processing (NLP) to deep-learning language models [
Although EHRs provide useful information, the majority of this information is recorded as free text, making it challenging to analyze along with other structured data [
Transformer is a state-of-the-art model, which was released to translate and improve the efficiency of Long Short-Term Memory (LSTM) [
The automatic text summarization task has two branches: extractive and abstractive [
The extractive summarization model can be simplified to a regression problem that outputs the probability of choosing or not choosing. Taking a single character as the token unit, this problem is similar to the segmentation problem in computer vision [
BERT is a state-of-the-art language model for many NLP tasks that is pretrained with unsupervised learning, including “masked language modeling” and “next-sentence prediction.” BERT is pretrained through several corpus datasets, which are then transferred to learning through supervised data [
Because English is not the native language in Taiwan, there are various typos and spelling errors in free-text medical records. Use of the word-level method [
In EHRs, medical terms, abbreviations, dates, and some count numbers for treatment are rarely found in the general corpus dataset, and will result in poor performance of the model. BioBERT, which is based on the BERT model and uses the same tokenizer, is obtained through advanced training on a biomedical corpus [
Our goal was to build a diagnoses-extractive summarization model that can run on the limited computing resources of hospital information systems with good performance. Therefore, we present AlphaBERT, a BERT-based model using the English alphabet (character-level) as the token unit. We compared the performance of AlphaBERT and the number of parameters with those of the other existing models described above.
A dataset of 258,050 discharge diagnoses was obtained from the National Taiwan University Hospital Integrated Medical Database (NTUH-iMD). The discharge diagnoses originated from the following departments (in descending order): surgery, internal medicine, obstetrics and gynecology, pediatrics, oncology, orthopedic surgery, urology, otolaryngology, ophthalmology, traumatology, dentistry, neurology, family medicine, psychiatry, physical medicine and rehabilitation, dermatology, emergency medicine, geriatrics, and gerontology. This study was approved by Research Ethics Committee B, National Taiwan University Hospital (201710066RINB).
In the pretraining stage, 71,704 diagnoses collected by the ICD 10th Revision (ICD-10) [
Pretrained validation dataset. ICD: International Statistical Classification of Diseases and Related Health Problems.
The hardware used for implementation was an I7 5960x CPU, with 60 G RAM, and 2 Nvidia GTX 1080 Ti GPUs. The software used were Ubuntu 18.04 [
We created a diagnosis-label tool to print the discharge diagnosis from the dataset in a textbox. Doctors highlighted the discharge diagnoses by selecting words that were considered to be most relevant, and the tool identified the highlighted position characters, which were labeled 1 and the others were labeled 0. For example, “1.Bladder cancer with” was labeled “001111111111111110000” and stored in the label dataset. We encouraged doctors to skip short diagnoses, because the summarization service will be more useful for longer diagnoses. Therefore, only longer diagnoses were labeled and collected in the fine-tuning set.
In this study, the pretraining dataset was smaller than the dataset used in the pretrained model of BERT and its extensions [
Because there was also a significant shortage of fine-tuning data, the same data augmentation strategy was used to extend the fine-tuning dataset. To provide greater tolerance for typos, we also randomly replaced 0.1% of the characters in the diagnoses during the fine-tuning stage.
We retained only 100 symbols, including letters, numbers, and some punctuation. All free-text diagnoses were preprocessed by filters, and symbols outside of the reserved list were replaced with spaces. Original letter cases (uppercase and lowercase) were retained for analysis.
The preprocessing of diagnoses then converted the symbols (letters, numbers, and punctuation) into numbers with a one-to-one correspondence. For example, “1.Bladder cancer with” was converted to the array “14, 11, 31, 68, 57, 60, 60, 61, 74, 0, 59, 57, 70, 59, 61, 74, 0, 79, 65, 76, 64.”
The architecture of AlphaBERT is based on that of BERT, and our model is based on the PyTorch adaptation released by the HuggingFace team [
Deep-learning model architecture.
The two-stage learning approach of BERT [
Because the free-text diagnoses contained dates, chemotherapy cycles, cancer staging index, and punctuation marks, these words were nonprompted, nongeneric, and changed sequentially. Even experienced doctors cannot recover hidden dates or cycles without prompts, and therefore the letters were replaced with other letters, numbers were replaced with other numbers, and punctuation marks were replaced with other punctuation marks (but were still randomly selected to mask by “^”).
In the masked language model used in this study, the BERT model was connected to a fully connected network decoder
where
Another fully connected network,
where
When we evaluated our model, the probability of each word was represented by the mean probability of each character in the word. In this method, we split the characters list
where
We also compared the state-of-the-art models and adjusted them to fit the target task. The purpose of these models was not summarization, and there is no well-presented, fine-tuned model for this purpose available. Based on the word pieces BPE method [
Where
We also used the LSTM model [
We used Adam optimization [
We measured the performance of the various models using the ROC curve, an
Data were analyzed using the statistical package RStudio (version 1.2.5019) based on R (version 3.6.1; R Foundation for Statistical Computing, Vienna, Austria). For group comparisons, we performed the pairwise paired
The discharge diagnoses dataset included 57,960 lowercase English words. The maximum number of words in a diagnosis was 654 (3654 characters), with a mean of 55 (SD 51) words corresponding to 355 (SD 318) characters. In the fine-tuning dataset, the mean number of words in the diagnoses and summary were 78 (SD 56) and 12 (SD 7), respectively. The retention ratio [
Our proposed model, AlphaBERT, demonstrated the highest performance among all compared models with an area under the ROC curve (AUROC) of 0.947, and the LSTM demonstrated the worst performance with an AUROC of 0.899 (
Model receiver operating characteristic (ROC) curves.
BioBERT achieved the highest ROUGE scores (
We collected 246 critical scores from the 14 doctors that responded to the questionnaire. Statistically significant differences (based on the paired
We built the service on a website [
Model parameters and ROUGEa F1 results.
Model | Dr A (n=250) | Dr B (n=248) | Dr C (n=91) | Mean |
|||
|
|
|
|
|
|||
|
ROUGE-1c | 0.761 | 0.693 | 0.648 | 0.715 | ||
|
ROUGE-2d | 0.612 | 0.513 | 0.473 | 0.549 | ||
|
ROUGE-Le | 0.748 | 0.671 | 0.627 | 0.697 | ||
|
|
|
|
|
|||
|
ROUGE-1 | 0.788 | 0.697 | 0.647 | 0.728 | ||
|
ROUGE-2 | 0.642 | 0.523 | 0.464 | 0.565 | ||
|
ROUGE-L | 0.773 | 0.678 | 0.629 | 0.711 | ||
|
|
|
|
|
|||
|
ROUGE-1 | 0.701 | 0.647 | 0.618 | 0.666 | ||
|
ROUGE-2 | 0.531 | 0.468 | 0.459 | 0.494 | ||
|
ROUGE-L | 0.684 | 0.629 | 0.602 | 0.648 | ||
|
|
|
|
|
|||
|
ROUGE-1 | 0.769 | 0.678 | 0.647 | 0.712 | ||
|
ROUGE-2 | 0.610 | 0.482 | 0.463 | 0.533 | ||
|
ROUGE-L | 0.751 | 0.656 | 0.632 | 0.693 |
aROUGE: Recall-Oriented Understudy for Gisting Evaluation.
bBERT: Bidirectional Encoder Representations from Transformers.
cROUGE-1: Recall-Oriented Understudy for Gisting Evaluation with unigram overlap.
dROUGE-2: Recall-Oriented Understudy for Gisting Evaluation with bigram overlap.
eROUGE-L: Recall-Oriented Understudy for Gisting Evaluation for the longest common subsequence (n) representing the number of reference labels.
fBioBERT: Bidirectional Encoder Representations from Transformers trained on a biomedical corpus.
gLSTM: Long Short-Term Memory.
ROUGEa F1 results of diagnoses with incorrect words.
ROUGE-Lb | BERTc | BioBERTd | LSTMe | Proposed Model |
Diagnoses without error words (n=451)f | 0.704 | 0.717 | 0.651 | 0.698 |
Diagnoses with incorrect words (n=138) | 0.676 | 0.692 | 0.640 | 0.674 |
aROUGE: Recall-Oriented Understudy for Gisting Evaluation.
bROUGE-L: ROUGE for the longest common subsequence.
cBERT: Bidirectional Encoder Representations from Transformers.
dBioBERT: Bidirectional Encoder Representations from Transformers trained on a biomedical corpus.
eLSTM: Long Short-Term Memory.
fn represents the number of reference labels.
Critique scores of models from doctors (N=246).
Model | Score, mean (SD) | ||||
BERTa | BioBERTb | LSTMc | Proposed Model | ||
Reference | 2.232 (0.832) | .11 | .66 | <.001 | .10 |
BERT | 2.134 (0.877) |
|
.10 | .001 | .89 |
BioBERT | 2.207 (0.844) |
|
|
<.001 | .19 |
LSTM | 1.927 (0.910) |
|
|
|
.002 |
Proposed | 2.126 (0.874) |
|
|
|
|
aBERT: Bidirectional Encoder Representations from Transformers.
bBioBERT: Bidirectional Encoder Representations from Transformers trained on a biomedical corpus.
cLSTM: Long Short-Term Memory.
AlphaBERT effectively performed the extractive summarization task on medical clinic notes and decreased the model size compared to BERT, reducing the number of parameters from 108,523,714 to 963,496 using a character-level tokenizer. AlphaBERT showed similar performance to BERT and BioBERT in this extractive summarization task. In spite of the heavy model, both BERT and BioBERT were demonstrated to be excellent models and well-suited for several tasks (including the primary task of this study) with small adjustments. For convenience, the model can be used in a straightforward manner to rapidly build new apps in the medical field. Because of the well pretrained NLP feature extraction model, a small label dataset (the fine-tuning training set includes only 2530 cases) is sufficient for supervised learning and achieving the goal.
In this study, we obtained high ROUGE
The ICD-10 is a well-classified system with more than 70,000 codes, but is often too simple to fully capture the complex context of a patient’s record. The treatments during the patient’s previous hospitalization are also important to consider, and are often recorded as a free-text diagnosis when the patient has revisited a hospital under critical status. For example, if a patient has cancer, the previous chemotherapy course is important information when the patient is seriously ill in the emergency department. Furthermore, it is difficult for doctors to accurately find the correct codes; thus, it is insufficient to represent a patient’s condition by simply obtaining the ICD-10 code from the EHR. However, the ICD-10 codes can be used to extend the pretrained training set by random stitching.
Combining a random number of diagnoses not only extends the training dataset but also improves the performance of the model. The average number of characters in a diagnosis was 355, but the range was larger (SD 318). In the absence of augmentation, the position embeddings and self-attention heads trained more in the front and demonstrated poorer performance in the back. Augmentation combines several diagnoses to lengthen the input embeddings, which can train the self-attention heads to consider all 1350 characters equally.
In the prediction phase, we obtained the probability of each character. Since a word is split into a sequence of characters, the result is fragmented, and only some characters in a word were selected by prediction. This results in a nonsense phrase and produces poor results. Accordingly, we proposed a cleanup method that selects the entire word based on the probability of all characters being present in the word. This concept is derived from the segmentation task in computer vision in which each pixel has the possibility of classifying and causing the predictions to not continue. In the field of computer vision, contour-based superpixels are chosen, and all superpixels are selected by a majority vote [
Since the summarization task is subjective, properly evaluating the performance of the model is a relevant consideration. Lack of adequate medical labels is an important issue, because labels from qualified physicians are rare and difficult to collect. Although the ROUGE score [
Owing to the lack of doctors who are capable of labeling the reference summaries, all of the models evaluated in this study were limited to being fine-tuned by Doctor A’s labels. We were able to shuffle and randomly split the three doctors’ labels to training, validation, and testing sets, but we did not have reference labels from other doctors to confirm whether individual variation exists. Even when using the three doctors’ labels, this problem would occur when gathering another doctor’s labels.
To confirm the differences from other doctors, the models were fine-tuned using only one doctor’s knowledge, with the others’ used as a test set. The results revealed a difference according to the ROUGE scores (
We established a website for doctors to easily critique the performance within label references and the predictions from the models to further objectively evaluate the performance of the model and the reference labels from doctors. We used a double-blind method to collect scores, and the system randomly chose a diagnosis and displayed corresponding summary proposals by random ordering. The critical reviewer was therefore blinded to the method used for each prediction. We obtained similar results to the ROUGE scores from this analysis. Moreover, the LSTM was consistently the lowest-performing model, whereas manually labeled references achieved the highest average score, followed by BioBERT.
Although the performance of AlphaBERT was not optimum, there was nevertheless no statistically significant difference between the performances of BERT, BioBERT, and AlphaBERT. The advantage of AlphaBERT is the character-level prediction probability and its one-to-one correspondence with the original document. The predicted keywords can be highlighted directly on the original document and can be easily edited by users. For example, although AlphaBERT’s predicted proposal had a ROUGE-L score of 0.701, it makes sense to recognize important words, which is perhaps more informative than a doctor’s reference label (
Illustration of the performance of AlphaBERT.
Due to the subjective nature of the text summarization task, the predicted summary results may lose some information that may be of relevance. The proposed model helps hospital staff to quickly view information for a large number of patients at the beginning of a shift; however, they will still need to read all of the collected information from the EHRs during ward rounds.
Typos and misspellings remain a problem in NLP. However, the character-level and word pieces BPE method can not only reduce the vocabulary but can also handle typos effectively to maintain noninferior results (
This was a pilot study in the medical text summarization field based on the deep-learning method. We plan to establish a website that offers this service and provides a way to edit suggestions and feedback to collect volunteer labels and resolve personal variability in the near future.
AlphaBERT, using character-level tokens in a BERT-based model, can greatly decrease model size without significantly reducing performance for text summarization tasks. The proposed model will provide a method to further extract the unstructured free-text portions in EHRs to obtain an abundance of health data. As we enter the forefront of the artificial intelligence era, NLP deep-learning models are well under development. In our model, all medical free-text data can be transformed into meaningful embeddings, which will enhance medical studies and strengthen doctors’ capabilities.
Input embedding.
Flowchart to determine the hyperparameters and measure the model’s performance.
Error statistics (strong and weak).
Error statistics (typos, misspellings, or incorrect words).
Area Under the Receiver Operating Characteristics
Bidirectional Encoder Representations from Transformers
byte-pair encoding
electronic health record
International Statistical Classification of Diseases and Related Health Problems 10th Revision
long short-term memory
natural language processing
National Taiwan University Hospital Integrated Medical Database
receiver operating characteristic
Recall-Oriented Understudy for Gisting Evaluation
We would like to thank the Ministry of Science and Technology, Taiwan, for financially supporting this research (grant MOST 109-2634-F-002-029). We would also like to thank Yun-Nung Chen for providing useful comments on this work and Hugging face for providing several excellent deep-learning codings. We are grateful to GitHub for providing the code repository used for AlphaBERT.
None declared.