This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Natural Language Understanding enables automatic extraction of relevant information from clinical text data, which are acquired every day in hospitals. In 2018, the language model Bidirectional Encoder Representations from Transformers (BERT) was introduced, generating new state-of-the-art results on several downstream tasks. The National NLP Clinical Challenges (n2c2) is an initiative that strives to tackle such downstream tasks on domain-specific clinical data. In this paper, we present the results of our participation in the 2019 n2c2 and related work completed thereafter.
The objective of this study was to optimally leverage BERT for the task of assessing the semantic textual similarity of clinical text data.
We used BERT as an initial baseline and analyzed the results, which we used as a starting point to develop 3 different approaches where we (1) added additional, handcrafted sentence similarity features to the classifier token of BERT and combined the results with more features in multiple regression estimators, (2) incorporated a built-in ensembling method,
We improved the performance of BERT on the test dataset from a Pearson correlation coefficient of 0.859 to 0.883 using a combination of the M-Heads method and the graph-based similarity approach. We also show differences between the test and training dataset and how the two datasets influenced the results.
We found that using a graph-based similarity approach has the potential to extrapolate domain specific knowledge to unseen sentences. We observed that it is easily possible to obtain deceptive results from the test dataset, especially when the distribution of the data samples is different between training and test datasets.
Every day, hospitals acquire large amounts of textual data which contain valuable information for medical decision processes, research projects, and many other medical applications [
In the clinical domain, Semantic Textual Similarity has the potential to ease clinical decision processes (eg, by highlighting crucial text snippets in a report), query databases for similar reports, assess the quality of reports, or be used in question answering applications [
State-of-the-art Natural Language Processing (NLP) methods for assessing the Semantic Textual Similarity of nonclinical data are developed and benchmarked based on the Semantic Textual Similarity benchmark, which compromises the SemEval Semantic Textual Similarity tasks from 2012 to 2017 [
The winners of Track 1: ‘n2c2/OHNLP 2018 Track on Clinical Semantic Textual Similarity’ [
In recent years, the general Natural Language Processing domain made a huge step forward with the breakthrough of transfer learning which allows leveraging semantic knowledge from huge amounts of unlabeled text data. That is, a model can be pretrained on enormous unlabeled text data with multiple unsupervised tasks. The trained model captures a universal language representation and can be effectively fine-tuned on different downstream tasks. For example, the 2018 language model, Bidirectional Encoder Representations from Transformers (BERT), introduced a multilayer bidirectional Transformer that is trained on a massive amount of text in two unsupervised tasks: (1) next sentence prediction and (2) masked word prediction. To use the model for further downstream tasks, it is usually enough to add a linear layer on top of the pretrained model to achieve state-of-the-art performance for the desired downstream tasks [
The application of pretrained models like BERT on clinical data comes with the question if the model can handle domain-specific nuances. One proposed approach to handling domain-specific nuances is to use transfer learning to adapt the model to clinical data [
To summarize our contributions (see
a simple modification of the BERT architecture by adding additional similarity features and employing a built-in ensembling method.
a graph-based similarity approach for a subset of structured sentences in which the knowledge of the training set is extrapolated to unseen sentence pairs of the test set.
Additionally, we show that statistically analyzing the data reveals differences between the training and test datasets. This analysis made the process of interpreting the results easier.
The code to reproduce the results of this paper is available online [
Overview of our pipeline for the different approaches. Blue boxes denote feature sets (see
Our methods were developed and tested on the data of the ClinicalSTS shared task, which consists of a collection of electronic health records from the Mayo Clinic’s dataset [
We started by applying ClinicalBERT [
In the following section, we describe the approaches shown in our pipeline (
Clustering of the sentences to reveal BERT’s weaknesses. Each point represents a sentence pair from the training set, and the corresponding absolute score difference is visualized as opacity, or, in other words, the more opaque a point is, the higher the deviation from the ground truth. The points are the t-SNE projected InferSent embeddings of all sentence pairs. For each cluster, the average absolute deviation from the ground truth as well as the distribution of the differences is shown in the legend. Best viewed in colour. BERT: Bidirectional Encoder Representations from Transformers. t-SNE: t-distributed stochastic neighbor embedding.
Box plot showing the absolute score differences for each cluster emphasizing the opacity information from Figure 2. The number below the bold cluster index is the cluster size. For each box plot, the following information is depicted: the box ranges from the lower to the upper quartiles with the notch at the median position. The whiskers extend up to 1.5 times the interquartile range. Remaining points (outliers) are not shown. The white square denotes the mean value.
The motivation behind this approach is to enhance BERT with additional information that BERT might not be able to capture in its model. On a token level, BERT uses a predefined tokenizer based on a set of rules; however, it might be valuable to compare arbitrary tokens based on character -grams. On a sentence-level, BERT does have a classifier token, [CLS], to compare two sentences. However, the [CLS] token was not designed to be a sentence-embedding [
In this approach, we used two kinds of similarity measures: (1) token-based and (2) sentence embedding–based. For a token-based similarity measure, -grams of characters are created and then compared with each other. For example, Jaccard Similarity compares the proportion between the intersection and the union of -grams in two input sentences. For a sentence embedding-based similarity measure, the embeddings of two sentences are compared, for example by taking the cosine similarity between the two embeddings of the two input sentences. The similarity measures were inspired by Chen et al [
We combined BERT with two feature sets of similarity measures at two different positions in our pipeline (
Ensembling methods have been very popular in recent machine learning challenges [
We took up this point and decided to include a simple ensembling method directly into the architecture of BERT. More concretely, we duplicated the final linear layer (the head) which receives the last [CLS] token from the BERT model and which is responsible for calculating the regression (score prediction). We initialized each head layer with different weights to allow the different solutions per head. We employed a loss scaling which enforces specialization of the different heads similar to methods seen in other research [
In this approach, we focused on a subset of the sentence pairs which we named “medication sentences,” for example “ibuprofen 150 mg tablet 2 tablets by mouth every 7 hours as needed.” Further examples are listed in the discussion. These sentences are fairly structured and can be compared by analyzing individual entities. We used the MedEx-UIMA system [
Our general idea was to determine the property of similarity between active agent pairs as compared to unknown active agent pairs. That is, we assumed that the similarity of active agents A and B as well as B and C also contained information about the similarity between A and C. We generalized this process by constructing a graph containing all active agents as nodes with corresponding similarities assigned to the edges, using the shortest path between arbitrary active agents as a foundation to predict a similarity score which could then be further modified by the remaining entities (ie, every entity except the active agent).
In the following section, we describe how we delt with the remaining entities, in which way we constructed the graph of all active agents, and how we used this information to predict similarity scores for new sentence pairs.
Even though we considered the active agents as the central part regarding sentence similarity, we still did not want to neglect other influences and, hence, we constructed a set of additional features per sentence pair, which reflect the similarity of everything except the active agents. More concretely, we constructed a set of similarity features Δk and compared the entity value of the first sentence
and for ratio-scaled entity types, we used the squared difference
For entities like “strength” (eg, “4 mg”), we first separated the unit (“mg”) from the number (“4”), used the nominal approach to compare the unit, and applied the squared difference equation on the number part. This differentiation gives us k=1 , … ,
We used all medication sentences S = (
which models the modified similarity score
The intuition here is that the weights,
The goal of the inference phase was to calculate a sentence similarity score,
Step 1: In its simplest form, the similarity between two active agents is just the weight of the edge between the two corresponding active agent nodes. For example,
where
with the final resistance,
[
Step 2: The weight,
to retrieve a similarity,
It may be possible that the sentences contain additional information that we do not cover in our approach, such as additional words, the relation between words, etc. For this reason, we combined the similarity score,
Excerpt of the medication graph, which models similarities between active agent pairs. On the edges, the modified similarity score, <inline-graphic xlink:href="medinform_v9i1e22795_fig10.png" mimetype="image" xlink:type="simple"/>, is shown. The full graph is available as an online widget, which provides further information and shows the graph calculations between arbitrary active agent nodes.
The parameters λ
For a more stable evaluation, we split the training data into 10 folds, built a graph based on each training set, and evaluated the graph performance based on the corresponding test set. For evaluation, we calculated the mean squared error between the prediction scores and the ground truth. We did not use the Pearson correlation coefficient here because the correlation on a subset may not be as helpful for the correlation on the complete dataset as a measure which directly enforces a closeness with the ground truth.
Let λ = (λ0, λ1, …, λ
λ'
via a sample from a standard normal distribution so that we obtained a new weight vector
which we evaluated again on the graphs from all folds, keeping the change if
MES(λ’) < MES(λ)
We repeated this process in two iterations, alternating with the process of hyperparameter tuning of the SVR model, until we observed no further improvements. For each random walk process, we applied 50 update steps. During development, we found that this setting was already sufficient and that the resulting weights tended to remain unchanged after these updates. For the SVR model, we applied a grid search to find values for the hyperparameters
In order to help with the interpretation of our results in the next section, we applied some basic statistical analysis on the training and test set, which revealed some imbalances. On average, the similarity score of the sentences in the training set (approximately 2.79) was higher than in the test set (approximately 1.76), whereas the standard deviation was slightly higher in the test set (approximately 1.52) than in the training set (approximately 1.39). This is also indicated by the left histogram chart of
Histogram of the label distribution and word lengths of the training and test set.
The right histogram chart of
Finally, we calculated InferSent embeddings of the sentences in the training and test dataset and visualized them in a t-SNE (t-distributed stochastic neighbor embedding) plot (
t-SNE projected InferSent embeddings of the sentences in the training and test dataset. Different groups of points correspond to different sentence types. For example, the group on the left upper side corresponds to the medication sentences. t-SNE: t-distributed stochastic neighbor embedding.
We evaluated all runs on 3 different sets. Firstly, we used the training set with
Secondly, for the evaluation of the test set, we employed an additional ensembling technique by using the model for each fold to calculate a prediction for a sentence pair and then averaged all predictions.
Summarization of the different approaches and their results. Training and test Pearson correlation coefficient scores are rounded to 3 decimal places.
Approach | Training set | Test set | |
|
|||
ClinicalBERT | 0.850 | 0.859 | |
|
|||
Enhanced BERT | 0.851 | 0.859 | |
Voting Regression | 0.860 | 0.849a | |
|
|||
Enhanced BERT with |
0.853 | 0.876a,b | |
Enhanced BERT with |
0.853 | 0.883c | |
|
|||
Voting Regression + Med. Graph | 0.862 | 0.862a |
aOur submissions.
bWe submitted a score of 0.869 for this setting because we were able to use only 10
cOur best result of the test set.
Our 3 approaches performed differently on the two datasets. In the following sections, we discuss the results in more detail and give our thoughts.
Evaluating the pure ClinicalBERT model, we see that the Pearson correlation coefficient is slightly higher for the test set as compared to the training set. The Enhanced BERT architecture led to an almost neglectable improvement on the training set and, in the test set, to no improvement at all. This indicates that, in this case, the additional features do not provide more information than what is already contained in the [CLS] token from the last hidden layer of BERT.
The Voting Regression approach showed an improvement of the Pearson correlation coefficient of the training set; however, for the test set, the performance decreased. These results might be traced back to overfitting of the training set. However, the decrease in the test set might also be explained by the imbalances between the training and test set.
Adding
Replacing the scores of the sentence subset which prescribes medications (cluster 3) with the medication graph scores led, in both cases (approaches 1 and 2), to an improvement for the test set. For the training set, however, we saw only marginal improvements, such as 0.860 to 0.862 from approach 1 to approach 3. This might be due to the Pearson correlation coefficient metric. In our experiments, we also evaluated our approaches with the Mean Squared Error between the predictions and the ground truth only on the subset of medication sentences. Without applying the medication graph (approach 1), we obtained a Mean Squared Error of 0.70, and with the medication graph (approach 3) a Mean Squared Error of 0.58. Combining the
Why did the medication graph perform better for the test set than for the training set? First, we observe that the test set contained more low-ranked sentences (see
Comparison of the Pearson correlation coefficient scores predicted by the Voting Regression (approach 1, A1) and the medication graph (approach 3, A3) via randomly selected example sentences. T denotes the ground truth score for the corresponding sentence pair. This table shows only the relevant entities from the original example sentences.
Sentence a | Sentence b | T | A1 | A3 | Set |
Ondansetron, 4 mg, 1 tablet, three times a day | Amoxicillin, 500 mg, 2 capsules, three times a day | 3.0 | 1.68 | 1.70 | Training |
Prozac, 20 mg, 3 capsules, one time daily | Aleve, 220 mg, 1 tablet, two times a day | 0.5 | 2.02 | 1.68 | Training |
Hydrochlorothiazide, 25 mg, one-half tablet, every morning | Ibuprofen, 600 mg, 1 tablet, four times a day | 1.5 | 1.59 | 1.70 | Training |
Aleve, 220 mg, 1 tablet, two times a day | Acetaminophen, 500 mg, 2 tablets, three times a day | 1.5 | 2.74 | 1.68 | Test |
Lisinopril, 10 mg, 2 tablets, one time daily | Naproxen, 500 mg, 1 tablet, two times a day | 1.0 | 2.29 | 1.69 | Test |
To tackle the problem of semantic textual similarity of medical data, we developed 3 different approaches. We proposed to add additional features to BERT and to weigh different regression models based on the BERT result and other features. Moreover, we proposed the application of
Preprocessing.
Implementation Details.
Detailed description of Feature Set I and Feature Set II.
M-Heads: training and prediction.
Bidirectional Encoder Representations from Transformers
Natural language Processing
National NLP Clinical Challenges
Open Health Natural Language Processing
This research was supported by the German Cancer Consortium (DKTK), the German Cancer Research Center (DKFZ), and the Helmholtz Association under the joint research school HIDSS4Health (Helmholtz Information and Data Science School for Health).
None declared.