This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
The most current methods applied for intrasentence relation extraction in the biomedical literature are inadequate for document-level relation extraction, in which the relationship may cross sentence boundaries. Hence, some approaches have been proposed to extract relations by splitting the document-level datasets through heuristic rules and learning methods. However, these approaches may introduce additional noise and do not really solve the problem of intersentence relation extraction. It is challenging to avoid noise and extract cross-sentence relations.
This study aimed to avoid errors by dividing the document-level dataset, verify that a self-attention structure can extract biomedical relations in a document with long-distance dependencies and complex semantics, and discuss the relative benefits of different entity pretreatment methods for biomedical relation extraction.
This paper proposes a new data preprocessing method and attempts to apply a pretrained self-attention structure for document biomedical relation extraction with an entity replacement method to capture very long-distance dependencies and complex semantics.
Compared with state-of-the-art approaches, our method greatly improved the precision. The results show that our approach increases the F1 value, compared with state-of-the-art methods. Through experiments of biomedical entity pretreatments, we found that a model using an entity replacement method can improve performance.
When considering all target entity pairs as a whole in the document-level dataset, a pretrained self-attention structure is suitable to capture very long-distance dependencies and learn the textual context and complicated semantics. A replacement method for biomedical entities is conducive to biomedical relation extraction, especially to document-level relation extraction.
A large number of biomedical entity relations exist in the biomedical literature. It is beneficial for the development of biomedical fields to automatically and accurately extract these relations and form structured knowledge. Some biomedical datasets have been proposed for extracting biomedical relations, such as drug-drug interactions (DDI) [
Most approaches [
Currently, for intersentence relation extraction, some studies [
For example, the chemical-disease relation (CDR) dataset is a document-level corpus designed to extract CID relations from biomedical literature [
According to heuristic rules [
Therefore, when constructing relation instances from a document-level dataset, it is necessary to consider sentences with multiple mentions of target entities in the entire document. While treating all target entities in a document as a whole brings benefits, the challenges are very long-distance dependencies and complex semantics, from which traditional neural networks such as CNN or RNN cannot accurately extract document-level relations. Recently, pretrained self-attention structures, such as SciBERT [
To address these problems, this paper proposes a pretrained self-attention mechanism and entity replacement method to extract document-level relationships. In this way, this paper has several contributions. First, to avoid errors by dividing the document-level dataset, this paper proposes a new data preprocessing method that treats the target entity pair of some sentences in a document as an instance. Second, to better focus on the target entity pairs and their context, a replacement method is proposed to replace biomedical entity pairs with uniform words. Compared with the different entity preprocessing for biomedical entity pairs, the replacement method is more effective for biomedical relation extraction. Third, to solve the problem of long-distance dependencies and learning complex semantics, a pretrained self-attention structure is proposed for document-level relation extraction and to achieve superior performance than state-of-the-art approaches. Through analysis and visualization of the model structure, the effectiveness of the self-attention structure for document-level datasets is demonstrated.
As already mentioned, splitting the document-level corpus will increase noise and may lose some useful information. To address this problem, the sentences in which the target entity pair is located and the sentences between them are constructed to an instance. This approach has the following benefits. First, it does not introduce error messages. The sentences do not need to be labeled after the segmentation of the dataset. The relationship between the marked relation pairs in the document corresponds to the instances one by one. Second, it discards useless information that is not related to the relationship of the target entities. Some are not related to those sentences in which the target entities are located; hence, they are noise for relation extraction. Discarding them will focus the model on the sentences in which the entity pair is located. Third, it keeps a lot of useful information, such as contextual information about entities and the relationship of entities.
As shown in
An example of document-level relation instance construction.
In this document, there are 2 chemical entities, “amisulpride” (C1) and “calcium gluconate” (C2), and 4 disease entities: “overdose” (D1), “prolonged QT syndrome/QT prolongation” (D2), “poisoning” (D3), and “hypocalcemia” (D4). It should be noted that C1, C2, D1, D2, D3, and D4 are added to the document to indicate which are chemical entities and which are disease entities. Hence, the document can be constructed into 8 instances, in which 2 instances of C1 and D2 or C1 and D4 have CID relations. a), b), c), and d) conformed the instance of C1 and D2. a), b), and c) conformed the instance of C1 and D4.
Semantically, both the intralevel sentence a) and the interlevel sentences b) and c) express the CID relationship of C1 and D2. However, according to heuristic rules, b) and c) will be discarded because only the entities that are not involved in any intrasentence level instance are considered at the intersentence level. Third, instances are full of contextual information of chemical and disease entities, which is conducive to document-level relation extraction when exploiting it well.
There are lots of biomedical entities in a document. When constructing the instances of the target entity pair, it is inevitable that the same instance is tagged with different labels, resulting in incorrect classification. For example, as mentioned in the Methods section, the instances of C1-D2 and C2-D2 are the same but tagged with different labels. To solve this problem, entity pretreatment methods are presented.
There are 2 different biomedical entity pretreatments, as shown in
An instance with two different biomedical entity pretreatments.
As shown in
Self-attention structure can directly calculate similarities between words, so that the distance between words is 1, which can intuitively solve long-distance dependencies. As demonstrated by Tang et al [
However, for document-level relationship extraction, according to our preprocessing method, the length of instances is longer than the experimental data in the paper by Tang et al [
Basde on the structure of BERT, SciBERT built a new vocabulary, called SCIVOCAB, and was trained on a scientific corpus that consists of computer science domain and biomedical papers. Following SciBERT, we still employ the same input representation, constructed by summation of token embedding, segment embedding, and position embedding. The tokens “[CLS]” and “[SEP]” are added at the beginning and end, respectively, of each instance. In addition, when tokenizing words, WordPiece embedding [
SciBERT is made up of N transformer stacks. Transformer stack
where
The multihead applies self-attention, or scaled dot-product attention, multiple times. Through the mapping of the query
The architecture of the model.
Instead of applying a single scaled dot-product attention, multihead attention applies query
Where the projections are parameter matrices
Then, the outputs of the individual attention heads are merged, denoted as
Where
The second component of the transformer stack is 2 layers of pointwise FFN. On the other hand, it can be described as 2 convolutions with kernel size 1.
where
The final layer is an FFN, a relation classifier. It corresponds to the final output of transformer of the token “[CLS]”.
Where
In this section, we first describe some experimental datasets and provide some experiment settings. Then, we compare the performance of SciBERT with that of existing methods and validate the availability of the pretrained self-attention structure on the document-level dataset through the visualization of the multihead attention. Finally, experimenting on different datasets, including 2 sentence-level corpora and a document-level corpus, we compare various biomedical entity pretreatments and analyze which preprocessing is better for the self-attention structure.
Descriptions of the chemical-disease relation datasets.
Dataset, Types | Training set | Development set | Test set | |
|
|
|
|
|
|
Documents | 500 | 500 | 500 |
Positive | 1038 | 1012 | 1066 | |
Negative | 4324 | 4134 | 4374 | |
|
|
|
|
|
|
Documents | 597 | N/Ac | 635 |
Positive | 750 | N/Ac | 869 | |
Negative | 1401 | N/Ac | 1717 | |
|
|
|
|
|
|
Sentence | 18,872 | N/Ac | 3843 |
Positive | 3964 | N/Ac | 970 | |
Negative | 14,908 | N/Ac | 2873 | |
Int | 183 | N/Ac | 96 | |
Advice | 815 | N/Ac | 219 | |
Effect | 1654 | N/Ac | 357 | |
Mechanism | 1312 | N/Ac | 298 | |
|
|
|
|
|
|
Sentences | 6437 | 3558 | 5744 |
Positive | 4172 | 2427 | 3469 | |
Negative | 2265 | 1131 | 2275 | |
CPR:3 | 777 | 552 | 667 | |
CPR:4 | 2260 | 1103 | 1667 | |
CPR:5 | 173 | 116 | 198 | |
CPR:6 | 235 | 199 | 293 | |
CPR:9 | 727 | 457 | 644 |
aCDR: chemical-disease relation.
bPPIm: protein-protein interaction affected by mutations.
cDevelopment sets do not exist in the PPIm an dDDI datasets.
dDDI: drug-drug interaction.
eCPR: chemical-protein relation.
The CDR dataset is used to extract CID and is a 2-label classification task. The PPIm dataset is released to extract protein-protein interactions affected by genetic mutations, which is a 2-label classification. Aimed at extracting drug-drug interactions, DDI is concerned with classifying into 5 relation types, including the int type, advice type, effect type, mechanism type, and negative type. For the DDI dataset, we adopted some rules to filter some negative sentences as described by Quan et al [
Due to the size of the CDR dataset, we merged the training and development sets to construct the training set. After preprocessing the CDR and PPIm datasets, we counted the average number of sentences per instance, average number of tokens per instance, and average number of tokens per sentence in the constructed instance set.
We employed the parameters of the uncased SciBERT with the vocabulary SCIVOCAB and fine-tuned on the CDR datasets. The model parameters are described as: SciBERTuncased:
Due to the distinction of the length of instances in the dataset, the input dimensions of the corresponding model for each dataset are different. For the CDR and PPIm datasets, the length of the input sequence is set to 512, and the batch size is set to 6. For the DDI dataset, the length of the input sequence is set to 150, and the batch size is set to 32. For the CPR dataset, the length of the input sequence is set to 200, and the batch size is set to 23. The epoch of all model training is set to 3. All results are averaged across 5 runs. For consistency of comparisons, we merged the training and development sets to train the models.
Statistics of the constructed instance sets of the chemical-disease relation (CDR) and protein-protein interaction affected by mutations (PPIm) datasets.
Dataset, Types | Training set | Test set | |
|
|
|
|
|
Instances | 10,407 | 5418 |
Positive |
1947 | 1042 | |
Negative | 8460 | 4376 | |
Sentences per instance | 11.1 | 12.1 | |
Tokens per instance | 161.5 | 168.9 | |
Tokens per sentence | 14.6 | 14.0 | |
|
|
|
|
|
Instances | 2151 | 2586 |
Positive | 750 | 869 | |
Negative | 1401 | 1717 | |
Sentences per instance | 9.0 | 8.8 | |
Tokens per instance | 169.6 | 186.6 | |
Tokens per sentence | 18.7 | 21.2 |
For the CDR dataset, we compared our method with 6 state-of-the-art models without any knowledge bases. Zhou et al [
As shown in
Performance of the chemical-disease relation (CDR) and protein-protein interactions affected by mutations (PPIm) test datasets compared with state-of-the-art methods.
Dataset, Model | Pa, % | Rb, % | F1, % | |
|
|
|
|
|
|
LSTMc [ |
55.6 | 68.4 | 61.3 |
CNNd [ |
55.7 | 68.1 | 61.3 | |
RPCNNe [ |
55.2 | 63.6 | 59.1 | |
BRANf [ |
55.6 | 70.8 | 62.1 | |
GCNNg [ |
52.8 | 66.0 | 58.6 | |
Our method | 65.5 | 62.6 | 64.0 | |
|
|
|
|
|
|
SVMh [ |
32.0 | 34.0 | 33.0 |
CNN (without KBi) [ |
38.2 | 37.3 | 37.8 | |
MNMj [ |
40.3 | 32.3 | 35.9 | |
MNM+Rule [ |
38.0 | 37.0 | 37.5 | |
Our method | 83.5 | 90.4 | 86.8 |
aP: precision.
bCNN: convolutional neural network.
cR: recall.
dLSTM: long short-term memory.
eRPCNN: recurrent piecewise convolutional neural network.
fBRAN: bi-affine relation attention network.
gGCNN: graph convolutional neural network.
hSVM: support vector machine.
iKB: knowledge base.
jMNM: memory neural network.
As described later, there are 2 methods, one of which is the replacement method, that replace biomedical entities with uniform words. The second method is the addition method, which adds extra tags in the left and right sides of biomedical entities. We conducted experiments with the CDR, PPIm, DDI, and CPR datasets. The comparison of the 2 pretreatments for biomedical entities is shown in
For each dataset, the recall rate and F1 score obtained with our model with the replacement method were higher than obtained with our model with the addition method, especially for the CDR dataset. The reason is that biomedical entities are complicated, and most are compound words. For the pretrained self-attention structure, the word embeddings of biomedical entities are hard to learn from small biomedical datasets. As a consequence, replacing the target entities with uniform words is beneficial for the model to understand target entities and pay more attention in the context of target entities.
Comparison of 2 pretreatments (addition and replacement) for biomedical entities using our method.
Dataset, Types | Addition method | Replacement method | |||||
|
Pa, % | Rb, % | F1, % | P, % | R, % | F1, % | |
|
|
|
|
|
|
|
|
|
Positive | 67.4 | 54.8 | 60.4 | 65.5 | 62.6 | 64.0 |
|
|
|
|
|
|
|
|
|
Positive | 79.3 | 91.5 | 84.8 | 83.5 | 90.4 | 86.8 |
|
|
|
|
|
|
|
|
|
Int | 74.8 | 46.2 | 57.1 | 76.2 | 46.9 | 58.0 |
Advise | 87.2 | 84.7 | 85.9 | 88.6 | 89.0 | 88.8 | |
Effect | 77.2 | 82.1 | 79.5 | 77.0 | 82.6 | 79.7 | |
Mechanism | 84.8 | 80.4 | 82.5 | 82.1 | 86.0 | 84.0 | |
All | 81.6 | 78.6 | 80.0 | 81.2 | 81.3 | 81.4 | |
|
|
|
|
|
|
|
|
|
CPR:3 | 73.5 | 80.3 | 76.7 | 75.4 | 79.5 | 77.4 |
CPR:4 | 84.4 | 88.8 | 86.6 | 83.7 | 90.4 | 86.9 | |
CPR:5 | 80.7 | 82.0 | 81.3 | 81.2 | 86.5 | 83.7 | |
CPR:6 | 84.0 | 89.4 | 86.7 | 86.5 | 88.2 | 87.3 | |
CPR:9 | 76.2 | 86.9 | 81.2 | 79.5 | 90.1 | 84.5 | |
All | 80.4 | 86.5 | 83.3 | 81.4 | 87.9 | 84.5 |
aP: precision.
bR: recall.
cCDR: chemical-drug reaction.
dPPIm: protein-protein interaction affected by mutations.
eDDI: drug-drug interaction.
fCPR: chemical-protein reaction.
BERT and SciBERT are the pretrained models that have the same self-attention structure. The difference between the two is that BERT is pretrained on the wiki corpus and SciBERT is pretrained on a large quantity of scientific papers from the computer science and biomedical domains.
Comparison of different pretrained models using our method.
Dataset, Type | BERT | SciBERT | |||||||||||
|
Pa, % | Rb, % | F1, % | P, % | R, % | F1, % | |||||||
|
|
|
|
|
|
|
|||||||
|
Positive | 62.9 | 58.3 | 60.5 | 65.5 | 62.6 | 64.0 | ||||||
|
|
|
|
|
|
|
|||||||
|
Positive | 79.0 | 92.2 | 85.1 | 83.5 | 90.4 | 86.8 | ||||||
|
|
|
|
|
|
|
|||||||
|
Int | 69.8 | 42.7 | 52.8 | 76.2 | 46.9 | 58.0 | ||||||
Advise | 91.3 | 89.0 | 90.1 | 88.6 | 89.0 | 88.8 | |||||||
Effect | 74.1 | 77.6 | 75.7 | 77.0 | 82.6 | 79.7 | |||||||
Mechanism | 78.5 | 80.1 | 79.3 | 82.1 | 86.0 | 84.0 | |||||||
All | 79.0 | 77.5 | 78.2 | 81.2 | 81.3 | 81.4 | |||||||
|
|
|
|
|
|
|
|||||||
|
CPR:3 | 73.8 | 76.5 | 75.1 | 75.4 | 79.5 | 77.4 | ||||||
CPR:4 | 81.7 | 89.7 | 85.5 | 83.7 | 90.4 | 86.9 | |||||||
CPR:5 | 79.3 | 80.7 | 79.9 | 81.2 | 86.5 | 83.7 | |||||||
CPR:6 | 80.2 | 84.6 | 82.2 | 86.5 | 88.2 | 87.3 | |||||||
CPR:9 | 76.5 | 88.2 | 81.9 | 79.5 | 90.1 | 84.5 | |||||||
All | 79.0 | 85.9 | 82.3 | 81.4 | 87.9 | 84.5 |
aP: precision.
bR: recall.
cCDR: chemical-drug reaction.
dPPIm: protein-protein interaction affected by mutations.
eDDI: drug-drug interaction.
fCPR: chemical-protein reaction.
Data preprocessing (DP) and pretraining means (PTM) are important components of our method; DP aims to alleviate noise, and PTM is designed to solve the long-distance dependencies. We compared the importance of each component of our method with the CDR dataset.
Performance changes by removing different parts of our model.
CDRa dataset | Pb, % | Rc, % | F1, % | Change, % |
Baseline | 65.5 | 62.6 | 64.0 | N/A |
Remove DPd | 67.0 | 54.3 | 60.0 | –6.3 |
Remove PTMe | 46.1 | 39.5 | 42.6 | –33.4 |
Remove DP and PTM | 48.9 | 31.2 | 38.1 | –40.5 |
aCDR: chemical-drug reaction.
bP: precision.
cR: recall.
dDP: data preprocessing.
ePTM: pertraining means.
To fully illustrate that our model can solve the problem of long-distance dependencies, we set 50 as the unit of instance length to count the number of positive and negative instances of the CDR test set, as shown in
We calculated the precision rate, recall rate, and accuracy rate of each interval length in the test set. The results are shown in
Quantity distribution of the chemical-disease relation test set.
Interval length | Positive | Negative | Sum |
0-50 | 70 | 344 | 414 |
51-100 | 181 | 884 | 1065 |
101-150 | 200 | 845 | 1045 |
151-200 | 160 | 756 | 916 |
201-250 | 158 | 663 | 821 |
251-300 | 177 | 571 | 748 |
301-350 | 56 | 216 | 272 |
351-400 | 34 | 64 | 98 |
>400 | 6 | 33 | 39 |
Results of each interval length in the test set using our replacement method.
Interval length | Pa, % | Rb, % | F1, % |
0-50 | 54.2 | 64.3 | 58.8 |
51-100 | 57.5 | 57.5 | 57.5 |
101-150 | 67.7 | 64.0 | 65.8 |
151-200 | 64.4 | 71.2 | 67.7 |
201-250 | 66.2 | 54.4 | 59.7 |
251-300 | 69.9 | 69.5 | 69.7 |
301-350 | 70.8 | 60.7 | 65.4 |
351-400 | 80.0 | 82.4 | 81.2 |
>400 | 100.0 | 66.7 | 80.0 |
aP: precision.
bR: recall.
To verify that the pretrained self-attention mechanism works as we believe, which is that it can take advantage of the textual context and capture very long-range dependencies to understand the complex semantics of biomedical text, we visualized the output of token “[CLS]” in the multihead of the final transformer stack, as shown in
As seen by the token colors, the token “[CLS]” is related to the following tokens: “chemical,” “disease,” “drug,” “related,” “bilateral,” “[CLS],” and “[SEP]”. The 12 different colors refer to the different head attentions. The more and darker the colors, the more relevant the token. Lines between two tokens denote a correlation between two tokens. Their clarity depends on the result of the head attentions.
Visualization of the output of token “[CLS]” in the multihead attention of the final transformer stack.
From the perspective of semantic analysis, there are 2 places in this example reflecting a relationship between a disease and chemical: “Drug-related disease are most often associated with chemical.” and “Bilateral disease after the use of entity, without concurrent chemical use, have never been reported.” In the first sentence, the relation between chemical and disease is mainly determined by the following tokens: “associated,” “chemical,” “disease,” “related,” and “drug.” In the second sentence, the relation between chemical and disease is mainly determined by the following tokens: “reported,” “never,” “chemical,” “disease,” “without,” and “concurrent.” Token “[CLS]” is related to the most keywords in both sentences. Therefore, the pretrained self-attention structure can take advantage of the textual context and capture very long-range dependencies from document-level instances. On the other hand, the distribution of the different colors shows that multihead attention can form diverse representation subspaces to learn more complicated semantics.
However, from the gradation of the colors, the relationship between token “[CLS]” and the keywords is not strong enough. Token “[CLS]” is not highly correlated with token “disease” in this instance. We visualized the output of tokens “chemical” and “disease” in the final multihead attention, as shown in
Visualization of the output of token “disease” in the final multihead attention.
Visualization of the output of token “chemical” in the final multihead attention.
For a document-level annotated dataset, instead of dividing the dataset, we considered all target entity pairs as a whole and applied a pretrained self-attention structure to extract biomedical relations. The results and analysis show that the pretrained self-attention structure extracted relations of multiple entity pairs in a document. Through the visualization of the transformer, we verified that the pretrained self-attention structure can capture long-distance dependencies and learn complicated semantics. Furthermore, we conclude that replacement of biomedical entities benefits biomedical relation extraction, especially for document-level relation extraction.
However, this method still has some issues. In future work, we plan to design a more effective network to capture local relations between biomedical entities and improve our method.
bi-affine relation attention network.
chemical-disease relation.
chemical-induced disease.
convolutional neural network.
chemical-protein relation.
drug-drug interaction.
data preprocessing.
feed-forward network.
graph convolutional neural network.
knowledge base.
memory neural network.
precision.
protein-protein interactions affected by mutations.
pretraining means.
recall.
recurrent neural network.
recurrent piecewise convolutional neural network.
support vector machine.
This study was funded by the Natural Science Foundation of Guangdong Province of China (2015A030308017), National Natural Science Foundation of China (61976239), and Innovation Foundation of High-end Scientific Research Institutions of Zhongshan City of China (2019AG031).
None declared.