Published on in Vol 8, No 5 (2020): May

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/17644, first published .
Document-Level Biomedical Relation Extraction Leveraging Pretrained Self-Attention Structure and Entity Replacement: Algorithm and Pretreatment Method Validation Study

Document-Level Biomedical Relation Extraction Leveraging Pretrained Self-Attention Structure and Entity Replacement: Algorithm and Pretreatment Method Validation Study

Document-Level Biomedical Relation Extraction Leveraging Pretrained Self-Attention Structure and Entity Replacement: Algorithm and Pretreatment Method Validation Study

Authors of this article:

Xiaofeng Liu1 Author Orcid Image ;   Jianye Fan1 Author Orcid Image ;   Shoubin Dong1 Author Orcid Image

Original Paper

Communication and Computer Network Key Laboratory of Guangdong, School of Computer Science and Engineering, South China University of Technology, Guangzhou, China

Corresponding Author:

Shoubin Dong, PhD

Communication and Computer Network Key Laboratory of Guangdong

School of Computer Science and Engineering

South China University of Technology

No. 381, Wushan Road

Tianhe District, Guangdong

Guangzhou, 510610

China

Phone: 86 15625125397

Email: sbdong@scut.edu.cn


Background: The most current methods applied for intrasentence relation extraction in the biomedical literature are inadequate for document-level relation extraction, in which the relationship may cross sentence boundaries. Hence, some approaches have been proposed to extract relations by splitting the document-level datasets through heuristic rules and learning methods. However, these approaches may introduce additional noise and do not really solve the problem of intersentence relation extraction. It is challenging to avoid noise and extract cross-sentence relations.

Objective: This study aimed to avoid errors by dividing the document-level dataset, verify that a self-attention structure can extract biomedical relations in a document with long-distance dependencies and complex semantics, and discuss the relative benefits of different entity pretreatment methods for biomedical relation extraction.

Methods: This paper proposes a new data preprocessing method and attempts to apply a pretrained self-attention structure for document biomedical relation extraction with an entity replacement method to capture very long-distance dependencies and complex semantics.

Results: Compared with state-of-the-art approaches, our method greatly improved the precision. The results show that our approach increases the F1 value, compared with state-of-the-art methods. Through experiments of biomedical entity pretreatments, we found that a model using an entity replacement method can improve performance.

Conclusions: When considering all target entity pairs as a whole in the document-level dataset, a pretrained self-attention structure is suitable to capture very long-distance dependencies and learn the textual context and complicated semantics. A replacement method for biomedical entities is conducive to biomedical relation extraction, especially to document-level relation extraction.

JMIR Med Inform 2020;8(5):e17644

doi:10.2196/17644

Keywords



A large number of biomedical entity relations exist in the biomedical literature. It is beneficial for the development of biomedical fields to automatically and accurately extract these relations and form structured knowledge. Some biomedical datasets have been proposed for extracting biomedical relations, such as drug-drug interactions (DDI) [1], chemical-protein relations (CPR) [2], and chemical-induced diseases (CID) [3]. The former 2 datasets are sentence-level annotated datasets that extract relations on a single sentence containing a single entity-pair mention, and the latter is a document-level annotated dataset, which means that it is uncertain whether relations are asserted from within sentences or across sentence boundaries.

Most approaches [4-7] have focused on single sentences containing biomedical relations. For example, Zhang et al [4] presented a hierarchical recurrent neural network (RNN) to combine raw sentences with their short dependency paths for a DDI task. To deal with long and complicated sentences, Sun et al [5] separated sequences into short context subsequences and proposed a hierarchical recurrent convolutional neural network (CNN). Because these approaches cannot be directly applied to document-level datasets, some existing methods [8,9] divided the document-level dataset into 2 parts and trained an intrasentence model and an intersentence model. Nevertheless, because of long-distance dependencies and co-references, their methods cannot be adapted to cross-sentence relation extraction. Furthermore, splitting the dataset resulted in noise and rule-based mistakes.

Currently, for intersentence relation extraction, some studies [10-12] generate dependency syntax trees within sentences and across sentences and employ a graph neural network to capture dependencies. However, it is costly to build dependency syntax trees. In addition, few studies, except those by Li et al [13] and Verga et al [14], have considered the influence of noisy data due to the segmentation of datasets and taking advantage of the textual context. For a document-level annotated corpus, an entity-pair mention within sentences or across sentences has a biomedical relationship by thinking simply, which will undoubtedly cause errors and may ignore plenty of useful information such that many sentences with co-occurring or co-referential medical entity mentions refer to biomedical relations.

For example, the chemical-disease relation (CDR) dataset is a document-level corpus designed to extract CID relations from biomedical literature [15]. For CID relation extraction, most current methods [8,16,17] divide the CDR dataset into intrasentence-level and intersentence-level relation instances using heuristic rules. Although these heuristic rules are effective, they inevitably generate noisy instances of CID relations or ignore some useful information. For example, the following sentence expresses CID relations between the chemical amitriptyline and the disease blurred vision: “The overall incidence of side effects and the frequency and severity of blurred vision, dry mouth, and drowsiness were significantly less with dothiepin than with amitriptyline.”

According to heuristic rules [8], the token distance between two mentions in an intrasentence level instance should be <10. The token distance between the chemical amitriptyline and the disease blurred vision in this example is 12; therefore, this sentence is discarded. However, factually, this sentence is the only sentence in the document [18] that describes the CID relation between the chemical amitriptyline and the disease blurred vision. Obviously, heuristic rules cannot precisely partition the CDR dataset, and they can induce the wrong classification by models, although they use multi-instance learning to reduce these errors.

Therefore, when constructing relation instances from a document-level dataset, it is necessary to consider sentences with multiple mentions of target entities in the entire document. While treating all target entities in a document as a whole brings benefits, the challenges are very long-distance dependencies and complex semantics, from which traditional neural networks such as CNN or RNN cannot accurately extract document-level relations. Recently, pretrained self-attention structures, such as SciBERT [19] and BERT [20], were proposed and were not necessarily better than RNN at capturing long-range dependencies. However, they performed better at increasing the number of attention heads [21]. A pretrained transformer has already learned more semantic features, and it performs well for sentence-level relation extraction; however, it did not apply to document-level relation extraction.

To address these problems, this paper proposes a pretrained self-attention mechanism and entity replacement method to extract document-level relationships. In this way, this paper has several contributions. First, to avoid errors by dividing the document-level dataset, this paper proposes a new data preprocessing method that treats the target entity pair of some sentences in a document as an instance. Second, to better focus on the target entity pairs and their context, a replacement method is proposed to replace biomedical entity pairs with uniform words. Compared with the different entity preprocessing for biomedical entity pairs, the replacement method is more effective for biomedical relation extraction. Third, to solve the problem of long-distance dependencies and learning complex semantics, a pretrained self-attention structure is proposed for document-level relation extraction and to achieve superior performance than state-of-the-art approaches. Through analysis and visualization of the model structure, the effectiveness of the self-attention structure for document-level datasets is demonstrated.


Data Preprocessing for the Document-Level Corpus

As already mentioned, splitting the document-level corpus will increase noise and may lose some useful information. To address this problem, the sentences in which the target entity pair is located and the sentences between them are constructed to an instance. This approach has the following benefits. First, it does not introduce error messages. The sentences do not need to be labeled after the segmentation of the dataset. The relationship between the marked relation pairs in the document corresponds to the instances one by one. Second, it discards useless information that is not related to the relationship of the target entities. Some are not related to those sentences in which the target entities are located; hence, they are noise for relation extraction. Discarding them will focus the model on the sentences in which the entity pair is located. Third, it keeps a lot of useful information, such as contextual information about entities and the relationship of entities.

As shown in Figure 1, a document [22] in the CDR dataset is constructed into biomedical relation instances. All chemical or disease entities are bold.

Figure 1. An example of document-level relation instance construction.
View this figure

In this document, there are 2 chemical entities, “amisulpride” (C1) and “calcium gluconate” (C2), and 4 disease entities: “overdose” (D1), “prolonged QT syndrome/QT prolongation” (D2), “poisoning” (D3), and “hypocalcemia” (D4). It should be noted that C1, C2, D1, D2, D3, and D4 are added to the document to indicate which are chemical entities and which are disease entities. Hence, the document can be constructed into 8 instances, in which 2 instances of C1 and D2 or C1 and D4 have CID relations. a), b), c), and d) conformed the instance of C1 and D2. a), b), and c) conformed the instance of C1 and D4.

Semantically, both the intralevel sentence a) and the interlevel sentences b) and c) express the CID relationship of C1 and D2. However, according to heuristic rules, b) and c) will be discarded because only the entities that are not involved in any intrasentence level instance are considered at the intersentence level. Third, instances are full of contextual information of chemical and disease entities, which is conducive to document-level relation extraction when exploiting it well.

There are lots of biomedical entities in a document. When constructing the instances of the target entity pair, it is inevitable that the same instance is tagged with different labels, resulting in incorrect classification. For example, as mentioned in the Methods section, the instances of C1-D2 and C2-D2 are the same but tagged with different labels. To solve this problem, entity pretreatment methods are presented.

There are 2 different biomedical entity pretreatments, as shown in Figure 2. In the first pretreatment, the target chemical and disease entities are respectively replaced with “chemical” and “disease,” which are called the replacement method. For example, in the instance of C1 and D2, sentence a) will be processed into “Two cases of chemical entity: a cause for disease.” In addition to the replacement method, there is another data preprocessing method, called the addition method. Different marks are added to the boundaries of chemical and disease entities, related to the relation instance. For instance, sentence a) will be processed into “Two cases of [[ amisulpride ]] overdose: a cause for << prolonged QT syndrome >>”. In the Results section, we will describe the advantages and disadvantages of the 2 different pretreatment methods.

Figure 2. An instance with two different biomedical entity pretreatments.
View this figure

Model Architecture

As shown in Figure 3, when adopting this data preprocessing, the length of most instances is very long, which results from very long-distance dependencies and complex semantics.

Self-attention structure can directly calculate similarities between words, so that the distance between words is 1, which can intuitively solve long-distance dependencies. As demonstrated by Tang et al [21], Transformer, a combined self-attention structure, is capable of semantic feature extraction far exceeding that of RNN and CNN and performs better when increasing the number of attention structures. Therefore, a pretrained self-attention structure, namely a pretrained transformer, is applied for these problems.

However, for document-level relationship extraction, according to our preprocessing method, the length of instances is longer than the experimental data in the paper by Tang et al [21], and the semantics are more complicated. There are multiple target entity pairs in the instances; some reflect the correct relationship, and some do not. Therefore, the transformer structure must have a certain reasoning ability. To verify the validity of a pretrained self-attention structure on document-level relation extraction, we adopted the structure of SciBERT, which was pretrained on the scientific literature, and added a feed-forward network (FNN) as a classifier. A visual model architecture is provided in Figure 3. We fine-tuned the model on the preprocessed CDR dataset. The structure of model is described in detail in the following paragraphs.

Basde on the structure of BERT, SciBERT built a new vocabulary, called SCIVOCAB, and was trained on a scientific corpus that consists of computer science domain and biomedical papers. Following SciBERT, we still employ the same input representation, constructed by summation of token embedding, segment embedding, and position embedding. The tokens “[CLS]” and “[SEP]” are added at the beginning and end, respectively, of each instance. In addition, when tokenizing words, WordPiece embedding [23] was used with SCIVOCAB to separate words and split word pieces with “##”.

SciBERT is made up of N transformer stacks. Transformer stack k is denoted by Transformerk, which has its own parameters and consists of 2 components: multi-head attention and FFN.

Sk = Transformerk(Sk-1) (1)

where SkRn×d is the output of the transformer stack k. S0 is the input representation of text sequence X, XRn×d. n is the length of text sequence, and d is the dimension of input representation. The whole text sequence shares the same parameters as the transformers.

The multihead applies self-attention, or scaled dot-product attention, multiple times. Through the mapping of the query Q, key K, and value V, scaled dot-product attention obtains a weighted sum of the values. Q, K, VRn×d are the same matrices in the self-attention computation that are the input of transformer.

Figure 3. The architecture of the model.
View this figure

Instead of applying a single scaled dot-product attention, multihead attention applies query Q, key K, and value V to linearly project the input h times with different, learned linear projections to n x l dimensions, respectively, where l = d/h and h is the number of the head. The reason for that is multihead attention can form different representation subspaces at different positions, learn more semantic information, and capture long-distance dependencies better.

Oh = softmax(QWiQ(KWiK)T / sqrt(dh))VWiV (2)

Where the projections are parameter matrices WQRd×l, WKRd×l, WVRd×l, and OhRdxl.sqrt(dh) is a scale factor to prevent the result of the dot-product attention from enlarging, and sqrt(‧) indicates that the square root is extracted.

Then, the outputs of the individual attention heads are merged, denoted as ORdxl. The input and output of the multihead attention are connected by residual connection. Layer normalization, denoted LN, is applied to the output of the residual connection.

O = [o1;...;oh] (3)
M = LN(Sk-1+O) (4)

Where MRn×d

The second component of the transformer stack is 2 layers of pointwise FFN. On the other hand, it can be described as 2 convolutions with kernel size 1.

Sk = ReLU(MW1+b1)W2+b2 (5)

where W1Rd×m, b1Rn×m, W2Rm×d, and b2Rn×d. Each row of or is the same, and m = 4d.

The final layer is an FFN, a relation classifier. It corresponds to the final output of transformer of the token “[CLS]”.

c = Wpreds1 (6)

Where WpredRo×d is the weight matrix and s ∈ Rd is the final output of the token “[CLS]”.


Overview

In this section, we first describe some experimental datasets and provide some experiment settings. Then, we compare the performance of SciBERT with that of existing methods and validate the availability of the pretrained self-attention structure on the document-level dataset through the visualization of the multihead attention. Finally, experimenting on different datasets, including 2 sentence-level corpora and a document-level corpus, we compare various biomedical entity pretreatments and analyze which preprocessing is better for the self-attention structure.

Datasets

Table 1 shows the statistics of the CDR [3], protein-protein interactions affected by mutations (PPIm) [24], DDI [1], and CPR [2] datasets. The CDR and PPIm datasets are document-level annotated corpus, and the DDI and CPR datasets are sentence-level annotated corpora, which are only used to discuss the advantages and disadvantages of different biomedical entity pretreatments.

Table 1. Descriptions of the chemical-disease relation datasets.
Dataset, TypesTraining setDevelopment setTest set
CDRa



Documents500500500
Positive103810121066
Negative432441344374
PPImb



Documents597N/Ac635
Positive750N/Ac869
Negative1401N/Ac1717
DDId



Sentence18,872N/Ac3843
Positive3964N/Ac970
Negative14,908N/Ac2873
Int183N/Ac96
Advice815N/Ac219
Effect1654N/Ac357
Mechanism1312N/Ac298
CPRe



Sentences643735585744
Positive417224273469
Negative226511312275
CPR:3777552667
CPR:4226011031667
CPR:5173116198
CPR:6235199293
CPR:9727457644

aCDR: chemical-disease relation.

bPPIm: protein-protein interaction affected by mutations.

cDevelopment sets do not exist in the PPIm an dDDI datasets.

dDDI: drug-drug interaction.

eCPR: chemical-protein relation.

The CDR dataset is used to extract CID and is a 2-label classification task. The PPIm dataset is released to extract protein-protein interactions affected by genetic mutations, which is a 2-label classification. Aimed at extracting drug-drug interactions, DDI is concerned with classifying into 5 relation types, including the int type, advice type, effect type, mechanism type, and negative type. For the DDI dataset, we adopted some rules to filter some negative sentences as described by Quan et al [25]. With the purpose of extracting chemical-protein relations, the CPR dataset is labeled as 10 types of chemical-protein relations, 5 of which are used for evaluation. The chemical-protein relations of CPR are classified into 6 categories.

Due to the size of the CDR dataset, we merged the training and development sets to construct the training set. After preprocessing the CDR and PPIm datasets, we counted the average number of sentences per instance, average number of tokens per instance, and average number of tokens per sentence in the constructed instance set. Table 2 shows the statistics of the constructed instance set.

Experiment Setup

We employed the parameters of the uncased SciBERT with the vocabulary SCIVOCAB and fine-tuned on the CDR datasets. The model parameters are described as: SciBERTuncased: k = 12, h = 12, d = 768, m = 3072.

Due to the distinction of the length of instances in the dataset, the input dimensions of the corresponding model for each dataset are different. For the CDR and PPIm datasets, the length of the input sequence is set to 512, and the batch size is set to 6. For the DDI dataset, the length of the input sequence is set to 150, and the batch size is set to 32. For the CPR dataset, the length of the input sequence is set to 200, and the batch size is set to 23. The epoch of all model training is set to 3. All results are averaged across 5 runs. For consistency of comparisons, we merged the training and development sets to train the models.

Table 2. Statistics of the constructed instance sets of the chemical-disease relation (CDR) and protein-protein interaction affected by mutations (PPIm) datasets.
Dataset, TypesTraining setTest set
CDR with preprocessing


Instances10,4075418
Positive

19471042
Negative84604376
Sentences per instance11.112.1
Tokens per instance161.5168.9
Tokens per sentence14.614.0
PPIm with preprocessing


Instances21512586
Positive750869
Negative14011717
Sentences per instance9.08.8
Tokens per instance169.6186.6
Tokens per sentence18.721.2

Comparison of the Pretrained Self-Attention Structure With Other Methods

For the CDR dataset, we compared our method with 6 state-of-the-art models without any knowledge bases. Zhou et al [9] proposed a method based on feature engineering and long short-term memory. Gu et al [8] combined CNN with maximum entropy. A recurrent piecewise CNN [13] was the piecewise CNN. A bi-affine relation attention network [14] incorporated an attention network, multi-instance learning, and multitask learning. A labeled edge graph CNN [12] was used for document-level dependency graphs. For the PPIm dataset, we compared our method with 4 models. Because few studies focused on the PPIm dataset, the 4 models are not really state-of-the-art. Table 3 shows the result of the comparisons.

As shown in Table 3, compared with other approaches, our method with the replacement method greatly improved the precision. The F1 score is 1.9% higher than the best result from Vargas et al [19] with the CDR test set. Our method also has great performance with the PPIm. It shows that a pretrained self-attention structure can be suitable for a document-level dataset.

Table 3. Performance of the chemical-disease relation (CDR) and protein-protein interactions affected by mutations (PPIm) test datasets compared with state-of-the-art methods.
Dataset, ModelPa, %Rb, %F1, %
CDR



LSTMc [9]55.668.461.3
CNNd [8]55.768.161.3
RPCNNe [13]55.263.659.1
BRANf [14]55.670.862.1
GCNNg [12]52.866.058.6
Our method65.562.664.0
PPIm



SVMh [26]32.034.033.0
CNN (without KBi) [27]38.237.337.8
MNMj [28]40.332.335.9
MNM+Rule [28]38.037.037.5
Our method83.590.486.8

aP: precision.

bCNN: convolutional neural network.

cR: recall.

dLSTM: long short-term memory.

eRPCNN: recurrent piecewise convolutional neural network.

fBRAN: bi-affine relation attention network.

gGCNN: graph convolutional neural network.

hSVM: support vector machine.

iKB: knowledge base.

jMNM: memory neural network.

Effects of Pretreatment Methods for Biomedical Entities

As described later, there are 2 methods, one of which is the replacement method, that replace biomedical entities with uniform words. The second method is the addition method, which adds extra tags in the left and right sides of biomedical entities. We conducted experiments with the CDR, PPIm, DDI, and CPR datasets. The comparison of the 2 pretreatments for biomedical entities is shown in Table 4.

For each dataset, the recall rate and F1 score obtained with our model with the replacement method were higher than obtained with our model with the addition method, especially for the CDR dataset. The reason is that biomedical entities are complicated, and most are compound words. For the pretrained self-attention structure, the word embeddings of biomedical entities are hard to learn from small biomedical datasets. As a consequence, replacing the target entities with uniform words is beneficial for the model to understand target entities and pay more attention in the context of target entities.

Table 4. Comparison of 2 pretreatments (addition and replacement) for biomedical entities using our method.
Dataset, TypesAddition methodReplacement method

Pa, %Rb, %F1, %P, %R, %F1, %
CDRc






Positive67.454.860.465.562.664.0
PPImd






Positive79.391.584.883.590.486.8
DDIe






Int74.846.257.176.246.958.0
Advise87.284.785.988.689.088.8
Effect77.282.179.577.082.679.7
Mechanism84.880.482.582.186.084.0
All81.678.680.081.281.381.4
CPRf






CPR:373.580.376.775.479.577.4
CPR:484.488.886.683.790.486.9
CPR:580.782.081.381.286.583.7
CPR:684.089.486.786.588.287.3
CPR:976.286.981.279.590.184.5
All80.486.583.381.487.984.5

aP: precision.

bR: recall.

cCDR: chemical-drug reaction.

dPPIm: protein-protein interaction affected by mutations.

eDDI: drug-drug interaction.

fCPR: chemical-protein reaction.

Comparison of Different Pretrained Models

BERT and SciBERT are the pretrained models that have the same self-attention structure. The difference between the two is that BERT is pretrained on the wiki corpus and SciBERT is pretrained on a large quantity of scientific papers from the computer science and biomedical domains. Table 5 presents the comparison of BERT and SciBERT on 4 biomedical data sets. As shown by Table 5, SciBERT performs better than BERT, particularly with the F1 score, which was improved by 3.5% on the CDR data set. Therefore, the model pretrained on the biomedical corpus is beneficial for extracting biomedical relations.

Table 5. Comparison of different pretrained models using our method.
Dataset, TypeBERTSciBERT

Pa, %Rb, %F1, %P, %R, %F1, %
CDRc






Positive62.958.360.565.562.664.0
PPImd






Positive79.092.285.183.590.486.8
DDIe






Int69.842.752.876.246.958.0
Advise91.389.090.188.689.088.8
Effect74.177.675.777.082.679.7
Mechanism78.580.179.382.186.084.0
All79.077.578.281.281.381.4
CPRf






CPR:373.876.575.175.479.577.4
CPR:481.789.785.583.790.486.9
CPR:579.380.779.981.286.583.7
CPR:680.284.682.286.588.287.3
CPR:976.588.281.979.590.184.5
All79.085.982.381.487.984.5

aP: precision.

bR: recall.

cCDR: chemical-drug reaction.

dPPIm: protein-protein interaction affected by mutations.

eDDI: drug-drug interaction.

fCPR: chemical-protein reaction.

Analysis of Each Component of the Method

Data preprocessing (DP) and pretraining means (PTM) are important components of our method; DP aims to alleviate noise, and PTM is designed to solve the long-distance dependencies. We compared the importance of each component of our method with the CDR dataset. Table 6 shows the changes in performance on the CDR dataset by removing DP and PRM. PTM resulted in a greater performance improvement than DP.

Table 6. Performance changes by removing different parts of our model.
CDRa datasetPb, %Rc, %F1, %Change, %
Baseline65.562.664.0N/A
Remove DPd67.054.360.0–6.3
Remove PTMe46.139.542.6–33.4
Remove DP and PTM48.931.238.1–40.5

aCDR: chemical-drug reaction.

bP: precision.

cR: recall.

dDP: data preprocessing.

ePTM: pertraining means.


Principal Findings

To fully illustrate that our model can solve the problem of long-distance dependencies, we set 50 as the unit of instance length to count the number of positive and negative instances of the CDR test set, as shown in Table 7. As can be seen from the table, the instance length of the test sets is concentrated in the range of 50 to 300.

We calculated the precision rate, recall rate, and accuracy rate of each interval length in the test set. The results are shown in Table 8. As can be seen in the table, the model has good performance when the instance length is longer than 100, except for the instances with lengths of 201 to 250. Therefore, our model can capture long-distance dependencies.

Table 7. Quantity distribution of the chemical-disease relation test set.
Interval lengthPositiveNegativeSum
0-5070344414
51-1001818841065
101-1502008451045
151-200160756916
201-250158663821
251-300177571748
301-35056216272
351-400346498
>40063339
Table 8. Results of each interval length in the test set using our replacement method.
Interval lengthPa, %Rb, %F1, %
0-5054.264.358.8
51-10057.557.557.5
101-15067.764.065.8
151-20064.471.267.7
201-25066.254.459.7
251-30069.969.569.7
301-35070.860.765.4
351-40080.082.481.2
>400100.066.780.0

aP: precision.

bR: recall.

To verify that the pretrained self-attention mechanism works as we believe, which is that it can take advantage of the textual context and capture very long-range dependencies to understand the complex semantics of biomedical text, we visualized the output of token “[CLS]” in the multihead of the final transformer stack, as shown in Figure 4.

As seen by the token colors, the token “[CLS]” is related to the following tokens: “chemical,” “disease,” “drug,” “related,” “bilateral,” “[CLS],” and “[SEP]”. The 12 different colors refer to the different head attentions. The more and darker the colors, the more relevant the token. Lines between two tokens denote a correlation between two tokens. Their clarity depends on the result of the head attentions.

Figure 4. Visualization of the output of token “[CLS]” in the multihead attention of the final transformer stack.
View this figure

From the perspective of semantic analysis, there are 2 places in this example reflecting a relationship between a disease and chemical: “Drug-related disease are most often associated with chemical.” and “Bilateral disease after the use of entity, without concurrent chemical use, have never been reported.” In the first sentence, the relation between chemical and disease is mainly determined by the following tokens: “associated,” “chemical,” “disease,” “related,” and “drug.” In the second sentence, the relation between chemical and disease is mainly determined by the following tokens: “reported,” “never,” “chemical,” “disease,” “without,” and “concurrent.” Token “[CLS]” is related to the most keywords in both sentences. Therefore, the pretrained self-attention structure can take advantage of the textual context and capture very long-range dependencies from document-level instances. On the other hand, the distribution of the different colors shows that multihead attention can form diverse representation subspaces to learn more complicated semantics.

However, from the gradation of the colors, the relationship between token “[CLS]” and the keywords is not strong enough. Token “[CLS]” is not highly correlated with token “disease” in this instance. We visualized the output of tokens “chemical” and “disease” in the final multihead attention, as shown in Figures 5 and 6. As seen in these figures, the tokens “chemical” and “disease” in the sentences capture more local information, compared with the token “[CLS].” It may be inferred that, for document-level relation extraction in the final layer of the pretrained self-attention structure, designing a special network to capture the relationships between different target entities is better than applying a dense layer.

Figure 5. Visualization of the output of token “disease” in the final multihead attention.
View this figure
Figure 6. Visualization of the output of token “chemical” in the final multihead attention.
View this figure

Conclusions

For a document-level annotated dataset, instead of dividing the dataset, we considered all target entity pairs as a whole and applied a pretrained self-attention structure to extract biomedical relations. The results and analysis show that the pretrained self-attention structure extracted relations of multiple entity pairs in a document. Through the visualization of the transformer, we verified that the pretrained self-attention structure can capture long-distance dependencies and learn complicated semantics. Furthermore, we conclude that replacement of biomedical entities benefits biomedical relation extraction, especially for document-level relation extraction.

However, this method still has some issues. In future work, we plan to design a more effective network to capture local relations between biomedical entities and improve our method.

Acknowledgments

This study was funded by the Natural Science Foundation of Guangdong Province of China (2015A030308017), National Natural Science Foundation of China (61976239), and Innovation Foundation of High-end Scientific Research Institutions of Zhongshan City of China (2019AG031).

Conflicts of Interest

None declared.

  1. Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, et al. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res 2014 Jan;42(Database issue):D1091-D1097 [FREE Full text] [CrossRef] [Medline]
  2. Krallinger M, Rabal O, Akhondi S. Overview of the BioCreative VI chemical-protein interaction Track. 2017 Oct Presented at: In: Proceedings of the sixth BioCreative challenge evaluation workshop; 2017; Bethesda, Maryland p. 141-146.
  3. Li J, Sun Y, Johnson RJ, Sciaky D, Wei C, Leaman R, et al. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database (Oxford) 2016;2016 [FREE Full text] [CrossRef] [Medline]
  4. Zhang Y, Zheng W, Lin H, Wang J, Yang Z, Dumontier M. Drug-drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths. Bioinformatics 2018 Mar 01;34(5):828-835 [FREE Full text] [CrossRef] [Medline]
  5. Sun C, Yang Z, Wang L. Hierarchical Recurrent Convolutional Neural Network for Chemical-protein Relation Extraction from Biomedical Literature. 2018 Mar Presented at: IEEE International Conference on Bioinformatics and Biomedicine (BIBM); December 2018; Madrid, Spain p. 3-6. [CrossRef]
  6. Lim S, Kang J. Chemical-gene relation extraction using recursive neural network. Database (Oxford) 2018 Jan 01;2018 [FREE Full text] [CrossRef] [Medline]
  7. Peng Y, Rios A, Kavuluru R, Lu Z. Extracting chemical-protein relations with ensembles of SVM and deep learning models. Database (Oxford) 2018 Jan 01;2018 [FREE Full text] [CrossRef] [Medline]
  8. Gu J, Sun F, Qian L, Zhou G. Chemical-induced disease relation extraction via convolutional neural network. Database (Oxford) 2017 Jan 01;2017(1) [FREE Full text] [CrossRef] [Medline]
  9. Zhou H, Deng H, Chen L, Yang Y, Jia C, Huang D. Exploiting syntactic and semantics information for chemical-disease relation extraction. Database (Oxford) 2016;2016 [FREE Full text] [CrossRef] [Medline]
  10. Peng NY, Hoifung P, Chris Q. Cross-sentence n-ary relation extraction with Graph LSTMs. Transactions of the Association for Computational Linguistics ? 2017;5:101-115.
  11. Song L, Zhang Y, Wang Z, Gildea D. N-ary relation extraction using graph-state LSTM. 2018 Oct Presented at: Proceedings of the Conference on Empirical Methods in Natural Language Processing; 2018; Brussels, Belgium.
  12. Sunil K, Fenia C, Makoto M. Inter-sentence relation extraction with document-level graph convolutional neural network. 2019 Jun Presented at: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; July; Florence, Italy; 2019; Italy. [CrossRef]
  13. Li H, Yang M, Chen Q, Tang B, Wang X, Yan J. Chemical-induced disease extraction via recurrent piecewise convolutional neural networks. BMC Med Inform Decis Mak 2018 Jul 23;18(Suppl 2):60 [FREE Full text] [CrossRef] [Medline]
  14. Verga P, Strubell E, McCallum A. Simultaneously Self-Attending to All Mentions for Full-Abstract Biological Relation Extraction. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers).: Association for Computational Linguistics; 2018 Presented at: 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; June 6, 2018; New Orleans, LA p. 872-884. [CrossRef]
  15. Wei C, Peng Y, Leaman R, Davis AP, Mattingly CJ, Li J, et al. Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task. Database (Oxford) 2016;2016 [FREE Full text] [CrossRef] [Medline]
  16. Gu J, Qian L, Zhou G. Chemical-induced disease relation extraction with various linguistic features. Database (Oxford) 2016;2016 [FREE Full text] [CrossRef] [Medline]
  17. Panyam NC, Verspoor K, Cohn T, Ramamohanarao K. Exploiting graph kernels for high performance biomedical relation extraction. J Biomed Semantics 2018 Jan 30;9(1):7 [FREE Full text] [CrossRef] [Medline]
  18. Stratas NE. A double-blind study of the efficacy and safety of dothiepin hydrochloride in the treatment of major depressive disorder. J Clin Psychiatry 1984 Nov;45(11):466-469. [Medline]
  19. Beltagy I, Lo K, Cohan A. SciBERT: A Pre-trained Language Model for scientific text. SciBERT; 2019 Nov Presented at: 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing; 2019; Hong Kong. [CrossRef]
  20. Devlin J, Chang M, Lee K. BERT: Pre-training of deep bidirectional transformers for language understanding. Computation and Language 2019 May:2019.
  21. Tang G, Müller M, Rios A, Sennrich R. Why self-attention? a targeted evaluation of neural machine translation architectures. 2018 Nov Presented at: Conference on Empirical Methods in Natural Language Processing; 2018; Brussels, Belgium.
  22. Ward DI. Two cases of amisulpride overdose: a cause for prolonged QT syndrome. Emerg Med Australas 2005 Jun;17(3):274-276. [CrossRef] [Medline]
  23. Wu Y, Mike S, Chen Z. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv.08144 2016 Oct:2016.
  24. Dogan R, Chatr-aryamontri A. BioCreative VI Precision Medicine Track: creating a training corpus for mining protein-protein interactions affected by mutations. 2017 Aug Presented at: =BioNLP 2017 workshop; 2017; Vancouver, Canada. [CrossRef]
  25. Quan C, Hua L, Sun X, Bai W. Multichannel Convolutional Neural Network for Biological Relation Extraction. Biomed Res Int 2016;2016:1850404 [FREE Full text] [CrossRef] [Medline]
  26. Fan Z, Soldaini L, Cohan A, Goharian N. Relation Extraction for Protein-protein Interactions Affected by Mutations. 2018 Aug Presented at: BCB '18th ACM International Conference on Bioinformatics, Computational Biology Health Informatics; 2018; Washington DC USA. [CrossRef]
  27. Tran T, Kavuluru R. An end-to-end deep learning architecture for extracting protein-protein interactions affected by genetic mutations. Database (Oxford) 2018 Jan 01;2018:1-13 [FREE Full text] [CrossRef] [Medline]
  28. Zhou H, Liu Z, Ning S, Yang Y, Lang C, Lin Y, et al. Leveraging prior knowledge for protein-protein interaction extraction with memory network. Database (Oxford) 2018 Jan 01;2018 [FREE Full text] [CrossRef] [Medline]


BRAN: bi-affine relation attention network.
CDR: chemical-disease relation.
CID: chemical-induced disease.
CNN: convolutional neural network.
CPR: chemical-protein relation.
DDI: drug-drug interaction.
DP: data preprocessing.
FNN: feed-forward network.
GCNN: graph convolutional neural network.
KB: knowledge base.
MNM: memory neural network.
P: precision.
PPIm: protein-protein interactions affected by mutations.
PTM: pretraining means.
R: recall.
RNN: recurrent neural network.
RPCNN: recurrent piecewise convolutional neural network.
SVM: support vector machine.


Edited by T Hao, B Tang, Z Huang; submitted 30.12.19; peer-reviewed by Z Zhao; comments to author 14.02.20; revised version received 02.03.20; accepted 19.03.20; published 29.05.20

Copyright

©Xiaofeng Liu, Jianye Fan, Shoubin Dong. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 29.05.2020.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.