This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Family history information, including information on family members, side of the family of family members, living status of family members, and observations of family members, plays an important role in disease diagnosis and treatment. Family member information extraction aims to extract family history information from semistructured/unstructured text in electronic health records (EHRs), which is a challenging task regarding named entity recognition (NER) and relation extraction (RE), where named entities refer to family members, living status, and observations, and relations refer to relations between family members and living status, and relations between family members and observations.
This study aimed to introduce the system we developed for the 2019 n2c2/OHNLP track on family history extraction, which can jointly extract entities and relations about family history information from clinical text.
We proposed a novel graph-based model with biaffine attention for family history extraction from clinical text. In this model, we first designed a graph to represent family history information, that is, representing NER and RE regarding family history in a unified way, and then introduced a biaffine attention mechanism to extract family history information in clinical text. Convolution neural network (CNN)-Bidirectional Long Short Term Memory network (BiLSTM) and Bidirectional Encoder Representation from Transformers (BERT) were used to encode the input sentence, and a biaffine classifier was used to extract family history information. In addition, we developed a postprocessing module to adjust the results. A system based on the proposed method was developed for the 2019 n2c2/OHNLP shared task track on family history information extraction.
Our system ranked first in the challenge, and the F1 scores of the best system on the NER subtask and RE subtask were 0.8745 and 0.6810, respectively. After the challenge, we further fine tuned the parameters and improved the F1 scores of the two subtasks to 0.8823 and 0.7048, respectively.
The experimental results showed that the system based on the proposed method can extract family history information from clinical text effectively.
Family history information plays an important role in the diagnosis and treatment of diseases, especially genetic disorders. Family history information is always embedded in electronic health records (EHRs) in a semistructured/unstructured format, which needs to be unlocked by natural language processing (NLP) technology.
In order to promote research on family history information extraction, Harvard Medical School and Mayo Clinic organized national NLP challenges on family history information extraction in 2018 and 2019. The family history information extraction task includes the following two subtasks: (1) recognizing family members, living status, and observations and (2) determining which family members the recognized living status and observations belong to, which correspond to two fundamental NLP tasks, namely named entity recognition (NER) and relation extraction (RE). The NER task is usually regarded as a sequence labeling task, while the RE task is the subsequent classification task, and they are tackled by pipeline methods.
For the NER task, traditional machine learning methods, such as hidden Markov model (HMM), conditional random field (CRF) [
A number of joint learning methods have been proposed [
In this study, we proposed a novel graph-based model with biaffine attention. Inspired by the dependency parsing task [
There were two subtasks in the 2019 n2c2/OHNLP challenge on family history information extraction. For subtask1, we need to recognize family members with the side of the family, living status mentioned in clinical text, and observations in the family history. All family members can be normalized to standard forms in
For subtask2, we need to extract the relations between family members, observations, and living status. Living status is used to represent the health status of family members, and it has the two properties of “Alive” and “Healthy.” Each property was measured by a real-valued score (yes: 2, NA: 1, and no: 0). The total living status score of family members was their alive score multiplied by their health score. We also need to predict the negation information (Negated and Non_Negated) for each observation, that is, to judge whether the family members have certain diseases or not.
Normalized family member names.
Degree | Normalized family member names |
1 | Father, Mother, Parent, Sister, Brother, Daughter, Son, and Child |
2 | Grandmother, Grandfather, Grandparent, Cousin, Sibling, Aunt, and Uncle |
We conducted experiments on the corpus provided by the 2018 and 2019 n2c2/OHNLP shared task tracks on family history information extraction. The training set of the 2019 n2c2/OHNLP shared task together with the test set of the 2018 BioCreative/OHNLP shared task was used as the final training set of 149 EHRs for model training. The test data set of the 2019 n2c2/OHNLP shared task, including 117 EHRs, was used for the model test. During model training, we randomly selected a development set of 14 EHRs from the training set for parameter optimization. The statistics of the corpus used in this study is shown in
Detailed data set statistics.
Item | Training set, n | Development set, n | Test set, n |
Document | 149 | 14 | 117 |
Sentence | 770 | 71 | 644 |
FMa: overall | 1128 | 94 | —b |
FM: NAc | 631 | 55 | — |
FM: maternal | 272 | 24 | — |
FM: paternal | 225 | 15 | — |
OBd | 1439 | 127 | — |
LSe | 596 | 52 | — |
FM-OB: overall | 1064 | 97 | — |
FM-OB: NA-OB | 575 | 57 | — |
FM-OB: maternal-OB | 265 | 23 | — |
FM-OB: paternal-OB | 224 | 17 | — |
FM-LS: overall | 605 | 53 | — |
FM-LS: NA-LS | 334 | 29 | — |
FM-LS: maternal-LS | 145 | 12 | — |
FM-LS: paternal-LS | 126 | 12 | — |
aFM: family member.
bNot available.
cNA: not applicable.
dOB: observation.
eLS: living status.
Similar to the dependency parsing task where each token has a head token, we transformed the family history information extraction task to a dependency parsing problem, where a dummy root (denoted by “ROOT”) was appended to each sentence at the beginning and arcs denoted links between two tokens. In the “dependency parsing tree” of a sentence, tokens in each entity were connected together by an “app” arc from right to left, two entities with a relation were connected through linking the right most token by an arc labeled with the entity type, and tokens not in any entity were connected with the “ROOT” node by “NULL” arcs.
Example of using a graph-based schema to represent family history information.
As shown in
Overview architecture of our model.
Given a sentence
where CNN [
Considering the
where
Assume that
Thereafter, the best head of token
For each unlabeled arc, we need to determine its label. Assume that
where
We also adopted the softmax function to compute the probability distribution
Thereafter, the best label of the arc from token
The total loss function was set as
We designed a rule-based postprocessing module to adjust the outputs of our model. It included the following five parts:
1. Converting the output to entities and relations.
(1) Combining all tokens connected by “app” arcs to form entities and assigning them the label of their last token.
(2) If there was an arc between two entities, but not an “app” arc, there was a relation between them.
2. Normalizing family members.
(1) Converting family member entities into normalized forms as shown in
(2) Excluding unnecessary family members. For example, a patient’s nonblood relatives, such as “father” in section “partner’s father,” should be removed. If the family member “father” belonged to section “partner’s father,” we removed “father” since father-in-law was not in
3. Determining the side of family members when using the strategy of three types of entities.
(1) If a family member was a first-degree relative, the side of the family was set as “NA.”
(2) If a family member was in the section “maternal family history” or “paternal family history,” the side of the family was set as maternal or paternal.
(3) If there was an indicator (“maternal” or “paternal”) near a family member, the side of the family was determined by the indicator.
(4) Otherwise, the side of the family of a family member was set as “NA.”
4. Determining the living status score of family members following the work of Shi et al [
(1) Determining the scores of the properties “Alive” and “Healthy” of a family member through searching the keywords listed in
(2) The total living status score was determined according to the alive score and healthy score. For a relative with “Alive=Yes” and “Healthy=Yes,” for example, the living status score should be 4.
5. Determining the negation information of observations.
(1) Determining the negation information of an observation through searching keywords (no, never, not, none, negative, neither, nor, unremarkable, and deny) from the observation’s context. If the context of an observation contained a keyword mentioned above, we set its negation information as “Negated;” otherwise, it was set as “Non_Negated.”
(2) Reversing the negation information of an observation if there were specific phrases, such as “apart from” and “except for,” in the observation’s context. For example, the negation information of the observation entity “Meniere disease” in “there is no history of hearing loss apart from the father's history of Meniere disease” was set as “Non_Negated” rather than “Negated.”
Keywords used to determine the properties “Alive” and “Healthy.”
Property | Keywords |
Alive: Yes=2 | Alive and living |
Alive: No=0 | Dead, die, deceased, death, died, stillborn, and passed away |
Healthy: Yes=2 | Good, health, without problems, healthy, and well |
The hyperparameters used in our experiments are listed in
We first investigated our model in the following two settings: (1) a pipeline model that tackled unlabeled arc prediction and arc label prediction separately and (2) a joint model that tackled unlabeled arc prediction and arc label prediction simultaneously. The joint model predicated the arc and label of each token in our model jointly. The pipeline model first trained one model to predict the head of each token and then trained another model to predict the head of each token according to the result of the predicted head. Thereafter, we compared our model with the BERT-based model using the same architecture as that of the model by Shi et al [
where TP denotes the number of true-positive samples, FP denotes the number of false-positive samples, and FN denotes the number of false-negative samples. We used the tool provided by the organizers [
Major hyperparameters.
Parameter | Value |
BiLSTMa size | 256 |
Arc MLPb size | 500 |
Label MLP size | 100 |
BERTc size | 768 |
Char embedding size | 25 |
CNNd kernel size | (3, 4, 5) |
Char-level CNN size | 50 |
Dropout | 0.5 |
Optimizer | Adam |
Learning rate | 2e-5 |
Batch size | 32 |
Max epoch | 100 |
aBiLSTM: Bidirectional Long Short Term Memory network.
bMLP: multilayer perceptron.
cBERT: Bidirectional Encoder Representation from Transformers.
dCNN: convolutional neural network.
As shown in
Compared to the pipeline model, the joint model performed better on both the NER and RE. For example, when considering five types of entities, the joint model outperformed the pipeline model by 1.21% in the F1 score on the NER subtask and 1.97% in the F1 score on the RE subtask. It indicated that error propagation was partially alleviated in our joint model. When considering five types of entities, the joint model achieved higher F1 scores than BERT-2BiLSTM on the NER subtask and RE subtask by 1.18% and 0.39%, respectively.
Performance of different models.
Subtask | Model | Three types of entities | Five types of entities | ||||||
Precision | Recall | F1 score | Precision | Recall | F1 score |
|
|||
NERa | Pipeline | 0.9254 | 0.8062 | 0.8617 | 0.9241 | 0.8223 | 0.8702 |
|
|
NER | Joint | 0.9012 | 0.8415 | 0.8703 | 0.9154 | 0.8514 | 0.8823 |
|
|
NER | BERTb-2BiLSTMc | —d | — | — | 0.9096 | 0.8347 | 0.8705 |
|
|
REe | Pipeline | 0.7909 | 0.6005 | 0.6827 | 0.7895 | 0.6051 | 0.6851 |
|
|
RE | Joint | 0.7679 | 0.6200 | 0.6861 | 0.7717 | 0.6487 | 0.7048 |
|
|
RE | BERT-2BiLSTM | — | — | — | 0.7686 | 0.6441 | 0.7009 |
|
aNER: named entity recognition.
bBERT: Bidirectional Encoder Representation from Transformers.
cBiLSTM: Bidirectional Long Short Term Memory network.
dNot available.
eRE: relation extraction.
The performance of our best model on each type of family member information and relation (except living status not provided in the test set) is listed in
Performance of the best model on each type of family member information.
Subtask | Type | Precision | Recall | F1 score |
NERa | FMb: overall | 0.8814 | 0.8386 | 0.8594 |
NER | FM: NAc | 0.8699 | 0.8515 | 0.8606 |
NER | FM: maternal | 0.9185 | 0.8267 | 0.8702 |
NER | FM: paternal | 0.8738 | 0.8108 | 0.8411 |
NER | OBd | 0.9385 | 0.8598 | 0.8974 |
NER | LSe | —f | — | — |
NER | Overall | 0.9154 | 0.8514 | 0.8823 |
REg | FM-OB: overall | 0.7843 | 0.6397 | 0.7047 |
RE | FM-OB: NA-OB | 0.8595 | 0.6098 | 0.7134 |
RE | FM-OB: maternal-OB | 0.7067 | 0.6601 | 0.6826 |
RE | FM-OB: paternal-OB | 0.7077 | 0.7150 | 0.7113 |
RE | FM-LS: overall | 0.7627 | 0.6553 | 0.7050 |
RE | FM-LS: NA-LS | 0.7627 | 0.6553 | 0.7050 |
RE | FM-LS: maternal-LS | 0.7108 | 0.7375 | 0.7239 |
RE | FM-LS: paternal-LS | 0.6825 | 0.6825 | 0.6825 |
RE | Overall | 0.7717 | 0.6487 | 0.7048 |
aNER: named entity recognition.
bFM: family member.
cNA: not applicable.
dOB: observation.
eLS: living status.
fNot available.
gRE: relation extraction.
As shown in
Performance of our model with different data.
Subtask | Data set | Three types of entities | Five types of entities | |||||
Precision | Recall | F1 score | Precision | Recall | F1 score | |||
NERa | 2019 | 0.8767 | 0.8409 | 0.8584 | 0.8847 | 0.8458 | 0.8648 | |
NER | 2018+2019b | —c | — | — | 0.9154 | 0.8372 | 0.8745 | |
NER | 2018+2019d | 0.9012 | 0.8415 | 0.8703 | 0.9154 | 0.8514 | 0.8823 | |
REe | 2019 | 0.7240 | 0.5973 | 0.6545 | 0.7270 | 0.6064 | 0.6612 | |
RE | 2018+2019b | — | — | — | 0.7459 | 0.6265 | 0.6810 | |
RE | 2018+2019d | 0.7679 | 0.6200 | 0.6861 | 0.7717 | 0.6487 | 0.7048 |
aNER: named entity recognition.
b2018+2019: the challenge submission performances of our model.
cNot available.
d2018+2019: the performances of our best model after challenge.
eRE: relation extraction.
In order to investigate the effect of sentence representation based on CNN-BiLSTM on our model, we evaluated the model without using the representation and obtained an F1 score of 0.8802 on the NER subtask and an F1 score of 0.7059 on the RE subtask when considering five types of entities. The sentence representation based on CNN-BiLSTM can bring improvement in the NER subtask, but a little loss in the RE subtask. Possibly, we can only share BERT on NER and RE for further improvement.
Traditional approaches regarded the NER task as a sequence labeling task, in which each token was assigned with a combined label of entity boundary and type. The entity boundaries were represented by the BIO schema, where “B” indicates the beginning of an entity, “I” indicates the inside of an entity, and “O” indicates the outside of an entity. Using a graph schema, we can also convert NER into a graph in the following way: (1) connect all tokens with “ROOT,” that is, the heads of all tokens are set to 0 and (2) set the label of the nonentity token to “NULL,” set the label of the last token in the entity to the entity type, and set the label of the remaining token in the entity to “app.”
We compared different decoders, that is, CRF for sequence labeling, biaffine for NER only (biaffine-NER), and biaffine for joint NER and RE (biaffine-Joint). As shown in
Comparison of different decoders on the named entity recognition subtask.
Decoder | Three types of entities | Five types of entities | ||||
Precision | Recall | F1 score | Precision | Recall | F1 score | |
CRFa | 0.8989 | 0.8316 | 0.8639 | 0.9070 | 0.8390 | 0.8717 |
Biaffine-NERb | 0.9001 | 0.8310 | 0.8641 | 0.8895 | 0.8570 | 0.8729 |
Biaffine-Joint | 0.9012 | 0.8415 | 0.8703 | 0.9154 | 0.8514 | 0.8823 |
aCRF: conditional random field.
bNER: named entity recognition.
We performed error analysis on our model considering five types of entities in the development data set. In the case of the NER subtask, 88.24% of errors were boundary errors because of wrong “app” arc prediction, while the remaining 11.76% of errors were type errors that have a correct boundary but wrong entity type. For example, in the sentence “The paternal grandmother, age 53, has wind sucking attributed to not having intestinal during her life,” the paternal entity “grandmother” with the observation entity “wind sucking” was wrongly recognized as a family member entity. In the RE subtask, all errors were caused by incorrect entities. For example, in the sentence “The patient’s father is 43 years old and healthy. His father is 72 years old and was diagnosed with esophageal cancer at age 70,” the family member entity “grandfather” with the observation entity “esophageal cancer” was wrongly extracted as the family member entity “father” with the observation entity “esophageal cancer” as our model could not understand that “his” refers to “the patient’s father,” which needs strong indirect relative reasoning.
The rule-based postprocessing module in our system cannot handle all cases properly, as shown by the example in the error analysis section. In future work, we will try to solve indirect relative reasoning for further improvement.
In this study, we proposed a novel graph-based model with biaffine attention, where a graph-based schema was design to represent entities and relations regarding family history in a unified way and deep biaffine attention was adopted to extract the entities and relations from clinical text. Our system based on the proposed model achieved the highest F1 score of the challenge to date.
Bidirectional Encoder Representation from Transformers
Bidirectional Long Short Term Memory
convolutional neural network
conditional random field
electronic health record
named entity recognition
natural language processing
relation extraction
This paper was supported in part by the following grants: National Natural Science Foundations of China (U1813215, 61876052, and 61573118), Special Foundation for Technology Research Program of Guangdong Province (2015B010131010), National Natural Science Foundations of Guangdong, China (2019A1515011158), Guangdong Province Covid-19 Pandemic Control Research Fund (2020KZDZX1222), Strategic Emerging Industry Development Special Funds of Shenzhen (JCYJ20180306172232154 and JCYJ20170307150528934), and Innovation Fund of Harbin Institute of Technology (HIT.NSRIF.2017052).
None declared.