Background: Drug prescriptions are often recorded in free-text clinical narratives; making this information available in a structured form is important to support many health-related tasks. Although several natural language processing (NLP) methods have been proposed to extract such information, many challenges remain.
Objective: This study evaluates the feasibility of using NLP and deep learning approaches for extracting and linking drug names and associated attributes identified in clinical free-text notes and presents an extensive error analysis of different methods. This study initiated with the participation in the 2018 National NLP Clinical Challenges (n2c2) shared task on adverse drug events and medication extraction.
Methods: The proposed system (DrugEx) consists of a named entity recognizer (NER) to identify drugs and associated attributes and a relation extraction (RE) method to identify the relations between them. For NER, we explored deep learning-based approaches (ie, bidirectional long-short term memory with conditional random fields [BiLSTM-CRFs]) with various embeddings (ie, word embedding, character embedding [CE], and semantic-feature embedding) to investigate how different embeddings influence the performance. A rule-based method was implemented for RE and compared with a context-aware long-short term memory (LSTM) model. The methods were trained and evaluated using the 2018 n2c2 shared task data.
Results: The experiments showed that the best model (BiLSTM-CRFs with pretrained word embeddings [PWE] and CE) achieved lenient micro F-scores of 0.921 for NER, 0.927 for RE, and 0.855 for the end-to-end system. NER, which relies on the pretrained word and semantic embeddings, performed better on most individual entity types, but NER with PWE and CE had the highest classification efficiency among the proposed approaches. Extracting relations using the rule-based method achieved higher accuracy than the context-aware LSTM for most relations. Interestingly, the LSTM model performed notably better in the reason-drug relations, the most challenging relation type.
Conclusions: The proposed end-to-end system achieved encouraging results and demonstrated the feasibility of using deep learning methods to extract medication information from free-text data.
Electronic health records (EHRs) are a valuable source of routinely collected health data that can be used for secondary purposes, including clinical and epidemiological research . They typically contain information on consultations, admissions, symptoms, clinical examinations, test results, diagnoses, treatments, and outcomes. Medication prescriptions are a key source for understanding the effects of patient treatment. In some settings (eg, general practitioners’ practices), they might be recorded in a structured fashion through prescribing software and would comprise, apart from drug names, medication attributes such as dosage, frequency, and duration. Still, there are often additional, free-text sources of prescription information, such as clinic letters or discharge summaries, particularly in secondary care. Extracting information from free-text is challenging because much of the information is provided in a narrative manner, and the text is often written in haste and under considerable time pressure. There has been strong interest among researchers in the use of natural language processing (NLP) to extract information from clinical free-text notes on a large scale [ - ], including a number of shared tasks and benchmark data sets to assess and advance the state-of-the-art in this domain, such as challenges in medication extraction [ ]; chemical and drug named entity recognition (NER) [ ]; drug-drug interaction extraction [ ]; and extraction of medications, indications, and adverse drug events (ADEs) [ , ].
Medication prescription instructions are a specific clinical sublanguage, where expressions are often abbreviated (eg, od for once a day) and may contain spelling errors (eg, 20 mcg evry othr wk) [, ]. Existing approaches for extracting drugs and associated attributes from the clinical text are diverse in their methods, using various approaches including dictionary lookup (ie, searching for matches from existing drug dictionaries) [ - ], rule-based approaches (manually design patterns, eg, regular expressions that can be searched in free-text) [ - , , , , - ], machine learning approaches (training models on example data) [ - ], and hybrid approaches that combine different methods [ - ]. Recently, methods based on deep learning and neural networks, such as convolutional neural networks and recurrent neural networks, have been shown to be state-of-the-art in drug attribute extraction tasks [ - ]. Deep learning methods take relevant features (eg, orthographic and lexical features) as inputs and produce labels as outputs. These manually constructed feature vectors can then be replaced with, for example, word embeddings (WE), character embeddings (CEs), and feature embeddings. Embeddings are representations of tokens in an n-dimensional space, typically learned over large collections of unlabeled data through an unsupervised process (eg, word2vec [ ], Global Vectors for Word (GloVe) [ ], and fastText [ ]). Recently, more advanced embedding methods and representations (eg, Embeddings from Language Models [ELMo] [ ] and Bidirectional Encoder Representations from Transformers [BERT] [ ]) have further advanced state-of-the-art clinical NLP.
Although deep learning methods have been extensively used in medication information extraction , the effects of various architectures and token representations have not been widely discussed. The purpose of this study is to provide a comprehensive comparison of various representations used for drug information extraction within the same settings. The main contributions of our work are as follows:
- An investigation of the effect of various token representations (ie, CE, WE, and semantic-feature embeddings [SFEs]) on extracting medication information
- The comparison between a rule-based method and deep learning approaches for identifying relations between drugs and associated attributes.
The DrugEx system proposed here is composed of (1) an NER method for extracting mentions of drug names and drug-associated attributes and (2) a relation extraction (RE) method for identifying relations between drugs and their associated attributes. The NER task involves extracting 8 types of entities: drug, strength, duration, route, form, dosage, frequency, and reason of administration (seefor definitions and examples of the extracted entities).
Definitions and examples of entity types extracted by the DrugEx system.
- Drug: The chemical name of a drug or the advertised brand name under which a drug is sold (eg, aspirin)
- Dosage: The amount of medicine that the patient takes or should take (eg, 2 tablets, 5 mL)
- Strength: The amount of drug in a given dosage (eg, 200 mg)
- Frequency: The rate at which medication was taken or is repeated over a particular period (eg, daily, every 4 hours)
- Duration: The period of continuous medication taking (eg, pro re nata, for 5 days)
- Route: The path by which medication is taken into the body or the location at which it is applied (eg, topical, per os)
- Form: The form in which a medication is marketed for use (eg, tablet)
- Reason: The reason for medication administration (eg, for pain)
The scope of these entity types and the data sets that were used for training and evaluation were provided as part of the 2018 National NLP Clinical Challenges (n2c2) shared task track 2: ADEs and medication extraction in EHR challenge [, ]. The data set consists of discharge summaries drawn from the Medical Information Mart for Intensive Care III (MIMIC-III) clinical care database [ ]. It comprises 505 documents, of which 303 documents were used as the training set, and the remaining 202 documents were used as the test set. These data were annotated by 7 domain experts, consisting of 4 physician-assistant students and 3 nurses. Annotations included drug, strength, dosage, frequency, duration, form, route, reason, and ADEs; ADEs annotations have been omitted here as they are beyond the scope of this study.
The annotations also included relations between drugs and other attributes.shows the descriptive statistics for the associated drug attributes in the n2c2 data set and how often each of them was linked to more than 1 drug. Noticeably, 17% (1412/8579) of the reason entities were associated with more than one drug; the maximum number of drugs associated with a single reason was 10. For example, in “START: Guaifensin with codeine QHS and Benzonatate as needed for cough,” the reason cough is associated with 2 drugs: guaifenesin (with codeine) and benzonatate. shows the number of drug entities participating in each link and the ratio of drugs with more than one link. From a total of 11,028 form-drug relations, 4517 (41%) drugs that have been associated with the form attribute has more than one association (ie, multiple forms reported for a single drug entity), for example, “Bisacodyl 5 mg Tablet Sig: 1-2 Tablets PO once a day as needed for constipation;” both mentions of tablets were annotated as form, and they both associated to the bisacodyl drug.
|Entity types||Entities, n (%)||Links to 1 drug, n (%)||Links to multiple drugs, n (%)||Maximum number of drug associations|
|Form||11,010 (13.38)||10,980 (99.56)||48 (<1)||2|
|Strength||10,921 (13.27)||10,913 (99.70)||33 (<1)||3|
|Frequency||10,293 (12.51)||10,281 (99.39)||63 (1)||4|
|Route||8989 (10.92)||9000 (99.08)||84 (1)||4|
|Dosage||6902 (8.39)||6877 (99.38)||43 (1)||4|
|Reason||6400 (7.78)||7158 (83.44)||1421 (16.56)||10|
|Duration||970 (1.2)||991 (92.7)||78 (7)||4|
|Relation type||Relations, n (%)||Drugs with 1 link, n (%)||Drugs with more than 1 link, n (%)|
|Strength-drug||10,946 (18.88)||10,639 (97.20)||307 (2.8)|
|Frequency-drug||10,344 (17.84)||10,054(97.20)||290 (2.8)|
|Route-drug||9084 (15.67)||8903 (98.01)||181 (1.99)|
|Reason-drug||8579 (14.80)||7704 (89.80)||875 (10.2)|
|Dosage-drug||6920 (11.94)||6765 (97.76)||155 (2.2)|
|Form-drug||11,028 (19.02)||6511 (59.04)||4517 (40.96)|
|Duration-drug||1069 (1.84)||1021 (95.51)||48 (5)|
All NER models rely on bidirectional long-short term memory with conditional random fields (BiLSTM-CRF) architecture (), which is composed of 3 different layers: embedding layer, bidirectional long-short term memory (BiLSTM) layer, and conditional random fields (CRFs) layer.
The data were first tokenized using spaCy, an open-source library for NLP, with support for various languages. Then, as target entities differ in length and may contain more than one token, each token was annotated using the BIOES (Begin, Inside, Outside, End, Single) tagging scheme to capture information about the sequence of tokens. We further processed the discharge summaries using the Clinical Language Annotation, Modeling, and Processing Toolkit (CLAMP)  and the Clinical Text Analysis and Knowledge Extraction System (cTAKES) [ ] to extract token-level clinical semantic tags (eg, medication, disease disorder, and procedure; see the section Embedding Layer for details), which were used for SFEs.
The embedding layer maps tokens into vectors of numbers that represent their meanings. WEs provide dense representations that make them capable of representing many aspects of similarities between words, such as semantic relations and morphological properties [, ]. Several methods can be used to initialize the values in WEs at the beginning of neural network training. We examined the randomly initialized word embeddings (RIWE) and the pretrained word embeddings (PWE), where the latter has been pretrained on data from the clinical (ie, target) domain.
Although WEs can capture tokens’ semantics, they might still be affected by data sparsity and, therefore, cannot remediate synonyms, out-of-vocabulary tokens, and misspellings. WE may not be able to capture morphemes (such as prefixes and suffixes) derived from classic Latin and ancient Greek roots, which are often included in drug names and drug attributes. Thus, we addressed these issues by using character feature embeddings in addition to WEs. The concatenation of the PWE with the CEs allows the model to learn subtoken patterns such as morphemes and roots, thereby aiming to capture out-of-vocabulary tokens, different forms, and any other information not captured by WEs .
We also considered representations beyond tokens, aiming to add clinical semantics to words. Specifically, the concatenation of the PWE and SFEs was used to represent the clinical categories of entities identified in the text, such as medical problems, tests, or temporal information. Note that in this study, we did not evaluate SFE without PWEs. Some entity types (such as frequency or route) are not present among the semantic tags we used, whereas other semantic tags (such as signs, symptoms, disease, and disorder) are more frequent. Therefore, the representations of semantic tags were learned simultaneously with word representations and concatenated together to form the final token representations. We used CLAMP  to extract semantic tags (ie, problem, treatment, and temporal entities) with associated assertion tag attributes (ie, present or absent). We also used the default clinical pipelines in cTAKES [ ] to tag tokens with other semantic categories (ie, Medication, DiseaseDisorder, and SignSymptom). In each pipeline, tokens were tagged with the corresponding semantic features and attributes (if available); otherwise, they were tagged with the outside (ie, O) tag. Token-level semantic tags from both pipelines were then mapped and merged based on their types to create a set of semantic features ( ).
The BiLSTM layer takes the sequence of vectors (ie, token representations) corresponding to a sequence of tokens (the output from the embedding layer) and calculates the hidden states by processing the sequence of token representations forward and backward (ie, left-to-right and right-to-left) to learn important token-level features. It then outputs the sequence of vectors, including the probability of each label for each corresponding token. The labels were either 1 of the 8 entity types () or none. The label assigned to the token is the label with the highest probability from the predicted labels’ sequence (output from the BiLSTM layer).
The BiLSTM output does not consider the dependencies between neighboring labels when predicting the current label. For example, it may be more likely to have a token labeled as a drug name followed by a token labeled as strength than any other entity type. Thus, to learn these dependencies, we added a CRF layer that uses past and future labels to optimize predictions and obtain the most probable sequence of predicted labels. Finally, the labels (BIOES tags) were combined into named entities by merging consecutive labeled B-, I-, E-, or S-tags of the same class.
NER Models Training and Tuning of Hyperparameters
We used the standard data split established by the n2c2 organizers, using the training set for fitting models, tuning the model parameters, and evaluating our best models on the test set. As there is no official development set, we randomly selected 9.9% (30/303) of the training documents for validation. This data set was used to optimize the models’ hyperparameters.
We trained all neural network models using stochastic gradient descent, with a learning rate of 0.005. In the baseline model (RIWE), we randomly selected 100-dimensional WEs. In other models, we used pretrained 600-dimensional WEs , which were trained on approximately 2 million discharge summaries drawn from the MIMIC-III data [ ] using the word2vec continuous bag-of-words method [ ]. CEs were 25-dimensional vectors, whereas SFEs were 50-dimensional vectors. The number of hidden states was set to 300 dimensions for running the BiLSTM WEs and to 25-dimensions for running the BiLSTM for learning CE. We also applied dropout to the token embeddings at a rate of 0.5 to avoid overfitting. The number of epochs was determined by an early stopping criterion (ie, after 10 epochs with no improvement) on the validation set, with the maximum number of epochs set to 100. Finally, the batch size was set to 32. These hyperparameters were optimized through a random search of the validation set [ ]. We tested WEs with dimensions ranging from 100 to 600, CE and SFEs with 25, 50, and 100 dimensions, and the dropout rate with values in the range between 0 and 0.75.
Once drugs and attributes are extracted, the subsequent step is to link drug names to the corresponding attributes. For this task, we experimented with a rule-based method engineered for the task and a context-aware long-short term memory (LSTM) model, where the positions of the involved entities were encoded using marker embeddings.
We used a context-aware LSTM  that considers other relations in the sentential context while predicting the target relation. It uses an LSTM-based encoder to jointly learn representations for all relations in the text. Thus, the representation of the target relation and representations of the context relations are combined to make the final prediction. presents the architecture of the LSTM model for RE. It consists of an embedding layer, an LSTM layer, and a softmax layer. The embedding layer maps a portion of the text that contains a target entity pair into a high-level representation vector. First, each token in the text is mapped to its WE vector. Second, every 2 entities (ie, a drug and its associated attribute) in the text are paired as candidate entities for a possible relation. All other tokens are then marked as either belonging to a drug (as the main actor of all relations) or not. Afterward, each token’s marker embeddings are concatenated to the WEs to generate a single vector. This vector is then passed to the LSTM, which calculates the hidden states by processing the sequence of token representations. Finally, the LSTM layer’s output is routed into the softmax layer to map the nonnormalized output to the final output vector that contains the probability for each relation type.
In this approach, we examined patterns of prescription information in discharge summaries in the training set and manually implemented a set of rules using regular expressions. These regular expressions were designed and implemented in the General Architecture of Text Engineering environment  ( ). First, the discharge summaries were split into sentences. For sentences that include only one drug name D, all drug attributes found in that sentence will be linked to drug D. However, for sentences that include multiple drug names, the sentences are split into several segments, where the segment’s start offset is the beginning of the next drug name.
If a sentence does not include a drug name but contains other entities, then the previous 2 sentences are checked. If they contain a drug name, then the attributes are linked to the closest drug name. For example, “Patient will be on Topiramate 25mg PO BID until 22/3 PM. Then increase to 50mg po BID for seven days. Then increase to 75mg ongoing”. All the italicized entities are linked to the drug topiramate that appears in the first sentence.
RE Model Training and Tuning of Hyperparameters
We used the same procedure and the same approach for hyperparameter settings that we have used previously in the NER models. Specifically, we trained the LSTM model using the same hyperparameters that we have used previously in the NER models. We used marker embeddings with 10-dimensional vectors.
The regular expressions in the RE rule-based method were implemented based on manual observation of the training set, followed by an initial evaluation of the validation set. The regular expressions were then refined based on an error analysis of the output from the validation process, and the final evaluation was performed on the official test set.
We considered the available annotations in the corpus as the gold standard when evaluating the models. To assess the performance of the proposed models, we performed hold-out cross-validation (using training and testing sets) and used the official n2c2 evaluation script provided with the data. It uses standard evaluation methods in information retrieval (ie, precision, recall, and F-score). We report the lenient micro-and macroaveraging for each NER experiment. Lenient matches refer to cases where the overlapped boundaries between the gold standard and the system’s predictions are allowed. Macroaveraging calculates the metrics on a per-document basis and then averages the results. Microaveraging, on the other hand, refers to the pooling of the results of all classified instances into a single contingency table.
In addition, we evaluated the performance of the NER models with the best-performing RE model as an end-to-end system. This allows us to measure the effect of missing entities in the NER models on the RE task. As shown in, attributes could be associated with more than one drug. Thus, when an NER model fails to recognize an entity (either drug or attribute), then all of its semantic relations (ie, associations) will also be missed. Finally, the best-performed end-to-end system was chosen for our DrugEX system.
shows the lenient precision, recall, and F-score for all models in the NER task. The best result in the NER task was achieved by PWE+CE embeddings (micro F-score of 0.921). Interestingly, NER (PWE), which ranked second in F-score, achieved a slightly higher precision, and NER (PWE+SFE) achieved a higher recall than any other model. NER (PWE+SFE) also yielded a better balance between precision and recall. Concerning individual F-scores, PWE performed better than the baseline (RIWE) for every entity type. The SFEs with the PWEs in NER (PWE+SFE) allow the model to perform better than others on some individual entity types, especially frequency, duration, and reason. An analysis at the per-entity type level shows that most entity types (ie, drugs, strength, form, dosage, frequency, and route) are associated with excellent performance (F-scores above 0.90). Duration and reason, however, are associated with lower performance. This might be amplified by the fact that there were few examples of duration and reason entities in the training data ( ).
aRIWE: bidirectional long-short term memory with conditional random fields with random word embeddings.
bPWE: bidirectional long-short term memory with conditional random fields with pretrained word embeddings.
c(PWE+CE): bidirectional long-short term memory with conditional random fields with pretrained word embeddings and character embeddings.
d(PWE+SFE): bidirectional long-short term memory with conditional random fields with pretrained word embeddings and semantic-feature embeddings.
eThe best results for each metric are italicized.
To explore the complementarity of the methods, we created an ensemble model using the outputs of all the proposed NER models. The ensemble output for each task was generated using a majority voting scheme. In addition to its type, the entire named entity phrase is taken as 1 prediction instance. The ensemble model showed precision, recall, and F-scores of 0.961, 0.884, and 0.921, respectively. As expected, the ensemble showed performance gains in precision when compared with the best individual models. This indicates that the 3 models did not learn the same patterns from the data set. However, the difference in recall and F-score is not evident, even for specific attributes ().
a(PWE+CE): bidirectional long-short term memory with conditional random fields with pretrained word embeddings and character embeddings.
b(PWE+SFE): bidirectional long-short term memory with conditional random fields with pretrained word embeddings and semantic-feature embeddings.
cThe best results for each metric are italicized.
We further conducted paired t tests to determine whether the differences between the models were statistically significant. Differences were considered significant if the P value was <.05. The samples used in this test were the microaverage F-scores from each document in the test set (ie, document-level NER performance).shows the post hoc analysis of variance for the NER task. The statistical significance test showed that there were no statistically significant differences between any of the models (PWE, PWE+CE, and PWE+SFE), despite the presence of apparently important and computationally expensive clinical information such as the type of entities (ie, problems, signs, and symptoms) in some of the models. However, the 3 models (PWE, PWE+CE, and PWE+SFE) were statistically significantly different from the baseline (ie, RIWE), where random embeddings were used. This means that pretraining embeddings on the target domain (ie, discharge summaries from MIMIC-III) helped in comparison with the random initialization of WEs.
|Named entity recognition||PWEb, P value||PWE+CEc, P value||PWE+SFEd, P value|
aRIWE is significantly worse than the rest of the models. At the same time, there is no statistically significant difference between PWE, PWE+CE, and PWE+SFE.
bPWE: pretrained word embeddings.
cCE: character embedding.
dSFE: semantic-feature embeddings.
eRIWE: randomly initialized word embeddings.
fN/A: not applicable.
shows the performances of the RE models using the gold-standard entities, whereas shows the performances of the RE model using the output from the NER models (end-to-end). Using the gold-standard entities and using the output from the best NER model (end-to-end), we achieved micro F-scores of 0.927 for rules and 0.855 for (PWE+CE)+rules, respectively. Thus, the traditional rule-based method performed surprisingly well relative to the context-aware LSTM for this task. Relations between form and frequency to drugs are examples of such success: there was at least a 4% improvement in F-score over the LSTM model. The microaverage F-score for the end-to-end task was notably lower than that for the NER tasks and RE using gold-standard entities. This was expected because prediction in the end-to-end compounded the errors in both the NER and RE steps. A major factor behind the low score is the reasons-drug relation type, which was often not recognized because the NER did not recognize the reason attribute. However, the prediction of this relation itself (ie, reason-drug) is also challenging, as evidenced by the F-score of 0.734 in the RE task (rules) on the gold-standard entities. This might be because the text span between 2 entities in this relation is often relatively long; thus, none of the methods explored in this study could capture this.
aLSTM: long-short term memory method.
bRules: rule-based method.
cThe best results for each metric are italicized.
aRIWE: bidirectional long-short term memory with conditional random fields with random word embeddings.
bPWE: bidirectional long-short term memory with conditional random fields with pretrained word embeddings.
cPWE+CE: bidirectional long-short term memory with conditional random fields with pretrained word embeddings and character embeddings.
dPWE+SFE: bidirectional long-short term memory with conditional random fields with pretrained word embeddings and semantic-feature embeddings.
eThe best results for each metric are italicized.
The statistical significance test for the RE task showed that the differences between the LSTM and rule-based models were insignificant (P=.41). For the end-to-end task, similar to the NER task, there was no statistically significant difference between any of the models (PWE, PWE+CE, and PWE+SFE); however, the 3 models were statistically significantly different from the RIWE (). Accordingly, the best-performed end-to-end system, (PWE+CE)+rules, was chosen for our DrugEx system.
|End-to-end models||PWEa+rules, P value||(PWE+CEb)+rules, P value||(PWE+SFEc)+rules, P value|
aPWE: pretrained word embeddings.
bCE: character embedding.
cSFE: semantic-feature embeddings.
dRIWE: randomly initialized word embeddings.
eN/A: not applicable.
The models explored in this study demonstrated high F-scores of 0.921 for NER, 0.927 for RE, and 0.855 for the end-to-end approach. The overall highest F-scores (achieved by different teams) in the n2c2 challenge in the NER, RE, and end-to-end tasks were 0.942, 0.963, and 0.891, respectively . The top-ranked NER used a BiLSTM-CRF with ELMo language model [ ], CFEs, and normalized section titles as features. The top-ranked RE and end-to-end tasks used a joint concept-relation extraction system that uses 2 layers of BiLSTM-CRFs [ ].
The results for our NER models showed that PWE+CE had the highest classification efficiency, followed by PWE and PWE+SFE, which had similar scores among themselves and above the baseline. RE models’ results showed that the rule-based method achieved significantly higher accuracy than the context-aware LSTM for most relation types. Interestingly, the LSTM model performed notably better in the reason-drug relations, which were missed more than all other relation types.
We observed that external resources (ie, SFEs) contributed to the attribute extraction. Presumably, plentiful labeled data already available and complementary information from these external resources appear to have been helpful for performance. Nevertheless, simpler methods, such as PWE and rule-based methods, can match these sophisticated and expensive methods.
We further analyzed false positives and false negatives from the NER to obtain deeper insights into the common classification errors. Note that the focus in the error analysis was on the NER only, as it appears to be the main factor of the relatively low F-score in RE.
To gain an insight into where errors are made and how models can be improved, we manually reviewed false negatives (entities identified in the gold standard but incorrectly rejected, ie, missed, by the models) and false positives (entities identified by the models when they are not in the gold standard) in the best-performing model. Errors were then grouped into different categories based on their causes, including (1) context error: when an entity is captured as one of the drug-related attributes, although it is not, or when an entity is missing because of the context; (2) type error: when an attribute is extracted but with an incorrect annotation type; and (3) gold-standard error: possible error in the gold standard. We also generated a confusion matrix to subdivide the errors made by the method based on which type of mistake was made.
Context error was a major category of errors. These mostly resulted from previously unseen information (eg, “He was given a loading dose of amiodarone,” where the dosage loading dose was missed), atypical expression formats (eg, “One (1) Tablet,” where dosage one (1) was missing because of the parentheses), and abbreviations (eg, “Dig level 2.1,” where drug dig—which should be digoxin—was missed). Context errors may also result from the complexity of language expressions; for example, 200 units in the phrase “was started on a 7d course of DRUG 200 units daily” could be a dosage when considered as a single phrase, or it could be 2 concepts: 200 (a strength) and unit (a form). Gold annotation preferred the latter, whereas our method identified the former.
Another interesting cause of error is the ambiguity between attributes, where an attribute is recognized, but the type is incorrect.presents the confusion matrix for the BiLSTM-CRF (PWE+CE) and indicates how often each entity is predicted. The confusion of dosage for strength and strength for dosage is the most frequent type of error, accounting for 28% ([66+126]/693) of the errors. The following example illustrates this type of error: “Meropenem 500 mg Intravenous every eight (8) hours.” The dosage 500 mg is wrongly predicted as strength; usually, the mg unit is associated with strength. The substitution of dosage for strength is a common error, and these entities are often mislabeled as each other—both are often numeric quantities and used in similar contexts. A common solution for this issue is to merge these 2 types into 1 annotation type [ ]. However, extracting them separately may be important for some applications.
The second most frequent type of this error, which accounts for 16% ([48+65]/693) of the errors, is the confusion of form for route and route for form. These entities are often annotated as the gold standard in various ways. For example, the word injection is sometimes annotated as a form and sometimes as a route; in the training set, it is annotated as a form 68 times and as a route 53 times, which makes learning from these examples challenging.
The confusion of drugs with general words is one of the other sources of error. We found that there were several causes of this confusion among drug names. These include (1) generic drug names (eg, glucose, IVF, blood, D5W, and chemo) corresponding to prescribed medications but not occurring in expected contexts; (2) words such as pressor, fluids, agents, or medication that may be considered to be underspecified, but should be extracted, at least in this data set; (3) some classes of drugs (eg, antiinflammatory drugs and hypertension medications) missing in the training sources; (4) new drug names that did not occur within an expected context or semantic patterns (eg, Dig level 2.5), so they were not extracted by the NER methods; and (5) abbreviations (eg, aspirin325 and ABX).
The analysis also showed a few potential omissions and inconsistencies in human annotations. Gold-standard errors fall into 2 different categories: missing in the gold standard and potential problems in gold standards. The more common error in this category is missing in the gold standard, where the method annotates entities that are not annotated in the data set. For example, four weeks in the phrase, “adding DRUG cover for the first four weeks of treatment,” is not annotated as a duration in the gold standard, whereas it appears to be a potentially correct attribute. Inconsistency may also appear in annotation spans; for example, dosage or strength, and form were annotated separately sometimes and jointly in others.
In this study, we constructed an end-to-end system (DrugEx) composed of bidirectional LSTM, CRF, and rule-based methods for extracting drug-related information from free-text discharge summaries. We studied various token representations (ie, WE, CE, and SFE) for extracting drug attributes from free-text discharge summaries. We also proposed a rule-based method for relations between drugs and attributes and compared this method with a context-aware deep learning method. The results showed that the proposed system can be used successfully for extracting and linking drug attributes in discharge summaries, although some attributes (ie, reason and duration) are still challenging. The results also showed that domain-tailored embeddings (ie, PWE) perform better than random embeddings (RIWE) in this task. Concatenating PWE with CE or SE achieved a comparable overall performance when compared between themselves. NER (PWE+CE) ranked best in F-score among other proposed models; however, NER (PWE+SE) performed better on some individual entity types, especially frequency, duration, and reason. Semantic embeddings also yielded a better balance between precision and recall. However, a simpler method (eg, WE and CE) can match these sophisticated and expensive methods. Incorporating external knowledge (eg, of a drug’s reason, proposed treatment, and a drug’s reactions) and incorporating a larger context may improve performance.
Concerning RE, the rule-based method achieved higher accuracy than the context-aware LSTM for most relations. Interestingly, the LSTM model performs notably better on some of the most challenging relations (eg, reason-drug).
In future work, we aim to investigate contextual embeddings, such as ELMo and BERT, which have been proven to provide considerable improvements in other tasks that include complex language structures, ambiguous word use, and unseen words in training. We also consider assessing the performance and transferability of the models across different biomedical data sets and tasks.
Finally, the medication NER and RE tasks are important not only from a research perspective but also because they have applications as steps in practical information extraction pipelines. The current level of performance indicates that these models should be good enough for large-scale statistical and epidemiological studies. However, applications that require patient-specific information may need NER systems with even higher recall and precision, ensemble and multiple-step systems (ie, systems that combine the output of multiple classifiers), or be subject to semiautomated verification.
This work was partially supported by the Saudi Arabian Ministry of Education, the Saudi Arabian Cultural Bureau in London, and the Healthcare Text Analytics Network (Heal-tex, grant EP/N027280/1, funded by the UK Engineering and Physical Sciences Research Council). The authors would like to thank Sumithra Velupillai and Natalia Viani (King’s College London) for their discussions on the error analysis. The authors would also like to acknowledge the help from Haifa Alrdahi and Nikola Milosevic (University of Manchester) during their participation in the n2c2 shared task.
GA and MB conducted the experiments and analyzed their output. GA drafted the manuscript. NP and GN revised the manuscript. All authors read and approved the final version of the manuscript. GN and NP supervised all steps of the work.
Conflicts of Interest
- Abhyankar S, Demner-Fushman D, Callaghan FM, McDonald CJ. Combining structured and unstructured data to identify a cohort of ICU patients who received dialysis. J Am Med Inform Assoc 2014;21(5):801-807 [FREE Full text] [CrossRef] [Medline]
- Evans DA, Brownlow ND, Hersh WR, Campbell EM. Automating concept identification in the electronic medical record: an experiment in extracting dosage information. Proc AMIA Annu Fall Symp 1996:388-392 [FREE Full text] [Medline]
- Karystianis G. Extraction and representation of key characteristics from epidemiological literature. The University of Manchester. 2014. URL: https://tinyurl.com/bv927sfthttps://tinyurl.com/645sksnd [accessed 2021-03-31]
- MacKinlay AD, Verspoor KM. Extracting structured information from free-text medication prescriptions using dependencies. In: Proceedings of the ACM sixth international workshop on Data and text mining in biomedical informatics. 2012 Presented at: CIKM'12: 21st ACM International Conference on Information and Knowledge Management; October, 2012; Maui Hawaii USA p. 35-40. [CrossRef]
- Sohn S, Clark C, Halgrim SR, Murphy SP, Chute CG, Liu H. MedXN: an open source medication extraction and normalization tool for clinical text. J Am Med Inform Assoc 2014;21(5):858-865 [FREE Full text] [CrossRef] [Medline]
- Spasic I, Sarafraz F, Keane JA, Nenadic G. Medication information extraction with linguistic pattern matching and semantic rules. J Am Med Inform Assoc 2010;17(5):532-535 [FREE Full text] [CrossRef] [Medline]
- Uzuner Ö, Solti I, Cadag E. Extracting medication information from clinical text. J Am Med Inform Assoc 2010;17(5):514-518 [FREE Full text] [CrossRef] [Medline]
- Xu H, Stenner SP, Doan S, Johnson KB, Waitman LR, Denny JC. MedEx: a medication information extraction system for clinical narratives. J Am Med Inform Assoc 2010;17(1):19-24 [FREE Full text] [CrossRef] [Medline]
- Yang H. Automatic extraction of medication information from medical discharge summaries. J Am Med Inform Assoc 2010;17(5):545-548 [FREE Full text] [CrossRef] [Medline]
- Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A. CHEMDNER: the drugs and chemical names extraction challenge. J Cheminform 2015 Jan 19;7(S1). [CrossRef]
- Segura-Bedmar I, Martínez P, Herrero-Zazo M. SemEval-2013 Task 9 : extraction of drug-drug interactions from biomedical texts (DDIExtraction 2013). In: Proceedings of the Seventh International Workhop on Semantic Evaluation (SemEval 2013) and Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2. 2013 Presented at: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2 and Seventh International Workshop on Semantic Evaluation (SemEval 2013); June, 2013; Atlanta, Georgia, USA p. 341-350.
- Jagannatha A, Liu F, Liu W, Yu H. Overview of the first natural language processing challenge for extracting medication, indication, and adverse drug events from electronic health record notes (MADE 1.0). Drug Saf 2019 Jan;42(1):99-111 [FREE Full text] [CrossRef] [Medline]
- Henry S, Buchan K, Filannino M, Stubbs A, Uzuner O. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. J Am Med Inform Assoc 2020 Jan 01;27(1):3-12 [FREE Full text] [CrossRef] [Medline]
- Karystianis G, Sheppard T, Dixon WG, Nenadic G. Modelling and extraction of variability in free-text medication prescriptions from an anonymised primary care electronic medical record research database. BMC Med Inform Decis Mak 2016 Mar 09;16:18 [FREE Full text] [CrossRef] [Medline]
- Leaman R, Khare R, Lu Z. Challenges in clinical natural language processing for automated disorder normalization. J Biomed Inform 2015 Oct;57:28-37 [FREE Full text] [CrossRef] [Medline]
- Kolárik C, Hofmann-Apitius M, Zimmermann M, Fluck J. Identification of new drug classification terms in textual resources. Bioinformatics 2007 Jul 01;23(13):264-272. [CrossRef] [Medline]
- Chhieng D, Day T, Gordon G, Hicks J. Use of natural language programming to extract medication from unstructured electronic medical records. AMIA Annu Symp Proc 2007 Oct 11:908. [Medline]
- Sirohi E, Peissig P. Study of effect of drug lexicons on medication extraction from electronic medical records. Pac Symp Biocomput 2005:308-318 [FREE Full text] [CrossRef] [Medline]
- Lowe DM, Sayle RA. LeadMine: a grammar and dictionary driven approach to entity recognition. J Cheminform 2015 Jan 19;7(S1). [CrossRef]
- Gold S, Elhadad N, Zhu X, Cimino JJ, Hripcsak G. Extracting structured medication event information from discharge summaries. AMIA Annu Symp Proc 2008 Nov 06:237-241 [FREE Full text] [Medline]
- Hamon T, Grabar N. Linguistic approach for identification of medication names and related information in clinical narratives. J Am Med Inform Assoc 2010;17(5):549-554 [FREE Full text] [CrossRef] [Medline]
- Xu R, Morgan A, Das AK, Garber A. Investigation of unsupervised pattern learning techniques for bootstrap construction of a medical treatment lexicon. In: Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing. 2009 Presented at: BioNLP '09: Workshop on Current Trends in Biomedical Natural Language Processing; June 4-5, 2009; Boulder, Colorado p. 63-70. [CrossRef]
- Patrick J, Li M. High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge. J Am Med Inform Assoc 2010;17(5):524-527 [FREE Full text] [CrossRef] [Medline]
- Leaman R, Wei C, Lu Z. tmChem: a high performance approach for chemical named entity recognition and normalization. J Cheminform 2015 Jan 19;7(S1). [CrossRef]
- Lu Y, Ji D, Yao X, Wei X, Liang X. CHEMDNER system with mixed conditional random fields and multi-scale word clustering. J Cheminform 2015 Jan 19;7(S1). [CrossRef]
- Campos D, Matos S, Oliveira JL. A document processing pipeline for annotating chemical entities in scientific documents. J Cheminform 2015 Jan 19;7(S1). [CrossRef]
- Lamurias A, Grego T, Couto FM. Chemical compound and drug name recognition using CRFs and semantic similarity based on ChEBI. Washington, DC USA: BioCreative challenge evaluation workshop, vol. 2; 2013. URL: https://biocreative.bioinformatics.udel.edu/media/store/files/2013/bc4_v2_9.pdf [accessed 2021-03-31]
- Sikdar UK, Ekbal A, Saha S. Domain-independent model for chemical compound and drug name recognition. Washington, DC USA: BioCreative Challenge Evaluation Workshop. Vol 2; 2013. URL: https://biocreative.bioinformatics.udel.edu/media/store/files/2013/bc4_v2_22.pdf [accessed 2021-03-31]
- Akhondi SA, Hettne KM, van der Horst E, van Mulligen EM, Kors JA. Recognition of chemical entities: combining dictionary-based and grammar-based approaches. J Cheminform 2015 Jan 19;7(S1). [CrossRef]
- He L, Yang Z, Lin H, Li Y. Drug name recognition in biomedical texts: a machine-learning-based method. Drug Discov Today 2014 May;19(5):610-617. [CrossRef] [Medline]
- Tikk D, Solt I. Improving textual medication extraction using combined conditional random fields and rule-based systems. J Am Med Inform Assoc 2010;17(5):540-544 [FREE Full text] [CrossRef] [Medline]
- Korkontzelos I, Piliouras D, Dowsey AW, Ananiadou S. Boosting drug named entity recognition using an aggregate classifier. Artif Intell Med 2015 Oct;65(2):145-153 [FREE Full text] [CrossRef] [Medline]
- Liu Z, Yang M, Wang X, Chen Q, Tang B, Wang Z, et al. Entity recognition from clinical texts via recurrent neural network. BMC Med Inform Decis Mak 2017 Jul 05;17(Suppl 2):67 [FREE Full text] [CrossRef] [Medline]
- Jagannatha AN, Yu H. Structured prediction models for RNN based sequence labeling in clinical text. Proc Conf Empir Methods Nat Lang Process 2016 Nov;2016:856 [FREE Full text] [CrossRef] [Medline]
- Yang X, Bian J, Fang R, Bjarnadottir RI, Hogan WR, Wu Y. Identifying relations of medications with adverse drug events using recurrent convolutional neural networks and gradient boosting. J Am Med Inform Assoc 2020 Jan 01;27(1):65-72 [FREE Full text] [CrossRef] [Medline]
- Wei Q, Ji Z, Li Z, Du J, Wang J, Xu J, et al. A study of deep learning approaches for medication and adverse drug event extraction from clinical text. J Am Med Inform Assoc 2020 Jan 01;27(1):13-21 [FREE Full text] [CrossRef] [Medline]
- Ju M, Nguyen NT, Miwa M, Ananiadou S. An ensemble of neural models for nested adverse drug events and medication extraction with subwords. J Am Med Inform Assoc 2020 Jan 01;27(1):22-30 [FREE Full text] [CrossRef] [Medline]
- Dai HJ, Su CH, Wu CS. Adverse drug event and medication extraction in electronic health records via a cascading architecture with different sequence labeling models and word embeddings. J Am Med Inform Assoc 2020 Jan 01;27(1):47-55 [FREE Full text] [CrossRef] [Medline]
- Oleynik M, Kugic A, Kasáč Z, Kreuzthaler M. Evaluating shallow and deep learning strategies for the 2018 n2c2 shared task on clinical text classification. J Am Med Inform Assoc 2019 Nov 01;26(11):1247-1254 [FREE Full text] [CrossRef] [Medline]
- Christopoulou F, Tran TT, Sahu SK, Miwa M, Ananiadou S. Adverse drug events and medication relation extraction in electronic health records with ensemble deep learning methods. J Am Med Inform Assoc 2020 Jan 01;27(1):39-46 [FREE Full text] [CrossRef] [Medline]
- Kim Y, Meystre SM. Ensemble method-based extraction of medication and related information from clinical texts. J Am Med Inform Assoc 2020 Jan 01;27(1):31-38 [FREE Full text] [CrossRef] [Medline]
- Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. 2013 Presented at: 26th International Conference on Neural Information Processing Systems - Volume 2; December 2013; Lake Tahoe, Nevada, United States p. 3111-3119 URL: http://dl.acm.org/citation.cfm?id=2999792.2999959
- Pennington J, Socher R, Manning C. GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014 Presented at: Conference on Empirical Methods in Natural Language Processing (EMNLP); October, 2014; Doha, Qatar. [CrossRef]
- Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguistics 2017 Dec;5:135-146. [CrossRef]
- Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K. Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018 Presented at: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers); June, 2018; New Orleans, Louisiana p. 2227-2237. [CrossRef]
- Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv. 2018. URL: https://arxiv.org/abs/1810.04805 [accessed 2021-03-31]
- n2c2 NLP research data sets. Harvard Medical School. 2018. URL: https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ [accessed 2021-03-31]
- Johnson AE, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016 May 24;3 [FREE Full text] [CrossRef] [Medline]
- Soysal E, Wang J, Jiang M, Wu Y, Pakhomov S, Liu H, et al. CLAMP - a toolkit for efficiently building customized clinical natural language processing pipelines. J Am Med Inform Assoc 2018 Mar 01;25(3):331-336 [FREE Full text] [CrossRef] [Medline]
- Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010;17(5):507-513 [FREE Full text] [CrossRef] [Medline]
- Kocmi T, Bojar O. SubGram: extending skip-gram word representation with substrings. In: Text, Speech, and Dialogue. Switzerland: Springer; 2016:182-189.
- Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv. 2013. URL: https://arxiv.org/abs/1301.3781 [accessed 2021-03-31]
- Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes with recurrent neural networks. J Am Med Inform Assoc 2017 May 01;24(3):596-606 [FREE Full text] [CrossRef] [Medline]
- Luo Y, Cheng Y, Uzuner O, Szolovits P, Starren J. Segment convolutional neural networks (Seg-CNNs) for classifying relations in clinical notes. J Am Med Inform Assoc 2018 Jan 01;25(1):93-98 [FREE Full text] [CrossRef] [Medline]
- Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res. 2012. URL: https://www.jmlr.org/papers/v13/bergstra12a.html [accessed 2021-03-31]
- Sorokin D, Gurevych I. Context-aware representations for knowledge base relation extraction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017 Presented at: Conference on Empirical Methods in Natural Language Processing; September, 2017; Copenhagen, Denmark p. 1784-1789. [CrossRef]
- Cunningham H, Tablan V, Roberts A, Bontcheva K. Getting more out of biomedical documents with GATE's full lifecycle open source text analytics. PLoS Comput Biol 2013;9(2) [FREE Full text] [CrossRef] [Medline]
- Xu J, Lee H, Ji Z, Wang J, Wei Q, Xu H. UTH_CCB system for adverse drug reaction extraction from drug labels at TAC-ADR. 2017. URL: https://tinyurl.com/645sksnd [accessed 2021-03-31]
- Demner-Fushman D, Mork JG, Rogers WJ, Shooshan SE, Rodriguez L, Aronson AR. Finding medication doses in the liteature. AMIA Annu Symp Proc 2018;2018:368-376 [FREE Full text] [Medline]
|ADE: adverse drug event|
|BERT: Bidirectional Encoder Representations from Transformers|
|BiLSTM: bidirectional long-short term memory|
|BiLSTM-CRF: bidirectional long-short term memory with conditional random field|
|BIOES: Begin, Inside, Outside, End, Single|
|CE: character embedding|
|CLAMP: Clinical Language Annotation, Modeling, and Processing Toolkit|
|CRF: conditional random field|
|cTAKES: Clinical Text Analysis and Knowledge Extraction System|
|EHR: electronic health record|
|ELMo: Embeddings from Language Models|
|LSTM: long-short term memory|
|MIMIC-III: Medical Information Mart for Intensive Care III|
|n2c2: National NLP Clinical Challenges|
|NER: named entity recognition|
|NLP: natural language processing|
|PWE: pretrained word embedding|
|RE: relation extraction|
|RIWE: randomly initialized word embedding|
|SFE: semantic-feature embedding|
|WE: word embedding|
Edited by C Lovis; submitted 30.09.20; peer-reviewed by S Fu, M Torii; comments to author 03.11.20; revised version received 15.02.21; accepted 20.02.21; published 05.05.21Copyright
©Ghada Alfattni, Maksim Belousov, Niels Peek, Goran Nenadic. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 05.05.2021.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.