This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Many drugs do not work the same way for everyone owing to distinctions in their genes. Pharmacogenomics (PGx) aims to understand how genetic variants influence drug efficacy and toxicity. It is often considered one of the most actionable areas of the personalized medicine paradigm. However, little prior work has included in-depth explorations and descriptions of drug usage, dosage adjustment, and so on.
We present a pharmacogenomics knowledge model to discover the hidden relationships between PGx entities such as drugs, genes, and diseases, especially details in precise medication.
PGx open data such as DrugBank and RxNorm were integrated in this study, as well as drug labels published by the US Food and Drug Administration. We annotated 190 drug labels manually for entities and relationships. Based on the annotation results, we trained 3 different natural language processing models to complete entity recognition. Finally, the pharmacogenomics knowledge model was described in detail.
In entity recognition tasks, the Bidirectional Encoder Representations from Transformers–conditional random field model achieved better performance with micro-F1 score of 85.12%. The pharmacogenomics knowledge model in our study included 5 semantic types: drug, gene, disease, precise medication (population, daily dose, dose form, frequency, etc), and adverse reaction. Meanwhile, 26 semantic relationships were defined in detail. Taking melanoma caused by a
We highlighted the pharmacogenomics knowledge model as a scalable framework for clinicians and clinical pharmacists to adjust drug dosage according to patient-specific genetic variation, and for pharmaceutical researchers to develop new drugs. In the future, a series of other antitumor drugs and automatic relation extractions will be taken into consideration to further enhance our framework with more PGx linked data.
The field of pharmacogenomics (PGx) has developed rapidly since the initial scientific discoveries of genetic characteristics affecting individual response to drugs or other agents [
As of June 2019, more than 190 drugs [
Named entity recognition (NER) is a basic tool for natural language processing (NLP) tasks such as information extraction, question answering system, syntactic analysis, and machine translation. Its main goal is identifying entities with specific meaning in the text, mainly including people’s names, place names, organization names, proper nouns, etc. It is the foundation of identifying semantic relationships between entities and filling a knowledge base.
The common statistical models of NER mainly include the Hidden Markov Model [
In 2018, Devlin et al [
The Knowledge Representation Model can be understood as a structured set of directed graphs, in which the nodes of the graph represent entities or concepts, while the edges represent the semantic relationship between entities or concepts. During the development of the knowledge representation, sematic networks, ontology, and knowledge graphs/models are most commonly used in the field of biomedical science.
A semantic network [
An ontology is a formal explicit description of concepts in a domain, properties of each concept, various features and attributes, and restrictions on these properties [
A knowledge graph/model emphasizes data cleaning and knowledge fusion, and its essence is a semantic network, which allows access to knowledge inference. Since this concept was put forward by Google in 2012 [
Above all, the knowledge graph/model technology provides a means to extract structured knowledge from massive texts and images. It has broad applications in biomedical field and can promote intelligent semantic retrieval, medical questions and answers, clinical decision support, and many other scenarios.
With the rapid growth and accumulation of massive PGx data, there is an increasing need for scientific data collecting, organizing, modeling, and mining. These data reflect a hierarchy of relationships and detailed information between biomedical entities. Currently, the semantic types and relationships involved in PGx knowledge representation are usually limited to drug, gene, and disease.
Drug2Gene [
Bo et al [
Dalleau et al [
Kim et al developed DigSee [
However, there currently exist no in-depth explorations and descriptions of personalized medication, such as drug usage, dosage adjustment, and applicable population. Therefore, there is significance in applying the knowledge model to the field of PGx in further study, which will assist clinicians and clinical pharmacists in precise medication.
In this study, we proposed the following 2 objects:
We aimed to present a pharmacogenomics knowledge model consisting of 5 semantic types related to PGx and precision medication, and also give definitions of relationships between these entities. The model mostly focuses on anticancer drugs, drug usage, and adjustments of daily dosage.
We aimed to semiautomatically construct PGx corpora, which are relatively rare in the existing research, and make them open access. The NLP algorithms for PGx NER were also trained for facilitating corpus annotation.
There are 3 main steps in our study (
Data preparation: Data related to PGx were collected from DailyMed, DrugBank, and RxNorm.
Data processing: Manual annotation for PGx entities and relationships were applied to drug labels in PDF/XML format from DailyMed. The BERT–CRF model were trained for entity recognition in this study. Data from DrugBank and RxNorm were also downloaded, parsed, and extracted for more drug attributes and relationships.
Model construction: The PGx knowledge model was described in this aspect based on the entities and relationships extraction. Melanoma was also used as an example to verify the accuracy and validity of our model.
The framework of our study.
Data related to PGx need to be collected and integrated in this study, which are currently stored in DrugBank, PharmGKB, Comparative Toxicogenomics Database (CTD), RxNorm, and other databases. Based on the pharmacogenomics knowledge model built in our study, we chose the following 3 data sources to accomplish data crawling and data preparation.
The text of drug labels was obtained from DailyMed, which is a free drug information resource [
DrugBank is a unique bioinformatics and cheminformatics resource that combines detailed drug (ie, chemical, pharmacological, and pharmaceutical) data with comprehensive drug target (ie, sequence, structure, and pathway) information [
RxNorm [
We recruited 3 annotators, all of whom had a medical training background and curation experience. Each drug label was annotated independently by 2 annotators (ie, double annotation). Differences were resolved by a third and senior annotator. Besides this, we measured agreement of relationship annotations using the
Because all 190 drug labels in the FDA table of PGx biomarkers [
The main tasks involved in the annotation stage were the recognition of semantic types and semantic relationships from drug labels sections, including “Indications and Usage,” “Dosage and Administration,” “Use in Specific Populations,” “Warnings and Precautions,” and “Adverse Reactions.” For semantic types, different highlighted colors represented different entities according to the frame of the PGx knowledge model. In this work, drug was annotated in yellow, gene was annotated in red, disease was annotated in gray, dosage and dose form were annotated in green, adverse reaction was annotated in purple, and population was annotated in blue. For semantic relationships, the more important and difficult section, annotators read the drug labels and recorded the relation descriptions between diseases and drugs, diseases and genes, diseases and diseases, drugs and genes, drugs and drugs, and drugs and dosage manually. This formed the basis of relationship definition in the follow-up work. Before annotation, we also indicated the annotation guidelines, see in
An example of drug label annotation is shown in
Annotation guidelines.
Annotation example of MEKINIST.
After the annotation of entities, we applied the BERT–CRF model for NER. The CRF model and BERT–Bi-LSTM–CRF model were also trained in our study as a comparison.
The BERT–CRF architecture was composed of 4 sections: the input layer, the pretraining model, the full connection layer, and the CRF layer, which assigns a tag to each word based on its context in the output (
The BERT-Base Multilingual, which has 110M parameters, was used in this NER task. We set the training batch size to 32, the max_seq to 80, and the learning rate to 0.00001. A total of 10 epochs were trained in each iteration to ensure model convergence. Other parameters related to BERT are set to default values. The dropout rate was set to 0.9 in fully connected layers to prevent over fitting. The transfer matrix in CRF is also left for the model to learn. The transfer matrix in the CRF layer was learned by the model itself. Importantly, the Bi-LSTM layer was added in this architecture before feeding the tweet-level representation into the CRF layer, to compare the performance between BERT–CRF with Bi-LSTM and without Bi-LSTM.
BERT–CRF architecture. BERT: Bidirectional Encoder Representations from Transformers; CRF: Conditional Random Field.
We extended the semantic types of our model from 3 common types of drug, gene, and disease to 5 types: drug, gene (gene name, gene mutation), disease (disease name, position, etc), precise medication (population, daily dose, dose form, frequency, take time for, take with a meal or not, etc), and adverse reaction.
All the semantic types and attributes covered in pharmacogenomics knowledge model are shown in
The entities model in pharmacogenomics knowledge model was defined and EID represented the unique identifier for entities
Entity={EID*,TERM*,Source,SEMANTICType*} (1)
The relationships model in pharmacogenomics knowledge model was defined and RID represented the unique identifier for relationships
Relation={RID*,Relationship*,Domain*,
Range*,Definition,TreeNumber*} (2)
The whole pharmacogenomics knowledge model can be represented as the risk factors of precision medication for cancers. In this model, disease (C, especially for cancer in this paper) is usually caused by gene mutations (G), which decided the target drug (Dr) for treatment.
Dr = F(C,G) (3)
During treatment, routine dosage/dose form (Ds) has been already offered by the FDA drug labels. However, it differs when the patient has an adverse reaction (A) or the disease occurs in special groups (P) such as pregnancy, lactation, pediatric, geriatric. Assuming that the 4 factors are independent in some cases, each factor can effect dosage/dose form separately.
Ds = F(Dr,G,A,P) (4)
Above all, gene mutation, disease, adverse reaction, and patient populations are the risk factors in pharmacogenomics knowledge model of drugs to be used, and suitable dosage and dose form especially.
Dr, Ds=F(C,G,A,P) (5)
Semantic types and attributes in the knowledge model.
Semantic Type | Entity/Attribute |
Drug | Drug Name, Description, Chemical Formula, Molecular Weight, Drug Approval Status, CASa, UNIIb, Pharmacology Indication |
Gene | Gene name, Mutation |
Disease | Disease Name, Position |
Adverse Reaction | N/Ac |
Population | Pediatric Use Population, Applicable Population, Gender, Age, Race |
Drug Use | Daily dose, Dose form, Frequency, Take time for, Take with a meal or not, etc |
aCAS: Chemical Abstracts Service Number.
bUNII: Unique Ingredient Identifier.
cN/A: not available.
In this paper, we have collected 4067 drug labels in XML format downloaded from DailyMed as pretraining data for the BERT–CRF architecture, and 190 drug labels after annotation for model representation in which 90% (n=171) form the training set and 10% (n=19) form the test set, randomly assigned. Statistics-annotated corpus are presented in
Number of entities in training and test sets.
Entity | Number of entities in the training set | Number of entities in the test set |
Drug | 76 | 31 |
Gene | 60 | 26 |
Disease | 94 | 33 |
Body_Part | 23 | 7 |
Daily_Dose | 99 | 27 |
Dose_Form | 16 | 8 |
Frequency | 32 | 12 |
Adverse_Reaction | 372 | 77 |
Three basic models are compared, with the specific results shown in
Performance of the models.
Model | Precision (%) | Recall (%) | F1 (%) |
CRFa | 88.03 | 73.57 | 80.16 |
BERT–CRFb | 85.12 | 85.12 | 85.12 |
BERT–Bi-LSTM–CRFc | 85.22 | 81.00 | 83.05 |
aCRF: Conditional Random Field.
bBERT: Bidirectional Encoder Representations from Transformers
cBi-LSTM: Bidirectional Long Short-Term Memory.
Performance of the semantic type.
Semantic type | F1 | ||
CRFa (%) | BERT–Bi-LSTM–CRFb,c (%) | BERT–CRF (%) | |
Drug | 94.12 | 94.12 | 100.00 |
Gene | 66.67 | 80.00 | 71.43 |
Disease | 61.54 | 66.67 | 57.14 |
Body_Part | 57.14 | 57.15 | 85.71 |
Daily_Dose | 31.58 | 31.58 | 42.11 |
Dose_Form | 100.00 | 100.00 | 100.00 |
Frequency | 62.50 | 75.00 | 75.00 |
Adverse Reaction | 68.15 | 79.00 | 73.74 |
aCRF: Conditional Random Field.
bBERT: Bidirectional Encoder Representations from Transformers
cBi-LSTM: Bidirectional Long Short-Term Memory.
Because this study required a high accuracy of relationship extraction, we adopted a manual method in this task. Descriptions of semantic relationships were normalized at the same time during annotation, such as “in combination with” = “synergized by,” “recommended dosage” = “routine dosage.” The normalized descriptions are presented in
In the end, 26 kinds of semantic relationships were extracted, and the consistency of the entity relationship annotation was 78.55%. Among them, there were 14 first-level semantic relationships and 12 second-level semantic relationships. Each kind of semantic relationships has been defined in detail, as shown in the accessory document.
Examples of semantic relationship–normalized description.
Normalized description | Expressions in drug labels |
Treats | for the prevention of, for relief of the signs and symptoms, for the treatment of, for the prevention of, as monotherapy of |
Synergized by | in combination with, coadministered with |
Antagonized by | avoid concurrent administration of, avoid concomitant use of |
Have dosage | total daily doses, recommended dosage |
Have mutation | with *** mutation, the presence of *** mutation, be homozygous for |
Based on the entity recognition and relationship definitions mentioned above, the pharmacogenomics knowledge model is presented as
Overview of pharmacogenomics knowledge model.
Melanoma is a malignant neoplasm derived from cells that are capable of forming melanin, which may occur in the skin of any part of body. It frequently metastasizes widely, and the regional lymph nodes, liver, lungs, and brain are likely to be involved. The incidence of malignant skin melanomas is rising rapidly in all parts of the world. Therefore, melanoma, which is caused by
Seven drugs were included in the cases: binimetinib, cobimetinib, dabrafenib, encorafenib, nivolumab, trametinib, and vemurafenib. Most were newly indicated for the treatment of unresectable or metastatic melanoma with
By researching the 7 drugs, 4846 triples were established in the pharmacogenomics knowledge model of melanoma, among them 4713 triples were drug–drug relationships, 41 were drug–adverse reaction, 30 were drug–dosage, 24 were adverse reaction–dosage, 22 were drug–disease, 7 were drug–gene, 4 were drug–population, 2 were gene–mutation, and 3 were gene–disease. An example of data visualization of trametinib can be seen in
An example of pharmacogenomics knowledge model data visualization.
We provided a user-friendly interface [
User interface of pharmacogenomics knowledge model data set.
The pharmacogenomics knowledge model constructed in this paper reveals hidden relationships between drug, gene, disease, precise medication, and adverse reaction. Trametinib is used as an example, which is a kinase inhibitor indicated as a single agent for the treatment of BRAF-inhibitor treatment-naïve patients with unresectable or metastatic melanoma with
The pharmacogenomics knowledge model included 9 groups of PGx relationships in this model, which can present more potential information than other relevant data sources such as DrugBank, PharmGKB, CTD, and RxNorm, as shown in
Comparison between pharmacogenomics data sources.
Relationships | DrugBank | PharmGKBd | CTDe | RxNormf | PGxKMg |
Drug–Gene | √a | √ | √ | — | √ |
Drug–Drug | √ | √ |
— | — | √ |
Gene–Disease | —c |
|
√ | — | √ |
Gene–Mutation | — | √ | — | — | √ |
Drug–Disease | √ | √ |
√ | — | √ |
Drug–Adverse Reaction | — | — | — | — | √ |
Drug–Dosage | √ | — | — | √ | √ |
Drug–Population | — | — | — | — | √ |
Adverse Reaction–Dosage | — | — | — | — | √ |
aHave structured data and can be downloaded in the web set.
bHave information (unstructured data) for such relationships in the web set.
cHave no information for such relationships in the web set.
dPharmGKB: Pharmacogenomics Knowledge Base.
eCTD: Comparative Toxicogenomics Database.
fRxNorm: drug data interaction standard in American Clinical Information System
gPGxKM: pharmacogenomics knowledge model.
However, there are still some limitations in our study. First, this study aimed to build a pharmacogenomics knowledge model and semiautomatically annotate the corpus using the existing NLP tools. However, we did not validate the feasibility of NLP tools or compare the NLP performance using a benchmark data set, such as clinical records from the Third i2b2 Workshop on NLP Challenges [
In future studies, we also plan to do the following jobs to improve our research. First, a series of other antitumor drugs will be taken into consideration to fill up our framework, such as ceritinib and afatinib for non–small-cell lung cancer. Second, linked data can also be extended to other sources, such as CTD, PharmGKB, and DisGeNET. We hope that this knowledge model for PGx interactions could serve as a framework and a resource for future drug research and development.
A pharmacogenomics knowledge model was constructed for precision medication in our research, which reflected the multidimensional relationships between drug, gene, disease, as well as relationships from gene to drug to dosage or frequency associations. Extraction task for PGx entities has been done using the BERT–CRF model with F1 score of 85.12%. Our pharmacogenomics knowledge model contained 5 semantic types (drug, gene, disease, precise medication, and adverse reaction) and 26 semantic relationships had been defined in detail. Using melanoma caused by
Anaplastic thyroid cancer
Bidirectional Encoder Representations from Transformers
bidirectional long short-term memory
conditional random field
the Comparative Toxicogenomics Database
the US Food and Drug Administration
the US National Library of Medicine
pharmacogenomics
Pharmacogenomics Knowledge Base
This work is supported by the Special Research Fund for Central Universities-Peking Union Medical College (Grant No. 3332020049), the National Key Research and Development Program of China (Grant No. 2016YFC0901901), the National Natural Science Foundation of China (Grant No. 81601573), National Engineering Laboratory for Internet Medical Systems and Applications (Grant No. NELIMSA2018P02), the Key Laboratory of Knowledge Technology for Medical Integrative Publishing of China, the program of China Knowledge Center for Engineering Sciences and Technology (Medical Knowledge Service System; Grant No. CKCEST-2019-1-10).
HK designed the model, performed the experiments, and wrote this paper. The study was originally conceived of by JL, who also improved the experiments and made modifications to this paper. HK, MW, and LS designed the annotation framework, made the rules of annotation, and analyzed the results. LH guided the study and made modifications to this paper. All the authors wrote and revised the manuscript, and all the authors have read and approved the final manuscript.
None declared.