This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Automatic recognition of medical concepts in unstructured text is an important component of many clinical and research applications, and its accuracy has a large impact on electronic health record analysis. The mining of medical concepts is complicated by the broad use of synonyms and nonstandard terms in medical documents.
We present a machine learning model for concept recognition in large unstructured text, which optimizes the use of ontological structures and can identify previously unobserved synonyms for concepts in the ontology.
We present a neural dictionary model that can be used to predict if a phrase is synonymous to a concept in a reference ontology. Our model, called the Neural Concept Recognizer (NCR), uses a convolutional neural network to encode input phrases and then rank medical concepts based on the similarity in that space. It uses the hierarchical structure provided by the biomedical ontology as an implicit prior embedding to better learn embedding of various terms. We trained our model on two biomedical ontologies—the Human Phenotype Ontology (HPO) and Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT).
We tested our model trained on HPO by using two different data sets: 288 annotated PubMed abstracts and 39 clinical reports. We achieved 1.7%-3% higher F1-scores than those for our strongest manually engineered rule-based baselines (
Most popular medical concept recognizers rely on rule-based models, which cannot generalize well to unseen synonyms. In addition, most machine learning methods typically require large corpora of annotated text that cover all classes of concepts, which can be extremely difficult to obtain for biomedical ontologies. Without relying on large-scale labeled training data or requiring any custom training, our model can be efficiently generalized to new synonyms and performs as well or better than state-of-the-art methods custom built for specific ontologies.
Automatic recognition of medical concepts in unstructured text is a key component of biomedical information retrieval systems. Its applications include analysis of unstructured text in electronic health records (EHR) [
Many concept recognition and text annotation tools have been developed for biomedical text. Examples of popular tools for general purpose are the NCBO (National Center for Biomedical Ontology) annotator [
On the other hand, in the more general domain of natural language processing, many machine learning–based text classification and NER tools have been recently introduced [
Although these methods address a similar problem, they cannot be used directly for concept recognition, as the number of named entity classes is typically much lower than that of the concepts in medical ontologies. For instance, CoNLL-2003 [
In this paper, we develop a hybrid approach, called Neural Concept Recognizer (NCR), by introducing a neural dictionary model that learns to generalize to novel synonyms for concepts. Our model is trained on the information provided by the ontology, including the concept names, synonyms, and taxonomic relations between the concepts, and can be used to rank the concepts that a given phrase can match as a synonym. Our model consists of two main components: an encoder, which maps an input phrase to a vector representation, and an embedding table, which consists of the vector representations learned for the ontology concepts. The classification is performed based on the similarity between the phrase vector and the concept vectors. To allow for the use of our model to also detect concepts from longer texts, we scan the input text with fixed-size windows and report a phrase as matching a concept if it is above a threshold that is chosen from an appropriate validation data set.
Our work introduces a novel machine learning–based method for automatic concept recognition of medical terms in clinical text, and we have provided empirical results to demonstrate the accuracy of our methods in several settings. We trained our neural dictionary model on the HPO and used it to recognize concepts from 228 PubMed abstracts and 39 clinical reports of patients with rare genetic diseases. Additionally, we used a subset of concepts from SNOMED-CT that have matching terms in ICD-9 and experimented on 2000 Intensive Care Unit (ICU) discharge summaries from a Multiparameter Intelligent Monitoring in Intensive Care (MIMIC-III) data set [
Recently, several machine learning methods have been used in biomedical NER or concept recognition. Habibi et al [
Curating such an annotated corpus is more difficult for typical biomedical ontologies, as the corpus has to cover thousands of classes. For example, the HPO contains 11,442 concepts (classes), while, to the best of our knowledge, the only publicly available corpus hand annotated with HPO concepts [
The concepts in an ontology often have a hierarchical structure (ie, a taxonomy), which can be utilized in representation learning. Hierarchies have been utilized in several recent machine learning approaches. Deng et al [
In this section, we first describe the neural dictionary model that computes the likelihood that a given phrase matches each concept from an ontology, and then demonstrate how to apply the model to larger text fragments such as a full sentence, which may have multiple (or no) terms.
The neural dictionary model receives a word or a phrase as input and finds the probability of the concepts in the ontology matching it. The model consists of a text encoder, which is a neural network that maps the query phrase into vector representation, and an embedding matrix with rows corresponding to the ontology concepts (
Architecture of the neural dictionary model. The encoder is shown at the top, and the procedure for computing the embedding for a concept is illustrated at the bottom. Encoder: a query phrase is first represented by its word vectors, which are then projected by a convolution layer into a new space. Then, a max-over-time pooling layer is used to aggregate the set of vectors into a single one. Thereafter, a fully connected layer maps this vector into the final representation of the phrase. Concept embedding: a matrix of raw embeddings is learned, where each row represents one concept. The final embedding of a concept is retrieved by summing the raw embeddings for that concept and all of its ancestors in the ontology. FC: fully connected.
We use word embeddings to represent the input words learned in a pre-processing step by running fastText [
Inspired by the work of Kim et al [
After the first layer of projection, the output vectors were aggregated into a single vector (
Our model includes a component that learns representations for concepts and measures the similarity between an input phrase and the concepts by computing the dot product between these representations and the encoded phrase
We denote these representations by the matrix
Each element of the ancestry matrix
The final embedding of a concept would be the final embedding of its parent (or the average of its parents, in cases of multi-inheritance) plus its own raw embedding (ie,
This has two major advantages. First, it incorporates the taxonomic structure as implicit prior information on the geometry of the concept embeddings. Second, by binding the embeddings of the concepts, training becomes more efficient, as for each concept, it is sufficient to learn only the local location with respect to its parent, rather than learning the absolute location from scratch. Furthermore, when the location of a concept gets updated, both its descendants and ancestors will also get updated, even if they do not have samples present in the mini-batch. More specifically, as a concept gets updated, the global locations provided to all its descendants are automatically updated as well, while the actual raw embedding of its ancestors will get updated through the backpropagation process. The results of our experiments quantitatively and qualitatively show the advantage of this approach in our task.
Finally, the classification is done by computing the dot product (plus a bias term) followed by a softmax layer as follows:
The taxonomy information can be ignored by setting
Training is performed on the names and synonyms provided by the ontology. If a concept has multiple synonyms, each synonym-concept pair is considered as a separate training example. The parameters learned during the training are the encoder parameters
To use our neural dictionary model to recognize concepts in a sentence or larger text, we extract all n-grams of one to seven words in the text and used the neural dictionary model to match each n-gram to a concept. We filter irrelevant n-grams by removing the candidates whose matching score (the softmax probability provided by the neural dictionary model) is lower than a threshold. This threshold is chosen based on the performance of the method (f-measure) on a validation set.
We also use random n-grams from an unrelated corpus (in our case Wikipedia) as negative examples labeled with a dummy
After all the n-grams satisfying the conditions are captured, a postprocessing step is performed to ensure that the results are consistent. For every pair of overlapping captured n-grams, if both n-grams match the same concept, we retain the smaller n-gram. Otherwise, if they are matched to different concepts, we choose the longer n-gram, as this reduces the chances of choosing shorter general concepts in the presence of a more specific, longer, concept. For example, when annotating the sentence “The patient was diagnosed with conotruncal heart defect,” our method will favor choosing the longer, more specific concept “conotruncal heart defect” rather than the more general concept “heart defect.”
To evaluate our model, we trained the model on the HPO and SNOMED-CT and applied it to a number of medical texts. We evaluated the model on two different tasks. In the first task, the model ranks concepts matching an input isolated phrase (synonym classification) and in the second task, concepts are recognized and classified from a document (concept recognition).
To assess the effectiveness of the techniques used in our model, we trained four variations of the model as follows:
NCR: The full model, with the same architecture as described in the section Overview of the Neural Dictionary Model. The training data for this model includes negative examples.
NCR-H: In this version, the model ignores the taxonomic relations by setting the ancestry matrix
NCR-N: Similar to the original NCR, this version utilizes the taxonomic relations. However, this model has not been trained on negative samples.
NCR-HN: A variation that ignores the taxonomy and has not been trained on negative examples.
To improve stability, we trained 10 different versions of our model, varying the random initialization of the model parameters and randomly reshuffling the training data across minibatches at the beginning of each training epoch. We created an ensemble of these 10 models by averaging their prediction probabilities for any given query and used this ensemble in all experiments.
In most of our experiments, we used the HPO to train the neural dictionary model. To maintain consistency with previous work, we used the 2016 release of the HPO, which contains a total of 11,442 clinical phenotypic abnormalities seen in human disease and provides a total of 19,202 names and synonyms for them, yielding an average of 1.67 names per concept.
We evaluated the accuracy of our model trained on the HPO on two different data sets:
PubMed: This data set contains 228 PubMed article abstracts, gathered and manually annotated with HPO concepts by Groza et al [
Undiagnosed Diseases Program (UDP): This data set includes 39 clinical reports provided by National Health Institutes UDP [
In order to examine the effectiveness of our model on different ontologies, we also trained the model on a subset of SNOMED-CT, which is a comprehensive collection of medical concepts that includes their synonyms and taxonomy. We evaluated the trained model for concept recognition using a subset of 2000 ICU discharge summaries from MIMIC-III. The discharge summaries are composed of unstructured text and are accompanied by a list of disease diagnosis terms in the form of ICD-9 codes.
Since SNOMED-CT provides a more sophisticated hierarchy than ICD-9 and a mapping between the two exists, we used a subset of SNOMED-CT concepts that include the ICD-9 concepts. We considered the 1292 most frequent ICD-9 concepts that have a minimum of 50 occurrences in MIMIC-III. These were filtered to 1134 concepts that also have at least one mapping SNOMED-CT concept, which were mapped to a total of 8405 SNOMED-CT concepts (more SNOMED-CT concepts because of one-to-many mappings). To have a single connected hierarchy of concepts, we also added all missing ancestors of these SNOMED-CT terms, resulting in a total of 11,551 SNOMED-CT concepts. To these additional 3146 SNOMED-CT concepts, we assigned the ICD-9 code mapped to the original SNOMED-CT term that had induced them (ie, their descendent). We trained NCR using these 11,551 SNOMED-CT concepts and the 21,550 names and synonyms associated with them.
In this experiment, we evaluated our method’s performance in matching isolated phrases with ontology concepts. For this purpose, we extracted 607 unique phenotypic phrases that did not have an exact match among the names and synonyms in the HPO from the 228 annotated PubMed abstracts. We used our model to classify HPO concepts for these phrases and ranked them by their score.
In addition to the four variations of our model, we compared our method with one based on Apache Solr, customized to suggest HPO terms for phenotypic queries. This tool is currently in use as a component of the phenotyping software PhenoTips [
An example phrase from this data set is “reduced retinal pigment,” labeled as HP:0007894. In our version of the HPO, there are four names/synonyms for this phrase: “hypopigmentation of the fundus,” “decreased retinal pigmentation,” “retinal depigmentation,” and “retinal hypopigmentation.” NCR correctly identified this concept as its top match. In contrast, the correct concept was not in the top 10 concepts reported by PhenoTips; the top reported concept was “retinal pigment epithelial mottling.”
Synonym classification experiments on 607 phenotypic phrases extracted from 228 PubMed abstracts. Largest values for each category are italicized.
Method | Accuracy (%) | |
R@1a | R@5b | |
PhenoTips | 28.9 | 49.3 |
NCRc | 51.6 | |
NCR-Hd | 45.5 | 69.8 |
NCR-Ne | 78.2 | |
NCR-HNf | 50.2 | 71.8 |
aR@1: recall using top 1 result from each method.
bR@5: recall using top 5 results from each method.
cNCR: Neural Concept Recognizer.
dNCR-H: variation of the NCR model that ignores taxonomic relations.
eNCR-N: variation of the NCR model that has not been trained on negative samples.
fNCR-HN: variation of the NCR model that ignores the taxonomy and has not been trained on negative examples.
We evaluated the four versions of NCR for concept recognition and compared them with four rule-based methods: NCBO annotator [
In order to choose a score threshold for filtering irrelevant concepts, we used 40 random PubMed abstracts as a validation set and compared the micro F1-score with different threshold values. The selected thresholds were 0.85, 0.8, 0.8, and 0.75 for NCR, NCR-H, NCR-N, and NCR-HN, respectively. Since the UDP data set contained fewer reports (39 in total), we did not choose a separate UDP validation set and used the same threshold determined for the PubMed abstracts. We tested our methods on the remaining 188 PubMed abstracts and the 39 UDP reports and calculated micro and macro versions of precision, recall, and F1-score, as shown in the following equations:
In these equations,
We also calculated a less strict version of accuracy measurements that takes the taxonomic relations of the concepts into consideration. For this, we extended the reported set and the label set for each document to include all their ancestor concepts, which we notate by
The measured micro and macro accuracies are provided in
To verify the statistical significance of NCR’s superiority to the baselines, we aggregated both the abstract and UDP data sets for a total of 227 documents and calculated the F1-score for each document separately. This method is different from that used to calculate the F1-score presented in
To evaluate the effectiveness of the techniques employed in NCR on a different ontology, we trained the four variations of our model on the SNOMED-CT subset, using 200 MIMIC reports as the validation set and the remaining 1800 reports as a test set. We mapped each reported SNOMED-CT concept to the corresponding ICD-9 code and calculated the accuracy measurements (
The results show that using the hierarchy information improved both micro and macro F1-scores. Since the labels were only available as ICD-9 codes, which do not hold a sufficiently rich hierarchical structure as opposed to HPO and SNOMED-CT, the Jaccard index and the extended accuracy measurements were less meaningful and were not calculated. We also ran the original cTAKES, which is optimized for SNOMED-CT concepts, on the 1800 test documents and filtered its reported SNOMED-CT results to ones that have a corresponding ICD-9. Although cTAKES had a high recall, the overall F1-scores were lower than those for NCR. Furthermore, using a method similar to the one used to calculate the statistical significance for the improvement relative to BioLark in the section above, we compared NCR with cTAKES and found that NCR performed statistically significantly better (
Micro and macro measurements for concept recognition experiments on 188 PubMed abstracts. Neural Concept Recognizer models were trained on Human Phenotype Ontology. Largest values for each category are italicized.
Method | Micro (%) | Macro (%) | ||||
Precision | Recall | F1-score | Precision | Recall | F1-score | |
BioLarK | 78.5 | 60.5 | 68.3 | 76.6 | 66.0 | 70.9 |
cTAKESa | 72.2 | 55.6 | 62.8 | 74.0 | 61.4 | 67.1 |
OBOb | 78.3 | 53.7 | 63.7 | 79.5 | 58.6 | 67.5 |
NCBOc | 44.0 | 57.2 | 79.5 | 48.7 | 60.4 | |
NCRd | 80.3 | 62.4 | 68.2 | |||
NCR-He | 74.4 | 61.5 | 67.3 | 72.2 | 67.1 | 69.6 |
NCR-Nf | 78.1 | 69.4 | 76.6 | 72.2 | ||
NCR-HNg | 77.1 | 57.2 | 65.7 | 76.5 | 63.4 | 69.3 |
acTAKES: Clinical Text Analysis and Knowledge Extraction System.
bOBO: Open Biological and Biomedical Ontologies
cNCBO: National Center for Biomedical Ontology.
dNCR: Neural Concept Recognizer.
eNCR-H: variation of the NCR model that ignores taxonomic relations.
fNCR-N: variation of the NCR model that has not been trained on negative samples.
gNCR-HN: variation of the NCR model that ignores the taxonomy and has not been trained on negative examples.
Micro and macro measurements for concept recognition experiments on 39 Undiagnosed Diseases Program clinical notes. Neural Concept Recognizer models were trained on Human Phenotype Ontology. Largest values for each category are italicized.
Method | Micro (%) | Macro (%) | ||||
Precision | Recall | F1-score | Precision | Recall | F1-score | |
BioLarK | 27.6 | 21.0 | 23.9 | 28.7 | 21.6 | 24.6 |
cTAKESa | 31.5 | 18.9 | 23.6 | 20.2 | 26.2 | |
OBOb | 26.8 | 20.5 | 23.2 | 28.8 | 20.1 | 23.7 |
NCBOc | 16.9 | 22.5 | 37.1 | 19.9 | 25.9 | |
NCRd | 24.5 | 27.2 | 25.8 | 26.5 | 27.6 | 27.0 |
NCR-He | 25.1 | 26.8 | 25.9 | 26.2 | 27.0 | 26.6 |
NCR-Nf | 24.3 | 26.2 | 27.0 | |||
NCR-HNg | 25.5 | 27.2 | 27.4 | 27.7 | 27.6 |
acTAKES: Clinical Text Analysis and Knowledge Extraction System.
bOBO: Open Biological and Biomedical Ontologies
cNCBO: National Center for Biomedical Ontology.
dNCR: Neural Concept Recognizer.
eNCR-H: variation of the NCR model that ignores taxonomic relations.
fNCR-N: variation of the NCR model that has not been trained on negative samples.
gNCR-HN: variation of the NCR model that ignores the taxonomy and has not been trained on negative examples.
Extended measurements for concept recognition experiments on 188 PubMed abstracts. Neural Concept Recognizer models were trained on Human Phenotype Ontology. Largest values for each category are italicized.
Method | Extended value (%) | Jaccard value (%) | ||
Precision | Recall | F1-score | ||
BioLarK | 91.5 | 80.8 | 85.8 | 76.9 |
cTAKESa | 95.6 | 73.9 | 83.3 | 72.1 |
OBOb | 92.4 | 77.9 | 84.5 | 74.4 |
NCBOc | 65.4 | 77.7 | 64.3 | |
NCRd | 93.3 | 82.1 | ||
NCR-He | 86.5 | 85.1 | 76.7 | |
NCR-Nf | 90.6 | 83.1 | 86.7 | 78.2 |
NCR-HNg | 89.7 | 78.9 | 83.9 | 73.2 |
acTAKES: Clinical Text Analysis and Knowledge Extraction System.
bOBO: Open Biological and Biomedical Ontologies
cNCBO: National Center for Biomedical Ontology.
dNCR: Neural Concept Recognizer.
eNCR-H: variation of the NCR model that ignores taxonomic relations.
fNCR-N: variation of the NCR model that has not been trained on negative samples.
gNCR-HN: variation of the NCR model that ignores the taxonomy and has not been trained on negative examples.
Extended measurements for concept recognition experiments on 39 Undiagnosed Diseases Program clinical notes. Neural Concept Recognizer models were trained on Human Phenotype Ontology. Largest values for each category are italicized.
Method | Extended value (%) | Jaccard index (%) | ||
Precision | Recall | F1-score | ||
BioLarK | 58.9 | 42.6 | 49.5 | 29.5 |
cTAKESa | 68.5 | 36.7 | 47.8 | 27.3 |
OBOb | 59.2 | 46.4 | 52.0 | 31.3 |
NCBOc | 37.2 | 48.5 | 27.2 | |
NCRd | 57.1 | 49.4 | ||
NCR-He | 54.0 | 49.4 | 51.6 | 30.5 |
NCR-Nf | 54.7 | 52.5 | 31.4 | |
NCR-HNg | 56.5 | 49.0 | 52.5 | 31.3 |
acTAKES: Clinical Text Analysis and Knowledge Extraction System.
bOBO: Open Biological and Biomedical Ontologies
cNCBO: National Center for Biomedical Ontology.
dNCR: Neural Concept Recognizer.
eNCR-H: variation of the NCR model that ignores taxonomic relations.
fNCR-N: variation of the NCR model that has not been trained on negative samples.
gNCR-HN: variation of the NCR model that ignores the taxonomy and has not been trained on negative examples.
Results for concept recognition experiments on 1800 Multiparameter Intelligent Monitoring in Intensive Care documents. The Neural Concept Recognizer models were trained on a subset of the Systematized Nomenclature of Medicine - Clinical Terms ontology. Largest values for each category are italicized.
Method | Micro (%) | Macro (%) | ||||
Precision | Recall | F1-score | Precision | Recall | F1-score | |
cTAKESa | 9.1 | 14.6 | 8.7 | 14.1 | ||
NCRb | 10.9 | 26.7 | 10.6 | 26.9 | 15.2 | |
NCR-Hc | 10.0 | 30.6 | 15.1 | 9.6 | 30.4 | 14.6 |
NCR-Nd | 24.8 | 15.4 | 25.3 | |||
NCR-HNe | 9.6 | 28.6 | 14.4 | 9.2 | 28.9 | 13.9 |
acTAKES: Clinical Text Analysis and Knowledge Extraction System.
bNCR: Neural Concept Recognizer.
cNCR-H: variation of the NCR model that ignores taxonomic relations.
dNCR-N: variation of the NCR model that has not been trained on negative samples.
eNCR-HN: variation of the NCR model that ignores the taxonomy and has not been trained on negative examples.
To better understand how utilizing the hierarchy information affects our model, we used t-SNE (t-distributed stochastic neighbor embedding) to embed and visualize the learned concept representations for the rows of matrix
Interestingly, in the representations learned for NCR-N, concepts in categories that share children with many other categories, such as “Neoplasm” (dark grey), are located in the center of the plot, close to various other categories, while a category like “Abnormality of ear” (orange) forms its own cluster far from center and is separated from other categories.
To further investigate the false positives reported by NCR, we manually investigated the false positives reported by our method in three clinical reports randomly chosen from the UDP data set. We looked at false positives from the extended version of evaluations, which included concepts reported by our method, where neither the concepts nor any of their descendants were in the label set. This yielded a total number of 73 unique false positives for the three documents. Based on a manual analysis of these terms conducted by a medical expert on rare genetic diseases (coauthor DA), 47.9% of the reported false positives were actually correctly adding more information to the closest phenotype reported in the label set. One such example is “Congenital hypothyroidism on newborn screening.” Although our method correctly recognized “Congenital hypothyroidism,” the closest concept in the extended label set was “Abnormality of the endocrine system.” In an additional 8.2% of cases, our model correctly reported a more specific concept than that presented in the patient record, but the concept was sufficiently close to a specified phenotype for it not to be considered a novel finding. Furthermore, 16.4% of the reported false positives were, in fact, mentioned in the text, albeit as negations, such as “Group fiber atrophy was not seen.” In 6.8% of these cases, the reported phenotype was mentioned but not confidently diagnosed, such as “possible esophagitis and gastric outlet delay.”
Visualization of the representations learned for Human Phenotype Ontology concepts. The representations are embedded into two dimensions using t-SNE. The colors denote the high-level ancestors of the concepts. The plot on the left shows the representations learned in NCR-N, where the taxonomy information was used in training, and the plot on the right shows representations learned for NCR-HN, where the taxonomy was ignored. NCR-HN: variation of the NCR model that ignores the taxonomy and has not been trained on negative examples; NCR-N: variation of the NCR model that has not been trained on negative samples; t-SNE: t-distributed stochastic neighbor embedding.
Our experiments showed the high accuracy of NCR compared to the baselines in both synonym classification and concept recognition, where NCR consistently achieved higher F1-scores across different data sets. Furthermore, we showed that NCR’s use of the hierarchical information contributes to its higher performance.
In the synonym classification task, as evident in
In concept recognition experiments, NCR had a better F1-score and Jaccard index than BioLarK and cTAKES on PubMed abstracts (
Among different variations of NCR, use of the hierarchy information always led to a higher F1-score and Jaccard index. Having negative samples during training also generally improved accuracy; however, in some cases, this difference was small, and in some cases, NCR-N showed slightly better results.
Although the PubMed abstracts were manually annotated with HPO concepts by Groza et al [
The experiments on MIMIC data, where the model was trained on SNOMED-CT, resulted in a much lower accuracy than the two experiments performed using the HPO. In addition to the problem of implicit correspondence between labels and actual mentions in the text, in this experiment, we used a mapping between ICD-9 and SNOMED-CT terms, which can introduce further inconsistencies. On the other hand, for the sake of evaluating the techniques employed in our model on another ontology, use of the SNOMED-CT hierarchy, similar to the case with the HPO, improves the F1-scores (
In addition to the quantitative results showing the advantage of using the hierarchy information, our visualization of the concept representations in
NCR has already been used in several applications in practice. Currently, a version of NCR trained on the HPO is deployed as a component of PhenoTips software [
In this paper, we presented a neural dictionary model that ranks matching concepts for a query phrase and can be used for concept recognition in larger text. Unlike other machine learning–based concept recognition tools, our training is solely performed on the ontology data (except the unsupervised learning of the word vectors) and does not require any annotated corpus. Another novelty of our model is our approach to using the taxonomic relations between concepts that, based on our experiments, improve synonym classification. Use of these taxonomic relations makes the training of our model easier by sharing knowledge between different concepts and providing implicit prior information on the similarity between concepts for the model. Furthermore, using multiple sources of information can improve the robustness of the model to potential errors in the input ontologies (eg, due to a mislabeled synonym).
NCR uses convolutional neural networks to encode query phrases into vector representations and computes their similarity to embeddings learned for ontology concepts. The model benefits from knowledge transfer between child and parent concepts by summing the raw embeddings of a concept’s ancestors to compute its final embedding. We tested our neural dictionary model by classifying 607 phenotypic phrases, and our model achieved a considerably higher accuracy than another method designed for this task and baseline versions of our model that do not use the taxonomy information. We also tested our method for concept recognition on full text using four data sets. In one setting, we trained our model on the HPO and tested it on two data sets, including 188 PubMed paper abstracts and 39 UDP clinical records, while in another setting, we trained the model on a subset of SNOMED-CT medical concepts and tested it on 1800 MIMIC ICU discharge notes. Our results showed the efficiency of our methods in both settings.
One major challenge for the concept recognition task is to filter candidates that do not match any class in the ontology. In our experiments, we approached this challenge by adding negative samples from Wikipedia in the training. Although this improved the results, it did not fully solve the problem, as there can be many relevant medical terms in a clinical text that are neither in an ontology nor available in any negative examples.
Although our experiments have shown the high accuracy of our model in classifying synonyms, we believe there is much more room for improvement in the overall concept recognition method, especially the way that n-grams are selected and filtered. Limitations of NCR include its relatively slower speed than several dictionary-based and rule-based methods and its limited ability to utilize contextual information for concept recognition. An interesting direction for future work is to investigate the possibility of using unsupervised methods for encoding phrases, such as skip-thought vectors [
Bidirectional Encoder Representations from Transformers
conditional random field
Clinical Text Analysis and Knowledge Extraction System
electronic health records
Human Phenotype Ontology
International Classification of Diseases - Ninth Revision
Intensive Care Unit
Identifying Human Phenotypes
long short-term memory
Multiparameter Intelligent Monitoring in Intensive Care
named entity recognizer
National Center for Biomedical Ontology
Neural Concept Recognizer
variation of the NCR model that ignores taxonomic relations
variation of the NCR model that ignores the taxonomy and has not been trained on negative examples
variation of the NCR model that has not been trained on negative samples
Open Biological and Biomedical Ontologies
recall using top 1 results from each method
recall using top 5 results from each method
rectified linear unit
Systematized Nomenclature of Medicine - Clinical Terms
t-distributed stochastic neighbor embedding
Undiagnosed Diseases Program
We thank Michael Glueck for his valuable comments and discussions, and Mia Husic for help in improving the manuscript. We also thank Tudor Groza for his helpful comments and for providing us the BioLarK API used for the experiments. This work was partially supported by an NSERC Discovery grant to MB.
None declared.