This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Medical terms are a major obstacle for patients to comprehend their electronic health record (EHR) notes. Clinical natural language processing (NLP) systems that link EHR terms to lay terms or definitions allow patients to easily access helpful information when reading through their EHR notes, and have shown to improve patient EHR comprehension. However, high-quality lay language resources for EHR terms are very limited in the public domain. Because expanding and curating such a resource is a costly process, it is beneficial and even necessary to identify terms important for patient EHR comprehension first.
We aimed to develop an NLP system, called adapted distant supervision (ADS), to rank candidate terms mined from EHR corpora. We will give EHR terms ranked as high by ADS a higher priority for lay language annotation—that is, creating lay definitions for these terms.
Adapted distant supervision uses distant supervision from consumer health vocabulary and transfer learning to adapt itself to solve the problem of ranking EHR terms in the target domain. We investigated 2 state-of-the-art transfer learning algorithms (ie, feature space augmentation and supervised distant supervision) and designed 5 types of learning features, including distributed word representations learned from large EHR data for ADS. For evaluating ADS, we asked domain experts to annotate 6038 candidate terms as important or nonimportant for EHR comprehension. We then randomly divided these data into the target-domain training data (1000 examples) and the evaluation data (5038 examples). We compared ADS with 2 strong baselines, including standard supervised learning, on the evaluation data.
The ADS system using feature space augmentation achieved the best average precision, 0.850, on the evaluation set when using 1000 target-domain training examples. The ADS system using supervised distant supervision achieved the best average precision, 0.819, on the evaluation set when using only 100 target-domain training examples. The 2 ADS systems both performed significantly better than the baseline systems (
ADS can effectively rank terms mined from EHRs. Transfer learning improved ADS’s performance even with a small number of target-domain training examples. EHR terms prioritized by ADS were used to expand a lay language resource that supports patient EHR comprehension. The top 10,000 EHR terms ranked by ADS are available upon request.
Online patient portals have been widely adopted in the United States in a nationwide effort to promote patient-centered care [
Her
She was on an
There has been long-standing research interest in developing health information technologies that promote health literacy and consumer-centered communication of health information [
However, high-quality lay language resources—the cornerstone of such interventions—are very limited in the public domain. The readability levels of health educational materials on the Internet often exceed the level that is easily understood by the average patient [
The consumer health vocabulary (CHV) [
We are building a lay language resource for EHR comprehension by including medical terms from EHRs and creating lay definitions for those terms. This is a time-consuming process that involves collecting candidate definitions from authorized health educational resources, and curating and simplifying these definitions by domain experts. Since the number of candidate terms mined from EHRs is large (hundreds of thousands of terms), we ranked candidate terms based on how important they are for patients’ comprehension of EHRs, and therefore prioritized the annotation effort of lexical entries based on those important terms.
The goal of this study was to develop an NLP system to automate the process of lexical entry selection. This task was challenging because the distinctions between important and nonimportant EHR terms in our task were more subtle than that between medical terms and nonmedical terms (detailed below in the Important Terms for Electronic Health Record Comprehension subsection). To achieve this goal, we developed a new NLP system, called adapted distant supervision (ADS), which uses distant supervision from the CHV and uses transfer learning to adapt itself to the target domain to rank terms from EHRs. We aimed to empirically show that ADS is effective in ranking EHR terms at the corpus level and outperforms supervised learning.
Previous studies have used both unsupervised and supervised learning methods to prioritize terms for inclusion in biomedical and health knowledge resources [
Our work is also related to previous studies that have used distributional semantics for lexicon expansion [
We previously developed NLP systems to rank and identify important terms from each EHR note of individual patients [
Our ADS system uses distant supervision from the CHV. Distant supervision refers to the learning framework that uses information from knowledge bases to create labeled data to train machine learning models [
Transfer learning is a learning framework that transfers knowledge from the source domain
We used 7839 discharge summary notes (5.4 million words) from the University of Pittsburgh NLP Repository (using these data requires a license) [
Overview of development of the adapted distant supervision (ADS) natural language processing system to rank candidate terms mined from electronic health record (EHR) corpora: data extraction (steps 1 and 2), ADS (step 3), and evaluation (step 4). CHV: consumer health vocabulary.
CHV was developed by collaborative research to address vocabulary discrepancies between lay people and health care professionals [
We defined important terms as those terms that, if understood by the patients, would significantly improve their EHR comprehension. In practice, we used 4 criteria, unithood, termhood, unfamiliarity, and quality of compound term (defined with examples in
Except for unithood, which is a general criterion for lexical entry selection, the other 3 criteria all measure term importance from the perspective of patient EHR comprehension (details in
We used CHV to select positive examples to train ADS (see step 2 in
Despite the aforementioned merits, CHV is not perfect in labeling the training data. First, there is not a clear boundary between familiar and unfamiliar terms if their CHV familiarity scores are close to 0.6. For example, “congestive heart failure” and “atypical migraine” have familiarity scores of 0.64 and 0.61; therefore, they would be labeled as negative examples by CHV. However, these 2 terms were judged by domain experts as important terms that need lay definitions. Second, some compound terms in CHV (eg, “knee osteoarthritis,” “brain MRI,” “aspirin allergy”), although labeled as positive examples by CHV, were judged by domain experts as being not high-quality compound terms from the perspective of efficiently expanding a lay language resource and thus did not need immediate treatment for adding lay definitions.
Since CHV-labeled training data are noisy, we used transfer learning to adapt the system distantly supervised by CHV to the target-domain task. More formally, we defined the training data derived from CHV as the source-domain data
In this study, we investigated 2 state-of-the-art transfer learning methods: feature space augmentation (FSA) and supervised distant supervision (SDS).
FSA [
This approach assumes that
Equations for feature mapping functions used in feature space augmentation (1), objective function used in supervised distant supervision (2), and average precision (3).
SDS is an extension of the algorithm recently proposed by Wallace et al [
Our algorithm differs from that of Wallace et al [
We implemented 2 versions of the ADS system, ADS-fsa and ADS-sds, by incorporating the 2 transfer learning algorithms. We used the log-linear model as the base of all the models (including the baseline models introduced in the subsection Baseline Systems) and used L2 regularization for model training. The output from the log-linear models is probabilities of a candidate term being a positive example and can be used to rank candidate terms directly. We used grid search and cross-validation on the target-domain training data to set the hyperparameters
We derived the training and evaluation datasets from the 106,108 candidate terms extracted from EHR-Pittsburgh as follows.
First, 3 people with a postgraduate level of education in biology, public health, and biomedical informatics reviewed candidate terms among the terms ranked as high by the nonadapted distant supervision model (ie, among the top 10,000 terms) or by the term recognition algorithm C-value [
Each term was annotated by 1 primary reviewer and then reviewed by another reviewer based on the 4 criteria introduced in the subsection Important Terms for Electronic Health Record Comprehension (details in
We used 1000 examples randomly sampled from the 6038 annotated terms as the target-domain training set and used the remaining 5038 terms as the evaluation set. We did not use stratified sampling because in practice we did not know the class distribution of the target-domain data or the test data. In transfer learning, the target-domain training data are critical to system performance. Therefore, we repeated the above procedure 100 times to obtain 100 pairs of <target training set, evaluation set> for system evaluation to take into account the variance of the target training set. To test the effects of the size of the target-domain training data, we reported system performance by using
We first obtained 100,070 terms by removing the 6038 manually labeled terms from the 106,108 candidate terms. We then automatically labeled the 100,070 terms based on whether a term was an EHR-CHV medical term (ie, positive term) or not (ie, negative term). In this way, we obtained 4166 positive terms and 95,904 negative terms. Because we did not know the distribution of the target-domain data, we randomly sampled 3000 positive and 3000 negative terms from these data to form a balanced source-domain training set. We set the size of the source training set to 6000 by following previous work [
We employed 2 baselines commonly used to evaluate transfer learning methods [
Word embedding is the distributed vector representation of words. It has emerged as a powerful technique for word representation and proved beneficial in a variety of biomedical and clinical NLP tasks. We used word2vec software to create the skip-gram word embeddings [
We mapped candidate terms to UMLS concepts and included semantic types for those concepts that had an exact match or a head-noun match as features. Each semantic type is a 0-1 binary feature. This type of feature has been used to identify domain-specific medical terms [
We used the confidence scores from 2 term-recognition algorithms: corpus-level term frequency-inverse document frequency [
We generated 4 features from the Google Ngram corpus [
Term length is the number of words in a term. Because a long candidate term may not be a good compound term but rather a simple concatenation of shorter terms (eg, “left heart cardiac catheterization”), this feature may help the ADS system to identify and rank as low the low-quality compound terms.
This metric averages precision
The area under the receiver operating characteristic curve (AUC-ROC) is computed; this curve plots the true positive rate (y-coordinate) against the false positive rate (x-coordinate) at various threshold settings.
Recall that we have 100 pairs of <target training set, evaluation set> randomly sampled from the 6038 labeled terms. When evaluating a system, we averaged its performance scores on the 100 pairs of datasets and report the averaged values.
We used sklearn.metrics to compute the average precision and AUC-ROC scores. Scikit-learn is an open source Python library widely used for machine learning [
We used the paired-samples
Performance of different natural language processing systems on the evaluation set under 4 conditions using 100, 200, 500, and 1000 target-domain training examplesa.
System | AUC-ROCb | Average precision | |||||||
100 | 200 | 500 | 1000 | 100 | 200 | 500 | 1000 | ||
SourceOnly | 0.739 | 0.739 | 0.739 | 0.739 | 0.811 | 0.811 | 0.811 | 0.811 | |
TargetOnly | 0.728 | 0.749 | 0.769 | 0.782 | 0.799 | 0.816 | 0.833 | 0.844 | |
ADS-fsac | 0.746 | 0.756 | 0.815 | 0.823 | |||||
ADS-sdsd | 0.775 | 0.786 | 0.838 | 0.847 | |||||
4.25 | 2.79 | 8.78 | 3.81 | 3.04 | 11.58 | ||||
<.001 | .01 | <.001 | <.001 | .003 | <.001 |
aThe highest performance scores are italicized.
bAUC-ROC: area under the receiver operating characteristic curve.
cADS-fsa: adapted distant supervision-feature space augmentation.
dADS-sds: adapted distant supervision-supervised distant supervision.
eThe
The average familiarity level or score of top-ranked terms measures one important aspect of ranking quality. However, because many terms in the evaluation set did not have CHV familiarity scores, we could not compute this value directly. A manual review of the top 500 terms ranked by the best system—that is, ADS-fsa trained using 1000 target-domain training examples—did find many unfamiliar medical terms, including “autoimmune enteropathy,” “ileostomy,” “myasthenia gravis,” “nifedipine,” “parathyroid hormone,” and “phototherapy.”
In addition to evaluating system performance, we tested the contribution of each individual feature to system performance by using feature ablation experiments.
Performance of different ADS-sdsa systems implemented by using all types of features or by dropping each individual type of feature, under 4 conditions using 100, 200, 500, and 1000 target-domain training examplesb.
ADS-sds system | AUC-ROCc | Average precision | |||||||
100 | 200 | 500 | 1000 | 100 | 200 | 500 | 1000 | ||
ADS-sds-ALLd | 0.751 | 0.759 | 0.775 | 0.786 | 0.819 | 0.826 | 0.838 | 0.847 | |
ADS-sds-woWEe | 0.711 | 0.718 | 0.726 | 0.733 | 0.780 | 0.785 | 0.793 | 0.799 | |
30.37 | 32.74 | 59.92 | 112.25 | 36.61 | 39.63 | 81.04 | 124.15 | ||
<.001 | <.001 | <.001 | <.001 | <.001 | <.001 | <.001 | <.001 | ||
ADS-sds-woSemf | 0.753 | 0.760 | 0.772 | 0.782 | 0.823 | 0.829 | 0.838 | 0.845 | |
4.63 | 12.28 | 3.18 | 4.00 | 4.55 | |||||
<.001 | <.001 | .002 | <.001 | <.001 | |||||
ADS-sds-woATRg | 0.751 | 0.759 | 0.774 | 0.786 | 0.819 | 0.826 | 0.838 | 0.847 | |
ADS-sds-woGTFh | 0.740 | 0.749 | 0.765 | 0.777 | 0.813 | 0.821 | 0.833 | 0.842 | |
13.04 | 9.50 | 14.85 | 22.55 | 8.12 | 6.49 | 11.52 | 23.07 | ||
<.001 | <.001 | <.001 | <.001 | <.001 | <.001 | <.001 | <.001 | ||
ADS-sds-woTLi | 0.741 | 0.751 | 0.767 | 0.778 | 0.807 | 0.815 | 0.829 | 0.838 | |
11.21 | 10.81 | 19.78 | 25.58 | 16.43 | 17.15 | 34.50 | 41.72 | ||
<.001 | <.001 | <.001 | <.001 | <.001 | <.001 | <.001 | <.001 |
aADS-sds: adapted distant supervision-supervised distant supervision.
bWe report the
cAUC-ROC: area under the receiver operating characteristic curve.
dADS-sds-ALL: ADS-sds with all types of features.
eADS-sds-woWE: ADS-sds without word embedding.
fADS-sds-woSem: ADS-sds without semantic features.
gADS-sds-woATR: ADS-sds without features derived from automatic term recognition.
hADS-sds-woGTF: ADS-sds without general-domain term frequency.
iADS-sds-woTL: ADS-sds without term length.
In an effort to build a lexical resource that provides lay definitions for medical terms in EHRs, we developed the ADS system to rank candidate terms mined from an EHR corpus and prioritized our efforts to collect and curate lay definitions for top-ranked terms. Given only 100 labeled target training examples, the best ADS system, ADS-sds, achieved 0.751 AUC-ROC and 0.819 average precision on the evaluation set, which are significantly better (
Our evaluation set was challenging, because terms included in this set had been prefiltered (ie, ranked as high) by 2 term-ranking methods (details in the Training and Evaluation Datasets subsection). In other words, we evaluated ADS on a set of candidate terms that had higher quality than the average candidate terms mined from EHRs, for which the boundaries between positive and negative examples were more subtle. For example, some candidate terms (eg, “metastatic carcinoid tumor,” “normal serum calcium,” and “acute cardiac ischemia”), although registered as medical terms in UMLS, were judged nonimportant or nonurgent for lay definition creation because their meanings could be easily inferred from their component words.
The evaluation results on this dataset suggest that our ADS system is effective in ranking EHR terms and can be used to facilitate the expansion of lexical resources that support EHR comprehension. In particular, it can be used to alleviate the data sparseness problem when there are very few target-domain training data and can be used to boost the performance of supervised learning when the size of the training data increases.
Our evaluation results also suggested that using more target-domain training data is beneficial for system performance (rows 2-4 in
The results of our feature ablation experiment (
Although ADS-fsa and ADS-sds were both effective in ranking EHR terms (
We identified three major types of errors through error analysis on the top-rank and low-rank terms (using 300 as the rank threshold) that were ranked by the ADS-sds system that used 1000 target-domain training examples for transfer learning. Error analysis for ADS-fsa showed similar results. First, we found that most errors were caused by compound terms. Specifically, ADS-sds ranked some terms (such as “malignant cell,” “chronic rhinitis,” and “viral bronchitis”) as high, even though their meanings could be easily inferred from their component words. It also ranked certain good compound terms (eg, “community-acquired pneumonia,” “end-stage kidney failure,” and “left ventricular ejection fraction”) as low when these terms contained familiar words. This suggests that advanced features generated by a compound term detector may improve the system’s performance, which we may explore in the future. Second, ADS-sds missed certain terms that are lay terms in the general domain but bear unfamiliar clinical meanings (eg, “baseline,” “vehicle,” and “family history”). Third, ADS-sds ranked some common medical terms (eg, “aspirin,” “vitamin,” and “nerve”) as high, although these terms are likely to be already known by the average patient. The second and third types of errors may be reduced by including domain-specific knowledge about term familiarity as additional features, which we will study in the future.
We report a novel ADS system for ranking and identifying medical terms important for patient EHR comprehension. We empirically show that the ADS system outperforms strong baselines, including supervised learning, and transfer learning can effectively boost its performance even with only 100 target-domain training examples. The EHR terms prioritized by our model have been used to expand a comprehensive lay language lexical resource that supports patient EHR comprehension. The top 10,000 EHR terms ranked by ADS are available upon request.
Analysis results of consumer health vocabulary’s coverage of terms in electronic health record notes.
Criteria used for manual selection of terms important for patient comprehension of electronic health record notes.
Effects of features on performance of adapted distant supervision.
Effects of increasing target-domain training data on system performance.
adapted distant supervision
area under the receiver operating characteristic curve
consumer health vocabulary
electronic health record
feature space augmentation
Java Automatic Term Extraction
natural language processing
supervised distant supervision
Unified Medical Language System
This work was supported by the Institutional National Research Service Award (T32) 5T32HL120823-02 from the US National Institutes of Health (NIH) and the Health Services Research & Development Service of the US Department of Veterans Affairs Investigator Initiated Research (1I01HX001457-01). The content is solely the responsibility of the authors and does not necessarily represent the official views of NIH, the US Department of Veterans Affairs, or the US Government.
We thank Weisong Liu, Elaine Freund, Emily Druhl, and Victoria Wang for technical support in data collection. We also thank the anonymous reviewers for their constructive comments and suggestions.
HY and JC designed the study. JC and ANJ collected the data. JC designed and developed the ADS system, conducted the experiments, and drafted the manuscript. ANJ contributed substantially to feature generation for ADS. HY and SJF provided important intellectual input into system evaluation and content organization. All authors contributed to the writing and revision of the manuscript.
None declared.