This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Clinical trials are an important step in introducing new interventions into clinical practice by generating data on their safety and efficacy. Clinical trials need to ensure that participants are similar so that the findings can be attributed to the interventions studied and not to some other factors. Therefore, each clinical trial defines eligibility criteria, which describe characteristics that must be shared by the participants. Unfortunately, the complexities of eligibility criteria may not allow them to be translated directly into readily executable database queries. Instead, they may require careful analysis of the narrative sections of medical records. Manual screening of medical records is time consuming, thus negatively affecting the timeliness of the recruitment process.
Track 1 of the 2018 National Natural Language Processing Clinical Challenge focused on the task of cohort selection for clinical trials, aiming to answer the following question: Can natural language processing be applied to narrative medical records to identify patients who meet eligibility criteria for clinical trials? The task required the participating systems to analyze longitudinal patient records to determine if the corresponding patients met the given eligibility criteria. We aimed to describe a system developed to address this task.
Our system consisted of 13 classifiers, one for each eligibility criterion. All classifiers used a bag-of-words document representation model. To prevent the loss of relevant contextual information associated with such representation, a pattern-matching approach was used to extract context-sensitive features. They were embedded back into the text as lexically distinguishable tokens, which were consequently featured in the bag-of-words representation. Supervised machine learning was chosen wherever a sufficient number of both positive and negative instances was available to learn from. A rule-based approach focusing on a small set of relevant features was chosen for the remaining criteria.
The system was evaluated using microaveraged F measure. Overall, 4 machine algorithms, including support vector machine, logistic regression, naïve Bayesian classifier, and gradient tree boosting (GTB), were evaluated on the training data using 10–fold cross-validation. Overall, GTB demonstrated the most consistent performance. Its performance peaked when oversampling was used to balance the training data. The final evaluation was performed on previously unseen test data. On average, the F measure of 89.04% was comparable to 3 of the top ranked performances in the shared task (91.11%, 90.28%, and 90.21%). With an F measure of 88.14%, we significantly outperformed these systems (81.03%, 78.50%, and 70.81%) in identifying patients with advanced coronary artery disease.
The holdout evaluation provides evidence that our system was able to identify eligible patients for the given clinical trial with high accuracy. Our approach demonstrates how rule-based knowledge infusion can improve the performance of machine learning algorithms even when trained on a relatively small dataset.
Clinical trials are medical research studies focusing on a specific health intervention. They involve human participants to generate data on safety and efficacy as any new health intervention needs to comply with the Hippocratic Oath: “First, do no harm!” With this principle in mind, clinical trials leading up to regulatory approval are typically divided into 3 phases, each involving a significantly higher number of patients (see
Patient recruitment is universally recognized as a key determinant of success for clinical trials, yet they commonly fail to reach their recruitment goals [
The three premarketing phases of a clinical trial.
The problem of matching the eligibility criteria against their electronic medical records (EMRs) can be framed using a variety of natural language processing (NLP) tasks depending on the type and level of automation expected. In the context of decision making, automation can be applied to 4 classes of functions: information acquisition, information analysis, decision selection, and decision implementation [
IR can be applied to both structured and unstructured components of the EMRs to retrieve relevant records or their parts. The usability of any IR system depends on two key factors: system effectiveness and user utility [
The complexity of clinical sublanguage may require new language modeling approaches to be able to formulate multilayered queries and customize the level of linguistic granularity [
However, the technical feasibility of the IE process does not mean that all relevant attributes are necessarily documented in a single source as the previous example illustrates. For example, a study on case-finding algorithms for hepatocellular cancer discovered significant differences in performance between 2 types of documents (pathology and radiology reports) [
The analysis of the strengths and weaknesses of both data sources together with practical experiments has led to a consensus that clinical narratives should be used in combination with structured data for eligibility screening [
We have thus far discussed the role of IR and IE in the context of information acquisition. The clinician is still expected to review the retrieved information to decide who satisfies the eligibility criteria. Text mining can be used to support this process by automating information analysis and decision selection by means of feature extraction and text classification, respectively. Two NLP systems tailored to the clinical domain are most often used to extract rich linguistic and semantic features from the narrative found in EMRs: Medical Language Extraction and Encoding (MedLEE [
Once the pertinent features have been extracted, they can be exploited by rule-based or machine learning approaches. A review of approaches to identifying patient cohorts using EMRs revealed that out of 97 studies, 24 described rule-based systems; 41 used statistical analyses, data mining, or machine learning; and 22 described hybrid systems [
A variety of supervised machine learning approaches have been used to support cohort selection, including support vector machines (SVMs) [
Our review of related work illustrates the ways in which the eligibility screening process can be automated. One study reported that the time for cohort identification was reduced significantly from a few weeks to a few seconds [
In this paper, we describe Cardiff Cohort Selection System (c2s2) [
For the majority of criteria, a record needs to contain the supporting evidence for the corresponding patient to meet a given criterion, otherwise the criterion is considered
The input to the system is a longitudinal patient record distributed as a single UTF-encoded text file, which contains multiple records generated across various health care encounters. Each individual record represents either a discharge summary or a correspondence between health care professionals [
Description of the eligibility criteria, as provided in the annotation guidelines used for the National Natural Language Processing Clinical Challenge shared task.
ID | Criterion | Time period | Default |
ABDOMINAL | Intra-abdominal surgery, small or large intestine resection, or small bowel obstruction | Any | Not met |
ADVANCED-CAD | Advanced artery disease | Present | Not met |
ALCOHOL-ABUSE | Alcohol use exceeds weekly recommended limits | Present | Not met |
ASP-FOR-MI | Use of aspirin to prevent myocardial infarction | Any | Not met |
CREATININE | Serum creatinine is above the upper limit of normal | Any | Not met |
DIETSUPP-2MOS | Use of dietary supplements (excluding vitamin D) | Past 2 months | Not met |
DRUG-ABUSE | Drug abuse | Any | Not met |
ENGLISH | Speaks English | Any | Met |
HBA1c | Glycated hemoglobin value is between 6.5 and 9.5 | Any | Not met |
KETO-1YR | Diagnosed with ketoacidosis | Past year | Not met |
MAJOR-DIABETES | Major diabetes-related complication | Any | Not met |
MAKES-DECISIONS | Able to make decisions for themselves | Present | Met |
MI-6MOS | Myocardial infarction | Past 6 months | Not met |
System architecture.
In addition to standard preprocessing operations (see
A selection of rule-based punctuation removal examples.
Rule target | Input | Output |
Prescription |
q. a.m. q. Sunday tab. |
qam q Sunday tab |
Vitamin |
vit. D MVit. |
vit D MVit |
Personal title |
Dr. Harold Nutter Harold Nutter, Ph.D. |
Dr Harold Nutter Harold Nutter, PhD |
Shorthand x |
hx. of migraines sx. of depression Rx. for cpap |
hx of migraines sx of depression Rx for cpap |
Species name |
E. coli C. diff H. pylori |
E coli C diff H pylori |
Clinical narratives also feature prevalent use of short formulaic statements such as field:value combinations (eg,
Such statements are not commonly terminated by means of punctuation. When used consecutively, this can often result in independent statements being incorrectly grouped together in a single sentence. Their intersentential co-occurrence may later be easily confused with relatedness. Consider, for instance, amalgamating the above itemized list into a continuous sequence “
Finally, to streamline subsequent text analysis, we use pattern-matching rules to fully expand enclitics and special characters. For example,
Text normalization is performed with a similar intent: to simplify subsequent text analysis. It involves mapping of a selected subset of words and phrases onto their representatives, which can be either a preferred synonym or a hypernym (see
Examples of text normalization.
Example | Surface forms | Normalized form | Relevance |
1 | mom, father, sister | family member | filtering |
2 | FH, FHx, FamHx | family history | filtering |
3 | whiskey, vodka, beer | alcohol | ALCOHOL-ABUSE |
4 | Lantus, Humalog, NPH | insulin | MAJOR-DIABETES |
5 | DM2, DMII, NIDDM | diabetes mellitus 2 | MAJOR-DIABETES |
6 | CRRT, CRRTX | continuous renal replacement therapy | MAJOR-DIABETES |
7 | ARF | acute renal failure | MAJOR-DIABETES |
8 | CKD | chronic kidney disease | MAJOR-DIABETES |
9 | BB, bblocker, betablocker | beta blocker | ADVANCED-CAD |
10 | ECG, EKG | electrocardiogram | ADVANCED-CAD |
11 | ICD | implantable cardioverter defibrillator | ADVANCED-CAD |
12 | CVD | cardiovascular disease | ADVANCED-CAD |
13 | MI, heart attack | myocardial infarction | MI-6MOS, ASP-FOR-MI, ADVANCED-CAD |
14 | STEMI | ST elevation myocardial infarction | MI-6MOS, ASP-FOR-MI, ADVANCED-CAD |
15 | ASA, ECASA | aspirin | ASP-FOR-MI |
Other acronyms and abbreviations of interest are then expanded using a bespoke lexicon (>500 entries) developed specifically for this task. To bootstrap the lexicon construction, the raw training data were used to analyze frequently occurring words. Orthographic features (uppercase typeset, eg,
The only acronym exempt from expansion was
To illustrate the extent to which text normalization can simplify its subsequent analysis, we can use examples provided in
By filtering out references to family members, we are effectively removing the mentions of
Once the text has been regularized by means of preprocessing and normalization, information not directly relevant to the given classification tasks is filtered out. We focus on 4 types of such information:
negation, for example,
family history, for example,
allergies, for example,
time window, for example,
Removal of such information simplifies subsequent classification by allowing the use of a BoW approach. For example, by not considering the first 2 examples, the risk of misclassifying a patient as having a
We used a set of regular expressions, which are available from the c2s2 GitHub repository [
Thus far, we reduced the noise and lexical variability in the data by means of filtering and normalization. This is expected to improve the performance of a supervised classifier. Another action that stands to improve the classification performance when trained on a relatively small dataset is that of reducing dimensionality of a BoW representation by aggregating related features into a single representative. In its simplest form, feature aggregation can be achieved by abstracting words into semantic classes. Where domain ontology is available, such abstraction can be automated by exploiting its taxonomic structure. The Semantic Network of the UMLS can be used to automatically abstract words into semantic types. However, as examples given in
Examples of word abstraction.
Example | Surface forms | Semantic type | Abstraction | Relevance |
1 | marijuana, heroin, ecstasy | Pharmacologic substance | Illicit drug | DRUG-ABUSE |
2 | beta blocker, nitroglycerin, CCB | Pharmacologic substance | Heart medication | ADVANCED-CAD |
3 | crestor, advicor, compactin | Pharmacologic substance | Statin | ADVANCED-CAD |
4 | vitamin C, calcium, primrose oil | Pharmacologic substance | Supplement | DIETSUPP-2MOS |
5 | turmeric, green tea, cinnamon | Food | Supplement | DIETSUPP-2MOS |
6 | vodka, beer, wine | Food | Alcohol | ALCOHOL-ABUSE |
Rule-based feature extraction.
Tag | Feature | Extractiona | Examplesb |
MEDRX | Prescription instructions | Regular expressions | po q4h prn |
KIDMED | Kidney medication | Lexicon (221 entries)c | Thymoglobulin |
BRPMED | Blood pressure medication | —d | Avapro |
HRTMED | Heart medication | — | Plavix |
HRTTRT | Heart treatment | Regular expressions | Re |
HRTISC | Heart ischemia | Regular expressions | Electro |
HRTANG | Angina | Regular expressions | Chest wall heaviness |
HRTCAD | Any of the HRT tags above + explicit references to CAD | Regular expressions | Given his extensive cardiac history |
ASPFMI | Aspirin for heart problems | Regular expressions | Start on heparin |
SPLMNT | Supplement (strong evidence) | Lexicon (67 entries) + regular expressions | Ibuprofen 800 mg |
DFCNCY | Supplement (weak evidence) | Lexicon (27 entries) + regular expressions | |
MNTCAP | Mental capacity | Regular expressions | Increasing |
DRGADD | Substance abuse | Lexicon (17 entries) + regular expressions | History of |
NOENGL | Does not speak English | Lexicon (66 entries) + regular expressions | An |
ALCABS | Alcohol abuse | Lexicon (7 entries) + regular expressions | |
ALCSTP | Stopped drinking alcohol | Regular expressions | Alcoholism 10 |
KETACD | Ketoacidosis | Regular expressions | Ketones positive |
KIDDAM | Kidney problems | Regular expressions | |
DMCMPL | Diabetic complications | Regular expressions | |
ABDMNL | Abdominal surgery or small bowel obstruction | Regular expressions | Gastric |
HIGHCRT | High creatinine | Regular expressions + information extraction | Blood urea nitrogen/ |
GLYHMG | Glycated hemoglobin in a given interval | Information extraction |
aAll lexicons and regular expressions are available from the c2s2 GitHub repository [
bItalic typeset is used to indicate the types of text features targeted by lexicons and regular expressions.
cKIDMED, BRPMED, HRTMED are organized into a single lexicon of 221 entries.
dNot applicable.
Once the BoW representation is passed onto a supervised classifier, the context of individual words will be lost. For instance, blood tests frequently feature essential minerals such as calcium, potassium, and iron, which can also be prescribed under the same names as supplements. The BoW approach will take these names out of context, keeping their frequency as the only information about them. Conversely, simple pattern analysis can be used to differentiate between the 2 types of context. For example, we can model prescription instructions using regular expressions (see
Regular expressions can be used to model categorical references to information relevant to the given eligibility criteria. For example, regular expressions can be used to link the word
Overall, a total of 22 tags described in
This module consists of 13 binary classifiers, 1 for each eligibility criterion (see
Distribution of class labels.
Features used in rule-based classification.
ID | Features |
ALCOHOL-ABUSE | ALCABS, ALCSTP |
DRUG-ABUSE | DRGADD |
ENGLISH | NOENGL |
KETO-1YR | KETACD |
MAKES-DECISIONS | MNTCAP |
MI-6MOS | BRPMED, HRTMED, HRTTRT, HRTISC, HRTANG, HRTCAD, ASPFMI |
Note that the numerical values used in criteria CREATININE and HBA1c were also extracted using a rule-based approach. However, in a longitudinal report, different values may be reported at different time points. In the absence of clear guidelines, we used machine learning on top of IE to determine automatically from the training data how to deal with such cases.
A machine learning approach was used for all other criteria. According to the
We trained all classifiers using single words and/or bigrams as features with and without feature selection based on L1 regularized linear SVM. The overall performance was statistically indistinguishable across different types of features used. Therefore, we opted for a simple BoW approach with feature selection for efficiency reasons. To evaluate the impact of the class imbalance on the classification performance, we balanced the training data using random undersampling and oversampling with default parameters from scikit-learn [
Summary of cross-validation results. SVM:support vector machines; LR: logistic regression; NB: naïve Bayesian; GTB: gradient tree boosting; HBA1c:glycated hemoglobin.
The results of classification experiments on previously unseen test data are summarized in
Detailed holdout test results.
ID | Meta | Not meta | Overall | Baselineb | c2s2c | ||||||||
Pd (%) | Re (%) | Ff (%) | P (%) | R (%) | F (%) | F (%) | F (%) | System | Rank | ||||
ABDOMINAL | 64.86 | 80.00 | 71.64 | 87.76 | 76.79 | 81.90 | 76.77 | 90.64 | Rules | 4 | |||
ADVANCED-CAD | 83.02 | 97.78 | 89.80 | 96.97 | 78.05 | 86.49 | 88.14 | 88.14 | c2s2 | 1 | |||
ALCOHOL-ABUSE | 22.22 | 66.67 | 33.33 | 98.70 | 91.57 | 95.00 | 64.17 |
|
Hybrid | 2 | |||
ASP-FOR-MI | 87.67 | 94.12 | 90.78 | 69.23 | 50.00 | 58.06 | 74.42 | 77.34 | HNNg | 2 | |||
CREATININE | 80.00 | 83.33 | 81.63 | 93.44 | 91.94 | 92.68 | 87.16 | 89.75 | Rules | 2 | |||
DIETSUPP-2MOS | 78.85 | 93.18 | 85.42 | 91.18 | 73.81 | 81.58 | 83.50 | 89.53 | Hybrid | 4 | |||
DRUG-ABUSE | 40.00 | 66.67 | 50.00 | 98.77 | 96.39 | 97.56 | 73.78 |
|
Hybrid | 2 | |||
ENGLISH | 91.25 | 100.00 | 95.42 | 100.00 | 46.15 | 63.16 | 79.29 | 97.66 | Hybrid | 4 | |||
HBA1c | 100.00 | 82.86 | 90.62 | 89.47 | 100.00 | 94.44 | 92.53 | 93.82 | Rules | 2 | |||
KETO-1YR | 0.00 | 0.00 | 0.00 | 100.00 | 100.00 | 100.00 | 50.00 |
|
All | 1 | |||
MAJOR-DIABETES | 85.00 | 79.07 | 81.93 | 80.43 | 86.05 | 83.15 | 82.54 | 86.02 | Hybrid | 2 | |||
MAKES-DECISIONS | 97.62 | 98.80 | 98.20 | 50.00 | 33.33 | 40.00 | 69.10 |
|
HNN | 2 | |||
MI-6MOS | 33.33 | 50.00 | 40.00 | 94.59 | 89.74 | 92.11 | 66.05 |
|
Rules | 4 | |||
Overallh (microaveraged) | 83.97 | 91.29 | 87.47 | 93.54 | 87.86 | 90.61 | 89.04 | 91.11 | Hybrid | 4 |
aThe binary classification task involves 2 classes (
bThe best results from 3 related studies are used as the baseline. They are named after the approach they used: rules [
cc2s2: Cardiff Cohort Selection System.
dP: precision.
eR: recall.
fF:
gHNN: hierarchical neural network.
hThe overall values provided in the bottom row have been microaveraged across the 13 classifiers.
The best results marked with an asterisk in
All 4 systems achieved similar performance for HBA1c and ASP-FOR-MI. On the
The rule-based approach performed best against the following eligibility criteria: ABDOMINAL and CREATININE. For ABDOMINAL, recall was in the 80s on the
Conversely, broader eligibility criteria, which require some reasoning over multiple references made across the discourse, may require a machine learning approach to model the complexities of target classification problems. MAJOR-DIABETES is one such example where major complications may not be restricted to a finite class of signs and symptoms. In addition, such complications may be mentioned without an explicit reference to diabetes. This requires complex analysis of the wider context. Neural networks can be used to model nonlinearity in text. Not surprisingly, the HNN approach achieved the best results in this case. In particular, the robustness of this approach is reflected in achieving a recall of over 90% on the
Another example of this type of problem is ADVANCED-CAD. As expected, both machine learning approaches performed better than the other 2, with overall
Detailed holdout test results for ADVANCED-CAD.
System | Met | Not met | Overall | ||||
Pa (%) | Rb (%) | Fc (%) | P (%) | R (%) | F (%) | F (%) | |
c2s2d | 83.02 | 97.78 | 89.80 | 96.97 | 78.05 | 86.49 | 88.14 |
Hybrid | 74.55 | 91.11 | 82.00 | 87.10 | 65.85 | 75.00 | 78.50 |
Rules | 67.80 | 88.89 | 76.92 | 81.48 | 53.66 | 64.71 | 70.81 |
HNNe | 77.36 | 91.11 | 83.67 | 87.88 | 70.73 | 78.38 | 81.03 |
aP: precision.
bR: recall.
cF:
dc2s2: Cardiff Cohort Selection System.
eHNN: hierarchical neural network.
Ideally, supervised learning performs best when large training datasets with a reasonable class balance are available to extrapolate a classification model while minimizing overfitting. As we can see from the data (see
bag-of-words
Cardiff Cohort Selection System
clinical Text Analysis and Knowledge Extraction System
electroencephalography
electronic medical record
gradient tree boosting
glycated hemoglobin
hierarchical neural network
International Classification of Diseases, Ninth Revision
information extraction
information retrieval
logistic regression
Medical Language Extraction and Encoding
National natural language processing Clinical Challenge
naïve Bayesian
natural language processing
support vector machine
Unified Medical Language System
The authors gratefully thank Nikola Cihoric, MD, for sharing his medical expertise, which partly informed the development of the preprocessing module.
IS designed the system. IS and PC implemented the following modules: preprocessing, normalization, filtering, and feature extraction. DK and AB implemented the classification module. All authors were involved in the evaluation and interpretation of the results. IS drafted the manuscript. All authors reviewed and approved the manuscript for publication.
None declared.