This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Eligibility criteria are the main strategy for screening appropriate participants for clinical trials. Automatic analysis of clinical trial eligibility criteria by digital screening, leveraging natural language processing techniques, can improve recruitment efficiency and reduce the costs involved in promoting clinical research.
We aimed to create a natural language processing model to automatically classify clinical trial eligibility criteria.
We proposed a classifier for short text eligibility criteria based on ensemble learning, where a set of pretrained models was integrated. The pretrained models included state-of-the-art deep learning methods for training and classification, including Bidirectional Encoder Representations from Transformers (BERT), XLNet, and A Robustly Optimized BERT Pretraining Approach (RoBERTa). The classification results by the integrated models were combined as new features for training a Light Gradient Boosting Machine (LightGBM) model for eligibility criteria classification.
Our proposed method obtained an accuracy of 0.846, a precision of 0.803, and a recall of 0.817 on a standard data set from a shared task of an international conference. The macro F1 value was 0.807, outperforming the state-of-the-art baseline methods on the shared task.
We designed a model for screening short text classification criteria for clinical trials based on multimodel ensemble learning. Through experiments, we concluded that performance was improved significantly with a model ensemble compared to a single model. The introduction of focal loss could reduce the impact of class imbalance to achieve better performance.
Clinical trials are experiments or observations conducted on human volunteers, who are also referred to as subjects in clinical research. Eligibility criteria are the main indicators developed by those conducting the clinical trial to identify whether a subject should be enrolled in a clinical trial [
Text classification is an essential research topic in text information processing. It associates a given text with one or more categories based on characteristics of the text (content, attributes, or features), under a predefined classification taxonomy. Effective feature selection is crucial to the efficiency and accuracy of text classification tasks [
However, unlike open domains, the complexity of medical texts makes it extremely difficult to classify them. First, the complexity of medical texts mainly comes from a large number of domain-specific terms. Different categories of texts correspond to medical terms of disease names, drug names, body part names, and other information, which presents difficulties in text segmentation and subsequent text feature extraction [
With the rapid development of deep learning [
In order to solve the difficulties (eg, feature extraction) caused by a large number of domain specific diseases, medicines, body parts names, and other terminology, our paper proposed a character-level short text classification model. For word embedding, 4 character-level word embedding models were selected: BERT, A Robustly Optimized BERT Pretraining Approach (RoBERTa), XLNet, and Enhanced Representation through Knowledge Integration (ERNIE). We used a pretrained model based on Chinese corpus to accelerate the convergence of the model. In order to reduce the data imbalance problem, focal loss was introduced to the training process to train the model more stably. Finally, LightGBM was used to ensemble the 4 models to improve overall performance.
The main contributions of this paper are as follows: (1) a character-level ensemble learning model created by integrating BERT, RoBERTa, XLNet, and ERNIE was proposed for eligibility criteria text classification. (2) The focal loss as a loss function was leveraged to solve the problem of data imbalance among different categories. (3) The evaluation results showed that our ensemble learning model outperformed several baseline methods, demonstrating its effectiveness in the eligibility criteria text classification task.
Our data set comes from the evaluation task of the China Health Information Processing Conference (CHIP) 2019. There are three evaluation tasks. The first task is the standardization of clinical terms [
The data set contains 38341 clinical trial eligibility criteria texts and has been manually annotated by human experts.
The data set contains 44 various categories of clinical trial eligibility criteria in total, including “Disease,” “Multiple,” “Therapy or Surgery,” etc. The data set is further divided into a training set, a validation set, and a test set. The training set contains 22962 pieces of eligibility criteria texts, while the validation and test sets contain 7682 and 7697 texts, respectively.
Examples of eligibility criteria texts and corresponding annotated categories.
Eligibility criteria text | Annotated category |
年龄>80岁 (Age>80) | Age |
近期颅内或椎管内手术史 (Recent intracranial or spinal canal surgery) | Therapy or surgery |
血糖<2.7 mmol/L (Blood glucose<2.7 mmol/L) | Laboratory examinations |
2)性别不限,年龄18~70岁 (Unlimited gender, aged 18-70 years) | Multiple |
合并造血系统或恶性肿瘤等严重原发性疾病 (A serious primary disease, such as one involving the hematopoietic system or a malignant tumor) | Disease |
其他研究者认为不适合参加本研究的患者 (Patients that are unsuitable for this study that were considered by other investigators) | Researcher decision |
预期生存超过12周 (Expected survival over 12 weeks) | Life expectancy |
男、女不限 (Male or female) | Gender |
The overall framework of our proposed model is shown in
Most existing text representation methods are based on words, phrases, sentences, or analysis of semantic and grammatical structure in texts. However, existing word segmentation techniques are not suitable in the medical field due to complex grammatical structures. Therefore, we use character-level textual representations to avoid these problems. Accordingly, our model is based on the mainstream character-level text models described below.
The framework of the proposed model that contains two layers: a word embedding layer consisting of 4 pretrained models (BERT, XLNet, ERNIE, and RoBERTa); and a model ensemble layer containing LightGBM, used to learn information by combining the outputs of the 4 pretrained models. BERT: Bidirectional Encoder Representations from Transformers; ERNIE: Enhanced Representation through Knowledge Integration; LightGBM: Light Gradient Boosting Machine; RoBERTa: A Robustly Optimized BERT Pretraining Approach.
BERT [
Moreover, RoBERTa [
In this paper, we use a pretrained model based on Chinese BERT and RoBERTa with a Whole Word Masking (WWM) version [
In natural language processing tasks, data preprocessing often greatly impacts the final result. The purpose of data preprocessing is to improve the quality of extracted text features [
Based on BERT, ERNIE [
We used a Chinese corpus–based pretrained model named ERNIE. In our preprocessing, a “[CLS]” symbol is added before input texts, and the features are extracted through a transformer with unshared weight. Here, global information is encoded into “[CLS].” Finally, we take the output of the highest hidden layer at “[CLS]” as a sentence-level feature for text classification by a fully connected layer.
XLNet is an autoregressive language model created by Google Brain and Carnegie Mellon University, which avoids the shortcomings of the BERT model in training-tuning differences caused by using masks not existing in real texts and ignoring the relevance of cover words in prediction.
We used the XLNet [
In the last layer of our model, after obtaining the training output of BERT, ERNIE, XLNet, and RoBERTa, we performed Softmax processing to obtain the probability that each submodel predicts 44 labels for each text. Let the probability of each model output be
We calculated the statistical characteristics of the training, validation, and test sets, and identified that there is a data imbalance issue.
To solve the problem of data imbalance, we applied focal loss [
Histogram distributions of the training set, validation set, and test set. The y-axis represents different labels, and the x-axis represents quantity.
In order to ensure the reproducibility of test results and to facilitate the experimental comparison of different methods, this experiment fixed the random number seed to 0, the batch size was 128, and model parameters remained the same as the learning rate was set to 2×10-5.
Our training used an NVIDIA 2080Ti graphics card. The memory size was 11 GB. Due to limited video memory, BERT, XLNet, and ERNIE were trained separately, including the training set (22932 pieces of data), the validation set (7652 pieces of data), and test set (7697 pieces of data). The learning rate was 2×10-5, and 30 rounds of training were conducted for each model using Adam as an optimizer.
Our model was implemented using Python, based on the open source framework of PyTorch and open source pretraining parameters. To make the model converge faster and obtain better performance, we used open source parameters trained with a large amount of Chinese texts for different models for transfer learning.
To evaluate our model, we applied four commonly used metrics in machine learning. They are accuracy, precision, recall, and F1 score. These four metrics are also often used in classification tasks in deep learning. F1 is the standard metric for this task; it combines precision and recall. Macro F1 is a parameter index that can best reflect the effectiveness and stability of the model. According to the task requirement of CHIP 2019, we applied the macro average on these four metrics. The calculation of the four metrics is as shown in Equations 4-7:
TP (true positive) is the number of categories
We used the current 4 single models for experiments and each model was tested on the training set only, to provide baselines for comparison.
By studying the loss function of the training set, we found that the performance of a single model using focal loss was significantly better using than CE loss for data sets with unbalanced categories.
Due to the structure and parameter differences of the models, the probability distributions of the models were different from each other. For a classification task, the final parameter distributions of the models were varied, and the results from different inputs had different confidences. After model assembling, a more accurate prediction result of the input sample [
The performance of our model and baseline models using the full training data set.
Model | Accuracy | Precision | Recall | Macro F1 |
BERTa | 0.836 | 0.779 | 0.802 | 0.788 |
XLNet | 0.844 | 0.790 | 0.811 | 0.795 |
ERNIEb | 0.836 | 0.786 | 0.795 | 0.783 |
RoBERTac | 0.840 | 0.791 | 0.800 | 0.792 |
Ensemble (Voting) | 0.846 | 0.800 | 0.812 | 0.802 |
Our model | 0.846 | 0.803 | 0.817 | 0.808 |
aBERT: Bidirectional Encoder Representations from Transformers.
bERNIE: Enhanced Representation through Knowledge Integration.
cRoBERTa: A Robustly Optimized BERT Pretraining Approach.
Histogram distributions of the training set, validation set, and test set. The y-axis represents different labels, and the x-axis represents quantity.
The performance of the 6 models using the reduced training data set.
Model | Accuracy | Precision | Recall | Macro F1 |
BERTa | 0.831 | 0.781 | 0.776 | 0.771 |
XLNet | 0.839 | 0.797 | 0.759 | 0.773 |
ERNIEb | 0.822 | 0.754 | 0.765 | 0.751 |
RoBERTac | 0.832 | 0.7952 | 0.770 | 0.776 |
Ensemble (Voting) | 0.832 | 0.795 | 0.770 | 0.776 |
Our model | 0.834 | 0.790 | 0.785 | 0.780 |
aBERT: Bidirectional Encoder Representations from Transformers.
bERNIE: Enhanced Representation through Knowledge Integration.
cRoBERTa: A Robustly Optimized BERT Pretraining Approach.
There was a limitation of the proposed method. Compared with the performance of the model under the complete data volume (
In the future, we believe that two aspects of our model could be improved: the data and the model. Short text has the characteristic of having fewer words, and may not be able to provide enough information [
The classification of clinical trial eligibility criteria texts is a fundamental and critical step in clinical target population recruitment. This research proposed an ensemble learning method that integrates the current cutting-edge deep learning models BERT, ERNIE, XLNet, and RoBERTa. Through model ensemble in two layers, we trained our model and compared it with a list of baseline deep learning models on a publicly available standard data set. The results demonstrated that our proposed ensemble learning method outperformed the baseline methods by 2.35% on average.
Bidirectional Encoder Representations from Transformers.
Bidirectional Long Short-Term Memory.
China Health Information Processing Conference.
convolutional neural network.
Enhanced Representation through Knowledge Integration.
Light Gradient Boosting Machine.
Long Short-Term Memory.
Natural Language Processing.
Next Sentence Prediction.
recurrent neural network.
A Robustly Optimized BERT Pretraining Approach.
Whole Word Masking.
The work was supported by funding from the National Science Foundation Grant of China (U1711266), the Science and Technology Plan of Guangzhou (201804010296), and the Natural Science Foundation of Guangdong Province (2018A030310051).
None declared.