This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
Under the paradigm of precision medicine (PM), patients with the same disease can receive different personalized therapies according to their clinical and genetic features. These therapies are determined by the totality of all available clinical evidence, including results from case reports, clinical trials, and systematic reviews. However, it is increasingly difficult for physicians to find such evidence from scientific publications, whose size is growing at an unprecedented pace.
In this work, we propose the PM-Search system to facilitate the retrieval of clinical literature that contains critical evidence for or against giving specific therapies to certain cancer patients.
The PM-Search system combines a baseline retriever that selects document candidates at a large scale and an evidence reranker that finely reorders the candidates based on their evidence quality. The baseline retriever uses query expansion and keyword matching with the ElasticSearch retrieval engine, and the evidence reranker fits pretrained language models to expert annotations that are derived from an active learning strategy.
The PM-Search system achieved the best performance in the retrieval of high-quality clinical evidence at the Text Retrieval Conference PM Track 2020, outperforming the second-ranking systems by large margins (0.4780 vs 0.4238 for standard normalized discounted cumulative gain at rank 30 and 0.4519 vs 0.4193 for exponential normalized discounted cumulative gain at rank 30).
We present PM-Search, a state-of-the-art search engine to assist the practicing of evidence-based PM. PM-Search uses a novel Bidirectional Encoder Representations from Transformers for Biomedical Text Mining–based active learning strategy that models evidence quality and improves the model performance. Our analyses show that evidence quality is a distinct aspect from general relevance, and specific modeling of evidence quality beyond general relevance is required for a PM search engine.
Traditionally, patients with the same diseases are treated with the same therapies. However, the treatment effects can be highly heterogeneous, that is, the benefits and risks may differ substantially among patient subgroups [
PM practices should be guided by the principles of evidence-based medicine [
To facilitate IR research for PM, the Text Retrieval Conference (TREC) holds the PM Track annually since 2017. From 2017 to 2019, the TREC PM focused on finding relevant academic papers or clinical trials of patient topics specified by their demographics, diseases, and gene mutations [
Traditional IR systems are mostly based on term frequency–inverse document frequency and its derivatives that basically rank the documents by their bag-of-word similarities with the input query. However, biomedical concepts are often referred to by various synonyms, and multiple studies have shown the importance of expanding query concepts to their synonyms before sending them to IR systems [
In this work, we propose the PM-Search model that tackles the aforementioned problems of traditional search engines to assist the practice of PM. The PM-Search system has two main components: (1) a baseline retriever using query expansion and keyword matching with the ElasticSearch engine; and (2) an evidence reranker that ranks the initial documents returned by ElasticSearch based on their evidence quality. The reranking uses article features as well as pretrained language models under an expert-in-the-loop active learning strategy, where a biomedical language model BERT for Biomedical Text Mining (BioBERT) [
In summary, our contributions of this work are three-fold:
We present PM-Search, which is an integrated IR system specifically designed to assist precision medicine. PM-Search achieved state-of-the-art performance in the TREC PM Track.
We used an expert-in-the-loop active learning strategy based on BioBERT to efficiently derive annotations and improve model performance. To the best of our knowledge, this is the first precision medicine search engine that combines active learning and pretrained language models.
We thoroughly analyzed the importance of each system feature with a full set of ablation studies, where we found that the most important features included publication types and active learning. We hope the experiments can provide some insights into the potential future directions of PM search engines.
The TREC 2020 PM Track provided 40 topics for evaluation. Each topic represented a PM query that contains three key elements of a specific patient population: (1) the disease, that is, the type of cancer; (2) the genetic variant, that is, the gene mutation; and (3) the tentative treatment. The topics were synthetically generated by biomedical experts and several examples are shown in (
The evaluation of the task followed standard TREC procedures of ad hoc retrieval, where participants submitted a maximum number of 1000 ranked articles and up to 5 different runs for each topic. The assessments were divided into 2 phases, where phase 1 was “Relevance Assessment,” judging the relevance of each article, and phase 2 was “Evidence Assessment,” judging the evidence quality provided by the article.
Phase 1 assessment was a general IR assessment that only considered relevance, where the assessors first judged whether the returned article
where is the number of relevant articles for the query. NDCG is computed by:
where
rel
In the phase 2 assessment, the assessors scored the relevant papers from the phase 1 assessment using a 5-point scale. For example, the tier 4 results should be “randomized controlled trial with >200 patients and single drug, or meta-analysis” and tier 0 should be “Not Relevant” for topic 16. The scale was tailored for each topic to adjust for the differences in the disease, genetic variant, and treatment. The main evaluation metric for phase 2 assessment was NDCG@30. NDCG values at this phase are exact since all articles in the top 30 ranks are judged. Two sets of relevance values were used to compute NDCG, the standard gains (std-gains) and the exponential gains (exp-gains). Standard gains have scores (ie, rel
Examples of the Text Retrieval Conference Precision Medicine 2020 topics.
Topic | Disease | Gene | Treatment |
1 | Colorectal cancer | ABL proto-oncogene 1 | Regorafenib |
11 | Breast cancer | Cyclin dependent kinase 4 | Abemaciclib |
21 | Differentiated thyroid carcinoma | Fibroblast growth factor receptor 2 | Lenvatinib |
31 | Hepatocellular carcinoma | Neurotrophic receptor tyrosine kinase 2 | Sorafenib |
As shown in (
The architecture of PM-Search. EBM: evidence-based medicine; PM: Precision Medicine.
We indexed the titles and abstracts of all articles from the PubMed 2019 baseline provided by the TREC organizers using ElasticSearch, a Lucene-based search engine. The synonyms of the disease
For each synonym
where
We used the normalized document frequency to lower the ranks of rare terms.
We performed the retrieval in ElasticSearch, which ranks the documents based on their word-level relevance with the input query using the Okapi BM25 algorithm [
TREC PM allowed a maximum number of 1000 documents per topic in the submission. We set the maximum number of retrieved documents for each topic as 10,000. On average, we retrieved 1589 candidates from the baseline retriever for each topic.
The Evidence Re-ranker scores a given candidate article
where
BioBERT [
where
where σ denotes the sigmoid function,
We show the expert-in-the-loop active learning procedure in (
The architecture of our expert-in-the-loop active learning strategy. BioBERT: Bidirectional Encoder Representations from Transformers for Biomedical Text Mining; Y: yes; N: no.
The expert annotation pipeline.
We used the expert annotations to train a simple linear regression model using the following features:
es: the relevance scores returned by the ElasticSearch;
pb: the relevance scores predicted by a pretrained BioBERT. We used the annotations from the previous TREC PM challenges to fine-tune the BioBERT. Specifically, we collected 54,500 topic-document relevance annotations from the
ty: the publication type score. PubMed also indexes each article with a publication type, such as journal article, review, clinical trials, etc. We manually rated the score of each publication type based on the judgments of their evidence quality. Our publication type and score mapping is shown in
ct: the citation count score. We ranked the citation count of all PubMed articles and used the quantile of a specific article’s citation count as a feature. Similar to but simpler than PageRank [
The linear regression was implemented using the
Mappings between publication types and clinical evidence quality scores.
Publication type | Score |
Comment | –1 |
Editorial | –1 |
Published erratum | –2 |
Retraction of publication | –2 |
English abstract | 0 |
Journal article | 0 |
Letter | 0 |
Review | 0 |
Case reports | 1 |
Observational study | 1 |
Clinical trial | 2 |
Meta-analysis | 2 |
Systematic review | 2 |
We compared our PM-Search submissions to TREC PM 2020 with models submitted by other teams. We used 5 settings in the challenge, namely
We also used the
where es
Feature weights in different systems. Participant denotes the system name submitted to the Text Retrieval Conference (TREC) Precision Medicine (PM).
System | TREC run Id |
|
|
|
|
|
|
|||||||
|
||||||||||||||
|
PM-Search-auto-1 | damoespb1 | 1.0 | 0.5 | 1.5 | 0.0 | —g | — | ||||||
|
PM-Search-auto-2 | damoespb2 | 1.0 | 0.5 | 1.0 | 0.0 | — | — | ||||||
|
PM-Search-full-1 | damoespcbh1 | –0.465 | –0.141 | –0.617 | –0.005 | 1.0 | 1.0 | ||||||
|
PM-Search-full-2 | damoespcbh2 | –0.465 | –0.141 | –0.617 | –0.005 | 1.0 | 2.0 | ||||||
|
PM-Search-full-3 | damoespcbh3 | –0.465 | –0.141 | –0.617 | –0.005 | 1.0 | 5.0 | ||||||
|
||||||||||||||
|
Retriever + pb | N/Ah | 1.0 | 1.0 | 0.0 | 0.0 | — | — | ||||||
|
Retriever + ty | N/A | 1.0 | 0.0 | 1.0 | 0.0 | — | — | ||||||
|
Retriever + ct | N/A | 1.0 | 0.0 | 0.0 | 1.0 | — | — | ||||||
|
LR | N/A | –0.465 | –0.141 | –0.617 | –0.005 | 1.0 | 0.0 | ||||||
|
FB | N/A | –0.465 | –0.141 | –0.617 | –0.005 | 0.0 | 1.0 |
aes: ElasticSearch score.
bpb: pretrained BioBERT.
cty: publication type.
dct: citation count.
eLR: linear regressor.
fFB: fine-tuned BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining).
gNot available.
hN/A: not applicable.
The main results of our participating systems in the TREC PM 2020, compared with the other top-ranking systems, are shown in
Topic-wise averaged performance of different settings in the evaluation. All numbers are percentages. Other top-ranking Text Retrieval Conference (TREC) submissions listed in the table include the systems of BIT.UA [
|
Evidence quality (phase 2) | General relevance (phase 1) | |||||||
|
NDCG@30a, exponential | NDCG@30, standard | infNDCGb | P@10c | R-precd | ||||
|
|||||||||
|
First | 45.19 (ours) | 47.80 (ours) | 53.25[ |
56.45 [ |
43.58 [ |
|||
|
Second | 41.93* [ |
42.38* [ |
53.03 [ |
55.16 [ |
42.07 [ |
|||
|
Median | 28.57 | 25.29 | 43.16 | 46.45 | 32.59 | |||
|
|||||||||
|
PM-Search-full-3 | 45.19 | 47.80 | 44.24 | 47.42 | 34.72 | |||
|
PM-Search-full-1 | 44.97 | 47.30 | 43.04 | 47.42 | 34.10 | |||
|
PM-Search-full-2 | 44.95 | 47.46 | 43.84 | 47.10 | 34.14 | |||
|
PM-Search-auto-1 | 42.55 | 44.17* | 45.33 | 47.42 | 35.93 | |||
|
PM-Search-auto-2 | 42.54 | 44.60* | 41.12 | 44.52 | 32.37 | |||
|
|||||||||
|
Retriever + pbe | 32.36* | 37.04* | 52.26 | 53.87 | 41.21 | |||
|
Retriever + tyf | 41.46* | 43.26* | 37.80 | 40.32 | 29.37 | |||
|
Retriever + ctg | 35.55* | 38.40* | 42.20 | 44.84 | 32.52 | |||
|
Linear regressor | 42.86* | 44.86* | 37.65 | 46.13 | 30.74 | |||
|
Linear regressor, leave-one-out | 42.08* | 43.81* | 37.06 | 46.45 | 30.58 | |||
|
Fine-tuned BioBERTh | 44.40* | 47.01* | 44.59 | 47.42 | 34.87 | |||
|
Fine-tuned BioBERT, leave-one-out | 44.15* | 46.58* | 43.83* | 46.45* | 33.81* |
aNDCG@30: normalized discounted cumulative gain NDCG at rank 30.
binfNDCG: inferred NDCG.
cP@10: precision at rank 10.
dR-prec: R-precision.
epb: pretrained BioBERT.
fty: publication type.
gct: citation count.
hBioBERT: Bidirectional Encoder Representations from Transformers for Biomedical Text Mining.
*Significant differences from the PM-Search-full-3. Significance is defined as
Our submissions scored higher than the topic-wise median submission, but the best submission (infNDCG: 0.5325, P@10: 0.5645, R-prec: 0.4358) outperformed our submissions (infNDCG: 0.4533, P@10: 0.4742, R-prec: 0.3593). Our PM-Search runs (
Our PM-Search system
We also experimented with different settings and studied the importance of PM-Search components, including the baseline retriever, active learning, and the reranking features.
In
Ablation results of different baseline retriever settings (in percentages).
Method | Evidence quality (phase 2) | General relevance (phase 1) | |||||
|
R@0.5ka | R@1kb | R@10kc | R@0.5k | R@1k | R@10k | |
Baseline retriever | 68.99 | 75.96 | 81.00 | 65.51 | 72.30 | 77.71 | |
Baseline retriever without query expansion | 66.84* | 72.61* | 76.94* | 61.85* | 67.21* | 72.90* | |
Baseline retriever without keyword matching | 68.85 | 76.06 | 81.00 | 65.65 | 72.33 | 77.71 |
aR@0.5k: recall at the top 500 positions.
bR@1k: recall at the top 1000 positions.
cR@10k: recall at the top 10,000 positions.
*Significant differences than the original retrieval. Significance is defined as
In
InfNDCG@30 and average annotated relevance at each iteration in active learning. InfNDCG@30: inferred normalized discounted cumulative gain at rank 30.
To analyze the importance of the used features, we show the ablation experiments in
General relevance (phase 1): BioBERT that is further pretrained by the annotations of previous TREC PM (pb) had the highest correlation (0.5771) with the phase 1 scores, and the baseline retriever with the pretrained BioBERT had the highest performance (infNDCG: 52.26%) in our ablation experiments. This is probably because the evaluations of previous tasks are also based on general relevance. The ElasticSearch scores (es) achieved the second highest correlation of 0.3892, and the fine-tuned BioBERT by active learning (FB) had a Pearson correlation of 0.3733. However, our expert annotations for the evidence quality only had a Pearson correlation of 0.2157 with the general relevance scores, which indicates that generally relevant papers might not have high evidence quality. In addition, the features of publication types (ty) and the citation counts (ct), which are designed for the evidence quality ranking and are positively correlated with the evidence quality, were negatively correlated with the general relevance scores.
Evidence quality (phase 2): The trends of ablation results and correlations between features and evidence quality scores were similar in both the standard and exponential scores. The most important features in the evidence quality evaluation included publication types and active learning. Interestingly, only using the publication type and the baseline retriever achieves comparable performance to the second-best system in TREC PM (0.4146 vs 0.4193 for exponential NDCG@30). BioBERT fine-tuned by the expert annotations (FB) had the highest performance in the ablation experiments (exponential NDCG@30: 0.4440) and its correlation to the official annotations was close to that of our expert annotations (0.3309 vs 0.2937 for exponential gains; 0.2847 vs 0.3073 for standard gains). Besides, the fine-tuned BioBERT outperformed the expert annotations by a large margin (0.3733 vs 0.2157) in the phase 1 assessment, indicating that it can rerank the documents by evidence quality while retaining the original general relevance ranks to some extent. The most correlated features of phase 1, that is, the pretrained BioBERT (pb) and the ElasticSearch score (es), had the lowest correlations with the phase 2 scores, which further confirms that the evidence quality assessment is distinct from the general relevance assessment.
In summary, the 2 assessment phases might have opposite considerations since features that are highly related to the score of one phase tended to be much less related to the score of the other phase, with the exception of the fine-tuned BioBERT. As a result, specific modeling of evidence quality beyond general relevance is required for a PM search engine.
Feature correlations to the official scores.
Features | esa | pbb | tyc | ctd | LRe | FBf | Expert annotation | ||||||
General relevance | 0.3892 |
|
–0.0621 | –0.0435 | 0.1341 | 0.3733 | 0.2157 | ||||||
|
|||||||||||||
|
Standard gains | 0.0752 | 0.0621 | 0.2564 | 0.0696 | 0.2728 |
|
0.2937 | |||||
|
Exponential gains | 0.0474 | 0.0338 | 0.2772 | 0.0806 | 0.2816 | 0.2847 |
|
aes: ElasticSearch score.
bpb: pretrained Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT).
cty: publication type.
dct: citation count.
eLR: linear regressor.
fFB: fine-tuned BioBERT.
Each instance used to train the PM-Search reranker contained a topic-article pair and its relevance score. The main results show that PM-Search is generalizable at
Here, we analyze how PM-Search generalizes to unseen topics using a leave-one-out evaluation strategy. Each time, we use the official annotations of only one topic to evaluate the models that are trained by our expert annotations without the evaluation topic. The results of each topic as the evaluation topic are calculated and the averaged performance is shown in
We show several typical cases in
Typical error cases in the evidence quality assessment. Topics are shown in Table 1.
Case | Topic | Article | Official, rank (normalized relevance) | PMa-Search, rank (normalized relevance) | Error type |
1 | 1 | PMIDb: 23177515; Title: Efficacy and safety of regorafenib for advanced gastrointestinal stromal tumours after failure of imatinib and sunitinib (GRID): an international, multicentre, randomised, placebo-controlled, phase 3 trial | 1 (1.00) | N/Ac | Concept recognition |
2 | 1 | PMID: 24150533; Title: Risk of hypertension with regorafenib in cancer patients: a systematic review and meta-analysis | 1 (1.00) | 148 (0.47) | Different understanding |
3 | 1 | PMID: 25213161; Title: Randomized phase III trial of regorafenib in metastatic colorectal cancer: analysis of the CORRECT Japanese and non-Japanese subpopulations | 1 (1.00) | 297 (0.29) | Unclassified |
4 | 11 | PMID: 29147869; Title: Hematological adverse effects in breast cancer patients treated with cyclin-dependent kinase 4 and 6 inhibitors: a systematic review and meta-analysis | 1 (1.00) | N/A | Full article visibility |
5 | 11 | PMID: 28540640; Title: A Population Pharmacokinetic and Pharmacodynamic Analysis of Abemaciclib in a Phase I Clinical Trial in Cancer Patients | 1 (1.00) | 53 (0.50) | Full article visibility |
6 | 11 | PMID: 29700711; Title: Cyclin-dependent kinase 4/6 inhibitors in hormone receptor-positive early breast cancer: preliminary results and ongoing studies | 61 (0.25) | 6 (0.71) | Different understanding |
aPM: precision medicine.
bPMID: PubMed IDentifier.
cN/A: not applicable.
The PM-Search system can only access the title and abstract of PubMed articles. However, vital article information (eg, detailed gene variant types, treatments) might only appear in the full article, especially for meta-analyses and systematic reviews where abstracts tend to use more general concepts. For example, PM-Search fails to retrieve the Case 5 article where the queried disease “breast cancer” is only mentioned in the full article, not in the abstract. For this, future models can use the full article information from PubMed Central to better retrieve and rank relevant papers.
In some cases, we have a different understanding of how clinically significant the evidence is that an article provides. For example, the article “Risk of hypertension with regorafenib in cancer patients: a systematic review and meta-analysis” in Case 2 is focused on the hypertension side effect of the therapy, not the therapeutic effects, which we think is not significant. However, it was given the highest score in the official evaluation but ranked much lower in the PM-Search prediction. This issue should be solved by community efforts for the development of standards.
The baseline retriever of PM-Search uses query expansion to recognize relevant concepts in the article. However, this step is error prone since biomedical terms are highly variable and thus cannot be represented by a list of synonyms. For example, in Case 1, the “colorectal cancer” in the query appears as “gastrointestinal stromal tumours” in the article, which was missed in the query expansion step of PM-Search. As a result, this article was not returned by the PM-Search but ranked the highest in the official assessment. Improving similar concept recognition, such as using distributed representations of concepts, remains an important direction to explore.
Many IR systems for precision medicine have been proposed in the TREC PM tracks [
In this paper, we present PM-Search, a search engine for PM that achieved state-of-the-art performance in TREC PM 2020. PM-Search uses an ElasticSearch-based baseline retriever with query expansion and keyword matching and an evidence reranker that uses the BioBERT fine-tuned by an active learning strategy. Our analyses show that the evidence quality is a distinct aspect from the general relevance, and thus, specific modeling of it is necessary to assist the practices for evidence-based PM.
The deployment and evaluation of PM-Search in real clinical settings remains a clear future direction. It is also worth exploring the use of dense vectors for baseline retrieval and incorporating full-text information into the ranking process.
Bidirectional Encoder Representations from Transformers
Bidirectional Encoder Representations from Transformers for Biomedical Text Mining
normalized discounted cumulative gain NDCG at rank 30
citation count
ElasticSearch score
fine-tuned BioBERT
inferred normalized discounted cumulative gain
information retrieval
linear regressor
normalized discounted cumulative gain
NDCG at rank 30
precision at rank 10
pretrained BioBERT
precision medicine
R-precision
Text Retrieval Conference
publication type
We thank the organizers of Text Retrieval Conference (TREC) Precision Medicine (PM) 2020 for their efforts in holding this task and manual evaluations of the submitted systems.
None declared.