This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Systematic reviews and their implementation in practice provide high quality evidence for clinical practice but are both time and labor intensive due to the large number of articles. Automatic text classification has proven to be instrumental in identifying relevant articles for systematic reviews. Existing approaches use machine learning model training to generate classification algorithms for the article screening process but have limitations.
We applied a network approach to assist in the article screening process for systematic reviews using predetermined article relationships (similarity). The article similarity metric is calculated using the MEDLINE elements title (TI), abstract (AB), medical subject heading (MH), author (AU), and publication type (PT). We used an article network to illustrate the concept of article relationships. Using the concept, each article can be modeled as a node in the network and the relationship between 2 articles is modeled as an edge connecting them. The purpose of our study was to use the article relationship to facilitate an interactive article recommendation process.
We used 15 completed systematic reviews produced by the Drug Effectiveness Review Project and demonstrated the use of article networks to assist article recommendation. We evaluated the predictive performance of MEDLINE elements and compared our approach with existing machine learning model training approaches. The performance was measured by work saved over sampling at 95% recall (WSS95) and the F-measure (F1). We also used repeated analysis over variance and Hommel’s multiple comparison adjustment to demonstrate statistical evidence.
We found that although there is no significant difference across elements (except AU), TI and AB have better predictive capability in general. Collaborative elements bring performance improvement in both F1 and WSS95. With our approach, a simple combination of TI+AB+PT could achieve a WSS95 performance of 37%, which is competitive to traditional machine learning model training approaches (23%-41% WSS95).
We demonstrated a new approach to assist in labor intensive systematic reviews. Predictive ability of different elements (both single and composited) was explored. Without using model training approaches, we established a generalizable method that can achieve a competitive performance.
Systematic reviews provide summaries of evidence from high quality studies to answer specific research questions. They are regularly used in health care [
MEDLINE is a biomedical literature database that stores and indexes a variety of relevant publications and is a primary resource for identifying studies for systematic reviews targeting the health sciences. However, the size of MEDLINE increases at a rate of over 12,000 articles per week, including reports related to over 300 randomized trials [
A systematic review is commonly conducted by domain experts who are able to draft systematic review scopes, retrieve relevant citations, assess study quality, and synthesize evidence. The process can be broken down into 15 steps [
In this paper, we proposed to use established and predetermined article relationships and incorporate the concept of active machine learning to iteratively recommend articles and receive feedback from human reviewers. Although the idea of integrating human judgment sounds similar to the active learning approach implemented in Wallace’s work [
The predetermined relationships between articles can be conceptualized as an article network, which is different from the traditional citation network. A traditional citation network uses the
During our preliminary experiments, we found that a similarity score composed of all MEDLINE elements does not work well for every systematic review. We suspected that some elements (eg, title, abstract, publication type, MeSH, author) are better predictors for recommendations than others. Therefore, the purpose of our study was to answer two research questions. When an article is classified as included, what element(s) are better to use to calculate the similarity score to predict the next relevant article? Since every element plays a different role and should be weighted accordingly, what combinations and weights of elements are better to predict the next relevant article?
Illustrated article network.
To evaluate our approach, we used 15 publicly available completed systematic review samples produced by the Drug Effectiveness Review Project (DERP) (coordinated by the Center for Evidence-Based Policy at Oregon Health and Science University) [
For instance, the review for ACE Inhibitors has a total of 2544 citations. Based on the abstracts, 183 (7.19%) were included; after full-text reading, 41 (1.61%) were included in the ACE Inhibitor systematic review. The final inclusion rates range from 0.78% to 27.04%. The 15 systematic reviews are also the same test collection previously used and made publicly available by Cohen et al [
Total article numbers and rates of inclusion.
SR report topic | Total | Abstract |
Full text |
ACE inhibitors | 2544 | 183 (7.19) | 41(1.61) |
ADHD | 851 | 84 (9.87) | 20 (2.35) |
Antihistamines | 310 | 92 (29.68) | 16 (5.16) |
Atypical antipsychotics | 1120 | 363 (32.41) | 146 (13.04) |
Beta blockers | 2072 | 302 (14.58) | 42 (2.03) |
Calcium channel blockers | 1218 | 279 (22.91) | 100 (8.21) |
Estrogens | 368 | 80 (21.74) | 80 (21.74) |
NSAIDS | 393 | 88 (22.39) | 41 (10.43) |
Opioids | 1915 | 48 (2.51) | 15 (0.78) |
Oral hypoglycemics | 503 | 139 (27.63) | 136 (27.04) |
Proton pump inhibitors | 1333 | 238 (17.85) | 51 (3.83) |
Skeletal muscle relaxants | 1643 | 34 (2.07) | 9 (0.55) |
Statins | 3465 | 173 (4.99) | 85 (2.45) |
Triptans | 671 | 218 (32.49) | 24 (3.58) |
Urinary incontinence | 327 | 78 (23.85) | 40 (12.23) |
MEDLINE elements are the fields in the MEDLINE format that document the major pieces of information of a publication (article) [
We calculate the similarity score using cosine similarity [
In this study, there is no human reviewer in this experimental process. The interactive recommendation process is simulated using the 15 completed DERP systematic reviews.
After identifying a list of articles to be screened for a systematic review, the recommendation process starts with calculating the similarity scores of any pairs of articles. This process constructs the relationship of the articles and builds a conceptualized article network. The first recommended article is selected based on key questions and search strategies of the systematic review. Once a recommended article is classified as included (IN) or excluded (EX), an IN list and an EX list are created (in this study, we used completed systematic reviews, which have predetermined decisions to simulate this step). We then iteratively recommend relevant articles based on the similarity to the IN. Assuming V is the set of all articles and U is the set of articles that have never been recommended, U is defined as U=V−IN−EX. Therefore, the sum of similarity scores represents the similarity between an article v with article(s) x in IN (see
Calculation of the similarity between articles.
In the formula,
To evaluate our performance, we used two performance measures: work saved over sampling at 95% recall (WSS95) and F-measure. These measures are commonly used for evaluating similar work [
Simulated interactive recommendation process.
WSS95 is a performance measure first proposed by Cohen [
TP is the number of true positive (relevant) articles, TN is the number of true negatives (irrelevant) articles, FN is the number of false negative (relevant) articles, and N is the total number of articles in each report.
Formulas of precision, recall, and F1.
F-measure is a measure of information retrieval accuracy. It considers both precision and recall and commonly combines them into a weighted harmonic mean. When they are weighted equally, the balanced F-measure is also called F1, where it reaches its best value at 1 and the worst value at 0. As a general measure of accuracy, F1 has been widely used in previous works for the evaluation of classification performance, such as Cohen 2006 [
Since F1 is dynamically changed over time, we can detect the highest F1 from the steepness of the performance curve. That means if the higher F1 scores occur during the early stage of the recommendation process (ie, before 50% of articles are screened), we are more likely to save more workload (high accuracy). We use F1 to help us evaluate how accurate and how quickly we can make recommendations on the relevant articles.
Formulas of WSS and WSS95.
The single MEDLINE element performance results are shown in
Single element WSS95 performance.
SR report topic | TI | AB | PT | AU | MH |
ACE inhibitors | 76.49 | 71.07 | 33.22 | 0 | 47.37 |
ADHD | 80.26 | 65.10 | 22.56 | 0 | 47.00 |
Antihistamines | 13.55 | 15.81 | 32.58 | 0 | 2.58 |
Atypical antipsychotics | 17.23 | 20.54 | 19.64 | 0 | 9.46 |
Beta blockers | 44.74 | 49.95 | 43.77 | 0 | 28.67 |
Calcium channel blockers | 19.38 | 16.34 | 18.64 | 0 | 20.94 |
Estrogens | 29.35 | 29.08 | 17.93 | 0 | 38.59 |
NSAIDS | 63.36 | 66.67 | 58.27 | 0 | 33.84 |
Opioids | 8.30 | 9.82 | 37.23 | 0 | 6.48 |
Oral hypoglycemics | 11.73 | 12.13 | 22.27 | 0 | 7.55 |
Proton pump inhibitors | 43.74 | 15.60 | 35.48 | 0 | 20.56 |
Skeletal muscle relaxants | 0 | 36.03 | 74.68 | 0 | 42.85 |
Statins | 25.52 | 30.17 | 13.31 | 0 | 13.68 |
Triptans | 45.60 | 42.47 | 28.17 | 0 | 33.23 |
Urinary incontinence | 30.89 | 18.65 | 43.43 | 0 | 26.91 |
Average WSS95 | 34.01 | 33.30 | 33.41 | 0 | 25.31 |
Single element F1 performance; percentage of articles screened at F1.
SR report topic | TI |
AB |
PT |
AU |
MH |
ACE inhibitors | 0.3444 (4) | 0.3121 (4) | 0.2182 (<1) | 0.1872 (6) | 0.2368 (1) |
ADHD | 0.2885 (10) | 0.3824 (6) | 0.2963 (<1) | 0.0909 (<1) | 0.5556 (4) |
Antihistamines | 0.2593 (12) | 0.4000 (3) | 0.2759 (<1) | 0.1111 (<1) | 0.3333 (3) |
Atypical antipsychotics | 0.3447 (26) | 0.4248 (14) | 0.4363 (5, 12a) | 0.0135 (<1) | 0.3113 (40) |
Beta blockers | 0.1972 (1) | 0.2710 (5) | 0.2105 (<1) | 0.0417 (<1) | 0.0957 (19) |
Calcium channel |
0.2026 (10) | 0.2672 (11) | 0.2662 (15) | 0.1261 (9) | 0.2579 (2) |
Estrogens | 0.5140 (36) | 0.5612 (29) | 0.4937 (18) | 0.0244 (<1) | 0.5536 (39) |
NSAIDS | 0.4368 (34) | 0.5870 (13) | 0.6761 (8) | 0.4853 (24) | 0.3650 (24) |
Opioids | 0.2727 (<1) | 0.1429 (<1) | 0.2222 (<1) | 0.1111 (<1) | 0.2500 (<1) |
Oral hypoglycemics | 0.4509 (88) | 0.4603 (76) | 0.5019 (78) | 0.0145 (<1) | 0.4527 (53) |
Proton pump |
0.3333 (1) | 0.3860 (5) | 0.1299 (42) | 0.0377 (<1) | 0.1775 (25) |
Skeletal muscle |
0.1429 (<1) | 0.1981 (<1) | 0.2286 (2) | 0.1429 (<1) | 0.2222 (<1) |
Statins | 0.2278 (6) | 0.2479 (1) | 0.4019 (4) | 0.1484 (12) | 0.1563 (1) |
Triptans | 0.1739 (10) | 0.360 (4) | 0.2569 (13) | 0.0690 (<1) | 0.2750 (8) |
Urinary |
0.3697 (24) | 0.5243 (19) | 0.5405 (10) | 0.4444 (13) | 0.4317 (30) |
Averageb | 0.3039 (18) | 0.3683 (13) | 0.3437 (14) | 0.1365 (5) | 0.3116 (17) |
aBoth 5% and 12% have F1 = 0.4363. The average of 5% and 12% (8.5%) is taken to calculate the average value on the last row of the table.
b<1% is considered as 1% for calculating the average percentage.
Different MEDLINE elements play different roles in the systematic review process, and their corresponding performance varied greatly as described above. To further explore their predictive abilities, we examined their collaborative performances. In total we examined 22 combinations and chose the top WSS95 performance of 6 combinations (see
We also conducted statistical analysis with repeated ANOVA for the composited elements performance. For WSS95, the results show that there is no statistical difference in WSS95 performance across the 6 combinations (
In summary, we found that the predictive ability of MEDLINE elements varies according to systematic review topics. Overall, TI and PT have better WSS95 performance on average but are not statistically different. AB has the best average F1 scores and is statistically better than TI.
WSS95 of the top 6 combinations.
SR report topic | TI+AB | TI+AB |
TI+AB |
TI+AB |
TI+AB |
TI+AB+MH |
ACE inhibitors | 76.38 | 76.85 | 74.29 | 75.79 | 73.70 | 75.08 |
ADHD | 80.38 | 79.79 | 67.92 | 80.14 | 67.92 | 56.17 |
Antihistamines | 16.13 | 10.65 | 24.52 | 16.13 | 24.52 | 18.39 |
Atypical antipsychotics | 20.89 | 14.20 | 17.95 | 20.63 | 17.77 | 14.38 |
Beta blockers | 60.14 | 60.09 | 65.01 | 60.96 | 64.72 | 65.21 |
Calcium channel blockers | 18.23 | 18.64 | 17.32 | 18.39 | 17.49 | 22.82 |
Estrogens | 33.42 | 36.14 | 22.55 | 33.97 | 22.55 | 29.08 |
NSAIDS | 72.26 | 75.57 | 77.35 | 70.48 | 76.34 | 77.86 |
Opioids | 6.01 | 11.75 | 8.98 | 5.95 | 8.98 | 12.17 |
Oral hypoglycemics | 11.33 | 13.12 | 13.52 | 11.13 | 13.52 | 12.72 |
Proton pump inhibitors | 19.20 | 21.31 | 19.65 | 19.05 | 19.65 | 20.11 |
Skeletal muscle relaxants | 41.94 | 46.44 | 58.55 | 41.87 | 58.49 | 60.01 |
Statins | 29.10 | 27.11 | 27.80 | 30.96 | 27.71 | 26.07 |
Triptans | 48.29 | 51.71 | 39.64 | 50.52 | 39.79 | 40.98 |
Urinary incontinence | 12.84 | 11.01 | 20.80 | 12.84 | 20.80 | 14.37 |
Average | 36.44 | 36.96 | 37.06 | 36.59 | 36.93 | 36.35 |
F1 of the top 6 combinations.
SR report topic | TI+AB |
TI+AB+MH |
TI+AB+PT |
TI+AB+AU |
TI+AB+PT+AU |
TI+AB+MH+PT+AU |
||||
ACE inhibitors | 0.4156 (1) | 0.4000 (2) | 0.4051 (1) | 0.3902 (2) | 0.3971 (4) | 0.3774 (3) | ||||
ADHD | 0.4000 (3) | 0.4688 (5) | 0.5455 (4) | 0.4286 (6) | 0.5306 (3) | 0.5818 (4) | ||||
Antihistamines | 0.3226 (5) | 0.3333 (10) | 0.2903 (15) | 0.3226 (5) | 0.2903 (15) | 0.2813 (15) | ||||
Atypical |
0.4364 (16) | 0.4241 (15) | 0.4887 (15) | 0.4411 (17) | 0.4856 (15) | 0.4606 (15) | ||||
Beta blockers | 0.2800 (3) | 0.3043 (2) | 0.3590 (2) | 0.2667 (3) | 0.3596 (2) | 0.3333 (3) | ||||
Calcium channel |
0.2335 (8) | 0.2620 (11) | 0.2804 (9) | 0.2323 (8) | 0.2816 (9) | 0.2995 (9) | ||||
Estrogens | 0.6000 (30) | 0.6237 (29) | 0.6047 (25) | 0.5979 (31) | 0.6118 (24) | 0.6171 (26) | ||||
NSAIDS | 0.6667 (16) | 0.6154 (16) | 0.6966 (12) | 0.6471 (16) | 0.6809 (13) | 0.6667 (15) | ||||
Opioids | 0.3000 (0) | 0.3158 (<1) | 0.3000 (<1) | 0.3000 (<1) | 0.3000 (<1) | 0.3158 (<1) | ||||
Oral hypoglycemics | 0.4497 (90) | 0.4541 (88) | 0.4553 (86) | 0.4489 (92) | 0.4561 (75) | 0.4635 (82) | ||||
Proton pump |
0.4384 (7) | 0.4737 (5) | 0.5172 (5) | 0.4552 (7) | 0.5455 (5) | 0.5079 (6) | ||||
Skeletal muscle |
0.2222 (1) | 0.2353 (<1) | 0.2500 (<1) | 0.2222 (1) | 0.2500 (<1) | 0.2667 (<1) | ||||
Statins | 0.2994 (2) | 0.3281 (1) | 0.3382 (1) | 0.2959 (2) | 0.3358 (2) | 0.3465 (1) | ||||
Triptans | 0.3636 (3) | 0.3913 (3) | 0.3556 (3) | 0.3556 (3) | 0.3529 (4) | 0.3913 (3) | ||||
Urinary incontinence | 0.5063 (12) | 0.5347 (19) | 0.5505 (21) | 0.5263 (11) | 0.5507 (9) | 0.5843 (15) | ||||
Averagea | 0.3956 (13) | 0.4110 (14) | 0.4291 (14) | 0.3954 (14) | 0.4286 (12) | 0.4329 (13) |
a<1% is considered as 1% for calculating the average percentage.
Here we also compared our WSS95 performance with existing machine learning model training approaches (we were not able to compare the F1 performances as they were not provided). Since TI+AB+PT has the simplest combination and its performance is equivalent or better than others, we chose TI+AB+PT (weight setting = 1:1:1) to compare against existing machine learning model training approaches, including voting perceptron-based automated citation classification system (VP), factorized complement naïve Bayes with weight engineering (FCNB/WE) and support vector machine (SVM) (
WSS95 comparison with the Cohen and Matwin systems across 15 SR topics.
SR report topic | Cohen 2006 [ |
Cohen 2008 [ |
Matwin 2010 [ |
Our study |
ACE inhibitors | 56.61 | 73.30 | 52.30 | 74.29 |
ADHD | 67.95 | 52.60 | 62.20 | 67.92 |
Antihistamines | 0 | 23.60 | 14.90 | 24.52 |
Atypical antipsychotics | 14.11 | 17.00 | 20.60 | 17.95 |
Beta blockers | 28.44 | 46.50 | 36.70 | 65.01 |
Calcium channel blockers | 12.21 | 43.00 | 23.40 | 17.32 |
Estrogens | 18.34 | 41.40 | 37.50 | 22.55 |
NSAIDS | 49.67 | 67.20 | 52.80 | 77.35 |
Opioids | 13.32 | 36.40 | 55.40 | 8.98 |
Oral hypoglycemics | 8.96 | 13.60 | 8.50 | 13.52 |
Proton pump inhibitors | 27.68 | 32.80 | 22.90 | 19.65 |
Skeletal muscle relaxants | 0 | 37.40 | 26.50 | 58.55 |
Statins | 24.71 | 49.10 | 31.50 | 27.80 |
Triptans | 3.37 | 34.60 | 27.40 | 39.64 |
Urinary incontinence | 26.14 | 43.20 | 29.60 | 20.80 |
Average | 23.43 | 40.80 | 33.50 | 37.06 |
aVP: voting Perceptron-based automated citation classification system
bFCNB/WE: factorized complement naïve Bayes with weight engineering
cSVM: support vector machine
The repeated ANOVA test shows significant different across four studies (
We were not able to compare side by side with the Wallace group [
The P values of pairwise comparison of four studies.
|
Cohen 2006 | Cohen 2008 | Matwin 2010 | Our study (TI+AB+PT) |
Cohen 2006 | — | 0.0012 | 0.0433 | 0.0475 |
Cohen 2008 | 0.0012 | — | 0.0649 | 0.4979 |
Matwin 2010 | 0.0433 | 0.0649 | — | 0.4979 |
Our study | 0.0475 | 0.4979 | 0.4979 | — |
Since different systematic reviews have diverse scopes (for example, one may require sufficient study information from an AB while another may have strict criteria on PT), we were interested in whether different weight parameters would alter the performance. We conducted experiments on different weight settings (eg, TI:PT:AU=3:1:2, TI:PT:AU=2:2:1, TI:PT:AU=3:2:1). The results revealed that when one element’s weight was increased to achieve a higher performance for some reports, some other reports would have performance degradation. Overall, we could not find a universal weight setting that benefited all reports. This may be explained in part by the diverse scopes captured in different systematic reviews. In addition, although some weighted combinations bring better global performance (ie, average WSS95 among 15 reports), the enhancement from the baseline (elements in the combination are equally weighted) is limited. For example, consider the combination of TI+PT+AU, the baseline performance (TI:PT:AU=1:1:1) evaluated by average WSS95 is 35.45%, while the performance of its weighted one (TI:PT:AU=3:1:2) (37.30%) gains less than 2%. There is not much improvement with weighted parameters.
During our experiments, we also discovered inconsistencies in performance with respect to WSS95 and F1. For example, some combinations had high F1 performance with low WSS95 and vice versa. We examined the recall performance during the entire recommendation process.
Proton Pump Inhibitors recall performance during the recommendation process using two different element combinations.
Due to the fact that different systematic reviews have different review scopes, we could not identify one universal weight setting which could be successfully applied to every systematic review. A similar idea was mentioned in Matwin’s work [
Currently the work of biomedical text classification for the purpose of reducing systematic review workload has mainly used machine learning model training approaches. Naïve Bayes and SVM are two widely applied machine learning algorithms. Although these machine learning approaches provide excellent performance in text classification on specific systematic review topics, it is a challenge to apply the existing machine learning algorithms to other new systematic review topics. It could be time consuming to construct training models as well [
Overcoming the limitations mentioned above, we provided a generalizable approach which can be easily deployed to facilitate any systematic review. Also, because we established an article network providing similarity relationship between articles, the iterative interactive recommendation process takes almost no time. Currently, our processing time to construct an article network takes from several seconds to several minutes for 300 to 3500 articles, but the recommendation step is real-time. This processing time is reasonable considering the non-trivial steps of building article networks. To be specific, this is polynomial time processing, not linear time processing. In our study, the backend programs for the computation of similarity matrixes are written in C/C++, which is the most efficient approach from the perspective of computer architecture and compiler. We also plan to improve the time responses for larger systematic reviews that may contain ten thousand articles or more. Most importantly, our approach can be applied to any systematic review topic and nontechnical human reviewers can use it with ease.
This study only uses 15 DERP reports for evaluation. Although it is our assumption that our approach will be applicable globally, datasets from other systematic review teams are needed to further demonstrate our hypothesis. Our future plans include collaborating with other systematic review teams.
As we have discussed in the Methods section, different article elements have different predictive abilities regarding the evaluation scheme of WSS95 or F1 score. With a better F1 score and a lower WSS95, combinations containing AB or MH are more likely to elicit good performance in the beginning but have difficulty reaching 100% recall. On the other hand, although the combination of TI, PT, and AU can reach a better overall workload saved, the recall rises slowly (low accuracy) in the beginning of the recommendation process. This inspires us to utilize multiple types of weight settings and take advantage of different article element strengths during different recommendation phases (early-, mid-, and late-phases). We plan to implement automatic detection and adjustment when information from elements has been exhausted, which indicates the time to alter the combination of elements and weight parameters. For instance, when a series of N recommended articles is classified as excluded by human reviewers, we take it as a signal for adjustment as the current setting can no longer provide a good recommendation. Another example is to first apply the combination of AB and MH, as they provide high accuracy in the early recommendation stage, and then automatically adjust to the combination of TI, PT, and AU in the subsequent phase. Further research is also necessary to investigate proper adjustments of weight parameters under different conditions.
In the near future, we will also provide visualized article networks where relationships between articles could be intuitively represented and comprehended by humans. Network-based analysis will be conducted and network metrics like graph diameter, centrality, and module classes (by communication detection) will be reported. Such visualizations have the potential to enable the identification of clusters of articles and knowledge gaps in a targeted area. Lower density in such visualizations of the network could also indicate fewer related articles published or vice versa.
We demonstrated a new approach to assist the systematic review article screening process. We established article networks based on article similarity that facilitate the process of interactive article recommendation. We calculated article similarities using MEDLINE elements and examined the predictive ability of the MEDLINE element(s). We found that TI and PT have the best WSS95 performance, and AB and PT provide the best F1 scores during the early stage of the recommendation process. However, no statistical difference was found.
Using our approach, we are able to achieve an average of 37% WSS95 with equally weighted combination of TI, AB, and PT. The statistical analysis also demonstrated that it is competitive with existing approaches. Based on findings and lessons learned from this study, we are currently deploying the approach into a prototype public online system, ArticleNet, to assist the article screening process.
abstract
analysis over variance
author
Drug Effectiveness Review Project
evidence-based medicine
excluded
F-measure
factorized complement naïve Bayes with weight engineering
included
MeSH, or medical subject heading
PubMed Indentifier
publication type
support vector machine
title
voting perceptron-based automated citation classification system
worked saved over sampling at 95% recall
Special thanks to Dr Soledad Fernandez who helped verify our statistical results. The authors also thank Dr Marian McDonagh for her suggestions for the project.
None declared.