Abstract
Background: Over the last decade, natural language processing (NLP) has provided various solutions for information extraction (IE) from textual clinical data. In recent years, the use of NLP in cancer research has gained considerable attention, with numerous studies exploring the effectiveness of various NLP techniques for identifying and extracting cancer-related entities from clinical text data.
Objective: We aimed to summarize the performance differences between various NLP models for IE within the context of cancer to provide an overview of the relative performance of existing models.
Methods: This systematic literature review was conducted using 3 databases (PubMed, Scopus, and Web of Science) to search for articles extracting cancer-related entities from clinical texts. In total, 33 articles were eligible for inclusion. We extracted NLP models and their performance by F1-scores. Each model was categorized into the following categories: rule-based, traditional machine learning, conditional random field-based, neural network, and bidirectional transformer (BT). The average of the performance difference for each combination of categorizations was calculated across all articles.
Results: The articles covered various scenarios, with the best performance for each article ranging from 0.355 to 0.985 in F1-score. Examining the overall relative performances, the BT category outperformed every other category (average F1-score between 0.2335 and 0.0439). The percentage of articles on implementing BTs has increased over the years.
Conclusions: NLP has demonstrated the ability to identify and extract cancer-related entities from unstructured textual data. Generally, more advanced models outperform less advanced ones. The BT category performed the best.
doi:10.2196/68707
Keywords
Introduction
Electronic health records (EHRs) are increasingly being adopted by health care providers worldwide, as they offer numerous benefits [
]. This has led to an increase in the quantity of data stored in EHRs, consisting of both structured and unstructured data (eg, text, images, and time series). Unstructured textual data from discharge summaries, radiology reports, clinical notes, and patient histories provide valuable information about patients that may not be captured by structured data alone [ ]. The extraction of clinical parameters from unstructured textual data, also known as information extraction (IE), has proven to be valuable in health care, such as in clinical research (eg, epidemiology) and decision support systems [ , ].However, working with unstructured textual data presents several challenges to health care providers and researchers. The volume of free text makes manual extraction and analysis time-consuming and resource-heavy, thereby limiting their utility and requiring automated solutions. Moreover, the lack of standardization and consistency in formatting and terminology makes it difficult to accurately identify and extract the relevant information in an automated manner. Furthermore, free text is prone to spelling errors, resulting in inaccurate or harder-to-find patient information for methods that rely on keyword extraction or other exact-match search techniques.
Natural language processing (NLP) techniques are well suited for extracting information from free text because of their ability to process, comprehend, and generate human language in a manner that allows for automatic extraction of structured information from free text. In recent years, NLP has gained considerable attention, with numerous studies exploring the effectiveness of various NLP techniques, notably for identifying and extracting cancer-related entities, such as smoking history [
], toxicities [ ], and Gleason scores [ ], which are only recorded as free text in clinical notes. These techniques are known as named entity recognition, or more generally, IE [ - ].A variety of techniques and pipelines have been developed for IE from medical free texts, ranging from simple rule-based solutions to advanced machine learning approaches. Rule-based solutions allow domain experts to define a set of linguistic rules and patterns to be implemented to identify and extract relevant information from clinical notes and medical free-form texts. Studies have shown that rule-based approaches can outperform machine-learning models [
- ]. However, rule-based approaches are custom-made for specific datasets and use cases, require manual specification of rules by medical experts, and are therefore difficult to generalize [ ]. Moreover, machine learning models allow for multiple methodologies and applications for different IE problems, solved by training specific traditional machine learning models such as support vector machines, decision trees, and neural networks (NNs). Recently, bidirectional transformers (BTs) such as large language models (LLM) have been identified as a possible tool for IE because of their strengths in pattern recognition, text summarization, and generation [ ]. LLMs allow for pretraining on large text corpora, which enables them to learn linguistic patterns applicable to different IE tasks. Furthermore, LLMs show promising results for specific tasks because of the domain-specific fine-tuning of pretrained models [ ].Over the years, numerous models have been developed for various IE tasks, but their relative performance across different datasets remains largely unknown. Both rule-based and machine learning approaches often exhibit limited generalizability between tasks, and their results are frequently inconsistent when applied to different datasets. This inconsistency highlights the need for studies that investigate and compare the relative performance of these models.
To the best of our knowledge, no review has been conducted that summarizes the differences in the performance of various types of NLP models for IE within the context of cancer. This review provides an overview of the various NLP methods used for IE and compares them in terms of their performance.
Methods
This systematic review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.
We searched 3 databases—PubMed, Scopus, and Web of Science—for relevant literature published between January 1, 2014, and April 19, 2024 (
). The following search criteria were used for the titles and abstracts:(“information extraction” OR “natural language processing” OR “nlp”) AND (EHR OR notes OR reports) AND (cancer OR tumor OR oncology)
The inclusion criterion for articles was the application of 2 or more NLP models to extract identical cancer-related entities from the same unstructured medical text in EHR. Articles were excluded during title and abstract screening if they were as follows: (1) reviews; (2) articles that do not compare 2 or more NLP models; (3) articles whose purpose is not to extract information from free text from EHR; and (4) articles whose purpose is not to develop an NLP cancer-related application.
For further exclusion during the full-text screening, articles were excluded if they were defined as follows: (1) abstract only; (2) text classification without cancer entity extraction; (3) results within the article were not compatible; (4) no NLP application development; (5) not related to cancer IE from EHR; and (6) no comparison with other NLP methods within the article.
Using the exclusion criteria, one author (SCD) performed 2 rounds of article selection: title and abstract screening, followed by a full-text review. A second reviewer (CV) was consulted for unclear cases during the screening.
Data from each of the included articles were extracted by 2 authors (SCD and CV). Both authors independently categorized the NLP models and extracted their performance metrics. Any discrepancies in categorization were resolved through consensus guided by consideration of the primary architectural components of the model. Each model was categorized into the following groups: rule-based, traditional machine learning (ML), conditional random field (CRF)–based, NN, and BT.
The rule-based category includes IE models that use regular expressions (Regex), keywords, and dictionary matching. The CRF-based category includes linear CRF, except bidirectional long short-term memory-CRF, which is in the NN category. The NN category includes NNs, except for BTs, that belong to the BT category. Ensemble models are categorized as the most advanced part of the ensemble. For example, a rule-based model combined with a BT is categorized as a BT model (see
). For articles that included both strict and relaxed keyword matching, the strict F1-scores were extracted as the performance metric. For articles presenting both macro- and micro-averaged F1-scores, macro-averaged F1-scores were extracted.Category | Included models | Articles using category | Total number of models implemented |
Rule-based |
| [ | - , - ] (n=11)12 |
CRF | -based
| [ | , , - , , , ] (n=8)26 |
Bidirectional transformer |
| [ | , - , , - ] (n=16)60 |
Neural network |
| [ | - , - , - ] (n=25)83 |
Traditional machine learning |
| [ | , , , , - ] (n=14)39 |
aCRF: conditional random field.
bBERT: Bidirectional Encoder Representations from Transformers.
cBiLSTM: bidirectional long short-term memory.
dSVM: support vector machine.
To calculate the performance differences for all categories across the included articles, the following steps were executed for all categories.
The best-performing model for each category within each article was selected. The best-performing model within category c for article a is given by :
where P is the F1 performance score of method m within category c. n is the number of methods within category c.
Having the best-performing categories within an article allows for calculation of the category difference for each combination of categories. Category differences for categories c1 and c2 within article a are given as :
where is the best performing model of category c in article a.
All performance differences of the same combination of categories were averaged across all the articles. The average of the category difference for all articles with combination c1 and c2 is given as :
where is the category difference between c1 and c2 in article a. n is the number of articles with a specific category combination.
Statistical significance between categories for each category combination was determined using a t test (P<.05).
Results
Overview
The article selection process is detailed in a PRISMA flowchart, shown in
. A total of 2032 articles were identified through searches in Web of Science, Scopus, and PubMed.
In total, 33 articles were included in this review. The articles were published between 2018 and 2024. They compared at least 2 NLP models for cancer-related IE from unstructured medical texts in EHRs. The articles contained a total of 220 implementations of NLP models. Selecting only the best-performing models within each category of each article summarizes 74 implementations.
Models
We categorized each NLP model as rule-based, CRF-based, BT, NN, and ML.
shows how each model was categorized and the articles in which the categories are contained.shows the categorization of the models, which articles contain the specific categories, and the total number of implemented models within each category.
The most frequently used category was NN, with 25 occurrences, followed by BT and ML with 16 and 11 occurrences, respectively. The most frequently implemented category was NN, with 83 implementations. The distribution of unique categorizations per year shows the variety of models that have been used throughout the years (see
). Notably, the percentage of articles on the implementation of BTs has increased over the years.
Performance
The performance varied a lot according to specific use cases. Inspecting the best-performing models of the specific articles shows that a total of 5 rule-based models performed best in their articles, with F1-scores in the range of 0.73-0.887 (see
). ML did not perform best in any article despite being compared in 14 articles with a total of 39 different model implementations, and neither did CRF-based. In total, 14 articles showed that NN performed the best, with F1-scores ranging from 0.3539 to 0.972. BT performed the best in 13 articles, with F1-scores ranging from 0.6023 to 0.97. Looking at the raw F1-scores, more advanced models outperformed less advanced ones.Article | Year | Title | Number of tested models (best performing F1-score within category) |
AAlAbdulsalam et al [ | ]2018 | Automated extraction and classification of cancer stage mentions from unstructured text fields in a central cancer registry | Rule-based=1 (0.887) CRF -based=1 (0.882) | ;
Alawad et al [ | ]2018 | Coarse-to-fine multi-task training of convolutional neural networks for automated information extraction from cancer pathology reports | ML NN =2 (0.752) | =1 (0.626);
Miao et al [ | ]2018 | Extraction of BI-RADS findings from breast ultrasound reports in Chinese using deep learning approaches | Rule-based=1 (0.848); CRF-based=1 (0.881); NN=2 (0.904) |
Qiu et al [ | ]2018 | Deep learning for automated extraction of primary sites from cancer pathology reports | ML=3 (0.640); NN=3 (0.701) |
Chen et al [ | ]2019 | Using natural language processing to extract clinically useful information from Chinese electronic medical records | Rule-based=1 (0.83) CRF=1 (0.8) | ;
Coquet et al [ | ]2019 | Comparison of orthogonal NLP methods for clinical phenotyping and assessment of bone scan utilization among prostate cancer patients | Rule-based=1 (0.897); NN=3 (0.918) |
Dubey et al [ | ]2019 | Inverse regression for extraction of tumor site from cancer pathology reports | ML=5 (0.759) NN=2 (0.701) | ;
Kim et al [ | ]2019 | A study of medical problem extraction for better disease management | Rule-based=2 (0.883); CRF-based=4 (0.926); NN=5 (0.929) |
Thompson et al [ | ]2019 | Relevant word order vectorization for improved natural language processing in electronic health records | ML=7 (0.788); NN=7 (0.858) |
Zhang et al [ | ]2019 | Extracting comprehensive clinical information for breast cancer using deep learning methods | Rule-based=1 (0.484); NN=1 (0.887); CRF-based=1 (0.837); BT =1 (0.935) |
Alawad et al [ | ]2020 | Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks | ML=2 (0.615); NN=3 (0.752) |
Odisho et al [ | ]2020 | Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation | ML=4 (0.948); NN=2 (0.972) |
Osborne et al [ | ]2020 | Identification of cancer entities in clinical text combining transformers with dictionary features | Rule-based=1 (0.73) BT=7 (0.7) | ;
Wu et al [ | ]2020 | Structured information extraction of pathology reports with attention-based graph convolutional network | ML=1 (0.74); NN=6 (0.803) |
Hu et al [ | ]2021 | Automatic extraction of lung cancer staging information from computed tomography reports: deep learning approach | NN=2 (0.773); BT=1 (0.81) |
Liu et al [ | ]2021 | Use of BERT (Bidirectional Encoder Representations from Transformers)-based deep learning method for extracting evidences in Chinese radiology reports: development of a computer-aided liver cancer diagnosis framework | CRF-based=1 (0.729); NN=1 (0.832); BT=1 (0.857) |
López-García et al [ | ]2021 | Detection of tumor morphology mentions in clinical reports in Spanish using transformers | CRF-based=1 (0.794); BT=18 (0.89) |
Lu et al [ | ]2021 | Natural language processing and machine learning methods to characterize unstructured patient-reported outcomes: validation study | ML=2 (0.365); BT=1 (0.602) |
Park et al [ | ]2021 | Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity | ML=4 (0.484); NN=5 (0.502) |
Rios et al [ | ]2021 | Assigning ICD-O-3 codes to pathology reports using neural multi-task training with hierarchical regularization | ML=3 (0.276); NN=12 (0.355) |
Wu et al [ | ]2021 | BioIE: biomedical information extraction with multi-head attention enhanced graph convolutional network | ML=1 (0.444); NN=5 (0.613) |
Yu et al [ | ]2021 | A study of social and behavioral determinants of health in lung cancer patients using transformers-based natural language processing models | BT=4 (0.879) NN=2 (0.844) | ;
Bozkurt et al [ | ]2022 | Expanding the secondary use of prostate cancer real world data: automated classifiers for clinical and pathological stage | Rule-based=1 (0.87) ML=1 (0.723) | ;
Fang et al [ | ]2022 | Extracting clinical named entity for pituitary adenomas from Chinese electronic medical records | Rule-based=1 (0.431); CRF-based=16 (0.904); NN=1 (0.899); BT=1 (0.913) |
Hu et al [ | ]2022 | Using natural language processing and machine learning to preoperatively predict lymph node metastasis for non-small cell lung cancer with electronic medical records: development and validation study | NN=1 (0.701); BT=2 (0.948) |
Pabón et al [ | ]2022 | Negation and uncertainty detection in clinical texts written in Spanish: a deep learning-based approach | NN=2 (0.788); BT=1 (0.823) |
Zhou et al [ | ]2022 | CancerBERT: a cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records | NN=1 (0.834); BT=8 (0.876) |
Ansoborlo et al [ | ]2023 | Prescreening in oncology trials using medical records. Natural language processing applied on lung cancer multidisciplinary team meeting reports | Rule-based=1 (0.932) ML=1 (0.68) | ;
Rohanian et al [ | ]2023 | Using bottleneck adapters to identify cancer in clinical notes under low-resource constraints | NN=3 (0.83); BT=8 (0.97) |
Seong et al [ | ]2023 | Deep learning approach to detection of colonoscopic information from unstructured reports | NN=3 (0.985) BT=2 (0.982) | ;
Zitu et al [ | ]2023 | Generalizability of machine learning methods in detecting adverse drug events from clinical narratives in electronic medical records | ML=1 (0.69); NN=2 (0.763); BT=2 (0.778) |
Martín-Noguerol et al [ | ]2024 | Natural language processing deep learning models for the differential between high-grade gliomas and metastasis: what if the key is how we report them? | NN=3 (0.872) BT=1 (0.766) | ;
Hu et al [ | ]2024 | Zero-shot information extraction from radiological reports using ChatGPT | Rule-based=1 (0.926); BT=2 (0.957) |
aThis is the best F1-score for the article.
bCRF: conditional random field.
cML: traditional machine learning.
dNN: neural network.
eBT: bidirectional transformer.
shows each article and the number of models in each category within the article. Parentheses show the best F1-score for each category. The best F1-score for each article is marked by a footnote.
Some variations between the average F1-score performance differences were observed (see
).Our results show that more advanced models outperform less advanced ones. The largest difference between the category performance F1-scores was observed between the BT category and the rule-based category. BT models were compared with rule-based models in 4 studies, yielding an average performance difference of 0.2335 in terms of F1-score. BT was the best-performing category. NN outperformed CRF-based, ML, and rule-based models, while CRF-based outperformed rule-based models, and rule-based outperformed ML models. The only statistically significant difference between categories is observed when comparing rule-based and ML; see
for P values.
Discussion
Principal Findings
This study provides an overview of the models used for IE in cancer and their performance in terms of the F1-score. By including only articles with 2 or more NLP models for IE, we were able to evaluate the relative performance of each NLP within categories: rule-based, CRF-based, BT, NN, and ML.
The search string for this review combined keywords for techniques (IE and NLP), data sources (EHR, notes, reports), and the domain (cancer, tumor, and oncology) using Boolean operators to limit irrelevant results. The initial yield of 2032 articles suggests a reasonable balance, considering the stringent inclusion criteria. The “AND” clauses effectively limit the search while still including the relevant articles for the screening process. Although our search strategy included articles published from 01/01/2014, no articles prior to 2018 were included in the analysis. The reason for this discrepancy is not addressed within the scope of this review, which focused on quantifying performance differences between our categories. Notably, the most frequent reason for full-text exclusion was “No comparison with other NLP methods within the article (185 articles).” Arguing for common benchmark testing of the implemented NLP models.
Without considering a dataset or specific extraction entities, our results show that BT is the best performing category, followed by NN, CRF-based, rule-based, and ML in written order. We observed an increasing number of transformer-based models developed in recent years, with promising results. Our results highlight a pivotal moment in which BTs, such as language models, are on the verge of demonstrating their full potential in IE. Although transformers [
] and BERT [ ] were introduced in 2017 and 2018, respectively, our literature review includes no articles using these technologies until 2019. This delay in time may reflect the time required for these models to become integrated into clinical research workflows. Surprisingly, rule-based solutions perform better than machine learning [ , ]. One explanation could be that rule-based solutions allow for the implementation of expert knowledge. The lowest-performing articles in terms of F1-score do not aim to show the best possible method for extraction, but rather how F1-scores increase using hierarchical regularization when extracting ICD-O-3 codes [ ]. Similarly, the study of Park et al [ ] aims to show how to increase the F1-score, using transfer learning and zero-shot string similarity, when the number of annotated pathology reports is limited.Multiple reviews have been conducted within the scope of NLP in a clinical context with different aims. The review by Kreimeyer et al [
] aims to identify NLP systems capable of processing clinical free text and generating structured output, thereby compiling a list of NLP solutions in use. The review by Datta et al [ ] defines relevant linguistic terms by organizing unstructured clinical text related to cancer into structured data using frame semantics. The review by Bilal et al [ ] examines the current state-of-the-art literature on NLP applications in analyzing EHRs and clinical notes for cancer research, quantifying the number of studies for each cancer type and outlining the research challenges and future directions for NLP when analyzing EHRs and clinical notes in cancer research. However, no review has been conducted comparing the performance of NLP models for IE of cancer-related entities from clinical text, a gap relevant to clinical informatics and crucial for improving the accuracy of cancer-related data IE within EHRs. This is the first review to summarize and compare the performance of NLP models for IE of different cancer entities from unstructured text, offering insights for clinical researchers focused on leveraging EHR data for cancer care and research.Strengths
One strength of our study was its ability to overcome the challenge of comparing low-performing models. By including only articles with 2 or more categories, we can determine the relative performance for each paper while neglecting low-performing models from papers that do not aim to beat state-of-the-art F1-score. Our review shows how models can be categorized and how the categorizations perform compared to each other through different datasets and extraction entities. The performance differences observed in our included articles highlight the importance of selecting the appropriate NLP model for each health care application. Our categorizations allow all models to be included, even ensemble and hybrid models. Furthermore, our performance calculation uses the best-performing model for each category reported within each included article. This approach allows for the addition of multiple new categories to support the desired level of model performance granularity.
Limitations
A categorization strategy was required to categorize all models. Most models assign into well-defined and distinct categories. However, some could be assigned to multiple categories, notably bidirectional long short-term memory-CRF models. To present intelligible results, the number of categories had to be kept relatively low, neglecting model specificities. Increasing the number of categories would reduce the number of models in that category, making the results too anecdotal. Decreasing the number of categories would increase the number of times each categorization was compared, making the averaged F1-scores less distinctive. Ideally, we would have wished for multiple studies implementing the same set of models and categorizations to avoid certain categorizations not being compared with every other category and to avoid certain combinations of categorizations occurring only once.
We selected the F1-score as a metric for performance; precision or recall could also be used. However, extracting specific numbers from the confusion matrix can provide deeper insights. The included studies reported F1-scores as a measure of performance. Although this is a practical method to generate 1 performance metric, its use has some limitations. In medical IE, one could argue that false negatives are worse than false positives, potentially leading to missed diagnoses or inappropriate treatment decisions, which is not considered in the F1-score. While metrics such as AUC-ROC, precision-recall tradeoff, or specificity offer complementary insights, their calculation was limited by the inconsistent reporting of the necessary data. Furthermore, given the sensitive nature of EHR data and the need for clinical trust, future research should also prioritize evaluating the interpretability of IE models alongside traditional performance measures to allow clinicians to understand how cancer-related entities are being extracted and validated from EHR data.
Furthermore, our included studies neglected to address the handling of negation and spelling errors. Giorgia et al [
] showed that negations account for 66% of the errors. Another study stated that BERT fails completely to show a generalizable understanding of negation, raising questions about the aptitude of language models to learn this type of meaning [ ]. In this study, BTs performed well; one could wish for a general approach to analyze the errors of each model instead of the general performance derived from the confusion matrix. Negation errors pose a significant challenge in EHR data and are critical in oncology, as a misidentified negated symptom or finding can alter clinical interpretation, treatment planning, and patient care.Perspectives
The field of IE has evolved rapidly, and models, such as LLMs, have been successfully applied in the context of cancer IE, both in terms of model performance and operational efficiency [
]. LLM could allow for enhanced transferability and utility for different IE tasks on unstructured textual data. Using LLMs for IE on unstructured textual data seems feasible because of the variety of available pretrained models in different versions. Some might perform well out of the box or with minor domain-specific fine-tuning [ ]. Generally, the evaluation of LLMs is challenging because of the lack of clarity regarding whether a public benchmark dataset has been used for training. However, when using data from EHRs, it is certain that they have not been used for training a public model.Conclusions
NLP has demonstrated the ability to identify and extract cancer-related entities from unstructured medical textual data. Generally, most of the reviewed models showed excellent performance in terms of the F1-score, and more advanced models outperformed less advanced ones. The BT category performed the best, followed by NN. The use of BTs has increased in recent years. Rule-based applications for IE remain competitive in terms of performance in this specific context.
Acknowledgments
This study was supported solely by the institutional resources of the Center for Clinical Data Science at Aalborg University Hospital. We acknowledge the use of Covidence systematic review software (Veritas Health Innovation, Melbourne, Australia) for automatic identification and removal of duplicate articles.
Data Availability
The datasets used and analyzed in this study are available upon reasonable request from the corresponding author.
Authors' Contributions
All authors were involved in the conception and design of the study. SCD contributed to the collection and assembly of data. SCD and CV were involved in data analysis and interpretation. All authors contributed to manuscript writing, and all authors approved the final version of the manuscript.
Conflicts of Interest
None declared.
Search strategy.
DOCX File, 12 KBStatistical significance of t test results.
DOCX File, 14 KBPRISMA 2020 checklist.
PDF File, 67 KBReferences
- Evans RS. Electronic health records: then, now, and in the future. Yearb Med Inform. May 20, 2016;Suppl 1(Suppl 1):S48-S61. [CrossRef] [Medline]
- Ruckdeschel JC, Riley M, Parsatharathy S, et al. Unstructured data are superior to structured data for eliciting quantitative smoking history from the electronic health record. JCO Clin Cancer Inform. Feb 2023;7:e2200155. [CrossRef] [Medline]
- Landolsi MY, Hlaoua L, Ben Romdhane L. Information extraction from electronic medical documents: state of the art and future research directions. Knowl Inf Syst. 2023;65(2):463-516. [CrossRef] [Medline]
- Spasić I, Livsey J, Keane JA, Nenadić G. Text mining of cancer-related information: review of current status and future directions. Int J Med Inform. Sep 2014;83(9):605-623. [CrossRef] [Medline]
- Hong JC, Fairchild AT, Tanksley JP, Palta M, Tenenbaum JD. Natural language processing for abstraction of cancer treatment toxicities: accuracy versus human experts. JAMIA Open. Feb 15, 2021;3(4):513-517. [CrossRef]
- Yu S, Le A, Feld E, et al. A natural language processing-assisted extraction system for Gleason scores: development and usability study. JMIR Cancer. Jul 2, 2021;7(3):e27970. [CrossRef] [Medline]
- Alkaitis MS, Agrawal MN, Riely GJ, Razavi P, Sontag D. Automated NLP extraction of clinical rationale for treatment discontinuation in breast cancer. JCO Clin Cancer Inform. May 2021;5(5):550-560. [CrossRef] [Medline]
- Benson R, Winterton C, Winn M, et al. Leveraging natural language processing to extract features of colorectal polyps from pathology reports for epidemiologic study. JCO Clin Cancer Inform. Jan 2023;7:e2200131. [CrossRef] [Medline]
- Si Y, Roberts K. A frame-based NLP system for cancer-related information extraction. AMIA Annu Symp Proc. 2018;2018:1524-1533. [Medline]
- AAlAbdulsalam AK, Garvin JH, Redd A, Carter ME, Sweeny C, Meystre SM. Automated extraction and classification of cancer stage mentions from unstructured text fields in a central cancer registry. AMIA Jt Summits Transl Sci Proc. 2018;2017:16-25. [Medline]
- Osborne JD, O’Leary T, Monte JD, Sasse K, Liang WH. Identification of cancer entities in clinical text combining transformers with dictionary features. In: CEUR Workshop Proceedings. Vol 2664. 2020:458-467.
- Chen L, Song L, Shao Y, Li D, Ding K. Using natural language processing to extract clinically useful information from Chinese electronic medical records. Int J Med Inform. Apr 2019;124:6-12. [CrossRef] [Medline]
- Bozkurt S, Magnani CJ, Seneviratne MG, Brooks JD, Hernandez-Boussard T. Expanding the secondary use of prostate cancer real world data: automated classifiers for clinical and pathological stage. Front Digit Health. 2022;4:793316. [CrossRef] [Medline]
- Iannantuono GM, Bracken-Clarke D, Floudas CS, Roselli M, Gulley JL, Karzai F. Applications of large language models in cancer care: current evidence and future perspectives. Front Oncol. 2023;13:1268915. [CrossRef] [Medline]
- Zhou S, Wang N, Wang L, Liu H, Zhang R. CancerBERT: a cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records. J Am Med Inform Assoc. Jun 14, 2022;29(7):1208-1216. [CrossRef] [Medline]
- Fang A, Hu J, Zhao W, et al. Extracting clinical named entity for pituitary adenomas from Chinese electronic medical records. BMC Med Inform Decis Mak. Mar 23, 2022;22(1):72. [CrossRef] [Medline]
- Zhang X, Zhang Y, Zhang Q, et al. Extracting comprehensive clinical information for breast cancer using deep learning methods. Int J Med Inform. Dec 2019;132:103985. [CrossRef] [Medline]
- Kim Y, Meystre SM. A study of medical problem extraction for better disease management. Stud Health Technol Inform. Aug 21, 2019;264:193-197. [CrossRef] [Medline]
- Coquet J, Bozkurt S, Kan KM, et al. Comparison of orthogonal NLP methods for clinical phenotyping and assessment of bone scan utilization among prostate cancer patients. J Biomed Inform. Jun 2019;94:103184. [CrossRef] [Medline]
- Miao S, Xu T, Wu Y, et al. Extraction of BI-RADS findings from breast ultrasound reports in Chinese using deep learning approaches. Int J Med Inform. Nov 2018;119:17-21. [CrossRef] [Medline]
- Hu D, Liu B, Zhu X, Lu X, Wu N. Zero-shot information extraction from radiological reports using ChatGPT. Int J Med Inform. Mar 2024;183:105321. [CrossRef] [Medline]
- Ansoborlo M, Gaborit C, Grammatico-Guillon L, Cuggia M, Bouzille G. Prescreening in oncology trials using medical records. Natural language processing applied on lung cancer multidisciplinary team meeting reports. Health Informatics J. 2023;29(1):146045822211467. [CrossRef] [Medline]
- López-García G, Jerez JM, Ribelles N, Alba E, Veredas FJ. Detection of tumor morphology mentions in clinical reports in Spanish using transformers. In: Advances in Computational Intelligence. 2021:24-35. [CrossRef] ISBN: 9783030850296
- Liu H, Zhang Z, Xu Y, et al. Use of BERT (Bidirectional Encoder Representations from Transformers)-based deep learning method for extracting evidences in chinese radiology reports: development of a computer-aided liver cancer diagnosis framework. J Med Internet Res. Jan 12, 2021;23(1):e19689. [CrossRef] [Medline]
- Solarte Pabón O, Montenegro O, Torrente M, Rodríguez González A, Provencio M, Menasalvas E. Negation and uncertainty detection in clinical texts written in Spanish: a deep learning-based approach. PeerJ Comput Sci. 2022;8:e913. [CrossRef] [Medline]
- Yu Z, Yang X, Dang C, et al. A study of social and behavioral determinants of health in lung cancer patients using transformers-based natural language processing models. AMIA Annu Symp Proc. 2021;2021:1225-1233. [Medline]
- Lu Z, Sim JA, Wang JX, et al. Natural language processing and machine learning methods to characterize unstructured patient-reported outcomes: validation study. J Med Internet Res. Nov 3, 2021;23(11):e26777. [CrossRef] [Medline]
- Hu D, Zhang H, Li S, Wang Y, Wu N, Lu X. Automatic extraction of lung cancer staging information from computed tomography reports: deep learning approach. JMIR Med Inform. Jul 21, 2021;9(7):e27955. [CrossRef] [Medline]
- Hu D, Li S, Zhang H, Wu N, Lu X. Using natural language processing and machine learning to preoperatively predict lymph node metastasis for non-small cell lung cancer with electronic medical records: development and validation study. JMIR Med Inform. Apr 25, 2022;10(4):e35475. [CrossRef] [Medline]
- Zitu MM, Zhang S, Owen DH, Chiang C, Li L. Generalizability of machine learning methods in detecting adverse drug events from clinical narratives in electronic medical records. Front Pharmacol. 2023;14:1218679. [CrossRef] [Medline]
- Seong D, Choi YH, Shin SY, Yi BK. Deep learning approach to detection of colonoscopic information from unstructured reports. BMC Med Inform Decis Mak. Feb 7, 2023;23(1):28. [CrossRef] [Medline]
- Rohanian O, Jauncey H, Nouriborji M, et al. Using bottleneck adapters to identify cancer in clinical notes under low-resource constraints. In: Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. Association for Computational Linguistics; 2023:62-78. [CrossRef]
- Martín-Noguerol T, López-Úbeda P, Pons-Escoda A, Luna A. Natural language processing deep learning models for the differential between high-grade gliomas and metastasis: what if the key is how we report them? Eur Radiol. Mar 2024;34(3):2113-2120. [CrossRef] [Medline]
- Thompson J, Hu J, Mudaranthakam DP, et al. Relevant word order vectorization for improved natural language processing in electronic health records. Sci Rep. Jun 25, 2019;9(1):9253. [CrossRef] [Medline]
- Qiu JX, Yoon HJ, Fearn PA, Tourassi GD. Deep learning for automated extraction of primary sites from cancer pathology reports. IEEE J Biomed Health Inform. Jan 2018;22(1):244-251. [CrossRef] [Medline]
- Alawad M, Yoon HJ, Tourassi GD. Coarse-to-fine multi-task training of convolutional neural networks for automated information extraction from cancer pathology reports. Presented at: 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI); Mar 4-7, 2018; Las Vegas, NV, USA. [CrossRef]
- Dubey AK, Yoon HJ, Tourassi GD. Inverse regression for extraction of tumor site from cancer pathology reports. Presented at: 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI); May 19-22, 2019; Chicago, IL, USA. [CrossRef]
- Wu J, Tang K, Zhang H, Wang C, Li C. Structured information extraction of pathology reports with attention-based graph convolutional network. Presented at: 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); Dec 16-19, 2020; Seoul, South Korea. [CrossRef]
- Odisho AY, Park B, Altieri N, et al. Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation. JAMIA Open. Oct 2020;3(3):431-438. [CrossRef] [Medline]
- Alawad M, Gao S, Qiu JX, et al. Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks. J Am Med Inform Assoc. Jan 1, 2020;27(1):89-98. [CrossRef] [Medline]
- Wu J, Zhang R, Gong T, Liu Y, Wang C, Li C. BioIE: biomedical information extraction with multi-head attention enhanced graph convolutional network. Presented at: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); Dec 9-12, 2021; Houston, TX, USA. [CrossRef]
- Rios A, Durbin EB, Hands I, Kavuluru R. Assigning ICD-o-3 codes to pathology reports using neural multi-task training with hierarchical regularization. ACM; 2021. Presented at: BCB ’21:1-10; Gainesville Florida. [CrossRef]
- Park B, Altieri N, DeNero J, Odisho AY, Yu B. Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity. JAMIA Open. Jul 2021;4(3):ooab085. [CrossRef] [Medline]
- Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. arXiv. Preprint posted online on Jun 12, 2017. [CrossRef]
- Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv. Preprint posted online on Oct 11, 2018. [CrossRef]
- Kreimeyer K, Foster M, Pandey A, et al. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. J Biomed Inform. Sep 2017;73:14-29. [CrossRef] [Medline]
- Datta S, Bernstam EV, Roberts K. A frame semantic overview of NLP-based information extraction for cancer-related EHR notes. J Biomed Inform. Dec 2019;100:103301. [CrossRef] [Medline]
- Bilal M, Hamza A, Malik N. NLP for analyzing electronic health records and clinical notes in cancer research: a review. J Pain Symptom Manage. May 2025;69(5):e374-e394. [CrossRef] [Medline]
- Giorgia T, Johannes CS, Gerasimos S. A study of BERT’s processing of negations to determine sentiment. BNAIC/BeneLearn; 2021.
- Ettinger A. What BERT is not: lessons from a new suite of psycholinguistic diagnostics for language models. Trans Assoc Comput Linguist. Dec 2020;8:34-48. [CrossRef]
- Choi HS, Song JY, Shin KH, Chang JH, Jang BS. Developing prompts from large language model for extracting clinical information from pathology and ultrasound reports in breast cancer. Radiat Oncol J. Sep 2023;41(3):209-216. [CrossRef] [Medline]
Abbreviations
BERT: Bidirectional Encoder Representations from Transformers |
BT: bidirectional transformer |
CRF: conditional random field |
EHR: electronic health record |
IE: information extraction |
LLM: large language model |
ML: machine learning |
NLP: natural language processing |
NN: neural network |
PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analyses |
Edited by Alexandre Castonguay; submitted 12.11.24; peer-reviewed by Akhil Chaturvedi, Dillon Chrimes; final revised version received 16.06.25; accepted 17.06.25; published 12.09.25.
Copyright© Simon Dahl, Martin Bøgsted, Tomer Sagi, Charles Vesteghem. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 12.9.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.