Abstract
Background: Prediction models have demonstrated a range of applications across medicine, including using electronic health record (EHR) data to identify hospital readmission and mortality risk. Large language models (LLMs) can transform unstructured EHR text into structured features, which can then be integrated into statistical prediction models, ensuring that the results are both clinically meaningful and interpretable.
Objective: This study aims to compare the classification decisions made by clinical experts with those generated by a state-of-the-art LLM, using terms extracted from a large EHR data set of individuals with mental health disorders seen in emergency departments (EDs).
Methods: Using a dataset from the EHR systems of more than 50 health care provider organizations in the United States from 2016 to 2021, we extracted all clinical terms that appeared in at least 1000 records of individuals admitted to the ED for a mental health–related problem from a source population of over 6 million ED episodes. Two experienced mental health clinicians (one medically trained psychiatrist and one clinical psychologist) reached consensus on the classification of EHR terms and diagnostic codes into categories. We evaluated an LLM’s agreement with clinical judgment across three classification tasks as follows: (1) classify terms into “mental health” or “physical health”, (2) classify mental health terms into 1 of 42 prespecified categories, and (3) classify physical health terms into 1 of 19 prespecified broad categories.
Results: There was high agreement between the LLM and clinical experts when categorizing 4553 terms as “mental health” or “physical health” (κ=0.77, 95% CI 0.75-0.80). However, there was still considerable variability in LLM-clinician agreement on the classification of mental health terms (κ=0.62, 95% CI 0.59‐0.66) and physical health terms (κ=0.69, 95% CI 0.67‐0.70).
Conclusions: The LLM displayed high agreement with clinical experts when classifying EHR terms into certain mental health or physical health term categories. However, agreement with clinical experts varied considerably within both sets of mental and physical health term categories. Importantly, the use of LLMs presents an alternative to manual human coding, presenting great potential to create interpretable features for prediction models.
doi:10.2196/65454
Keywords
Introduction
Recent advances in health informatics have led to the development of machine learning models that are trained on data from electronic health records (EHRs). These models have proven to be effective across a range of health domains, including predicting the spread of disease [
], hospital readmission rates [ ], and suicide risk [ , ]. Predictive models have been implemented in EHR systems to identify high-risk patients and alert clinicians to critical health events [ ].EHR systems are filled with unstructured text data, including clinical notes and discharge summaries, which are not easily categorized into clinically interpretable groupings for use in predictive models. Although the use of this data can greatly enhance prediction model performance and the interpretability of decision-support tools [
, ], the coding process is labor intensive and requires expert consultation and extensive training [ ]. These challenges hinder the development and scalability of clinical prediction models that incorporate unstructured EHR data [ , ].Large language models (LLMs), like OpenAI’s GPT models, can streamline the classification and coding of unstructured EHR text due to their massive training data sets and advanced text processing [
, ]. LLMs have been used to categorize unstructured text from EHR systems [ ], assist with qualitative analysis [ , ], and perform deductive coding with and without context [ ]. Preliminary evidence shows that LLMs outperform crowd workers in annotation of health texts [ , ].The reliability of LLMs in replicating clinical judgment for coding classification tasks in mental health remains uncertain, particularly given the inherent complexities of mental health disorders [
, ]. Prior research highlights that while LLMs can process large volumes of text, their ability to discern subtle differences in clinical presentations, such as differentiating between comorbid conditions like depression and anxiety, is still unproven. This challenge is exacerbated by the frequent overlap of symptoms across diagnoses, which complicates classification efforts [ ]. Patients with mental health disorders may present with unique clinical characteristics that challenge an LLM’s ability to accurately identify and code physical and mental health symptoms [ , , ].We used a large EHR data set of individuals admitted to the emergency department (ED) for a mental health disorder, to assess the ability of a state-of-the-art LLM to classify EHR terms into categories defined by experienced mental health clinicians. We assessed the extent to which a LLM replicates clinical judgment and the practicality of using a LLM to assist in creating clinically interpretable features for prediction models.
Methods
Data
We extracted de-identified EHR data from the Optum Labs Data Warehouse, a longitudinal, real-world data asset, from >50 US healthcare provider organizations that encompass >700 hospitals. We included individuals aged ≥10 years who were admitted to the ED from 2016 to 2021 and had an International Classification of Disease-9 or -10 code for a mental health diagnosis, suicidal ideation, or self-harm, resulting in approximately 6.2 million unique patient episodes. A natural language processing (NLP) algorithm integrated into the Optum Labs Data Warehouse extracted from unstructured free-text fields in the EHR, clinical terms for signs, symptoms, and diseases based on the National Library of Medicine’s Unified Medical Language System dictionary. We identified physical and mental health terms that appeared in at least 1000 unique patient episodes.
Coding
A board-certified psychiatrist and licensed clinical psychologist categorized each EHR term into 1 of 61 categories including 42 mental health-related categories and 19 physical health-related categories which were generated from the Clinical Classifications Software Refined [
] and the International Classification of Disease-10 diagnosis coding system, respectively. Coding each EHR term involved: (1) initial classification by 1 clinician coder, (2) a review of all coding decisions by a second clinician coder with suggestions for revisions; (3) a final consensus reconciliation involving both coders. The coding of physical health terms was supported by an LLM, which suggested coding decisions that were refined and reconciled (5% of terms required reconciliation) by the 2 clinician coders. All study procedures were approved by the Institutional Review Board of University of Pennsylvania.Classification Tasks
We used the Python module “openai” [
] to run the GPT-4 LLM in a Python environment. We used OpenAI’s most sophisticated GPT-4 that was then publicly available (“gpt-4-turbo-2024-04-09”) and set model parameters to maximize output consistency (eg, temperature=0).We prompted the model with 3 “zero-shot” classification tasks, wherein the model is provided codes without examples: (1) classify all (n=4553) EHR terms as either “mental health” or “physical health,” (2) classify each of the (n=846) mental health terms into 1 of the 42 mental health categories, and (3) classify each of the (n=3707) physical health terms into 1 of the 19 physical health categories. The prompt described the task, listed the possible categories, and provided the EHR terms. The model then confirmed that the predicted category was among the list of possible categories. For full reproducibility, the complete prompt provided to the model, including the task description and category list, is detailed in
. In task 2, the model was given an unstructured clinical term from an EHR such as “depressive symptoms.” Then, the prompt described the classification task and provided the following list of possible mental health categories (eg, “depression,” “anxiety,” “eating disorder symptoms,” and “substance use”). The process was repeated for all 846 mental health terms, and similarly for the 3707 physical health terms in task 3.Performance Metrics
We compared GPT-4’s predicted categories with the categories determined by clinical judgment using the Python library scikit-learn “metrics” module [
]. For each task, we report the overall Cohen κ and weighted average of precision, recall, and F1-score, accounting for label imbalance. We computed 95% CIs for Cohen κ, precision, recall, and F1-score using a bootstrap procedure with 1000 resamples [ ].Ethical Considerations
Ethical approval (IRB Protocol #848806) for this study was waived by the University of Pennsylvania Institutional Review Board via 45 CFR 46.104, category 4.
Results
Overview
EHR terms (n=4553) were categorized by GPT as “mental health” or “physical health.” Overall, classification performance was strong with κ of 0.77 (95% CI 0.75-0.80), precision of 0.93 (95% CI 0.92‐0.94), recall of 0.93 (95% CI 0.92‐0.94), and F1-score of 0.93 (95% CI 0.92‐0.94). The GPT-4 classified 18.3% (n=833) of the EHR terms as “mental health” and 81.7% (n=3720) as “physical health” (
). The clinician coders and model disagreed on the categorization of 164 (19.7%) mental health terms (eg, “gunshot wound,” “chronic fatigue syndrome,” and “IV drug use”) and 149 (4%) physical health terms (eg, “activity issues,” “lethargic,” and “food issues”).Health domain (n) | Recall (95% CI) | F1-score (95% CI) | Total mentions in data set (thousands) |
Physical health (n=3707) | 0.96 (0.95‐0.97) | 0.96 (0.95‐0.96) | 255,573 |
Mental health (n=846) | 0.81 (0.78‐0.83) | 0.81 (0.79‐0.83) | 85,081 |
aRecall indicates the proportion of terms in a clinician-coded category that were classified by the model as belonging to that category.
Mental Health
Mental health terms (n=846) were classified into 42 categories with κ of 0.62 (95% CI 0.59-0.66), precision of 0.71 (95% CI 0.68‐0.74), recall of 0.64 (95% CI 0.61‐0.68), and F1-score of 0.65 (95% CI 0.62‐0.69).
includes category-wise recall, F1-score, and a set of the most frequent categories into which terms from the true category were misclassified ( ).Term category (n) | Recall (95% CI) | F1-score (95% CI) | Total mentions in dataset (thousands) | Misclassifications (n) |
Eating disorder or symptoms (n=16) | 1 (0.81‐1) | 0.91 (0.80‐1) | 582 | None |
Living situation (n=11) | 1 (0.74‐1) | 1 (1‐1) | 1259 | None |
ADHD | spectrum (n=11)1 (0.74‐1) | 0.73 (0.52-0.88) | 810 | None |
OCD | symptoms or disorder (n=10)1 (0.72‐1) | 0.87 (0.67‐1) | 207 | None |
Somatization symptoms (n=6) | 1 (0.61‐1) | 0.86 (0.57‐1) | 62 | None |
Neurocognitive disorders (n=20) | 0.95 (0.76‐0.99) | 0.62 (0.47‐0.75) | 1225 | Neurocognitive symptoms (n=1) |
Sleep wake symptoms or disorder (n=37) | 0.95 (0.82‐0.99) | 0.86 (0.78‐0.94) | 1833 | Miscellaneous psychiatric symptoms (n=1) and depressive symptoms (n=1) |
Substance-related symptoms or disorder (n=90) | 0.92 (0.85‐0.96) | 0.95 (0.91‐0.98) | 8783 | Neurocognitive disorders (n=4), neurocognitive symptoms (n=1), and psychotic symptoms or disorder (n=1) |
Abusive behavior (n=26) | 0.89 (0.71‐0.96) | 0.84 (0.71‐0.93) | 3053 | Aggressive symptoms (n=1), miscellaneous psychiatric symptoms (n=1), and personality disorder (n=1) |
Unipolar depressive disorder (n=8) | 0.88 (0.53‐0.98) | 0.78 (0.50‐0.96) | 944 | Mood disorder (n=1) |
Autism spectrum disorder (n=7) | 0.86 (0.49‐0.97) | 0.71 (0.38‐0.92) | 132 | Mood disorder (n=1) |
Impulsive behavior (n=6) | 0.83 (0.44‐0.97) | 0.83 (0.50‐1) | 414 | Aggressive symptoms (n=1) |
Personality disorder (n=5) | 0.80 (0.38‐0.96) | 0.47 (0.11‐0.73) | 158 | OCD symptoms or disorder (n=1) |
Injury (n=76) | 0.78 (0.67‐0.88) | 0.84 (0.77‐0.90) | 10,470 | Self harm (n=8), miscellaneous psychiatric symptoms (n=3), and stress-related symptoms or disorder (n=2) |
Psychotic symptoms or disorder (n=50) | 0.76 (0.63‐0.86) | 0.76 (0.66‐0.85) | 6074 | Miscellaneous psychiatric symptoms (n=5), neurocognitive symptoms (n=2), and impulsive behavior (n=1) |
Stress-related symptoms or disorder (n=11) | 0.73 (0.43‐0.90) | 0.57 (0.32‐0.77) | 480 | Stressor symptoms (n=2) and anxiety symptoms (n=1) |
Anxiety disorder (n=14) | 0.71 (0.45‐0.88) | 0.71 (0.50‐0.90) | 683 | Anxiety symptoms (n=1), social situation (n=1), and somatization symptoms (n=1) |
Suicidal symptoms (n=12) | 0.67 (0.39‐0.86) | 0.73 (0.46‐0.92) | 6167 | Self-harm (n=3) and psychotic symptoms or disorder (n=1) |
Self-harm (n=12) | 0.67 (0.39‐0.86) | 0.47 (0.23‐0.67) | 2126 | Abusive behavior (n=3) and suicidal symptoms (n=1) |
Anxiety symptoms (n=22) | 0.64 (0.43‐0.80) | 0.54 (0.36‐0.69) | 7481 | Stress-related symptoms or disorder (n=2), sensory disturbances (n=2), and anxiety disorder (n=2) |
Neurocognitive symptoms (n=74) | 0.61 (0.49‐0.71) | 0.61 (0.50‐0.69) | 1802 | Neurocognitive disorders (n=10), miscellaneous psychiatric symptoms (n=8), and ADHD spectrum (n=6) |
Aggressive symptoms (n=24) | 0.58 (0.40‐0.76) | 0.58 (0.40‐0.74) | 4275 | Anxiety symptoms (n=4), mood symptoms (n=4), and miscellaneous psychiatric symptoms (n=2) |
Depressive symptoms (n=39) | 0.56 (0.41‐0.71) | 0.68 (0.54‐0.80) | 6381 | Mood symptoms (n=5), miscellaneous psychiatric symptoms (n=3), and unipolar depressive disorder (n=2) |
Pharm symptoms (n=7) | 0.43 (0.16‐0.75) | 0.33 (0‐0.59) | 699 | Sensory disturbances (n=2), psych ADE | (n=1), and miscellaneous psychiatric symptoms (n=1)
Bipolar spectrum (n=36) | 0.42 (0.27‐0.58) | 0.59 (0.40‐0.74) | 2290 | Mood symptoms (n=18), psychotic symptoms or disorder (n=2), and miscellaneous psychiatric symptoms (n=1) |
Miscellaneous psychiatric symptoms (n=156) | 0.29 (0.22‐0.36) | 0.39 (0.30‐0.46) | 9554 | Neurocognitive symptoms (n=17), antisocial behavior (n=10), and mood symptoms (n=10) |
Suicidal behavioral (n=12) | 0.25 (0.09‐0.53) | 0.38 (0‐0.67) | 1164 | Injury (n=3), miscellaneous psychiatric symptoms (n=1), and overdose (n=1) |
Antisocial behavior (n=10) | 0.20 (0.06‐0.51) | 0.17 (0‐0.37) | 1666 | Personality disorder (n=3), aggressive symptoms (n=2), and miscellaneous psychiatric symptoms (n=2) |
Sensory disturbances (n=6) | 0.17 (0.03‐0.56) | 0.09 (0‐0.27) | 387 | Psychotic symptoms or disorder (n=3) and miscellaneous psychiatric symptoms (n=2) |
Stressor symptoms (n=5) | 0 (0‐0.43) | 0 (0‐0) | 34 | Sensory disturbances (n=2), personality disorder (n=2), and miscellaneous psychiatric symptoms (n=1) |
Psych ADE (n=11) | 0 (0‐0.26) | 0 (0‐0) | 151 | Neurocognitive symptoms (n=6) and pharm symptoms (n=5) |
aRecall indicates the proportion of terms in a clinician-coded category that were classified by the model as belonging to that category.
bADHD: attention deficit hyperactive disorder.
cOCD: obsessive compulsive disorder.
dpsych ADE: psychiatric adverse drugs events.
The model exhibited the best classification performance for categories of: “living situation” (F1-score=1, n=11 terms), “substance use related symptoms and disorder” (F1-score=0.94, n=90 terms), “eating disorder or symptoms” (F1-score=0.95, n=16 terms), “OCD symptoms or disorder” (F1-score=0.87, n=10 terms), and “sleep wake symptoms or disorder” (F1-score=0.86, n=37 terms). Conversely, the model performed poorly on “miscellaneous psychiatric symptoms” (F1-score=0.39, n=156 terms), “antisocial behavior” (F1-score=0.17, n=10 terms), “sensory disturbances” (F1-score=0.09, n=10 terms), “psychiatric adverse drug events” (F1-score=0, n=11 terms), and “stressor symptoms” (F1-score=0, n=5 terms).
The most mislabeled mental health terms included “psychiatric adverse drug events” as “neurocognitive symptoms” (n=6 misclassifications) or “pharmacological symptoms” (n=5 misclassifications). The model also commonly mislabeled terms in “miscellaneous psychiatric symptoms.” There were 111 terms in the “miscellaneous psychiatric symptoms” category that were misclassified across 28 of 41 other categories (
).Physical Health
Physical health terms (n=3707) were classified into 19 categories with κ of 0.69 (95% CI 0.67-0.70), precision of 0.76 (95% CI 0.74‐0.77), recall of 0.71 (95% CI 0.70‐0.73), and F1-score of 0.72 (95% CI 0.70‐0.73).
includes category-wise recall, F1-score, and a set of the most frequent categories into which terms from the true category were misclassified ( ).Term category (n) | Recall (95% CI) | F1-score (95% CI) | Total mentions in dataset (thousands) | Most frequent misclassifications (n) |
Oncological conditions (n=45) | 0.91 (0.79‐0.96) | 0.61 (0.51‐0.70) | 4549 | Autoimmune and inflammatory conditions (n=1), gastrointestinal symptoms (n=1), and other physical symptoms and conditions (n=1) |
Sensory problems (n=41) | 0.90 (0.78‐0.96) | 0.35 (0.27‐0.43) | 3113 | Neurological symptoms (n=4) |
Cardiovascular symptoms (n=401) | 0.88 (0.85‐0.91) | 0.88 (0.85‐0.90) | 30,930 | Other physical symptoms and conditions (n=14), neurological symptoms (n=9), and respiratory symptoms (n=8) |
Respiratory symptoms (n=139) | 0.84 (0.77‐0.89) | 0.72 (0.66‐0.77) | 27,775 | Sensory problems (n=6), gastrointestinal symptoms (n=5), and other physical symptoms and conditions (n=4) |
Infectious symptoms (n=145) | 0.84 (0.77‐0.89) | 0.63 (0.57‐0.68) | 15,079 | Hepatobiliary conditions (n=7), sensory problems (n=3), and skin and soft tissue disorders (n=3) |
Metabolic disorders (n=63) | 0.84 (0.73‐0.91) | 0.68 (0.59‐0.76) | 3136 | Hepatobiliary conditions (n=7), endocrine symptoms (n=1), and other physical symptoms and conditions (n=1) |
Hematological symptoms (n=122) | 0.83 (0.75‐0.89) | 0.81 (0.75‐0.86) | 6321 | Oncological conditions (n=11), gastrointestinal symptoms (n=3), and hepatobiliary conditions (n=3) |
Neurological symptoms (n=413) | 0.82 (0.78‐0.85) | 0.79 (0.76‐0.82) | 22,540 | Sensory problems (n=38), other physical symptoms and conditions (n=8), and infectious symptoms (n=5) |
Gastrointestinal symptoms (n=279) | 0.81 (0.76‐0.85) | 0.77 (0.72‐0.81) | 24,878 | Hepatobiliary conditions (n=18), autoimmune and inflammatory conditions (n=10), and infectious symptoms (n=9) |
Skin and soft tissue disorders (n=314) | 0.78 (0.73‐0.82) | 0.80 (0.76‐0.83) | 15,212 | Infectious symptoms (n=26), other physical symptoms and conditions (n=13), and gastrointestinal symptoms (n=9) |
Genitourinary symptoms (n=201) | 0.77 (0.71‐0.82) | 0.81 (0.76‐0.85) | 8571 | Gastrointestinal symptoms (n=12), infectious symptoms (n=11), and other physical symptoms and conditions (n=7) |
Renal disorders (n=52) | 0.75 (0.62‐0.85) | 0.76 (0.65‐0.84) | 2221 | Infectious symptoms (n=5), genitourinary symptoms (n=4), and cardiovascular symptoms (n=3) |
Endocrine symptoms (n=98) | 0.67 (0.58‐0.76) | 0.71 (0.63‐0.78) | 4942 | Metabolic disorders (n=16), sensory problems (n=4), and autoimmune and inflammatory conditions (n=3) |
Musculoskeletal symptoms (n=480) | 0.67 (0.63‐0.71) | 0.79 (0.75‐0.82) | 21,785 | Other physical symptoms and conditions (n=62), neurological symptoms (n=39), and autoimmune and inflammatory conditions (n=13) |
Pain symptoms (n=59) | 0.59 (0.47‐0.71) | 0.61 (0.51‐0.71) | 18,045 | Other physical symptoms and conditions (n=6), neurological symptoms (n=5), and gastrointestinal symptoms (n=4) |
Autoimmune and inflammatory conditions (n=68) | 0.54 (0.43‐0.66) | 0.50 (0.40‐0.60) | 6234 | Infectious symptoms (n=9), other physical symptoms and conditions (n=9), and skin and soft tissue disorders (n=4) |
Hepatobiliary conditions (n=54) | 0.54 (0.41‐0.66) | 0.45 (0.33‐0.56) | 1970 | Gastrointestinal symptoms (n=11), cardiovascular symptoms (n=4), and other physical symptoms and conditions (n=3) |
Other physical symptoms and conditions (n=559) | 0.47 (0.42‐0.51) | 0.54 (0.50‐0.58) | 31,151 | Sensory problems (n=68), neurological symptoms (n=39), and skin and soft tissue disorders (n=29) |
Respiratory disorders (n=173) | 0.40 (0.33‐0.47) | 0.55 (0.48‐0.63) | 7120 | Respiratory symptoms (n=50), infectious symptoms (n=28), and other physical symptoms and conditions (n=10) |
aRecall indicates the proportion of terms in a clinician-coded category that were classified by the model as belonging to that category.
The model exhibited the best classification performance for categories of: “cardiovascular symptoms” (n=401 terms), “hematological symptoms” (n=122 terms), and “genitourinary symptoms” (n=201 terms), with recall and F1-score values >0.80. Conversely, the model performed poorly on “sensory problems” (F1-score=0.35, n=41 terms), “hepatobiliary conditions” (F1-score=0.45, n=54 terms), and “other physical symptoms and conditions” (F1-score=0.54, n=559 terms).
The model commonly predicted the category “sensory problems” in terms of the categories “other physical symptoms and conditions” (n=68 misclassifications) and “neurological symptoms” (n=38 misclassifications). The model also commonly mislabeled “other physical symptoms and conditions.” There were 299 “other physical symptoms and conditions” terms that were misclassified across 18 other categories (
).Discussion
Principal Findings
We investigated a GPT-4’s ability to replicate clinical judgment when classifying EHR terms from a dataset of mental health patients into interpretable clinical categories. A recent review of NLP studies found the agreement of human coding of EHR data to range from 0.72 to 0.94 (Cohen κ) [
]. Based on this benchmark, GPT-4 showcases human-like agreement with clinical experts when classifying EHR terms as either mental or physical health. Yet, GPT-4’s classification performance varied widely across mental health and physical health categories and had high error rates for certain categories (eg, “sensory problems” and “stressor symptoms”). Misclassifications highlighted GPT-4’s biases, such as the tendency for broad categories (eg, “other physical symptoms and conditions”) to be underselected. Instead, terms from these categories were allocated to more specific categories (eg, “cutting” was allocated to “injury” instead of “self-harm”).Nevertheless, GPT-4 was able to rapidly transform a feature set of 4553 individual EHR terms into 61 clinically valid groups which can be readily implemented into prediction models. State-of-the-art LLMs have already been used alongside traditional NLP methods, such as named entity recognition, text clustering, and supervised machine learning models trained on text data [
- ]. Additionally, LLMs can explain categorization decisions, providing valuable insights for end users of integrated clinical tools.Limitations
LLMs occasionally “hallucinate”, generating outputs that are off-task, nonsensical, or contradictory. Although we prompted the model to validate the output and correct for hallucinations, as the creativity and complexity of tasks increase so does the risk of aberrant outputs [
]. Moreover, recent studies have found that LLM performance on certain clinical tasks can substantially improve when given 1 or multiple examples for codes, a process known as “few-shot” learning [ , ]. In contrast, our study used “zero-shot” learning, where GPT-4 was asked to classify clinical terms without being provided with any specific examples or definitions for the coding system. This method was chosen to assess the model’s baseline classification performance, without introducing any more task-specific bias. However, we recognize that because the coding system was developed by only 2 clinicians, bias may be introduced due to their unique sets of clinical experiences, institutional practices, and personal preferences. The LLM may be biased as well. An ad hoc analysis indicated a tendency for the model to underuse “other” categories (eg, “other physical symptoms and conditions” and “miscellaneous psychiatric symptoms”) relative to clinician coders ( ). Nonetheless, we acknowledge that many clinical terms in EHR are inherently ambiguous and may be classified under multiple categories depending on the context. Without knowing the sample is among people hospitalized with a mental health disturbance, it is not necessarily a misclassification for GPT-4 to label “gunshot wound” as a physical injury and not an indicator of suicide. The task of assigning a single, mutually exclusive label may limit one’s ability to capture the full complexity of the clinical term. While this study provides a preliminary framework for exploring the feasibility of using LLMs for unstructured EHR classification, future research should aim to involve a varied set of coding methods, classification approaches (eg, multi-label classification), and a larger cohort of clinician-coders to enhance generalizability. Finally, we note that several categories in the mental health domain had too few terms (<5) to yield stable estimates of agreement and were removed from the analysis.Implications
The accuracy of clinical term classification is essential for downstream predictive models that rely on structured data, as inaccuracies can propagate through the model pipeline. Understanding the sensitivity of these models to variations in input labels is key, especially when distinguishing between random errors and systematic misclassifications. Systematic errors, where specific categories are consistently mislabeled, may significantly affect the robustness of models trained on such data, potentially more so than a random error (ie, noise) [
- ]. Moreover, the assumption that accurate categorization of clinical terms is a necessary intermediate step is worth reconsidering. As LLMs advance, there is potential for these models to bypass the traditional 2-stage process and make direct predictions from unstructured text [ ]. Future research is needed to determine whether bypassing the intermediate categorization step entirely might enhance or hinder model performance, depending on the specific clinical application.Conclusion
As LLMs continue to advance, the time and human resources required to distill a large corpus of EHR terms into clinically meaningful groups can be greatly reduced. LLMs have the potential to be integrated into EHR systems to create text-based features for prediction models in real time. This study found that a state-of-the-art LLM achieved high agreement with classifications of experienced clinicians across terms from numerous physical and mental health categories.
Acknowledgments
This work was supported by the National Institutes of Health (R01MH126895). We appreciate the contribution of Ming Xie in dataset preparation and extensive analytical support.
Conflicts of Interest
None declared.
References
- Hossain ME, Khan A, Moni MA, Uddin S. Use of electronic health data for disease prediction: a comprehensive literature review. IEEE/ACM Trans Comput Biol Bioinform. 2021;18(2):745-758. [CrossRef]
- Mahmoudi E, Kamdar N, Kim N, Gonzales G, Singh K, Waljee AK. Use of electronic medical records in development and validation of risk prediction models of hospital readmission: systematic review. BMJ. Apr 8, 2020;369:m958. [CrossRef] [Medline]
- Boudreaux ED, Haskins BL, Larkin C, et al. Emergency department safety assessment and follow-up evaluation 2: an implementation trial to improve suicide prevention. Contemp Clin Trials. Aug 2020;95:106075. [CrossRef] [Medline]
- Boudreaux ED, Rundensteiner E, Liu F, et al. Applying machine learning approaches to suicide prediction using healthcare data: overview and future directions. Front Psychiatry. 2021;12:707916. [CrossRef] [Medline]
- Lee TC, Shah NU, Haack A, Baxter SL. Clinical implementation of predictive models embedded within electronic health record systems: a systematic review. Informatics (MDPI). Sep 2020;7(3):25. [CrossRef] [Medline]
- Bayramli I, Castro V, Barak-Corren Y, et al. Predictive structured-unstructured interactions in EHR models: a case study of suicide prediction. NPJ Digit Med. Jan 27, 2022;5(1):15. [CrossRef] [Medline]
- Mahajan SM, Ghani R. Combining structured and unstructured data for predicting risk of readmission for heart failure patients. Stud Health Technol Inform. Aug 21, 2019;264:238-242. [CrossRef] [Medline]
- Coiera E. Guide to Health Informatics. 3rd ed. CRC Press; 2015.
- Akbilgic O, Homayouni R, Heinrich K, Langham M, Davis R. Unstructured text in EMR improves prediction of death after surgery in children. Informatics (MDPI). 2019;6(1):4. [CrossRef]
- Marafino BJ, Park M, Davies JM, et al. Validation of prediction models for critical care outcomes using natural language processing of electronic health record data. JAMA Netw Open. Dec 7, 2018;1(8):e185097. [CrossRef] [Medline]
- Bousselham H, Nfaoui EH, Mourhir A. Fine-tuning GPT on biomedical NLP tasks: an empirical evaluation. Presented at: 2024 International Conference on Computer, Electrical & Communication Engineering (ICCECE); Feb 2-3, 2024; Kolkata, India. [CrossRef]
- OpenAI, Achiam J, Adler S, et al. GPT-4 technical report. arXiv. Preprint posted online on Mar 15, 2023. [CrossRef]
- Shekhar S, Tiwari S, Rensink TC, Eskander R, Salloum W. Coupling symbolic reasoning with language modeling for efficient longitudinal understanding of unstructured electronic medical records. arXiv. Preprint posted online on Aug 7, 2023. [CrossRef]
- Chew R, Bollenbacher J, Wenger M, Speer J, Kim A. LLM-assisted content analysis: using large language models to support deductive coding. arXiv. Preprint posted online on Jun 23, 2023. [CrossRef]
- Xiao Z, Yuan X, Liao QV, Abdelghani R, Oudeyer PY. Supporting qualitative analysis with large language models: combining codebook with GPT-3 for deductive coding. Presented at: IUI ’23: 28th International Conference on Intelligent User Interfaces; Mar 27-31, 2023; Sydney, Australia. [CrossRef]
- Hou C, Zhu G, Zheng J, et al. Prompt-based and fine-tuned GPT models for context-dependent and -independent deductive coding in social annotation. Presented at: LAK ’24: The 14th Learning Analytics and Knowledge Conference; Mar 18-22, 2024; Kyoto, Japan. [CrossRef]
- Gilardi F, Alizadeh M, Kubli M. ChatGPT outperforms crowd workers for text-annotation tasks. Proc Natl Acad Sci U S A. Jul 25, 2023;120(30):e2305016120. [CrossRef] [Medline]
- Törnberg P. ChatGPT-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv. Preprint posted online on Apr 13, 2023. [CrossRef]
- Hua Y, Liu F, Yang K, et al. Large language models in mental health care: a scoping review. arXiv. Preprint posted online on Jan 1, 2024. [CrossRef]
- Kumar A, Sharma R, Bedi P. Towards optimal NLP solutions: analyzing GPT and LLaMA-2 models across model scale, dataset size, and task diversity. Eng Technol Appl Sci Res. 2024;14(3):14219-14224. [CrossRef]
- Yanagita Y, Yokokawa D, Fukuzawa F, Uchida S, Uehara T, Ikusaka M. Assessing the ability of GPT to generate illness scripts: an evaluation study. medRxiv. Preprint posted online on Dec 27, 2023. [CrossRef]
- Moradi M, Blagec K, Haberl F, Samwald M. GPT-3 models are poor few-shot learners in the biomedical domain. arXiv. Preprint posted online on Sep 6, 2021. [CrossRef]
- Rao A, Pang M, Kim J, et al. Assessing the utility of chatgpt throughout the entire clinical workflow. medRxiv. Preprint posted online on Feb 26, 2023. [CrossRef]
- User guide: clinical classifications software refined (CCSR). Agency for Healthcare Research and Quality Healthcare Cost and Utilization Project (HCUP). 2019. URL: https://hcup-us.ahrq.gov/toolssoftware/ccsr/DXCCSR-User-Guide-v2019-1.pdf [Accessed 2025-01-15]
- openai/openai-python: the official Python library for the OpenAI API. GitHub. 2024. URL: https://github.com/openai/openai-python [Accessed 2025-01-15]
- Kramer O. Scikit-Learn: Machine Learning for Evolution Strategies. Springer International Publishing; 2016:45-53. URL: http://link.springer.com/10.1007/978-3-319-33383-0_5 [Accessed 2025-01-15]
- Fung KP, Lee J. Bootstrap estimate of the variance and confidence interval of kappa. Br J Ind Med. Jul 1991;48(7):503-504. [CrossRef] [Medline]
- Scharp D, Hobensack M, Davoudi A, Topaz M. Natural language processing applied to clinical documentation in post-acute care settings: a scoping review. J Am Med Dir Assoc. Jan 2024;25(1):69-83. [CrossRef] [Medline]
- Garrido-Merchan EC, Gozalo-Brizuela R, Gonzalez-Carvajal S. Comparing BERT against traditional machine learning models in text classification. J Comput Cogn Eng. 2023;2:352-356.
- Li L, Zhou J, Gao Z, et al. A scoping review of using large language models (LLMs) to investigate electronic health records (EHRs). arxiv. 2024. URL: https://arxiv.org/abs/2405.03066
- Sushil M, Zack T, Mandair D, et al. A comparative study of zero-shot inference with large language models and supervised modeling in breast cancer pathology classification. arXiv. Preprint posted online on Jan 25, 2024. [CrossRef]
- Lee M. A mathematical investigation of hallucination and creativity in GPT models. Mathematics. 2023;11(10):2320. [CrossRef]
- Labrak Y, Rouvier M, Dufour R. A zero-shot and few-shot study of instruction-finetuned large language models applied to clinical and biomedical tasks. arXiv. Preprint posted online on Jul 22, 2023. [CrossRef]
- Sivarajkumar S, Kelley M, Samolyk-Mazzanti A, Visweswaran S, Wang Y. An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing: algorithm development and validation study. JMIR Med Inform. Apr 8, 2024;12:e55318. [CrossRef] [Medline]
- Shah D, Schwartz HA, Hovy D. Predictive biases in natural language processing models: a conceptual framework and overview. arXiv. Preprint posted online on Nov 9, 2019. [CrossRef]
- Lin Z, Guan S, Zhang W, Zhang H, Li Y, Zhang H. Towards trustworthy LLMs: a review on debiasing and dehallucinating in large language models. Artif Intell Rev. 2024;57(9):243. [CrossRef]
- Frenay B, Verleysen M. Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learning Syst. 2014;25(5):845-869. [CrossRef]
Abbreviations
ED: emergency department |
EHR: electronic health record |
LLM: large language model |
NLP: natural language processing |
Edited by Christian Lovis; submitted 15.08.24; peer-reviewed by Bill Thies, Philip Chung; final revised version received 25.11.24; accepted 30.11.24; published 21.01.25.
Copyright© Nicholas C Cardamone, Mark Olfson, Timothy Schmutte, Lyle Ungar, Tony Liu, Sara W Cullen, Nathaniel J Williams, Steven C Marcus. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 21.1.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.