TY - JOUR AU - Yun, Sun Hye AU - Bickmore, Timothy PY - 2025/3/31 TI - Online Health Information?Seeking in the Era of Large Language Models: Cross-Sectional Web-Based Survey Study JO - J Med Internet Res SP - e68560 VL - 27 KW - online health information?seeking KW - large language models KW - eHealth KW - internet KW - consumer health information N2 - Background: As large language model (LLM)?based chatbots such as ChatGPT (OpenAI) grow in popularity, it is essential to understand their role in delivering online health information compared to other resources. These chatbots often generate inaccurate content, posing potential safety risks. This motivates the need to examine how users perceive and act on health information provided by LLM-based chatbots. Objective: This study investigates the patterns, perceptions, and actions of users seeking health information online, including LLM-based chatbots. The relationships between online health information?seeking behaviors and important sociodemographic characteristics are examined as well. Methods: A web-based survey of crowd workers was conducted via Prolific. The questionnaire covered sociodemographic information, trust in health care providers, eHealth literacy, artificial intelligence (AI) attitudes, chronic health condition status, online health information source types, perceptions, and actions, such as cross-checking or adherence. Quantitative and qualitative analyses were applied. Results: Most participants consulted search engines (291/297, 98%) and health-related websites (203/297, 68.4%) for their health information, while 21.2% (63/297) used LLM-based chatbots, with ChatGPT and Microsoft Copilot being the most popular. Most participants (268/297, 90.2%) sought information on health conditions, with fewer seeking advice on medication (179/297, 60.3%), treatments (137/297, 46.1%), and self-diagnosis (62/297, 23.2%). Perceived information quality and trust varied little across source types. The preferred source for validating information from the internet was consulting health care professionals (40/132, 30.3%), while only a very small percentage of participants (5/214, 2.3%) consulted AI tools to cross-check information from search engines and health-related websites. For information obtained from LLM-based chatbots, 19.4% (12/63) of participants cross-checked the information, while 48.4% (30/63) of participants followed the advice. Both of these rates were lower than information from search engines, health-related websites, forums, or social media. Furthermore, use of LLM-based chatbots for health information was negatively correlated with age (?=?0.16, P=.006). In contrast, attitudes surrounding AI for medicine had significant positive correlations with the number of source types consulted for health advice (?=0.14, P=.01), use of LLM-based chatbots for health information (?=0.31, P<.001), and number of health topics searched (?=0.19, P<.001). Conclusions: Although traditional online sources remain dominant, LLM-based chatbots are emerging as a resource for health information for some users, specifically those who are younger and have a higher trust in AI. The perceived quality and trustworthiness of health information varied little across source types. However, the adherence to health information from LLM-based chatbots seemed more cautious compared to search engines or health-related websites. As LLMs continue to evolve, enhancing their accuracy and transparency will be essential in mitigating any potential risks by supporting responsible information-seeking while maximizing the potential of AI in health contexts. UR - https://www.jmir.org/2025/1/e68560 UR - http://dx.doi.org/10.2196/68560 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/68560 ER - TY - JOUR AU - Schaye, Verity AU - DiTullio, David AU - Guzman, Vincent Benedict AU - Vennemeyer, Scott AU - Shih, Hanniel AU - Reinstein, Ilan AU - Weber, E. Danielle AU - Goodman, Abbie AU - Wu, Y. Danny T. AU - Sartori, J. Daniel AU - Santen, A. Sally AU - Gruppen, Larry AU - Aphinyanaphongs, Yindalon AU - Burk-Rafel, Jesse PY - 2025/3/21 TI - Large Language Model?Based Assessment of Clinical Reasoning Documentation in the Electronic Health Record Across Two Institutions: Development and Validation Study JO - J Med Internet Res SP - e67967 VL - 27 KW - large language models KW - artificial intelligence KW - clinical reasoning KW - documentation KW - assessment KW - feedback KW - electronic health record N2 - Background: Clinical reasoning (CR) is an essential skill; yet, physicians often receive limited feedback. Artificial intelligence holds promise to fill this gap. Objective: We report the development of named entity recognition (NER), logic-based and large language model (LLM)?based assessments of CR documentation in the electronic health record across 2 institutions (New York University Grossman School of Medicine [NYU] and University of Cincinnati College of Medicine [UC]). Methods: The note corpus consisted of internal medicine resident admission notes (retrospective set: July 2020-December 2021, n=700 NYU and 450 UC notes and prospective validation set: July 2023-December 2023, n=155 NYU and 92 UC notes). Clinicians rated CR documentation quality in each note using a previously validated tool (Revised-IDEA), on 3-point scales across 2 domains: differential diagnosis (D0, D1, and D2) and explanation of reasoning, (EA0, EA1, and EA2). At NYU, the retrospective set was annotated for NER for 5 entities (diagnosis, diagnostic category, prioritization of diagnosis language, data, and linkage terms). Models were developed using different artificial intelligence approaches, including NER, logic-based model: a large word vector model (scispaCy en_core_sci_lg) with model weights adjusted with backpropagation from annotations, developed at NYU with external validation at UC, NYUTron LLM: an NYU internal 110 million parameter LLM pretrained on 7.25 million clinical notes, only validated at NYU, and GatorTron LLM: an open source 345 million parameter LLM pretrained on 82 billion words of clinical text, fined tuned on NYU retrospective sets, then externally validated and further fine-tuned at UC. Model performance was assessed in the prospective sets with F1-scores for the NER, logic-based model and area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) for the LLMs. Results: At NYU, the NYUTron LLM performed best: the D0 and D2 models had AUROC/AUPRC 0.87/0.79 and 0.89/0.86, respectively. The D1, EA0, and EA1 models had insufficient performance for implementation (AUROC range 0.57-0.80, AUPRC range 0.33-0.63). For the D1 classification, the approach pivoted to a stepwise approach taking advantage of the more performant D0 and D2 models. For the EA model, the approach pivoted to a binary EA2 model (ie, EA2 vs not EA2) with excellent performance, AUROC/AUPRC 0.85/ 0.80. At UC, the NER, D-logic?based model was the best performing D model (F1-scores 0.80, 0.74, and 0.80 for D0, D1, D2, respectively. The GatorTron LLM performed best for EA2 scores AUROC/AUPRC 0.75/ 0.69. Conclusions: This is the first multi-institutional study to apply LLMs for assessing CR documentation in the electronic health record. Such tools can enhance feedback on CR. Lessons learned by implementing these models at distinct institutions support the generalizability of this approach. UR - https://www.jmir.org/2025/1/e67967 UR - http://dx.doi.org/10.2196/67967 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/67967 ER - TY - JOUR AU - Hou, Yu AU - Bishop, R. Jeffrey AU - Liu, Hongfang AU - Zhang, Rui PY - 2025/3/19 TI - Improving Dietary Supplement Information Retrieval: Development of a Retrieval-Augmented Generation System With Large Language Models JO - J Med Internet Res SP - e67677 VL - 27 KW - dietary supplements KW - knowledge representation KW - knowledge graph KW - retrieval-augmented generation KW - large language model KW - user interface N2 - Background: Dietary supplements (DSs) are widely used to improve health and nutrition, but challenges related to misinformation, safety, and efficacy persist due to less stringent regulations compared with pharmaceuticals. Accurate and reliable DS information is critical for both consumers and health care providers to make informed decisions. Objective: This study aimed to enhance DS-related question answering by integrating an advanced retrieval-augmented generation (RAG) system with the integrated Dietary Supplement Knowledgebase 2.0 (iDISK2.0), a dietary supplement knowledge base, to improve accuracy and reliability. Methods: We developed iDISK2.0 by integrating updated data from authoritative sources, including the Natural Medicines Comprehensive Database, the Memorial Sloan Kettering Cancer Center database, Dietary Supplement Label Database, and Licensed Natural Health Products Database, and applied advanced data cleaning and standardization techniques to reduce noise. The RAG system combined the retrieval power of a biomedical knowledge graph with the generative capabilities of large language models (LLMs) to address limitations of stand-alone LLMs, such as hallucination. The system retrieves contextually relevant subgraphs from iDISK2.0 based on user queries, enabling accurate and evidence-based responses through a user-friendly interface. We evaluated the system using true-or-false and multiple-choice questions derived from the Memorial Sloan Kettering Cancer Center database and compared its performance with stand-alone LLMs. Results: iDISK2.0 integrates 174,317 entities across 7 categories, including 8091 dietary supplement ingredients; 163,806 dietary supplement products; 786 diseases; and 625 drugs, along with 6 types of relationships. The RAG system achieved an accuracy of 99% (990/1000) for true-or-false questions on DS effectiveness and 95% (948/100) for multiple-choice questions on DS-drug interactions, substantially outperforming stand-alone LLMs like GPT-4o (OpenAI), which scored 62% (618/1000) and 52% (517/1000) on these respective tasks. The user interface enabled efficient interaction, supporting free-form text input and providing accurate responses. Integration strategies minimized data noise, ensuring access to up-to-date, DS-related information. Conclusions: By integrating a robust knowledge graph with RAG and LLM technologies, iDISK2.0 addresses the critical limitations of stand-alone LLMs in DS information retrieval. This study highlights the importance of combining structured data with advanced artificial intelligence methods to improve accuracy and reduce misinformation in health care applications. Future work includes extending the framework to broader biomedical domains and improving evaluation with real-world, open-ended queries. UR - https://www.jmir.org/2025/1/e67677 UR - http://dx.doi.org/10.2196/67677 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/67677 ER - TY - JOUR AU - Mansoor, Masab AU - Ibrahim, F. Andrew AU - Grindem, David AU - Baig, Asad PY - 2025/3/19 TI - Large Language Models for Pediatric Differential Diagnoses in Rural Health Care: Multicenter Retrospective Cohort Study Comparing GPT-3 With Pediatrician Performance JO - JMIRx Med SP - e65263 VL - 6 KW - natural language processing KW - NLP KW - machine learning KW - ML KW - artificial intelligence KW - language model KW - large language model KW - LLM KW - generative pretrained transformer KW - GPT KW - pediatrics N2 - Background: Rural health care providers face unique challenges such as limited specialist access and high patient volumes, making accurate diagnostic support tools essential. Large language models like GPT-3 have demonstrated potential in clinical decision support but remain understudied in pediatric differential diagnosis. Objective: This study aims to evaluate the diagnostic accuracy and reliability of a fine-tuned GPT-3 model compared to board-certified pediatricians in rural health care settings. Methods: This multicenter retrospective cohort study analyzed 500 pediatric encounters (ages 0?18 years; n=261, 52.2% female) from rural health care organizations in Central Louisiana between January 2020 and December 2021. The GPT-3 model (DaVinci version) was fine-tuned using the OpenAI application programming interface and trained on 350 encounters, with 150 reserved for testing. Five board-certified pediatricians (mean experience: 12, SD 5.8 years) provided reference standard diagnoses. Model performance was assessed using accuracy, sensitivity, specificity, and subgroup analyses. Results: The GPT-3 model achieved an accuracy of 87.3% (131/150 cases), sensitivity of 85% (95% CI 82%?88%), and specificity of 90% (95% CI 87%?93%), comparable to pediatricians? accuracy of 91.3% (137/150 cases; P=.47). Performance was consistent across age groups (0?5 years: 54/62, 87%; 6?12 years: 47/53, 89%; 13?18 years: 30/35, 86%) and common complaints (fever: 36/39, 92%; abdominal pain: 20/23, 87%). For rare diagnoses (n=20), accuracy was slightly lower (16/20, 80%) but comparable to pediatricians (17/20, 85%; P=.62). Conclusions: This study demonstrates that a fine-tuned GPT-3 model can provide diagnostic support comparable to pediatricians, particularly for common presentations, in rural health care. Further validation in diverse populations is necessary before clinical implementation. UR - https://xmed.jmir.org/2025/1/e65263 UR - http://dx.doi.org/10.2196/65263 ID - info:doi/10.2196/65263 ER - TY - JOUR AU - Li, Jiajia AU - Wang, Zikai AU - Yu, Longxuan AU - Liu, Hui AU - Song, Haitao PY - 2025/3/19 TI - Synthetic Data-Driven Approaches for Chinese Medical Abstract Sentence Classification: Computational Study JO - JMIR Form Res SP - e54803 VL - 9 KW - medical abstract sentence classification KW - large language models KW - synthetic datasets KW - deep learning KW - Chinese medical KW - dataset KW - traditional Chinese medicine KW - global medical research KW - algorithm KW - robustness KW - efficiency KW - accuracy N2 - Background: Medical abstract sentence classification is crucial for enhancing medical database searches, literature reviews, and generating new abstracts. However, Chinese medical abstract classification research is hindered by a lack of suitable datasets. Given the vastness of Chinese medical literature and the unique value of traditional Chinese medicine, precise classification of these abstracts is vital for advancing global medical research. Objective: This study aims to address the data scarcity issue by generating a large volume of labeled Chinese abstract sentences without manual annotation, thereby creating new training datasets. Additionally, we seek to develop more accurate text classification algorithms to improve the precision of Chinese medical abstract classification. Methods: We developed 3 training datasets (dataset #1, dataset #2, and dataset #3) and a test dataset to evaluate our model. Dataset #1 contains 15,000 abstract sentences translated from the PubMed dataset into Chinese. Datasets #2 and #3, each with 15,000 sentences, were generated using GPT-3.5 from 40,000 Chinese medical abstracts in the CSL database. Dataset #2 used titles and keywords for pseudolabeling, while dataset #3 aligned abstracts with category labels. The test dataset includes 87,000 sentences from 20,000 abstracts. We used SBERT embeddings for deeper semantic analysis and evaluated our model using clustering (SBERT-DocSCAN) and supervised methods (SBERT-MEC). Extensive ablation studies and feature analyses were conducted to validate the model?s effectiveness and robustness. Results: Our experiments involved training both clustering and supervised models on the 3 datasets, followed by comprehensive evaluation using the test dataset. The outcomes demonstrated that our models outperformed the baseline metrics. Specifically, when trained on dataset #1, the SBERT-DocSCAN model registered an impressive accuracy and F1-score of 89.85% on the test dataset. Concurrently, the SBERT-MEC algorithm exhibited comparable performance with an accuracy of 89.38% and an identical F1-score. Training on dataset #2 yielded similarly positive results for the SBERT-DocSCAN model, achieving an accuracy and F1-score of 89.83%, while the SBERT-MEC algorithm recorded an accuracy of 86.73% and an F1-score of 86.51%. Notably, training with dataset #3 allowed the SBERT-DocSCAN model to attain the best with an accuracy and F1-score of 91.30%, whereas the SBERT-MEC algorithm also showed robust performance, obtaining an accuracy of 90.39% and an F1-score of 90.35%. Ablation analysis highlighted the critical role of integrated features and methodologies in improving classification efficiency. Conclusions: Our approach addresses the challenge of limited datasets for Chinese medical abstract classification by generating novel datasets. The deployment of SBERT-DocSCAN and SBERT-MEC models significantly enhances the precision of classifying Chinese medical abstracts, even when using synthetic datasets with pseudolabels. UR - https://formative.jmir.org/2025/1/e54803 UR - http://dx.doi.org/10.2196/54803 ID - info:doi/10.2196/54803 ER - TY - JOUR AU - Nazar, Wojciech AU - Nazar, Grzegorz AU - Kami?ska, Aleksandra AU - Danilowicz-Szymanowicz, Ludmila PY - 2025/3/18 TI - How to Design, Create, and Evaluate an Instruction-Tuning Dataset for Large Language Model Training in Health Care: Tutorial From a Clinical Perspective JO - J Med Internet Res SP - e70481 VL - 27 KW - generative artificial intelligence KW - large language models KW - instruction-tuning datasets KW - tutorials KW - evaluation framework KW - health care UR - https://www.jmir.org/2025/1/e70481 UR - http://dx.doi.org/10.2196/70481 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/70481 ER - TY - JOUR AU - Oami, Takehiko AU - Okada, Yohei AU - Nakada, Taka-aki PY - 2025/3/12 TI - GPT-3.5 Turbo and GPT-4 Turbo in Title and Abstract Screening for Systematic Reviews JO - JMIR Med Inform SP - e64682 VL - 13 KW - large language models KW - citation screening KW - systematic review KW - clinical practice guidelines KW - artificial intelligence KW - sepsis KW - AI KW - review KW - GPT KW - screening KW - citations KW - critical care KW - Japan KW - Japanese KW - accuracy KW - efficiency KW - reliability KW - LLM UR - https://medinform.jmir.org/2025/1/e64682 UR - http://dx.doi.org/10.2196/64682 ID - info:doi/10.2196/64682 ER - TY - JOUR AU - Park, Adam AU - Jung, Young Se AU - Yune, Ilha AU - Lee, Ho-Young PY - 2025/3/7 TI - Applying Robotic Process Automation to Monitor Business Processes in Hospital Information Systems: Mixed Method Approach JO - JMIR Med Inform SP - e59801 VL - 13 KW - robotic process automation KW - RPA KW - electronic medical records KW - EMR KW - system monitoring KW - health care information systems KW - user-centric monitoring KW - performance evaluation KW - business process management KW - BPM KW - healthcare technology KW - mixed methods research KW - process automation in health care N2 - Background: Electronic medical records (EMRs) have undergone significant changes due to advancements in technology, including artificial intelligence, the Internet of Things, and cloud services. The increasing complexity within health care systems necessitates enhanced process reengineering and system monitoring approaches. Robotic process automation (RPA) provides a user-centric approach to monitoring system complexity by mimicking end user interactions, thus presenting potential improvements in system performance and monitoring. Objective: This study aimed to explore the application of RPA in monitoring the complexities of EMR systems within a hospital environment, focusing on RPA?s ability to perform end-to-end performance monitoring that closely reflects real-time user experiences. Methods: The research was conducted at Seoul National University Bundang Hospital using a mixed methods approach. It included the iterative development and integration of RPA bots programmed to simulate and monitor typical user interactions with the hospital?s EMR system. Quantitative data from RPA process outputs and qualitative insights from interviews with system engineers and managers were used to evaluate the effectiveness of RPA in system monitoring. Results: RPA bots effectively identified and reported system inefficiencies and failures, providing a bridge between end user experiences and engineering assessments. The bots were particularly useful in detecting delays and errors immediately following system updates or interactions with external services. Over 3 years, RPA monitoring highlighted discrepancies between user-reported experiences and traditional engineering metrics, with the bots frequently identifying critical system issues that were not evident from standard component-level monitoring. Conclusions: RPA enhances system monitoring by providing insights that reflect true end user experiences, which are often overlooked by traditional monitoring methods. The study confirms the potential of RPA to act as a comprehensive monitoring tool within complex health care systems, suggesting that RPA can significantly contribute to the maintenance and improvement of EMR systems by providing a more accurate and timely reflection of system performance and user satisfaction. UR - https://medinform.jmir.org/2025/1/e59801 UR - http://dx.doi.org/10.2196/59801 UR - http://www.ncbi.nlm.nih.gov/pubmed/40053771 ID - info:doi/10.2196/59801 ER - TY - JOUR AU - Guo, XiaoRui AU - Xiao, Liang AU - Liu, Xinyu AU - Chen, Jianxia AU - Tong, Zefang AU - Liu, Ziji PY - 2025/3/4 TI - Enhancing Doctor-Patient Shared Decision-Making: Design of a Novel Collaborative Decision Description Language JO - J Med Internet Res SP - e55341 VL - 27 KW - shared decision-making KW - speech acts KW - agent KW - argumentation KW - interaction protocol N2 - Background: Effective shared decision-making between patients and physicians is crucial for enhancing health care quality and reducing medical errors. The literature shows that the absence of effective methods to facilitate shared decision-making can result in poor patient engagement and unfavorable decision outcomes. Objective: In this paper, we propose a Collaborative Decision Description Language (CoDeL) to model shared decision-making between patients and physicians, offering a theoretical foundation for studying various shared decision scenarios. Methods: CoDeL is based on an extension of the interaction protocol language of Lightweight Social Calculus. The language utilizes speech acts to represent the attitudes of shared decision-makers toward decision propositions, as well as their semantic relationships within dialogues. It supports interactive argumentation among decision makers by embedding clinical evidence into each segment of decision protocols. Furthermore, CoDeL enables personalized decision-making, allowing for the demonstration of characteristics such as persistence, critical thinking, and openness. Results: The feasibility of the approach is demonstrated through a case study of shared decision-making in the disease domain of atrial fibrillation. Our experimental results show that integrating the proposed language with GPT can further enhance its capabilities in interactive decision-making, improving interpretability. Conclusions: The proposed novel CoDeL can enhance doctor-patient shared decision-making in a rational, personalized, and interpretable manner. UR - https://www.jmir.org/2025/1/e55341 UR - http://dx.doi.org/10.2196/55341 UR - http://www.ncbi.nlm.nih.gov/pubmed/40053763 ID - info:doi/10.2196/55341 ER - TY - JOUR AU - Ohno, Yukiko AU - Aomori, Tohru AU - Nishiyama, Tomohiro AU - Kato, Riri AU - Fujiki, Reina AU - Ishikawa, Haruki AU - Kiyomiya, Keisuke AU - Isawa, Minae AU - Mochizuki, Mayumi AU - Aramaki, Eiji AU - Ohtani, Hisakazu PY - 2025/3/4 TI - Performance Improvement of a Natural Language Processing Tool for Extracting Patient Narratives Related to Medical States From Japanese Pharmaceutical Care Records by Increasing the Amount of Training Data: Natural Language Processing Analysis and Validation Study JO - JMIR Med Inform SP - e68863 VL - 13 KW - natural language processing KW - NLP KW - named entity recognition KW - NER KW - deep learning KW - pharmaceutical care record KW - electronic medical record KW - EMR KW - Japanese N2 - Background: Patients? oral expressions serve as valuable sources of clinical information to improve pharmacotherapy. Natural language processing (NLP) is a useful approach for analyzing unstructured text data, such as patient narratives. However, few studies have focused on using NLP for narratives in the Japanese language. Objective: We aimed to develop a high-performance NLP system for extracting clinical information from patient narratives by examining the performance progression with a gradual increase in the amount of training data. Methods: We used subjective texts from the pharmaceutical care records of Keio University Hospital from April 1, 2018, to March 31, 2019, comprising 12,004 records from 6559 cases. After preprocessing, we annotated diseases and symptoms within the texts. We then trained and evaluated a deep learning model (bidirectional encoder representations from transformers combined with a conditional random field [BERT-CRF]) through 10-fold cross-validation. The annotated data were divided into 10 subsets, and the amount of training data was progressively increased over 10 steps. We also analyzed the causes of errors. Finally, we applied the developed system to the analysis of case report texts to evaluate its usability for texts from other sources. Results: The F1-score of the system improved from 0.67 to 0.82 as the amount of training data increased from 1200 to 12,004 records. The F1-score reached 0.78 with 3600 records and was largely similar thereafter. As performance improved, errors from incorrect extractions decreased significantly, which resulted in an increase in precision. For case reports, the F1-score also increased from 0.34 to 0.41 as the training dataset expanded from 1200 to 12,004 records. Performance was lower for extracting symptoms from case report texts compared with pharmaceutical care records, suggesting that this system is more specialized for analyzing subjective data from pharmaceutical care records. Conclusions: We successfully developed a high-performance system specialized in analyzing subjective data from pharmaceutical care records by training a large dataset, with near-complete saturation of system performance with about 3600 training records. This system will be useful for monitoring symptoms, offering benefits for both clinical practice and research. UR - https://medinform.jmir.org/2025/1/e68863 UR - http://dx.doi.org/10.2196/68863 UR - http://www.ncbi.nlm.nih.gov/pubmed/40053805 ID - info:doi/10.2196/68863 ER - TY - JOUR AU - Park, Katie Jinkyung AU - Singh, K. Vivek AU - Wisniewski, Pamela PY - 2025/2/28 TI - Current Landscape and Future Directions for Mental Health Conversational Agents for Youth: Scoping Review JO - JMIR Med Inform SP - e62758 VL - 13 KW - conversational agent KW - chatbot KW - mental health KW - youth KW - adolescent KW - scoping review KW - Preferred Reporting Items for Systematic Reviews and Meta-Analyses KW - artificial intelligence N2 - Background: Conversational agents (CAs; chatbots) are systems with the ability to interact with users using natural human dialogue. They are increasingly used to support interactive knowledge discovery of sensitive topics such as mental health topics. While much of the research on CAs for mental health has focused on adult populations, the insights from such research may not apply to CAs for youth. Objective: This study aimed to comprehensively evaluate the state-of-the-art research on mental health CAs for youth. Methods: Following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, we identified 39 peer-reviewed studies specific to mental health CAs designed for youth across 4 databases, including ProQuest, Scopus, Web of Science, and PubMed. We conducted a scoping review of the literature to evaluate the characteristics of research on mental health CAs designed for youth, the design and computational considerations of mental health CAs for youth, and the evaluation outcomes reported in the research on mental health CAs for youth. Results: We found that many mental health CAs (11/39, 28%) were designed as older peers to provide therapeutic or educational content to promote youth mental well-being. All CAs were designed based on expert knowledge, with a few that incorporated inputs from youth. The technical maturity of CAs was in its infancy, focusing on building prototypes with rule-based models to deliver prewritten content, with limited safety features to respond to imminent risk. Research findings suggest that while youth appreciate the 24/7 availability of friendly or empathetic conversation on sensitive topics with CAs, they found the content provided by CAs to be limited. Finally, we found that most (35/39, 90%) of the reviewed studies did not address the ethical aspects of mental health CAs, while youth were concerned about the privacy and confidentiality of their sensitive conversation data. Conclusions: Our study highlights the need for researchers to continue to work together to align evidence-based research on mental health CAs for youth with lessons learned on how to best deliver these technologies to youth. Our review brings to light mental health CAs needing further development and evaluation. The new trend of large language model?based CAs can make such technologies more feasible. However, the privacy and safety of the systems should be prioritized. Although preliminary evidence shows positive trends in mental health CAs, long-term evaluative research with larger sample sizes and robust research designs is needed to validate their efficacy. More importantly, collaboration between youth and clinical experts is essential from the early design stages through to the final evaluation to develop safe, effective, and youth-centered mental health chatbots. Finally, best practices for risk mitigation and ethical development of CAs with and for youth are needed to promote their mental well-being. UR - https://medinform.jmir.org/2025/1/e62758 UR - http://dx.doi.org/10.2196/62758 UR - http://www.ncbi.nlm.nih.gov/pubmed/40053735 ID - info:doi/10.2196/62758 ER - TY - JOUR AU - Zhong, Jinjia AU - Zhu, Ting AU - Huang, Yafang PY - 2025/2/25 TI - Reporting Quality of AI Intervention in Randomized Controlled Trials in Primary Care: Systematic Review and Meta-Epidemiological Study JO - J Med Internet Res SP - e56774 VL - 27 KW - artificial intelligence KW - randomized controlled trial KW - reporting quality KW - primary care KW - meta-epidemiological study N2 - Background: The surge in artificial intelligence (AI) interventions in primary care trials lacks a study on reporting quality. Objective: This study aimed to systematically evaluate the reporting quality of both published randomized controlled trials (RCTs) and protocols for RCTs that investigated AI interventions in primary care. Methods: PubMed, Embase, Cochrane Library, MEDLINE, Web of Science, and CINAHL databases were searched for RCTs and protocols on AI interventions in primary care until November 2024. Eligible studies were published RCTs or full protocols for RCTs exploring AI interventions in primary care. The reporting quality was assessed using CONSORT-AI (Consolidated Standards of Reporting Trials?Artificial Intelligence) and SPIRIT-AI (Standard Protocol Items: Recommendations for Interventional Trials?Artificial Intelligence) checklists, focusing on AI intervention?related items. Results: A total of 11,711 records were identified. In total, 19 published RCTs and 21 RCT protocols for 35 trials were included. The overall proportion of adequately reported items was 65% (172/266; 95% CI 59%-70%) and 68% (214/315; 95% CI 62%-73%) for RCTs and protocols, respectively. The percentage of RCTs and protocols that reported a specific item ranged from 11% (2/19) to 100% (19/19) and from 10% (2/21) to 100% (21/21), respectively. The reporting of both RCTs and protocols exhibited similar characteristics and trends. They both lack transparency and completeness, which can be summarized in three aspects: without providing adequate information regarding the input data, without mentioning the methods for identifying and analyzing performance errors, and without stating whether and how the AI intervention and its code can be accessed. Conclusions: The reporting quality could be improved in both RCTs and protocols. This study helps promote the transparent and complete reporting of trials with AI interventions in primary care. UR - https://www.jmir.org/2025/1/e56774 UR - http://dx.doi.org/10.2196/56774 UR - http://www.ncbi.nlm.nih.gov/pubmed/39998876 ID - info:doi/10.2196/56774 ER - TY - JOUR AU - Steele, Brian AU - Fairie, Paul AU - Kemp, Kyle AU - D'Souza, G. Adam AU - Wilms, Matthias AU - Santana, Jose Maria PY - 2025/2/24 TI - Identifying Patient-Reported Care Experiences in Free-Text Survey Comments: Topic Modeling Study JO - JMIR Med Inform SP - e63466 VL - 13 KW - natural language processing KW - patient-reported experience KW - topic models KW - inpatient KW - artificial intelligence KW - AI KW - patient reported KW - feedback KW - survey KW - patient experiences KW - bidirectional encoder representations from transformers KW - BERT KW - sentiment analysis KW - pediatric caregivers KW - patient safety KW - safety N2 - Background: Patient-reported experience surveys allow administrators, clinicians, and researchers to quantify and improve health care by receiving feedback directly from patients. Existing research has focused primarily on quantitative analysis of survey items, but these measures may collect optional free-text comments. These comments can provide insights for health systems but may not be analyzed due to limited resources and the complexity of traditional textual analysis. However, advances in machine learning?based natural language processing provide opportunities to learn from this traditionally underused data source. Objective: This study aimed to apply natural language processing to model topics found in free-text comments of patient-reported experience surveys. Methods: Consumer Assessment of Healthcare Providers and Systems?derived patient experience surveys were collected and linked to administrative inpatient records by the provincial health services organization responsible for inpatient care. Unsupervised topic modeling with automated labeling was performed with BERTopic. Sentiment analysis was performed to further assist in topic description. Results: Between April 2016 and February 2020, 43.4% (43,522/100,272) adult patients and 46.9% (3501/7464) pediatric caregivers included free-text responses on completed patient experience surveys. Topic models identified 86 topics among adult survey responses and 35 topics among pediatric responses that included elements of care not currently surveyed by existing questionnaires. Frequent topics were generally positive. Conclusions: We found that with limited tuning, BERTopic identified care experience topics with interpretable automated labeling. Results are discussed in the context of person-centered care, patient safety, and health care quality improvement. Furthermore, we note the opportunity for the identification of temporal and site-specific trends as a method to identify patient care and safety concerns. As the use of patient experience measurement increases in health care, we discuss how machine learning can be leveraged to provide additional insight on patient experiences. UR - https://medinform.jmir.org/2025/1/e63466 UR - http://dx.doi.org/10.2196/63466 ID - info:doi/10.2196/63466 ER - TY - JOUR AU - Shi, Qiming AU - Luzuriaga, Katherine AU - Allison, J. Jeroan AU - Oztekin, Asil AU - Faro, M. Jamie AU - Lee, L. Joy AU - Hafer, Nathaniel AU - McManus, Margaret AU - Zai, H. Adrian PY - 2025/2/13 TI - Transforming Informed Consent Generation Using Large Language Models: Mixed Methods Study JO - JMIR Med Inform SP - e68139 VL - 13 KW - informed consent form KW - ICF KW - large language models KW - LLMs KW - clinical trials KW - readability KW - health informatics KW - artificial intelligence KW - AI KW - AI in health care N2 - Background: Informed consent forms (ICFs) for clinical trials have become increasingly complex, often hindering participant comprehension and engagement due to legal jargon and lengthy content. The recent advances in large language models (LLMs) present an opportunity to streamline the ICF creation process while improving readability, understandability, and actionability. Objectives: This study aims to evaluate the performance of the Mistral 8x22B LLM in generating ICFs with improved readability, understandability, and actionability. Specifically, we evaluate the model?s effectiveness in generating ICFs that are readable, understandable, and actionable while maintaining the accuracy and completeness. Methods: We processed 4 clinical trial protocols from the institutional review board of UMass Chan Medical School using the Mistral 8x22B model to generate key information sections of ICFs. A multidisciplinary team of 8 evaluators, including clinical researchers and health informaticians, assessed the generated ICFs against human-generated counterparts for completeness, accuracy, readability, understandability, and actionability. Readability, Understandability, and Actionability of Key Information indicators, which include 18 binary-scored items, were used to evaluate these aspects, with higher scores indicating greater accessibility, comprehensibility, and actionability of the information. Statistical analysis, including Wilcoxon rank sum tests and intraclass correlation coefficient calculations, was used to compare outputs. Results: LLM-generated ICFs demonstrated comparable performance to human-generated versions across key sections, with no significant differences in accuracy and completeness (P>.10). The LLM outperformed human-generated ICFs in readability (Readability, Understandability, and Actionability of Key Information score of 76.39% vs 66.67%; Flesch-Kincaid grade level of 7.95 vs 8.38) and understandability (90.63% vs 67.19%; P=.02). The LLM-generated content achieved a perfect score in actionability compared with the human-generated version (100% vs 0%; P<.001). Intraclass correlation coefficient for evaluator consistency was high at 0.83 (95% CI 0.64-1.03), indicating good reliability across assessments. Conclusions: The Mistral 8x22B LLM showed promising capabilities in enhancing the readability, understandability, and actionability of ICFs without sacrificing accuracy or completeness. LLMs present a scalable, efficient solution for ICF generation, potentially enhancing participant comprehension and consent in clinical trials. UR - https://medinform.jmir.org/2025/1/e68139 UR - http://dx.doi.org/10.2196/68139 ID - info:doi/10.2196/68139 ER - TY - JOUR AU - Seo, Sujeong AU - Kim, Kyuli AU - Yang, Heyoung PY - 2025/2/12 TI - Performance Assessment of Large Language Models in Medical Consultation: Comparative Study JO - JMIR Med Inform SP - e64318 VL - 13 KW - artificial intelligence KW - biomedical KW - large language model KW - depression KW - similarity measurement KW - text validity N2 - Background: The recent introduction of generative artificial intelligence (AI) as an interactive consultant has sparked interest in evaluating its applicability in medical discussions and consultations, particularly within the domain of depression. Objective: This study evaluates the capability of large language models (LLMs) in AI to generate responses to depression-related queries. Methods: Using the PubMedQA and QuoraQA data sets, we compared various LLMs, including BioGPT, PMC-LLaMA, GPT-3.5, and Llama2, and measured the similarity between the generated and original answers. Results: The latest general LLMs, GPT-3.5 and Llama2, exhibited superior performance, particularly in generating responses to medical inquiries from the PubMedQA data set. Conclusions: Considering the rapid advancements in LLM development in recent years, it is hypothesized that version upgrades of general LLMs offer greater potential for enhancing their ability to generate ?knowledge text? in the biomedical domain compared with fine-tuning for the biomedical field. These findings are expected to contribute significantly to the evolution of AI-based medical counseling systems. UR - https://medinform.jmir.org/2025/1/e64318 UR - http://dx.doi.org/10.2196/64318 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/64318 ER - TY - JOUR AU - Puts, Sander AU - Zegers, L. Catharina M. AU - Dekker, Andre AU - Bermejo, Iñigo PY - 2025/2/11 TI - Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based Code Selection JO - JMIR Form Res SP - e60095 VL - 9 KW - International Classification of Diseases KW - ICD-10 KW - computer-assisted-coding KW - GPT-4 KW - coding KW - term extraction KW - code analysis KW - computer assisted coding KW - transformer model KW - artificial intelligence KW - AI automation KW - retrieval-augmented generation KW - RAG KW - large language model KW - LLM KW - Bidirectional Encoder Representations from Transformers KW - Robustly Optimized BERT Pretraining Approach KW - RoBERTa KW - named entity recognition KW - NER N2 - Background: The International Classification of Diseases (ICD), developed by the World Health Organization, standardizes health condition coding to support health care policy, research, and billing, but artificial intelligence automation, while promising, still underperforms compared with human accuracy and lacks the explainability needed for adoption in medical settings. Objective: The potential of large language models for assisting medical coders in the ICD-10 coding was explored through the development of a computer-assisted coding system. This study aimed to augment human coding by initially identifying lead terms and using retrieval-augmented generation (RAG)?based methods for computer-assisted coding enhancement. Methods: The explainability dataset from the CodiEsp challenge (CodiEsp-X) was used, featuring 1000 Spanish clinical cases annotated with ICD-10 codes. A new dataset, CodiEsp-X-lead, was generated using GPT-4 to replace full-textual evidence annotations with lead term annotations. A Robustly Optimized BERT (Bidirectional Encoder Representations from Transformers) Pretraining Approach transformer model was fine-tuned for named entity recognition to extract lead terms. GPT-4 was subsequently employed to generate code descriptions from the extracted textual evidence. Using a RAG approach, ICD codes were assigned to the lead terms by querying a vector database of ICD code descriptions with OpenAI?s text-embedding-ada-002 model. Results: The fine-tuned Robustly Optimized BERT Pretraining Approach achieved an overall F1-score of 0.80 for ICD lead term extraction on the new CodiEsp-X-lead dataset. GPT-4-generated code descriptions reduced retrieval failures in the RAG approach by approximately 5% for both diagnoses and procedures. However, the overall explainability F1-score for the CodiEsp-X task was limited to 0.305, significantly lower than the state-of-the-art F1-score of 0.633. The diminished performance was partly due to the reliance on code descriptions, as some ICD codes lacked descriptions, and the approach did not fully align with the medical coder?s workflow. Conclusions: While lead term extraction showed promising results, the subsequent RAG-based code assignment using GPT-4 and code descriptions was less effective. Future research should focus on refining the approach to more closely mimic the medical coder?s workflow, potentially integrating the alphabetic index and official coding guidelines, rather than relying solely on code descriptions. This alignment may enhance system accuracy and better support medical coders in practice. UR - https://formative.jmir.org/2025/1/e60095 UR - http://dx.doi.org/10.2196/60095 ID - info:doi/10.2196/60095 ER - TY - JOUR AU - Selcuk, Yesim AU - Kim, Eunhui AU - Ahn, Insung PY - 2025/2/10 TI - InfectA-Chat, an Arabic Large Language Model for Infectious Diseases: Comparative Analysis JO - JMIR Med Inform SP - e63881 VL - 13 KW - large language model KW - Arabic large language models KW - AceGPT KW - multilingual large language model KW - infectious disease monitoring KW - public health N2 - Background: Infectious diseases have consistently been a significant concern in public health, requiring proactive measures to safeguard societal well-being. In this regard, regular monitoring activities play a crucial role in mitigating the adverse effects of diseases on society. To monitor disease trends, various organizations, such as the World Health Organization (WHO) and the European Centre for Disease Prevention and Control (ECDC), collect diverse surveillance data and make them publicly accessible. However, these platforms primarily present surveillance data in English, which creates language barriers for non?English-speaking individuals and global public health efforts to accurately observe disease trends. This challenge is particularly noticeable in regions such as the Middle East, where specific infectious diseases, such as Middle East respiratory syndrome coronavirus (MERS-CoV), have seen a dramatic increase. For such regions, it is essential to develop tools that can overcome language barriers and reach more individuals to alleviate the negative impacts of these diseases. Objective: This study aims to address these issues; therefore, we propose InfectA-Chat, a cutting-edge large language model (LLM) specifically designed for the Arabic language but also incorporating English for question and answer (Q&A) tasks. InfectA-Chat leverages its deep understanding of the language to provide users with information on the latest trends in infectious diseases based on their queries. Methods: This comprehensive study was achieved by instruction tuning the AceGPT-7B and AceGPT-7B-Chat models on a Q&A task, using a dataset of 55,400 Arabic and English domain?specific instruction?following data. The performance of these fine-tuned models was evaluated using 2770 domain-specific Arabic and English instruction?following data, using the GPT-4 evaluation method. A comparative analysis was then performed against Arabic LLMs and state-of-the-art models, including AceGPT-13B-Chat, Jais-13B-Chat, Gemini, GPT-3.5, and GPT-4. Furthermore, to ensure the model had access to the latest information on infectious diseases by regularly updating the data without additional fine-tuning, we used the retrieval-augmented generation (RAG) method. Results: InfectA-Chat demonstrated good performance in answering questions about infectious diseases by the GPT-4 evaluation method. Our comparative analysis revealed that it outperforms the AceGPT-7B-Chat and InfectA-Chat (based on AceGPT-7B) models by a margin of 43.52%. It also surpassed other Arabic LLMs such as AceGPT-13B-Chat and Jais-13B-Chat by 48.61%. Among the state-of-the-art models, InfectA-Chat achieved a leading performance of 23.78%, competing closely with the GPT-4 model. Furthermore, the RAG method in InfectA-Chat significantly improved document retrieval accuracy. Notably, RAG retrieved more accurate documents based on queries when the top-k parameter value was increased. Conclusions: Our findings highlight the shortcomings of general Arabic LLMs in providing up-to-date information about infectious diseases. With this study, we aim to empower individuals and public health efforts by offering a bilingual Q&A system for infectious disease monitoring. UR - https://medinform.jmir.org/2025/1/e63881 UR - http://dx.doi.org/10.2196/63881 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/63881 ER - TY - JOUR AU - Rz?deczka, Marcin AU - Sterna, Anna AU - Stoli?ska, Julia AU - Kaczy?ska, Paulina AU - Moskalewicz, Marcin PY - 2025/2/7 TI - The Efficacy of Conversational AI in Rectifying the Theory-of-Mind and Autonomy Biases: Comparative Analysis JO - JMIR Ment Health SP - e64396 VL - 12 KW - cognitive bias KW - conversational artificial intelligence KW - artificial intelligence KW - AI KW - chatbots KW - digital mental health KW - bias rectification KW - affect recognition N2 - Background: The increasing deployment of conversational artificial intelligence (AI) in mental health interventions necessitates an evaluation of their efficacy in rectifying cognitive biases and recognizing affect in human-AI interactions. These biases are particularly relevant in mental health contexts as they can exacerbate conditions such as depression and anxiety by reinforcing maladaptive thought patterns or unrealistic expectations in human-AI interactions. Objective: This study aimed to assess the effectiveness of therapeutic chatbots (Wysa and Youper) versus general-purpose language models (GPT-3.5, GPT-4, and Gemini Pro) in identifying and rectifying cognitive biases and recognizing affect in user interactions. Methods: This study used constructed case scenarios simulating typical user-bot interactions to examine how effectively chatbots address selected cognitive biases. The cognitive biases assessed included theory-of-mind biases (anthropomorphism, overtrust, and attribution) and autonomy biases (illusion of control, fundamental attribution error, and just-world hypothesis). Each chatbot response was evaluated based on accuracy, therapeutic quality, and adherence to cognitive behavioral therapy principles using an ordinal scale to ensure consistency in scoring. To enhance reliability, responses underwent a double review process by 2 cognitive scientists, followed by a secondary review by a clinical psychologist specializing in cognitive behavioral therapy, ensuring a robust assessment across interdisciplinary perspectives. Results: This study revealed that general-purpose chatbots outperformed therapeutic chatbots in rectifying cognitive biases, particularly in overtrust bias, fundamental attribution error, and just-world hypothesis. GPT-4 achieved the highest scores across all biases, whereas the therapeutic bot Wysa scored the lowest. Notably, general-purpose bots showed more consistent accuracy and adaptability in recognizing and addressing bias-related cues across different contexts, suggesting a broader flexibility in handling complex cognitive patterns. In addition, in affect recognition tasks, general-purpose chatbots not only excelled but also demonstrated quicker adaptation to subtle emotional nuances, outperforming therapeutic bots in 67% (4/6) of the tested biases. Conclusions: This study shows that, while therapeutic chatbots hold promise for mental health support and cognitive bias intervention, their current capabilities are limited. Addressing cognitive biases in AI-human interactions requires systems that can both rectify and analyze biases as integral to human cognition, promoting precision and simulating empathy. The findings reveal the need for improved simulated emotional intelligence in chatbot design to provide adaptive, personalized responses that reduce overreliance and encourage independent coping skills. Future research should focus on enhancing affective response mechanisms and addressing ethical concerns such as bias mitigation and data privacy to ensure safe, effective AI-based mental health support. UR - https://mental.jmir.org/2025/1/e64396 UR - http://dx.doi.org/10.2196/64396 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/64396 ER - TY - JOUR AU - Yang, Zhichao AU - Yao, Zonghai AU - Tasmin, Mahbuba AU - Vashisht, Parth AU - Jang, Seok Won AU - Ouyang, Feiyun AU - Wang, Beining AU - McManus, David AU - Berlowitz, Dan AU - Yu, Hong PY - 2025/2/7 TI - Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study JO - J Med Internet Res SP - e65146 VL - 27 KW - artificial intelligence KW - natural language processing KW - large language model KW - LLM KW - ChatGPT KW - GPT KW - GPT-4V KW - USMLE KW - Medical License Exam KW - medical image interpretation KW - United States Medical Licensing Examination KW - NLP N2 - Background: Recent advancements in artificial intelligence, such as GPT-3.5 Turbo (OpenAI) and GPT-4, have demonstrated significant potential by achieving good scores on text-only United States Medical Licensing Examination (USMLE) exams and effectively answering questions from physicians. However, the ability of these models to interpret medical images remains underexplored. Objective: This study aimed to comprehensively evaluate the performance, interpretability, and limitations of GPT-3.5 Turbo, GPT-4, and its successor, GPT-4 Vision (GPT-4V), specifically focusing on GPT-4V?s newly introduced image-understanding feature. By assessing the models on medical licensing examination questions that require image interpretation, we sought to highlight the strengths and weaknesses of GPT-4V in handling complex multimodal clinical information, thereby exposing hidden flaws and providing insights into its readiness for integration into clinical settings. Methods: This cross-sectional study tested GPT-4V, GPT-4, and ChatGPT-3.5 Turbo on a total of 227 multiple-choice questions with images from USMLE Step 1 (n=19), Step 2 clinical knowledge (n=14), Step 3 (n=18), the Diagnostic Radiology Qualifying Core Exam (DRQCE) (n=26), and AMBOSS question banks (n=150). AMBOSS provided expert-written hints and question difficulty levels. GPT-4V?s accuracy was compared with 2 state-of-the-art large language models, GPT-3.5 Turbo and GPT-4. The quality of the explanations was evaluated by choosing human preference between an explanation by GPT-4V (without hint), an explanation by an expert, or a tie, using 3 qualitative metrics: comprehensive explanation, question information, and image interpretation. To better understand GPT-4V?s explanation ability, we modified a patient case report to resemble a typical ?curbside consultation? between physicians. Results: For questions with images, GPT-4V achieved an accuracy of 84.2%, 85.7%, 88.9%, and 73.1% in Step 1, Step 2 clinical knowledge, Step 3 of USMLE, and DRQCE, respectively. It outperformed GPT-3.5 Turbo (42.1%, 50%, 50%, 19.2%) and GPT-4 (63.2%, 64.3%, 66.7%, 26.9%). When GPT-4V answered correctly, its explanations were nearly as good as those provided by domain experts from AMBOSS. However, incorrect answers often had poor explanation quality: 18.2% (10/55) contained inaccurate text, 45.5% (25/55) had inference errors, and 76.3% (42/55) demonstrated image misunderstandings. With human expert assistance, GPT-4V reduced errors by an average of 40% (22/55). GPT-4V accuracy improved with hints, maintaining stable performance across difficulty levels, while medical student performance declined as difficulty increased. In a simulated curbside consultation scenario, GPT-4V required multiple specific prompts to interpret complex case data accurately. Conclusions: GPT-4V achieved high accuracy on multiple-choice questions with images, highlighting its potential in medical assessments. However, significant shortcomings were observed in the quality of explanations when questions were answered incorrectly, particularly in the interpretation of images, which could not be efficiently resolved through expert interaction. These findings reveal hidden flaws in the image interpretation capabilities of GPT-4V, underscoring the need for more comprehensive evaluations beyond multiple-choice questions before integrating GPT-4V into clinical settings. UR - https://www.jmir.org/2025/1/e65146 UR - http://dx.doi.org/10.2196/65146 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/65146 ER - TY - JOUR AU - Bhak, Youngmin AU - Lee, Ho Yu AU - Kim, Joonhyung AU - Lee, Kiwon AU - Lee, Daehwan AU - Jang, Chan Eun AU - Jang, Eunjeong AU - Lee, Seungkyu Christopher AU - Kang, Seok Eun AU - Park, Sehee AU - Han, Wook Hyun AU - Nam, Min Sang PY - 2025/2/7 TI - Diagnosis of Chronic Kidney Disease Using Retinal Imaging and Urine Dipstick Data: Multimodal Deep Learning Approach JO - JMIR Med Inform SP - e55825 VL - 13 KW - multimodal deep learning KW - chronic kidney disease KW - fundus image KW - saliency map KW - urine dipstick N2 - Background: Chronic kidney disease (CKD) is a prevalent condition with significant global health implications. Early detection and management are critical to prevent disease progression and complications. Deep learning (DL) models using retinal images have emerged as potential noninvasive screening tools for CKD, though their performance may be limited, especially in identifying individuals with proteinuria and in specific subgroups. Objective: We aim to evaluate the efficacy of integrating retinal images and urine dipstick data into DL models for enhanced CKD diagnosis. Methods: The 3 models were developed and validated: eGFR-RIDL (estimated glomerular filtration rate?retinal image deep learning), eGFR-UDLR (logistic regression using urine dipstick data), and eGFR-MMDL (multimodal deep learning combining retinal images and urine dipstick data). All models were trained to predict an eGFR<60 mL/min/1.73 m², a key indicator of CKD, calculated using the 2009 CKD-EPI (Chronic Kidney Disease Epidemiology Collaboration) equation. This study used a multicenter dataset of participants aged 20?79 years, including a development set (65,082 people) and an external validation set (58,284 people). Wide Residual Networks were used for DL, and saliency maps were used to visualize model attention. Sensitivity analyses assessed the impact of numerical variables. Results: eGFR-MMDL outperformed eGFR-RIDL in both the test and external validation sets, with area under the curves of 0.94 versus 0.90 and 0.88 versus 0.77 (P<.001 for both, DeLong test). eGFR-UDLR outperformed eGFR-RIDL and was comparable to eGFR-MMDL, particularly in the external validation. However, in the subgroup analysis, eGFR-MMDL showed improvement across all subgroups, while eGFR-UDLR demonstrated no such gains. This suggested that the enhanced performance of eGFR-MMDL was not due to urine data alone, but rather from the synergistic integration of both retinal images and urine data. The eGFR-MMDL model demonstrated the best performance in individuals younger than 65 years or those with proteinuria. Age and proteinuria were identified as critical factors influencing model performance. Saliency maps indicated that urine data and retinal images provide complementary information, with urine offering insights into retinal abnormalities and retinal images, particularly the arcade vessels, being key for predicting kidney function. Conclusions: The MMDL model integrating retinal images and urine dipstick data show significant promise for noninvasive CKD screening, outperforming the retinal image?only model. However, routine blood tests are still recommended for individuals aged 65 years and older due to the model?s limited performance in this age group. UR - https://medinform.jmir.org/2025/1/e55825 UR - http://dx.doi.org/10.2196/55825 ID - info:doi/10.2196/55825 ER - TY - JOUR AU - Roustan, Dimitri AU - Bastardot, François PY - 2025/1/28 TI - The Clinicians? Guide to Large Language Models: A General Perspective With a Focus on Hallucinations JO - Interact J Med Res SP - e59823 VL - 14 KW - medical informatics KW - large language model KW - clinical informatics KW - decision-making KW - computer assisted KW - decision support techniques KW - decision support KW - decision KW - AI KW - artificial intelligence KW - artificial intelligence tool KW - LLM KW - electronic data system KW - hallucinations KW - false information KW - technical framework UR - https://www.i-jmr.org/2025/1/e59823 UR - http://dx.doi.org/10.2196/59823 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/59823 ER - TY - JOUR AU - Cardamone, C. Nicholas AU - Olfson, Mark AU - Schmutte, Timothy AU - Ungar, Lyle AU - Liu, Tony AU - Cullen, W. Sara AU - Williams, J. Nathaniel AU - Marcus, C. Steven PY - 2025/1/21 TI - Classifying Unstructured Text in Electronic Health Records for Mental Health Prediction Models: Large Language Model Evaluation Study JO - JMIR Med Inform SP - e65454 VL - 13 KW - artificial intelligence KW - AI KW - machine learning KW - ML KW - natural language processing KW - NLP KW - large language model KW - LLM KW - ChatGPT KW - predictive modeling KW - mental health KW - health informatics KW - electronic health record KW - EHR KW - EHR system KW - text KW - dataset KW - mental health disorder KW - emergency department KW - physical health N2 - Background: Prediction models have demonstrated a range of applications across medicine, including using electronic health record (EHR) data to identify hospital readmission and mortality risk. Large language models (LLMs) can transform unstructured EHR text into structured features, which can then be integrated into statistical prediction models, ensuring that the results are both clinically meaningful and interpretable. Objective: This study aims to compare the classification decisions made by clinical experts with those generated by a state-of-the-art LLM, using terms extracted from a large EHR data set of individuals with mental health disorders seen in emergency departments (EDs). Methods: Using a dataset from the EHR systems of more than 50 health care provider organizations in the United States from 2016 to 2021, we extracted all clinical terms that appeared in at least 1000 records of individuals admitted to the ED for a mental health?related problem from a source population of over 6 million ED episodes. Two experienced mental health clinicians (one medically trained psychiatrist and one clinical psychologist) reached consensus on the classification of EHR terms and diagnostic codes into categories. We evaluated an LLM?s agreement with clinical judgment across three classification tasks as follows: (1) classify terms into ?mental health? or ?physical health?, (2) classify mental health terms into 1 of 42 prespecified categories, and (3) classify physical health terms into 1 of 19 prespecified broad categories. Results: There was high agreement between the LLM and clinical experts when categorizing 4553 terms as ?mental health? or ?physical health? (?=0.77, 95% CI 0.75-0.80). However, there was still considerable variability in LLM-clinician agreement on the classification of mental health terms (?=0.62, 95% CI 0.59?0.66) and physical health terms (?=0.69, 95% CI 0.67?0.70). Conclusions: The LLM displayed high agreement with clinical experts when classifying EHR terms into certain mental health or physical health term categories. However, agreement with clinical experts varied considerably within both sets of mental and physical health term categories. Importantly, the use of LLMs presents an alternative to manual human coding, presenting great potential to create interpretable features for prediction models. UR - https://medinform.jmir.org/2025/1/e65454 UR - http://dx.doi.org/10.2196/65454 ID - info:doi/10.2196/65454 ER - TY - JOUR AU - Yang, Xingwei AU - Li, Guang PY - 2025/1/20 TI - Psychological and Behavioral Insights From Social Media Users: Natural Language Processing?Based Quantitative Study on Mental Well-Being JO - JMIR Form Res SP - e60286 VL - 9 KW - social media KW - natural language processing KW - social interaction KW - decision support system KW - depression detection N2 - Background: Depression significantly impacts an individual?s thoughts, emotions, behaviors, and moods; this prevalent mental health condition affects millions globally. Traditional approaches to detecting and treating depression rely on questionnaires and personal interviews, which can be time consuming and potentially inefficient. As social media has permanently shifted the pattern of our daily communications, social media postings can offer new perspectives in understanding mental illness in individuals because they provide an unbiased exploration of their language use and behavioral patterns. Objective: This study aimed to develop and evaluate a methodological language framework that integrates psychological patterns, contextual information, and social interactions using natural language processing and machine learning techniques. The goal was to enhance intelligent decision-making for detecting depression at the user level. Methods: We extracted language patterns via natural language processing approaches that facilitate understanding contextual and psychological factors, such as affective patterns and personality traits linked with depression. Then, we extracted social interaction influence features. The resultant social interaction influence that users have within their online social group is derived based on users? emotions, psychological states, and context of communication extracted from status updates and the social network structure. We empirically evaluated the effectiveness of our framework by applying machine learning models to detect depression, reporting accuracy, recall, precision, and F1-score using social media status updates from 1047 users along with their associated depression diagnosis questionnaire scores. These datasets also include user postings, network connections, and personality responses. Results: The proposed framework demonstrates accurate and effective detection of depression, improving performance compared to traditional baselines with an average improvement of 6% in accuracy and 10% in F1-score. It also shows competitive performance relative to state-of-the-art models. The inclusion of social interaction features demonstrates strong performance. By using all influence features (affective influence features, contextual influence features, and personality influence features), the model achieved an accuracy of 77% and a precision of 80%. Using affective features and affective influence features also showed strong performance, achieving 81% precision and an F1-score of 79%. Conclusions: The developed framework offers practical applications, such as accelerating hospital diagnoses, improving prediction accuracy, facilitating timely referrals, and providing actionable insights for early interventions in mental health treatment plans. UR - https://formative.jmir.org/2025/1/e60286 UR - http://dx.doi.org/10.2196/60286 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/60286 ER - TY - JOUR AU - Kim, JaeYong AU - Vajravelu, Narayan Bathri PY - 2025/1/16 TI - Assessing the Current Limitations of Large Language Models in Advancing Health Care Education JO - JMIR Form Res SP - e51319 VL - 9 KW - large language model KW - generative pretrained transformer KW - health care education KW - health care delivery KW - artificial intelligence KW - LLM KW - ChatGPT KW - AI UR - https://formative.jmir.org/2025/1/e51319 UR - http://dx.doi.org/10.2196/51319 ID - info:doi/10.2196/51319 ER - TY - JOUR AU - Fukushima, Takuya AU - Manabe, Masae AU - Yada, Shuntaro AU - Wakamiya, Shoko AU - Yoshida, Akiko AU - Urakawa, Yusaku AU - Maeda, Akiko AU - Kan, Shigeyuki AU - Takahashi, Masayo AU - Aramaki, Eiji PY - 2025/1/16 TI - Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset JO - JMIR Med Inform SP - e65047 VL - 13 KW - large language models KW - genetic counseling KW - medical KW - health KW - artificial intelligence KW - machine learning KW - domain adaptation KW - retrieval-augmented generation KW - instruction tuning KW - prompt engineering KW - question-answer KW - dialogue KW - ethics KW - safety KW - low-rank adaptation KW - Japanese KW - expert evaluation N2 - Background: Advances in genetics have underscored a strong association between genetic factors and health outcomes, leading to an increased demand for genetic counseling services. However, a shortage of qualified genetic counselors poses a significant challenge. Large language models (LLMs) have emerged as a potential solution for augmenting support in genetic counseling tasks. Despite the potential, Japanese genetic counseling LLMs (JGCLLMs) are underexplored. To advance a JGCLLM-based dialogue system for genetic counseling, effective domain adaptation methods require investigation. Objective: This study aims to evaluate the current capabilities and identify challenges in developing a JGCLLM-based dialogue system for genetic counseling. The primary focus is to assess the effectiveness of prompt engineering, retrieval-augmented generation (RAG), and instruction tuning within the context of genetic counseling. Furthermore, we will establish an experts-evaluated dataset of responses generated by LLMs adapted to Japanese genetic counseling for the future development of JGCLLMs. Methods: Two primary datasets were used in this study: (1) a question-answer (QA) dataset for LLM adaptation and (2) a genetic counseling question dataset for evaluation. The QA dataset included 899 QA pairs covering medical and genetic counseling topics, while the evaluation dataset contained 120 curated questions across 6 genetic counseling categories. Three enhancement techniques of LLMs?instruction tuning, RAG, and prompt engineering?were applied to a lightweight Japanese LLM to enhance its ability for genetic counseling. The performance of the adapted LLM was evaluated on the 120-question dataset by 2 certified genetic counselors and 1 ophthalmologist (SK, YU, and AY). Evaluation focused on four metrics: (1) inappropriateness of information, (2) sufficiency of information, (3) severity of harm, and (4) alignment with medical consensus. Results: The evaluation by certified genetic counselors and an ophthalmologist revealed varied outcomes across different methods. RAG showed potential, particularly in enhancing critical aspects of genetic counseling. In contrast, instruction tuning and prompt engineering produced less favorable outcomes. This evaluation process facilitated the creation an expert-evaluated dataset of responses generated by LLMs adapted with different combinations of these methods. Error analysis identified key ethical concerns, including inappropriate promotion of prenatal testing, criticism of relatives, and inaccurate probability statements. Conclusions: RAG demonstrated notable improvements across all evaluation metrics, suggesting potential for further enhancement through the expansion of RAG data. The expert-evaluated dataset developed in this study provides valuable insights for future optimization efforts. However, the ethical issues observed in JGCLLM responses underscore the critical need for ongoing refinement and thorough ethical evaluation before these systems can be implemented in health care settings. UR - https://medinform.jmir.org/2025/1/e65047 UR - http://dx.doi.org/10.2196/65047 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/65047 ER - TY - JOUR AU - Zhu, Shiben AU - Hu, Wanqin AU - Yang, Zhi AU - Yan, Jiani AU - Zhang, Fang PY - 2025/1/10 TI - Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study JO - JMIR Med Inform SP - e63731 VL - 13 KW - large language models KW - LLMs KW - Chinese National Nursing Licensing Examination KW - ChatGPT KW - Qwen-2.5 KW - multiple-choice questions KW - N2 - Background: Large language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain?specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored. Objective: This study aims to evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. We also explore whether combining their outputs using machine learning techniques can improve their overall accuracy. Methods: This retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these multiple-choice questions, and 9 machine learning models, including Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were used to optimize overall performance through ensemble techniques. Results: Qwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared with the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the 7 LLMs were combined using 9 machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an area under the curve of 0.961, sensitivity of 0.905, specificity of 0.978, F1-score of 0.901, positive predictive value of 0.901, and negative predictive value of 0.977. Conclusions: This study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training. UR - https://medinform.jmir.org/2025/1/e63731 UR - http://dx.doi.org/10.2196/63731 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/63731 ER - TY - JOUR AU - Kang, Boyoung AU - Hong, Munpyo PY - 2025/1/3 TI - Development and Evaluation of a Mental Health Chatbot Using ChatGPT 4.0: Mixed Methods User Experience Study With Korean Users JO - JMIR Med Inform SP - e63538 VL - 13 KW - mental health chatbot KW - Dr. CareSam KW - HoMemeTown KW - ChatGPT 4.0 KW - large language model KW - LLM KW - cross-lingual KW - pilot testing KW - cultural sensitivity KW - localization KW - Korean students N2 - Background: Mental health chatbots have emerged as a promising tool for providing accessible and convenient support to individuals in need. Building on our previous research on digital interventions for loneliness and depression among Korean college students, this study addresses the limitations identified and explores more advanced artificial intelligence?driven solutions. Objective: This study aimed to develop and evaluate the performance of HoMemeTown Dr. CareSam, an advanced cross-lingual chatbot using ChatGPT 4.0 (OpenAI) to provide seamless support in both English and Korean contexts. The chatbot was designed to address the need for more personalized and culturally sensitive mental health support identified in our previous work while providing an accessible and user-friendly interface for Korean young adults. Methods: We conducted a mixed methods pilot study with 20 Korean young adults aged 18 to 27 (mean 23.3, SD 1.96) years. The HoMemeTown Dr CareSam chatbot was developed using the GPT application programming interface, incorporating features such as a gratitude journal and risk detection. User satisfaction and chatbot performance were evaluated using quantitative surveys and qualitative feedback, with triangulation used to ensure the validity and robustness of findings through cross-verification of data sources. Comparative analyses were conducted with other large language models chatbots and existing digital therapy tools (Woebot [Woebot Health Inc] and Happify [Twill Inc]). Results: Users generally expressed positive views towards the chatbot, with positivity and support receiving the highest score on a 10-point scale (mean 9.0, SD 1.2), followed by empathy (mean 8.7, SD 1.6) and active listening (mean 8.0, SD 1.8). However, areas for improvement were noted in professionalism (mean 7.0, SD 2.0), complexity of content (mean 7.4, SD 2.0), and personalization (mean 7.4, SD 2.4). The chatbot demonstrated statistically significant performance differences compared with other large language models chatbots (F=3.27; P=.047), with more pronounced differences compared with Woebot and Happify (F=12.94; P<.001). Qualitative feedback highlighted the chatbot?s strengths in providing empathetic responses and a user-friendly interface, while areas for improvement included response speed and the naturalness of Korean language responses. Conclusions: The HoMemeTown Dr CareSam chatbot shows potential as a cross-lingual mental health support tool, achieving high user satisfaction and demonstrating comparative advantages over existing digital interventions. However, the study?s limited sample size and short-term nature necessitate further research. Future studies should include larger-scale clinical trials, enhanced risk detection features, and integration with existing health care systems to fully realize its potential in supporting mental well-being across different linguistic and cultural contexts. UR - https://medinform.jmir.org/2025/1/e63538 UR - http://dx.doi.org/10.2196/63538 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/63538 ER - TY - JOUR AU - Wals Zurita, Jesus Amadeo AU - Miras del Rio, Hector AU - Ugarte Ruiz de Aguirre, Nerea AU - Nebrera Navarro, Cristina AU - Rubio Jimenez, Maria AU - Muñoz Carmona, David AU - Miguez Sanchez, Carlos PY - 2025/1/2 TI - The Transformative Potential of Large Language Models in Mining Electronic Health Records Data: Content Analysis JO - JMIR Med Inform SP - e58457 VL - 13 KW - electronic health record KW - EHR KW - oncology KW - radiotherapy KW - data mining KW - ChatGPT KW - large language models KW - LLMs N2 - Background: In this study, we evaluate the accuracy, efficiency, and cost-effectiveness of large language models in extracting and structuring information from free-text clinical reports, particularly in identifying and classifying patient comorbidities within oncology electronic health records. We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators. Objective: We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators. Methods: We implemented a script using the OpenAI application programming interface to extract structured information in JavaScript object notation format from comorbidities reported in 250 personal history reports. These reports were manually reviewed in batches of 50 by 5 specialists in radiation oncology. We compared the results using metrics such as sensitivity, specificity, precision, accuracy, F-value, ? index, and the McNemar test, in addition to examining the common causes of errors in both humans and generative pretrained transformer (GPT) models. Results: The GPT-3.5 model exhibited slightly lower performance compared to physicians across all metrics, though the differences were not statistically significant (McNemar test, P=.79). GPT-4 demonstrated clear superiority in several key metrics (McNemar test, P<.001). Notably, it achieved a sensitivity of 96.8%, compared to 88.2% for GPT-3.5 and 88.8% for physicians. However, physicians marginally outperformed GPT-4 in precision (97.7% vs 96.8%). GPT-4 showed greater consistency, replicating the exact same results in 76% of the reports across 10 repeated analyses, compared to 59% for GPT-3.5, indicating more stable and reliable performance. Physicians were more likely to miss explicit comorbidities, while the GPT models more frequently inferred nonexplicit comorbidities, sometimes correctly, though this also resulted in more false positives. Conclusions: This study demonstrates that, with well-designed prompts, the large language models examined can match or even surpass medical specialists in extracting information from complex clinical reports. Their superior efficiency in time and costs, along with easy integration with databases, makes them a valuable tool for large-scale data mining and real-world evidence generation. UR - https://medinform.jmir.org/2025/1/e58457 UR - http://dx.doi.org/10.2196/58457 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/58457 ER - TY - JOUR AU - Selim, Rania AU - Basu, Arunima AU - Anto, Ailin AU - Foscht, Thomas AU - Eisingerich, Benedikt Andreas PY - 2024/12/27 TI - Effects of Large Language Model?Based Offerings on the Well-Being of Students: Qualitative Study JO - JMIR Form Res SP - e64081 VL - 8 KW - large language models KW - ChatGPT KW - functional support KW - escapism KW - fantasy fulfillment KW - angst KW - despair KW - anxiety KW - deskilling KW - pessimism about the future N2 - Background: In recent years, the adoption of large language model (LLM) applications, such as ChatGPT, has seen a significant surge, particularly among students. These artificial intelligence?driven tools offer unprecedented access to information and conversational assistance, which is reshaping the way students engage with academic content and manage the learning process. Despite the growing prevalence of LLMs and reliance on these technologies, there remains a notable gap in qualitative in-depth research examining the emotional and psychological effects of LLMs on users? mental well-being. Objective: In order to address these emerging and critical issues, this study explores the role of LLM-based offerings, such as ChatGPT, in students? lives, namely, how postgraduate students use such offerings and how they make students feel, and examines the impact on students? well-being. Methods: To address the aims of this study, we employed an exploratory approach, using in-depth, semistructured, qualitative, face-to-face interviews with 23 users (13 female and 10 male users; mean age 23 years, SD 1.55 years) of ChatGPT-4o, who were also university students at the time (inclusion criteria). Interviewees were invited to reflect upon how they use ChatGPT, how it makes them feel, and how it may influence their lives. Results: The current findings from the exploratory qualitative interviews showed that users appreciate the functional support (8/23, 35%), escapism (8/23, 35%), and fantasy fulfillment (7/23, 30%) they receive from LLM-based offerings, such as ChatGPT, but at the same time, such usage is seen as a ?double-edged sword,? with respondents indicating anxiety (8/23, 35%), dependence (11/23, 48%), concerns about deskilling (12/23, 52%), and angst or pessimism about the future (11/23, 48%). Conclusions: This study employed exploratory in-depth interviews to examine how the usage of LLM-based offerings, such as ChatGPT, makes users feel and assess the effects of using LLM-based offerings on mental well-being. The findings of this study show that students used ChatGPT to make their lives easier and felt a sense of cognitive escapism and even fantasy fulfillment, but this came at the cost of feeling anxious and pessimistic about the future. UR - https://formative.jmir.org/2024/1/e64081 UR - http://dx.doi.org/10.2196/64081 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/64081 ER - TY - JOUR AU - Sprint, Gina AU - Schmitter-Edgecombe, Maureen AU - Cook, Diane PY - 2024/12/23 TI - Building a Human Digital Twin (HDTwin) Using Large Language Models for Cognitive Diagnosis: Algorithm Development and Validation JO - JMIR Form Res SP - e63866 VL - 8 KW - human digital twin KW - cognitive health KW - cognitive diagnosis KW - large language models KW - artificial intelligence KW - machine learning KW - digital behavior marker KW - interview marker KW - health information KW - chatbot KW - digital twin KW - smartwatch N2 - Background: Human digital twins have the potential to change the practice of personalizing cognitive health diagnosis because these systems can integrate multiple sources of health information and influence into a unified model. Cognitive health is multifaceted, yet researchers and clinical professionals struggle to align diverse sources of information into a single model. Objective: This study aims to introduce a method called HDTwin, for unifying heterogeneous data using large language models. HDTwin is designed to predict cognitive diagnoses and offer explanations for its inferences. Methods: HDTwin integrates cognitive health data from multiple sources, including demographic, behavioral, ecological momentary assessment, n-back test, speech, and baseline experimenter testing session markers. Data are converted into text prompts for a large language model. The system then combines these inputs with relevant external knowledge from scientific literature to construct a predictive model. The model?s performance is validated using data from 3 studies involving 124 participants, comparing its diagnostic accuracy with baseline machine learning classifiers. Results: HDTwin achieves a peak accuracy of 0.81 based on the automated selection of markers, significantly outperforming baseline classifiers. On average, HDTwin yielded accuracy=0.77, precision=0.88, recall=0.63, and Matthews correlation coefficient=0.57. In comparison, the baseline classifiers yielded average accuracy=0.65, precision=0.86, recall=0.35, and Matthews correlation coefficient=0.36. The experiments also reveal that HDTwin yields superior predictive accuracy when information sources are fused compared to single sources. HDTwin?s chatbot interface provides interactive dialogues, aiding in diagnosis interpretation and allowing further exploration of patient data. Conclusions: HDTwin integrates diverse cognitive health data, enhancing the accuracy and explainability of cognitive diagnoses. This approach outperforms traditional models and provides an interface for navigating patient information. The approach shows promise for improving early detection and intervention strategies in cognitive health. UR - https://formative.jmir.org/2024/1/e63866 UR - http://dx.doi.org/10.2196/63866 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/63866 ER - TY - JOUR AU - Daluwatte, Chathuri AU - Khromava, Alena AU - Chen, Yuning AU - Serradell, Laurence AU - Chabanon, Anne-Laure AU - Chan-Ou-Teung, Anthony AU - Molony, Cliona AU - Juhaeri, Juhaeri PY - 2024/12/20 TI - Application of a Language Model Tool for COVID-19 Vaccine Adverse Event Monitoring Using Web and Social Media Content: Algorithm Development and Validation Study JO - JMIR Infodemiology SP - e53424 VL - 4 KW - adverse event KW - COVID-19 KW - detection KW - large language model KW - mass vaccination KW - natural language processing KW - pharmacovigilance KW - safety KW - social media KW - vaccine N2 - Background: Spontaneous pharmacovigilance reporting systems are the main data source for signal detection for vaccines. However, there is a large time lag between the occurrence of an adverse event (AE) and the availability for analysis. With global mass COVID-19 vaccination campaigns, social media, and web content, there is an opportunity for real-time, faster monitoring of AEs potentially related to COVID-19 vaccine use. Our work aims to detect AEs from social media to augment those from spontaneous reporting systems. Objective: This study aims to monitor AEs shared in social media and online support groups using medical context-aware natural language processing language models. Methods: We developed a language model?based web app to analyze social media, patient blogs, and forums (from 190 countries in 61 languages) around COVID-19 vaccine?related keywords. Following machine translation to English, lay language safety terms (ie, AEs) were observed using the PubmedBERT-based named-entity recognition model (precision=0.76 and recall=0.82) and mapped to Medical Dictionary for Regulatory Activities (MedDRA) terms using knowledge graphs (MedDRA terminology is an internationally used set of terms relating to medical conditions, medicines, and medical devices that are developed and registered under the auspices of the International Council for Harmonization of Technical Requirements for Pharmaceuticals for Human Use). Weekly and cumulative aggregated AE counts, proportions, and ratios were displayed via visual analytics, such as word clouds. Results: Most AEs were identified in 2021, with fewer in 2022. AEs observed using the web app were consistent with AEs communicated by health authorities shortly before or within the same period. Conclusions: Monitoring the web and social media provides opportunities to observe AEs that may be related to the use of COVID-19 vaccines. The presented analysis demonstrates the ability to use web content and social media as a data source that could contribute to the early observation of AEs and enhance postmarketing surveillance. It could help to adjust signal detection strategies and communication with external stakeholders, contributing to increased confidence in vaccine safety monitoring. UR - https://infodemiology.jmir.org/2024/1/e53424 UR - http://dx.doi.org/10.2196/53424 UR - http://www.ncbi.nlm.nih.gov/pubmed/39705077 ID - info:doi/10.2196/53424 ER - TY - JOUR AU - Cao, Lang AU - Sun, Jimeng AU - Cross, Adam PY - 2024/12/18 TI - An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study JO - JMIR Med Inform SP - e60665 VL - 12 KW - rare disease KW - clinical informatics KW - LLM KW - natural language processing KW - machine learning KW - artificial intelligence KW - large language models KW - data extraction KW - ontologies KW - knowledge graphs KW - text mining N2 - Background: Rare diseases affect millions worldwide but sometimes face limited research focus individually due to low prevalence. Many rare diseases do not have specific International Classification of Diseases, Ninth Edition (ICD-9) and Tenth Edition (ICD-10), codes and therefore cannot be reliably extracted from granular fields like ?Diagnosis? and ?Problem List? entries, which complicates tasks that require identification of patients with these conditions, including clinical trial recruitment and research efforts. Recent advancements in large language models (LLMs) have shown promise in automating the extraction of medical information, offering the potential to improve medical research, diagnosis, and management. However, most LLMs lack professional medical knowledge, especially concerning specific rare diseases, and cannot effectively manage rare disease data in its various ontological forms, making it unsuitable for these tasks. Objective: Our aim is to create an end-to-end system called automated rare disease mining (AutoRD), which automates the extraction of rare disease?related information from medical text, focusing on entities and their relations to other medical concepts, such as signs and symptoms. AutoRD integrates up-to-date ontologies with other structured knowledge and demonstrates superior performance in rare disease extraction tasks. We conducted various experiments to evaluate AutoRD?s performance, aiming to surpass common LLMs and traditional methods. Methods: AutoRD is a pipeline system that involves data preprocessing, entity extraction, relation extraction, entity calibration, and knowledge graph construction. We implemented this system using GPT-4 and medical knowledge graphs developed from the open-source Human Phenotype and Orphanet ontologies, using techniques such as chain-of-thought reasoning and prompt engineering. We quantitatively evaluated our system?s performance in entity extraction, relation extraction, and knowledge graph construction. The experiment used the well-curated dataset RareDis2023, which contains medical literature focused on rare disease entities and their relations, making it an ideal dataset for training and testing our methodology. Results: On the RareDis2023 dataset, AutoRD achieved an overall entity extraction F1-score of 56.1% and a relation extraction F1-score of 38.6%, marking a 14.4% improvement over the baseline LLM. Notably, the F1-score for rare disease entity extraction reached 83.5%, indicating high precision and recall in identifying rare disease mentions. These results demonstrate the effectiveness of integrating LLMs with medical ontologies in extracting complex rare disease information. Conclusions: AutoRD is an automated end-to-end system for extracting rare disease information from text to build knowledge graphs, addressing critical limitations of existing LLMs by improving identification of these diseases and connecting them to related clinical features. This work underscores the significant potential of LLMs in transforming health care, particularly in the rare disease domain. By leveraging ontology-enhanced LLMs, AutoRD constructs a robust medical knowledge base that incorporates up-to-date rare disease information, facilitating improved identification of patients and resulting in more inclusive research and trial candidacy efforts. UR - https://medinform.jmir.org/2024/1/e60665 UR - http://dx.doi.org/10.2196/60665 ID - info:doi/10.2196/60665 ER - TY - JOUR AU - Oh, Soyeon Sarah AU - Kang, Bada AU - Hong, Dahye AU - Kim, Ivy Jennifer AU - Jeong, Hyewon AU - Song, Jinyeop AU - Jeon, Minkyu PY - 2024/11/22 TI - A Multivariable Prediction Model for Mild Cognitive Impairment and Dementia: Algorithm Development and Validation JO - JMIR Med Inform SP - e59396 VL - 12 KW - mild cognitive impairment KW - machine learning algorithms KW - sociodemographic factors KW - gerontology KW - geriatrics KW - older people KW - aging KW - MCI KW - dementia KW - Alzheimer KW - cognitive KW - machine learning KW - prediction KW - algorithm N2 - Background: Mild cognitive impairment (MCI) poses significant challenges in early diagnosis and timely intervention. Underdiagnosis, coupled with the economic and social burden of dementia, necessitates more precise detection methods. Machine learning (ML) algorithms show promise in managing complex data for MCI and dementia prediction. Objective: This study assessed the predictive accuracy of ML models in identifying the onset of MCI and dementia using the Korean Longitudinal Study of Aging (KLoSA) dataset. Methods: This study used data from the KLoSA, a comprehensive biennial survey that tracks the demographic, health, and socioeconomic aspects of middle-aged and older Korean adults from 2018 to 2020. Among the 6171 initial households, 4975 eligible older adult participants aged 60 years or older were selected after excluding individuals based on age and missing data. The identification of MCI and dementia relied on self-reported diagnoses, with sociodemographic and health-related variables serving as key covariates. The dataset was categorized into training and test sets to predict MCI and dementia by using multiple models, including logistic regression, light gradient-boosting machine, XGBoost (extreme gradient boosting), CatBoost, random forest, gradient boosting, AdaBoost, support vector classifier, and k-nearest neighbors, and the training and test sets were used to evaluate predictive performance. The performance was assessed using the area under the receiver operating characteristic curve (AUC). Class imbalances were addressed via weights. Shapley additive explanation values were used to determine the contribution of each feature to the prediction rate. Results: Among the 4975 participants, the best model for predicting MCI onset was random forest, with a median AUC of 0.6729 (IQR 0.3883-0.8152), followed by k-nearest neighbors with a median AUC of 0.5576 (IQR 0.4555-0.6761) and support vector classifier with a median AUC of 0.5067 (IQR 0.3755-0.6389). For dementia onset prediction, the best model was XGBoost, achieving a median AUC of 0.8185 (IQR 0.8085-0.8285), closely followed by light gradient-boosting machine with a median AUC of 0.8069 (IQR 0.7969-0.8169) and AdaBoost with a median AUC of 0.8007 (IQR 0.7907-0.8107). The Shapley values highlighted pain in everyday life, being widowed, living alone, exercising, and living with a partner as the strongest predictors of MCI. For dementia, the most predictive features were other contributing factors, education at the high school level, education at the middle school level, exercising, and monthly social engagement. Conclusions: ML algorithms, especially XGBoost, exhibited the potential for predicting MCI onset using KLoSA data. However, no model has demonstrated robust accuracy in predicting MCI and dementia. Sociodemographic and health-related factors are crucial for initiating cognitive conditions, emphasizing the need for multifaceted predictive models for early identification and intervention. These findings underscore the potential and limitations of ML in predicting cognitive impairment in community-dwelling older adults. UR - https://medinform.jmir.org/2024/1/e59396 UR - http://dx.doi.org/10.2196/59396 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/59396 ER - TY - JOUR AU - Chang, Annie AU - Young, Jade AU - Para, Andrew AU - Lamb, Angela AU - Gulati, Nicholas PY - 2024/11/20 TI - Efficacy of ChatGPT in Educating Patients and Clinicians About Skin Toxicities Associated With Cancer Treatment JO - JMIR Dermatol SP - e54919 VL - 7 KW - artificial intelligence KW - ChatGPT KW - oncodermatology KW - cancer therapy KW - language learning model UR - https://derma.jmir.org/2024/1/e54919 UR - http://dx.doi.org/10.2196/54919 ID - info:doi/10.2196/54919 ER - TY - JOUR AU - Ehrett, Carl AU - Hegde, Sudeep AU - Andre, Kwame AU - Liu, Dixizi AU - Wilson, Timothy PY - 2024/11/19 TI - Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study JO - JMIR Med Educ SP - e51433 VL - 10 KW - data augmentation KW - large language models KW - medical education KW - natural language processing KW - data security KW - ethics KW - AI KW - artificial intelligence KW - data privacy KW - medical staff N2 - Background: Generative large language models (LLMs) have the potential to revolutionize medical education by generating tailored learning materials, enhancing teaching efficiency, and improving learner engagement. However, the application of LLMs in health care settings, particularly for augmenting small datasets in text classification tasks, remains underexplored, particularly for cost- and privacy-conscious applications that do not permit the use of third-party services such as OpenAI?s ChatGPT. Objective: This study aims to explore the use of open-source LLMs, such as Large Language Model Meta AI (LLaMA) and Alpaca models, for data augmentation in a specific text classification task related to hospital staff surveys. Methods: The surveys were designed to elicit narratives of everyday adaptation by frontline radiology staff during the initial phase of the COVID-19 pandemic. A 2-step process of data augmentation and text classification was conducted. The study generated synthetic data similar to the survey reports using 4 generative LLMs for data augmentation. A different set of 3 classifier LLMs was then used to classify the augmented text for thematic categories. The study evaluated performance on the classification task. Results: The overall best-performing combination of LLMs, temperature, classifier, and number of synthetic data cases is via augmentation with LLaMA 7B at temperature 0.7 with 100 augments, using Robustly Optimized BERT Pretraining Approach (RoBERTa) for the classification task, achieving an average area under the receiver operating characteristic (AUC) curve of 0.87 (SD 0.02; ie, 1 SD). The results demonstrate that open-source LLMs can enhance text classifiers? performance for small datasets in health care contexts, providing promising pathways for improving medical education processes and patient care practices. Conclusions: The study demonstrates the value of data augmentation with open-source LLMs, highlights the importance of privacy and ethical considerations when using LLMs, and suggests future directions for research in this field. UR - https://mededu.jmir.org/2024/1/e51433 UR - http://dx.doi.org/10.2196/51433 ID - info:doi/10.2196/51433 ER - TY - JOUR AU - Cho, Na Ha AU - Jun, Joon Tae AU - Kim, Young-Hak AU - Kang, Heejun AU - Ahn, Imjin AU - Gwon, Hansle AU - Kim, Yunha AU - Seo, Jiahn AU - Choi, Heejung AU - Kim, Minkyoung AU - Han, Jiye AU - Kee, Gaeun AU - Park, Seohyun AU - Ko, Soyoung PY - 2024/11/18 TI - Task-Specific Transformer-Based Language Models in Health Care: Scoping Review JO - JMIR Med Inform SP - e49724 VL - 12 KW - transformer-based language models KW - medicine KW - health care KW - medical language model N2 - Background: Transformer-based language models have shown great potential to revolutionize health care by advancing clinical decision support, patient interaction, and disease prediction. However, despite their rapid development, the implementation of transformer-based language models in health care settings remains limited. This is partly due to the lack of a comprehensive review, which hinders a systematic understanding of their applications and limitations. Without clear guidelines and consolidated information, both researchers and physicians face difficulties in using these models effectively, resulting in inefficient research efforts and slow integration into clinical workflows. Objective: This scoping review addresses this gap by examining studies on medical transformer-based language models and categorizing them into 6 tasks: dialogue generation, question answering, summarization, text classification, sentiment analysis, and named entity recognition. Methods: We conducted a scoping review following the Cochrane scoping review protocol. A comprehensive literature search was performed across databases, including Google Scholar and PubMed, covering publications from January 2017 to September 2024. Studies involving transformer-derived models in medical tasks were included. Data were categorized into 6 key tasks. Results: Our key findings revealed both advancements and critical challenges in applying transformer-based models to health care tasks. For example, models like MedPIR involving dialogue generation show promise but face privacy and ethical concerns, while question-answering models like BioBERT improve accuracy but struggle with the complexity of medical terminology. The BioBERTSum summarization model aids clinicians by condensing medical texts but needs better handling of long sequences. Conclusions: This review attempted to provide a consolidated understanding of the role of transformer-based language models in health care and to guide future research directions. By addressing current challenges and exploring the potential for real-world applications, we envision significant improvements in health care informatics. Addressing the identified challenges and implementing proposed solutions can enable transformer-based language models to significantly improve health care delivery and patient outcomes. Our review provides valuable insights for future research and practical applications, setting the stage for transformative advancements in medical informatics. UR - https://medinform.jmir.org/2024/1/e49724 UR - http://dx.doi.org/10.2196/49724 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/49724 ER - TY - JOUR AU - Li, Xingyuan AU - Liu, Ke AU - Lang, Yanlin AU - Chai, Zhonglin AU - Liu, Fang PY - 2024/11/15 TI - Exploring the Potential of Claude 3 Opus in Renal Pathological Diagnosis: Performance Evaluation JO - JMIR Med Inform SP - e65033 VL - 12 KW - artificial intelligence KW - Claude 3 Opus KW - renal pathology KW - diagnostic performance KW - large language model KW - LLM KW - performance evaluation KW - medical diagnosis KW - AI language model KW - diagnosis KW - pathology images KW - pathologist KW - clinical relevance KW - accuracy KW - language fluency KW - pathological diagnosis N2 - Background: Artificial intelligence (AI) has shown great promise in assisting medical diagnosis, but its application in renal pathology remains limited. Objective: We evaluated the performance of an advanced AI language model, Claude 3 Opus (Anthropic), in generating diagnostic descriptions for renal pathological images. Methods: We carefully curated a dataset of 100 representative renal pathological images from the Diagnostic Atlas of Renal Pathology (3rd edition). The image selection aimed to cover a wide spectrum of common renal diseases, ensuring a balanced and comprehensive dataset. Claude 3 Opus generated diagnostic descriptions for each image, which were scored by 2 pathologists on clinical relevance, accuracy, fluency, completeness, and overall value. Results: Claude 3 Opus achieved a high mean score in language fluency (3.86) but lower scores in clinical relevance (1.75), accuracy (1.55), completeness (2.01), and overall value (1.75). Performance varied across disease types. Interrater agreement was substantial for relevance (?=0.627) and overall value (?=0.589) and moderate for accuracy (?=0.485) and completeness (?=0.458). Conclusions: Claude 3 Opus shows potential in generating fluent renal pathology descriptions but needs improvement in accuracy and clinical value. The AI?s performance varied across disease types. Addressing the limitations of single-source data and incorporating comparative analyses with other AI approaches are essential steps for future research. Further optimization and validation are needed for clinical applications. UR - https://medinform.jmir.org/2024/1/e65033 UR - http://dx.doi.org/10.2196/65033 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/65033 ER - TY - JOUR AU - Nagarajan, Radha AU - Kondo, Midori AU - Salas, Franz AU - Sezgin, Emre AU - Yao, Yuan AU - Klotzman, Vanessa AU - Godambe, A. Sandip AU - Khan, Naqi AU - Limon, Alfonso AU - Stephenson, Graham AU - Taraman, Sharief AU - Walton, Nephi AU - Ehwerhemuepha, Louis AU - Pandit, Jay AU - Pandita, Deepti AU - Weiss, Michael AU - Golden, Charles AU - Gold, Adam AU - Henderson, John AU - Shippy, Angela AU - Celi, Anthony Leo AU - Hogan, R. William AU - Oermann, K. Eric AU - Sanger, Terence AU - Martel, Steven PY - 2024/11/14 TI - Economics and Equity of Large Language Models: Health Care Perspective JO - J Med Internet Res SP - e64226 VL - 26 KW - large language model KW - LLM KW - health care KW - economics KW - equity KW - cloud service providers KW - cloud KW - health outcome KW - implementation KW - democratization UR - https://www.jmir.org/2024/1/e64226 UR - http://dx.doi.org/10.2196/64226 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/64226 ER - TY - JOUR AU - Gray, Magnus AU - Milanova, Mariofanna AU - Wu, Leihong PY - 2024/11/12 TI - Enhancing Bias Assessment for Complex Term Groups in Language Embedding Models: Quantitative Comparison of Methods JO - JMIR Med Inform SP - e60272 VL - 12 KW - bias KW - bias measurement KW - natural language processing KW - language models KW - artificial intelligence KW - input embeddings KW - AI KW - assessment KW - decision-making KW - AI-powered tool KW - NLP KW - application KW - AI language models N2 - Background: Artificial intelligence (AI) is rapidly being adopted to build products and aid in the decision-making process across industries. However, AI systems have been shown to exhibit and even amplify biases, causing a growing concern among people worldwide. Thus, investigating methods of measuring and mitigating bias within these AI-powered tools is necessary. Objective: In natural language processing applications, the word embedding association test (WEAT) is a popular method of measuring bias in input embeddings, a common area of measure bias in AI. However, certain limitations of the WEAT have been identified (ie, their nonrobust measure of bias and their reliance on predefined and limited groups of words or sentences), which may lead to inadequate measurements and evaluations of bias. Thus, this study takes a new approach at modifying this popular measure of bias, with a focus on making it more robust and applicable in other domains. Methods: In this study, we introduce the SD-WEAT, which is a modified version of the WEAT that uses the SD of multiple permutations of the WEATs to calculate bias in input embeddings. With the SD-WEAT, we evaluated the biases and stability of several language embedding models, including Global Vectors for Word Representation (GloVe), Word2Vec, and bidirectional encoder representations from transformers (BERT). Results: This method produces results comparable to those of the WEAT, with strong correlations between the methods? bias scores or effect sizes (r=0.786) and P values (r=0.776), while addressing some of its largest limitations. More specifically, the SD-WEAT is more accessible, as it removes the need to predefine attribute groups, and because the SD-WEAT measures bias over multiple runs rather than one, it reduces the impact of outliers and sample size. Furthermore, the SD-WEAT was found to be more consistent and reliable than its predecessor. Conclusions: Thus, the SD-WEAT shows promise for robustly measuring bias in the input embeddings fed to AI language models. UR - https://medinform.jmir.org/2024/1/e60272 UR - http://dx.doi.org/10.2196/60272 ID - info:doi/10.2196/60272 ER - TY - JOUR AU - Yau, Yi-Shin Jonathan AU - Saadat, Soheil AU - Hsu, Edmund AU - Murphy, Suk-Ling Linda AU - Roh, S. Jennifer AU - Suchard, Jeffrey AU - Tapia, Antonio AU - Wiechmann, Warren AU - Langdorf, I. Mark PY - 2024/11/4 TI - Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study JO - J Med Internet Res SP - e60291 VL - 26 KW - artificial intelligence KW - AI KW - chatbots KW - generative AI KW - natural language processing KW - consumer health information KW - patient education KW - literacy KW - emergency care information KW - chatbot KW - misinformation KW - health care KW - medical consultation N2 - Background: Recent surveys indicate that 48% of consumers actively use generative artificial intelligence (AI) for health-related inquiries. Despite widespread adoption and the potential to improve health care access, scant research examines the performance of AI chatbot responses regarding emergency care advice. Objective: We assessed the quality of AI chatbot responses to common emergency care questions. We sought to determine qualitative differences in responses from 4 free-access AI chatbots, for 10 different serious and benign emergency conditions. Methods: We created 10 emergency care questions that we fed into the free-access versions of ChatGPT 3.5 (OpenAI), Google Bard, Bing AI Chat (Microsoft), and Claude AI (Anthropic) on November 26, 2023. Each response was graded by 5 board-certified emergency medicine (EM) faculty for 8 domains of percentage accuracy, presence of dangerous information, factual accuracy, clarity, completeness, understandability, source reliability, and source relevancy. We determined the correct, complete response to the 10 questions from reputable and scholarly emergency medical references. These were compiled by an EM resident physician. For the readability of the chatbot responses, we used the Flesch-Kincaid Grade Level of each response from readability statistics embedded in Microsoft Word. Differences between chatbots were determined by the chi-square test. Results: Each of the 4 chatbots? responses to the 10 clinical questions were scored across 8 domains by 5 EM faculty, for 400 assessments for each chatbot. Together, the 4 chatbots had the best performance in clarity and understandability (both 85%), intermediate performance in accuracy and completeness (both 50%), and poor performance (10%) for source relevance and reliability (mostly unreported). Chatbots contained dangerous information in 5% to 35% of responses, with no statistical difference between chatbots on this metric (P=.24). ChatGPT, Google Bard, and Claud AI had similar performances across 6 out of 8 domains. Only Bing AI performed better with more identified or relevant sources (40%; the others had 0%-10%). Flesch-Kincaid Reading level was 7.7-8.9 grade for all chatbots, except ChatGPT at 10.8, which were all too advanced for average emergency patients. Responses included both dangerous (eg, starting cardiopulmonary resuscitation with no pulse check) and generally inappropriate advice (eg, loosening the collar to improve breathing without evidence of airway compromise). Conclusions: AI chatbots, though ubiquitous, have significant deficiencies in EM patient advice, despite relatively consistent performance. Information for when to seek urgent or emergent care is frequently incomplete and inaccurate, and patients may be unaware of misinformation. Sources are not generally provided. Patients who use AI to guide health care decisions assume potential risks. AI chatbots for health should be subject to further research, refinement, and regulation. We strongly recommend proper medical consultation to prevent potential adverse outcomes. UR - https://www.jmir.org/2024/1/e60291 UR - http://dx.doi.org/10.2196/60291 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/60291 ER - TY - JOUR AU - Zeng, Jiaqi AU - Zou, Xiaoyi AU - Li, Shirong AU - Tang, Yao AU - Teng, Sisi AU - Li, Huanhuan AU - Wang, Changyu AU - Wu, Yuxuan AU - Zhang, Luyao AU - Zhong, Yunheng AU - Liu, Jialin AU - Liu, Siru PY - 2024/10/31 TI - Assessing the Role of the Generative Pretrained Transformer (GPT) in Alzheimer?s Disease Management: Comparative Study of Neurologist- and Artificial Intelligence?Generated Responses JO - J Med Internet Res SP - e51095 VL - 26 KW - Alzheimer's disease KW - artificial intelligence KW - AI KW - large language model KW - LLM KW - Generative Pretrained Transformer KW - GPT KW - ChatGPT KW - patient information N2 - Background: Alzheimer?s disease (AD) is a progressive neurodegenerative disorder posing challenges to patients, caregivers, and society. Accessible and accurate information is crucial for effective AD management. Objective: This study aimed to evaluate the accuracy, comprehensibility, clarity, and usefulness of the Generative Pretrained Transformer?s (GPT) answers concerning the management and caregiving of patients with AD. Methods: In total, 14 questions related to the prevention, treatment, and care of AD were identified and posed to GPT-3.5 and GPT-4 in Chinese and English, respectively, and 4 respondent neurologists were asked to answer them. We generated 8 sets of responses (total 112) and randomly coded them in answer sheets. Next, 5 evaluator neurologists and 5 family members of patients were asked to rate the 112 responses using separate 5-point Likert scales. We evaluated the quality of the responses using a set of 8 questions rated on a 5-point Likert scale. To gauge comprehensibility and participant satisfaction, we included 3 questions dedicated to each aspect within the same set of 8 questions. Results: As of April 10, 2023, the 5 evaluator neurologists and 5 family members of patients with AD rated the 112 responses: GPT-3.5: n=28, 25%, responses; GPT-4: n=28, 25%, responses; respondent neurologists: 56 (50%) responses. The top 5 (4.5%) responses rated by evaluator neurologists had 4 (80%) GPT (GPT-3.5+GPT-4) responses and 1 (20%) respondent neurologist?s response. For the top 5 (4.5%) responses rated by patients? family members, all but the third response were GPT responses. Based on the evaluation by neurologists, the neurologist-generated responses achieved a mean score of 3.9 (SD 0.7), while the GPT-generated responses scored significantly higher (mean 4.4, SD 0.6; P<.001). Language and model analyses revealed no significant differences in response quality between the GPT-3.5 and GPT-4 models (GPT-3.5: mean 4.3, SD 0.7; GPT-4: mean 4.4, SD 0.5; P=.51). However, English responses outperformed Chinese responses in terms of comprehensibility (Chinese responses: mean 4.1, SD 0.7; English responses: mean 4.6, SD 0.5; P=.005) and participant satisfaction (Chinese responses: mean 4.2, SD 0.8; English responses: mean 4.5, SD 0.5; P=.04). According to the evaluator neurologists? review, Chinese responses had a mean score of 4.4 (SD 0.6), whereas English responses had a mean score of 4.5 (SD 0.5; P=.002). As for the family members of patients with AD, no significant differences were observed between GPT and neurologists, GPT-3.5 and GPT-4, or Chinese and English responses. Conclusions: GPT can provide patient education materials on AD for patients, their families and caregivers, nurses, and neurologists. This capability can contribute to the effective health care management of patients with AD, leading to enhanced patient outcomes. UR - https://www.jmir.org/2024/1/e51095 UR - http://dx.doi.org/10.2196/51095 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/51095 ER - TY - JOUR AU - Abbott, E. Ethan AU - Apakama, Donald AU - Richardson, D. Lynne AU - Chan, Lili AU - Nadkarni, N. Girish PY - 2024/10/30 TI - Leveraging Artificial Intelligence and Data Science for Integration of Social Determinants of Health in Emergency Medicine: Scoping Review JO - JMIR Med Inform SP - e57124 VL - 12 KW - data science KW - social determinants of health KW - natural language processing KW - artificial intelligence KW - NLP KW - machine learning KW - review methods KW - review methodology KW - scoping review KW - emergency medicine KW - PRISMA N2 - Background: Social determinants of health (SDOH) are critical drivers of health disparities and patient outcomes. However, accessing and collecting patient-level SDOH data can be operationally challenging in the emergency department (ED) clinical setting, requiring innovative approaches. Objective: This scoping review examines the potential of AI and data science for modeling, extraction, and incorporation of SDOH data specifically within EDs, further identifying areas for advancement and investigation. Methods: We conducted a standardized search for studies published between 2015 and 2022, across Medline (Ovid), Embase (Ovid), CINAHL, Web of Science, and ERIC databases. We focused on identifying studies using AI or data science related to SDOH within emergency care contexts or conditions. Two specialized reviewers in emergency medicine (EM) and clinical informatics independently assessed each article, resolving discrepancies through iterative reviews and discussion. We then extracted data covering study details, methodologies, patient demographics, care settings, and principal outcomes. Results: Of the 1047 studies screened, 26 met the inclusion criteria. Notably, 9 out of 26 (35%) studies were solely concentrated on ED patients. Conditions studied spanned broad EM complaints and included sepsis, acute myocardial infarction, and asthma. The majority of studies (n=16) explored multiple SDOH domains, with homelessness/housing insecurity and neighborhood/built environment predominating. Machine learning (ML) techniques were used in 23 of 26 studies, with natural language processing (NLP) being the most commonly used approach (n=11). Rule-based NLP (n=5), deep learning (n=2), and pattern matching (n=4) were the most commonly used NLP techniques. NLP models in the reviewed studies displayed significant predictive performance with outcomes, with F1-scores ranging between 0.40 and 0.75 and specificities nearing 95.9%. Conclusions: Although in its infancy, the convergence of AI and data science techniques, especially ML and NLP, with SDOH in EM offers transformative possibilities for better usage and integration of social data into clinical care and research. With a significant focus on the ED and notable NLP model performance, there is an imperative to standardize SDOH data collection, refine algorithms for diverse patient groups, and champion interdisciplinary synergies. These efforts aim to harness SDOH data optimally, enhancing patient care and mitigating health disparities. Our research underscores the vital need for continued investigation in this domain. UR - https://medinform.jmir.org/2024/1/e57124 UR - http://dx.doi.org/10.2196/57124 ID - info:doi/10.2196/57124 ER - TY - JOUR AU - Joshi, Saubhagya AU - Ha, Eunbin AU - Amaya, Andee AU - Mendoza, Melissa AU - Rivera, Yonaira AU - Singh, K. Vivek PY - 2024/10/30 TI - Ensuring Accuracy and Equity in Vaccination Information From ChatGPT and CDC: Mixed-Methods Cross-Language Evaluation JO - JMIR Form Res SP - e60939 VL - 8 KW - vaccination KW - health equity KW - multilingualism KW - language equity KW - health literacy KW - online health information KW - conversational agents KW - artificial intelligence KW - large language models KW - health information KW - public health N2 - Background: In the digital age, large language models (LLMs) like ChatGPT have emerged as important sources of health care information. Their interactive capabilities offer promise for enhancing health access, particularly for groups facing traditional barriers such as insurance and language constraints. Despite their growing public health use, with millions of medical queries processed weekly, the quality of LLM-provided information remains inconsistent. Previous studies have predominantly assessed ChatGPT?s English responses, overlooking the needs of non?English speakers in the United States. This study addresses this gap by evaluating the quality and linguistic parity of vaccination information from ChatGPT and the Centers for Disease Control and Prevention (CDC), emphasizing health equity. Objective: This study aims to assess the quality and language equity of vaccination information provided by ChatGPT and the CDC in English and Spanish. It highlights the critical need for cross-language evaluation to ensure equitable health information access for all linguistic groups. Methods: We conducted a comparative analysis of ChatGPT?s and CDC?s responses to frequently asked vaccination-related questions in both languages. The evaluation encompassed quantitative and qualitative assessments of accuracy, readability, and understandability. Accuracy was gauged by the perceived level of misinformation; readability, by the Flesch-Kincaid grade level and readability score; and understandability, by items from the National Institutes of Health?s Patient Education Materials Assessment Tool (PEMAT) instrument. Results: The study found that both ChatGPT and CDC provided mostly accurate and understandable (eg, scores over 95 out of 100) responses. However, Flesch-Kincaid grade levels often exceeded the American Medical Association?s recommended levels, particularly in English (eg, average grade level in English for ChatGPT=12.84, Spanish=7.93, recommended=6). CDC responses outperformed ChatGPT in readability across both languages. Notably, some Spanish responses appeared to be direct translations from English, leading to unnatural phrasing. The findings underscore the potential and challenges of using ChatGPT for health care access. Conclusions: ChatGPT holds potential as a health information resource but requires improvements in readability and linguistic equity to be truly effective for diverse populations. Crucially, the default user experience with ChatGPT, typically encountered by those without advanced language and prompting skills, can significantly shape health perceptions. This is vital from a public health standpoint, as the majority of users will interact with LLMs in their most accessible form. Ensuring that default responses are accurate, understandable, and equitable is imperative for fostering informed health decisions across diverse communities. UR - https://formative.jmir.org/2024/1/e60939 UR - http://dx.doi.org/10.2196/60939 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/60939 ER - TY - JOUR AU - Kim, Kyungmo AU - Park, Seongkeun AU - Min, Jeongwon AU - Park, Sumin AU - Kim, Yeon Ju AU - Eun, Jinsu AU - Jung, Kyuha AU - Park, Elyson Yoobin AU - Kim, Esther AU - Lee, Young Eun AU - Lee, Joonhwan AU - Choi, Jinwook PY - 2024/10/30 TI - Multifaceted Natural Language Processing Task?Based Evaluation of Bidirectional Encoder Representations From Transformers Models for Bilingual (Korean and English) Clinical Notes: Algorithm Development and Validation JO - JMIR Med Inform SP - e52897 VL - 12 KW - natural language processing KW - NLP KW - natural language inference KW - reading comprehension KW - large language models KW - transformer N2 - Background: The bidirectional encoder representations from transformers (BERT) model has attracted considerable attention in clinical applications, such as patient classification and disease prediction. However, current studies have typically progressed to application development without a thorough assessment of the model?s comprehension of clinical context. Furthermore, limited comparative studies have been conducted on BERT models using medical documents from non?English-speaking countries. Therefore, the applicability of BERT models trained on English clinical notes to non-English contexts is yet to be confirmed. To address these gaps in literature, this study focused on identifying the most effective BERT model for non-English clinical notes. Objective: In this study, we evaluated the contextual understanding abilities of various BERT models applied to mixed Korean and English clinical notes. The objective of this study was to identify the BERT model that excels in understanding the context of such documents. Methods: Using data from 164,460 patients in a South Korean tertiary hospital, we pretrained BERT-base, BERT for Biomedical Text Mining (BioBERT), Korean BERT (KoBERT), and Multilingual BERT (M-BERT) to improve their contextual comprehension capabilities and subsequently compared their performances in 7 fine-tuning tasks. Results: The model performance varied based on the task and token usage. First, BERT-base and BioBERT excelled in tasks using classification ([CLS]) token embeddings, such as document classification. BioBERT achieved the highest F1-score of 89.32. Both BERT-base and BioBERT demonstrated their effectiveness in document pattern recognition, even with limited Korean tokens in the dictionary. Second, M-BERT exhibited a superior performance in reading comprehension tasks, achieving an F1-score of 93.77. Better results were obtained when fewer words were replaced with unknown ([UNK]) tokens. Third, M-BERT excelled in the knowledge inference task in which correct disease names were inferred from 63 candidate disease names in a document with disease names replaced with [MASK] tokens. M-BERT achieved the highest hit@10 score of 95.41. Conclusions: This study highlighted the effectiveness of various BERT models in a multilingual clinical domain. The findings can be used as a reference in clinical and language-based applications. UR - https://medinform.jmir.org/2024/1/e52897 UR - http://dx.doi.org/10.2196/52897 ID - info:doi/10.2196/52897 ER - TY - JOUR AU - Ball Dunlap, A. Patricia AU - Michalowski, Martin PY - 2024/10/25 TI - Advancing AI Data Ethics in Nursing: Future Directions for Nursing Practice, Research, and Education JO - JMIR Nursing SP - e62678 VL - 7 KW - artificial intelligence KW - AI data ethics KW - data-centric AI KW - nurses KW - nursing informatics KW - machine learning KW - data literacy KW - health care AI KW - responsible AI UR - https://nursing.jmir.org/2024/1/e62678 UR - http://dx.doi.org/10.2196/62678 ID - info:doi/10.2196/62678 ER - TY - JOUR AU - So, Jae-hee AU - Chang, Joonhwan AU - Kim, Eunji AU - Na, Junho AU - Choi, JiYeon AU - Sohn, Jy-yong AU - Kim, Byung-Hoon AU - Chu, Hui Sang PY - 2024/10/24 TI - Aligning Large Language Models for Enhancing Psychiatric Interviews Through Symptom Delineation and Summarization: Pilot Study JO - JMIR Form Res SP - e58418 VL - 8 KW - large language model KW - psychiatric interview KW - interview summarization KW - symptom delineation N2 - Background: Recent advancements in large language models (LLMs) have accelerated their use across various domains. Psychiatric interviews, which are goal-oriented and structured, represent a significantly underexplored area where LLMs can provide substantial value. In this study, we explore the application of LLMs to enhance psychiatric interviews by analyzing counseling data from North Korean defectors who have experienced traumatic events and mental health issues. Objective: This study aims to investigate whether LLMs can (1) delineate parts of the conversation that suggest psychiatric symptoms and identify those symptoms, and (2) summarize stressors and symptoms based on the interview dialogue transcript. Methods: Given the interview transcripts, we align the LLMs to perform 3 tasks: (1) extracting stressors from the transcripts, (2) delineating symptoms and their indicative sections, and (3) summarizing the patients based on the extracted stressors and symptoms. These 3 tasks address the 2 objectives, where delineating symptoms is based on the output from the second task, and generating the summary of the interview incorporates the outputs from all 3 tasks. In this context, the transcript data were labeled by mental health experts for the training and evaluation of the LLMs. Results: First, we present the performance of LLMs in estimating (1) the transcript sections related to psychiatric symptoms and (2) the names of the corresponding symptoms. In the zero-shot inference setting using the GPT-4 Turbo model, 73 out of 102 transcript segments demonstrated a recall mid-token distance d<20 for estimating the sections associated with the symptoms. For evaluating the names of the corresponding symptoms, the fine-tuning method demonstrates a performance advantage over the zero-shot inference setting of the GPT-4 Turbo model. On average, the fine-tuning method achieves an accuracy of 0.82, a precision of 0.83, a recall of 0.82, and an F1-score of 0.82. Second, the transcripts are used to generate summaries for each interviewee using LLMs. This generative task was evaluated using metrics such as Generative Evaluation (G-Eval) and Bidirectional Encoder Representations from Transformers Score (BERTScore). The summaries generated by the GPT-4 Turbo model, utilizing both symptom and stressor information, achieve high average G-Eval scores: coherence of 4.66, consistency of 4.73, fluency of 2.16, and relevance of 4.67. Furthermore, it is noted that the use of retrieval-augmented generation did not lead to a significant improvement in performance. Conclusions: LLMs, using either (1) appropriate prompting techniques or (2) fine-tuning methods with data labeled by mental health experts, achieved an accuracy of over 0.8 for the symptom delineation task when measured across all segments in the transcript. Additionally, they attained a G-Eval score of over 4.6 for coherence in the summarization task. This research contributes to the emerging field of applying LLMs in psychiatric interviews and demonstrates their potential effectiveness in assisting mental health practitioners. UR - https://formative.jmir.org/2024/1/e58418 UR - http://dx.doi.org/10.2196/58418 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/58418 ER - TY - JOUR AU - Achtari, Margaux AU - Salihu, Adil AU - Muller, Olivier AU - Abbé, Emmanuel AU - Clair, Carole AU - Schwarz, Joëlle AU - Fournier, Stephane PY - 2024/10/22 TI - Gender Bias in AI's Perception of Cardiovascular Risk JO - J Med Internet Res SP - e54242 VL - 26 KW - artificial intelligence KW - gender equity KW - coronary artery disease KW - AI KW - cardiovascular KW - risk KW - CAD KW - artery KW - coronary KW - chatbot: health care KW - men: women KW - gender bias KW - gender UR - https://www.jmir.org/2024/1/e54242 UR - http://dx.doi.org/10.2196/54242 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/54242 ER - TY - JOUR AU - Nunes, Miguel AU - Bone, Joao AU - Ferreira, C. Joao AU - Elvas, B. Luis PY - 2024/10/21 TI - Health Care Language Models and Their Fine-Tuning for Information Extraction: Scoping Review JO - JMIR Med Inform SP - e60164 VL - 12 KW - language model KW - information extraction KW - healthcare KW - PRISMA-ScR KW - scoping literature review KW - transformers KW - natural language processing KW - European Portuguese N2 - Background: In response to the intricate language, specialized terminology outside everyday life, and the frequent presence of abbreviations and acronyms inherent in health care text data, domain adaptation techniques have emerged as crucial to transformer-based models. This refinement in the knowledge of the language models (LMs) allows for a better understanding of the medical textual data, which results in an improvement in medical downstream tasks, such as information extraction (IE). We have identified a gap in the literature regarding health care LMs. Therefore, this study presents a scoping literature review investigating domain adaptation methods for transformers in health care, differentiating between English and non-English languages, focusing on Portuguese. Most specifically, we investigated the development of health care LMs, with the aim of comparing Portuguese with other more developed languages to guide the path of a non?English-language with fewer resources. Objective: This study aimed to research health care IE models, regardless of language, to understand the efficacy of transformers and what are the medical entities most commonly extracted. Methods: This scoping review was conducted using the PRISMA-ScR (Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews) methodology on Scopus and Web of Science Core Collection databases. Only studies that mentioned the creation of health care LMs or health care IE models were included, while large language models (LLMs) were excluded. The latest were not included since we wanted to research LMs and not LLMs, which are architecturally different and have distinct purposes. Results: Our search query retrieved 137 studies, 60 of which met the inclusion criteria, and none of them were systematic literature reviews. English and Chinese are the languages with the most health care LMs developed. These languages already have disease-specific LMs, while others only have general?health care LMs. European Portuguese does not have any public health care LM and should take examples from other languages to develop, first, general-health care LMs and then, in an advanced phase, disease-specific LMs. Regarding IE models, transformers were the most commonly used method, and named entity recognition was the most popular topic, with only a few studies mentioning Assertion Status or addressing medical lexical problems. The most extracted entities were diagnosis, posology, and symptoms. Conclusions: The findings indicate that domain adaptation is beneficial, achieving better results in downstream tasks. Our analysis allowed us to understand that the use of transformers is more developed for the English and Chinese languages. European Portuguese lacks relevant studies and should draw examples from other non-English languages to develop these models and drive progress in AI. Health care professionals could benefit from highlighting medically relevant information and optimizing the reading of the textual data, or this information could be used to create patient medical timelines, allowing for profiling. UR - https://medinform.jmir.org/2024/1/e60164 UR - http://dx.doi.org/10.2196/60164 UR - http://www.ncbi.nlm.nih.gov/pubmed/39432345 ID - info:doi/10.2196/60164 ER - TY - JOUR AU - Liu, Shengyu AU - Wang, Anran AU - Xiu, Xiaolei AU - Zhong, Ming AU - Wu, Sizhu PY - 2024/10/17 TI - Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study JO - JMIR Med Inform SP - e59782 VL - 12 KW - natural language processing KW - NLP KW - model evaluation KW - macrofactors KW - medical named entity recognition models N2 - Background: Named entity recognition (NER) models are essential for extracting structured information from unstructured medical texts by identifying entities such as diseases, treatments, and conditions, enhancing clinical decision-making and research. Innovations in machine learning, particularly those involving Bidirectional Encoder Representations From Transformers (BERT)?based deep learning and large language models, have significantly advanced NER capabilities. However, their performance varies across medical datasets due to the complexity and diversity of medical terminology. Previous studies have often focused on overall performance, neglecting specific challenges in medical contexts and the impact of macrofactors like lexical composition on prediction accuracy. These gaps hinder the development of optimized NER models for medical applications. Objective: This study aims to meticulously evaluate the performance of various NER models in the context of medical text analysis, focusing on how complex medical terminology affects entity recognition accuracy. Additionally, we explored the influence of macrofactors on model performance, seeking to provide insights for refining NER models and enhancing their reliability for medical applications. Methods: This study comprehensively evaluated 7 NER models?hidden Markov models, conditional random fields, BERT for Biomedical Text Mining, Big Transformer Models for Efficient Long-Sequence Attention, Decoding-enhanced BERT with Disentangled Attention, Robustly Optimized BERT Pretraining Approach, and Gemma?across 3 medical datasets: Revised Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), BioCreative V CDR, and Anatomical Entity Mention (AnatEM). The evaluation focused on prediction accuracy, resource use (eg, central processing unit and graphics processing unit use), and the impact of fine-tuning hyperparameters. The macrofactors affecting model performance were also screened using the multilevel factor elimination algorithm. Results: The fine-tuned BERT for Biomedical Text Mining, with balanced resource use, generally achieved the highest prediction accuracy across the Revised JNLPBA and AnatEM datasets, with microaverage (AVG_MICRO) scores of 0.932 and 0.8494, respectively, highlighting its superior proficiency in identifying medical entities. Gemma, fine-tuned using the low-rank adaptation technique, achieved the highest accuracy on the BioCreative V CDR dataset with an AVG_MICRO score of 0.9962 but exhibited variability across the other datasets (AVG_MICRO scores of 0.9088 on the Revised JNLPBA and 0.8029 on AnatEM), indicating a need for further optimization. In addition, our analysis revealed that 2 macrofactors, entity phrase length and the number of entity words in each entity phrase, significantly influenced model performance. Conclusions: This study highlights the essential role of NER models in medical informatics, emphasizing the imperative for model optimization via precise data targeting and fine-tuning. The insights from this study will notably improve clinical decision-making and facilitate the creation of more sophisticated and effective medical NER models. UR - https://medinform.jmir.org/2024/1/e59782 UR - http://dx.doi.org/10.2196/59782 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/59782 ER - TY - JOUR AU - Elyoseph, Zohar AU - Gur, Tamar AU - Haber, Yuval AU - Simon, Tomer AU - Angert, Tal AU - Navon, Yuval AU - Tal, Amir AU - Asman, Oren PY - 2024/10/17 TI - An Ethical Perspective on the Democratization of Mental Health With Generative AI JO - JMIR Ment Health SP - e58011 VL - 11 KW - ethics KW - generative artificial intelligence KW - generative AI KW - mental health KW - ChatGPT KW - large language model KW - LLM KW - digital mental health KW - machine learning KW - AI KW - technology KW - accessibility KW - knowledge KW - GenAI UR - https://mental.jmir.org/2024/1/e58011 UR - http://dx.doi.org/10.2196/58011 ID - info:doi/10.2196/58011 ER - TY - JOUR AU - Hung, W. Tony K. AU - Kuperman, J. Gilad AU - Sherman, J. Eric AU - Ho, L. Alan AU - Weng, Chunhua AU - Pfister, G. David AU - Mao, J. Jun PY - 2024/10/15 TI - Performance of Retrieval-Augmented Large Language Models to Recommend Head and Neck Cancer Clinical Trials JO - J Med Internet Res SP - e60695 VL - 26 KW - large language model KW - LLM KW - ChatGPT KW - GPT-4 KW - artificial intelligence KW - AI KW - clinical trials KW - decision support KW - LookUpTrials KW - cancer care delivery KW - head and neck oncology KW - head and neck cancer KW - retrieval augmented generation UR - https://www.jmir.org/2024/1/e60695 UR - http://dx.doi.org/10.2196/60695 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/60695 ER - TY - JOUR AU - Held, Philip AU - Pridgen, A. Sarah AU - Chen, Yaozhong AU - Akhtar, Zuhaib AU - Amin, Darpan AU - Pohorence, Sean PY - 2024/10/10 TI - A Novel Cognitive Behavioral Therapy?Based Generative AI Tool (Socrates 2.0) to Facilitate Socratic Dialogue: Protocol for a Mixed Methods Feasibility Study JO - JMIR Res Protoc SP - e58195 VL - 13 KW - generative artificial intelligence KW - mental health KW - feasibility KW - cognitive restructuring KW - Socratic dialogue KW - mobile phone N2 - Background: Digital mental health tools, designed to augment traditional mental health treatments, are becoming increasingly important due to a wide range of barriers to accessing mental health care, including a growing shortage of clinicians. Most existing tools use rule-based algorithms, often leading to interactions that feel unnatural compared with human therapists. Large language models (LLMs) offer a solution for the development of more natural, engaging digital tools. In this paper, we detail the development of Socrates 2.0, which was designed to engage users in Socratic dialogue surrounding unrealistic or unhelpful beliefs, a core technique in cognitive behavioral therapies. The multiagent LLM-based tool features an artificial intelligence (AI) therapist, Socrates, which receives automated feedback from an AI supervisor and an AI rater. The combination of multiple agents appeared to help address common LLM issues such as looping, and it improved the overall dialogue experience. Initial user feedback from individuals with lived experiences of mental health problems as well as cognitive behavioral therapists has been positive. Moreover, tests in approximately 500 scenarios showed that Socrates 2.0 engaged in harmful responses in under 1% of cases, with the AI supervisor promptly correcting the dialogue each time. However, formal feasibility studies with potential end users are needed. Objective: This mixed methods study examines the feasibility of Socrates 2.0. Methods: On the basis of the initial data, we devised a formal feasibility study of Socrates 2.0 to gather qualitative and quantitative data about users? and clinicians? experience of interacting with the tool. Using a mixed method approach, the goal is to gather feasibility and acceptability data from 100 users and 50 clinicians to inform the eventual implementation of generative AI tools, such as Socrates 2.0, in mental health treatment. We designed this study to better understand how users and clinicians interact with the tool, including the frequency, length, and time of interactions, users? satisfaction with the tool overall, quality of each dialogue and individual responses, as well as ways in which the tool should be improved before it is used in efficacy trials. Descriptive and inferential analyses will be performed on data from validated usability measures. Thematic analysis will be performed on the qualitative data. Results: Recruitment will begin in February 2024 and is expected to conclude by February 2025. As of September 25, 2024, overall, 55 participants have been recruited. Conclusions: The development of Socrates 2.0 and the outlined feasibility study are important first steps in applying generative AI to mental health treatment delivery and lay the foundation for formal feasibility studies. International Registered Report Identifier (IRRID): DERR1-10.2196/58195 UR - https://www.researchprotocols.org/2024/1/e58195 UR - http://dx.doi.org/10.2196/58195 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/58195 ER - TY - JOUR AU - Chang, Eunsuk AU - Sung, Sumi PY - 2024/10/7 TI - Use of SNOMED CT in Large Language Models: Scoping Review JO - JMIR Med Inform SP - e62924 VL - 12 KW - SNOMED CT KW - ontology KW - knowledge graph KW - large language models KW - natural language processing KW - language models N2 - Background: Large language models (LLMs) have substantially advanced natural language processing (NLP) capabilities but often struggle with knowledge-driven tasks in specialized domains such as biomedicine. Integrating biomedical knowledge sources such as SNOMED CT into LLMs may enhance their performance on biomedical tasks. However, the methodologies and effectiveness of incorporating SNOMED CT into LLMs have not been systematically reviewed. Objective: This scoping review aims to examine how SNOMED CT is integrated into LLMs, focusing on (1) the types and components of LLMs being integrated with SNOMED CT, (2) which contents of SNOMED CT are being integrated, and (3) whether this integration improves LLM performance on NLP tasks. Methods: Following the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines, we searched ACM Digital Library, ACL Anthology, IEEE Xplore, PubMed, and Embase for relevant studies published from 2018 to 2023. Studies were included if they incorporated SNOMED CT into LLM pipelines for natural language understanding or generation tasks. Data on LLM types, SNOMED CT integration methods, end tasks, and performance metrics were extracted and synthesized. Results: The review included 37 studies. Bidirectional Encoder Representations from Transformers and its biomedical variants were the most commonly used LLMs. Three main approaches for integrating SNOMED CT were identified: (1) incorporating SNOMED CT into LLM inputs (28/37, 76%), primarily using concept descriptions to expand training corpora; (2) integrating SNOMED CT into additional fusion modules (5/37, 14%); and (3) using SNOMED CT as an external knowledge retriever during inference (5/37, 14%). The most frequent end task was medical concept normalization (15/37, 41%), followed by entity extraction or typing and classification. While most studies (17/19, 89%) reported performance improvements after SNOMED CT integration, only a small fraction (19/37, 51%) provided direct comparisons. The reported gains varied widely across different metrics and tasks, ranging from 0.87% to 131.66%. However, some studies showed either no improvement or a decline in certain performance metrics. Conclusions: This review demonstrates diverse approaches for integrating SNOMED CT into LLMs, with a focus on using concept descriptions to enhance biomedical language understanding and generation. While the results suggest potential benefits of SNOMED CT integration, the lack of standardized evaluation methods and comprehensive performance reporting hinders definitive conclusions about its effectiveness. Future research should prioritize consistent reporting of performance comparisons and explore more sophisticated methods for incorporating SNOMED CT?s relational structure into LLMs. In addition, the biomedical NLP community should develop standardized evaluation frameworks to better assess the impact of ontology integration on LLM performance. UR - https://medinform.jmir.org/2024/1/e62924 UR - http://dx.doi.org/10.2196/62924 UR - http://www.ncbi.nlm.nih.gov/pubmed/39374057 ID - info:doi/10.2196/62924 ER - TY - JOUR AU - Seth, Puneet AU - Carretas, Romina AU - Rudzicz, Frank PY - 2024/10/4 TI - The Utility and Implications of Ambient Scribes in Primary Care JO - JMIR AI SP - e57673 VL - 3 KW - artificial intelligence KW - AI KW - large language model KW - LLM KW - digital scribe KW - ambient scribe KW - organizational efficiency KW - electronic health record KW - documentation burden KW - administrative burden UR - https://ai.jmir.org/2024/1/e57673 UR - http://dx.doi.org/10.2196/57673 UR - http://www.ncbi.nlm.nih.gov/pubmed/39365655 ID - info:doi/10.2196/57673 ER - TY - JOUR AU - Li, Xingang AU - Guo, Heng AU - Li, Dandan AU - Zheng, Yingming PY - 2024/10/4 TI - Engine of Innovation in Hospital Pharmacy: Applications and Reflections of ChatGPT JO - J Med Internet Res SP - e51635 VL - 26 KW - ChatGPT KW - hospital pharmacy KW - natural language processing KW - drug information KW - drug therapy KW - drug interaction KW - scientific research KW - innovation KW - pharmacy KW - quality KW - safety KW - pharmaceutical care KW - tool KW - medical care quality UR - https://www.jmir.org/2024/1/e51635 UR - http://dx.doi.org/10.2196/51635 UR - http://www.ncbi.nlm.nih.gov/pubmed/39365643 ID - info:doi/10.2196/51635 ER - TY - JOUR AU - Yang, Rui AU - Zeng, Qingcheng AU - You, Keen AU - Qiao, Yujie AU - Huang, Lucas AU - Hsieh, Chia-Chun AU - Rosand, Benjamin AU - Goldwasser, Jeremy AU - Dave, Amisha AU - Keenan, Tiarnan AU - Ke, Yuhe AU - Hong, Chuan AU - Liu, Nan AU - Chew, Emily AU - Radev, Dragomir AU - Lu, Zhiyong AU - Xu, Hua AU - Chen, Qingyu AU - Li, Irene PY - 2024/10/3 TI - Ascle?A Python Natural Language Processing Toolkit for Medical Text Generation: Development and Evaluation Study JO - J Med Internet Res SP - e60601 VL - 26 KW - natural language processing KW - machine learning KW - deep learning KW - generative artificial intelligence KW - large language models KW - retrieval-augmented generation KW - healthcare N2 - Background: Medical texts present significant domain-specific challenges, and manually curating these texts is a time-consuming and labor-intensive process. To address this, natural language processing (NLP) algorithms have been developed to automate text processing. In the biomedical field, various toolkits for text processing exist, which have greatly improved the efficiency of handling unstructured text. However, these existing toolkits tend to emphasize different perspectives, and none of them offer generation capabilities, leaving a significant gap in the current offerings. Objective: This study aims to describe the development and preliminary evaluation of Ascle. Ascle is tailored for biomedical researchers and clinical staff with an easy-to-use, all-in-one solution that requires minimal programming expertise. For the first time, Ascle provides 4 advanced and challenging generative functions: question-answering, text summarization, text simplification, and machine translation. In addition, Ascle integrates 12 essential NLP functions, along with query and search capabilities for clinical databases. Methods: We fine-tuned 32 domain-specific language models and evaluated them thoroughly on 27 established benchmarks. In addition, for the question-answering task, we developed a retrieval-augmented generation (RAG) framework for large language models that incorporated a medical knowledge graph with ranking techniques to enhance the reliability of generated answers. Additionally, we conducted a physician validation to assess the quality of generated content beyond automated metrics. Results: The fine-tuned models and RAG framework consistently enhanced text generation tasks. For example, the fine-tuned models improved the machine translation task by 20.27 in terms of BLEU score. In the question-answering task, the RAG framework raised the ROUGE-L score by 18% over the vanilla models. Physician validation of generated answers showed high scores for readability (4.95/5) and relevancy (4.43/5), with a lower score for accuracy (3.90/5) and completeness (3.31/5). Conclusions: This study introduces the development and evaluation of Ascle, a user-friendly NLP toolkit designed for medical text generation. All code is publicly available through the Ascle GitHub repository. All fine-tuned language models can be accessed through Hugging Face. UR - https://www.jmir.org/2024/1/e60601 UR - http://dx.doi.org/10.2196/60601 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/60601 ER - TY - JOUR AU - Hirosawa, Takanobu AU - Harada, Yukinori AU - Tokumasu, Kazuki AU - Ito, Takahiro AU - Suzuki, Tomoharu AU - Shimizu, Taro PY - 2024/10/2 TI - Comparative Study to Evaluate the Accuracy of Differential Diagnosis Lists Generated by Gemini Advanced, Gemini, and Bard for a Case Report Series Analysis: Cross-Sectional Study JO - JMIR Med Inform SP - e63010 VL - 12 KW - artificial intelligence KW - clinical decision support KW - diagnostic excellence KW - generative artificial intelligence KW - large language models KW - natural language processing N2 - Background: Generative artificial intelligence (GAI) systems by Google have recently been updated from Bard to Gemini and Gemini Advanced as of December 2023. Gemini is a basic, free-to-use model after a user?s login, while Gemini Advanced operates on a more advanced model requiring a fee-based subscription. These systems have the potential to enhance medical diagnostics. However, the impact of these updates on comprehensive diagnostic accuracy remains unknown. Objective: This study aimed to compare the accuracy of the differential diagnosis lists generated by Gemini Advanced, Gemini, and Bard across comprehensive medical fields using case report series. Methods: We identified a case report series with relevant final diagnoses published in the American Journal Case Reports from January 2022 to March 2023. After excluding nondiagnostic cases and patients aged 10 years and younger, we included the remaining case reports. After refining the case parts as case descriptions, we input the same case descriptions into Gemini Advanced, Gemini, and Bard to generate the top 10 differential diagnosis lists. In total, 2 expert physicians independently evaluated whether the final diagnosis was included in the lists and its ranking. Any discrepancies were resolved by another expert physician. Bonferroni correction was applied to adjust the P values for the number of comparisons among 3 GAI systems, setting the corrected significance level at P value <.02. Results: In total, 392 case reports were included. The inclusion rates of the final diagnosis within the top 10 differential diagnosis lists were 73% (286/392) for Gemini Advanced, 76.5% (300/392) for Gemini, and 68.6% (269/392) for Bard. The top diagnoses matched the final diagnoses in 31.6% (124/392) for Gemini Advanced, 42.6% (167/392) for Gemini, and 31.4% (123/392) for Bard. Gemini demonstrated higher diagnostic accuracy than Bard both within the top 10 differential diagnosis lists (P=.02) and as the top diagnosis (P=.001). In addition, Gemini Advanced achieved significantly lower accuracy than Gemini in identifying the most probable diagnosis (P=.002). Conclusions: The results of this study suggest that Gemini outperformed Bard in diagnostic accuracy following the model update. However, Gemini Advanced requires further refinement to optimize its performance for future artificial intelligence?enhanced diagnostics. These findings should be interpreted cautiously and considered primarily for research purposes, as these GAI systems have not been adjusted for medical diagnostics nor approved for clinical use. UR - https://medinform.jmir.org/2024/1/e63010 UR - http://dx.doi.org/10.2196/63010 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/63010 ER - TY - JOUR AU - Choi, K. Yong AU - Lin, Shih-Yin AU - Fick, Marie Donna AU - Shulman, W. Richard AU - Lee, Sangil AU - Shrestha, Priyanka AU - Santoso, Kate PY - 2024/10/1 TI - Optimizing ChatGPT?s Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study JO - JMIR Form Res SP - e51383 VL - 8 KW - generative artificial intelligence KW - generative AI KW - large language models KW - ChatGPT KW - delirium detection KW - Sour Seven Questionnaire KW - prompt engineering KW - clinical vignettes KW - medical education KW - caregiver education N2 - Background: Generative artificial intelligence (AI) and large language models, such as OpenAI?s ChatGPT, have shown promising potential in supporting medical education and clinical decision-making, given their vast knowledge base and natural language processing capabilities. As a general purpose AI system, ChatGPT can complete a wide range of tasks, including differential diagnosis without additional training. However, the specific application of ChatGPT in learning and applying a series of specialized, context-specific tasks mimicking the workflow of a human assessor, such as administering a standardized assessment questionnaire, followed by inputting assessment results in a standardized form, and interpretating assessment results strictly following credible, published scoring criteria, have not been thoroughly studied. Objective: This exploratory study aims to evaluate and optimize ChatGPT?s capabilities in administering and interpreting the Sour Seven Questionnaire, an informant-based delirium assessment tool. Specifically, the objectives were to train ChatGPT-3.5 and ChatGPT-4 to understand and correctly apply the Sour Seven Questionnaire to clinical vignettes using prompt engineering, assess the performance of these AI models in identifying and scoring delirium symptoms against scores from human experts, and refine and enhance the models? interpretation and reporting accuracy through iterative prompt optimization. Methods: We used prompt engineering to train ChatGPT-3.5 and ChatGPT-4 models on the Sour Seven Questionnaire, a tool for assessing delirium through caregiver input. Prompt engineering is a methodology used to enhance the AI?s processing of inputs by meticulously structuring the prompts to improve accuracy and consistency in outputs. In this study, prompt engineering involved creating specific, structured commands that guided the AI models in understanding and applying the assessment tool?s criteria accurately to clinical vignettes. This approach also included designing prompts to explicitly instruct the AI on how to format its responses, ensuring they were consistent with clinical documentation standards. Results: Both ChatGPT models demonstrated promising proficiency in applying the Sour Seven Questionnaire to the vignettes, despite initial inconsistencies and errors. Performance notably improved through iterative prompt engineering, enhancing the models? capacity to detect delirium symptoms and assign scores. Prompt optimizations included adjusting the scoring methodology to accept only definitive ?Yes? or ?No? responses, revising the evaluation prompt to mandate responses in a tabular format, and guiding the models to adhere to the 2 recommended actions specified in the Sour Seven Questionnaire. Conclusions: Our findings provide preliminary evidence supporting the potential utility of AI models such as ChatGPT in administering standardized clinical assessment tools. The results highlight the significance of context-specific training and prompt engineering in harnessing the full potential of these AI models for health care applications. Despite the encouraging results, broader generalizability and further validation in real-world settings warrant additional research. UR - https://formative.jmir.org/2024/1/e51383 UR - http://dx.doi.org/10.2196/51383 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/51383 ER - TY - JOUR AU - Sung, Sheng-Feng AU - Hu, Ya-Han AU - Chen, Chong-Yan PY - 2024/10/1 TI - Disambiguating Clinical Abbreviations by One-to-All Classification: Algorithm Development and Validation Study JO - JMIR Med Inform SP - e56955 VL - 12 KW - word sense disambiguation KW - electronic medical records KW - abbreviation expansion KW - text mining KW - natural language processing N2 - Background: Electronic medical records store extensive patient data and serve as a comprehensive repository, including textual medical records like surgical and imaging reports. Their utility in clinical decision support systems is substantial, but the widespread use of ambiguous and unstandardized abbreviations in clinical documents poses challenges for natural language processing in clinical decision support systems. Efficient abbreviation disambiguation methods are needed for effective information extraction. Objective: This study aims to enhance the one-to-all (OTA) framework for clinical abbreviation expansion, which uses a single model to predict multiple abbreviation meanings. The objective is to improve OTA by developing context-candidate pairs and optimizing word embeddings in Bidirectional Encoder Representations From Transformers (BERT), evaluating the model?s efficacy in expanding clinical abbreviations using real data. Methods: Three datasets were used: Medical Subject Headings Word Sense Disambiguation, University of Minnesota, and Chia-Yi Christian Hospital from Ditmanson Medical Foundation Chia-Yi Christian Hospital. Texts containing polysemous abbreviations were preprocessed and formatted for BERT. The study involved fine-tuning pretrained models, ClinicalBERT and BlueBERT, generating dataset pairs for training and testing based on Huang et al?s method. Results: BlueBERT achieved macro- and microaccuracies of 95.41% and 95.16%, respectively, on the Medical Subject Headings Word Sense Disambiguation dataset. It improved macroaccuracy by 0.54%?1.53% compared to two baselines, long short-term memory and deepBioWSD with random embedding. On the University of Minnesota dataset, BlueBERT recorded macro- and microaccuracies of 98.40% and 98.22%, respectively. Against the baselines of Word2Vec + support vector machine and BioWordVec + support vector machine, BlueBERT demonstrated a macroaccuracy improvement of 2.61%?4.13%. Conclusions: This research preliminarily validated the effectiveness of the OTA method for abbreviation disambiguation in medical texts, demonstrating the potential to enhance both clinical staff efficiency and research effectiveness. UR - https://medinform.jmir.org/2024/1/e56955 UR - http://dx.doi.org/10.2196/56955 ID - info:doi/10.2196/56955 ER - TY - JOUR AU - Armbruster, Jonas AU - Bussmann, Florian AU - Rothhaas, Catharina AU - Titze, Nadine AU - Grützner, Alfred Paul AU - Freischmidt, Holger PY - 2024/10/1 TI - ?Doctor ChatGPT, Can You Help Me?? The Patient?s Perspective: Cross-Sectional Study JO - J Med Internet Res SP - e58831 VL - 26 KW - artificial intelligence KW - AI KW - large language models KW - LLM KW - ChatGPT KW - patient education KW - patient information KW - patient perceptions KW - chatbot KW - chatbots KW - empathy N2 - Background: Artificial intelligence and the language models derived from it, such as ChatGPT, offer immense possibilities, particularly in the field of medicine. It is already evident that ChatGPT can provide adequate and, in some cases, expert-level responses to health-related queries and advice for patients. However, it is currently unknown how patients perceive these capabilities, whether they can derive benefit from them, and whether potential risks, such as harmful suggestions, are detected by patients. Objective: This study aims to clarify whether patients can get useful and safe health care advice from an artificial intelligence chatbot assistant. Methods: This cross-sectional study was conducted using 100 publicly available health-related questions from 5 medical specialties (trauma, general surgery, otolaryngology, pediatrics, and internal medicine) from a web-based platform for patients. Responses generated by ChatGPT-4.0 and by an expert panel (EP) of experienced physicians from the aforementioned web-based platform were packed into 10 sets consisting of 10 questions each. The blinded evaluation was carried out by patients regarding empathy and usefulness (assessed through the question: ?Would this answer have helped you??) on a scale from 1 to 5. As a control, evaluation was also performed by 3 physicians in each respective medical specialty, who were additionally asked about the potential harm of the response and its correctness. Results: In total, 200 sets of questions were submitted by 64 patients (mean 45.7, SD 15.9 years; 29/64, 45.3% male), resulting in 2000 evaluated answers of ChatGPT and the EP each. ChatGPT scored higher in terms of empathy (4.18 vs 2.7; P<.001) and usefulness (4.04 vs 2.98; P<.001). Subanalysis revealed a small bias in terms of levels of empathy given by women in comparison with men (4.46 vs 4.14; P=.049). Ratings of ChatGPT were high regardless of the participant?s age. The same highly significant results were observed in the evaluation of the respective specialist physicians. ChatGPT outperformed significantly in correctness (4.51 vs 3.55; P<.001). Specialists rated the usefulness (3.93 vs 4.59) and correctness (4.62 vs 3.84) significantly lower in potentially harmful responses from ChatGPT (P<.001). This was not the case among patients. Conclusions: The results indicate that ChatGPT is capable of supporting patients in health-related queries better than physicians, at least in terms of written advice through a web-based platform. In this study, ChatGPT?s responses had a lower percentage of potentially harmful advice than the web-based EP. However, it is crucial to note that this finding is based on a specific study design and may not generalize to all health care settings. Alarmingly, patients are not able to independently recognize these potential dangers. UR - https://www.jmir.org/2024/1/e58831 UR - http://dx.doi.org/10.2196/58831 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/58831 ER - TY - JOUR AU - Franc, Micheal Jeffrey AU - Hertelendy, Julius Attila AU - Cheng, Lenard AU - Hata, Ryan AU - Verde, Manuela PY - 2024/9/30 TI - Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Disaster Triage of Simulated Patients Using the Simple Triage and Rapid Treatment (START) Protocol: Gage Repeatability and Reproducibility Study JO - J Med Internet Res SP - e55648 VL - 26 KW - disaster medicine KW - large language models KW - triage KW - disaster KW - emergency KW - disasters KW - emergencies KW - LLM KW - LLMs KW - GPT KW - ChatGPT KW - language model KW - language models KW - NLP KW - natural language processing KW - artificial intelligence KW - repeatability KW - reproducibility KW - accuracy KW - accurate KW - reproducible KW - repeatable N2 - Background: The release of ChatGPT (OpenAI) in November 2022 drastically reduced the barrier to using artificial intelligence by allowing a simple web-based text interface to a large language model (LLM). One use case where ChatGPT could be useful is in triaging patients at the site of a disaster using the Simple Triage and Rapid Treatment (START) protocol. However, LLMs experience several common errors including hallucinations (also called confabulations) and prompt dependency. Objective: This study addresses the research problem: ?Can ChatGPT adequately triage simulated disaster patients using the START protocol?? by measuring three outcomes: repeatability, reproducibility, and accuracy. Methods: Nine prompts were developed by 5 disaster medicine physicians. A Python script queried ChatGPT Version 4 for each prompt combined with 391 validated simulated patient vignettes. Ten repetitions of each combination were performed for a total of 35,190 simulated triages. A reference standard START triage code for each simulated case was assigned by 2 disaster medicine specialists (JMF and MV), with a third specialist (LC) added if the first two did not agree. Results were evaluated using a gage repeatability and reproducibility study (gage R and R). Repeatability was defined as variation due to repeated use of the same prompt. Reproducibility was defined as variation due to the use of different prompts on the same patient vignette. Accuracy was defined as agreement with the reference standard. Results: Although 35,102 (99.7%) queries returned a valid START score, there was considerable variability. Repeatability (use of the same prompt repeatedly) was 14% of the overall variation. Reproducibility (use of different prompts) was 4.1% of the overall variation. The accuracy of ChatGPT for START was 63.9% with a 32.9% overtriage rate and a 3.1% undertriage rate. Accuracy varied by prompt with a maximum of 71.8% and a minimum of 46.7%. Conclusions: This study indicates that ChatGPT version 4 is insufficient to triage simulated disaster patients via the START protocol. It demonstrated suboptimal repeatability and reproducibility. The overall accuracy of triage was only 63.9%. Health care professionals are advised to exercise caution while using commercial LLMs for vital medical determinations, given that these tools may commonly produce inaccurate data, colloquially referred to as hallucinations or confabulations. Artificial intelligence?guided tools should undergo rigorous statistical evaluation?using methods such as gage R and R?before implementation into clinical settings. UR - https://www.jmir.org/2024/1/e55648 UR - http://dx.doi.org/10.2196/55648 UR - http://www.ncbi.nlm.nih.gov/pubmed/39348189 ID - info:doi/10.2196/55648 ER - TY - JOUR AU - Kumar, Tanuj Ash AU - Wang, Cindy AU - Dong, Alec AU - Rose, Jonathan PY - 2024/9/26 TI - Generation of Backward-Looking Complex Reflections for a Motivational Interviewing?Based Smoking Cessation Chatbot Using GPT-4: Algorithm Development and Validation JO - JMIR Ment Health SP - e53778 VL - 11 KW - motivational interviewing KW - smoking cessation KW - therapy KW - automated therapy KW - natural language processing KW - large language models KW - GPT-4 KW - chatbot KW - dialogue agent KW - reflections KW - reflection generation KW - smoking KW - cessation KW - ChatGPT KW - smokers KW - smoker KW - effectiveness KW - messages N2 - Background: Motivational interviewing (MI) is a therapeutic technique that has been successful in helping smokers reduce smoking but has limited accessibility due to the high cost and low availability of clinicians. To address this, the MIBot project has sought to develop a chatbot that emulates an MI session with a client with the specific goal of moving an ambivalent smoker toward the direction of quitting. One key element of an MI conversation is reflective listening, where a therapist expresses their understanding of what the client has said by uttering a reflection that encourages the client to continue their thought process. Complex reflections link the client?s responses to relevant ideas and facts to enhance this contemplation. Backward-looking complex reflections (BLCRs) link the client?s most recent response to a relevant selection of the client?s previous statements. Our current chatbot can generate complex reflections?but not BLCRs?using large language models (LLMs) such as GPT-2, which allows the generation of unique, human-like messages customized to client responses. Recent advancements in these models, such as the introduction of GPT-4, provide a novel way to generate complex text by feeding the models instructions and conversational history directly, making this a promising approach to generate BLCRs. Objective: This study aims to develop a method to generate BLCRs for an MI-based smoking cessation chatbot and to measure the method?s effectiveness. Methods: LLMs such as GPT-4 can be stimulated to produce specific types of responses to their inputs by ?asking? them with an English-based description of the desired output. These descriptions are called prompts, and the goal of writing a description that causes an LLM to generate the required output is termed prompt engineering. We evolved an instruction to prompt GPT-4 to generate a BLCR, given the portions of the transcript of the conversation up to the point where the reflection was needed. The approach was tested on 50 previously collected MIBot transcripts of conversations with smokers and was used to generate a total of 150 reflections. The quality of the reflections was rated on a 4-point scale by 3 independent raters to determine whether they met specific criteria for acceptability. Results: Of the 150 generated reflections, 132 (88%) met the level of acceptability. The remaining 18 (12%) had one or more flaws that made them inappropriate as BLCRs. The 3 raters had pairwise agreement on 80% to 88% of these scores. Conclusions: The method presented to generate BLCRs is good enough to be used as one source of reflections in an MI-style conversation but would need an automatic checker to eliminate the unacceptable ones. This work illustrates the power of the new LLMs to generate therapeutic client-specific responses under the command of a language-based specification. UR - https://mental.jmir.org/2024/1/e53778 UR - http://dx.doi.org/10.2196/53778 ID - info:doi/10.2196/53778 ER - TY - JOUR AU - Shen, Jocelyn AU - DiPaola, Daniella AU - Ali, Safinah AU - Sap, Maarten AU - Park, Won Hae AU - Breazeal, Cynthia PY - 2024/9/25 TI - Empathy Toward Artificial Intelligence Versus Human Experiences and the Role of Transparency in Mental Health and Social Support Chatbot Design: Comparative Study JO - JMIR Ment Health SP - e62679 VL - 11 KW - empathy KW - large language models KW - ethics KW - transparency KW - crowdsourcing KW - human-computer interaction N2 - Background: Empathy is a driving force in our connection to others, our mental well-being, and resilience to challenges. With the rise of generative artificial intelligence (AI) systems, mental health chatbots, and AI social support companions, it is important to understand how empathy unfolds toward stories from human versus AI narrators and how transparency plays a role in user emotions. Objective: We aim to understand how empathy shifts across human-written versus AI-written stories, and how these findings inform ethical implications and human-centered design of using mental health chatbots as objects of empathy. Methods: We conducted crowd-sourced studies with 985 participants who each wrote a personal story and then rated empathy toward 2 retrieved stories, where one was written by a language model, and another was written by a human. Our studies varied disclosing whether a story was written by a human or an AI system to see how transparent author information affects empathy toward the narrator. We conducted mixed methods analyses: through statistical tests, we compared user?s self-reported state empathy toward the stories across different conditions. In addition, we qualitatively coded open-ended feedback about reactions to the stories to understand how and why transparency affects empathy toward human versus AI storytellers. Results: We found that participants significantly empathized with human-written over AI-written stories in almost all conditions, regardless of whether they are aware (t196=7.07, P<.001, Cohen d=0.60) or not aware (t298=3.46, P<.001, Cohen d=0.24) that an AI system wrote the story. We also found that participants reported greater willingness to empathize with AI-written stories when there was transparency about the story author (t494=?5.49, P<.001, Cohen d=0.36). Conclusions: Our work sheds light on how empathy toward AI or human narrators is tied to the way the text is presented, thus informing ethical considerations of empathetic artificial social support or mental health chatbots. UR - https://mental.jmir.org/2024/1/e62679 UR - http://dx.doi.org/10.2196/62679 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/62679 ER - TY - JOUR AU - AlSaad, Rawan AU - Abd-alrazaq, Alaa AU - Boughorbel, Sabri AU - Ahmed, Arfan AU - Renault, Max-Antoine AU - Damseh, Rafat AU - Sheikh, Javaid PY - 2024/9/25 TI - Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook JO - J Med Internet Res SP - e59505 VL - 26 KW - artificial intelligence KW - large language models KW - multimodal large language models KW - multimodality KW - multimodal generative artificial intelligence KW - multimodal generative AI KW - generative artificial intelligence KW - generative AI KW - health care UR - https://www.jmir.org/2024/1/e59505 UR - http://dx.doi.org/10.2196/59505 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/59505 ER - TY - JOUR AU - Zaghir, Jamil AU - Naguib, Marco AU - Bjelogrlic, Mina AU - Névéol, Aurélie AU - Tannier, Xavier AU - Lovis, Christian PY - 2024/9/10 TI - Prompt Engineering Paradigms for Medical Applications: Scoping Review JO - J Med Internet Res SP - e60501 VL - 26 KW - prompt engineering KW - prompt design KW - prompt learning KW - prompt tuning KW - large language models KW - LLMs KW - scoping review KW - clinical natural language processing KW - natural language processing KW - NLP KW - medical texts KW - medical application KW - medical applications KW - clinical practice KW - privacy KW - medicine KW - computer science KW - medical informatics N2 - Background: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored. Objective: The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice. Methods: Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering?based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD). Results: We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering?specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research. Conclusions: In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field. UR - https://www.jmir.org/2024/1/e60501 UR - http://dx.doi.org/10.2196/60501 UR - http://www.ncbi.nlm.nih.gov/pubmed/39255030 ID - info:doi/10.2196/60501 ER - TY - JOUR AU - Sanjeewa, Ruvini AU - Iyer, Ravi AU - Apputhurai, Pragalathan AU - Wickramasinghe, Nilmini AU - Meyer, Denny PY - 2024/9/9 TI - Empathic Conversational Agent Platform Designs and Their Evaluation in the Context of Mental Health: Systematic Review JO - JMIR Ment Health SP - e58974 VL - 11 KW - conversational agents KW - chatbots KW - virtual assistants KW - empathy KW - emotionally aware KW - mental health KW - mental well-being N2 - Background: The demand for mental health (MH) services in the community continues to exceed supply. At the same time, technological developments make the use of artificial intelligence?empowered conversational agents (CAs) a real possibility to help fill this gap. Objective: The objective of this review was to identify existing empathic CA design architectures within the MH care sector and to assess their technical performance in detecting and responding to user emotions in terms of classification accuracy. In addition, the approaches used to evaluate empathic CAs within the MH care sector in terms of their acceptability to users were considered. Finally, this review aimed to identify limitations and future directions for empathic CAs in MH care. Methods: A systematic literature search was conducted across 6 academic databases to identify journal articles and conference proceedings using search terms covering 3 topics: ?conversational agents,? ?mental health,? and ?empathy.? Only studies discussing CA interventions for the MH care domain were eligible for this review, with both textual and vocal characteristics considered as possible data inputs. Quality was assessed using appropriate risk of bias and quality tools. Results: A total of 19 articles met all inclusion criteria. Most (12/19, 63%) of these empathic CA designs in MH care were machine learning (ML) based, with 26% (5/19) hybrid engines and 11% (2/19) rule-based systems. Among the ML-based CAs, 47% (9/19) used neural networks, with transformer-based architectures being well represented (7/19, 37%). The remaining 16% (3/19) of the ML models were unspecified. Technical assessments of these CAs focused on response accuracies and their ability to recognize, predict, and classify user emotions. While single-engine CAs demonstrated good accuracy, the hybrid engines achieved higher accuracy and provided more nuanced responses. Of the 19 studies, human evaluations were conducted in 16 (84%), with only 5 (26%) focusing directly on the CA?s empathic features. All these papers used self-reports for measuring empathy, including single or multiple (scale) ratings or qualitative feedback from in-depth interviews. Only 1 (5%) paper included evaluations by both CA users and experts, adding more value to the process. Conclusions: The integration of CA design and its evaluation is crucial to produce empathic CAs. Future studies should focus on using a clear definition of empathy and standardized scales for empathy measurement, ideally including expert assessment. In addition, the diversity in measures used for technical assessment and evaluation poses a challenge for comparing CA performances, which future research should also address. However, CAs with good technical and empathic performance are already available to users of MH care services, showing promise for new applications, such as helpline services. UR - https://mental.jmir.org/2024/1/e58974 UR - http://dx.doi.org/10.2196/58974 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/58974 ER - TY - JOUR AU - Reis, Florian AU - Lenz, Christian AU - Gossen, Manfred AU - Volk, Hans-Dieter AU - Drzeniek, Michael Norman PY - 2024/9/5 TI - Practical Applications of Large Language Models for Health Care Professionals and Scientists JO - JMIR Med Inform SP - e58478 VL - 12 KW - artificial intelligence KW - healthcare KW - chatGPT KW - large language model KW - prompting KW - LLM KW - applications KW - AI KW - scientists KW - physicians KW - health care UR - https://medinform.jmir.org/2024/1/e58478 UR - http://dx.doi.org/10.2196/58478 ID - info:doi/10.2196/58478 ER - TY - JOUR AU - Akyon, Handan Seyma AU - Akyon, Cagatay Fatih AU - Camyar, Sefa Ahmet AU - H?zl?, Fatih AU - Sari, Talha AU - H?zl?, ?amil PY - 2024/9/4 TI - Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study JO - JMIR Med Inform SP - e59258 VL - 12 KW - large language models KW - LLM KW - LLMs KW - ChatGPT KW - artificial intelligence KW - AI KW - natural language processing KW - medicine KW - health care KW - GPT KW - machine learning KW - language model KW - language models KW - generative KW - research paper KW - research papers KW - scientific research KW - answer KW - answers KW - response KW - responses KW - comprehension KW - STROBE KW - Strengthening the Reporting of Observational Studies in Epidemiology N2 - Background: Reading medical papers is a challenging and time-consuming task for doctors, especially when the papers are long and complex. A tool that can help doctors efficiently process and understand medical papers is needed. Objective: This study aims to critically assess and compare the comprehension capabilities of large language models (LLMs) in accurately and efficiently understanding medical research papers using the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) checklist, which provides a standardized framework for evaluating key elements of observational study. Methods: The study is a methodological type of research. The study aims to evaluate the understanding capabilities of new generative artificial intelligence tools in medical papers. A novel benchmark pipeline processed 50 medical research papers from PubMed, comparing the answers of 6 LLMs (GPT-3.5-Turbo, GPT-4-0613, GPT-4-1106, PaLM 2, Claude v1, and Gemini Pro) to the benchmark established by expert medical professors. Fifteen questions, derived from the STROBE checklist, assessed LLMs? understanding of different sections of a research paper. Results: LLMs exhibited varying performance, with GPT-3.5-Turbo achieving the highest percentage of correct answers (n=3916, 66.9%), followed by GPT-4-1106 (n=3837, 65.6%), PaLM 2 (n=3632, 62.1%), Claude v1 (n=2887, 58.3%), Gemini Pro (n=2878, 49.2%), and GPT-4-0613 (n=2580, 44.1%). Statistical analysis revealed statistically significant differences between LLMs (P<.001), with older models showing inconsistent performance compared to newer versions. LLMs showcased distinct performances for each question across different parts of a scholarly paper?with certain models like PaLM 2 and GPT-3.5 showing remarkable versatility and depth in understanding. Conclusions: This study is the first to evaluate the performance of different LLMs in understanding medical papers using the retrieval augmented generation method. The findings highlight the potential of LLMs to enhance medical research by improving efficiency and facilitating evidence-based decision-making. Further research is needed to address limitations such as the influence of question formats, potential biases, and the rapid evolution of LLM models. UR - https://medinform.jmir.org/2024/1/e59258 UR - http://dx.doi.org/10.2196/59258 UR - http://www.ncbi.nlm.nih.gov/pubmed/39230947 ID - info:doi/10.2196/59258 ER - TY - JOUR AU - Heilmeyer, Felix AU - Böhringer, Daniel AU - Reinhard, Thomas AU - Arens, Sebastian AU - Lyssenko, Lisa AU - Haverkamp, Christian PY - 2024/8/28 TI - Viability of Open Large Language Models for Clinical Documentation in German Health Care: Real-World Model Evaluation Study JO - JMIR Med Inform SP - e59617 VL - 12 KW - machine learning KW - ML KW - artificial intelligence KW - AI KW - large language model KW - large language models KW - LLM KW - LLMs KW - natural language processing KW - NLP KW - deep learning KW - algorithm KW - algorithms KW - model KW - models KW - analytics KW - practical model KW - practical models KW - medical documentation KW - writing assistance KW - medical administration KW - writing assistance for physicians N2 - Background: The use of large language models (LLMs) as writing assistance for medical professionals is a promising approach to reduce the time required for documentation, but there may be practical, ethical, and legal challenges in many jurisdictions complicating the use of the most powerful commercial LLM solutions. Objective: In this study, we assessed the feasibility of using nonproprietary LLMs of the GPT variety as writing assistance for medical professionals in an on-premise setting with restricted compute resources, generating German medical text. Methods: We trained four 7-billion?parameter models with 3 different architectures for our task and evaluated their performance using a powerful commercial LLM, namely Anthropic?s Claude-v2, as a rater. Based on this, we selected the best-performing model and evaluated its practical usability with 2 independent human raters on real-world data. Results: In the automated evaluation with Claude-v2, BLOOM-CLP-German, a model trained from scratch on the German text, achieved the best results. In the manual evaluation by human experts, 95 (93.1%) of the 102 reports generated by that model were evaluated as usable as is or with only minor changes by both human raters. Conclusions: The results show that even with restricted compute resources, it is possible to generate medical texts that are suitable for documentation in routine clinical practice. However, the target language should be considered in the model selection when processing non-English text. UR - https://medinform.jmir.org/2024/1/e59617 UR - http://dx.doi.org/10.2196/59617 ID - info:doi/10.2196/59617 ER - TY - JOUR AU - Shah-Mohammadi, Fatemeh AU - Finkelstein, Joseph PY - 2024/8/19 TI - Extraction of Substance Use Information From Clinical Notes: Generative Pretrained Transformer?Based Investigation JO - JMIR Med Inform SP - e56243 VL - 12 KW - substance use KW - natural language processing KW - GPT KW - prompt engineering KW - zero-shot learning KW - few-shot learning N2 - Background: Understanding the multifaceted nature of health outcomes requires a comprehensive examination of the social, economic, and environmental determinants that shape individual well-being. Among these determinants, behavioral factors play a crucial role, particularly the consumption patterns of psychoactive substances, which have important implications on public health. The Global Burden of Disease Study shows a growing impact in disability-adjusted life years due to substance use. The successful identification of patients? substance use information equips clinical care teams to address substance-related issues more effectively, enabling targeted support and ultimately improving patient outcomes. Objective: Traditional natural language processing methods face limitations in accurately parsing diverse clinical language associated with substance use. Large language models offer promise in overcoming these challenges by adapting to diverse language patterns. This study investigates the application of the generative pretrained transformer (GPT) model in specific GPT-3.5 for extracting tobacco, alcohol, and substance use information from patient discharge summaries in zero-shot and few-shot learning settings. This study contributes to the evolving landscape of health care informatics by showcasing the potential of advanced language models in extracting nuanced information critical for enhancing patient care. Methods: The main data source for analysis in this paper is Medical Information Mart for Intensive Care III data set. Among all notes in this data set, we focused on discharge summaries. Prompt engineering was undertaken, involving an iterative exploration of diverse prompts. Leveraging carefully curated examples and refined prompts, we investigate the model?s proficiency through zero-shot as well as few-shot prompting strategies. Results: Results show GPT?s varying effectiveness in identifying mentions of tobacco, alcohol, and substance use across learning scenarios. Zero-shot learning showed high accuracy in identifying substance use, whereas few-shot learning reduced accuracy but improved in identifying substance use status, enhancing recall and F1-score at the expense of lower precision. Conclusions: Excellence of zero-shot learning in precisely extracting text span mentioning substance use demonstrates its effectiveness in situations in which comprehensive recall is important. Conversely, few-shot learning offers advantages when accurately determining the status of substance use is the primary focus, even if it involves a trade-off in precision. The results contribute to enhancement of early detection and intervention strategies, tailor treatment plans with greater precision, and ultimately, contribute to a holistic understanding of patient health profiles. By integrating these artificial intelligence?driven methods into electronic health record systems, clinicians can gain immediate, comprehensive insights into substance use that results in shaping interventions that are not only timely but also more personalized and effective. UR - https://medinform.jmir.org/2024/1/e56243 UR - http://dx.doi.org/10.2196/56243 UR - http://www.ncbi.nlm.nih.gov/pubmed/39037700 ID - info:doi/10.2196/56243 ER - TY - JOUR AU - Szumilas, Dawid AU - Ochmann, Anna AU - Zi?ba, Katarzyna AU - Bartoszewicz, Bart?omiej AU - Kubrak, Anna AU - Makuch, Sebastian AU - Agrawal, Siddarth AU - Mazur, Grzegorz AU - Chudek, Jerzy PY - 2024/8/14 TI - Evaluation of AI-Driven LabTest Checker for Diagnostic Accuracy and Safety: Prospective Cohort Study JO - JMIR Med Inform SP - e57162 VL - 12 KW - LabTest Checker KW - CDSS KW - symptom checker KW - laboratory testing KW - AI KW - assessment KW - accuracy KW - artificial intelligence KW - health care KW - medical fields KW - clinical decision support systems KW - application KW - applications KW - diagnoses KW - patients KW - patient KW - medical history KW - tool KW - tools N2 - Background: In recent years, the implementation of artificial intelligence (AI) in health care is progressively transforming medical fields, with the use of clinical decision support systems (CDSSs) as a notable application. Laboratory tests are vital for accurate diagnoses, but their increasing reliance presents challenges. The need for effective strategies for managing laboratory test interpretation is evident from the millions of monthly searches on test results? significance. As the potential role of CDSSs in laboratory diagnostics gains significance, however, more research is needed to explore this area. Objective: The primary objective of our study was to assess the accuracy and safety of LabTest Checker (LTC), a CDSS designed to support medical diagnoses by analyzing both laboratory test results and patients? medical histories. Methods: This cohort study embraced a prospective data collection approach. A total of 101 patients aged ?18 years, in stable condition, and requiring comprehensive diagnosis were enrolled. A panel of blood laboratory tests was conducted for each participant. Participants used LTC for test result interpretation. The accuracy and safety of the tool were assessed by comparing AI-generated suggestions to experienced doctor (consultant) recommendations, which are considered the gold standard. Results: The system achieved a 74.3% accuracy and 100% sensitivity for emergency safety and 92.3% sensitivity for urgent cases. It potentially reduced unnecessary medical visits by 41.6% (42/101) and achieved an 82.9% accuracy in identifying underlying pathologies. Conclusions: This study underscores the transformative potential of AI-based CDSSs in laboratory diagnostics, contributing to enhanced patient care, efficient health care systems, and improved medical outcomes. LTC?s performance evaluation highlights the advancements in AI?s role in laboratory medicine. Trial Registration: ClinicalTrials.gov NCT05813938; https://clinicaltrials.gov/study/NCT05813938 UR - https://medinform.jmir.org/2024/1/e57162 UR - http://dx.doi.org/10.2196/57162 ID - info:doi/10.2196/57162 ER - TY - JOUR AU - Wang, Yijie AU - Chen, Yining AU - Sheng, Jifang PY - 2024/8/8 TI - Assessing ChatGPT as a Medical Consultation Assistant for Chronic Hepatitis B: Cross-Language Study of English and Chinese JO - JMIR Med Inform SP - e56426 VL - 12 KW - chronic hepatitis B KW - artificial intelligence KW - large language models KW - chatbots KW - medical consultation KW - AI in health care KW - cross-linguistic study N2 - Background: Chronic hepatitis B (CHB) imposes substantial economic and social burdens globally. The management of CHB involves intricate monitoring and adherence challenges, particularly in regions like China, where a high prevalence of CHB intersects with health care resource limitations. This study explores the potential of ChatGPT-3.5, an emerging artificial intelligence (AI) assistant, to address these complexities. With notable capabilities in medical education and practice, ChatGPT-3.5?s role is examined in managing CHB, particularly in regions with distinct health care landscapes. Objective: This study aimed to uncover insights into ChatGPT-3.5?s potential and limitations in delivering personalized medical consultation assistance for CHB patients across diverse linguistic contexts. Methods: Questions sourced from published guidelines, online CHB communities, and search engines in English and Chinese were refined, translated, and compiled into 96 inquiries. Subsequently, these questions were presented to both ChatGPT-3.5 and ChatGPT-4.0 in independent dialogues. The responses were then evaluated by senior physicians, focusing on informativeness, emotional management, consistency across repeated inquiries, and cautionary statements regarding medical advice. Additionally, a true-or-false questionnaire was employed to further discern the variance in information accuracy for closed questions between ChatGPT-3.5 and ChatGPT-4.0. Results: Over half of the responses (228/370, 61.6%) from ChatGPT-3.5 were considered comprehensive. In contrast, ChatGPT-4.0 exhibited a higher percentage at 74.5% (172/222; P<.001). Notably, superior performance was evident in English, particularly in terms of informativeness and consistency across repeated queries. However, deficiencies were identified in emotional management guidance, with only 3.2% (6/186) in ChatGPT-3.5 and 8.1% (15/154) in ChatGPT-4.0 (P=.04). ChatGPT-3.5 included a disclaimer in 10.8% (24/222) of responses, while ChatGPT-4.0 included a disclaimer in 13.1% (29/222) of responses (P=.46). When responding to true-or-false questions, ChatGPT-4.0 achieved an accuracy rate of 93.3% (168/180), significantly surpassing ChatGPT-3.5?s accuracy rate of 65.0% (117/180) (P<.001). Conclusions: In this study, ChatGPT demonstrated basic capabilities as a medical consultation assistant for CHB management. The choice of working language for ChatGPT-3.5 was considered a potential factor influencing its performance, particularly in the use of terminology and colloquial language, and this potentially affects its applicability within specific target populations. However, as an updated model, ChatGPT-4.0 exhibits improved information processing capabilities, overcoming the language impact on information accuracy. This suggests that the implications of model advancement on applications need to be considered when selecting large language models as medical consultation assistants. Given that both models performed inadequately in emotional guidance management, this study highlights the importance of providing specific language training and emotional management strategies when deploying ChatGPT for medical purposes. Furthermore, the tendency of these models to use disclaimers in conversations should be further investigated to understand the impact on patients? experiences in practical applications. UR - https://medinform.jmir.org/2024/1/e56426 UR - http://dx.doi.org/10.2196/56426 UR - http://www.ncbi.nlm.nih.gov/pubmed/39115930 ID - info:doi/10.2196/56426 ER - TY - JOUR AU - Liu, Xu AU - Duan, Chaoli AU - Kim, Min-kyu AU - Zhang, Lu AU - Jee, Eunjin AU - Maharjan, Beenu AU - Huang, Yuwei AU - Du, Dan AU - Jiang, Xian PY - 2024/8/6 TI - Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis JO - JMIR Med Inform SP - e59273 VL - 12 KW - artificial intelligence KW - AI KW - large language model KW - LLM KW - Claude KW - ChatGPT KW - dermatologist N2 - Background: Recent advancements in artificial intelligence (AI) and large language models (LLMs) have shown potential in medical fields, including dermatology. With the introduction of image analysis capabilities in LLMs, their application in dermatological diagnostics has garnered significant interest. These capabilities are enabled by the integration of computer vision techniques into the underlying architecture of LLMs. Objective: This study aimed to compare the diagnostic performance of Claude 3 Opus and ChatGPT with GPT-4 in analyzing dermoscopic images for melanoma detection, providing insights into their strengths and limitations. Methods: We randomly selected 100 histopathology-confirmed dermoscopic images (50 malignant, 50 benign) from the International Skin Imaging Collaboration (ISIC) archive using a computer-generated randomization process. The ISIC archive was chosen due to its comprehensive and well-annotated collection of dermoscopic images, ensuring a diverse and representative sample. Images were included if they were dermoscopic images of melanocytic lesions with histopathologically confirmed diagnoses. Each model was given the same prompt, instructing it to provide the top 3 differential diagnoses for each image, ranked by likelihood. Primary diagnosis accuracy, accuracy of the top 3 differential diagnoses, and malignancy discrimination ability were assessed. The McNemar test was chosen to compare the diagnostic performance of the 2 models, as it is suitable for analyzing paired nominal data. Results: In the primary diagnosis, Claude 3 Opus achieved 54.9% sensitivity (95% CI 44.08%-65.37%), 57.14% specificity (95% CI 46.31%-67.46%), and 56% accuracy (95% CI 46.22%-65.42%), while ChatGPT demonstrated 56.86% sensitivity (95% CI 45.99%-67.21%), 38.78% specificity (95% CI 28.77%-49.59%), and 48% accuracy (95% CI 38.37%-57.75%). The McNemar test showed no significant difference between the 2 models (P=.17). For the top 3 differential diagnoses, Claude 3 Opus and ChatGPT included the correct diagnosis in 76% (95% CI 66.33%-83.77%) and 78% (95% CI 68.46%-85.45%) of cases, respectively. The McNemar test showed no significant difference (P=.56). In malignancy discrimination, Claude 3 Opus outperformed ChatGPT with 47.06% sensitivity, 81.63% specificity, and 64% accuracy, compared to 45.1%, 42.86%, and 44%, respectively. The McNemar test showed a significant difference (P<.001). Claude 3 Opus had an odds ratio of 3.951 (95% CI 1.685-9.263) in discriminating malignancy, while ChatGPT-4 had an odds ratio of 0.616 (95% CI 0.297-1.278). Conclusions: Our study highlights the potential of LLMs in assisting dermatologists but also reveals their limitations. Both models made errors in diagnosing melanoma and benign lesions. These findings underscore the need for developing robust, transparent, and clinically validated AI models through collaborative efforts between AI researchers, dermatologists, and other health care professionals. While AI can provide valuable insights, it cannot yet replace the expertise of trained clinicians. UR - https://medinform.jmir.org/2024/1/e59273 UR - http://dx.doi.org/10.2196/59273 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/59273 ER - TY - JOUR AU - Zhang, Boyan AU - Du, Yueqi AU - Duan, Wanru AU - Chen, Zan PY - 2024/8/5 TI - Benchmarking Large Language Models for Cervical Spondylosis JO - JMIR Form Res SP - e55577 VL - 8 KW - cervical spondylosis KW - large language model KW - LLM KW - patient KW - ChatGPT UR - https://formative.jmir.org/2024/1/e55577 UR - http://dx.doi.org/10.2196/55577 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/55577 ER - TY - JOUR AU - Dingel, Julius AU - Kleine, Anne-Kathrin AU - Cecil, Julia AU - Sigl, Leonie Anna AU - Lermer, Eva AU - Gaube, Susanne PY - 2024/8/5 TI - Predictors of Health Care Practitioners? Intention to Use AI-Enabled Clinical Decision Support Systems: Meta-Analysis Based on the Unified Theory of Acceptance and Use of Technology JO - J Med Internet Res SP - e57224 VL - 26 KW - Unified Theory of Acceptance and Use of Technology KW - UTAUT KW - artificial intelligence?enabled clinical decision support systems KW - AI-CDSSs KW - meta-analysis KW - health care practitioners N2 - Background: Artificial intelligence?enabled clinical decision support systems (AI-CDSSs) offer potential for improving health care outcomes, but their adoption among health care practitioners remains limited. Objective: This meta-analysis identified predictors influencing health care practitioners? intention to use AI-CDSSs based on the Unified Theory of Acceptance and Use of Technology (UTAUT). Additional predictors were examined based on existing empirical evidence. Methods: The literature search using electronic databases, forward searches, conference programs, and personal correspondence yielded 7731 results, of which 17 (0.22%) studies met the inclusion criteria. Random-effects meta-analysis, relative weight analyses, and meta-analytic moderation and mediation analyses were used to examine the relationships between relevant predictor variables and the intention to use AI-CDSSs. Results: The meta-analysis results supported the application of the UTAUT to the context of the intention to use AI-CDSSs. The results showed that performance expectancy (r=0.66), effort expectancy (r=0.55), social influence (r=0.66), and facilitating conditions (r=0.66) were positively associated with the intention to use AI-CDSSs, in line with the predictions of the UTAUT. The meta-analysis further identified positive attitude (r=0.63), trust (r=0.73), anxiety (r=?0.41), perceived risk (r=?0.21), and innovativeness (r=0.54) as additional relevant predictors. Trust emerged as the most influential predictor overall. The results of the moderation analyses show that the relationship between social influence and use intention becomes weaker with increasing age. In addition, the relationship between effort expectancy and use intention was stronger for diagnostic AI-CDSSs than for devices that combined diagnostic and treatment recommendations. Finally, the relationship between facilitating conditions and use intention was mediated through performance and effort expectancy. Conclusions: This meta-analysis contributes to the understanding of the predictors of intention to use AI-CDSSs based on an extended UTAUT model. More research is needed to substantiate the identified relationships and explain the observed variations in effect sizes by identifying relevant moderating factors. The research findings bear important implications for the design and implementation of training programs for health care practitioners to ease the adoption of AI-CDSSs into their practice. UR - https://www.jmir.org/2024/1/e57224 UR - http://dx.doi.org/10.2196/57224 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/57224 ER - TY - JOUR AU - Zhui, Li AU - Fenghe, Li AU - Xuehu, Wang AU - Qining, Fu AU - Wei, Ren PY - 2024/8/1 TI - Ethical Considerations and Fundamental Principles of Large Language Models in Medical Education: Viewpoint JO - J Med Internet Res SP - e60083 VL - 26 KW - medical education KW - artificial intelligence KW - large language models KW - medical ethics KW - AI KW - LLMs KW - ethics KW - academic integrity KW - privacy and data risks KW - data security KW - data protection KW - intellectual property rights KW - educational research UR - https://www.jmir.org/2024/1/e60083 UR - http://dx.doi.org/10.2196/60083 UR - http://www.ncbi.nlm.nih.gov/pubmed/38971715 ID - info:doi/10.2196/60083 ER - TY - JOUR AU - Cherif, Hela AU - Moussa, Chirine AU - Missaoui, Mouhaymen Abdel AU - Salouage, Issam AU - Mokaddem, Salma AU - Dhahri, Besma PY - 2024/7/23 TI - Appraisal of ChatGPT?s Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination JO - JMIR Med Educ SP - e52818 VL - 10 KW - medical education KW - ChatGPT KW - GPT KW - artificial intelligence KW - natural language processing KW - NLP KW - pulmonary medicine KW - pulmonary KW - lung KW - lungs KW - respiratory KW - respiration KW - pneumology KW - comparative analysis KW - large language models KW - LLMs KW - LLM KW - language model KW - generative AI KW - generative artificial intelligence KW - generative KW - exams KW - exam KW - examinations KW - examination N2 - Background: The rapid evolution of ChatGPT has generated substantial interest and led to extensive discussions in both public and academic domains, particularly in the context of medical education. Objective: This study aimed to evaluate ChatGPT?s performance in a pulmonology examination through a comparative analysis with that of third-year medical students. Methods: In this cross-sectional study, we conducted a comparative analysis with 2 distinct groups. The first group comprised 244 third-year medical students who had previously taken our institution?s 2020 pulmonology examination, which was conducted in French. The second group involved ChatGPT-3.5 in 2 separate sets of conversations: without contextualization (V1) and with contextualization (V2). In both V1 and V2, ChatGPT received the same set of questions administered to the students. Results: V1 demonstrated exceptional proficiency in radiology, microbiology, and thoracic surgery, surpassing the majority of medical students in these domains. However, it faced challenges in pathology, pharmacology, and clinical pneumology. In contrast, V2 consistently delivered more accurate responses across various question categories, regardless of the specialization. ChatGPT exhibited suboptimal performance in multiple choice questions compared to medical students. V2 excelled in responding to structured open-ended questions. Both ChatGPT conversations, particularly V2, outperformed students in addressing questions of low and intermediate difficulty. Interestingly, students showcased enhanced proficiency when confronted with highly challenging questions. V1 fell short of passing the examination. Conversely, V2 successfully achieved examination success, outperforming 139 (62.1%) medical students. Conclusions: While ChatGPT has access to a comprehensive web-based data set, its performance closely mirrors that of an average medical student. Outcomes are influenced by question format, item complexity, and contextual nuances. The model faces challenges in medical contexts requiring information synthesis, advanced analytical aptitude, and clinical judgment, as well as in non-English language assessments and when confronted with data outside mainstream internet sources. UR - https://mededu.jmir.org/2024/1/e52818 UR - http://dx.doi.org/10.2196/52818 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/52818 ER - TY - JOUR AU - Adhikary, Kumar Prottay AU - Srivastava, Aseem AU - Kumar, Shivani AU - Singh, Michael Salam AU - Manuja, Puneet AU - Gopinath, K. Jini AU - Krishnan, Vijay AU - Gupta, Kedia Swati AU - Deb, Sinha Koushik AU - Chakraborty, Tanmoy PY - 2024/7/23 TI - Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study JO - JMIR Ment Health SP - e57306 VL - 11 KW - mental health KW - counseling summarization KW - large language models KW - digital health KW - artificial intelligence KW - AI N2 - Background: Comprehensive session summaries enable effective continuity in mental health counseling, facilitating informed therapy planning. However, manual summarization presents a significant challenge, diverting experts? attention from the core counseling process. Leveraging advances in automatic summarization to streamline the summarization process addresses this issue because this enables mental health professionals to access concise summaries of lengthy therapy sessions, thereby increasing their efficiency. However, existing approaches often overlook the nuanced intricacies inherent in counseling interactions. Objective: This study evaluates the effectiveness of state-of-the-art large language models (LLMs) in selectively summarizing various components of therapy sessions through aspect-based summarization, aiming to benchmark their performance. Methods: We first created Mental Health Counseling-Component?Guided Dialogue Summaries, a benchmarking data set that consists of 191 counseling sessions with summaries focused on 3 distinct counseling components (also known as counseling aspects). Next, we assessed the capabilities of 11 state-of-the-art LLMs in addressing the task of counseling-component?guided summarization. The generated summaries were evaluated quantitatively using standard summarization metrics and verified qualitatively by mental health professionals. Results: Our findings demonstrated the superior performance of task-specific LLMs such as MentalLlama, Mistral, and MentalBART evaluated using standard quantitative metrics such as Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-1, ROUGE-2, ROUGE-L, and Bidirectional Encoder Representations from Transformers Score across all aspects of the counseling components. Furthermore, expert evaluation revealed that Mistral superseded both MentalLlama and MentalBART across 6 parameters: affective attitude, burden, ethicality, coherence, opportunity costs, and perceived effectiveness. However, these models exhibit a common weakness in terms of room for improvement in the opportunity costs and perceived effectiveness metrics. Conclusions: While LLMs fine-tuned specifically on mental health domain data display better performance based on automatic evaluation scores, expert assessments indicate that these models are not yet reliable for clinical application. Further refinement and validation are necessary before their implementation in practice. UR - https://mental.jmir.org/2024/1/e57306 UR - http://dx.doi.org/10.2196/57306 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/57306 ER - TY - JOUR AU - Laymouna, Moustafa AU - Ma, Yuanchao AU - Lessard, David AU - Schuster, Tibor AU - Engler, Kim AU - Lebouché, Bertrand PY - 2024/7/23 TI - Roles, Users, Benefits, and Limitations of Chatbots in Health Care: Rapid Review JO - J Med Internet Res SP - e56930 VL - 26 KW - chatbot KW - conversational agent KW - conversational assistant KW - user-computer interface KW - digital health KW - mobile health KW - electronic health KW - telehealth KW - artificial intelligence KW - AI KW - health information technology N2 - Background: Chatbots, or conversational agents, have emerged as significant tools in health care, driven by advancements in artificial intelligence and digital technology. These programs are designed to simulate human conversations, addressing various health care needs. However, no comprehensive synthesis of health care chatbots? roles, users, benefits, and limitations is available to inform future research and application in the field. Objective: This review aims to describe health care chatbots? characteristics, focusing on their diverse roles in the health care pathway, user groups, benefits, and limitations. Methods: A rapid review of published literature from 2017 to 2023 was performed with a search strategy developed in collaboration with a health sciences librarian and implemented in the MEDLINE and Embase databases. Primary research studies reporting on chatbot roles or benefits in health care were included. Two reviewers dual-screened the search results. Extracted data on chatbot roles, users, benefits, and limitations were subjected to content analysis. Results: The review categorized chatbot roles into 2 themes: delivery of remote health services, including patient support, care management, education, skills building, and health behavior promotion, and provision of administrative assistance to health care providers. User groups spanned across patients with chronic conditions as well as patients with cancer; individuals focused on lifestyle improvements; and various demographic groups such as women, families, and older adults. Professionals and students in health care also emerged as significant users, alongside groups seeking mental health support, behavioral change, and educational enhancement. The benefits of health care chatbots were also classified into 2 themes: improvement of health care quality and efficiency and cost-effectiveness in health care delivery. The identified limitations encompassed ethical challenges, medicolegal and safety concerns, technical difficulties, user experience issues, and societal and economic impacts. Conclusions: Health care chatbots offer a wide spectrum of applications, potentially impacting various aspects of health care. While they are promising tools for improving health care efficiency and quality, their integration into the health care system must be approached with consideration of their limitations to ensure optimal, safe, and equitable use. UR - https://www.jmir.org/2024/1/e56930 UR - http://dx.doi.org/10.2196/56930 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/56930 ER - TY - JOUR AU - Knitza, Johannes AU - Tascilar, Koray AU - Fuchs, Franziska AU - Mohn, Jacob AU - Kuhn, Sebastian AU - Bohr, Daniela AU - Muehlensiepen, Felix AU - Bergmann, Christina AU - Labinsky, Hannah AU - Morf, Harriet AU - Araujo, Elizabeth AU - Englbrecht, Matthias AU - Vorbrüggen, Wolfgang AU - von der Decken, Cay-Benedict AU - Kleinert, Stefan AU - Ramming, Andreas AU - Distler, W. Jörg H. AU - Bartz-Bazzanella, Peter AU - Vuillerme, Nicolas AU - Schett, Georg AU - Welcker, Martin AU - Hueber, Axel PY - 2024/7/23 TI - Diagnostic Accuracy of a Mobile AI-Based Symptom Checker and a Web-Based Self-Referral Tool in Rheumatology: Multicenter Randomized Controlled Trial JO - J Med Internet Res SP - e55542 VL - 26 KW - symptom checker KW - artificial intelligence KW - eHealth KW - diagnostic decision support system KW - rheumatology KW - decision support KW - decision KW - diagnostic KW - tool KW - rheumatologists KW - symptom assessment KW - resources KW - randomized controlled trial KW - diagnosis KW - decision support system KW - support system KW - support N2 - Background: The diagnosis of inflammatory rheumatic diseases (IRDs) is often delayed due to unspecific symptoms and a shortage of rheumatologists. Digital diagnostic decision support systems (DDSSs) have the potential to expedite diagnosis and help patients navigate the health care system more efficiently. Objective: The aim of this study was to assess the diagnostic accuracy of a mobile artificial intelligence (AI)?based symptom checker (Ada) and a web-based self-referral tool (Rheport) regarding IRDs. Methods: A prospective, multicenter, open-label, crossover randomized controlled trial was conducted with patients newly presenting to 3 rheumatology centers. Participants were randomly assigned to complete a symptom assessment using either Ada or Rheport. The primary outcome was the correct identification of IRDs by the DDSSs, defined as the presence of any IRD in the list of suggested diagnoses by Ada or achieving a prespecified threshold score with Rheport. The gold standard was the diagnosis made by rheumatologists. Results: A total of 600 patients were included, among whom 214 (35.7%) were diagnosed with an IRD. Most frequent IRD was rheumatoid arthritis with 69 (11.5%) patients. Rheport?s disease suggestion and Ada?s top 1 (D1) and top 5 (D5) disease suggestions demonstrated overall diagnostic accuracies of 52%, 63%, and 58%, respectively, for IRDs. Rheport showed a sensitivity of 62% and a specificity of 47% for IRDs. Ada?s D1 and D5 disease suggestions showed a sensitivity of 52% and 66%, respectively, and a specificity of 68% and 54%, respectively, concerning IRDs. Ada?s diagnostic accuracy regarding individual diagnoses was heterogenous, and Ada performed considerably better in identifying rheumatoid arthritis in comparison to other diagnoses (D1: 42%; D5: 64%). The Cohen ? statistic of Rheport for agreement on any rheumatic disease diagnosis with Ada D1 was 0.15 (95% CI 0.08-0.18) and with Ada D5 was 0.08 (95% CI 0.00-0.16), indicating poor agreement for the presence of any rheumatic disease between the 2 DDSSs. Conclusions: To our knowledge, this is the largest comparative DDSS trial with actual use of DDSSs by patients. The diagnostic accuracies of both DDSSs for IRDs were not promising in this high-prevalence patient population. DDSSs may lead to a misuse of scarce health care resources. Our results underscore the need for stringent regulation and drastic improvements to ensure the safety and efficacy of DDSSs. Trial Registration: German Register of Clinical Trials DRKS00017642; https://drks.de/search/en/trial/DRKS00017642 UR - https://www.jmir.org/2024/1/e55542 UR - http://dx.doi.org/10.2196/55542 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/55542 ER - TY - JOUR AU - Kizaki, Hayato AU - Satoh, Hiroki AU - Ebara, Sayaka AU - Watabe, Satoshi AU - Sawada, Yasufumi AU - Imai, Shungo AU - Hori, Satoko PY - 2024/7/23 TI - Construction of a Multi-Label Classifier for Extracting Multiple Incident Factors From Medication Incident Reports in Residential Care Facilities: Natural Language Processing Approach JO - JMIR Med Inform SP - e58141 VL - 12 KW - residential facilities KW - incidents KW - non-medical staff KW - natural language processing KW - risk management N2 - Background: Medication safety in residential care facilities is a critical concern, particularly when nonmedical staff provide medication assistance. The complex nature of medication-related incidents in these settings, coupled with the psychological impact on health care providers, underscores the need for effective incident analysis and preventive strategies. A thorough understanding of the root causes, typically through incident-report analysis, is essential for mitigating medication-related incidents. Objective: We aimed to develop and evaluate a multilabel classifier using natural language processing to identify factors contributing to medication-related incidents using incident report descriptions from residential care facilities, with a focus on incidents involving nonmedical staff. Methods: We analyzed 2143 incident reports, comprising 7121 sentences, from residential care facilities in Japan between April 1, 2015, and March 31, 2016. The incident factors were annotated using sentences based on an established organizational factor model and previous research findings. The following 9 factors were defined: procedure adherence, medicine, resident, resident family, nonmedical staff, medical staff, team, environment, and organizational management. To assess the label criteria, 2 researchers with relevant medical knowledge annotated a subset of 50 reports; the interannotator agreement was measured using Cohen ?. The entire data set was subsequently annotated by 1 researcher. Multiple labels were assigned to each sentence. A multilabel classifier was developed using deep learning models, including 2 Bidirectional Encoder Representations From Transformers (BERT)?type models (Tohoku-BERT and a University of Tokyo Hospital BERT pretrained with Japanese clinical text: UTH-BERT) and an Efficiently Learning Encoder That Classifies Token Replacements Accurately (ELECTRA), pretrained on Japanese text. Both sentence- and report-level training were performed; the performance was evaluated by the F1-score and exact match accuracy through 5-fold cross-validation. Results: Among all 7121 sentences, 1167, 694, 2455, 23, 1905, 46, 195, 1104, and 195 included ?procedure adherence,? ?medicine,? ?resident,? ?resident family,? ?nonmedical staff,? ?medical staff,? ?team,? ?environment,? and ?organizational management,? respectively. Owing to limited labels, ?resident family? and ?medical staff? were omitted from the model development process. The interannotator agreement values were higher than 0.6 for each label. A total of 10, 278, and 1855 reports contained no, 1, and multiple labels, respectively. The models trained using the report data outperformed those trained using sentences, with macro F1-scores of 0.744, 0.675, and 0.735 for Tohoku-BERT, UTH-BERT, and ELECTRA, respectively. The report-trained models also demonstrated better exact match accuracy, with 0.411, 0.389, and 0.399 for Tohoku-BERT, UTH-BERT, and ELECTRA, respectively. Notably, the accuracy was consistent even when the analysis was confined to reports containing multiple labels. Conclusions: The multilabel classifier developed in our study demonstrated potential for identifying various factors associated with medication-related incidents using incident reports from residential care facilities. Thus, this classifier can facilitate prompt analysis of incident factors, thereby contributing to risk management and the development of preventive strategies. UR - https://medinform.jmir.org/2024/1/e58141 UR - http://dx.doi.org/10.2196/58141 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/58141 ER - TY - JOUR AU - Levinson, T. Rebecca AU - Paul, Cinara AU - Meid, D. Andreas AU - Schultz, Jobst-Hendrik AU - Wild, Beate PY - 2024/7/23 TI - Identifying Predictors of Heart Failure Readmission in Patients From a Statutory Health Insurance Database: Retrospective Machine Learning Study JO - JMIR Cardio SP - e54994 VL - 8 KW - statutory health insurance KW - readmission KW - machine learning KW - heart failure KW - heart KW - cardiology KW - cardiac KW - hospitalization KW - insurance KW - predict KW - predictive KW - prediction KW - predictions KW - predictor KW - predictors KW - all cause N2 - Background: Patients with heart failure (HF) are the most commonly readmitted group of adult patients in Germany. Most patients with HF are readmitted for noncardiovascular reasons. Understanding the relevance of HF management outside the hospital setting is critical to understanding HF and factors that lead to readmission. Application of machine learning (ML) on data from statutory health insurance (SHI) allows the evaluation of large longitudinal data sets representative of the general population to support clinical decision-making. Objective: This study aims to evaluate the ability of ML methods to predict 1-year all-cause and HF-specific readmission after initial HF-related admission of patients with HF in outpatient SHI data and identify important predictors. Methods: We identified individuals with HF using outpatient data from 2012 to 2018 from the AOK Baden-Württemberg SHI in Germany. We then trained and applied regression and ML algorithms to predict the first all-cause and HF-specific readmission in the year after the first admission for HF. We fitted a random forest, an elastic net, a stepwise regression, and a logistic regression to predict readmission by using diagnosis codes, drug exposures, demographics (age, sex, nationality, and type of coverage within SHI), degree of rurality for residence, and participation in disease management programs for common chronic conditions (diabetes mellitus type 1 and 2, breast cancer, chronic obstructive pulmonary disease, and coronary heart disease). We then evaluated the predictors of HF readmission according to their importance and direction to predict readmission. Results: Our final data set consisted of 97,529 individuals with HF, and 78,044 (80%) were readmitted within the observation period. Of the tested modeling approaches, the random forest approach best predicted 1-year all-cause and HF-specific readmission with a C-statistic of 0.68 and 0.69, respectively. Important predictors for 1-year all-cause readmission included prescription of pantoprazole, chronic obstructive pulmonary disease, atherosclerosis, sex, rurality, and participation in disease management programs for type 2 diabetes mellitus and coronary heart disease. Relevant features for HF-specific readmission included a large number of canonical HF comorbidities. Conclusions: While many of the predictors we identified were known to be relevant comorbidities for HF, we also uncovered several novel associations. Disease management programs have widely been shown to be effective at managing chronic disease; however, our results indicate that in the short term they may be useful for targeting patients with HF with comorbidity at increased risk of readmission. Our results also show that living in a more rural location increases the risk of readmission. Overall, factors beyond comorbid disease were relevant for risk of HF readmission. This finding may impact how outpatient physicians identify and monitor patients at risk of HF readmission. UR - https://cardio.jmir.org/2024/1/e54994 UR - http://dx.doi.org/10.2196/54994 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/54994 ER - TY - JOUR AU - Bienefeld, Nadine AU - Keller, Emanuela AU - Grote, Gudela PY - 2024/7/22 TI - Human-AI Teaming in Critical Care: A Comparative Analysis of Data Scientists? and Clinicians? Perspectives on AI Augmentation and Automation JO - J Med Internet Res SP - e50130 VL - 26 KW - AI in health care KW - human-AI teaming KW - sociotechnical systems KW - intensive care KW - ICU KW - AI adoption KW - AI implementation KW - augmentation KW - automation, health care policy and regulatory foresight KW - explainable AI KW - explainable KW - human-AI KW - human-computer KW - human-machine KW - ethical implications of AI in health care KW - ethical KW - ethic KW - ethics KW - artificial intelligence KW - policy KW - foresight KW - policies KW - recommendation KW - recommendations KW - policy maker KW - policy makers KW - Delphi KW - sociotechnical N2 - Background: Artificial intelligence (AI) holds immense potential for enhancing clinical and administrative health care tasks. However, slow adoption and implementation challenges highlight the need to consider how humans can effectively collaborate with AI within broader socio-technical systems in health care. Objective: In the example of intensive care units (ICUs), we compare data scientists? and clinicians? assessments of the optimal utilization of human and AI capabilities by determining suitable levels of human-AI teaming for safely and meaningfully augmenting or automating 6 core tasks. The goal is to provide actionable recommendations for policy makers and health care practitioners regarding AI design and implementation. Methods: In this multimethod study, we combine a systematic task analysis across 6 ICUs with an international Delphi survey involving 19 health data scientists from the industry and academia and 61 ICU clinicians (25 physicians and 36 nurses) to define and assess optimal levels of human-AI teaming (level 1=no performance benefits; level 2=AI augments human performance; level 3=humans augment AI performance; level 4=AI performs without human input). Stakeholder groups also considered ethical and social implications. Results: Both stakeholder groups chose level 2 and 3 human-AI teaming for 4 out of 6 core tasks in the ICU. For one task (monitoring), level 4 was the preferred design choice. For the task of patient interactions, both data scientists and clinicians agreed that AI should not be used regardless of technological feasibility due to the importance of the physician-patient and nurse-patient relationship and ethical concerns. Human-AI design choices rely on interpretability, predictability, and control over AI systems. If these conditions are not met and AI performs below human-level reliability, a reduction to level 1 or shifting accountability away from human end users is advised. If AI performs at or beyond human-level reliability and these conditions are not met, shifting to level 4 automation should be considered to ensure safe and efficient human-AI teaming. Conclusions: By considering the sociotechnical system and determining appropriate levels of human-AI teaming, our study showcases the potential for improving the safety and effectiveness of AI usage in ICUs and broader health care settings. Regulatory measures should prioritize interpretability, predictability, and control if clinicians hold full accountability. Ethical and social implications must be carefully evaluated to ensure effective collaboration between humans and AI, particularly considering the most recent advancements in generative AI. UR - https://www.jmir.org/2024/1/e50130 UR - http://dx.doi.org/10.2196/50130 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/50130 ER - TY - JOUR AU - Chen, Xi AU - Wang, Li AU - You, MingKe AU - Liu, WeiZhi AU - Fu, Yu AU - Xu, Jie AU - Zhang, Shaoting AU - Chen, Gang AU - Li, Kang AU - Li, Jian PY - 2024/7/22 TI - Evaluating and Enhancing Large Language Models? Performance in Domain-Specific Medicine: Development and Usability Study With DocOA JO - J Med Internet Res SP - e58158 VL - 26 KW - large language model KW - retrieval-augmented generation KW - domain-specific benchmark framework KW - osteoarthritis management N2 - Background: The efficacy of large language models (LLMs) in domain-specific medicine, particularly for managing complex diseases such as osteoarthritis (OA), remains largely unexplored. Objective: This study focused on evaluating and enhancing the clinical capabilities and explainability of LLMs in specific domains, using OA management as a case study. Methods: A domain-specific benchmark framework was developed to evaluate LLMs across a spectrum from domain-specific knowledge to clinical applications in real-world clinical scenarios. DocOA, a specialized LLM designed for OA management integrating retrieval-augmented generation and instructional prompts, was developed. It can identify the clinical evidence upon which its answers are based through retrieval-augmented generation, thereby demonstrating the explainability of those answers. The study compared the performance of GPT-3.5, GPT-4, and a specialized assistant, DocOA, using objective and human evaluations. Results: Results showed that general LLMs such as GPT-3.5 and GPT-4 were less effective in the specialized domain of OA management, particularly in providing personalized treatment recommendations. However, DocOA showed significant improvements. Conclusions: This study introduces a novel benchmark framework that assesses the domain-specific abilities of LLMs in multiple aspects, highlights the limitations of generalized LLMs in clinical contexts, and demonstrates the potential of tailored approaches for developing domain-specific medical LLMs. UR - https://www.jmir.org/2024/1/e58158 UR - http://dx.doi.org/10.2196/58158 UR - http://www.ncbi.nlm.nih.gov/pubmed/38833165 ID - info:doi/10.2196/58158 ER - TY - JOUR AU - Wu, Qingxia AU - Li, Huali AU - Wang, Yan AU - Bai, Yan AU - Wu, Yaping AU - Yu, Xuan AU - Li, Xiaodong AU - Dong, Pei AU - Xue, Jon AU - Shen, Dinggang AU - Wang, Meiyun PY - 2024/7/17 TI - Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study JO - JMIR Med Inform SP - e55799 VL - 12 KW - Radiology Reporting and Data Systems KW - LI-RADS KW - Lung-RADS KW - O-RADS KW - large language model KW - ChatGPT KW - chatbot KW - chatbots KW - categorization KW - recommendation KW - recommendations KW - accuracy N2 - Background: Large language models show promise for improving radiology workflows, but their performance on structured radiological tasks such as Reporting and Data Systems (RADS) categorization remains unexplored. Objective: This study aims to evaluate 3 large language model chatbots?Claude-2, GPT-3.5, and GPT-4?on assigning RADS categories to radiology reports and assess the impact of different prompting strategies. Methods: This cross-sectional study compared 3 chatbots using 30 radiology reports (10 per RADS criteria), using a 3-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in Liver Imaging Reporting & Data System (LI-RADS) version 2018, Lung CT (computed tomography) Screening Reporting & Data System (Lung-RADS) version 2022, and Ovarian-Adnexal Reporting & Data System (O-RADS) magnetic resonance imaging, meticulously prepared by board-certified radiologists. Each report underwent 6 assessments. Two blinded reviewers assessed the chatbots? response at patient-level RADS categorization and overall ratings. The agreement across repetitions was assessed using Fleiss ?. Results: Claude-2 achieved the highest accuracy in overall ratings with few-shot prompts and guideline PDFs (prompt-2), attaining 57% (17/30) average accuracy over 6 runs and 50% (15/30) accuracy with k-pass voting. Without prompt engineering, all chatbots performed poorly. The introduction of a structured exemplar prompt (prompt-1) increased the accuracy of overall ratings for all chatbots. Providing prompt-2 further improved Claude-2?s performance, an enhancement not replicated by GPT-4. The interrun agreement was substantial for Claude-2 (k=0.66 for overall rating and k=0.69 for RADS categorization), fair for GPT-4 (k=0.39 for both), and fair for GPT-3.5 (k=0.21 for overall rating and k=0.39 for RADS categorization). All chatbots showed significantly higher accuracy with LI-RADS version 2018 than with Lung-RADS version 2022 and O-RADS (P<.05); with prompt-2, Claude-2 achieved the highest overall rating accuracy of 75% (45/60) in LI-RADS version 2018. Conclusions: When equipped with structured prompts and guideline PDFs, Claude-2 demonstrated potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS version 2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria. UR - https://medinform.jmir.org/2024/1/e55799 UR - http://dx.doi.org/10.2196/55799 UR - http://www.ncbi.nlm.nih.gov/pubmed/39018102 ID - info:doi/10.2196/55799 ER - TY - JOUR AU - Herman Bernardim Andrade, Gabriel AU - Yada, Shuntaro AU - Aramaki, Eiji PY - 2024/7/2 TI - Is Boundary Annotation Necessary? Evaluating Boundary-Free Approaches to Improve Clinical Named Entity Annotation Efficiency: Case Study JO - JMIR Med Inform SP - e59680 VL - 12 KW - natural language processing KW - named entity recognition KW - information extraction KW - text annotation KW - entity boundaries KW - lenient annotation KW - case reports KW - annotation KW - case study KW - medical case report KW - efficiency KW - model KW - model performance KW - dataset KW - Japan KW - Japanese KW - entity KW - clinical domain KW - clinical N2 - Background: Named entity recognition (NER) is a fundamental task in natural language processing. However, it is typically preceded by named entity annotation, which poses several challenges, especially in the clinical domain. For instance, determining entity boundaries is one of the most common sources of disagreements between annotators due to questions such as whether modifiers or peripheral words should be annotated. If unresolved, these can induce inconsistency in the produced corpora, yet, on the other hand, strict guidelines or adjudication sessions can further prolong an already slow and convoluted process. Objective: The aim of this study is to address these challenges by evaluating 2 novel annotation methodologies, lenient span and point annotation, aiming to mitigate the difficulty of precisely determining entity boundaries. Methods: We evaluate their effects through an annotation case study on a Japanese medical case report data set. We compare annotation time, annotator agreement, and the quality of the produced labeling and assess the impact on the performance of an NER system trained on the annotated corpus. Results: We saw significant improvements in the labeling process efficiency, with up to a 25% reduction in overall annotation time and even a 10% improvement in annotator agreement compared to the traditional boundary-strict approach. However, even the best-achieved NER model presented some drop in performance compared to the traditional annotation methodology. Conclusions: Our findings demonstrate a balance between annotation speed and model performance. Although disregarding boundary information affects model performance to some extent, this is counterbalanced by significant reductions in the annotator?s workload and notable improvements in the speed of the annotation process. These benefits may prove valuable in various applications, offering an attractive compromise for developers and researchers. UR - https://medinform.jmir.org/2024/1/e59680 UR - http://dx.doi.org/10.2196/59680 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/59680 ER - TY - JOUR AU - Xu, Jie AU - Lu, Lu AU - Peng, Xinwei AU - Pang, Jiali AU - Ding, Jinru AU - Yang, Lingrui AU - Song, Huan AU - Li, Kang AU - Sun, Xin AU - Zhang, Shaoting PY - 2024/6/28 TI - Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation JO - JMIR Med Inform SP - e57674 VL - 12 KW - ChatGPT KW - LLM KW - assessment KW - data set KW - benchmark KW - medicine N2 - Background: Large language models (LLMs) have achieved great progress in natural language processing tasks and demonstrated the potential for use in clinical applications. Despite their capabilities, LLMs in the medical domain are prone to generating hallucinations (not fully reliable responses). Hallucinations in LLMs? responses create substantial risks, potentially threatening patients? physical safety. Thus, to perceive and prevent this safety risk, it is essential to evaluate LLMs in the medical domain and build a systematic evaluation. Objective: We developed a comprehensive evaluation system, MedGPTEval, composed of criteria, medical data sets in Chinese, and publicly available benchmarks. Methods: First, a set of evaluation criteria was designed based on a comprehensive literature review. Second, existing candidate criteria were optimized by using a Delphi method with 5 experts in medicine and engineering. Third, 3 clinical experts designed medical data sets to interact with LLMs. Finally, benchmarking experiments were conducted on the data sets. The responses generated by chatbots based on LLMs were recorded for blind evaluations by 5 licensed medical experts. The evaluation criteria that were obtained covered medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with 16 detailed indicators. The medical data sets include 27 medical dialogues and 7 case reports in Chinese. Three chatbots were evaluated: ChatGPT by OpenAI; ERNIE Bot by Baidu, Inc; and Doctor PuJiang (Dr PJ) by Shanghai Artificial Intelligence Laboratory. Results: Dr PJ outperformed ChatGPT and ERNIE Bot in the multiple-turn medical dialogues and case report scenarios. Dr PJ also outperformed ChatGPT in the semantic consistency rate and complete error rate category, indicating better robustness. However, Dr PJ had slightly lower scores in medical professional capabilities compared with ChatGPT in the multiple-turn dialogue scenario. Conclusions: MedGPTEval provides comprehensive criteria to evaluate chatbots by LLMs in the medical domain, open-source data sets, and benchmarks assessing 3 LLMs. Experimental results demonstrate that Dr PJ outperforms ChatGPT and ERNIE Bot in social and professional contexts. Therefore, such an assessment system can be easily adopted by researchers in this community to augment an open-source data set. UR - https://medinform.jmir.org/2024/1/e57674 UR - http://dx.doi.org/10.2196/57674 ID - info:doi/10.2196/57674 ER - TY - JOUR AU - Gwon, Nam Yong AU - Kim, Heon Jae AU - Chung, Soo Hyun AU - Jung, Jee Eun AU - Chun, Joey AU - Lee, Serin AU - Shim, Ryul Sung PY - 2024/5/14 TI - The Use of Generative AI for Scientific Literature Searches for Systematic Reviews: ChatGPT and Microsoft Bing AI Performance Evaluation JO - JMIR Med Inform SP - e51187 VL - 12 KW - artificial intelligence KW - search engine KW - systematic review KW - evidence-based medicine KW - ChatGPT KW - language model KW - education KW - tool KW - clinical decision support system KW - decision support KW - support KW - treatment N2 - Background: A large language model is a type of artificial intelligence (AI) model that opens up great possibilities for health care practice, research, and education, although scholars have emphasized the need to proactively address the issue of unvalidated and inaccurate information regarding its use. One of the best-known large language models is ChatGPT (OpenAI). It is believed to be of great help to medical research, as it facilitates more efficient data set analysis, code generation, and literature review, allowing researchers to focus on experimental design as well as drug discovery and development. Objective: This study aims to explore the potential of ChatGPT as a real-time literature search tool for systematic reviews and clinical decision support systems, to enhance their efficiency and accuracy in health care settings. Methods: The search results of a published systematic review by human experts on the treatment of Peyronie disease were selected as a benchmark, and the literature search formula of the study was applied to ChatGPT and Microsoft Bing AI as a comparison to human researchers. Peyronie disease typically presents with discomfort, curvature, or deformity of the penis in association with palpable plaques and erectile dysfunction. To evaluate the quality of individual studies derived from AI answers, we created a structured rating system based on bibliographic information related to the publications. We classified its answers into 4 grades if the title existed: A, B, C, and F. No grade was given for a fake title or no answer. Results: From ChatGPT, 7 (0.5%) out of 1287 identified studies were directly relevant, whereas Bing AI resulted in 19 (40%) relevant studies out of 48, compared to the human benchmark of 24 studies. In the qualitative evaluation, ChatGPT had 7 grade A, 18 grade B, 167 grade C, and 211 grade F studies, and Bing AI had 19 grade A and 28 grade C studies. Conclusions: This is the first study to compare AI and conventional human systematic review methods as a real-time literature collection tool for evidence-based medicine. The results suggest that the use of ChatGPT as a tool for real-time evidence generation is not yet accurate and feasible. Therefore, researchers should be cautious about using such AI. The limitations of this study using the generative pre-trained transformer model are that the search for research topics was not diverse and that it did not prevent the hallucination of generative AI. However, this study will serve as a standard for future studies by providing an index to verify the reliability and consistency of generative AI from a user?s point of view. If the reliability and consistency of AI literature search services are verified, then the use of these technologies will help medical research greatly. UR - https://medinform.jmir.org/2024/1/e51187 UR - http://dx.doi.org/10.2196/51187 ID - info:doi/10.2196/51187 ER - TY - JOUR AU - Preiksaitis, Carl AU - Ashenburg, Nicholas AU - Bunney, Gabrielle AU - Chu, Andrew AU - Kabeer, Rana AU - Riley, Fran AU - Ribeira, Ryan AU - Rose, Christian PY - 2024/5/10 TI - The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review JO - JMIR Med Inform SP - e53787 VL - 12 KW - large language model KW - LLM KW - emergency medicine KW - clinical decision support KW - workflow efficiency KW - medical education KW - artificial intelligence KW - AI KW - natural language processing KW - NLP KW - AI literacy KW - ChatGPT KW - Bard KW - Pathways Language Model KW - Med-PaLM KW - Bidirectional Encoder Representations from Transformers KW - BERT KW - generative pretrained transformer KW - GPT KW - United States KW - US KW - China KW - scoping review KW - Preferred Reporting Items for Systematic Reviews and Meta-Analyses KW - PRISMA KW - decision support KW - risk KW - ethics KW - education KW - communication KW - medical training KW - physician KW - health literacy KW - emergency care N2 - Background: Artificial intelligence (AI), more specifically large language models (LLMs), holds significant potential in revolutionizing emergency care delivery by optimizing clinical workflows and enhancing the quality of decision-making. Although enthusiasm for integrating LLMs into emergency medicine (EM) is growing, the existing literature is characterized by a disparate collection of individual studies, conceptual analyses, and preliminary implementations. Given these complexities and gaps in understanding, a cohesive framework is needed to comprehend the existing body of knowledge on the application of LLMs in EM. Objective: Given the absence of a comprehensive framework for exploring the roles of LLMs in EM, this scoping review aims to systematically map the existing literature on LLMs? potential applications within EM and identify directions for future research. Addressing this gap will allow for informed advancements in the field. Methods: Using PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) criteria, we searched Ovid MEDLINE, Embase, Web of Science, and Google Scholar for papers published between January 2018 and August 2023 that discussed LLMs? use in EM. We excluded other forms of AI. A total of 1994 unique titles and abstracts were screened, and each full-text paper was independently reviewed by 2 authors. Data were abstracted independently, and 5 authors performed a collaborative quantitative and qualitative synthesis of the data. Results: A total of 43 papers were included. Studies were predominantly from 2022 to 2023 and conducted in the United States and China. We uncovered four major themes: (1) clinical decision-making and support was highlighted as a pivotal area, with LLMs playing a substantial role in enhancing patient care, notably through their application in real-time triage, allowing early recognition of patient urgency; (2) efficiency, workflow, and information management demonstrated the capacity of LLMs to significantly boost operational efficiency, particularly through the automation of patient record synthesis, which could reduce administrative burden and enhance patient-centric care; (3) risks, ethics, and transparency were identified as areas of concern, especially regarding the reliability of LLMs? outputs, and specific studies highlighted the challenges of ensuring unbiased decision-making amidst potentially flawed training data sets, stressing the importance of thorough validation and ethical oversight; and (4) education and communication possibilities included LLMs? capacity to enrich medical training, such as through using simulated patient interactions that enhance communication skills. Conclusions: LLMs have the potential to fundamentally transform EM, enhancing clinical decision-making, optimizing workflows, and improving patient outcomes. This review sets the stage for future advancements by identifying key research areas: prospective validation of LLM applications, establishing standards for responsible use, understanding provider and patient perceptions, and improving physicians? AI literacy. Effective integration of LLMs into EM will require collaborative efforts and thorough evaluation to ensure these technologies can be safely and effectively applied. UR - https://medinform.jmir.org/2024/1/e53787 UR - http://dx.doi.org/10.2196/53787 UR - http://www.ncbi.nlm.nih.gov/pubmed/38728687 ID - info:doi/10.2196/53787 ER - TY - JOUR AU - Wang, Lei AU - Ma, Yinyao AU - Bi, Wenshuai AU - Lv, Hanlin AU - Li, Yuxiang PY - 2024/3/29 TI - An Entity Extraction Pipeline for Medical Text Records Using Large Language Models: Analytical Study JO - J Med Internet Res SP - e54580 VL - 26 KW - clinical data extraction KW - large language models KW - feature hallucination KW - modular approach KW - unstructured data processing N2 - Background: The study of disease progression relies on clinical data, including text data, and extracting valuable features from text data has been a research hot spot. With the rise of large language models (LLMs), semantic-based extraction pipelines are gaining acceptance in clinical research. However, the security and feature hallucination issues of LLMs require further attention. Objective: This study aimed to introduce a novel modular LLM pipeline, which could semantically extract features from textual patient admission records. Methods: The pipeline was designed to process a systematic succession of concept extraction, aggregation, question generation, corpus extraction, and question-and-answer scale extraction, which was tested via 2 low-parameter LLMs: Qwen-14B-Chat (QWEN) and Baichuan2-13B-Chat (BAICHUAN). A data set of 25,709 pregnancy cases from the People?s Hospital of Guangxi Zhuang Autonomous Region, China, was used for evaluation with the help of a local expert?s annotation. The pipeline was evaluated with the metrics of accuracy and precision, null ratio, and time consumption. Additionally, we evaluated its performance via a quantified version of Qwen-14B-Chat on a consumer-grade GPU. Results: The pipeline demonstrates a high level of precision in feature extraction, as evidenced by the accuracy and precision results of Qwen-14B-Chat (95.52% and 92.93%, respectively) and Baichuan2-13B-Chat (95.86% and 90.08%, respectively). Furthermore, the pipeline exhibited low null ratios and variable time consumption. The INT4-quantified version of QWEN delivered an enhanced performance with 97.28% accuracy and a 0% null ratio. Conclusions: The pipeline exhibited consistent performance across different LLMs and efficiently extracted clinical features from textual data. It also showed reliable performance on consumer-grade hardware. This approach offers a viable and effective solution for mining clinical research data from textual records. UR - https://www.jmir.org/2024/1/e54580 UR - http://dx.doi.org/10.2196/54580 UR - http://www.ncbi.nlm.nih.gov/pubmed/38551633 ID - info:doi/10.2196/54580 ER - TY - JOUR AU - Castonguay, Alexandre AU - Lovis, Christian PY - 2023/12/21 TI - Introducing the ?AI Language Models in Health Care? Section: Actionable Strategies for Targeted and Wide-Scale Deployment JO - JMIR Med Inform SP - e53785 VL - 11 KW - generative AI KW - health care digitalization KW - AI in health care KW - digital health standards KW - AI implementation KW - artificial intelligence UR - https://medinform.jmir.org/2023/1/e53785 UR - http://dx.doi.org/10.2196/53785 UR - http://www.ncbi.nlm.nih.gov/pubmed/38127431 ID - info:doi/10.2196/53785 ER -