Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis

doi:10.2196/64963

Published on 25.Apr.2025 in Vol 13 (2025)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/64963, first published 31.Jul.2024.

Doctors use holographic technology to analyze a 3D human body scan for medical diagnosis.

Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis

Guxue Shan¹

; Xiaonan Chen¹

; Chen Wang¹

; Li Liu²

; Yuanjing Gu³

; Huiping Jiang⁴

; Tingqi Shi⁵

Article Authors Cited by (39) Tweetations (2) Metrics Author Video

Journals

Hao W, Chen C, Chen K, Li L, Chiu C, Yang T, Jong H, Yang H, Huang C, Liu J, Li Y. ChatGPT Performance Deteriorated in Patients with Comorbidities When Providing Cardiological Therapeutic Consultations. Healthcare 2025;13(13):1598 View
Urda-Cîmpean A, Leucuța D, Drugan C, Duțu A, Călinici T, Drugan T. Assessing the Accuracy of Diagnostic Capabilities of Large Language Models. Diagnostics 2025;15(13):1657 View
Chen Y, Dong M, Sun J, Meng Z, Yang Y, Muhetaier A, Li C, Qin J. Leveraging GPT-4o for Automated Extraction and Categorization of CAD-RADS Features From Free-Text Coronary CT Angiography Reports: Diagnostic Study. JMIR Medical Informatics 2025;13:e70967 View
Liu Y, Zhang Y. ChatGPT as a clinical support tool: A comprehensive review of applications, assessment, and implementation challenges. Physiotherapy Practice and Research 2026;47(1):160 View
Young V, Gates S, Garcia L, Salardini A. Data Leakage in Deep Learning for Alzheimer’s Disease Diagnosis: A Scoping Review of Methodological Rigor and Performance Inflation. Diagnostics 2025;15(18):2348 View
Cilli Hayıroğlu S, Bozkurt T. ChatGPT, Gemini, and Grok on familial mediterranean fever: are they trustworthy?. Clinical Rheumatology 2026;45(1):521 View
Brooks J, Blankson P, Campbell P, Cowley R, Yang T, Oseni T, Rodriguez A, Idris M. Assessment of Physician Preferences for Large Language Model–Generated Responses Across Geographic Regions and Clinical Experience Levels: Preliminary Survey Study. JMIR Formative Research 2026;10:e82487 View
Sowan B, Zhang L, Houssein E, Qattous H, Azzeh M, Massad B. DOVE-FELM: A fusion-optimized feature selection and heterogeneous ensemble learning framework for early prediction of chronic kidney disease risk. Array 2025;28:100613 View
Umar M, Ali V, Shamim L, Musharaf I, Hafsa R, Ahsan M, Ahmad O, Sabhan L, Saeed M, Ahmed S, Iftikhar S, Ain N. Transforming healthcare with large language models: Current applications, challenges, and future directions—a literature review. Journal of Intelligent Medicine 2026;3(1):8 View
Patel A, Contractor H, Heninger H, Vallamchetla S, Li P, Tao C, Cheung J. Performance of successive generative pretrained transformers (GPT) models in medical cases and board style questions. Scientific Reports 2026;16(1) View
Kahyaoglu S, Kaygisiz A, Alatli I, Boyaci A, Aray E, Tulgar S, Balci D. Large Language Model-Assisted Point-in-Time Interpretation of Advanced Hemodynamics in Liver Transplant Recipients: A Pilot Evaluation of Content Quality and Safety. Journal of Clinical Medicine 2026;15(2):716 View
Nedos I, Zagalioti S, Kofos C, Katsikidou T, Vellidou D, Astrinakis K, Karagiannis I, Giannakopoulos P, Michaloudi S, Apostolopoulou A, Karagiannidis E, Fyntanidou B. Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department. Journal of Clinical Medicine 2026;15(4):1512 View
Cai H, Wang C, Zhang Y, Ding H, Hong W, Zhao Y, Cheng S, Wang Y. Decoding AI Competence: Benchmarking Large Language Models (LLMs) in Ovarian Cancer Diagnosis and Treatment—A Systematic Evaluation of Generative AI Accuracy and Completeness. Diagnostics 2026;16(4):616 View
Ekingen E, Ucdal M. Performance Comparison of a Neuro-Symbolic Large Language Model System Versus Human Experts in Acute Cholecystitis Management. Journal of Clinical Medicine 2026;15(5):1730 View
Hack S, Craig J, Lin C, Fu C, Kwiatkowska M, Kocum P, Allevi F, Saibene A. Retrieval-augmented generative AI enhances clinical reasoning in odontogenic sinusitis versus maxillary sinus mucositis. European Archives of Oto-Rhino-Laryngology 2026;283(4):2353 View
Huang T, Tse G, Pageler N, Bannett Y. Large Language Models Using Clinical Text in Pediatrics. JAMA Network Open 2026;9(3):e262443 View
Li J. Large language models approach specialist-level accuracy in dental pain diagnosis: a comparative evaluation of four models. Journal of Dental Anesthesia and Pain Medicine 2026;26(2):155 View
Lai N, Lim Y, Win M, Bhargava P, Thomas P, Ong Q. The Effectiveness of Artificial Intelligence in Undergraduate Health Professions Education: Systematic Review and Meta-Analysis of Randomized Controlled Trials. JMIR Medical Education 2026;12:e88933 View
Rahman A, Alkureishi L, Hageman J. Trust, Authority, and the Future of Clinical Care. Pediatric Annals 2026;55(4) View
Młodawski M, Deniziak S, Płaza M, Kwiatkowska A, Jaszczyk D, Gajda A, Twardowski K. Comparative Evaluation of Large Language Models in Clinical Diagnostics for Real-World Medical Cases. Applied Sciences 2026;16(7):3499 View
Alzyood M, Veldhuis A, Stevenson H, Sheikh S. Hidden failure modes of large language models in healthcare-associated infection surveillance: a structured evaluation using NHSN definitions. Infection Control & Hospital Epidemiology 2026;47(6):568 View
Chen J, Luo M, Chen J, Luo G, Li G, Lei C, Chen D, Yu J, Gu K. Construction and evaluation of the knowledge graph and large model question-answering system for Jin San Zhen therapy: a tool study for primary care and general practice. Frontiers in Medicine 2026;13 View
Khosravi M, Zamaninasab Z, Mojtabaeian S, Dindar Demiray E, Arab-Zozani M, Shi M. A systematic review of the limitations of large language models in generating healthcare content. PLOS Digital Health 2026;5(4):e0001354 View
Gendron M, Djangone A, Chambers K, Caetta A, Roberts D. Human-centric clinical decision-making: how curated treatment protocols transform trust in AI-generated treatment planning. Journal of Decision Systems 2026;35(1) View
Omar M, Agbareia R, Gorenshtein A, Ramaswamy A, Sakhuja A, Barash Y, Ting D, Klang E, Nadkarni G. How to meaningfully evaluate AI in clinical medicine. Nature Medicine 2026;32(6):1948 View
Liu T, Qi X, Guo M, Ye X, Fan L, Yang D. Performance evaluation of five large language models for assisting in the interpretation of urinalysis reports for kidney diseases: a real-world study. Clinical Chemistry and Laboratory Medicine (CCLM) 2026;64(8):1826 View
Olejnik A, Możdżan M, Biskup L, Zheng E, Huszcza B, Gronek K, Kaczmarek J, Grandos J, Gajewski D, Głowacka Z, Kościołek K. THE EVOLUTION OF AI MEDICAL CONSULTANTS AND THEIR IMPACT ON PATIENT EDUCATION: A LITERATURE REVIEW. International Journal of Innovative Technologies in Social Science 2026;4(1(49)) View
Wei J, Jiang S, Yin T, Ma J, Li Q, Tian Y, Yan M, Shen Z, Lv X, Ma X, Xu S, Zhang M. Performance evaluation of large language models in the diagnosis of emergency internal medicine diseases: a retrospective study. Frontiers in Public Health 2026;14 View
Pennington-FitzGerald W, Warrier A, Durant S, Sharaf I, Carlino F, Pamula S, Eloy J, Levi J. Diagnostic accuracy and citation integrity of four large language models on otolaryngology vignettes. European Archives of Oto-Rhino-Laryngology 2026;283(7):4707 View
Karni J, Simon C, Hack S. Assistive, not autonomous: Generative artificial intelligence in head and neck cancer care - A scoping review. DIGITAL HEALTH 2026;12 View
Dashti M, Khosraviani F, Meyari A, Amirzade-Iranaq M, Chaurasia A, Hefzi D, Ghadimi N, Tichy A, Khurshid Z, Schwendicke F. Accuracy of Large Language Models in Answering Dental Examination Questions: A Systematic Review and Meta-Analysis. International Dental Journal 2026;76(4):109609 View
Özel Ş, Köse T, Şatır S. Evaluation of imaging protocols for detecting tooth ankylosis using different CBCT devices: an in vitro study. European Journal of Orthodontics 2026;48(3) View
Skalski D, Prończuk M, Łosińska K, Stastny P, Maszczyk A, Zajac A. Artificial neural network predictive models for optimizing the training process in race walking: a longitudinal observational study. PeerJ 2026;14:e21224 View
Ferreira Santos J, de Brito Duarte R, Mota I, Santos R, Moreira J, Campos J, Silva N, Neves B, Ladeiras-Lopes R, Leite F, Dores H. Large language models approach clinician performance in ESC cardiovascular risk stratification: a vignette-based benchmark study. European Heart Journal - Digital Health 2026;7(5) View
Sheppert A, Adams B, Sheppert A, Riley S. Reasoning or reciting? A temporal contamination audit of large language models in clinical medicine. Journal of the American Medical Informatics Association 2026 View
Barish P, Auerbach A, Ranji S. The 5R's of large language model‐assisted diagnosis: A practical framework for hospitalists. Journal of Hospital Medicine 2026 View
Abusalih H, Alqahtani A, Alsarhan K, Alshehri L, Aldosari K, Alqahtani Y, Abohimed S. Predictors of Trust and Engagement in Personalized Healthcare: A Study of AI-Driven Diagnosis and Treatment in Saudi Arabia. Healthcare 2026;14(13):1954 View
Malik F, Nawaz F, Mughal H. Ethical and Clinical Implications of Artificial Intelligence in Diagnostic Medicine. International Journal of Innovations in Science and Technology 2026:63 View
Scott I. Can AI assist in reducing diagnostic error? A narrative review. Diagnosis 2026 View

Citation

Please cite as:

Shan G, Chen X, Wang C, Liu L, Gu Y, Jiang H, Shi T
Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis
JMIR Med Inform 2025;13:e64963
doi: 10.2196/64963 PMID: 40279517 PMCID: 12047852

Export Metadata

END for: Endnote

BibTeX for: BibDesk, LaTeX

RIS for: RefMan, Procite, Endnote, RefWorks

Add this article to your Mendeley library

This paper is in the following e-collection/theme issue:

Reviews in Medical Informatics (462) Digital Health Reviews (3557) Clinical Information and Decision Making (3587) Generative Language Models Including ChatGPT (1444) AI Language Models in Health Care (710) Foundation Models and Their Applications in AI (104)

Download

Download PDF Download XML

Share Article

Share on Bluesky Share on Twitter Share on Facebook Share on LinkedIn