Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

doi:10.2196/66917

Journals

Omar M, Hijazi K, Omar M, Nadkarni G, Klang E. Performance of large language models on family medicine licensing exams. Family Practice 2025;42(4) View
Omar M, Glicksberg B, Nadkarni G, Klang E. Refining LLMs outputs with iterative consensus ensemble (ICE). Computers in Biology and Medicine 2025;196:110731 View
Huang Y, Yang G, Shen Y, Chen H, Wu W, Li X, Wu Y, Zhang K, Xu J, Zhang J. Application of Large Language Models in Complex Clinical Cases: Cross-Sectional Evaluation Study. JMIR Medical Informatics 2025;13:e73941 View
FUJITA W, SAKAMOTO A, SATO E, KANEKO T, KAGIYAMA N. Transformative Impact of Artificial Intelligence on Internal Medicine: Current Applications, Challenges, and Future Horizons for Urban Health. Juntendo Medical Journal 2025;71(6):389 View
Akinniranye O, Akinniranye O. Performance of Large Language Models and Top-Decile Doctors on an Undergraduate Ophthalmology Examination. Cureus 2025 View
Thelwall M, Yang Y. Implicit and explicit research quality score probabilities from ChatGPT. Quantitative Science Studies 2025;6:1271 View
Kaczmarczyk R, Pieroh P, Koob S, Fröschen F, Scheidt S, Welle K, Martin R, Roos J. Application of Vision-Language Models in the Automatic Recognition of Bone Tumors on Radiographs: A Retrospective Study. AI 2025;6(12):327 View
Li P, Xu Y, Liu X, Shen Z, Wang Y, Lv X, Lu Z, Wu H, Zhuang J, Chen Y. Large Language Models in Patient Health Communication for Atherosclerotic Cardiovascular Disease: Pilot Cross-Sectional Comparative Analysis. JMIR Medical Informatics 2026;14:e81422 View
Callens S. Effective prompt design for large language models in clinical practice. Acta Clinica Belgica 2026:1 View
Jaworski W, Dolata T, Sawina P, Latkowska A, Olender M, Wielochowska A, Boczkowski D, Radej D, Kowalczyk A, Majchrowicz W, Loson M, Kruplewicz M, Stachowicz A, Kubiak M, Dadynska P. Comparison of GPT-5 and GPT-4o in Solving the Polish Centre for Medical Examinations (CEM) Gastroenterology Examination. Cureus 2026 View
Chang Y, Ju P, Hsieh M, Chang C. Impact of authoritative and subjective cues on large language model reliability for clinical inquiries: an experimental study. Scientific Reports 2026;16(1) View
Naderi N, Safavi-Naini S, Savage T, Khalafi M, Lewis P, Atf Z, Nadkarni G, Soroush A. Across generations, sizes, and types, large language models poorly report self-confidence in gastroenterology clinical reasoning tasks. npj Gut and Liver 2026;3(1) View
Berkowitz J, Patock J, Nawaz A, Gonzalez-Hernandez G, Tatonetti N. A crisis of overconfidence: Why confidence, not accuracy, is the real risk in clinical AI. BioData Mining 2026;19(1) View
Wang Q, Zou H, Zhang H, Huang Y, Tian J, Cheng W. A Survey on Medical Competence Evaluation Benchmarks for Large Language Models. Health Care Science 2026 View

Conference Proceedings

Meena Y, Mondal S, Potta M. Proceedings of the 16th International Conference of Human-Computer Interaction (HCI) Design & Research. Muteract: Interactive and Iterative Prompt Mutation Interface for LLM Developers and Evaluators View

This paper is in the following e-collection/theme issue:

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

Journals

Conference Proceedings