Published on in Vol 13 (2025)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/66917, first published .
Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

Journals

  1. Omar M, Hijazi K, Omar M, Nadkarni G, Klang E. Performance of large language models on family medicine licensing exams. Family Practice 2025;42(4) View
  2. Omar M, Glicksberg B, Nadkarni G, Klang E. Refining LLMs outputs with iterative consensus ensemble (ICE). Computers in Biology and Medicine 2025;196:110731 View
  3. Huang Y, Yang G, Shen Y, Chen H, Wu W, Li X, Wu Y, Zhang K, Xu J, Zhang J. Application of Large Language Models in Complex Clinical Cases: Cross-Sectional Evaluation Study. JMIR Medical Informatics 2025;13:e73941 View
  4. FUJITA W, SAKAMOTO A, SATO E, KANEKO T, KAGIYAMA N. Transformative Impact of Artificial Intelligence on Internal Medicine: Current Applications, Challenges, and Future Horizons for Urban Health. Juntendo Medical Journal 2025;71(6):389 View
  5. Akinniranye O, Akinniranye O. Performance of Large Language Models and Top-Decile Doctors on an Undergraduate Ophthalmology Examination. Cureus 2025 View
  6. Thelwall M, Yang Y. Implicit and explicit research quality score probabilities from ChatGPT. Quantitative Science Studies 2025;6:1271 View
  7. Kaczmarczyk R, Pieroh P, Koob S, Fröschen F, Scheidt S, Welle K, Martin R, Roos J. Application of Vision-Language Models in the Automatic Recognition of Bone Tumors on Radiographs: A Retrospective Study. AI 2025;6(12):327 View
  8. Li P, Xu Y, Liu X, Shen Z, Wang Y, Lv X, Lu Z, Wu H, Zhuang J, Chen Y. Large Language Models in Patient Health Communication for Atherosclerotic Cardiovascular Disease: Pilot Cross-Sectional Comparative Analysis. JMIR Medical Informatics 2026;14:e81422 View
  9. Callens S. Effective prompt design for large language models in clinical practice. Acta Clinica Belgica 2026:1 View
  10. Jaworski W, Dolata T, Sawina P, Latkowska A, Olender M, Wielochowska A, Boczkowski D, Radej D, Kowalczyk A, Majchrowicz W, Loson M, Kruplewicz M, Stachowicz A, Kubiak M, Dadynska P. Comparison of GPT-5 and GPT-4o in Solving the Polish Centre for Medical Examinations (CEM) Gastroenterology Examination. Cureus 2026 View

Conference Proceedings

  1. Meena Y, Mondal S, Potta M. Proceedings of the 16th International Conference of Human-Computer Interaction (HCI) Design & Research. Muteract: Interactive and Iterative Prompt Mutation Interface for LLM Developers and Evaluators View