Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation

Rewthamrongsris P, Burapacheep J, Trachoo V, Porntaveetus T. Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures. International Dental Journal 2025;75(1):206 View
Andrew A, Tizzard E. Large language models for improving cancer diagnosis and management in primary health care settings. Journal of Medicine, Surgery, and Public Health 2024:100157 View
Chang Y, Yin J, Li J, Liu C, Cao L, Lin S. Applications and Future Prospects of Medical LLMs: A Survey Based on the M-KAT Conceptual Framework. Journal of Medical Systems 2024;48(1) View
Kreso A, Boban Z, Kabic S, Rada F, Batistic D, Barun I, Znaor L, Kumric M, Bozic J, Vrdoljak J. Using large language models as decision support tools in emergency ophthalmology. International Journal of Medical Informatics 2025;199:105886 View
Wei B, Yao L, Hu X, Hu Y, Rao J, Ji Y, Dong Z, Duan Y, Wu X. Evaluating the Effectiveness of Large Language Models in Providing Patient Education for Chinese Patients With Ocular Myasthenia Gravis: Mixed Methods Study. Journal of Medical Internet Research 2025;27:e67883 View
Liu Y, Shi C, Wu L, Lin X, Chen X, Zhu Y, Tan H, Zhang W. Development and Validation of a Large Language Model–Based System for Medical History-Taking Training: Prospective Multicase Study on Evaluation Stability, Human-AI Consistency, and Transparency. JMIR Medical Education 2025;11:e73419 View
Torous J, Ledley K, Gorban C, Strudwick G, Schwarz J, Choudhary S, Emerson M, Patriquin M, Dempsey A, Bantjes J, Ospina-Pinillos L, Hornick J, Kochhar S. Accelerating Digital Mental Health: The Society of Digital Psychiatry’s Three-Pronged Road Map for Education, Digital Navigators, and AI. JMIR Mental Health 2025;12:e84501 View
Dwyer B, Flathers M, Sano A, Dempsey A, Cipriani A, Gazi A, Hill B, Gorban C, Rodriguez C, Stromeyer C, King D, Rozenblit E, Strudwick G, Linardon J, Cheong J, Firth J, Herpertz J, Schwarz J, Truong K, Emerson M, Paulus M, Patriquin M, Hua Y, Choudhary S, Siddals S, Pinillos L, Bantjes J, Schueller S, Xu X, Duckworth K, Gillison D, Wood M, Torous J. Mindbench.ai: an actionable platform to evaluate the profile and performance of large language models in a mental healthcare context. NPP—Digital Psychiatry and Neuroscience 2025;3(1) View
Chang Q, Chen F, Chen Y, Cheng L, Dong D, Dong J, Feng X, Ge J, He J, He Y, He Z, Ji H, Jiang X, Jiang Z, Li N, Li P, Li Y, Liu B, Liu J, Lyu H, Min D, Qi W, Shen X, Sheng B, Sun J, Sun Y, Tian B, Wang K, Wang L, Wang L, Wang W, Wang Y, Wang Y, Wang Z, Weng J, Wei J, Wu G, Wu X, Xiao Y, Xu Y, Yan P, Ye Z, Yin W, Zhang C, Zhang D, Zhang P, Zhang W, Zhang X, Zhao S, Zhao Y, Zhou S, Zhou X, Zhu B, Zhu L, Zhu Z. 2025 Expert consensus on retrospective evaluation of large language model applications in clinical scenarios. Intelligent Medicine 2025;5(4):318 View
Singh S, Alyakin A, Alber D, Stryker J, Tong A, Sangwon K, Goff N, De La Paz M, Hernandez-Rovira M, Park K, Leuthardt E, Oermann E. The pitfalls of multiple-choice questions in generative AI and medical education. Scientific Reports 2025;15(1) View
Seidl P, Szep M, Breden S, Charitou F, Mogler C, Schüffler P, von Eisenhart-Rothe R, Lazic I, Hinterwimmer F. LLM-gestützte Extraktion klinischer Daten: Potenziale und Herausforderungen. Die Orthopädie 2026;55(1):17 View
Liao W, Li M, Ma C, Han Y, Wang D, Liu H, Wang Y, Feng Z, Wang H, Guan Y. Developing a Quality Evaluation Index System for Health Conversational Artificial Intelligence: Mixed Methods Study. Journal of Medical Internet Research 2026;28:e83188 View
Wang Q, Zou H, Zhang H, Huang Y, Tian J, Cheng W. A Survey on Medical Competence Evaluation Benchmarks for Large Language Models. Health Care Science 2026;5(1):4 View
Ji X, Sun N, Wang A, Dong J, Hu J, Zhu J, Huang F, Zhang Z, Li K, Teng D, Li T. Beyond generalist LLMs: building and validating domain-specific models with the SpAMCQA benchmark. Artificial Intelligence Surgery 2026;6(1):80 View
Cafferty O, Jeffries S, Pelletier E, Tu Z, Sinha A, Hemmerling T. Contextualizing AI Evaluation in Anesthesiology: Interpreting Large Language Models and Computer Vision Metrics Across Clinical Use Cases—An Expert Statement from the Society of Technology in Anesthesia. Anesthesia & Analgesia 2026 View
Hack S, Craig J, Lin C, Fu C, Kwiatkowska M, Kocum P, Allevi F, Saibene A. Retrieval-augmented generative AI enhances clinical reasoning in odontogenic sinusitis versus maxillary sinus mucositis. European Archives of Oto-Rhino-Laryngology 2026;283(4):2353 View
Chen J, Luo M, Chen J, Luo G, Li G, Lei C, Chen D, Yu J, Gu K. Construction and evaluation of the knowledge graph and large model question-answering system for Jin San Zhen therapy: a tool study for primary care and general practice. Frontiers in Medicine 2026;13 View

Books/Policy Documents

Xu H, Xue T, Liu D, Zhang F, Westin C, Kikinis R, O’Donnell L, Cai W. Foundation Models for General Medical AI. View
Geathers J, Hicke Y, Chan C, Rajashekar N, Young S, Sewell J, Cornes S, Kizilcec R, Shung D. Artificial Intelligence in Education. View

Conference Proceedings

Anusha M, Bhavani S, Jahnavi A, Likitha M. 2026 9th International Conference on Inventive Computation Technologies (ICICT). Offline LLMs: Enabling Secure, Real-Time Human-AI Conversations without Internet using On-Device Language Models View

Citation

Please cite as:

Xu J, Lu L, Peng X, Pang J, Ding J, Yang L, Song H, Li K, Sun X, Zhang S
Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation
JMIR Med Inform 2024;12:e57674
doi: 10.2196/57674 PMID: 38952020 PMCID: 11225096

Export Metadata

END for: Endnote

BibTeX for: BibDesk, LaTeX

RIS for: RefMan, Procite, Endnote, RefWorks

Add this article to your Mendeley library

This paper is in the following e-collection/theme issue:

Natural Language Processing (1251) Formative Evaluation of Digital Health Interventions (5021) Chatbots and Conversational Agents (1150) Artificial Intelligence (4625) Generative Language Models Including ChatGPT (1455) AI Language Models in Health Care (714)

Download

Download PDF Download XML

Share Article

Share on Bluesky Share on Twitter Share on Facebook Share on LinkedIn