Letter to the Editor
Comment in: http://medinform.jmir.org/2025/1/e82057/
doi:10.2196/80987
Keywords
This letter is regarding the recent publication of the article titled “Clinical Performance and Communication Skills of ChatGPT Versus Physicians in Emergency Medicine: Simulated Patient Study” by Park et al [
]. The study makes a significant contribution to the growing field of artificial intelligence (AI) evaluation in medicine, and I congratulate the authors on their valuable work. However, I would like to highlight a potential methodological limitation in the written examination portion of the study. The authors state that their examination questions were taken from a 2018 textbook, 100 Cases in Emergency Medicine and Critical Care [ ]. The AI model they tested, ChatGPT (OpenAI), was trained on huge amounts of public text from the internet, which likely included this textbook. This means ChatGPT may have seen exactly the same questions and answers during its training.This problem is known as “data contamination.” If the AI has already seen the test questions, its high scores might show good memory, not good medical reasoning. This makes the comparison to human doctors, who were seeing the questions for the first time, unfair. The study found that ChatGPT performed much better than doctors on this written test, but this result could be due to this methodological limitation.
Other researchers in the field are aware of this problem and take steps to avoid it. For example, a study by Busch et al [
] on radiology used private, members-only cases that were not likely in the AI’s training data to minimize this risk. Another study by Noda et al [ ] on a Japanese medical examination used questions from an examination that took place after the AI’s training data cut-off date.These studies show the importance of using new and unseen questions when testing AI. Because the study by Park et al [
] did not use this approach, I believe the results of their written examination should be viewed with caution. Future studies must use methods like those in the Busch et al [ ] and Noda et al [ ] papers to ensure a fair and valid test of AI’s abilities.Acknowledgments
Google Gemini was used for language editing.
Conflicts of Interest
None declared.
References
- Park C, An MH, Hwang G, Park RW, An J. Clinical performance and communication skills of ChatGPT versus physicians in emergency medicine: simulated patient study. JMIR Med Inform. Jul 17, 2025;13:e68409. [FREE Full text] [CrossRef] [Medline]
- Shamil E, Ravi P, Mistry D. 100 Cases in Emergency Medicine and Critical Care. Boca Raton, FL. CRC Press; 2018.
- Busch F, Han T, Makowski MR, Truhn D, Bressem KK, Adams L. Integrating text and image analysis: exploring GPT-4V’s capabilities in advanced radiological applications across subspecialties. J Med Internet Res. May 01, 2024;26:e54948. [CrossRef] [Medline]
- Noda M, Ueno T, Koshu R, Takaso Y, Shimada MD, Saito C, et al. Performance of GPT-4V in answering the Japanese otolaryngology board certification examination questions: evaluation study. JMIR Med Educ. Mar 28, 2024;10:e57054. [FREE Full text] [CrossRef] [Medline]
Abbreviations
AI: artificial intelligence |
Edited by A Iannaccio; This is a non–peer-reviewed article. submitted 20.Jul.2025; accepted 20.Aug.2025; published 29.Sep.2025.
Copyright©Alaeddin Acar. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 29.Sep.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.