Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations

doi:10.2196/69485

Published on 27.Jun.2025 in Vol 13 (2025)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/69485, first published 02.Dec.2024.

Doctor in white coat with stethoscope at laptop, discussing healthcare.

Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations

Zhong Yao^{1, 2, 3}

; Liantan Duan⁴

; Shuo Xu^{1, 2, 3}

; Lingyi Chi^{1, 2, 3}

; Dongfang Sheng⁴

Article Authors Cited by (22) Tweetations Metrics

Journals

Zhong D, Liang Y, Yan H, Chen X, Yang Q, Ma S, Su Y, Chen Y, Huang X, Wang M. A Comparative Study of Five Large Language Models’ Response for Liver Cancer Comprehensive Treatment. Journal of Hepatocellular Carcinoma 2025;Volume 12:1861 View
Meretukov D, Grechukhina K, Evdokimov V, Didych D, Kondratieva S, Rakitina O, Gordeev A, Shilo P, Khatkov I, Zhukova L. Deriving Real-World Evidence from Non-English Electronic Medical Records in Hormone Receptor-Positive Breast Cancer Using Large Language Models. Cancers 2025;17(23):3836 View
Kaleci A, Şahinbaş B, Ağadayı E, Çelikkaya S, Altun A, Kardan E. Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students. Tıp Eğitimi Dünyası 2025;24(74):135 View
Zhou Y, Wang W, Wang P, Hu K. Diagnostic performance of large language models on the NEJM image challenge: a comparative study with human evaluators and the impact of prompt engineering. Frontiers in Medicine 2026;12 View
Karampinis E, Zoumpourli C, Kontogianni C, Arkoumanis T, Koumaki D, Mantzaris D, Filippakis K, Papadopoulou M, Theofili M, Enechukwu N, Ouédraogo N, Katoulis A, Zafiriou E, Sgouros D. Dermatology “AI Babylon”: Cross-Language Evaluation of AI-Crafted Dermatology Descriptions. Medicina 2026;62(1):227 View
Wang Y. Integrating large language models into medical undergraduate laboratory course to enhance bioethical competence: a quasi-experimental study. Frontiers in Medicine 2026;12 View
Atlı Ş, Nalbant G, Beşparmak T, Türkyılmaz A, Erdemir A. Comparison of Artificial Intelligence Chatbots and Dental Students on Context of Dental Trauma. Dental Traumatology 2026 View
Liu L, Ma K, Wang Y. Performance of five large language models in oral and maxillofacial surgery exam questions: a comparative study. BMC Oral Health 2026;26(1) View
Qi X, Fan L, Yao Y, Shen S, Yang Z, Zhu J, Yang D. Performance evaluation and comparison of ChatGPT, Gemini, Grok, and DeepSeek in the interpretation of tumor marker reports. Clinica Chimica Acta 2026;588:120984 View
Kim M, Park J, Kang S. Comparative performance of recent and prior large language models and pediatric residents on pediatric in-training examination questions. Scientific Reports 2026;16(1) View
İzci Çetinkaya F, Mirza A, Ekici H, Eryılmaz Eren E, Ture Z. Evaluation of Artificial Intelligence Chatbots in Providing Brucellosis‐Related Health Information: A Multidimensional Quality Assessment. Zoonoses and Public Health 2026;73(4):380 View
Li Y, Chen X, Dolata M. LLM-as-a-Judge for mental support: A meta-evaluation using domain-specific platform data. Electronic Markets 2026;36(1) View
Hato E, Peker K. PEDODONTİ SORULARININ YANITLANMASINDA YAPAY ZEKÂ PERFORMANSINA DİLİN ETKİSİ: CHATGPT-4.0 VE DEEPSEEK-R1 İLE TÜRKÇE VE İNGİLİZCE KARŞILAŞTIRMASI. Kırıkkale Üniversitesi Tıp Fakültesi Dergisi 2026;28(1):49 View
Li M, Chen D, Xiao Q, He Z, Zhang Y, Zhong J, Luo Y, Ma H. Automated Identification of Nursing Diagnoses and Interventions From Nursing Records Using a Retrieval-Augmented Large Language Model Approach: Quantitative Study. Journal of Medical Internet Research 2026;28:e89850 View
Chang Y, Hsieh M, Ju P, Liu Y, Chang C. Clinical Plausibility in Large Language Model Robustness Testing for Medicine: A Scoping Review. Journal of Medical Systems 2026;50(1) View
Zhang Z, Huang C, Yu X, Lv T, Chen X, Wang C, Li Y, Li F, Wu S, Fu Y, Lu F, Dai Q. Large Language Models for Ophthalmology Training in China: A Prospective Evaluation. Ophthalmology Science 2026;6(8):101270 View
Moghe G, Zimić-Sheen A, Chen D, Yadav G, Cao G, Tufan H, Williams J, Szymański J, Kim J, Busta L, Mutwil M, Verdú M, Zimić M, Provart N, Makunga N, Wilkins O, Sun Q, VanBuren R, Marks R, Rhee S, Jiang Y, Xie Y. Reimagining plant science training in the era of generative artificial intelligence: a global perspective. The Plant Cell 2026;38(6) View
Hato E, İmrendi Yergin B, Erkmen Almaz M, Arıkan V. Evaluation of the Effect of Having a Dental Intern on Family Members’ Knowledge of Oral Hygiene and Fluoride: A Cross-Sectional Survey. Cumhuriyet Dental Journal 2026;29(2):287 View
Zhang H, Qu L, Bai H, Chen Y, Ji R, Cheng Z, Yang C. Beyond accuracy: evaluating the reliability of large language models for medical assessment. Frontiers in Artificial Intelligence 2026;9 View
Wojtas J, Pak G, Górski W. Comparative Performance of Large Language Models in the Polish State Specialization Examination in Anesthesiology and Intensive Care Medicine. Cureus 2026 View
Zhou Q, Jia Y, Hu H, Huang D, Chen X, Xia Y, Wu W. Performance of Large Language Models for Oncology Nursing Decision Support: Cross-Sectional Study. Journal of Medical Internet Research 2026;28:e97802 View

Conference Proceedings

Cheng S, Xu H, Meng S, Hao S, Yue C, Li Z. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems. The Privacy Paradox of LLMs: User Perceptions and the Reality of PII Leakage View

This paper is in the following e-collection/theme issue:

Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations

Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations

Journals

Conference Proceedings