TY - JOUR AU - Nair, Rakhi Asokkumar Subjagouri AU - Hartung, Matthias AU - Heinisch, Philipp AU - Jaskolski, Janik AU - Starke-Knäusel, Cornelius AU - Veríssimo, Susana AU - Schmidt, David Maria AU - Cimiano, Philipp PY - 2025 DA - 2025/4/14 TI - Summarizing Online Patient Conversations Using Generative Language Models: Experimental and Comparative Study JO - JMIR Med Inform SP - e62909 VL - 13 KW - patient experience KW - online communities KW - summarizing KW - large language models AB - Background: Social media is acknowledged by regulatory bodies (eg, the Food and Drug Administration) as an important source of patient experience data to learn about patients’ unmet needs, priorities, and preferences. However, current methods rely either on manual analysis and do not scale, or on automatic processing, yielding mainly quantitative insights. Methods that can automatically summarize texts and yield qualitative insights at scale are missing. Objective: The objective of this study was to evaluate to what extent state-of-the-art large language models can appropriately summarize posts shared by patients in web-based forums and health communities. Specifically, the goal was to compare the performance of different language models and prompting strategies on the task of summarizing documents reflecting the experiences of individual patients. Methods: In our experimental and comparative study, we applied 3 different language models (Flan-T5, Generative Pretrained Transformer [GPT], GPT-3, and GPT-3.5) in combination with various prompting strategies to the task of summarizing posts from patients in online communities. The generated summaries were evaluated with respect to 124 manually created summaries as a ground-truth reference. As evaluation metrics, we used 2 standard metrics from the field of text generation, namely, Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and BERTScore, to compare the automatically generated summaries to the manually created reference summaries. Results: Among the zero-shot prompting–based large language models investigated, GPT-3.5 performed better than the other models with respect to the ROUGE metrics, as well as with respect to BERTScore. While zero-shot prompting seems to be a good prompting strategy, overall GPT-3.5 in combination with directional stimulus prompting in a 3-shot setting had the best results with respect to the aforementioned metrics. A manual investigation of the summarization of the best-performing method showed that the generated summaries were accurate and plausible compared to the manual summaries. Conclusions: Taken together, our results suggest that state-of-the-art pretrained language models are a valuable tool to provide qualitative insights about the patient experience to better understand unmet needs, patient priorities, and how a disease impacts daily functioning and quality of life to inform processes aimed at improving health care delivery and ensure that drug development focuses more on the actual priorities and unmet needs of patients. The key limitations of our work are the small data sample as well as the fact that the manual summaries were created by 1 annotator only. Furthermore, the results hold only for the examined models and prompting strategies, potentially not generalizing to other models and strategies. SN - 2291-9694 UR - https://medinform.jmir.org/2025/1/e62909 UR - https://doi.org/10.2196/62909 DO - 10.2196/62909 ID - info:doi/10.2196/62909 ER -