Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Journal Description

JMIR Medical Informatics (JMI, ISSN 2291-9694, Journal Impact Factor 3.8) (Editor-in-chief: Arriel Benis, PhD, FIAHSI) is an open-access journal that focuses on the challenges and impacts of clinical informatics, digitalization of care processes, clinical and health data pipelines from acquisition to reuse, including semantics, natural language processing, natural interactions, meaningful analytics and decision support, electronic health records, infrastructures, implementation, and evaluation (see Focus and Scope).

JMIR Medical Informatics adheres to rigorous quality standards, involving a rapid and thorough peer-review process, professional copyediting, and professional production of PDF, XHTML, and XML proofs.

The journal is indexed in MEDLINEPubMedPubMed CentralDOAJ, Scopus, and the Science Citation Index Expanded (SCIE)

JMIR Medical Informatics received a Journal Impact Factor of 3.8 (Source:Journal Citation Reports 2025 from Clarivate).

JMIR Medical Informatics received a Scopus CiteScore of 7.7 (2024), placing it in the 79th percentile (#32 of 153) as a Q1 journal in the field of Health Informatics.

 

Recent Articles:

  • Source: Freepik; Copyright: pressfoto; URL: https://www.freepik.com/free-photo/physiotherapist-palpating-leg_5535730.htm; License: Licensed by JMIR.

    Development of Venous Thromboembolism Risk Prediction Models Based on Whole Blood Gene Expression Profiling Using 20 Machine Learning Algorithms:...

    Abstract:

    Background: There is a lack of venous thromboembolism (VTE) risk prediction models based on gene expression information. Objective: This study aims to develop a VTE risk prediction model using whole blood gene expression profiles by conducting a comprehensive evaluation and comparison of 20 different machine learning algorithms. Methods: Two transcriptome datasets containing VTE patients and healthy controls were obtained by searching the GEO database and used as the training and validation sets, respectively. Feature selection for model construction was performed on the training set using LASSO and Random Forest, followed by the selection of the intersection of the chosen features. Subsequently, Recursive feature elimination was applied to further refine the selected features. The selected features underwent model construction using 20 machine learning algorithms. The performance of the models was evaluated using various methods such as receiver operating characteristic (ROC) and confusion matrix. The validation set was used for external model validation. Results: The final results demonstrated that all algorithm models, except for K-nearest neighbor, exhibited good performance in VTE prediction. External validation data indicated that nine algorithm models had AUC greater than 0.75. Confusion matrix analysis revealed that the algorithm models maintained high specificity in the external validation cohort. Conclusions: This study utilized 20 machine learning algorithms to construct VTE prediction models based on whole blood gene expression information, with nine of these models demonstrating good diagnostic performance in external validation cohorts. The above models, when used in conjunction with D-dimer, may provide more valuable references for VTE diagnosis.

  • AI-generated illustrative image depicting clinicians reviewing structured clinical data for Parkinson’s disease diagnosis, with subtle digital elements indicating AI-assisted analysis. The image conceptually represents the use of large language models to support clinical decision-making based on structured datasets. Source: DALL·E by OpenAI; Copyright: N/A (AI-generated image); URL: https://medinform.jmir.org/2026/1/e77561/; License: Public Domain (CC0).

    Prompting and Fine-Tuning Large Language Models for Parkinson Disease Diagnosis: Comparative Evaluation Study Using the PPMI Structured Dataset

    Abstract:

    Background: Parkinson disease (PD) presents diagnostic challenges due to its heterogeneous motor and nonmotor manifestations. Traditional machine learning (ML) approaches have been evaluated on structured clinical variables. However, the diagnostic utility of large language models (LLMs) using natural language representations of structured clinical data remains underexplored. Objective: This study aimed to evaluate the diagnostic classification performance of multiple LLMs using natural language prompts derived from structured clinical data and to compare their performance with traditional ML baselines. Methods: We reformatted structured clinical variables from the Parkinson’s Progression Markers Initiative (PPMI) dataset into natural language prompts and used them as inputs for several LLMs. Variables with high multicollinearity were removed, and the top 10 features were selected using Shapley additive explanations (SHAP)–based feature ranking. LLM performance was examined across few-shot prompting, dual-output prompting that additionally generated post hoc explanatory text as an exploratory component, and supervised fine-tuning. Logistic regression (LR) and support vector machine (SVM) classifiers served as ML baselines. Model performance was evaluated using F1-scores on both the test set and a temporally independent validation set (temporal validation set) of limited size, and repeated output generation was carried out to assess stability. Results: On the test set of 122 participants, LR and SVM trained on the 10 SHAP-selected clinical variables each achieved a macro-averaged F1-score of 0.960 (accuracy 0.975). LLMs receiving natural language prompts derived from the same variables reached comparable performance, with the best few-shot configurations achieving macro-averaged F1-scores of 0.987 (accuracy 0.992). In the temporal validation set of 31 participants, LR maintained a macro-averaged F1-score of 0.903, whereas SVM showed substantial performance degradation. In contrast, multiple LLMs sustained high diagnostic performance, reaching macro-averaged F1-scores up to 0.968 and high recall for PD. Repeated output generation across LLM conditions produced generally stable predictions, with rare variability observed across runs. Under dual-output prompting, diagnostic performance showed a reduction relative to few-shot prompting while remaining generally stable. Supervised fine-tuning of lightweight models improved stability and enabled GPT-4o-mini to achieve a macro-averaged F1-score of 0.987 on the test set, with uniformly correct predictions observed in the small temporal validation set, which should be interpreted cautiously given the limited sample size and exploratory nature of the evaluation. Conclusions: This study provides an exploratory benchmark of how modern LLMs process structured clinical variables in natural language form. While several models achieved diagnostic performance comparable to LR across both the test and temporal validation datasets, their outputs were sensitive to prompting formats, model choice, and class distributions. Occasional variability across repeated output generations reflected the stochastic nature of LLMs, and lightweight models required supervised fine-tuning for stable generalization. These findings highlight the capabilities and limitations of current LLMs in handling tabular clinical information and underscore the need for cautious application and further investigation.

  • AI-generated image, in response to the request "create an abstract image for title "Mild Cognitive Impairment Detection System Based on 
Unstructured Spontaneous Speech: Longitudinal Dual-modal Framework" " (Generator: Gemini/ Nano Banana Pro Dec 3, 2025). Source: Created with Gemini; Copyright: N/A (AI-generated image); URL: https://medinform.jmir.org/2026/1/e80883; License: Public Domain (CC0).

    Mild Cognitive Impairment Detection System Based on Unstructured Spontaneous Speech: Longitudinal Dual-Modal Framework

    Abstract:

    Background: In recent years, the incidence of cognitive diseases has also risen with the significant increase in population aging. Among these diseases, Alzheimer’s disease constitutes a substantial proportion, placing a high-cost burden on healthcare systems. To give early treatment and slow the progression of patient deterioration, it is crucial to diagnose Mild Cognitive Impairment (MCI), a transitional stage. Objective: In this study, we utilize autobiographical memory (AM) test speech data to establish a dual-modal longitudinal cognitive detection system for MCI. The AM test is a psychological assessment method that evaluates the cognitive status of subjects as they freely narrate important life experiences. Methods: Identifying hidden disease-related information in unstructured, spontaneous speech is more difficult than in structured speech. To improve this process, we use both speech and text data, which provide more clues about a person’s cognitive state. In addition, to track how cognition changes over time in spontaneous speech, we introduce an aging trajectory module. This module uses local and global alignment loss functions to better learn time-related features by aligning cognitive changes across different time points. Results: In our experiments on the Chinese dataset, the longitudinal model incorporating the aging trajectory module achieved AUROC of 84.81% and 88.59% on two datasets, respectively, showing significant improvement over cross-sectional, single time-point models. We also conducted ablation studies to verify the necessity of the proposed aging trajectory module. To confirm that the model not only applies to autobiographical memory test data, we used part of the model to evaluate the performance on the ADReSSo dataset, a single-time-point semi-structured data for validation, with results showing an accuracy exceeding 88.05%. Conclusions: This study presents a non-invasive and scalable approach for early MCI detection by leveraging autobiographical memory speech data across multiple time points. Through dual-modal analysis and the introduction of an aging trajectory module, our system effectively captures cognitive decline trends over time. Experimental results demonstrate the method’s robustness and generalizability, highlighting its potential for real-world, long-term cognitive monitoring.

  • Source: freepik; Copyright: freepik; URL: https://www.freepik.com/free-photo/female-doctor-with-stethoscope-around-her-neck-using-laptop-desk_4435674.htm; License: Licensed by JMIR.

    Developing a Suicide Risk Prediction Algorithm Using Electronic Health Record Data in Mental Health Care: Real-World Case Study

    Abstract:

    Background: Artificial intelligence (AI) offers potential solutions to address the challenges faced by a strained mental healthcare system, such as increasing demand for care, staff shortages and pressured accessibility. While developing AI-based tools for clinical practice is technically feasible and has the potential of producing real-world impact, only few are actually implemented into clinical practice. Implementation starts at the algorithm development phase, as this phase bridges theoretical innovation and practical application. The design and the way the AI tool is developed may either facilitate or hinder later implementation and use. Objective: This is a qualitative case study of the development process of a suicide risk prediction algorithm using real-world electronic health record (EHR) data for clinical use in mental health care. It explores which challenges the development team encountered in creating the algorithm and how they addressed these challenges. This study identifies key considerations for the integration of technical and clinical perspectives in algorithm, facilitating the evolution of mental health organizations toward data-driven practice. The studied algorithm remains exploratory and has not yet been implemented in clinical practice. Methods: An exploratory, multimethod qualitative case study was conducted, employing a hybrid approach with both inductive and deductive analysis. Data were collected through desk research, reflective team meetings, and iterative feedback sessions with the development team. Thematic analysis was used to identify development challenges and the team’s responses. Based on these findings, key considerations for future algorithm development were derived. Results: Key challenges included defining, operationalizing, and measuring suicide incidents within EHRs due to issues such as missing data, underreporting, and differences between data sources. Predicting factors were identified by consulting clinical experts, however, psychosocial variables had to be constructed as they could not directly be extracted from EHR data. Risk of bias occurred when traditional suicide prevention questionnaires, unequally distributed across patients, were used as input. Analyzing unstructured data by Natural Language Processing (NLP) was challenging due to data noise but ultimately enabled successful sentiment analysis, which provided dynamic, clinically relevant information for the algorithm. A complex model enhanced predictive accuracy but posed challenges regarding understandability, which was highly valued by clinicians. Conclusions: To advance mental healthcare as a data-driven field, several critical considerations must be addressed: ensuring robust data governance and quality, fostering cultural shifts in data documentation practices, establishing mechanisms for continuous monitoring of AI tool usage, mitigating risks of bias, balancing predictive performance with explainability, and maintaining a clinician "in-the-loop" approach. Future research should prioritize sociotechnical aspects related to the development, implementation and daily use of AI in mental healthcare practice.

  • AI-generated image, in response to the request "Physicians Apply Radiomics Technology to Predict Pain Progression in Knee Osteoarthritis" (Generator doubao APP Dec 28, 2025; Requestor: Yingwei Sun). Source: doubao APP; Copyright: N/A (AI-Generated image); URL: https://medinform.jmir.org/2026/1/e78338/; License: Public Domain (CC0).

    Nomograms Based on X-Ray Radiomics for Predicting Pain Progression in Knee Osteoarthritis Using Data From the Foundation for the National Institutes of...

    Abstract:

    Background: Knee osteoarthritis (KOA) is one of the most prevalent chronic musculoskeletal disorders among the older adult population. Screening populations at risk of rapid progression of osteoarthritis and implementing appropriate early intervention strategies is advantageous for the treatment and prognosis of affected patients. Objective: This study aimed to construct and validate a nomogram model based on x-ray radiomics to effectively identify individuals experiencing progression of KOA pain. Methods: The Foundation for the National Institutes of Health Biomarkers Consortium included a total of 600 participants who were classified as pain progressors (n=297, 49.5%) and non–pain progressors (n=303, 50.5%) according to an increase in the Western Ontario and McMaster Universities Osteoarthritis Index pain score of ≥9 points (on a scale from 0 to 100) during the follow-up period of 24 to 48 months. X-rays that lacked defined spacing in the DICOM image were excluded. Fully automatic selection of subchondral bone regions on the inner and outer edges of the tibia and femur as regions of interest and extraction of radiomics features for different combinations of regions of interest were conducted. Least absolute shrinkage and selection operator regression was used to select features and generate a radiomics score using Shapley additive explanations for interpretability. The radiomics score, along with clinical indicators, was incorporated into nomograms using a multivariable logistic regression model. The subgroup analysis focused solely on the progression of pain and cases with no progression at all. The receiver operating characteristic curve, along with calibration and decision curves, was used to assess the discriminative performance. Results: A total of 450 participants were included in the study. Shapley additive explanations analysis identified Wavelet-HH_gldm_HighGrayLevelEmphasis as the primary radiomics feature. Nomogram 1 and nomogram 2 for predicting KOA pain progression achieved area under the curve values of 0.766 and 0.753, respectively, with mean absolute errors of 0.012 and 0.008, respectively, in the calibration curves. Decision curve analysis showed a positive net benefit across a range of threshold probabilities. In subgroup analyses, nomogram 3 and nomogram 4 yielded areas under the curve of 0.795 and 0.740, respectively. Conclusions: The nomograms based on x-ray radiomics demonstrated excellent predictive capability and accuracy in forecasting the progression of KOA pain.

  • Source: Freepik; Copyright: freepik; URL: https://www.freepik.com/free-photo/accompaniment-abortion-process_31260212.htm; License: Licensed by JMIR.

    Large Language Models for Psychiatric Diagnosis Based on Multicenter Real-World Clinical Records: Comparative Study

    Abstract:

    Background: Psychiatric disorders are diagnostically challenging and often rely on subjective clinical judgment, particularly in resource-limited settings. Large language models (LLMs) have demonstrated potential in supporting psychiatric diagnosis; however, robust evidence from large-scale, real-world clinical data remains limited. Objective: This study aimed to evaluate and compare the diagnostic performance of multiple LLMs for psychiatric disorders using multicenter real-world electronic health records (EHRs). Methods: We retrospectively analyzed 9923 inpatient EHRs collected from 6 psychiatric centers across China, encompassing all ICD-10 (International Statistical Classification of Diseases, Tenth Revision) psychiatric categories. In total, 3 LLMs—GPT-4.0 (OpenAI), GPT-3.5 (OpenAI), and GLM-4-Plus (Zhipu AI)—were evaluated against physician-confirmed discharge diagnoses. Diagnostic performance was assessed using strict accuracy criteria and lenient classification metrics, with subgroup analyses conducted across diagnostic categories and age groups. Results: GPT-4.0 achieved the highest overall strict diagnostic accuracy (71.7%) and the highest weighted F1-score under lenient evaluation (0.881), particularly for high-prevalence disorders, such as mood disorders and schizophrenia spectrum disorders. Diagnostic performance varied across age groups, with the highest accuracy observed in older adult patients (up to 79.5%) and lower accuracy in adolescents. Across centers, model performance remained stable, with no significant intercenter differences. Conclusions: LLMs—especially GPT-4.0—demonstrate promising capability in supporting psychiatric diagnosis using real-world EHRs. However, diagnostic performance varies by age group and disorder category. LLMs should be regarded as assistive tools rather than replacements for clinical judgment, and further validation is needed before routine clinical implementation.

  • AI-generated image based on the prompt: "Physician's office setting, where the physician faces a computer screen displaying glioma imaging findings, hematological test parameters, and machine learning-related plots" (Generator: ByteDance; Date:2025-11-29. Requestor: Congcong Zhu). Source: Doubao (ByteDance); Copyright: N/A (AI-generated image); URL: https://medinform.jmir.org/2026/1/e79945/; License: Public Domain (CC0).

    Neutrophil Percentage–to-Albumin Ratio as a Novel Prognostic Biomarker in Adult Diffuse Gliomas: Retrospective Study Integrating 3 Machine Learning Models...

    Abstract:

    Background: Adult-type diffuse glioma (ADG) is the most common primary malignant tumor of the central nervous system. Its highly invasive nature, marked heterogeneity, and resistance to therapy contribute to a high risk of recurrence and poor prognosis. At present, the lack of reliable prognostic tools poses a significant barrier to the development of individualized treatment strategies. Objective: This study aimed to develop an effective prognostic model for ADG by integrating multiple machine learning algorithms, in order to enhance the precision of individualized clinical decision-making. Methods: In this retrospective study, 160 newly diagnosed patients with ADG who underwent surgical resection and histopathological confirmation at our institution between June 2019 and September 2021 were included. A total of 32 variables, including clinical characteristics, molecular biomarkers, and preoperative hematological indicators, were collected. Overall survival (OS) and progression-free survival (PFS) were defined as the study endpoints. Feature selection was performed using least absolute shrinkage and selection operator regression, extreme gradient boosting, and random forest algorithms. Kaplan-Meier survival curves and log-rank tests were used for survival analysis. Multivariate Cox proportional hazards models were constructed to identify independent prognostic factors, and nomograms were developed accordingly. The model’s discriminative ability, calibration, and clinical utility were evaluated using the concordance index, area under the receiver operating characteristic curve (area under the curve), calibration plots, and Kaplan-Meier analysis. Results: Age, neutrophil percentage–to-albumin ratio (NPAR), and platelet-to-mean platelet volume ratio were identified as independent prognostic factors for OS, while age and NPAR were independent predictors for PFS (all P<.001). The prognostic models based on these variables demonstrated good predictive performance, with concordance index values of 0.731 and 0.763 for the training and validation cohorts in the OS model, respectively. The PFS model also showed robust performance. Area under the curve values and calibration curves further supported the models’ accuracy and stability. Risk stratification analysis revealed clear survival differences between risk groups (all P<.05), indicating strong clinical applicability. Conclusions: This study is the first to identify preoperative NPAR as a significant prognostic biomarker for ADG using machine learning approaches. The prognostic model incorporating NPAR, platelet-to-mean platelet volume ratio, and age demonstrated favorable predictive performance, offering a novel perspective for accurate risk stratification and personalized treatment in patients with ADG.

  • Source: freepik; Copyright: freepik; URL: https://www.freepik.com/free-photo/doctor-writing-about-routine-medical-checkup_22894399.htm; License: Licensed by JMIR.

    Exploring Factors Associated With the Stalled Implementation of a Ground-Up Electronic Health Record System in South Africa: Qualitative Insights From the...

    Abstract:

    Background: Electronic health records (EHRs) have the potential to improve service delivery through record keeping and monitoring health outcomes. As countries move toward universal health coverage, digital health tools such as EHRs are essential for achieving this goal. However, EHR implementation in middle-income countries like South Africa faces obstacles. Objective: This study explores the reasons behind a stalled implementation of the E-tick system (an electronic version of a paper primary health care register to record services provided), using the Consolidated Framework for implementation Research (CFIR). Methods: Using a qualitative design, in-depth interviews were conducted with 38 participants to explore their perceptions and experiences, and the factors surrounding E-ticks success and stalling. Participants included managers, stakeholders, implementers and end users from the 3 implementation clinics. Data was collected using semi-structured interview guides. Thematic and CFIR framework analysis (innovation, inner setting, individual characteristics, implementation process and outer setting) were applied. Results: The E-tick system was designed to improve data quality in paper health registers, addressing inaccuracies in reporting to district and provincial health departments (Innovation domain). Implementers iteratively developed the system through user input from managers and clinicians, and stakeholder engagement of software developers, funders, health managers and decision-makers from the provincial health department (individuals characteristics). Although the system was initially well adopted by end users, it stalled primarily due to outer setting factors, which included a change of developers, funding cuts, and limited support at provincial health department level due to capacity gaps, political appointments and mistrust stemming from corruption and abuse of the tender system. Moreover, resistance to leveraging lessons from locally developed small-scale systems further constrained institutional support for the E-tick. Conclusions: Although successful implementation of EHRs can be facilitated by strong user engagement and co-design, outer setting factors such as governance, funding and policy alignment can pose significant threats to sustainability. This underscores the importance of effective synergy between top-down and bottom-up processes for successful implementation.

  • Source: Freepik; Copyright: Freepik; URL: https://www.freepik.com/free-photo/business-people-working-together_12162860.htm; License: Licensed by JMIR.

    Large Language Model–Enabled Editing of Patient Audio Interviews From “This Is My Story” Conversations: Comparative Study

    Abstract:

    Background: This Is My Story (TIMS) was started by Chaplain Elizabeth Tracey to promote a humanistic approach to medicine. Patients in the TIMS program are the subject of a guided conversation in which a chaplain interviews either the patient or their loved one about the patient. The interviewer asks four questions to elicit clinically actionable information that has been shown to improve communication, between the narrator and the medical providers, and increase empathy on part of the medical team. The original recorded conversation is edited into a condensed audio file approximately 1.5 minutes in length and placed in the electronic health record where it is easily accessible by all clinicians caring for the patient. Objective: TIMS is active at the Johns Hopkins Hospital and has shown value in assisting with clinician empathy and communication. As the program expands, there exists a barrier to adoption due to limited time and resources needed to manually edit audio conversations into a more condensed format. To address this, we propose an automated solution using a large language model (LLM) to create meaningful and concise audio summaries. Methods: We analyzed 24 TIMS audio interviews and created three edited versions of each: (1) Expert-edited, (2) AI-edited using a fully automated LLM pipeline, and (3) Novice-edited by two medical students trained by the expert. All versions were evaluated using a within‐subjects design by a second expert who was blinded to both the editor and order each audio was presented. This expert rated all interviews and scored audio quality and content quality on 5-point Likert scales. We quantified transcript similarity to the expert-edited reference using lexical and semantic similarity metrics and qualitatively assessed important information omitted relative to the expert-edited interview. Results: Audio quality (flow, pacing, clarity) and content quality (coherence, relevance, nuance) were each rated on 5-point Likert scales. Expert-edited interviews received highest mean ratings for both audio quality (4.84) and content quality (4.83). Novice-edited scored moderately (3.84 audio, 3.63 content), while AI-edited scored slightly lower (3.49 audio, 3.20 content). Novice and AI edits were rated significantly lower than expert (p <.001), but not significantly different from each other. AI and novice-edited interview transcripts had comparable overlap with the expert reference transcript, while qualitative review found frequent omissions of patient identity, actionable insights, and overall context in both the AI and novice-edited interviews. AI editing was fully automated and significantly reduced the editing time compared to both human editors. Conclusions: AI-based editing pipeline can generate TIMS audio summaries with comparable content and audio quality to novice human editors with one hour of training. AI significantly reduces editing time and removes the need for manual training, while offering a solution to scale TIMS to larger organizations or where expert editors are not readily available. Clinical Trial: Not applicable.

  • Source: freepik; Copyright: DC Studio; URL: https://www.freepik.com/free-photo/medical-team-nurses-working-tablet-modern-professional-medical-office-smiling-healthcare-employee-comparing-data-with-african-american-coworker-hospital-workplace_51112340.htm; License: Licensed by JMIR.

    Ethical Imperatives for Retrieval-Augmented Generation in Clinical Nursing: Viewpoint on Responsible AI Use

    Abstract:

    Retrieval-Augmented Generation systems have emerged as a powerful technique for optimizing general large language models in specialized domains, and are being increasingly adopted by researchers in the medical field. This article acknowledges the significant potential of RAG to enhance clinical decision-making. However, it argues that researchers and practitioners must proactively address the ethical risks associated with RAG implementation in healthcare. Key considerations include ensuring accuracy, fairness, transparency, and accountability, as well as maintaining essential human oversight, discussed through a structured analysis. We propose that robust data governance, explainable AI techniques, and continuous monitoring are critical components of a responsible RAG implementation strategy. Ultimately, realizing the benefits of RAG while mitigating ethical concerns requires collaboration among healthcare professionals, AI developers, and policymakers, fostering a future where AI supports patient safety, reduces disparities, and improves the quality of nursing care.

  • Source: Freepik; Copyright: DC Studio; URL: https://www.freepik.com/free-photo/female-applicant-looking-cv-files-waiting-attend-hiring-meeting-preparing-job-interview-career-opportunity-woman-queue-feeling-nervous-about-candidate-selection_25700346.htm; License: Licensed by JMIR.

    Applicability of Existing Gender Scores for German Clinical Research Data: Scoping Review and Data Mapping

    Abstract:

    Background: Considering sex and gender improves research quality, innovation, and social equity, while ignoring them leads to inaccuracies and inefficiency in study results. Despite increasing attention on sex- and gender-sensitive medicine, challenges remain in accurately representing gender due to its dynamic and context-specific nature. Objective: This work aims to contribute to the implementation of a standard for collecting and assessing gender-specific data in German university hospitals and associated research facilities. Methods: We carried out a review to identify and categorize state-of-the-art gender scores. 22 publications were systematically assessed regarding the applicability and practicability of their proposed gender scores. Specifically, we evaluated the use of these gender scores on German research data from clinical routine, using the Medical Informatics Initiative core dataset (MII CDS). Results: Different methods for assessing gender have been proposed, but no standardized and validated gender score is available for health research. Most gender scores target epidemiological or public health research where questions about social aspects and life habits are already part of the questionnaires. However, it is challenging to apply concepts for gender scoring on clinical data. The MII CDS, for example, lacks all variables currently being recorded in gender scores. While some of the required variables are indeed present in clinical routine data, but need to become part of the MII CDS. Conclusions: To enable gender-specific retrospective analysis of clinical routine data, we recommend to update and expand the MII CDS by including more gender-relevant information. For this purpose, we provide concrete action steps on how gender related variables can be captured in clinical routine and represented in a machine-readable way. Clinical Trial: Not applicable.

  • Source: Freepik; Copyright: wavebreakmedia_micro; URL: https://www.freepik.com/free-photo/surgeons-wearing-surgical-loupes-while-performing-operation_8402437.htm; License: Licensed by JMIR.

    Deep Learning for Dynamic Prognostic Prediction in Minimally Invasive Surgery for Intracerebral Hemorrhage: Model Development and Validation Study

    Abstract:

    Background: The pathological and physiological state of patients with intracerebral hemorrhage (ICH) after minimally invasive surgery (MIS) is a dynamic evolution, and the traditional models cannot dynamically predict prognosis. Clinical data at multiple time points often show the characteristics of different categories, different numbers, and missing data. The existing models lack methods to deal with imbalanced data. Objective: This study aims to develop and validate a dynamic prognostic model using multi–time point data from patients with ICH undergoing MIS to predict survival and functional outcomes. Methods: In this study, 287 patients who underwent MIS for ICH were retrospectively collected on the day of surgery, days 1, 3, 7, and 14 after surgery, and the day of drainage tube removal. Their general information, vital signs, laboratory test findings, neurological function scores, head hematoma volume, and MIS-related indicators were collected. In addition, this study proposes a multistep attention model, namely the MultiStep Transformer. The model can simultaneously output 3 types of prediction probabilities for 30-day survival probability, 180-day survival probability, and 180-day favorable functional outcome (modified Rankin Scale [mRS] 0-3) probability. Five-fold cross-validation was used to evaluate the performance of the model and compare it with mainstream models and traditional scores. The main evaluation indexes included accuracy, precision, recall, and F1-score. The predictive performance of the model was evaluated using receiver operating characteristic (ROC) curves; its calibration was assessed via calibration curves; and its clinical utility was examined using decision curve analysis (DCA). Attributable value analysis was conducted to assess the key predictive features. Results: The 30‑day survival rate, 180‑day survival rate, and 180‑day favorable functional outcome rate among 287 patients were 92.3%, 88.8%, and 52.3%, respectively. In terms of predictive efficacy for survival and functional outcomes, the MultiStep Transformer model showed a remarkable superiority over traditional scoring systems and other deep learning models. For these three outcomes, the model achieved areas under the receiver operating characteristic curves (AUROCs) of 0.87 (95% CI 0.82-0.92), 0.85 (95% CI 0.77-0.93), and 0.75 (95% CI 0.72-0.78), with corresponding Brier scores of 0.1041, 0.1115, and 0.231. DCA confirmed that the model provided a definite clinical net benefit when threshold probabilities ranged within 0.06-0.26, 0.04-0.5, and 0.21-0.71. Conclusions: The MultiStep Transformer model proposed in this study can effectively use imbalanced data to construct a model. It possesses good dynamic prediction ability for short-term and long-term survival and functional outcome of patients with ICH undergoing MIS, providing a novel tool for individualized assessment of prognosis among patients with ICH undergoing MIS.

Citing this Article

Right click to copy or hit: ctrl+c (cmd+c on mac)

Latest Submissions Open for Peer-Review:

View All Open Peer Review Articles
  • A Conceptual Model for Ambient AI Adoption: Perspectives from Academia and Industry

    Date Submitted: Jan 12, 2026

    Open Peer Review Period: Jan 15, 2026 - Mar 12, 2026

    Ambient AI technologies are increasingly marketed as solutions to reduce clinician burden and improve care efficiency, yet real-world performance varies widely across clinical settings. Healthcare p...

    Ambient AI technologies are increasingly marketed as solutions to reduce clinician burden and improve care efficiency, yet real-world performance varies widely across clinical settings. Healthcare provider organizations face challenges in determining which aspects of ambient AI performance matter most and how to obtain meaningful information about those aspects from vendors or through internal evaluation. This article presents a shared mental model to guide health system leaders in conceptualizing ambient AI performance across three interdependent dimensions: technical, interface, and system-level. For each dimension, we outline the types of information relevant to assessment, what vendors should reasonably be expected to provide, and how healthcare provider organizations can conduct their own evaluations to contextualize, verify, or supplement vendor claims. By integrating both vendor and health-system perspectives, this work offers a grounded, practical structure to support organizations of all sizes in understanding and making informed decisions about ambient AI technologies.

  • Predicting Response to Exercise Therapy in Adolescents With Spinal Curvature Abnormalities: A Randomized Controlled Trial Using Machine Learning

    Date Submitted: Jan 6, 2026

    Open Peer Review Period: Jan 14, 2026 - Mar 11, 2026

    Background: Background: Adolescence is a critical period for spinal and neuromuscular development, during which abnormal spinal curvature may progress rapidly and lead to long-term musculoskeletal dys...

    Background: Background: Adolescence is a critical period for spinal and neuromuscular development, during which abnormal spinal curvature may progress rapidly and lead to long-term musculoskeletal dysfunction. Exercise therapy is widely recommended as a non-surgical intervention; however, substantial individual variability in treatment response limits its clinical effectiveness. Although multidimensional data on body composition and spinal function are routinely collected in schools and rehabilitation clinics, these data are rarely integrated into intervention decision-making. Current screening and treatment selection still rely largely on visual assessment and simple angular measurements, and validated tools for identifying adolescents most likely to benefit from specific exercise therapies are lacking. Objective: Objective: This study aimed to evaluate the effects of a 12-week spiral muscle chain training (SPS) and combined exercise therapy incorporating proprioceptive neuromuscular facilitation (PNF), and to develop an interpretable machine learning–based predictive model to support personalized exercise therapy planning for adolescents with abnormal spinal curvature. Methods: Methods: The data for this study were derived from a 12-week randomized controlled trial of exercise therapy. A total of 125 middle and high school students with abnormal spinal curvature were recruited from schools and randomly assigned to a spiral muscle chain training group (n = 61) or a combined exercise therapy group (n = 64). All interventions were conducted offline. Baseline and post-intervention assessments of body composition and spinal health were performed using standardized clinical measurements. Singular value decomposition–based principal component analysis (SVD-PCA) was applied to extract principal components representing spinal mobility and balance. These components, together with demographic and clinical indicators, were used to construct predictive models using four machine learning algorithms: K-nearest neighbors (KNN), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGBoost). Model performance was evaluated, and SHapley Additive exPlanations (SHAP) were used to interpret the optimal model. Results: Results: Both exercise therapies significantly improved spinal curvature, spinal mobility, and head, shoulder, and pelvic balance, with combined exercise therapy demonstrating superior efficacy. The reduction in angle of trunk inclination (ATI) was greater in the combined therapy group(P<0.001). SVD-PCA extracted three mobility-related principal components and one balance-related component from 21 spinal indicators, explaining 86.37% of the total variance. Among all models, the RF model achieved the best predictive performance (AUC=0.950, F1=0.857, BS=0.120). SHAP analysis identified exercise therapy type, kyphotic angle (KA), ATI, and spinal function–related principal components as the most influential predictors. Conclusions: Conclusions: Both SPS and combined exercise therapy effectively improve adolescent spinal curvature abnormalities, with SPS showing particular value for mild to moderate cases. Machine learning–based predictive models can integrate multidimensional spinal health data to provide interpretable and individualized predictions, supporting precision assessment and personalized intervention strategies for adolescents with abnormal spinal curvature. Clinical Trial: Trial Registration: ClinicalTrials.gov NCT07319702; https://clinicaltrials.gov/ct2/show/NCT07319702

  • The Clinical Generalizability Gap in AI-based Alzheimer's Diagnosis: A Systematic Analysis of Deficits and Proposed Practical Solutions

    Date Submitted: Dec 31, 2025

    Open Peer Review Period: Jan 8, 2026 - Mar 5, 2026

    Background: Despite the high potential of artificial intelligence (AI) in diagnosing Alzheimer's disease, a profound gap exists between reported accuracy in ideal conditions and models' reliable perfo...

    Background: Despite the high potential of artificial intelligence (AI) in diagnosing Alzheimer's disease, a profound gap exists between reported accuracy in ideal conditions and models' reliable performance in real-world clinical settings. Objective: This systematic analysis aimed to identify the root causes of this gap and propose practical solutions. Methods: We conducted a systematic analysis in accordance with PRISMA 2020, analyzing 56 studies (2013-2023). A qualitative content analysis was performed around four pillars: 1) Data repository characteristics, 2) Data preprocessing and model design, 3) Technical implementation frameworks, and 4) Performance evaluation protocols. Results: Results indicate a methodological transition towards standardized data repositories and modern AI frameworks. However, rapid algorithm development has outpaced the maturity required for clinical generalizability. Four key deficits were identified: 1. Data limitations due to reliance on restricted, low-diversity datasets (63% of studies used ADNI exclusively). 2. Insufficient standardization in preprocessing and modeling, prioritizing 'convenience' over 'generalizability'. 3. A disconnect between technical capabilities and critical clinical needs (only 7% focused on the crucial sMCI/pMCI distinction). 4. Deficiencies in evaluation protocols, notably scarce multi-center validation (only 7%) and inadequate reporting of comprehensive metrics (96% relied solely on Accuracy). Practical solutions to address these deficits across data, modeling, and evaluation domains are prop osed. Conclusions: Transitioning from 'accuracy under ideal conditions' to 'reliability in real-world settings' is an unavoidable necessity. This requires investment in multi-center data repositories, alignment of models with clinical needs, and institutionalizing comprehensive evaluations. The findings and recommendations are generalizable to other domains of AI-based disease diagnosis.

  • The SPHN Metadata Catalog: A platform for health data discovery and exploration based on FAIR principles

    Date Submitted: Dec 23, 2025

    Open Peer Review Period: Jan 4, 2026 - Mar 1, 2026

    Background: The Swiss Personalized Health Network facilitates the interoperability and secure sharing of health-related data for research in Switzerland, in line with the FAIR principles. Since medica...

    Background: The Swiss Personalized Health Network facilitates the interoperability and secure sharing of health-related data for research in Switzerland, in line with the FAIR principles. Since medical datasets can be highly sensitive, access is often governed by complex legal and regulatory requirements. Enabling researchers to discover, understand, and evaluate datasets through rich, well-structured metadata is therefore essential to support informed decisions about data suitability and reuse. Objective: This study describes the design and functionality of the SPHN Metadata Catalog and its role in supporting the discovery, exploration, and reuse assessment of health-related datasets. Methods: The SPHN Metadata Catalog is a FAIR Data Point-compliant infrastructure that provides rich, structured metadata in both human and machine-readable form. Dataset descriptions are based on HealthDCAT, ensuring a standardized representation of health data catalogs. Beyond the descriptive metadata typically offered by other catalogs, the SPHN Metadata Catalog includes extensive dataset-level statistics expressed using the Vocabulary of Interlinked Datasets. An interactive visualization component further enables users to explore graph-based schemas and datasets, including entities, attributes, relationships, and their relative abundances. Results: The SPHN Metadata Catalog enables users to explore the semantic structure of graph schemas and statistics of datasets prior to requesting access. Researchers can examine data structures, relationships, attributes, and the abundances of individual data elements. This functionality supports feasibility assessments and informed evaluations of dataset suitability and reuse conditions. Conclusions: By combining HealthDCAT-based descriptions with rich statistical metadata and interactive exploration capabilities, the SPHN Metadata Catalog enhances dataset discoverability and supports FAIR-compliant data reuse. As a key component of Switzerland’s health data research infrastructure, the SPHN Metadata Catalog provides a foundation for future interoperability initiatives, including potential alignment with emerging frameworks such as the European Health Data Space.

  • Optimizing Clinical Temporal Relation Extraction with Large Language Models: Comparative Analysis

    Date Submitted: Dec 19, 2025

    Open Peer Review Period: Jan 4, 2026 - Mar 1, 2026

    Background: Clinical Temporal Relation Extraction (CTRE) is essential for reconstructing patient timelines from unstructured Electronic Health Records (EHRs). However, the linguistic complexity of cli...

    Background: Clinical Temporal Relation Extraction (CTRE) is essential for reconstructing patient timelines from unstructured Electronic Health Records (EHRs). However, the linguistic complexity of clinical notes and the high cost of expert annotation impede the development of large-scale training corpora. While Large Language Models (LLMs) have transformed general Natural Language Processing, their application to CTRE remains underexplored. Objective: This study aims to determine the optimal adaptation strategy for CTRE by conducting a comprehensive benchmarking of LLM architectures and fine-tuning methodologies in both data-rich and limited-data regimes. Methods: We evaluated four LLMs representing two distinct architectures: Transformer Encoders (GatorTron-Base, GatorTron-Large) and Transformer Decoders (LLaMA 3.1-8B, MeLLaMA-13B). We compared four adaptation strategies: (1) Standard Fine-Tuning, (2) Hard-Prompting, (3) Soft-Prompting, and (4) Low-Rank Adaptation (LoRA). Experiments were conducted on the 2012 i2b2 CTRE benchmark in both full-supervision and 1-shot scenarios. Results: We achieved results that exceed the current state-of-the-art (SOTA) on the 2012 i2b2 dataset. Comparative analysis reveals that hard-prompting consistently yields superior efficacy compared to standard fine-tuning. Regarding Parameter-Efficient Fine-Tuning (PEFT) strategies, Low-Rank Adaptation (LoRA) targeting query and value layers emerged as the optimal configuration. Conversely, soft-prompting demonstrated suboptimal performance, likely due to constraints on representational capacity. Architecturally, we observed a performance dichotomy based on data availability: Encoder-based models (GatorTron) exhibited superior stability and accuracy in few-shot scenarios, whereas Decoder-based models (LLaMA 3.1, MeLLaMA) demonstrated dominant performance in data-rich regimes. Conclusions: This study provides a rigorous roadmap for adapting LLMs to clinical extraction tasks. Based on our empirical findings, we recommend hard-prompting to maximize predictive accuracy and identify specific LoRA configurations (targeting query and value layers) as the preferred approach when computational efficiency is paramount. Furthermore, our findings suggest that while generative Decoders excel with abundant data, domain-specific Encoders remain the robust choice for few-shot clinical applications.

  • Identification of Main Biomarkers in NSCLC Through a Multi-Platform Genomic Network Analysis

    Date Submitted: Dec 16, 2025

    Open Peer Review Period: Jan 3, 2026 - Feb 28, 2026

    Background: Non-Small Cell Lung Cancer (NSCLC) remains the leading cause of cancer-related mortality worldwide. The identification and prioritization of molecular biomarkers involved in NSCLC pathogen...

    Background: Non-Small Cell Lung Cancer (NSCLC) remains the leading cause of cancer-related mortality worldwide. The identification and prioritization of molecular biomarkers involved in NSCLC pathogenesis are essential for advancing early diagnostic strategies and optimizing therapeutic interventions. Objective: This study aimed to utilize genomic network approaches and bioinformatics tools to prioritize clinically relevant biomarkers associated with NSCLC. Methods: Non-Small Cell Lung Cancer (NSCLC) remains the leading cause of cancer-related mortality worldwide. The identification and prioritization of molecular biomarkers involved in NSCLC pathogenesis are essential for advancing early diagnostic strategies and optimizing therapeutic interventions. This study aimed to utilize genomic network approaches and bioinformatics tools to prioritize clinically relevant biomarkers associated with NSCLC. Results: Data integration from three major genomic repositories DisGeNET, GWAS Catalog, and cBioPortal 1,317 NSCLC associated genes. Subsequent analyses included gene ontology enrichment, pathway enrichment, and protein–protein interaction (PPI) network construction. Network-based prioritization identified ten key hub genes: TP53, MYC, PTEN, CTNNB1, ACBT, STAT3, CCND1, AKT1, ESR1, and HIP1A, with TP53, MYC, PTEN, CTNNB1 as the most prominent biomarkers according to CytoHubba scoring. Conclusions: This study presents a genomic network-based framework for identifying and prioritizing potential NSCLC biomarkers, offering critical insights into the molecular underpinnings of NSCLC pathogenesis.