Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Journal Description

JMIR Medical Informatics (JMI, ISSN 2291-9694, Journal Impact Factor 3.8) (Editor-in-chief: Arriel Benis, PhD, FIAHSI) is an open-access journal that focuses on the challenges and impacts of clinical informatics, digitalization of care processes, clinical and health data pipelines from acquisition to reuse, including semantics, natural language processing, natural interactions, meaningful analytics and decision support, electronic health records, infrastructures, implementation, and evaluation (see Focus and Scope).

JMIR Medical Informatics adheres to rigorous quality standards, involving a rapid and thorough peer-review process, professional copyediting, and professional production of PDF, XHTML, and XML proofs.

The journal is indexed in MEDLINEPubMedPubMed CentralDOAJ, Scopus, and the Science Citation Index Expanded (SCIE)

JMIR Medical Informatics received a Journal Impact Factor of 3.8 (Source:Journal Citation Reports 2025 from Clarivate).

JMIR Medical Informatics received a Scopus CiteScore of 7.7 (2024), placing it in the 79th percentile (#32 of 153) as a Q1 journal in the field of Health Informatics.

 

Recent Articles:

  • Source: Freepik; Copyright: Freepik; URL: https://www.freepik.com/free-photo/front-view-man-wearing-protection-helmet_27258018.htm; License: Licensed by JMIR.

    Research on the Prediction of Coal Workers’ Pneumoconiosis Based on Easily Detectable Clinical Data: Machine Learning Model Development and Validation Study

    Abstract:

    Background: Coal workers’ pneumoconiosis (CWP) is the most prevalent occupational disease that causes irreversible lung damage. Early prediction of CWP is the key to blocking the irreversible process of pulmonary fibrosis. The prediction of CWP based on imaging data and biomarker detection is constrained due to high cost and poor convenience. Objective: We utilize easily detectable clinical data to construct a prediction model for CWP through machine learning (ML) methods. Methods: A prediction framework was established using a moderate-sized dataset and multidimensional clinical features, including occupational information, lung function parameters, and blood indicators. Six ML algorithms (LightGBM, random forest, XGBoost, CatBoost, support vector machine, and logistic regression) were trained and evaluated using stratified 5-fold cross-validation and a held-out test set. Hyperparameter optimization was performed using a unified Optuna-based strategy to ensure fair comparison across models. Model interpretability was assessed using SHAP (Shapley additive explanations) on the top-performing models. In addition, an ablation analysis was conducted by retraining models after excluding Job-Type to assess the independent predictive value of clinical biomarkers. Results: All six models achieved consistently high predictive performance, and the differences among the top-performing models were small on the test set. After Optuna-based optimization, LightGBM and XGBoost achieved high test-set AUC values (0.974 and 0.975, respectively), while RF achieved the highest recall (0.926) and F1 score (0.952). Compared with the baseline models, hyperparameter optimization resulted in only minor performance changes, indicating robust prediction under the current feature set and evaluation protocol. SHAP analysis consistently identified Age, FEV1/FVC, and platelet count (PLT) as key contributors to CWP risk prediction. The ablation analysis further showed that model performance remained strong after removing Job-Type, supporting the independent predictive value of clinical features beyond occupational history. Conclusions: The research results have confirmed the potential of combining simple multidimensional features with ML algorithms for predicting CWP, and provided new ideas for early diagnosis and intervention of CWP patients.

  • Source: freepik; Copyright: freepik; URL: https://www.freepik.com/free-photo/doctor-offering-medical-teleconsultation_23988168.htm; License: Licensed by JMIR.

    Machine Learning for Predicting Venous Thromboembolism After Joint Arthroplasty: Systematic Review of Clinical Applicability and Model Performance

    Abstract:

    Background: There is increasing research on machine learning in predicting venous thromboembolism after joint arthroplasty, but the quality and clinical applicability of these models are unclear. Objective: This systematic review and meta-analysis aims to evaluate the predictive performance and methodological quality of machine learning models for venous thromboembolism risk after joint replacement surgery, and to provide insights for further clinical application. Methods: Web of Science, Embase, Scopus, CNKI, Wanfang, Vipro, and PubMed were searched until December 15, 2024. The risk of bias and applicability were evaluated using the prediction model Bias Risk Assessment Tool (PROBAST) checklist. Quantitative synthesis and meta-analysis included models reporting AUC value with 95% confidence intervals. Results: There were 34 prediction models from 9 studies, and the most used machine learning models were extreme gradient boosting and logistic regression. 24 models with reported confidence intervals were incorporated into the meta-analysis, and the total area under the curve was 0.826 (95% CI 0.775-0.876). All studies indicated a high risk of bias and considerable heterogeneity. Age, gender, diabetes, and hypertension were the most frequently used predictive factors. Conclusions: The predictive performance of machine learning models varies greatly, and the AUC value of the report indicates that most of the models have good discriminative ability. These models have a high risk bias, and it is necessary to take this into account when they are used in clinical practice. Future studies should adopt a prospective study design, ensure appropriate data handling, and use external validation to improve model robustness and applicability. Clinical Trial: The protocol for this study is registered with PROSPERO (registration number: CRD42024625842).

  • AI-generated image in response to the request "- [ ]  Create a professional, healthcare-themed illustration showing a comparison of four patient outreach modalities used to address gaps in care: (1) telephone calls, (2) emails, (3) mailed letters, and (4) chatbot or mobile messaging. The image should be divided into four vertical panels, each depicting a diverse adult patient interacting with one communication method (phone conversation, reading email on a laptop, reviewing a mailed letter, and engaging with a chatbot on a smartphone). Use a realistic, polished digital illustration style suitable for academic publication. Avoid cartoonish exaggeration. Ensure neutral clinical tone, diverse representation, and a clean composition without text overlays or whitespace. Aspect ratio 4:3." (Generator: DALL-E; 
 1/13/2026). Source: DALL-E/Open AI; Copyright: N/A (AI-Generated image); URL: https://medinform.jmir.org/2026/1/e81370/; License: Public Domain (CC0).

    Chatbot Outreach in Value-Based Preventive Care: Retrospective Analysis

    Abstract:

    Background: As health care delivery shifts toward value-based care, proactive strategies to close preventive care gaps are essential. However, patient engagement remains suboptimal due to logistical, behavioral, and socioeconomic barriers. Traditional outreach methods, such as phone calls, emails, and postal mail, have long been used, but emerging digital approaches, such as chatbot-based messaging, offer potential advantages in scalability and personalization. Their comparative effectiveness, however, remains underexplored. Objective: This study aimed to evaluate the effectiveness of chatbot outreach compared with traditional communication methods (phone, email, mail, and multichannel) in promoting compliance with preventive screenings and wellness visits defined by the Healthcare Effectiveness Data and Information Set and the Centers for Medicare & Medicaid Services guidelines. Methods: This retrospective study evaluated patient outreach campaigns conducted from 2021 to 2023 across an integrated health system in New York. The final analytic sample included 50,145 care gaps from 41,959 eligible participants, predominantly female (29,989/50,145, 60%), White (31,857/50,145, 64%), with mean age ranging from 49.36 to 72.81 years over the study period. All participants were residents of New York state, and 81% (40,553/50,145) maintained an active relationship with a primary care provider during the participation year. Outreach modalities included automated chatbot SMS text messages, nonautomated phone calls, and organization-led email or mail campaigns. Participant data were enriched with social vulnerability scores to account for community-level disadvantages. Exposure was defined as the outreach method (chatbot, phone, email, mail, or multichannel), with assignment based on engagement history and operational protocols. The primary outcome was care-gap closure or compliance with identified measure gaps annually. Logistic regression and chi-square analyses examined associations between outreach method, patient demographics, primary care physician relationship, social vulnerability index (SVI), and compliance. Results: Phone outreach consistently achieved higher compliance than chatbot or multichannel outreach across most groups and years. Chatbot messages outperformed phone calls only in diabetes care in 2023 (odds ratio [OR] 1.81, 95% CI 1.48-2.21; P<.001). Primary care physician continuity remained a strong predictor of gap closure, especially in primary care (ORs ranged 1.36-2.61; P<.001). Higher SVI quartiles were associated with lower compliance in blood pressure, cancer care, and diabetes care groups; however, primary care outcomes showed higher odds of compliance in the third quartile of SVI, contradicting the typical linear-deprivation narrative. Women, Hispanic or Latino individuals, and Asian patients demonstrated higher odds of compliance in some groups and years. Conclusions: Outreach modality is an important, modifiable factor in preventive care adherence. While phone-based outreach remains the most effective overall approach, chatbot-based strategies may have targeted applications in digitally engaged populations such as the diabetic group. Segmented, equity-informed outreach strategies that integrate technology, patient preferences, and primary care continuity are essential to achieving high-impact, scalable outcomes in value-based care settings.

  • Trust in AI-Supported Screening in General Practice Among Urban and Rural Citizens. Generated by Larisa Wewetzer on 5 January 2026. URL of the generator used: https://chatgpt.com/s/m_696f7c29bed88191a1fdc55a04267428. Source: Larisa Wewetzer; Copyright: Larisa Wewetzer; URL: https://medinform.jmir.org/2026/1/e69777; License: Public Domain (CC0).

    Trust in AI-Supported Screening in General Practice Among Urban and Rural Citizens: Cross-Sectional Study

    Abstract:

    Background: The early detection of diseases is one of the tasks of general practice. AI (artificial intelligence)-based technologies could be useful to identify diseases at an early stage in general practices. As a good 90% of the population regularly consult a GP (general practitioner) during one year, this could increase the percentage of citizens who take part in meaningful screening measures. Objective: Considering these factors, the aim of the study was to evaluate the level of trust of citizens in rural and urban areas in AI-supported early detection measures in general practice. Methods: This cross-sectional study was conducted in the federal state of Schleswig-Holstein, Germany from Nov 2023 to Dec 2023 on the topic of early detection measures with AI in general practice care, among other things. For this purpose, 5,000 adult residents of rural areas (Ostholstein, Pinneberg, Nordfriesland) and urban areas (Kiel City) were invited to take part in the survey. Data analysis was carried out using descriptive statistics, sub-group analysis, linear and stepwise regressions to identify the factors that influence trust in AI-based diagnoses. Results: The majority of respondents 55% (n=787) consider the introduction of an AI-based screening measure to be a sign of modern medicine. Moreover, 27% (n=388) of respondents fear that the introduction of such services could lead to a deterioration in the doctor-patient relationship. The role of AI in future care was rated as (very) important by 35% (n= 634). The stepwise regression analysis showed that a positive attitude towards AI in medicine being the strongest predictor (ß= 0.420) concerning trust in AI based diagnoses. In contrast, trust in physician diagnoses was associated with lower age (ß= -0.111) and shorter waiting times for test results (ß= 0.077). Conclusions: Trust in a GP based diagnose was around six times greater than towards AI applications. Despite concerns about their impact on the doctor-patient relationship, a good third of participants believe that the role of AI in healthcare will grow.

  • Source: Freepik; Copyright: Freepik; URL: https://www.freepik.com/free-photo/surgical-procedure-made-by-doctor_14008914.htm; License: Licensed by JMIR.

    Early Prediction of Delirium in Postcardiac Surgery Patients: Machine Learning Model Development and External Validation

    Abstract:

    Background: Delirium is a frequent postoperative complication among patients who have undergone cardiac surgery and is associated with prolonged hospitalization, cognitive decline, and increased mortality. Early prediction of delirium is therefore critical for initiating timely interventions. Objective: This study proposes the development and validation of a machine learning–based model to predict postoperative delirium in patients undergoing cardiac surgery during intensive care unit (ICU) care, facilitating the early detection of individuals at high risk of delirium and supporting clinicians in the deployment of targeted preventive strategies. Methods: This study extracted data on postoperative cardiac surgery patients who remained in the ICU for more than 24 hours from the Medical Information Mart for Intensive Care IV version 2.0 (MIMIC-IV 2.0) database and the eICU Collaborative Research Database (eICU-CRD). The MIMIC-IV 2.0 cohort was randomly divided into a training set and an internal validation set in a 7:3 ratio, whereas the eICU-CRD functioned as an independent validation cohort. We used data from the first 24 hours of ICU monitoring to model the likelihood of delirium over the entire ICU admission period. Delirium was identified by a positive Confusion Assessment Method for the Intensive Care Unit evaluation (ie, score ≥4). We built predictive models by using logistic regression, support vector classifier, extreme gradient boosting (XGB), and random forest classifiers. Their performance was assessed via the area under the receiver operating characteristic curve, accuracy, sensitivity, positive predictive value, negative predictive value, and F1-score. Results: The analysis involved 2124 patients from the MIMIC-IV 2.0 database and 2406 from the eICU-CRD. A set of 57variables was selected to construct the predictive models. Among the various machine learning models tested, the XGB model demonstrated the best performance for delirium prediction during internal validation. As for external validation, the model achieved an area under the receiver operating characteristic curve of 0.75, indicating strong discriminatory ability. The most important predictive features identified by the model included hospital length of stay, minimum Glasgow Coma Scale score, mean blood pressure, Sequential Organ Failure Assessment score, weight, urine output, heart rate, and age. Conclusions: The XGB model with strong predictive capability for ICU delirium after cardiac surgery was developed and externally validated. This model offers essential technical support for building real-time delirium alert systems and enables ongoing risk stratification and evidence-based decision-making within the ICU environment.

  • Source: Freepik; Copyright: Drazen Zigic; URL: https://www.freepik.com/free-photo/closeup-female-athlete-using-mobile-phone-while-checking-heart-rate-wristwatch_25777181.htm; License: Licensed by JMIR.

    Exploring Self-Management–Based Mobile Health User Typologies and Associations Between User Types and Satisfaction With Key Mobile Health Functions:...

    Abstract:

    Background: Exploring user satisfaction is crucial for enhancing and ensuring the sustainable development of mobile health (mHealth) apps, particularly in the fitness and weight management sectors. Analyzing user types and developing user profiles are valuable for understanding differences in satisfaction. However, prior research lacks a classification of user types based on self-management characteristics and an analysis of satisfaction disparities among these types. Objective: This study analyzes user heterogeneity from a self-management perspective among fitness and weight management app users by identifying user types and constructing profiles. It further explores differences in satisfaction with the functional design of these mHealth apps across user types. Methods: First, 8 feature indicators were selected based on the Health Belief Model and the Behavior Change Wheel to evaluate users’ levels of health knowledge and beliefs, as well as self-regulation related to self-management. Existing research was integrated to categorize mHealth app functional design into 5 categories: health guidance, health education, health monitoring, social features, and gamification. Second, a questionnaire survey was used to collect data on users’ 8 health management characteristics and their satisfaction with the 5 functional design categories. A total of 2518 responses were collected, of which 1025 were included in the analysis. Cluster analysis was conducted to classify users into distinct types based on the 8 health management characteristics, and user profiles were constructed according to the distribution of these characteristics within each type. Finally, the Kruskal-Wallis test was used to analyze differences in satisfaction across user types with respect to the 5 functional design categories of mHealth apps. Results: Cluster analysis revealed that users could be categorized into 6 types based on the 8 self-management characteristics: positively proactive energizers, proactive intenders, negatively proactive energizers, low health management demanders, potential health management demanders, and passive attitude holders. Significant differences were observed across all 8 health management characteristics among the 6 user types (all P<.001). The Kruskal-Wallis test indicated significant variations in user satisfaction with the 5 functional designs of mHealth apps: H(4)=445.388, (P<.001). Overall, users reported the highest satisfaction with health guidance and health monitoring (median 4.00, IQR 1.00) and the lowest satisfaction with gamification (median 3.00, IQR 1.00). Positively proactive energizers, proactive intenders, and negatively proactive energizers demonstrated the highest satisfaction with health education and health guidance (median 4.00). Potential health management demanders, proactive intenders, positively proactive energizers, and negatively proactive energizers reported the highest satisfaction with health monitoring (median 4.00). Proactive intenders reported the highest satisfaction with social features and gamification (median 4.00). Conclusions: Users of mHealth apps exhibit diverse types, with significant differences in health management characteristics and satisfaction with the 5 functional designs of fitness and weight management apps. This study clarifies individual-level differences in user satisfaction with mHealth apps.

  • Source: The Authors/Placeit; Copyright: The Authors/Placeit; URL: https://medinform.jmir.org/2026/1/e77830; License: Licensed by JMIR.

    Enhancing Anesthetic Depth Assessment via Unsupervised Machine Learning in Processed Electroencephalography Analysis: Novel Methodological Study

    Abstract:

    Background: General anesthesia induces temporary loss of consciousness, and electroencephalography (EEG)-based monitoring is crucial for tracking this state. However, EEG-based indices that are used to assess the depth of anesthesia can be influenced by various factors, potentially leading to misleading outputs. Objective: This study aimed to explore the feasibility of using unsupervised machine learning on processed EEG data to enhance anesthetic depth assessment. Methods: Over 16,000 data points were collected from patients who underwent elective lumbar spine surgery. The EEG data were processed using a bandpass filter and Fast Fourier Transform for power spectral density estimation. Unsupervised machine learning with Fuzzy C-means clustering was applied to categorize anesthesia depth into three clusters: slight, proper, and deep. Results: Fuzzy C-means clustering identified distinct anesthesia depth groups based on delta, alpha, theta, and beta band power ratios. Visual representations validated the clustering results, which were consistent across individual patient data. The figures demonstrate the application of clustering to EEG data, revealing detailed anesthesia depth estimations. Conclusions: This study developed a machine learning-based methodology for anesthesia depth assessment, demonstrating feasibility and providing preliminary insights into classification, visualization, and patient-specific management. By applying Fuzzy C-Means clustering to processed EEG data, this approach enhances anesthesia depth understanding and integrates with existing monitoring modalities.

  • Source: Freepik; Copyright: Drazen Zigic; URL: https://www.freepik.com/free-photo/african-american-doctor-using-touchpad-while-talking-senior-woman-nursing-home_26768690.htm; License: Licensed by JMIR.

    AI Scribes: Are We Measuring What Matters?

    Abstract:

    AI scribes, software that can convert speech into concise clinical documents, have achieved remarkable clinical adoption at a pace rarely seen for digital technologies in healthcare. The reasons for this are understandable: the technology works well enough, it addresses a genuine pain point for clinicians, and it has largely sidestepped regulatory requirements. In many ways, clinical adoption of AI scribes has also occurred well ahead of robust evidence of their safety and efficacy. The papers in this theme issue demonstrate real progress in the technology and evidence of its benefit: documentation times are reported to decrease when using scribes, clinicians report feeling less burdened, and the notes produced are often of reasonable quality. Yet as we survey the emerging evidence base, there remains one outstanding and urgent unanswered question: Are AI scribes safe? We need to know the clinical outcomes achievable when scribes are used compared to other forms of note taking.

  • Source: Freepik; Copyright: rawpixel.com; URL: https://www.freepik.com/free-photo/asian-woman-home-using-laptop_15667791.htm; License: Licensed by JMIR.

    Iterative Large Language Model–Guided Sampling and Expert-Annotated Benchmark Corpus for Harmful Suicide Content Detection: Development and Validation Study

    Abstract:

    Background: Harmful suicide content on the internet poses significant risks, as it can induce suicidal thoughts and behaviors, particularly among vulnerable populations. Despite global efforts, existing moderation approaches remain insufficient, especially in high-risk regions like South Korea, which has the highest suicide rate among OECD countries. Previous research has primarily focused on assessing the suicide risk of the authors who wrote the content rather than the harmfulness of content itself which potentially leads the readers to self-harm or suicide, highlighting a critical gap in current approaches. Our study addresses this gap by shifting the focus from assessing the suicide risk of content authors to evaluating the harmfulness of the content itself and its potential to induce suicide risk among readers. Objective: In this study, we aimed to develop an AI-driven system for classifying online suicide-related content into five levels: illegal, harmful, potentially harmful, harmless, and non-suicide related. Additionally, the researchers construct a multi-modal benchmark dataset with expert annotations to improve content moderation and assist AI models in detecting and regulating harmful content more effectively. Methods: We collected 43,244 user-generated posts from various online sources, including social media, Q&A platforms, and online communities. To reduce the workload on human annotators, GPT-4 was used for pre-annotation, filtering and categorizing content before manual review by medical professionals. A task description document ensured consistency in classification. Ultimately, a benchmark dataset of 452 manually labeled entries was developed, including both Korean and English versions, to support AI-based moderation. The study also evaluated zero-shot and few-shot learning to determine the best AI approach for detecting harmful content. Results: The multi-modal benchmark dataset showed that GPT-4 achieved the highest F1 scores (66.46 for illegal and 77.09 for harmful content detection). Image descriptions improved classification accuracy, while directly using raw images slightly decreased performance. Few-shot learning significantly enhanced detection, demonstrating that small but high-quality datasets could improve AI-driven moderation. However, translation challenges were observed, particularly in suicide-related slang and abbreviations, which were sometimes inaccurately conveyed in the English benchmark. Conclusions: This study provides a high-quality benchmark for AI-based suicide content detection, proving that LLMs can effectively assist in content moderation while reducing the burden on human moderators. Future work will focus on enhancing real-time detection and improving the handling of subtle or disguised harmful content.

  • AI generated image. (Generator: ChatGPT/OpenAI. Janauray 3,2026. Requester: Ohoud Almadani). Source: ChatGPT (DALL-E 3); Copyright: N/A (AI generated image); URL: https://medinform.jmir.org/2026/1/e79869/; License: Public Domain (CC0).

    Linking Electronic Health Records for Multiple Sclerosis Research: Comparative Study of Deterministic, Probabilistic, and Machine Learning Linkage Methods

    Abstract:

    Background: Data linkage in pharmacoepidemiological research is commonly employed to ascertain exposure and outcomes, or to obtain more information about confounding variables. However, to protect patient confidentiality usually unique patient identifiers are not provided; thus, makes data linkage between various sources challenging. The Saudi Real-Evidence Researches Network (RERN) aggregates Electronic Health Records from various hospitals, which may require a robust linkage technique. Objective: To evaluate and compare the performance of deterministic, probabilistic, and machine learning approaches for linking de-identified multiple sclerosis (MS) patient data from the RERN and Ministry of National Guard Health Affairs (MNGHA) EHR systems. Methods: We applied a simulation-based validation framework before linking real-world data sources. Deterministic linkage was based on predefined rules, while probabilistic linkage was based on a similarity-score matching. We applied both similarity-score and classification approach in machine learning¬¬¬¬— models including neural networks (NN), logistic regression (LR), and random forest (RF). Performance of each approach was assessed using confusion matrix focusing on sensitivity, positive predictive value (PPV), F1-score, and computational efficiency. Results: The study included linked data of 2,247 MS patients (spanning from 2016 to 2023). The deterministic approach resulted in an average F1-score of 97.2% in the simulation and demonstrated varying match rates in real-work linkage: 1,046 out of 2,247 (46.6%) to 1,946 out of 2,247 (86.6%). This linkage was computationally efficient with a run time of <1 second per rule. Using a probabilistic approach, provided an average F1-score of 93.9% in the simulation, with real-world match rates ranging from 1,472 out of 2,247 (65.5%) to 2,144 out of 2,247 (95.4%), and processing times ranged from ~0.1 second to ~5 second per rule. Although that machine learning approaches achieved high performance (F1-score reached 99.8%), they were computationally expensive. Processing time ranged from approximately 13 to 16,936 seconds for the classification approach and from approximately 13 to 7,467 seconds for the similarity-score approach. Real-world match rates from the machine learning models were highly variable depending on the method used; the similarity-score approach identified 789 out of 2,247 (35.1%) matched pairs, whereas the classification approach identified 2,014 out of 2,247 (89.6%). Conclusions: Probabilistic linkage offers high linkage capacity by recovering matches missed by deterministic methods, proving to be both flexible and efficient method, especially in real-world scenarios where unique identifiers are lacking. Probabilistic linkage achieved a great balance between recall and precision, enabling better integration of various data sources that could be useful in MS research.

  • AI image generated to the prompt "A Black African male doctor in a white lab coat seated at a desk with three equally sized computer monitors displaying programming code, machine learning model outputs, and statistical plots, in a simple African doctor’s office setting with neutral tones and natural light." by Donald Salami on 15 Jan 2026. Source: AI image created by the authors using ChatGPT; Copyright: NA (AI generated image); URL: https://chatgpt.com/s/m_696a59cb3f108191b4aaa65abb2869d5; License: Public Domain (CC0).

    Prediction of First and Multiple Antiretroviral Therapy Interruptions in People Living With HIV: Comparative Survival Analysis Using Cox and Explainable...

    Abstract:

    Background: The Cox proportional hazards (CPH) model is a common choice for analyzing time to treatment interruptions in patients on antiretroviral therapy (ART), valued for its straightforward interpretability and flexibility in handling time-dependent covariates. Machine learning (ML) models have increasingly been adapted for handling temporal data, with added advantages of handling complex, non-linear relationships, large datasets, and provide clear practical interpretations. Objective: This study aims to compare the predictive performance of the traditional CPH model and ML models in predicting treatment interruptions among patients on ART, while also providing both global and individual-level explanations to support personalized, data-driven interventions for improving treatment retention. Methods: Using data from 621,115 patients who started ART between 2017 and 2023, in Kenya, we compared the performance of the CPH with 6 different ML models – Gradient Boosting Machine, Extreme Gradient Boosting, Regularized Generalized Linear Models (Ridge, Lasso and Elastic-Net) and Recursive Partitioning – in predicting first and multiple treatment interruptions. Explainable surrogate technique (model-agnostic) was applied to interpret the best-performing model's predictions globally, using variable importance and partial dependence profiles (PDP), and at individual-level, using break-down additive (BD), Shapley additive explanations (SHAP), and ceteris paribus (CP). Results: Recursive partitioning (RP) model achieved the best performance with a predictive concordance index score (C-Index) of 0.81 for first treatment interruptions and 0.89 for multiple interruptions, outperforming the CPH model, which scored 0.78 and 0.87 for the same scenarios, respectively. RP’s performance can be attributed to its ability to model non-linear relationships and automatically detect complex interactions. The global model-agnostic explanations aligned closely with the interpretations offered by hazard ratios in the CPH model, while offering additional insights into the impact of specific features on the model's predictions. The BD and SHAP explainers demonstrated how different variables contribute to the predicted risk at the individual patient level. The CP profiles further explored the time-varying model, to illustrate how changes in a patient’s covariates over time could impact their predicted risk of treatment interruption. Conclusions: In conclusion, our results highlight the superior predictive performance of ML models and their ability to provide patient-specific risk predictions and insights that can support targeted interventions to reduce treatment interruptions in antiretroviral therapy care.

  • AI-generated TOC image created using nano-banana pro. Prompt: "This study introduces a novel methodology designed to enhance the performance of Multiple Instance Learning (MIL) models by integrating expert annotations. Unlike conventional MIL approaches that often rely on weak labels, our method explicitly incorporates a ranking-aware objective. Please make sure to include content that clearly showcases the key points of our paper.
" (Generated by the Kisu Hwang; requested: 2026-01-08). Source: nano-banana pro; Copyright: N/A (AI-generated image); URL: https://medinform.jmir.org/2026/1/e84417/; License: Public Domain (CC0).

    Ranking-Aware Multiple Instance Learning for Histopathology Slide Classification: Development and Validation Study

    Abstract:

    Background: Multiple instance learning (MIL) is widely used for slide-level classification in digital pathology without requiring expert annotations. However, even partial expert annotations offer valuable supervision; few studies have effectively leveraged this information within MIL frameworks. Objective: This study aims to develop and evaluate a ranking-aware MIL framework, called rank induction, that effectively incorporates partial expert annotations to improve slide-level classification performance under realistic annotation constraints. Methods: We developed rank induction, a MIL approach that incorporates expert annotations using a pairwise rank loss inspired by RankNet. The method encourages the model to assign higher attention scores to annotated regions than to unannotated ones, guiding it to focus on diagnostically relevant patches. We evaluated rank induction on 2 public datasets (Camelyon16 and DigestPath2019) and an in-house dataset (Seegene Medical Foundation-stomach; SMF-stomach) and tested its robustness under 3 real-world conditions: low-data regimes, coarse within-slide annotations, and sparse slide-level annotations. Results: Rank induction outperformed existing methodologies, achieving an area under the receiver operating characteristic curve (AUROC) of 0.839 on Camelyon16, 0.995 on DigestPath2019, and 0.875 on SMF-stomach. It remained robust under low-data conditions, maintaining an AUROC of 0.761 with only 60.2% (130/216) of the training data. When using coarse annotations (with 2240-pixel padding), performance slightly declined to 0.823. Remarkably, annotating just 20% (18/89) of the slides was enough to reach near-saturated performance (AUROC of 0.806, vs 0.839 with full annotations). Conclusions: Incorporating expert annotations through ranking-based supervision improves MIL-based classification. Rank induction remains robust even with limited, coarse, or sparsely available annotations, demonstrating its practicality in real-world scenarios.

Citing this Article

Right click to copy or hit: ctrl+c (cmd+c on mac)

Latest Submissions Open for Peer-Review:

View All Open Peer Review Articles
  • Development and Validation of an Interpretable Machine Learning Model for Predicting Lateral Neck Lymph Node Metastasis in Papillary Thyroid Carcinoma Based on Ultrasound Data: A Retrospective Study

    Date Submitted: Feb 4, 2026

    Open Peer Review Period: Feb 11, 2026 - Apr 8, 2026

    Background: Background: Lateral neck lymph node metastasis (LLNM) is a major determinant of recurrence risk and surgical strategy in papillary thyroid carcinoma (PTC). However, accurate preoperative i...

    Background: Background: Lateral neck lymph node metastasis (LLNM) is a major determinant of recurrence risk and surgical strategy in papillary thyroid carcinoma (PTC). However, accurate preoperative identification of LLNM remains challenging, as conventional imaging assessment is limited by operator dependency and variable diagnostic performance. Although several predictive models have been proposed, many suffer from limited generalizability or poor interpretability, hindering their integration into clinical decision-making. Objective: Objective: Preoperative accurate prediction of LLNM in PTC remains challenging, and existing models have limitations such as poor interpretability or restricted applicability. This study aimed to develop and validate an interpretable machine learning (ML) model based on routine clinical and ultrasound data to predict LLNM risk in PTC patients. Methods: Methods: A retrospective cohort study enrolled 816 PTC patients (June 2022-May 2024), randomly split into training (n=571) and internal validation (n=245) sets at a 7:3 ratio, with an independent external validation cohort of 178 patients (June 2024-May 2025). Clinical, laboratory, and routine ultrasound data were collected. Feature selection employed a three-step approach: (1) univariate and multivariate logistic regression (LR) analysis, (2) Boruta-SHAP algorithm for importance ranking, and (3) clinical expert validation to ensure clinical relevance. Nine ML models were developed, with hyperparameter tuning via grid search and 10-fold cross-validation. Model performance was evaluated using metrics such as area under the receiver operating characteristic curve (ROC), sensitivity, specificity, and F1-score. The SHapley Additive exPlanations (SHAP) method was used for model interpretation. Results: Results: Eight independent risk factors were identified: gender, multifocality, age, tumor diameter, tumor location, capsular invasion, central lymph node metastasis, and uneven lateral cervical lymph node hilum echo. The Gradient Boosting Machine (GBM) model demonstrated optimal performance with an AUC of 0.905 (95% CI: 0.868-0.942), sensitivity of 0.831, specificity of 0.840, and F1-score of 0.764 in internal validation. External validation confirmed robust generalizability (AUC: 0.887, 95% CI: 0.840-0.934).SHAP analysis revealed that tumor size, gender, lateral cervical lymph node echo, central lymph node metastasis, and capsular invasion were the top five contributors to high LLNM risk, and provided individualized risk interpretation. Conclusions: Conclusion: This interpretable GBM model, based on routinely accessible clinical and ultrasound data, enables accurate preoperative LLNM risk stratification, supporting personalized decisions on the extent of lymph node dissection and potentially reducing unnecessary prophylactic surgery while ensuring adequate treatment for high-risk patients.

  • Design and Implementation of a Cloud-Native Infrastructure for Non–Medical Health Factors Data: A Framework for Geospatial Analytics and Education

    Date Submitted: Feb 3, 2026

    Open Peer Review Period: Feb 10, 2026 - Apr 7, 2026

    Background: Non-medical health factors (NMHF), including education, income, housing, transportation, and neighborhood infrastructure, are crucial to understanding health outcomes and health equity. Ho...

    Background: Non-medical health factors (NMHF), including education, income, housing, transportation, and neighborhood infrastructure, are crucial to understanding health outcomes and health equity. However, integration of these factors into research and teaching has been challenged by fragmented data sources, heterogeneous data schemas, and inconsistent geographic units. Objective: To design and evaluate a cloud-native, geospatially standardized NMHF data infrastructure that supports end-to-end data acquisition, harmonization, analytics, and visualization for research and education. Methods: We implemented a serverless architecture on Google Cloud Platform, centered on BigQuery for scalable storage and geospatial analytics, while incorporating an improved Extract–Transform–Load (ETL) pipeline for data collection and storage. This cloud-native architecture also integrated Tableau for live interactive dashboards. Reproducible SQL pipelines standardize schemas and harmonize geographies via population-weighted crosswalks between ZIP Code Tabulation Areas (ZCTAs), census tracts, counties, and states. Users access the platform through parameterized SQL queries, Python notebooks, or optional serverless APIs. We evaluated the resulting data coverage, query performance, user adoption, and educational utility of the platform. Results: The platform harmonized data for over 40 NMHF databases across deprivation, vulnerability, opportunity, instability, demographics, and outcomes from widely used public sources at the census tract and ZCTA levels. Over 50 users, including students participating in courses, capstone projects, and workshops, actively engaged with the platform’s notebooks and dashboards. The publicly accessible dashboards accrued over 1,000 unique views. The platform demonstrated support for exploratory analyses linking NMHF indicators with health outcomes, illustrating its value for hypothesis generation and geospatial storytelling. Conclusions: This geospatially standardized, education-oriented NMHF infrastructure minimizes operational friction and shortens time-to-insight for students and researchers. It provides a pragmatic foundation for future efforts in clinical integration of social risk data, scalable federated analytics, and fairness-aware health modeling.

  • Noise-Robust Atrial Fibrillation Detection from Garment-Type Wearable Holter Electrocardiogram Monitoring Using R–R Interval-Based Deep Learning: Algorithm Development and Validation Study

    Date Submitted: Jan 22, 2026

    Open Peer Review Period: Feb 3, 2026 - Mar 31, 2026

    Background: Atrial fibrillation (AF) is a significant contributor to cardioembolic stroke, necessitating early and precise detection of AF to mitigate associated risks. Long-term Holter electrocardiog...

    Background: Atrial fibrillation (AF) is a significant contributor to cardioembolic stroke, necessitating early and precise detection of AF to mitigate associated risks. Long-term Holter electrocardiography (ECG) monitoring using garment-type wearable devices produces large volumes of single-lead data with various noise artifacts. Deep learning has achieved high performance in AF detection from ECG data; however, many deep learning studies report strong performance on curated datasets or noise-controlled recordings. Comparatively fewer approaches have been developed and evaluated with an explicit strategy to maintain diagnostic accuracy in noise-included real-world wearable Holter ECG data. An alternative representation using the R–R interval (RRI) time series may reduce the dependence on waveform morphology and provide a computationally efficient pathway for robust AF screening in noisy recordings. Objective: This study aims to develop a computationally efficient, noise-robust deep learning model that leverages the irregularity of the RRI in noisy wearable monitoring environments. We evaluated the impact of the analysis window length on model performance. Methods: Single-lead Holter ECG data from 117 patients at the University of Osaka Hospital were analyzed, excluding those with atrial tachycardia/flutter. The RRIs were extracted, segmented into 1.5-, 3-, and 6-min windows, and transformed into two-dimensional histogram images. A ResNet-34–based two-dimensional convolutional neural network (2D-CNN) was trained for three-class classification. The model performance was evaluated using five-fold inter-patient cross-validation and externally validated using the MIT-BIH AF Database. Patient-level AF burden was defined as the proportion of AF duration relative to total analyzable recording time per patient; agreement between cardiologist-derived and model-estimated AF burden was assessed using Pearson’s correlation coefficient and linear regression. Results: Of 129 monitored patients (Feb 1, 2023–Nov 20, 2025), 117 were analyzed. In the internal validation, the 3-min window had superior performance (accuracy, 96.9%; AF sensitivity, 97.0%; AF specificity, 98.2%). External validation corroborated this balance (accuracy, 96.1%; AF sensitivity, 93.3%; and AF specificity, 98.7%). The 3-min model exhibited an exceptionally high correlation with the reference AF burden (r = 0.988, R² = 0.976). Conclusions: The RRI-based 2D-CNN achieved high AF classification accuracy and excellent agreement with AF burden. By utilizing RRI features and a noise-adaptive training strategy, a 3-min RRI window has emerged as a practical solution for efficient AF screening in a garment-type Holter ECG.

  • Machine Learning for Predicting Patient Revisits and Future Diagnoses Using Electronic Health Claims Data: A Retrospective Cohort Study from Ghana

    Date Submitted: Jan 23, 2026

    Open Peer Review Period: Feb 3, 2026 - Mar 31, 2026

    Background: Health facilities globally face increasing operational pressure from rising Communicable and Non-Communicable disease burdens, with low- and middle-income countries experiencing the greate...

    Background: Health facilities globally face increasing operational pressure from rising Communicable and Non-Communicable disease burdens, with low- and middle-income countries experiencing the greatest challenges. To improve operational efficiency, the timely identification of healthcare use patterns and recurring care needs is essential. Objective: This study aimed to develop machine learning (ML) models that predict (1) patient revisits within 30, 90, and 180 days and (2) the most likely diagnosis at revisit, using longitudinal national health insurance scheme (NHIS) claims data from a medical facility in Ghana. Methods: We conducted a retrospective cohort study using electronic health records (EHR) spanning January 2015 to August 2025. The analytical dataset comprised 111,488 visits from 34,486 unique patients. We compared five machine learning approaches: logistic regression (LR), random forest (RF), extreme gradient boosting (XGBoost), multilayer perceptron (MLP), and TabM (a recent parameter-efficient ensemble architecture for tabular data). Patient-level data splitting prevented information leakage between training and evaluation sets. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC-ROC), accuracy, and top-3 accuracy for multiclass disease prediction (31-54 categories depending on horizon). Feature importance was assessed using Shapley Additive exPlanations (SHAP) analysis for XGBoost and permutation importance for TabM. Results: For revisit prediction, TabM achieved the highest AUC-ROC across all horizons (0.891 at 30 days, 0.942 at 90 days, 0.973 at 180 days), followed closely by XGBoost (0.884, 0.927, 0.964). Disease prediction proved more challenging given the multiclass nature of the task; TabM achieved the highest top-3 accuracy (0.420 at 30 days, 0.626 at 90 days, and 0.635 at 180 days) and standard accuracy for 90 and 180 days, respectively (0.494 and 0.492), while XGBoost achieved the highest AUC-ROC (0.666, 0.710, and 0.690). Feature importance analysis revealed that clinical visit pattern features (total visits, visit frequency) dominated revisit prediction, while demographic features (age) and current diagnosis drove disease prediction. Conclusions: Machine learning models using NHIS claims data can effectively predict hospital revisits and narrow diagnostic possibilities to clinically useful shortlists in a resource-limited hospital setting. TabM, a recent tabular deep learning architecture, has demonstrated competitive or superior performance compared to gradient boosting methods, challenging assumptions about the limitations of neural networks on tabular healthcare data. These findings support the feasibility of deploying predictive analytics in Sub-Saharan African health systems with modest data infrastructure.

  • Navigating Collective Bargaining Barriers to the Implementation of AI Scribe Technology Among Ambulatory Advanced Practice Providers

    Date Submitted: Jan 29, 2026

    Open Peer Review Period: Jan 28, 2026 - Mar 25, 2026

    Background: Advanced Practice Providers (APPs) face rising documentation demands driven by increased patient access needs and productivity pressures, contributing to burnout and reduced face-to-face t...

    Background: Advanced Practice Providers (APPs) face rising documentation demands driven by increased patient access needs and productivity pressures, contributing to burnout and reduced face-to-face time with patients. AI scribe tools have been reported to reduce documentation time, improve clinician well-being and patient interaction. Objective: To describe the process, challenges, and outcomes of implementing an AI scribe within a unionized ambulatory academic setting, focusing on collective bargaining considerations and pilot efficiency metrics among APPs. Methods: Following formal notices to labor unions and a meet-and-confer process consistent with California public employer obligations, our institution conducted a ten-week pilot (June 16, 2025 to August 31, 2025) of the Abridge AI scribe among 15 Primary Care APPs (12 nurse practitioners and 3 physician assistants). Training modules covered consent, privacy, and documentation verification. We tracked utilization, effort reduction, time spent in notes, and same-day encounter closures; aggregated results were reported to leadership. Results: Across 8100 APP encounters, AI scribe was used in 5403 (67%) notes. Individual utilization ranged from 30% to 89%. Average effort reduction was 78% (range 20%–93%). Mean time in the note writer was 6 minutes for pilot APPs versus 11 minutes for non-pilot APPs. Same-day encounter closure improved from 66% (non-pilot) to 90% (pilot). No formal or informal union concerns were raised post-pilot. Conclusions: Adoption of AI scribe technology among represented APPs was successfully implemented in compliance with collective bargaining agreements. The pilot demonstrated notable efficiency gains and improved documentation timeliness; broader deployment proceeded under the same training and regulatory constraints. Future work will examine variability in utilization and effort reduction across APPs.

  • Medical Device Integration: A Practitioner Framework for Failure Mode Analysis

    Date Submitted: Jan 21, 2026

    Open Peer Review Period: Jan 28, 2026 - Mar 25, 2026

    Modern hospitals require stable connections between medical devices and electronic health record systems for optimal patient care. Devices including fetal monitors, anesthesia machines, infusion pumps...

    Modern hospitals require stable connections between medical devices and electronic health record systems for optimal patient care. Devices including fetal monitors, anesthesia machines, infusion pumps, and cardiac implants must reliably transmit patient data to clinical documentation systems. Technical or infrastructure failures affecting these connections force clinicians to document manually and lose real-time data access. Research attributes 22.5% of EHR safety events to health IT failures, often originating from interface errors. This viewpoint presents an engineering framework, derived from hands-on operational experience with device connectivity systems in varied healthcare settings, for analyzing medical device integration failures. The analysis combines firsthand experience with targeted literature review to identify common failure modes in fetal monitoring, anesthesia integration, infusion pump connectivity, and cardiac device data transfer. The framework identifies five main architectural layers vulnerable to failure: the medical device layer, data aggregation layer, interface/translation layer, EHR integration layer, and clinical presentation layer. Recurring failure patterns include full system outages, application errors, and degraded performance, with system outages predominating. A significant proportion of failures self-resolve, suggesting underlying system instability requiring investigation. Solutions range from restarting services to advanced reconfiguration and vendor support. Legacy system dependencies, inadequate monitoring, and gaps between system design and actual clinical workflow drive integration failures. Healthcare organizations should consistently monitor device feeds, establish alternate data pathways where feasible, and maintain clear downtime procedures to manage failures effectively.