Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Journal Description

JMIR Medical Informatics (JMI, ISSN 2291-9694, Impact Factor: 3.2) (Editor-in-chief: Christian Lovis, MD, MPH, FACMI) is an open-access PubMed/SCIE-indexed journal that focuses on the challenges and impacts of clinical informatics, digitalization of care processes, clinical and health data pipelines from acquisition to reuse, including semantics, natural language processing, natural interactions, meaningful analytics and decision support, electronic health records, infrastructures, implementation, and evaluation (see Focus and Scope).

JMIR Medical Informatics adheres to rigorous quality standards, involving a rapid and thorough peer-review process, professional copyediting, and professional production of PDF, XHTML, and XML proofs. The journal is indexed in PubMed, PubMed Central, DOAJ, SCOPUS, and SCIE (Clarivate). In 2023, JMI received a Journal Impact Factor™ of 3.2 (5-Year Journal Impact Factor: 3.6) (Source: Journal Citation Reports™ from Clarivate, 2023).

 

Recent Articles:

  • A visual abstract summarizing the key findings of the review article titled “Impact of Electronic Health Record Use on Cognitive Load and Burnout Among Clinicians: Narrative Review” published in JMIR Medical Informatics in 2024. The study presents potential strategies to mitigate burnout among clinicians, such as using new technologies to improve electronic health record design, user interface, and data visualization to reduce documentation burden while enhancing patient care. Source: Image created by JMIR Publications/Authors; Copyright: JMIR Publications; URL: https://medinform.jmir.org/2024/1/e55499; License: Creative Commons Attribution (CC-BY).

    Impact of Electronic Health Record Use on Cognitive Load and Burnout Among Clinicians: Narrative Review

    Abstract:

    The cognitive load theory suggests that completing a task relies on the interplay between sensory input, working memory, and long-term memory. Cognitive overload occurs when the working memory’s limited capacity is exceeded due to excessive information processing. In health care, clinicians face increasing cognitive load as the complexity of patient care has risen, leading to potential burnout. Electronic health records (EHRs) have become a common feature in modern health care, offering improved access to data and the ability to provide better patient care. They have been added to the electronic ecosystem alongside emails and other resources, such as guidelines and literature searches. Concerns have arisen in recent years that despite many benefits, the use of EHRs may lead to cognitive overload, which can impact the performance and well-being of clinicians. We aimed to review the impact of EHR use on cognitive load and how it correlates with physician burnout. Additionally, we wanted to identify potential strategies recommended in the literature that could be implemented to decrease the cognitive burden associated with the use of EHRs, with the goal of reducing clinician burnout. Using a comprehensive literature review on the topic, we have explored the link between EHR use, cognitive load, and burnout among health care professionals. We have also noted key factors that can help reduce EHR-related cognitive load, which may help reduce clinician burnout. The research findings suggest that inadequate efforts to present large amounts of clinical data to users in a manner that allows the user to control the cognitive burden in the EHR and the complexity of the user interfaces, thus adding more “work” to tasks, can lead to cognitive overload and burnout; this calls for strategies to mitigate these effects. Several factors, such as the presentation of information in the EHR, the specialty, the health care setting, and the time spent completing documentation and navigating systems, can contribute to this excess cognitive load and result in burnout. Potential strategies to mitigate this might include improving user interfaces, streamlining information, and reducing documentation burden requirements for clinicians. New technologies may facilitate these strategies. The review highlights the importance of addressing cognitive overload as one of the unintended consequences of EHR adoption and potential strategies for mitigation, identifying gaps in the current literature that require further exploration.

  • Source: Pixabay; Copyright: RAEng_Publications; URL: https://pixabay.com/photos/woman-work-office-face-eyes-4941164/; License: Licensed by JMIR.

    Evaluating ChatGPT-4’s Diagnostic Accuracy: Impact of Visual Data Integration

    Abstract:

    Background: In the evolving field of health care, multimodal generative artificial intelligence (AI) systems, such as ChatGPT-4 with vision (ChatGPT-4V), represent a significant advancement, as they integrate visual data with text data. This integration has the potential to revolutionize clinical diagnostics by offering more comprehensive analysis capabilities. However, the impact on diagnostic accuracy of using image data to augment ChatGPT-4 remains unclear. Objective: This study aims to assess the impact of adding image data on ChatGPT-4’s diagnostic accuracy and provide insights into how image data integration can enhance the accuracy of multimodal AI in medical diagnostics. Specifically, this study endeavored to compare the diagnostic accuracy between ChatGPT-4V, which processed both text and image data, and its counterpart, ChatGPT-4, which only uses text data. Methods: We identified a total of 557 case reports published in the American Journal of Case Reports from January 2022 to March 2023. After excluding cases that were nondiagnostic, pediatric, and lacking image data, we included 363 case descriptions with their final diagnoses and associated images. We compared the diagnostic accuracy of ChatGPT-4V and ChatGPT-4 without vision based on their ability to include the final diagnoses within differential diagnosis lists. Two independent physicians evaluated their accuracy, with a third resolving any discrepancies, ensuring a rigorous and objective analysis. Results: The integration of image data into ChatGPT-4V did not significantly enhance diagnostic accuracy, showing that final diagnoses were included in the top 10 differential diagnosis lists at a rate of 85.1% (n=309), comparable to the rate of 87.9% (n=319) for the text-only version (P=.33). Notably, ChatGPT-4V’s performance in correctly identifying the top diagnosis was inferior, at 44.4% (n=161), compared with 55.9% (n=203) for the text-only version (P=.002, χ2 test). Additionally, ChatGPT-4’s self-reports showed that image data accounted for 30% of the weight in developing the differential diagnosis lists in more than half of cases. Conclusions: Our findings reveal that currently, ChatGPT-4V predominantly relies on textual data, limiting its ability to fully use the diagnostic potential of visual information. This study underscores the need for further development of multimodal generative AI systems to effectively integrate and use clinical image data. Enhancing the diagnostic performance of such AI systems through improved multimodal data integration could significantly benefit patient care by providing more accurate and comprehensive diagnostic insights. Future research should focus on overcoming these limitations, paving the way for the practical application of advanced AI in medicine.

  • AI-generated image, in response to the request "Thumbnail mage for a project on prompt engineering for clinical information extraction"(Generator: CoPilot/Microsoft Bing March 10, 2024; Requestor: Sonish Sivarajkumar). Source: Created with CoPilot/Microsoft Bing, an AI system; Copyright: N/A (AI-Generated image); URL: https://www.bing.com/chat; License: Public Domain (CC0).

    An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and...

    Abstract:

    Background: Large language models (LLMs) have shown remarkable capabilities in natural language processing (NLP), especially in domains where labeled data are scarce or expensive, such as the clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches. Objective: The objective of this study is to assess the effectiveness of various prompt engineering techniques, including 2 newly introduced types—heuristic and ensemble prompts, for zero-shot and few-shot clinical information extraction using pretrained language models. Methods: This comprehensive experimental study evaluated different prompt types (simple prefix, simple cloze, chain of thought, anticipatory, heuristic, and ensemble) across 5 clinical NLP tasks: clinical sense disambiguation, biomedical evidence extraction, coreference resolution, medication status extraction, and medication attribute extraction. The performance of these prompts was assessed using 3 state-of-the-art language models: GPT-3.5 (OpenAI), Gemini (Google), and LLaMA-2 (Meta). The study contrasted zero-shot with few-shot prompting and explored the effectiveness of ensemble approaches. Results: The study revealed that task-specific prompt tailoring is vital for the high performance of LLMs for zero-shot clinical NLP. In clinical sense disambiguation, GPT-3.5 achieved an accuracy of 0.96 with heuristic prompts and 0.94 in biomedical evidence extraction. Heuristic prompts, alongside chain of thought prompts, were highly effective across tasks. Few-shot prompting improved performance in complex scenarios, and ensemble approaches capitalized on multiple prompt strengths. GPT-3.5 consistently outperformed Gemini and LLaMA-2 across tasks and prompt types. Conclusions: This study provides a rigorous evaluation of prompt engineering methodologies and introduces innovative techniques for clinical information extraction, demonstrating the potential of in-context learning in the clinical domain. These findings offer clear guidelines for future prompt-based clinical NLP research, facilitating engagement by non-NLP experts in clinical NLP advancements. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative artificial intelligence, and we hope that it will inspire and inform future research in this area.

  • Source: Freepik; Copyright: DCStudio; URL: https://www.freepik.com/free-photo/general-practitioner-taking-notes-digital-tablet-maternity-checkup-with-pregnant-patient-male-physician-having-conversation-about-pregnancy-with-future-mother-medical-office_29168461.htm#; License: Licensed by JMIR.

    Effect of Performance-Based Nonfinancial Incentives on Data Quality in Individual Medical Records of Institutional Births: Quasi-Experimental Study

    Abstract:

    Background: Despite the potential of routine health information systems in tackling persistent maternal deaths stemming from poor service quality at health facilities during and around childbirth, research has demonstrated their suboptimal performance, evident from the incomplete and inaccurate data unfit for practical use. There is a consensus that nonfinancial incentives can enhance health care providers’ commitment toward achieving the desired health care quality. However, there is limited evidence regarding the effectiveness of nonfinancial incentives in improving the data quality of institutional birth services in Ethiopia. Objective: This study aimed to evaluate the effect of performance-based nonfinancial incentives on the completeness and consistency of data in the individual medical records of women who availed institutional birth services in northwest Ethiopia. Methods: We used a quasi-experimental design with a comparator group in the pre-post period, using a sample of 1969 women’s medical records. The study was conducted in the “Wegera” and “Tach-armacheho” districts, which served as the intervention and comparator districts, respectively. The intervention comprised a multicomponent nonfinancial incentive, including smartphones, flash disks, power banks, certificates, and scholarships. Personal records of women who gave birth within 6 months before (April to September 2020) and after (February to July 2021) the intervention were included. Three distinct women’s birth records were examined: the integrated card, integrated individual folder, and delivery register. The completeness of the data was determined by examining the presence of data elements, whereas the consistency check involved evaluating the agreement of data elements among women’s birth records. The average treatment effect on the treated (ATET), with 95% CIs, was computed using a difference-in-differences model. Results: In the intervention district, data completeness in women’s personal records was nearly 4 times higher (ATET 3.8, 95% CI 2.2-5.5; P=.02), and consistency was approximately 12 times more likely (ATET 11.6, 95% CI 4.18-19; P=.03) than in the comparator district. Conclusions: This study indicates that performance-based nonfinancial incentives enhance data quality in the personal records of institutional births. Health care planners can adapt these incentives to improve the data quality of comparable medical records, particularly pregnancy-related data within health care facilities. Future research is needed to assess the effectiveness of nonfinancial incentives across diverse contexts to support successful scale-up. Trial Registration:

  • Source: Freepik; Copyright: freepik; URL: https://www.freepik.com/free-photo/side-view-scientist-holding-tablet_65670588.htm; License: Licensed by JMIR.

    Impact of Translation on Biomedical Information Extraction: Experiment on Real-Life Clinical Notes

    Abstract:

    Background: Biomedical natural language processing tasks are best performed with English models, and translation tools have undergone major improvements. On the other hand, building annotated biomedical datasets remains a challenge. Objective: The aim of our study is to determine whether the use of English tools to extract and normalize French medical concepts on translations provides comparable performance to that of French models trained on a set of annotated French clinical notes. Methods: We compare two methods: one involving French-language models and one involving English-language models. For the native French method, the Named Entity Recognition (NER) and normalization steps are performed separately. For the translated English method, after the first translation step, we compare a two-step method and a terminology-oriented method that performs extraction and normalization at the same time. We used French, English and bilingual annotated datasets to evaluate all stages (NER, normalization and translation) of our algorithms. Results: The native French method outperformed the translated English method, with an overall f1 score of 0.51 [0.47;0.55], compared with 0.39 [0.34;0.44] and 0.38 [0.36;0.40] for the two English methods tested. Conclusions: Despite recent improvements in translation models, there is a significant difference in performance between the two approaches in favor of the native French method, which is more effective on French medical texts, even with few annotated documents.

  • Source: Freepik; Copyright: Freepik; URL: https://www.freepik.com/free-photo/close-up-sportswoman-consulting-her-watch_1052523.htm#; License: Licensed by JMIR.

    Scalable Approach to Consumer Wearable Postmarket Surveillance: Development and Validation Study

    Abstract:

    Background: With the capability to render pre-diagnosis, consumer wearables have the potential to affect subsequent diagnosis and the level of care in the healthcare delivery setting. Despite this, post-market surveillance of consumer wearables has been hindered by the lack of codified terms in EHR to capture wearable use. Objective: We sought to develop a weak supervision-based approach to demonstrate the feasibility and efficacy of EHR-based post-market surveillance on consumer wearables that render AF pre-diagnosis. Methods: We applied data programming where labeling heuristics are expressed as code-based labeling functions to detect AF pre-diagnosis incidents. A labeler model was then learned from the labeling function predictions using the Snorkel framework. Running the labeler model on the clinical notes probabilistically labeled the notes, which were then used as a training set to fine-tune a classifier called Clinical-Longformer. Resulting classifier identified patients with AF pre-diagnosis mentions. A retrospective cohort study was conducted, where the baseline characteristics and subsequent care patterns of patients identified by the classifier were compared against those who did not receive pre-diagnosis. Results: Labeler model learned from labeling functions showed high accuracy (0.92, F1-score 0.77) on the training set. The classifier learned on the probabilistically labeled notes accurately identified patients with AF pre-diagnosis (0.95, F1-score 0.83). Cohort study conducted using the constructed system carried enough statistical power to verify the key findings of the Apple Heart Study that enrolled a much larger number of participants, where those patients who received pre-diagnosis tended to be older, male, and White with higher CHA2DS2-VASc scores (P < .001). We also made a novel discovery that patients with pre-diagnosis are enriched for anticoagulation (50.63% vs. 35.85%) and eventual diagnosis of AF (13.38% vs. 1.38%). At the index diagnosis, existence of pre-diagnosis did not stratify patients on clinical characteristics, but did correlate with anticoagulant prescription (P = .004 for apixaban and P = .01 for rivaroxaban). Conclusions: Our work establishes the feasibility and efficacy of an EHR-based surveillance system for consumer wearables that render AF pre-diagnosis. Further work is necessary to generalize these findings for patient populations at other sites.

  • Source: Freepik; Copyright: Drazen Zigic; URL: https://www.freepik.com/free-photo/closeup-woman-greeting-her-doctor-while-using-smart-phone-having-video-call-focus-is-female-doctor-touchscreen_26144009.htm#; License: Licensed by JMIR.

    Use of Video in Telephone Triage in Out-of-Hours Primary Care: Register-Based Study

    Abstract:

    Background: Out-of-hours primary care (OOH-PC) is challenged by high workload, workforce shortage, and long waiting and transportation time for patients. Use of video enables triage professionals to visually assess the patients, potentially ending more contacts in a telephone triage contact instead of referring patients to more resource-demanding clinic consultations or home visits. Thereby, video use may help reduce healthcare resources in OOH-PC. Objective: This study aimed to investigate video use in telephone triage contacts to OOH-PC in Denmark by studying usage rate, and potential associations between video use and patient- and contact-related characteristics and between video use and triage outcomes and follow-up contacts. We hypothesized that video use could serve to reduce healthcare resources in OOH-PC. Methods: This register-based study included all telephone triage contacts to OOH-PC in four of the five Danish regions from 15 March 2020 to 1 December 2021. We linked data from the OOH-PC electronic registration systems to national registers and identified telephone triage contacts with video use (video contact) and without video use (telephone contact). Calculating crude and adjusted incidence rate ratios (adj. IRR), we investigated the association between patient- and contact-related characteristics and video contacts and measured the frequency of different triage outcomes and follow-up contacts after video contact compared to telephone contact. Results: Of 2,900,566 identified telephone triage contacts to OOH-PC, 9.5% were conducted as video contacts. The frequency of video contacts was unevenly distributed across patient- and contact-related characteristics; it was used more often for employed young patients without comorbidities who contacted OOH-PC at more than four hours before the opening hours of daytime general practice. Compared to telephone contacts, significantly more video contacts ended with advice and self-care (adj. IRR=1.21, 95%CI=1.21-1.21) and no follow-up contact (adj. IRR=1.08, 95%CI=1.08-1.09). Conclusions: This study supports our hypothesis that video contacts could reduce healthcare resources in OOH-PC. Video use was associated with a lower frequency of referrals to a clinic consultation or a home visit and lower frequency of follow-up contacts. However, the results could be biased due to confounding by indication, reflecting that triage GPs use video for a specific set of reasons for encounter.

  • Al-generated image, in response to the request " Create a realistic image of a pharmacist entering a report on a computer, focusing on her hands while hiding the screen's content and her face, with a clean table set up in a professional pharmacy environment." (Generator: DALL-E2/OpenAI February 29, 2024; Requestor: Sim Mei Choo). Source: Created with DALL•E, an Al system by OpenAl; Copyright: N/A (AI-generated image); URL: https://medinform.jmir.org/2024/1/e49643/; License: Public Domain (CC0).

    Data-Driven Identification of Factors That Influence the Quality of Adverse Event Reports: 15-Year Interpretable Machine Learning and Time-Series Analyses of...

    Abstract:

    Background: The completeness of adverse event (AE) reports, crucial for assessing putative causal relationships, is measured using the vigiGrade completeness score in VigiBase, the World Health Organization global database of reported potential AEs. Malaysian reports have surpassed the global average score (approximately 0.44), achieving a 5-year average of 0.79 (SD 0.23) as of 2019 and approaching the benchmark for well-documented reports (0.80). However, the contributing factors to this relatively high report completeness score remain unexplored. Objective: This study aims to explore the main drivers influencing the completeness of Malaysian AE reports in VigiBase over a 15-year period using vigiGrade. A secondary objective was to understand the strategic measures taken by the Malaysian authorities leading to enhanced report completeness across different time frames. Methods: We analyzed 132,738 Malaysian reports (2005-2019) recorded in VigiBase up to February 2021 split into historical International Drug Information System (INTDIS; n=63,943, 48.17% in 2005-2016) and newer E2B (n=68,795, 51.83% in 2015-2019) format subsets. For machine learning analyses, we performed a 2-stage feature selection followed by a random forest classifier to identify the top features predicting well-documented reports. We subsequently applied tree Shapley additive explanations to examine the magnitude, prevalence, and direction of feature effects. In addition, we conducted time-series analyses to evaluate chronological trends and potential influences of key interventions on reporting quality. Results: Among the analyzed reports, 42.84% (56,877/132,738) were well documented, with an increase of 65.37% (53,929/82,497) since 2015. Over two-thirds (46,186/68,795, 67.14%) of the Malaysian E2B reports were well documented compared to INTDIS reports at 16.72% (10,691/63,943). For INTDIS reports, higher pharmacovigilance center staffing was the primary feature positively associated with being well documented. In recent E2B reports, the top positive features included reaction abated upon drug dechallenge, reaction onset or drug use duration of <1 week, dosing interval of <1 day, reports from public specialist hospitals, reports by pharmacists, and reaction duration between 1 and 6 days. In contrast, reports from product registration holders and other health care professionals and reactions involving product substitution issues negatively affected the quality of E2B reports. Multifaceted strategies and interventions comprising policy changes, continuity of education, and human resource development laid the groundwork for AE reporting in Malaysia, whereas advancements in technological infrastructure, pharmacovigilance databases, and reporting tools concurred with increases in both the quantity and quality of AE reports. Conclusions: Through interpretable machine learning and time-series analyses, this study identified key features that positively or negatively influence the completeness of Malaysian AE reports and unveiled how Malaysia has developed its pharmacovigilance capacity via multifaceted strategies and interventions. These findings will guide future work in enhancing pharmacovigilance and public health.

  • Source: pxhere; Copyright: pxhere; URL: https://pxhere.com/en/photo/1179849; License: Public Domain (CC0).

    Toward Fairness, Accountability, Transparency, and Ethics in AI for Social Media and Health Care: Scoping Review

    Abstract:

    Background: The use of social media for disseminating health care information has become increasingly prevalent, making the expanding role of artificial intelligence (AI) and machine learning in this process both significant and inevitable. This development raises numerous ethical concerns. This study explored the ethical use of AI and machine learning in the context of health care information on social media platforms (SMPs). It critically examined these technologies from the perspectives of fairness, accountability, transparency, and ethics (FATE), emphasizing computational and methodological approaches that ensure their responsible application. Objective: This study aims to identify, compare, and synthesize existing solutions that address the components of FATE in AI applications in health care on SMPs. Through an in-depth exploration of computational methods, approaches, and evaluation metrics used in various initiatives, we sought to elucidate the current state of the art and identify existing gaps. Furthermore, we assessed the strength of the evidence supporting each identified solution and discussed the implications of our findings for future research and practice. In doing so, we made a unique contribution to the field by highlighting areas that require further exploration and innovation. Methods: Our research methodology involved a comprehensive literature search across PubMed, Web of Science, and Google Scholar. We used strategic searches through specific filters to identify relevant research papers published since 2012 focusing on the intersection and union of different literature sets. The inclusion criteria were centered on studies that primarily addressed FATE in health care discussions on SMPs; those presenting empirical results; and those covering definitions, computational methods, approaches, and evaluation metrics. Results: Our findings present a nuanced breakdown of the FATE principles, aligning them where applicable with the American Medical Informatics Association ethical guidelines. By dividing these principles into dedicated sections, we detailed specific computational methods and conceptual approaches tailored to enforcing FATE in AI-driven health care on SMPs. This segmentation facilitated a deeper understanding of the intricate relationship among the FATE principles and highlighted the practical challenges encountered in their application. It underscored the pioneering contributions of our study to the discourse on ethical AI in health care on SMPs, emphasizing the complex interplay and the limitations faced in implementing these principles effectively. Conclusions: Despite the existence of diverse approaches and metrics to address FATE issues in AI for health care on SMPs, challenges persist. The application of these approaches often intersects with additional ethical considerations, occasionally leading to conflicts. Our review highlights the lack of a unified, comprehensive solution for fully and effectively integrating FATE principles in this domain. This gap necessitates careful consideration of the ethical trade-offs involved in deploying existing methods and underscores the need for ongoing research.

  • AI-generated image, in response to the request "Image for a project on physical rehabilitation exercise information extraction from clinical notes" (Generator: Dall-E/OpenAI March 10, 2024; Requestor: Sonish Sivarajkumar). Source: Created with DALL-E, an AI system by OpenAI; Copyright: N/A (AI-Generated image); URL: https://medinform.jmir.org/2024/1/e52289/; License: Public Domain (CC0).

    Mining Clinical Notes for Physical Rehabilitation Exercise Information: Natural Language Processing Algorithm Development and Validation Study

    Abstract:

    Background: The rehabilitation of a patient who had a stroke requires precise, personalized treatment plans. Natural language processing (NLP) offers the potential to extract valuable exercise information from clinical notes, aiding in the development of more effective rehabilitation strategies. Objective: This study aims to develop and evaluate a variety of NLP algorithms to extract and categorize physical rehabilitation exercise information from the clinical notes of patients who had a stroke treated at the University of Pittsburgh Medical Center. Methods: A cohort of 13,605 patients diagnosed with stroke was identified, and their clinical notes containing rehabilitation therapy notes were retrieved. A comprehensive clinical ontology was created to represent various aspects of physical rehabilitation exercises. State-of-the-art NLP algorithms were then developed and compared, including rule-based, machine learning–based algorithms (support vector machine, logistic regression, gradient boosting, and AdaBoost) and large language model (LLM)–based algorithms (ChatGPT [OpenAI]). The study focused on key performance metrics, particularly F1-scores, to evaluate algorithm effectiveness. Results: The analysis was conducted on a data set comprising 23,724 notes with detailed demographic and clinical characteristics. The rule-based NLP algorithm demonstrated superior performance in most areas, particularly in detecting the “Right Side” location with an F1-score of 0.975, outperforming gradient boosting by 0.063. Gradient boosting excelled in “Lower Extremity” location detection (F1-score: 0.978), surpassing rule-based NLP by 0.023. It also showed notable performance in the “Passive Range of Motion” detection with an F1-score of 0.970, a 0.032 improvement over rule-based NLP. The rule-based algorithm efficiently handled “Duration,” “Sets,” and “Reps” with F1-scores up to 0.65. LLM-based NLP, particularly ChatGPT with few-shot prompts, achieved high recall but generally lower precision and F1-scores. However, it notably excelled in “Backward Plane” motion detection, achieving an F1-score of 0.846, surpassing the rule-based algorithm’s 0.720. Conclusions: The study successfully developed and evaluated multiple NLP algorithms, revealing the strengths and weaknesses of each in extracting physical rehabilitation exercise information from clinical notes. The detailed ontology and the robust performance of the rule-based and gradient boosting algorithms demonstrate significant potential for enhancing precision rehabilitation. These findings contribute to the ongoing efforts to integrate advanced NLP techniques into health care, moving toward predictive models that can recommend personalized rehabilitation treatments for optimal patient outcomes.

  • NTU Emergency Department. Source: The Authors; Copyright: The Authors; URL: https://medinform.jmir.org/2024/1/e48862/; License: Creative Commons Attribution (CC-BY).

    Interpretable Deep Learning System for Identifying Critical Patients Through the Prediction of Triage Level, Hospitalization, and Length of Stay: Prospective...

    Abstract:

    Background: Triage is the process of accurately assessing patients’ symptoms and providing them with proper clinical treatment in the emergency department (ED). While many countries have developed their triage process to stratify patients’ clinical severity and thus distribute medical resources, there are still some limitations of the current triage process. Since the triage level is mainly identified by experienced nurses based on a mix of subjective and objective criteria, mis-triage often occurs in the ED. It can not only cause adverse effects on patients, but also impose an undue burden on the health care delivery system. Objective: Our study aimed to design a prediction system based on triage information, including demographics, vital signs, and chief complaints. The proposed system can not only handle heterogeneous data, including tabular data and free-text data, but also provide interpretability for better acceptance by the ED staff in the hospital. Methods: In this study, we proposed a system comprising 3 subsystems, with each of them handling a single task, including triage level prediction, hospitalization prediction, and length of stay prediction. We used a large amount of retrospective data to pretrain the model, and then, we fine-tuned the model on a prospective data set with a golden label. The proposed deep learning framework was built with TabNet and MacBERT (Chinese version of bidirectional encoder representations from transformers [BERT]). Results: The performance of our proposed model was evaluated on data collected from the National Taiwan University Hospital (901 patients were included). The model achieved promising results on the collected data set, with accuracy values of 63%, 82%, and 71% for triage level prediction, hospitalization prediction, and length of stay prediction, respectively. Conclusions: Our system improved the prediction of 3 different medical outcomes when compared with other machine learning methods. With the pretrained vital sign encoder and repretrained mask language modeling MacBERT encoder, our multimodality model can provide a deeper insight into the characteristics of electronic health records. Additionally, by providing interpretability, we believe that the proposed system can assist nursing staff and physicians in taking appropriate medical decisions.

  • Source: Freepik; Copyright: katemangostar; URL: https://www.freepik.com/free-photo/close-up-girl-sweater-viewing-online-information_3798993.htm; License: Licensed by JMIR.

    A Mobile App (Concerto) to Empower Hospitalized Patients in a Swiss University Hospital: Development, Design, and Implementation Report

    Abstract:

    Background: Patient empowerment can be associated with better health outcomes, especially in the management of chronic diseases. Digital health has the potential to promote patient empowerment. Objective: Concerto is a mobile app designed to promote patient empowerment in an in-patient setting. This implementation report focuses on the lessons learned during its implementation. Methods: The app was conceptualized and prototyped during a hackathon. Concerto uses hospital information system (HIS) data to offer the following key functionalities: a care schedule, targeted medical information, practical information, information about the on-duty care team, and a medical round preparation module. Funding was obtained following a feasibility study, and the app was developed and implemented in four pilot divisions of a Swiss University Hospital using institution-owned tablets. Results: The project lasted for 2 years with effective implementation in the four pilot divisions and was maintained within budget. The induced workload on caregivers impaired project sustainability and warranted a change in our implementation strategy. The presence of a killer function would have facilitated the deployment. Furthermore, our experience is in line with the well-accepted need for both high-quality user training and a suitable selection of superusers. Finally, by presenting HIS data directly to the patient, Concerto highlighted the data that are not fit for purpose and triggered data curation and standardization initiatives. Conclusions: This implementation report presents a real-world example of designing, developing, and implementing a patient-empowering mobile app in a university hospital in-patient setting with a particular focus on the lessons learned. One limitation of the study is the lack of definition of a “key success” indicator.

Citing this Article

Right click to copy or hit: ctrl+c (cmd+c on mac)

Latest Submissions Open for Peer-Review:

View All Open Peer Review Articles
  • Machine learning-based model stacking and multi-key biomarker association for rapid differentiation of patients with acute chest pain:a multicenter study with subgroup bias evaluation

    Date Submitted: Apr 4, 2024

    Open Peer Review Period: Apr 4, 2024 - May 30, 2024

    Despite significant advancements in cardiovascular disease research, the rapid diagnosis of acute high-risk chest pain diseases such as acute myocardial infarction (AMI), pulmonary embolism (PE), and...

    Despite significant advancements in cardiovascular disease research, the rapid diagnosis of acute high-risk chest pain diseases such as acute myocardial infarction (AMI), pulmonary embolism (PE), and aortic dissection (AD) remains a challenge in the emergency setting. We developed a multicenter risk prediction model (Stacking-Optuna) that rapidly and accurately distinguishes between AMI, PE, and AD. This model underscores the importance of biomarkers such as cardiac troponin, brain natriuretic peptide, and D-dimer while addressing the limitations of current diagnostic methods, especially in terms of considering patient age, gender, and the combined use of different indicators. This model integrates three large-scale databases for training and validation. The results demonstrate that Stacking-Optuna exhibits exceptional discriminative ability for all three acute chest pain diseases (AMI group: AUC=0.9380 [95%CI 0.9160-0.9480], PE group: AUC=0.9480 [95%CI 0.9220-0.9560], AD group: AUC=0.9540 [95%CI 0.9260-0.9640]). Additionally, the interpretability analysis and bias evaluation further reveal the model's consistency and generalizability in a clinical context (the median AUC of bias evaluations covering fifteen subgroups were above 0.83). This clinical knowledge-based fusion and data-driven approach supports rapid and accurate risk assessment of patients with emergency chest pain in the ICU, providing a more effective diagnostic tool for patients with cardiovascular emergencies.

  • Does facilitating trust calibration for artificial-intelligence-driven differential diagnoses list improve physicians' diagnostic accuracy?: A quasi-experimental study

    Date Submitted: Mar 22, 2024

    Open Peer Review Period: Apr 1, 2024 - May 27, 2024

    Background: Despite the usefulness of artificial intelligence (AI)-based diagnostic decision-support systems, the over-reliance of physicians on AI-generated diagnoses may lead to diagnostic errors. O...

    Background: Despite the usefulness of artificial intelligence (AI)-based diagnostic decision-support systems, the over-reliance of physicians on AI-generated diagnoses may lead to diagnostic errors. Objective: We investigated the safe use of AI-based diagnostic-support systems with trust calibration, adjusting trust levels to AI’s actual reliability. Methods: A quasi-experimental study was conducted at Dokkyo Medical University, Japan, with physicians allocated (1:1) to the intervention and control groups. The participants reviewed medical histories of 20 clinical cases generated by an AI-driven automated medical history-taking system with an AI-generated list of 10 differential diagnoses and provided one to three possible diagnoses. Physicians were asked to consider whether the final diagnosis was included in the AI-generated list of 10 differential diagnoses in the intervention group, which served as trust calibration. We analyzed the diagnostic accuracy of physicians and the correctness of trust calibration in the intervention group. Results: Among the 20 physicians assigned to the intervention (n=10) and control (n=10) groups, the diagnostic accuracy was 41.5% and 46.0%, respectively, without significant difference (odds ratio 1.20, 95% confidence interval [CI] 0.81–1.78, P=.42). The overall accuracy of the trust calibration was only 61.5%, and despite correct calibration, the diagnostic accuracy was 54.5%. Conclusions: Trust calibration did not significantly improve physicians' diagnostic accuracy when considering differential diagnoses generated by reading medical histories and the possible differential diagnosis lists of an AI-driven automated medical history-taking system. This study underscores the limitations of the extant trust-calibration system and highlights the need to apply supportive measures of trust calibration rather than solely utilizing trust calibration.

  • Do patients still need radiologists? Quality assessment of explanations of radiological reports using seven large language models from points of view of radiologists, clinicians, and patients

    Date Submitted: Apr 1, 2024

    Open Peer Review Period: Apr 1, 2024 - May 27, 2024

    Background: Patients have started to use large language models (LLMs) to explain their medical reports. It is unclear whether radiologists are still necessary. Objective: To assess the quality of expl...

    Background: Patients have started to use large language models (LLMs) to explain their medical reports. It is unclear whether radiologists are still necessary. Objective: To assess the quality of explanations of radiological reports generated by large language models (LLMs) from point of view of radiologists, clinicians, and patients. Methods: In this preliminary case study, we created a fictitious oncological radiology report to generate explanations using three prompt styles via seven LLMs twice. In a questionnaire, three radiologists, three clinicians, and three volunteers were asked to evaluate the quality of forty-two explanations using a Likert scale, to mark the inappropriate text, and to make comment if they want. We quantitatively analyzed the Likert scale and inappropriate text marking, and conducted inductive free-text analysis. Results: The radiologists agreed that some of the explanations are factually correct (61%), and not potentially harmful to the patient (55%), but are not complete enough (38%). The clinicians considered that some of the explanations can make the content easier to understand (48%), improve their communication to patients and radiologists (60%), and potentially guide their subsequent treatment (52%). The volunteers thought that only a fraction of the explanations can help them in content understanding (40%), make them feel cared for (37%), and let them know what to do next (40%). The mean number of character (percentage) of inappropriate text in explanation were 4.1 (1.2%). The explanations included obvious irrelevant contents, hallucination of nonexistent lesions, and incorrect description of malignant lesion as benign. Conclusions: LLMs have potential to improve patient-centered care in radiology, but the explanations for radiological reports did not satisfy stakeholders so far, and still needed supervisions from medical experts. Clinical Trial: None

  • Performance of an Electronic Health Record-based automated Pulmonary Embolism Severity Index score calculator: a cohort study in the Emergency Department

    Date Submitted: Mar 25, 2024

    Open Peer Review Period: Mar 25, 2024 - May 20, 2024

    Background: Studies suggest that less than 4% of patients with pulmonary embolisms (PE) are managed in the outpatient setting. Strong evidence and multiple guidelines support use of the Pulmonary Embo...

    Background: Studies suggest that less than 4% of patients with pulmonary embolisms (PE) are managed in the outpatient setting. Strong evidence and multiple guidelines support use of the Pulmonary Embolism Severity Index (PESI) for identification of acute PE patients appropriate for outpatient management. However, calculating the PESI score can be inconvenient in a busy emergency department (ED). To facilitate integration into ED workflow, we created a 2023 EpicTM-compatible clinical decision support (CDS) tool that automatically calculates the PESI score in real-time with patients’ electronic health data (ePESI). Objective: The primary objectives of this study were to determine the overall accuracy of ePESI and its ability to correctly distinguish high- and low-risk PESI scores within the EpicTM 2023 software. The secondary objective was to identify variables that impact ePESI accuracy. Methods: We collected ePESI scores on 500 consecutive patients at least 18 years old who underwent a computerized tomography-PE (CT-PE) scan in the ED of our tertiary, academic health center between January 3 and February 15, 2023. We compared ePESI results to a PESI score calculated by two independent, medically-trained abstractors blinded to the ePESI and each other’s results. ePESI accuracy was calculated with binomial test. Odds ratio (OR) was calculated with logistic regression. Results: 203 (40.6%) and 297 (59.4%) patients had low- and high-risk PESI scores, respectively. The ePESI exactly matched the calculated PESI in 394/500 cases, for an accuracy of 78.8% (95% CI 74.9-82.3%), and correctly identified low- vs. high-risk in 477/500 (95.4%) cases. The accuracy of the ePESI was higher for low-risk scores (OR=2.96, 0<0.0001) and lower when patients were without prior encounters in the health system (OR 0.42, p=0.0075). Conclusions: In this single center study, the ePESI was highly accurate in discriminating between low- and high-risk scores. The CDS should facilitate real-time identification of patients who may be candidates for outpatient PE management.

  • Best Practices for Implementation and Development of Data Models and Structures based on FHIR: A Systematic Scoping Review

    Date Submitted: Mar 17, 2024

    Open Peer Review Period: Mar 20, 2024 - May 15, 2024

    Background: Data models play a crucial role in facilitating clinical research and taking full advantage of clinical data stored in medical systems; data, as well as the clear relationships between the...

    Background: Data models play a crucial role in facilitating clinical research and taking full advantage of clinical data stored in medical systems; data, as well as the clear relationships between them, are expected to be in a standardized format to establish reproducible research. Using the Fast Healthcare Interoperability Resources (FHIR) standard for clinical data representation would be a practical methodology to enhance and accelerate interoperability and data availability for research. Objective: To investigate data models utilizing the FHIR standard, to offer a comprehensive overview of the best practices for developing and implementing these data models as well as presenting a summary of tools, mappings, limitations and other important details in the selected models. Methods: To ensure the extraction of reliable results, we followed the instructions of Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) checklist. We analyzed the indexed articles in PubMed, Scopus, Web of Science, IEEE Xplore, ACM digital library, and Google Scholar using Boolean operators to merge relevant keywords and their related terms. Results: Based on the reviewed articles, we categorized them into two main groups; pipeline-based data models and non-linear data models. We summarized each included article and extracted information about the FHIR resources, technologies and standards, and mappings. We additionally aimed to extract and summarize the limitations of each research to provide a comprehensive view of the potential challenges and limitations that future researchers may face. Conclusions: Based on the results of our review, FHIR can be a very promising standard in developing interoperable data models and infrastructures, despite presenting some challenges in the development phase. Policymakers and healthcare specialists can utilize this standard in any field such as healthcare, research, administration, finance, and so on. Additionally, when developing data models, this standard can also be integrated with other health-related standards to propose more interoperable solutions.