JMIR Publications

JMIR Medical Informatics

Clinical informatics, decision support for health professionals, electronic health records, and ehealth infrastructures.


Journal Description

JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals.

Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2015: 4.532), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.

JMIR Medical Informatics journal features a rapid and thorough peer-review process, professional copyediting, professional production of PDF, XHTML, and XML proofs (ready for deposit in PubMed Central/PubMed). The site is optimized for mobile and iPad use.

JMIR Medical Informatics adheres to the same quality standards as JMIR and all articles published here are also cross-listed in the Table of Contents of JMIR, the worlds' leading medical journal in health sciences / health services research and health informatics (


Recent Articles:

  • Source: Flickr; Copyright: NEC Corporation of America; URL:; License: Creative Commons Attribution (CC-BY).

    A Software Framework for Remote Patient Monitoring by Using Multi-Agent Systems Support


    Background: Although there have been significant advances in network, hardware, and software technologies, the health care environment has not taken advantage of these developments to solve many of its inherent problems. Research activities in these 3 areas make it possible to apply advanced technologies to address many of these issues such as real-time monitoring of a large number of patients, particularly where a timely response is critical. Objective: The objective of this research was to design and develop innovative technological solutions to offer a more proactive and reliable medical care environment. The short-term and primary goal was to construct IoT4Health, a flexible software framework to generate a range of Internet of things (IoT) applications, containing components such as multi-agent systems that are designed to perform Remote Patient Monitoring (RPM) activities autonomously. An investigation into its full potential to conduct such patient monitoring activities in a more proactive way is an expected future step. Methods: A framework methodology was selected to evaluate whether the RPM domain had the potential to generate customized applications that could achieve the stated goal of being responsive and flexible within the RPM domain. As a proof of concept of the software framework’s flexibility, 3 applications were developed with different implementations for each framework hot spot to demonstrate potential. Agents4Health was selected to illustrate the instantiation process and IoT4Health’s operation. To develop more concrete indicators of the responsiveness of the simulated care environment, an experiment was conducted while Agents4Health was operating, to measure the number of delays incurred in monitoring the tasks performed by agents. Results: IoT4Health’s construction can be highlighted as our contribution to the development of eHealth solutions. As a software framework, IoT4Health offers extensibility points for the generation of applications. Applications can extend the framework in the following ways: identification, collection, storage, recovery, visualization, monitoring, anomalies detection, resource notification, and dynamic reconfiguration. Based on other outcomes involving observation of the resulting applications, it was noted that its design contributed toward more proactive patient monitoring. Through these experimental systems, anomalies were detected in real time, with agents sending notifications instantly to the health providers. Conclusions: We conclude that the cost-benefit of the construction of a more generic and complex system instead of a custom-made software system demonstrated the worth of the approach, making it possible to generate applications in this domain in a more timely fashion.

  • Image of Telehealth monitoring devices. Copyright: Cathy Soreny via Optical Jukebox; URL:; License: Licensed by the authors.

    Does Telehealth Monitoring Identify Exacerbations of Chronic Obstructive Pulmonary Disease and Reduce Hospitalisations? An Analysis of System Data


    Background: The increasing prevalence and associated cost of treating chronic obstructive pulmonary disease (COPD) is unsustainable. Health care organizations are focusing on ways to support self-management and prevent hospital admissions, including telehealth-monitoring services capturing physiological and health status data. This paper reports on data captured during a pilot randomized controlled trial of telehealth-supported care within a community-based service for patients discharged from hospital following an exacerbation of their COPD. Objective: The aim was to undertake the first analysis of system data to determine whether telehealth monitoring can identify an exacerbation of COPD, providing clinicians with an opportunity to intervene with timely treatment and prevent hospital readmission. Methods: A total of 23 participants received a telehealth-supported intervention. This paper reports on the analysis of data from a telehealth monitoring system that captured data from two sources: (1) data uploaded both manually and using Bluetooth peripheral devices by the 23 participants and (2) clinical records entered as nursing notes by the clinicians. Rules embedded in the telehealth monitoring system triggered system alerts to be reviewed by remote clinicians who determined whether clinical intervention was required. We also analyzed data on the frequency and length (bed days) of hospital admissions, frequency of hospital Accident and Emergency visits that did not lead to hospital admission, and frequency and type of community health care service contacts—other than the COPD discharge service—for all participants for the duration of the intervention and 6 months postintervention. Results: Patients generated 512 alerts, 451 of which occurred during the first 42 days that all participants used the equipment. Patients generated fewer alerts over time with typically seven alerts per day within the first 10 days and four alerts per day thereafter. They also had three times more days without alerts than with alerts. Alerts were most commonly triggered by reports of being more tired, having difficulty with self-care, and blood pressure being out of range. During the 8-week intervention, and for 6-month follow-up, eight of the 23 patients were hospitalized. Hospital readmission rates (2/23, 9%) in the first 28 days of service were lower than the 20% UK norm. Conclusions: It seems that the clinical team can identify exacerbations based on both an increase in alerts and the types of system-generated alerts as evidenced by their efforts to provided treatment interventions. There was some indication that telehealth monitoring potentially delayed hospitalizations until after patients had been discharged from the service. We suggest that telehealth-supported care can fulfill an important role in enabling patients with COPD to better manage their condition and remain out of hospital, but adequate resourcing and timely response to alerts is a critical factor in supporting patients to remain at home. Trial Registration: International Standard Randomized Controlled Trial Number (ISRCTN): 68856013; (Archived by WebCite at

  • Patient and Medical Record. TOC picture created by authors from two images. Source: Pixabay. Medical Record Health Patient form, Author vjohns1580; and Hospital Labor Delivery Mom, Author Parentingupstream. Public Domain. Licensed under a CC0.

    Patient Similarity in Prediction Models Based on Health Data: A Scoping Review


    Background: Physicians and health policy makers are required to make predictions during their decision making in various medical problems. Many advances have been made in predictive modeling toward outcome prediction, but these innovations target an average patient and are insufficiently adjustable for individual patients. One developing idea in this field is individualized predictive analytics based on patient similarity. The goal of this approach is to identify patients who are similar to an index patient and derive insights from the records of similar patients to provide personalized predictions.. Objective: The aim is to summarize and review published studies describing computer-based approaches for predicting patients’ future health status based on health data and patient similarity, identify gaps, and provide a starting point for related future research. Methods: The method involved (1) conducting the review by performing automated searches in Scopus, PubMed, and ISI Web of Science, selecting relevant studies by first screening titles and abstracts then analyzing full-texts, and (2) documenting by extracting publication details and information on context, predictors, missing data, modeling algorithm, outcome, and evaluation methods into a matrix table, synthesizing data, and reporting results. Results: After duplicate removal, 1339 articles were screened in abstracts and titles and 67 were selected for full-text review. In total, 22 articles met the inclusion criteria. Within included articles, hospitals were the main source of data (n=10). Cardiovascular disease (n=7) and diabetes (n=4) were the dominant patient diseases. Most studies (n=18) used neighborhood-based approaches in devising prediction models. Two studies showed that patient similarity-based modeling outperformed population-based predictive methods. Conclusions: Interest in patient similarity-based predictive modeling for diagnosis and prognosis has been growing. In addition to raw/coded health data, wavelet transform and term frequency-inverse document frequency methods were employed to extract predictors. Selecting predictors with potential to highlight special cases and defining new patient similarity metrics were among the gaps identified in the existing literature that provide starting points for future work. Patient status prediction models based on patient similarity and health data offer exciting potential for personalizing and ultimately improving health care, leading to better patient outcomes.

  • Source: Getty images. IStock. License purchased by the author.

    The State of Open Source Electronic Health Record Projects: A Software Anthropology Study


    Background: Electronic health records (EHR) are a key tool in managing and storing patients’ information. Currently, there are over 50 open source EHR systems available. Functionality and usability are important factors for determining the success of any system. These factors are often a direct reflection of the domain knowledge and developers’ motivations. However, few published studies have focused on the characteristics of free and open source software (F/OSS) EHR systems and none to date have discussed the motivation, knowledge background, and demographic characteristics of the developers involved in open source EHR projects. Objective: This study analyzed the characteristics of prevailing F/OSS EHR systems and aimed to provide an understanding of the motivation, knowledge background, and characteristics of the developers. Methods: This study identified F/OSS EHR projects on SourceForge and other websites from May to July 2014. Projects were classified and characterized by license type, downloads, programming languages, spoken languages, project age, development status, supporting materials, top downloads by country, and whether they were “certified” EHRs. Health care F/OSS developers were also surveyed using an online survey. Results: At the time of the assessment, we uncovered 54 open source EHR projects, but only four of them had been successfully certified under the Office of the National Coordinator for Health Information Technology (ONC Health IT) Certification Program. In the majority of cases, the open source EHR software was downloaded by users in the United States (64.07%, 148,666/232,034), underscoring that there is a significant interest in EHR open source applications in the United States. A survey of EHR open source developers was conducted and a total of 103 developers responded to the online questionnaire. The majority of EHR F/OSS developers (65.3%, 66/101) are participating in F/OSS projects as part of a paid activity and only 25.7% (26/101) of EHR F/OSS developers are, or have been, health care providers in their careers. In addition, 45% (45/99) of developers do not work in the health care field. Conclusion: The research presented in this study highlights some challenges that may be hindering the future of health care F/OSS. A minority of developers have been health care professionals, and only 55% (54/99) work in the health care field. This undoubtedly limits the ability of functional design of F/OSS EHR systems from being a competitive advantage over prevailing commercial EHR systems. Open source software seems to be a significant interest to many; however, given that only four F/OSS EHR systems are ONC-certified, this interest is unlikely to yield significant adoption of these systems in the United States. Although the Health Information Technology for Economic and Clinical Health (HITECH) act was responsible for a substantial infusion of capital into the EHR marketplace, the lack of a corporate entity in most F/OSS EHR projects translates to a marginal capacity to market the respective F/OSS system and to navigate certification. This likely has further disadvantaged F/OSS EHR adoption in the United States.

  • Clinician and patient viewing EMR data (Adobe stock photo).

    Progress in the Enhanced Use of Electronic Medical Records: Data From the Ontario Experience


    Background: This paper describes a change management strategy, including a self-assessment survey tool and electronic medical record (EMR) maturity model (EMM), developed to support the adoption and implementation of EMRs among community-based physicians in the province of Ontario, Canada. Objective: The aim of our study was to present an analysis of progress in EMR use in the province of Ontario based on data from surveys completed by over 4000 EMR users. Methods: The EMM and the EMR progress report (EPR) survey tool clarify levels of capability and expected benefits of improved use. Maturity is assessed on a 6-point scale (0-5) for 25 functions, across 7 functional areas, ranging from basic to more advanced. A total of 4214 clinicians completed EPR surveys between April 2013 and March 2016. Univariate and multivariate descriptive statistics were calculated to describe the survey results. Results: Physicians reported continual improvement over years of use, perceiving that the longer they used their EMR, the better patient care they provided. Those with at least two years of experience reported the greatest progress. Conclusions: From our analyses at this stage we identified: (1) a direct correlation between years of EMR use and EMR maturity as measured in our model, (2) a similar positive correlation between years of EMR use and the perception that these systems improve clinical care in at least four patient-centered areas, and (3) evidence of ongoing improvement even in advanced years of use. Future analyses will be supplemented by qualitative and quantitative data collected from field staff engagements as part of the new EMR practice enhancement program (EPEP).

  • Image sourced from and owned by the authors.

    Checking Questionable Entry of Personally Identifiable Information Encrypted by One-Way Hash Transformation


    Background: As one of the several effective solutions for personal privacy protection, a global unique identifier (GUID) is linked with hash codes that are generated from combinations of personally identifiable information (PII) by a one-way hash algorithm. On the GUID server, no PII is permitted to be stored, and only GUID and hash codes are allowed. The quality of PII entry is critical to the GUID system. Objective: The goal of our study was to explore a method of checking questionable entry of PII in this context without using or sending any portion of PII while registering a subject. Methods: According to the principle of GUID system, all possible combination patterns of PII fields were analyzed and used to generate hash codes, which were stored on the GUID server. Based on the matching rules of the GUID system, an error-checking algorithm was developed using set theory to check PII entry errors. We selected 200,000 simulated individuals with randomly-planted errors to evaluate the proposed algorithm. These errors were placed in the required PII fields or optional PII fields. The performance of the proposed algorithm was also tested in the registering system of study subjects. Results: There are 127,700 error-planted subjects, of which 114,464 (89.64%) can still be identified as the previous one and remaining 13,236 (10.36%, 13,236/127,700) are discriminated as new subjects. As expected, 100% of nonidentified subjects had errors within the required PII fields. The possibility that a subject is identified is related to the count and the type of incorrect PII field. For all identified subjects, their errors can be found by the proposed algorithm. The scope of questionable PII fields is also associated with the count and the type of the incorrect PII field. The best situation is to precisely find the exact incorrect PII fields, and the worst situation is to shrink the questionable scope only to a set of 13 PII fields. In the application, the proposed algorithm can give a hint of questionable PII entry and perform as an effective tool. Conclusions: The GUID system has high error tolerance and may correctly identify and associate a subject even with few PII field errors. Correct data entry, especially required PII fields, is critical to avoiding false splits. In the context of one-way hash transformation, the questionable input of PII may be identified by applying set theory operators based on the hash codes. The count and the type of incorrect PII fields play an important role in identifying a subject and locating questionable PII fields.

  • OVERT-MED visual interface.

    Ontology-Driven Search and Triage: Design of a Web-Based Visual Interface for MEDLINE


    Background: Diverse users need to search health and medical literature to satisfy open-ended goals such as making evidence-based decisions and updating their knowledge. However, doing so is challenging due to at least two major difficulties: (1) articulating information needs using accurate vocabulary and (2) dealing with large document sets returned from searches. Common search interfaces such as PubMed do not provide adequate support for exploratory search tasks. Objective: Our objective was to improve support for exploratory search tasks by combining two strategies in the design of an interactive visual interface by (1) using a formal ontology to help users build domain-specific knowledge and vocabulary and (2) providing multi-stage triaging support to help mitigate the information overload problem. Methods: We developed a Web-based tool, Ontology-Driven Visual Search and Triage Interface for MEDLINE (OVERT-MED), to test our design ideas. We implemented a custom searchable index of MEDLINE, which comprises approximately 25 million document citations. We chose a popular biomedical ontology, the Human Phenotype Ontology (HPO), to test our solution to the vocabulary problem. We implemented multistage triaging support in OVERT-MED, with the aid of interactive visualization techniques, to help users deal with large document sets returned from searches. Results: Formative evaluation suggests that the design features in OVERT-MED are helpful in addressing the two major difficulties described above. Using a formal ontology seems to help users articulate their information needs with more accurate vocabulary. In addition, multistage triaging combined with interactive visualizations shows promise in mitigating the information overload problem. Conclusions: Our strategies appear to be valuable in addressing the two major problems in exploratory search. Although we tested OVERT-MED with a particular ontology and document collection, we anticipate that our strategies can be transferred successfully to other contexts.

  • Predictive analytics. Image Source: Author: geralt. Copyright: CC0 Public Domain.

    Patient-Specific Predictive Modeling Using Random Forests: An Observational Study for the Critically Ill

    Authors List:


    Background: With a large-scale electronic health record repository, it is feasible to build a customized patient outcome prediction model specifically for a given patient. This approach involves identifying past patients who are similar to the present patient and using their data to train a personalized predictive model. Our previous work investigated a cosine-similarity patient similarity metric (PSM) for such patient-specific predictive modeling. Objective: The objective of the study is to investigate the random forest (RF) proximity measure as a PSM in the context of personalized mortality prediction for intensive care unit (ICU) patients. Methods: A total of 17,152 ICU admissions were extracted from the Multiparameter Intelligent Monitoring in Intensive Care II database. A number of predictor variables were extracted from the first 24 hours in the ICU. Outcome to be predicted was 30-day mortality. A patient-specific predictive model was trained for each ICU admission using an RF PSM inspired by the RF proximity measure. Death counting, logistic regression, decision tree, and RF models were studied with a hard threshold applied to RF PSM values to only include the M most similar patients in model training, where M was varied. In addition, case-specific random forests (CSRFs), which uses RF proximity for weighted bootstrapping, were trained. Results: Compared to our previous study that investigated a cosine similarity PSM, the RF PSM resulted in superior or comparable predictive performance. RF and CSRF exhibited the best performances (in terms of mean area under the receiver operating characteristic curve [95% confidence interval], RF: 0.839 [0.835-0.844]; CSRF: 0.832 [0.821-0.843]). RF and CSRF did not benefit from personalization via the use of the RF PSM, while the other models did. Conclusions: The RF PSM led to good mortality prediction performance for several predictive models, although it failed to induce improved performance in RF and CSRF. The distinction between predictor and similarity variables is an important issue arising from the present study. RFs present a promising method for patient-specific outcome prediction.

  • Health care professionals. Image sourced and copyright owned by authors.

    The Value of Electronic Medical Record Implementation in Mental Health Care: A Case Study


    Background: Electronic medical records (EMR) have been implemented in many organizations to improve the quality of care. Evidence supporting the value added to a recovery-oriented mental health facility is lacking. Objective: The goal of this project was to implement and customize a fully integrated EMR system in a specialized, recovery-oriented mental health care facility. This evaluation examined the outcomes of quality improvement initiatives driven by the EMR to determine the value that the EMR brought to the organization. Methods: The setting was a tertiary-level mental health facility in Ontario, Canada. Clinical informatics and decision support worked closely with point-of-care staff to develop workflows and documentation tools in the EMR. The primary initiatives were implementation of modules for closed loop medication administration, collaborative plan of care, clinical practice guidelines for schizophrenia, restraint minimization, the infection prevention and control surveillance status board, drug of abuse screening, and business intelligence. Results: Medication and patient scan rates have been greater than 95% since April 2014, mitigating the adverse effects of medication errors. Specifically, between April 2014 and March 2015, only 1 moderately severe and 0 severe adverse drug events occurred. The number of restraint incidents decreased 19.7%, which resulted in cost savings of more than Can $1.4 million (US $1.0 million) over 2 years. Implementation of clinical practice guidelines for schizophrenia increased adherence to evidence-based practices, standardizing care across the facility. Improved infection prevention and control surveillance reduced the number of outbreak days from 47 in the year preceding implementation of the status board to 7 days in the year following. Decision support to encourage preferential use of the cost-effective drug of abuse screen when clinically indicated resulted in organizational cost savings. Conclusions: EMR implementation allowed Ontario Shores Centre for Mental Health Sciences to use data analytics to identify and select appropriate quality improvement initiatives, supporting patient-centered, recovery-oriented practices and providing value at the clinical, organizational, and societal levels.

  • Hand Holding A Stethoscope. Image Source: Author: Petr Kratochvil. Copyright: Public Domain.

    Email Between Patient and Provider: Assessing the Attitudes and Perspectives of 624 Primary Health Care Patients


    Background: Email between patients and their health care providers can serve as a continuous and collaborative forum to improve access to care, enhance convenience of communication, reduce administrative costs and missed appointments, and improve satisfaction with the patient-provider relationship. Objective: The main objective of this study was to investigate the attitudes of patients aged 16 years and older toward receiving email communication for health-related purposes from an academic inner-city family health team in Southern Ontario. In addition to exploring the proportion of patients with a functioning email address and interest in email communication with their health care provider, we also examined patient-level predictors of interest in email communication. Methods: A cross-sectional study was conducted using a self-administered, 1-page survey of attitudes toward electronic communication for health purposes. Participants were recruited from attending patients at the McMaster Family Practice in Hamilton, Ontario, Canada. These patients were aged 16 years and older and were approached consecutively to complete the self-administered survey (N=624). Descriptive analyses were conducted using the Pearson chi-square test to examine correlations between variables. A logistic regression analysis was conducted to determine statistically significant predictors of interest in email communication (yes or no). Results: The majority of respondents (73.2%, 457/624) reported that they would be willing to have their health care provider (from the McMaster Family Practice) contact them via email to communicate health-related information. Those respondents who checked their personal email more frequently were less likely to want to engage in this electronic communication. Among respondents who check their email less frequently (fewer than every 3 days), 46% (37/81) preferred to communicate with the McMaster Family Practice via email. Conclusions: Online applications, including email, are emerging as a viable avenue for patient communication. With increasing utility of mobile devices in the general population, the proportion of patients interested in email communication with their health care providers may continue to increase. When following best practices and appropriate guidelines, health care providers can use this resource to enhance patient-provider communication in their clinical work, ultimately leading to improved health outcomes and satisfaction with care among their patients.

  • EHR and FOCUS. Image sourced and copyright owned by authors.

    Finding Important Terms for Patients in Their Electronic Health Records: A Learning-to-Rank Approach Using Expert Annotations


    Background: Many health organizations allow patients to access their own electronic health record (EHR) notes through online patient portals as a way to enhance patient-centered care. However, EHR notes are typically long and contain abundant medical jargon that can be difficult for patients to understand. In addition, many medical terms in patients’ notes are not directly related to their health care needs. One way to help patients better comprehend their own notes is to reduce information overload and help them focus on medical terms that matter most to them. Interventions can then be developed by giving them targeted education to improve their EHR comprehension and the quality of care. Objective: We aimed to develop a supervised natural language processing (NLP) system called Finding impOrtant medical Concepts most Useful to patientS (FOCUS) that automatically identifies and ranks medical terms in EHR notes based on their importance to the patients. Methods: First, we built an expert-annotated corpus. For each EHR note, 2 physicians independently identified medical terms important to the patient. Using the physicians’ agreement as the gold standard, we developed and evaluated FOCUS. FOCUS first identifies candidate terms from each EHR note using MetaMap and then ranks the terms using a support vector machine-based learn-to-rank algorithm. We explored rich learning features, including distributed word representation, Unified Medical Language System semantic type, topic features, and features derived from consumer health vocabulary. We compared FOCUS with 2 strong baseline NLP systems. Results: Physicians annotated 90 EHR notes and identified a mean of 9 (SD 5) important terms per note. The Cohen’s kappa annotation agreement was .51. The 10-fold cross-validation results show that FOCUS achieved an area under the receiver operating characteristic curve (AUC-ROC) of 0.940 for ranking candidate terms from EHR notes to identify important terms. When including term identification, the performance of FOCUS for identifying important terms from EHR notes was 0.866 AUC-ROC. Both performance scores significantly exceeded the corresponding baseline system scores (P<.001). Rich learning features contributed to FOCUS’s performance substantially. Conclusions: FOCUS can automatically rank terms from EHR notes based on their importance to patients. It may help develop future interventions that improve quality of care.

  • Creative abstract healthcare, medicine and cardiology tool concept: laptop or notebook computer with medical cardiologic diagnostic test software on screen and stethoscope isolated on white background. Image source: Image Author: Scanrail1. Image purchased by authors.

    A Predictive Model for Medical Events Based on Contextual Embedding of Temporal Sequences


    Background: Medical concepts are inherently ambiguous and error-prone due to human fallibility, which makes it hard for them to be fully used by classical machine learning methods (eg, for tasks like early stage disease prediction). Objective: Our work was to create a new machine-friendly representation that resembles the semantics of medical concepts. We then developed a sequential predictive model for medical events based on this new representation. Methods: We developed novel contextual embedding techniques to combine different medical events (eg, diagnoses, prescriptions, and labs tests). Each medical event is converted into a numerical vector that resembles its “semantics,” via which the similarity between medical events can be easily measured. We developed simple and effective predictive models based on these vectors to predict novel diagnoses. Results: We evaluated our sequential prediction model (and standard learning methods) in estimating the risk of potential diseases based on our contextual embedding representation. Our model achieved an area under the receiver operating characteristic (ROC) curve (AUC) of 0.79 on chronic systolic heart failure and an average AUC of 0.67 (over the 80 most common diagnoses) using the Medical Information Mart for Intensive Care III (MIMIC-III) dataset. Conclusions: We propose a general early prognosis predictor for 80 different diagnoses. Our method computes numeric representation for each medical event to uncover the potential meaning of those events. Our results demonstrate the efficiency of the proposed method, which will benefit patients and physicians by offering more accurate diagnosis.

Citing this Article

Right click to copy or hit: ctrl+c (cmd+c on mac)

Latest Submissions Open for Peer-Review:

View All Open Peer Review Articles
  • What patients can tell us: topic analysis for social media on breast cancer

    Date Submitted: Apr 5, 2017

    Open Peer Review Period: Apr 6, 2017 - Jun 1, 2017

    Background: Internet health forums are increasingly used by patients and health professionals. They are rich textual resources with content generated through free exchange between patients. We are pro...

    Background: Internet health forums are increasingly used by patients and health professionals. They are rich textual resources with content generated through free exchange between patients. We are proposing a method to tackle the problem of retrieving clinically relevant information from such forums in order to analyse the quality of life of patients suffering from breast cancer. Objective: Our aim is to detect the different topics discussed by patients on social media and to relate them to the functional and symptomatic dimensions assessed in the internationally standardized auto-questionnaires used in cancer clinical trials (EORTC QLQ-C30 and EORTC QLQ-BR23). Methods: First, we applied a classic text mining technique, Latent Dirichlet Allocation (LDA), to detect the different topics discussed on social media dealing with breast cancer. We applied the LDA model on two data sets composed of messages extracted from public Facebook groups and from a public health forum (“”) with relevant preprocessing. Secondly, we applied a customized Jaccard coefficient to automatically compute similarity distance between the topics detected with LDA and the questions in the auto-questionnaires used to study quality of life. Results: Among the 23 topics present in the auto-questionnaires, 22 topics match with the topics discussed by patients on social media. Interestingly, these topics correspond to 95% of the forum and 86% of the Facebook groups. These figures underline that topics related to quality of life are an important concern for patients. However, 5 topics from social media did not find correspond in the auto-questionnaires which do not cover all the concerns of the patients. 2 out of these 5 topics can be used in the auto-questionnaires and these 2 topics corresponded to a total of 4.3% of topics in the “” corpus and 3.1% of the Facebook corpus. Conclusions: This work demonstrates that we have found a good correspondence between detected topics on social media and topics present in auto-questionnaires, which substantiates the sound construction of such auto-questionnaires. We detected new emerging topics from social media that can be used to complete current auto-questionnaires. Moreover, we confirmed that social media mining is an important source of information for complementary analysis of the quality of life.