TY - JOUR AU - Herman Bernardim Andrade, Gabriel AU - Yada, Shuntaro AU - Aramaki, Eiji PY - 2024/7/2 TI - Is Boundary Annotation Necessary? Evaluating Boundary-Free Approaches to Improve Clinical Named Entity Annotation Efficiency: Case Study JO - JMIR Med Inform SP - e59680 VL - 12 KW - natural language processing KW - named entity recognition KW - information extraction KW - text annotation KW - entity boundaries KW - lenient annotation KW - case reports KW - annotation KW - case study KW - medical case report KW - efficiency KW - model KW - model performance KW - dataset KW - Japan KW - Japanese KW - entity KW - clinical domain KW - clinical N2 - Background: Named entity recognition (NER) is a fundamental task in natural language processing. However, it is typically preceded by named entity annotation, which poses several challenges, especially in the clinical domain. For instance, determining entity boundaries is one of the most common sources of disagreements between annotators due to questions such as whether modifiers or peripheral words should be annotated. If unresolved, these can induce inconsistency in the produced corpora, yet, on the other hand, strict guidelines or adjudication sessions can further prolong an already slow and convoluted process. Objective: The aim of this study is to address these challenges by evaluating 2 novel annotation methodologies, lenient span and point annotation, aiming to mitigate the difficulty of precisely determining entity boundaries. Methods: We evaluate their effects through an annotation case study on a Japanese medical case report data set. We compare annotation time, annotator agreement, and the quality of the produced labeling and assess the impact on the performance of an NER system trained on the annotated corpus. Results: We saw significant improvements in the labeling process efficiency, with up to a 25% reduction in overall annotation time and even a 10% improvement in annotator agreement compared to the traditional boundary-strict approach. However, even the best-achieved NER model presented some drop in performance compared to the traditional annotation methodology. Conclusions: Our findings demonstrate a balance between annotation speed and model performance. Although disregarding boundary information affects model performance to some extent, this is counterbalanced by significant reductions in the annotator?s workload and notable improvements in the speed of the annotation process. These benefits may prove valuable in various applications, offering an attractive compromise for developers and researchers. UR - https://medinform.jmir.org/2024/1/e59680 UR - http://dx.doi.org/10.2196/59680 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/59680 ER - TY - JOUR AU - Hao, Tianyong AU - Huang, Zhengxing AU - Liang, Likeng AU - Weng, Heng AU - Tang, Buzhou PY - 2021/10/21 TI - Health Natural Language Processing: Methodology Development and Applications JO - JMIR Med Inform SP - e23898 VL - 9 IS - 10 KW - health care KW - unstructured text KW - natural language processing KW - methodology KW - application UR - https://medinform.jmir.org/2021/10/e23898 UR - http://dx.doi.org/10.2196/23898 UR - http://www.ncbi.nlm.nih.gov/pubmed/34673533 ID - info:doi/10.2196/23898 ER - TY - JOUR AU - Wang, Jian AU - Chen, Xiaoyu AU - Zhang, Yu AU - Zhang, Yijia AU - Wen, Jiabin AU - Lin, Hongfei AU - Yang, Zhihao AU - Wang, Xin PY - 2020/7/31 TI - Document-Level Biomedical Relation Extraction Using Graph Convolutional Network and Multihead Attention: Algorithm Development and Validation JO - JMIR Med Inform SP - e17638 VL - 8 IS - 7 KW - biomedical relation extraction KW - dependency graph KW - multihead attention KW - graph convolutional network N2 - Background: Automatically extracting relations between chemicals and diseases plays an important role in biomedical text mining. Chemical-disease relation (CDR) extraction aims at extracting complex semantic relationships between entities in documents, which contain intrasentence and intersentence relations. Most previous methods did not consider dependency syntactic information across the sentences, which are very valuable for the relations extraction task, in particular, for extracting the intersentence relations accurately. Objective: In this paper, we propose a novel end-to-end neural network based on the graph convolutional network (GCN) and multihead attention, which makes use of the dependency syntactic information across the sentences to improve CDR extraction task. Methods: To improve the performance of intersentence relation extraction, we constructed a document-level dependency graph to capture the dependency syntactic information across sentences. GCN is applied to capture the feature representation of the document-level dependency graph. The multihead attention mechanism is employed to learn the relatively important context features from different semantic subspaces. To enhance the input representation, the deep context representation is used in our model instead of traditional word embedding. Results: We evaluate our method on CDR corpus. The experimental results show that our method achieves an F-measure of 63.5%, which is superior to other state-of-the-art methods. In the intrasentence level, our method achieves a precision, recall, and F-measure of 59.1%, 81.5%, and 68.5%, respectively. In the intersentence level, our method achieves a precision, recall, and F-measure of 47.8%, 52.2%, and 49.9%, respectively. Conclusions: The GCN model can effectively exploit the across sentence dependency information to improve the performance of intersentence CDR extraction. Both the deep context representation and multihead attention are helpful in the CDR extraction task. UR - https://medinform.jmir.org/2020/7/e17638 UR - http://dx.doi.org/10.2196/17638 UR - http://www.ncbi.nlm.nih.gov/pubmed/32459636 ID - info:doi/10.2196/17638 ER - TY - JOUR AU - Pan, Xiaoyi AU - Chen, Boyu AU - Weng, Heng AU - Gong, Yongyi AU - Qu, Yingying PY - 2020/7/27 TI - Temporal Expression Classification and Normalization From Chinese Narrative Clinical Texts: Pattern Learning Approach JO - JMIR Med Inform SP - e17652 VL - 8 IS - 7 KW - Temporal expression extraction KW - Temporal expression normalization KW - Machine learning KW - Heuristic rule KW - Pattern learning KW - Clinical text N2 - Background: Temporal information frequently exists in the representation of the disease progress, prescription, medication, surgery progress, or discharge summary in narrative clinical text. The accurate extraction and normalization of temporal expressions can positively boost the analysis and understanding of narrative clinical texts to promote clinical research and practice. Objective: The goal of the study was to propose a novel approach for extracting and normalizing temporal expressions from Chinese narrative clinical text. Methods: TNorm, a rule-based and pattern learning-based approach, has been developed for automatic temporal expression extraction and normalization from unstructured Chinese clinical text data. TNorm consists of three stages: extraction, classification, and normalization. It applies a set of heuristic rules and automatically generated patterns for temporal expression identification and extraction of clinical texts. Then, it collects the features of extracted temporal expressions for temporal type prediction and classification by using machine learning algorithms. Finally, the features are combined with the rule-based and a pattern learning-based approach to normalize the extracted temporal expressions. Results: The evaluation dataset is a set of narrative clinical texts in Chinese containing 1459 discharge summaries of a domestic Grade A Class 3 hospital. The results show that TNorm, combined with temporal expressions extraction and temporal types prediction, achieves a precision of 0.8491, a recall of 0.8328, and a F1 score of 0.8409 in temporal expressions normalization. Conclusions: This study illustrates an automatic approach, TNorm, that extracts and normalizes temporal expression from Chinese narrative clinical texts. TNorm was evaluated on the basis of discharge summary data, and results demonstrate its effectiveness on temporal expression normalization. UR - https://medinform.jmir.org/2020/7/e17652 UR - http://dx.doi.org/10.2196/17652 UR - http://www.ncbi.nlm.nih.gov/pubmed/32716307 ID - info:doi/10.2196/17652 ER - TY - JOUR AU - Sun, Haixia AU - Xiao, Jin AU - Zhu, Wei AU - He, Yilong AU - Zhang, Sheng AU - Xu, Xiaowei AU - Hou, Li AU - Li, Jiao AU - Ni, Yuan AU - Xie, Guotong PY - 2020/7/23 TI - Medical Knowledge Graph to Enhance Fraud, Waste, and Abuse Detection on Claim Data: Model Development and Performance Evaluation JO - JMIR Med Inform SP - e17653 VL - 8 IS - 7 KW - medical knowledge graph KW - FWA detection N2 - Background: Fraud, Waste, and Abuse (FWA) detection is a significant yet challenging problem in the health insurance industry. An essential step in FWA detection is to check whether the medication is clinically reasonable with respect to the diagnosis. Currently, human experts with sufficient medical knowledge are required to perform this task. To reduce the cost, insurance inspectors tend to build an intelligent system to detect suspicious claims with inappropriate diagnoses/medications automatically. Objective: The aim of this study was to develop an automated method for making use of a medical knowledge graph to identify clinically suspected claims for FWA detection. Methods: First, we identified the medical knowledge that is required to assess the clinical rationality of the claims. We then searched for data sources that contain information to build such knowledge. In this study, we focused on Chinese medical knowledge. Second, we constructed a medical knowledge graph using unstructured knowledge. We used a deep learning?based method to extract the entities and relationships from the knowledge sources and developed a multilevel similarity matching approach to conduct the entity linking. To guarantee the quality of the medical knowledge graph, we involved human experts to review the entity and relationships with lower confidence. These reviewed results could be used to further improve the machine-learning models. Finally, we developed the rules to identify the suspected claims by reasoning according to the medical knowledge graph. Results: We collected 185,796 drug labels from the China Food and Drug Administration, 3390 types of disease information from medical textbooks (eg, symptoms, diagnosis, treatment, and prognosis), and information from 5272 examinations as the knowledge sources. The final medical knowledge graph includes 1,616,549 nodes and 5,963,444 edges. We designed three knowledge graph reasoning rules to identify three kinds of inappropriate diagnosis/medications. The experimental results showed that the medical knowledge graph helps to detect 70% of the suspected claims. Conclusions: The medical knowledge graph?based method successfully identified suspected cases of FWA (such as fraud diagnosis, excess prescription, and irrational prescription) from the claim documents, which helped to improve the efficiency of claim processing. UR - http://medinform.jmir.org/2020/7/e17653/ UR - http://dx.doi.org/10.2196/17653 UR - http://www.ncbi.nlm.nih.gov/pubmed/32706714 ID - info:doi/10.2196/17653 ER - TY - JOUR AU - Zeng, Kun AU - Pan, Zhiwei AU - Xu, Yibin AU - Qu, Yingying PY - 2020/7/1 TI - An Ensemble Learning Strategy for Eligibility Criteria Text Classification for Clinical Trial Recruitment: Algorithm Development and Validation JO - JMIR Med Inform SP - e17832 VL - 8 IS - 7 KW - Deep learning KW - Text classification KW - Ensemble learning KW - Eligibility criteria KW - Clinical trial N2 - Background: Eligibility criteria are the main strategy for screening appropriate participants for clinical trials. Automatic analysis of clinical trial eligibility criteria by digital screening, leveraging natural language processing techniques, can improve recruitment efficiency and reduce the costs involved in promoting clinical research. Objective: We aimed to create a natural language processing model to automatically classify clinical trial eligibility criteria. Methods: We proposed a classifier for short text eligibility criteria based on ensemble learning, where a set of pretrained models was integrated. The pretrained models included state-of-the-art deep learning methods for training and classification, including Bidirectional Encoder Representations from Transformers (BERT), XLNet, and A Robustly Optimized BERT Pretraining Approach (RoBERTa). The classification results by the integrated models were combined as new features for training a Light Gradient Boosting Machine (LightGBM) model for eligibility criteria classification. Results: Our proposed method obtained an accuracy of 0.846, a precision of 0.803, and a recall of 0.817 on a standard data set from a shared task of an international conference. The macro F1 value was 0.807, outperforming the state-of-the-art baseline methods on the shared task. Conclusions: We designed a model for screening short text classification criteria for clinical trials based on multimodel ensemble learning. Through experiments, we concluded that performance was improved significantly with a model ensemble compared to a single model. The introduction of focal loss could reduce the impact of class imbalance to achieve better performance. UR - https://medinform.jmir.org/2020/7/e17832 UR - http://dx.doi.org/10.2196/17832 UR - http://www.ncbi.nlm.nih.gov/pubmed/32609092 ID - info:doi/10.2196/17832 ER - TY - JOUR AU - Li, Genghao AU - Li, Bing AU - Huang, Langlin AU - Hou, Sibing PY - 2020/6/23 TI - Automatic Construction of a Depression-Domain Lexicon Based on Microblogs: Text Mining Study JO - JMIR Med Inform SP - e17650 VL - 8 IS - 6 KW - depression detection KW - depression diagnosis KW - social media KW - automatic construction KW - domain-specific lexicon KW - depression lexicon KW - label propagation N2 - Background: According to a World Health Organization report in 2017, there was almost one patient with depression among every 20 people in China. However, the diagnosis of depression is usually difficult in terms of clinical detection owing to slow observation, high cost, and patient resistance. Meanwhile, with the rapid emergence of social networking sites, people tend to share their daily life and disclose inner feelings online frequently, making it possible to effectively identify mental conditions using the rich text information. There are many achievements regarding an English web-based corpus, but for research in China so far, the extraction of language features from web-related depression signals is still in a relatively primary stage. Objective: The purpose of this study was to propose an effective approach for constructing a depression-domain lexicon. This lexicon will contain language features that could help identify social media users who potentially have depression. Our study also compared the performance of detection with and without our lexicon. Methods: We autoconstructed a depression-domain lexicon using Word2Vec, a semantic relationship graph, and the label propagation algorithm. These two methods combined performed well in a specific corpus during construction. The lexicon was obtained based on 111,052 Weibo microblogs from 1868 users who were depressed or nondepressed. During depression detection, we considered six features, and we used five classification methods to test the detection performance. Results: The experiment results showed that in terms of the F1 value, our autoconstruction method performed 1% to 6% better than baseline approaches and was more effective and steadier. When applied to detection models like logistic regression and support vector machine, our lexicon helped the models outperform by 2% to 9% and was able to improve the final accuracy of potential depression detection. Conclusions: Our depression-domain lexicon was proven to be a meaningful input for classification algorithms, providing linguistic insights on the depressive status of test subjects. We believe that this lexicon will enhance early depression detection in people on social media. Future work will need to be carried out on a larger corpus and with more complex methods. UR - http://medinform.jmir.org/2020/6/e17650/ UR - http://dx.doi.org/10.2196/17650 UR - http://www.ncbi.nlm.nih.gov/pubmed/32574151 ID - info:doi/10.2196/17650 ER - TY - JOUR AU - Su, Longxiang AU - Liu, Chun AU - Li, Dongkai AU - He, Jie AU - Zheng, Fanglan AU - Jiang, Huizhen AU - Wang, Hao AU - Gong, Mengchun AU - Hong, Na AU - Zhu, Weiguo AU - Long, Yun PY - 2020/6/22 TI - Toward Optimal Heparin Dosing by Comparing Multiple Machine Learning Methods: Retrospective Study JO - JMIR Med Inform SP - e17648 VL - 8 IS - 6 KW - heparin KW - dosing KW - machine learning KW - optimization KW - intensive care unit N2 - Background: Heparin is one of the most commonly used medications in intensive care units. In clinical practice, the use of a weight-based heparin dosing nomogram is standard practice for the treatment of thrombosis. Recently, machine learning techniques have dramatically improved the ability of computers to provide clinical decision support and have allowed for the possibility of computer generated, algorithm-based heparin dosing recommendations. Objective: The objective of this study was to predict the effects of heparin treatment using machine learning methods to optimize heparin dosing in intensive care units based on the predictions. Patient state predictions were based upon activated partial thromboplastin time in 3 different ranges: subtherapeutic, normal therapeutic, and supratherapeutic, respectively. Methods: Retrospective data from 2 intensive care unit research databases (Multiparameter Intelligent Monitoring in Intensive Care III, MIMIC-III; e?Intensive Care Unit Collaborative Research Database, eICU) were used for the analysis. Candidate machine learning models (random forest, support vector machine, adaptive boosting, extreme gradient boosting, and shallow neural network) were compared in 3 patient groups to evaluate the classification performance for predicting the subtherapeutic, normal therapeutic, and supratherapeutic patient states. The model results were evaluated using precision, recall, F1 score, and accuracy. Results: Data from the MIMIC-III database (n=2789 patients) and from the eICU database (n=575 patients) were used. In 3-class classification, the shallow neural network algorithm performed the best (F1 scores of 87.26%, 85.98%, and 87.55% for data set 1, 2, and 3, respectively). The shallow neural network algorithm achieved the highest F1 scores within the patient therapeutic state groups: subtherapeutic (data set 1: 79.35%; data set 2: 83.67%; data set 3: 83.33%), normal therapeutic (data set 1: 93.15%; data set 2: 87.76%; data set 3: 84.62%), and supratherapeutic (data set 1: 88.00%; data set 2: 86.54%; data set 3: 95.45%) therapeutic ranges, respectively. Conclusions: The most appropriate model for predicting the effects of heparin treatment was found by comparing multiple machine learning models and can be used to further guide optimal heparin dosing. Using multicenter intensive care unit data, our study demonstrates the feasibility of predicting the outcomes of heparin treatment using data-driven methods, and thus, how machine learning?based models can be used to optimize and personalize heparin dosing to improve patient safety. Manual analysis and validation suggested that the model outperformed standard practice heparin treatment dosing. UR - http://medinform.jmir.org/2020/6/e17648/ UR - http://dx.doi.org/10.2196/17648 UR - http://www.ncbi.nlm.nih.gov/pubmed/32568089 ID - info:doi/10.2196/17648 ER - TY - JOUR AU - Liu, Ziqing AU - He, Haiyang AU - Yan, Shixing AU - Wang, Yong AU - Yang, Tao AU - Li, Guo-Zheng PY - 2020/6/16 TI - End-to-End Models to Imitate Traditional Chinese Medicine Syndrome Differentiation in Lung Cancer Diagnosis: Model Development and Validation JO - JMIR Med Inform SP - e17821 VL - 8 IS - 6 KW - traditional Chinese medicine KW - syndrome differentiation KW - lung cancer KW - medical record KW - deep learning KW - model fusion N2 - Background: Traditional Chinese medicine (TCM) has been shown to be an efficient mode to manage advanced lung cancer, and accurate syndrome differentiation is crucial to treatment. Documented evidence of TCM treatment cases and the progress of artificial intelligence technology are enabling the development of intelligent TCM syndrome differentiation models. This is expected to expand the benefits of TCM to lung cancer patients. Objective: The objective of this work was to establish end-to-end TCM diagnostic models to imitate lung cancer syndrome differentiation. The proposed models used unstructured medical records as inputs to capitalize on data collected for practical TCM treatment cases by lung cancer experts. The resulting models were expected to be more efficient than approaches that leverage structured TCM datasets. Methods: We approached lung cancer TCM syndrome differentiation as a multilabel text classification problem. First, entity representation was conducted with Bidirectional Encoder Representations from Transformers and conditional random fields models. Then, five deep learning?based text classification models were applied to the construction of a medical record multilabel classifier, during which two data augmentation strategies were adopted to address overfitting issues. Finally, a fusion model approach was used to elevate the performance of the models. Results: The F1 score of the recurrent convolutional neural network (RCNN) model with augmentation was 0.8650, a 2.41% improvement over the unaugmented model. The Hamming loss for RCNN with augmentation was 0.0987, which is 1.8% lower than that of the same model without augmentation. Among the models, the text-hierarchical attention network (Text-HAN) model achieved the highest F1 scores of 0.8676 and 0.8751. The mean average precision for the word encoding?based RCNN was 10% higher than that of the character encoding?based representation. A fusion model of the text-convolutional neural network, text-recurrent neural network, and Text-HAN models achieved an F1 score of 0.8884, which showed the best performance among the models. Conclusions: Medical records could be used more productively by constructing end-to-end models to facilitate TCM diagnosis. With the aid of entity-level representation, data augmentation, and model fusion, deep learning?based multilabel classification approaches can better imitate TCM syndrome differentiation in complex cases such as advanced lung cancer. UR - https://medinform.jmir.org/2020/6/e17821 UR - http://dx.doi.org/10.2196/17821 UR - http://www.ncbi.nlm.nih.gov/pubmed/32543445 ID - info:doi/10.2196/17821 ER - TY - JOUR AU - Zhang, Hong AU - Ni, Wandong AU - Li, Jing AU - Zhang, Jiajun PY - 2020/6/15 TI - Artificial Intelligence?Based Traditional Chinese Medicine Assistive Diagnostic System: Validation Study JO - JMIR Med Inform SP - e17608 VL - 8 IS - 6 KW - traditional Chinese medicine KW - TCM KW - disease diagnosis KW - syndrome prediction KW - syndrome differentiation KW - natural language processing KW - NLP KW - artificial intelligence KW - AI KW - assistive diagnostic system KW - convolutional neural network KW - CNN KW - machine learning KW - ML KW - BiLSTM-CRF N2 - Background: Artificial intelligence?based assistive diagnostic systems imitate the deductive reasoning process of a human physician in biomedical disease diagnosis and treatment decision making. While impressive progress in this area has been reported, most of the reported successes are applications of artificial intelligence in Western medicine. The application of artificial intelligence in traditional Chinese medicine has lagged mainly because traditional Chinese medicine practitioners need to perform syndrome differentiation as well as biomedical disease diagnosis before a treatment decision can be made. Syndrome, a concept unique to traditional Chinese medicine, is an abstraction of a variety of signs and symptoms. The fact that the relationship between diseases and syndromes is not one-to-one but rather many-to-many makes it very challenging for a machine to perform syndrome predictions. So far, only a handful of artificial intelligence?based assistive traditional Chinese medicine diagnostic models have been reported, and they are limited in application to a single disease-type. Objective: The objective was to develop an artificial intelligence?based assistive diagnostic system capable of diagnosing multiple types of diseases that are common in traditional Chinese medicine, given a patient?s electronic health record notes. The system was designed to simultaneously diagnose the disease and produce a list of corresponding syndromes. Methods: Unstructured freestyle electronic health record notes were processed by natural language processing techniques to extract clinical information such as signs and symptoms which were represented by named entities. Natural language processing used a recurrent neural network model called bidirectional long short-term memory network?conditional random forest. A convolutional neural network was then used to predict the disease-type out of 187 diseases in traditional Chinese medicine. A novel traditional Chinese medicine syndrome prediction method?an integrated learning model?was used to produce a corresponding list of probable syndromes. By following a majority-rule voting method, the integrated learning model for syndrome prediction can take advantage of four existing prediction methods (back propagation, random forest, extreme gradient boosting, and support vector classifier) while avoiding their respective weaknesses which resulted in a consistently high prediction accuracy. Results: A data set consisting of 22,984 electronic health records from Guanganmen Hospital of the China Academy of Chinese Medical Sciences that were collected between January 1, 2017 and September 7, 2018 was used. The data set contained a total of 187 diseases that are commonly diagnosed in traditional Chinese medicine. The diagnostic system was designed to be able to detect any one of the 187 disease-types. The data set was partitioned into a training set, a validation set, and a testing set in a ratio of 8:1:1. Test results suggested that the proposed system had a good diagnostic accuracy and a strong capability for generalization. The disease-type prediction accuracies of the top one, top three, and top five were 80.5%, 91.6%, and 94.2%, respectively. Conclusions: The main contributions of the artificial intelligence?based traditional Chinese medicine assistive diagnostic system proposed in this paper are that 187 commonly known traditional Chinese medicine diseases can be diagnosed and a novel prediction method called an integrated learning model is demonstrated. This new prediction method outperformed all four existing methods in our preliminary experimental results. With further improvement of the algorithms and the availability of additional electronic health record data, it is expected that a wider range of traditional Chinese medicine disease-types could be diagnosed and that better diagnostic accuracies could be achieved. UR - http://medinform.jmir.org/2020/6/e17608/ UR - http://dx.doi.org/10.2196/17608 UR - http://www.ncbi.nlm.nih.gov/pubmed/32538797 ID - info:doi/10.2196/17608 ER - TY - JOUR AU - Hane, A. Christopher AU - Nori, S. Vijay AU - Crown, H. William AU - Sanghavi, M. Darshak AU - Bleicher, Paul PY - 2020/6/3 TI - Predicting Onset of Dementia Using Clinical Notes and Machine Learning: Case-Control Study JO - JMIR Med Inform SP - e17819 VL - 8 IS - 6 KW - Alzheimer disease KW - dementia KW - health information systems KW - machine learning KW - natural language processing KW - health information interoperability N2 - Background: Clinical trials need efficient tools to assist in recruiting patients at risk of Alzheimer disease and related dementias (ADRD). Early detection can also assist patients with financial planning for long-term care. Clinical notes are an important, underutilized source of information in machine learning models because of the cost of collection and complexity of analysis. Objective: This study aimed to investigate the use of deidentified clinical notes from multiple hospital systems collected over 10 years to augment retrospective machine learning models of the risk of developing ADRD. Methods: We used 2 years of data to predict the future outcome of ADRD onset. Clinical notes are provided in a deidentified format with specific terms and sentiments. Terms in clinical notes are embedded into a 100-dimensional vector space to identify clusters of related terms and abbreviations that differ across hospital systems and individual clinicians. Results: When using clinical notes, the area under the curve (AUC) improved from 0.85 to 0.94, and positive predictive value (PPV) increased from 45.07% (25,245/56,018) to 68.32% (14,153/20,717) in the model at disease onset. Models with clinical notes improved in both AUC and PPV in years 3-6 when notes? volume was largest; results are mixed in years 7 and 8 with the smallest cohorts. Conclusions: Although clinical notes helped in the short term, the presence of ADRD symptomatic terms years earlier than onset adds evidence to other studies that clinicians undercode diagnoses of ADRD. De-identified clinical notes increase the accuracy of risk models. Clinical notes collected across multiple hospital systems via natural language processing can be merged using postprocessing techniques to aid model accuracy. UR - https://medinform.jmir.org/2020/6/e17819 UR - http://dx.doi.org/10.2196/17819 UR - http://www.ncbi.nlm.nih.gov/pubmed/32490841 ID - info:doi/10.2196/17819 ER - TY - JOUR AU - Liu, Xiaofeng AU - Fan, Jianye AU - Dong, Shoubin PY - 2020/5/29 TI - Document-Level Biomedical Relation Extraction Leveraging Pretrained Self-Attention Structure and Entity Replacement: Algorithm and Pretreatment Method Validation Study JO - JMIR Med Inform SP - e17644 VL - 8 IS - 5 KW - self-attention KW - document-level KW - relation extraction KW - biomedical entity pretreatment N2 - Background: The most current methods applied for intrasentence relation extraction in the biomedical literature are inadequate for document-level relation extraction, in which the relationship may cross sentence boundaries. Hence, some approaches have been proposed to extract relations by splitting the document-level datasets through heuristic rules and learning methods. However, these approaches may introduce additional noise and do not really solve the problem of intersentence relation extraction. It is challenging to avoid noise and extract cross-sentence relations. Objective: This study aimed to avoid errors by dividing the document-level dataset, verify that a self-attention structure can extract biomedical relations in a document with long-distance dependencies and complex semantics, and discuss the relative benefits of different entity pretreatment methods for biomedical relation extraction. Methods: This paper proposes a new data preprocessing method and attempts to apply a pretrained self-attention structure for document biomedical relation extraction with an entity replacement method to capture very long-distance dependencies and complex semantics. Results: Compared with state-of-the-art approaches, our method greatly improved the precision. The results show that our approach increases the F1 value, compared with state-of-the-art methods. Through experiments of biomedical entity pretreatments, we found that a model using an entity replacement method can improve performance. Conclusions: When considering all target entity pairs as a whole in the document-level dataset, a pretrained self-attention structure is suitable to capture very long-distance dependencies and learn the textual context and complicated semantics. A replacement method for biomedical entities is conducive to biomedical relation extraction, especially to document-level relation extraction. UR - http://medinform.jmir.org/2020/5/e17644/ UR - http://dx.doi.org/10.2196/17644 UR - http://www.ncbi.nlm.nih.gov/pubmed/32469325 ID - info:doi/10.2196/17644 ER - TY - JOUR AU - Li, Linfeng AU - Wang, Peng AU - Wang, Yao AU - Wang, Shenghui AU - Yan, Jun AU - Jiang, Jinpeng AU - Tang, Buzhou AU - Wang, Chengliang AU - Liu, Yuting PY - 2020/5/21 TI - A Method to Learn Embedding of a Probabilistic Medical Knowledge Graph: Algorithm Development JO - JMIR Med Inform SP - e17645 VL - 8 IS - 5 KW - probabilistic medical knowledge graph KW - representation learning KW - graph embedding KW - PrTransX KW - decision support systems, clinical KW - knowledge graph KW - medical informatics KW - electronic health records KW - natural language processing N2 - Background: Knowledge graph embedding is an effective semantic representation method for entities and relations in knowledge graphs. Several translation-based algorithms, including TransE, TransH, TransR, TransD, and TranSparse, have been proposed to learn effective embedding vectors from typical knowledge graphs in which the relations between head and tail entities are deterministic. However, in medical knowledge graphs, the relations between head and tail entities are inherently probabilistic. This difference introduces a challenge in embedding medical knowledge graphs. Objective: We aimed to address the challenge of how to learn the probability values of triplets into representation vectors by making enhancements to existing TransX (where X is E, H, R, D, or Sparse) algorithms, including the following: (1) constructing a mapping function between the score value and the probability, and (2) introducing probability-based loss of triplets into the original margin-based loss function. Methods: We performed the proposed PrTransX algorithm on a medical knowledge graph that we built from large-scale real-world electronic medical records data. We evaluated the embeddings using link prediction task. Results: Compared with the corresponding TransX algorithms, the proposed PrTransX performed better than the TransX model in all evaluation indicators, achieving a higher proportion of corrected entities ranked in the top 10 and normalized discounted cumulative gain of the top 10 predicted tail entities, and lower mean rank. Conclusions: The proposed PrTransX successfully incorporated the uncertainty of the knowledge triplets into the embedding vectors. UR - https://medinform.jmir.org/2020/5/e17645 UR - http://dx.doi.org/10.2196/17645 UR - http://www.ncbi.nlm.nih.gov/pubmed/32436854 ID - info:doi/10.2196/17645 ER - TY - JOUR AU - Wang, Erniu AU - Wang, Fan AU - Yang, Zhihao AU - Wang, Lei AU - Zhang, Yin AU - Lin, Hongfei AU - Wang, Jian PY - 2020/5/19 TI - A Graph Convolutional Network?Based Method for Chemical-Protein Interaction Extraction: Algorithm Development JO - JMIR Med Inform SP - e17643 VL - 8 IS - 5 KW - chemical-protein interaction KW - graph convolutional network KW - long-range syntactic KW - dependency structure N2 - Background: Extracting the interactions between chemicals and proteins from the biomedical literature is important for many biomedical tasks such as drug discovery, medicine precision, and knowledge graph construction. Several computational methods have been proposed for automatic chemical-protein interaction (CPI) extraction. However, the majority of these proposed models cannot effectively learn semantic and syntactic information from complex sentences in biomedical texts. Objective: To relieve this problem, we propose a method to effectively encode syntactic information from long text for CPI extraction. Methods: Since syntactic information can be captured from dependency graphs, graph convolutional networks (GCNs) have recently drawn increasing attention in natural language processing. To investigate the performance of a GCN on CPI extraction, this paper proposes a novel GCN-based model. The model can effectively capture sequential information and long-range syntactic relations between words by using the dependency structure of input sentences. Results: We evaluated our model on the ChemProt corpus released by BioCreative VI; it achieved an F-score of 65.17%, which is 1.07% higher than that of the state-of-the-art system proposed by Peng et al. As indicated by the significance test (P<.001), the improvement is significant. It indicates that our model is effective in extracting CPIs. The GCN-based model can better capture the semantic and syntactic information of the sentence compared to other models, therefore alleviating the problems associated with the complexity of biomedical literature. Conclusions: Our model can obtain more information from the dependency graph than previously proposed models. Experimental results suggest that it is competitive to state-of-the-art methods and significantly outperforms other methods on the ChemProt corpus, which is the benchmark data set for CPI extraction. UR - http://medinform.jmir.org/2020/5/e17643/ UR - http://dx.doi.org/10.2196/17643 UR - http://www.ncbi.nlm.nih.gov/pubmed/32348257 ID - info:doi/10.2196/17643 ER - TY - JOUR AU - Zhang, Zhichang AU - Zhu, Lin AU - Yu, Peilin PY - 2020/5/4 TI - Multi-Level Representation Learning for Chinese Medical Entity Recognition: Model Development and Validation JO - JMIR Med Inform SP - e17637 VL - 8 IS - 5 KW - medical entity recognition KW - multi-level representation learning KW - Chinese KW - natural language processing KW - electronic medical records KW - multi-head attention mechanism N2 - Background: Medical entity recognition is a key technology that supports the development of smart medicine. Existing methods on English medical entity recognition have undergone great development, but their progress in the Chinese language has been slow. Because of limitations due to the complexity of the Chinese language and annotated corpora, these methods are based on simple neural networks, which cannot effectively extract the deep semantic representations of electronic medical records (EMRs) and be used on the scarce medical corpora. We thus developed a new Chinese EMR (CEMR) dataset with six types of entities and proposed a multi-level representation learning model based on Bidirectional Encoder Representation from Transformers (BERT) for Chinese medical entity recognition. Objective: This study aimed to improve the performance of the language model by having it learn multi-level representation and recognize Chinese medical entities. Methods: In this paper, the pretraining language representation model was investigated; utilizing information not only from the final layer but from intermediate layers was found to affect the performance of the Chinese medical entity recognition task. Therefore, we proposed a multi-level representation learning model for entity recognition in Chinese EMRs. Specifically, we first used the BERT language model to extract semantic representations. Then, the multi-head attention mechanism was leveraged to automatically extract deeper semantic information from each layer. Finally, semantic representations from multi-level representation extraction were utilized as the final semantic context embedding for each token and we used softmax to predict the entity tags. Results: The best F1 score reached by the experiment was 82.11% when using the CEMR dataset, and the F1 score when using the CCKS (China Conference on Knowledge Graph and Semantic Computing) 2018 benchmark dataset further increased to 83.18%. Various comparative experiments showed that our proposed method outperforms methods from previous work and performs as a new state-of-the-art method. Conclusions: The multi-level representation learning model is proposed as a method to perform the Chinese EMRs entity recognition task. Experiments on two clinical datasets demonstrate the usefulness of using the multi-head attention mechanism to extract multi-level representation as part of the language model. UR - https://medinform.jmir.org/2020/5/e17637 UR - http://dx.doi.org/10.2196/17637 UR - http://www.ncbi.nlm.nih.gov/pubmed/32364514 ID - info:doi/10.2196/17637 ER - TY - JOUR AU - Zhao, Zhenyu AU - Yang, Muyun AU - Tang, Buzhou AU - Zhao, Tiejun PY - 2020/4/30 TI - Re-examination of Rule-Based Methods in Deidentification of Electronic Health Records: Algorithm Development and Validation JO - JMIR Med Inform SP - e17622 VL - 8 IS - 4 KW - ensemble learning KW - deidentification KW - transformation-based error-driven rule learner N2 - Background: Deidentification of clinical records is a critical step before their publication. This is usually treated as a type of sequence labeling task, and ensemble learning is one of the best performing solutions. Under the framework of multi-learner ensemble, the significance of a candidate rule-based learner remains an open issue. Objective: The aim of this study is to investigate whether a rule-based learner is useful in a hybrid deidentification system and offer suggestions on how to build and integrate a rule-based learner. Methods: We chose a data-driven rule-learner named transformation-based error-driven learning (TBED) and integrated it into the best performing hybrid system in this task. Results: On the popular Informatics for Integrating Biology and the Bedside (i2b2) deidentification data set, experiments showed that TBED can offer high performance with its generated rules, and integrating the rule-based model into an ensemble framework, which reached an F1 score of 96.76%, achieved the best performance reported in the community. Conclusions: We proved the rule-based method offers an effective contribution to the current ensemble learning approach for the deidentification of clinical records. Such a rule system could be automatically learned by TBED, avoiding the high cost and low reliability of manual rule composition. In particular, we boosted the ensemble model with rules to create the best performance of the deidentification of clinical records. UR - http://medinform.jmir.org/2020/4/e17622/ UR - http://dx.doi.org/10.2196/17622 UR - http://www.ncbi.nlm.nih.gov/pubmed/32352384 ID - info:doi/10.2196/17622 ER - TY - JOUR AU - Hu, Fang AU - Li, Liuhuan AU - Huang, Xiaoyu AU - Yan, Xingyu AU - Huang, Panpan PY - 2020/4/16 TI - Symptom Distribution Regularity of Insomnia: Network and Spectral Clustering Analysis JO - JMIR Med Inform SP - e16749 VL - 8 IS - 4 KW - insomnia KW - core symptom KW - symptom community KW - symptom embedding representation KW - spectral clustering algorithm N2 - Background: Recent research in machine-learning techniques has led to signi?cant progress in various research ?elds. In particular, knowledge discovery using this method has become a hot topic in traditional Chinese medicine. As the key clinical manifestations of patients, symptoms play a signi?cant role in clinical diagnosis and treatment, which evidently have their underlying traditional Chinese medicine mechanisms. Objective: We aimed to explore the core symptoms and potential regularity of symptoms for diagnosing insomnia to reveal the key symptoms, hidden relationships underlying the symptoms, and their corresponding syndromes. Methods: An insomnia dataset with 807 samples was extracted from real-world electronic medical records. After cleaning and selecting the theme data referring to the syndromes and symptoms, the symptom network analysis model was constructed using complex network theory. We used four evaluation metrics of node centrality to discover the core symptom nodes from multiple aspects. To explore the hidden relationships among symptoms, we trained each symptom node in the network to obtain the symptom embedding representation using the Skip-Gram model and node embedding theory. After acquiring the symptom vocabulary in a digital vector format, we calculated the similarities between any two symptom embeddings, and clustered these symptom embeddings into five communities using the spectral clustering algorithm. Results: The top five core symptoms of insomnia diagnosis, including difficulty falling asleep, easy to wake up at night, dysphoria and irascibility, forgetful, and spiritlessness and weakness, were identified using evaluation metrics of node centrality. The symptom embeddings with hidden relationships were constructed, which can be considered as the basic dataset for future insomnia research. The symptom network was divided into five communities, and these symptoms were accurately categorized into their corresponding syndromes. Conclusions: These results highlight that network and clustering analyses can objectively and effectively find the key symptoms and relationships among symptoms. Identification of the symptom distribution and symptom clusters of insomnia further provide valuable guidance for clinical diagnosis and treatment. UR - http://medinform.jmir.org/2020/4/e16749/ UR - http://dx.doi.org/10.2196/16749 UR - http://www.ncbi.nlm.nih.gov/pubmed/32297869 ID - info:doi/10.2196/16749 ER -