Original Paper
Abstract
Background: The detection of infectious diseases through the analysis of free text on electronic health reports (EHRs) can provide prompt and accurate background information for the implementation of preventative measures, such as advertising and monitoring the effectiveness of vaccination campaigns.
Objective: The purpose of this paper is to compare machine learning techniques in their application to EHR analysis for disease detection.
Methods: The Pedianet database was used as a data source for a real-world scenario on the identification of cases of varicella. The models’ training and test sets were based on two different Italian regions’ (Veneto and Sicilia) data sets of 7631 patients and 1,230,355 records, and 2347 patients and 569,926 records, respectively, for whom a gold standard of varicella diagnosis was available. Elastic-net regularized generalized linear model (GLMNet), maximum entropy (MAXENT), and LogitBoost (boosting) algorithms were implemented in a supervised environment and 5-fold cross-validated. The document-term matrix generated by the training set involves a dictionary of 1,871,532 tokens. The analysis was conducted on a subset of 29,096 tokens, corresponding to a matrix with no more than a 99% sparsity ratio.
Results: The highest predictive values were achieved through boosting (positive predicative value [PPV] 63.1, 95% CI 42.7-83.5 and negative predicative value [NPV] 98.8, 95% CI 98.3-99.3). GLMNet delivered superior predictive capability compared to MAXENT (PPV 24.5% and NPV 98.3% vs PPV 11.0% and NPV 98.0%). MAXENT and GLMNet predictions weakly agree with each other (agreement coefficient 1 [AC1]=0.60, 95% CI 0.58-0.62), as well as with LogitBoost (MAXENT: AC1=0.64, 95% CI 0.63-0.66 and GLMNet: AC1=0.53, 95% CI 0.51-0.55).
Conclusions: Boosting has demonstrated promising performance in large-scale EHR-based infectious disease identification.
doi:10.2196/14330
Keywords
Introduction
Improving the predictive capability of infectious disease detection at the population level is an important public health issue that can provide the background information necessary for the implementation of effective control strategies, such as advertising and monitoring the effectiveness of vaccination campaigns [
].The need for fast, cost-effective, and accurate detection of infection rates has been widely investigated in recent literature [
]. Particularly, the combination of increased electronic health report (EHR) implementation in primary care, the growing availability of digital information within the EHR, and the development of data mining techniques offer great promise for accelerating pediatric infectious disease research [ ].Although EHR data are collected prospectively in real time at the point of health care delivery, observational studies intended to retrospectively assess the impact of clinical decisions are likely the most common type of EHR-enabled research [
].Among the high-impact diseases, the prompt identification of varicella zoster viral infections is of key interest due to the debate around the need and cost-benefit dynamics of a mass-vaccination program for young children [
, ].Challenges in this context arise from both the unique epidemiological characteristics of varicella zoster with respect to information extraction, such as age-specific consultation rates, seasonality, force of infection, hospitalization rates, and inpatient days [
], and from the way that medical records are organized, often in free-format and uncoded fields [ ]. A critical step is to transform this large amount of health care data into knowledge.Data extraction from free text for disease detection at the individual level can be based on manual, in-depth examinations of individual medical records or, to contain costs and ensure time-tightening and control, by automatic coding. Machine learning techniques (MLTs) are the most commonly used approaches [
] and show good overall performance [ , ]. Nevertheless, few indications are currently available on the most appropriate technique to use, and comparative evidence is still lacking on the performances of each available technique [ ] in the field of pediatric infectious disease research.In recent years, generalized linear model (GLM)-based techniques have been largely used for the text mining of EHRs, both as a technique of choice [
] and as a benchmark [ ]. The performance of GLMs, especially multinomial or in the simplest cases logistic regression, has been indicated as unsatisfactory [ ] because they are prone to overfitting and are sensitive to outliers. Enhancements to GLMs have been proposed recently in the form of the lasso and elastic-net regularized GLM [ ] (GLMNet), multinomial logistic regression (maximum entropy [MAXENT]), and the boosting approach implemented in the LogitBoost algorithm [ ] to overcome the limitations of naïve GLMs. Nevertheless, to the best of our knowledge, no comparisons have been made among these techniques to determine to what extent improvements are needed.The purpose of this study is to make comparisons among enhanced GLM techniques in the setting of automatic disease detection [
]. Particularly, these methods will be assessed on their ability of identifying cases of varicella from a large set of EHRs.Methods
Electronic Medical Record Database
The Italian Pedianet database [
] collects anonymized clinical data from more than 300 pediatricians throughout the country. This database focuses on children 0-14 years of age [ - ] and records the reasons for accessing health care, diagnosis, and clinical details. The sources of those data are primary care records written in Italian, which are filled in by pediatricians with clinical details about diagnosis and prescriptions; they also contain details about the eventual hospitalization and specialist referrals.For the purpose of this study, we were allowed to access only two subsets of the Pedianet database, corresponding to the data collected between 2004 and 2014 in the Italian regions of Veneto (northern Italy) and Sicilia (South Italy). Since the Veneto region data set was larger, it was considered for carrying out the training of the model. The data set of the Sicilia region provided an independent data set for testing the model. The main characteristics of the two data sets are reported in
. It is worth noting that the proportion of positive cases of varicella is different in the two databases. Interpreting differences in prevalence between regions is beyond the purpose of this study; nevertheless, given the smaller prevalence, there is an expected lower positive predictive value (PPV) and a higher negative predictive value (NPV) on the test set.The Pedianet source data includes five different tables. In
, we report a short description of them.Characteristic | Train | Test |
Database | Pedianet | Pedianet |
Language | Italian | Italian |
Italian Region | Veneto | Sicilia |
Date span | January 2, 2004-December 31, 2014 | January 7, 2004-December 30, 2014 |
Records, n | 1,230,355 | 569,926 |
Children, n | 7631 | 2347 |
Pediatricians, n | 46 | 13 |
Positive cases, n (%) | 3481 (45.6%) | 128 (5.4%) |
Table topic | Content | Type of data | Example |
Accessing | Reasons for accessing the pediatrician and diagnoses | Free text (including codes) |
|
Diaries | Pediatrician’s free-text diaries | Free text |
|
Hospitalizations | Details on hospital admissions, diagnoses, and length of stays | Free text |
|
SOAPa | Symptoms, objectivity, diagnosis, or prescriptions | Free text (including codes) |
|
Specialistic visits | Visit type and its diagnosis | Free text including (codes) |
|
aSOAP: symptoms, objectivity, diagnosis, or prescriptions.
bFor tables with multiple fields, field names are reported in italics.
All the tables can be linked at the individual level (ie, each row of all the tables contains the fields for reporting information on dates, the assisting pediatrician’s anonymous identifier, and the patients’ anonymous identifier, which constitutes the linking key).
Case Definition
The case definition comes directly from the gold standard provided, and the training set for machine learning was created using those dichotomous labels (ie, 0=noncase, that is not a varicella case; and 1=case, that is a varicella case).
Training and Test Sets for Machine Learning
Linking by patient ID, pediatrician ID, and reporting date, we merged the five tables into a single table consisting of several entries, each of which represents a visit or evaluation of a patient carried out by a pediatrician on a specific day. At this step, the information (excluding patient ID, pediatrician ID, and reporting date) is contained in 15 columns containing free text mixed with coded text, which was considered by us as free text as well. Finally, all remaining columns of the table were merged into a single corpus (ie, a body of text). This process was applied to train the models on 1,230,355 entries (database of the Veneto region) and to test them on 569,926 entries (database of the Sicily region) separately.
Preprocessing
Text analysis by a computer program is possible only after establishing a way to convert text (ie, readable to humans) into numbers (ie, readable to computers). This process is called preprocessing, and it is the first [
] and probably the most important step in data mining [ ]. To process the corpus of Pedianet EHRs included in the training set, we used the following strategy. First, we converted all fields in a text type; lowered the content; and cleared it of symbols, punctuation, numbers, and extra white spaces. Second, we stemmed the words (ie, reducing them to their basic form, or “root”), which is recognized as one of the most important procedures to perform [ ], and constructed 2-gram tokens, which has been shown to be the optimal rank for gram tokenization [ ]. Third, we removed all the (stemmed) stop words (ie, common and nonmeaningful words such as articles or conjunctions) from the set of tokens as well as all bigrams containing any of them. We chose this strategy after exploring different approaches described in [ ]. Fourth, we created the document-term matrix (DTM) as a patient-token matrix. To consider both the importance of the tokens within a patient (ie, one row of the DTM) and its discrimination power between patients’ records (ie, the rows of the DTM), we computed the TF-iDF (term frequencies–inverse document frequencies) weights. TF-iDF weights help to adjust for the presence of words that are more frequent but less meaningful [ ]. TF-iDF-ij entry is equal to the product of the frequency of the j-th token in the i-th document by the logarithm of the inverse of the number of documents that contain that token (ie, the more frequent a word appears in a document the more its weight rises for that document), and the more documents that contain the j-th token, the more the weight shrinks across all the documents [ ]. In the initial DTM there were 1,871,532 tokens that appear at least once, with a nonsparse/sparse entries ratio of (18,951,304/14,262,709,388). We decided to reduce it to achieve a maximum of 99% overall sparsity. Filtering out the tokens that do not appear in at least 1% of the documents had reduced it down to 94% (ie, 29,096 tokens that appear at least once for a nonsparse/sparse entries ratio of 13,140,370/208,891,206). The choice of a 99% level of sparsity was a tradeoff between the need to retain as many tokens as possible and the computational effort.The corpus of Pedianet EHRs comprised in the test set went through the same text preprocessing strategy in the same order, and then the DTM was created with the initial TF weighing scheme. Furthermore, it was adapted with the same tokens retained in the training phase (ie, adding the missing tokens, weighting them as zero, and removing the ones not included in the training DTM) and was finally reweighted with the TF-iDF weighing scheme with the same retained iDF weights of the corresponding training DTM, which were retained when applied to the whole training data set. Those are necessary steps to guarantee that the two feature spaces are the same and that the models trained can be evaluated on the test set.
Machine Learning Techniques
Enhancements of GLMs for carrying out text mining on EHRs have been proposed in the form of the lasso and GLMNet [
], multinomial logistic regression (MAXENT), and the boosting approach (LogitBoost) [ ].GLMNet is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods applied in synergy with a link function and a variance function to overcome linear model limitations (eg, the constant variability among the mean and the normality of the data). The link function selected was the binomial (ie, the model fit a regularized logistic regression model for the log odds), while the amount of regularization was automatically selected by the algorithm through an exploration of 100 values between the minimum value that reduced all the coefficients to zero and its 0.01 fraction.
MAXENT is an implementation of (multinomial) logistic regression aimed at minimizing the memory load on large data sets in R (R Foundation for Statistical Computing) and is primarily designed to work with the sparse DTM provided by the R package [
]. It has been proven to provide results mathematically equivalent to a GLM with a Poisson link function [ ].Boosting is a general approach for improving the predictive capability of any given learning algorithm. We used the adaptations of Tuszynski [
] to the original algorithm, (ie, LogitBoost [ , ]), which is aimed at making the entire process more efficient while applying it on large data sets. The standard boosting technique [ ] is applied to the sequential use of a decision stump classification algorithm as a weak learner (ie, a single binary decision tree). The number of stumps considered is the same as the columns provided in the training set.Those techniques are chosen among computationally treatable algorithms for use with large data sets [
]. GLMNet and MAXENT represent classical benchmark approaches to linear and logistic classification, respectively, in a manner that differs from LogitBoost, which is a modern boosted tree-based machine learning approach [ , ]. Moreover, LogitBoost generalizes the classical logistic models by fitting a logistic model at each node [ ] and shows an alternative point of view with regards to models such as the GLMs, for which the structure of the learner must be chosen a priori [ ].Training and Testing
We addressed the issue of internal validation by performing cross-validation on the training set comprising records from the Veneto region. We dealt with external validation by accessing a truly external sample of Pedianet EHRs from another Italian region, Sicily. This accomplishes two tasks: preserving precision in the training phase and complementing study findings with external validation results using data that were not available when the predictive tool was developed.
We used a 5-fold cross-validation approach to validate each of the three MLTs on the DTM with the corresponding (by row) “case/non-case” attached labels. All MLTs were simultaneously fitted on the same set of folds to ensure a proper comparison between techniques. Values of k=10 or k=5 (especially for large data sets) have been shown empirically to yield acceptable (in terms of bias-variance trade-off) error rates [
, ]. Thus, the choice of 5-folds was driven by the computational complexity, the fewer folds, the less complexity.As measures of performance, we calculated point estimates and 95% CIs for the following.
- PPV or Precision: , that is the fraction of positively identified cases that are true positives
- NPV: , that is the fraction of positively identified noncases that are true negatives
- Sensitivity or Recall: , that is the true positive rate
- Specificity: , that is the true negative rate
- F score: , the harmonic mean of the PPV (Precision) and Sensitivity (Recall)
The Gwet agreement coefficient 1 (AC1) statistics of agreement [
, ] between the techniques were computed and reported, along with their corresponding 95% CIs. Given that A=the number of times both models classify a record as noncase, D=the number of times both models classify a record as a case, and N=the total sample size, then , where , and eγ is the agreement probability by chance and is equal to 2q (1 – q), where ; A1 is the number of records classified as noncase by model 1, and B1 is the number of records classified as noncase by model 2. AC1 has been used given its propensity to be weakly affected by marginal probability, and therefore it was chosen to manage unbalanced data [ ].All the analyses were implemented in the R system [
] with the computing facilities of the Unit of Biostatistics, Epidemiology and Public Health. The R packages used were: SnowballC (to stem the words) and RWeka (to create n-grams) for the preprocessing step; Matrix and SparseM to manage sparse matrices; GLMNet, MAXENT, and caTools for the GLMNet, MAXENT, and LogitBoost MLT implementation; caret to create and evaluate the cross-validation folds; ROCR to estimate the performance; and the tidyverse bundle of packages for data management, functional programming, and plots. A git repository of the analysis code is available [ ].Results
The flow chart, from data acquisition to preprocessing, is shown in
. In the training set, 29,096 initial terms out of 1,871,532 were retained by the sparsity reduction step. Boosting significantly outperforms all other MLTs on the training set, with the highest F score and PPV. The GLMNet predictor delivered a superior F score and greater PPV compared to MAXENT ( ). The same results held on the test set ( ) and agreement between MLT predictions on the training set was good as measured by AC1 statistics ( ).Technique | Sensitivity, mean (95% CI) | PPVa, mean (95% CI) | NPVb, mean (95% CI) | Specificity, mean (95% CI) | F score, mean (95% CI) |
GLMNetc | 80.2 (77.7-82.7) | 73.2 (70.9-75.6) | 90.9 (89.6-92.2) | 87.1 (85.6-88.7) | 76.5 (75.6-77.5) |
MAXENTd | 68.8 (66.8-70.7) | 66.0 (62.5-69.5) | 86.1 (85.2-86.9) | 84.5 (82.7-86.3) | 67.4 (64.7-70.0) |
Boosting | 86.6 (82.1-91.1) | 95.8 (93.2-98.5) | 94.4 (92.4-96.3) | 98.3 (97.0-99.6) | 90.9 (89.7-92.1) |
aPPV: positive predicative value.
bNPV: negative predicative value.
cGLMNet: elastic-net regularized generalized linear model.
dMAXENT: maximum entropy.
Technique | Sensitivity, mean (95% CI) | PPVa, mean (95% CI) | NPVb, mean (95% CI) | Specificity, mean (95% CI) | F score, mean (95% CI) |
GLMNetc | 72.3 (66.4-78.1) | 24.5 (21.0-28.0) | 98.3 (97.9-98.6) | 87.4 (85.4-89.5) | 36.5 (32.2-40.8) |
MAXENTd | 74.8 (62.2-87.5) | 11.0 (9.5-12.5) | 98.0 (97.3-98.6) | 65.5 (54.7-76.2) | 19.1 (17.2-20.9) |
Boosting | 79.2 (69.7-88.7) | 63.1 (42.7-83.5) | 98.8 (98.3-99.3) | 96.9 (94.2-99.6) | 68.5 (59.3-77.7) |
aPPV: positive predicative value.
bNPV: negative predicative value.
cGLMNet: elastic-net regularized generalized linear model.
dMAXENT: maximum entropy.
Technique | Wrongly agreea, n | Correctly agreeb, n | Disagreec, n | Gwet AC1d,e (95% CI) |
GLMNetf vs MAXENTg | 669 | 5609 | 1353 | 0.68 (0.67-0.70) |
GLMNet vs boosting | 195 | 6269 | 1146 | 0.74 (0.72-0.75) |
MAXENT vs boosting | 224 | 5895 | 1491 | 0.66 (0.65-0.68) |
aThe “Wrongly Agree” column refers to the number of records misclassified by both techniques.
bThe “Correctly Agree” column states the number of records correctly classified by both techniques.
cThe “Disagree” column lists the number of records for which the techniques disagree in the classification.
dAC1: agreement coefficient 1.
eGwet AC1 represents the index of agreement between the identified techniques. Legend for AC1 is: AC1<0=disagreement; AC1 0.00-0.40=poor; AC1 0.41-0.60=discrete; AC1 0.61-0.80=good; AC1 0.81-1.00=optimal.
fGLMNet: elastic-net regularized generalized linear model.
gMAXENT: maximum entropy.
With the aim to analyze the most relevant errors, we explored if any records were wrongly classified by all the techniques. There were 3 records: 1 wrongly classified as positive and 2 wrongly classified as negatives by all the MLTs.
Discussion
Principal Findings
The application of MLTs to EHRs constitutes the analytical component of an emerging research paradigm that rests on the capture and preprocessing of massive amounts of clinical data to gain clinical insights and ideally to complement the decision-making process at different levels, from individual treatment to definition of national public health policies. As acknowledged by others [
], the development and application of big data analysis methods on EHRs may help create a continually learning health care system [ ].This study trains and compares three different machine learning approaches towards infectious disease detection at the population level based on clinical data collected in primary care EHRs. In line with the recommended paradigm for model validation [
], the MLTs’ performance underwent internal validation through cross-validation and external validation on an independent set of EHRs.The predictive capabilities of the developed MLTs are promising even if quite different from each other (eg, validation F scores range from 67%-91% and test F scores range from 19%-69%). Findings on the better performance reached by LogitBoost are in line with recent evidence that shows an improvement in general classification problems moving from MAXENT algorithms to LogitBoost-based ones [
]. LogitBoost is thus confirmed to be a useful technique for solving health-related classification problems [ ].Only three records were wrongly classified by all the models. The first one was wrongly classified as positive probably because the text entry was “vaccini:varicella e mpr” (ie, vaccine: varicella and mpr), and after the preprocessing, the bigram “vaccin varicell” was removed because the TFiDF weight was low. Thus the relationship between varicella and vaccine was lost and remained only the token “varicell”.
The other two records were wrongly classified as negative. For one of them, the misclassification was probably due to an issue in the tokenization. In fact, an anomalous sequence of dashes (“-”) and blanks lead to the token “- varicella”, which was removed from the feature space, leaving no reference to the disease. The second negative misclassified record referred to a child who was vaccinated for measles, mumps, rubella, and varicella (quadrivalent vaccine). The pediatrician wrote “vaccinazione morbillo parotite rosolia varicella” (ie, vaccination, measles, mumps, rubella, varicella). The bigram “rosol varicell” (ie, “rubell varicell”) was weighted 0.361 and, hence, was retained in the feature space, and was considered by all the MLTs a pattern of noninfection.
The strength of tree-based models such as LogitBoost also lies in their high scalability. In fact, their computational complexity (ie, the asymptotical time needed for a complete run) grows linearly with the sample size and quadratically with the number of features used (ie, the number of tokens considered) [
]. Assuming that the richness of the pediatric EHRs’ vocabulary is limited (ie, the number of tokens reaches a plateau as data accumulates over time) an increase in computational time will only depend linearly on the number of patients.Any attempt to use EHRs to identify patients with a specific disease would depend on the algorithm, the database, the language, and the true prevalence of the disease. As to the generalization of these models to other contexts, we hypothesize that they could also be successfully applied in public health systems with EHR charting in other languages [
].We acknowledge that one metric (ie, sensitivity, specificity, PPV, or NPV) may be more important than another, depending on the intended use of the classification algorithm. Thus, the LogitBoost model is adequate for ascertaining varicella cases, with a preference for case identification with good sensitivity and excellent specificity.
If the aim of using MLTs is to help create a gold standard for databases, the limited agreement between the MLTs reported in
suggests that these classification algorithms are not reliable as a set of annotators.Limitations
Some limitations must be acknowledged. First, it is acknowledged that text preprocessing is a crucial step. The way to convert free text into numbers and numbers into features is an essential step of the process and has one of the biggest impacts on the results [
]. For the same reason as before, we decided to follow a standard preprocessing procedure without searching for the best one to obtain results that are, at most, independent of human tuning.Second, we set the number of boosting iterations as the same number of features considered. This is suboptimal in computational time because the same performance can be reached with fewer iterations [
]. Nevertheless, we aimed to reach an upper-bound value for the performance estimated in an optimal situation.Third, the large difference in disease prevalence between the training and the validation data set should be noted. The boosting approach seems to deal with this issue in a satisfactory way, but a potential impact on model prediction could not be excluded.
Conclusions
Given their promising performance in identifying varicella cases, LogitBoost, and MLTs in general, could be effectively used for large-scale surveillance, minimizing time and cost in a scalable and reproducible manner.
Acknowledgments
The data that support the findings of this study are available from Pedianet, but restrictions apply to the availability of these data, which were used under license for this study and are not publicly available. Data are, however, available from the authors upon reasonable request and with the permission of Pedianet.
Authors' Contributions
CL, CG, and DG designed the study. CL and PB performed the analysis. CL, PB, IB, and GL wrote the manuscript. IB and DG interpreted the statistical results. GL and CG interpreted the clinical results. LT, AS, and LG handled data management.
Conflicts of Interest
None declared.
References
- Magill SS, Dumyati G, Ray SM, Fridkin SK. Evaluating epidemiology and improving surveillance of infections associated with health care, United States. Emerg Infect Dis 2015 Sep;21(9):1537-1542 [FREE Full text] [CrossRef] [Medline]
- Lloyd-Smith JO, Funk S, McLean AR, Riley S, Wood JL. Nine challenges in modelling the emergence of novel pathogens. Epidemics 2015 Mar;10:35-39 [FREE Full text] [CrossRef] [Medline]
- Sutherland SM, Kaelber DC, Downing NL, Goel VV, Longhurst CA. Electronic health record-enabled research in children using the electronic health record for clinical discovery. Pediatr Clin North Am 2016 Apr;63(2):251-268. [CrossRef] [Medline]
- Baracco G, Eisert S, Saavedra S, Hirsch P, Marin M, Ortega-Sanchez I. Clinical and economic impact of various strategies for varicella immunity screening and vaccination of health care personnel. Am J Infect Control 2015 Oct 01;43(10):1053-1060. [CrossRef] [Medline]
- Damm O, Ultsch B, Horn J, Mikolajczyk RT, Greiner W, Wichmann O. Systematic review of models assessing the economic value of routine varicella and herpes zoster vaccination in high-income countries. BMC Public Health 2015 Jun 05;15:533 [FREE Full text] [CrossRef] [Medline]
- Kawai K, Gebremeskel BG, Acosta CJ. Systematic review of incidence and complications of herpes zoster: towards a global perspective. BMJ Open 2014 Jun 10;4(6):e004833 [FREE Full text] [CrossRef] [Medline]
- Pierik JG, Gumbs PD, Fortanier SA, Van Steenwijk PC, Postma MJ. Epidemiological characteristics and societal burden of varicella zoster virus in the Netherlands. BMC Infect Dis 2012 May 10;12:110 [FREE Full text] [CrossRef] [Medline]
- Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet 2012 May 02;13(6):395-405. [CrossRef] [Medline]
- Afzal Z, Schuemie MJ, van Blijderveen JC, Sen EF, Sturkenboom MC, Kors JA. Improving sensitivity of machine learning methods for automated case identification from free-text electronic medical records. BMC Med Inform Decis Mak 2013 Mar 02;13:30 [FREE Full text] [CrossRef] [Medline]
- Wang Z, Shah AD, Tate AR, Denaxas S, Shawe-Taylor J, Hemingway H. Extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning. PLoS One 2012;7(1):e30412 [FREE Full text] [CrossRef] [Medline]
- Kavuluru R, Rios A, Lu Y. An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records. Artif Intell Med 2015 Oct;65(2):155-166 [FREE Full text] [CrossRef] [Medline]
- Ford E, Carroll JA, Smith HE, Scott D, Cassell JA. Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc 2016 Sep;23(5):1007-1015 [FREE Full text] [CrossRef] [Medline]
- Zheng T, Xie W, Xu L, He X, Zhang Y, You M, et al. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inform 2017 Jan;97:120-127 [FREE Full text] [CrossRef] [Medline]
- Wu PY, Cheng CW, Kaddi CD, Venugopalan J, Hoffman R, Wang MD. -Omic and electronic health record big data analytics for precision medicine. IEEE Trans Biomed Eng 2017 Feb;64(2):263-273 [FREE Full text] [CrossRef] [Medline]
- Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010;33(1):1-22 [FREE Full text] [Medline]
- Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors). Ann Statist 2000 Apr;28(2):337-407. [CrossRef]
- Mani S, Chen Y, Arlinghaus L, Li X, Chakravarthy A, Bhave S, et al. Early prediction of the response of breast tumors to neoadjuvant chemotherapy using quantitative MRI and machine learning. AMIA Annu Symp Proc 2011;2011:868-877 [FREE Full text] [Medline]
- Pedianet. URL: http://www.pedianet.it/en [accessed 2019-04-09]
- Nicolosi A, Sturkenboom M, Mannino S, Arpinelli F, Cantarutti L, Giaquinto C. The incidence of varicella: correction of a common error. Epidemiology 2003 Jan;14(1):99-102. [CrossRef] [Medline]
- Nicolosi A, Sturkenboom M, Mannino S, Arpinelli F, Cantarutti L, Giaquinto C. The incidence of varicella: correction of a common error. Epidemiology 2003 Jan;14(1):99-102. [CrossRef] [Medline]
- Cantarutti A, Donà D, Visentin F, Borgia E, Scamarcia A, Cantarutti L, Pedianet. Epidemiology of frequently occurring skin diseases in Italian children from 2006 to 2012: a retrospective, population-based study. Pediatr Dermatol 2015;32(5):668-678. [CrossRef] [Medline]
- Donà D, Mozzo E, Scamarcia A, Picelli G, Villa M, Cantarutti L, et al. Community-acquired rotavirus gastroenteritis compared with adenovirus and norovirus gastroenteritis in Italian children: a Pedianet study. Int J Pediatr 2016;2016:5236243 [FREE Full text] [CrossRef] [Medline]
- Sebastiani F. Machine learning in automated text categorization. ACM Comput Surv 2002 Mar;34(1):1-47. [CrossRef]
- Denny MJ, Spirling A. Text preprocessing for unsupervised learning: why it matters, when it misleads, and what to do about it. Polit Anal 2018 Mar 19;26(2):168-189. [CrossRef]
- Liu M, Hu Y, Tang B. Role of text mining in early identification of potential drug safety issues. Methods Mol Biol 2014;1159:227-251. [CrossRef] [Medline]
- Marafino B, Davies J, Bardach N, Dean M, Dudley RA. N-gram support vector machines for scalable procedure and diagnosis classification, with applications to clinical free text data from the intensive care unit. J Am Med Inform Assoc 2014;21(5):871-875 [FREE Full text] [CrossRef] [Medline]
- Gregori D, Paola B, Soriani N, Baldi I, Lanera C. Maximizing text mining performance: the impact of pre-processing. In: JSM Proceedings, Section on Statistical Learning and Data Science. 2016 Presented at: ASA Joint Statistical Meeting; 2016; Chicago, IL p. 3265-3270.
- Wu HC, Luk RWP, Wong KF, Kwok KL. Interpreting TF-IDF term weights as making relevance decisions. ACM Trans Inf Syst 2008 Jun 01;26(3):1-37. [CrossRef]
- Goodall CR. Data mining of massive datasets in healthcare. Journal of Computational and Graphical Statistics 1999 Sep;8(3):620-634. [CrossRef]
- Jurka T. maxent: an R package for low-memory multinomial logistic regression with support for semi-automated text classification. The R Journal 2012;4(1):56. [CrossRef]
- Renner IW, Warton DI. Equivalence of MAXENT and Poisson point process models for species distribution modeling in ecology. Biometrics 2013 Mar;69(1):274-281. [CrossRef] [Medline]
- Tuszynski J. R-project. 2019. caTools: Tools: Moving Window Statistics, GIF, Base64, ROC AUC, etc URL: https://cran.r-project.org/package=caTools
- Dettling M, Bühlmann P. Boosting for tumor classification with gene expression data. Bioinformatics 2003 Jun 12;19(9):1061-1069. [CrossRef] [Medline]
- Freund Y, Schapire RE. Experiments with a new boosting algorithm. 340 Pine Street, Sixth Floor, San Francisco, CA: Morgan Kaufmann Publishers Inc; 1996 Jul Presented at: Thirteenth International Conference on International Conference on Machine Learning; 1996; Bari, Italy p. E URL: https://cseweb.ucsd.edu/~yfreund/papers/boostingexperiments.pdf
- Boughorbel S, Al-Ali R, Elkum N. Model comparison for breast cancer prognosis based on clinical data. PLoS One 2016;11(1):e0146413 [FREE Full text] [CrossRef] [Medline]
- Andrews P, Sleeman D, Statham P, McQuatt A, Corruble V, Jones P, et al. Predicting recovery in patients suffering from traumatic brain injury by using admission variables and physiological data: a comparison between decision tree analysis and logistic regression. J Neurosurg 2002 Aug;97(2):326-336. [CrossRef] [Medline]
- Landwehr N, Hall M, Frank E. Logistic model trees. Mach Learn 2005 May;59(1-2):161-205. [CrossRef]
- Abeare S. LSU Master's Theses. 2009. Comparisons of boosted regression tree, GLM and GAM performance in the standardization of yellowfin tuna catch-rate data from the Gulf of Mexico lonline [sic] fishery URL: https://digitalcommons.lsu.edu/gradschool_theses/2880/ [accessed 2020-04-01]
- Hastie T, Tibshirani R, Friedman J. The Elements Of Statistical Learning. Berlin, Germany: Springer Science & Business Media; 2009.
- Borra S, Di Ciaccio A. Measuring the prediction error. A comparison of cross-validation, bootstrap and covariance penalty methods. Computational Statistics & Data Analysis 2010 Dec;54(12):2976-2989. [CrossRef]
- Gwet K. Handbook Of Inter-rater Reliability: The Definitive Guide To Measuring The Extent Of Agreement Among Raters. Piedmont, Ca: Advanced Analytics, Llc; 2014.
- Wongpakaran N, Wongpakaran T, Wedding D, Gwet KL. A comparison of Cohen's Kappa and Gwet's AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples. BMC Med Res Methodol 2013 Apr 29;13:61 [FREE Full text] [CrossRef] [Medline]
- Zec S, Soriani N, Comoretto R, Baldi I. High agreement and high prevalence: the paradox of Cohen's kappa. Open Nurs J 2017;11:211-218 [FREE Full text] [CrossRef] [Medline]
- R Foundation for Statistical Computing. 2016. R: A Language Environment for Statistical Computing URL: https://www.r-project.org/ [accessed 2020-04-01]
- GitHub. mltzostercode URL: https://github.com/UBESP-DCTV/mltzostercode
- Ross M, Wei W, Ohno-Machado L. "Big data" and the electronic health record. Yearb Med Inform 2014 Aug 15;9:97-104 [FREE Full text] [CrossRef] [Medline]
- Wiens J, Shenoy ES. Machine learning for healthcare: on the verge of a major shift in healthcare epidemiology. Clin Infect Dis 2018 Jan 06;66(1):149-153 [FREE Full text] [CrossRef] [Medline]
- Xing C, Geng X, Xue H. Logistic boosting regression for label distribution learning. 2016 Presented at: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016; Las Vegas, NV p. 4489-4497 URL: http://openaccess.thecvf.com/content_cvpr_2016/papers/Xing_Logistic_Boosting_Regression_CVPR_2016_paper.pdf [CrossRef]
- Lorenzoni G, Bressan S, Lanera C, Azzolina D, Da Dalt L, Gregori D. Analysis of unstructured text-based data using machine learning techniques: the case of pediatric emergency department records in Nicaragua. In: Med Care Res Rev. 2019 Apr 29 Presented at: APHA 2017 Annual Meeting & Expo; November 4-8; Atlanta, GA p. 1077558719844123. [CrossRef]
Abbreviations
AC1: agreement coefficient 1 |
DTM: document-term matrix |
EHR: electronic health report |
GLM: generalized linear model |
GLMNet: elastic-net regularized generalized linear model |
MAXENT: maximum entropy |
MLT: machine learning technique |
NPV: negative predicative value |
PPV: positive predicative value |
TF-iDF: term frequencies–inverse document frequencies |
Edited by N Bruining; submitted 10.04.19; peer-reviewed by R Bajpai, M Torii, B Polepalli Ramesh; comments to author 20.06.19; revised version received 28.08.19; accepted 16.12.19; published 05.05.20
Copyright©Corrado Lanera, Paola Berchialla, Ileana Baldi, Giulia Lorenzoni, Lara Tramontan, Antonio Scamarcia, Luigi Cantarutti, Carlo Giaquinto, Dario Gregori. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 05.05.2020.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.