This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Cardiac dysrhythmia is currently an extremely common disease. Severe arrhythmias often cause a series of complications, including congestive heart failure, fainting or syncope, stroke, and sudden death.
The aim of this study was to predict incident arrhythmia prospectively within a 1-year period to provide early warning of impending arrhythmia.
Retrospective (1,033,856 individuals enrolled between October 1, 2016, and October 1, 2017) and prospective (1,040,767 individuals enrolled between October 1, 2017, and October 1, 2018) cohorts were constructed from integrated electronic health records in Maine, United States. An ensemble learning workflow was built through multiple machine learning algorithms. Differentiating features, including acute and chronic diseases, procedures, health status, laboratory tests, prescriptions, clinical utilization indicators, and socioeconomic determinants, were compiled for incident arrhythmia assessment. The predictive model was retrospectively trained and calibrated using an isotonic regression method and was prospectively validated. Model performance was evaluated using the area under the receiver operating characteristic curve (AUROC).
The cardiac dysrhythmia case-finding algorithm (retrospective: AUROC 0.854; prospective: AUROC 0.827) stratified the population into 5 risk groups: 53.35% (555,233/1,040,767), 44.83% (466,594/1,040,767), 1.76% (18,290/1,040,767), 0.06% (623/1,040,767), and 0.003% (27/1,040,767) were in the very low-risk, low-risk, medium-risk, high-risk, and very high-risk groups, respectively; 51.85% (14/27) patients in the very high-risk subgroup were confirmed to have incident cardiac dysrhythmia within the subsequent 1 year.
Our case-finding algorithm is promising for prospectively predicting 1-year incident cardiac dysrhythmias in a general population, and we believe that our case-finding algorithm can serve as an early warning system to allow statewide population-level screening and surveillance to improve cardiac dysrhythmia care.
Cardiac dysrhythmia is a series of conditions in which the heartbeat is irregular, too fast, or too slow. There are many types of dysrhythmias, and most are mild; however, some severe arrhythmias increase the risk of serious or even life-threatening complications, such as congestive heart failure, syncope, stroke, and sudden death. More than 850,000 people are hospitalized for arrhythmias each year in the United States [
A few arrhythmia risk prediction tools have been applied in programs for screening, prevention of life-threatening arrhythmias [
With the widespread use of electronic health records (EHRs) in hospitals and clinics, an individual's physical and mental condition can be assessed to potentially improve the effectiveness of health management [
The purpose of this study was to retrospectively develop and prospectively validate our case-finding algorithm for patients at risk of 1-year incident cardiac dysrhythmia in Maine, United States.
Protected personal health information was deidentified for model development. Due to the nature of the development with deidentified data, this study was exempted from ethics review by the Stanford University Institutional Review Board (October 16, 2017).
A complete workflow (data collection, exclusion, and application) is presented in
A workflow diagram that describes data collection, data classification, model building, and model evaluation. AUC: area under the curve; EHR: electronic health record; HIE: health information exchange; PPV: positive predictive value.
The detailed modeling process with the ensemble learning method.
Nearly 95% of the population in Maine was included in the study. Clinical variables were collected, including demographic information, socioeconomic status, laboratory, and radiographic tests coded according to Logical Observation Identifier Names and Codes, outpatient medication prescriptions coded according to the National Drug Code, and primary and secondary diagnoses and procedures, which were coded using International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM).
Cardiac dysrhythmia was defined according to ICD-10-CM, including paroxysmal supraventricular tachycardia, paroxysmal ventricular tachycardia, atrial fibrillation, atrial flutter, premature beats, sinoatrial node dysfunction, and other cardiac dysrhythmias (ICD-10-CM diagnosis codes from I47 to I49).
The individuals included in this study were patients who visited any medical institutions in the Maine health information exchange network from October 1, 2015, to October 1, 2018. The retrospective timeframe was from October 1, 2016, to October 1, 2017. The prospective timeframe was from October 1, 2017, to October 1, 2018. Individuals were excluded if they died during the study period or were diagnosed with cardiac dysrhythmia before October 1, 2016, for the retrospective analysis and before October 1, 2017, for the prospective analysis.
Information regarding the medical history, diagnoses, medications, treatment plans, immunization dates, allergies, radiology images, and laboratory test results were extracted from EHRs. Relevant socioeconomic variables were extracted from the US Census and US Department of Agriculture websites [
Missing values in the data matrix for machine learning most likely arise due to the lack of the order of the tests or lack of coding for the absence of relevant comorbidity. The data matrix was constructed with all the entries to document the binary outcome (0 or 1) or the counts of the utilization. Therefore, missing entries caused by the data matrix consolidation were filled as 0 outcome or 0 count. Given the confounding effects of stratification factors among various features in the large number of samples, the Cochran-Mantel-Haenszel test was applied to analyze the relationship between the features and corresponding outcome under age-group strata [
Multihypothesis test correction was conducted to ensure the false discovery rate of the remaining features was in an acceptable range [
To investigate associations between feature categories, we built correlation networks among the features based on Spearman correlations. In these networks, vertices correspond to features, and an edge existed between 2 vertices if and only if a correlation (absolute value of the Spearman coefficient) >0.1 between 2 features was observed. In real clinical settings, these features are most likely not independent of each other, and more complicated causative or associative relationships may exist among these significant feature categories.
The retrospective cohort was divided into 2 parts: two-thirds of the data in the retrospective cohort were used for training, and the remaining one-third was used for model calibration. For training, multiple algorithms were applied, including least absolute shrinkage and selection operator (LASSO) [
where
in which each hypothesis is also multiplied by the prior probability of that hypothesis and where
The isotonic regression method was used to calibrate the model [
To find individuals at different risk levels, 5 risk groups were created and assigned to bins: very low-risk, low-risk, medium-risk, high-risk, and very high-risk bins. The model performance was evaluated through sensitivity, specificity, and PPV. The AUROC values were utilized to illustrate the relationship between sensitivity and specificity by composition.
Survival analysis was applied to track the timing of arrhythmia diagnosis in different risk bins. Kaplan-Meier curves were plotted for different risk levels to stratify the time to events of new incidences. The Cox proportional hazards regression method was used for multivariate analysis.
The cohort baseline characteristics are shown in
Baseline characteristics of the retrospective and prospective cohorts.
Characteristic | Retrospective (n=1,033,856), n (%) | Prospective (n=1,040,767), n (%) | |||
|
|
|
|||
|
<35 | 391,613 (37.90) | 399,545 (38.40) | ||
|
35-50 | 178,348 (17.20) | 176,995 (17.00) | ||
|
50-65 | 245,580 (23.80) | 243,161 (23.40) | ||
|
65-75 | 132,444 (12.80) | 135,600 (13.00) | ||
|
>75 | 85,871 (8.30) | 85,466 (8.20) | ||
|
|
|
|||
|
Male | 464,796 (45.00) | 571,821 (54.90) | ||
|
Female | 569,060 (55.00) | 468,946 (45.10) | ||
|
|
|
|||
|
Cardiovascular diseasea | 215,059 (20.80) | 228,757 (22.00) | ||
|
Chronic obstructive pulmonary disease | 38,029 (3.70) | 42,778 (4.10) | ||
|
Chronic kidney disease | 21,277 (2.70) | 23,653 (2.30) | ||
|
Type 2 diabetes | 83,387 (8.10) | 84,649 (8.10) | ||
|
Disorder of metabolism | 214,606 (20.80) | 221,168 (21.30) | ||
|
Hypothyroidism | 73,319 (7.10) | 74,495 (7.20) | ||
|
|
|
|||
|
Pain in throat and chest | 48,507 (4.69) | 55,368 (5.30) | ||
|
Anemia | 10,921 (1.06) | 11,099 (1.10) | ||
|
Edema | 17,096 (1.65) | 17,768 (1.70) | ||
|
Syncope and collapse | 10,662 (1.03) | 11,855 (1.10) | ||
|
Malaise and fatigue | 47,793 (4.62) | 46,834 (4.50) | ||
|
|
|
|||
|
Long term (current) drug therapy | 86,501 (8.40) | 107,834 (10.40) | ||
|
Personal history of other diseases and conditions | 104,282 (10.10) | 130,955 (12.60) | ||
|
BMI>33.0 | 3055 (0.30) | 3842 (0.40) | ||
|
|
|
|||
|
Glomerular filtration rate | 391,613 (37.90) | 16,726 (1.60) | ||
|
Coagulation assay | 12,197 (1.20) | 9616 (0.90) | ||
|
Carboxyhemoglobin in blood | 35,773 (3.50) | 20,850 (2.00) | ||
|
|
|
|||
|
Beta-adrenergic blocker | 66,750 (6.50) | 69,572 (6.70) | ||
|
Proton pump inhibitor | 68,123 (6.60) | 69,278 (6.70) | ||
|
Vitamin K antagonist | 6858 (0.70) | 5541 (0.50) |
aCardiovascular diseases included heart failure, rheumatic mitral valve diseases, atrioventricular and left bundle-branch block, cardiomyopathy, nonrheumatic aortic, tricuspid and mitral valve disorders, atherosclerosis, and other disorders of arteries and arterioles.
The original features (n=17,865) were extracted from the EHRs and socioeconomic databases. The model building process identified 307 features with contributing weights, including 2 demographic features, 18 socioeconomic characteristics, 101 chronic disease diagnostics, 147 confirmed acute disease and disease events, 7 procedures, 5 utilization variables, 9 factors of the health status, 9 medication prescriptions, and 9 laboratory tests. The top 60 important features and their odds ratios in the model are tabulated in
We built correlation networks among these 307 features based on Spearman correlations. The integral correlation networks contain 307 vertices and 325 edges. The majority of edges involved diagnostic diseases (n=206) and demographic features (n=34), with an additional 28 edges involving clinical medications, 27 involving laboratory tests, 18 involving health status, 15 involving procedures, and 10 involving socioeconomic characteristics, as well as all the utilization variables. The community structure of the 153 impactful features in different types and their correlation networks containing 127 edges, as an example, is shown in
The community structure of the 153 impactful features and their correlation networks (absolute value of the correlation coefficient >0.1).
Arrhythmia is an important group of cardiovascular diseases and is associated with other cardiovascular diseases. Heart disease–related features were revealed in our community structures and correlational networks, including acute myocardial infarction, aortic dissection, chronic ischemic heart disease, cardiomyopathy, and atherosclerosis caused by chronic obstructive pulmonary disease. These features imply a potential causative relationship with arrhythmia. Electrolyte imbalance, heart enlargement, heart failure, and myocardial ischemia may be related to the pathogenesis of arrhythmia and may be complications of chronic kidney disease, metabolic syndrome (type 2 diabetes), and hypertension. The associative relationship between arrhythmia and these chronic diseases is shown in
The important network structure of the predictive diagnostic features.
The AUROC results from our predictive methods, including LASSO, feed-forward neural network, random forest, boosting, XGBoost, naïve Bayes, k-nearest neighbor, and ensemble learning, were compared to demonstrate the effectiveness of the cardiac dysrhythmia risk prediction (
Model performance.
Model | AUROCa (95% CI) |
Ensemble learning | 0.827 (0.824-0.830) |
Least absolute shrinkage and selection operator | 0.819 (0.816-0.822) |
Extreme gradient boosting | 0.808 (0.805-0.811) |
Feed-forward neural network | 0.807 (0.804-0.810) |
Boosting | 0.775 (0.771-0.778) |
Random forest | 0.695 (0.691-0.699) |
0.631 (0.627-0.635) | |
Naïve Bayes | 0.611 (0.607-0.614) |
aAUROC: area under the receiver operating characteristic curve.
Patients in the prospective cohort were divided into 5 risk categories (very low, low, medium, high, and very high;
Survival curve analysis was applied to quantify the effectiveness of the 5-risk bin stratification for future cardiac dysrhythmia events within 1 year (
The proportion of patients and their positive predictive values with different prediction scores on the prospective cohort.
Survival curves of the 5 risk groups. HR: hazard ratio.
Age distributions at 5 different risk levels are shown in
Many patients in the high-risk and the very high-risk groups had comorbidity with chronic (cardiovascular disease: 436/650, 67.07%; metabolism disorder: 304/650, 46.77%; type 2 diabetes: 183/650, 28.15%; chronic obstructive pulmonary disease: 159/650, 24.46%; chronic kidney disease: 146/650, 22.46%) or acute diseases (cardiovascular disease: 220/650, 33.85%; syncope and collapse: 28/650, 4.31%; dizziness and giddiness: 47/650, 7.23%; pain in throat and chest: 81/650, 12.46%; breathing abnormalities: 151/650, 23.23%;
Socioeconomic features reflected the social disparities of individuals’ living environments and living conditions. The Spearman rank method was used to study the correlation between socioeconomic factors and arrhythmia (
In this study, we developed a case-finding tool to identify general population individuals at risk of future cardiac dysrhythmia events within 1 year using Maine statewide health information exchange aggregated EHR data sets. The predictive model was trained retrospectively (AUROC 0.854) and validated prospectively (AUROC 0.827). Our model was capable of prospectively stratifying the general population into 5 risk bins (very low, low, medium, high, and very high) of incident arrhythmia. It will support targeted care plans to manage patients in different risk categories.
Our case-finding tool is different from previous efforts in terms of the targeted population and predictive timeframe. Other models [
The high-risk or very high-risk bin individuals in a prospective cohort are likely to have higher disease burdens, given the confirmative diagnosis of multiple chronic diseases or acute disease events as well as other major medical histories (
Patients' average clinical costs in the past 12 months with respect to the average number of chronic diseases; 17 common diseases are presented in the low- and very low–risk group (green) and the high- and very high–risk group (red). CKD: chronic kidney disease; COPD: chronic obstructive pulmonary disease; CVD: cardiovascular disease.
Studies have shown that certain social factors were indirect causes of arrhythmia. The predisposing factors of arrhythmia may involve various aspects of the psychosocial environment related to social status [
Our study established a model for predicting the probability of arrhythmia disease within 1 subsequent year. By continuously tracking the influencing factors, the accuracy and applicability of the prediction results could be further improved. The predictive weight of different kinds of factors can provide insight into the formation mechanism of arrhythmia, the analysis of predisposing factors, and the research of preventive measures.
Our model will benefit physicians and health care organizations as well as patients. Model prediction and risk score results can be used as an auxiliary tool for physicians to diagnose and provide a reference for treatment planning. The stratification of the risk of the patient population also contributes to medical budget planning and target intervention. In the very high-risk category, 29.63% patients (8/27) would have been diagnosed with arrhythmia within the first 4 months of the subsequent year, which would then have gradually increased to 51.85% (14/27) within 1 year. Therefore, for those in the very high-risk group, it is necessary to formulate appropriate personalized intervention programs according to medical history, health status, living environment, and other conditions to prevent or delay the development of arrhythmia.
Given that our study showed that individuals at high risk of developing arrhythmias often have multiple chronic conditions, such as cardiovascular disease, metabolism disorders, type 2 diabetes, chronic obstructive pulmonary disease, and chronic kidney disease, aggressive interventions, including early routine testing and treatment of related chronic diseases, are needed for those patients. Arrhythmias after surgery are common and can lead to serious complications [
Our research has some limitations that could be further improved. First, some information was missing from our data set. Lifestyle information (such as eating habits and amount of daily exercise) was not fully documented in the EHR data. Second, when there are too many missing variables, there may be some bias in data preprocessing with the k-nearest neighbor method, resulting in inaccurate estimation results. Third, arrhythmia is a very common symptom that can be triggered in many cases, and some occurrences are not dangerous. Although arrhythmia was defined in detail in our model, it was only stratified by the probability of its occurrence, not by its severity. If the patient population based on the severity of the arrhythmia can be further subdivided, more accurate reference information for arrhythmia diagnosis and intervention will be provided.
A risk prediction model of 1-year incidence of cardiac dysrhythmia was developed and prospectively validated using EHR data from 1.5 million people in Maine. The model was able to classify patients according to the predicted scores. The model had a good predictive performance (AUROC 0.827) in a prospective test. Age, gender, cardiovascular disease, chronic kidney disease, chest pain, pleural effusion, and socioeconomic factors were found to be related to new arrhythmia. For patients at high risk, early intervention should be carried out in a timely manner. Patients with low and moderate risk should maintain good living and eating habits, pay more attention to their physical condition, and exercise more often to avoid the occurrence of serious arrhythmias. This prediction model and analysis will ultimately benefit patient families, clinicians, and social health care institutions.
List of social determinant variables detailed in the data source and mapping method.
List of the top 60 important features and their odds ratios in the model.
Receiver operating characteristic curve comparisons of the model performance.
Receiver operating characteristic curves derived from the prospective cohorts.
Performance of the 1-year arrhythmia risk prediction model in the prospective cohort.
Distribution of age and gender in the 5 risk categories.
Distribution of top risk predictors across the 5 risk categories.
Time-to-arrhythmia diagnosis curves of the chronic disease group in the low-risk/very low-risk and high-risk/very high-risk populations of the prospective cohort.
(a) The feature importance score bar chart of health status, laboratory test, and medication. (b) Distributions of health status, laboratory test, and medication in the low-risk/very low-risk and high-risk/very high-risk categories.
Spearman rank correlation between socioeconomic features and prospective arrhythmia risk score.
Patients' average clinical costs and average number of chronic diseases in the past 12 months in the low-risk/very low-risk and the high-risk/very high-risk subgroups. The points presented 17 common diseases.
area under the receiver operating characteristic curve
electronic health record
International Classification of Diseases, Tenth Revision, Clinical Modification
least absolute shrinkage and selection operator
positive predictive value
extreme gradient boosting
The authors gratefully acknowledge all the members involved in the project and the support of the research group in the Department of Surgery, Stanford University.
None declared.