Evaluating the Representativeness of US Centricity Electronic Medical Records With Reports From the Centers for Disease Control and Prevention: Comparative Study on Office Visits and Cardiometabolic Conditions

Background Electronic medical record (EMR)–based clinical and epidemiological research has dramatically increased over the last decade, although establishing the generalizability of such big databases for conducting epidemiological studies has been an ongoing challenge. To draw meaningful inferences from such studies, it is essential to fully understand the characteristics of the underlying population and potential biases in EMRs. Objective This study aimed to assess the generalizability and representativity of the widely used US Centricity Electronic Medical Record (CEMR), a primary and ambulatory care EMR for population health research, using data from the National Ambulatory Medical Care Surveys (NAMCS) and the National Health and Nutrition Examination Surveys (NHANES). Methods The number of office visits reported in the NAMCS, designed to meet the need for objective and reliable information about the provision and the use of ambulatory medical care services, was compared with similar data from the CEMR. The distribution of major cardiometabolic diseases in the NHANES, designed to assess the health and nutritional status of adults and children in the United States, was compared with similar data from the CEMR. Results Gender and ethnicity distributions were similar between the NAMCS and the CEMR. Younger patients (aged <15 years) were underrepresented in the CEMR compared with the NAMCS. The number of office visits per 100 persons per year was similar: 277.9 (95% CI 259.3-296.5) in the NAMCS and 284.6 (95% CI 284.4-284.7) in the CEMR. However, the number of visits for males was significantly higher in the CEMR (CEMR: 270.8 and NAMCS: 239.0). West and South regions were underrepresented and overrepresented, respectively, in the CEMR. The overall prevalence of diabetes along with age and gender distribution was similar in the CEMR and the NHANES: overall prevalence, 10.1% and 9.7%; male, 11.5% and 10.8%; female, 9.1% and 8.8%; age 20 to 40 years, 2.5% and 1.8%; and age 40 to 60 years, 9.4% and 11.1%, respectively. The prevalence of obesity was similar: 42.1% and 39.6%, with similar age and female distribution (41.5% and 41.1%) but different male distribution (42.7% and 37.9%). The overall prevalence of high cholesterol along with age and female distribution was similar in the CEMR and the NHANES: overall prevalence, 12.4% and 12.4%; and female, 14.8% and 13.2%, respectively. The overall prevalence of hypertension was significantly higher in the CEMR (33.5%) than in the NHANES (95% CI: 27.0%-31.0%). Conclusions The distribution of major cardiometabolic diseases in the CEMR is comparable with the national survey results. The CEMR represents the general US population well in terms of office visits and major chronic conditions, whereas the potential subgroup differences in terms of age and gender distribution and prevalence may differ and, therefore, should be carefully taken care of in future studies.


Background
Large national surveys and registry data provide epidemiological and population-level health information. Although such studies will remain as gold standards in evaluating the health state at a population level, the more recent development of large real-world data (RWD) from electronic medical records (EMRs) and claims data for therapeutic management and population-level safety evaluations provide additional and unique opportunities to expand our understanding in a broad class of clinical, epidemiological, and public health-related questions [1][2][3][4][5][6].
EMR data are collected during routine medical care, offering the opportunity to investigate clinical questions from a real-world perspective. Although randomized clinical trials (RCTs) allow the evaluation of the safety and efficacy of interventions in a design-led population, the EMR-based studies allow for comparative effectiveness and safety studies, apart from revolutionizing the approach to efficient pharmacovigilance. RWD-based studies also provide opportunities to explore clinical questions in populations that are often excluded from RCTs, such as pregnant, older, or comorbid patients. Furthermore, real-world studies allow us to investigate questions that may be unethical for testing in RCTs. EMRs are also used to track how clinical guidelines are implemented in real-world practices and to research the quality of clinical care.
The epidemiological value of EMR-based research directly depends on the size of the EMR network. Several EMR systems were implemented at the national level, and most familiar representatives include databases from the United States, the United Kingdom, Sweden, Norway, and Denmark [7][8][9][10]. The representativeness of some of these databases in terms of demographics and chronic and rare diseases has been shown in some studies [8,9,[11][12][13][14].
Apart from health research based on data from individual practices, pharmacies, insurers, claims, or prescriptions, the MarketScan Commercial Claims and Encounters Database, owned by Truven Health Analytics, is one of the most commonly used data source for health research in the United States [15,16]. The Veteran Affairs-integrated health care system is another widely used data source in the United States [17,18]. One of the oldest primary and ambulatory EMR systems in the United States is the Centricity Electronic Medical Record (CEMR), owned by General Electric, which provides an opportunity for research using deidentified data on more than 45 million patients from all states of the United States [19,20].
To the best of our knowledge, only two studies have investigated the representativeness of the CEMR database [14,45]. Brixner et al [14]

Aims
Given the significant increase in CEMR coverage since the last report was published and exponentially increasing volume of RWD-based research, we aimed to repeat and expand the exploration of the generalizability and representativity of the CEMR database with two of the most widely used and relevant survey results from the United States. Specifically, the goals of this study were to compare (1) patient demographics in the CEMR with the NAMCS and (2) the prevalence of obesity, hypertension, high total cholesterol, and diabetes in the CEMR with the respective reports based on the National Health and Nutrition Examination Surveys (NHANES).

Centricity Electronic Medical Record
The CEMR incorporates patient-level data from independent physician practices, academic medical centers, hospitals, and large integrated delivery networks in the United States. The Medical Quality Improvement Consortium is a rapidly growing community that contributes deidentified clinical data to the CEMR research database to enable quality improvement, benchmarking, and population-based medical research [40,46]. With an average follow-up of 4.5 years, the CEMR research database covers more than 35,000 health care providers from all states of the United States, where approximately 70% are primary care providers. Longitudinal EMRs were available for more than 45 million individuals from 1995 to September 2018, with comprehensive patient-level information on demographics, anthropometric measures, disease events, medications, and clinical and laboratory measures. The database has been extensively used in academic research [14,44,45].

The National Ambulatory Medical Care Survey
A report based on the 2016 NAMCS data was used in this study [47]. The excerpts from the Centers for Disease Control and Prevention (CDC) website are presented in the following two paragraphs [47,48].
The NAMCS is a national survey designed to meet the need for objective, reliable information about the provision and the use of ambulatory medical care services in the United States. The findings were based on a sample of visits to nonfederally employed office-based physicians who are primarily engaged in direct patient care. Physicians specializing in anesthesiology, pathology, and radiology were excluded from the survey. Each physician was randomly assigned to a 1-week reporting period. During this period, data for a systematic random sample of visits were recorded by Census interviewers using an automated patient record form (PRF) developed for that purpose.
The 2016 NAMCS sampling design used a stratified two-stage sample, with physicians selected in the first stage and visits in the second stage. The 2016 NAMCS sample included 3699 physicians. Of the 2080 in-scope (eligible) physicians, 677 completed PRFs in the study. Of the 677 physicians who completed PRFs, 536 participated fully or adequately (ie, at least one half of the expected PRFs were submitted, based on the total number of visits during the reporting week) and 141 participated minimally (ie, fewer than half of the expected number of PRFs were submitted). Within physician practices, data were abstracted from medical records for up to 30 sampled visits during a randomly assigned 1-week reporting period. In total, 13,165 PRFs were submitted. The participation rate-the percentage of in-scope physicians for whom at least one PRF was completed-was 39.3%. The response rate-the percentage of in-scope physicians for whom at least one half of their expected number of PRFs was completed-was 32.7%. Among the 4 census regions, response rates ranged from 24.6% to 40.0%.

The National Health and Nutrition Examination Survey
The NHANES is a program of studies designed to assess the health and nutritional status of adults and children in the United States [49]. The survey consists of interviews conducted in participants' homes, standardized physical examinations in mobile examination centers, and laboratory tests on blood and other specimens.
Individual reports produced by the CDC based on the NHANES 2015-2016 data were used to compare the prevalence of obesity [50], hypertension [51], and high total cholesterol [52]. The latest CDC report for diabetes prevalence was based on the NHANES 2013-2016 data [53].

Methods
Data from the CEMR were matched on methods to individual CDC reports as close as possible.

Office Visits
Data on percent distribution of office visits and number of office visits per 100 patients by various subgroups from the NAMCS were compared with similar data from the CEMR. All office visits in 2016 for patients with nonmissing age and sex from the CEMR were aggregated to match the NAMCS report as close as possible.

Obesity
In the NHANES, obesity was defined as a BMI of 30 kg/m 2 or greater for adults aged 20 years and older [50].
In the CEMR, the proportion of obese people was estimated among people aged older than 20 years and with at least one BMI measure (direct or estimated using weight and height) during the years 2015-2016. Women who had pregnancy-related records before or within the estimated time frame were excluded.

Hypertension
In the NHANES, systolic blood pressure (SBP) of 140 mm Hg or greater, diastolic blood pressure (DBP) of 90 mm Hg or greater, or currently taking medication to lower high blood pressure were defining hypertension for people aged 18 years and older [51].
In the CEMR, the proportion of patients with hypertension during the years 2015-2016 was estimated among people aged older than 18 years. On average, patients had 4 blood pressure measures during a 2-year time frame. Those who had an average of available measures for SBP of 140 mm Hg or greater, those who had an average of available measures for DBP of 90 mm Hg or greater, or those who were taking medication to lower high blood pressure during the respective time frame were considered to have hypertension. Blood pressure-lowering medications included diuretics, peripheral vasodilators, beta blockers, calcium channel blockers, angiotensin-converting enzyme inhibitors, angiotensin II receptor blockers, and other agents acting on the renin-angiotensin system. Only medications that are indicated to lower blood pressure were preserved within these drug classes.

High Total Cholesterol
In the NHANES, proportions of participants aged 20 years and older with high total cholesterol (≥240 mg/dL) were reported [52]. In the CEMR, among people aged older than 20 years and with at least one available cholesterol measure, the proportions of those with total cholesterol of 240 mg/dL or greater were estimated during the years 2015-2016.

Diabetes
In the NHANES, participants were classified as having diagnosed diabetes if they answered "yes" to the question, "Other than during pregnancy, have you ever been told by a doctor or health professional that you have diabetes or sugar diabetes?" [53]. Participants were classified as having undiagnosed diabetes if they did not report a diagnosis of diabetes by a health care provider and their fasting (8-24 hours) plasma glucose level was 126 mg/dL or greater or their hemoglobin A 1c (HbA 1c ) level was 6.5% or greater. Participants were randomly assigned to a morning, afternoon, or evening examination. Fasting plasma glucose data from the morning examination (after an 8-to 24-hour fast) were used to define total and undiagnosed diabetes.
In the CEMR, an algorithm to identify patients with diabetes was developed on the basis of (1) diabetes diagnostic codes (International Classification of Diseases and SNOMED), (2) antidiabetic medication prescription patterns, (3) availability of 2 measurements of HbA 1c level of 6.5% or greater or fasting blood glucose level of 126 mg/dL or greater or random blood glucose level of 200 mg/dL or greater within 1 year, and (4) keyword searching procedures for diabetic-related terms from the clinical notes of every patient. The algorithm was developed on the basis of clinical guidelines and machine learning suggestions described by Adjah et al [54] for a database from the United Kingdom. Patients who were prescribed metformin for polycystic ovary syndrome were detected and excluded. In the case of nondefinite diabetes subtype, a patient's age and insulin and noninsulin prescription patterns were used to distinguish subtypes. The off-label use of antidiabetic drugs was not explored. For analyses in this study, patients with prediabetes and gestational diabetes were excluded. The proportion of patients with coded and noncoded diabetes was estimated among adults aged 20 years and older and who were active in the CEMR during the years 2013-2016.

Statistical Methods
Proportional distributions between the CEMR and the NAMCS and NHANES were compared using the chi-square test, where appropriate. Office visit estimates in the NAMCS report are based on sample data weighted to produce annual national estimates and include SEs. All estimates in the NHANES reports were age adjusted using the 2000 US Census population. For the NAMCS and the NHANES estimates, 95% CIs were calculated using the available SE estimates from the reports.
Crude estimates from the CEMR data were calculated and presented in this study, 95% CI for percentages were calculated based on binomial distribution assumption, and 95% CIs for number of visits per 100 persons per year were calculated assuming a Poisson distribution.
Statistical equivalence for the pairwise comparisons of proportional distributions, where appropriate, was evaluated using the two one-sided test (TOST) of equivalence [55,56] with a ±2.5, ±5, and ±7.5 percentage point equivalence margins. Using population summary statistics on mean, SD, and total number of office visits, the TOST procedure available in SAS 9.4 (SAS Institute Inc) was employed.

Office Visits
The

Hypertension
During the years 2015-2016, hypertension prevalence in adults was higher in the CEMR than in the NHANES: 33.5% (95% CI 33.5-33.5) versus 29.0% (95% CI 27.0-31.0; Table 2; not equivalent at 2.5 percentage point margin and equivalent at 5 percentage point margin). However, in the CEMR, the prevalence was significantly lower for older patients (aged 60+ years), not equivalent at 7.5 percentage point margin.

High Total Cholesterol
The proportions of adults with high total cholesterol were similar across the NHANES and the CEMR, with a total of 12.4% (Table 2). Although the 95% CI of the proportion of males with high total cholesterol indicated a significant difference between the NHANES (95% CI 9.6%-13.2%) and the CEMR (95% CI 9.3%-9.3%), the proportions appeared to be equivalent at 2.5 percentage point margin based on the TOST.

Diabetes
In the NHANES, the proportion of adults with diabetes during the years 2013-2016 was reported to be 14.0% (95% CI 12.8-15.2); among them, 9.7% (95% CI 8.7%-10.7%) were diagnosed and 4.3% (95% CI 3.5%-5.1%) were undiagnosed (Table 3). In the CEMR, 10.1% of adults were estimated to have diabetes, 8.1% of adults had a diagnostic code, and 2% of adults were without (false negatives). Comparing total diabetes estimates in the CEMR with diagnosed in the NHANES, the total and by gender prevalence was similar in both data sources (equivalent at 2.5 percentage point margin), and there were fewer seniors (aged 60+ years) with estimated diabetes in the CEMR compared with the NHANES: 16.4% (95% CI 16.4-16.4) versus 21.0% (95% CI 18.5-23.5; not equivalent at 2.5 percentage point margin and equivalent at 5 percentage point margin).

Principal Findings
In this study, we compared the CEMR ambulatory and primary care database with federal reports based on the NAMCS and the NHANES. Although the CEMR and the CDC reports may not be directly compared because of the differences in the data collection nature and methodologies applied, in this study, we have observed that the CEMR is a good source of population health research with regard to cardiometabolic conditions. Specifically, we observed that (1) on average, there were 3 office visits per patient per year in both the NAMCS and the CEMR; (2) the distribution of age at office visits in the CEMR is biased toward older population; (3) although the proportional share of all visits by males and females were similar, females/males had more/fewer visits in the NAMCS, compared with the CEMR; (4) the distribution of office visits were similar for ethnic groups; (5) West regions are underrepresented and South region is overrepresented in the CEMR compared with the NAMCS; (6) compared with the CDC reports based on the NHANES data, the prevalence of obesity and high total cholesterol is similar in the CEMR, whereas hypertension prevalence is 5% higher; and (7) [45] reported that the CEMR had higher proportions of visits by younger patients and by females, compared with the NAMCS. Maintaining similar methods, we observed a reversed trend in age and no difference in the distribution of visits by gender. The overall prevalence in adults with obesity, high cholesterol, and diabetes was similar between the CEMR and the NHANES reports, and the proportion of those with hypertension was higher in the CEMR than in the NHANES. A possible explanation for this result is the ability to track longitudinal information in EMRs, which is especially prominent in chronic conditions. This feature of EMRs provides exceptional opportunities in terms of extending and modifying the classical epidemiologic theory [57].
Closed systems in the West (the Kaiser Permanente [58]) may explain the finding of lower rates for office visits in the West region in the CEMR. Although the CDC reports were adjusted and weighted with the US Census data, CEMR's crude prevalence estimates of chronic conditions reported in this study are thus biased by geographic regions. This issue should be carefully taken care of in future population health research based on the CEMR data.

Strengths and Limitations
As with any survey, the results in the NAMCS and the NHANES are subject to sampling and nonsampling errors. Nonsampling errors include reporting and processing errors as well as biases because of nonresponse and incomplete response. In the NAMCS, ethnicity data were missing for 25% of visits [47,48], whereas in the CEMR, unknown race accounted for only 9% of visits.
It is important to highlight that the population that seeks medical care is biased from the general population, and healthy individuals will be underrepresented in any EMR database. For this reason, rather than comparing the CEMR with US Census data, we compared the demographics at the time of office visits in the CEMR with the NAMCS, which is carefully weighted and adjusted for the US population. Although a significant subset of CEMR users is participating in the Medical Quality Improvement Consortium, the CEMR research database is biased toward these practices. The participation and response rates in the NAMCS of less than 40% also introduce a selection bias, although the NAMCS estimates were corrected [47,48].
The limitations of this study include the nonavailability of provider specialty and insurance data in the CEMR for deeper comparisons with the NAMCS. Certain specialties might adopt EMRs at a slower rate, and insured individuals might be overrepresented in commercial databases, compared with the general population. Owing to large cohort sizes in the CEMR, reported CIs are very narrow, and we believe they do not reflect meaningful differences. Adopting TOST with 2.5, 5.0, and 7.5 percentage point equivalence margins for data comparison provides another overview of the equivalence of data sources. As mentioned earlier, CEMR and survey methods are not directly comparable, and we have done our best to match the data as closely as possible. The cardiometabolic prevalence estimates should be interpreted carefully, in the light of methodological and regional differences. However, we believe that the results of this study demonstrate the ability of the CEMR to reflect population health quite well.

Conclusions
To conclude, epidemiological and population health findings based on the CEMR database might reflect trends in the general US population; however, the possible region, age, and gender biases presented in this study should be treated and interpreted carefully.