This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Health data collected during routine care have important potential for reuse for other purposes, especially as part of a learning health system to advance the quality of care. Many sources of bias have been identified through the lifecycle of health data that could compromise the scientific integrity of these data. New data protection legislation requires research facilities to improve safety measures and, thus, ensure privacy.
This study aims to address the question on how health data can be transferred from various sources and using multiple systems to a centralized platform, called Healthdata.be, while ensuring the accuracy, validity, safety, and privacy. In addition, the study demonstrates how these processes can be used in various research designs relevant for learning health systems.
The Healthdata.be platform urges uniformity of the data registration at the primary source through the use of detailed clinical models. Data retrieval and transfer are organized through end-to-end encrypted electronic health channels, and data are encoded using token keys. In addition, patient identifiers are pseudonymized so that health data from the same patient collected across various sources can still be linked without compromising the deidentification.
The Healthdata.be platform currently collects data for >150 clinical registries in Belgium. We demonstrated how the data collection for the Belgian primary care morbidity register INTEGO is organized and how the Healthdata.be platform can be used for a cluster randomized trial.
Collecting health data in various sources and linking these data to a single patient is a promising feature that can potentially address important concerns on the validity and quality of health data. Safe methods of data transfer without compromising privacy are capable of transporting these data from the primary data provider or clinician to a research facility. More research is required to demonstrate that these methods improve the quality of data collection, allowing researchers to rely on electronic health records as a valid source for scientific data.
More than a decade ago, the Institute of Medicine introduced the “learning health system” (LHS) in response to the challenges on how to generate and apply the best evidence to guide health care choices [
Even though the type of data recorded for research and the data stored in EHRs are similar, the use of health data poses some important problems. Concerns regarding the data quality and validity, completeness of data capture, and lack of interoperability have been identified as important barriers to the use of EHRs for clinical research [
Perhaps, one of the most important challenges to the use of health data in clinical research is the persistent divide between clinicians (data providers) and researchers (data scientists) [
In Belgium, >150 clinical registries actively collect health data from multiple sources such as primary care facilities, laboratories, hospitals, and radiology centers. Moreover, there are multiple information systems or EHRs available for each of these sources. For example, for primary care practices alone, at least, 8 different EHRs are available. In 2012, the Scientific Institute of Public Health was charged with centralizing and improving these clinical registries as part of the national electronic health (eHealth) action plan in a new platform named Healthdata.be. The challenge for this task was to develop a system that allows the integration of data from diverse sources and collects them through multiple systems by clinicians during routine care, while ensuring the accuracy, validity, safety, and privacy of the data. This study addresses the following questions:
How can health data be transferred from various original sources of entry to a centralized platform for reuse and what efforts can be done to limit sources of bias?
How can health data within the LHS be used for various research designs?
This study will describe elements of the Healthdata.be project designed for data extraction, data transfer, and data processing. Subsequently, we will demonstrate how Healthdata.be was used in the INTEGO primary care morbidity registry [
Health data are at the core of both EHRs and clinical research registries. However, to collect these data in a meaningful manner, these must have the same structure, use interoperable terminologies, and be documented using a detailed clinical model (DCM) [
The detailed clinical model for the root concept blood pressure. CD: coded descriptor; PQ: physical quantity; TS: timestamp; ST: string (free) text. Source: https://www.healthdata.be/doc/cbb/index.php5/Be.en.hd.BloodPressure.
The content standardization of scientific data collections, using a DCM, contributes to the enhanced data quality and correct interpretation of data for research [
Systems used by data providers are being urged to comply with these DCMs, and these elements are being included in local certification standards. When data providers or researchers are confronted with a concept for which no DCM exists, they can apply for the development of one to enable the automated provisioning of registries.
When shaping the principles of LHSs, the Institute of Medicine reiterated the need to reflect on the burden data collection can be on health care professionals and the importance of limiting this burden to the issues most important to patient care and knowledge generation [
Data capture, data transfer, data encryption, and data reception through the Healthdata.be platform. HD4DP: Healthdata for data providers; HD4RES: Healthdata for research. Source: https://healthdata.wiv-isp.be/en/services.
The transfer of sensitive health data is challenging with regards to technicality, safety, and privacy. In Belgium, data transfer between health care professionals is organized through existing eHealth channels by an end-to-end (E2E) encryption [
An important feature in enabling the linkage of health data extracted from different systems or settings is the ability to identify data from the same patient. When health data are being sent from one health care provider to another in the context of clinical care, the content of the message is encrypted, but the identity of the patient remains known. However, for research purposes, the content of the message is encrypted, but the identity of the patient must be blinded; this poses an important challenge when health data for a single patient are collected across sources. To enable this linkage without unblinding the coded data, an extra step is introduced in the data transfer. Where masking of identifiers is required, the national eHealth services act as a trusted third party and use an algorithm to pseudonymize this data element [
Illustration of the data encryption, coding and decryption steps; SSIN: social security identification number; HD4DP: Healthdata for data providers; HD4RES: Healthdata for researchers; CSV: comma separated value; ETK: eHealth token key; E2E: end-to-end. Source: https://healthdata.wiv-isp.be/en/services.
Healthdata.be can process the data collection for very diverse specialties or research facilities in health care. To ensure that the requested data are in accordance with the research question or aim of the project, a thorough screening of the project is organized. Each project submits a research protocol, including a list of specific data variables being collected. An internal steering committee, an ethics committee associated with a research center, and the National Privacy Commission’s Sector Committee for eHealth review this submission. Only when all authorities have approved the project, can the data collection commence.
The Healthdata.be platform not only enables safe data transfer but also provides a secure environment for data handling and data analysis for research purposes. Coded data are received by the HD4RES (Healthdata for research) service, which shows the data as sent by the data provider. The interface of the HD4RES is almost identical to that of the HD4DP, except that identification details are coded. Upon arrival in the HD4RES, the data are not yet stored in the datawarehouse (DWH) of Healthdata.be. The DWH has 3 separate entities—the validation environment, the analysis environment, and the reporting environment—and uses SAS Enterprise Guide (SAS Institute Inc) to visualize and process the data. It is first stored in a validation table where data quality is controlled. Healthdata.be allows for semiautomated processes so that the validation of continuous data capture can be operationalized. Once validated, data are then promoted to the analysis environment of Healthdata.be. Access to the HD4RES and the separate environments of the DWH are secured through a 2-factor authentication and can be restricted depending on the needs of the researcher. Furthermore, data processing and reporting can be operationalized to accommodate a continuous data flow in ongoing registers.
A pitfall to accepting data from various sources is the possibility of missing or erroneous data. Erroneous data can be prevented by introducing restricted possibilities, ranges, or syntaxes for the data transferred through the HD4DP. For example, validation rules that detect out-of-range data, missing data, or alphanumeric results for a numeric value can already prevent the transfer of these errors at the site of the data provider. However, it may still be possible that an aberrant value is transferred to the HD4RES that needs correction. To allow for this correction by the data provider, a feedback loop has been designed. This feedback loop uses the same channels and encryption methods for data transfer and includes decoding of the SSIN by the trusted third party of eHealth so that the primary data provider can identify the person for whom a corrected data variable is requested.
The illustration of the feedback loop in case of missing or erroneous data using the eHealth services. HD4DP: Healthdata for data providers; CSV: comma separated value; HD4RES: Healthdata for researchers. Source: https://healthdata.wiv-isp.be/en/services.
INTEGO is a primary care morbidity registry, which was founded over 20 years ago [
INTEGO does not require any data collection besides that being collected for daily clinical practice. However, the quality of the collected data is expected to be of high quality. The data transferred to INTEGO not only include basic concepts, such as diagnoses or problems, procedures, prescriptions, laboratory tests, parameters, or vital signs and personal information, but also include intricate attributes such as longitudinal care for the same problem (problem-oriented medical registration), causal relationships between diagnoses and prescriptions, or the evolution of a health issue from a symptom into a diagnosis over time. Although many aspects of this registry were already described in existing DCMs, many of these attributes required additional coding and mapping to maintain their meaningfulness. The validity of the recorded data from the original INTEGO database has been studied through comparison with other existing continuous morbidity registries and found to be comparable [
The INTEGO procedures were approved by the KU Leuven Ethics Committee (nr. ML1723) and by the National Privacy Commission’s Sector Committee for eHealth (decision nr. 13.026 of March 19, 2013). The procedures to collect data by Healthdata.be were approved by the Belgian Privacy Commission on April 17, 2018.
To date, almost all GPs have migrated to CareConnect, and the first data export is being prepared and tested. On the one hand, a “core INTEGO” will be constructed, based on the original eligibility criteria to participate in the INTEGO network, to perform epidemiological research. On the other hand, an “extended INTEGO” will be constructed, without eligibility criteria to participate, to perform research on the quality of registration, quality of care, and impact of audit and feedback.
The Electronic Laboratory Medicine ordering with evidence-based Order sets in primary care (ELMO) trial is a practical cluster randomized trial investigating the effects of decision support on the quantity and quality of laboratory test ordering behavior by GPs [
In addition, we collected patient-specific data directly from the GPs. GP investigators used several different EHR software for the registration of clinical practice. To ensure uniformity in the data collection, we designed a clinical report form (CRF), detailing the exact information we wished to extract from the EHR and which data would need to be added manually. To facilitate data extraction, we designed the CRF so that >60% of data would be automatically extracted from the EHR, meaning that these data already complied with one or more existing DCMs as defined by Healthdata.be and in use by most EHRs. The CRF was programmed and distributed to all EHRs through an app named Healthdata for Primary Care (HD4PrC), which is a tool that extracts the requested data directly from the EHR and populates the CRF with these data. Only data requests that could not be mapped to a DCM needed to be added manually. Examples of the requested patient-specific data were diagnoses or problems (including international classification for primary care codes and date of diagnosis), procedures performed or ordered, referrals to specialist care, pre- and posttest probabilities of disease, and diagnostic error. These data were then sent to the Healthdata.be platform through the described eHealth channels.
Finally, for a subset of patients, data were obtained directly from patients. A similar CRF was designed, which surveyed patients on data similar to the data requested from investigating GPs. Additional information on the socioeconomic status was requested. To ensure uniformity and avoid technical issues, the CRF was not sent directly to participating patients, but a telephonic interview was conducted by a research assistant who completed the CRF based on patients’ responses.
All these data were collected in separate SAS datasets on the Healthdata.be platform, which was accessible through a secured server. Access to various parts of the datasets was dependent on the role of the investigator, where data managers had access to the staging datasets, and statisticians had access to the analytics datasets. The chief investigator had access to all datasets and managed the authorizations of the entire team.
Healthdata.be has successfully connected a myriad of data providers on a centralized platform through a secure and private method of interoperable data transfer across settings and systems. This was done by enabling interoperable data collection, encrypted data transfer, and coded data collection while still allowing to connect data from the same patient collected from multiple sources through a system of pseudonymization. Healthdata.be has largely been able to bridge the disconnect between clinicians and researchers. In addition, Healthdata.be has been able to centralize >100 clinical registries governed by various research facilities, most of which are continuously collecting new data. A list of current registries being hosted on the Healthdata.be platform is available from the website (www.healthdata.be). Alongside the centralization of clinical registries, the platform can also be used in clinical trials or studies using routinely collected data at the point of care. Additional data, which are not defined through a DCM and specific to the trial or study, can be added manually to the data collection tool. These features make Healthdata.be an important facilitator of the LHS and help drive quality improvement in health care.
Facilitating access to reliable health data may be crucial to LHSs, but several situations have illustrated that there may be boundaries to this easier access. The Danish General Practice Database [
An important limitation to the Healthdata.be platform is rooted in the decentralized data provision. Despite the efforts to standardize data collection using DCMs, variability in data collection at the point of care is inevitable. Even though DCMs may clearly define a clinical concept, there may still be variations in its use in documenting daily clinical practice. To be fully interoperable, DCMs must also be integrated into a conceptual model such as problem-oriented medical registration. These conceptual models not only include interoperable standards for individual concepts but also the relations between concepts. Moreover, even when a concept is well defined and documented within the same conceptual model, interrater differences persist; this is a feature that is common to the way the narrative of a patient is translated into an EHR documentation. Recording guidelines are required, and training on how to put these into practice is imperative, but harmonizing system designs and user interfaces may prove to be crucial. Many of the robust registries, such as the British General Practice Research Database [
The reuse of health data collected as part of routine clinical care can further research and improve health care. By ensuring semantic interoperability, safe data transfer, and trustworthy data handling, important sources of bias can be avoided. Concerns on data quality and validity can be addressed by collecting data from those sources where the data capture is bound to be most complete and linking these data from multiple sources through pseudonymization. Further research is required to assess whether these methods truly address concerns on the data quality. To date, patients have only limited access and cannot add or change health data in their own patient record. When these features become more widespread, it would be interesting to evaluate how this may influence the data quality and validity.
clinical report form
detailed clinical model
datawarehouse
electronic data capture
electronic health
electronic health record
Electronic Laboratory Medicine ordering with evidence-based Order sets in primary care
general practitioner
learning health system
logical observation identifiers names and codes
social security identification number
Healtdata.be is a project within the Scientific Institute of Public Health, currently known as Sciensano. Healthdata.be is funded by the National Institute for Health and Disability Insurance. The INTEGO continuous morbidity register is funded by the Flemish Health Ministry. The ELMO study is funded by the Belgian Health Care Knowledge Centre Clinical Trials Programme under research agreement KCE16011.
None declared.