This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
Data science offers an unparalleled opportunity to identify new insights into many aspects of human life with recent advances in health care. Using data science in digital health raises significant challenges regarding data privacy, transparency, and trustworthiness. Recent regulations enforce the need for a clear legal basis for collecting, processing, and sharing data, for example, the European Union’s General Data Protection Regulation (2016) and the United Kingdom’s Data Protection Act (2018). For health care providers, legal use of the electronic health record (EHR) is permitted only in clinical care cases. Any other use of the data requires thoughtful considerations of the legal context and direct patient consent. Identifiable personal and sensitive information must be sufficiently anonymized. Raw data are commonly anonymized to be used for research purposes, with risk assessment for reidentification and utility. Although health care organizations have internal policies defined for information governance, there is a significant lack of practical tools and intuitive guidance about the use of data for research and modeling. Off-the-shelf data anonymization tools are developed frequently, but privacy-related functionalities are often incomparable with regard to use in different problem domains. In addition, tools to support measuring the risk of the anonymized data with regard to reidentification against the usefulness of the data exist, but there are question marks over their efficacy.
In this systematic literature mapping study, we aim to alleviate the aforementioned issues by reviewing the landscape of data anonymization for digital health care.
We used Google Scholar, Web of Science, Elsevier Scopus, and PubMed to retrieve academic studies published in English up to June 2020. Noteworthy gray literature was also used to initialize the search. We focused on review questions covering 5 bottom-up aspects: basic anonymization operations, privacy models, reidentification risk and usability metrics, off-the-shelf anonymization tools, and the lawful basis for EHR data anonymization.
We identified 239 eligible studies, of which 60 were chosen for general background information; 16 were selected for 7 basic anonymization operations; 104 covered 72 conventional and machine learning–based privacy models; four and 19 papers included seven and 15 metrics, respectively, for measuring the reidentification risk and degree of usability; and 36 explored 20 data anonymization software tools. In addition, we also evaluated the practical feasibility of performing anonymization on EHR data with reference to their usability in medical decision-making. Furthermore, we summarized the lawful basis for delivering guidance on practical EHR data anonymization.
This systematic literature mapping study indicates that anonymization of EHR data is theoretically achievable; yet, it requires more research efforts in practical implementations to balance privacy preservation and usability to ensure more reliable health care applications.
Digital health [
The United Kingdom’s Human Rights Act 1998 defines privacy as “everyone has the right to respect for [their] private and family life, [their] home and [their] correspondence” in Article 8 [
On January 30, 2020, a declaration [
Currently, the lack of more intuitive guidance and a deeper understanding of how to feasibly anonymize personally identifiable information in EHRs (it should be noted that data from wearables, smart home sensors, pictures, videos, and audio files, as well as the combination of EHR and social media data, are out of the scope of this study) while ensuring an acceptable approach for both patients and the public leave the data controller and data processor susceptible to breaches of privacy. Although several diligent survey papers [
The curse of anonymization. Blue hue indicates an increase in data anonymity, which, in turn, reveals the decrease in usability of the anonymized data, very likely reaching minimum usability before reaching full anonymization (red hue).
In line with the updated scope of the GDPR and its associated Article 9 [
Relational data [
How the attributes are handled during the anonymization process depends on their categorization [
Direct identifiers
Indirect identifiers
Sensitive attributes
Among the 4 categories of identifiers, it is particularly difficult to differentiate between direct identifiers
Logical flow of distinguishing direct identifiers I from quasi-identifiers Q. F: false; T: true.
In
Given the definition in Recital 26 [
It should be noted therefore that the relationship between data anonymization and pseudonymization techniques is characterized as follows:
Anonymized data are not identifiable, whereas pseudonymized data are identifiable.
Pseudonymized data remain personal based on Recital 26 of the GDPR and the conclusion [
Solving the problem of data anonymization necessarily means solving pseudonymization.
Concretely, given an anonymization function
In a real-world scenario, efficient data anonymization is challenging because it is usually problem dependent (ie, solutions vary across problem domains) and requires substantial domain expertise (eg, to specify the direct and indirect identifiers present in raw data) and effort (eg, user involvement in specifying the privacy model before the data anonymization process). Fundamentally, it is very challenging and nontrivial to define what
To minimize bias and deliver up-to-date studies related to data anonymization for health care, we organized this survey in a systematic literature mapping (SLM) manner. In general, there are 2 main approaches to conduct literature reviews: systematic literature review (SLR) and SLM [
Our overall objective is to alleviate the issues introduced toward the end of the previous section by reviewing the landscape of data anonymization for digital health care to benefit practitioners aiming to achieve appropriate trade-offs in leveraging the reidentification risk and usability of anonymized health care data. In other words, we evaluate the evidence regarding the effectiveness and practicality of data anonymization operations, models, and tools in secondary care from the perspective of data processors.
The aims of the study are to evaluate the potential of preserving privacy using data anonymization techniques in secondary care. Concretely, we, as data processors, are highly motivated to investigate the best possible way of anonymizing real-world EHRs by leveraging the privacy and usability concerns visualized in
RQ 1: Do best practices exist for the anonymization of realistic EHR data?
RQ 2: What are the most frequently applied data anonymization operations, and how can these operations be applied?
RQ 3: What are the existing conventional and machine learning–based privacy models for measuring the level of anonymity? Are they practically useful in handling real-world health care data? Are there any new trends?
RQ 4: What metrics could be adopted to measure the reidentification risk and usability of the anonymized data?
RQ 5: What are the off-the-shelf data anonymization tools based on conventional privacy models and machine learning?
The knowledge generated from this SLM, especially the answer to our driving question, RQ 1, will build on the study’s evidence on the future of the development of data anonymization toolkits for data processors such as the companies and organizations in which they are situated. The evidence gained may also contribute to our understanding of how data anonymization tools are implemented and their applicability to anonymizing real-world health care data. Finally, we intend to identify the major facilitators and barriers to data anonymization in secondary care in relation to reidentification risk and utility.
In keeping with our RQs, we built up our search strategy using
Articles were eligible for inclusion based on the criteria defined in
Related to anonymization or privacy-preserving techniques
Related to privacy-preserving techniques in health care
Presented privacy concerns in health care
Proposed methods for privacy preservation in electronic health records
Proposed methods for using private information, for example, biometric data
Proposed methods partially related to protected health care
Written in language other than English
Without
About other health care issues, for example, clinical trials
Insights not suitable for European Union or United Kingdom
Out of our research scope
Duplicate articles (case dependent)
Article selection (
Systematic literature mapping process for articles. GL+D: gray literature and duplicates.
As mentioned at the beginning of this section, the phases involved in selecting data anonymization software tools are difficult because of the limited tools available in the existing studies. Thus, the initially included tools were selected from the qualified articles without considering whether their source code was publicly accessible, maintainable, and extensible. The only criterion was whether the tool could be downloaded and executed. Furthermore, to guarantee that the selection process was less biased, we decided that in each of the 2 (ie, privacy model–based and machine learning–based) categories of privacy-preserving software tools, the number of tools chosen from outside of the selected articles would be no more than 30% of the total.
In keeping with the five-phase article selection strategy described in the previous section, ZZ and MW independently selected articles for eligibility in phase 2. Articles were moved forward to the
Number of selected articles during the study selection process.
An overview of the 239 selected research articles grouped in categorical order.
Category | Selected research articles, n (%) |
Background knowledge | 60 (25.1) |
Data anonymization operations | 16 (6.7) |
Privacy models | 104 (43.5) |
Risk metrics | 4 (1.7) |
Utility metrics | 19 (7.9) |
Data anonymization tools | 36 (15.1) |
An overview of the 239 selected research articles grouped in chronological order.
In accordance with the strategy of selecting qualified privacy-preserving software tools described in the previous section, there were 5 out of a total of 15 privacy model–based data anonymization tools that were not derived from the qualified (ie, selected) articles. Of these 5 tools, 3 (
To add structure to this SLM, we grouped the results of the reviewed articles into four categories:
This technique is implemented by modifying the original data in a nonstatistically significant fashion. As described in the code of practice [
This technique can be achieved through, for instance, microaggregation [
Generalization [
Full postcode → street → city or town → county (optional) → country
with a possible instance being as follows:
DH1 3LE → South Road → Durham → UK
and
DH → Durham → UK
Typically, generalization is used to reduce the specificity of the data and thereby the probability of information disclosure. Given the examples above, the degree of generalization is fully controlled by the granularity defined in the hierarchy.
Suppression [
Data masking [
Differential privacy (DP) [
Homomorphic encryption (HE) [
Compressive privacy (CP) [
The objective of satisfying different levels of anonymity is usually achieved through 2 consecutive steps: measurement and evaluation. The former refers to the use of either conventional or machine learning–based privacy models to perform data anonymization, and the latter is the process of evaluating the reidentification risk and degree of usability of the anonymized data. The anonymization operations are usually adopted by conventional privacy models or machine-learning–based models.
Categorizations of measurements and evaluations for achieving different levels of anonymity. ML: machine learning; RT: relational-transactional privacy model.
The attributes contained in a table are usually divided into direct identifiers
Given a class of records
During the process of data anonymization, interpretable and realistically feasible measurements (ie, privacy models [
General pipeline for existing privacy model–based data anonymization tools. F: false; T: true.
A summary of privacy models for relational electronic health record data with respect to parameter interval and degree of privacy of data.
Category | Privacy model | Section in Multimedia Appendix 4 | Parameter interval | Privacy level | References |
|
|||||
|
κ-anonymity | 1.1 | [1, |X|] | High | [ |
|
(α, |
1.1.1 | α ∈ [0, 1], |
α: low, |
[ |
|
1.1.2 | [1, |X|] | Low | [ |
|
|
1.1.3 | [0, +∞] | High | [ |
|
|
( |
1.1.4 | [0, +∞] | High | [ |
|
( |
1.1.5 | High | [ |
|
|
Multirelational |
1.1.6 | [0, +∞] | High | [ |
|
Strict average risk | 1.1.7 | N/Aa | Low | [ |
|
1.2 | [0, +∞] | High | [ |
|
|
1.2.1 | High | [ |
||
|
1.3 | [0, +∞] | Low | [ |
|
|
Stochastic |
1.3.1 | Low | [ |
|
|
( |
1.3.2 | [0, +∞] | High | [ |
|
β-Likeness and enhanced β-likeness | 1.3.3 | [0, +∞] | High | [ |
|
Differential privacy | 1.4 | [0, +∞] | Low | [ |
|
( |
1.4.1 | [0, +∞] | High | [ |
|
(ε, δ)-anonymity | 1.4.2 | ε ∈ [0, +∞], δ ∈ [0, +∞] | ε: low, δ: low | [ |
|
(ε, |
1.4.3 | ε ∈ [0, 1], |
ε: high, |
[ |
|
Distributed differential privacy | 1.4.4 | [0, +∞] | Low | [ |
|
Distributional differential privacy | 1.4.5 | ε ∈ [0, +∞], δ ∈ [0, +∞] | ε: low, δ: low | [ |
|
1.4.6 | [0, +∞] | Low | [ |
|
|
Joint differential privacy | 1.4.7 | ε ∈ [0, +∞], δ ∈ [0, +∞] | ε: low, δ: low | [ |
|
( |
1.5.1 | [0, 1] | Low | [ |
|
Normalized variance | 1.5.2 | [0, 1] | High | [ |
|
δ-disclosure privacy | 1.5.3 | [0, +∞] | High | [ |
|
( |
1.5.4 | [0, 1] | Low | [ |
|
δ-presence | 1.5.5 | [0, 1] | Low | [ |
|
Population and sample Uniqueness | 1.5.6 or 1.5.7 | N/A | N/A | [ |
|
Profitability | 1.5.8 | N/A | N/A | [ |
Transactional | 2 | N/A | N/A | [ |
|
Relational-transactional | ( |
3 | N/A | N/A | [ |
|
|||||
|
4.1 | N/A | N/A | [ |
|
|
4.2 | N/A | N/A | [ |
|
|
4.3 | N/A | N/A | [ |
|
|
( |
4.4 | N/A | N/A | [ |
Geolocational | Historical |
5 | N/A | N/A | [ |
aN/A: not applicable.
In light of machine learning and its derived subset, deep learning, there has been an upsurge of interest in machine learning– or deep learning–based privacy models for anonymizing patient or general data; we explore these approaches in this section. We divided related machine learning–based privacy models into 2 categories in accordance with the type of data used: raw or synthetic. Of late, the use of synthetic data has become more popular because these generated data are both anonymous and realistic; therefore, consent from data owners is not required [
In the study by D’Acquisto and Naldi [
Many similar PCA techniques rely on results derived from random matrix theory [
Despite the great success achieved by PCA and its variants in data anonymization, traditional clustering algorithms have also been adopted to deal with the same problem; 𝑘-means [
In the study by Choi et al [
Given the conventional and machine learning–based privacy models, a disclosure risk assessment is usually conducted to measure the reidentification risk of the anonymized EHR data. In practice, risk values from different combinations of privacy models could be used when deciding which version of the anonymized data should be used for data analysis and possible machine learning tasks such as EHR classification with respect to treatment planning or distance recurrence identification.
Concretely, there are 3 major types of disclosure that may occur during the process of data anonymization: identity, attribute, and membership disclosure (
Categorization of data reidentification risk metrics for electronic health record data.
Disclosure type and metric | Section in |
Reference | |
|
|||
|
Average risk | 1 | N/Aa |
|
Overall risk | 1 | N/A |
|
β-Likeness | 1 | [ |
|
Distance-linked disclosure | 2 | [ |
|
|||
|
Probabilistic linkage disclosure | 2 | [ |
|
Interval disclosure | 2 | [ |
Membership | Log-linear models | 3 | [ |
aN/A: not applicable.
The metrics used for measuring the usefulness of the anonymized data can be treated as an on-demand component of a data anonymization system. We revisit the proposed quantitative metrics in this section, although this important indicator is usually not fully covered in the off-the-shelf privacy model–based data anonymization tools. In addition, qualitative metrics are not covered in this study. This is due to the varied objectives of different data anonymization activities, including the evaluation of anonymization quality that is performed by health care professionals.
Categorization of data usability metrics.
Data type and metric | Section in |
References | |||
|
|||||
|
Information loss and its variants | 1.1 | [ |
||
|
Privacy gain | 1.2 | [ |
||
|
Discernibility | 1.3 | [ |
||
|
Average equivalence class size | 1.4 | [ |
||
|
Matrix norm | 1.5 | [ |
||
|
Correlation | 1.6 | [ |
||
|
Divergence | 1.7 | [ |
||
|
|||||
|
Mean squared error and its variants | 2.1 | [ |
||
|
Peak signal-to-noise ratio | 2.2 | [ |
||
|
Structural similarity index | 2.3 | [ |
aAny type of raw and anonymized electronic health record data that can be converted into an image.
In this section, several off-the-shelf data anonymization tools based on conventional privacy models and operations are detailed. These tools are commonly adopted for anonymizing tabular data. It should be noted that EHRs are usually organized in the tabular data format and that the real difficulties of anonymizing tabular data lie in the inherent bias and presumption of the availability of limited forecast-linkable data. Therefore, we investigated 14 data anonymization toolboxes, all of which share a similar workflow (summarized in
Overall results of the systematic literature mapping study. This mapping consists of four contextually consecutive parts (from bottom to top): basic anonymization operations, existing privacy models, metrics proposed to measure re-identification risk and degree of usability of the anonymized data, and off-the-shelf data anonymization software tools. ADS-GAN: anonymization through data synthesis using generative adversarial networks; AP: affinity propagation; BL: β-Likeness; CIAGAN: conditional identity anonymization generative adversarial network; CPGAN: compressive privacy generative adversarial network; DBSCAN: density-based spatial clustering of apps with noise; DP: differential privacy; DPPCA: differentially private principal component analysis; FCM: fuzzy c-means; G: graph; GAN: generative adversarial network; GL: geolocational; GMM: Gaussian mixture model; HE: homomorphic encryption; IL: information loss; ILPG: ratio of information loss to privacy gain; KA:
Comparison of the off-the-shelf privacy model–based data anonymization tools in terms of available development options, anonymization functionality and risk metrics.
Tool | Last release | Development support | Anonymization | Risk assessment | ||||||
|
|
Open source | Public APIa | Extensibility | Cross-platform | Programming language |
|
|
||
ARX | November 2019 | ✓b | ✓ | ✓ | ✓ | Java | ✓ | ✓ | ||
Amnesia | October 2019 | ✓ | ✓ | ✓ | ✓ | Java | ✓ |
|
||
μ-ANTc | August 2019 | ✓ | ✓ | ✓ | ✓ | Java | ✓ |
|
||
Anonimatron | August 2019 | ✓ | ✓ | ✓ | ✓ | Java |
|
|
||
SECRETAd | June 2019 |
|
|
|
✓ | C++ | ✓ |
|
||
sdcMicro | May 2019 | ✓ | ✓ | Poorly supported | ✓ | R | ✓ | ✓ | ||
Aircloak Insights | April 2019 |
|
|
|
✓ | Ruby |
|
|
||
NLMe Scrubber | April 2019 |
|
|
|
✓ | Perl |
|
|
||
Anonymizer | March 2019 | ✓ | ✓ | ✓ | ✓ | Ruby |
|
|
||
Shiny Anonymizer | February 2019 | ✓ | ✓ | ✓ | ✓ | R | ✓ |
|
||
μ-ARGUS | March 2018 |
|
|
|
|
C++ | ✓ | ✓ | ||
UTDf Toolbox | April 2010 | ✓ |
|
Poorly supported | ✓ | Java | ✓ |
|
||
OpenPseudonymiser | November 2011 | ✓ |
|
|
✓ | Java |
|
|
||
TIAMATg | 2009 |
|
|
|
✓ | Java | ✓ |
|
||
Cornell Toolkit | 2009 | ✓ |
|
Poorly supported | ✓ | C++ | ✓ | Poorly supported |
aAPI: application programming interface.
bFeature present.
cμ-ANT: microaggregation-based anonymization tool.
dSECRETA: System for Evaluating and Comparing RElational and Transaction Anonymization.
eNLM: National Library of Medicine.
fUTD: University of Texas at Dallas.
gTIAMAT: Tool for Interactive Analysis of Microdata Anonymization Techniques.
Comparison of the off-the-shelf privacy model–based data anonymization tools with respect to the supported privacy models.
Tool | Last release | Privacy models | |||||||||
|
|
δ-presence | ( |
( |
(ε, δ)-anonymity | ( |
|||||
ARX | November 2019 | ✓a | ✓ | ✓ | ✓ | ✓ |
|
|
✓ |
|
|
Amnesia | October 2019 | ✓ |
|
|
|
|
|
|
|
|
|
μ-ANTb | August 2019 | ✓ |
|
✓ |
|
|
|
|
|
|
|
Anonimatron | August 2019 |
|
|
|
|
|
|
|
|
|
|
SECRETAc | June 2019 | ✓ |
|
|
|
|
|
|
|
✓ | ✓ |
sdcMicro | May 2019 | ✓ | ✓ |
|
|
|
|
|
|
|
|
Aircloak Insights | April 2019 |
|
|
|
|
|
|
|
|
|
|
NLMd Scrubber | April 2019 |
|
|
|
|
|
|
|
|
|
|
Anonymizer | March 2019 |
|
|
|
|
|
|
|
|
|
|
Shiny Anonymizer | February 2019 |
|
|
|
|
|
|
|
|
|
|
μ-ARGUS | March 2018 | ✓ |
|
|
|
|
|
|
|
|
|
UTDe Toolbox | April 2010 | ✓ | ✓ | ✓ |
|
|
|
|
|
|
|
OpenPseudonymiser | November 2011 |
|
|
|
|
|
|
|
|
|
|
TIAMATf | 2009 | ✓ | ✓ | ✓ |
|
|
|
|
|
|
|
Cornell Toolkit | 2009 |
|
✓ | ✓ |
|
|
|
|
|
|
|
aFeature present.
bμ-ANT: microaggregation-based anonymization tool.
cSECRETA: System for Evaluating and Comparing RElational and Transaction Anonymization.
dNLM: National Library of Medicine.
eUTD: University of Texas at Dallas.
fTIAMAT: Tool for Interactive Analysis of Microdata Anonymization Techniques.
Amnesia [
Anonimatron [
ARX [
sdcMicro [
SECRETA (System for Evaluating and Comparing RElational and Transaction Anonymization) [
Aircloak Insights [
National Library of Medicine-Scrubber [
OpenPseudonymiser [
The Cornell Toolkit [
Recently, in response to the GDPR and DPA regulations, efforts were made by the machine learning and cryptography communities to develop privacy-preserving machine learning methods. We define privacy-preserving methods as any machine learning method or tool that has been designed with data privacy as a fundamental concept (usually in the form of data encryption) and that can typically be divided into 2 classes: those that use SMPC and those that use fully HE (FHE). All the investigated machine learning–based data anonymization tools are compared in
Comparison of existing machine learning–based data anonymization tools. The Largest model tested column reports the number of parameters in the largest model shown in the respective tool’s original paper (when reported); CrypTFlow has been shown to work efficiently on much larger machine learning models than the other available privacy-preserving machine learning tools.
Tool | Encryption methods | Reidentification risk assessment | Usability measurement | Development support | ||||||||
|
SMPCa | FHEb | Differential privacy | Federated learning |
|
|
Supports training | Malicious security | Largest model tested | |||
CrypTen | ✓c |
|
|
|
|
|
✓ |
|
N/Ad | |||
TF Encrypted | ✓ | ✓ |
|
✓ |
|
|
✓ |
|
419,720 | |||
PySyft | ✓ |
|
✓ | ✓ |
|
|
✓ |
|
N/A | |||
CrypTFlow | ✓ |
|
|
|
|
|
|
✓ | 65×106 | |||
CHET |
|
✓ |
|
|
|
|
|
|
421,098 |
aSMPC: secure multiparty computation.
bFHE: fully homomorphic encryption.
cFeature present.
dN/A: not applicable.
SMPC involves a problem in which
There are several different practical implementations of this type of protocol, although none are ready for use in production environments. CrypTen [
PySyft [
CrypTFlow [
An example of how SMPC protocols and SMPC-supporting machine learning libraries can be used is shown in the study by Hong et al [
CrypTen, TF Encrypted, and PySyft all have the advantage that they work closely with commonly used machine learning libraries (PyTorch, TF, and both PyTorch and TF, respectively), meaning that there is less of a learning curve required to make the existing models privacy preserving compared with tools such as CrypTFlow. This ease of use comes at the cost of efficiency, however, because more complex tools such as CrypTFlow are able to work at a lower level and perform more optimizations, allowing larger models to be encrypted.
HE is a type of encryption wherein the result of computations on the encrypted data, when decrypted, mirror the result of the same computations carried out on the plaintext data. Specifically, FHE is an encryption protocol that supports any computation on the ciphertext. Attempts have been made to apply FHE to machine learning [
Applying FHE to privacy-preserving machine learning is a relatively new area of research, and thus there are few tools that tie the 2 concepts together, with most research focusing on specific model implementations rather than on creating a general framework for FHE machine learning. One such tool, however, is CHET [
Although general data protection laws such as GDPR and DPA and health care–specific guidelines have been proposed for a while, data anonymization practitioners still demand a combined and intuitive reference list to check. In this discussion, we tentatively construct a policy base by collecting and sorting the available guidance provided by 4 lawful aspects in an effort to benefit future intelligent data anonymization for health care.
The policy base was constructed by considering the documentation provided in accordance with legal frameworks and guidelines proposed by government-accountable institutions, that is, the GDPR, particularly Article 5 [
From the NHS perspective, pseudonyms should be used on a one-off and consistent basis. In terms of the best practice recommendations, they recommend adopting cryptographic hash functions (eg, MD5, SHA-1, and SHA-2) to create a fixed-length hash. We argue that the encrypted data might be less useful for possible later data analysis and explainability research. We summarize the suggestions provided by the 4 aforementioned entities in
Accuracy
Accountability
Storage limitation
Purpose limitation
Data minimization
Purpose limitation
Lawfulness, fairness, and transparency
Notify any personal data breach
Settle system interruption or restoration
Implement disclosure-risk measures
Define legal basis for data processing
Establish precise details of any processing
Prevent unauthorized processing and inference
Conduct data protection impact assessment
Test anonymization effectiveness through reidentification
No intent, threaten, or damage to cause in reidentification
Ensure data integrity when malfunctions occur
Remove high-risk records
Remove high-risk attribute
Use average value of each group
Use the week to replace the exact date
Swap values of attributes with high risk
Use partial postcode instead of full address
Define a threshold and suppress the minority
Probabilistically perturb categorical attributes
Aggregate multiple variables into new classes
Use city instead of postcode and house number, street
Recode specific values into less-specific range
Use secret key to link back (data owner only)
Add noise to numerical data with low variations
Round off the totals
Swap data attributes
Use identifier ranges
Mask part of the data
Use age rather than date of birth
Change the sort sequence
Use the first part of the postcode
Remove direct identifiers (National Health Service number)
Risk assessment of indirect identifiers
Provide only a sample of the population
Provide range banding rather than exact data
If aggregate totals less than 5, use pseudonyms
Here we present the results of the 5 defined RQs (
Review question (RQ) 1: Do best practices exist for the anonymization of realistic electronic health record (EHR) data?
As the leading question of this systematic literature mapping study, we answer this question by exploring the answers to the other 4 RQs. It is theoretically feasible but practically challenging. On the basis of the answers to the remaining 4 questions, theoretical operations, privacy models, reidentification risk, and usability measurements are sufficient. Despite this, anonymization is practically difficult mainly because of 2 reasons: (1) the knowledge gap between health care professionals and privacy law (usually requiring huge collaborative efforts by clinical science, law, and data science), although we have summarized all lawful bases in the following subsection; and (2) automatic anonymization of EHR data is nontrivial and very case dependent.
RQ 2: What are the most frequently applied data anonymization operations, and how can these operations be applied?
We investigated 7 categories of basic data anonymization operations in 16 articles, most of which are summarized in
RQ 3: What are the existing conventional and machine learning–based privacy models for measuring the level of anonymity? Are they practically useful in handling real-world health care data? Are there any new trends?
We presented 40 conventional (a taxonomy for relational data is summarized as part of
RQ 4: What metrics could be adopted to measure the reidentification risk and usability of the anonymized data?
We investigated 7 (from 4 articles) and 15 (from 19 articles) metrics to quantify the risk of reidentification and degree of usability of the anonymized data. Measuring reidentification risk requires a pair of raw and anonymized data records in which the original data are treated as an object of reference and compared with the anonymized data in terms of statistical difference. Such a difference may not sufficiently reveal the true risk of reidentification. To further investigate this issue, we combined the privacy models for discussing the trade-offs between these 2 privacy aspects. In contrast, more usability metrics were proposed because of the wider availability of performance indicators.
RQ 5: What are the off-the-shelf data anonymization tools based on conventional privacy models and machine learning?
We investigated and compared 19 data anonymization tools (reported in 36 articles), of which 15 are based on privacy models (compared in Tables
The most important question to consider when data anonymization is required in the health care sector is the choice between the level of privacy and degree of usability. In
It should be noted that the privacy models, reidentification risk measurements, and data usability metrics reviewed in this study are relatively easy to understand, with equations provided along with adequate descriptions. However, these concepts are difficult to deploy in real-world data anonymization tools. Even given the intensive investigations summarized above, the utility of the anonymized data may not be easily measurable through a number of proposed metrics of reidentification risk and utility metrics.
Given this discrepancy observed from the ablation study we conducted (
SMPC and FHE share some disadvantages. They both use encryption methods that work over finite fields, and thus they cannot natively work with floating-point numbers. All practical implementations instead use a fixed-precision representation, but this adds computational overhead, and the level of precision used can affect the accuracy of the results.
Another important issue is that of the trade-off between interpretability and privacy [
Encrypted, trained models are also still vulnerable to reverse-engineering attacks (regardless of the encryption method used) [
It is also important to remember that, as noted previously, any type of encryption is regarded as a form of pseudonymization rather than anonymization because the encrypted data can be decrypted by anyone with access to the encryption key. However, we note that much of the current guidance on viewing encryption techniques as anonymization or pseudonymization is ambiguous; for example, ICO guidance [
Overall, privacy-preserving machine learning is a promising area of research, although more work needs to be undertaken to ensure that such methods are ready for use in industrial applications; many of the tools currently available are only suitable for research rather than practical application. There also needs to be some consideration over which privacy-preserving methods best suit the needs of the application. SMPC currently offers a more viable approach than FHE because of its ability to run (and, more importantly, train) larger models, although the need to have multiple trusted parties may mean that it is seen as less secure than FHE. Meanwhile, FHE for privacy-preserving machine learning is still an emerging field, and it is encouraging to see research being undertaken by both the machine learning and cryptographic communities to improve the practicality of FHE methods by improving the running time of encrypted models and reducing the level of cryptographic knowledge needed to create efficient, encrypted models using FHE.
In this SLM study, we presented a comprehensive overview of data anonymization research for health care by investigating both conventional and emerging privacy-preserving techniques. Given the results and the discussions regarding the 5 proposed RQs, privacy-preserving data anonymization for health care is a promising domain, although more studies are required to be conducted to ensure more reliable industrial applications.
Math notation system.
Typical direct identifiers and quasi-identifiers in UK electronic health record data.
Examples of fundamental data anonymization operations.
Conventional privacy models.
Usability metrics for privacy models.
Reidentification risk metrics for privacy models.
Ablation study for privacy-usability trade-offs and practical feasibility.
compressive privacy
differential privacy
Data Protection Act
electronic health record
fully homomorphic encryption
generative adversarial network
General Data Protection Regulation
general practitioner
homomorphic encryption
Health Information Portability and Accountability Act
Information Commissioner’s Office
medical generative adversarial network
National Health Service
principal component analysis
postrandomization method
quasi-identifier
review question
System for Evaluating and Comparing RElational and Transaction Anonymization
systematic literature mapping
systematic literature review
secure multiparty computation
TensorFlow
microaggregation-based anonymization tool
This study was sponsored by the UK Research and Innovation fund (project 312409) and Cievert Ltd.
None declared.