Background

JMI

JMIR Med Inform

JMIR Medical Informatics

2291-9694

JMIR Publications

Toronto, Canada

v10i4e35734

35389366

10.2196/35734

Original Paper

Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study

Lovis

Christian

Turbe

Hugues

Zamstein

Noa

El Emam

Khaled

BEng, PhD 1

School of Epidemiology and Public Health University of Ottawa

401 Smyth Road

Ottawa, ON, K1H 8L1

Canada 1 6137975412 kelemam@ehealthinformation.ca

2 3

https://orcid.org/0000-0003-3325-4149

Mosquera

Lucy

BA, MSc 2 3

https://orcid.org/0000-0002-5289-8372

Fang

BA, MSc 3

https://orcid.org/0000-0002-5571-7004

El-Hussuna

Alaa

MSc, MD 4

https://orcid.org/0000-0002-0070-8362

1 School of Epidemiology and Public Health University of Ottawa

Ottawa, ON

Canada 2 Children's Hospital of Eastern Ontario Research Institute

Ottawa, ON

Canada 3 Replica Analytics Ltd

Ottawa, ON

Canada 4 Open Source Research Collaboration

Aarlberg

Denmark

Corresponding Author: Khaled El Emam kelemam@ehealthinformation.ca

4 2022

7 4 2022

10 4

e35734

15 12 2021 4 1 2022 27 1 2022 13 2 2022

©Khaled El Emam, Lucy Mosquera, Xi Fang, Alaa El-Hussuna. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 07.04.2022.

2022

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

Background

A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods.

Objective

This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research.

Methods

We evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models.

Results

The utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions.

Conclusions

This study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.

synthetic data data utility data privacy generative models utility metric synthetic data generation logistic regression model validation medical informatics binary prediction model prediction model

Introduction

Interest in synthetic data generation (SDG) has recently grown. Synthetic data are deemed to have low privacy risks in practice because there is no one-to-one mapping between synthetic records and real people [1-8]. Recent evidence supports the low privacy risk claim [9]. This enables synthetic data to be used and shared for secondary purposes without the need for further consent [10]. In addition to meeting privacy requirements, synthetic data must also have sufficient utility. This utility can be evaluated using utility metrics. Utility metrics are important in hyperparameter tuning of the generative models during training and communicating data quality to the synthetic data users and for researchers and analysts when ranking different SDG methods to select the best one. Our focus in this paper is on the ranking of SDG methods.

Utility metrics can be defined as narrow or broad [11]. Narrow metrics are specific to an analysis that is performed with the synthetic data and are also sometimes referred to as workload-aware utility metrics. For example, if the objective is to build a model between a predictor and a binary outcome, controlling for multiple confounders, then the difference in accuracy of real versus synthetic model predictions on holdout data sets would be a workload-aware utility metric. There have been multiple studies evaluating narrow metrics [12-16]. Narrow metrics represent what the data user is ultimately interested in. Data users want synthetic data sets that score highly on narrow utility metrics.

Researchers and analysts need to rank SDG methods. For example, a developer of an SDG method may use an ensemble of techniques and then select the one with the highest utility as the final result, or analysts may evaluate multiple SDG methods available in the marketplace to select one for their own projects. However, all workloads are typically not known in advance. Therefore, researchers and analysts cannot evaluate the narrow utility of the SDG methods directly. Instead, they need to use broad utility metrics during the SDG construction and evaluation process. A key requirement is that broad utility metrics are predictive of narrow utility metrics for plausible analytic workloads.

Some studies utilized broad metrics, for example, to compare and improve SDG methods [17-19]. However, many of the broad utility metrics currently used have not been validated. This means that there is a dearth of evidence demonstrating that they are predictive of narrow utility metrics under realistic decision-making scenarios.

The realistic decision-making scenario that we are considering here is the comparison and ranking of SDG methods. Finding the best SDG method is becoming a more common need in the literature, and we need reliable metrics to be able to draw valid conclusions from these comparisons. Furthermore, in practice, users of SDG methods need to have good metrics to select among a number of these methods that may be available to them.

Utility metrics can be classified in a different way, which is relevant for our purposes. They can pertain to a specific synthetic data set or to the generative model (“data set–specific” and “model-specific” utility metrics). Because SDG is stochastic, the utility of synthetic data sets generated from the same generative model will vary each time the generative model is run, and sometimes that variation can be substantial. Data set–specific utility metrics are useful when one wants to communicate how good the particular generated data set is to a data user. However, these utility metrics are not necessarily useful, for example, for comparing different generative models because of the stochasticity. A model-specific utility metric reflects the utility of the generated synthetic data sets on average, across many data sets that are generated from the same model. Such a metric is more useful in our context, where we want to compare and rank SDG methods.

Our focus in the current study is to perform a validation study of broad model-specific utility metrics for structured (tabular) health data sets. While there have been evaluations of generative model utility metrics in the past, these have focused on images rather than structured data [20]. One previous more relevant evaluation considered propensity mean squared error (pMSE) [21,22] as a model utility metric whereby its correlation with binary prediction accuracy on synthetic data was empirically assessed [23]. The authors found that when used as a broad model-specific utility metric, by averaging across multiple synthetic data sets, this metric had a moderate correlation with narrow model-specific utility metrics. However, the correlation between a broad metric and a narrow metric across many data sets for a single SDG technique does not reflect an actual decision-making scenario. In practice, we have a single data set and multiple SDG techniques. Therefore, the extent to which the results from that previous study would be informative to our scenario of interest is unclear.

We build on this previous work by considering other types of broad model-specific utility metrics beyond pMSE and adjust the methodology to more closely model a practical decision-making scenario of an analyst selecting among multiple SDG methods to identify the one with higher narrow utility on logistic regression prediction tasks. This type of prediction task is used often in health research.

Methods

The protocol for this study was approved by the CHEO Research Institute Research Ethics Board (number CHEOREB# 21/144X). Our objective was to answer the following question: Which broad model-specific utility metrics can be used to rank SDG methods in terms of the similarity of prediction performance between real and synthetic data? In the following sections we describe the methods that were followed.

Data Sets

For our analysis, we used the 30 health data sets that are summarized in Appendix S1 in Multimedia Appendix 1. These data sets are available publicly or can be requested from the data custodians. Many of these data sets have been used in previous evaluations of SDG techniques [12,15,23], and therefore we can ensure some consistency across studies in this domain. These data sets also represent a heterogeneous set of clinical situations (providing care, observational studies, clinical trials, and registries), a wide range of data set sizes (87-44,842 patients), and variation in data set complexity (as measured using average variable entropy), which allow our evaluations to be more generalizable.

The Broad Utility Metrics Considered

Broad utility metrics compare the joint distributions of the real and synthetic data sets. Many metrics have been proposed to compare joint distributions [24]. We only focus on 6 multivariate metrics that have been used in previous work to evaluate the utility of synthetic data sets.

Maximum Mean Discrepancy

The maximum mean discrepancy metric is one way to test whether samples are from different distributions [25]. In our implementation, we used a radial basis function kernel. This metric has been applied to assess the utility of synthetic health data [26,27]. It is also widely used in the training of deep learning models and evaluation of the quality of synthetic data. Recent work on a recurrent Generative Adversarial Network (GAN) and recurrent conditional GAN made use of maximum mean discrepancy to assess whether the time series generated by the generative model implicitly learns the distribution of the true data [28]. Another study evaluated synthetic data in the smart grid context, in which a GAN is used to learn the conditional probability distribution of the significant features in the real data set and generates synthetic data based on the learnt distribution [29].

Multivariate Hellinger Distance

The Hellinger distance [30] has been shown to behave in a consistent manner as other distribution comparison metrics, specifically in the context of evaluating disclosure control methods [31], when comparing original and transformed data.

The Hellinger distance can be derived from the multivariate normal Bhattacharyya distance and has the advantage that it is bound between 0 and 1 and hence is more interpretable [32]. We constructed Gaussian copulas from the original and synthetic data sets [33] and then computed the distance between them. The concept of comparing the distance between 2 multivariate Gaussian distributions has been used to train GAN-based SDG methods [34]. Additional details on its calculation are provided in Appendix S2 in Multimedia Appendix 1.

Wasserstein Distance

The W₁ Wasserstein distance [35] is often applied to the training of GANs [36]. It has resulted in a learning process that is more robust by alleviating the vanishing gradient issue and mode collapse.

While GANs have been used extensively as an SDG technique, they very often still have trouble capturing the temporal dependency of the joint probability distributions caused by time-series data. The conditional sig-Wasserstein GANs proposed for time series generation is aimed at addressing this problem [37]. Here, the authors combine the signature of paths, which statistically describe the stream of data, and the W₁ distance, to capture the joint law of time series. By employing the sig-W as the discriminator, sig-Wasserstein GAN shows an ability to generate realistic multidimensional time series. Additional details on its calculation are provided in Appendix S2 in Multimedia Appendix 1.

Cluster Analysis Measure

The original cluster metric [21] was first purposed as a global measure of the data utility of original data and masked data. The cluster analysis has 2 steps: first, merge the original data (O) and masked data (M); then, given a certain number of groups G, perform cluster analysis on the merged data. The measure can be calculated as:

Where, n_j denotes the number of observations in the jth cluster and n_jo denotes the number of observations in the jth cluster that are from the original data (O). The c value is defined as:

A large U_c value indicates the disparities of the underlying latent structure of the original and masked data. The weight w_j can reflect the importance of certain clusters. This cluster analysis measure is used in the evaluation of synthetic data by simply replacing the original data with real data and the masked data with synthetic data [17].

Distinguishability Metrics

These broad metrics are based on the idea of training a binary classifier that can discriminate between a real and synthetic record [38,39]. That ability to discriminate is converted into a score.

A propensity mean square error metric has been proposed to evaluate the similarity of real and synthetic data sets [21,22], a perspective adopted from the propensity score matching literature [40], which we will refer to as propensityMSE. To calculate the propensityMSE, a classifier is trained on a stacked data set consisting of real observations labelled 1 and synthetic observations labelled 0. The propensityMSE score is computed as the mean squared difference of the estimated probability from the average prediction where it is not possible to distinguish between the 2 data sets. If the data sets are of the same size, which is the assumption we make here, and indistinguishable, then the average estimate will be 0.5.

Another related approach that has been used to evaluate the utility of synthetic data is to take a prediction perspective rather than a propensity perspective. This has been applied with “human discriminators” by asking a domain expert to manually classify sample records as real or synthetic [41-43]. This means that a sample of real records and a sample of synthetic records are drawn, and the 2 sets are shuffled together. Then the shuffled records are presented to clinicians who are experts in the domain, and they are asked to subjectively discriminate between the records by indicating which is real versus synthetic. High distinguishability only occurs when the human discriminator can correctly classify real and synthetic records.

The use of human discriminators is not scalable and therefore we can use machine learning algorithms trained on a training data set and that make predictions on a holdout test data set. This approach mimics the subjective evaluations described above. We will refer to this metric as predictionMSE. Also note that this calculation is different from the calculation of propensityMSE where the training data set is also used to compute the probabilities. Additional details on the calculations are provided in Appendix S2 in Multimedia Appendix 1.

Workload Aware (Narrow) Metrics

To assess whether the utility metrics are useful, we evaluated whether they can accurately rank SDG methods on workload aware metrics. This section describes these workload aware metrics.

We built a logistic regression (LR) model for each data set. LR is common in health research, and a recent systematic review has shown that its performance is comparable to that of machine learning models for clinical prediction workloads [44]. Furthermore, an evaluation of the relative accuracy of LR models compared to that of other machine learning techniques, such as random forests and support vector machines, on synthetic versus real data sets across multiple types of SDG methods showed that LR models are only very slightly different [23]. Therefore, we would expect that the results using LR would provide broadly applicable and meaningful results.

We evaluated the prediction accuracy using 3-fold crossvalidation. Accuracy was measured using the area under the receiver operating characteristic curve (AUROC) [45] and the area under the precision-recall curve (AUPRC) [46]. For outcomes that had multiple categories, we used the average of pairwise AUROC values [47]. The AUPRC values for multicategory outcomes were macroaveraged. This was performed for each real and each synthetic data set.

To assess the similarity between the AUROC and AUPRC for the real and synthetic data sets, we computed the absolute difference between them. This provides a measure of how similar the real results are to the synthetic results.

Evaluation Methodology

For each of the 30 real data sets, we generated 20 synthetic data sets. The utility metrics and the absolute AUROC difference and absolute AUPRC difference were computed on each of the 20 synthetic data sets, and each of these was averaged. Therefore, for each of the data sets, we had 1 average utility metric value for each of the 6 utility metrics, 1 average AUROC difference value, and 1 average AUPRC difference value. These values are tabulated in Appendix S3 and S4 in Multimedia Appendix 1.

SDG Methods

The main hypothesis that we wanted to test was whether the utility metrics can be used to rank the SDG methods by their AUROC and AUPRC differences. The SDG methods were chosen to achieve representativeness, applicability, and variation.

Representativeness. The methods should reflect those that are often used in the community of practice and by researchers.

Applicability. The methods are those that an analyst would likely want to compare and select from to be consistent with our motivating use case.

Variation. The utility results among the chosen SDG methods should have variation sufficient for utility metrics to detect differences.

Three generative models were used: conditional GAN [48], a Bayesian network [49], and a sequential synthesis approach using decision trees [19]. The Bayesian network implementation uses a differential privacy approach. These 3 methods were selected for the following reasons: they each represent a class of methods that is often used in the literature (eg, sequential synthesis has been used on health and social sciences data [50-58], as well as Bayesian networks [26,59] and GANs [2,60,61]), they use very different approaches and therefore represent plausible SDG methods that an analyst would want to compare, and they are expected to exhibit large utility level variation given that different SDG methods tend to be better at modeling certain types of variables and relationships. For these 3 reasons, this set of SDG methods was suitable for this study on validating utility metrics.

Individual Utility Metric Ranking

We used the Page test to determine whether the utility metric prediction was correct [62]. For that, we specified 3 groups for each utility metric: an “L” group where the utility metric indicates low utility (ie, has the highest value since they are all distance-type metrics), an “H” group where the utility metric indicates high utility (ie, has the lowest value), and an “M” group in the middle. This process is repeated for each utility metric. For any particular data set, the generative model with the lowest utility is put in the “L” group, the generative model with the highest utility is put in the “H” group, and the third generative model is in the “M” group. Each generative model in a group is replaced with its AUROC or AUPRC difference value, depending on which workload aware metric is under evaluation.

The null hypotheses we were testing are therefore that:

H0_AUROC: median(AUROC_Diff_L) = median(AUROC_Diff_M) = median(AUROC_Diff_H)

H0_AUPRC: median(AUPRC_Diff_L) = median(AUPRC_Diff_M) = median(AUPRC_Diff_H)

where the subscript indicates the group. Against the alternatives:

H1_AUROC: median(AUROC_Diff_L) ≥ median(AUROC_Diff_M) ≥ median(AUROC_Diff_H)

H1_AUPRC: median(AUPRC_Diff_L) ≥ median(AUPRC_Diff_M) ≥ median(AUPRC_Diff_H)

Where at least one of the inequalities is strict. To compute the test statistic, L, the data are put in a matrix with 30 rows, one for each data set, and 3 columns, one for each group. The accuracy scores are used to assign a rank to the values in each row. Then the ranks are summed per column R_j where j=1…3. The L statistic is then the sum: L = R₁ + 2R₂ + 3R₃. The larger that value, the greater the evidence supporting the ranking conclusion.

Because of the relatively small sample size, we used an exact test of statistical significance. This also does not make distributional assumptions on the data, and for the number of data sets we have, this gives us a high-powered test.

If the test is significant, then the broad utility metric can be used to correctly rank SDG techniques based on their workload (narrow) metrics. Since we were comparing multiple utility metrics, a Bonferroni adjustment was made to the α level of .05 to account for multiple testing.

The maximum L value can be used to identify the utility metric that is best at ranking the SDG methods by prediction accuracy difference. This is particularly useful if more than one metric is found to be statistically significant.

Aggregate Ranking

Because each utility metric is expected to rank the SDG methods differently, we wanted to test whether an aggregate ranking would provide a better result than any of the individual utility metric rankings. We hoped to find an “ideal” ranking that has minimal distance to each of the individual rankings on the utility metrics. This can be performed for each data set separately, and then the ideal rankings across all the data sets would be evaluated on the Page test. The result would give us the performance of the aggregate ranking, and we can contrast that with the quality of individual utility metric rankings.

The distance we used is the Spearman footrule [63]. With this approach, if method A has a higher ranking than method B more often than not, method A should rank higher than method B in the ideal ranking. Given the relatively small data set, full enumeration rather than an optimization algorithm was used to find the ideal ranking.

Given that the predictionMSE and propensityMSE are strongly related, the former was removed so as to not give that particular ranking a higher weighting in the aggregation.

Results

The results of the ranking of the SDG methods are shown in Table 1. All metrics are statistically significant in that the null hypothesis of no difference was rejected. The broad utility metric rankings were close enough to the correct rank, so the relationship was quite strong.

The test statistic, the L value, indicates the strength of the ordering of data. The Hellinger distance had the highest L value among all the utility metrics, suggesting that it has an advantage in ordering the SDG methods based on their narrow utility metrics.

Table 1

Page test results for each of the utility metrics and prediction accuracy

Utility metric	AUROC^a difference		AUPRC^b difference
	L value	P value	L value	P value
Maximum mean discrepancy	384	.00104^c	392	<.001^c
Hellinger distance^d	398	<.001^c	409	<.001^c
Wasserstein distance	392	<.001^c	403	<.001^c
Cluster analysis	396,	<.001^c	405	<.001^c
Propensity mean squared error	390	<.001^c	394	<.001^c
Prediction mean squared error	396	<.001^c	397	<.001^c
Aggregate^d	400	<.001^c	408	<.001^c

^aAUROC: area under the receiver operating characteristic curve.

^bAUPRC: area under the precision-recall curve.

^cStatistically significant at a Bonferroni adjusted α level of .05.

^dHighest metric on the test statistic.

The boxplots in Figure 1 descriptively show the trend for the Hellinger distance. There is a clear trend of higher utility on the narrow AUROC and AUPRC metrics as the Hellinger distances get smaller. The boxplots for the remainder of the utility metrics are included in Appendix S5 in Multimedia Appendix 1, and they all show trends similar to those seen in Figure 1.

Figure 1

The relationship between the Hellinger distance versus the AUROC and AUPRC. The 3 SDG methods were ordered based on their relative Hellinger distance values into the “H,” “M,” and “L” groups. AUROC: area under the receiver operating characteristic curve; AUPRC: area under the precision-recall curve; SDG: synthetic data generation.

The results for the aggregate ranking are shown in Table 1 and Figure 2. As can be seen from the L statistic and the boxplots, there is only a slight difference between using the Hellinger distance and the aggregate ranking from 5 different utility metrics. In a post-hoc analysis, we removed each of the metrics in turn in a leave-one-out fashion and recomputed the aggregate rank, but these did not produce better results than the one presented here.

Figure 2

The relationship between the aggregate ranking versus the AUROC and AUPRC. AUROC: area under the receiver operating characteristic curve; AUPRC: area under the precision-recall curve.

Discussion Summary

The purpose of our study was to identify the most useful, broad generative model utility metrics. These are different from utility metrics calculated for a particular synthetic data set. Generative model utility characterizes the average utility across synthetic data sets that are produced from a generative model. Given the stochasticity of SDG, such utility metrics are more appropriate for evaluating, comparing, and selecting among SDG models on the same real data set. Single synthetic data set utility metrics, on the other hand, are useful for communicating synthetic data utility to a data user because these pertain to the particular synthetic data set that is being shared.

We performed our analysis using 3 types of generative models: a conditional GAN, a Bayesian network, and sequential decision trees. These 3 cover a broad cross-section of types of techniques that are used in practice, which would enhance the applicability and generalizability of the results.

In this study, we evaluated 6 different model-specific utility metrics to determine whether they can be used to rank SDG methods. This is a practical use case that reflects a decision that an analyst using SDG methods would want to make. For example, there are multiple SDG techniques that have been published in the literature, and our ranking results can help an analyst determine the one that would work best on their real data sets.

We defined workload-aware utility as the ability to develop binary or multicategory prediction models that have similar prediction accuracy, measured by the AUROC and the AUPRC, between the real and synthetic data sets. The construction of binary or multicategory prediction models is an often-used analytical workload for health data sets. We used logistic regression to compute the absolute AUROC and AUPRC differences on real and synthetic data sets.

Our results based on an evaluation on 30 heterogeneous health data sets indicated that all the utility metrics proposed in the literature will work well. However, the multivariate Hellinger distance computed over the Gaussian copula has a slight advantage in that it provides better utility ordering. Further examination of an aggregate ranking using multiple utility metrics showed only a negligible difference from the results of the Hellinger distance for the AUROC metric, and therefore the simplicity of a single utility metric would be preferred.

Our results would allow a researcher or analyst to select the SDG method with the highest utility defined in a narrow sense. However, maximum utility does not imply that the privacy risks are acceptably low. As there is a trade-off between utility and privacy, higher utility will increase the privacy risks as well. Therefore, when evaluating SDG methods, it is important to also consider the privacy risks.

Now that we have validation evidence for a broad utility metric, it can be combined with a privacy metric to provide an overall ranking of SDG methods. For example, membership disclosure metrics for generative models [64,65] can be considered along with the multivariate Hellinger distance when SDG methods are ranked. Metrics combining these 2 risk and utility metrics would be a good avenue for future research.

Limitations

An analyst may need to make other kinds of decisions, such as evaluating different SDG models for the purpose of hyperparameter tuning. Our study did not evaluate that specific use case, and therefore we cannot make broader claims that the Hellinger distance metric is suitable for other use cases.

Our study was performed by averaging the broad and narrow utility across 20 synthetic data sets (iterations). A larger number of iterations was evaluated (50 and 100), and we noted that the differences were not material. We opted to present the smaller number of iterations as these still give us meaningful results and would be faster computationally for others applying these results.

Multimedia Appendix 1

Detailed SDG method descriptions, dataset descriptions, and detailed analysis results. SDG: synthetic data generation.

Abbreviations

AUPRC

area under the precision-recall curve

AUROC

area under the receiver operating characteristic curve

GAN

Generative Adversarial Network

logistic regression

pMSE

propensity mean squared error

SDG

synthetic data generation

This study uses information obtained from www.projectdatasphere.org, which is maintained by Project Data Sphere, LLC. Neither Project Data Sphere, LLC nor the owner(s) of any information from the website has contributed to or approved or is in any way responsible for the contents of this study. This research was enabled in part by support provided by Compute Ontario (computeontario.ca) and Compute Canada (www.computecanada.ca). This work was partially funded by the Canada Research Chairs program through the Canadian Institutes of Health Research, a Discovery Grant RGPIN-2016-06781 from the Natural Sciences and Engineering Research Council of Canada, through a contract with the Bill and Melinda Gates Foundation, and by Replica Analytics Ltd.

This work was performed in collaboration with Replica Analytics Ltd. This company is a spin-off from the Children’s Hospital of Eastern Ontario Research Institute. KEE is co-founder and has equity in this company. LM and XF are data scientists employed by Replica Analytics Ltd.

Reiter

New approaches to data dissemination: a glimpse into the future (?)

CHANCE 2012 09 20 17 3 11 15

10.1080/09332480.2004.10554907

Park

Mohammadi

Gorde

Jajodia

Park

Kim

Data synthesis based on generative adversarial networks

Proc VLDB Endow 2018 06 01 11 10 1071 1083

10.14778/3231751.3231757

Bayesian Estimation of Attribute and Identification Disclosure Risks in Synthetic Data

arXiv 2018

2022-03-01

http://arxiv.org/abs/1804.02784

Taub

Elliot

Pampaka

Smith

Differential correct attribution probability for synthetic data: an exploration

Privacy in Statistical Databases 2018

Switzerland

Springer, Cham

122 137

Reiter

Wang

Disclosure risk evaluation for fully synthetic categorical data

Privacy in Statistical Databases 2014

Switzerland

Springer, Cham

185 199

Wei

Reiter

Releasing synthetic magnitude microdata constrained to fixed marginal totals

SJI 2016 02 27 32 1 93 108

10.3233/sji-160959

Ruiz

Muralidhar

Domingo-Ferrer

On the privacy guarantees of synthetic data: a reassessment from the maximum-knowledge attacker perspective

Privacy in Statistical Databases 2018

Switzerland

Springer, Cham

59 74

Reiter

Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study

J Royal Statistical Soc A 2005 01 168 1 185 205

10.1111/j.1467-985x.2004.00343.x

El Emam

Mosquera

Bass

Evaluating identity disclosure risk in fully synthetic health data: model development and validation

J Med Internet Res 2020 11 16 22 11 e23139

10.2196/23139

33196453

v22i11e23139

PMC7704280

El Emam

Mosquera

Hoptroff

Practical Synthetic Data Generation 2020

Sebastopol, CA

O'Reilly Media, Inc

Karr

Kohnen

Oganian

Reiter

Sanil

A framework for evaluating the utility of data altered to protect confidentiality

Am Stat 2006 08 60 3 224 232

10.1198/000313006x124640

Azizi

Zheng

Mosquera

Pilote

El Emam

GOING-FWD Collaborators

Can synthetic data be a proxy for real clinical trial data? A validation study

BMJ Open 2021 04 16 11 4 e043497

10.1136/bmjopen-2020-043497

33863713

bmjopen-2020-043497

PMC8055130

El Emam

Mosquera

Jonker

Elizabeth

Sood

Harpreet

Evaluating the utility of synthetic COVID-19 case data

JAMIA Open 2021 01 4 1 ooab012

10.1093/jamiaopen/ooab012

33709065

ooab012

PMC7936723

Reiner Benaim

Almog

Gorelik

Hochberg

Nassar

Mashiach

Khamaisi

Lurie

Azzam

Khoury

Kurnik

Beyar

Analyzing medical research results based on synthetic data and their relation to real data results: systematic comparison from five observational studies

JMIR Med Inform 2020 02 20 8 2 e16492

10.2196/16492

32130148

v8i2e16492

PMC7059086

Rankin

Black

Bond

Wallace

Mulvenna

Epelde

Reliability of supervised machine learning using synthetic data in health care: model to preserve privacy for data sharing

JMIR Med Inform 2020 07 20 8 7 e18910

10.2196/18910

32501278

v8i7e18910

PMC7400044

Foraker

Gupta

Michelson

Pineda Soto

Colvin

Loh

Kollef

Maddox

Evanoff

Dror

Zamstein

Lai

Payne

PRO

Spot the difference: comparing results of analyses from real patient data and synthetic derivatives

JAMIA Open 2020 12 3 4 557 566

10.1093/jamiaopen/ooaa060

33623891

ooaa060

PMC7886551

Goncalves

Andre

Ray

Priyadip

Soper

Braden

Stevens

Jennifer

Coyle

Linda

Sales

Ana Paula

Generation and evaluation of synthetic patient data

BMC Med Res Methodol 2020 05 07 20 1 108

10.1186/s12874-020-00977-1

32381039

10.1186/s12874-020-00977-1

PMC7204018

Platzer

Reutterer

Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data

arXiv 2021 04 01

2022-10-01

http://arxiv.org/abs/2104.00635

Emam

Khaled El

Mosquera

Lucy

Zheng

Chaoyi

Optimizing the synthesis of clinical trial data using sequential trees

J Am Med Inform Assoc 2021 01 15 28 1 3 13

10.1093/jamia/ocaa249

33186440

5981525

PMC7810457

Huang

Yuan

Guo

Sun

Weinberger

An empirical study on evaluation metrics of generative adversarial networks

arXiv 2018

2022-11-01

http://arxiv.org/abs/1806.07755

Woo

Reiter

Oganian

Karr

Global measures of data utility for microdata masked for disclosure limitation

JPC 2009 04 01 1 1 111 124

10.29012/jpc.v1i1.568

Snoke

Raab

Nowok

Dibben

Slavkovic

General and specific utility measures for synthetic data

J R Stat Soc A 2018 03 07 181 3 663 688

10.1111/rssa.12358

Dankar

Ibrahim

Fake it till you make it: guidelines for effective synthetic data generation

Appl Sci 2021 02 28 11 5 2158

10.3390/app11052158

Cha

Comprehensive survey on distance similarity measures between probability density functions

Math Models Methods Appl Sci 2007 4 300 307

10.46300/9101

Gretton

Borgwardt

Rasch

Schölkopf

Smola

A Kernel Method for the Two-Sample Problem

Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference 2007

20th Annual Conference on Neural Information Processing Systems: NIPS 200

December 4-7, 2006

Vancouver, BC

Cambridge, MA

MIT Press

Tucker

Wang

Rotalinti

Myles

Generating high-fidelity synthetic patient data for assessing machine learning healthcare software

NPJ Digit Med 2020 11 09 3 1 147

10.1038/s41746-020-00353-9

33299100

10.1038/s41746-020-00353-9

PMC7653933

Torfi

Fox

Reddy

Differentially Private Synthetic Medical Data Generation using Convolutional GANs

arXiv 2020

2022-11-01

http://arxiv.org/abs/2012.11774

Cristóbal

Stephanie

Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs

arXiv 2017

2021-11-01

https://arxiv.org/abs/1706.02633

Zhang

Kuppannagari

Kannan

Prasanna

Generative Adversarial Network for Synthetic Time Series Data Generation in Smart Grids

2018

2018 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm)

October 29-31, 2018

Aalborg, Denmark

10.1109/SmartGridComm.2018.8587464

Le Cam

Yang

Asymptotics in Statistics: Some Basic Concepts 2000

New York, NY

Springer

Gomatam

Karr

Sanil

Data swapping as a decision problem

J Off Stat 2005 21 4 635 655

Derpanis

The Bhattacharyya Measure

CiteSeerX 2008

2021-11-01

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.217.3369

Joe

Dependence Modeling with Copulas 2015

New York

Chapman and Hall/CRC

Borji

Pros and Cons of GAN Evaluation Measures

arXiv 2018

2020-05-22

http://arxiv.org/abs/1802.03446

Kantorovich

Mathematical Methods of Organizing and Planning Production

Management Science 1960 07 6 4 366 422

10.1287/mnsc.6.4.366

Arjovsky

Chintala

Bottou

Wasserstein Generative Adversarial Networks

2017

The 34th International Conference on Machine Learning

August 6-11, 2017

Sydney, Australia

214 223

Szpruch

Wiese

Liao

Xiao

Conditional Sig-Wasserstein GANs for Time Series Generation

SSRN 2020

2021-11-01

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3623086

Friedman

On Multivariate Goodness-of-Fit and Two-Sample Testing

2003

PHYSTAT2003

September 8-11, 2003

Stanford, California

Hediger

Michel

Näf

On the Use of Random Forest for Two-Sample Testing

arXiv 2020

2020-05-06

http://arxiv.org/abs/1903.06287

Rosenbaum

Rubin

The central role of the propensity score in observational studies for causal effects

Biometrika 1983 70 1 41 55

10.1093/biomet/70.1.41

Beaulieu-Jones

Williams

Lee

Bhavnani

Byrd

Greene

Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing

Circ Cardiovasc Qual Outcomes 2019 07 12 7 e005122

10.1161/CIRCOUTCOMES.118.005122

31284738

PMC7041894

Choi

Biswal

Malin

Duke

Stewart

Sun

Generating Multi-label Discrete Patient Records using Generative Adversarial Networks

2017

Machine Learning for Healthcare Conference

August 18-19, 2017

Boston

Salim

Synthetic Patient Generation: A Deep Learning Approach Using Variational Autoencoders

arXiv 2018

2021-08-06

http://arxiv.org/abs/1808.06444

Christodoulou

Collins

Steyerberg

Verbakel

Van Calster

A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models

J Clin Epidemiol 2019 06 110 12 22

10.1016/j.jclinepi.2019.02.004

30763612

S0895-4356(18)31081-3

Pepe

The Statistical Evaluation of Medical Tests for Classification and Prediction 2004

Oxford

Oxford University Press

Davis

Goadrich

The relationship between Precision-Recall and ROC curves

2006

23rd International Conference on Machine Learning (ICML '06)

June 25-29, 2006

Pittsburgh

10.1145/1143844.1143874

Hand

Till

A simple generalisation of the area under the ROC curve for multiple class classification problems

Mach Learn 2001 45 2 171 186

10.1023/A:1010920819831

Skoularidou

Cuesta-Infante

Veeramachaneni

Modeling Tabular data using Conditional GAN

2019

Advances in Neural Information Processing Systems 32 (NeurIPS 2019)

December 8-14, 2019

Vancouver, BC

Ping

Stoyanovich

Howe

DataSynthesizer: Privacy-Preserving Synthetic Datasets

2017

The 29th International Conference on Scientific and Statistical Database Management

June 27-29, 2017

Chicago, IL

1 5

10.1145/3085504.3091117

Drechsler

Reiter

An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets

Comput Stat Data Anal 2011 12 55 12 3232 3243

10.1016/j.csda.2011.06.006

Arslan

Schilling

Gerlach

Penke

Using 26,000 diary entries to show ovulatory changes in sexual desire and behavior

J Pers Soc Psychol 2021 08 121 2 410 431

10.1037/pspp0000208

30148371

2018-41799-001

Bonnéry

Feng

Henneberger

Johnson

Lachowicz

Rose

Shaw

Stapleton

Woolley

Zheng

The promise and limitations of synthetic data as a strategy to expand access to state-level multi-agency longitudinal data

J Res Educ Eff 2019 08 02 12 4 616 647

10.1080/19345747.2019.1631421

Sabay

Harris

Bejugama

Jaceldo-Siegl

Overcoming small data limitations in heart disease prediction by using surrogate data

SMU Data Science Review 2018 1 3 12

Freiman

Lauger

Reiter

Data Synthesis and Perturbation for the American Community Survey at the U.S. Census Bureau

United States Census Bureau 2017

2021-11-01

https://www.census.gov/content/dam/Census/library/working-papers/2018/adrm/2017%20Data%20Synthesis%20and%20Perturbation%20for%20ACS.pdf

Nowok

Utility of synthetic microdata generated using tree-based methods

2015

UNECE Statistical Data Confidentiality Work Session

October 5-7, 2015

Helsinki, Finland

10.1007/springerreference_64338

Raab

Nowok

Dibben

Practical data synthesis for large samples

JPC 2018 02 02 7 3 67 97

10.29012/jpc.v7i3.407

Nowok

Raab

Dibben

Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R

SJI 2017 08 21 33 3 785 796

10.3233/sji-150153

Quintana

A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation

eLife 2020 03 11

2020-11-01

https://elifesciences.org/articles/53275

Wang

Myles

Tucker

Generating and Evaluating Synthetic UK Primary Care Data: Preserving Data Utility Patient Privacy

2019

IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS)

June 5-7, 2019

Cordoba, Spain

10.1109/cbms.2019.00036

Chin-Cheong

Sutter

Vogt

Generation of Heterogeneous Synthetic Electronic Health Records using GANs

2019

Workshop on Machine Learning for Health (ML4H) at the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019)

2019

Vancouver, BC

10.3929/ethz-b-000392473

Zhang

Yan

Mesa

Sun

Malin

Ensuring electronic medical record simulation through better training, modeling, and evaluation

J Am Med Inform Assoc 2020 01 01 27 1 99 108

10.1093/jamia/ocz161

31592533

5583723

PMC6913223

Siegel

Castellan

Nonparametric statistics for the behavioral sciences, 2nd ed 1988

New York

Mcgraw-Hill Book Company

Pihur

Vasyl

Datta

Susmita

Datta

Somnath

Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach

Bioinformatics 2007 07 01 23 13 1607 15

10.1093/bioinformatics/btm158

17483500

btm158

Chen

Zhang

Fritz

GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models

2020

ACM SIGSAC Conference on Computer and Communications Security

November 9-13, 2020

USA Virtual

10.1145/3372297.3417238

Hilprecht

Härterich

Bernau

Monte Carlo and Reconstruction Membership Inference Attacks against Generative Models

Proc Priv Enh Technol 2019 4 232 249

10.2478/popets-2019-0067