1. Introduction
Health care research often is a collaborative, or even international effort. This may require exchange of data, so that an analyst can combine or compare data from different sources. When these data contain personal and sensitive information, privacy regulations and concerns may restrict data sharing, which can limit research opportunities. An increasingly popular privacy-enhancing solution is to share a synthetic version of the original data [
1]. By analyzing artificial data instead of the original health care records, the privacy of the included individuals can be protected [
2]. Synthetic data can be generated for a variety of purposes, such as machine learning, software testing, education, and data retention (see [
3,
4] for an overview of use cases).
Typically, synthetic data generation involves creating a
model of the data, which describes the associations and patterns that should be reproduced with artificial individuals. These individuals are generated by sampling records from the model. Multiple approaches have been proposed; frequently used methods include Conditional Tabular Generative Adversarial Networks (CTGAN) [
5], Synthpop [
6], and PrivBayes [
7] (see [
8] for an overview).
Most synthetic data approaches have been evaluated using publicly available, real data (such as data from the UCI Machine Learning repository [
9]) and several have been applied to use cases as well [
10]. For example, [
11] demonstrate the data quality (
utility) of synthetic COVID-19 data generated with sequential trees. Although some of these applications show that analyses can be reproduced, it remains challenging to gain insight into privacy risks [
3]. Assessing the magnitude of these risks is critical because synthetic data methods can leak sensitive information with the generated data [
12].
Differential privacy [
13] is a frequently used approach in addressing this topic, which has been implemented in combination with synthetic data generators (e.g., [
14,
15]). The differential privacy framework can be used to bound the impact that one individual has on the outcome distribution of the model (i.e., synthetic data) by introducing noise. As a result, strong privacy guarantees can be made, which is one of the reasons that the approach has gained considerable popularity [
16]. It remains up for debate, however, how much noise is sufficient to preserve privacy [
17]. Although setting an appropriate threshold is a common challenge for privacy frameworks, this is a particular complication for differential privacy, because the parameters are known to be difficult to interpret [
18]. In addition, several articles (e.g., [
19,
20]) show that differentialy private noise can result in detrimental effects on synthetic data quality or may not succeed in avoiding privacy inference attacks.
Methods without differential privacy do not impose explicit privacy-preserving techniques onto the model; synthetic data are expected to preserve more privacy because they do not consist of real people. Empirical privacy metrics can be used to quantify any remaining risks, such as attribute inference and membership inference [
19,
21,
22]. These types of privacy measures do not directly assess whether the model has memorized specific, personal information, but make an approximation by considering the generated data. As a result, the outcome is subject to randomness in the synthetic data and there is a risk of misestimating privacy leakage.
Recently, Cluster-Based Synthetic Data Generation (CBSDG) has been introduced, which sets itself apart from other available approaches by (1) incorporating interpretable privacy-preserving techniques into the model and (2) generating a tractable model with which privacy risks can be measured [
23]. CBSDG creates clusters of similar individuals, typically containing 25 individuals, and removes all the relations between the variables within these clusters. Similar to anonymisation techniques such as
k-anonymity [
24] or
l-diversity [
25], this causes individuals to become indistinguishable from one another within a group of peers. At the same time, relations between variables can be preserved with the number, size and location of clusters, e.g., a cluster with females taller than 2 m will typically be smaller than a cluster with males over 2 m.
CBSDG has been subject to extensive experiments with simulated data, which demonstrated that the algorithm can generate high quality, privacy-preserving data [
23]. In this controlled setting, several aspects of the data could be varied (types of data and sample size), but complex data features found in real data were not investigated. Contrary to other synthetic data generators, CBSDG has not been tested on real data, and further analysis is required to assess its practical potential.
Considering that CBSDG is unique in generating synthetic data with tractable privacy preservation, it is meaningful to research whether the performance on simulated data is comparable for real data. This current work investigates this by replicating a published blood-transfusion analysis by Vinkenoog et al. (2022) [
26]. The data used reflect a wider variety of challenges, such as heavily skewed binary variables and multimodal continuous distributions. Furthermore, we take a deeper look into the quality of synthetic data than in [
23], by assessing the preservation of intricate relations between variables. The objective of this study is to evaluate the practical potential of CBSDG by assessing whether a prediction model trained with synthetic data has the same predictive power as the model trained with real, personal donor data, and to what extent privacy is preserved.
2. Materials and Methods
To evaluate the CBSDG on real, personal data, an analysis of blood-transfusion data published by Vinkenoog et al. (2022) [
26] was replicated with synthetic data. We describe the blood-transfusion analysis, research question and data used in the first subsection. In the second subsection, we describe how the data were synthesized. The third subsection outlines the measures used to quantify how the original results by Vinkenoog et al. [
26] differ from the replication with synthetic data as a measure of utility. In the final subsection the used privacy measure are described.
2.1. Original Analysis by Vinkenoog et al. (2022): Predicting Deferral of Blood Donors
The Dutch blood bank tests donor hemoglobin (Hb) before each donation, so that donors with low Hb levels can be deferred. This protects donor health and ensures quality of the provided transfusions. Because on-site donor deferral forms a demotivating experience for the donor as well as an unnecessary waste of resources, Vinkenoog et al. [
26] trained machine learning models (specifically support vector machines, SVMs, see below) to predict whether donors will be deferred. Using these predictions, donors with a high probability of having a low Hb can be deferred on paper and requested to return at a later time; when their Hb is expected to be sufficiently high.
Vinkenoog et al. [
26] performed the analysis with 250,729 donation records from the Dutch blood bank. The donation records contained the variables Sex, Age (years), Time of day of the donation (hours), Month of donation, Last ferritin level (a protein that indicates iron storage, measured in ng/mL), Time since last ferritin measurement (days), Previous Hb level (mmol/L), Time since previous Hb measurement (days). Additionally, the donation records contained the binary variable Deferral Status, i.e., whether a donors’ Hb was sufficiently high for donation or not. The Dutch blood bank defines a sufficiently high Hb as above 7.8 mmol/L for females and 8.4 mmol/L for males.
Predictions were made with two SVMs; one for male donors and one for female donors. Donations from 2017 up to 2020 served as the
training data. The SVMs predict Deferral Status, based on the predictors Time, Age, Month, Last ferritin, Time since last ferritin measurement, Previous Hb level, and Time since previous Hb measurement. Their analysis code is available online (
https://zenodo.org/record/6938113#.Y67At3bMJsY, accessed on 19 January 2022). A separate
test data set was used to assess whether donor Hb levels for donations in 2021 could be correctly predicted. This data set contained the same variables for 183,867 donations. Predictive performance was quantified by computing precision and recall scores. Precision is the proportion of deferred donors among donors predicted to be deferred, while recall is the proportion of donors correctly predicted to be deferred among donors who were in fact deferred. The SVMs achieved precision scores of 0.997 and 0.987 and recall scores of 0.739 and 0.637 for males and females, respectively.
2.2. Cluster-Based Synthetic Data Generation
The CBSDG algorithm previously described by Kroes et al. (2023) [
23] was used to generate synthetic versions of the donor data. By constructing a two-layered MSPN, clusters of similar individuals are created, while assuming that the variables are independent within clusters [
23]. That is, the probability distribution within a cluster is modeled with separate histograms; one for each variable. As a result, the information regarding which values from different variables originally belonged to the same record is removed in the model. This impedes opportunities to extract sensitive information by using background information (i.e., known values on other variables). At the same time, the relations between variables can be modeled by the fact that value combinations that are more likely will occur in more or in larger clusters.
Synthetic data are generated by sampling from the MSPN. Sampling a synthetic record can be seen as a two-step process: (1) sampling one of the clusters (where the probability of sampling a cluster is proportional to its size), and (2) sampling a value for each variable by sampling from each histogram for the sampled cluster.
We created synthetic records based on the training data (i.e., records corresponding to donations between 2017 and 2020) using the MSPN_privacy source code by Kroes et al. (2023) [
23] in Python 3.8, which includes an altered version of the spflow package by Molina et al. (2019) [
27,
28]. With the function anonymize_data, a two-layered MSPN was constructed with 4000 clusters. We chose this number to achieve a sizable number of donors per cluster (62 on average). We generated 50 synthetic data sets by sampling records from the MSPN. One synthetic data set was generated in 6 min on a machine with 1.5 TB RAM and 16 Intel Xeon E5-2630v3 processor with CPUs @2.40 GHz. All code is available on
https://github.com/ShannonKroes/CBSDG_application (accessed on 11 November 2023).
2.3. Utility Evaluation
The process for utility evaluation is visualized in
Figure 1. A synthetic version of the training data (2017 up to 2020) is generated with CBSDG. Using the synthetic data set, the analyses that Vinkenoog et al. [
26] performed on the real training data (i.e., the original data) are performed with the same hyperparameter choices (as used in the original analyses). Hb deferral predictions are made for donations from 2021 using SVMs based on the original data and based on the synthetic data. Utility is evaluated by comparing these predictions. The donations from 2021 were not used to train the SVMs, nor to construct the MSPN. The process is repeated 50 times (leading to 50 MSPNs, 50 synthetic data sets and 50 corresponding sets of SVMs). The objective is to asses the extent to which predictions made by SVMs are expected to differ when using synthetic instead of real, personal donor data.
For both sets of predictions, predictive performance is quantified by computing precision and recall scores. Particularly the latter is important, because less than 2.5% of donors is deferred for low Hb. Because the analyses are performed 50 times (once for each synthetic data set), the distributions of precision and recall demonstrate the stability of predictive performance over different synthetic data sets.
To convey in greater detail to what extent the variable relations were represented by the synthetic data, we also evaluate how the separate predictors impacted the modeled probability of deferral. Vinkenoog et al. (2022) [
26] used SHAP values (SHapley Additive exPlanations [
29]) for 100 donation records from 2021, to quantify and visualize the impact of every predictor, using the Python package shap. This analysis is repeated for the SVMs for all synthetic data sets. These SHAP values indicate for each of these 100 donors, how much the predictors impacted their probability of Hb deferral, on a scale of −1 to 1. Specifically, the SHAP value pertaining to a particular predictor and donation record approximates how the probability of deferral would differ if the value on this predictor would have been missing, averaged over all possible scenarios in which the value could be missing [
29].
2.4. Privacy Evaluation
We compute attribute disclosure risk, using the measures presented in Kroes et al. (2021) and Kroes et al. (2023) [
22,
23]. We assume a scenario where an attacker has background information on eight out of nine variables. In the scenario, the attacker is interested in the nineth variable, referred to as the sensitive variable. Every variable is iteratively considered to be sensitive. The extent to which the attacker can accurately infer the true value on the sensitive variable, using both their background information and the synthetic data set, is used as a measure of privacy. The attacker is assumed to target (and have information on) one specific donor; the donor of interest.
The attacker identifies all individuals who match the background information. These individuals are referred to as peers. For example, to evaluate privacy for the variable Time, it is assumed that the attacker will search for donors with the same month of donation, age, sex, last ferritin and Hb levels, time of these measurements, and deferral status as the donor of interest. Among these peers, the attacker will assess which sensitive values occur, i.e., the distribution of Time among peers. If this distribution only contains one value, or a very narrow set of values, the attacker is able to draw precise conclusions about the sensitive value of the donor of interest and privacy could be breached.
For unordered variables (Sex and current Hb), we compute the Proportion of Alternatives Considered (PoAC) [
22]. This assesses whether (synthetic) peers all have the same sensitive information (in which case the PoAC would equal 1) or whether both values are probable (in which case the PoAC equals 0). The PoAC is either 0 or 1 because both unordered variables are binary; a value of 1 indicates that both sexes or both Hb deferral statuses are probable.
For ordered variables (all seven remaining variables) we use the Expected Deviation (ED) to quantify how far off the
distribution of sensitive values among peers is from the true sensitive value of the donor of interest. Similar to the PoAC, a value of 0 indicates that all probable values equal the true sensitive value and privacy can be breached. An important note is that the ED is on the scale of the variable under evaluation. For example, if the average ED (among all donors) equals 5 for age, it means that when an attacker attempts to extract the age of a donor, on average, the inferred age will differ by 5 years from the true age of the donor. For more details on these measures, please refer to Kroes et al. (2023) [
23].
The measures can be applied to both original and synthetic data. In the original data, it is straightforward to identify peers of the donor of interest; every donor will have at least one peer (themselves). In the synthetic data, it is not unlikely that the specific combination of background information of the donor of interest will not occur in the data set. This does not mean that it is impossible to generate this combination of values, however. Furthermore, if the combination of values does occur, it is of interest to know whether this is due to chance or due to overfitting by the model [
12].
Because the MSPN is a tractable model, the privacy measures can be computed for the model, instead of for a single synthetic data set. For the PoAC we can directly compute the probability of sensitive values, given the background information, using the log_likelihood function from the Python package spflow [
28]. To compute the ED, synthetic peers can be generated with conditional sampling, so that the distribution of the sensitive variable among them can be approximated.
We selected the first donation of 8000 donors. For every donor, we compute the PoAC and ED values (depending on whether the variable under consideration was ordered or unordered). If a value is 0, privacy can be breached, whereas values above zero are considered to indicate protection against privacy breaches. The analyses were performed for one MSPN chosen at random.
4. Discussion
CBSDG is a synthetic data approach that is unique in providing tractable and interpretable privacy-preservation built into the model. Therefore, it is worthwhile to study its practical potential. In this article, a published blood-transfusion data analysis was replicated with synthetic data to study the practical value of CBSDG. The analyses performed complement the previously published experiments in [
23] by considering real data with features on which the method had not yet been evaluated.
Blood-transfusion records were synthesized using data from the Dutch national blood bank. Prediction of Hb of blood-transfusion donors by Vinkenoog et al. (2022) [
26] were replicated with the synthetic data. CBSDG produced similar univariate distributions, including multimodal continuous variables and unbalanced binary data. To evaluate the similarity of variable relations, prediction of Hb of blood-transfusion donors by Vinkenoog et al. (2022) [
26] were replicated with the synthetic data. The predictions were largely identical to the results with the original data, and precision and recall scores were highly similar as well.
Furthermore, visualization of variable importance depicted that these results with synthetic and original data were alike, also when differentiated between sexes. This indicates that both the predictions themselves and the reasoning for these predictions were comparable. Additionally, variation of precision and recall was low, indicating that the approach consistently performed well across different synthetic data sets. With respect to privacy, an attacker is not expected to be able to infer sensitive values with 100% accuracy, for values on a continuous scale. For categorical values, the probability of correct inference was not larger for donors who were used to generate the synthetic data, than for donors who were not part of the synthetic-data-generation process.
The synthetic data would have led to predominantly the same deferral predictions with minimal to no loss in accuracy. Furthermore, with the synthetic data, similar insights could have been acquired regarding associations between Hb deferral and specific donor characteristics, also when differentiating between male and female donors. This shows that the synthetic data sets have potential to substitute sensitive data in machine learning analyses, which are increasingly being used for blood-transfusion research [
32].
Compared to other synthetic data generators, CBSDG has several advantages. First, the model (a restricted MSPN) requires little knowledge from users and no parametric distributions are assumed (as opposed to e.g., [
33]), nor were (hyper)parameter settings required, other than the number of clusters. In addition, the way in which clusters yield privacy is explainable, which is an increasingly important factor in the development of privacy-enhancing technologies [
34]. This allowed for a tractable privacy analysis that evaluates which relations have been learnt by the model.
The privacy analyses demonstrated that the specified attack scenario would not be successful for continuous variables. For categorical variables, however, attribute information could be inferred by an attacker with sufficient background information. An additional analysis with donors who were not part of the training set yielded similar results for these variables (Deferral Status and Sex). In other words, being part of the training data for synthetic data generation was not associated with a significantly increased risk of leaking sensitive information for categorical variables.
This demonstrates that opportunities for attribute linkage can be explained (at least in part) by properties of the data and do not necessarily result from overfitting by CBSDG. One such property could be the imbalance in Deferral Status, where guessing a negative deferral status is already accurate for 98% of the donors (even for attackers without background information). Furthermore, several variables were highly correlated, e.g., it is well-known that there is a strong relation between sex and ferritin level [
35]. Men have a higher ferritin level on average and information on this variable alone could be enough to infer a donor’s sex. This also relates to the low expected deviation for Previous Hb, where an attacker’s guess on a donor’s Hb will be off by ‘only’ 0.69 mmol/L.
Apart from data characteristics there remain other factors that can lead to differences between privacy risks for categorical and continuous values. Categorical variables generally take on fewer unique values, which already increases the probability of correct inference. In addition, with CBSDG, histograms are used to model the continuous distributions within clusters, through which different values are grouped together in histogram bins. This is a form of
generalization; an anonymization technique used to increase privacy [
24]. For categorical variables, the probability of each value is described separately and privacy is not increased with generalization.
In the privacy analyses, it was assumed that adversaries had information on all eight remaining variables to mimic a strong attacker. This scenario was selected in order to generate generalizable analysis results that can be related to situations with less powerful attackers as well. On the other hand, this may have resulted in a scenario with an unrealistically knowledgeable attacker. For example, donors generally do not know their exact ferritin value, and therefore an attacker is unlikely to have this background information as well.
Nevertheless, there may be a desire to further increase privacy. One solution could be to modify the clustering, such that more dissimilar individuals are combined. Future research is needed to determine how this impacts utility, as clustering more dissimilar individuals may decrease the extent to which correlations are represented. Explicit parameterization of this property (i.e., the similarity of individuals within clusters) would capacitate users in navigating the privacy-utility trade-off more effectively. Furthermore, there are attack scenarios that are not covered by the analysis (such as membership inference attacks) which require further investigation.
Compared to the simulations previously performed by Kroes et al. (2023) [
22], an important additional step for practical implementation is data selection. Deciding on which version of the data will be synthesized can be an impactful choice, which may require either experience or guidance by an expert. Essentially, the data set goes through different forms during pre-processing and these can differ in the extent to which they are suitable for synthetic data generation. In general, we recommend synthetizing as few variables as possible, considering that the number of inter-variable associations increase factorially with the number of variables. Moreover, synthesizing more variables creates more opportunities for potential attackers, which may be associated with unnecessary and unjustifiable privacy risks. Naturally, a suitable selection of variables and records remains dependent on the analyses to be performed with the data.
Finally we want to point out that the donation data are well-matched with the cluster-based synthetic data generator, since they contain few variables and a sizable set of records. CBSDG is expected to be less suitable for high-dimensional data, since modeling many variable dependencies would require a large number of clusters. In these cases, the model may leak private information because the number of clusters required is disproportionally high. Similar experiments can be performed with other types of data and (machine learning) analyses to gain more insight into the expected utility of cluster-based generated synthetic data in different applications. Although the performed analysis is not generalizable to all types of data, the performed experiments do show how CBSDG performs for a wider assortment of data types with more thorough utility measures than previously evaluated.