3.2.1. Marginal Histogram Release
Differential privacy marginal histograms are a differential privacy-preserving method for publishing and sharing data. This method allows users to access statistical information about the distribution of data while protecting the privacy of individual data. A marginal histogram is a graphical representation of the distribution of data that shows the frequency or probability of the data over a range of values. The goal of publishing differential privacy marginal histograms is to provide a rough understanding of data distribution without exposing specific individual data [
9], as described by
Figure 2 below.
The specific steps are as follows:
Step 1. Select four features: AGE, BMI, LIVER SIZE, and GFR.
Step 2. Divide the data into 30 brackets, then calculate and draw the marginal histogram of each feature.
Step 3. Use the Laplace mechanism to add noise to the frequency of each feature’s marginal histogram, and use an average allocation strategy for privacy budgeting.
Step 4. Draw the original marginal histogram and differential privacy protection histogram under different privacy budgets for comparison.
The histograms in
Figure 3 and
Figure 4 are obtained according to the above steps.
Similar to the analysis in the previous section,
Figure 3 and
Figure 4 indicate that when
= 0.1, the added noise to some extent disturbs the data, and the distribution of data with differential privacy remains basically consistent with the original data distribution, effectively protecting individual privacy.
To evaluate the quality of the publication, the mean squared error (MSE) formula [
10],
is used to measure the difference between the differential privacy histogram and the original histogram, with the results shown in
Table 4 below.
From
Table 4, the AGE attribute has the lowest MSE, while the GFR attribute has a relatively low MSE. This indicates that using the same privacy budget for these attributes can better protect the privacy of AGE attributes and provide lower prediction errors. However, the MSEs of BMI and LIVER SIZE attributes are slightly higher under the same privacy budget, requiring a delicate balance between privacy protection and data accuracy.
3.2.2. Composite Dataset Publishing
Synthetic data publishing predicts potential distribution structures from raw data through generative models, generating data with similar statistical characteristics. Compared to the possibility of data distortion caused by differential privacy histogram publishing, synthetic data publishing can more effectively protect the privacy of data subjects and handle the relationships between different attributes to improve the utility of data. The key to its release is to construct a reasonable range of synthetic data queries. We adopt the approach of constructing a composite dataset that follows the data distribution of the original dataset to protect privacy [
11,
12]. The specific steps are as follows:
Step 1. Design a synthetic representation of answer range queries for a column of raw data showing the data distribution with a histogram.
Step 2. The count results in the histogram being used as a synthetic representation of the raw data.
Step 3. Laplace noise (2) is added individually to each count value in the histogram.
Step 4. The synthetic representation is used as a probability distribution function for sampling to generate the synthetic dataset.
Step 5. A new sample is generated based on the probability values.
Step 6. The generated distribution is compared with the distribution of the original dataset.
The results of these specific inquiries are as follows.
Firstly, range questioning under BMI and systolic blood pressure (SBP) attributes is designed to obtain the number of people suffering from obesity, with a BMI of 28–60 (kg/m2), and hypertension, with an SBP of 140–200 (mmHg).
Defining the count value of each BMI between 28 and 60 as a histogram query and applying the range query to calculate the number of people per BMI yields
Figure 5. Similarly, defining the count value of each SBP between 140 and 200 as a histogram query and applying the range query to calculate the number of people per SBP yields
Figure 5.
Secondly, to answer the query range, the results of the BMI and SBP counts of all people who fall within the range are summed up. The results of the experiment show that there are 466 and 17 people who meet the above ranges for BMI and SBP, respectively.
Thirdly, according to the parallel combination property of differential privacy and Equation (
1), the synthesized representation satisfies
-differential privacy, and the results after noise addition are 465.990 and 16.524, respectively.
Fourthly, the composite representation as a probability distribution function that can be used to estimate the potential distribution of the original data is considered and then sampled based on this probability distribution to obtain a composite dataset. Comparing
Figure 5 and
Figure 6, the probability distribution of the BMI and SBP estimates is similar to the probability distribution function of the original dataset.
The composite dataset is obtained, as shown in
Table 5.
The original dataset is then replaced with the generated synthetic dataset and queries are responded to based on it. The histograms of BMI and SBP for the synthetic dataset reveal graphs that are the same shape as the original dataset, as shown in
Figure 7.
Next, to evaluate the quality of the synthesized data, the MSE Formula (3) is used, and the results are shown in
Table 6.
Table 6 clearly shows that the synthetic data are more usable while protecting data privacy. Thus, the MSE can guide the choice of the synthetic data generation mechanism and control the cost of privacy protection.
Further, to evaluate query accuracy, two error rates are defined, as follows:
where notation
denotes the Laplace mechanism error rate.
where notation
denotes the synthetic representation error rate,
s denotes the statistical characteristics of the synthesized data, and
o denotes the statistical characteristics of the original data.
In
Table 7, the smaller error rate indicates more accurate queries and the Laplace mechanism has mostly smaller errors. However, for queries with a larger range, the query cost increases and the error significantly increases. In practical applications, Laplace queries are typically used for small samples and budget-constrained queries, while composite data queries are used for large samples and large-scale queries.
3.2.3. Differential Privacy Protection for Machine-Learning Algorithms
This section explores the application of the logistic regression model with differential privacy in children’s health datasets. In 2011, Chaudhuri et al. [
7] established a differential privacy (CDP for short) theory based on target perturbation by adding linear noise to the loss function and applied it to the logistic regression model. Inspired by their study, Mi [
13] added quadratic noise to the loss function to obtain a differential privacy protection model (MDP for short), then applied it to the classification task of children with fatty livers based on the BMI attributes in the dataset.
The dataset is divided into a training set and testing set, which are used to train the model and evaluate the testing accuracy, respectively. The accuracy formula is as follows [
14]:
where
TP,
TN,
FP, and
FN represent the number of patient true positive examples, true negative examples, false positive examples, and false negative examples, respectively. The experimental results of CDP and MDP are shown in
Figure 8 and
Figure 9, where the values of
are taken in the range of [
,
] for 100 different points. The benchmark accuracy of the logistic regression model without privacy protection is about 0.7577.
By comparing the accuracy curves of
Figure 8 and
Figure 9, it is found that the MDP model performs more stably and has higher accuracy, as shown in
Figure 10. This indicates that under the same privacy budget, the MDP model has better privacy protection and higher model accuracy. Meanwhile, with the increase in the privacy budget, both models converge to the benchmark accuracy of privacy-free models.