A Histogram Publishing Method under Differential Privacy That Involves Balancing Small-Bin Availability First

Chen, Jianzhang; Zhou, Shuo; Qiu, Jie; Xu, Yixin; Zeng, Bozhe; Fang, Wanchuan; Chen, Xiangying; Huang, Yipeng; Xu, Zhengquan; Chen, Youqin

doi:10.3390/a17070293

Open AccessArticle

A Histogram Publishing Method under Differential Privacy That Involves Balancing Small-Bin Availability First

by

Jianzhang Chen

¹,

Shuo Zhou

¹,

Jie Qiu

¹,

Yixin Xu

¹,

Bozhe Zeng

¹,

Wanchuan Fang

¹,

Xiangying Chen

¹,

Yipeng Huang

¹,

Zhengquan Xu

² and

Youqin Chen

^2,3,*

¹

College of Computer and Information Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China

²

State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China

³

College of Computer Science and Mathematics, Fujian University of Technology, Fuzhou 350118, China

^*

Author to whom correspondence should be addressed.

Algorithms 2024, 17(7), 293; https://doi.org/10.3390/a17070293

Submission received: 13 May 2024 / Revised: 15 June 2024 / Accepted: 27 June 2024 / Published: 4 July 2024

(This article belongs to the Section Randomized, Online, and Approximation Algorithms)

Download

Browse Figures

Versions Notes

Abstract

:

Differential privacy, a cornerstone of privacy-preserving techniques, plays an indispensable role in ensuring the secure handling and sharing of sensitive data analysis across domains such as in census, healthcare, and social networks. Histograms, serving as a visually compelling tool for presenting analytical outcomes, are widely employed in these sectors. Currently, numerous algorithms for publishing histograms under differential privacy have been developed, striving to balance privacy protection with the provision of useful data. Nonetheless, the pivotal challenge concerning the effective enhancement of precision for small bins (those intervals that are narrowly defined or contain a relatively small number of data points) within histograms has yet to receive adequate attention and in-depth investigation from experts. In standard DP histogram publishing, adding noise without regard for bin size can result in small data bins being disproportionately influenced by noise, potentially severely impairing the overall accuracy of the histogram. In response to this challenge, this paper introduces the SReB_GCA sanitization algorithm designed to enhance the accuracy of small bins in DP histograms. The SReB_GCA approach involves sorting the bins from smallest to largest and applying a greedy grouping strategy, with a predefined lower bound on the mean relative error required for a bin to be included in a group. Our theoretical analysis reveals that sorting bins in ascending order prior to grouping effectively prioritizes the accuracy of smaller bins. SReB_GCA ensures strict

ϵ

-DP compliance and strikes a careful balance between reconstruction error and noise error, thereby not only initially improving the accuracy of small bins but also approximately optimizing the mean relative error of the entire histogram. To validate the efficiency of our proposed SReB_GCA method, we conducted extensive experiments using four diverse datasets, including two real-life datasets and two synthetic ones. The experimental results, quantified by the Kullback–Leibler Divergence (KLD), show that the SReB_GCA algorithm achieves substantial performance enhancement compared to the baseline method (DP_BASE) and several other established approaches for differential privacy histogram publication.

Keywords:

differential privacy; histogram publishing; relative error

1. Introduction

In the era of digitalization today, concerns about the privacy and security of user-sensitive data coexist with the progression of data mining technologies. These privacy apprehensions can surface at every phase of the data mining and analysis procedure, encompassing data collection, data publication, data transmission, and data analysis. Recently, there has been extensive research [1,2,3] aimed at effectively mitigating privacy threats to sensitive information during data mining and analysis, with significant contributions, including differential privacy techniques [4,5,6].

Differential privacy is a privacy-preserving paradigm defined from a probabilistic statistical standpoint [4,5,6]. In the context of database queries, it involves adding noise drawn from a certain probability distribution to query results, such that the maximum impact on the result caused by either deleting (or inserting) a single record or modifying the value of a single record is confined within a predetermined range. This ensures that an attacker, with a certain probability, cannot infer whether a particular record exists or determine its true value from the query output, thereby providing privacy protection for the presence or magnitude of each individual record’s information [6]. Due to its rigorous quantification of privacy guarantees and independence from an attacker’s background knowledge, differential privacy is well-suited for safeguarding privacy in big data scenarios [6].

Differential privacy (DP) is considered as a de facto standard privacy definition and is applied to many privacy protection scenarios [7,8,9,10,11,12,13,14,15,16,17,18,19,20], where the histogram [21] is a very typical one. The histogram adopts a box-division technology, which divides a database table into disjoint areas according to one or more column attributes and expresses each area with some statistical values to understand the distribution of various types of data. Therein, one statistical value of one disjoint area is called a bin. The histogram can be used for population census, road condition information, disease detection, product testing, daily personal activities including living expenses, web browsing, and APP statistics, etc. However, if an original histogram is directly used for publishing, it may face the problem of personal privacy leakage. For example, a hospital carries out a statistical analysis on the disease types of patients who come to the hospital this month, among whom three patients are found to have diabetes. If the number of patients with other disease types and the fact that two patients have diabetes is known, the third diabetic patient can be easily speculated. Therefore, before publishing the histogram, privacy protection is needed.

In DP histogram publishing, deleting a record or adding a record to a database table has one count that affects the publication, and its sensitivity is very small. Therefore, DP histogram publishing is of great concern in the research of DP [21]. However, the uneven distribution of the original data leads to the presence of variously sized statistics (bins) in histograms constructed from the data. When bins of varying sizes comply with

ϵ

-differential privacy, it is often observed that small bins exhibit lower relative accuracy compared to larger bins, sometimes to a significantly greater extent. In the relevant literature concerning the publication of differentially private histograms [13,14,15], the majority of studies overlook the issue of relatively lower accuracy in small bins under non-uniform data distributions. While some studies have indeed taken note of data skewness, their focus remains confined to the overall accuracy of all bins or predominantly larger bins, occasionally sacrificing the accuracy of small bins to marginally enhance the absolute accuracy of the aggregate or larger bins. Indeed, it is a well-established principle that the equitable analysis of data yields more dependable outcomes. Hence, even in the presence of noisy statistical information, it is imperative to direct attention towards data analysis conducted under a comparably uniform level of noise.

In the context of hospital patient data analysis using histograms, health information is sensitive, especially regarding serious illnesses, which typically involve fewer cases and are vital for the in-depth analysis of such diseases. Consider a hypothetical histogram of patient data resulting in counts of 2, 3, 10, 9, 8, 1, and 1, wherein 1, 1, 2, and 3 denote the respective counts for severe disease categories. Following the guidance of study [5], if noise from a Laplace mechanism with a scale parameter

λ

were introduced to each datum to enforce

\frac{1}{λ}

-differential privacy, the expected absolute magnitude of the noise added to each count would be

λ

. This addition, particularly in relation to the counts of 1, 1, 2, and 3 for severe diseases, introduces a significant relative error. study [13,15,19,22] introduces grouping methodologies that can moderately enhance the overall accuracy of the data; however, this approach necessitates balancing the noise error induced by random distribution against the reconstruction error due to grouping. Current methodologies, while attempting to weigh absolute accuracy against both types of errors, still encounter the challenge of excessive noise introduced by low counts.

Specifically, in Section 4 of this paper, a formulaic analysis is carried out for the

ϵ

-DP histogram publishing, and it is concluded that under the uneven data distribution, small bins tend to bear much more relative noise than large bins. Indeed, small bins are sometimes just as important as large bins or much more crucial than large bins. Therefore, this paper considers how to prioritize balancing the accuracy of small bins and minimize the mean relative error of the histogram under

ϵ

-DP.

The main contributions of this paper are summarized as follows.

A sanitization algorithm, SReB_GCA, is introduced, which adopts relative error as its metric for accuracy. This approach diverges from numerous methodologies documented in the study [15,19], wherein the mean squared error or absolute error is typically employed, with the histogram being assessed collectively to ascertain a cumulative mean squared or absolute error for the purpose of optimization. Consequently, the significance of smaller bins may be inadvertently disregarded.
The grouping rules under relative error are theoretically deduced and analyzed. There are two types of errors in DP histogram publishing, including the reconstruction error due to grouping and the noise error due to Laplace noise being injected. From the analysis of their relative error forms, it is concluded that sorting from small to large is favorable for improving the utility of small bins first, a perspective not thoroughly explored in the previous literature. Additionally, a lower bound in the greedy grouping process of SReB_GCA is theoretically deduced, which facilitates maximizing the number of bins grouped and approximately optimizing the mean relative error of the histogram.

Moreover, we conduct extensive experiments on four real-life and synthetic datasets to evaluate the performance of our approach in comparison with a benchmark method (DP_BASE) across single-value queries, range queries, and analyses of data distribution. Additionally, we compare this performance with several other existing methods that leverage data distribution characteristics.

The remainder of this paper is organized as follows: Section 2 introduces the related work on DP histogram publishing. Section 3 presents the relevant definitions of DP. Section 4 details the problem and presents initial approaches. Section 5 formally defines the problem and proposes a sanitization algorithm, SReB_GCA, encompassing the grouping rules and an approximately greedy optimal grouping strategy, theoretically derived and analyzed. Section 6 presents the experimental evaluation of SReB_GCA. Finally, we draw conclusions in Section 7.

2. Related Work

Researchers have derived numerous valuable solutions concerning DP histogram publishing [5,13,14,15,17,18,19,22,23,24,25,26,27]. Dwork et al. [5] introduced a method applicable to equal-width histograms for the first time. They augmented each bin in the equal-width histogram with Laplace noise and utilized the resulting noisy bins for range queries. When executing large-range queries, this approach results in a substantial accumulation of noise within the query outcomes, ultimately compromising the data availability of query results. In [14], a methodology known as Boost1 is presented, which relies on the optimal linear unbiased estimator. In practice, however, this method fails to efficiently mitigate the noise associated with each bin in the histogram. To address the challenge posed by a wide range of queries, Xu et al. [19] put forward two techniques: NoiseFirst and StructureFirst. With NoiseFirst, dynamic programming technology is employed to reconstitute the output of an equal-width histogram by consolidating adjacent, similar bins. Conversely, StructureFirst utilizes the exponential mechanism [5] to determine the boundaries of output bins, thereby forming a non-equiwidth histogram, and subsequently incorporates appropriate noise into the adjusted results prior to publication. While StructureFirst effectively mitigates the error induced by perturbation (i.e., noise error), it inadvertently introduces a novel error (i.e., reconstruction error). Furthermore, during the reconstruction process, the number of bins (k) is manually specified, rendering it incapable of effectively balancing noise error against reconstruction error.

In order to enhance the accuracy of sanitized data (i.e., data with added noise conforming to a specific distribution) and strike a superior balance between noise and reconstruction errors, the study [13] introduces PHPartition. This method employs a dynamic search via hierarchical clustering techniques to determine the optimal value k, rendering it an improvement over StructureFirst. Additionally, the study [17] puts forth an adaptive sampling methodology for achieving differential privacy in the histogram publishing of dynamically evolving datasets. Furthermore, the work presented in [24] proposes DPDR, a grouping technique grounded in the ranking methodology of the exponential mechanism. Nonetheless, these methodologies primarily focus on devising reconstruction strategies that enhance the overall accuracy of histograms or the precision of larger bins, inadvertently overlooking the accuracy of smaller bins. In some instances, they even compromise the precision of small bins to augment the absolute accuracy for broader bins or the histogram as a whole [19]. While large bins typically garner significant attention, the significance of small bins, as exemplified by phenomena like the “Black Swan Event”, cannot be discounted.

As far as we know, there is almost no research focusing specifically on prioritizing the accuracy of small bins, despite some studies employing a metric of relative error. In [28], through analyzing the correlation between the cumulative probability of noise and the privacy level in Laplace mechanism, and integrating the relative error metric, an algorithm was proposed to inject varying levels of noise proportional to query preferences. However, the algorithm’s implementation necessitates prior knowledge of these query preferences, which may contain sensitive information to some extent, thereby posing a risk of privacy leakage. In addition, the metric of relative error is also considered in study [29]. Therein, a DP publishing algorithm for batch queries based on gradual iterations is proposed to mitigate noise error; specifically, the privacy budget is allocated according to intermediate noisy results, with the overarching goal of minimizing the overall relative error while efficiently managing the consumption of the privacy budget. Through iterations, queries are adaptively adjusted to achieve the best cost-performance ratio. The algorithm proposed in this paper, distinct from that discussed in Reference [29], initiates with prioritizing the precision of smaller bins under the criterion of relative error as an assessment of overall data accuracy. While [29] primarily focuses on enhancing the accuracy of larger bins under differential privacy constraints before tuning the discrepancy between large and small bins to boost aggregate data precision, our approach subsequently optimizes the error between these bin categories to augment the overall usability of the dataset. Moreover, it is noteworthy that the dataset utilized in our research exhibits a distribution that starkly differs from that employed in Reference [29].

Furthermore, when delving into the relative accuracy of bins in the realm of DP histogram publishing, we note that several grouping principles resemble yet differ from those in prior research [22] (see Section 5). Nonetheless, given the divergent utility metrics and distinct research aims we consider, the methodology employed in the previous work cannot be straightforwardly adopted to address our present challenge.

3. Preliminaries

3.1. Differential Privacy

The concept of differential privacy (DP) originates from the indistinguishability principle in cryptographic semantic security, implying that an adversary cannot discern between the encrypted outcomes of distinct plaintexts. For example, the inclusion or exclusion of a single record has a negligible impact on query outcomes under DP, thereby safeguarding the record’s privacy.

Definition 1

(

ϵ

-DP [5]). For a randomized algorithm K and a set S of all possible outputs of K, a randomized algorithm is said to satisfy ϵ-DP if algorithm K satisfies the following Formula (1) for any two neighboring datasets D and

D^{'}

that differ by at most one user.

\begin{matrix} P [K (D) \in S] \leq e x p (ϵ) \times P r [K (D^{'}) \in S], \end{matrix}

(1)

where

P r [\cdot]

represents the risk of privacy disclosure and is controlled by the randomness of algorithm K.

In Definition 1, the parameter

ϵ

is referred to as the privacy budget, quantifying the level of privacy protection. Specifically, a smaller value of

ϵ

signifies stronger privacy guarantees, whereas a larger

ϵ

implies weaker privacy protection.

Definition 2

(Global sensitivity [5]). For any function

f : D \to R^{d}

, the global sensitivity of f is defined as

\begin{matrix} Δ f = m a x_{D, D^{'}} {∥ f (D) - f (D^{'}) ∥}_{p}, \end{matrix}

(2)

where R represents the real number space; d represents the query dimension of function f; and p represents the norm distance used by

Δ f

, which is usually measured by

L_{1}

.

3.2. Laplace Mechanism

The Laplace mechanism is usually used to add noise to the numerical data to satisfy

ϵ

-DP. The definition is given below.

Definition 3

(Laplace mechanism [5]). For any function

f : D \to R^{d}

with a global sensitivity

Δ f

, the algorithm

K (D) = f (D) + (n_{1}, n_{2}, n_{3}, \dots, n_{d})

ensures ϵ-DP, where each

n_{i} \sim L a p (Δ f / ϵ)

for

i \in {1, 2, \dots, d}

is an independently drawn Laplace random variable from a Laplace distribution with scale parameter

λ = \frac{Δ f}{ϵ} .

The probability density function of this Laplace distribution is given by

\begin{matrix} p (x) = \frac{1}{2 λ} e x p (- \frac{| x |}{λ}) . \end{matrix}

(3)

3.3. Composition Properties

DP has the following two composition properties that are often used in the design of its mechanisms.

Property 1

(The parallel composition property [30]). Given randomized algorithms

K_{1}

,

K_{2}

,

K_{3}

, ⋯,

K_{n}

, whose privacy budgets are

ϵ_{1},

ϵ_{2},

ϵ_{3},

⋯,

ϵ_{n}

, respectively, for disjoint datasets

D_{1},

D_{2},

D_{3},

\dots

,

D_{n}

, the combined algorithm

K (K_{1} (D_{1}), K_{2} (D_{2}), \dots, K_{n} (D_{n}))

composed of these algorithms satisfies

m a x (ϵ_{i})

-DP, and the privacy protection level provided depends on the largest privacy budget.

Property 2

(The sequence composition property [30]). Given dataset D and randomized algorithms

K_{1}

,

K_{2}

, ⋯,

K_{n}

, whose privacy budgets are

ϵ_{1}

,

ϵ_{2}

,

ϵ_{3}

, ⋯,

ϵ_{n}

respectively, the combined algorithm

K (K_{1} (D), K_{2} (D), \dots, K_{n} (D))

composed of these algorithms satisfies

(\sum ϵ_{i})

-DP.

3.4. Relative Error

In this paper, the relative error [28,29,31,32] used to measure the quality of a published result is as follows:

e r r (r) = \frac{| r - r^{★} |}{m a x (r, δ)},

(4)

where

r^{★}

is a published result with respect to the original result r and the parameter

δ

is a constant to avoid the situation

r = 0

or r is too small [28,29,31,32].

4. Problem Statement and First-Cut Method

4.1. Problem Statement

Given a sequence of original histogram bins,

H = {H_{1}, H_{2}, \dots, H_{n}}

. Without a loss of generality, assume that the bin proportions are

p = {p_{1}, p_{2}, \dots, p_{n}}

, respectively, where

0 < p_{1} \leq p_{2} \leq \dots \leq p_{n} < 1

and

\sum_{i = 1}^{n} p_{i} = 1

. Publish a histogram

\tilde{H} = {{\tilde{H}}_{1}, {\tilde{H}}_{2}, \dots, {\tilde{H}}_{n}}

satisfying

ϵ

-DP, where

{\tilde{H}}_{j} = H_{j} + n_{j}

,

n_{j} \sim L a p (λ)

for

j \in {1, 2, \dots, n}

, and

L a p (λ)

is the Laplace distribution satisfying a scale parameter

λ = \frac{1}{ϵ}

.

Obviously, the absolute error of each bin in the histogram is

E [| \tilde{H_{j}} - H_{j} |] = λ

. Note that the mean square error (MSE) is given by

E [{(\tilde{H_{j}} - H_{j})}^{2}] = 2 λ^{2}

. So, we can deduce that the relative error of each bin is

E [\frac{| \tilde{H_{j}} - H_{j} |}{H_{j}}] = \frac{λ}{H_{j}} .

(5)

Given that

p_{1} \leq p_{2} \leq \dots \leq p_{n}

, it follows that

H_{1} \leq H_{2} \leq \dots \leq H_{n}

. Consequently, it has Formula (6).

\begin{matrix} E [\frac{| \tilde{H_{1}} - H_{1} |}{H_{1}}] & \geq E [\frac{| \tilde{H_{2}} - H_{2} |}{H_{2}}] \geq \dots \geq E [\frac{| \tilde{H_{n}} - H_{n} |}{H_{n}}] . \end{matrix}

(6)

According to Formula (6), the relative error of each bin with different proportions is different. If the data distribution is very uneven, the relative errors of bins with smaller counts (small bins) can be disproportionately higher compared to those with larger counts. This disparity may be detrimental in application scenarios where critical decisions rely on accurate representations of these smaller bins.

4.2. First-Cut Method

From the previous subsection, we know that a histogram with uneven data distribution may often fail to meet the relative accuracy requirements for small bins when it adds the same absolute noise to all bins to satisfy

ϵ

-DP. Then, can we use two straightforward solutions? Therein, one is adding an equal relative noise to each bin in the histogram to meet the accuracy requirements for all of the bins, and the other one is taking a scale parameter that meets the relative accuracy requirement of the minimum bin as the overall scale parameter. The answer is no. Although these two solutions are simple and brute-force, they are not the best and maybe lead to privacy disaster in some extreme cases.

Firstly, we assume that the same unit noise is

γ

, and

γ = \frac{λ_{j}}{H_{j}}

for

j \in [1, n]

, where

H_{j}

is perturbed by Laplace noise with

λ_{j}

as its scale parameter, respectively. Since

H_{1} \leq H_{2} \leq \dots \leq H_{n}

, it has

λ_{1} \leq λ_{2} \leq \dots \leq λ_{n}

. According to

λ_{j} = \frac{1}{ϵ_{j}}

, it has

ϵ_{1} \geq ϵ_{2} \geq \dots \geq ϵ_{n}

. Then, according to Property 1 of DP, it can be seen that it satisfies

ϵ = m a x (ϵ_{j}) = ϵ_{1}

-DP, where

j \in [1, n]

. In the extreme case of

H_{1} < < H_{n}

, then

ϵ_{1} > > ϵ_{n}

; this may cause privacy disaster.

Secondly, according to the histogram publishing method adding the same noise to each bin to satisfy

ϵ

-DP, the relative noise of each bin is assumed to be

γ_{j} = \frac{λ}{H_{j}}

, where

λ = \frac{1}{ϵ}

and

j \in [1, n]

. In the extreme case where the accuracy requirement of the minimum bin is

γ < < γ_{1}

, the perturbation has to be implemented with the scale parameter

λ^{'}

in order to meet the accuracy requirement of the minimum bin, where

λ^{'} = H_{1} γ < < H_{1} γ_{1} = λ

. From

λ^{'} = \frac{1}{ϵ^{'}} < < λ = \frac{1}{ϵ}

, it has

ϵ^{'} > > ϵ

, which also may lead to a privacy disaster.

Although these two solutions can directly meet the availability requirements of small bins, they may violate

ϵ

-DP seriously. Furthermore, as we have learned, the availability of small bins will also impact the overall availability of the histogram. Consequently, if the requirement of satisfying

ϵ

-DP remains, is it feasible to prioritize enhancing the availability of small bins while concurrently considering the histogram’s availability (the mean relative error of histogram)? In the next section, we will propose a sanitization algorithm based on this question.

5. Sanitization Algorithm

In this section, we introduce our sanitization algorithm. We meticulously derive and analyze the grouping rules under the framework of relative error, wherein we propose an approximately greedy optimal grouping strategy (GGS).

5.1. Grouping Rules

According to studies [13,15,19,22], grouping constitutes a prevalent technique in differential privacy histogram publication for enhancing accuracy. The formal definition of histogram grouping is outlined as follows.

Definition 4

(Histogram grouping). Given an original bin sequence

H = {H_{1}, H_{2}, \dots, H_{n}}

of the histogram, and assuming that k groups of

C = {C_{1}, C_{2}, \dots, C_{k}}

are formed after grouping and each group

C_{i}

covers

| C_{i} |

bins, the average count

{\bar{C}}_{i}

of group

C_{i}

is expressed in Formula (7).

{\bar{C}}_{i} = \frac{\sum_{H_{j} \in C_{i}} H_{j}}{| C_{i} |}

(7)

Assume that a sanitized histogram

\tilde{H} = {{\tilde{H}}_{1}, {\tilde{H}}_{2}, \dots, {\tilde{H}}_{n}}

satisfies

ϵ

-DP, where each

{\tilde{H}}_{j}

represents a sanitized version of

H_{j} \in H

. For

H_{j} \in C_{i}

, it can be seen that

{\tilde{H}}_{j} = {\bar{C}}_{i} + \frac{x}{| C_{i} |},

(8)

where

x \sim L a p (λ)

and

λ = \frac{1}{ϵ} .

Then, the relative error between

{\tilde{H}}_{j}

and

H_{j}

can be expressed as

e r r (H_{j}) = \frac{| {\tilde{H}}_{j} - H_{j} |}{m a x (H_{j}, δ)},

(9)

where

δ \geq 0

serves as a small positive constant to avoid division by zero when

H_{j} = 0

.

When

H_{j} \geq 1

and

δ = 1

, the mean relative error between

{\tilde{C}}_{i}

and

C_{i}

is given by

\begin{matrix} e r r (C_{i}) & = \frac{\sum_{H_{j} \in C_{i}} e r r (H_{j})}{| C_{i} |} = \sum_{H_{j} \in C_{i}} \frac{\frac{| {\tilde{H}}_{j} - H_{j} |}{H_{j}}}{| C_{i} |} . \end{matrix}

(10)

For any

H_{j} \in C_{i}

, the expected relative error between

{\tilde{H}}_{j}

and

H_{j}

can be expressed as

\begin{matrix} E [e r r (H_{j})] & = E [\frac{| {\tilde{H}}_{j} - H_{j} |}{H_{j}}] = \frac{1}{H_{j}} E [| {\tilde{H}}_{j} - H_{j} |], \end{matrix}

(11)

where

E [| {\tilde{H}}_{j} - H_{j} |] = \int_{- \infty}^{+ \infty} | {\bar{C}}_{i} + \frac{x}{| C_{i} |} - H_{j} | \frac{1}{2 λ} e^{- \frac{| x |}{λ}} d x

and

x \sim L a p (λ)

.

Consequently, the expected mean relative error between

{\tilde{C}}_{i}

and

C_{i}

is given by Formula (12).

\begin{matrix} E [e r r (C_{i})] & = E [\frac{\sum_{H_{j} \in C_{i}} | {\tilde{H}}_{j} - H_{j} |}{| C_{i} |}] \\ = \sum_{H_{j} \in C_{i}} \frac{1}{H_{j}} \frac{E [| {\tilde{H}}_{j} - H_{j} |]}{| C_{i} |} . \end{matrix}

(12)

Regarding our problem, we initially present Theorem 1, which offers an approximate characterization of the relative error between sanitized and original bins.

Theorem 1.

Given

C_{i}

,

{\bar{C}}_{i}

,

| C_{i} |

,

H_{j} \in C_{i}

,

{\tilde{H}}_{j}

, and λ, then

E [e r r (H_{j})] \approx \frac{1}{H_{j}} (| H_{j} - {\bar{C}}_{i} | + \frac{λ}{| C_{i} |})

, where

\frac{| H_{j} - {\bar{C}}_{i} |}{H_{j}}

is the Relative Reconstruction Error (

R R E

) and

\frac{λ}{| C_{i} | H_{j}}

is the Relative Noise Error (

R N E

).

Proof.

In order to obtain

E [e r r (H_{j})]

, we need to calculate

E [| \tilde{H_{j}} - H_{j} |]

. According to the relationship between

H_{j}

and

\bar{C_{i}}

, the discussion can be divided into the following two cases.

(1): When $H_{j} \geq \bar{C_{i}}$ , it has

$\begin{matrix} E [| \tilde{H_{j}} - H_{j} |] & = \int_{- \infty}^{+ \infty} | {\bar{C}}_{i} - H_{j} + \frac{x}{| C_{i} |} | \frac{1}{2 λ} e^{- \frac{| x |}{λ}} d x \\ = \int_{- \infty}^{0} - ({\bar{C}}_{i} - H_{j} + \frac{x}{| C_{i} |}) \frac{1}{2 λ} e^{\frac{x}{λ}} d x \\ - \int_{0}^{(H_{j} - {\bar{C}}_{i}) | C_{i} |} ({\bar{C}}_{i} - H_{j} + \frac{x}{| C_{i} |}) \frac{1}{2 λ} e^{- \frac{x}{λ}} d x \\ + \int_{(H_{j} - {\bar{C}}_{i}) | C_{i} |}^{+ \infty} ({\bar{C}}_{i} - H_{j} + \frac{x}{| C_{i} |}) \frac{1}{2 λ} e^{- \frac{x}{λ}} d x \\ = H_{j} - {\bar{C}}_{i} + \frac{λ}{| C_{i} |} e^{- \frac{H_{j} - {\bar{C}}_{i}}{\frac{λ}{| C_{i} |}}} . \end{matrix}$

(13)
(2): When $H_{j} < \bar{C_{i}}$ , it has

$\begin{matrix} E [| \tilde{H_{j}} - H_{j} |] & = \int_{- \infty}^{+ \infty} | {\bar{C}}_{i} - H_{j} + \frac{x}{| C_{i} |} | \frac{1}{2 λ} e^{- \frac{| x |}{λ}} d x \\ = \int_{- \infty}^{(H_{j} - {\bar{C}}_{i}) | C_{i} |} - ({\bar{C}}_{i} - H_{j} + \frac{x}{| C_{i} |}) \frac{1}{2 λ} e^{\frac{x}{λ}} d x \\ + \int_{(H_{j} - {\bar{C}}_{i}) | C_{i} |}^{0} ({\bar{C}}_{i} - H_{j} + \frac{x}{| C_{i} |}) \frac{1}{2 λ} e^{\frac{x}{λ}} d x \\ + \int_{0}^{+ \infty} ({\bar{C}}_{i} - H_{j} + \frac{x}{| C_{i} |}) \frac{1}{2 λ} e^{- \frac{x}{λ}} d x \\ = {\bar{C}}_{i} - H_{j} + \frac{λ}{| C_{i} |} e^{\frac{H_{j} - {\bar{C}}_{i}}{\frac{λ}{| C_{i} |}}} . \end{matrix}$

(14)

Based on (13) and (14), it can be seen that

$E [| \tilde{H_{j}} - H_{j} |] = | {\bar{C}}_{i} - H_{j} | + \frac{λ}{| C_{i} |} e^{\frac{| H_{j} - {\bar{C}}_{i} |}{- \frac{λ}{| C_{i} |}}} .$

(15)

Therefore, $E [| \tilde{H_{j}} - H_{j} |]$ is mainly composed of two parts, where the part $| {\bar{C}}_{i} - H_{j} |$ is caused by grouping and the other part $\frac{λ}{| C_{i} |} e^{- \frac{| H_{j} - {\bar{C}}_{i} |}{\frac{λ}{| C_{i} |}}}$ is caused by both grouping and Laplace noise injected. It is easy to obtain $\frac{λ}{| C_{i} |} e^{- \frac{| H_{j} - {\bar{C}}_{i} |}{\frac{λ}{| C_{i} |}}} \leq \frac{λ}{| C_{i} |} .$

Approximately, we use

| H_{j} - {\bar{C}}_{i} | + \frac{λ}{| C_{i} |}

to represent the relative error between

{\tilde{H}}_{j}

and

H_{j}

. Therefore,

E [e r r (H_{j})] \approx \frac{1}{H_{j}} (| H_{j} - {\bar{C}}_{i} | + \frac{λ}{| C_{i} |})

approximately, where

\frac{| H_{j} - {\bar{C}}_{i} |}{H_{j}}

is the Relative Reconstruction Error (

R R E

) and

\frac{λ}{| C_{i} | H_{j}}

is the Relative Noise Error (

R N E

). From the above discussion, the result follows. □

Hence, the expectation of the mean relative error of

C_{i}

is

E [e r r (C_{i})] = \sum_{H_{j} \in C_{i}} \frac{1}{H_{j}} \frac{| H_{j} - {\bar{C}}_{i} | + \frac{λ}{| C_{i} |}}{| C_{i} |} .

(16)

Theorem 2.

Assuming that

H_{j} \in C_{i}

and

H_{j} \leq {\bar{C}}_{i}

, suppose there two bins

H_{l}

and

H_{l^{'}}

exist such that

H_{j} \leq H_{l} \leq H_{l^{'}}

. Then, the relative error incurred by grouping

H_{l}

into

C_{i}

is less than that resulting from grouping

H_{l^{'}}

into

C_{i}

.

Proof.

From

H_{j} \in C_{i}

, it has

E [e r r (H_{j})] = \frac{1}{H_{j}} (| H_{j} - {\bar{C}}_{i} | + \frac{λ}{| C_{i} |})

, where

R R E = \frac{| H_{j} - {\bar{C}}_{i} |}{H_{j}}

and

R N E = \frac{λ}{| C_{i} | H_{j}}

. If we add

H_{l}

to

C_{i}

, it has

E [e r r (H_{j})] = \frac{1}{H_{j}} (| H_{j} - \frac{{\bar{C}}_{i} | C_{i} | + H_{l}}{| C_{i} | + 1} | + \frac{λ}{| C_{i} | + 1}),

(17)

where

R R E_{1} = \frac{1}{H_{j}} | H_{j} - \frac{{\bar{C}}_{i} | C_{i} | + H_{l}}{| C_{i} | + 1} |

and

R N E_{1} = \frac{1}{H_{j}} \frac{λ}{| C_{i} | + 1}

. If we add

H_{i^{'}}

to

C_{i}

, it has

\begin{matrix} E [e r r (H_{j})] & = \frac{1}{H_{j}} | H_{j} - \frac{{\bar{C}}_{i} | C_{i} | + H_{i^{'}}}{| C_{i} | + 1} | + \frac{1}{H_{j}} \frac{λ}{| C_{i} | + 1}, \end{matrix}

(18)

where

R R E_{2} = \frac{1}{H_{j}} | H_{j} - \frac{{\bar{C}}_{i} | C_{i} | + H_{i^{'}}}{| C_{i} | + 1} |

and

R N E_{2} = \frac{1}{H_{j}} \frac{λ}{| C_{i} | + 1}

. Since

H_{j} \leq {\bar{C}}_{i}

and

H_{j} \leq H_{l} \leq H_{l^{'}}

, it has

\begin{matrix} R R E_{1} & = \frac{1}{H_{j}} | H_{j} - \frac{{\bar{C}}_{i} | C_{i} | + H_{l}}{| C_{i} | + 1} | = \frac{1}{H_{j}} (\frac{{\bar{C}}_{i} | C_{i} | + H_{l}}{| C_{i} | + 1} - H_{j}) \end{matrix}

(19)

and

\begin{matrix} R R E_{2} & = \frac{1}{H_{j}} | H_{j} - \frac{{\bar{C}}_{i} | C_{i} | + H_{l^{'}}}{| C_{i} | + 1} | \\ = \frac{1}{H_{j}} (\frac{{\bar{C}}_{i} | C_{i} | + H_{l}}{| C_{i} | + 1} - H_{j} + \frac{H_{l^{'}} - H_{l}}{| C_{i} | + 1}) \\ \geq \frac{1}{H_{j}} | H_{j} - \frac{{\bar{C}}_{i} | C_{i} | + H_{l}}{| C_{i} | + 1} | \\ = R R E_{1} . \end{matrix}

(20)

Therefore, the result follows. □

Theorem 2 shows that if a bin smaller than the average of the current group is added, the relative error of the current group adding another bin closer to the bin is smaller than that of adding another bin farther away from the bin.

Theorem 3.

When

| C_{i} |

gets bigger,

R N E

gets smaller.

Proof.

Assuming that

H_{j} \in C_{i}

, it has

E [e r r (H_{j})] = \frac{1}{H_{j}} (| H_{j} - {\bar{C}}_{i} | + \frac{λ}{| C_{i} |})

with

R N E = \frac{λ}{H_{j} | C_{i} |}

. Therefore, the result follows. □

Theorem 3 demonstrates that the more bins a group contains, the smaller the Relative Noise Error (

R N E

) for each individual bin within that group. Theorems 2 and 3 collectively facilitate enhancing the accuracy of smaller bins during the grouping process of a new bin. Proceeding further in this section, we delve into analyzing the relative error associated with a solitary bin situated within a fixed group, leading us to the derivation of Theorem 4.

Theorem 4.

Given

C_{i}

,

{\bar{C}}_{i}

,

| C_{i} |

,

H_{j} \in C_{i}

, and λ, then it has the following cases.

(1): When $| C_{i} | {\bar{C}}_{i} < λ$ , $E [e r r (H_{j})]$ decreases as $H_{j}$ increases;
(2): When $| C_{i} | {\bar{C}}_{i} \geq λ$ , two cases exist as follows.

(i): If $H_{j} \leq {\bar{C}}_{i}$ , then $E [e r r (H_{j})]$ decreases as $H_{j}$ increases;
(ii): If $H_{j} > {\bar{C}}_{i}$ , then $E [e r r (H_{j})]$ increases as $H_{j}$ increases.

Proof.

Let

f (x) = \frac{1}{x} (| x - {\bar{C}}_{i} | + \frac{λ}{| C_{i} |}),

where

x > 0

is the variable for

H_{j}

, and the derivative

f (x)

can be obtained as follows:

\frac{d f (x)}{d x} = \{\begin{matrix} - \frac{{\bar{C}}_{i}}{x^{2}} - \frac{λ}{x^{2} | C_{i} |}, 0 < x \leq {\bar{C}}_{i} \\ \frac{{\bar{C}}_{i}}{x^{2}} - \frac{λ}{x^{2} | C_{i} |}, x > {\bar{C}}_{i} \end{matrix} .

(21)

i.e.,

\begin{matrix} \frac{d f (x)}{d x} & = \{\begin{matrix} < 0, 0 < x \leq {\bar{C}}_{i} \\ \geq 0, x > {\bar{C}}_{i}, | C_{i} | {\bar{C}}_{i} \geq λ \\ < 0, x > {\bar{C}}_{i}, | C_{i} | {\bar{C}}_{i} < λ \end{matrix} \end{matrix}

(22)

Thus, the result follows. □

Example 1.

When

ϵ = 0.0005

,

λ = \frac{1}{ϵ} = 2000,

{\bar{C}}_{i} = 100

, and

| C_{i} | = 10

, it has

| C_{i} | {\bar{C}}_{i} < λ

. Assuming that

H_{j} \in C_{i}

, the change of

l n (E [e r r (H_{j})])

with respect to

H_{j}

is illustrated in Figure 1. When

ϵ = 0.01

,

λ = \frac{1}{ϵ} = 100,

{\bar{C}}_{i} = 100

, and

| C_{i} | = 10

, it has

| C_{i} | {\bar{C}}_{i} \geq λ

. Assume that

H_{j} \in C_{i}

; the behavior of

l n (E [e r r (H_{j})])

as

H_{j}

varies is shown in Figure 2. Finally, the conclusions of Theorem 4 are intuitively demonstrated directly through Figure 1 and Figure 2.

Given

H = {H_{1}, H_{2}, \dots, H_{n}}

in order from small to large, Theorem 4 shows that the maximum relative error of the histogram in each group depends on the bins at both ends.

Theorem 5.

Given

C_{i}

,

{\bar{C}}_{i}

,

| C_{i} |

,

H_{j} \in C_{i}

, and λ with

λ = \frac{1}{ϵ}

, then

E [e r r (H_{j})]

increases with the increase in λ (or the decrease in ϵ) and decreases otherwise.

Proof.

From

E [e r r (H_{j})] = \frac{1}{H_{j}} (| H_{j} - \bar{C_{i}} | + \frac{λ}{| C_{i} |})

,

E [e r r (H_{j})]

increases with the increase in

λ

; otherwise, it decreases. When

λ = \frac{1}{ϵ}

,

E [e r r (H_{j})]

increases with the decrease in

ϵ

; otherwise, it decreases. □

Theorem 6.

Given

C_{i}

,

{\bar{C}}_{i}

,

| C_{i} |

,

H_{j} \in C_{i}

, and λ, the monotonicity of

E [e r r (C_{i})]

on

H_{j}

is the same as the monotonicity of

E [e r r (H_{j})]

on

H_{j}

.

Proof.

Since

\begin{matrix} E [e r r (C_{i})] & = E [\frac{\sum_{H_{j} \in C_{i}} e r r (H_{i})}{| C_{i} |}] = \frac{\sum_{H_{j} \in C_{i}} e r r (H_{i})}{| C_{i} |} = \sum_{H_{j} \in C_{i}} \frac{\frac{| H_{j} - {\bar{C}}_{i} | + \frac{λ}{| C_{i} |}}{H_{j}}}{| C_{i} |}, \end{matrix}

(23)

Theorem 6 can be proved by using the method of Theorem 4 similarly, and the monotonicity of

E [e r r (C_{i})]

on

H_{j}

is consistent with that of

E [e r r (H_{j})]

on

H_{j}

. □

From Theorems 1–3, in order to reasonably solve the problem considered in this paper, the following two grouping rules are needed at least.

A sorting from small to large is favorable to improve the availability of small bins first.
The closer bins should be divided into the same group as much as possible to reduce the relative error.

Based on the two grouping rules above, we formulate the following optimization problem, referred to as OP.

OP: Given a histogram

H = {H_{1}, H_{2}, \dots, H_{n}}

sorted in ascending order and a privacy budget

ϵ

, the goal is to partition H into k non-overlapping groups

C = {C_{1}, C_{2}, \dots, C_{k}}

, such that

⋃_{i} C_{i} = H

and

C_{i} \cap C_{j} = \emptyset

for all

i \neq j

and

k \geq 2

. Define the average error as

e r r_{A v g} = \frac{\sum_{C_{i} \in C} \sum_{H_{j} \in C_{i}} E [e r r (H_{j})]}{n}

, where the objective is to find an optimal grouping

C^{*} = {C_{1}^{*}, C_{2}^{*}, \dots, C_{k}^{*}}

that minimizes

e r r_{A v g}

, i.e.,

C^{*} = arg {min}_{C} e r r_{A v g}

. Here, the error for each bin

H_{j}

is given by

e r r (H_{j}) = \frac{| {\tilde{H}}_{j} - H_{j} |}{H_{j}}

, with

{\tilde{H}}_{j} = {\bar{C}}_{i} + \frac{L a p (λ)}{| C_{i} |}

and the noise scale

λ = \frac{1}{ϵ}

.

The OP problem, which is founded on the dynamic programming technique introduced in [19], incurs a complexity of

O (k n^{2})

when k is predefined. Furthermore, in cases where k is not specified, determining the optimal grouping escalates to a complexity of

O (n^{4})

. Consequently, this approach becomes impractical for large real-world datasets due to computational intensity. To address this, the paper introduces a greedy optimal grouping strategy, designated as the GP (Greedy Problem), without predetermining k. Incorporating Theorems 4–6, the formal description of the GP is as follows.

The GP: Given a histogram

H = {H_{1}, H_{2}, \dots, H_{n}}

sorted in ascending order and a privacy parameter

ϵ

, the task is to divide H into k non-overlapping groups

C = {C_{1}, C_{2}, \dots, C_{k}}

, where the number of groups satisfies

1 \leq k \leq n

, such that

⋃_{i = 1}^{k} C_{i} = H

and for any pair of distinct groups

C_{i}

and

C_{q}

,

C_{i} \cap C_{q} = \emptyset

(with the condition applicable when

k \geq 2

). The objective is to identify a grouping configuration

C^{*} = {C_{1}^{*}, C_{2}^{*}, \dots, C_{k}^{*}}

that adheres to Equation (24).

\begin{matrix} m a x {| C_{i} | | E [e r r (C_{i} \cup H_{j})] < \frac{E [e r r (C_{i})] | C_{i} | + E {[e r r (H_{j})]}^{*}}{(| C_{i} | + 1)}}, \end{matrix}

(24)

where

E [e r r (C_{i})] = \sum_{H_{i} \in C_{i}} \frac{1}{H_{i}} (| H_{i} - {\bar{C}}_{i} | + \frac{λ}{| C_{i} |}) / | C_{i} |,

(25)

\begin{matrix} ​ & E [e r r (C_{i} \cup H_{j})] = \sum_{H_{r} \in (C_{i} \cup H_{j})} \frac{1}{H_{i}} (| H_{r} - {\bar{C^{'}}}_{i} | + \frac{λ}{| C_{i} | + 1}) / (| C_{i} | + 1), \end{matrix}

(26)

{\bar{C}}_{i}^{'} = ({\bar{C}}_{i} | C_{i} | + H_{j}) / (| C_{i} | + 1),

(27)

\begin{matrix} ​ & E {[e r r (H_{j})]}^{*} = m i n_{j + 1 \leq e \leq n} (E [e r r (H_{j})] | H_{j} \in C_{{H_{j}, \dots, H_{e}}}) \end{matrix}

(28)

and

λ = \frac{1}{ϵ} .

(29)

Therein, Formula (28) defines

C_{{H_{j}, \dots, H_{e}}}

as a group comprising bins

H_{j}, \dots, H_{e}

.

From Formula (24), it is evident that, apart from computing

E [e r r (C_{i} \cup H_{j})]

and

E [e r r (C_{i})]

, it is also necessary to calculate

E {[e r r (H_{j})]}^{*} = min_{j + 1 \leq e \leq n} (E [e r r (H_{j})] ∣ H_{j} \in C_{{H_{j}, \dots, H_{e}}}) .

(30)

Formula (30) varies each time the composition of groups changes and thus requires recalculation accordingly. This differs from operations involving mean square error, which can often be simplified via decomposition into the sum of squares and the square of the sums, enhancing computational efficiency. Consequently, in scenarios where k is not predetermined, the operational efficiency can be relatively low for large real-world datasets. To address this, we propose the following approximately greedy optimal grouping strategy.

5.2. GGS

Definition 5

(

G G S

). Assuming that

H = {H_{1}, H_{2},

\dots,

H_{n}}

satisfies the sorting from small to large and is grouped in order, the number of groups is k with

1 \leq k \leq n

and for the ith

(1 \leq i \leq k - 1)

dividing point is

\begin{matrix} d_{i} & = m a x {j | E [e r r (C_{i} \cup H_{j})] < \frac{E [e r r (C_{i})] | C_{i} | + \frac{λ}{(n - j + 1) H_{j}}}{| C_{i} | + 1}}, \end{matrix}

(31)

then, the grouping strategy formed by

{d_{1}, d_{2}, \dots,

d_{k - 1}}

is called the approximately greedy optimal grouping strategy and denoted as

G G S

.

Specifically, the approximate greedy bound defined in Definition 5 is chiefly derived from Theorem 7, and it can be utilized for the greedy grouping procedure.

Theorem 7.

E {[e r r (H_{j})]}^{*} \geq \frac{λ}{(n - j + 1) H_{j}} .

(32)

Proof.

From

\begin{matrix} E {[e r r (H_{j})]}^{*} & = m i n_{j + 1 \leq e \leq n} (E [e r r (H_{j})] | H_{j} \in C_{{H_{j}, \dots, H_{e}}}) \\ = m i n (R R E + N R E) \\ \geq m i n (R R E) + m i n (N R E) \\ \geq \frac{λ}{(n - j + 1) H_{j}}, \end{matrix}

(33)

the result follows. □

5.3. SReB_GCA

In the preceding subsection, the GGS method initially requires sorting. However, directly sorting the original histogram H would violate

ϵ

-differential privacy. Consequently, a fraction of the total privacy budget, designated as

ϵ_{1}

, is allocated to perform the sorting of H in ascending order in compliance with

ϵ_{1}

-DP. Subsequently, the SReB_GCA histogram publishing algorithm is designed, incorporating the grouping principles discussed, along with the approximate greedy optimization strategy outlined in GGS. Detailed descriptions of the algorithm can be found in Algorithms 1 and 2.

Algorithm 1 SReB_GCA

Require:: Histogram $H = {H_{1}, H_{2}, \dots, H_{n}}$ and privacy budget $ϵ$
Ensure:: Sanitized histogram $\tilde{H}$
1:: $ϵ = ϵ_{1} + ϵ_{2}$ , $λ_{1} = \frac{1}{ϵ_{1}}$ and $λ_{2} = \frac{1}{ϵ_{2}}$ ;
2:: $\hat{H} = H + {〈L a p (λ_{1})〉}^{n}$ ;
3:: $\hat{H}$ = Ascending $_{S o r t}$ $(\hat{H})$ ;
4:: $S = G G S (\hat{H}, λ_{2})$ ; $/ /$ S stands for a group strategy, i.e., a set of segmentation points
5:: $C = G r o u p i n g (H, S)$ ;
6:: for $i = 1$ to $| C |$ do
7:: $\bar{C_{i}} = \sum_{H_{j} \in C_{i}, C_{i} \in C} \frac{H_{j}}{| C_{i} |}$
8:: end for
9:: for each $H_{j} \in H$ do
10:: ${\tilde{H}}_{j} = {\bar{C}}_{i} + \frac{L a p (λ_{2})}{| C_{i} |}$ , where $H_{j} \in C_{i}$ ;
11:: end for
12:: return $\tilde{H} = {{\tilde{H}}_{1}, {\tilde{H}}_{2}, \dots, {\tilde{H}}_{n}}$

Algorithm 2 GGS

(\hat{H}, λ_{1})

Require:: Sorted histogram $\hat{H}$ from small to large, and scale parameter $λ_{1}$
Ensure:: Grouping strategy S
1:: $S = \emptyset$ ;
2:: $l = 1$ ;
3:: $C_{i} = {{\hat{H}}_{1}}$ ;
4:: for $r = l + 1$ to n do
5:: min = E[err $(C_{i} \cup {\hat{H}}_{r})$ ];
6:: tmp = $(E [e r r (C_{i})] | C_{i} | + \frac{λ_{2}}{(n - r + 1) {\hat{H}}_{r}}) / (| C_{i} | + 1)$ ;
7:: if min<tmp then
8:: $d < r$ ;
9:: $C_{i} = C_{i} \cup {\hat{H}}_{r}$ ;
10:: else
11:: $S = S \cup {d}$ ;
12:: $l = d + 1$ ;
13:: $C_{i} = {{\hat{H}}_{l}}$ ;
14:: end if
15:: end for
16:: return S

Moreover, we can obtain two corollaries, Corollaries 1 and 2, related to SReB_GCA from the perspective of privacy and relative error respectively.

Corollary 1.

SReB_GCA in Algorithm 1 satisfies ϵ-DP.

Proof.

SReB_GCA in Algorithm 1 divides the privacy budget

ϵ

into

ϵ_{1}

and

ϵ_{2}

, where

ϵ_{1}

is used to realize the perturbation of the original histogram in Line 2, and

ϵ_{2}

is used to perturb the mean value of a group in Lines 9–11. According to Property 2, SReB_GCA satisfies

ϵ_{1} + ϵ_{2} = ϵ

-DP. □

Corollary 2.

Given the group

C_{i}

obtained by SReB_GCA in Algorithm 1, the mean value is

{\bar{C}}_{i}

, the number of bins is

| C_{i} |

,

{\hat{H}}_{j} \in C_{i}

is a bin in the group, and

{\tilde{H}}_{j}

is the sanitization of

{\hat{H}}_{j}

; then, the relative error

E [e r r ({\hat{H}}_{j})]

is approximately

\frac{1}{{\hat{H}}_{j}} (| {\hat{H}}_{j} - {\bar{C}}_{i} | + \frac{λ_{2}}{| C_{i} |})

, where

\frac{| {\hat{H}}_{j} - {\bar{C}}_{i} |}{{\hat{H}}_{j}}

is

R R E

and

\frac{λ_{2}}{{\hat{H}}_{j} | C_{i} |}

is

R N E

with

λ_{2} = \frac{1}{ϵ_{2}}

.

Proof.

From Theorem 1, we can obtain the result when

λ = λ_{2} .

□

6. Experimental Evaluation

To the best of our knowledge, no attention has been paid to the DP histogram publishing for small data, except for the relative error of the whole average case considered in some studies [28,29,31,32]. Therefore, our paper sets a benchmark method DP_BASE as its competitor. In addition, we also compare our method with AHP in the study [22] in terms of the availability of data distribution, which is one of the classical methods in DP histogram publishing. In this paper, the benchmark method is denoted as DP_BASE, which uses the Laplace mechanism to directly perturb a histogram and then publishes it. Under the same

ϵ

-DP, the availability of our SReB_GCA is compared with DP_BASE, and it evaluates the performance using the single-value query, range query, and data distribution.

6.1. Experimental Settings

(1) Datasets

Waitakere, Search Log, NetTrace, and Social Network are used for experiments in this paper, which are commonly employed in histogram publishing [13,14,19]. Among them, Waitakere is a semi-synthetic dataset derived from the 2006 Census Meshblock Dataset of New Zealand, comprising a total population of 186,471 spread across 1340 meshblocks. Search Log is a synthetic dataset of search query logs featuring the keyword “Obama”, covering the period from 1 January 2004 to 9 August 2009. This timeframe is divided into 32,768 intervals, each lasting 90 min. Each interval represents a bin of the histogram, used to tally the number of queries within those 90-min segments. NetTrace documents IP-level network traces garnered from a university’s intranet, where every internal host aligns with a histogram bin to enumerate the connecting external hosts. The Social Network dataset archives the friendships of 11,000 social networking platform users, with each user profile detailing their friend count. We consolidate the attributes of these datasets in Table 1. Moreover, Figure 3 illustrates that the prevalence of smaller bins is significantly high across all these datasets. Throughout the graphical representations from Figure 3, Figure 4, Figure 5 and Figure 6, the y-axis employs a logarithmic scale with a base of 10.

(2) Utility metrics and parameter settings

Combined popular data analysis tasks, which involve checking data distributions and answering single-value or range queries [13], utilize mean relative error (MRE) and Kullback–Leibler Divergence (KLD) as metrics to assess the utility of the released histograms.

(a) Single-value query. The performance of these two methods on different single values of queries with different privacy parameters (

ϵ = 1, 0.1, 0.01

) is explored, and the mean relative error of the same value of the bins is used to verify the accuracy of their reports and is symbolized as

M R E_{s i n g l e}

.

M R E_{s i n g l e} = \frac{1}{| Q_{k} |} \sum_{H_{j} \in Q_{k}} \frac{| H_{j} - {\tilde{H}}_{j} |}{m a x (H_{j}, δ)}

(34)

where

Q_{k}

= {H_{j} | H_{j} = k, H_{j} \in H, k \in [m i n (H), m a x (H)]}

, and

δ = 1

in this paper.

(b) Range

query .

The performance of two methods on different ranges of queries with varying privacy parameters (

ϵ = 1, 0.1, 0.01

) is explored, with the mean relative error in the query range employed to validate the accuracy of their responses, symbolized as

M R E_{r a n g e}

.

M R E_{r a n g e} = \frac{1}{| Q_{i} |} \sum_{q_{j} \in Q_{i}} \frac{| q_{j} - {\tilde{q}}_{j} |}{m a x (q_{j}, δ)}

(35)

where

Q_{i}

denotes the set of all queries within scope i,

q_{j}

represents one query within scope i,

{\tilde{q}}_{j}

is the sanitized version of

q_{j}

,

i \in {1, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500}

,

j \in [1, | H | - i + 1]

, and

δ = 1

in this paper.

(c) Data

distribution .

The KLD (Kullback–Leibler Divergence), which is used to quantify the difference between two probability distributions where a smaller KLD indicates higher similarity and a larger value denotes greater dissimilarity, is investigated for two methods employing distinct privacy parameters (

ϵ = 1, 0.1, 0.01

).

K L D = \sum_{j = 1}^{| H |} H_{j} l o g \frac{H_{j}}{{\tilde{H}}_{j}} .

(36)

6.2. Experimental Results

(1) Single-value query. Figure 4 illustrates the behavior of the SReB_GCA and DP_BASE methods as they handle single-value queries across various privacy parameter settings. Notably, our proposed SReB_GCA method demonstrates a more consistent mean relative error performance compared to the DP_BASE method across different bin sizes. Specifically, it significantly enhances the accuracy for smaller bins without imposing a substantial detriment to the error rates of larger bins. The experiments conducted on the four datasets, depicted in Figure 4a–d, collectively confirm the efficacy of the SReB_GCA method in achieving a balanced reduction in relative errors, particularly for bins with lower counts.

(2) Range query. Figure 5 illustrates the performance of the SReB_GCA and DP_BASE methods in response to range queries for varying privacy parameters. It indicates that the SReB_GCA method achieves higher accuracy across different range queries compared to the DP_BASE method. Furthermore, the experimental results from the four datasets presented in Figure 5a–d reveal an improving trend in range query accuracy as the privacy parameter

ϵ

increases.

(3) Data distribution. Based on the KLD metric, Figure 6 illustrates that the distribution generated by the SReB_GCA method in this paper is closer to the true data distribution compared to the DP_BASE method. Moreover, Table 2 demonstrates that our SReB_GCA method achieves a smaller KLD value than the AHP method reported in study [22], indicative of a distribution that is closer to the genuine data. Study [22] is designed based on the absolute value error, and its optimization objective is different from the one we are considering, but it is comparable in terms of KLD for evaluating the performance of data distribution. Considering that study [22] reports smaller KLD values across a range of privacy parameters

ϵ

compared to other studies [13,15,19], the SReB_GCA method introduced herein excels in KLD performance over these preceding works, thereby producing sanitized data of superior quality.

7. Conclusions

To enhance the poor relative accuracy of small bins in histograms under

ϵ

-DP, we propose the SReB_GCA sanitization algorithm. SReB_GCA encompasses sorting and greedy grouping stages, with sorting conducted from smallest to largest to prioritize optimizing smaller bins. The grouping incorporates a lower bound on the mean relative error to inform an effective grouping strategy. Experiments on four real-life and synthetic datasets demonstrate that SReB_GCA effectively optimizes the reconstruction and noise errors of small bins, striking a good balance between the utility of small and large bins. Coupled with KLD, it is validated that our SReB_GCA outperforms the baseline method (DP_BASE) and several classic DP histogram publishing techniques in terms of data utility. The feasibility of the histogram publishing model based on the centralized differential privacy proposed in this paper has been validated on various datasets, including Waitakere, Search Log, NetTrace, and Social Network. However, the algorithm presented herein is tailored for static histogram publication; further in-depth investigation is warranted to ascertain its generality under streaming release conditions. Moreover, given the heightened sensitivity of medical and health data, we anticipate that our algorithm, as presented in this paper, will find suitable application scenarios in the crucial domain of medical and health data privacy protection, thereby further propelling advancements in this field. Lastly, exploring how to adapt the proposed algorithm within the context of local differential privacy (LDP) scenarios, such as in simple, high-dimensional, streaming, or dynamic LDP histogram publishing instances as seen in [33,34,35,36], presents an intriguing avenue for further study.

Author Contributions

The problem was conceived by J.C., Y.C., and Z.X. The theoretical analysis and verification were performed by J.C., S.Z., J.Q., Y.X., B.Z., W.F., Y.C., X.C., Y.H., J.C., S.Z., J.Q., and Y.C. wrote the paper. Y.C. reviewed the grammar and structure of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

The Natural Science Foundation of China (No. 41971407); the Natural Science Foundation of Fujian Province, China (Nos. 2020J01571, 2016J01281); the Science and Technology Innovation Special Fund of Fujian Agriculture and Forestry University (No. CXZX2019119S); the Undergraduate Innovation and Entrepreneurship Training Program of Fujian Agricultural and Forestry University (No. 202410389320); and the Research Fund of Fujian University of Technology (No. GY-Z23210).

Data Availability Statement

The datasets used to support the results of this paper are cited in relevant places as references [13,14,19]; the Search Log, NetTrace, and Social Network datasets can be downloaded at https://github.com/michaelghay/vldb2010data (accessed on 1 September 2020), and the Waitakere dataset can be downloaded at https://cdm20045.contentdm.oclc.org/digital/collection/p20045coll20/id/466/rec/3 (accessed on 1 September 2020).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Torra, V. Data Privacy: Foundations, New Developments and the Big Data Challenge; Springer Press: Cham, Switzerland, 2017; pp. 1–21. [Google Scholar]
Fung, B.C.M.; Wang, K.; Chen, R.; Yu, P.S. Privacy-preserving data publishing: A survey of recent developments. ACM Comput. Surv. 2010, 42, 1–53. [Google Scholar] [CrossRef]
Sweeney, L.A. K-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzz. Knowl. Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef]
Dwork, C. Differential privacy: A survey of results. In Proceedings of the International Conference on Theory and Applications of Models of Computation, Xi’an, China, 25–29 April 2008; pp. 1–19. [Google Scholar]
Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography Conference, New York, NY, USA, 4–7 March 2006; pp. 265–284. [Google Scholar]
Dwork, C.; Roth, A. The algorithmic foundations of differential privacy. Found. Trends Databases 2014, 9, 211–407. [Google Scholar]
Huang, H.; Zhang, D.; Xiao, F.; Wang, K.; Gu, J.; Wang, R.; Wang, R. Privacy-preserving approach PBCN in social network with differential privacy. IEEE Trans. Netw. Serv. Man. 2020, 17, 931–945. [Google Scholar] [CrossRef]
Ou, L.; Qin, Z.; Liao, S.; Hong, Y.; Jia, X. Releasing correlated trajectories: Towards high utility and optimal differential privacy. IEEE Trans. Dep. Secur. Comput. 2020, 17, 1109–1123. [Google Scholar] [CrossRef]
Ying, C.; Jin, H.; Wang, X.; Luo, Y. Double insurance: Incentivized federated learning with differential privacy in mobile crowdsensing. In Proceedings of the 2020 International Symposium on Reliable Distributed Systems (SRDS), Shanghai, China, 21–24 September 2020; pp. 81–90. [Google Scholar]
Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep learning with differential privacy. In Proceedings of the ACM Conference on Computer and Communications Security (CCS), Vienna, Austria, 24–28 October 2016; pp. 308–318. [Google Scholar]
Bonomi, L.; Xiong, L. Mining frequent patterns with differential privacy. Proc. VLDB Endowm. 2013, 6, 1422–1427. [Google Scholar] [CrossRef]
Barak, B.; Chaudhuri, K.; Dwork, C.; Kale, S.; McSherry, F.; Talwar, K. Privacy, accuracy, and consistency too a holistic solution to contingency table release. In Proceedings of the Symposium on Principles of Database Systems (PODS), Beijing, China, 11–13 June 2007; pp. 273–282. [Google Scholar]
Acs, G.; Castelluccia, C.; Chen, R. Differentially private histogram publishing through lossy compression. In Proceedings of the International Conference on Data Mining (ICDM), Washington, DC, USA, 10–13 December 2012; pp. 1–10. [Google Scholar]
Hay, M.; Rastogi, V.; Miklau, G.; Suciu, D. Boosting the accuracy of differentially private histograms through consistency. Proc. VLDB Endow. 2010, 3, 1021–1032. [Google Scholar] [CrossRef]
Kellaris, G.; Papadopoulos, S. Practical differential privacy via grouping and smoothing. Proc. VLDB Endow. 2013, 6, 301–312. [Google Scholar] [CrossRef]
Li, C.; Hay, M.; Miklau, G.; McGregor, A. Optimizing linear counting queries under differential privacy. In Proceedings of the Symposium on Principles of Database Systems (PODS), Indianapolis, IN, USA, 6–11 June 2010; pp. 123–134. [Google Scholar]
Rastogi, V.; Nath, S. Differentially private aggregation of distributed time-series with transformation end encryption. In Proceedings of the International Conference on Management of Data (SIGMOD), Indianapolis, IN, USA, 6–10 June 2010; pp. 735–746. [Google Scholar]
Xiao, X.; Wang, G.; Gehrke, J. Differential privacy via wavelet transform. In Proceedings of the International Conference on Data Engineering (ICDE), Long Beach, CA, USA, 1–6 March 2010; pp. 225–236. [Google Scholar]
Xu, J.; Zhang, Z.; Xiao, X.; Yu, G. Differentially private histogram publicaiton. In Proceedings of the International Conference on Data Engineering (ICDE), Arlington, VA, USA, 1–5 April 2012; pp. 32–43. [Google Scholar]
Yuan, G.; Zhang, Z.; Winslett, M.; Xiao, X.; Yang, Y.; Hao, Z. Low-rank mechanism: Optimizing batch queries under differential privacy. Proc. VLDB Endow. 2012, 5, 1352–1363. [Google Scholar] [CrossRef]
Nelson, B.; Reuben, J. SoK: Chasing accuracy and privacy, and catching both in differentially private histogram publication. Trans. Data Priv. 2020, 13, 201–245. [Google Scholar]
Zhang, X.; Chen, R.; Xu, J.; Meng, X.; Xie, Y. Towards accurate histogram publication under differential privacy. In Proceedings of the International Conference on Data Mining (SDM), Philadelphia, PA, USA, 24–26 April 2014; pp. 587–595. [Google Scholar]
Ligett, K.; Neel, S.; Roth, S.A.; Bo, W.; Wu, Z.S. Accuracy first: Selecting a differential privacy level for accuracy-constrained ERM. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS), Red Hook, NY, USA, 4–9 December 2017; pp. 2563–2573. [Google Scholar]
Tao, T.; Li, S.; Huang, J.; Hou, S.; Gong, H. A Symmetry Histogram Publishing Method Based on Differential Privacy. Symmetry 2023, 15, 1099. [Google Scholar] [CrossRef]
Chen, Q.; Ni, Z.; Zhu, X.; Xia, P. Diferential privacy histogram publishing method based on dynamic sliding window. Front. Comput. Sci. 2023, 17, 174809. [Google Scholar] [CrossRef]
Zou, Y.; Shan, C. Delay-tolerant privacy-preserving continuous histogram publishing method. In Proceedings of the 7th International Conference on Big Data and Computing, Shenzhen, China, 27–29 May 2022; pp. 88–95. [Google Scholar]
Lei, H.; Li, S.; Wang, H. A weighted social network publishing method based on diffusion wavelets transform and differential privacy. Multimed. Tools Appl. 2022, 81, 20311–20328. [Google Scholar] [CrossRef]
Shoaran, M.; Thomo, A.; Weber, J. Differential privacy in practice. In Proceedings of the Workshop on Secure Data Management (SDM), Istanbul, Turkey, 27 August 2012; pp. 14–24. [Google Scholar]
Xiao, X.; Bender, G.; Hay, M.; Gehrke, J. iReduct: Differential privacy with reduced relative errors. In Proceedings of the International Conference on Management of Data (SIGMOD), Athens, Greece, 12–16 June 2011; pp. 229–240. [Google Scholar]
McSherry, F. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In Proceedings of the International Conference on Management of Data (SIGMOD), Providence, RI, USA, 29 June–2 July 2009; pp. 19–30. [Google Scholar]
Liu, H.; Wu, Z.; Peng, C.; Tian, F.; Lu, H. Adaptive Gaussian mechanism based on expected data utility under conditional filtering noise. KSII Trans. Int. Inf. Syst. 2018, 12, 3497–3515. [Google Scholar]
Chen, Y.; Xu, Z.; Chen, J.; Jia, S. B-DP: Dynamic collection and publishing of continuous check-in data with best-effort differential privacy. Entropy 2022, 24, 404. [Google Scholar] [CrossRef] [PubMed]
Bassily, R.; Smith, A. Local, private, efficient protocols for succinct histograms. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing (STOC), Portland, OR, USA, 14–17 June 2015; pp. 127–135. [Google Scholar]
Ren, X.; Yu, C.M.; Yu, W.; Yang, S.; Yang, X.; McCann, J.A.; Yu, P.S. LoPub: High-dimensional crowdsourced data publication with local differential privacy. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2151–2166. [Google Scholar] [CrossRef]
Li, H.; Xiong, L.; Jiang, X.; Liu, J. Differentially private histogram publication for dynamic datasets: An adaptive sampling approach. In Proceedings of the ACM International Conference on Information and Knowledge Management(CIKM), Melbourne, Australia, 19–23 October 2015; pp. 1001–1010. [Google Scholar]
Ren, X.; Shi, L.; Yu, W.; Yang, S.; Zhao, C.; Xu, Z. LDP-IDS: Local Differential Privacy for Infinite Data Streams. In Proceedings of the 41th ACM SIGMOD International Conference on Management of Data (SIGMOD), Philadelphia, PA, USA, 12–17 June 2022; pp. 1064–1077. [Google Scholar]

Figure 1.

l n (E [e r r (H_{j})])

changes with

H_{j}

when

| C_{i} | {\bar{C}}_{i} < λ

, where A represents RRE = 0 and ln(

R N E

) = 0.693.

Figure 1.

l n (E [e r r (H_{j})])

changes with

H_{j}

when

| C_{i} | {\bar{C}}_{i} < λ

, where A represents RRE = 0 and ln(

R N E

) = 0.693.

Figure 2.

l n (E [e r r (H_{j})])

changes with

H_{j}

when

| C_{i} | {\bar{C}}_{i} \geq λ

, where B represents RRE = 0 and ln(

R N E

) = −2.303.

Figure 2.

l n (E [e r r (H_{j})])

changes with

H_{j}

when

| C_{i} | {\bar{C}}_{i} \geq λ

, where B represents RRE = 0 and ln(

R N E

) = −2.303.

Figure 3. The ratio of bins.

Figure 4. Single-value query on four datasets. (a) Waitakere; (b) Search Log; (c) NetTrace; and (d) Social Network.

Figure 5. Range query on four datasets. (a) Waitakere; (b) Search Log; (c) NetTrace; and (d) Social Network.

Figure 6. KLD on four datasets.

Table 1. Expermental dataset characteristics.

Dataset	$\| H \|$	Mean	Variance	Count Range
Waitakere	7725	24.13	4764.57	[0, 467]
Search Log	32,768	10.25	577.31	[0, 496]
NetTrace	65,536	0.39	91.01	[0, 1423]
Social Network	11,342	59.49	2995	[1, 1678]

Table 2. KLD with SReB_GCA and AHP.

Dataset	Method	$ϵ = 1$	$ϵ = 0.1$	$ϵ = 0.01$
Waitakere	AHP	0.018	0.203	0.467
	SReB_GCA	0.0004	0.056	0.171
Search Log	AHP	0.054	0.103	0.189
	SReB_GCA	0.0001	0.009	0.099
NetTrace	AHP	0.153	0.572	1.229
	SReB_GCA	0.004	0.092	0.252
Social Network	AHP	0.071	0.309	0.825
	SReB_GCA	0.0001	0.003	0.099

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Zhou, S.; Qiu, J.; Xu, Y.; Zeng, B.; Fang, W.; Chen, X.; Huang, Y.; Xu, Z.; Chen, Y. A Histogram Publishing Method under Differential Privacy That Involves Balancing Small-Bin Availability First. Algorithms 2024, 17, 293. https://doi.org/10.3390/a17070293

AMA Style

Chen J, Zhou S, Qiu J, Xu Y, Zeng B, Fang W, Chen X, Huang Y, Xu Z, Chen Y. A Histogram Publishing Method under Differential Privacy That Involves Balancing Small-Bin Availability First. Algorithms. 2024; 17(7):293. https://doi.org/10.3390/a17070293

Chicago/Turabian Style

Chen, Jianzhang, Shuo Zhou, Jie Qiu, Yixin Xu, Bozhe Zeng, Wanchuan Fang, Xiangying Chen, Yipeng Huang, Zhengquan Xu, and Youqin Chen. 2024. "A Histogram Publishing Method under Differential Privacy That Involves Balancing Small-Bin Availability First" Algorithms 17, no. 7: 293. https://doi.org/10.3390/a17070293

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Histogram Publishing Method under Differential Privacy That Involves Balancing Small-Bin Availability First

Abstract

1. Introduction

2. Related Work

3. Preliminaries

3.1. Differential Privacy

3.2. Laplace Mechanism

3.3. Composition Properties

3.4. Relative Error

4. Problem Statement and First-Cut Method

4.1. Problem Statement

4.2. First-Cut Method

5. Sanitization Algorithm

5.1. Grouping Rules

5.2. GGS

5.3. SReB_GCA

6. Experimental Evaluation

6.1. Experimental Settings

6.2. Experimental Results

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI