A Range Query Scheme for Spatial Data with Shuffled Differential Privacy

Li, Kaixuan; Zhang, Hua; Xu, Yanxin; Liu, Zhenyan

doi:10.3390/math12131934

Open AccessArticle

A Range Query Scheme for Spatial Data with Shuffled Differential Privacy

by

Kaixuan Li

^†

,

Hua Zhang

^*,†,

Yanxin Xu

and

Zhenyan Liu

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2024, 12(13), 1934; https://doi.org/10.3390/math12131934

Submission received: 24 April 2024 / Revised: 19 June 2024 / Accepted: 19 June 2024 / Published: 21 June 2024

(This article belongs to the Section Computational and Applied Mathematics)

Download

Browse Figures

Versions Notes

Abstract

:

The existing high-dimensional or multi-dimensional geographic spatial datasets have a large amount of data. When third-party servers collect and publish them, privacy protection is required to prevent sensitive information from being leaked. Local differential privacy can be used to protect location-sensitive information during range queries. However, the accuracy of a range query based on local differential privacy is affected by the distribution and density of spatial data. Based on this, aiming at the distribution and density characteristics of data, we designed a

_{d p} K D

tree that supports high-precision range queries with shuffled differential privacy, and designed an algorithm KDRQ for range queries based on shuffled differential privacy. First, we employed the

_{d p} K D

to divide the data. Then, we shuffled the data based on

S R R_{Q}

and reconstructed the tree. Finally, we used the SDRQ algorithm for the response range query. The experimental results show that the query accuracy of the KDRQ algorithm was at least 1–4.5 times higher than that of the existing algorithms RAPPOR, PSDA and GT-R under the same privacy budget.

Keywords:

shuffled differential privacy; high-dimensional spatial data; range query

MSC:

68P27

1. Introduction

With the development of the era of big data, massive amounts of spatial data are generated, in which the user’s sensitive location information may be leaked during the collection and publication by third-party servers. For example, when the server collects and publishes the user’s location information, it may be attacked by an attacker or learn the user’s entertainment, shopping, social and other daily behavior patterns such that the user’s personal privacy is leaked. Recently, the Telegram robot was suspected to have leaked 4.5 billion pieces of express information, including the user’s name, telephone and home address [1]. Therefore, it is necessary to protect the user’s spatial data privacy when publishing spatial data.

Differential privacy [2] protection was proposed by Dwork et al. in 2006 to protect data privacy. Central differential privacy [3] adds noise to the statistical results of the data. This assumes that the third-party servers are fully trusted and have high availability, but its privacy protection is low compared with other technologies. In practical applications, it is difficult to find fully trusted third-party servers, and thus, local differential privacy [4] came into being. Users send data to the third-party servers after local perturbation. Compared with the central differential privacy, the degree of privacy protection is higher. The shuffled differential privacy [5] is between the two privacy protections. The introduction of a shuffler shuffles the correspondence between the data and user ID, which can improve the data availability with the same intensity of privacy protection as local differential privacy.

For high-dimensional spatial data, this mainly refers to the dimension greater than two, including user location information, attribute data, relationship information and time update information. Similar to relational data [6], the high-dimensional spatial data will also face the same “curse of dimensionality” [7], that is, the increase in the communication cost due to the distribution and scale of data points and the reduction in signal-to-noise ratio bring about the reduction in data availability. Therefore, it is necessary to construct an efficient index structure for dimensionality reduction. A spatial index structure, such as a UB-tree, is a combination of a Z-order and

B^{*}

data index structure proposed by Bayer et al. [8]. Li et al. [9] proposed an adaptively clocking region division to resist the threat under a location-based service. However, the actual spatial location data will have different distribution densities. Different from the previous work under the B-tree, Wang et al. [10] proposed a new domain decomposition mechanism of privNUD with local differential privacy, but it performs an adaptive range query on one-dimensional data.

At present, the work of range queries with differential privacy protection mainly focuses on two-dimensional or one-dimensional spatial data. Cormode et al. [11] proposed private spatial decompositions (PSDs) based on central differential privacy, which divides spatial data, reports the statistical number of points in the region and then answers the range query. Zhang et al. [12] proposed PrivTree, which uses the quadtree for hierarchical decomposition and introduces a bias to weaken the influence of noise. Different from the above spatial range query work with central differential privacy, Kim et al. [13] used local differential privacy in indoor positioning systems for the first time to estimate the density of the region. Kulkarni et al. [14] proposed and analyzed different methods to answer one-dimensional range queries with local differential privacy. Both of these works have the problem that the direct application of the schemes to multi-dimensional scenarios will reduce the statistical accuracy because the density of the data is not considered. Du et al. [15] proposed an adaptive hierarchical decomposition protocol (AHEAD), which can control the added noise by adaptively dynamically controlling the established tree structure. Ahuja et al. [16] designed a method to provide accurate queries based on variational auto-encoders. The scheme improves the data availability for range queries in multi-dimensional scenarios, which inevitably leads to a reduction in privacy protection intensity. The above are the works on range queries with local or central differential privacy.

Compared with the previous range query schemes with local differential privacy, the research work of shuffled differential privacy is currently focused on frequency and mean statistics. The shuffled differential privacy was proposed by Bittau et al. [17] in 2017, which achieved a good tradeoff between the degree of privacy protection and data availability. Balle et al. [5] proposed the real-sum protocol and privacy amplification effect in the shuffled model. Wang et al. [18] studied the shuffle model from two aspects of algorithm and model security, and proposed a more secure PEOS model. Cheu et al. [19] presented a protocol to reduce the message complexity of histograms. We compared the range queries containing the

_{d p} K D

(KDRQ) scheme with related work by other researchers, as shown in Table 1. These works do not involve range queries oriented to spatial data, and thus, we used the index structure to perform range queries on high-dimensional spatial data with shuffled differential privacy. The scheme was directly applied to the shuffled differential privacy model, and the data distribution law was not retained, which led to the problem of low query accuracy.

With shuffled differential privacy, we designed a new index structure to improve the query accuracy and could achieve adaptive range queries. Compared with the existing indexing schemes for multi-dimensional spatial data, we constructed an index structure that supports range queries for high-dimensional spatial data, which is more in line with the distribution characteristics of data and can support fast high-dimensional range queries. The specific contributions are as follows:

A secure range query scheme for high-dimensional geographic spatial data with shuffled differential privacy is proposed. It solves the problem of low query accuracy and low availability resulting from the direct extension of local differential privacy to shuffled differential privacy scenario.
Building an index that supports high-precision range queries. The index can adaptively divide the spatial region and improve the accuracy of range queries about spatial data. The algorithm can also be used to solve other range query problems with shuffled differential privacy.
Security analysis showed that our KDRQ scheme satisfied the requirements for shuffled differential privacy. Experimental analysis on the Landmark and Checkin datasets showed that the KDRQ method was superior to the existing methods in terms of range query accuracy.

The organization of this paper is as indicated below. In Section 2, we introduce the related work. In Section 3, we introduce the preliminaries used in this paper. In Section 4, we present the system model of the scheme. Then, Section 5 provides a detailed introduction to the scheme we have proposed. In Section 6, we present experimental results to show the query accuracy of KDRQ. Finally, we summarize this paper in Section 7.

2. Related Work

Most of the existing publishing schemes for spatial data with differential privacy build grids [20], and then users perturb the units in the grid based on random responses. However, because the scheme perturbs each location in the same way, it does not provide different optimization schemes for different spatial data. Cormode et al. [11] proposed a privacy space decomposition scheme with central differential privacy, which divides the spatial data into smaller regions, reports the statistical number of points in each region, and answers the query. Zhang et al. [12] proposed a hierarchical decomposition differential privacy algorithm PrivTree based on quadtree to divide regions. However, when faced with non-uniform data sets, the query results are not ideal due to the unbalanced data points. Chen et al. [21] studied the aggregation of spatial data with local differential privacy, but they did not design an effective index structure. Wang et al. [22] proposed a segmentation mechanism to collect and publish high-dimensional spatial data, and used noise distribution to disturb points. However, due to the influence of the segmentation mechanism on data at a single location, the error was large. This study solved the above problems and designed an index of high-precision range query with shuffled differential privacy to achieve adaptive range queries.

3. Preliminaries

3.1. Range Query

A range query mainly refers to the number of users that satisfy a certain range. The following parts are one-dimensional, two-dimensional, three-dimensional and even n-dimensional range queries.

One-dimensional data: The core idea of a range query of one-dimensional data based on differential privacy is to query the number of user data points in the range of

R Q = [a, b]

. Figure 1 is a simplified instance of a range query of one-dimensional data. Suppose that there are three one-dimensional data points stored in the cloud server, namely,

p_{1}

,

p_{2}

and

p_{3}

; the cloud server can have a range query

[3, 4]

and can also have larger range queries

[2, 5]

and

[1, 6]

. For a given range query

[a, b]

, the cloud server returns the count of data points within the range

[a, b]

.

Two-dimensional data: The core idea of a range query of two-dimensional data based on differential privacy is whether the query point q appears in

R Q = [a, b] \times [c, d]

. Figure 2 is a simplified instance of a range query of two-dimensional data. Assume that two vectors

v_{1} ⊥ v_{2}

are randomly generated, and according to

P r o_{v_{i}} (p) = \frac{\vec{p} * \vec{v_{i}}}{|\vec{v_{i}}|}

,

i = 1, 2

, calculate the projection of each point. Given the range query

R Q = [a, b] \times [c, d]

, the cloud server determines whether the point is within

R Q

, that is,

p_{v_{1} \in [a, b]}

and

p_{v_{2} \in [c, d]}

. Then, the query result of RQ can be expressed as

R Q_{[a, b] \times [c, d]} = \sum_{i = 1}^{n} I_{a \leq x_{i} \leq b, c \leq y_{i} \leq d},

(1)

where I is the identification function, where a value of 1 means that the spatial location of the i-th user is within

R Q

, and a value of 0 means that the spatial location of the user is not within

R Q

.

Three-dimensional data and higher-dimensional data: The core idea of a range query of three-dimensional data based on differential privacy is whether the query point q appears in

R Q = [a, b] \times [c, d] \times \cdot \cdot \cdot \times [k, l]

. Figure 3 is a simplified instance of a range query of three-dimensional data. Suppose that three vectors

v_{1} ⊥ v_{2} ⊥ v_{3}

are randomly generated, and according to

P r o_{v_{i}} (p) = \frac{\vec{p} * \vec{v_{i}}}{|\vec{v_{i}}|}

,

i = 1, 2, 3

, calculate the projection of each point on it. Given the range query

R Q = [a, b] \times [c, d] \times [e, f]

, the cloud server determines whether the point is within

R Q

, that is,

p_{v_{1} \in [a, b]}

,

p_{v_{2} \in [c, d]}

and

p_{v_{3} \in [e, f]}

. Similarly, for the query

R Q

(hypercube) of higher-dimensional data, the query results can be expressed by a mathematical expression as follows:

R Q_{[a, b] \times [c, d] \times \cdot \cdot \cdot \times [k, l]} = \sum_{i = 1}^{n} I_{a \leq x_{i} \leq b, c \leq y_{i} \leq d} .

(2)

3.2. Local Differential Privacy

The data is protected by which users send the perturbed data to the third-party server with local differential privacy. The specific mathematical definition of local differential privacy is given as follows:

Definition 1

(

ϵ, δ

)-local differential privacy). Given a user’s spatial location data and a random algorithm M, for any two spatial data

g_{i}

and

g_{i}^{'}

, the random algorithm M satisfies (

ϵ, δ

)-local differential privacy if and only if all possible outputs y satisfy the following inequality:

P r (M (g_{i} = y)) \leq e^{ϵ} \cdot P r (M (g_{i}^{'} = y)) + δ .

(3)

The existing local differential privacy protocols are mainly based on the randomized aggregatable privacy-preserving ordinal response (RAPPOR) [23] to implement local perturbations. The local transformation algorithm in this paper is based on our previous work [24].

3.3. Shuffled Differential Privacy

Compared with central differential privacy and local differential privacy, shuffled differential privacy [17,25] consists of a local perturbation, shuffle and analysis. The specific mathematical definition of shuffled differential privacy is as follows:

Definition 2

(

ϵ, δ

)-shuffled differential privacy). Given a user’s spatial location data and a randomly shuffled algorithm

S h

and local perturbed algorithm M, for any two spatial data

g_{i}

and

g_{i}^{'}

, the algorithm

S = S h \circ M

satisfies (

ϵ, δ

)-shuffled differential privacy if and only if all possible shuffled outputs y satisfy the following inequality:

P r (S (g_{i} = y)) \leq e^{ϵ} \cdot P r (S (g_{i}^{'} = y)) + δ .

(4)

The data analysis under shuffled differential privacy is shown in Figure 4: each user perturbs their spatial data with a random oracle to obtain a report

r_{1}, r_{2}, \cdot \cdot \cdot r_{n}

. Then, multiple shufflers shuffle the report and send it to the server. The server counts the query results and finally sends them to the data analyst.

4. System Model

4.1. System Model

As shown in Figure 5, the system model involves four participants: query user, data owner, shuffler and cloud server.

Query users: They are generally the data owner themself or other legitimate users who have some completely trusted internal connection with the data owner. The query user hopes to search for the content that they are interested in within the entire datasets provided by all data owners and is not willing to disclose the relevant query results. In order to achieve this goal, the authorized user sends the query to the cloud server through the secure channel in this scheme. After receiving the query request, the server runs the query algorithm on the index tree and returns the query result to the user.

Data owners: The n data owners are willing to outsource their data to the cloud server in the form of differential privacy protection, while allowing authorized query users to query all data. In this scheme, each data owner has their own spatial data

g_{i}

. They first locally perturb their location data and then send the message to the shuffler.

Shuffler: The shuffler is an independent semi-trusted server that performs secure shuffled operations without knowing the data. After receiving the data of the user’s local disturbance, it shuffles the received data.

Cloud server: The cloud server is untrusted. The cloud server can build an empty

_{d p} K D

tree and share the index with the data owner. Once the data released by the shuffler is received, the tree is reconstructed and the final statistical results are obtained. Finally, the query results are returned to the query user. In addition, when the data owner sends the update information, the cloud server should update its stored index synchronously.

The cloud server in this scheme is untrusted. From the user’s view, other parties in the system model, including the shuffler (auxiliary server), cloud server and other data owners, may be adversaries. We assume that all participants have the same level of background knowledge and need to consider the consequences of collusion between different parties. There are three important adversaries: the adversary is the server itself; a conspiracy between the server and other users; and a conspiracy between the server and shuffler. In particular, the server can collude with the auxiliary server or other users. In this case, the model is simplified to local differential privacy.

4.2. Problem Definition

Given the spatial data

g_{i}

of n data owners, shufflers and the cloud server, each data owner (user) perturbs the data via an

L T

algorithm and sends

L T (g_{i})

to the shuffler. The shuffler shuffles the received n inputs and outputs the results. In order to formalize it, the shuffled mechanism is defined as

S h (D) = S R R_{Q} (L T (g_{1}, \cdot \cdot \cdot, L T (g_{n})))

. The server performs statistical analysis based on the shuffled data, that is, the range query algorithm is

S D R Q : R Q (v_{i}) \to \tilde{R Q}

. The problem that was solved in this study was to design a spatial range query scheme that satisfies the shuffled differential privacy, obtains the query results with high accuracy as far as possible, and support sadaptive partitioning.

4.3. The Design Goal of the Spatial Query Scheme

In order to achieve accurate and secure range queries over privacy-preserving cloud data by multiple data owners, our scheme should meet the following goals and security requirements:

(1): Support multiple data owners: multiple data owners can outsource their data safely and conveniently, and the complexity of users to create queries is almost unaffected by the number of data owners.
(2): Ideal query results: cloud servers should help users to query high-precision query results in outsourced datasets by all data owners.
(3): Efficiency and timeliness: The cloud server receives a massive number of geospatial data queries, and the spatial data are frequently updated. It is necessary to construct an index that supports efficient queries and updates the performance for these characteristics. This will improve the user’s query experience, but also provide results closer to reality.
(4): Security: aiming at the range query requirements of outsourcing geographic data with shuffled differential privacy protection, we analyzed the potential attacks and the security of the scheme, which is divided into the following points:

Data confidentiality: the content of the outsourced datasets by the data owner should be protected and should not be known to the cloud server or other unauthorized query users.

Query privacy: The query content in the user’s query is sent to the cloud server through a secure channel and will not be leaked to any adversary. This scheme is not considered.

5. Range Query over Spatial Data

5.1. Overview

With the protection of shuffled differential privacy, we propose a hierarchical adaptive division method of spatial data to divide spatial data by using the law of data distribution (Algorithms 1). The specific steps are as follows:

(1): Data partitioning (constructing the $_{d p} K D$ tree): Considering sparse and dense regions, the $_{d p} K D$ tree is constructed to divide the spatial datasets. The Algorithms 2 and 3 are based on the $K D$ tree and quadtree to segment and index spatial data.
(2): Shuffling data and reconstructed $_{d p} K D$ tree: Each user that is a data owner first traverses the $_{d p} K D$ tree to find the path containing their own location. After finding the leaf node, the weight of the path from the leaf node to the root node is assigned to 1, and the weights of the other paths are each assigned a 0. Then, the user randomly samples and generates a vector. After locally perturbing its location data through the $L T$ method, the user sends it to the shuffler. The shufflers shuffle the data and sends it to the server. Finally, reconstructing the $_{d p} K D$ tree T after the server collects the reports of all users.
(3): Response range query: T is traversed and a range query is performed based on the $S D R Q$ algorithm and the final range query result is obtained.

Algorithm 1:

K D R Q

Require:: The dataset $D = d_{1}, d_{2}, \cdot \cdot \cdot, d_{n}$ , range query $R Q$ , privacy budget $ϵ$
Ensure:: The query results $\tilde{R Q}$
1:: //Step 1: Partition
2:: Map the data point of the original dataset to the data space
3:: $P \leftarrow_{d p} K D (D)$
4:: Construct the $_{d p} K D (D)$ tree T
5:: //Step 2: Shuffle and reconstruct the tree
6:: Initialize the dataset
7:: for each user $g_{i}$ in T do
8:: $(l_{i}, y_{i}) \leftarrow S R R_{Q} (g_{i}, ϵ)$
9:: end for
10:: if $l_{i} = l$ then
11:: for node $v_{i}$ in T do
12:: $N^{*} (v_{i}) \leftarrow y_{i} + N (v_{i})$
13:: end for
14:: else
15:: if $N (v_{i}) > d s$ then
16:: $T \leftarrow Q P (v_{i})$
17:: end if
18:: end if
19:: return T
20:: //Step 3: range query
21:: $T \leftarrow n o n$ - $n e g a t i v i t y (T)$
22:: Traversing T from top to bottom
23:: $\tilde{R Q} \leftarrow S R Q (R Q, v)$
24:: return $\tilde{R Q}$

5.2. User Data Division

The data is divided by constructing a

_{d p} K D

tree. First, the important geographic location information

g_{i}

in the user spatial data of the dataset is selected for division. Compared with the quadtree, the dimension of the spatial data can be effectively reduced. Then, each dimension in

g_{i}

is divided according to the selected dividing standard

d s

, in which sparse and dense spaces are considered and the density is calculated. The dividing standard is selected according to the structure of the

_{d p} K D

tree. Finally, it is recursively calculated until there is only one leaf node, and the partition is terminated. The construction of the

_{d p} K D

tree is as follows (Figure 6): there are six points: (1.2,4), (4.5,2.2), (5,8), (6,10), (8.6,9) and (12,11). The dimension with the largest variance is selected to divide the space, that is, y = 8, and the space is divided recursively with the dividing standard of x = 1.2 and x = 8.6 until the number of points in the subspace is 1. In particular, to determine that the spatial data contain sparse and dense spaces, the density threshold is calculated, and the space that is less than the density threshold is divided into a sparse space, and then the space continues to be divided. The specific steps are shown in Algorithm 2.

Algorithm 2:

_{d p} K D

Require:: The dataset D, user’s location $g_{i}$
Ensure:: Preliminary data division P
1:: if $g_{i} \neq ⌀$ then
2:: Built cells as the root nodes
3:: Divide the space $g_{i}$ as the selected dividing standard $d s$
4:: for $P_{i} \in P_{1}, P_{2}, \cdot \cdot \cdot, P_{d}$ do
5:: Count the amount of data space $C_{m a x}$
6:: if $C_{m a x} \neq 1$ then
7:: $F = ⌀$ ▹F is the sparse space
8:: Count the density $d e n_{A x i s}$ of the data space ▹ $d e n_{A x i s} = C_{A x i s}$
9:: if $ρ \cdot A v e_{A x i s - X_{i}} > d e n_{A x i s - X_{i}}$ then
10:: $F = F + {A x i s - X_{i}}$ ▹ Take the $A x i s - X$ as an example
11:: While $r d (I n t_{i}, I n t_{i - 1}) < 1$ ▹ $r d (I n t_{i}, I n t_{i - 1}) = \frac{∥ d e n_{i} - d e n_{i - 1} ∥}{d e n_{i} + d e n_{i - 1}}$
12:: Merge intervals $I n t_{i}, I n t_{i - 1}$
13:: Recursive partitioning P
14:: else
15:: return
16:: end for
17:: else
18:: return None
19:: return P

5.3. Shuffle and Reconstruct the Tree

After building an empty

_{d p} K D

tree in the cloud server and sharing it to the user, each user traverses the

_{d p} K D

tree to find the leaf node of the spatial location and the path of its root node, and assigns 1 to its path. Then, a layer is randomly selected, and the report is generated by the local transformation. Finally, it is sent to shufflers for shuffling. The local transformation comes from our previous work [17], the shuffled protocol based on multiple shufflers comes from [18] and the shuffled algorithm is shown in Algorithm 3.

Algorithm 3:

S R R_{Q}

Require:: The user important location $g_{i}$ , privacy budget $ϵ$
Ensure:: The shuffled value $(l_{i}^{'}, y_{i})$
1:: The server shares the T with each user
2:: Each user traverses T to determine the path p
3:: if node $v_{i}$ not in T then
4:: $w e i g h t (v_{i}) = 0$
5:: else
6:: $w e i g h t (v_{i}) = 1$
7:: end if
8:: Randomly select a layer in T to generate a vector $B_{i}$
9:: for $w e i g h t (v_{j})$ in vector $B_{i}$ do
10:: $(l_{i}^{'}, r_{i}) \leftarrow L T (w e i g h t (v_{j}, ϵ))$ ▹ $l_{i}^{'}$ is based on a randomly selected tree layer of T
11:: Each user sends $(l_{i}^{'}, r_{i})$ to the shufflers $S h_{i}$
12:: end for
13:: The shufflers $S h_{i}$ shuffle all the data and send to the server
14:: return $(l_{i}^{'}, y_{i})$

Next, the user reconstructs the

_{d p} K D

tree as in lines 10–17 of Algorithm 1.Then, the quadtree partition standard is used to continue to divide the space. The quadtree partition space algorithm is shown in Algorithm 4. First, the data are divided and the number of data in the data space is calculated. Then, if

P_{n} \leq C_{m a x}

, the cell is constructed as the root node. Otherwise, according to the quadtree partitioning strategy, four sub-rectangles are constructed and continue to be recursively partitioned.

Algorithm 4:

Q P

Require:: The dataset D, user’s location $g_{i}$
Ensure:: Preliminary data division P
1:: Count the amount of data space $C_{m a x}$
2:: if $P_{n} \leq C_{m a x}$ then
3:: Built cells as the root nodes
4:: else
5:: Construct four sub-rectangles $P_{1}, P_{2}, P_{3}, P_{4}$ ▹ According to the quadtree partition strategy
6:: for $P_{i} \in P_{1}, P_{2}, P_{3}, P_{4}$ do
7:: Recursive partitioning P
8:: end for
9:: end if
10:: return P ▹ The root nodes represent the space P

5.4. Range Query

The range query process of the

_{d p} K D

index can be given in the query rectangle

R Q = [a, b] \times [c, d]

, where

a, b

and

c, d

are the minimum and maximum boundary values on the same dimension, respectively. After post-processing, the range query process is performed. First, the reconstructed

_{d p} K D

tree is traversed; the corresponding processing according to the relationship between the node and the range query

R Q

is performed; and finally, the query result

\tilde{R Q}

that conforms to the query rectangle is obtained. The specific algorithm is as follows (Algorithm 5).

Algorithm 5:

S D R Q

Require:: The $_{d p} K D$ tree, range query $R Q$
Ensure:: The query result $\tilde{R Q}$
1:: Traverse T from top to bottom ▹ $u n t r a v e l e d (v_{i}) = 0$
2:: if node $v_{i} \subset R Q$ then
3:: $\tilde{R Q} \leftarrow R Q \cup v_{i}$
4:: if node $v_{i}$ is disjoint from $R Q$ then
5:: return
6:: else
7:: if node $v_{i}$ is leaf node then
8:: Add the intersectant records to candidates
9:: $\tilde{R Q} \leftarrow R Q \cup C a n (v_{i}, R Q)$
10:: else
11:: if node $v_{i}$ is not leaf node then
12:: Record the child node of $v_{i}$ ▹ $u n t r a v e l e d (v_{i}) = 1$
13:: $\tilde{R Q} \leftarrow S D R Q (v, R Q)$
14:: end if
15:: end if
16:: end if
17:: return $\tilde{R Q}$

5.5. Security Analysis

Our scheme can achieve range queries with a high accuracy with shuffled differential privacy. In the following, we theoretically confirm that our algorithm satisfies shuffled differential privacy.

Theorem 1.

The

K D R Q

algorithm satisfies

(ϵ, δ)

-shuffled differential privacy.

Proof of Theorem 1.

From the sequence combination of shuffled differential privacy, it can be seen that each shuffled algorithm

S R R_{Q}

on the given spatial location dataset D satisfies shuffled differential privacy. According to the privacy amplification effect proposed by Balle et al. [5], each data owner uses the

L T

algorithm to satisfy

(ϵ, δ)

-local differential privacy when they have the local perturbation, where

ϵ = \sqrt{14 l n (2 / δ) \frac{e^{ϵ} + Ω - 1}{n - 1}}

. And the domain

ω

is shuffled through the shuffling algorithm, where

ϵ \leq \sqrt{14 l n (2 / δ) \frac{e^{ϵ} + ω - 1}{n / k - 1}}

(

n / / k

is the number of users participating in the shuffling). □

6. Performance Evaluation

6.1. Experiment Settings

Our experimental platform was a Windows 10 system with an Intel i5-10 505 CPU (3.2 Hz) and a 16 GB memory, and Python was used to implement the algorithm on the Landmark and Checkin datasets. The Landmark dataset includes 870,000 user geolocation data from in the United States and is available from the infochimps platform. The Checkin dataset records 1 million location data of the social networking site Gowalla platform, as shown in Table 2.

Experimental metrics: We used the relative error

R E = \frac{\tilde{R Q} (D) - R Q (D)}{R Q (D)}

[26] to measure the range query accuracy of

R A P P O R

[23], private spatial data aggregation (

P S D A

) [21], grid and quadtree of regions (

G T

-R) [26], and our algorithm

K D R Q

. We set the privacy budget parameter values to 0.1, 0.3 and 0.5.

6.2. Evaluation

6.2.1. Landmark-Based $R E$ Comparison

Figure 7, Figure 8 and Figure 9 shows the

R E

values of the

R A P P O R

,

P S D A

,

G T

-R and

K D R Q

algorithms in the ranges of [5%,20%], [20%,40%] and [40%,60%] on the Landmark dataset. From Figure 7, it can be seen that the privacy budgets were 0.1, 0.3 and 0.5; the query accuracy of our algorithm

K D R Q

in the range of [5%,20%] gradually improved; and the query accuracy was higher than that of the other three algorithms. Figure 8 and Figure 9 are similar. In addition, as the query range became larger, the accuracy of the experimental algorithm was improved. This was because the area of the range query became larger and the error was reduced.

6.2.2. Checkin-Based $R E$ Comparison

Figure 10, Figure 11 and Figure 12 show the

R E

values of the

R A P P O R

,

P S D A

,

G T

-R and

K D R Q

algorithms on the Checkin dataset in the ranges of [5%,20%], [20%,40%] and [40%,60%]. From Figure 10, we can see that the privacy budgets were 0.1, 0.3 and 0.5; the query accuracy of our algorithm

K D R Q

in the range of [5%,20%] gradually improved; and the query accuracy was higher than the other three algorithms. Figure 11 and Figure 12 are similar. Because the data distribution of the Checkin dataset is sparse and dense, the query accuracy can be improved by our

K D R Q

algorithm. In particular, when the privacy budget changed from 0.5 to 0.3, the relative

R E

values of

G T

-R and

K D R Q

were almost same in the range of [40%,60%]. That is, the query accuracies were almost the same, which also shows that we improved the query accuracy while increasing the intensity of privacy protection. In general, taking the Checkin dataset as an example, when the privacy budgets were 0.1, 0.3 and 0.5, the ratios of

R E

values ranged from 1.63 to 4.52 for all

R E

ranges.

7. Conclusions

In our paper, aiming at the problem of publishing user’s spatial location with differential privacy protection, combined with the shortcomings of existing range queries, a spatial data publishing method based on

_{d p} K D

and the quadtree index is proposed that improves the query accuracy through adaptive partitioning. From the perspective of the definition of shuffle differential privacy,

K D R Q

satisfies

(ϵ, δ)

-shuffled differential privacy. Finally, the range query accuracy of the

K D R Q

method was verified by two real large-scale datasets. The experimental results show that

K D R Q

was obviously superior to the existing similar methods. Furthermore, our work on range queries was aimed at spatial data and needs to be extended to multi-type data. We will carry out the work of graph data publishing with differential privacy in the future.

Author Contributions

Conceptualization, K.L., H.Z. and Y.X.; methodology, K.L.; software, K.L.; validation, H.Z., K.L. and Z.L.; formal analysis, K.L. and H.Z.; investigation, H.Z.; resources, K.L.; writing—original draft preparation, K.L.; writing—review and editing, H.Z.; visualization, Y.X. and K.L.; supervision, Z.L. and K.L.; project administration, H.Z.; funding acquisition, H.Z. All authors read and agreed to the published version of this manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (grant nos. 62072051 and 62032004).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

n	The number of users
D	Spatial dataset
$g_{i}$	The spatial data of the user
$L T$	Local transform algorithm
l	The layer of tree
$N (v_{i})$	The true value of user locations
$N^{*} (v_{i})$	The estimation value of user locations
$d s$	Dividing standard
$ρ$	Adjustable density parameter
$r d$	The relative density difference of the sparse interval
$Ω$	The domain of local algorithm
$ω$	The domain of shuffled algorithm

References

Available online: https://www.chinanews.com.cn/cj/2023/02-16/9955018.shtmlww (accessed on 1 February 2023).
Dwork, C. Differential privacy. In Proceedings of the International Colloquium on Automata, Languages, and Programming, Venice, Italy, 10–14 July 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–12. [Google Scholar]
Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, 4–7 March 2006; Proceedings 3. Springer: Berlin/Heidelberg, Germany, 2006; pp. 265–284. [Google Scholar]
Duchi, J.C.; Jordan, M.I.; Wainwright, M.J. Local privacy and statistical minimax rates. In Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, Berkeley, CA, USA, 26–29 October 2013; IEEE: New York City, NY, USA, 2013; pp. 429–438. [Google Scholar]
Balle, B.; Bell, J.; Gascón, A.; Nissim, K. The privacy blanket of the shuffle model. In Proceedings of the Advances in Cryptology–CRYPTO 2019: 39th Annual International Cryptology Conference, Santa Barbara, CA, USA, 18–22 August 2019; Proceedings, Part II 39. Springer: Berlin/Heidelberg, Germany, 2019; pp. 638–667. [Google Scholar]
Cai, K.; Xiao, X.; Cormode, G. Privlava: Synthesizing relational data with foreign keys under differential privacy. Proc. ACM Manag. Data 2023, 1, 1–25. [Google Scholar] [CrossRef]
Zhang, J.; Cormode, G.; Procopiuc, C.M.; Srivastava, D.; Xiao, X. Privbayes: Private data release via bayesian networks. ACM Trans. Database Syst. (TODS) 2017, 42, 1–41. [Google Scholar] [CrossRef]
Bayer, R. The universal B-tree for multidimensional indexing: General concepts. In Proceedings of the Worldwide Computing and Its Applications: International Conference, WWCA’97, Tsukuba, Japan, 10–11 March 1997; Springer: Berlin/Heidelberg, Germany, 1997; pp. 198–209. [Google Scholar]
Li, D.; Yang, Q.; An, D.; Zhang, Y. A Location Privacy Aware Taxi-Hailing System: Adaptive Differential Privacy-based Dynamic Incentive Method. IEEE Internet Things J. 2023, 11, 914–930. [Google Scholar] [CrossRef]
Wang, N.; Wang, Y.; Wang, Z.; Nie, J.; Wei, Z.; Tang, P.; Gu, Y.; Yu, G. PrivNUD: Effective Range Query Processing under Local Differential Privacy. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, CA, USA, 3–7 April 2023; IEEE: New York City, NY, USA, 2023; pp. 2660–2672. [Google Scholar]
Cormode, G.; Procopiuc, C.; Srivastava, D.; Shen, E.; Yu, T. Differentially private spatial decompositions. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, Arlington, VA, USA, 1–5 April 2012; IEEE: New York City, NY, USA, 2012; pp. 20–31. [Google Scholar]
Zhang, J.; Xiao, X.; Xie, X. Privtree: A differentially private algorithm for hierarchical decompositions. In Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, USA, 26 June–1 July 2016; pp. 155–170. [Google Scholar]
Kim, J.W.; Kim, D.H.; Jang, B. Application of local differential privacy to collection of indoor positioning data. IEEE Access 2018, 6, 4276–4286. [Google Scholar] [CrossRef]
Kulkarni, T. Answering range queries under local differential privacy. In Proceedings of the 2019 International Conference on Management of Data, Amsterdam, The Netherlands, 30 June–5 July 2019; pp. 1832–1834. [Google Scholar]
Du, L.; Zhang, Z.; Bai, S.; Liu, C.; Ji, S.; Cheng, P.; Chen, J. AHEAD: Adaptive hierarchical decomposition for range query under local differential privacy. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual, 15–19 November 2021; pp. 1266–1288. [Google Scholar]
Ahuja, R.; Zeighami, S.; Ghinita, G.; Shahabi, C. A Neural Approach to Spatio-Temporal Data Release with User-Level Differential Privacy. Proc. ACM Manag. Data 2023, 1, 1–25. [Google Scholar] [CrossRef]
Bittau, A.; Erlingsson, Ú.; Maniatis, P.; Mironov, I.; Raghunathan, A.; Lie, D.; Rudominer, M.; Kode, U.; Tinnes, J.; Seefeld, B. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, 28 October 2017; pp. 441–459. [Google Scholar]
Wang, T.; Ding, B.; Xu, M.; Huang, Z.; Hong, C.; Zhou, J.; Li, N.; Jha, S. Improving utility and security of the shuffler-based differential privacy. arXiv 2019, arXiv:1908.11515. [Google Scholar] [CrossRef]
Cheu, A.; Zhilyaev, M. Differentially private histograms in the shuffle model from fake users. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 22–26 May 2022; IEEE: New York City, NY, USA, 2022; pp. 440–457. [Google Scholar]
Bordenabe, N.E.; Chatzikokolakis, K.; Palamidessi, C. Optimal geo-indistinguishable mechanisms for location privacy. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 3–7 November 2014; pp. 251–262. [Google Scholar]
Chen, R.; Li, H.; Qin, A.K.; Kasiviswanathan, S.P.; Jin, H. Private spatial data aggregation in the local setting. In Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering (ICDE), Helsinki, Finland, 16–20 May 2016; IEEE: New York City, NY, USA, 2016; pp. 289–300. [Google Scholar]
Wang, N.; Xiao, X.; Yang, Y.; Zhao, J.; Hui, S.C.; Shin, H.; Shin, J.; Yu, G. Collecting and analyzing multidimensional data with local differential privacy. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, Macao, 8–11 April 2019; IEEE: New York City, NY, USA, 2019; pp. 638–649. [Google Scholar]
Erlingsson, Ú.; Pihur, V.; Korolova, A. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 3–7 November 2014; pp. 1054–1067. [Google Scholar]
Zhang, H.; Li, K.; Huang, T.; Zhang, X.; Li, W.; Jin, Z.; Gao, F.; Gao, M. Publishing locally private high-dimensional synthetic data efficiently. Inf. Sci. 2023, 633, 343–356. [Google Scholar] [CrossRef]
Li, X.; Liu, W.; Feng, H.; Huang, K.; Hu, Y.; Liu, J.; Ren, K.; Qin, Z. Privacy enhancement via dummy points in the shuffle model. IEEE Trans. Dependable Secur. Comput. 2023, 1001–1016. [Google Scholar] [CrossRef]
Zhang, X.; Fu, N.; Meng, X. Spatial Range Query Method based on Local differential privacy. J. Comput. Res. Dev. 2020, 57, 847–858. [Google Scholar]

Figure 1. Range queries of 1-dimensional data.

Figure 2. Range queries of 2-dimensional data.

Figure 3. Range queries of 3-dimensional data.

Figure 4. Data analysis with shuffled differential privacy.

Figure 5. The framework of system.

Figure 6. The example of

K D

tree index.

Figure 6. The example of

K D

tree index.

Figure 7. The RE in the range [5%,20%].

Figure 8. The RE in the range [20%,40%].

Figure 9. The RE in the range [40%,60%].

Figure 10. The RE in the range [5%, 20%].

Figure 11. The RE in the range [20%, 40%].

Figure 12. The RE in the range [40%, 60%].

Table 1. Comparison with related works.

Scheme	SDP Protection	High Dimension	Spatial Data	Data Density
AHEAD [13]	×	√	√	√
PSDA [18]	×	×	√	×
HaarHRR [12]	×	√	×	×
SM [19]	×	×	√	×
Our Work	√	√	√	√

Table 2. Datasets.

Datasets	Dataset Size	Position Domain	Sample Size
Landmark	$8.7 \times 10^{5}$	$[- 124.4, - 67] \times [24.6, 49]$	$6 \times 10^{5}$
Checkin	$10^{6}$	$[- 176.3, 177.5] \times [- 48.2, 90]$	$6 \times 10^{5}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, K.; Zhang, H.; Xu, Y.; Liu, Z. A Range Query Scheme for Spatial Data with Shuffled Differential Privacy. Mathematics 2024, 12, 1934. https://doi.org/10.3390/math12131934

AMA Style

Li K, Zhang H, Xu Y, Liu Z. A Range Query Scheme for Spatial Data with Shuffled Differential Privacy. Mathematics. 2024; 12(13):1934. https://doi.org/10.3390/math12131934

Chicago/Turabian Style

Li, Kaixuan, Hua Zhang, Yanxin Xu, and Zhenyan Liu. 2024. "A Range Query Scheme for Spatial Data with Shuffled Differential Privacy" Mathematics 12, no. 13: 1934. https://doi.org/10.3390/math12131934

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Range Query Scheme for Spatial Data with Shuffled Differential Privacy

Abstract

1. Introduction

2. Related Work

3. Preliminaries

3.1. Range Query

3.2. Local Differential Privacy

3.3. Shuffled Differential Privacy

4. System Model

4.1. System Model

4.2. Problem Definition

4.3. The Design Goal of the Spatial Query Scheme

5. Range Query over Spatial Data

5.1. Overview

5.2. User Data Division

5.3. Shuffle and Reconstruct the Tree

5.4. Range Query

5.5. Security Analysis

6. Performance Evaluation

6.1. Experiment Settings

6.2. Evaluation

6.2.1. Landmark-Based $R E$ Comparison

6.2.2. Checkin-Based $R E$ Comparison

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Range Query Scheme for Spatial Data with Shuffled Differential Privacy

Abstract

1. Introduction

2. Related Work

3. Preliminaries

3.1. Range Query

3.2. Local Differential Privacy

3.3. Shuffled Differential Privacy

4. System Model

4.1. System Model

4.2. Problem Definition

4.3. The Design Goal of the Spatial Query Scheme

5. Range Query over Spatial Data

5.1. Overview

5.2. User Data Division

5.3. Shuffle and Reconstruct the Tree

5.4. Range Query

5.5. Security Analysis

6. Performance Evaluation

6.1. Experiment Settings

6.2. Evaluation

6.2.1. Landmark-Based R E Comparison

6.2.2. Checkin-Based R E Comparison

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

6.2.1. Landmark-Based $R E$ Comparison

6.2.2. Checkin-Based $R E$ Comparison