1. Introduction
With the development of the era of big data, massive amounts of spatial data are generated, in which the user’s sensitive location information may be leaked during the collection and publication by third-party servers. For example, when the server collects and publishes the user’s location information, it may be attacked by an attacker or learn the user’s entertainment, shopping, social and other daily behavior patterns such that the user’s personal privacy is leaked. Recently, the Telegram robot was suspected to have leaked 4.5 billion pieces of express information, including the user’s name, telephone and home address [
1]. Therefore, it is necessary to protect the user’s spatial data privacy when publishing spatial data.
Differential privacy [
2] protection was proposed by Dwork et al. in 2006 to protect data privacy. Central differential privacy [
3] adds noise to the statistical results of the data. This assumes that the third-party servers are fully trusted and have high availability, but its privacy protection is low compared with other technologies. In practical applications, it is difficult to find fully trusted third-party servers, and thus, local differential privacy [
4] came into being. Users send data to the third-party servers after local perturbation. Compared with the central differential privacy, the degree of privacy protection is higher. The shuffled differential privacy [
5] is between the two privacy protections. The introduction of a shuffler shuffles the correspondence between the data and user ID, which can improve the data availability with the same intensity of privacy protection as local differential privacy.
For high-dimensional spatial data, this mainly refers to the dimension greater than two, including user location information, attribute data, relationship information and time update information. Similar to relational data [
6], the high-dimensional spatial data will also face the same “curse of dimensionality” [
7], that is, the increase in the communication cost due to the distribution and scale of data points and the reduction in signal-to-noise ratio bring about the reduction in data availability. Therefore, it is necessary to construct an efficient index structure for dimensionality reduction. A spatial index structure, such as a UB-tree, is a combination of a Z-order and
data index structure proposed by Bayer et al. [
8]. Li et al. [
9] proposed an adaptively clocking region division to resist the threat under a location-based service. However, the actual spatial location data will have different distribution densities. Different from the previous work under the B-tree, Wang et al. [
10] proposed a new domain decomposition mechanism of privNUD with local differential privacy, but it performs an adaptive range query on one-dimensional data.
At present, the work of range queries with differential privacy protection mainly focuses on two-dimensional or one-dimensional spatial data. Cormode et al. [
11] proposed private spatial decompositions (PSDs) based on central differential privacy, which divides spatial data, reports the statistical number of points in the region and then answers the range query. Zhang et al. [
12] proposed PrivTree, which uses the quadtree for hierarchical decomposition and introduces a bias to weaken the influence of noise. Different from the above spatial range query work with central differential privacy, Kim et al. [
13] used local differential privacy in indoor positioning systems for the first time to estimate the density of the region. Kulkarni et al. [
14] proposed and analyzed different methods to answer one-dimensional range queries with local differential privacy. Both of these works have the problem that the direct application of the schemes to multi-dimensional scenarios will reduce the statistical accuracy because the density of the data is not considered. Du et al. [
15] proposed an adaptive hierarchical decomposition protocol (AHEAD), which can control the added noise by adaptively dynamically controlling the established tree structure. Ahuja et al. [
16] designed a method to provide accurate queries based on variational auto-encoders. The scheme improves the data availability for range queries in multi-dimensional scenarios, which inevitably leads to a reduction in privacy protection intensity. The above are the works on range queries with local or central differential privacy.
Compared with the previous range query schemes with local differential privacy, the research work of shuffled differential privacy is currently focused on frequency and mean statistics. The shuffled differential privacy was proposed by Bittau et al. [
17] in 2017, which achieved a good tradeoff between the degree of privacy protection and data availability. Balle et al. [
5] proposed the real-sum protocol and privacy amplification effect in the shuffled model. Wang et al. [
18] studied the shuffle model from two aspects of algorithm and model security, and proposed a more secure PEOS model. Cheu et al. [
19] presented a protocol to reduce the message complexity of histograms. We compared the range queries containing the
(KDRQ) scheme with related work by other researchers, as shown in
Table 1. These works do not involve range queries oriented to spatial data, and thus, we used the index structure to perform range queries on high-dimensional spatial data with shuffled differential privacy. The scheme was directly applied to the shuffled differential privacy model, and the data distribution law was not retained, which led to the problem of low query accuracy.
With shuffled differential privacy, we designed a new index structure to improve the query accuracy and could achieve adaptive range queries. Compared with the existing indexing schemes for multi-dimensional spatial data, we constructed an index structure that supports range queries for high-dimensional spatial data, which is more in line with the distribution characteristics of data and can support fast high-dimensional range queries. The specific contributions are as follows:
A secure range query scheme for high-dimensional geographic spatial data with shuffled differential privacy is proposed. It solves the problem of low query accuracy and low availability resulting from the direct extension of local differential privacy to shuffled differential privacy scenario.
Building an index that supports high-precision range queries. The index can adaptively divide the spatial region and improve the accuracy of range queries about spatial data. The algorithm can also be used to solve other range query problems with shuffled differential privacy.
Security analysis showed that our KDRQ scheme satisfied the requirements for shuffled differential privacy. Experimental analysis on the Landmark and Checkin datasets showed that the KDRQ method was superior to the existing methods in terms of range query accuracy.
The organization of this paper is as indicated below. In
Section 2, we introduce the related work. In
Section 3, we introduce the preliminaries used in this paper. In
Section 4, we present the system model of the scheme. Then,
Section 5 provides a detailed introduction to the scheme we have proposed. In
Section 6, we present experimental results to show the query accuracy of KDRQ. Finally, we summarize this paper in
Section 7.
2. Related Work
Most of the existing publishing schemes for spatial data with differential privacy build grids [
20], and then users perturb the units in the grid based on random responses. However, because the scheme perturbs each location in the same way, it does not provide different optimization schemes for different spatial data. Cormode et al. [
11] proposed a privacy space decomposition scheme with central differential privacy, which divides the spatial data into smaller regions, reports the statistical number of points in each region, and answers the query. Zhang et al. [
12] proposed a hierarchical decomposition differential privacy algorithm PrivTree based on quadtree to divide regions. However, when faced with non-uniform data sets, the query results are not ideal due to the unbalanced data points. Chen et al. [
21] studied the aggregation of spatial data with local differential privacy, but they did not design an effective index structure. Wang et al. [
22] proposed a segmentation mechanism to collect and publish high-dimensional spatial data, and used noise distribution to disturb points. However, due to the influence of the segmentation mechanism on data at a single location, the error was large. This study solved the above problems and designed an index of high-precision range query with shuffled differential privacy to achieve adaptive range queries.
4. System Model
4.1. System Model
As shown in
Figure 5, the system model involves four participants: query user, data owner, shuffler and cloud server.
Query users: They are generally the data owner themself or other legitimate users who have some completely trusted internal connection with the data owner. The query user hopes to search for the content that they are interested in within the entire datasets provided by all data owners and is not willing to disclose the relevant query results. In order to achieve this goal, the authorized user sends the query to the cloud server through the secure channel in this scheme. After receiving the query request, the server runs the query algorithm on the index tree and returns the query result to the user.
Data owners: The n data owners are willing to outsource their data to the cloud server in the form of differential privacy protection, while allowing authorized query users to query all data. In this scheme, each data owner has their own spatial data . They first locally perturb their location data and then send the message to the shuffler.
Shuffler: The shuffler is an independent semi-trusted server that performs secure shuffled operations without knowing the data. After receiving the data of the user’s local disturbance, it shuffles the received data.
Cloud server: The cloud server is untrusted. The cloud server can build an empty tree and share the index with the data owner. Once the data released by the shuffler is received, the tree is reconstructed and the final statistical results are obtained. Finally, the query results are returned to the query user. In addition, when the data owner sends the update information, the cloud server should update its stored index synchronously.
The cloud server in this scheme is untrusted. From the user’s view, other parties in the system model, including the shuffler (auxiliary server), cloud server and other data owners, may be adversaries. We assume that all participants have the same level of background knowledge and need to consider the consequences of collusion between different parties. There are three important adversaries: the adversary is the server itself; a conspiracy between the server and other users; and a conspiracy between the server and shuffler. In particular, the server can collude with the auxiliary server or other users. In this case, the model is simplified to local differential privacy.
4.2. Problem Definition
Given the spatial data of n data owners, shufflers and the cloud server, each data owner (user) perturbs the data via an algorithm and sends to the shuffler. The shuffler shuffles the received n inputs and outputs the results. In order to formalize it, the shuffled mechanism is defined as . The server performs statistical analysis based on the shuffled data, that is, the range query algorithm is . The problem that was solved in this study was to design a spatial range query scheme that satisfies the shuffled differential privacy, obtains the query results with high accuracy as far as possible, and support sadaptive partitioning.
4.3. The Design Goal of the Spatial Query Scheme
In order to achieve accurate and secure range queries over privacy-preserving cloud data by multiple data owners, our scheme should meet the following goals and security requirements:
- (1)
Support multiple data owners: multiple data owners can outsource their data safely and conveniently, and the complexity of users to create queries is almost unaffected by the number of data owners.
- (2)
Ideal query results: cloud servers should help users to query high-precision query results in outsourced datasets by all data owners.
- (3)
Efficiency and timeliness: The cloud server receives a massive number of geospatial data queries, and the spatial data are frequently updated. It is necessary to construct an index that supports efficient queries and updates the performance for these characteristics. This will improve the user’s query experience, but also provide results closer to reality.
- (4)
Security: aiming at the range query requirements of outsourcing geographic data with shuffled differential privacy protection, we analyzed the potential attacks and the security of the scheme, which is divided into the following points:
Data confidentiality: the content of the outsourced datasets by the data owner should be protected and should not be known to the cloud server or other unauthorized query users.
Query privacy: The query content in the user’s query is sent to the cloud server through a secure channel and will not be leaked to any adversary. This scheme is not considered.