1. Introduction
With the rapid development of smart terminal technology, positioning technology, and mobile Internet technology, location-based service (LBS) is increasingly penetrating all aspects of life, such as inquiring about nearby restaurants, hotels, and so on, bringing great convenience to people. In reality, people enjoy location-based services over the cyber. Considering the network security situation, people inevitably suffer serious cyber risks, e.g., malware [
1], spyware [
2], encryption for malicious purposes [
3], and zero-day attacks [
4]. Privacy leakage, as one of the cyber risks, is drawing more and more attention.
At present, the intrinsic privacy leakage for LBS is increasingly becoming a huge challenge for protecting a user’s privacy. In applying LBS, a user submits a request, which contains locations and query content, to the LBS server. However, users’ locations and query content are often leaked, due to the compromise between the provider (LP) and attackers. As a result, users are exposed to privacy leakage when using an LBS. For example, attackers can infer a user’s privacy (e.g., where the user works and goes for a dinner, where the user stays overnight) from locations and query content (e.g., “which hospitals are nearby”). Even then, users still need to use LBSs in many cases. Thus a location privacy-preserving technology must settle these problems to protect locations and so query content is not disclosed. In this work, we investigate how to protect users’ privacy in the scenario of a continuous query.
In this scenario, to protect users’ privacy, researchers have presented a variety of methods like privacy policy [
5,
6], distortion, encryption [
7,
8,
9,
10,
11], and so on. Among these methods, distortion techniques, such as pseudonym, obfuscation, and dummy, are most widely used to protect privacy. Pseudonym techniques [
12,
13] protect users’ privacy by replacing or directly deleting users’ identity identifiers with a false or temporary pseudonym. While users can use pseudonyms for privacy protection, adversaries can still infer the user’s real identity by analyzing the spatial-temporal correlation of continuous queries [
13]. Obfuscation techniques [
14,
15,
16] protect the users’ privacy by generalizing or perturbing the time and locations in LBS queries, and then the users’ precise locations are not recognized. However, submitting inaccurate locations will reduce the quality of the service. An effective approach is to maintain the quality of service while ensuring that users’ privacy is not leaked. Considering that the dummy techniques [
13,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27] to protect the privacy of a user are to add dummy queries to the real query and do not reduce the quality of the service. They are feasible in our work. In this paper, we focus on how to hide a user’s locations or query content among dummies in the scenario of a continuous query.
While existing dummy techniques solved the privacy leakage in the LBS query, they mainly focus on the scenario neighbor query in which a user sends queries in different locations and they neglect the scenario where the same user sends a short-time continuous query in the same location (called ’scenario of this paper’ for simplicity ).
Figure 1 shows an example of privacy leakage in the scenario of this paper.
In the example, the real user sends three queries in location L to query nearby points of interest (POI). For each query, three dummy users are selected to form an anonymous group. However, existing techniques, e.g., the distortion technique, only select one of the three queries (we assume the first query is selected) to hide the location and query content since existing techniques neglect the scenario of this paper. So, we assume not one of the users sends three queries except the real user in the example. For four query content sent by all users in each query, we assume the first query has three query categories, while there is only one query category for the second and third query, respectively.
As shown in
Figure 1, by intersecting users in different anonymous groups and combining background knowledge [
26], the attacker could find that only
was in the three queries and then inferred
was the real user who sends queries in
L (only the same user has the ability to send queries continuously in a short period of time, see [
13]). Moreover, attackers can deduce the user’s sensitive location, e.g., home address, workplace, by collecting and analyzing the user’s long-term historical query data [
13]. For the query content, although the attacker cannot distinguish which one is the query content of
in the first query, he was able to identify it in the second and third queries. We call this kind of attack, one meant to obtain sensitive location and query content privacy by intersection inference attacks (see [
27]), intersection inference attack for location and query content (I2LC).
Furthermore, existing methods do not consider the hierarchical structure of the address (HSA), which may reveal a user’s organization information. Actually, a region is comprised of different levels of region, and these different levels of region are called the HSA. For example, the HSA of Beijing is comprised of four levels of region: City (Beijing), district (Haidian District), street (Xi Tucheng Road), organization (Beijing University of Posts and Telecommunications (BUPT)). If we do not consider the HSA, we may select locations that locate in the same organization as the real location in the anonymous group. For example, the k locations in the anonymous group are all located in BUPT. Despite this not directly exposing the user’s exact location, it reveals that the user is located in BUPT. For some particular users, the organization information also needs to be kept secret, such as confidential personnel, anti-drug police, etc.; once leaked, it will be used by criminals to infer their occupation and pose a life threat to the users.
In this paper, we present a dummy generation scheme considering the HSA, called DGS-HSA, to generate dummy locations and query content. In this scheme, for each query in a continuous query, we select
k locations (in our scheme, the user does not send her real locations to the LBS server, see
Section 4.4), from historical locations as dummy locations. In each dummy location, a user sends a dummy query selected from the basic service set. In the DGS-HSA, we first divide a city area where users locate into different grids with the two-level structure. Each grid represents an organization and all historical locations are clustered into these grids according to their organizations. Based on these grids, we then propose the degree of privacy protection
. Using
, we select dummy locations and query content to protect users’ location privacy, organization information, and query privacy in the scenario of this paper. For each query in a continuous query, we select
k locations distributed across
s different organizations from historical locations and query content in all queries in a continuous query that has at least
l query categories. It can protect a user’s organization information and query privacy since it ensures that
k locations do not locate in the same organization and query content in all queries in a continuous query that has at least
l query categories. For all queries in a continuous query, we ensure all locations selected from historical locations are evenly distributed across all different organizations. So, our scheme can resist I2LC and protect users’ location privacy. Using theoretical analysis we also prove that our scheme can meet the
and resist I2LC. However, considering the privacy protection level and system overhead, the two must reach equilibrium if our scheme is feasible. So, we evaluate the feasibility and effectiveness of our scheme. Results show that our scheme can reach an equilibrium between privacy protection level and system overhead and can better protect users’ location privacy, organization information, and query privacy than the scheme in [
13,
27].
The DGS-HSA can protect users’ organization information and query privacy. Compared with [
13,
27], the DGS-HSA considers the HSA and divides a city area where users locate into different grids with the two-level structure. Based on it, for each query in a continuous query, the DGS-HSA selects
k dummy locations distributing across
s different organizations and
k query content in a continuous query that has at least
l query categories. That is, attackers cannot distinguish the user’s real organization and real query content.
Compared with [
13,
27], the DGS-HSA can resist I2LC and protect users’ location privacy. For a continuous query, our scheme evenly selects dummy locations from different organizations. It ensures that all dummy locations are evenly distributed across these organizations. It means the probability that each location distributes across every organization is the same. That is, attackers cannot distinguish the user’s real location by launching I2LC. Specifically, the major contributions of this paper are as follows:
To protect a user’s location privacy, organization information and query privacy in the scenario of this paper, we present the DGS-HSA to generate dummy locations and query content. The DGS-HSA considers the HSA and can resist I2LC.
Considering the HSA, we introduce a novel meshing method to divide a city area where users locate into different grids with the two-level structure. Using this, our scheme ensures that, for each query in a continuous query,
k dummy locations do not select from the same organization. In addition, for the query content, the probability of occurrence of each query category is the same in all queries in a continuous query. Thus, our scheme can better protect users’ organization information and query privacy than the scheme in [
13,
27].
To protect a user’s location privacy, we propose a method to resist I2LC. Using this method, for a continuous query, our scheme ensures that each selected dummy location evenly distributes across different organizations in a continuous query. In addition, we also replace the real user’s location with historical locations that are in the same organization as the real user’s location. It prevents the user’s exact location from being sent to the server. Thus, our scheme can resist I2LC and better protects users’ location privacy than the scheme in [
13,
27].
We evaluate the feasibility and effectiveness of our scheme. Results show that our scheme can reach an equilibrium between privacy protection level and system overhead and we give the recommended configuration of system parameters. They also show that our scheme can better protect users’ location privacy, organization information and query privacy than the scheme in [
13,
27].
The rest of the paper is organized as follows. We first overview the related research in
Section 2. Then, in
Section 3, we introduce the adversary model, the motivation and basic idea, the privacy protection model, and the system architecture of this paper. We also describe our algorithms in detail in
Section 4. Respectively,
Section 5 and
Section 6 evaluate the effectiveness of the presented scheme from both theoretical and experimental aspects, including scheme feasibility analysis. Finally, in
Section 7, we summarize the research work and the innovative results and point out the problems and future work.
3. Preliminary
In this section, we first analyze the privacy that an attacker intends to obtain(Adversary Model). Next, we formulate the problem of how user privacy leaks (motivation) and give the basic idea (basic idea) to solve this problem in the scenario of this paper. We also give the criteria (privacy protection model) for measuring whether a user’s privacy has been leaked and the structure needed to protect the user’s privacy (system architecture).
3.1. Adversary Model
In this paper, we assume that adversaries are honest but curious and simply collect all the users’ data that they can access. Their goal is to infer the user’s privacy, including identity, location, organization, and query content, by analyzing these data. In fact, an adversary may be an LBS provider, an LBS user, or a malicious hacker. We mainly consider the LBS provider owning users’ full information as an active adversary. To achieve the goal, he can obtain global information of all users’ current and historical data. He can also use statistical inference methods to infer a specific user’s privacy by combining with side information. Here, side information mainly refers to a city’s region information, e.g., administrative region division, street, organization distribution, and so on. In this paper, we only consider protecting the user’s location, organization, and query content. The reason for not protecting the user’s identity is that the LBSs only providing the query service do not involve the user’s identity. However, for the other LBSs needing to log in to the account, because real name Internet access is used in China, once the user logs in to the account, the service provider will know the user’s identity. In both cases, protecting the user’s identity is of little significance.
3.2. Motivation and Basic Idea
To protect users’ privacy in the scenario of this paper, an effective approach is to add dummies. Considering the scenario of this paper, we assume a user sends
m queries in a continuous query. Using dummy techniques, there will be
selected dummy locations (some dummy locations are the same) and
m real locations. Assume there are
c kinds of locations among the
locations and
is the number of
ith kinds of location. Then there are the following relationships. From the perspective of adversaries, the probability
that the
ith kind of location is estimated as the real location is as follows:
In particular, for the real location
r (assume
r is the
jth kind of location), the probability
that it is estimated as the real location is as Equation (
2).
Considering the I2LC, we assume each dummy location appears once in
m queries since only the real location
r appears in each query. That is, for each kind of dummy location,
. Then we get:
That is, when the user sends an unlimited number of queries, the probability that the real location is estimated as the real location is far larger than the dummy location. This means that the real location can be easily distinguished.
Hence, if a user uses dummy techniques for privacy protection, the attacker can infer her real identity by launching the I2LC. As shown in
Figure 1, in the scenario of this paper, the attacker could infer that a user who appears most frequently may be the real user by intersecting users in different anonymous groups.
In addition, a basic assumption commonly used in existing methods is that all locations are geographic coordinates. For example, we assume user Alice locates in the location
, where
,
and
are the time, longitude, and latitude, respectively. That is, the attacker cannot obtain Alice’s location privacy if he cannot infer that Alice locates in
. However, existing methods ignore the HSA. In this paper, we divide the HSA into six levels based on common administrative division and use
,
to denote its level, where
corresponds to the organization name and
represents a location, as shown in
Figure 2. If we do not consider the HSA, we may select
dummies that locate in the same organization as the real location. In particular, in the methods that have the TTP-free method and collaboratively achieve anonymity through P2P network, the
k users are close to each other and easily belong to the same organizations, which reveals the privacy of users’ organizations.
Assume that the attacker obtains
k locations, which are distributed across
different organizations and are from one of
m queries in a continuous query. Then the probability
that the attacker can distinguish the user’s real organization is as follows:
If the
k users (
k locations) are close enough that they are almost in the same organization,
s will be close to 1. Then we get:
That is to say, if we select
dummies that locate in the same organization as the real location, the attacker can easily distinguish the user’s real organization. In fact, the attacker can easily infer that different anonymous groups locate in the same organization. As shown in
Figure 3. In
Figure 3, the rectangle represents an area containing four organizations:
,
,
, and
. Different red circles represent different anonymous groups:
,
, and
. In the example, although users are different in
and
, the attacker can easily infer that the users in
and
are in the same organization. It is more serious in densely populated and larger organizations. According to actual need, this paper only considers location privacy protection of
and
levels.
The above formulas reveal the mechanism by which the attacker obtains users’ privacy by using the I2LC and the HSA. (1) When the frequency with which each location appears in a continuous query is different, location privacy leakage occurs; (2) users (locations) in an anonymous group are concentrated in the same organization. Hence, the basic idea of our scheme is that each location evenly appears in a continuous query and locations in an anonymous group are distributed across two or more different organizations. In addition, we use the l-diversity to protect the query content. That is, for all queries in a continuous query, the probability of occurrence of each query category is the same in multiple queries. This means that the probability that the user’s query content is recognized is also the same.
To implement our basic idea, we consider two main k-anonymity methods: Cloaking and adding dummy. The cloaking method submits a cloaking area containing k users to the LBS server. (The user’s real location is also in the cloaking area.) However, it is difficult to guarantee the basic idea. The reason is that users do not store the information of users who had participated in anonymity. Therefore, it is difficult to determine whether a user appears too frequently in a continuous query. For example, in the cloaking method, a user is randomly selected to participate in anonymity, resulting in the inability to control the user’s participation. Another reason is that users cannot know the area of all organizations. For example, in cloaking, randomly selecting users to participate in anonymity causes users’ locations to be randomly distributed across different organizations. In this paper, adding dummy is an ideal solution because it is not constrained by the real user’s locations, and the dummy locations can be generated flexibly according to some conditions. While some methods, such as randomly generating dummies, selecting dummies from historical queries or historical data, to a certain extent, can guarantee that users in an anonymous group are distributed across specified organizations, there is no guarantee that all users are not in the same organization.
In this paper, we use the adding dummy to realize the protection of the user’s location, organization information, and query privacy in the scenario of this paper. All dummy locations and query content are selected from historical locations and the basic service set. We first divide a city area where users locate into different grids with the two-level structure and different grids represent different organizations, as shown in
Figure 4. Each historical location is located in a grid. Then, we can select
k dummy locations from
s different grids evenly. That is, we can ensure that the
k locations are different from each other in each query and distributed in
s different organizations. This protects users’ privacy at the organization level. In particular, the LBS provider can identify the real user by comparing historical locations (the real user’s location is different from the historical locations), since he has all historical locations. Therefore, we consider replacing the real user’s location with the historical locations that are in the same organization as the real user’s location. This ensures that the locations in each query are different. In addition, for the query content, our idea is to ensure that the probability of occurrence of each query category is the same in all queries. This means that the probability that the user’s query content is recognized is also the same.
3.3. Privacy Protection Model
To implement the above basic idea, the adding dummy method in our paper needs to meet two conditions. (1) The LBS server also needs to store a historical location dataset. It can guarantee that each location evenly appears in a continuous query and k dummy locations cannot locate in the same organization by controlling the user’s participation; (2) the LBS server also needs to provide a service category set. This can guarantee that the probability of occurrence of each query category is the same in all queries.
Considering the above conditions, for the LBS system, it needs to provide M different service categories, e.g., restaurant and hospital. Let these service categories be ; this is called a basic service set. In addition, the LBS server also needs to store a historical location dataset G, used by users to generate dummy locations. The area that historical locations in G are located in is divided into N different first-level grids. So, we denote G as , where denotes the ith first-level grid. For , it is further divided into different second-level grids (organizations). Therefore, we denote as , where denotes the jth second-level grid in and L is a location in . To protect users’ privacy, when a user sends the ith query in location to the LBS server, the user needs to submit a query request , . After receiving the query request, the LBS server calculates and generates a response message . Then, to implement the basic idea, the and in a continuous query should be bounded. To address this issue, we define the privacy protection model as , called the ’degree of privacy protection’, as follows:
Definition 1: In a continuous query, if the can protect a user’s location privacy, organization information, and query privacy, the and must satisfy the following conditions:
- (1)
;
- (2)
The locations are located in s different second-level grids;
- (3)
The locations in contain at least different service categories. If a user sends n times queries in location , the number of times each service category selected in C is approximately equal. That is, the probability of each service category selected in C is equal.
The privacy protection model means that, if a user needs to protect her location, organization information, and query content, she must send k different locations to the LBS server in each query request, which these locations locate in s different second-level grids and the k query content contained in these locations contains at least different service categories. The first two conditions ensure that the k locations are different from each other and located in s different second-level grids. It is necessary to select k dummies from s different grids in a query request. Condition (3) guarantees that the k query contents in are not exactly the same. It is necessary to protect the query content. So, can protect a user’s location privacy, organization information, and query privacy.
3.4. System Architecture
Given the adversary model, the basic idea, and the privacy protection model, the problem is how to select a system architecture to select dummy locations and query content. Considering the drawbacks of the TTP and the TTP-free method, in this paper, we adopt a dummy technique that uses historical locations to generate dummy locations to protect the user’s privacy, including the organization information and query content. As shown in
Figure 5, the system architecture of the dummy technique is a typical independent architecture that avoids the disadvantages of centralized and distributed architectures.
In the LBS system, there are three participants: The user, the mobile terminal, and the LBS server. Additionally, the privacy protection module of the mobile terminal is the core module of the entire system and consists of an generation module, a dummy location generation module, and a dummy content generation module. The LBS server stores the historical location dataset G, provides the historical location sub-dataset for the user to generate dummy locations, and calculates query results for the user. The whole process is described as follows:
System initialization: This mainly refers to obtaining the user’s current location and initializing the historical location dataset G on the server.
Initiating a service request: In this stage, the user needs to input initial parameters as the input of the privacy protection module.
generation: is a range of latitude and longitude sent to the LBS server by the user to indicate the latitude and longitude range of the historical location dataset that the user downloads from the LBS server. is determined by the user’s current location and . After the user inputs , the generation module gives according to and s, and sends to the LBS server for downloading .
Historical location data acquisition: The LBS server sends which meets the requirements of to the user.
Query request construction: After generating dummy locations and dummy query content, the privacy protection module constructs the query request message in the light of a specific format.
Service response: According to the dummy locations and dummy query content in , the server calculates query results and constructs and feeds back response message to the user.
4. Dummy Generation Scheme
In this section, we describe our dummy generation scheme in two parts: Algorithm framework and several key algorithms.
4.1. Algorithm Framework
In our scheme, generating dummy locations and query content requires consideration of two conditions: (1) The LBS provider owning users’ full information as an active adversary; (2) in our independent architecture, we select dummy locations and query content from historical location datasets and basic service sets which store on the mobile terminal and the mobile terminal has less storage. In the independent architecture, they mean that the client cannot submit the exact locations to the server and only stores fewer data. The process generating dummy locations and query content is as shown in
Figure 6. The main steps are as follows:
Step 1: The user submits the privacy protection parameters and her current location to the server. To ensure that the LBS provider cannot identify , we use an algorithm, called the generation algorithm, to generate containing and submit , not the exact location .
Step 2: After receiving , the server generates (a sub-dataset of G) and sends it to the mobile terminal. According to the , the server uses an algorithm, called the historical location sub-dataset generation algorithm, to generate . The is a small dataset and can be stored on the mobile terminal.
Step 3: The mobile terminal constructs a query request and submits it to the server. In this step, the mobile terminal uses the dummy location generation algorithm to select dummy locations and the dummy query content generation algorithm to select dummy query content. The two algorithms can ensure that the constructed meets the privacy protection model.
Step 4: The server calculates and sends the query results to the mobile terminal. To ensure that the user knows the query results and the server does not, we use the user query request construction algorithm to calculate the .
4.2. Generation Algorithm
To ensure that the LBS provider cannot identify , we need to blur into an area that consists of different organizations. Here, the area is denoted as , where s indicates the number of different second-level grids, and , respectively, point out the range of longitude and latitude of . The is decided by . If we know the user’s current location , is also denoted as . In , , , and are given by the user. So, we design an algorithm to generate the . Algorithm 1 shows the formal description of it. First, it judges the number of second-level grids in the . If so, the server provides . Otherwise, the server increases the value of the initial parameter , , , and , and repeats the above judgment until it contains no less than s second-level grids.
Generally speaking, the higher the user’s location privacy protection requirements are and the more grids and historical location data meeting the requirements, the better the privacy protection effect is, and of course the higher the storage and communication overhead are when generating the dummy locations.
Algorithm 1: generation. |
Input: , s, , , ,
Output: 1.
; 2. let , , , ; 3. let ; 4. Return |
4.3. Historical Location Sub-Dataset Generation Algorithm
In our scheme, the mobile terminal is used to implement the process selecting dummy locations and query content. So, it needs to store the historical locations dataset and basic service set. Considering its lesser storage, the server must send a smaller historical location sub-dataset
to the mobile terminal. Here, we only consider the
, since the basic service set itself is a smaller dataset and can store on the mobile terminal. So, we design an algorithm to generate the
. Algorithm 2 shows the formal description of it. First, Algorithm 2 determines whether the historical locations within the range of
are distributed in
s different second-level grids. If so, they meet the requirement and return
, which are the historical locations in the given range. Otherwise, the range of the
is extended by a degree of
and we get a new
which is expressed as
. Repeat the above process until the given historical location range can contain different second-level grids. In general, iterating once will meet the requirement.
Algorithm 2: Historical location sub-dataset generation. |
Input: , , G Output: |
4.4. Dummy Location Generation Algorithm
To meet , we need to achieve two goals. (1) Each location evenly appears in a continuous query; (2) in each query, k dummy locations do not locate in the same organization. Therefore, we take two measures to ensure that the two goals are achieved. Firstly, the user’s real location is not submitted to the LBS server but is replaced by a historical location in the same organization as the user’s real location. Secondly, we add an identifier for each location in to ensure selected dummy locations are different in each query request . We set to 1 each time a location is selected. When the identifiers of all locations in are set to 1, which represents that all locations in are traversed, all identifiers for each location in are set to 0 to start a new round of selection. While we use a historical location in the same organization to replace the user’s real location to protect the real one from being identified, because of the user’s real location and the dummy location that replaces the real one in the same organization, the loss of service quality is small.
Based on the above goals and measures, we design an algorithm to generate dummy locations. Algorithm 3 shows the formal description of it. First, we determine the second-level grid
which the current location
belongs to and then use
as the starting point to randomly select
second-level grids from near and far. The
s second-level grids are expressed as
. Then, we randomly select one location from each grid of
and obtain
s locations
which are located in
s different second-level grids. We also set each identifier
of the selected historical locations to 1. Second. we randomly select the next
s locations
from the historical locations of
in
s grids of
, and set the identifier
of selected historical locations to 1. Then, we repeat the above procedure until the remaining
locations are selected. Finally,
k locations
are selected and evenly distributed in
s second-level grids. Among them, we use
to replace the user’s real location
to prevent the server from distinguishing
from other dummy locations by comparing with historical location data it owns. For simplicity, we assume that there are enough historical locations in the historical location dataset (in fact, we can also get enough historical locations) to ensure that users can choose different historical locations each time. The specific algorithm is described as follows.
Algorithm 3: Dummy locations generation. |
Input: , s, k, Output: anonymous group
|
4.5. Dummy Query Content Generation Algorithm
The number of locations participating in anonymity is k, so the amount of corresponding anonymous query content is also k. Assuming that the user has queried n times in a row, the number of locations and the amount of query content of participating in the anonymity are both . In n times continuous queries, the idea of protecting the user’s query content is that the probability of each service category selected in the basic service set C is equal. In other words, the number of times each service category selected is approximately equal, and the attacker cannot guess the user’s query privacy from the distribution characteristics. We denote C as , and use to represent the times selected. should meet the following constraints:
each query contains at least l different service categories.
Then, .
The specific algorithm is described as follows (Algorithm 4).
Algorithm 4: Query content generation. |
Input: k, n, l, C Output:
|
4.6. User Query Request Construction Algorithm
After generating k dummy locations and k dummy query content that satisfy the privacy protection requirement , the user’s query request can be expressed as , .
4.7. Response Message Generation Algorithm
After receiving the user query request , the server calculates and generates a response message . Finally, the user receives the response message and obtains the query results.
5. System Analysis
In this section, we analyze the security and feasibility of our scheme. Specifically, following the scheme, we prove the feasibility of the scheme and solve optimal solution problems. We also examine whether our scheme can achieve the desirable security and privacy requirements.
5.1. Existence of Solutions and Optimal Solution Problems
The ideal privacy protection scheme is to maximize the privacy protection level, minimize system overhead, and balance the privacy protection and overhead in specific scenarios. This section will prove the feasibility of the scheme by proofing the existence of the solution of the multi-objective optimization problem and solve the equilibrium problem of privacy protection level and system overhead balance.
In this paper, the security goal is to achieve privacy protection level . We use the probability of organization information being identified to measure and use the probability of query content being identified to measure . The smaller and are, the higher the privacy protection level is. To achieve the expected security goal, we select k dummy locations from and make k dummy locations be located in s second-level grids (i.e., organizations). is related to (the number of locations in ) and s. The larger and s are, the larger the is, and the larger the corresponding communication overhead and storage overhead are. The equilibrium problem of privacy protection level and system overhead can be described as minimizing , , and .
The objective functions of the above problem are denoted as , , and signifies the system overhead. Here, we consider the storage overhead and communication overhead related to for the following two reasons. On the one hand, compared with other methods, the increase in the cost of our method chiefly comes from the increase of . The cost of other data involved in the service process is much smaller than . On the other hand, mostly affects the computational cost of selecting s organizations. Once s organizations are determined, the computational cost of selecting k dummy locations is basically unchanged.
Constraints are described as follows.
locations for each organization correspond to
query content, each query content corresponds to a different service category, and
is no greater than
M (in this paper, we set
(see
Section 6),
; here, let
be reasonable; otherwise, it will cause
k to be too large, which will greatly increase the system overhead). There is at least one solution that can make the
query content corresponding to
locations in each organization different, and the number of times each query category selected after
n times queries equal.
We use
to indicate whether the category corresponding to the
jth location in the
ith organization in the
zth time query is
. If
, it indicates that
is selected; otherwise,
indicates that
is not selected. The constraints are expressed as follows.
In System of Linear equations (6), and are rounded up for the rigor of logic; ensures the number of dummy locations and dummy query content selected are both k; indicates the number of locations belonging to the ith organization in the zth times query; represents the number of times that the service category is selected in the zth times query. denotes that the number of times that any of the basic service categories is selected in the times queries are equal, which can be derived from .
The objective function having a solution means that given k, when s changes from 2 to k, there is at least one scheme for each value of s such that it satisfies the constraints. Then k and s can be regarded as a fixed value in a specific solving process, and whether or not the objective function with a solution is transformed, the equations in the constraint conditions have a solution.
Conclusion 1: The System of Linear equations (6) with integer coefficient has multiple integer solutions.
Proof. In a certain solving process, the number of variables is
, and the coefficient matrix and the augmented matrix of System of Linear equations (6) are, respectively, denoted as
B and
. Let
and
, then
is expressed as follows.
B is composed of the first
columns in
. We can work out the invariant factors of the matrix
B and the augmented matrix
through matrix elementary transformation. The invariant factors of
B and
are both
. In addition, the ranks of
B and
are both
. The number of effective equations in the constraint conditions is
and
. According to references [
34,
35], System of Linear equations (6) has multiple integer solutions. Hence, there must be a scheme to make the objective function reach the local or global optimum. Furthermore, the presented scheme is feasible, and the objective function can reach the local or global optimum. □
5.2. Security Analysis
5.2.1. User’s Privacy Protection Requirements
User’s privacy protection requirements in a single query: In a single query, the user submits a query request to the LBS server. If the user wants the attacker to identify the user’s real location with a probability that is not greater than , and the user’s organization name is identified with a probability that is no greater than , the user’s location privacy protection requirement is called . If the user wants the attacker to identify the user’s actual query content with a probability that is not greater than , the user’s query privacy protection requirement is called .
User’s location privacy protection requirements in continuous queries in the same location: The user submits queries continuously in the same location. If the following conditions are met:
- (i)
;
- (ii)
;
- (iii)
The proposed scheme meets the user’s privacy requirements in a continuous query request scenario in the same location.
5.2.2. Security Analysis of the Presented Scheme
Conclusion 2: The presented scheme can achieve privacy protection requirements .
Proof. In a single query, the presented scheme is clearly able to meet the user’s privacy protection requirements . In n times continuous queries, the k dummy locations included in each are selected from s different second-level grids, so the number of organizations that dummy locations belong to is s and then the probability that the user’s real organization is recognized is no greater than . Thus, the presented scheme satisfies . Moreover, The number of each basic service category selected is times, so the probability of each basic service category being recognized is , which satisfies . □
Conclusion 3: The presented scheme can resist LSA.
Proof. In this paper, the core idea for location privacy protection is to make the locations submitted to the LBS server as scattered as possible. The locations in each are different. As the number of query times n increases, the locations involved in the n query requests are always different from each other. locations disperse in no less than s different second-level grids. The distribution characteristics of the locations are consistent with the locations distribution characteristics of the locations in , and there will be no situation where the user’s locations are concentrated in some specific places. Therefore, the attacker cannot identify where the user often appears. In addition, because the user’s real location is replaced by the other location in the same second-level grid that the real one is located in, the probability that the real location is recognized is 0. As to the query content protection, the idea of homogenization is adopted to ensure the number of times that each basic service category is selected in each query is the same, then the probabilities of each service category being selected are all . From the above, it can be proved that the presented scheme can resist LSA attacks. □
Conclusion 4: The presented scheme can resist RSA.
Proof. It can be seen from the proof of Conclusion 3 that the mutually different locations in the n queries are scattered in s different second-level grids, and there is no case where some specific locations are highly concentrated. Each query does not submit the user’s real location, and the query content protection reaches l-diversity. So the attacker cannot obtain the user’s real location and query content through association analysis. In a word, the presented scheme can resist RSA attacks. □
5.3. Performance Analysis
Utility: In the same LBS, the accuracy of the query results is determined by whether or not the user’s real location and real query content are submitted to the LBS server. Both REGP and L2P2 submit the user’s real location and real query content, and the query results are unaffected. In this paper, although we do not submit the user’s real location, we replace the real location with the other historical location in the same second-level grid where the real location is located. The loss in utility is acceptable, and is caused by this replacement. The reason is as follows. In the actual application, the result of the neighbor query is basically around the organization where the real location is located, and the historical location that replaces the real location is located in the same organization as the real location. So there is a small loss in utility. In addition, each query submits the real query content and the dummy query content, so the obtained query results contain the query results of the real query content.
Communication overhead: The communication overhead the presented method primarily includes four aspects: i) The user submits to the LBS server. The communication overhead of submitting is . ii) The LBS server sends to the user. contains no less than s different second-level grids, and each second-level grid contains approximately n records. The communication overhead of sending is . iii) The same as the existing methods, each query request submitted to the LBS server by the user contains k locations and k query content. This part of the communication overhead is . iv) Assuming that each query returns m POI, the LBS server needs to return query results to the user, so the communication overhead of the query results is . Compared with REGP and L2P2, our method has more communication overhead about and , which provides the historical locations for dummy generation.
Computational cost: The computational cost mostly includes three aspects: (i) After receiving , the LBS server generates a historical location sub-dataset , and the computational cost of this part is ; (ii) the user generates dummy locations and dummy query content according to the parameter , then the computational cost of generating dummies is ; (iii) The LBS server calculates and returns query results to the user, and the computational cost about query results is . In terms of computational cost, our method is approximately equal to REGP and L2P2.
Storage overhead: In this paper, the client stores a historical location sub-dataset to generate dummies, which contains approximately records, so the storage overhead is . We mainly consider the storage overhead on the client side. Besides, the LBS server stores the historical location dataset G, which is about 1.6 GB and is negligible relative to the storage space of the server. Similarly, to defend against RSA, REGP needs to obtain and store the PLs on the map. The storage overhead is , whereas L2P2 does not need additional storage overhead.
6. Experiment
L2P2 study privacy protection for users is within the same locations both in single request and continuous request, the scene of which is similar to ours. Besides, REGP aims at resisting the attacks of RSA and LSA; we also try to solve the same problem. So we compare DGS-HSA with L2P2 and REGP from privacy protection effects and system overhead to evaluate the effectiveness of the presented scheme, DGS-HSA. The privacy protection level is measured by the probability that the user’s real organization is recognized and the probability that the user’s real query content is recognized. Among those, reflects after considering the factor of the hierarchical structure of the address that the scheme not only protects the specific location but also protects the organization information corresponding to the location. reflects after adopting l-diversity that the scheme can protect the user’s query content. The smaller and are, the better the privacy protection effect is. The system overhead primarily refers to the storage overhead and communication overhead, which is related to the historical location sub-dataset .
Below, we first describe the dataset and the experimental setup and then give out the experiment results and analysis after conducting extensive simulations.
6.1. Dataset
We choose the Geolife Trajectories 1.3 dataset of Microsoft Research Asia [
36,
37,
38] as the historical location dataset and use the POI dataset of Amap [
39] to provide the query results. Amap is a free map product in China, and also a very comprehensive and informative map application based on location.
REGP divided the map into grids of . We preprocess and mesh the original Geolife Trajectories 1.3 dataset according to the administrative region division. Firstly, we delete the sequence number and time field in the dataset and retain the longitude and latitude of the location. Then we add the district-level administrative region names (called district names for short) and organization names to each location of the dataset by using the open developer interface of Amap. At last, we obtain G which contains records corresponding to the historical locations and is used to generate dummies. Each record of G has four fields of longitude, latitude, district name, and organization name. More specifically, we sort G according to the district names, and the records that have the same district name field are regarded as the same first-level grid . Then we sort each first-level grid by organization name, and the records that have the same organization name are regarded as the same second-level grid . After twice sorting and meshing, we obtain a grid dataset with a two-level structure. The grid dataset is stored on the LBS server and maintained by the LBS provider, using a quadtree for data indexing.
6.2. Experimental Setup
In the experiment of DGS-HSA, we select the trajectory data within three kilometers of BUPT in the Geolife Trajectories 1.3 dataset as the historical location dataset G. The processed historical location dataset has three first-level grids () and approximately 1300 second-level grids (). One hundred users are randomly distributed in different second-level grids. The basic service categories of Amap are used as the basic service set of the experiments, which has about 21 categories (). Set (), then (). Each user sends 100 query requests at a frequency of once per minute.
In the experiment of L2P2, the second-level grids where the residential quarters and hospitals are located are set as sensitive areas. In order to ensure the privacy protection effect, we set the PID exchange probability and .
In the experiment of REGP, we divide all second-level grids into four PLs. The second-level grids where the confidential organizations are located are set as the first-level privacy zone , such as the military and scientific research departments. The second-level grids where the hospital and residential community are located are set as the second-level privacy zone . In , personal privacy is easily leaked. The second-level grids where the parks are located have relatively fewer people and are set as the third-level privacy zone . The second-level grids that the malls and the schools are located in are densely populated, and are set as the fourth-level privacy zone . We also set the parameter and .
6.3. Experimental Results
6.3.1. The Probability that the User’s Real Organization is Recognized
The presented scheme does not submit the user’s real location, and the probability that the real location is recognized is zero. Therefore, this section discusses the relationship between
and
k. The results are plotted in
Figure 7. When
k increases, the
of all three methods shows a downward trend. The larger
k is, the more the number of different organizations of
k locations are located, and the smaller
is. Among the three methods, DGS-HSA has the best privacy protection effect, and L2P2 is the worst. This is because DGS-HSA considers the hierarchical structure of the address. When we select the dummies,
k locations need to be located in
s different organizations; however, L2P2 realizes anonymity through an ad-hoc network in which the communication distance is limited, so the distribution of
k locations is relatively concentrated. In other words,
s is smaller, and in extreme cases (populated dense area)
; when REGP selects dummy locations based on the probability of the historical queries and the locations are scattered as much as possible,
s can increase to some extent. Compared with L2P2, REGP has a larger
s, so it has a better privacy protection effect. However, when the same historical query probability occurs frequently in some areas, there is no guarantee that
k locations will not be in fewer organizations (i.e.,
s is relatively smaller) and the privacy protection effect will be weakened.
6.3.2. The Probability that the User’s Real Query Content is Recognized
It can be seen from
Figure 8 that for different
k, the
of REGP is larger than that of L2P2 and DGS-HSA, indicating that the privacy protection effect of REGP is the worst. This is because both L2P2 and DGS-HSA use
l-diversity to protect the query content, while REGP does not use
l-diversity. In addition, DGS-HSA requires
k locations to be evenly distributed among
s different organizations, and the categories of query content in each organization are as different as possible, so DGS-HSA has a better privacy protection effect than L2P2.
6.3.3. The Equilibrium Problem of Privacy Protection and System Overhead
Next, we discuss the equilibrium problem of privacy protection and system overhead for DGS-HSA. From the attacker’s point of view, when the attacker receives a query request , and . Given k, then is also determined. Hence, we mainly discuss, when , with the change of k, how the value of s can balance the privacy protection and system overhead. When we calculate system overhead, for simplicity, each second-level grid contains about 3000 historical locations.
After normalization,
and
can be represented in the same coordinate system. As shown in
Figure 9, when
, the two curves have an intersection, i.e., the proposed scheme achieves the balance between privacy protection and system overhead at the intersection. At this moment, the storage overhead and communication overhead generated by each submission of the query are about 1.6 Mb. Moreover, as
k changes, the value of
s changes as shown in
Figure 10, which is the equilibrium point of privacy protection and system overhead. We can see from
Figure 10 that when
, the equilibrium is reached at
; when
, the local optimum is reached at
. Furthermore, the above results show that the presented scheme is feasible.