1. Introduction
As enabled by the maturity of 5G technologies, location-based services (LBSs) have become popular in our daily life [
1]. However, these service providers have stored the user’s trajectory data [
2]. The trajectory data contains a large amount of the user’s sensitive information, such as shopping habits, home address, workplace, or frequently visited places [
3]. If these service providers suffer from security breaches or the data flow is used by attackers maliciously, the trajectory data may be directly leaked without any protection. It would result in exposing the sensitive information regarding the user. Therefore, finding a way to protect the user’s trajectory data for better privacy is necessary.
In response to the need mentioned above, researchers have worked extensively on related trajectory privacy protection technologies [
4].
-anonymity is one of the important techniques recently used to protect a user’s trajectory. The
-anonymity set is formed by similar trajectories and sent to the service providers [
5], where
denotes the anonymity degree. Nevertheless, constructing a good
k-anonymity set effectively is a big challenge because the attacker may consider side information and use data mining techniques to distinguish the dummy trajectories.
For constructing the
-anonymity set, most of the existing approaches consider the direction similarity between trajectories [
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16]. However, these methods ignore that different users have different attributes and movement patterns. The trajectories generated by different attributes of users are very different.
In this paper, Spatiotemporal Mobility (SM) is used to denote the user’s attributes with respect to the number of stopovers and the average moving speed of the user. The stopovers include the supermarket, the park, the community, or any locations the user may visit. The attacker can still distinguish the trajectory in the anonymity set through the SM.
Figure 1 shows two users’ moving trajectories in one day. The trajectories colored in red and green belong to Alice and Bob, respectively. The stopovers of Alice are distributed over multiple locations in the region. Her average moving speed is comparatively high. It is easy to speculate that her daily movement pattern is irregular and unfixed. On the contrary, the stopovers of Bob only distribute in two locations. He has fewer stopovers than Alice. His average moving speed is lower. It is speculated that his daily movement pattern is likely to be more regular and fixed. It is concluded that the mobility of Alice is higher than Bob. If the trajectory
-anonymity set submitted by Bob contains a trajectory generated by Alice, once the attacker knows Bob is an employee of a company through data mining techniques, this trajectory with high mobility in the anonymity set will be easily filtered out.
Motivated by the above, this paper aims to explore the way to construct a -anonymity set. There are two issues to be considered. The first is how to measure the similarity between trajectories. The second one is how to make the trajectories in the k-anonymity set more similar. To address these two issues, a novel trajectory privacy-preserving algorithm is proposed. The main contributions of this paper are listed as follows:
The SM is defined based on the number of stopovers and the average moving speed of the user’s trajectory. Furthermore, the SM is used to measure the similarity between trajectories to form the trajectory -anonymity set.
The trajectory graph is constructed to model the relationship between trajectories. The analysis of the relationship between trajectories is transformed into the study of graph features.
The Spatiotemporal Mobility-based Trajectory Privacy-Preserving Algorithm (MTPPA) is proposed. The -anonymity set is constructed by the historical trajectories with the simulated annealing algorithm. This anonymity set improves the similarity between the anonymity set trajectories effectively.
The performances are analyzed by the real datasets [
6]. The results show that the
-anonymity set constructed by MTPPA has a lower trajectory privacy disclosure prob-ability than existing algorithms while ensuring the quality of services.
The remainder of this article is organized as follows. The related works are discussed in
Section 2. The problem formulation is explained in
Section 3. The proposed MTPPA is revealed in
Section 4. The experimental results and analysis are delivered in
Section 5. Finally, the conclusion is given in
Section 6.
2. Related Works
As one of the most important trajectory privacy protection technologies, the
-anonymity method was proposed by Gruteser et al. [
7] in 2003. The anonymity set is constructed with
similar trajectories. The probability of an attacker distinguishing a particular user is less than
. There are three kinds of approaches based on the
-anonymity method: the dummy trajectory method, the suppression method and the generalization method.
The dummy trajectory method generates
similar dummy trajectories to form the
-anonymity set. When generating
similar trajectories, Liu et al. [
8] select the final anonymity set from three aspects including the time reachability, the direction similarity, and the in-degree/out-degree. Wang et al. [
9] rotate the user’s real trajectory at the selected rotation point to generate
dummy trajectories. Shaham et al. [
10] select
dummy locations with the same posterior probability of the real location. The transfer probability of each location to the next
-anonymity set is equal. They generate multiple dummy location sets and divide them into several subsets. Then, they select the anonymity set which has the largest entropy. However, the above methods do not meet the requirements of real geographical constraints in most cases.
The suppression method constructs the
-anonymity set by removing the highly sensitive locations from the trajectory collection. Zhao et al. [
11] suppress the whole problematic trajectory data locally according to the trajectory frequency and the relationship between privacy relevance and data utility. To construct the
-anonymity set, Gramaglia et al. [
12] suppress the sampling points so that the data spatiotemporal granularity is minimized. Li et al. [
13] use the hidden Markov model to formulate the user’s mobile status and the visited locations. A probability vector of the user’s mobile direction is used as the decision variable to determine whether revealing the user’s trajectory details. However, these methods lead to excessive trajectory information loss.
The generalization method generalizes a trajectory into a
-anonymity set. Each record of the location at a timestamp is a generalized region. Based on the traditional generalization method, Xu et al. [
14] consider four characteristics of direction, speed, time and space as the basis for measuring the similarity of trajectories. Xin et al. [
15] use the Gibbs sampling clustering method to detect the representative regions. Then, the detected representative regions are further generalized according to the rationality of equivalence classes. Zhang et al. [
16] propose a trilateral Stackelberg game model based on community structure. They design an optimization method to construct the
-anonymity set by the reverse induction method. However, when the road network is too sparse, the anonymous region of the above methods is also large.
To generate dummy trajectories that match the real geographical constraints, in this paper, the historical trajectories are used to construct the -anonymity set. Furthermore, the SM is used to measure the similarity between trajectories. Trajectories with similar mobility level make it more difficult for the attacker to distinguish the trajectories.
4. Spatiotemporal Mobility (SM) based Trajectory Privacy-Preserving Algorithm (MTPPA)
In this section, the overview of the proposed MTPPA algorithm is revealed in
Figure 2. There are three stages in MTPPA. In stage I, the trajectory pre-processing is designed. The equivalence classes are formed, the stopovers are detected. In stage II, the process of initial trajectory candidate selection and the construction of trajectory graph is designed. In stage III, an optimal trajectory
-anonymity set is selected by the simulated annealing algorithm. After passing all three stages, the constructed optimal anonymity set can protect the user’s trajectory privacy while matching the requirements of high-quality services.
4.1. Trajectory Pre-Processing
The operations in stage I is similar to the processes for handling trajectories in the equivalent classes and in Huo et al.’s method [
20]. The pre-processing includes a process for detecting the stopovers in the trajectory. Different from Huo et al.’s method, we consider protecting the trajectory privacy through hiding stopovers in the trajectory in this work.
To guarantee the fake trajectories formed by the remaining dummy stopovers are reachable in the given request time interval [
8], the equivalent trajectory time interval
is generated according to the initial sampling time
and the last sampling time
. Moreover, an initial timestamp
according to the timestamp of historical trajectory
is selected before an end timestamp
is selected such that:
To keep the computation simple, all the sampling points of the historical trajectory are replaced by the real trajectory timestamps generated according to the user’s speed. Thus, the equivalence class is formed by both the real trajectory and the historical trajectory data.
For the equivalence class formed by the real trajectory and the historical trajectory , if there is a sampling time in but not in , a new sampling point is inserted at . Contrary, if there is a sampling time in but not in , remove from .
After synchronizing the trajectories, the detection process is used to find the stopovers in the trajectory equivalence class [
21]. In practice, we implement DBSCAN algorithm to detect the stopovers. DBSCAN is a popular unsupervised data clustering algorithm. The user needs to predefine the radius
of the stopover. The radius
not only determines the number of stopovers of the trajectory, but also affects the SM of the user.
4.2. Initial Trajectory Candidates Selection
In this stage, the user sets a trajectory similarity threshold
according to his/her privacy tolerance. It selects
trajectories from the trajectory database so that the SM difference between any trajectory and the real trajectory is not greater than
. The selected
trajectories and the real trajectories form the initial trajectory candidates set (
). A weighted undirected trajectory graph model (
) is used to present the relationship between trajectories and
. The procedure for constructing the trajectory graph is revealed in Algorithm 1.
Algorithm 1. Trajectory Graph Construction (TGC)
|
Input: Initial trajectory candidates set , trajectory similarity threshold . Output: Trajectory graph 1: ;
2: ;
3: while do 4: for each vertex in do 5: for each vertex in do 6: if then 7: ;
8: ;
9: ;
10: ;
11: ;
12: end if 13: end for 14: end for 15: end while 16: return ; |
4.3. Optimal Anonymization Set Selection
The following process is used to select the optimal
-anonymity set (
) from
. The privacy protection performance of a trajectory anonymity set can be measured by the similarity of anonymity sets. When the
trajectories are similar to each other, the sum of the mobility differences between any two trajectories is as small as possible, the privacy preserving performance is better. Therefore, the problem of finding the optimal
-anonymity set is transformed into the problem of finding the
-clique of an undirected weighted graph [
22,
23]. It is an NP-hard problem. The process is divided into two parts and explained as follows.
The first part is to search the maximum clique (
) that contains the vertex of the user’s real trajectory in
. The number of vertices of
should be greater than
. To find
, a greedy algorithm is designed in this paper. It starts with the real trajectory vertex, grow the current clique one vertex at a time by looping through the remaining vertices of the graph. For each vertex
examined by this loop, if
is adjacent to every vertex that is already in the clique, add
to the clique. Otherwise, discard
. The process is shown in Algorithm 2.
Algorithm 2. Search for Maximum Clique (SMC)
|
Input: Initial Trajectory graph Output: Maximum clique 1: ;
2: ;
3: for each vertex in do 4: for each vertex in do 5: if is adjacent to then 6: ;
7: ;
8: ;
9: ;
10: end if 11: end for 12: end for 13: return |
The second part is to select
vertices with a smaller sum of weights from
. This process can be treated as a combinatorial optimization problem. The objective function
of the optimization problem is the sum of the SM differences between
trajectory pairs.
is the decision variable of
.
. The mathematical model of the objective function is defined as follows:
When
is very large, it is hard to find
for conventional algorithms in polynomial time. Nevertheless, the heuristic swarm intelligence algorithms are capable of solving the problem with satisfactory efficiency [
24]. One of the classical swarm intelligence algorithms for solving this combinatorial optimization problem is the simulated annealing algorithm [
25]. It searches the approximate optimal solution more quickly and has strong global searchability. Hence, the simulated annealing algorithm is used to solve the optimization problem (see Algorithm 3).
Algorithm 3. Search for -anonymity set (SKAS) |
Input: Maximum clique , , initial temperature , minimum temperature , times of internal circulation of every temperature .
Output: -anonymity set 1: Select Trajectories randomly from , set to ;
2: ;
3: Calculate ;
4: ;
5: while do 6: for do 7: Select Trajectories randomly from , set to ;
8: Calculate ;
9: if then 10: ;
11: else 12: ;
13: if then 14: ;
15: end if 16: end if 17: end for 18: ; 19: ; 20: end while 21: ; 22: return |
Algorithm 3 can be summarized in four steps:
- (1)
Set the initial high-temperature , minimum temperature , the number of iterations for each temperature .
- (2)
Select an initial solution randomly from . Let be the optimal solution . Calculate .
- (3)
Repeat
iterations for each temperature
. For each temperature, generate a new solution
, if
. Then, let
. Otherwise, the optimal solution will accept
at a probability
. It follows the Metropolis criterion and decreases with the decrease of temperature
. The criterion is shown as follows.
- (4)
Gradually reduce the temperature
. End the process until
is less than
. Then return
. The temperature reduction mode is as follows.
5. Experiment
The implementation details, the feasibility analysis, the data availability analysis and security analysis results are reported in this section. The implementation details are described in
Section 5.1. A case study is provided to demonstrate the MTPPA in the feasibility analysis in
Section 5.2. The data availability analysis is discussed in
Section 5.3. The security analysis is discussed in
Section 5.4.
5.1. Implementation Details
The experiment was implemented with PyCharm in Python 3.8 on a Windows 10 operating system with Intel(R) Core(TM) i3-7100U @ 2.40 GHz equipped with 4 GB RAM. The algorithms were repeated 50 times to ensure the results obtained with different variable values were stable. The experiment uses the user’s trajectory obtained from Microsoft’s GeoLife Trajectories 1.3 [
6] as the historical trajectory. This dataset contains 17,621 trajectories, recording a wide range of outdoor activities of users’ daily life, such as going home, going to work, shopping, and dining. The travel modes include driving, by bus, by train, by bicycle, and walking. This dataset has been applied to mobile pattern mining, location-based social network, location privacy, and location recommendation. After trajectory pre-processing, each trajectory of the dataset contains a sequence of 20 sampling points. That is
. 3200 trajectories are selected randomly to form a trajectory equivalence class as the experimental data. After the trajectory pre-processing, the proportion of the trajectories with the number of stopovers is 1 is the largest when
.
To verify the effectiveness of the proposed algorithm,
and
were set as 0.9 and 0.1, respectively.
,
,
. The data availability was measured by the information loss. Less information loss means better data availability. The information was estimated by the size of cloaking area, similar to Hu et al.’s method [
26]. The trajectory privacy disclosure probability of the algorithm was analyzed. The proposed MTPPA algorithm was compared with the DTI algorithm [
26] and the random algorithm [
8]. The DTI-1 algorithm represents the case when the DTI algorithm only considers data utility. The DTI-2 algorithm refers to the case when the DTI algorithm only considers trajectory privacy. The Random algorithm selects the
-anonymity set randomly in the trajectory candidates set.
5.2. Feasibility Analysis
To explain the application of MTPPA more clearly, a simple case study is given to describe the selection process of trajectory anonymity set when , , , .
As shown in
Figure 3a, the initial trajectory candidates map is constructed by a real trajectory
and seven historical trajectories
,
,
,
,
,
,
. The number of stopovers and the average moving speed of these trajectories are listed in
Table 2, where the SM of each trajectory is computed by Equation (2). The parameters are set as
,
,
. Equation (3) is used to calculate the SM difference between trajectories. The weight matrix of the eight trajectories is obtained as follows:
where the weight of two trajectories is 0, which means that the two trajectories are not similar.
Figure 3b is the initial trajectory candidates graph constructed by the weight matrix of the eight trajectories. The maximum clique
is obtained by Algorithm 3. It contains six trajectories
,
,
,
,
,
. By using algorithm 4, four trajectories with the smallest sum of weights
,
,
,
are found from
to form the optimal anonymity set.
5.3. Data Availability Analysis
In this subsection, the comparison of algorithms in terms of information loss with different value of k is revealed.
Figure 4a shows the information loss comparison between these four algorithms with
increases when
,
,
. As shown in
Figure 4a, the information loss of the four algorithms decreases when
increases. For the same
, the information loss of the random algorithm and the DTI-2 algorithm are relatively high. The information loss of the DTI-1 algorithm and the MTPPA algorithm are similar, but both of them are relatively lower than the random algorithm and the DTI-2 algorithm. This is because the DTI-1 algorithm essentially does not consider the similarity between trajectories. The
-anonymity set generated by the MTPPA algorithm contains the user’s real trajectory and
− 1 historical trajectories. The queried results contain the query results of the user’s real location in each query. Therefore, the MTPPA algorithm and the DTI-1 algorithm have the lower information loss, which results in better data availability.
5.4. Security Analysis
In this subsection, the comparison of algorithms in terms of trajectory privacy disclosure probability with different value of k, , , are revealed, respectively.
5.4.1. Comparison of Algorithms in Terms of Trajectory Privacy Disclosure Probability under Different k
Figure 4b shows the trajectory privacy disclosure probability comparison between these four algorithms with
increases when
,
,
. As shown in
Figure 4b, the trajectory privacy disclosure probability of the four algorithms remains unchanged. For the same
, both of the trajectory privacy disclosure probability of the Random algorithm and the DTI-1 algorithm are relatively high. The trajectory privacy disclosure probability of the DTI-2 algorithm is lower, and that of the MTPPA algorithm is lower than that of the DTI-2 algorithm by 37%. This is because the random algorithm and the DTI-1 algorithm essentially do not consider the similarity between trajectories. Therefore, the random algorithm and the DTI-1 algorithm have a high probability of privacy disclosure. The DTI-2 algorithm considers the similarity between trajectories, but it does not guarantee the final
-anonymity set is a similar trajectory set. The MTPPA algorithm guarantees the final
-anonymity set is a similar trajectory set. Therefore, the MTPPA algorithm has the lowest trajectory privacy disclosure probability.
5.4.2. Comparison of Algorithms in Terms of Trajectory Privacy Disclosure Probability under Different
Figure 4c shows the trajectory privacy disclosure probability comparison between these four algorithms in the condition that
increases when
,
,
. It can be observed from
Figure 4c that with
increases, the trajectory privacy disclosure probability of the DTI algorithm and the Random algorithm slightly increases. For any
value, the proposed MTPPA algorithm still has the lowest trajectory privacy disclosure probability, which is 42% lower than that of the DTI-2 algorithm. All of the four algorithms have the lowest trajectory privacy disclosure probability when
. This is because in the selected experimental trajectories, the proportion of the trajectories with the number of stopovers is the largest. When selecting the initial trajectory candidates, the probability of selecting these trajectories is higher. Thus, the trajectories of the final
-anonymity set are more similar.
5.4.3. Comparison of Algorithms in Terms of Trajectory Privacy Disclosure Probability under Different
Figure 4d shows the trajectory privacy disclosure probability comparison between these four algorithms with
increases when
,
,
. It can be observed from
Figure 4d that the trajectory privacy disclosure probability of the four algorithms decrease when
increases. The trajectory privacy disclosure probability of the Random algorithm is the highest, while that of the MTPPA algorithm is the lowest. When
the trajectory privacy disclosure probability of the four algorithms is 1. In the view of the attacker, all the trajectories of the
-anonymity set are dissimilar. When
, the trajectory privacy disclosure probability of the MTPPA algorithm is 0. This is because at this time,
. From the attacker’s point of view, the trajectories of the
-anonymity set generated by the MTPPA algorithm is similar to each other. The trajectory privacy of the
-anonymity set generated by the other algorithms is still at the risk of disclosure.
5.4.4. Comparison of Algorithms in Terms of Trajectory Privacy Disclosure Probability under Different
Figure 4e shows the trajectory privacy disclosure probability comparison between these four algorithms with
increases when
,
,
. It can be observed from
Figure 4e that the MTPPA algorithm has the lowest trajectory privacy disclosure probability. When
is smaller than
, the trajectory privacy disclosure probability of the MTPPA algorithm is 0. When
is greater than
, with
increases, the trajectory privacy disclosure probability of the MTPPA algorithm gradually increases. Although the trajectory similarity threshold
of the maximum clique generated by the MTPPA algorithm is very close to
. Most of the SM difference between trajectories is far less than
. When
continue to increase, less and less of the SM difference between trajectories is smaller than
. As a result, the trajectory privacy disclosure probability is increasing.
6. Conclusions
In this paper, spatiotemporal mobility (SM) is defined to measure the similarity between trajectories. The relationship between the SM and the anonymity set is discovered. The mathematical model is constructed to model the relationship between trajectories. Based on SM and trajectory graph modeling, the MTPPA algorithm is proposed. The problem of finding the optimal -anonymity set is transformed into the -clique problem of an undirected weighted graph. The simulated annealing algorithm is utilized to find the approximate optimal -anonymity set. It improves the similarity between trajectories of the anonymity set effectively while meeting the same services quality. Experimental results show that the trajectory privacy disclosure probability of the -anonymity set generated by this algorithm is about 40% lower than that of existing algorithms.
This study considers the privacy protection effect when the historical trajectories are sufficient, but not the case when the historical trajectories are sparse. Future studies may concentrate on the following aspects: (1) The privacy protection effect of this algorithm will be discussed under the condition of the historical trajectories are sparse. (2) Based on the SM, the semantic information of the stopover will be considered to achieve semantically secure anonymity. (3) The model and algorithm designed in this paper cans be applied to popular services such as online car-hailing to match the best vehicle for the users without disclosing sensitive information of the users and the drivers.