Effective and Privacy-Preserving Estimation of the Density Distribution of LBS Users under Geo-Indistinguishability

Kim, Jongwook; Lim, Byungjin

doi:10.3390/electronics12040917

Open AccessFeature PaperArticle

Effective and Privacy-Preserving Estimation of the Density Distribution of LBS Users under Geo-Indistinguishability

by

Jongwook Kim

^*

and

Byungjin Lim

Department of Computer Science, Sangmyung University, Seoul 03016, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(4), 917; https://doi.org/10.3390/electronics12040917

Submission received: 27 December 2022 / Revised: 8 February 2023 / Accepted: 10 February 2023 / Published: 12 February 2023

(This article belongs to the Special Issue Smart Sensing, Monitoring, and Control in Industry 4.0)

Download

Browse Figures

Versions Notes

Abstract

With the widespread use of mobile devices, location-based services (LBSs), which provide useful services adjusted to users’ locations, have become indispensable to our daily lives. However, along with several benefits, LBSs also create problems for users because to use LBSs, users are required to disclose their sensitive location information to the service providers. Hence, several studies have focused on protecting the location privacy of individual users when using LBSs. Geo-indistinguishability (Geo-I), which is based on the well-known differential privacy, has recently emerged as a de-facto privacy definition for the protection of location data in LBSs. However, LBS providers require aggregate statistics, such as user density distribution, for the purpose of improving their service quality, and deriving them accurately from the location dataset received from users is difficult owing to the data perturbation of Geo-I. Thus, in this study, we investigated two different approaches, the expectation-maximization (EM) algorithm and the deep learning based approaches, with the aim of precisely computing the density distribution of LBS users while preserving the privacy of location datasets. The evaluation results show that the deep learning approach significantly outperforms other alternatives at all privacy protection levels. Furthermore, when a low level of privacy protection is sufficient, the approach based on the EM algorithm shows performance results similar to those of the deep learning solution. Thus, it can be used instead of a deep learning approach, particularly when training datasets are not available.

Keywords:

location privacy; density distribution; differential privacy; geo-indistinguishability

1. Introduction

Location-based services (LBSs) provide services that are adjusted to users’ current locations. With the widespread use of mobile devices, LBSs have become indispensable to our daily lives. In its early days, LBSs provided simple services such as directions and localized advertisements. However, with time they evolved to provide users with more complex services such as ride-sharing [1], mobile crowdsensing [2,3], mobile games [4], and disaster alarm services [5].

The demand for LBSs that can provide various services based on user location information is expected to continue to increase. However, similar to many other innovative services, LBSs also come with their problems. The main source of these problems is the fact that to use LBSs, users are required to disclose their location information to the LBS servers. Location data can easily be linked to sensitive information that individual users do not wish to disclose [6]. That is, by tracking users’ location information, inferring sensitive information, such as home and work addresses, hospital visit records, and behavioral patterns such as offline shopping preferences, is possible. Thus, without proper protection of sensitive location data, there is an inevitable risk that LBS users’ sensitive information may be exposed to unauthorized parties [6,7].

Extensive studies have focused on protecting the location privacy of individual users when using LBSs. The most common approaches include anonymization techniques [8,9], dummy location techniques [10], and cryptography mechanisms [11,12]. Differential privacy (DP) [13] has emerged as the de-facto standard for privacy-preserving computations, and several approaches have sought to apply the notion of DP to the protection of location data. Among them, geo-indistinguishability (Geo-I) has recently emerged as the de-facto privacy definition for protecting location data [14,15,16]. As Geo-I can guarantee rigorous location privacy against adversaries with arbitrary background knowledge, it has been widely adopted in many LBS applications [17,18,19,20].

For LBS providers, the location datasets collected from their service users are valuable assets that can be used to obtain aggregate statistics, which, in turn, can be exploited to improve the quality of the service they provide. For example, in mobile crowdsensing applications, information on worker distribution obtained from the collection of historical location data can be exploited in worker selection processes [3,19,20]. In point-of-interest recommendation applications, the transition pattern between two consecutive points of interest, which can be extracted by analyzing historical location datasets, is used to recommend the next point-of-interest candidate to users [21,22,23]. Furthermore, the location datasets collected in LBSs can be published to external third parties for data analysis [24,25].

1.1. Motivation

Geo-I is a perturbation-based technique and thus guarantees that the true locations of individual users are not disclosed to their LBS providers. This alleviates user concerns regarding privacy leaks. However, for LBS providers, such a location-privacy-preserving mechanism makes utilizing location datasets collected from users difficult.

Consider the motivational example shown in Figure 1. In this example, owing to privacy concerns, each user first perturbs their true location using Geo-I and then sends the perturbed location to the LBS server. The LBS server then provides the requested service to the user based on their perturbed location. Meanwhile, for the purpose of improving the service quality, the service provider wishes to know the aggregate statistics, such as the user density distribution in each specified time period. However, because of the data perturbation mechanism of Geo-I, deriving such statistics accurately from the collection of location data received from users is difficult.

1.2. Contributions

In this study, we aim to develop a novel method to effectively compute aggregate statistics, particularly the user density distribution, from the perturbed location data of LBS users collected using Geo-I. Information regarding the density distribution of users is commonly used in diverse LBS applications. For example, one of the most representative LBS applications that exploit information about user-density distribution is a digitized map service, such as Apple Maps [26], Google Maps [27], and Waze [28]. Digitized map services currently leverage information on the density distribution of users (i.e., vehicle drivers) on roads, which is collected from map users, to provide traffic information and driving directions to users. Another application is the epidemic map, which provides information about areas wherein there is a high risk of infection. During the COVID-19 outbreak, numerous studies on privacy-preserving COVID-19 vulnerability map construction were carried out. For example, Chen et al. [29] established a COVID-19 vulnerability map based on information on the density distribution of voluntary participants with COVID-19 symptoms, in which Geo-I was used to collect the location of participants in a privacy-preserving manner.

The main contributions of this study are summarized as follows:

We developed a privacy-preserving framework for effectively computing the density distribution of LBS users based on the collection of users’ location information that has been obfuscated by the perturbation mechanism of Geo-I.
For an accurate estimation of the density distribution of LBS users with perturbed location datasets, we explored two different approaches. The first approach leverages the functionality of the expectation-maximization (EM) algorithm to estimate the hidden information precisely (i.e., users’ true location) from observed data (i.e., users’ perturbed location). In the second approach, which is the deep learning–based approach, a generative adversarial network is first trained using the available training datasets, and then the trained generative network is used to generate the true density distribution from the perturbed location information.
We evaluated the performance of the proposed algorithms using real and synthetic data. The evaluation results demonstrated that the proposed EM algorithm- and deep learning approaches significantly outperform the existing approaches. Furthermore, based on the evaluation results, we analyzed the features of the proposed approaches.

The remainder of this paper is organized as follows. Section 2 presents related work. In Section 3, we provide preliminary information. Section 4 presents two approaches to estimate the density distribution of LBS users using perturbed location datasets. In Section 5, we present the experimental evaluation of the proposed approach using real datasets. Finally, the conclusions are presented in Section 6.

2. Related Work

Geo-I has attracted considerable attention for diverse LBS applications. Here, we briefly summarize several key applications of Geo-I. Mobile crowdsourcing (MC) is a framework in which a large group of people (e.g., workers) voluntarily participate in the collection and sharing of data using mobile devices. MC platforms require workers to share their locations with MC servers so that sensing tasks can be allocated to the closest workers. This violates workers’ location privacy and is thus perceived as a deterrent factor in their willingness to participate in sensing tasks. Hence, several studies have proposed the use of Geo-I to protect worker location privacy by obfuscating their true locations [19,30,31]. Social network services, which provide users with a list of people nearby, are increasingly popular. Location-aware social network services can help people get connected, but they can create a location privacy problem at the same time. Accordingly, several studies have aimed at leveraging Geo-I to protect the location privacy of users in location-aware social network services [18,32]. In ride-sharing services such as Uber, Waze, and Lyft, users are required to share their location information (e.g., current and destination locations) to benefit from the service. In this process, there is the risk that the service user’s sensitive location data may be exposed. To address the location privacy challenge in ride-sharing services, Tong et al. [17] and Shi et al. [33] proposed scheduling schemes that employ Geo-I to protect the location information of ride-sharing users.

With the recent development of the Internet of Things (IoT), we are now regularly encountering various IoT-based services, such as healthcare services and smart grids, in our daily lives [34]. With the development of IoT, vehicles are equipped with various sensors and Internet access, which can then be used to gather traffic data and road information [35]. For example, vehicles can collect information on their road environments using cameras [36]. Accordingly, several studies have suggested the use of vehicles as workers on MC platforms. In [20], Geo-I was used to protect the locations of vehicles participating in MC. The method proposed in [20] first models a road map as a weighted directed graph and represents the locations of tasks and vehicles as points on the graph. Then, the location perturbation of Geo-I was achieved using a probabilistic distribution over the graph. To improve the quality of service in vehicle networks, Zhou et al. [37] developed edge-assisted vehicle networks in which Geo-I was deployed at the edge nodes to protect the actual location of the vehicle.

Although extensive studies have been conducted on Geo-I, most have focused on protecting individual users’ location privacy in LBSs; only a few recent studies have addressed the problem of computing aggregate statistics using Geo-I in LBSs. EGeoIndis [38], which is based on Geo-I, is a vehicle location privacy protection framework used to estimate traffic density. It applies Geo-I to protect the privacy of vehicle location during traffic density estimation. Chen et al. [29] relied on Geo-I to collect the locations of voluntary participants with COVID-19 symptoms in a privacy-preserving manner; the location data collected under Geo-I were used to construct a COVID-19 vulnerability map.

Several studies have been conducted on privacy-preserving density estimation in diverse application domains. Wu et al. [39] presented a privacy-preserving traffic density estimation system that uses a pseudonym server and location anonymization server to prevent the risk of privacy disclosure by vehicle drivers. Huang et al. [40] proposed a privacy-preserving traffic density estimation model that combined homomorphic encryption with the Laplace mechanism to protect users’ location privacy. Furthermore, to collect track data from users in a privacy-preserving manner, a sampling method is employed to reduce the spatial and temporal continuity of location information. Kim and Jang [41] proposed a method to collect location data from users in indoor environments. They exploited local DP, which is a local version of DP, to preserve the location privacy of users when collecting their indoor location information. The indoor location data collected under the local DP were then used to compute the user density in indoor environments. Wang et al. [42] presented a location protection method for mobile crowdsensing to preserve worker privacy. They leveraged the local DP to collect the location information of workers, which was then used to estimate the distribution of workers for mobile crowdsensing. Yang et al. [43] proposed a privacy framework for mobile crowdsensing in which the cell service provider collects locations from workers and releases worker density in a sanitized form, which is obtained using DP, to the crowdsensing server.

3. Preliminary

In this section, we first introduce the preliminaries used in this study, formally define the problem, and present baseline approaches.

3.1. Geo-Indistinguishability

Geo-I is an extended version of the well-known DP. To protect users’ location information against adversaries with arbitrary background knowledge, Geo-I introduces the distance metric between objects to the concept of DP. Formally,

ϵ

-Geo-I is defined as follows:

Definition 1

(

ϵ

-Geo-I). Let us assume that

X

represents a set of possible user locations. Let us further assume that a randomized mechanism, K, probabilistically generates a perturbed location from a true user’s location. Let

d (x, x^{'})

be the distance between

x, x^{'} \in X

, such as Euclidean distance between x and

x^{'}

. Then, K satisfies ϵ-Geo-I, if and only if for (1) all

x, x^{'} \in X

and (2) any output location,

y \in X

of K, the following is satisfied [6,14,15,16]:

K (x) (y) \leq e^{ϵ \cdot d (x, x^{'})} \times K (x^{'}) (y) .

(1)

The above definition denotes that if x and

x^{'}

are closely located to each other, the distribution of locations (i.e.,

K (x) (y)

and

K (x^{'}) (y)

) generated by the randomized mechanism K would similar to each other and thus, it is hard to distinguish that any output location, y, is generated from x or

x^{'}

. On the contrary, if they are located far away to each other, the distribution of locations generated by K would be relatively dissimilar and thus, an output location generated by x from by

x^{'}

can be easily distinguished.

The parameter

ϵ

, which is a privacy budget, controls a tradeoff between the level of data utility and privacy. A large value of

ϵ

provides a weak privacy guarantee, resulting in higher data utility. On the contrary, a small value of

ϵ

provides strong privacy protection for location information, resulting in lower data utility.

There are two common approaches to archive

ϵ

-Geo-I: the Laplace mechanism [14] and optimization mechanism [15,16]. In terms of data utility, it is well known that the optimization mechanism can produce the perturbed locations with a higher utility than the Laplace mechanism. In the optimization mechanism, the LBS server first computes the perturbation matrix, M, by solving the following linear programming problem.

\begin{matrix} m i n : & \sum_{x, y \in X} π_{X} \cdot M [x, y] \cdot d (x, y) \\ s . t . : & M [x, y] \leq e^{ϵ \cdot d (x, x^{'})} \times M [x^{'}, y] & x, x^{'}, y \in X \\ \sum_{y \in X} M [x, y] = 1 & x \in X \\ M [x, y] > 0 & x, y \in X \end{matrix}

(2)

Here,

π

is the prior probability distribution of user locations and can be defined using the available historical data.

M [x, y]

denotes the probability that a perturbed location y is randomly generated from a true location x (i.e.,

M [x, y] = K (x) (y)

).

Once the perturbation matrix is computed, M, the LBS server distributes it to LBS users. After receiving M, each user perturbs their true location according to the probabilities encoded in M, and sends the perturbed location to the LBS server along with a service request. Then, the LBS server provides the requested service to the user based on their perturbed location. We note that during this process, the true location of each user is not disclosed because the perturbation of location data is performed in his/her mobile device.

3.2. Problem Definition

In this subsection, the problems addressed in this study are defined. Let

U = {u_{1}, u_{2}, \dots, u_{n}}

be a set of LBS users. In this study, we assumed that the spatial domain was partitioned into disjointed grids. Let

X = {x_{1}, x_{2}, \dots, x_{m}}

be the set of disjointed grids. The location of the user is then represented using a grid to which the user’s current location belongs geographically. Let

x_{u_{k}} \in X

be the location of the k-th user and

x_{u_{k}}^{p} \in X

be the corresponding perturbed location obtained using the perturbation matrix M.

Let

D B_{s}^{p} = {x_{u_{1}}^{p}, x_{u_{2}}^{p}, \dots, x_{u_{n}}^{p}}

be the collection of perturbed locations maintained by the LBS server and received from n users in the s-th time period. The objective of this study is to compute the density distribution,

p (x_{i})

(where

1 \leq i \leq m

), of LBS users in the s-th time period using the collection of perturbed locations

D B_{s}^{p}

.

3.3. Baseline Approaches

In this subsection, we present the baseline approaches. Given

D B_{s}^{p}

, the first baseline solution is to compute

p (x_{i})

directly based on

D B_{s}^{p}

:

p (x_{i}) = \frac{c o u n t (x_{i}, D B_{s}^{p})}{n}

(3)

Here,

c o u n t (x_{i}, D B_{s}^{p})

represents the number of times location

x_{i}

appears in

D B_{s}^{p}

. However, this straightforward approach cannot accurately estimate the user density distribution because it does not consider the effect of the Geo-I location perturbation mechanism.

A better solution to this problem is to exploit the probabilistic mapping information between the true and perturbed locations encoded in the perturbation matrix M. Given that M is distributed to each user by the LBS server,

M [x_{i}, x_{k}]

denotes the probability that a perturbed location

x_{k}

(corresponding to the location sent to the LBS server by a user) is randomly generated from the user’s true location

x_{i}

. The second baseline approach then considers the mapping probability information between a perturbed and true location when computing

p (x_{i})

:

p (x_{i}) = \frac{\sum_{1 \leq k \leq m} (M [x_{i}, x_{k}] \times c o u n t (x_{k}, D B_{s}))}{n}

(4)

Note that by the definition of M,

\sum_{1 \leq k \leq m} M [x_{i}, x_{k}] = 1

; that is, when computing

p (x_{i})

, the above approach considers the probabilities encoded in the perturbation matrix M that the perturbed location

x_{k}

(where

1 \leq k \leq m

) is generated randomly from the true location

x_{i}

.

4. Effective Privacy-Preserving Estimation of the Density Distribution of LBS Users

In this section, we present two approaches for accurately estimating the true density distribution of LBS users based on perturbed location datasets collected under the Geo-I setting.

4.1. Expectation-Maximization Algorithm-Based Approach

In this subsection, we describe the details of the EM algorithm approach to compute the density distribution of LBS users with the perturbed location datasets. The EM algorithm, which is frequently used to estimate a set of hidden parameters based on observed data, is an iterative method that sequentially runs the E-step (expectation) and M-step (maximization). The E-step computes the expected value of the likelihood based on the current parameters and observed variables, and the M-step re-estimates the parameters to maximize the likelihood function.

The process of the EM algorithm to compute the density distribution of LBS is defined as follows.

Initialization: In the initialized phase, the initial parameter

θ^{(0)}

is defined.

π_{i}^{(0)}

is initialized with a uniform value, such as

π_{i}^{(0)} = p (x_{i}) = \frac{1}{m}

. We note that

π_{i}^{(0)}

can be defined using the available historical data.

E-Step: Let

p^{(h)} (x_{i} | x_{u_{t}}^{p})

be a posterior probability for which, given the perturbed location

x_{u_{t}}^{p}

(which is received from the t-th user

u_{t}

), the true location is

x_{i}

under the current parameter

θ^{(h)}

. Then,

p^{(h)} (x_{i} | x_{u_{t}}^{p})

can be computed by using Bayes’ theorem as follows:

\begin{matrix} p^{(h)} (x_{i} | x_{u_{t}}^{p}) & = P (x_{i} | x_{u_{t}}^{p}, θ^{(h)}) \\ = \frac{P (x_{i} | θ^{(h)}) \times P (x_{u_{t}}^{p} | x_{i}, θ^{(h)})}{P (x_{u_{t}}^{p} | θ^{(h)})} \\ = \frac{P (x_{i} | θ^{(h)}) \times P (x_{u_{t}}^{p} | x_{i}, θ^{(h)})}{\sum_{j = 1}^{m} P (x_{j} | θ^{(h)}) \times P (x_{u_{t}}^{p} | x_{j}, θ^{(h)})} \\ = \frac{π_{i}^{(h)} \times M [x_{i}, x_{u_{t}}^{p}]}{\sum_{j = 1}^{m} π_{j}^{(h)} \times M [x_{j}, x_{u_{t}}^{p}]} \end{matrix}

(5)

We note that by the definition of the perturbation matrix,

M [x_{i}, x_{u_{t}}^{p}]

is equal to

p (x_{u_{t}}^{p} | x_{i})

.

By using Equation (6), for each received perturbed location

x_{u_{t}}^{p} \in D B_{s}

and each grid

x_{i} \in X

, the E-step computes the posterior probability

p^{(h)} (x_{i} | x_{u_{t}}^{p})

with the current parameter

θ^{(h)}

.

M-Step: In this step, the parameters are recomputed based on the current posterior probabilities computed in the previous E-Step:

π_{i}^{(h + 1)} = \frac{\sum_{t = 1}^{n} p^{(h)} (x_{i} | x_{u_{t}}^{p})}{n}

(6)

The abovementioned E-step and M-step are repeated until the values of the parameters converge towards a stable value. Let us assume that the values of the parameters of the EM algorithm converge after h iterations. Then,

p (x_{i})

is computed as

p (x_{i}) = π_{i}^{(h + 1)}

.

4.2. Deep Learning–Based Approach

In this subsection, we present a deep learning approach to compute the density distribution of LBS users from perturbed location datasets collected under the Geo-I setting. The proposed method first trains a conditional generative adversarial network (cGAN) [44] such that it can generate the true density distribution from perturbed location datasets, and then exploits the well-trained generator of cGAN to estimate the density distribution from the given perturbed location data (Figure 2).

4.2.1. Training of Conditional Generative Adversarial Network

Let

D B = {D B_{1}, D B_{2}, \dots, D B_{v}}

be the collection of training datasets in which

D B_{r} \in D B

corresponds to a set of true locations of LBS users in the r-th time period. Let

D B_{r}^{p}

be a set of perturbed locations generated from

D B_{r}

by using the perturbation matrix M of Geo-I. Let us assume that

p_{t r u e} (x_{i})

is the true density of the grid

x_{i} \in X

computed from

D B_{r}

. Similarly,

p (x_{i})

is defined as the perturbed density computed from

D B_{r}^{p}

by using Equation (3). The proposed method first trains the generator of the cGAN such that it can generate the true density distributions

p_{t r u e} (x_{1}), p_{t r u e} (x_{2}), \dots, p_{t r u e} (x_{m})

from the perturbed density distributions

p (x_{1}), p (x_{2}), \dots, p (x_{m})

.

Figure 2 shows the cGAN structure utilized to estimate the density distribution of LBS users from perturbed location datasets. The cGAN structure comprises a generator (G) and discriminator (D). Given the perturbed density distribution and temporal information, such as daily and hourly information, the generator aims to generate density distributions that are similar to the true distributions. By contrast, the discriminator attempts to distinguish between the true density distributions and those from the generator.

The input to the generator comprises the following two parts: a noise vector

\vec{z}

and condition vector

\vec{y}

. The condition vector

\vec{y}

consists of the following two parts: the perturbed density distribution vector

\vec{p}

and the temporal information vector

\vec{t}

. The perturbed density distribution,

p (x_{1}), p (x_{2}), \dots, p (x_{m})

, is encoded into the m-dimensional perturbed density distribution vector

\vec{p} = [p (x_{1}), p (x_{2}), \dots, p (x_{m})]

such that the value of the i-th element corresponds to

p (x_{i})

. For the temporal information, a one-hot encoder was used to represent the attributes as high-dimensional vectors. In this study, both daily and hourly information were used as temporal information. The daily information attribute was encoded into 7-dimensional binary vectors

{\vec{t}}_{d}

, and the hourly information was encoded into 24-dimensional binary vectors

{\vec{t}}_{h}

. These two vectors

{\vec{t}}_{d}

and

{\vec{t}}_{h}

are concatenated into

\vec{t}

. Subsequently, the perturbed density distribution vector

\vec{p}

and temporal information vector

\vec{t}

are concatenated into

\vec{y}

.

The noise vector

\vec{z}

is defined as an m-dimensional vector. Note that m denotes the grid set size. The value of each element in

\vec{z}

corresponds to a value randomly sampled from the Gaussian distribution

N (μ, σ)

. The generator takes the noise vector

\vec{z}

and condition vector

\vec{y}

as the input and generates the synthetic density distribution vector

{\vec{d}}_{s y n}

, which is represented in m-dimensional space.

The discriminator of the cGAN is responsible for distinguishing between fake and real data instances. The discriminator input contains two parts: a density distribution vector and a condition vector. The density distribution vector

{\vec{d}}_{t r u e}

of real data instances is generated with the true density distribution,

p_{t r u e} (x_{1}), p_{t r u e} (x_{2}), \dots, p_{t r u e} (x_{m})

, whereas that of fake data instances correspond to the output vector of the generator (i.e.,

G (\vec{z}, \vec{y}) = {\vec{d}}_{s y n}

). The discriminator outputs a scalar value (i.e., 0 or 1) indicating whether the input density distribution vector is real or fake.

With the cGAN training procedure, the generator attempts to fool the discriminator such that it believes that the synthetically generated data by the generator are real, whereas the discriminator attempts to distinguish between the real and synthetic data generated by the generator. This can be represented using a min–max optimization problem as follows [44]:

min_{G} max_{D} V (G, D) = E_{{\vec{d}}_{t r u e} \sim p_{d a t a} ({\vec{d}}_{t r u e})} [log D ({\vec{d}}_{t r u e}, \vec{y})] + E_{\vec{z} \sim p (\vec{z})} [log (1 - D (G (\vec{z}, \vec{y}), \vec{y}))]

(7)

Here,

p_{d a t a} ({\vec{d}}_{t r u e})

is the true density distribution of the training data and

p (\vec{z})

corresponds to a Gaussian distribution

N (μ, σ)

. The cGAN was trained by alternately updating the generator and discriminator parameters.

4.2.2. Estimating Density Distributions with Trained cGAN

Once training of the cGAN is complete, its well-trained generator can be used to infer the current true density distribution from the perturbed location data received from users under Geo-I. As shown in Figure 2, the input to the trained generator comprises the following two parts: a noise vector

\vec{z}

and a condition vector

\vec{y}

containing both the current perturbed densities and temporal information. The trained generator then generates an m-dimensional synthetic density distribution vector

{\vec{d}}_{s y n}

, which corresponds to the estimated density distribution vector. That is, the synthetic density distribution vector

{\vec{d}}_{s y n}

is decoded such that the value of the i-th element of

{\vec{d}}_{s y n}

(i.e.,

{\vec{d}}_{s y n} [i]

) is the estimated density of the grid

x_{i} \in X

.

5. Experiments

In this section, we first describe the experimental and implementation setup and then discuss the experimental results.

5.1. Experimental Setup

We report the results for the following alternatives: the first baseline approach (

B A_{1}

), described in Section 3.3; the second baseline approach (

B A_{2}

) that leverages probabilistic mapping information between true and perturbed locations, presented in Section 3.3; the EM algorithm-based approach (

E A

), described in Section 4.1; and the deep learning-based approach (

D A

), discussed in Section 4.2. We also report the results for the Laplace mechanism approach (

L A

) used in [29] in which the perturbed location is obtained by adding noise to an original location to obtain an obfuscated location and the obfuscated location is mapped to the closest grid. In the experiment, we used the mean absolute error (MAE) to compare the following four alternative approaches:

M A E = \frac{1}{m} \times \sum_{i = 1}^{m} | p (x_{i}) - p_{t r u e} (x_{i}) |

(8)

Here,

p_{t r u e} (x_{i})

represents the true density of grids

x_{i}

and

p (x_{i})

denotes the density estimated using the perturbed location datasets collected under Geo-I.

For evaluation, we used two datasets.

Seoul Metro datasets [45], which correspond to real datasets, contain information on the number of passengers using each metro station each hour. For our experiments, we selected 114 stations, each of which is regarded as a grid. Then, we extracted the number of passengers using each station for each hour, which is considered as the number of LBS users in each grid. We divided the entire datasets into training and testing datasets. The training datasets, which contained 10,500 h datasets, were used for training the cGAN structure of the $D A$ , as discussed in Section 4.2, whereas the testing datasets, containing the remaining 2658 h datasets, were used for evaluating the four approaches. We note that four approaches, $B A_{1}$ , $B A_{2}$ , $E A$ , and $L A$ , did not require the training datasets.
We also synthetically generated datasets for evaluation purposes. These datasets contained 225 grids. For each grid, the number of LBS users was randomly generated under the Gaussian distribution. For our experiments, we generated 20,000 training datasets, which were used for training the cGAN structure of the $D A$ , and 3000 testing datasets, which were used to evaluate the four different approaches. For each data, the Gaussian distribution with different standard deviations, randomly selected between 1.0 and 10, was used.

In the experiment, we used the following cGAN structure for

D A

. The generator of the cGAN consisted of four fully connected layers. In each intermediate hidden layer, the ReLU activation was used, whereas the sigmoid activation was used in the final layer. The discriminator of the cGAN has three fully connected layers, in which each hidden layer uses ReLU activation, except for the final layer, in which sigmoid activation is used. The cGAN model was implemented using PyTorch 1.13 with two Intel Xeon 5220R CPUs and an Nvidia RTX A6000 GPU.

5.2. Results

Figure 3 shows the effect of the privacy budget

ϵ

on the MAE. In the experiments,

ϵ

varies from 0.1 to 1.5. For all the methods, the MAE decreased as the privacy budget increased from 0.1 to 1.5. This is because a smaller value of

ϵ

for Geo-I ensures stronger privacy protection by introducing larger perturbations into the true location. This, in turn, leads to a decreased utility of the collected data, resulting in an increased MAE. In contrast, a larger value of

ϵ

introduces a smaller perturbation to the true location, thereby guaranteeing weaker privacy protection. Thus, with a larger value of

ϵ

, high data utility of the collected data can be achieved, which leads to a low MAE.

As shown in the figure,

D A

, which is based on cGAN, outperforms the other approaches. The performance gap between

D A

and the other approaches (i.e.,

B A_{1}

,

B A_{2}

,

E A

, and

L A

) increases significantly as

ϵ

decreases. However, as

ϵ

increases, and thus, the level of privacy protection decreases, the performance gap between

D A

and the other approaches decreases. In particular, when

ϵ

is set to 1.5, the

E A

shows a performance in MAE similar to that of the

D A

. Among the four alternatives that do not require training datasets, the

E A

, which is based on the EM algorithm, significantly outperforms

B A_{1}

,

B A_{2}

and

L A

at all privacy budget levels. These evaluation results demonstrate that significant performance gains can be achieved by the proposed approaches,

D A

and

E A

, compared to the baseline schemes.

To further investigate the performance gap between the

E A

and

D A

, we plotted the MAE for each individual testing dataset in Figure 4. Here, the x-axis represents the index number of each testing dataset of the Seoul Metro datasets, and the y-axis represents the MAE values. Note that the size of the Seoul Metro testing datasets was 2658. The figure also shows the standard deviation values of the two approaches used to observe the variations in the results. As shown in the figure, in most cases, the

D A

shows better results than the

E A

, which is consistent with the previous results in Figure 3. In addition, it was observed that the

D A

produces more stable results than the

E A

when a high level of privacy protection is required. In contrast, when the privacy protection level is low (i.e.,

ϵ = 1.5

), the

E A

shows slightly more stable results than the

D A

. This is verified using the standard deviation values reported in the figure. When

ϵ

was set to 1.5, the standard deviation of the

E A

was slightly lower than that of the

D A

. However, for cases where

ϵ

was set to 0.1, 0.5, and 1.0, the standard deviation values of

D A

were significantly lower than those of

E A

.

Figure 5 shows the impact of the size of the training datasets on the performance of

D A

. In the experiments, Seoul Metro datasets were used, and the size of the training datasets varied from 1000 to 10,500. Four privacy budgets (0.1, 0.5, 1.0, and 1.5) are used in the experiments. For comparison, we also plotted

E A

, which shows the best performance among the approaches that do not require training datasets. As expected, the MAE of

D A

decreased as the size of the training datasets increased. As can be observed in the figure, for the case when the privacy protection level is high (i.e.,

ϵ = 0.1, 0.5

),

D A

shows a better performance than

E A

, even when a small amount (i.e., 1000) of training data is used to train the cGAN of

D A

. These results indicate that, with sufficiently large training datasets, a good estimation of the density distribution of LBS users can be achieved with the proposed

D A

.

We also compared the execution times of

D A

and

E A

. The execution time of

D A

comprises training and estimation times. Table 1 lists the overall training times of

D A

for various training dataset sizes. In the experiments, Seoul Metro datasets were used, and the size of the training datasets varied from 1000 to 10,500. The number of epochs was set to 200, and the privacy budget was fixed at 1.0. The table shows that the training time increases linearly with an increase in the training dataset size. The density distribution estimation time for

D A

is approximately 0.165 s. In the case of

E A

, the execution time corresponds to the estimation time because

E A

does not require model training. The density distribution estimation time of

E A

is approximately 0.347 s.

5.3. Discussion

Based on the experimental results, we discuss the features of the two proposed methods,

E A

and

D A

. The evaluation results verified that the deep learning approach,

D A

, significantly outperformed the other alternatives, particularly when a high level of privacy protection was required. However,

D A

requires training datasets to train the underlying cGAN model. Thus,

D A

is suitable for the case where protecting the location privacy of LBS users is of the highest priority and sufficient training datasets are available.

On the contrary, in the case where a low level of privacy protection is sufficient, the EM algorithm approach,

E A

, shows a similar performance to that of the

D A

. Hence, the

E A

is recommended for cases where a low level of privacy protection for LBS users is sufficient. In addition, unlike

D A

,

E A

does not require any training datasets; thus, it is suitable for cases in which these are not available.

6. Conclusions

We presented an effective privacy-preserving framework to estimate the density distribution of LBS users based on a collection of user location information obfuscated by the perturbation mechanism of Geo-I. In particular, we explored two approaches: the EM algorithm and deep learning approaches. The evaluation results with real and synthetic data demonstrated that if there are sufficient training datasets, the deep learning approach significantly outperforms the other alternatives at all privacy protection levels. In the case where training datasets are not available, a significant reduction in error rates can be achieved using the approach based on the EM algorithm compared with the baseline approaches.

Author Contributions

Conceptualization, J.K.; methodology, J.K. and B.L.; software, B.L.; validation, J.K. and B.L.; formal analysis, J.K.; investigation, J.K.; resources, J.K.; data curation, J.K. and B.L.; writing—original draft preparation, J.K.; writing—review and editing, J.K.; visualization, J.K. and B.L.; supervision, J.K.; project administration, J.K.; funding acquisition, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by a 2022 research Grant from Sangmyung University.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dong, Y.; Wang, S.; Li, L.; Zhang, Z. An empirical study on travel patterns of internet based ride-sharing. Transp. Res. Part C Emerg. Technol. 2018, 86, 1–22. [Google Scholar] [CrossRef]
Liu, Y.; Kong, L.; Chen, G. Data-oriented mobile crowdsensing: A comprehensive survey. IEEE Commun. Surv. Tutor. 2019, 21, 2849–2885. [Google Scholar] [CrossRef]
Kim, J.W.; Edemacu, K.; Jang, B. Privacy-preserving mechanisms for location privacy in mobile crowdsensing: A survey. J. Netw. Comput. Appl. 2022, 200, 103315. [Google Scholar] [CrossRef]
Chen, C.-S.; Lu, H.-P.; Luor, T. A new flow of location based service mobile games: Non-stickiness on Pokemon Go. Comput. Hum. Behav. 2018, 89, 182–190. [Google Scholar] [CrossRef]
Bopp, E.; Douvinet, J. Spatial performance of location-based alerts in France. Int. J. Disaster Risk Reduct. 2020, 50, 101909. [Google Scholar] [CrossRef]
Chatzikokolakis, K.; ElSalamouny, E.; Palamidessi, C. Efficient utility improvement for location privacy. In Proceedings of the Privacy Enhancing Technologies, Minneapolis, MN, USA, 18–21 July 2017; pp. 210–231. [Google Scholar]
Kim, J.W.; Edemacu, K.; Kim, J.S.; Chung, Y.D.; Jang, B. A survey of differential privacy-based techniques and their applicability to location-Based services. Comput. Secur. 2021, 111, 102464. [Google Scholar] [CrossRef]
Gruteser, M.O.; Grunwald, D. Anonymous usage of location-based services through spatial and temporal cloaking. In Proceedings of the International Conference on Mobile Systems, Applications and Services, San Francisco, CA, USA, 5–8 May 2003; pp. 31–42. [Google Scholar]
Beresford, A.R.; Stajano, F. Location privacy in pervasive computing. IEEE Pervasive Comput. 2003, 2, 46–55. [Google Scholar] [CrossRef]
Kido, H.; Yanagisawa, Y.; Satoh, T. Protection of location privacy using dummies for location-based services. In Proceedings of the International Conference on Data Engineering Workshops, Tokyo, Japan, 5–8 April 2005. [Google Scholar]
Mascetti, S.; Freni, D.; Bettini, C.; Wang, X.; Jajodia, S. Privacy in geo-social networks: Proximity notification with untrusted service providers and curious buddies. Int. J. Very Large Data Bases 2011, 20, 541–566. [Google Scholar] [CrossRef]
Popa, R.A.; Blumberg, A.J.; Balakrishnan, H.; Li, F.H. Privacy and accountability for location-based aggregate statistics. In Proceedings of the ACM conference on Computer and communications security, Chicago, IL, USA, 17–21 October 2011; pp. 653–666. [Google Scholar]
Dwork, C. Differential privacy. In Proceedings of the International Conference on Automata, Languages and Programming, Venice, Italy, 10–14 July 2006. [Google Scholar]
Andres, M.E.; Bordenabe, N.E.; Chatzikokolakis, K.; Palamidessi, C. Geo-indistinguishability: Differential privacy for location-based systems. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, Berlin, Germany, 4–8 November 2013; pp. 901–914. [Google Scholar]
Bordenabe, N.E.; Chatzikokolakis, K.; Palamidess, C. Optimal geo-indistinguishable mechanisms for location privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 3–7 November 2014; pp. 251–262. [Google Scholar]
Ahuja, R.; Ghinita, G.; Shahabi, C. A utility-preserving and scalable technique for protecting location data with geo-indistinguishability. In Proceedings of the International Conference on Extending Database Technology, Lisbon, Portugal, 26–29 March 2019; pp. 210–231. [Google Scholar]
Tong, W.; Hua, J.; Zhong, S. A jointly differentially private scheduling protocol for ridesharing services. IEEE Trans. Inf. Forensics Secur. 2017, 12, 2444–2456. [Google Scholar] [CrossRef]
Ma, C.; Chen, C.W. Nearby friend discovery with geo-indistinguishability to stalkers. Procedia Comput. Sci. 2014, 34, 352–359. [Google Scholar] [CrossRef]
Wang, Z.; Hu, J.; Lv, R.; Wei, J.; Wang, Q. Personalized privacy-preserving task allocation for mobile crowdsensing. IEEE Trans. Mob. Comput. 2018, 18, 1330–1341. [Google Scholar] [CrossRef]
Qiu, C.; Squicciarini, A.C. Location privacy protection in vehicle-based spatial crowdsourcing via geo-indistinguishability. In Proceedings of the IEEE International Conference on Distributed Computing Systems, Dallas, TX, USA, 7–9 July 2019; pp. 1061–1071. [Google Scholar]
Kim, J.S.; Kim, J.W.; Chung, Y.D. Successive point-of-interest recommendation with local differential privacy. IEEE Access 2021, 9, 66371–66386. [Google Scholar] [CrossRef]
Feng, S.; Li, X.; Zeng, Y.; Cong, G.; Chee, Y.M. Personalized ranking metric embedding for next new poi recommendation. In Proceedings of the International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015; pp. 2069–2075. [Google Scholar]
Zhang, H.; Chen, Z.; Liu, Z.; Zhu, Y.; Wu, C. Location prediction based on transition probability matrices constructing from sequential rules for spatial-temporal k-Anonymity dataset. PLoS ONE 2019, 11, e0160629. [Google Scholar] [CrossRef]
Li, M.; Zhu, L.; Zhang, Z.; Xu, R. Differentially private publication scheme for trajectory data. In Proceedings of the IEEE International Conference on Data Science in Cyberspace (DSC), Changsha, China, 13–16 June 2016. [Google Scholar]
Li, M.; Zhu, L.; Zhang, Z.; Xu, R. Achieving differential privacy of trajectory data publishing in participatory sensing. Inf. Sci. 2017, 400–401, 1–13. [Google Scholar] [CrossRef]
Apple Maps. Available online: https://www.apple.com/maps (accessed on 30 January 2023).
Google Maps. Available online: https://www.google.com/maps (accessed on 30 January 2023).
Waze. Available online: https://www.waze.com (accessed on 30 January 2023).
Chen, R.; Li, L.; Chen, J.J.; Hou, R.; Gong, Y.; Guo, Y.; Pan, M. COVID-19 vulnerability map construction via location privacy preserving mobile crowdsourcing. In Proceedings of the IEEE Conference and Exhibition on Global Telecommunications, Taipei, Taiwan, 7–11 December 2020. [Google Scholar]
Yan, K.; Luo, G.; Zheng, X.; Tian, L.; Sai, A.M.V.V. A comprehensive location-privacy-awareness task selection mechanism in mobile crowd-wensing. IEEE Access 2019, 7, 77541–77554. [Google Scholar] [CrossRef]
Jin, W.; Xiao, M.; Guo, L.; Yang, L.; Li, M. ULPT: A user-centric location privacy trading framework for mobile crowd sensing. IEEE Trans. Mob. Comput. 2022, 21, 3789–3806. [Google Scholar] [CrossRef]
Huang, C.; Lu, R.; Zhu, H.; Shao, J.; Alamer, A.; Lin, X. EPPD: Efficient and privacy-preserving proximity testing with differential privacy techniques. In Proceedings of the IEEE International Conference on Communications, Kuala Lumpur, Malaysia, 23–27 May 2016; pp. 1–6. [Google Scholar]
Shi, D.; Ding, J.; Errapotu, S.M.; Yue, H.; Xu, W.; Zhou, X.; Pan, M. Deep Q-network-based route scheduling for TNC vehicles with passengers’ location differential privacy. IEEE Internet Things J. 2019, 6, 7681–7692. [Google Scholar] [CrossRef]
Gao, H.; Qiu, B.; Barroso, R.J.D.; Hussain, W.; Xu, Y.; Wang, X. TSMAE: A novel anomaly detection approach for internet of things time series data using memory-augmented autoencoder. IEEE Trans. Netw. Sci. Eng. 2022. [Google Scholar] [CrossRef]
Pudar, N.J.; Schwinke, S.P.; Tengler, S.C. Method of Using Vehicle Location Information with a Wireless Mobile Device. U.S. Patent 8,744,745, 8 June 2014. [Google Scholar]
Gao, H.; Fang, D.; Xiao, J.; Hussain, W.; Kim, J.Y. CAMRL: A joint method of channel attention and multidimensional regression loss for 3D object detection in automated vehicles. IEEE Trans. Intell. Transp. Syst. 2022. [Google Scholar] [CrossRef]
Zhou, L.; Yu, L.; Du, S.; Zhu, H.; Chen, C. Achieving differentially private location privacy in edge-assistant connected vehicles. IEEE Internet Things J. 2019, 6, 4472–4481. [Google Scholar] [CrossRef]
Ren, W.; Tang, S. EGeoIndis: An effective and efficient location privacy protection framework in traffic density detection. Veh. Commun. 2020, 21, 100187. [Google Scholar] [CrossRef]
Wu, L.; Wei, X.; Meng, L.; Zhao, S.; Wang, H. Privacy-preserving location-based traffic density monitoring. Connect. Sci. 2022, 34, 874–894. [Google Scholar] [CrossRef]
Huang, Y.; Tian, Y.; Liu, Z.; Jin, X.; Liu, Y.; Zhao, S.; Tian, D. A traffic density estimation model based on crowdsourcing privacy protection. Acm Trans. Intell. Syst. Technol. 2020, 11, 1–18. [Google Scholar] [CrossRef]
Kim, J.W.; Jang, B. Workload-aware indoor positioning data collection via local differential privacy. IEEE Commun. Lett. 2019, 23, 1352–1356. [Google Scholar] [CrossRef]
Wang, J.; Wang, Y.; Zhao, G.; Zhao, Z. Location protection method for mobile crowd sensing based on local differential privacy preference. Peer-to-Peer Netw. Appl. 2019, 12, 1097–1109. [Google Scholar] [CrossRef]
Yang, M.; Zhu, T.; Xiang, Y.; Zhou, W. Density-based location preservation for mobile crowdsensing with differential privacy. IEEE Access 2018, 6, 14779–14789. [Google Scholar] [CrossRef]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Seoul Metro Dataset. 2022. Available online: https://data.seoul.go.kr/dataList/OA-12252/S/1/datasetView.do (accessed on 30 June 2022).

Figure 1. High level architecture of the proposed framework.

Figure 2. The overall neural network structure for estimating the density distribution of LBS users under Geo-I setting.

Figure 3. MAE for varying

ϵ

. (a) Seoul Metro dataset; (b) Synthetic dataset.

Figure 3. MAE for varying

ϵ

. (a) Seoul Metro dataset; (b) Synthetic dataset.

Figure 4. MAE of each individual testing data. (a)

ϵ = 0.1

; (b)

ϵ = 0.5

; (c)

ϵ = 1.0

; (d)

ϵ = 1.5

.

Figure 4. MAE of each individual testing data. (a)

ϵ = 0.1

; (b)

ϵ = 0.5

; (c)

ϵ = 1.0

; (d)

ϵ = 1.5

.

Figure 5. MAE for varying the size of training datasets. (a)

ϵ = 0.1

; (b)

ϵ = 0.5

; (c)

ϵ = 1.0

; (d)

ϵ = 1.5

.

Figure 5. MAE for varying the size of training datasets. (a)

ϵ = 0.1

; (b)

ϵ = 0.5

; (c)

ϵ = 1.0

; (d)

ϵ = 1.5

.

Table 1. Training time of

D A

for varying training dataset sizes.

Table 1. Training time of

D A

for varying training dataset sizes.

Training data size	1000	2000	4000	6000	8000	10,500
Training time (s)	38	65	127	195	259	324

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, J.; Lim, B. Effective and Privacy-Preserving Estimation of the Density Distribution of LBS Users under Geo-Indistinguishability. Electronics 2023, 12, 917. https://doi.org/10.3390/electronics12040917

AMA Style

Kim J, Lim B. Effective and Privacy-Preserving Estimation of the Density Distribution of LBS Users under Geo-Indistinguishability. Electronics. 2023; 12(4):917. https://doi.org/10.3390/electronics12040917

Chicago/Turabian Style

Kim, Jongwook, and Byungjin Lim. 2023. "Effective and Privacy-Preserving Estimation of the Density Distribution of LBS Users under Geo-Indistinguishability" Electronics 12, no. 4: 917. https://doi.org/10.3390/electronics12040917

APA Style

Kim, J., & Lim, B. (2023). Effective and Privacy-Preserving Estimation of the Density Distribution of LBS Users under Geo-Indistinguishability. Electronics, 12(4), 917. https://doi.org/10.3390/electronics12040917

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Effective and Privacy-Preserving Estimation of the Density Distribution of LBS Users under Geo-Indistinguishability

Abstract

1. Introduction

1.1. Motivation

1.2. Contributions

2. Related Work

3. Preliminary

3.1. Geo-Indistinguishability

3.2. Problem Definition

3.3. Baseline Approaches

4. Effective Privacy-Preserving Estimation of the Density Distribution of LBS Users

4.1. Expectation-Maximization Algorithm-Based Approach

4.2. Deep Learning–Based Approach

4.2.1. Training of Conditional Generative Adversarial Network

4.2.2. Estimating Density Distributions with Trained cGAN

5. Experiments

5.1. Experimental Setup

5.2. Results

5.3. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI