Improving Data Utility in Privacy-Preserving Location Data Collection via Adaptive Grid Partitioning

Kim, Jongwook

doi:10.3390/electronics13153073

Open AccessArticle

Improving Data Utility in Privacy-Preserving Location Data Collection via Adaptive Grid Partitioning

by

Jongwook Kim

Department of Computer Science, Sangmyung University, Seoul 03016, Republic of Korea

Electronics 2024, 13(15), 3073; https://doi.org/10.3390/electronics13153073

Submission received: 27 June 2024 / Revised: 27 July 2024 / Accepted: 29 July 2024 / Published: 3 August 2024

(This article belongs to the Special Issue Cryptography in Network Security)

Download

Browse Figures

Versions Notes

Abstract

:

The widespread availability of GPS-enabled devices and advances in positioning technologies have significantly facilitated collecting user location data, making it an invaluable asset across various industries. As a result, there is an increasing demand for the collection and sharing of these data. Given the sensitive nature of user location information, considerable efforts have been made to ensure privacy, with differential privacy (DP)-based schemes emerging as the most preferred approach. However, these methods typically represent user locations on uniformly partitioned grids, which often do not accurately reflect the true distribution of users within a space. Therefore, in this paper, we introduce a novel method that adaptively adjusts the grid in real-time during data collection, thereby representing users on these dynamically partitioned grids to enhance the utility of the collected data. Specifically, our method directly captures user distribution during the data collection process, eliminating the need to rely on pre-existing user distribution data. Experimental results with real datasets show that the proposed scheme significantly enhances the utility of the collected location data compared to the existing method.

Keywords:

location privacy; density distribution; differential privacy; geo-indistinguishability

1. Introduction

The proliferation of GPS-enabled devices and recent advances in positioning technologies have made it easier to collect user location data, making them a valuable asset for various sectors. These data play an important role in areas such as personalized marketing, real-time traffic analysis, recommendations, etc. For example, real-time traffic analysis utilizes location data to optimize traffic flow, reduce congestion, and improve navigation systems for more efficient travel [1,2]. Additionally, location-based recommendations for services, restaurants, and events provide users with relevant and timely suggestions, enhancing their overall experience [3,4]. As a result, the demand for collecting and sharing user location data continues to increase.

User location data are sensitive because they contain personal information, such as home or company addresses, hospital visit records, and even political affiliations [5,6,7]. For example, by collecting and analyzing the positioning information of visitors in a large indoor shopping mall, it is possible to infer sensitive details, such as their shopping patterns. In addition, location data can be cross-referenced with other data sets to draw even more precise conclusions about an individual’s lifestyle and choices [8]. For example, frequent visits to certain types of businesses or locations can indicate specific health conditions, hobbies or even religious practices. As a result, the indiscriminate collection of location data raises significant privacy concerns. Consequently, considerable efforts have been made to protect the privacy of users’ location data when handling such data.

As differential privacy (DP) [9,10] has become the de facto standard for handling sensitive personal data, significant efforts have been made to apply it to location data. As a result, numerous DP-based methods have been proposed to collect, process, and analyze location data while preserving privacy. Many of these approaches represent user location data using grids, where the entire domain is uniformly partitioned into disjoint grids, and a user’s location is represented by the grid in which his or her actual position lies [11,12,13,14]. Although representing user location using uniformly partitioned grids is straightforward, it does not account for the actual distribution of users within the space. This approach often results in lower utility of collected location data, as it assumes that users are evenly and uniformly distributed throughout the area. However, this is not true in most real-world scenarios, where some areas are denser than others. For example, in an urban environment, the city center may have a high concentration of users, while the suburbs have a lower density. This discrepancy can negatively impact the accuracy and effectiveness of subsequent analyses. Therefore, more sophisticated grid representations that align with actual user distributions are necessary to improve data utilization and enhance the performance of these applications.

Existing solutions assume the existence of prior information about the distribution of users, typically obtained from historical data. However, such historical data may not always be available for many applications. More importantly, prior information about the distribution of users derived from historical data may not match the current distribution as it may change over time or in response to special social events. For example, major events such as festivals can dramatically change user movement patterns and densities, rendering historical data obsolete or misleading [15]. Therefore, it is preferable to instantaneously extract information about the distribution of users during data collection and adaptively adjust the grid accordingly.

In this paper, we propose a novel method that simultaneously extracts the distribution of users and adaptively adjusts the grid in real-time during location data collection. The contributions of this work can be summarized as follows:

First, we introduce a method to effectively compute the distribution of users during DP-based location data collection. This approach is able to effectively capture user distribution in real-time and adapt to dynamic changes in user behavior.
Then, we propose a method to adaptively adjust the grid to maximize the utility of the collected location data under DP. This adaptive grid adjustment is designed to improve the granularity and relevance of the data, ensuring that the most significant and densely populated areas are prioritized, thus improving the overall quality and applicability of the data.
We evaluated the performance of the proposed algorithms using real-world datasets. The evaluation results demonstrated that the proposed scheme significantly enhances the utility of the collected location data compared to existing methods

The rest of this paper is organized as follows: Section 2 reviews related work. In Section 3, we provide background information. In Section 4, we introduce a novel method that simultaneously extracts user distribution and adaptively adjusts the grid in real-time during location data collection. In Section 5, we experimentally evaluate the proposed approach with real datasets. Finally, Section 6 presents our conclusions

2. Related Work

Numerous DP-based methods have been developed to collect, process, and analyze location data while preserving privacy. In this section, we provide a brief overview of these methods.

Local differential privacy (LDP) is a variant of DP in which each user individually perturbs his own sensitive data before reporting it to the server. Kim and Jang [12] propose an LDP-based data aggregation approach designed for workload-aware collection of indoor positioning data, while ensuring user privacy. Their method identifies an optimal data encoding and perturbation strategy within the LDP framework to minimize the overall estimation error for the given workload. LDPTrace [14] is designed to synthetically generate locally differentially private trajectory data. In this method, user location information is collected using LDP to ensure privacy, and these perturbed data are then used to generate synthetic trajectories. Kim et al. [3] present a method for recommending the next point-of-interest, utilizing location data collected under LDP.

Metric differential privacy (MDP) extends the standard differential privacy framework to handle data with inherent metric or distance measures [16]. This extension is particularly useful for location-based data. Geo-indistinguishability (Geo-Ind) is a specific application of MDP designed for location-based services [11,17,18]. Mobile Crowdsensing (MCS) frameworks often use Geo-Ind to collect location information from workers and assign tasks in a privacy-preserving manner. Wang et al. [19] was the first to use Geo-Ind to protect the location privacy of workers in the MCS process. Their proposed framework includes three steps: First, the MCS server generates a function that satisfies Geo-Ind. Next, each worker downloads this function, obfuscates their true location, and uploads the obfuscated location to the server. Finally, the MCS server assigns tasks to workers based on the obfuscated location information. In [20], location privacy protection in vehicle-based MCS is investigated, where the roadmap is modeled as a weighted directed graph with task and worker locations as points on the graph. The authors propose an optimization mechanism-based obfuscation scheme that achieves location obfuscation through a probabilistic distribution over the graph that satisfies Geo-Ind. Jin et al. [21] proposes a user-centric location privacy trading framework for MCS. Following the notion of Geo-Ind, they design a location obfuscation mechanism that allows each worker to probabilistically obfuscate his true location using his own privacy budget. Zhang et al. [13] introduces an obfuscation method that satisfies Geo-Ind to collect location information from workers in MCS. Huang et al. [22] propose a privacy-aware scheme for MCS-based noise monitoring, where the server publishes tasks and workers report perturbed locations and noise levels under DP. Each worker collaborates with a master, carefully selected from the workers in the same group, to achieve group-level Geo-Ind. Zhao et al. [23] explored the privacy protection of individuals’ locations in the context of analyzing the geographic directional distribution of the community. They defined community information using a covariance matrix and integrated it into the proposed geo-ellipse indistinguishability based on Geo-Ind. This geo-ellipse indistinguishability provides quantifiable privacy guarantees for locations within Mahalanobis space. Yu et al. [24] highlighted the weaknesses of current Geo-Ind-based location obfuscation mechanisms, especially when users consistently share their locations with multiple LBS providers over a long period of time. To address this issue, they introduced PrivLocAd, a system that uses location profiling to generate obfuscated locations, thereby protecting user privacy against multi-platform adversarial attacks. Zhao et al. [25] introduced a novel privacy concept called vector-indistinguishability, which builds on Geo-Ind to provide a privacy guarantee for location-dependent relations. They have developed four mechanisms to achieve vector-indistinguishability, using both Laplace and uniform distributions. Mendes et al. [26] utilized user velocity and report frequency to measure the correlation between locations. They extended Geo-Ind to enhance privacy preservation in continuous online reporting scenarios. Specifically, they introduced a velocity-aware Geo-Ind that automatically balances privacy and utility based on the user’s velocity and frequency of location reports.

EGeoIndis [27] is a vehicle location privacy protection framework designed for traffic density estimation. It leverages Geo-Ind to protect vehicle location privacy during the traffic density estimation process. In [28], the authors proposed a deep learning-based method to estimate the density distribution using location data collected under Geo-Ind. Chen et al. [29] develop a method to create a COVID-19 vulnerability map using the density distribution of volunteer participants with COVID-19 symptoms. They exploit Geo-Ind to collect participants’ locations in a privacy-preserving manner to ensure the confidentiality of sensitive health information. Fathalizadeh et al. [30] present a framework for implementing Geo-Ind for indoor environments. The proposed framework considers two scenarios for applying Geo-Ind, reporting an obfuscated point to the location service provider that satisfies DP.

The proposed approach in this paper leverages Geo-Ind, which is a representative model in the domain of privacy-preserving location data collection. However, the proposed approach differs from other Geo-Ind-based methods in several ways. First, existing methods typically rely on the availability of historical data to infer user distribution, which presents significant challenges. Historical data may not always be accessible or current, resulting in inaccurate inferences of user distribution. Second, most existing methods use static grid structures, resulting in a fixed representation that does not account for the dynamic movement of users. In contrast, the proposed method addresses these limitations by adaptively adjusting grids in real-time during data collection. This dynamic adjustment captures the current user distribution without relying on historical data, thereby enhancing the utility of the collected data.

3. Background and Problem Statement

In this section, we provide the necessary background for this paper and state the problem addressed in this paper.

3.1. Background

Recently, DP has emerged as the de facto standard for privacy-preserving data processing. DP is based on a formal mathematical definition that provides a probabilistic privacy guarantee against attackers with arbitrary background knowledge [9]. It ensures that an attacker cannot determine with high confidence whether a given individual is included in the disseminated data. DP is formally defined as [9,10]:

Definition 1.

(ϵ-DP) A randomized algorithm

A

satisfies ϵ-DP, if and only if for (1) any two neighboring datasets,

D_{1}

and

D_{2}

, and (2) any output O of

A

, the following is satisfied:

P r [A (D_{1}) = O] \leq e^{ϵ} \times P r [A (D_{2}) = O] .

(1)

Two datasets,

D_{1}

and

D_{2}

, are considered neighboring if they differ by only one record. The above definition indicates that, for any output of

A

, an adversary with any amount of background knowledge cannot reliably determine whether

D_{1}

or

D_{2}

was the input to

A

. The parameter

ϵ

, known as the privacy budget, regulates the privacy level: smaller

ϵ

values provide stronger privacy protection but add more noise to the result, whereas larger

ϵ

values offer weaker privacy protection with less noise.

There have been several proposals to apply the concept of DP to the protection of location data. In this paper, we use Geo-Ind, a concept based on the well-established DP framework and recognized as the standard privacy definition for protecting location data in location-based services [11,17,18]. In addition to location data, Geo-Ind is also used to collect other types of data, such as text microdata, in a DP-compliant manner [31,32]. Geo-Ind is formally defined as follows:

Definition 2.

(ϵ-Geo-Ind) Consider

X

as the set of possible user locations and

Y

as the set of reported locations, which are typically assumed to be equal. Let K be a randomized mechanism that generates a perturbed location from a user’s true location. A randomized mechanism K satisfies ϵ-Geo-Ind if and only if the following condition holds for (1) all

x_{1}, x_{2} \in X

and (2) any output location

y \in Y

:

K (x_{1}) (y) \leq e^{ϵ \cdot d (x_{1}, x_{2})} \times K (x_{2}) (y),

(2)

where

d (x_{1}, x_{2})

corresponds to the distance between

x_{1}

and

x_{2}

.

There are two primary methods for implementing Geo-Ind: the Laplace mechanism and the matrix-based mechanism. It is well known that the matrix-based mechanism is more effective than the Laplace method, given prior information about the distribution of users which can be obtained from available historical data [11]. This increased effectiveness is due to the fact that the matrix-based mechanism incorporates prior distribution information when perturbing the true locations of users. As a result, the distribution of perturbed locations collected using the matrix-based mechanism more closely approximates the true distribution than the distribution collected using the Laplace mechanism.

In the matrix-based mechanism, the space is first partitioned into a set of grids, and then the data collection server computes an obfuscation matrix, M, over these grids that satisfy

ϵ

-Geo-Ind. This matrix is then distributed to users. Subsequently, users perturb their location data according to the probabilities embedded in M and report the perturbed location to the server instead of their true data. Several approaches for computing the obfuscation matrix that satisfies

ϵ

-Geo-Ind have been proposed in the literature [11,13,14,32,33]. We, however, note that the method proposed in this paper is general enough to be applied to any matrix-based mechanism.

3.2. Problem Statement

Let

U = {u_{1}, u_{2}, \dots, u_{k}}

be a set of users who agree to provide their location information to the server. However, users do not fully trust the server and thus, instead of providing true location information, each user provides perturbed (and thus privacy-preserved) location information that satisfies

ϵ

-Geo-Ind. Let us assume that the entire area is divided into disjoint grids, and let G be the set of these grids. Each user’s location is then represented by the grid in G to which his/her true location belongs.

The problem addressed in this paper is to collect high-utility location data while protecting users’ location privacy with

ϵ

-Geo-Ind. Existing methods either use static grid partitioning that does not adapt to real-time changes in user distribution or rely on pre-existing user distribution data, which may not be available or accurate in real-time scenarios. In order to address these gaps, we propose a novel adaptive grid partitioning method that dynamically adjusts the grid during the process of location data collection. In particular, the proposed method directly captures user distribution during data collection, eliminating the need for pre-existing distribution information.

4. Proposed Method

Figure 1 provides an overview of the proposed location data collection scheme using

ϵ

-Geo-Ind.

Collection of perturbed location data from sampled users: The server first computes the obfuscation matrix, M, over uniformly partitioned grids and distributes it to the sampled users, who then send their perturbed locations back to the server.
Estimation of the distribution of users: The server estimates the distribution of users based on the perturbed location data collected from the sampled users.
Computation of adaptively partitioned grids: The server uses the estimated distribution of users to compute adaptively partitioned grids.
Collection of perturbed location data from remaining users using adaptively partitioned grids: A new obfuscation matrix is computed using the adaptively partitioned grids. This new obfuscation matrix is then used to collect location data from the remaining users.

In next subsections, we provide a detailed explanation of each step.

4.1. Collecting Perturbed Location Data from Sampled Users

Let us assume that the entire area is uniformly partitioned into m grids,

G = {g_{1}, g_{2}, \dots, g_{m}}

. The data collection server computes an obfuscation matrix, M, over G. There are various approaches for computing M that satisfy

ϵ

-Geo-Ind. In this paper, we use the method proposed in [13], where the obfuscation matrix is defined as an

m \times m

matrix. Each element

M [i, j]

, which represents the probability that a perturbed location

g_{j}

is randomly generated from the true location

g_{i}

, is defined as follows:

M [i, j] = \frac{e^{- \frac{ϵ}{2} \cdot d (g_{i}, g_{j})}}{\sum_{g_{k} \in G} e^{- \frac{ϵ}{2} \cdot d (g_{i}, g_{k})}}

(3)

Once computing the obfuscation matrix M, it is distributed to the sampled users, who then perturb their true location according to the probabilities encoded in M and report the perturbed location to the server.

4.2. Estimating Probability Distribution

After collecting perturbed location data from sampled users, the next step is to estimate the distribution of users based on these data. For each grid

g_{i} \in G

, let

P (g_{i})

be the probability that a user is located at

g_{i}

. Then, in this subsection, we estimate

P (g_{i})

for all

g_{i} \in G

from the sampled perturbed location data.

Let

g_{j}^{'} \in G

be the perturbed data that the server receives from a sampled user. For the sake of explanation, we will use

g_{j}^{'}

to denote the perturbed location and

g_{j}

to denote the true location. The probability that this perturbed location is randomly generated from the true location

g_{i} \in G

can be computed as follows:

P (g_{i} | g_{j}^{'}) = \frac{P (g_{i}) P (g_{j}^{'} | g_{i})}{p (g_{j}^{'})} = \frac{P (g_{i}) P (g_{j}^{'} | g_{i})}{\sum_{g_{k} \in G} P (g_{k}) P (g_{j}^{'} | g_{k})} = \frac{P (g_{i}) M [i, j]}{\sum_{g_{k} \in G} P (g_{k}) M [k, j]}

(4)

Note, that

P (g_{j}^{'} | g_{i}) = M [i, j]

by the definition of the obfuscation matrix. Since it is not possible to compute the prior probability,

P (g_{i})

, directly from the above equation, we need to approximate it. There are several methods to approximate the prior probability, including variational inference [34], Markov chain Monte Carlo [35] and expectation propagation [36]. In this paper, we use the Expectation-Maximization (EM) algorithm [37] to estimate

P (g_{i})

. The EM algorithm is particularly effective when the likelihood is well defined, which in our case corresponds to the obfuscation matrix, M.

Let

D B

be a bag of perturbed location data from sampled users. The EM process for estimating

P (g_{i})

for all

g_{i} \in G

from

D B

is as follows.

Initialization: The parameter (i.e., prior probability) is initialized as follows:

$P^{(0)} (g_{1}) = P^{(0)} (g_{2}) = \dots = P^{(0)} (g_{m}) = \frac{1}{m}$

(5)
E-step: The posterior probability is calculated based on the current parameters as follows:

$P (g_{i} | g_{j}^{'}) = \frac{P^{(t)} (g_{i}) M [i, j]}{\sum_{g_{k} \in G} P^{(t)} (g_{k}) M [k, j]}$

(6)
M-step: The parameter is updated using the current posterior probabilities calculated in the previous E-step:

$P^{(t + 1)} (g_{i}) = \frac{\sum_{g_{j}^{'} \in D B} P (g_{i} | g_{j}^{'})}{| DB |}$

(7)

Here, $| D B |$ represents the number of data in $D B$ . After updating the prior probabilities, we perform a normalization step to ensure that the sum of all prior probabilities equals 1 as follows:

$P^{(t + 1)} (g_{i}) = \frac{P^{(t + 1)} (g_{i})}{\sum_{g_{k} \in G} P^{(t + 1)} (g_{k})}$

(8)

The above E-step and M-step are iterated until the parameter converges to a stable value or the number of iterations reaches a predefined threshold.

4.3. Computing Adaptively Partitioned Grids

In this subsection, we introduce a method that adaptively partitions grids based on the probability distribution (i.e.,

P (g_{i})

for all

g_{i} \in G

) computed in the previous phase. Initially, the proposed method treats all grids in G as a single grid cluster, and then iteratively partitions this cluster in a top-down using a greedy algorithm.

Let

G C_{v} = {C_{1}, C_{2}, \dots, C_{| G C_{v} |}}

represent a set of grid clusters after the v-th partition. Assume that for each

C_{k} \in G C_{v}

,

g r i d (C_{k}) \subset G

denotes the set of grids that belong to the cluster

C_{k}

. Let n be the total number of users from whom the server collects location data. Then, the expected number of users located in grid

g_{i}

is computed as

C n t (g_{i}) = n \cdot P (g_{i})

.

Furthermore, let

M_{G C_{v}}

be an

| G C_{v} | \times | G C_{v} |

obfuscation matrix, satisfying

ϵ

-Geo-Ind, constructed over elements in

G C_{v}

using Equation (3). The distance between any two clusters necessary to compute

M_{G C_{v}}

is determined using the centroids of the grids belonging to each cluster. Then, assuming that users perturb their location according to the probabilities encoded in

M_{G C_{v}}

, the expected number of perturbed location data corresponding to grids belonging to

C_{K}

that the server receives from n users is computed as follows:

C n t_{p e r t} (C_{k}) = \sum_{C_{j} \in G C_{v}} \sum_{g_{i} \in g r i d (C_{j})} C n t (g_{i}) \times M [j, k]

(9)

Let

C l u s ()

be a function that takes a grid as input and outputs the cluster to which that grid belongs. Assuming that users are evenly distributed across the grids within each cluster, the expected error due to Geo-Ind with

G C_{v}

is computed as follows:

E r r_{G C_{v}} = \sum_{g_{i} \in G} | C n t (g_{i}) - \frac{C n t_{p e r t} (C l u s (g_{i}))}{s i z e (C l u s (g_{i}))} |

(10)

Here,

s i z e (C_{k})

denotes the number of grids that belong to the cluster

C_{k}

.

Let us assume that in the next

(v + 1)

-th partition,

C_{h} \in G C_{v}

is selected to be divided into subclusters. In this paper, we partition

C_{h}

into four equal-sized subclusters by dividing the associated region horizontally and vertically. Let

G C_{v + 1}^{h}

represent the set of grid clusters newly obtained by subdividing

C_{h}

. Using the method described above, we can similarly estimate the expected error,

E r r_{G C_{v + 1}^{h}}

, caused by Geo-Ind with

G C_{v + 1}^{h}

. The set of grid clusters for the

(v + 1)

-th iteration is then determined as follows:

G C_{v + 1} = \underset{1 \leq h \leq | G C_{v} |}{argmax} (E r r_{G C_{v + 1}^{h}} - E r r_{G C_{v}})

(11)

In other words, a grid cluster

C_{h}

that provides the maximum error reduction gain is selected for partitioning in the next iteration.

Algorithm 1 presents pseudocode for adaptively partitioning grids using the probability distribution of users. The algorithm takes as input a set of grids, G, and probability distributions,

P (g_{1}), \dots, P (g_{m})

, and outputs a set of grid clusters,

G C

. In line 1,

G C_{c u r}

is initialized to contain a single cluster that includes all grids in G. Then, in line 3,

E r r_{G C_{c u r}}

is computed using

G C_{c u r}

. Between lines 4 and 12, the algorithm identifies the cluster

C_{h} \in G C_{c u r}

that yields the maximum error reduction gain. This process is repeated until the error reduction gain is greater than 0. Finally, the algorithm returns

G C_{c u r}

.

Algorithm 1: Pseudo-code for adaptive grid partition

The proposed adaptive grid partitioning method in this subsection relies on the probability distribution of users estimated from sampled data. Thus, as with other sampling-based methods, there is a possibility that the sampled data may be biased. Such biases can result in a non-representative user distribution being used for grid partitioning, which can lead to ineffective partitioning. This, in turn, can adversely affect the overall utility of the collected location data, because the adaptive grids may not accurately reflect the true user density. In order to mitigate potential biases and inaccuracies in capturing real-time user distribution, spatial variations can be considered in the sampling process. One effective method is to use stratified sampling [38], which involves dividing the entire region into disjoint subregions. By ensuring that each subregion is proportionally represented in the sample, stratified sampling helps reduce sampling bias and provides a more accurate estimate of the user distribution.

4.4. Collecting Perturbed Location Data from Remaining Users Using Adaptively Partitioned Grids

A new obfuscation matrix, M, is computed using the adaptively partitioned grid set,

G C

, computed in the previous phase. The adaptively partitioned grids allow for a more precise and relevant obfuscation process by capturing the dynamic nature of the user distribution more effectively than static grids. Once the new obfuscation matrix is computed, it is distributed to the remaining users. These users then use the updated matrix to perturb their true location data, ensuring that their privacy is preserved according to the principles of Geo-Ind. The obfuscated location data are then sent back to the server, where it is integrated with the previously collected data from the sampled users.

5. Experiments

In this section, we first describe the experimental setup. Then, we discuss the experimental results.

5.1. Experimental Setup

In this section, we describe the experiments we carried out to evaluate the proposed approach. For our experiments, we used the Porto taxi trajectories dataset [39], which consists of taxi trajectories composed of a series of GPS coordinates recorded from 442 taxis operating in the city of Porto, Portugal. We randomly extracted 50,000 location data from these trajectories, of which 10,000 were considered as location data of the sampled users. In the experiment, we varied the number of grids from 400 (i.e., 20-by-20 grids) to 10,000 (i.e., 100-by-100 grids). In the experiments, results are reported for the following alternatives: the existing non-adaptive grid (

N G

) method in [13], and the adaptive grid (

A G

) method introduced in this paper. We use the following metrics for evaluation:

Data-level metric measures the similarity between the true location dataset and the perturbed location dataset collected under $ϵ$ -Geo-Ind. For the data-level evaluation, we use both the average count error (ACE) and the density error. The average count error quantifies the difference between the actual number of users, $n u m_{t r u e} (g_{i})$ , and the number derived from the perturbed dataset, $n u m_{p e r t} (g_{i})$ , for each grid. It is calculated as

$A v e r a g e c o u n t e r r o r = \sum_{1 \leq i \leq m} \frac{| n u m_{t r u e} (g_{i}) - n u m_{p e r t} (g_{i}) |}{m a x (n u m_{t r u e} (g_{i}), 1)}$

(12)

The density error measures the difference between the actual density distribution of users and the perturbed version computed from the datasets collected under $ϵ$ -Geo-Ind. This error is measured as

$D e n s i t y e r r o r = J S D (D (O D), D (P D))$

(13)

Here, $J S D ()$ represents the Jenson–Shannon divergence between two distributions from the original location dataset, $D (O D)$ and from the perturbed location dataset, $D (P D)$ .
Application-level metric evaluates the utility of collected data from the perspective of applications that use it. We use range query error for this metric, a widely recognized measure for evaluating the effectiveness of location data [14]. In the experiment, we generate a range query, $Q_{R}$ , with a random region R, and compare the number of results from the original location dataset, $Q_{R} (O D)$ , with those from the perturbed location datasets, $Q_{R} (P D$ ). It is calculated as

$R a n g e q u e r y e r r o r = \frac{| Q_{R} (O D) - Q_{R} (P D) |}{m a x (Q_{R} (O D), 1)}$

(14)

In the experiments, we generated 200 range queries and reported the average range query error.

In the experiment, various privacy budgets (

ϵ

) ranging from 0.2 to 2.0 were used. A privacy budget of less than 2 is typically considered acceptable in practical applications [14]. We implemented both

N G

and

A G

using Python 3.8, and all experiments were conducted in an environment equipped with Intel Xeon 5220R CPUs and 64 MB of memory.

5.2. Results

In this subsection, we first present the results of evaluating the data level, and then present the results of evaluating the application level.

5.2.1. Data-Level Evaluation

Figure 2 shows the effect of the privacy budget on both the average count error and the density error. In this experiment, the privacy budget varies from 0.2 to 2.0, while the grid size is fixed at 400. As

ϵ

decreases, both errors increase. This is because as

ϵ

decreases, the degree of perturbation caused by Geo-Ind increases, leading to an increased error, which is commonly observed with DP-based methods. As shown in the figures, the proposed method (AG) consistently outperforms the existing method (NG) across all privacy budget levels. Furthermore, the performance gap between the two methods increases as the privacy budget decreases, and thus the level of privacy increases. This shows that the proposed method is more advantageous for applications that require a high level of privacy, which is typical for most applications that handle location data.

Figure 3 shows the effect of the number of grids on the average count error and the density error. In this experiment, the number of grids varies from 400 to 10,000, while the privacy budget is fixed at 0.6. The figure shows that the proposed method consistently outperforms the existing method across all grid sizes. More specifically, as can be seen in the figure, the average count error decreases as the number of grids increases. Note that as the number of grids increases, the number of users per grid decreases because the total number of users is fixed. This, in turn, reduces the average count error, which is based on the absolute difference between the users obtained from the actual location data and the perturbed location data. On the other hand, as the number of grids increases, the density error, which measures the Jenson–Shannon divergence between two distributions from the original location data set and the perturbed location data sets, increases. This is because the Jenson–Shannon divergence measures the relative difference between distributions, and therefore, is not affected by the number of users per grid. As the grid size becomes finer, the perturbations in the data have a more pronounced effect on the distribution, resulting in a higher divergence between the original and perturbed datasets.

The results shown in Figure 2 and Figure 3 confirm that the proposed method enables the collection of location data that are more similar to the original data under Geo-Ind than the existing method. These results highlight the significant advantages of our approach in privacy-preserving location data collection. By dynamically adjusting grids based on real-time user distribution under Geo-Ind, we achieve higher data accuracy and utility.

5.2.2. Application-Level Evaluation

Figure 4 illustrates the effect of the privacy budget and the number of grids on the range query error. In Figure 4a, the grid size is fixed at 400, while in Figure 4b, the privacy budget is fixed at 0.6. As shown in the figure, as

ϵ

decreases, the error associated with the existing method increases dramatically, while the error for the proposed method increases only marginally. This occurs because the proposed method is able to collect location datasets that are closer to the original datasets, as verified by the data-level evaluation. The robustness of our approach under varying privacy budgets highlights its effectiveness in balancing privacy and accuracy. As

ϵ

becomes smaller, indicating stronger privacy guarantees, the proposed method still manages to preserve the utility of the data, making them more reliable for applications that require precise location information.

Moreover, the proposed method consistently outperforms the existing method across all grid sizes. This verifies that our proposed method is robust regardless of the number of grids. The ability to maintain low query error rates across different grid configurations demonstrates the adaptability and effectiveness of our approach. This robustness is crucial for practical applications that require different granularity in representing user location (i.e., number of grids) depending on the application requirements.

These experimental results indicate that the proposed method can be used for a wide range of location-based services and applications requiring different privacy levels and granularity in representing user location.

5.2.3. Evaluation of Network Variability and Grid Adaptation Effects

All experiment results in the previous subsections were obtained under the assumption that the network conditions are stable. However, in real-world scenarios, network conditions can be highly variable and unpredictable. The proposed method adaptively partitions the grid based on the sampled location data collected during the process of location data collection. However, unstable network conditions, such as high network latency, packet loss, and low bandwidth, can delay the timely collection of these sampled data. As a result, some sampled data may not be available for the computation of adaptive grid partitioning, which may lead to less accurate grid partitioning.

In this subsection, to address the challenges of unstable networks, we evaluate the effectiveness of the proposed method in real network scenarios. The experiment shown in Table 1 considers a scenario where some of the sampled location data needed to estimate the user distribution in Section 4.2 (which is then used to adaptively compute the grid partitions in Section 4.3) is either lost or not received in time due to unstable network conditions such as high latency, packet loss, and low bandwidth. In this experiment, the loss rate of sampled location data varies from 1% to 20%, covering a range from typical to severe network conditions. In the experiment, the number of grids is set to 400 and the privacy budget is set to 0.6. As shown in Table 1, even as the loss rate of sampled location data increases from 1% to 20%, both errors remain stable with only a very small increase. In particular, even under severe network conditions with a 20% loss rate, the proposed adaptive grid (AG) method significantly outperforms the non-adaptive grid (NG) method. These results verify the robustness and effectiveness of the proposed method in maintaining data utility under real network scenarios with varying levels of data loss due to unstable network conditions.

The proposed method adaptively adjusts the grid during the location data collection process, which introduces additional latency due to the computational overhead of computing the adaptive grid. Hence, we experimentally evaluate the latency introduced by the proposed method. Table 2 shows the latency results caused by the adaptive grid computation of the proposed method. In this experiment, the number of grids varies from 400 to 10,000, while the privacy budget is fixed at 0.6. As shown in the table, the latency increases as the grid size increases. This is because as the number of grids increases, the number of iterations required to adaptively partition the grids also increases. As a result, larger grids require more computational resources, which leads to higher latency.

We note that although the computation of the adaptive grid introduces additional latency as shown in Table 2, it is a one-time process within the overall location data collection procedure. Therefore, the impact of this overhead on the overall processing time of the location data collection is limited. Furthermore, the additional latency caused by the computational overhead of adaptively partitioning the grid can be mitigated by various parallel processing techniques. In particular, techniques such as distributed computing frameworks [40,41] and GPU acceleration can significantly reduce computation time, thereby mitigating this latency. By distributing the workload across multiple processors, these approaches can improve the efficiency of the adaptive grid partitioning process, ensuring data analysis in real-time applications.

6. Conclusions and Future Work

Recently, there has been an increasing demand for the collection and sharing of location data. Given the sensitive nature of user location information, considerable efforts have been made to ensure privacy, with differential privacy-based schemes emerging as the preferred approach. However, these schemes typically represent user locations on uniformly partitioned grids, which often do not accurately reflect the true distribution of users within a space. In this paper, we presented a novel approach that dynamically adjusts the grid in real-time during location data collection using Geo-Ind to enhance the utility of the collected data. The proposed method captures the user distribution directly during data collection, eliminating the reliance on pre-existing distribution information. Experimental results on real data confirmed that the proposed scheme significantly improves the utility of the collected location data at both the data and application levels. Specifically, the results showed that compared to the existing solution, the proposed method can reduce the error rate by up to 52% in data-level experiments and by up to 75% in application-level experiments.

Despite the promising results, the proposed method has the following limitations. Since the proposed method adaptively adjusts the grid in real-time during data collection, there is an additional computational overhead associated with computing the adaptive grid. This overhead is especially significant when using large grid sizes. Thus, future work will focus on improving the efficiency of the adaptive grid computation, especially for large grid sizes with large numbers of users. This can be achieved by parallelizing the adaptive grid partitioning process to reduce the computational overhead. We will explore various parallel processing techniques, such as implementing distributed computing frameworks such as Apache Hadoop [40] or Apache Spark [41], which distribute data and computation across a cluster of machines. In addition, multi-threading within a single machine and the use of GPU acceleration can be considered to increase efficiency. Furthermore, integrating cloud computing services will improve scalability by providing a dynamic and scalable infrastructure for performing adaptive grid partitioning on large datasets.

Another future research direction is to theoretically analyze the impact of adaptive grid partitioning on the data utility of collected location data. This analysis will elucidate the underlying principles of adaptive grid partitioning and its impact on data utility, thus providing more robust theoretical support for the proposed method. In addition, the privacy-utility tradeoff can be further investigated to optimize the balance between privacy and utility. This will include the development of models and metrics to quantitatively assess this trade-off under various conditions.

Funding

This research was funded by a 2023 Research Grant from Sangmyung University. (2023-A000-0119).

Data Availability Statement

The original data presented in the study are openly available in Kaggle at https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i (accessed on 20 June 2024).

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DP	Differential Privacy
LDP	Local Differential Privacy
MDP	Metric Differential Privacy
Geo-Ind	Geo-Indistinguishability
MCS	Mobile Crowdsensing
EM	Expectation-Maximization
JSD	Jenson–Shannon Divergence

References

Wang, X.; Ma, Y.; Wang, Y.; Jin, W.; Wang, X.; Tang, J.; Jia, C.; Yu, J. Traffic flow prediction via spatial temporal graph neural network. In Proceedings of the Web Conference, Taipei, Taiwan, 20–24 April 2020; pp. 1082–1092. [Google Scholar]
Pan, Z.; Liang, Y.; Wang, W.; Yu, Y.; Zheng, Y.; Zhang, J. Urban traffic prediction from spatio-temporal data using deep meta learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 April 2019; pp. 1720–1730. [Google Scholar]
Kim, J.S.; Kim, J.W.; Chung, Y.D. Successive Point-of-Interest Recommendation With Local Differential Privacy. IEEE Access 2021, 9, 66371–66386. [Google Scholar] [CrossRef]
An, J.; Li, G.; Jiang, W. NRDL: Decentralized user preference learning for privacy-preserving next POI recommendation. Expert Syst. Appl. 2024, 239, 122421. [Google Scholar] [CrossRef]
Primault, V.; Boutet, A.; Mokhtar, S.B.; Brunie, L. The long road to computational location privacy: A survey. IEEE Commun. Surv. Tutor. 2019, 21, 2772–2793. [Google Scholar] [CrossRef]
Liu, B.; Zhou, W.; Zhu, T.; Gao, L.; Xiang, Y. Location privacy and its applications: A systematic study. IEEE Access 2019, 6, 17606–17624. [Google Scholar] [CrossRef]
Alharthi, R.; Banihani, A.; Alzahrani, A.; Alshehri, A.; Alshahrani, H.; Fu, H.; Liu, A.; Zhu, Y. Location privacy challenges in spatial crowdsourcing. In Proceedings of the IEEE International Conference on Electro/Information Technology, Rochester, MI, USA, 3–5 May 2018. [Google Scholar]
Henriksen-Bulmer, J.; Jeary, S. Re-identification attacks—A systematic literature review. Int. J. Inf. Manag. 2016, 36, 1184–1192. [Google Scholar] [CrossRef]
Dwork, C. Differential privacy. In Proceedings of the International Colloquium on Automata, Languages, and Programming, Venice, Italy, 10–14 July 2006; pp. 1–12. [Google Scholar]
Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third conference on Theory of Cryptography, New York, NY, USA, 4–7 March 2006. [Google Scholar]
Bordenabe, N.E.; Chatzikokolakis, K.; Palamidess, C. Optimal geo-indistinguishable mechanisms for location privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, New York, NY, USA, 3–7 November 2014; pp. 251–262. [Google Scholar]
Kim, J.; Jang, B. Workload-aware indoor positioning data collection via local differential privacy. IEEE Commun. Lett. 2019, 23, 1352–1356. [Google Scholar] [CrossRef]
Zhang, P.; Cheng, X.; Su, S.; Wang, N. Area coverage-based worker recruitment under geo-indistinguishability. Comput. Netw. 2022, 217, 109340. [Google Scholar] [CrossRef]
Du, Y.; Hu, Y.; Zhang, Z.; Fang, Z.; Chen, L.; Zheng, B.; Gao, Y. LDPTrace: Locally differentially private trajectory synthesis. In Proceedings of the VLDB Endowment, Vancouver, BC, Canada, 28 August–1 September 2023; pp. 1897–1909. [Google Scholar]
Ghaemi, Z.; Farnaghi, M. A Varied Density-based Clustering Approach for Event Detection from Heterogeneous Twitter Data. ISPRS Int. J. Geo-Inf. 2019, 8, 82. [Google Scholar] [CrossRef]
Alvim, M.; Chatzikokolakis, K.; Palamidessi, C.; Pazii, A. Local Differential Privacy on Metric Spaces: Optimizing the Trade-Off with Utility. In Proceedings of the IEEE Computer Security Foundations Symposium, Oxford, UK, 9–12 July 2018. [Google Scholar]
Andres, M.E.; Bordenabe, N.E.; Chatzikokolakis, K.; Palamidessi, C. Geo-indistinguishability: Differential privacy for location-based systems. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, Berlin, Germany, 4–8 November 2013; pp. 901–914. [Google Scholar]
Chatzikokolakis, K.; Palamidessi, C.; Stronati, M. Geo-indistinguishability: A principled approach to location privacy. In Proceedings of the International Conference on Distributed Computing and Internet Technology, Bhubaneswar, India, 5–8 February 2015; pp. 49–72. [Google Scholar]
Wang, L.; Yang, D.; Han, X.; Wang, T.; Zhang, D.; Ma, X. Location privacy-preserving task allocation for mobile crowdsensing with differential geo-obfuscation. In Proceedings of the International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 627–636. [Google Scholar]
Qiu, C.; Squicciarini, A.C. Location privacy protection in vehicle-based spatial crowdsourcing via geo-indistinguishability. In Proceedings of the IEEE International Conference on Distributed Computing Systems, Dallas, TX, USA, 7–10 July 2019; pp. 1061–1071. [Google Scholar]
Jin, W.; Xiao, M.; Guo, L.; Yang, L.; Li, M. ULPT: A user-centric location privacy trading framework for mobile crowd sensing. IEEE Trans. Mob. Comput. 2022, 21, 3789–3806. [Google Scholar] [CrossRef]
Huang, P.; Zhang, X.; Guo, L.; Li, M. Incentivizing crowdsensing-based noise monitoring with differentially-private locations. IEEE Trans. Mob. Comput. 2021, 20, 519–532. [Google Scholar] [CrossRef]
Zhao, Y.; Yuan, D.; Du, J.T.; Chen, J. Geo-Ellipse-Indistinguishability: Community-aware location privacy protection for directional distribution. IEEE Trans. Knowl. Data Eng. 2023, 35, 6957–6967. [Google Scholar] [CrossRef]
Yu, L.; Zhang, S.; Meng, Y.; Du, S.; Chen, Y.; Ren, Y.; Zhu, H. Privacy-preserving location-based advertising via longitudinal geo-indistinguishability. IEEE Trans. Mob. Comput. 2024, 23, 8256–8273. [Google Scholar] [CrossRef]
Zhao, Y.; Chen, J. Vector-indistinguishability: Location dependency based privacy protection for successive location data. IEEE Trans. Comput. 2024, 73, 970–979. [Google Scholar] [CrossRef]
Mendes, R.; Cunha, M.; Vilela, J.P. Velocity-aware geo-indistinguishability. In Proceedings of the ACM Conference on Data and Application Security and Privacy, Charlotte, NC, USA, 24–26 April 2023; pp. 141–152. [Google Scholar]
Ren, W.; Tang, S. EGeoIndis: An effective and efficient location privacy protection framework in traffic density detection. Veh. Commun. 2020, 21, 100187. [Google Scholar] [CrossRef]
Kim, J.W.; Lim, B. Effective and Privacy-Preserving Estimation of the Density Distribution of LBS Users under Geo-Indistinguishability. Electronics 2023, 12, 917. [Google Scholar] [CrossRef]
Chen, R.; Li, L.; Ma, Y.; Gong, Y.; Guo, Y.; Ohtsuki, T.; Pan, M. Constructing Mobile Crowdsourced COVID-19 Vulnerability Map With Geo-Indistinguishability. IEEE Internet Things J. 2022, 9, 17403–17416. [Google Scholar] [CrossRef]
Fathalizadeh, A.; Moghtadaiee, V.; Alishahi, M. Indoor Geo-Indistinguishability: Adopting Differential Privacy for Indoor Location Data Protection. IEEE Trans. Emerg. Top. Comput. 2023, 12, 293–306. [Google Scholar] [CrossRef]
Feyisetan, O.; Balle, B.; Drake, T.; Diethe, T. Privacy-and utility-preserving textual analysis via calibrated multivariate perturbations. In Proceedings of the International Conference on Web Search and Data Mining, Houston, TX, USA, 3–7 February 2020; pp. 178–186. [Google Scholar]
Song, S.; Kim, J.W. Adapting Geo-Indistinguishability for Privacy-Preserving Collection of Medical Microdata. Electronics 2023, 12, 2793. [Google Scholar] [CrossRef]
Ahuja, R.; Ghinita, G.; Shahabi, C. A utility-preserving and scalable technique for protecting location data with geo-indistinguishability. In Proceedings of the International Conference on Extending Database Technology, Lisbon, Portuga, 26–29 March 2019; pp. 210–231. [Google Scholar]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational Inference: A Review for Statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
Chib, S. Markov Chain Monte Carlo Methods: Computation and Inference. Handb. Econom. 2001, 5, 3569–3649. [Google Scholar]
Li, Y.; Hernandez-Lobato, J.M.; Turner, R.E. Stochastic Expectation Propagation. arXiv 2018, arXiv:1506.04132. [Google Scholar]
Sammaknejad, N.; Zhao, Y.; Huang, B. A review of the Expectation Maximization algorithm in data-driven process identification. J. Process Control 2019, 73, 123–136. [Google Scholar] [CrossRef]
Howell, C.R.; Su, W.; Nassel, A.F.; Agne, A.A.; Cherrington, A.L. Area based stratified random sampling using geospatial technology in a community-based survey. BMC Public Health 2020, 20, 1678. [Google Scholar] [CrossRef] [PubMed]
Moreira-Matias, L.; Gama, J.; Ferreira, M.; Mendes-Moreira, J.; Damas, L. Predicting taxi–passenger demand using streaming data. IEEE Trans. Intell. Transp. Syst. 2013, 14, 1393–1402. [Google Scholar] [CrossRef]
Apache Hadoop. Available online: https://hadoop.apache.org/ (accessed on 22 July 2024).
Apache Spark—Unified Engine for Large-Scale Data Analytics. Available online: https://spark.apache.org/ (accessed on 22 July 2024).

Figure 1. An overview of the proposed privacy-preserving location data collection scheme.

Figure 2. Effect of the privacy budget on (a) the average count error and (b) the density error for the existing non-adaptive grid (

N G