Enhancing Real-Time Traffic Data Sharing: A Differential Privacy-Based Scheme with Spatial Correlation

Le, Junqing; Xing, Bowen; Zhang, Di; Qiao, Dewen

doi:10.3390/math12111722

Open AccessArticle

Enhancing Real-Time Traffic Data Sharing: A Differential Privacy-Based Scheme with Spatial Correlation

by

Junqing Le

,

Bowen Xing

,

Di Zhang

^* and

Dewen Qiao

College of Computer Science, Chongqing University, Chongqing 400044, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(11), 1722; https://doi.org/10.3390/math12111722

Submission received: 5 May 2024 / Revised: 27 May 2024 / Accepted: 29 May 2024 / Published: 31 May 2024

(This article belongs to the Special Issue Privacy-Preserving Techniques in AI, Blockchain and Cloud Systems with Formal Mathematical Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

The real-time sharing of traffic data can offer improved services to users and timely respond to environmental changes. However, this data often involves individuals’ sensitive information, raising substantial privacy concerns. It is imperative to find ways to protect the privacy of the shared traffic data while maintaining its ongoing data utility. In this paper, a Differential Privacy-based scheme with Spatial Correlation for Real-time traffic data (named as DP-SCR) is proposed. DP-SCR not only ensures the high data utility of shared traffic data, but also provides strong privacy protection. Specifically, DP-SCR is designed to adhere to w-event

ε

-differential privacy, ensuring a high level of privacy protection. Subsequently, a novel adaptive allocation based on spatial correlation prediction is proposed to optimize the privacy budget allocation in differential privacy. In addition, a feasible dynamic clustering algorithm is developed to minimize the relative perturbation error, which further improves the quality of shared data. Finally, the analyses demonstrate that DP-SCR provides w-event privacy for the shared data of each section, and the spatial correlation is a more pronounced characteristic of the traffic data than other characteristics. Meanwhile, experiments conducted on real-world data show that the MAR and MER of the predicted data in DP-SCR are smaller than those in other baseline DP-based schemes. It indicates that the DP-SCR scheme proposed in this paper can provide more accurate shared data.

Keywords:

traffic data sharing; privacy protection; differential privacy; adaptive allocation; spatial correlation

MSC:

68P27

1. Introduction

As science and technology advance, various sensors collect traffic flows (i.e., a kind of traffic statistic) accurately and in real-time [1,2,3,4,5,6]. Real-time traffic data records the time sequence information of the road and can describe the traffic status in more detail. The real-time traffic data can be shared with other companies and organizations and are subsequently utilized in intelligent transportation systems (ITS), such as traffic light control [7], route planning [8], autonomous driving [9,10], and forecasts of electric vehicle energy consumption [11], ensuring these applications can provide more personalized services and timely respond to environmental changes. However, these traffic statistical data often contain individual sensitive information [12], e.g., location information and vehicle status, which will lead to considerable threats to individual privacy. For example, according to the uniqueness of the individuals’ mobility trace [13], an adversary can link back to the individuals’ ID information through some outside information when the trace information is published, and then the adversary may match the ID information with sensitive information to acquire individuals’ privacy.

To solve the issues of privacy leakage in data sharing, an insightful privacy protection model with strong theoretical support, called

ε

-differential privacy, has been proposed in [14]. It ensures that the outcomes of any analyses on neighboring datasets (i.e., two datasets that have only one data difference) are difficult to distinguish. Based on differential privacy, a lot of varietal schemes have been proposed for privacy protection [15,16,17,18,19,20,21,22]. However, most of them focus on either the user-level privacy on finite streams or the event-level privacy on infinite streams. However, applying these methods directly to protect real-time data often leads to inadequate protection or a notable reduction in data utility.

In view of this, Kellaris et al. [23] have proposed a novel model of differential privacy named w-event

ε

-differential privacy (w-event privacy for short). The w-event privacy model fills the gap between the event-level and the user-level privacy, which can protect all events that happen at any successive w timestamps without sacrificing too much data utility. Using w-event privacy to protect the real-time data is a favorable option. The authors in [23] have designed two schemes based on w-event privacy, called budget distribution (BD) and budget absorption (BA), to protect any event sequence occurring at any successive w timestamps (i.e., a sliding window of size w).

Currently, numerous improved privacy-preserving schemes based on w-event privacy for real-time data sharing have been proposed [24,25,26,27,28,29]. To improve the accuracy of the shared traffic flows, Wang et al. [24] have proposed two schemes for privacy protection, i.e., RescueDP and E-RescueDP, which take into account data dynamics and can adaptively allocate privacy budgets for each section through proportional-integral-derivative control (PID control) or a recurrent neural network (RNN). Huo et al. [25] have proposed an adaptive w-event privacy for fog computing, which optimizes the prediction of E-RescueDP by using a long short-term memory. In contrast to the centralized differential privacy mentioned earlier, some w-event privacy schemes for real-time data release, based on local differential privacy (abbreviated as LCD), have been developed without the need for establishing a trusted server, as discussed in [27,28,29].

In this paper, we focus on sharing real-time traffic flows that are used to serve the intelligent transportation system continually. However, if the real-time traffic flows of each road section are shared with the public directly, it can cause serious privacy issues, such as the disclosure of whereabouts. To ensure the shared traffic flows are protected by strong privacy, enabling the sharing of traffic flows that adhere to w-event privacy is essential.

1.1. Motivation

Nevertheless, the prior schemes that focus on real-time data sharing under the protection of w-event privacy have shown limitations in data utility, specifically in terms of the quality of the shared data. Data utility is a vital metric for assessing the quality of the shared data.

First, privacy protection for raw data significantly reduces data utility. In BD and BA schemes [23], they allocate an equivalent privacy budget for the traffic flows of each section. The LCD-based work in [29] divides privacy budgeting to different processing steps to satisfy more limited privacy guarantees. However, the above schemes tend to result in allocating low privacy budgets to traffic flows, which in turn leads to excessive noise introduced into the shared traffic flows. An reasonable allocation of privacy budget is a promising way to solve the above issues. In RescueDP and E-RescueDP in [24], an adaptive allocation is proposed, where the current raw traffic flows are replaced by the predicted traffic flows without consuming privacy budget. In any case, the more accurate the predicted traffic flows are, the more accurate the shared traffic flows also are. However, the calculation of predicted data in [24] is based on the temporal correlation between data, which may not be the best way to predict traffic flows.

Second, the difference in the privacy budget allocated to each section can introduce large relative perturbation error. The sections with small traffic flow produce a large relative perturbation error in the w-event privacy schemes, where the perturbation error is introduced by Laplace noise. To reduce the perturbation error, the mechanism for dynamic grouping in [24] partitions the sections with small traffic flows into different groups, which is based on the similarity of traffic flows. Furthermore, to satisfy w-event privacy, it uses the smallest privacy budget of the section as the privacy budget of all sections in the group. However, if the sections with a large different privacy budget are partitioned into the same group, it will cause a large perturbation error in all sections of the group. This will lead to a reduction in the accuracy of the shared traffic flows.

1.2. Contributions

Motivated by the above discussions, a scheme named DP-SCR is proposed in this paper to enhance the real-time traffic data sharing. In DP-SCR, we design an adaptive allocation of a privacy budget by using spatial correlation. Then, a novel dynamic clustering method based on k-means algorithm is developed, which takes the traffic flows and the difference in privacy budget into account. Finally, the proposed DP-SCR is proved to satisfy w-event privacy, providing a high level of privacy protection. This means that even if the attacker has background information about the user, they cannot obtain any additional information from the shared data.

Compared with the existing schemes that also satisfy w-event privacy, DP-SCR has the following three contributions:

In DP-SCR, we prove that the spatial characteristic of traffic flows provides a more remarkable correlation than other characteristics of traffic flows. Then, the designed spatial correlation prediction in DP-SCR is used to adaptively allocate the privacy budget for traffic flows. It significantly improves the accuracy of the shared traffic flows;
We design a novel dynamic clustering algorithm to aggregate the sections with similar traffic flow and privacy budget. It further improves the accuracy of the shared traffic flows by reducing the relative perturbation error caused by the small traffic flows.
The experimental results with real-world traffic datasets demonstrate that DP-SCR outperforms baseline w-event privacy schemes in terms of data utility for real-time data release. Also, these experiments validate that DP-SCR is robust to the changes of $ϵ$ and w.

1.3. Organization

The rest of this paper is organized as follows. In Section 2, some preliminary knowledge of the proposed scheme is described. Then, the main problems of the sharing of real-time traffic flows are stated in Section 3. The construction of DP-SCR is established in Section 4, consisting of the adaptive allocation of privacy budgets, dynamic clustering, approximation and perturbation. In Section 5, we analyze the related performance of DP-SCR. The experiments are conducted to verify the high data utility of DP-SCR in Section 6. Finally, the conclusions and future work of this paper can be derived in Section 7.

2. Preliminaries

In this section, we review some basic preliminaries that are necessary for the rest of this paper, mainly including differential privacy, w-event privacy and the characteristics of traffic flows. Some mathematical notations are summarized in Table 1.

2.1. Differential Privacy

Let

D

denote a set of datasets, and let Q be the query function. M represents the Laplace mechanism, and the set R denotes the range of

M (\cdot)

.

Definition 1

(Neighboring datasets [30]). For two datasets

D \in D

and

D^{'} \in D

, if

D^{'}

can be obtained from D by removing or adding any single record, D and

D^{'}

are neighboring.

Definition 2

(Sensitivity [30]). Assume Q:

D \to ℜ^{d}

, then the sensitivity of Q with regard to

D

is

Δ (Q) = max_{D, D^{'}} {‖ Q (D) - Q (D^{'}) ‖}_{1},

where D and

D^{'}

represent any pair of neighboring datasets of

D

.

Definition 3

(Laplace mechanism [30]). For Q:

D \to ℜ^{d}

, M adds noise into the results of

Q (D)

, where the noise conforms to the Laplace distribution. Formally, for any dataset

D \in D

,

M (D) = Q (D) + {〈 L a p (Δ (Q) / ε) 〉}^{d},

where ε denotes privacy budget indicating the privacy level of mechanism M.

Definition 4

(Differential privacy [31]). For any neighboring dataset D and

D^{'}

, and the set R, if

P r [M (D) \in R] \leq e^{ε} P r [M (D^{'}) \in R],

the mechanism M satisfies ε-differential privacy (

ε > 0

).

Take the traffic flows of section k at timestamp i as an sample. The queried result of the traffic flows from

D_{i}

is represented as

Q_{k} (D_{i}) = f_{i}^{k}

, and the shared traffic flows after the processing of differential privacy can be rewritten as

r_{i}^{k} = f_{i}^{k} + 〈 L a p (Δ (Q_{k}) / ε) 〉

, where

D_{i}

is the raw traffic data at timestamp i and

Q_{k}

is the query function for section k.

Theorem 1

(Sequential composition [32]). Assume that M includes a sequence of sub-mechanisms

M_{1}, M_{2}, \dots, M_{r}

and each

M_{i}

adds an independently random noise. If each mechanism

M_{i}

satisfies

ε_{i}

-differential privacy, the mechanism M satisfies (

\sum_{i = 1}^{r} ε_{i}

)-differential privacy.

According to the above definitions and Theorem 1, it is obvious that the smaller

ε

or the higher

Δ (Q)

is, the larger the noise introduced. The privacy budget

ε

of M assigned to sub-mechanisms may be different.

2.2. w-Event Privacy

w-event privacy can protect all events that happen at any successive w timestamps. For a distinct description of w-event privacy, the traffic data are denoted as an infinite tuple

S = (D_{1}, D_{2}, \dots)

, where

D_{t}

represents the raw traffic data at timestamp t, and

S [i]

is the i-th element of S. Then a stream prefix of S at timestamp t is denoted as

S_{t} = (D_{1}, D_{2}, \dots, D_{t})

.

Definition 5

(

w

-neighboring [23]). Two stream prefixes

S_{t}

and

S_{t}^{'}

are w-neighboring, where w is a positive integer, if they satisfy the following conditions:

1.: For each $S_{t} [i]$ , $S_{t}^{'} [i]$ with $i \in [t]$ and $S_{t} [i] \neq S_{t}^{'} [i]$ , it holds that $S_{t} [i]$ , $S_{t}^{'} [i]$ are neighboring;
2.: For each $S_{t} [i_{1}]$ , $S_{t} [i_{2}]$ , $S_{t}^{'} [i_{1}]$ , $S_{t}^{'} [i_{2}]$ , when $i_{1} < i_{2}$ , $S_{t} [i_{1}] \neq S_{t}^{'} [i_{1}]$ and $S_{t} [i_{2}] \neq S_{t}^{'} [i_{2}]$ , it holds that $i_{2} - i_{1} + 1 \leq w$ ;

Definition 6

(

w

-event privacy [23]). Let

S_{t} [i] = D_{i} \in D

and one set

R \subset R a n g e (M)

. For all w-neighboring stream prefixes

S_{t}

,

S_{t}^{'}

and all t, if

P r [M (S_{t}) \in R] \leq e^{ε} P r [M (S_{t}^{'}) \in R],

the mechanism M satisfies w-event privacy.

Theorem 2.

Let stream prefix

S_{t}

denote the input of M, and the output of M is

{R_{1}, R_{2}, \dots, R_{t}} \subset R a n g e (M)

. Suppose that the mechanism M includes t mechanisms

M_{1}, M_{2}, \dots, M_{t}

, and each

M_{i} (S_{t} [i])

achieves

ε_{i}

-differential privacy. Then the mechanism M satisfies w-event privacy, if

\forall i \in [t], \sum_{k = i - w + 1}^{i} ε_{k} \leq ε .

2.3. Characteristics of Traffic Flows

Based on the analyses in Section 1,the proposed DP-SCR mainly considers the spatial correlation between traffic flows.

Definition 7

(Spatial correlation [33]). A road network consists of multiple sections, and there exists a spatial correlation between sections. Formally, the algorithm

S C

denotes the spatial correlation between the traffic flows. The predicted traffic flow

{\hat{f}}_{t + 1}^{i}

can be calculated by

{\hat{f}}_{t + 1}^{i} = S C (t p_{t}^{i})

, where

t p_{t}^{i}

is the traffic parameter of section i at timestamp t. If the linked sections of section i are section j and section k, then

t p_{t}^{i} = {t p_{t}^{i, j}, t p_{t}^{i, k})}

, where

t p_{t}^{i, k}

consists of the shared traffic flow

r_{t}^{i, k}

and some prior knowledge including the maximal speed limit

v_{m a x}^{i}

, the predefined sampling period, and the road networks

r n_{k, i}

.

The road networks link with all the sections and show the flow correlation between different sections. As shown in Figure 1, it is a part of the road networks, where each numeric value represents the probability that the traffic flow of one section enters its linked (adjacent) sections.

For example, there are half traffic flows of Section 5 that enter to Section 4, so the probability is 0.5 and is denoted as

r n_{5, 4}

in this paper. Obviously, the flow of traffic on any section will either stay in the same section or enter to another section, so the probability of section i satisfies the following relationship.

\sum_{j \in A_{i}} r n_{i, j} = 1,

where

A_{i}

is a set includes section i and its linked sections. The flow correlation is represented as

R N = {r n_{i, j} | i, j \in S e c}

, where

S e c

is the set of all sections.

Mathematical notations: The mathematical notations and their semantic meanings used in this paper are summarized in Table 1.

3. Problems Statement

When real-time traffic flows are shared with the public, they may cause serious privacy issues. Therefore, in order to ensure the shared traffic flows with strong privacy protection, each section is required to satisfy w-event privacy. As shown in Figure 2, the traffic data are collected by various sensors and stored in the database. Then, the traffic flows for serving an intelligent transportation system will be processed to satisfy w-event privacy, so that the shared flows can not leak the privacy of users. To be more specific, let

F_{k}

be the traffic flows of section k and

D_{k}

be raw traffic data at timestamp k. Then, we have

F_{k} = Q (D_{k}) = (f_{k}^{1}, f_{k}^{2}, \dots, f_{k}^{n})

, where n is the total number of sections at timestamp t, and

f_{k}^{i}

is defined as the traffic flow of section i at timestamp k. In order to ensure the traffic flows are shared securely, the sanitized version of

f_{k}^{i}

, denoted by

r_{k}^{i}

, is used to replace

f_{k}^{i}

. Thus, the sanitized version of infinite time traffic flows at section i is denoted as

R^{i} = (r_{1}^{i}, r_{2}^{i}, \dots, r_{k}^{i}, \dots)

.

In this paper, the problem concerning privacy protection is formally stated as follows.

Given an infinite time series of traffic flows

F = (F_{1}, F_{2}, \dots, F_{k}, \dots)

, denote its sanitized version as

R = (R_{1}, R_{2}, \dots, R_{k}, \dots)

. Then, a scheme is designed to make each infinite time section from

R

, denoted as

R^{i} = (r_{1}^{i}, r_{2}^{i}, \dots, r_{k}^{i}, \dots)

, which satisfies w-event privacy.

Since data utility is the main criterion for measuring the quality of a scheme, designing a mechanism to improve the data utility of the shared traffic flows is very meaningful. In this paper, the allocation of the privacy budget and the perturbation error will affect the accuracy of the shared traffic flows greatly. Therefore, the problem of the data utility can be described as follows.

(a) How to allocate the privacy budget reasonably. (b) How to reduce the absolute error (MAE) and relative error (MRE) of the shared traffic flows

R

, where MAE and MRE are the representation of perturbation error.

4. The Design of DP-SCR

In this section, we propose a scheme, named DP-SCR, which satisfies w-event privacy and provides high accuracy of traffic flows. The proposed DP-SCR can achieve the adaptive allocation of privacy budget, where the spatial correlation prediction is used to improve the budget allocation. Additionally, dynamic clustering is proposed to reduce the perturbation error caused by the small traffic flows. Finally, a novel approximation method and the perturbation method are used to deal with the no sampled sections and sampled sections in DP-SCR, respectively. Figure 3 shows the flowchart of DP-SCR, where the sampling of sections is determined by

d i s

and

λ_{i + 1}

. The

d i s

is the dissimilarity between the predicted traffic flow and the last shared traffic flow, and

λ_{t + 1}

is perturbation error.

Algorithm 1 gives an overall description of the proposed DP-SCR. The main processes of DP-SCR are described in detail as follows.

Algorithm 1: DP-SCR.

Require: raw traffic data $D_{t + 1}$ , the shared traffic flows $R_{t}$ , $t p_{t} = {t p_{t}^{i}, \dots, t p_{t}^{n}}$ .
Ensure: new shared traffic flows $R_{t + 1}$ .

1:: Obtain traffic flows $F_{t + 1} = Q (D_{t + 1})$ ;
2:: for each section i at timestamp t do
3:: Calculate the predicted traffic flow ${\hat{f}}_{t + 1}^{i}$ according to $t p_{t}^{i}$ , and then calculate $d i s$ ;
4:: Calculate the privacy budget for each section at timestamp t;
5:: Sampling according to ${\hat{f}}_{t + 1}$ .
6:: end for (see Section 4.1)
7:: For sampling points, do dynamic clustering for them at timestamp t + 1, and perturb their traffic flows by adopting Laplace mechanism; (see Section 4.2 and Section 4.3)
8:: For non-sampled points, approximate current traffic flows with the corresponding predicted traffic flows; (see Section 4.3)
9:: Obtain $R_{t + 1}$ by combining the results at sampling points and the results at non-sampled points.

4.1. Adaptive Allocation of Privacy Budget

According to Theorems 1 and 2, if the sliding window size w is too large, the privacy budget allocated for the sections at each timestamp is small, which will result in a large magnitude of noise. Sampling is a promising way to reduce noise. Because the non-sampled points do not consume any privacy budget, more privacy budget will be allocated for the sampling points. Towards this end, an adaptive allocation of the privacy budget is proposed in the literature [24]. It adopts the temporal correlation to predict the value at the next timestamp, and the value is used for sampling, where the predicted value determines the quality of the sampling. Also, due to the spatial correlation between traffic flows, it can increase the accuracy of predicted traffic flows and make the sampling more reasonable.

Inspired by the above ideas, a mechanism for the adaptive allocation of the privacy budget based on spatial correlation is proposed in this paper. The mechanism includes three operations, which are described in detail as follows.

4.1.1. Spatial Correlation Prediction

In the phase of spatial correlation prediction, we note that

V_{m a x}^{i}

is the maximal speed limit of section i, where the speed of all vehicles is assumed to be less than or equal to the maximum speed. The predefined

T_{S a m p l e}

is the sampling period of raw traffic data. According to Equation (9) of the work [34], the speed–flow relationship between

V_{t}^{i}

and

r_{t}^{i}

is

V_{t}^{i} = θ_{1} \times V_{m a x}^{i} / (1 + {(r_{t}^{i} / C A_{i})}^{ρ})

,

ρ = θ_{2} + θ_{3} \times {(r_{t}^{i} / C A_{i})}^{3}

, where

V_{t}^{i}

is the average vehicle speed in section i at timestamp t,

C A_{i} = V_{m a x}^{i} \times T_{S a m p l e}

is the maximum capacity of section i, and

θ_{i}

{i = 1, 2, 3}

and

ρ

are scale factor corrections. It is obvious that the average vehicle speed is related to the shared traffic flows, the maximal speed limit of the section, and the predefined sampling period. However, the scale factors are hard to set artificially. Inspired by [34], we design a novel spatial correlation algorithm

S C

to calculate the predicted traffic flows of section i. The goal of the method is to predict the traffic flow of section i at timestamp t + 1 based on

r_{t}^{i}

and the prior knowledge

V_{m a x}^{i}

,

T_{S a m p l e}

and

r n_{k, i}

. Specific processes are described in the following three steps:

Step 1: To calculate

V_{t}^{i}

, we train a model

M_{v}

to learning the relationship between

V_{t}^{i}

and

r_{t}^{i}

,

V_{m a x}^{i}

and

T_{S a m p l e}

. Based on the trained model, we can obtain the average vehicle speed of each section at any timestamp by inputing the prior knowledge and the shared traffic flows.

Step 2: Based on the average speed of section i and the sampling period

T_{S a m p l e}

, the traffic outflow

{\bar{r}}_{t}^{i} (o u t)

of section i is calculated by

{\bar{r}}_{t}^{i} (o u t) = V_{t}^{i} \times T_{S a m p l e} .

Step 3: According to the actual situation, the predicted traffic flow (

{\hat{f}}_{t + 1}^{i}

) of section i is the difference between the traffic outflow of section i and the traffic inflows of its linked sections. For example, if section j and section k are the linked sections of section i, the predicted traffic flow of section i is represented as the following formula.

{\hat{f}}_{t + 1}^{i} = r n_{j, i} \cdot {\bar{r}}_{t}^{j} (o u t) + r n_{k, i} \cdot {\bar{r}}_{t}^{k} (o u t) - (1 - r n_{i, i}) \cdot {\bar{r}}_{t}^{i} (o u t) .

4.1.2. Calculation of Privacy Budget

In the calculation of the privacy budget, to satisfy w-event privacy, the total privacy budget of each section at any sliding window should be smaller than

ε

. Here, assume that all sections at the next timestamp are sampling points. Thus, the privacy budget of all sections at the next timestamp should be calculated.

Without loss of generality, let the current timestamp be t; then, the privacy budget for section i at timestamp t + 1 is

ε_{t + 1}^{i}

. The remaining privacy budget

ε_{r}

in the sliding window

[t - w + 2, t]

is calculated by

ε_{r} = ε - \sum_{j = t - w + 2}^{t} ε_{j}^{i}

. Additionally, the sampling interval is

I = (t + 1 - l)

, where l is the last sampling point of section i. Then, a scale factor p, which determines how much privacy budget will be allocated for section i at timestamp t + 1, is calculated by

p = min (φ \times l n (I + 1), p_{m a x}),

where

φ

is defined as a scale factor varied in (0, 1], and

p_{m a x}

is the maximum portion of privacy budget allocated for each sampling point. In the end, the privacy budget allocated for section i at timestamp t + 1 is calculated by

ε_{t + 1}^{i} = min (p \times ε_{r}, ε_{m a x}),

where

ε_{m a x}

is the maximum privacy budget allocated for each sampling point. Two constraints (i.e.,

p_{m a x}

and

ε_{m a x}

) are aimed at striking a good balance between the data utility and privacy protection of traffic flows.

4.1.3. Sampling with the Predicted Traffic Flows

The perturbation error of section i is

λ_{t + 1}^{i} = 1 / ε_{t + 1}^{i}

, and the dissimilarity between the predicted traffic flow and the last shared traffic flow is

d i s = {\hat{f}}_{t + 1}^{i} - r_{l}^{i}

, where

r_{l}^{i}

is the last shared traffic flow of section i. If

λ_{t + 1}^{i} > d i s

, the traffic flow at timestamp t + 1 is approximated by the predicted traffic flow. Then, the privacy budget of section i is withdrawn, i.e., section i at timestamp t + 1 is a non-sampled point, and its privacy budget is zero. Otherwise, section i at timestamp t + 1 is a sampling point, and its privacy budget remains unchanged. The mechanism for the adaptive allocation of the privacy budget is formally presented in Algorithm 2.

Algorithm 2: Adaptive allocation of privacy budget for section i at timestamp t + 1.

Require: privacy budget $ε$ , $ε_{m a x}$ , $p_{m a x}$ , $r_{l}^{i}$ , $R N$ and the traffic flows of section i and its linked sections at timestamp t.
Ensure: the privacy budget of section i at timestamp t + 1.

1:: Assume that section i at timestamp t + 1 is sampling point, and calculate privacy budget for it, then obtain $ε_{t + 1}^{i} = m i n (p \times ε_{r}, ε_{m a x})$ , where $p = m i n (φ \times l n (I + 1),$ $p_{m a x})$ and $i = (t + 1 - l)$ ;
2:: According to Spatial Correlation Prediction, calculate ${\hat{f}}_{t + 1}^{i}$ that is the predicted traffic flow of section i at timestamp t +1;
3:: Calculate the dissimilarity between the predicted traffic flow and the last sharing $d i s = {\hat{f}}_{t + 1}^{i} - r_{l}^{i}$ ;
4:: $λ_{t + 1}^{i} = 1 / ε_{t + 1}^{i}$ ;
5:: if $d i s > λ_{t + 1}^{i}$ then
6:: section i at timestamp t + 1 is sampling point;
7:: return $ε_{t + 1}^{i}$ ;
8:: else
9:: section i at timestamp t + 1 is non-sampled point;
10:: return 0;
11:: end if

4.2. Dynamic Clustering

As shown in the analyses in Section 1, the sections with small traffic flow can cause large relative perturbation error in the w-event privacy schemes. In this section, a dynamic clustering algorithm, i.e., bisecting k-means, is adopted to reduce the perturbation error. Specifically, the sections with similar traffic flow and privacy budget will be aggregated together to resist noise via the dynamic clustering.

First, it is necessary to determine which sections have small traffic flow before clustering. Here, the noise resistance threshold is defined as

τ

, which reflects whether the traffic flows have sufficient capacity to resist noise. When the traffic flows are smaller than

τ

, they are classified as small traffic flows. Then, the sections with small traffic flows will be saved in the cluster

C_{t + 1}^{0}

.

Assume that the number of the sections with small traffic flow is n; then

C_{t + 1}^{0} = {y_{1, 0}, \dots, y_{n, 0}}

,

y_{x, 0} = (f_{t + 1}^{x, 0}, λ_{t + 1}^{x, 0})

, and

λ_{t + 1}^{x, 0} = 1 / ε_{t + 1}^{x, 0}

,

x \in [1, n]

, where

f_{t + 1}^{x, 0}

is the traffic flow

f_{t + 1}^{x}

of cluster

C_{t + 1}^{0}

, and

λ_{t + 1}^{x, 0}

is perturbation error. When

\sum_{x \in C_{t + 1}^{h}} f_{t + 1}^{x, h} \geq τ

, the cluster at timestamp t + 1 can be denoted as

C L U_{t + 1} = {C_{t + 1}^{0}, \dots, C_{t + 1}^{k}}

,

(k \leq n)

. Also, the sum of the squared error (

S S E

) of

C L U_{t + 1}

is

S S E (C L U_{t + 1}) = \sum_{h = 1}^{k} \sum_{x \in C_{t + 1}^{h}} | | y_{x, h} - c_{t + 1}^{h} {| |}^{2}

, where

c_{t + 1}^{h}

is the cluster center of

C_{t + 1}^{h}

. As is well known, the smaller the

S S E

is, the more similar the traffic flows and privacy budget of sections are. Thus, the dynamic clustering is aimed at finding the smallest

S S E

, which is described in Algorithm 3.

After dynamic clustering, suitable privacy budget should be allocated for each cluster in

C L U_{t + 1}

. Without loss of generality,

\sum_{j = 0}^{w - 1} ε_{t + j}^{i}

is denoted as the total privacy budget for section i at any successive w timestamps. In order to ensure

\sum_{j = 0}^{w - 1} ε_{t + j}^{i} \leq ε

, the privacy budget allocated for

C_{t + 1}^{i}

is equal to

{\hat{ε}}_{t + 1}^{i}

, and the privacy budget of the sections in

C_{t + 1}^{i}

is also

{\hat{ε}}_{t + 1}^{i}

, where

{\hat{ε}}_{t + 1}^{i} = min_{x \in C_{t + 1}^{i}} (ε_{t + 1}^{x, i})

.

Algorithm 3: Dynamic Clustering Algorithm at timestamp t + 1.

Require: $C_{t + 1}^{0}$ .
Ensure: $C L U_{t + 1}$ .

1:: Initialization: $C l u = C_{t + 1}^{0}$ , $\sum_{x \in C_{t + 1}^{0}} f_{t + 1}^{x, 0} \leq τ$ .
2:: if $\sum_{x \in C_{t + 1}^{0}} f_{t + 1}^{x, 0} \leq τ$ then
3:: return $C l u$ ;
4:: end if
5:: while 1 do
6:: $k = s i z e (C l u)$ ; $C l u_{2} = {C_{t + 1}^{0}, \dots, C_{t + 1}^{k}}$ ;
7:: for $i = 0 : k$ do
8:: do 2- $m e a n s$ for $C_{t + 1}^{i}$ in $C l u$ , then obtain new $C_{t + 1}^{i}$ and $C_{t + 1}^{k + 1}$ ;
9:: if $\sum_{x \in C_{t + 1}^{i}} f_{t + 1}^{x, i} \geq τ$ and $\sum_{x \in C_{t + 1}^{k + 1}} f_{t + 1}^{x, k + 1} \geq τ$ then
10:: $C l u_{1} = {C_{t + 1}^{0}, \dots, C_{t + 1}^{i}, \dots, C_{t + 1}^{k + 1}}$ ; (the clusters except $C_{t + 1}^{i}$ and $C_{t + 1}^{k + 1}$ are from $C l u$ )
11:: if $S S E (C l u_{1}) < S S E (C l u_{2})$ then
12:: $C l u_{2} = C l u_{1}$
13:: end if
14:: end if
15:: end for
16:: if $C l u \neq C l u_{2}$ then
17:: $C l u = C l u_{2}$ ;
18:: else
19:: return $C l u$ ;
20:: end if
21:: end while
22:: The function of 2-means is as follows.
23:: Function: $(C_{t + 1}^{i}, C_{t + 1}^{k + 1})$ =2- $m e a n s$ ( $C_{t + 1}^{i}$ )
24:: 1: randomly select two objects from these k objects of $C_{t + 1}^{i}$ as the initial cluster centers of the cluster A and the cluster B;
25:: 2: calculate the similarity (Euclidean distance) between each object $y_{x, i}$ and the cluster center; (The smaller the value is, the closer the similarity is)
26:: 3: all objects are divided into the cluster with closer similarity;
27:: 4: recalculate the cluster centers of the cluster A and the cluster B;
28:: 5: repeat step 2–step 4 until each cluster is not changing;
29:: 6: $C_{t + 1}^{i} = A$ , and $C_{t + 1}^{k + 1} = B$ ;
30:: 7: return $C_{t + 1}^{i}$ and $C_{t + 1}^{k + 1}$ .
31:: end Function

4.3. Approximation and Perturbation

To ensure that each section satisfies w-event privacy, the noise that conforms to Laplace distribution is injected into each sampling section. In [24], it uses the last shared value to approximate non-sampled sections. Different from the approximation mechanism in [24], we propose a novel approximation mechanism that takes the predicted value as the value of non-sampled sections. The predicted values are used to approximate the real value in this paper. However, there exists a dissimilarity

d i s

between the last shared values and the predicted values, so the predicted values are closer to the real traffic flows of the section than to its last shared value. In any case, the predicted values are calculated based on the shared traffic flow at the previous timestamp. Thus, it also can protect real values and prevent privacy leakage.

In the perturbation mechanism,

D_{t + 1}

is the raw traffic data at timestamp t + 1, and

C_{t + 1}^{h}

is a cluster at timestamp t + 1 consisting of

n_{h}

sections. As each vehicle can only appear in at most one section at each timestamp, the sensitivity of Q (

Δ (Q)

) is 1. Then, the sanitized traffic flow of section i at timestamp t + 1 can be denoted as

M (D_{t + 1}^{i}) = \{\begin{matrix} Q (D_{t + 1}^{i}) + L a p (Δ (Q) / ε_{t + 1}^{i}), i f i \notin C_{t + 1}^{h} \\ (Q (D_{t + 1}^{i}) + L a p (Δ (Q) / {\hat{ε}}_{t + 1}^{i})) / n_{h}, o t h e r w i s e . \end{matrix}

If section

i \notin C_{t + 1}^{h}

, the sanitized traffic flows are

Q (D_{t + 1}^{i}) + L a p (Δ (Q) / ε_{t + 1}^{i})

. Otherwise, the sanitized traffic flows are

(Q (D_{t + 1}^{i}) + L a p (Δ (Q) / {\hat{ε}}_{t + 1}^{i})) / n_{h}

.

5. Performance Analyses

In this section, we will analyze the privacy protection, the correlation of traffic flows and effects of filtering in DP-SCR.

5.1. Privacy Analyses

In this subsection, the privacy loss and privacy protection are analyzed.

5.1.1. Privacy Loss

Privacy loss is used to metric the privacy information leakage. According to the definition of differential privacy, we have the privacy loss

\begin{matrix} ln \frac{Pr (M (D) = r)}{Pr (M (D^{'}) = r)} = ln \frac{Pr (Q (D) + {〈 L a p (Δ (Q) / ε) 〉}^{d} = r)}{Pr (Q (D^{'}) + {〈 L a p (Δ (Q) / ε) 〉}^{d} = r)} \\ = ln \frac{Pr ({〈 L a p (Δ (Q) / ε) 〉}^{d} = r - Q (D))}{P r ({〈 L a p (Δ (Q) / ε) 〉}^{d} = r - Q (D^{'}))} \\ = ln \frac{e x p (- | r - Q (D) | ε / Δ (Q))}{e x p (- | r - Q (D^{'}) | ε / Δ (Q))} \\ = ln e x p (ε (| r - Q (D) | - | r - Q (D^{'}) |) / Δ (Q)) \\ \leq ln e x p (ε | Q (D) - Q (D^{'}) | / Δ (Q)) \\ \leq ln e x p (ε) \\ \leq ε \end{matrix}

where r represents the output after the processing of differential privacy. Therefore, the privacy loss is determined by the allocated privacy budget

ε

and does not exceed

ε

.

5.1.2. Privacy Protection

These schemes BA [23], BD [23], E-RescueDP [24] and CLDP [29] to be compared in this paper all satisfy w-event privacy. Here, we will prove whether the DP-SCR proposed in this paper satisfies w-event privacy.

Claim 1.

The proposed DP-SCR satisfies w-event privacy.

Proof.

In DP-SCR, the perturbation phase is the only one accessing raw traffic flow. If

\sum_{j = t - w + 1}^{t} {\bar{ε}}_{j}^{i} \leq ε

for each section, Claim 1 holds, where

{\bar{ε}}_{j}^{i}

is the privacy budget allocated for section i at timestamp j in perturbation.

For section i,

ε_{t}^{i}

is the allocated privacy budget at timestamp t after the adaptive allocation of the privacy budget. Then, the budget privacy

ε_{t}^{i}

will be changed after dynamic clustering, and the changed privacy budget will be denoted as

{\bar{ε}}_{t}^{i}

. If section i belongs to

C_{t}^{h}

,

{\bar{ε}}_{t}^{i} = {\hat{ε}}_{t}^{h}

, where

{\hat{ε}}_{t}^{h} = min_{x \in C_{t}^{h}} (ε_{t}^{x, i})

, meaning

ε_{t}^{i} \geq {\hat{ε}}_{t}^{h}

. Otherwise,

{\bar{ε}}_{t}^{i} = ε_{t}^{i}

. Thus,

ε_{t}^{i} \geq {\bar{ε}}_{t}^{h}

holds. Moreover, as

\sum_{j = t - w + 1}^{t} ε_{j}^{i} \leq ε

for section i at any successive w timestamps has been required in the mechanism of the adaptive allocation of the privacy budget,

\sum_{j = t - w + 1}^{t} {\bar{ε}}_{j}^{i} \leq ε

is tenable. Finally, according to Theorem 2, the proposed DP-SCR satisfies w-event privacy, so Claim 1 holds. □

5.2. Correlation Analyses

Claim 2.

The spatial correlation between traffic flows is more remarkable than other characteristics of traffic flows in the prediction.

Proof.

Traffic flows have four characteristics, i.e., temporal correlation, spatial correlation, historical correlation and multistate. The authors in [33] indicated that the prediction for traffic flows is only related to the temporal, spatial and historical correlation between traffic flows, and they illustrated that the multistate is useless. Upon further analysis, the prediction for traffic flows also has little effect on the historical correlation between traffic flows due to the following reasons: (1) On-road traffic events, such as accidents and road closures, affect the traffic flows in the transportation system, and these effects cannot be predicted a priori [35]. (2) Off-road events have a major impact on the traffic flows and may not be included in the usual historical traffic flows [35]. (3) The timestamps of sampling are too short to predict the traffic flows at the next timestamp by using historical traffic flows in the sharing of real-time traffic flows. Thus, the prediction for traffic flows is mainly correlated with temporal and spatial characteristics. Also, the authors in [36,37] have also emphasized that most of the mechanisms on the prediction for traffic flows mainly are based on temporal correlation and spatial correlation. However, the spatial characteristics of traffic flows can reflect the correlation between traffic flows more distinctly than the temporal characteristics of traffic flows, which has been illustrated as follows.

The Pearson correlation coefficient is used to calculate the spatial correlation between traffic flows, i.e.,

ρ_{X, Y} = C O V (X, Y) / σ_{X} σ_{Y}

, where

C O V

is covariance,

σ_{X}

and

σ_{Y}

are, respectively, the standard deviation of X and Y. In the experiment, the traffic flows of the target section and the traffic flows of its linked sections at 160 successive timestamps are, respectively, served as X and Y. Then, the spatial correlation between traffic flows is 0.7072. Also, the autocorrelation coefficient is adopted to calculate the temporal correlation of the above-mentioned traffic flows, where the range of retardation timestamp is [0, 10], and the sample size is 10,000, which is large enough for the coefficient calculation. Figure 4 shows the temporal correlation of X, where all results are smaller than 0.3. As the larger correlation value means a more remarkable correlation, the spatial correlation between traffic flows is more striking than the temporal correlation between traffic flows.

That is, the prediction based on spatial correlation obtains more accurate results than that based on other characteristics of traffic flows. Thus, Claim 2 holds. □

5.3. Effects of Filtering on DP-SCR

In many differential privacy schemes, the sanitized traffic flows can not be shared directly because the noise caused by perturbation may reduce the accuracy of the shared traffic flows. Thus, the filtering mechanism is used to improve the accuracy of the sanitized traffic flows after perturbation.

In E-RescueDP [24], it uses Kalman Filter to improve the accuracy of the sanitized traffic flows. To compare the effects of filtering in the proposed DP-SCR and E-RescueDP, we also use the Kalman Filter (KF) to deal with the noise in DP-SCR.

Inspired by the FAST algorithm [17], KF [38] is used to improve the accuracy of the sanitized traffic flows

M (D_{t + 1}^{i})

. The filtering mechanism includes two steps:

P r e d i c t

and

C o r r e c t

, which are shown in Algorithm 4.

Algorithm 4: Filtering with KF for

M (D_{t + 1}^{i})

.

Require: the previous shared $r_{t}^{i}$ and noisy measurement $Z_{t + 1}^{i} = M (D_{t + 1}^{i})$ .
Ensure: the posterior estimate ${\hat{x}}_{t + 1}^{i}$ .

1:: $K F P r e d i c t (t + 1)$ :
2:: ${\bar{x}}_{t + 1}^{i} = r_{t}^{i}$ ;
3:: ${\bar{P}}_{t + 1}^{i} = P_{t}^{i} + G$ ;
4:: $K F C o r r e c t (t + 1)$ :
5:: $K_{t + 1}^{i} = {\bar{P}}_{t + 1}^{i} / ({\bar{P}}_{t + 1}^{i} + H)$ ;
6:: ${\hat{x}}_{t + 1}^{i} = {\bar{x}}_{t + 1}^{i} + K_{t + 1}^{i} (z_{t + 1}^{i}) - {\bar{x}}_{t + 1}^{i}$ ;
7:: $P_{t + 1}^{i} = (1 - K_{t + 1}^{i}) {\bar{P}}_{t}^{i}$ .

The posterior estimate

{\hat{x}}_{t + 1}^{i}

is the final shared traffic flow of section i at timestamp t + 1, i.e.,

r_{t + 1}^{i} = {\hat{x}}_{t + 1}^{i}

. The detailed principles and processes of KF have been explained in FAST algorithm, the readers may refer to [17].

Some experiments on filtering are conducted in this paper. As shown in Figure 5, the

M R E

of DP-SCR is slightly influenced by Kalman Filter and has a smaller value. It indicates that DP-SCR without Kalman Filter also has high accuracy. Thus, the sanitized traffic flows in DP-SCR without Kalman Filter can be shared directly.

In addition, the

M R E

of E-RescueDP adopting and not adopting Kalman Filter are larger than that of DP-SCR. It indicates that DP-SCR is superior to E-RescueDP in terms of accuracy.

5.4. Complexity Analyses

The proposed DP-SCR scheme is compared with the other four schemes (i.e., BA, BD, E-RescueDP, and CLDP) in terms of time complexity, and the comparison results are shown in Table 2, where d is the number of sections, m is the number of groups/clusters in E-RescueDP and DP-SCR, and e represents the number of iterations required for the convergence of the 2-means in DP-SCR. As can be seen, BA, BD, and CLDP schemes are faster than E-RescueDP and DP-SCR, and DP-SCR may be faster than E-RescueDP when the number of sections is large.

6. Experimental Simulation and Evaluation

In this section, the related experiments are simulated on real-world datasets, and the performance of the proposed DP-SCR is compared with that of schemes E-RescueDP [24], BD [23] and BA [23]. All our experiments are run in Matlab 2018a platform on PC with Intel(R) Core(TM) i5-4590 CPU @ 3.30 GHz, 4.00 G main memory, and 500 GB hard disk with the Microsoft Windows 7 operating system.

The datasets of our experiments include the vehicular mobility dataset and the street layout dataset. The vehicular mobility dataset is mainly based on the real data collected by the General Departmental Council of Val de Marne (94) in France (Downloaded at http://vehicular-mobility-trace.github.io/ accessed on 2 March 2024). It comprises around 10,000 traces, over rush hour periods of two hours in the morning (7 a.m.–9 a.m.) and two hours in the evening (5 p.m.–7 p.m.). The real street layout of the Creteil roundabout area (sampled area) is obtained from the OpenStreetMap database, as shown in Figure 6. Here, each 400-m road is served as one section.

Subsequently, a traffic flow dataset with 160 timestamps for each section is created, which is sampled every 85 s from the vehicular mobility dataset. Moreover, the traffic flow dataset contains vehicle numbers, vehicle coordinates on the two-dimensional plane (x and y coordinates in meters), vehicle speed (in meters per second), and vehicle id. The target section is randomly selected from the sections generated by function Q. In any case, to ensure the credibility of our experiments, all experiments involving the Laplace mechanism are conducted 100 times, and the average value of these 100 experiment results is represented by the points in the figures.

6.1. Data Utility of the Shared Traffic Flow

In this section, we conduct experiments for the designed adaptive allocation of privacy budget and dynamic clustering to evaluate the superiority in terms of data utility. The accuracy of the shared traffic flows reflects the data utility. The mean absolute error (

M A E

) and the mean relative error (

M R E

) are served as an accuracy metric. Moreover, the smaller the

M A E

and

M R E

are, the more accurate the traffic flows are. Let

F^{i, n} = {f_{1 + m}^{i}, f_{2 + m}^{i}, \dots, f_{n + m}^{i}}

be raw traffic flows, and let

R^{i, n} = {r_{1 + m}^{i}, r_{2 + m}^{i}, \dots, r_{n + m}^{i}}

be the sanitized traffic flows of section i at successive n timestamps before Filtering. Then, the formulas for the

M A E

and

M R E

, respectively, are

M A E (F^{i, n}, R^{i, n}) = (1 / n) \times \sum_{j = 1}^{n} | f_{j + m}^{i} - r_{j + m}^{i} |, and

M R E (F^{i, n}, R^{i, n}) = \frac{(1 / n) \times \sum_{j = 1}^{n} (| f_{j + m}^{i} - r_{j + m}^{i} |)}{max (r_{j + m}^{i}, δ)},

where

δ

is the bound of small traffic flows, which is used to reduce the effect of excessively small traffic flows and is equal to 0.1% of

\sum_{j = 1}^{n} f_{j + m}^{i}

.

(1) Prediction accuracy evaluation for DP-SCR. The allocation for privacy budget affects the accuracy of the predicted traffic flows greatly. RescueDP and E-RescueDP are the baseline temporal-based schemes designed for the sharing of real-time data with w-event privacy. Due to the performance of E-RescueDP being better than that of RescueDP, we only compare our scheme with the preferable E-RescueDP in terms of the accuracy of prediction. In these experiments, the privacy budget is

ε = 1

.

E-RescueDP is based on the temporal correlation with the Elman network [39] (an RNN algorithm). In the Elman network, the number of neurons is 5 in the input layer, 18 in the hidden layer and 1 in the output layer, respectively. The designed diagram of the Elman network is shown in Figure 7. Moreover, the traffic flows at the first successive 80 timestamps of the target section is selected as the training sets, and the remaining traffic flows are taken as the testing sets. As depicted in Figure 8, the blue line represents the training loss (i.e., mean squared error) of the Elman network during training. It is noteworthy that the training loss remains stable and is equal to 0.012323 at 4997 epochs. In Figure 9, the blue dashed lines depict the raw traffic flows, while the orange solid lines represent the predicted traffic flows. Figure 9a displays the predicted results in the E-RescueDP scheme. The results show that there are significant differences between the predicted results of E-RescueDP and the raw traffic flows, where the

M A R

and

M R E

in E-RescueDP are 5.0875 and 0.4881, respectively.

In the proposed DP-SCR, the traffic data of the above target section and its linked sections at the last 80 timestamps are selected as experimental data. For fair comparison, the model

M_{v}

is trained on Elman network also with the number of neurons is 18 in the hidden layer and 1 in the output layer, respectively. The number of neurons is 3 in the input layer, including the shared traffic flow, maximal speed limit, and sampling period. The prediction results of DP-SCR are shown in Figure 9b, where the predicted traffic flows match the raw traffic flows well. Moreover, the

M A R

and

M R E

in DP-SCR are 3.1000 and 0.2680, respectively.

In summary, the predicted results in DP-SCR are more accurate than that in E-RescueDP, which means the data utility of the proposed scheme is higher.

(2) Accuracy evaluation for dynamic clustering. The dynamic clustering based on bisecting k-means is adopted to reduce the perturbation error caused by the small traffic flows, which will result in the loss of data utility. In the experiments on dynamic clustering, an experimental dataset based on real data is created, where the dataset includes 5000 sections with small traffic flows that are allocated with a random privacy budget. The

M R E

of the bisecting k-means is compared with that of non-partitioned operation and dynamic programming, where the dynamic programming is the partitioned operation in [24]. Figure 10 illustrates the

M R E

results of different strategies with different sections. It observes that the

M R E

of DP-SCR is smaller than that of other schemes, indicating the higher data utility of the shared traffic flows obtained by the dynamic clustering in DP-SCR compared to other partitioned operations.

6.2. Data Utility vs. Privacy Budget $(ε)$

In this section, the experiments about the

M A E

and

M R E

are conducted when

ε

varies from 0.1 to 1.0. The

M A E

and

M R E

of DP-SCR are compared with those of schemes BA [23], BD [23], E-RescueDP [24] and CLDP [29]. BA and BD are baseline w-event privacy schemes for the real-time data release, and CLDP is the baseline w-event privacy scheme with local differential privacy. Figure 11 compares

M A E

and

M R E

for the shared traffic flows with

ε

changing, where w is fixed and equal to 10. The results indicate that the

M A E

and

M R E

of DP-SCR with any privacy budget are significantly smaller than those of other schemes. Moreover, the

M A E

and

M R E

of BA, BD and LCDP decrease as

ε

increases, and the magnitude of the decrease is also becoming smaller. The changes in the

M A E

and

M R E

of E-RescueDP and DP-SCR are little. There are three reasons for the above experimental results. First, BA, BD and CLDP allocate too small privacy budget for perturbation, which introduces more noise into the shared data. Second, the prediction of DP-SCR is more accurate than that of E-RescueDP, so the

M A R

and

M E R

of DP-SCR are smaller than those of E-RescueDP. Third, since the adaptive allocation of budget privacy is adopted in E-RescueDP and DP-SCR, providing more reasonable privacy budgets, the

M A R

and

M E R

of them are relatively stable when

ε

changes.

6.3. Data Utility vs. Sliding Window Size $(w)$

The data utility of DP-SCR is compared with that of schemes BA, BD, E-RescueDP and LCDP, where w varies from 5 to 45. The results are shown in Figure 12, where the

M A E

and

M R E

of DP-SCR are higher than those of other schemes. Also, the

M A E

and

M R E

of BA, BD and LCDP increase as w increases, and the

M A E

and

M R E

of E-RescueDP and DP-SCR are relatively stable. This is because the adaptive allocation of privacy budget and dynamic processing improve the accuracy of the shared traffic flows and make them robust to the changes in w.

7. Conclusions

In this paper, we propose a scheme, named DP-SCR, to ensure the sharing of real-time traffic flows with high data utility under privacy protection. DP-SCR consists of four key components: adaptive allocation of privacy budget, dynamic clustering, approximation and perturbation. In the proposed DP-SCR, we take advantage of the spatial correlation prediction and the novel clustering strategy to improve the accuracy of the shared traffic flows. Moreover, the results of the experiments on real-world datasets have also shown that the shared traffic flows in DP-SCR are more accurate than those in the existing baseline w-event privacy schemes. Also, in terms of privacy protection, DP-SCR has been proven to satisfy w-event privacy, which provides strong privacy protection to the shared traffic flows.

However, some aspects that still exist can be improved in future work. First, more characteristics of traffic flows may be considered together in prediction to improve the data utility. Second, genetic algorithms [40] may be used to improve the accuracy of the spatial correlation prediction. Finally, other privacy-preserving methods such as secure data deduplication [41], blockchain-based secure sharing scheme [42] and federated learning [43,44] can be used to enhance the data utility and security.

Author Contributions

Methodology, J.L., B.X., D.Z. and D.Q.; Software, J.L. and B.X.; Validation, J.L., B.X. and D.Q.; Formal analysis, J.L.; Writing—original draft, J.L. and D.Z.; Writing—review & editing, B.X. and D.Q.; Supervision, D.Z.; Funding acquisition, J.L. and D.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of China (Grant no. 62202071, 62302072), in part by the China Postdoctoral Science Foundation (Grant no. 2022M710518, 2022M710520), and in part by the Natural Science Foundation of Chongqing, China (Grant no. CSTB2022NSCQ-MSX0358, CSTB2022NSCQ-MSX1217).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ding, X.; Zhou, W.; Sheng, S.; Bao, Z.; Choo, K.K.R.; Jin, H. Differentially private publication of streaming trajectory data. Inf. Sci. 2020, 538, 159–175. [Google Scholar] [CrossRef]
Li, L.; Jiang, R.; He, Z.; Chen, X.M.; Zhou, X. Trajectory data-based traffic flow studies: A revisit. Transp. Res. Part Emerg. Technol. 2020, 114, 225–240. [Google Scholar] [CrossRef]
Liu, Y.; James, J.; Kang, J.; Niyato, D.; Zhang, S. Privacy-preserving traffic flow prediction: A federated learning approach. IEEE Internet Things J. 2020, 7, 7751–7763. [Google Scholar] [CrossRef]
Le, J.; Lei, X.; Mu, N.; Zhang, H.; Zeng, K.; Liao, X. Federated Continuous Learning With Broad Network Architecture. IEEE Trans. Cybern. 2021, 51, 3874–3888. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Gu, B.; Zheng, B.; Ding, B.; Han, Y.; Yu, K. Toward Incentive-Compatible Vehicular Crowdsensing: An Edge-Assisted Hierarchical Framework. IEEE Netw. 2022, 36, 162–167. [Google Scholar] [CrossRef]
Chiou, J.M.; Liou, H.T.; Chen, W.H. Modeling time-varying variability and reliability of freeway travel time using functional principal component analysis. IEEE Trans. Intell. Transp. Syst. 2019, 22, 257–266. [Google Scholar] [CrossRef]
Wu, T.; Zhou, P.; Liu, K.; Yuan, Y.; Wang, X.; Huang, H.; Wu, D.O. Multi-agent deep reinforcement learning for urban traffic light control in vehicular networks. IEEE Trans. Veh. Technol. 2020, 69, 8243–8256. [Google Scholar] [CrossRef]
Meese, C.; Chen, H.; Asif, S.A.; Li, W.; Shen, C.C.; Nejad, M. Bfrt: Blockchained federated learning for real-time traffic flow prediction. In Proceedings of the IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), Taormina, Italy, 16–19 May 2022; pp. 317–326. [Google Scholar]
Miglani, A.; Kumar, N. Deep learning models for traffic flow prediction in autonomous vehicles: A review, solutions, and challenges. Veh. Commun. 2019, 20, 100184. [Google Scholar] [CrossRef]
Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Al Sallab, A.A.; Yogamani, S.; Pérez, P. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst. 2021, 23, 4909–4926. [Google Scholar] [CrossRef]
Morlock, F.; Rolle, B.; Bauer, M.; Sawodny, O. Forecasts of electric vehicle energy consumption based on characteristic speed profiles and real-time traffic data. IEEE Trans. Veh. Technol. 2019, 69, 1404–1418. [Google Scholar] [CrossRef]
Gazdag, A.; Lestyán, S.; Remeli, M.; Ács, G.; Holczer, T.; Biczók, G. Privacy pitfalls of releasing in-vehicle network data. Veh. Commun. 2023, 39, 100565. [Google Scholar] [CrossRef]
De Montjoye, Y.A.; Hidalgo, C.A.; Verleysen, M.; Blondel, V.D. Unique in the Crowd: The privacy bounds of human mobility. Sci. Rep. 2013, 3, 1376. [Google Scholar] [CrossRef] [PubMed]
Dwork, C. Differential Privacy: A Survey of Results. In Proceedings of the International Conference on Theory and Applications of MODELS of Computation (TAMC), Xi’an, China, 25–29 April 2008; pp. 1–19. [Google Scholar]
Dwork, C.; Naor, M.; Pitassi, T.; Rothblum, G.N. Differential Privacy Under Continual Observation. In Proceedings of the Forty-Second ACM Symposium on Theory of Computing (STOC), Cambridge, MA, USA, 6–8 June 2010; pp. 715–724. [Google Scholar]
Chan, T.H.H.; Shi, E.; Song, D. Private and Continual Release of Statistics. ACM Trans. Inf. Syst. Secur. 2011, 14, 26:1–26:24. [Google Scholar] [CrossRef]
Fan, L.; Xiong, L. An Adaptive Approach to Real-Time Aggregate Monitoring With Differential Privacy. IEEE Trans. Knowl. Data Eng. 2014, 26, 2094–2106. [Google Scholar]
Fan, L.; Xiong, L.; Sunderam, V. Differentially private multi-dimensional time series release for traffic monitoring. In Differentially Private Multi-Dimensional Time Series Release for Traffic Monitoring; Springer: Berlin/Heidelberg, Germany, 2013; Volume 7964, pp. 33–48. [Google Scholar]
Chen, Y.; Machanavajjhala, A.; Hay, M.; Miklau, G. PeGaSus: Data-Adaptive Differentially Private Stream Processing. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), Dallas, TX, USA, 30 October–3 November 2017; pp. 1375–1388. [Google Scholar]
Ren, X.; Wang, S.; Yao, X.; Yu, C.M.; Yu, W.; Yang, X. Differentially Private Event Sequences Over Infinite Streams With Relaxed Privacy Guarantee. In Differentially Private Event Sequences over Infinite Streams with Relaxed Privacy Guarantee; Springer: Cham, Switzerland, 2019; Volume 11604, pp. 272–284. [Google Scholar]
Gati, N.J.; Yang, L.T.; Feng, J.; Nie, X.; Ren, Z.; Tarus, S.K. Differentially private data fusion and deep learning framework for cyber–physical–social systems: State-of-the-art and perspectives. Inf. Fusion 2021, 76, 298–314. [Google Scholar] [CrossRef]
Li, Q.; Heusdens, R.; Christensen, M.G. Communication efficient privacy-preserving distributed optimization using adaptive differential quantization. Signal Process. 2022, 194, 108456. [Google Scholar] [CrossRef]
Kellaris, G.; Papadopoulos, S.; Xiao, X.; Papadias, D. Differentially Private Event Sequences over Infinite Streams. Proc. VLDB Endow. 2014, 7, 1155–1166. [Google Scholar] [CrossRef]
Wang, Q.; Zhang, Y.; Lu, X.; Wang, Z.; Qin, Z.; Ren, K. Real-time and Spatio-temporal Crowd-sourced Social Network Data Publishing with Differential Privacy. IEEE Trans. Dependable Secur. Comput. 2016, 15, 591–606. [Google Scholar] [CrossRef]
Huo, Y.; Yong, C.; Lu, Y. Re-ADP: Real-Time Data Aggregation with Adaptive ω-Event Differential Privacy for Fog Computing. Wirel. Commun. Mob. Comput. 2018, 2018, 6285719. [Google Scholar] [CrossRef]
Wang, H.; Cai, S.; Liu, P.; Zhang, J.; Shen, Z.; Liu, K. DP-STGAT: Traffic statistics publishing with differential privacy and a spatial-temporal graph attention network. Inf. Sci. 2023, 623, 258–274. [Google Scholar] [CrossRef]
Wang, T.; Chen, J.Q.; Zhang, Z.; Su, D.; Cheng, Y.; Li, Z.; Li, N.; Jha, S. Continuous release of data streams under both centralized and local differential privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), Virtual, 15–19 November 2021; pp. 1237–1253. [Google Scholar]
Ren, X.; Shi, L.; Yu, W.; Yang, S.; Zhao, C.; Xu, Z. LDP-IDS: Local differential privacy for infinite data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), Philadelphia, PA, USA, 12–17 June 2022; pp. 1064–1077. [Google Scholar]
Errounda, F.Z.; Liu, Y. Collective location statistics release with local differential privacy. Future Gener. Comput. Syst. 2021, 124, 174–186. [Google Scholar] [CrossRef]
Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography; Springer: Berlin/Heidelberg, Germany, 2006; pp. 265–284. [Google Scholar]
Dwork, C. Differential Privacy. In Proceedings of the International Conference on Automata, Languages and Programming (ICALP), Venice, Italy, 10–14 July 2006; pp. 1–12. [Google Scholar]
McSherry, F.D. Privacy Integrated Queries: An Extensible Platform for Privacy-preserving Data Analysis. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), Providence, RI, USA, 29 June–2 July 2009; pp. 19–30. [Google Scholar]
Lu, H.; Sun, Z.; Qu, W. Big Data-Driven Based Real-Time Traffic Flow State Identification and Prediction. Discret. Dyn. Nat. Soc. 2015, 2015, 284906. [Google Scholar] [CrossRef]
Wang, W.; Li, W.; Ren, G. A speed-flow relationship model of highway traffic flow. J. Harbin Inst. Technol. 2005, 12, 331–335. [Google Scholar]
Alvarez-Marquez, A.; Aguilera, I.; Gentil, M.A.; Cabello, V.; Gonzalez-Escribano, M.F.; Nunez-Roldan, A. Traffic Flow Prediction for Road Transportation Networks With Limited Traffic Data. IEEE Trans. Intell. Transp. Syst. 2015, 16, 653–662. [Google Scholar]
Lv, Y.; Duan, Y.; Kang, W.; Li, Z.; Wang, F.Y. Traffic Flow Prediction With Big Data: A Deep Learning Approach. IEEE Trans. Intell. Transp. Syst. 2015, 16, 865–873. [Google Scholar] [CrossRef]
Liebig, T.; Piatkowski, N.; Bockermann, C.; Morik, K. Dynamic route planning with real-time traffic predictions. Inf. Syst. 2017, 64, 258–265. [Google Scholar] [CrossRef]
Kalman, R. A new approach to linear filtering and predicted problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
Elman, J.L. Distributed representations, simple recurrent networks, and grammatical structure. Mach. Learn. 1991, 7, 195–225. [Google Scholar] [CrossRef]
Rangel, H.R.; Puig, V.; Farias, R.L.; Flores, J.J. Short-term demand forecast using a bank of neural network models trained using genetic algorithms for the optimal management of drinking water networks. J. Hydroinform. 2017, 19, 1–16. [Google Scholar] [CrossRef]
Zhang, D.; Le, J.; Mu, N.; Wu, J.; Liao, X. Secure and efficient data deduplication in jointcloud storage. IEEE Trans. Cloud Comput. 2021, 11, 156–167. [Google Scholar] [CrossRef]
Zhao, R.; Xu, C.; Zhu, Z.; Mo, W. A Blockchain-Based Secure Sharing Scheme for Electrical Impedance Tomography Data. Mathematics 2024, 12, 1120. [Google Scholar] [CrossRef]
Le, J.; Zhang, D.; Lei, X.; Jiao, L.; Zeng, K.; Liao, X. Privacy-preserving federated learning with malicious clients and honest-but-curious servers. IEEE Trans. Inf. Forensics Secur. 2023, 18, 4329–4344. [Google Scholar] [CrossRef]
Zhang, L.; Lei, X.; Shi, Y.; Huang, H.; Chen, C. Federated Learning for IoT Devices with Domain Generalization. IEEE Internet Things J. 2023, 10, 9622–9633. [Google Scholar] [CrossRef]

Figure 1. Partial road networks and its transition probabilities.

Figure 2. The system model.

Figure 3. The flowchart of DP-SCR.

Figure 4. The temporal correlation (autocorrelation) of X.

Figure 5. The effects of filtering for E-RescueDP and DP-SCR.

Figure 6. The street layout of training data.

Figure 7. The designed diagram of the Elman network.

Figure 8. The training of the Elman network.

Figure 9. (a) The predicted results in E-RescueDP; (b) The predicted results in DP-SCR.

Figure 10. The effects of dynamic processing.

Figure 11. (a) The MAE of the shared traffic flows with

ε

changing (w = 10); (b) The MRE of the shared traffic flows with

ε

changing (w = 10).

Figure 11. (a) The MAE of the shared traffic flows with

ε

changing (w = 10); (b) The MRE of the shared traffic flows with

ε

changing (w = 10).

Figure 12. (a) The MAE of the shared traffic flows with w changing (

ε

= 1); (b) The MRE of the shared traffic flows with w changing (

ε

= 1).

Figure 12. (a) The MAE of the shared traffic flows with w changing (

ε

= 1); (b) The MRE of the shared traffic flows with w changing (

ε

= 1).

Table 1. The mathematical notations.

Notations	Semantic Meanings
$M, Q, S C$	Laplace mechanism, query function and prediction function, respectively.
S, $S_{t}$	An infinite stream and the stream prefix of S at timestamp t, respectively.
$D_{t}$	The raw traffic data at timestamp t.
$F_{i}$ , $F^{i, n}$	The raw traffic flows at timestamp i; the raw traffic flows of section i at successive n timestamps.
$f_{k}^{i}$ , $f_{k}^{x, h}$ , ${\hat{f}}_{k}^{i}$	The raw traffic flow of section i at timestamp k; $f_{k}^{i}$ at h-th cluster; the predicted traffic flow of section i at timestamp k.
$R_{i}$ , $R^{i}$ , $R^{i, n}$	The sanitized traffic flows at timestamp i, the sanitized traffic flows of section i at all timestamps, and the sanitized traffic flows of section i at successive n timestamps.
$r_{k}^{i}$ , ${\bar{r}}_{k}^{i} (o u t)$	The sanitized traffic flows and the traffic outflow of section i at timestamp k, respectively.
$r n_{i, j}$ , $R N$	The transition probability of that the traffic flows of section i enters to section j; the set of $r n_{i, j}$ .
$t p_{t}^{i, j}$ , $t p_{t}^{i}$	The traffic parameter between section i and section j at timestamp t; the set of the traffic parameters between section i and its linked sections at timestamp t.
$d i s, S S E$	The dissimilarity between the predicted traffic flows and the last shared traffic flow; the squared error.
$V_{m a x}^{i}$ , $V_{t}^{i}$	The maximal speed limit of section i; the average speed of section i at timestamp t.
$T_{s a m p l e s}$	The sampling period of raw traffic data.
$C A_{i}$	The maximum capacity of section i.
$θ_{i}$ , $ρ$	The scale factor of corrections.
$C_{t}^{i}$ , $c_{t}^{i}$	The i-th cluster at timestamp t and the cluster center of $C_{t}^{i}$ , respectively.
$C L U_{t}$	The set of clusters at timestamp t.
$ε_{t}^{i}$ , $ε_{t}^{x, i}, {\hat{ε}}_{i}^{j}$	The privacy budget of section i at timestamp t; the privacy budget included in $C_{t}^{i}$ ; the privacy budget of $C_{t}^{i}$ .
$ε_{r}$ , $ε_{m a x}$	The remaining privacy budget; the maximum privacy budget allowed for sections.
$λ_{i}^{j}$	The perturbation error of section i at timestamp j.

Table 2. The comparison of complexity time.

schemes	BA [23]	BD [23]	E-RescueDP [24]	CLDP [29]	DP-SCR
complexity time	$O (d)$	$O (d)$	$O (m d^{2})$	$O (d)$	$O (m d e)$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Le, J.; Xing, B.; Zhang, D.; Qiao, D. Enhancing Real-Time Traffic Data Sharing: A Differential Privacy-Based Scheme with Spatial Correlation. Mathematics 2024, 12, 1722. https://doi.org/10.3390/math12111722

AMA Style

Le J, Xing B, Zhang D, Qiao D. Enhancing Real-Time Traffic Data Sharing: A Differential Privacy-Based Scheme with Spatial Correlation. Mathematics. 2024; 12(11):1722. https://doi.org/10.3390/math12111722

Chicago/Turabian Style

Le, Junqing, Bowen Xing, Di Zhang, and Dewen Qiao. 2024. "Enhancing Real-Time Traffic Data Sharing: A Differential Privacy-Based Scheme with Spatial Correlation" Mathematics 12, no. 11: 1722. https://doi.org/10.3390/math12111722

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Real-Time Traffic Data Sharing: A Differential Privacy-Based Scheme with Spatial Correlation

Abstract

1. Introduction

1.1. Motivation

1.2. Contributions

1.3. Organization

2. Preliminaries

2.1. Differential Privacy

2.2. w-Event Privacy

2.3. Characteristics of Traffic Flows

3. Problems Statement

4. The Design of DP-SCR

4.1. Adaptive Allocation of Privacy Budget

4.1.1. Spatial Correlation Prediction

4.1.2. Calculation of Privacy Budget

4.1.3. Sampling with the Predicted Traffic Flows

4.2. Dynamic Clustering

4.3. Approximation and Perturbation

5. Performance Analyses

5.1. Privacy Analyses

5.1.1. Privacy Loss

5.1.2. Privacy Protection

5.2. Correlation Analyses

5.3. Effects of Filtering on DP-SCR

5.4. Complexity Analyses

6. Experimental Simulation and Evaluation

6.1. Data Utility of the Shared Traffic Flow

6.2. Data Utility vs. Privacy Budget ( ε )

6.3. Data Utility vs. Sliding Window Size ( w )

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

6.2. Data Utility vs. Privacy Budget $(ε)$

6.3. Data Utility vs. Sliding Window Size $(w)$