Traffic Volume Estimation Based on Spatiotemporal Correlation Adaptive Graph Convolutional Network

Ding, Sheng; Yan, Fei; Yi, Yingmin

doi:10.3390/sym17040599

Open AccessArticle

Traffic Volume Estimation Based on Spatiotemporal Correlation Adaptive Graph Convolutional Network

by

Sheng Ding

¹

,

Fei Yan

^2,* and

Yingmin Yi

²

¹

College of Electrical and Power Engineering, Taiyuan University of Technology, Taiyuan 030024, China

²

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(4), 599; https://doi.org/10.3390/sym17040599

Submission received: 28 February 2025 / Revised: 9 April 2025 / Accepted: 11 April 2025 / Published: 15 April 2025

(This article belongs to the Special Issue Advances in Machine Learning with Symmetry/Asymmetry in Transportation)

Download

Browse Figures

Versions Notes

Abstract

:

Traffic volume estimation is a fundamental task in Intelligent Transportation Systems (ITS). The highly unbalanced and asymmetric spatiotemporal distribution of traffic flow combined with the sparse and uneven deployment of sensors pose significant challenges for accurate estimation. To address these issues, this paper proposes a novel traffic volume estimation framework. It combines a dynamic adjacency matrix Graph Convolutional Network (GCN) with a multi-scale transformer structure to capture spatiotemporal correlation. First, an adaptive speed-flow correlation module captures global road correlations based on historical speed patterns. Second, a dynamic recurrent graph convolution network is used to capture both short- and long-range correlations between roads. Third, a multi-scale transformer module models the short-term fluctuations and long-term trends of traffic volume at multiple scales, capturing temporal correlations. Finally, the output layer fuses spatiotemporal correlations to estimate the global road traffic volume at the current time. Experiments on the PEMS-BAY dataset in California show that the proposed model outperforms the baseline models and achieves good estimation results with only 30% sensor coverage. Ablation and hyperparameter experiments validate the effectiveness of each component of the model.

Keywords:

traffic volume estimation; graph neural networks; multi-scale transformer; dynamic adjacency matrix

1. Introduction

Speed, density, and volume constitute the three main macroscopic traffic features that reflect the operational status of urban transportation networks. With the rapid development of ITS and autonomous driving technologies, traffic volume studies have become crucial tasks for traffic management [1]. Accurate traffic volume data help urban planners improve scheduling, network planning, accident prevention, and emergency response. It can also optimize infrastructure development and reduce urban pollution [2,3].

There has been a series of innovative works on traffic volume estimation [4,5,6,7,8]. Traditional spatiotemporal modeling methods rely on static traffic network assumptions and simple local information inference. This makes it challenging to capture the dynamic changes in traffic volume across both space and time. To tackle these challenges, recent research has increasingly focused on spatiotemporal modeling based on Graph Neural Network (GNN) and deep learning methods [9,10]. However, challenges remain, such as the unbalanced distribution of traffic volume, missing local data, and traffic volume delays, especially in regions with sparse traffic data or external disruptions like accidents and weather [11,12].

Traffic volume often shows imbalances, with noticeable differences in flow between regions. Transportation hubs or commercial centers tend to generate higher volumes due to geographic and infrastructural factors. This spatial imbalance requires models to accurately estimate global volume trends [13]. Moreover, traffic volume is highly variable over time, with patterns differing during peak hours, holidays, or special events. It also adds complexity to modeling, thus requiring robust, adaptable models.

Traditional estimation methods often rely on homogeneous assumptions or local information, which limits their accuracy in complex environments. Basic spatiotemporal models may fail to handle sudden traffic changes, leading to inaccurate estimates in congested areas or overly smooth predictions for off-peak roads.

Missing local data are another critical challenge. Sparse sensor deployment and incomplete data lead to missing traffic volume information, reducing model accuracy. Fixed sensors are costly and sparse, and the penetration rate of mobile sensors is unknown. This data shortage complicates traffic volume estimation across the entire network.

Localized data loss affects model accuracy and training. Missing data prevent traditional supervised learning methods from functioning effectively, as incomplete data make it difficult for loss functions to measure prediction errors. Approaches like matrix completion [14], weighted loss functions [15], and GAN-based methods [16] have been proposed but have limitations in large-scale networks, high computational cost, and the need for precise hyperparameter tuning.

Despite progress, localized data loss remains a significant obstacle. For instance, the assumption of spatiotemporal smoothness in traffic models can lead to incorrect traffic distribution patterns, especially when propagation delays and time lags are not considered. This may result in overfitting, inaccurate estimates for congested areas, and overly smooth predictions for off-peak segments.

In summary, the crucial challenges in traffic volume estimation remain unresolved. To improve accuracy, models must better capture spatiotemporal correlations and adapt to unexpected events and missing data.

To address these problems, this paper introduces a novel traffic volume estimation model DTSAGCN (Dynamic Temporal-Spatial Attention Graph Convolutional Network). First, we design a Dynamic Graph Convolution Recurrent (DGCR) module to capture the local dynamic spatial correlations between road traffic flows. Next, an Adaptive Speed-Flow Correlation (ASFC) module is proposed to adaptively model the dynamic relationships across global roads. Finally, a Multi-Scale Transformer (MSTF) module is adopted to capture long-term temporal dependencies through a patch mechanism and pyramid structure. We aim to comprehensively enhance the accuracy and robustness of traffic volume estimation.

The main contributions of this paper are summarized as follows:

We employ a dynamic adjacency matrix and DGCR module to adaptively model the spatial correlations of road traffic. The ASFC module adapts to historical speed data to extract road patterns and capture global spatial relationships, improving model performance.
We design the MSTF module with a pyramid structure. This structure efficiently captures periodic and long-term dependencies in traffic flow by segmenting time-series data from the state matrix.
Experiments on the PEMS-BAY real-world traffic dataset demonstrate that our model achieves improvements with sensor coverage rates ranging from 30% to 50%, outperforming all baseline models. Ablation and hyperparameter experiments further validate the rationality of the model architecture.

The structure of this paper is organized as follows: Section 2 reviews relevant work and achievements in traffic flow estimation; Section 3 presents the problem formulation; Section 4 introduces the construction details of our model; Section 5 evaluates the model on real-world datasets; and Section 6 concludes the paper and provides insights for future development.

2. Literature Review

2.1. Method of Traffic Volume Estimation

In traffic volume estimation research, methods can generally be divided into two main categories: model-based estimation methods and data-based estimation methods. Each of these approaches has its own advantages and is suitable for different conditions.

Model-based estimation methods typically rely on the physical theories and mathematical models of traffic flow. By calibrating and validating the parameters of these models, they estimate traffic conditions. Classical macroscopic traffic flow models [17,18,19] (such as the Lighthill–Whitham–Richards (LWR) first-order model, the higher-order Aw-Rascle–Zhang (ARZ) model, and the Payne–Whitham (PW) model) are commonly used for traffic state estimation. The advantages of these methods are that they are relatively simple and efficient in computation. They estimate the traffic state by iteratively calculating the propagation equation of the traffic flow. There are still two main issues:

Model-driven approaches require large amounts of accurate and complete traffic data for parameter calibration and validation. This results in high data requirements and complex tasks.
In real-world traffic networks, traffic flow often does not meet the model’s assumption of evenly distributed vehicles. This limits the accuracy of these methods and makes them less adaptable to dynamic changes in the road network.

These methods are currently facing bottlenecks and are rarely seen in the latest traffic volume estimation research. However, they are still used in some real-time traffic signal control applications. A promising improvement for model-based methods is Physical Information Deep Learning (PIDL). PIDL combines physical models with deep learning, leveraging the strengths of both to enhance the accuracy of traffic flow estimation [20,21].

Unlike model-based methods, data-driven estimation methods rely on large amounts of historical data or real-time sensor data to estimate traffic volume through statistical analysis or machine learning algorithms. These methods typically do not depend on complex physical models but instead focus on capturing features and patterns from the data.

Statistical Methods: For example, time-series models, such as ARIMA [22] and Kalman filters [23], are mainly used to process traffic volume time-series data. These methods generally perform well with small datasets but struggle in more complex environments.
Machine Learning Methods: In recent years, methods such as deep learning, support vector machines (SVM), and random forests (RF) have been widely used for traffic volume estimation. By training on datasets, machine learning models can automatically identify spatiotemporal features in traffic volume and make accurate estimations. These methods offer greater adaptability. With advancements in sensor technology and the increase in the amount of data, machine learning-based methods have gradually become the mainstream in traffic volume estimation.

With the development of machine learning technologies, deep learning-based traffic volume estimation methods have attracted more attention. Nie et al. [24] proposed a deep learning-based model that combines convolutional neural networks (CNN) with time-series data to capture the spatiotemporal dynamics of traffic flow. This method improves volume estimation accuracy by learning from historical traffic data. Recognizing the limitations of CNN-based approaches, researchers have turned to GNN-based methods. These methods model the non-Euclidean traffic topology network as a graph and use GNN to capture spatial dependencies in traffic data, further improving estimation accuracy [25,26]. Abdelraouf et al. [27] introduced a method combining GCN and long short-term memory (LSTM) networks, using a sequence-to-sequence architecture. This approach effectively captures spatiotemporal dependencies and remains effective even under the low penetration rate of floating vehicle data.

Existing deep learning-based models still face several limitations when handling issues such as spatial dependencies, nonlinear changes, and missing data in traffic volume. These methods typically require large amounts of training data to learn traffic flow patterns, and their performance is affected when data are sparse or incomplete.

The emergence of adaptive graph convolution networks offers the potential for further improving traffic volume estimation accuracy [28]. The adaptive adjacency matrix can automatically adjust the weights between nodes based on the dynamic changes in traffic volume [29]. Urban traffic networks are becoming increasingly complex. This complexity is driven by the unbalanced spatiotemporal distribution of traffic and the need to respond to sudden events. Additionally, the high variability and volatility of traffic flow present further challenges. As a result, the full potential of this technology has not yet been realized. Future research will further drive its widespread application in ITS.

2.2. Application of Transformer and Its Variants in Traffic Time-Series Modeling

LSTM [30] and Recurrent Neural Network (RNN) [31] are widely used for processing traffic time-series data. LSTM and GRU (Gated Recurrent Unit) introduce gating mechanisms into the network. These mechanisms effectively address the vanishing gradient problem in traditional RNNs when handling long-term dependencies [32]. As a result, LSTM and RNN have become effective tools for capturing temporal dependencies in traffic volume estimation.

The transformer [33] is a novel sequence modeling framework. It overcomes the bottleneck issues in RNN or LSTM by using its self-attention mechanism to capture long-term dependencies. The self-attention mechanism calculates dependencies between different positions in the input sequence. It captures long-term dependencies and allows for parallel computation, significantly improving computational efficiency. Compared to LSTM and RNN, the transformer has shown substantial advantages in large-scale traffic data prediction, particularly in capturing long-term dependencies in traffic flow. Yan et al. [34] applied the transformer structure in traffic volume prediction, achieving improved performance. It relieved the issue of mismatched traffic flow propagation with road networks and enhanced the ability to capture traffic flow changes. This led to improved accuracy.

Despite the advantages of the transformer in capturing long-term dependencies, it still has certain limitations. Due to the nature of the self-attention mechanism, the transformer encounters difficulties when handling tasks with fixed positional dependencies, such as the spatial structure and periodic patterns of traffic volume. A potential solution is the patch method. This approach divides time-series data into segments of a fixed length, allowing for the transformer structure to better capture multi-level temporal dependencies [35,36]. Zhang et al. [37] introduced a patch method with a pyramid structure, employing a tree structure, which was successful in time-series prediction tasks. These results offer the potential to further capture the long-term correlations and periodicity of road traffic, enhancing its capability to address complex traffic environments.

Although current research has made progress in capturing spatial and temporal correlations of traffic volume, several challenges remain unresolved. This study aims to propose the DTSAGCN model, which combines the strengths of GNN and MSTF, to address these issues and provide a more flexible and efficient solution for traffic volume estimation.

3. Problem Formulation

Traffic volume estimation refers to estimating the traffic state of unknown road segments based on traffic volume and speed data from known sections of the traffic network. Unlike traditional traffic prediction tasks, volume estimation focuses on estimating the traffic state of the entire network at a specific moment. It relies on available local data, particularly when sensor data are incomplete or partially missing. Accurate traffic volume estimation is crucial for intelligent traffic management, road network optimization, and traffic safety warnings.

We are given a directed traffic graph network

G = (V, E, A)

with

N

nodes, where

V = {v_{1}, \dots, v_{N}}

is the set of road segments,

E

is the set of connections between road segments, and

A \in ℝ^{N \times N}

is the weighted adjacency matrix that describes the relationships between road segments in the traffic network. The volume and speed features of the nodes are recorded as graph signals

X_{v} \in ℝ^{N \times T}

and

X_{s} \in ℝ^{N \times T}

, where

N

is the total number of nodes (road segments) and

T

is the total time intervals. At the time step

t

, we define the volume

x_{v_{i}}^{t} \in ℝ^{N \times T}

and speed

x_{s_{i}}^{t} \in ℝ^{N \times T}

for each road segment

v_{i} \in V

. To obtain fine-grained spatiotemporal traffic volume estimates, we set the time window to 5 min.

With the above annotations, we can further describe the traffic volume estimation problem. In the given directed graph

G = (V, E, A)

, based on the volume data of partial roads and the speed data of all roads across time steps, we estimate the traffic volume for all roads at time. We maintain the same assumption as Meng et al. [6] and Zhang et al. [8]: traffic speed records should be complete and available for all road segments (sensors). However, traffic volume records are partial. Only a few roads in the network can directly access their volume data [38].

We aim to learn a mapping function

F Θ (\cdot)

that takes known partial traffic flow and speed data as the input and estimates the traffic volume based on the structure of the traffic network

G

. For road segments with known traffic volumes

V^{o}

(measured segments), the volume remains fixed. For road segments with unknown traffic volume

V^{u}

(unmeasured segments), there are

{\hat{X}}_{v} = F (X_{v}^{i}, X_{s}^{j}, G), i \in V^{o}, j \in V^{o} \cup V^{u}

(1)

where

{\hat{X}}_{v}

represents the estimated traffic volume,

V^{o}

is the set of road segments with known traffic volumes, and

V^{u}

is the set of road segments with unknown traffic volumes.

Therefore, the goal of the problem is to estimate the traffic volume state of the entire network by using the known partial traffic volumes, the full network speed data, and the spatiotemporal dependencies between road segments. This process is based on GNN and spatiotemporal modeling techniques, which utilize the spatial dependencies of the traffic network and time-series data to make accurate traffic volume estimates.

4. Methodology

This study proposes a deep learning framework based on the DTSAGCN model to address the spatiotemporal correlation issues in traffic volume estimation, particularly when facing challenges such as localized data missingness, unbalanced traffic flow, and propagation delays. The model combines adaptive adjacency matrix graph convolution and multi-scale transformer models, forming a robust spatiotemporal modeling structure to estimate the traffic volume of unobserved roads.

Figure 1 illustrates the overall architecture of the DTSAGCN model. It consists of two correlation-capturing components for both time and space:

Spatial Correlation: The dynamic graph convolution recurrent module and the adaptive speed-flow correlation module work together to capture the spatial correlations of traffic volume. The DGCR models the traffic volume relationships between roads based on their relative static distances. The adjacency matrix is dynamically updated according to the changes in traffic volume. This enhances the model’s expressiveness. The ASFC dynamically captures different speed-flow patterns across various roads, significantly improving accuracy compared to static linear methods.
Temporal Correlation: The Multi-Scale Transformer (MSTF) module is used to extract the temporal dependencies of traffic volume. The MSTF employs a multi-head attention pyramid structure that effectively captures periodic and long-term dependencies in traffic volume. It also retains the efficient training characteristics of the transformer structure, allowing for more effective handling of large datasets.

The detailed implementation mechanisms of each module will be described in the following subsections.

4.1. Spatial Correlations

Full-scale road network traffic data are often unavailable or come at a high cost. Therefore, capturing spatial correlations is a crucial task in traffic volume estimation. Traditional GCNs are widely used for this task. However, their static modeling of road dependencies limits their ability to adapt to changes in road relationships caused by factors such as traffic conditions, weather, and accidents. This reduces the accuracy of volume estimation. To effectively address this issue, the DTSAGCN model combines ASFC and DGCR. This combination allows for the dynamic capture of spatial correlations between road traffic volumes.

4.1.1. Graph Convolution and Local Spatial Dependency

GCN has been widely used in recent spatial modeling work [39,40,41]. By aggregating the features of neighboring nodes on the graph, GCN models the local spatial dependencies. Traditional GCN assumes that the adjacency matrix is static, but in traffic volume estimation, factors such as traffic conditions, weather, and accidents can cause changes in the relationships between roads. To address this, this paper proposes using a dynamic adjacency matrix in the graph convolution network. This allows for the network to dynamically capture the relationships between nodes as they change over time, enabling the model to adapt to the time-varying nature of the traffic network, particularly the spatiotemporal imbalance of traffic flow. The dynamic adjacency matrix

\tilde{A}

consists of two parts:

A_{s p a}

and

A_{s f c}^{t}

, where

A_{s p a}

is the static adjacency matrix and

A_{s f c}^{t}

is the dynamic adjacency matrix under a given time step

t

.

When the GCN receives input signals

X \in ℝ^{N \times D}

, at each time step, the input signal

X_{t} \in ℝ^{N \times D}

and the previous hidden state

H_{t - 1} \in ℝ^{N \times D m}

are used, where

D

is the feature dimension of the input signal

X_{t}

and

D_{m}

is the feature dimension of the hidden state

H_{t - 1}

. This leads to a dynamic input

Y_{t} = Contact (X_{t}, H_{t - 1})

,

Y_{t} \in ℝ^{N \times (\begin{matrix} D + D_{m} \end{matrix})}

.

To extract static traffic features and model the relationship between upstream and downstream road segments in terms of their impact on traffic flow, we have

{({\tilde{D}}_{O})}_{i, i} = \sum_{j} {\tilde{A}}_{i, j}

(2)

{({\tilde{D}}_{I})}_{i, i} = \sum_{j} {\tilde{A}}_{i, j}^{T}

(3)

F_{t} = σ ({\tilde{D}}_{O}^{- 1} \tilde{A} Y_{t} W_{O} + {\tilde{D}}_{I}^{- 1} {\tilde{A}}^{T} Y_{t} W_{I})

(4)

where

\tilde{A}

is the adjacency matrix,

{\tilde{D}}_{O}

and

{\tilde{D}}_{I}

are the degree matrices for both outflow and inflow, and

W_{O}

and

W_{I}

are the corresponding parameter matrices.

σ (\cdot)

is the activation function. The node feature at time

t

is denoted by

F_{t} \in ℝ^{N \times D_{m}}

.

Next, we establish the basic static adjacency matrix

A_{s p a}

part of the module. Based on prior research, we construct a static adjacency matrix

A_{s p a}

using the road network distances between sensor nodes, applying a threshold Gaussian kernel to model the relationships between nodes [42]:

A_{s p a} (v_{i}, v_{j}) = {\begin{matrix} \exp {(- \frac{dist (v_{i}, v_{j})}{δ})}^{2}, if dist (v_{i}, v_{j}) \leq ϵ \\ 0, otherwise \end{matrix}

(5)

where

dist (v_{i}, v_{j})

is the travel distance between nodes

v_{i}

and

v_{j}

,

δ

is a scaling parameter (typically set as the standard deviation of the distance), and

ϵ

is a threshold that controls the correlation between nodes that are too far apart.

It can be observed that the correlations calculated using the static adjacency matrix are symmetric, as the weights depend on the distance between nodes. In reality, since the traffic graph is directed, the adjacency matrix

A

should not be completely symmetric. This reflects the imbalance and asymmetry inherent in the traffic network to some extent.

4.1.2. Adaptive Speed-Flow Correlation Module

In traffic volume estimation, speed and volume are two of the three key traffic elements, and they are closely related. Among the three key elements—speed, volume, and density—speed is the most readily available parameter. Speed pattern features play a crucial role in inferring traffic volume. Numerous studies have shown that, when roads share the same speed pattern, their traffic volumes may follow similar patterns of evolution [6]. Figure 2 shows two typical speed–volume patterns from the PEMS-BAY dataset used in this study. To address this, this paper proposes the ASFC module, which captures global road correlations based on speed information.

Most current research on traffic speed–volume correlation is based on the Macroscopic Fundamental Diagram (MFD) approach. However, the shape of the MFD changes under different traffic conditions, leading to errors [43,44]. In contrast, the ASFC module captures the time-varying speed–volume correlations directly from the data, avoiding this issue. Additionally, the ASFC module considers the delay and time variation in traffic flow propagation and performs a weighted fusion of historical traffic flow patterns. This allows for the model to adaptively capture the correlations between different roads. This design ensures that the model can handle the changing correlations in traffic volume, especially during traffic congestion or the occurrence of sudden events, thus more accurately reflecting the variations in and diffusion of traffic volume.

First, we use a clustering algorithm to extract typical short-term traffic speed patterns. The input speed data

X_{s_{i}}^{t}

are divided using a sliding window of size

T_{w}

to segment the traffic speed data.

X_{s}^{t} = [\begin{matrix} x_{s_{1}}^{t - T_{w} + 1} & \dots & x_{s_{1}}^{t} \\ ⋮ & ⋮ \\ x_{s_{N}}^{t - T_{w} + 1} & \dots & x_{s_{N}}^{t} \end{matrix}]

(6)

where

x_{s_{i}}^{t} \in ℝ^{N \times T}

represents the speed of road

i

at time

t

.

Next, we apply the k-Shape clustering algorithm to process the speed sequences, resulting in

N_{p}

typical speed patterns. Each cluster has a centroid

p_{i}

, which is also a time series of length

T_{w}

. This gives us the collection of short-term traffic flow speed patterns.

By extracting the historical speed information of the sliding window length into the node speed data at the current time step

t

, we capture the temporal correlations between each road and other roads, including time delays. To achieve this, the embedding matrix

W_{w}

is used to obtain the high-dimensional representation of the short-term historical speed data from

t - (T_{w} + 1)

time step to

t

time step.

u_{n}^{t} = x_{n}^{t - (S + 1) : t} W_{w}

(7)

The embedding matrix

W_{m}

is used to generate a memory vector

m_{i}

for each traffic flow pattern

p_{i}

.

m_{i} = p_{i} W_{m}

(8)

Then, the typical traffic flow patterns are used to represent the historical speed information of the current road, resulting in a weight vector

w_{s_{i}}^{t}

. This weight vector is normalized using the function

soft \max

to ensure that the sum of all weights equals 1.

w_{s_{i}}^{t} = softmax (u_{n}^{t^{T}} m_{i})

(9)

Finally, the historical speed information of the current road can be represented as a weighted fusion of the memory vectors for each traffic flow pattern.

r_{n}^{t} = \sum_{i = 1}^{N_{p}} w_{s i}^{t} (p_{i}^{T} W_{c})

(10)

The Pearson correlation coefficient is used to measure the correlation between two nodes

v_{i}

and

v_{j}

:

ρ (r_{i}^{t}, r_{j}^{t}) = \frac{〈 r_{i}^{t}, r_{j}^{t} 〉}{‖ r_{i}^{t} ‖ ‖ r_{j}^{t} ‖}

(11)

Therefore, the ASFC module generates the adjacency matrix

A_{s f c}^{t}

based on short-term historical speed information, capturing the global road correlations.

A_{s f c}^{t} (v_{i}, v_{j}) = soft \max (ρ (r_{i}^{t}, r_{i}^{t}))

(12)

The weighted fusion vector

r_{n}^{t}

introduces the contribution of each traffic flow pattern to the short-term speed representation of the current road, taking into account the propagation delay of traffic flow information in the DTSAGCN model. This design is highly valuable in practical applications. For example, during peak hours, the traffic flow changes on a major road may be delayed before affecting other secondary roads. This phenomenon must be captured through the time-delay mechanism in the model. Without such a design, directly computing road correlations would lose important correlation information between the roads.

4.1.3. Dynamic Graph Convolution Recurrent Module

Dynamic graph convolution aims to transfer and fuse information between nodes and their neighbors to express the dynamic spatial dependencies within the traffic network. We integrate the adjacency matrix

A_{s p a}

and the dynamic adjacency matrix adynamic

A_{s f c}

, continuously updating the spatial interactions between nodes and capturing the time-varying relationships between roads through the aggregation of local and global information. First, we dynamically adjust

\tilde{A}

using element-wise multiplication, as follows:

{\tilde{A}}_{t} = D A_{s f c}^{t} ⊙ A_{s p a}

(13)

The graph convolution propagation process consists of two stages. The first stage is recursive propagation, where information is gradually spread through the dynamic adjacency matrix of the graph. The second stage is the mixing stage, where important information is further extracted through aggregation and feature selection. This method effectively addresses the spatiotemporal dependency problem.

The propagation stage aims to recursively fuse each node’s information with the features of its deeper neighbors. We assume that the maximum depth of the propagation process is

K

, meaning that information can be propagated through up to

K

hops. The propagation formula at the

k (k = 1, \dots, K)

hop is as follows:

{\tilde{H}}_{t}^{(k)} = {\tilde{D}}_{O}^{- 1} {\tilde{A}}_{t} H_{t}^{(k - 1)} W_{O}^{(k)} + {\tilde{D}}_{I}^{- 1} {\tilde{A}}_{t}^{T} H_{t}^{(k - 1)} W_{I}^{(k)}

(14)

where

W_{O}^{(k)}

and

W_{I}^{(k)}

are the corresponding outflow and inflow parameter matrices for each hop.

An activation function

ReLU

is applied to the computed features to introduce nonlinearity, which helps the model capture more complex patterns and features.

H_{t}^{(k)} = ReLU ((α H_{t}^{i n} + (1 - α) {\tilde{H}}_{t}^{(k)}) W_{P}^{(k)})

(15)

Here,

α

is a hyperparameter that balances the ratio between retaining original node features and deeper neighborhood information. Selecting an appropriate

α

helps the model explore deeper neighborhood information while maintaining locality.

W_{P}^{(k)}

is the parameter matrix for the

k

hop propagation.

After the recursive propagation stage, we proceed to the mixing stage. This involves aggregating the results from multiple hops and removing irrelevant information. The multi-hop propagation results are concatenated along the feature dimension, and a fully connected layer is applied to select the most useful features. This approach effectively reduces information redundancy, enhances model performance, and preserves important spatiotemporal features.

H_{t}^{o u t} = C O N C A T (H_{t}^{0}, H_{t}^{1}, \dots, H_{t}^{K}) W^{O}

(16)

where

H_{t}^{o u t}

is the output node state of the dynamic graph convolution at time

t

and

W^{O}

is the trainable feature transformation matrix.

Through this design, the DTSAGCN model uses an adaptive method to capture road correlations based on different information patterns from various nodes. It relates neighboring roads with highly correlated traffic volume, even if they are spatially distant. It captures globally correlated roads with similar speed change patterns or direct flow change patterns. This design improves the accuracy and interpretability of the model, especially for estimating unmonitored regions.

4.2. Temporal Correlations

Traffic volume estimation involves capturing the complex spatiotemporal correlations of traffic volume changes within the network, particularly the periodicity and short-term volatility. Periodicity refers to the regular fluctuations of traffic flow during different time periods of the day or week with significant differences, especially between peak and off-peak hours [45,46]. Short-term volatility refers to the variations in traffic volume over short time intervals, influenced by the volume of the previous time step [47,48]. This requires the model to capture both long-term patterns and short-term fluctuations. To address this, the DTSAGCN model incorporates the MSTF module for time modeling, enhancing the model’s ability to capture both long- and short-term temporal dependencies in traffic flow.

The MSTF module utilizes a state-of-the-art multi-scale attention pyramid mechanism, modeling short-term fluctuations and long-term trends of traffic flow at multiple scales. The MSTF module leverages the self-attention mechanism to better capture long-range dependencies and can process time series in parallel, improving training efficiency. Figure 3 illustrates the structure of the MSTF module.

First, following the research findings of Wu et al. [49] and Zhou et al. [50], we decompose the input data into short-term fluctuation components and long-term trend components. The input data consist of the time-hidden states obtained from spatial modeling:

S = H_{1 : T}^{o u t} = {[H_{1}^{out}, H_{2}^{out}, \dots, H_{T}^{out}]}^{T} \in ℝ^{N \times T \times D_{m}}

(17)

where

T

represents the time step,

N

is the number of road segments, and

D_{m}

is the feature dimension of the hidden node state

H_{t}^{o u t}

.

At the time step

t

, the decomposition process is as follows:

S_{T} = mean (\sum_{i = 1}^{n} AvgPool {(Padding (S))}_{i})

(18)

S_{s} = S - S_{T}

(19)

where

S_{s}

is the long-term cyclical trend part and

S_{T}

is the short-term fluctuation trend part.

Based on the different data characteristics of

S_{s}

and

S_{T}

, we apply different processing methods.

For simpler variations in

S_{T}

, we use a single-layer convolutional network to process it. The convolution kernel

W_{s s} \in ℝ^{h_{s s} \times 1}

has a length of

h_{S S}

, which can be adjusted according to the time step length. Finally, we use a linear layer to output the result. This allows for us to effectively capture the short-term trend.

H_{c n}^{T + 1} = linear (conv (S_{T}, W_{s s}))

(20)

For long-term dependencies in

S_{s}

, we model them using a multi-scale pyramid transformer network. Through extensive research on traffic data, it is found that traffic flow typically exhibits clear periodicity on a daily and weekly basis. Based on this, we introduce

K_{t}

levels, which are not fixed and depend on the characteristics of the dataset. With these

K_{t}

levels, we divide the

S_{s}

into

k_{t} \in {1, \dots, K_{t}}

patches of length

p_{k} \in {p_{1}, \dots, p_{K}}

, which helps us capture temporal features at different granularities.

S_{s} = Patch (S_{s}, S_{s 0}, p_{k})

(21)

When the length of the time series cannot be evenly divided by the patch size

p

,

S_{s 0}

performs the zero-padding. The patch operation splits the time series into

N_{p t} = [\frac{T}{p_{k}}]

non-overlapping patches. The resulting embedded data are represented as

S_{e k} \in ℝ^{N \times N_{p t} \times p_{k} \times D_{m}}

, where

N

represents the number of nodes,

N_{p t}

denotes the number of patches, and

p_{k}

is the patch size.

Next, the entire time-series data are input into the transformer structure at each level. The encoder adopts a top-down approach, gradually extracting features from fine-grained to coarse-grained levels. The pyramid structure generates

K_{t}

hidden representations, and a convolutional layer is applied to generate the hidden state matrix for the current time step. The embedded data

S_{e}

are then passed through a

1 \times 1

convolution layer and concatenated with the output

Z_{e n}^{k - 1}

of the lower-level encoder to generate the new input data.

S_{e k} = Conv (Concat (S_{e (k - 1)}, Z_{e}^{k - 1}))

(22)

We use the classic transformer multi-head attention mechanism to calculate attention weights, including the standard layer normalization and residual connections. At the

k_{t}

layer, the formula is as follows:

Q_{e n, h}^{d k}, K_{e n, h}^{d k}, V_{e n, h}^{d k} = Linear (Q_{e n}^{d k}, K_{e n}^{d k}, V_{e n}^{d k})

(23)

Attention (Q_{e n}^{d k}, K_{e n}^{d k}, V_{e n}^{d k}) = Softmax (\frac{Q_{e n}^{d k} {K_{e n}^{d k}}^{T}}{\sqrt{d_{k}^{d k}}}) V_{e n}^{d k}

(24)

where

Q_{e n}^{d k}

,

K_{e n}^{d k}

, and

V_{e n}^{d k}

are the query, key, and value, and

d_{k}^{d k}

is the dimension of the query.

S_{e k}^{d}

is the

d

th feature data at the

k_{t}

layer.

Through the linear transformation of

Q_{e n}^{d k}

,

K_{e n}^{d k}

, and

V_{e n}^{d k}

, multiple different queries, keys, and values are obtained, each forming a group of attention heads. If there are

h = 1, 2, \dots, h_{m}

heads, then

Q_{e n, h}^{d k}, K_{e n, h}^{d k}, V_{e n, h}^{d k} = Linear (Q_{e n}^{d k}, K_{e n}^{d k}, V_{e n}^{d k})

(25)

For each attention head,

Z_{e n, h}^{d k} = Attention (Q_{e n, h}^{d k}, K_{e n, h}^{d k}, V_{e n, h}^{d k})

(26)

The multi-head attention outputs are

Z_{e n}^{d k} = linear (contact (H_{e n, 1}^{d k}, H_{e n, 2}^{d k}, \dots, H_{e n, h_{m}}^{d k}))

(27)

For the output

Z_{e n}^{k}

of each layer, we add a residual connection to alleviate the vanishing gradient problem that may occur during backpropagation, thereby accelerating training.

Z_{e n}^{k} = Z_{e n}^{k} + S_{e k}

(28)

A linear layer is then used to output the predicted

H_{e n}^{T + 1}

, the hidden state for the next time step, obtained from spatial modeling.

H_{e n}^{T + 1} = linear (contact (Z_{e n}^{1}, \dots, Z_{e n}^{\begin{matrix} K_{t} \end{matrix}}))

(29)

The MSTF module integrates long-term trend prediction and short-term trend prediction to output

H_{t}^{p r e}

.

H_{t}^{p r e}

takes into account the impact of all hidden states at previous time steps on the current time step and combines features from different time sequences from

p_{1}

to

p_{K}

. The multi-scale latent representations effectively leverage temporal dependencies that are not constrained by various scales.

H_{t}^{p r e} = H_{e n}^{t} + H_{c n}^{t}

(30)

4.3. Model Implementations

4.3.1. Output Layer

We integrate the MSTF module with the DGCR module to fully utilize the state information from the previous

t - 1

time steps and the state information from the current time step

t

. A linear layer is used to combine and output the global traffic volume estimation results.

{\tilde{x}}_{v} = c o n c a t (H_{t}^{p r e}, H_{t}^{o u t}) W_{o u t} + b_{o u t}

(31)

where

W_{o u t} \in ℝ^{2 D_{m} \times D_{m}}

and

b_{o u t} \in ℝ^{D_{m}}

are the learnable parameters of the linear layer.

4.3.2. Training Strategy and Loss Function

In the traffic volume estimation task, the loss function is a crucial component for evaluating the model’s performance. To effectively capture the spatiotemporal correlations of traffic volume, especially when facing challenges such as localized data missingness, volume imbalance, and propagation delays, we propose a comprehensive loss function. This loss function combines the traffic reconstruction loss for sensor nodes and graph structure regularization. The goal of the loss function is to balance the model’s estimation accuracy with the consistency of the graph’s structure, ensuring that the estimated traffic volume reflects the spatiotemporal dependencies of the graph.

Traffic volume reconstruction loss

The traffic reconstruction loss is a core evaluation metric for model accuracy. Since we can only obtain the ground truth for a subset of nodes, this loss cannot fully help us measure the model’s performance. The model requires enhancement from other loss functions. For nodes equipped with traffic sensors, the estimated value is

{\hat{X}}_{v}^{i}

, and the ground truth value is

X_{v}^{i}

. We use the L1 norm to calculate the difference between the predicted value and the ground truth:

ℒ_{r} = | | x_{v} - {\tilde{x}}_{v} | |_{1}

(32)

where

| | \cdot | |_{1}

represents the L1 norm, used to measure the difference between the two. Compared to the L2 norm, the L1 norm is more robust in the presence of outliers, which is particularly important in traffic volume estimation, where sudden events (such as traffic accidents or construction) may occur.

2.: Learnable graph structure regularization

Based on existing research, we believe that the correlations between roads should be limited to a few related roads. To make the graph structure sparser and smoother, we employ graph Laplacian regularization. This is achieved by learning the representation of the adjacency matrix. The regularization encourages smoother relationships between traffic flow on neighboring nodes.

The graph Laplacian matrix

L

is defined as

L = D - A

(33)

where

D

is the degree matrix and

A

is the adjacency matrix.

ℒ_{g} = X_{v}^{T} L X_{v} = \sum_{i, j} {\tilde{A}}_{t} (v_{i}, v_{j}) ∥ x_{v_{i}} - x_{v_{j}} ∥_{2}^{2}

(34)

where

{\tilde{A}}_{t} (v_{i}, v_{j})

is the connection weight between each pair of nodes in the graph and

| | \cdot | |_{2}

is the L2 norm. The L2 norm is continuous and smooth during the process, which better improves the relationships between nodes.

By integrating these two parts, we obtain the total loss function for the DTSAGCN model:

ℒ = ℒ_{r} + λ_{g} ℒ_{g}

(35)

where

λ_{g}

is a hyperparameter used to control the weight of the regularization term.

4.4. Time Complexity Analysis

We analyze the time complexity of the proposed model from three key components: correlation matrix construction, dynamic graph convolution, and the multi-scale temporal transformer.

The correlation matrix includes both static and dynamic components. The static matrix computes pairwise correlations between

N

nodes based on distance. Since it is symmetric and fixed, it only needs to be computed once, resulting in a time complexity of

O (1 / 2 N^{2})

. The dynamic matrix is computed using Pearson correlation coefficients across

N

nodes,

N_{p}

typical speed patterns, and

T

time intervals, leading to a time complexity of

O (N^{2} \cdot N_{p} \cdot T)

. Therefore, the overall complexity for correlation matrix construction is

O (N^{2} N_{p} T)

.

In the dynamic graph convolution module, each time step requires a weighted fusion of static and dynamic correlation matrices, with complexity

O (N^{3})

. Multi-hop message passing and aggregation are then performed based on the dynamic adjacency matrix. This step has complexity

O (K \cdot (N + E) \cdot D_{m})

, where

K

is the number of graph convolution layers,

D_{m}

is the hidden state dimension, and

E

is the number of edges. Over

T

time intervals, the total complexity of this module is

O ((N^{3} + K \cdot (N + E) \cdot D_{m}) \cdot T)

.

The multi-scale temporal transformer introduces additional cost due to the multi-head self-attention mechanism. The number and size of temporal patches vary across scales, and we use their average for analysis. Let

K_{t}

be the number of transformer layers,

P

be the average patch size, and

D_{k}

be the hidden dimension. The resulting complexity is

O (K_{t} \cdot N \cdot \frac{T}{P} \cdot D_{k} \cdot (\frac{T}{P} + D_{k}))

.

The total time complexity of the proposed model is the sum of the complexities from all three components.

5. Experiments and Result Analysis

5.1. Data

In this section, we evaluate the proposed DTSAGCN model on the widely used PEMS-BAY dataset and analyze the experimental results from multiple perspectives. Due to the difficulty in obtaining aligned traffic volume and speed data for urban road networks, we conducted the traffic volume estimation experiment using the PEMS-BAY dataset. The PEMS-BAY dataset is a widely used traffic volume dataset sourced from the traffic sensor system in the Bay Area of California. We used data from 1 January 2017 to 31 May 2017, which includes speed and volume data collected by 325 sensors across 75 road segments. The data collection interval is 5 min. The dataset covers traffic conditions across different time periods and has high spatiotemporal resolution. Missing data have been pre-interpolated.

Due to the presence of multiple on-ramps on highways, where vehicles constantly flow in and out, the traffic volume at sensors before and after the on-ramp can experience abrupt changes, while speed may not be affected. This underdetermined situation can lead to increased overall errors in the model. Therefore, we conducted multi-model comparison experiments using traffic data with 30–50% coverage. Based on the model assumptions, we used speed data with 100% coverage. The remaining volume data were only used to evaluate the experimental results and were not involved in model training and validation. The distribution of traffic and speed sensor locations used in the experiments is shown in Figure 4. We used 80% of the data for training, 10% for validation, and 10% for testing.

5.2. Parameter Settings

The main hyperparameter settings for our DTSAGCN model are as follows. In the spatial modeling part, the number of layers in the graph convolution network is three, with a hidden dimension of 128. The sliding window length for the ASFC module is 2 h, which corresponds to 24 time steps. For the temporal modeling part, the transformer consists of four encoder layers, with a pyramid structure having four levels. The patch sizes are (4, 24, 144, 288). The number of muti-head attention is four, and the number of hidden dimensions is 128. In the loss function, the weight for the regularization term is set to 5 × 10⁻³.

During the training phase, the input batch size is 32, using the Adam optimizer with an initial learning rate of 1 × 10⁻³. The learning rate is halved every 10 training epochs, with a minimum learning rate of 1 × 10⁻⁵. The number of training epochs is 200, and early stopping is applied to prevent overfitting.

5.3. Baselines

We compare the DTSAGCN model with several baseline models, including the benchmark KNN method and several advanced deep learning models, to demonstrate the effectiveness of our model design. A brief description of each method is as follows:

KNN: The K-Nearest Neighbors method is a simple and intuitive algorithm that calculates the correlation between nodes based on their distance. This serves as the benchmark for all traffic volume estimation methods, reflecting the performance improvements of different models.
ST-SSL [6]: This is a pioneering work in the field of traffic volume estimation. It first introduced a semi-supervised learning model and used speed information to supplement missing volume data. This model eliminates the reliance on complete labeled data for traffic volume estimation.
TGMC-F [8]: This method integrates traffic and speed data into a geometric matrix completion model and designs a loss function that includes spatiotemporal regularization, further improving the accuracy of traffic volume estimation.
GCBRNN [2]: This model designs a graph convolutional gated recurrent module to capture the spatiotemporal correlations in the data, and it performs both traffic volume estimation and traffic flow prediction tasks.

The performance of all models is evaluated using three common metrics: (1) Mean Absolute Error (MAE), which reflects the overall estimation accuracy; (2) Root Mean Squared Error (RMSE), which is more sensitive to abnormal traffic states; and (3) Mean Absolute Percentage Error (MAPE), which eliminates the effect of data range. These metrics are defined as follows:

MAE (x_{i j}, {\tilde{x}}_{i j}) = \frac{1}{N T} \sum_{i = 1}^{N} \sum_{j = 1}^{T} | x_{i j} - {\tilde{x}}_{i j} |

(36)

RMSE (x_{i j}, {\tilde{x}}_{i j}) = \sqrt{\frac{1}{N T} \sum_{i = 1}^{N} \sum_{j = 1}^{T} (x_{i j} - {\tilde{x}}_{i j})^{2}}

(37)

MAPE (x_{i j}, {\tilde{x}}_{i j}) = \frac{1}{N T} \sum_{i = 1}^{N} \sum_{j = 1}^{T} | \frac{x_{i j} - {\tilde{x}}_{i j}}{x_{i j}} |

(38)

5.4. Volume Estimation Results at Different Coverage Rates

Table 1 shows the performance of the DTSAGCN model and other baseline models for global traffic volume estimation with 30–50% sensor coverage. Figure 5 provides more intuitive visual representations.

The experimental results show that, under all sensor coverage conditions, the DTSAGCN model consistently outperforms other baseline models. At the lower 30% coverage rate, the DTSAGCN achieves a MAPE of 35.42%, demonstrating superior performance compared to the other models. This value is even nearly equivalent to the performance of the other models at the higher 50% coverage rate. This substantial improvement demonstrates the powerful ability of the DTSAGCN model to capture spatiotemporal correlations of roads and highlights the importance of dynamic modeling.

Through a horizontal comparison of different models, we find that deep learning methods perform better than traditional machine learning methods, showing their potential to capture complex features. The ST-SSL model uses a static adjacency matrix and incorporates various data sources, such as weather and POI data, to construct the adjacency matrix between roads. However, not only the highway dataset we used but many traffic volume datasets also lack this additional relevant information. As a result, the performance of the ST-SSL model was suboptimal in this experiment. If the model was applied to a dataset with richer features, its performance could improve. This also suggests that, in future research, we can collect traffic datasets with more diverse features and enhance model performance through multi-feature fusion. The use of richer features may somewhat limit the model’s generalizability.

5.5. Model Analysis

5.5.1. Ablation Study

To verify the necessity of each component in the proposed model, we conducted an ablation study and compared the model’s performance under 40% sensor coverage. We replaced specific modules and compared them with the original DTSAGCN model:

DTSAGCN w/o DGCR: We replaced the dynamic adjacency matrix in DTSAGCN with a static adjacency matrix based on distance calculations between roads.
DTSAGCN w/o ASFC: We directly computed the adjacency matrix using the Pearson correlation coefficient between road speeds, without considering the temporal delay of traffic volume.
DTSAGCN w/o MSTF: We replaced the multi-scale transformer module in DTSAGCN with a standard transformer, without explicitly modeling the periodicity of traffic volume.

The results of the different variations are summarized in Table 2. When comparing these with the results of the full DTSAGCN model, the performance of all models with replaced modules declined. This demonstrates that each module makes a significant contribution to the overall improvement of the DTSAGCN model.

Analyzing the results of the ablation study, we can summarize the following:

After replacing the DGCR module, the model’s performance significantly declined, demonstrating the necessity of capturing the dynamic spatial correlations of traffic volume. Both static speed–volume adjacency matrices and distance-based correlation coefficient adjacency matrices fail to fully reflect the time-varying, unbalanced spatial correlations between roads.
After replacing the ASFC module, the performance decreased, but overall, its impact on the entire model was minimal. This may be because the capturing of spatial correlations in the overall model is somewhat redundant.
After replacing the MSTF module, the model’s performance showed a noticeable decline. This is because the standard transformer structure cannot retain the sequence positions and lacks the ability to capture long-term dependencies, which is crucial in traffic volume tasks, especially for capturing periodicity with longer time steps, such as daily and weekly cycles.

5.5.2. Hyperparameter Study

In this section, we conduct a parameter study on the core hyperparameters of the DTSAGCN model at 40% sensor coverage. We adopt a one-at-a-time sensitivity analysis strategy to investigate the impact of each hyperparameter individually rather than performing a full grid search. This approach reduces computational cost. It also helps reveal the independent effect of each parameter on model performance. Such an analysis improves the interpretability of the model and supports efficient parameter tuning when applying the model to other datasets. The selected hyperparameters are as follows:

Dimension of hidden node states, ranging from 32 to 256. The results are shown in Figure 6a. The dimension of the hidden node states determines the complexity of each node’s feature representation in the graph convolution network. The estimation accuracy of the DTSAGCN model increases significantly as the dimension of hidden node states increases, indicating that accommodating more features can enhance model performance. However, when the dimension exceeds 128, the model’s estimation accuracy decreases, suggesting that excessively large dimensions may lead to overfitting and reduced model training efficiency.
Number of graph convolutional layers, ranging from one to four. The results are shown in Figure 6b. The number of graph convolutional layers determines the ability to aggregate information from neighboring nodes. The estimation accuracy of the DTSAGCN model improves initially as the number of layers increases, suggesting that deeper learning can focus on more correlations between nodes. However, when the number of layers exceeds three, the model’s estimation accuracy decreases, indicating that too many layers may introduce excessive information, causing the model to become overly smooth and risk underfitting.
Number of multi-head attention heads in the transformer, ranging from one to eight. The results are shown in Figure 6c. The number of attention heads in the temporal extraction module determines the model’s flexibility in capturing long-term dependencies in time series. The estimation accuracy of the DTSAGCN model improves as the number of heads increases, showing that more attention heads can effectively capture multi-scale temporal patterns. However, when the number of heads exceeds four, the model’s estimation accuracy declines, suggesting that an excessively high model complexity may weaken the model’s performance.

In conclusion, after experimental tuning, the optimal hyperparameter configuration for the DTSAGCN model is a hidden node state dimension of 128, three layers of graph convolution, and four attention heads in the temporal extraction module. This configuration ensures model training efficiency and improves the model’s accuracy.

5.5.3. Temporal Interpretability Study

To enhance model interpretability and examine its ability to capture temporal dependencies, we visualize attention weights learned at different temporal scales. We extract attention scores from each temporal attention head and plot them as heatmaps. These visualizations show how the model focuses on different historical time steps under multi-scale temporal settings.

Figure 7 presents attention coefficients at four temporal scales on the PEMS-BAY dataset. Darker colors indicate stronger attention values, which reflect higher dependency between two time steps. These heatmaps help identify key temporal patterns and reveal how the model adjusts its focus under different traffic conditions. This analysis provides insights that support improved estimation accuracy.

The results show that the model adapts its attention according to the temporal scale. At the 20 min scale, it focuses on recent time steps during free-flow conditions and captures longer-term dependencies under congestion. At the 12 h scale, the model highlights daily traffic patterns, showing high attention between time steps 24 h apart. At the 1-day scale, some segments reveal weekly periodicity, while others do not, depending on their functional characteristics. These results confirm that the model uses multi-scale temporal cues effectively and adjusts attention to reflect temporal variability in traffic volume.

6. Conclusions and Future Directions

In this paper, we present a new deep learning model DTSAGCN for traffic volume estimation. This model captures global spatial correlations, which are difficult to obtain due to the sparse deployment of traffic detectors, and accounts for the complex temporal correlations of traffic flow, including propagation delays and periodic patterns. This enables high-accuracy traffic estimation under sparse sensor data conditions. Experiments conducted on real-world datasets validate the effectiveness and robustness of the DTSAGCN model. The model demonstrates its ability to address the challenges posed by complex traffic conditions and data scarcity. Ablation studies further confirm the significant contribution of each component to the overall performance of the model.

In future research, we aim to gather urban traffic network data, which will allow for the incorporation of additional external factors, such as weather, POI data, and their strong associations with traffic patterns. This will facilitate the development of strategies to further enhance the performance of traffic estimation models.

Author Contributions

Conceptualization, S.D. and F.Y.; methodology, S.D.; software, S.D.; validation, S.D., F.Y. and Y.Y.; formal analysis, S.D.; investigation, F.Y.; resources, F.Y.; data curation, S.D.; writing—original draft preparation, S.D.; writing—review and editing, F.Y.; visualization, S.D. and Y.Y.; supervision, F.Y. and Y.Y.; project administration, F.Y.; funding acquisition, F.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the SanqinTalent Introduction Plan Innovative Talent Project and the Shaanxi Innovative Team for Science and Technology (funding number 2023-CX-TD-01) and funded in part by the Shaanxi Province Natural Science Basic Research Program (funding number 2024JC-YBMS-547) and in part by the Qin Chuangyuan Cited High-level Innovation and Entrepreneurship Talent Program (funding number QCYRCXM-2023-076).

Data Availability Statement

The data used in this study are publicly available. The code is related to an ongoing patent application and is also integral to a paper currently under preparation. As such, the code cannot be made publicly available at this time.

Acknowledgments

We are grateful to the editors and anonymous referees for their valuable comments and efforts. We also thank our colleagues and institutional support that contributed to the successful completion of this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ITS	Intelligent Transportation Systems
GCN	Graph Convolutional Network
GNN	Graph Neural Network
DTSAGCN	Dynamic Temporal-Spatial Attention Graph Convolutional Network
DGCR	Dynamic Graph Convolution Recurrent
ASFC	Adaptive Speed-Flow Correlation
MSTF	Multi-Scale Transformer
PIDL	Physical Information Deep Learning
CNN	Convolutional Neural Network
LSTM	Long Short-Term Memory
RNN	Recurrent Neural Network
MFD	Macroscopic Fundamental Diagram

References

Ren, Y.; Yin, H.; Wang, L.; Ji, H. Data-Driven RBFNN-Enhanced Model-Free Adaptive Traffic Symmetrical Signal Control for a Multi-Phase Intersection with Fast-Changing Traffic Flow. Symmetry 2023, 15, 1235. [Google Scholar] [CrossRef]
Zhang, Z.; Lin, X.; Li, M.; Wang, Y. A Customized Deep Learning Approach to Integrate Network-Scale Online Traffic Data Imputation and Prediction. Transp. Res. Part C Emerg. Technol. 2021, 132, 103372. [Google Scholar] [CrossRef]
Nigam, N.; Singh, D.P.; Choudhary, J. A Review of Different Components of the Intelligent Traffic Management System (ITMS). Symmetry 2023, 15, 583. [Google Scholar] [CrossRef]
Aslam, J.; Lim, S.; Pan, X.; Rus, D. City-Scale Traffic Estimation from a Roving Sensor Network. In Proceedings of the 10th ACM Conference on Embedded Network Sensor Systems, Toronto, ON, Canada, 6–9 November 2012; Association for Computing Machinery: New York, NY, USA, 2012; pp. 141–154. [Google Scholar]
Zhan, X.; Zheng, Y.; Yi, X.; Ukkusuri, S.V. Citywide Traffic Volume Estimation Using Trajectory Data. IEEE Trans. Knowl. Data Eng. 2017, 29, 272–285. [Google Scholar] [CrossRef]
Meng, C.; Yi, X.; Su, L.; Gao, J.; Zheng, Y. City-Wide Traffic Volume Inference with Loop Detector Data and Taxi Trajectories. In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Redondo Beach, CA, USA, 7–10 November 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 1–10. [Google Scholar]
Tang, X.; Gong, B.; Yu, Y.; Yao, H.; Li, Y.; Xie, H.; Wang, X. Joint Modeling of Dense and Incomplete Trajectories for Citywide Traffic Volume Inference. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1806–1817. [Google Scholar]
Zhang, Z.; Li, M.; Lin, X.; Wang, Y. Network-Wide Traffic Flow Estimation with Insufficient Volume Detection and Crowdsourcing Data. Transp. Res. Part C Emerg. Technol. 2020, 121, 102870. [Google Scholar] [CrossRef]
Atwood, J.; Towsley, D. Diffusion-Convolutional Neural Networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Newry, UK, 2016; Volume 29. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2017, arXiv:1609.02907. [Google Scholar] [CrossRef]
Guo, S.; Lin, Y.; Feng, N.; Song, C.; Wan, H. Attention Based Spatial-Temporal Graph Convolutional Networks for Traffic Flow Forecasting. Proc. AAAI Conf. Artif. Intell. 2019, 33, 922–929. [Google Scholar] [CrossRef]
Jiang, W.; Luo, J. Graph Neural Network for Traffic Forecasting: A Survey. Expert Syst. Appl. 2022, 207, 117921. [Google Scholar] [CrossRef]
Qu, Z.; He, S. A Time-Space Network Model Based on a Train Diagram for Predicting and Controlling the Traffic Congestion in a Station Caused by an Emergency. Symmetry 2019, 11, 780. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, D.; Xie, K. Accurate Traffic Matrix Completion Based on Multi-Gaussian Models. Comput. Commun. 2017, 102, 165–176. [Google Scholar] [CrossRef]
Kampitakis, E.P.; Fafoutellis, P.; Oprea, G.-M.; Vlahogianni, E.I. Shared Space Multi-Modal Traffic Modeling Using LSTM Networks with Repulsion Map and an Intention-Based Multi-Loss Function. Transp. Res. Part C Emerg. Technol. 2023, 150, 104104. [Google Scholar] [CrossRef]
Yuan, Y.; Zhang, Y.; Wang, B.; Peng, Y.; Hu, Y.; Yin, B. STGAN: Spatio-Temporal Generative Adversarial Network for Traffic Data Imputation. IEEE Trans. Big Data 2023, 9, 200–211. [Google Scholar] [CrossRef]
Richards, P.I. Shock Waves on the Highway. Oper. Res. 1956, 4, 42–51. [Google Scholar] [CrossRef]
Aw, A.; Rascle, M. Resurrection of “Second Order” Models of Traffic Flow. SIAM J. Appl. Math. 2000, 60, 916–938. [Google Scholar] [CrossRef]
Zhang, H.M. A Non-Equilibrium Traffic Model Devoid of Gas-like Behavior. Transp. Res. Part B Methodol. 2002, 36, 275–290. [Google Scholar] [CrossRef]
Zhang, J.; Mao, S.; Yang, L.; Ma, W.; Li, S.; Gao, Z. Physics-Informed Deep Learning for Traffic State Estimation Based on the Traffic Flow Model and Computational Graph Method. Inf. Fusion 2024, 101, 101971. [Google Scholar] [CrossRef]
Huang, A.J.; Agarwal, S. Physics Informed Deep Learning for Traffic State Estimation. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020; pp. 1–6. [Google Scholar]
Shahriari, S.; Ghasri, M.; Sisson, S.A.; Rashidi, T. Ensemble of ARIMA: Combining Parametric and Bootstrapping Technique for Traffic Flow Prediction. Transp. A Transp. Sci. 2020, 16, 1552–1573. [Google Scholar] [CrossRef]
Zhao, M.; Yu, H.; Wang, Y.; Song, B.; Xu, L.; Zhu, D. Real-Time Freeway Traffic State Estimation for Inhomogeneous Traffic Flow. Phys. A Stat. Mech. Its Appl. 2024, 639, 129633. [Google Scholar] [CrossRef]
Nie, L.; Li, Y.; Kong, X. Spatio-Temporal Network Traffic Estimation and Anomaly Detection Based on Convolutional Neural Network in Vehicular Ad-Hoc Networks. IEEE Access 2018, 6, 40168–40176. [Google Scholar] [CrossRef]
Yu, B.; Yin, H.; Zhu, Z. Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 3634–3640. [Google Scholar]
Zhang, M.; Chen, Y. Link Prediction Based on Graph Neural Networks. arXiv 2018, arXiv:1802.09691. [Google Scholar] [CrossRef]
Abdelraouf, A.; Abdel-Aty, M.; Mahmoud, N. Sequence-to-Sequence Recurrent Graph Convolutional Networks for Traffic Estimation and Prediction Using Connected Probe Vehicle Data. IEEE Trans. Intell. Transp. Syst. 2023, 24, 1395–1405. [Google Scholar] [CrossRef]
Li, R.; Wang, S.; Zhu, F.; Huang, J. Adaptive Graph Convolutional Neural Networks. Proc. AAAI Conf. Artif. Intell. 2018, 32, 11691. [Google Scholar] [CrossRef]
Ta, X.; Liu, Z.; Hu, X.; Yu, L.; Sun, L.; Du, B. Adaptive Spatio-Temporal Graph Neural Network for Traffic Forecasting. Knowl. -Based Syst. 2022, 242, 108199. [Google Scholar] [CrossRef]
Lu, Z.; Lv, W.; Cao, Y.; Xie, Z.; Peng, H.; Du, B. LSTM Variants Meet Graph Neural Networks for Road Speed Prediction. Neurocomputing 2020, 400, 34–45. [Google Scholar] [CrossRef]
BAI, L.; Yao, L.; Li, C.; Wang, X.; Wang, C. Adaptive Graph Convolutional Recurrent Network for Traffic Forecasting. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Newry, UK, 2020; Volume 33, pp. 17804–17815. [Google Scholar]
Zhang, Z.; Li, Y.; Song, H.; Dong, H. Multiple Dynamic Graph Based Traffic Speed Prediction Method. Neurocomputing 2021, 461, 109–117. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2016, arXiv:1706.03762. [Google Scholar] [CrossRef]
Yan, H.; Ma, X.; Pu, Z. Learning Dynamic and Hierarchical Traffic Spatiotemporal Features With Transformer. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22386–22399. [Google Scholar] [CrossRef]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series Is Worth 64 Words: Long-Term Forecasting with Transformers. arXiv 2022, arXiv:2211.14730. [Google Scholar] [CrossRef]
Zhang, Y.; Yan, J. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Zhang, Y.; Wu, R.; Dascalu, S.M.; Harris, F.C. Multi-Scale Transformer Pyramid Networks for Multivariate Time Series Forecasting. IEEE Access 2024, 12, 14731–14741. [Google Scholar] [CrossRef]
Liang, W.; Li, Y.; Xie, K.; Zhang, D.; Li, K.-C.; Souri, A.; Li, K. Spatial-Temporal Aware Inductive Graph Neural Network for C-ITS Data Recovery. IEEE Trans. Intell. Transp. Syst. 2023, 24, 8431–8442. [Google Scholar] [CrossRef]
Bi, J.; Yuan, H.; Xu, K.; Ma, H.; Zhou, M. Large-Scale Network Traffic Prediction with LSTM and Temporal Convolutional Networks. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 3865–3870. [Google Scholar]
Zhang, R.; Sun, F.; Song, Z.; Wang, X.; Du, Y.; Dong, S. Short-Term Traffic Flow Forecasting Model Based on GA-TCN. J. Adv. Transp. 2021, 2021, 1338607. [Google Scholar] [CrossRef]
Gao, H.; Jia, H.; Yang, L. An Improved CEEMDAN-FE-TCN Model for Highway Traffic Flow Prediction. J. Adv. Transp. 2022, 2022, 2265000. [Google Scholar] [CrossRef]
Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. arXiv 2017, arXiv:1707.01926. [Google Scholar] [CrossRef]
Hong, R.; Liu, H.; An, C.; Wang, B.; Lu, Z.; Xia, J. An MFD Construction Method Considering Multi-Source Data Reliability for Urban Road Networks. Sustainability 2022, 14, 6188. [Google Scholar] [CrossRef]
Yildirimoglu, M.; Ramezani, M.; Geroliminis, N. Equilibrium Analysis and Route Guidance in Large-Scale Networks with MFD Dynamics. Transp. Res. Part C Emerg. Technol. 2015, 59, 404–420. [Google Scholar] [CrossRef]
Jiang, J.; Han, C.; Zhao, W.X.; Wang, J. PDFormer: Propagation Delay-Aware Dynamic Long-Range Transformer for Traffic Flow Prediction. Proc. AAAI Conf. Artif. Intell. 2023, 37, 4365–4373. [Google Scholar] [CrossRef]
Ma, D.; Song, X.; Li, P. Daily Traffic Flow Forecasting Through a Contextual Convolutional Recurrent Neural Network Modeling Inter- and Intra-Day Traffic Patterns. IEEE Trans. Intell. Transp. Syst. 2021, 22, 2627–2636. [Google Scholar] [CrossRef]
Han, L.; Du, B.; Sun, L.; Fu, Y.; Lv, Y.; Xiong, H. Dynamic and Multi-Faceted Spatio-Temporal Deep Learning for Traffic Speed Forecasting. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 547–555. [Google Scholar]
Wu, Y.; Tan, H. Short-Term Traffic Flow Forecasting with Spatial-Temporal Correlation in a Hybrid Deep Learning Framework. arXiv 2016, arXiv:1612.01022. [Google Scholar] [CrossRef]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Newry, UK, 2021; Volume 34, pp. 22419–22430. [Google Scholar]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-Term Series Forecasting. In Proceedings of the 39th International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]

Figure 1. The overall architecture of the proposed DTSAGCN model. Dynamic graph convolution networks and multi-scale transformers are applied. The input consists of the road network topology, partial volume, and complete speed data, while the output is the complete volume data.

Figure 2. (a) Typical steady speed and flow pattern; (b) Typical congestion speed and flow pattern.

Figure 3. The mechanism of the MSTF module. The input is divided into blocks to collect features at multiple scales.

Figure 4. (a) Distribution of speed sensors; (b) Distribution of volume sensors (40% coverage).

Figure 5. (a) MAE values for each model at different coverage rates; (b) RMSE values for each model at different coverage rates; (c) MAPE values for each model at different coverage rates.

Figure 6. (a) Model performance under different hidden node state dimensions; (b) Model performance under different graph convolutional layer counts; (c) Model performance under different transformer attention head counts.

Figure 7. Heatmaps of temporal attention coefficients at multiple time scales. Subfigures (a–d) correspond to attention maps at the 20 min, 2 h, 12 h, and 1-day scales. In each heatmap, the vertical axis represents different road segments, and the horizontal axis denotes consecutive time segments. Darker colors indicate higher attention scores, reflecting stronger temporal dependencies between time steps.

Table 1. Volume estimation results under 30–50% coverage rate of different methods.

Coverage	Evaluation	DTSAGCN	KNN	ST-SSL	TGMC-F	GCBRNN
30%	MAE	44.13	54.9	52.45	46.62	47.66
	RMSE	61.93	83.47	78.57	70.91	68.99
	MAPE	35.42%	42.75%	41.08%	38.48%	37.83%
40%	MAE	40.75	52.39	50.28	47.81	45.92
	RMSE	60.73	79.01	76.51	70.89	67.18
	MAPE	32.41%	41.69%	37.20%	36.82%	35.54%
50%	MAE	38.24	50.33	47.37	47.06	43.76
	RMSE	58.02	75.26	71.05	69.43	65.89
	MAPE	44.13%	54.9%	52.45%	46.62%	47.66%

Table 2. Ablation study: The impact of different components of the DTSAGCN model.

	DTSAGCN	w/o DGCR	w/o ASFC	w/o MSTF
MAE	40.75	47.14	43.27	45.76
RMSE	60.73	72.26	64.22	66.86
MAPE	32.41%	39.91%	35.43%	37.15%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, S.; Yan, F.; Yi, Y. Traffic Volume Estimation Based on Spatiotemporal Correlation Adaptive Graph Convolutional Network. Symmetry 2025, 17, 599. https://doi.org/10.3390/sym17040599

AMA Style

Ding S, Yan F, Yi Y. Traffic Volume Estimation Based on Spatiotemporal Correlation Adaptive Graph Convolutional Network. Symmetry. 2025; 17(4):599. https://doi.org/10.3390/sym17040599

Chicago/Turabian Style

Ding, Sheng, Fei Yan, and Yingmin Yi. 2025. "Traffic Volume Estimation Based on Spatiotemporal Correlation Adaptive Graph Convolutional Network" Symmetry 17, no. 4: 599. https://doi.org/10.3390/sym17040599

APA Style

Ding, S., Yan, F., & Yi, Y. (2025). Traffic Volume Estimation Based on Spatiotemporal Correlation Adaptive Graph Convolutional Network. Symmetry, 17(4), 599. https://doi.org/10.3390/sym17040599

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Traffic Volume Estimation Based on Spatiotemporal Correlation Adaptive Graph Convolutional Network

Abstract

1. Introduction

2. Literature Review

2.1. Method of Traffic Volume Estimation

2.2. Application of Transformer and Its Variants in Traffic Time-Series Modeling

3. Problem Formulation

4. Methodology

4.1. Spatial Correlations

4.1.1. Graph Convolution and Local Spatial Dependency

4.1.2. Adaptive Speed-Flow Correlation Module

4.1.3. Dynamic Graph Convolution Recurrent Module

4.2. Temporal Correlations

4.3. Model Implementations

4.3.1. Output Layer

4.3.2. Training Strategy and Loss Function

4.4. Time Complexity Analysis

5. Experiments and Result Analysis

5.1. Data

5.2. Parameter Settings

5.3. Baselines

5.4. Volume Estimation Results at Different Coverage Rates

5.5. Model Analysis

5.5.1. Ablation Study

5.5.2. Hyperparameter Study

5.5.3. Temporal Interpretability Study

6. Conclusions and Future Directions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI