MAT-WGCN: Traffic Speed Prediction Using Multi-Head Attention Mechanism and Weighted Adjacency Matrix

Tian, Xiaoping; Du, Lei; Zhang, Xiaoyan; Wu, Song

doi:10.3390/su151713080

Open AccessArticle

MAT-WGCN: Traffic Speed Prediction Using Multi-Head Attention Mechanism and Weighted Adjacency Matrix

College of Information Engineering, Beijing Institute of Petrochemical Technology, Beijing 102617, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(17), 13080; https://doi.org/10.3390/su151713080

Submission received: 20 July 2023 / Revised: 19 August 2023 / Accepted: 26 August 2023 / Published: 30 August 2023

(This article belongs to the Special Issue Traffic Flow, Road Safety, and Sustainable Transportation)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Traffic prediction is important in applications such as traffic management, route planning, and traffic flow optimization. Traffic speed prediction is an important part of traffic forecasting, which has always been a challenging problem due to the complexity and dynamics of traffic systems. In order to predict traffic speed more accurately, we propose a traffic speed prediction model based on a multi-head attention mechanism and weighted adjacency matrix: MAT-WGCN. MAT-WGCN first uses GCN to extract the road spatial features in the weighted adjacency matrix, and it uses GRU to extract the correlation between speed and time from the original features. Then, the spatial features extracted by GCN and the temporal features extracted by GRU are fused, and a multi-head attention mechanism is introduced to integrate spatiotemporal features, collect and summarize spatiotemporal road information, and realize traffic speed prediction. In this study, the prediction performance of MAT-WGCN was tested on two real datasets, EXPY-TKY and METR-LA, and compared with the performance of traditional methods such as HA and SVR that do not combine spatial features, as well as T-GCN, A3T-GCN, and newer methods such as GCN and NA-DGRU that combine spatial features. The experimental results demonstrate that MAT-WGCN can capture the temporal and spatial characteristics of road conditions, thus enabling accurate traffic speed predictions. Furthermore, the incorporation of a multi-head attention mechanism significantly enhances the robustness of our model.

Keywords:

traffic prediction; temporal and spatial features; attention mechanism; GCN; GRU

1. Introduction

As an important part of modern urban traffic management and planning, intelligent transportation systems aim to improve traffic efficiency, reduce traffic congestion, and provide more convenient travel methods. As the population and number of vehicles have increased, traffic congestion has become a common problem in cities. Traffic speed prediction provides a decision-making basis for urban governance. Of course, if more factors such as population and vehicle information can be introduced into future traffic prediction models, the accuracy of traffic speed prediction will be further improved. Traffic speed prediction can not only provide important data support for intelligent transportation systems [1,2] but can also improve efficiency for intelligent navigation and improve safety for intelligent vehicle control. The model proposed in this paper makes full use of spatial information such as road length, and it will provide decision-making ideas for urban road planning, municipal planning, and river regulation. The main methods of traffic forecasting are divided into traditional methods and deep learning methods.

Traditional traffic forecasting mainly targets traffic flow prediction. ARIMA, integrated with the Kalman filter, predicts short-term traffic flow, while the historical average (HA) model predicts future traffic using historical data averages [1,3]. However, statistical methods cannot capture aspects like speed, congestion, and accidents. Smola [4] introduced the SVR model with a linear kernel function for traffic speed prediction. Zhang [5] utilized a deep spatiotemporal residual network for urban traffic congestion prediction. Meanwhile, Hu et al. [6] employed data-driven techniques to predict traffic incidents considering traffic, weather, and social events.

Traffic speed prediction has always been the focus of traffic forecasting. In particular, traffic forecasting based on deep learning has become the main method by which to explore the application of traffic forecasting. Deep learning methods mainly focus on how to use historical data to predict future traffic speeds. For example, the RNN [7] time-series model predicts traffic speeds by capturing the long-term dependencies of time series and constructing recursive connections. Zhang et al. [8] utilized a deep residual network to learn spatiotemporal features from historical traffic data. This approach involved using a convolutional neural network (CNN) for spatial feature extraction and a recurrent neural network (RNN) for time-series modeling to realize traffic speed prediction. MA [9] and others proposed to use a long short-term memory neural network to predict traffic speed, which can capture long-term dependencies and is suitable for the modeling and forecasting of time-series data such as traffic data. Lv [10] and others used a multi-layer perceptron to build a prediction model and used a large dataset for training to improve the accuracy of traffic prediction. Yu [11] and others introduced a spatiotemporal graph convolutional network for traffic prediction, which can capture the spatiotemporal characteristics of the traffic network, and used the graph convolutional neural network to predict traffic flow. Narmadha et al. [12] predicted spatiotemporal vehicle traffic flow using multivariate CNN and LSTM models.

Introducing more features for accurate traffic speed prediction is a novel research avenue. Fang et al. [13] utilized a global attention mechanism with a local convolutional neural network to capture spatiotemporal traffic data traits. Yao et al. [14] employed meta-learning, learning inter-city relationships from multi-city data to enhance shared knowledge and transferability. Both methods emphasize convolutional feature extraction for speed prediction. Li et al. [15] reviewed machine learning techniques in short-term traffic prediction, discussing challenges and future prospects. Addressing RNN’s limitations with long sequences, Hochreiter [16] introduced LSTM, while Cho et al. [17] proposed GRU for complex traffic data prediction. Zhao et al. [17] introduced a spatiotemporal graph convolutional network (T-GCN), blending GCN and GRU to refine spatiotemporal traffic data correlations. Bai et al. [18] developed A3T-GCN, an attentional spatial–temporal graph convolutional network, enhancing traffic perception with an attention mechanism. However, it differs from prior models in spatial and temporal feature processing. Tian et al. [19] proposed the dual-GRU NA-DGRU model, leveraging neighborhood traffic network data for accuracy. The authors of [20] proposed a traffic speed prediction method based on temporal convolutional networks, adopting residual links to speed up learning and emphasizing the mapping relationship between input and output. The authors of [21] described a platoon-based traffic flow model that predicts optimal speeds with constant separation between platoons, emphasizing the importance of speed prediction for traffic management. Table 1 presents a comparative analysis of various methods suggested by different researchers in traffic prediction. Although these methods combine spatiotemporal features and introduce an attention mechanism, when dealing with spatial features, the models are not as strong as GCN in extracting spatial information.

To overcome the limitations of traditional methods on temporal dependence, deep learning methods introduce different models to extend spatial relationships. In fact, the main advantage of deep learning methods is their ability to utilize both captured temporal features and extracted spatial features.

In summary, traffic speed prediction methods today face the following problems:

(1): How to introduce the influencing factor of road length in road spatial feature extraction.
(2): How to combine temporal and spatial features to improve the accuracy of traffic speed prediction.

To address the challenges mentioned, we present MAT-WGCN, a traffic prediction model leveraging a multi-head attention mechanism and weighted adjacency matrix. Unlike the methods in [20,21], our work provides a new method for traffic speed prediction: combining road length and historical data for traffic speed prediction. MAT-WGCN employs GCN with the weighted adjacency matrix for enhanced spatial understanding and GRU for temporal insights. These features are integrated using the multi-head attention mechanism, which normalizes the hidden states. The output feeds into a fully connected layer for final prediction. By merging temporal and spatial perspectives, MAT-WGCN aims to refine traffic prediction accuracy.

The main contributions of this paper are as follows:

(1): Introduce a weighted adjacency matrix: In general, long roads have little influence on vehicle speed, while short roads have a great influence on vehicle speed. The adjacency matrix is replaced by a 0-1 matrix with a weighted adjacency matrix related to the length of the road. This weight adjacency matrix can reflect the spatial characteristics between roads in more detail.
(2): Apply multi-head attention mechanism: By using multiple attention heads, the multi-head attention mechanism enables the model to pay attention to different feature subspaces, which is beneficial for improving the robustness and stability of the model.
(3): Construct the MAT-WGCN model: Use GCN, GRU, and multi-head attention to jointly construct a new traffic prediction model, in which GCN and GRU extract spatial information and time information, and then, we use the multi-head attention model to establish temporal and spatial relevance. In this way, the traffic speed prediction at various time lengths can be realized.
(4): Test the effect of the model on a real dataset: By testing on a real dataset and comparing with other methods, the superiority of the MAT-WGCN model in traffic speed prediction is demonstrated.

The rest of the paper is organized as follows. Section 2 introduces the methods and models used in this paper. Section 3 presents the datasets used, evaluation criteria, other baseline methods, and experimental results and analysis. Finally, conclusions are given in Section 4.

2. Research Methods

2.1. Problem Definition

MAT-WGCN combines time and space features for speed prediction, and the spatial features include the influence of road length. When the road is longer, the speed of the vehicle tends to be stable within a range. Conversely, when the road is shorter, the speed of the vehicle changes frequently, so it can be concluded that the change in vehicle speed is inversely proportional to the length of the road. Therefore, we propose a traffic prediction model based on a multi-head attention mechanism and weighted adjacency matrix: MAT-WGCN. MAT-WGCN combines GCN and a weighted adjacency matrix to extract spatial features. The weighted adjacency matrix not only represents the road connection relationship but also reflects the distance between nodes, which enhances the model’s extraction of spatial features. In addition, MAT-WGCN uses GRU to process temporal features to capture the impact of historical moment velocity on future moment velocity. Then, the hidden layers of GCN and GRU are input into a multi-head attention mechanism, which weighs and averages the weights of multiple attention heads to generate the attention output [23]. Finally, the output of the attention mechanism is passed into the fully connected layer to obtain the predicted value. Therefore, the MAT-WGCN model comprehensively considers temporal features and spatial features to improve the accuracy of traffic prediction.

This study uses the city’s road information as the spatial feature and the historical traffic speed as the temporal feature, and it combines the effective features of the two to predict the traffic speed of the future segment.

Definition 1.

Urban road topology graph G can be expressed as a directed graph

G = {V, E}

, where

V = {v_{1}, v_{2}, ..., v_{N}}

is a node set containing N roads and E is the set of directed edges between roads. For a given road graph G, its adjacency matrix

A \in R^{N \times N}

is used to describe the connection relationship between roads. For a road node v_i, all roads connected to v_i constitute the neighborhood of v_i. The adjacency matrix A is a two-dimensional matrix, the elements of which indicate whether there is a connection or a road relationship between two nodes. The rows and columns of the adjacency matrix represent different nodes, and the element values of the matrix represent the connection relationship between corresponding nodes. In the unweighted adjacency matrix, 0 and 1 are used to represent the connection relationship, where 1 indicates that there is a connection between two nodes, and 0 indicates that there is no connection. In the weighted adjacency matrix, a non-zero value indicates that there is a connection between two nodes, and it represents the connection weight between the two nodes. In this paper, the diagonal elements of the default adjacency matrix A are 0.

Definition 2.

Feature matrix

X^{N \times P}

: If each road in the road graph G is abstracted as a node, the traffic speed on the road can be regarded as the attribute of the node and can be represented by the characteristic matrix

X \in R^{N \times P}

. In this equation, N is the number of roads (that is, the number of nodes) and P is the length of the historical time series (that is, the number of characteristics of nodes). The traffic speed at time t on all roads is expressed by X_t.

In this study, the road in the urban road topology map G is used as the spatial feature, the traffic speed of the road is used as the time feature, the mapping function f is learned, and the mapping function is used to predict the speed at time T in the future. The calculation formula is shown in Formula (1):

[X_{t + 1}, \cdot \cdot \cdot, X_{t + T}] = f (X_{t - M + 1}, \cdot \cdot \cdot, X_{t}, A)

(1)

where T is the length of the predicted time series, M is the length of the historical time series, and X_t is the original feature of the road at time t: that is, the time feature. A is a weighted adjacency matrix, which represents the spatial characteristics of the road. The prediction framework of this research model is shown in Figure 1, which shows the overall framework of the MAT-WGCN model. The model uses GRU, GCN, and a multi-head attention mechanism. The model first uses GRU to extract the time features of the road to obtain the hidden state of GRU:

h_{1}, ..., h_{M}

. At the same time, the model uses GCN to extract the spatial features of the road to obtain the hidden state of GCN:

h_{1}^{'}, ..., h_{M}^{'}

. The hidden-layer state output by GRU and GCN is concatenated as

h_{1}, ..., h_{M}, h_{1}^{'}, ..., h_{M}^{'}

, and it is input in the multi-head attention mechanism. Finally, the output of the multi-head attention mechanism is input into the fully connected layer to obtain the prediction speed.

2.2. Normalization Preprocessing

The influence of road length on adjacency matrix A can be realized through normalization preprocessing. Before using GCN to extract spatial features, the adjacency matrix A should be preprocessed. We propose three different normalization methods to preprocess adjacency matrix A. From [24,25,26], we know the following methods: (1) min_max normalization; (2) logarithmic normalization; and (3) sigmoid normalization. The three normalization formulas are shown in (2), (3), and (4), where X denotes the input data and

X_{\max}, X_{\min}

represent the maximum value and minimum value of the input data, respectively. Min_max normalization: Subtract the minimum value from the original data and divide it by the difference between the maximum value and the minimum value. Logarithmic normalization: By taking the logarithm of the data, the original data are mapped to a more even distribution. Sigmoid normalization: use the sigmoid function to map the data.

X_{n o r m_\min_\max} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}

(2)

X_{n o r m_l ogarithmic} = \frac{\log X}{\log X_{\max}}

(3)

X_{n o r m_S i g m o i d} = \frac{1}{1 + e^{- X}}

(4)

These three different normalization methods have different advantages and disadvantages, so different methods should be used for the preprocessing of different datasets. (1) Min_max normalization. Advantages: simple, can retain the relative order and linear relationship of data. Disadvantages: sensitive to extreme values, and if there are many outliers, the effect after processing is not good; (2) Logarithmic normalization. Advantages: can preserve the relative relationship, suitable for non-negative datasets, and can narrow the gap between larger and smaller values; (3) Sigmoid normalization. Advantages: mapping to probability distribution, simple. Disadvantage: sensitive to extreme values. In this paper, the three methods are compared and tested to explore the optimal method.

2.3. Two-Layer GCN Model

Constructing a two-layer GCN model can effectively extract the spatial features between road nodes. GCN models have been applied in many fields, such as image classification [27] and unsupervised learning [28]. In this study, we built a 2-layer GCN model, applied GCN to the nodes of the graph, captured the spatial features between nodes through the first-order neighborhood, and then built the GCN model by stacking multiple convolutional layers. The model can be expressed as Formula (5):

H^{(l + 1)} = σ ({\tilde{D}}^{- \frac{1}{2}} \hat{A} {\tilde{D}}^{- \frac{1}{2}} H^{(l)} θ^{(l)})

(5)

where

\tilde{A} = A + I_{N}

is the adjacency matrix of the graph,

I_{N}

is the unit matrix,

\hat{A} = {\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}}

indicates the preprocessing of the adjacency matrix,

{\hat{D}}_{ii} = {\sum_{j} \hat{A}}_{i j}

and

H^{(l)} \in R^{N \times P}

denote the output of the l layer, and P is the length of the feature, depending on the specific situation.

σ

is the activation function. The activation function selected in this paper is the ReLU function, and

θ^{(l)}

contains the parameters of this layer.

In this study, a 2-layer GCN model was selected to extract spatial features, which can be expressed as Formula (6):

f (X, A) = σ (\hat{A} Re L U (\hat{A} X W_{0}) W_{1})

(6)

where

W_{0} \in R^{P \times H}

represents the weight of the hidden layer, P represents the length of the feature matrix, H represents the number of hidden units,

W_{1} \in R^{H \times T}

,

W_{1}

represents the weight matrix between the hidden layer and the output layer, and

f (X, A) \in R^{N \times T}

represents the output of length T. The GCN model can learn the topological relationship between the node and the “neighbors” of the node, and it finally obtains the spatial correlation.

The two-layer GCN model involves the process of extracting the hidden state of the space matrix. MAT-WGCN leverages the GCN model to extract spatial features. The flow of 2-layer GCN processing is shown in Figure 2. First, the weighted adjacency matrix is used as input, and then, the spatial features of the road network are learned through the GCN model. The GCN model can capture the topological relationship between roads and surrounding roads while encoding them with the topological structure and road attributes of the road network. Finally, MAT-WGCN can obtain the spatial correlation between roads in the road network and then extract the spatial features.

2.4. GRU Model

The GRU model can effectively extract the temporal characteristics of the historical traffic speed. In traffic speed prediction, it is critical to analyze the relationship between time and speed. For processing traffic time-series data, GRU is widely considered to be a reasonable choice because of the following advantages. First, GRU has a more compact structure than other traditional RNN models, such as simple RNN and long short-term memory (LSTM). It only contains an update gate and a reset gate, which makes the model more computationally efficient and faster during training and inference. Second, GRU effectively solves the problems of vanishing and exploding gradients through a gating mechanism. It can adaptively adjust the flow of information and selectively forget or update previous states, thereby better capturing long-term dependencies in time-series data. In addition, GRU has fewer parameters, which reduces the complexity of the model and the risk of overfitting. When dealing with large-scale traffic data, choosing a model with fewer parameters can improve training efficiency and generalization ability. Compared with other neural network models, GRU has better performance and adaptability in traffic time-series data processing. It accurately captures cyclical, trending, and seasonal changes in traffic data, providing more accurate forecasting results. In summary, due to its simplicity, gating mechanism, and fewer parameters, GRU is a reasonable model for processing traffic time-series data, as it can provide better prediction performance and effect. The structure of GRU is shown in Figure 3. The calculation formula of GRU is shown in Formulas (7)–(10), and Formula (7) is the updated gate formula, where

z_{t}

is the update gate and the value range is 0~1. The closer

z_{t}

is to 1, the more the model will remember past information; the closer

z_{t}

is to 0, the more past information is forgotten.

X_{t}

is the speed of all roads at time t and

h_{t - 1}

is the hidden state at time t − 1.

r_{t}

is the reset gate and the value range is 0~1; the closer

r_{t}

is to 0, the more the model will discard past hidden information; the closer

r_{t}

is to 1, the more the model will add past information to the current information.

\tilde{h_{t}}

is a candidate state, which contains the information of

X_{t}

and makes targeted reservations for the information of

h_{t - 1}

.

h_{t}

is the output state at the current moment calculated by

\tilde{h_{t}}

,

z_{t}

, and

h_{t - 1}

.

σ

is the sigmoid function,

[X_{t}, h_{t - 1}]

means splicing

X_{t}

and

h_{t - 1}

,

\cdot

means matrix dot product, and

*

means matrix product;

W_{z}, b_{z}, W_{r}, b_{r}, W_{\tilde{h}}

, and

b_{\tilde{h}}

are learnable parameters. In the experiment, the dimension of

X_{t}

is N × 1, the dimensions of

h_{t - 1}

,

\tilde{h_{t}}

, and

h_{t}

are all N × u, and the dimensions of

W_{z}

,

W_{r}

, and

W_{\tilde{h}}

are all (u + 1) × u, where N is the total number of roads in the road network and u is the number of hidden units used by GRU.

z_{t} = σ ([X_{t}, h_{t - 1}] \cdot W_{z} + b_{z})

(7)

r_{t} = σ ([X_{t}, h_{t - 1}] \cdot W_{r} + b_{r})

(8)

{\tilde{h}}_{t} = \tanh ([X_{t}, (r_{t} * h_{t - 1})] \cdot W_{\tilde{h}} + b_{\tilde{h}})

(9)

h_{t} = z_{t} * h_{t - 1} + (1 - z_{t}) * {\tilde{h}}_{t}

(10)

The GRU model involves the process of extracting the hidden state in the historical traffic speed sequence. MAT-WGCN uses GRU to process the time characteristics of the road, mines the correlation between time and speed, and outputs the hidden state at each moment. Figure 4 shows the flow of GRU processing time features of M historical moments. The time features of each historical moment are input into GRU for calculation, and GRU outputs a hidden state. According to the characteristics of GRU, the hidden state will not only affect the feature calculation of the next moment but also be retained and input into the multi-head attention mechanism.

2.5. Multi-Head Attention Mechanism

The attention mechanism can effectively focus on sensitive information in features. An attention mechanism is a technique used in models to process sequence or ensemble data, which has demonstrated powerful capabilities in tasks in many fields. The attention mechanism allows the model to dynamically assign weights to different parts during the learning process and thus pay more attention to the important information in the input data and the relationship between the contexts. Initial work in the field of machine translation laid the groundwork for the development of attention mechanisms. In their 2014 paper, Bahdanau et al. [29] proposed a machine translation model that achieves better translation performance by learning to weight different parts of the source language sequence. Subsequently, the introduction of the self-attention mechanism (Self-Attention) in the Transformer model further promoted the application and development of the attention mechanism [17]. Petar et al. [30] discovered the graph attention mechanism and developed a multi-head attention mechanism to increase the stability and robustness of the model. Attention mechanisms have proven to be powerful in computer vision, as in [31], where they are used to focus on key parts of an image for accurate classification. At its core, they automatically assign a content- and context-based weight to each part of the input data. In traffic speed prediction, this mechanism focuses on the influence of different road lengths on speed, and it achieves accurate prediction by assigning different weights.

In the field of traffic prediction, in order to make full use of the attention mechanism, researchers have introduced graph neural networks. Transportation networks are viewed as graphs, where nodes represent intersections and edges represent roads. The attention mechanism helps the model focus on key nodes and edges. Traditional attention does not fully consider the spatial dependence between nodes when processing graph data, and it cannot easily deal with irregular graph structures. As task complexity rises, single attention may be insufficient. This is where the multi-head attention mechanism comes into play.

The multi-head attention mechanism broadens the observation range of the model, but it may bring redundancy. The authors of [32] observed that in training, each head learns different weights to reduce redundancy, and the output of multiple heads will merge and integrate information. Multi-heads capture traffic patterns more accurately but increase computational requirements and model complexity. In addition, more computing resources and careful parameter tuning are required. The multi-head attention mechanism can effectively balance the sensitivity of features. The weighted assignment of temporal and spatial features can be achieved using a multi-head attention mechanism. In time-series data, attention weights can be assigned to features at different time steps to highlight important time points or windows. Similarly, in spatial data, attention weights can be assigned to features at different spatial locations to emphasize key spatial regions. The same steps are repeated many times, and the obtained weights are averaged according to the number of times to obtain a better effect: that is, the multi-head attention mechanism. By applying an attention mechanism to both temporal and spatial features, the model can adaptively focus on different temporal and spatial important information, thereby improving task performance and presentation.

Before using the attention mechanism, we first introduce the road length factor; that is, the distribution weight is inversely proportional to the road length. When the road is longer, the speed of the vehicle tends to remain stable, so the distribution weight should be smaller. When the road is shorter, if the vehicle speed changes frequently, more weights should be assigned and attention should be paid to the change in vehicle speed. Therefore, the road length factor is introduced; that is, the road length is taken as the reciprocal.

After extracting the road length factor, the hidden-state weight selection is the key in the multi-head attention mechanism. MAT-WGCN introduces a multi-head attention mechanism to assign the weights of spatial and temporal features. After the processing of GRU and GCN, MAT-WGCN stitches and fuses the hidden states output by GRU and GCN as the input of the attention mechanism, and it uses the attention mechanism to assign weights to the hidden states. As can be seen from the above, let the spatial feature processed by GCN be

H_{1} = {h_{1}^{'}, h_{2}^{'}, ..., h_{M}^{'}}

. Let the time characteristic of GRU processing be

H_{2} = {h_{1}, h_{2}, ..., h_{M}}

, where M is the length of the time series. Concat

H_{1}, H_{2}

,

H = {h_{1}, h_{2}, ..., h_{M}, h_{M + 1}, h_{M + 2}, ..., h_{2 M}}

is used as input, and H is input into the attention mechanism. The process of assigning weights by the attention mechanism is shown in Figure 5. The multi-head attention mechanism performs feature extraction on traffic data in temporal and spatial dimensions in the MAT-WGCN model. It can focus on different historical time periods to capture long-term and short-term dependencies, and at the same time, it can focus on multiple spatial locations to capture the interaction of traffic speeds in different locations. Especially on short roads, the model pays more attention to speed changes. Combining the output of GCN and GRU, this mechanism strengthens the correlation of time and space, provides rich context for traffic speed prediction, and enables the model to capture the complex relationship of traffic speed more comprehensively.

The significance of the weighted hidden state is that it provides us with a guideline as to which past time points are more important in predicting the traffic speed at the current time point. Once these weighted hidden states are obtained, we can use them to predict the traffic speed at the current point in time. Operationally, these weighted hidden states are pooled together and passed through one or several fully connected layers to arrive at the final prediction.

The attention mechanism workflow involves processing H. First, we apply a multi-layer perceptron (MLP) with two hidden layers to process H and obtain the score

e_{i} (i = 1, 2, 3 ..., 2 M)

of the hidden layer state. The calculation is shown in Formula (11):

E = (H \cdot W_{1} + b_{1}) \cdot W_{2} + b_{2}

(11)

where

E = {e_{1}, e_{2}, e_{3}, \dots, e_{2 M}}

·represents the matrix dot product;

W_{1}, b_{1}

and

W_{2}, b_{2}

are the learnable parameters of the two hidden layers, respectively. Then, we use the fully connected layer softmax to process E to obtain a set of weights

α_{i} (i = 1, 2, ..., 2 M)

. We then repeat the previous steps three times and average the obtained three sets of weights, as shown in Formula (13). The softmax calculation is shown in Formula (12). We multiply the obtained weights and hidden states and calculate the sum, and then, we obtain the final output

C_{t}

,

C_{t}

, which is the final output of the attention mechanism. The calculation formula for

C_{t}

is shown in Formula (14):

α_{i} = softmax (e_{i}) = \frac{\exp (e_{i})}{\sum_{k = 1}^{2 M} \exp (e_{k})}

(12)

α = \frac{1}{3} (\sum_{i = 1}^{3} α_{i})

(13)

C_{t} = \sum_{i = 1}^{2 M} α * h_{i}

(14)

2.6. MAT-WGCN Model

The loss function selects the L2 norm and is optimized by the Adam optimizer. MAT-WGCN uses GRU, GCN, and a multi-head attention mechanism. MAT-WGCN first uses GCN to process the input n node connection information and obtains the hidden state

h_{1}^{'}, h_{2}^{'}, ..., h_{M}^{'}

that represents the spatial correlation between nodes. It then uses GRU to process the temporal features to obtain the hidden state representing the temporal features

h_{1}, h_{2}, ..., h_{M}

. After splicing the two hidden states, they are input into the multi-head attention mechanism. The multi-head attention mechanism scores each hidden state, and three prediction heads are introduced; that is, this step is cycled three times to average the weights, and then, the softmax function is used to obtain full attention. The output

C_{t}

of the attention model is obtained by multiplying and summing the weight and the spliced hidden state. Finally, the fully connected layer is used to process

C_{t}

to obtain the predicted speed

Y_{t}^{'}

. The loss function of MAT-WGCN chooses the L2 norm, which is calculated using the actual speed

Y_{t}

and the predicted value

Y_{t}^{'}

. As shown in Formula (15), the optimizer chooses the Adam optimizer to make L as small as possible.

L = | | Y_{t} - Y_{t}^{'} | |

(15)

This study introduces MAT-WGCN for traffic speed prediction, leveraging GCN for spatial features and GRU for temporal dynamics. The multi-head attention mechanism fuses the outputs of GCN and GRU, integrating road length and historical time data, thereby enhancing the traffic speed forecasting accuracy.

3. Experiments

3.1. Data Description

Two real datasets, METR-LA and EXPY-TKY, were selected for the experiments. These datasets have been used in previous studies [23,33,34,35,36] (EXPY-TKY dataset source: https://github.com/deepkashiwa20/MegaCRN/tree/main/EXPYTKY, accessed on 15 June 2023; METR-LA dataset source: https://github.com/deepkashiwa20/MegaCRN/tree/main/METRLA, accessed on 15 June 2023). Each dataset consists of two parts: a weighted adjacency matrix and a road feature matrix. The weighted adjacency matrix represents the connection relationship and connection distance between roads, and the road feature matrix represents the relationship between speed and time on the road. The METR-LA data were obtained from 207 sensors on the highway in Los Angeles, USA, from 1 March 2012 to 7 March 2021, and the size of the weighted adjacency matrix is

R^{207 \times 207}

. The speed recording interval is once every 5 min. The size of the matrix containing temporal features is

R^{207 \times 2017}

. There are 207 sensors, with each sensor measuring 2017 velocity values. The EXPY-TKY dataset was compiled from the speeds of 2841 points on the Tokyo Expressway from 1 October 2021 to 3 October 2021, and the size of the weighted adjacency matrix is

R^{2841 \times 2841}

. Speed was recorded every 10 min. The size of the matrix containing temporal features is

R^{2841 \times 4464}

. There are 2841 sensors, and each sensor measured 4464 velocity values. The adjacency matrices of these two datasets are both weighted adjacency matrices. Compared with previous studies, the weight matrix not only represents the connection relationship but also represents the distance of the road. Table 2 shows the statistical data of the METR-LA and EXPY-TKY datasets, including the number of nodes (that is, the number of roads), the number of edges (that is, the number of connected roads), the time step, and the sampling interval. Figure 6 shows the speed on a road in the METR-LA dataset in one day. Figure 7 shows the speed on a certain road in the EXPY-TKY dataset on 1 October 2021. The abscissa of the two graphs is time, and one sampling instance involves recording one data point, so the data volume of METR-LA is more than that of EXPY-TKY in one day because the sampling interval of METR-LA is once every 5 min, and the data volume of EXPY-TKY is every 10 min. The vertical axis represents the speed. The speed range of the METR-LA dataset is between 40 and 80 and the speed fluctuation of the EXPY-TKY dataset is between 0 and 130. Figure 8 shows the adjacency matrix distance boxplots of some nodes in EXPT-TKY and METR-LA. The figure mainly shows the median and quartile of the data, the length of the box, and outliers. The outliers were calculated using the interquartile range (IQR) and the inner limit in the box plot. The inner limit defines the range of the outlier, and the calculation of the upper and lower limits is shown in Formulas (16) and (17), where

Q_{3}

and

Q_{1}

denote the third and first quartiles, respectively. A data point is considered an outlier when it falls outside the upper and lower bounds. In Figure 8a, in the 144 speed samples of the EXPY-TKY dataset, a total of 36 data are outliers. In Figure 8b, in the 288 speed samples of the METR-LA dataset, there are 16 data outliers. The experiment used 80% of the data as the training set and 20% of the data as the test set.

U p p e r = Q_{3} + 1.5 \times I Q R

(16)

L o w e r = Q_{1} - 1.5 \times I Q R

(17)

3.2. Evaluation Metrics

In this study, four indicators were selected to evaluate the predictive effect of various comparative models. The four evaluation indicators are mean squared error (RMSE), mean absolute error (MAE), accuracy, and optimal scale. The calculation formulas are shown in Formulas (18)–(20), where M is the time step size, N is the number of nodes (that is, the number of roads);

y_{i j}

and

y_{i j}^{'}

represent the actual value and predicted value at the j-th moment in the i-th road, respectively;

Y

and

Y^{'}

are the sets of

y_{i}

and

y_{i}^{'}

, respectively, indicating the real and predicted speed values at all moments in the i-th hop road. The optimal scale calculation is shown in Formulas (21)–(23).

(1) Root Mean Squared Error (RMSE).

RMSE is used to calculate the average of the squared error between predicted and true values.

R M S E = \sqrt{\frac{1}{M N} \sum_{j = 1}^{M} \sum_{i = 1}^{N} {(y_{i j} - y_{i j}^{'})}^{2}}

(18)

(2) Mean Absolute Error (MAE).

MAE is used to calculate the average of the absolute values of the error between the predicted and true values.

M A E = \frac{1}{M N} \sum_{j = 1}^{M} \sum_{i = 1}^{N} | y_{i j} - y_{i j}^{'} |

(19)

(3) Accuracy

Accuracy is calculated using the F-norm. The F norm is the root value of the sum of the squares of the absolute values of each element in the matrix.

A c c u r a c y = 1 - \frac{{‖ Y - Y^{'} ‖}_{F}}{{‖ Y ‖}_{F}}

(20)

(4) Optimal Scale

The improvement rate uses the baseline model and the evaluation index of the MAT-WGCN model to make a difference (namely RMSE, MAE, and accuracy), and it divides it by the evaluation index of MAT-WGCN.

R M E S E_{O p t i m a l_r a t i o} = \frac{R M S E_{b a s e} - R M S E_{M A T - W G C N}}{R M S E_{M A T - W G C N}}

(21)

M A E_{O p t i m a l_r a t i o} = \frac{M A E_{b a s e} - M A E_{M A T - W G C N}}{M A E_{M A T - W G C N}}

(22)

A c c u r a c y_{O p t i m a l_r a t i o} = \frac{A c c u r a c y_{b a s e} - A c c u r a c y_{M A T - W G C N}}{A c c u r a c y_{M A T - W G C N}}

(23)

RMSE measures the root mean squared error between predicted and true values, with a smaller value indicating better model prediction. MAE represents the mean absolute error between these values, also suggesting a better predictive effect with a smaller value. Optimal scale represents the ideal ratio of MAT-WGCN to other model metrics. Accuracy, determined by the F norm, gauges the difference between predicted and actual values, with a larger value implying higher model accuracy. In essence, smaller RMSE and MAE values, along with higher accuracy, signify superior prediction by the model.

3.3. Experiment Settings

3.3.1. The Influence of Normalization on the Prediction Effect

We used different normalization methods to process the weighted adjacency matrix and compared its actual effect in the model. Three normalization methods were used: (1) min_max normalization; (2) logarithmic normalization; and (3) sigmoid normalization. The learning rate of the MAT-WGCN model was set to 0.001, the epoch was 2000, and the batch size was 32. According to the research [19], it is known that GRU performs better when the number of hidden units is 64. Therefore, under the premise that the number of hidden units is fixed at 64, different normalized adjacency matrices were used, and the historical traffic speed of the first 60 min was used as input. The traffic speed was predicted for 10, 30, and 60 min for the EXPY-TKY dataset, and the traffic speed for the next 15, 30, 45, and 60 min was predicted for the METR-LA dataset. Figure 9 and Figure 10 show the impact of different normalization methods on the evaluation indicators (RMSE, MAE, accuracy) for the two datasets EXPY-TKY and METR-LA, respectively. The horizontal axis comprises three parts, which represent logarithmic, min_max, and sigmoid normalization. The vertical axis represents the values of different evaluation indicators.

From Figure 9 and Figure 10, it can be seen that on the EXPY-TKY dataset, the logarithmically normalized accuracy of the adjacency matrix is the largest, and the RMSE and MAE values are the smallest. Therefore, on the EXPY-TKY dataset, the adjacency matrix logarithmic normalization should be selected for processing. On the METR-LA dataset, the min_max normalized accuracy for the adjacency matrix is the largest, and the RMSE and MAE values are the smallest, so for the METR-LA dataset, the adjacency matrix should be processed using min_max normalization.

Different normalizations behave differently on different datasets, which is caused by the different distribution of values in the datasets. The characteristic of min_max normalization is that it can preserve the relative order and linear relationship of the data, but it is sensitive to extreme values, and if there are outliers, the effect is not good; the logarithmic normalization can maintain the relative relationship while shortening the maximum value and the gap between the minimum values, but it has the disadvantage that some information may be lost. It can be seen from Figure 8 that there are many discrete values in the EXPY-TKY dataset, and the floating range of the values is large, so when min_max normalization is selected, it is greatly affected and the effect is not good; therefore, logarithmic normalization is selected. There are almost no discrete values in the METR-LA dataset, and the data are relatively concentrated, so using min_max normalization has a better effect.

3.3.2. Comparison of Baseline Prediction Model Parameter Settings

We compared the effect of MAT-WGCN with the following methods:

Historical average (HA) model [1]: The historical average model uses the average value of traffic speed at historical moments as the prediction value of future traffic speed. HA is a simple benchmark model that does not consider the correlation of time and space.
Support vector recession (SVR) model [4]: The support vector recession model uses a linear kernel function to predict traffic speeds in the future. SVR builds a predictive model by learning a linear kernel function from the training dataset.
Temporal graph convolutional network (T-GCN) [17]: T-GCN uses a two-layer GCN and a GRU for traffic prediction. T-GCN represents the road network as a graph, uses GCN to extract spatial features from the graph structure, and uses GRU to extract temporal features from time-series data. By fusing spatial and temporal features, T-GCN can establish a model of road speed and spatiotemporal correlation to realize traffic prediction.
Attention temporal graph convolutional network (A3T-GCN) [18]: A3T-GCN is connected to an attention model after T-GCN for traffic prediction. A3T-GCN uses the attention model to weight and fuse temporal and spatial features, which can dynamically adjust the weight of temporal and spatial features, and pay more attention to important feature subspaces, thereby improving the robustness and stability of the model. By introducing an attention mechanism, A3T-GCN can more accurately capture the spatiotemporal correlation of road speed and achieve more precise traffic prediction.
Dual-GRU traffic speed prediction model based on neighborhood aggregation and attention mechanism (NA-DGRU) [19]: NA-DGRU uses two layers of GRU and an attention model for traffic speed prediction. NA-DGRU extracts temporal and spatial features through neighborhood aggregation and attention mechanisms. Through neighborhood aggregation, NA-DGRU can capture spatial dependencies in road networks and use GRU to extract temporal features from time-series data. At the same time, introducing an attention mechanism can dynamically adjust the importance of different adjacent roads and historical speeds, improving the prediction accuracy.

The parameters selected for the baseline model are shown below. The SVR model uses a linear kernel function with 20,000 iterations. According to the literature [17,18], the learning rate, epoch, and batch size of T-GCN and A3T-GCN were set to 0.001, 2000, and 32, respectively, and the number of GRU hidden units was set to 100. According to the literature [19], the learning rate, epoch, and batch size of NA-DGRU were 0.001, 2000, and 32, respectively, and the number of GRU hidden units on the EXPY-TKY dataset was set to 64. The number of GRU hidden units on the METR-LA dataset was set to 100, and the parameter settings are shown in Table 3.

3.4. Experimental Results and Analysis

Table 4 shows the test results of different methods on the EXPY-TKY dataset at 10, 30, and 60 min, where the best results are indicated in bold. Figure 11 shows the test result curves of different methods on the EXPY-TKY dataset at various moments.

By analyzing Table 4, MAT-WGCN can be compared with five groups of baseline models. It can be seen from the data in the table that RMSE achieved the best results in the 10-min prediction on the EXPY-TKY dataset, and MAT-WGCN achieved the best results on all evaluation indicators in the 30- and 60-min tests. Figure 11 is composed of three parts: the RMSE curve, MAE curve, and accuracy curve. The curves show the comparison between MAT-WGCN and other baseline methods on these three evaluation indicators. For MAT-WGCN, as the length of time increases, the accuracy rate decreases, but the performance of MAT-WGCN is still the best at 60 min.

Compared with other models, the optimization ratio of MAT-WGCN is shown in Table 5. A positive number indicates the optimization improvement rate of MAT-WGC, and a negative number indicates the optimization reduction rate of MAT-WGC. The calculation formulas of RMSE, MAE, and accuracy are shown in Formulas (18)–(20). Since RMSE and MAE are smaller, the effect is better and the accuracy is higher, which is better. Compared with T-GCN, the optimized ratios of the 10-min, 30-min, and 60-min RMSEs of MAT-WGCN are 0.42%, 0.51%, and 0.52%, respectively. Compared with A3T-GCN and NA-DGRU, MAT-WGCN has the highest optimization ratio for RMSE. Compared with A3T-GCN, the RMSE optimization ratios at each moment are 31.06%, 31.17%, and 31.06%, respectively. Compared with NA-DGRU, the RMSE optimization ratios at each moment are 31.51%, 31.95%, and 31.80%, respectively.

Table 6 shows the test results of different methods on the METR-LA dataset at 15, 30, 45, and 60 min, where the best results are indicated in bold. Figure 12 shows the resulting curves from the tests of different methods on the METR-LA dataset at various moments.

By analyzing Table 6, MAT-WGCN can be compared with five groups of baseline models. It can be seen from the data in the table that accuracy achieved the best results in the 15-min prediction on the METR-LA dataset, and it achieved the best results in all evaluation indicators of MAT-WGCN at 30 min and 60 min. At 45 min, the best results were achieved for MAE and accuracy. Figure 12 is composed of three parts: namely, RMSE curve, MAE curve, and accuracy curve. The curve shows the comparison between MAT-WGCN and other baseline methods on these three evaluation indicators. For MAT-WGCN, as the length of time increases, the accuracy rate decreases, but the performance of MAT-WGCN is still the best at 60 min.

Compared with other models, the optimization ratio of MAT-WGCN is shown in Table 7, and the calculation formulas are given by Equations (21)–(23). Compared with T-GCN, the MAE optimization ratios of 15 min, 30 min, 45 min, and 60 min are 9.40%,13.33%, 8.41%, and 2.48%, and the MAE optimization ratios of MAT-WGCN and A3T-GCN are the highest. The length was optimized by 7.79%,10.48%, 9.32%, and 2.76%, respectively, and the RMSE optimization ratio was the highest compared with NA-DGRU, which was optimized by −2.06%,3.15%, 7.21%, and 5.37%.

It can be seen from Figure 11 and Figure 12 that both RMSE and MAE on the EXPY-TKY dataset and METR-LA dataset have a trend of increasing over time, and accuracy also decreases over time. This shows that with an increase in the forecast time length, the forecast accuracy of the model will decrease, and the error will gradually increase.

According to the analysis of Formula (1), we can observe that when predicting the speed at time t + T, the model also predicts the speed at time t + 1 to t + T − 1. Therefore, as the forecast time increases, errors will gradually accumulate when calculating RMSE, MAE, and accuracy, and this is reflected in the curve changes in Figure 11 and Figure 12. This phenomenon of cumulative error occurs because the prediction of the model is based on the real value of the previous moment, and the prediction result will be used as the input of the next moment. Therefore, any initial forecast error or accumulated error is propagated and amplified in subsequent moments. This results in a trend of the curve as the predicted time increases. In other words, as the prediction time increases, the errors will gradually accumulate, which may lead to the gradual increase in RMSE and MAE values, while the accuracy rate gradually decreases. This phenomenon is due to the inherent nature of the forecasting problem and the error propagation mechanism of the model. Therefore, when evaluating and interpreting model performance, it is necessary to consider the impact of the moment of prediction and be aware of the cumulative effect of errors.

From Table 5 and Table 7, it can be seen that MAT-WGCN has the best performance in 60-minute prediction compared with other models on the EXPY-TKY dataset, According to Table 5, the RMSE optimization ratios in 60-minute predictions are 24.60%, 9.88%, and 0.52%. 31.06% and 31.80%, The optimization ratios of MAE are 16.96%, 5.81%, 0.27%, 26.85%, and 31.54%, and the optimization ratios of accuracy are 2.14%, 0.74%, 0.90%, 5.62%, and 6.30%. The METR-LA dataset also performs better in 60-min predictions. The optimization ratios of RMSE are 4.10%, 3.53%, and 5.37%, respectively. The optimization ratios of MAE are 2.48%, 2.76%, and 2.12%, respectively. After the optimization of accuracy, the proportions are 3.16%, 2.39%, and 2.58%, respectively. MAT-WGCN performs better in longer-term predictions. This is mainly due to the addition of the multi-head attention mechanism. The multi-head attention mechanism calculates the weight multiple times and then takes the average value, which is robust to the model. The performance and stability have been improved to a certain extent, and the expressive ability of the model has been enhanced so that the model can better capture the complex relationship between data.

From the above analysis, MAT-WGCN can effectively capture time and spatial information and establish a correlation to achieve the purpose of traffic prediction. The performance of MAT-WGCN on METR-LA is better than that on EXPY-TKY. This is mainly because of the following:

The sampling interval is different: The sampling interval of the EXPY-TKY dataset is 10 min; that is, there are six sampling points per hour. However, the sampling interval of the METR-LA dataset is 5 min, and there are 12 sampling points per hour. Since the input of the model is all historical speeds collected in the past 60 min, the EXPY-TKY dataset has six fewer inputs than the METR-LA dataset. This difference may cause the prediction results of the EXPY-TKY dataset to be less stable than those of the METR-LA dataset.
The internal data structure of the adjacency matrix is different: The EXPY-TKY dataset is a dataset of the Tokyo Ring expressway. Compared with the METR-LA dataset, its road connections are fewer; that is, each node is usually only connected to one or two nodes. In addition, in the EXPY-TKY dataset, there are many outliers in the distance values in the adjacency matrix, resulting in an uneven distribution of distances. This inhomogeneity may negatively affect the learning and prediction performance of the model, which in turn leads to inferior performance on the EXPY-TKY dataset and the METR-LA dataset.

3.5. Visualization Analysis

The prediction results of the MAT-WGCN model on the two datasets EXPY-TKY and METR-LA can be visualized to show the degree of overlap between the predicted value and the actual value.

For the EXPY-TKY dataset, the prediction results for 10, 30, and 60 min are shown in Figure 13, and for the METR-LA dataset, the prediction results for 10, 30, 45, and 60 min are shown in Figure 14. In Figure 13 and Figure 14, the horizontal axis represents time, the vertical axis represents speed, the blue broken line shows the actual speed value, and the red broken line shows the predicted value at different times. The image shows the fitting effect of the predicted value and the actual value under different time lengths. By observing two broken lines of different colors, we can see the prediction effect of MAT-WGCN. After analyzing the change trend of speed in Figure 13 and Figure 14, it can be concluded that under different prediction time lengths, MAT-WGCN can predict results close to the actual value. In addition, MAT-WGCN can learn the trend of speed changes, track the peaks and troughs of the speed curve, and identify the starting point and end point of the speed value. Especially in the speed curve during the period 300–350 in Figure 14a, MAT-WGCN has a high degree of fit to the actual value and performs well.

4. Conclusions and Future Work

In this paper, we proposed a new traffic prediction model: MAT-WGCN. Compared with traditional models, MAT-WGCN can combine the spatial information of the road and the temporal information of the speed to achieve traffic prediction. MAT-WGCN combines a multi-head attention mechanism, weighted adjacency matrix, GRU, and GCN. MAT-WGCN extracts the spatial features of the road by processing the adjacency matrix of GCN, and it extracts the temporal features of the road by processing the speed of historical time through GRU. The multi-head attention mechanism weights and sums all hidden states, performs multiple attention-mechanism processing, and finally uses the obtained result as input for the fully connected layer to obtain the final predicted value. The result of MAT-WGCN contains the temporal and spatial characteristics of the road at each moment and finally achieves the goal of traffic prediction.

MAT-WGCN not only outperforms traditional methods that do not incorporate spatial features, such as HA and SVR, in terms of prediction accuracy but also compares favorably with some current state-of-the-art baseline methods that incorporate spatial features (such as T-GCN, A3T-GCN, and NA-DGRU). These experimental results validate the design concept of MAT-WGCN, which can achieve higher traffic prediction accuracy by integrating spatial and temporal information for deep learning.

However, MAT-WGCN still has its limitations and challenges. First, although we explored three normalization methods for adjacency matrices, further experiments are still needed to determine the best normalization strategy. Second, in short-term forecasting, the improvement of the model has not been as expected. This may point to a broader issue where traffic flow over short timescales may be influenced by more external factors such as emergencies or road conditions.

Author Contributions

Conceptualization, X.T.; data curation, X.T., L.D. and X.Z.; formal analysis, X.T. and X.Z.; funding acquisition, X.T.; investigation, L.D. and S.W.; methodology, X.T. and L.D.; project administration, L.D.; resources, X.Z. and L.D.; software, L.D. and S.W.; supervision, X.T.; validation, L.D.; visualization, X.Z. and S.W.; writing—original draft, L.D.; writing—review and editing, X.T., X.Z., L.D. and S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Laboratory of Police Internet of Things Application Ministry of Public Security. People’s Republic of China: JWWLWKFKT2022001.Cross-Disciplinary Science Foundation from Beijing Institute of Petrochemical Technology (Project BIPTCSF-017).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study including the EXPY-TKY and METR-LA sets are openly available.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, J.; Guan, W. A summary of traffic flow forecasting methods. J. Highw. Transp. Res. Dev. 2004, 21, 82–85. [Google Scholar]
Huang, H. Dynamic modeling of urban transportation networks and analysis of its travel behaviors. Chin. J. Manag. 2005, 2, 18–22. [Google Scholar]
Williams, B.M.; Hoel, L.A. Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process: Theoretical basis and empirical results. J. Transp. Eng. 2003, 129, 664–672. [Google Scholar] [CrossRef]
Smola, A.J.; Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 2004, 14, 199–222. [Google Scholar] [CrossRef]
Zhang, J.; Zheng, Y.; Qi, D. Deep spatio-temporal residual networks for citywide crowd flows prediction. In Proceedings of the Proceedings of the AAAI conference on artificial intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Hu, Z.; Zhou, J.; Huang, K.; Zhang, E. A data-driven approach for traffic crash prediction: A case study in Ningbo, China. Int. J. Intell. Transp. Syst. Res. 2022, 20, 508–518. [Google Scholar] [CrossRef]
Lv, Z.; Xu, J.; Zheng, K.; Yin, H.; Zhao, P.; Zhou, X. Lc-rnn: A deep learning model for traffic speed prediction. In Proceedings of the IJCAI, Stockholm, Sweden, 13–19 July 2018; p. 27. [Google Scholar]
Zhang, S.; Wu, G.; Costeira, J.P.; Moura, J.M. Fcn-rlstm: Deep spatio-temporal neural networks for vehicle counting in city cameras. In Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22–29 October 2017; pp. 3667–3676. [Google Scholar]
Ma, X.; Tao, Z.; Wang, Y.; Yu, H.; Wang, Y. Long short-term memory neural network for traffic speed prediction using remote microwave sensor data. Transp. Res. Part C Emerg. Technol. 2015, 54, 187–197. [Google Scholar] [CrossRef]
Lv, Y.; Duan, Y.; Kang, W.; Li, Z.; Wang, F.-Y. Traffic flow prediction with big data: A deep learning approach. IEEE Trans. Intell. Transp. Syst. 2014, 16, 865–873. [Google Scholar] [CrossRef]
Yu, B.; Yin, H.; Zhu, Z. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv 2017, arXiv:1709.04875. [Google Scholar] [CrossRef]
Narmadha, S.; Vijayakumar, V. Spatio-Temporal vehicle traffic flow prediction using multivariate CNN and LSTM model. Mater. Today Proc. 2021, 81, 826–833. [Google Scholar] [CrossRef]
Fang, S.; Zhang, Q.; Meng, G.; Xiang, S.; Pan, C. GSTNet: Global Spatial-Temporal Network for Traffic Flow Prediction. In Proceedings of the IJCAI, Macao, China, 10–16 August 2019; pp. 2286–2293. [Google Scholar]
Yao, H.; Liu, Y.; Wei, Y.; Tang, X.; Li, Z. Learning from multiple cities: A meta-learning approach for spatial-temporal prediction. In Proceedings of the The world wide web conference, San Francisco, CA, USA, 13–17 May 2019; pp. 2181–2191. [Google Scholar]
Li, Y.; Shahabi, C. A brief overview of machine learning methods for short-term traffic forecasting and future directions. SIGSPATIAL Spéc. 2018, 10, 3–9. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Deng, M.; Li, H. T-gcn: A temporal graph convolutional network for traffic prediction. IEEE Trans. Intell. Transp. Syst. 2019, 21, 3848–3858. [Google Scholar] [CrossRef]
Bai, J.; Zhu, J.; Song, Y.; Zhao, L.; Hou, Z.; Du, R.; Li, H. A3t-gcn: Attention temporal graph convolutional network for traffic forecasting. ISPRS Int. J. Geo-Inf. 2021, 10, 485. [Google Scholar] [CrossRef]
Tian, X.; Zou, C.; Zhang, Y.; Du, L.; Wu, S. NA-DGRU: A Dual-GRU Traffic Speed Prediction Model Based on Neighborhood Aggregation and Attention Mechanism. Sustainability 2023, 15, 2927. [Google Scholar] [CrossRef]
Du, Y.; Qin, X.; Jia, Z.; Yu, K.; Lin, M. Traffic Speed Prediction Based on Heterogeneous Graph Attention Residual Time Series Convolutional Networks. AI 2021, 2, 650–661. [Google Scholar] [CrossRef]
Lilhore, U.K.; Imoize, A.L.; Li, C.-T.; Simaiya, S.; Pani, S.K.; Goyal, N.; Kumar, A.; Lee, C.-C. Design and implementation of an ML and IoT based adaptive traffic-management system for smart cities. Sensors 2022, 22, 2908. [Google Scholar] [CrossRef]
Anand, D.; Singh, A.; Alsubhi, K.; Goyal, N.; Abdrabou, A.; Vidyarthi, A.; Rodrigues, J.J. A Smart Cloud and IoVT-Based Kernel Adaptive Filtering Framework for Parking Prediction. IEEE Trans. Intell. Transp. Syst. 2022, 24, 2737–2745. [Google Scholar] [CrossRef]
Wang, J.; Wang, W.; Liu, X.; Yu, W.; Li, X.; Sun, P. Traffic prediction based on auto spatiotemporal Multi-graph Adversarial Neural Network. Phys. A Stat. Mech. Its Appl. 2022, 590, 126736. [Google Scholar] [CrossRef]
Jayalakshmi, T.; Santhakumaran, A. Statistical normalization and back propagation for classification. Int. J. Comput. Theory Eng. 2011, 3, 1793–8201. [Google Scholar]
Phyo, N.W.W.; Hlaing, C.S. Implementation of Normalization using Min_Max Method. MERAL Portal. Ph.D. Thesis, University of Computer Studies, Yangon, Burma, 2020. [Google Scholar]
Zolfani, S.; Yazdani, M.; Pamucar, D.; Zarate, P. A VIKOR and TOPSIS focused reanalysis of the MADM methods based on logarithmic normalization. Facta Univ. Series: Mech. Eng. 2020, 18, 341–355. [Google Scholar]
Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar]
Wang, X.; Li, J.; Yang, L.; Mi, H. Unsupervised learning for community detection in attributed networks based on graph convolutional network. Neurocomputing 2021, 456, 147–155. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International conference on machine learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
Michel, P.; Levy, O.; Neubig, G. Are sixteen heads really better than one? Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Guo, Z.; Liu, S.; Pan, L.; Li, Q. Addressing the Uncertainty in Urban Traffic Prediction via Bayesian Graph Neural Network. 2022. Available online: https://www.researchsquare.com/article/rs-1751349/v1 (accessed on 19 July 2023).
Wang, Y.; Zheng, J.; Du, Y.; Huang, C.; Li, P. Traffic-ggnn: Predicting traffic flow via attentional spatial-temporal gated graph neural networks. IEEE Trans. Intell. Transp. Syst. 2022, 23, 18423–18432. [Google Scholar] [CrossRef]
Zhang, H.; Chen, L.; Cao, J.; Zhang, X.; Kan, S. A combined traffic flow forecasting model based on graph convolutional network and attention mechanism. Int. J. Mod. Phys. C 2021, 32, 2150158. [Google Scholar] [CrossRef]
Zhao, H.; Yao, Q.; Li, J.; Song, Y.; Lee, D.L. Meta-graph based recommendation fusion over heterogeneous information networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, Halifax, NS, Canada, 13–17 August 2017; pp. 635–644. [Google Scholar]

Figure 1. MAT-WGCN prediction framework.

Figure 2. Two-layer GCN processing flowchart.

Figure 3. GRU structure diagram.

Figure 4. Time feature flowchart of GRU processing M historical moments.

Figure 5. Workflow diagram of the multi-head attention mechanism.

Figure 6. The speed on a road in a day in the METR-LA dataset.

Figure 7. The speed on a certain road in the EXPY-TKY dataset on 1 October 2021.

Figure 8. Adjacency matrix distance boxplots: (a) boxplot of adjacency matrix distance of EXPY-TKY dataset; (b) boxplot of adjacency matrix distance of METR-LA dataset.

Figure 9. Effects of different normalization methods on the EXPY-TKY dataset in terms of accuracy, RMSE, and MAE. (a) Effects on accuracy; (b) effects on RMSE; (c) effects on MAE.

Figure 10. Effects of different normalization methods on the METR-LA dataset in terms of accuracy, RMSE, and MAE. (a) Effects on accuracy; (b) effects on RMSE; (c) effects on MAE.

Figure 11. Test result curves of different methods at various moments on the EXPY-TKY dataset. (a) RMSE curves of different methods; (b) MAE curves of different methods; (c) accuracy curves of different methods.

Figure 12. Test result curves of different methods on the METR-LA dataset at various moments. (a) RMSE curves of different methods; (b) MAE curves of different methods; (c) accuracy curves of different methods.

Figure 13. Visualization of the real value of the EXPY-TKY dataset and the prediction results at 10, 30, and 60 min. (a) 10 min results; (b) 30 min results; (c) 60 min results.

Figure 14. Visualization of the real value of the METR-LA dataset and the prediction results at 15, 30, 45, and 60 min. (a) 15 min results; (b) 30 min results; (c) 45 min results; (d) 60 min results.

Table 1. Review of various research in traffic prediction.

Ref No.	Key Technique	Methods/ Algorithm	Traffic Flow Prediction	Speed Prediction	Merits
Williams, B.M. et al. [3]	ARIMA, Kalman Filter	Using ARIMA combined with Kalman filter for traffic flow prediction	Yes	No	Based on statistical methods
Zhang, J. et al. [5]	Deep Learning	Constructing neural networks using historical traffic data	Yes	Yes	Initial form of deep learning methods
Hu, Z. et al. [6]	Random Forest	Regression conformity prediction method based on random forest	Yes	No	Potential in handling non-linear and complex patterns
Zhang, S. et al. [8]	Deep Residual Network	Using deep residual network to learn spatio-temporal features	Yes	No	Requires large data for effective training
Ma, X. et al. [9]	LSTM	Using long short-term memory neural network for traffic speed prediction	No	Yes	May underperform with very long time-series data
Yu, B. et al. [11]	Spatio-Temporal Graph Convolutional Network	Using spatio-temporal graph convolutional network for traffic prediction	Yes	Yes	Application for non-graphic data may be limited
Narmadha, S. et al. [12]	Deep Multivariate Spatio-Temporal Regression Model	Using deep multivariate spatio-temporal regression model	Yes	No	Application for non-time-series data may be limited
Fang, S. et al. [13]	Global Attention Mechanism, Local Convolutional Neural Network	Using global attention mechanism and local convolutional neural network	Yes	Yes	Model complexity may slow down training and inference
Hochreiter, S. et al. [16]	LSTM	Introduced improved RNN model LSTM	Yes	Yes	LSTM might have limitations in handling long time dependencies
Tian, X. et al. [19]	Dual GRU and Attention Mechanism	Using dual GRU and attention mechanism	No	Yes	Challenges may arise in handling large-scale or high-dimensional spatial data
Du, Y.et al. [20]	TCN and Heterogeneous Graph Attention	Time-series convolutional networks, gating mechanism, residual blocks	Yes	Yes	Enhanced model performance and long-term dependencies extraction
Lilhore.et al. [21]	ML and IoT for Smart City Traffic Management	Machine learning algorithms, Internet of Things (IoT) integration	Yes	Yes	1. High adaptability with real-time adjustment. 2. Real-time traffic and speed prediction
Anad, D. et al. [22]	Cloud and IoVT-Based Architecture	Kernel Least Mean Square (KLMS) algorithm	No	Yes	Cloud-based system integrates multiple sensors for real-time processing

Table 2. Statistics of METR-LA and EXPY-TKY datasets.

Datasets	Nodes	Edges	Time Steps	Time Interval (Minutes)
METR-LA	207	2833	2016	5
EXPY-TKY	2841	2982	4464	10

Table 3. Parameter setting table of the baseline method.

Metric	HA	SVR	T-GCN	A3T-GCN	NA-DGRU (EXPY)	NA-DGRU (METR)	MAT-WGCN
learning rate	-	-	0.001	0.001	0.001	0.001	0.001
epoch	-	-	2000	2000	2000	2000	2000
batch size	-	-	32	32	32	32	32
GRU units	-	-	100	100	64	100	64
iterations	-	20,000	-	-	-	-	-

Table 4. Test results on the EXPY-TKY dataset.

T	Metric	HA	SVR	T-GCN	A3T-GCN	NA-DGRU	MAT-WGCN
	RMSE	13.0592	11.4832	10.5007	13.7053	13.7519	10.4570
10 min	MAE	8.4511	7.5949	7.0977	9.0983	9.4577	7.2119
	Accuracy	0.8445	0.8633	0.8749	0.8368	0.8362	0.8725
	RMSE	13.0592	11.6034	10.6174	13.8254	13.9065	10.5637
30 min	MAE	8.4511	7.6325	7.3214	9.1223	9.4896	7.2196
	Accuracy	0.8445	0.8584	0.8658	0.8254	0.8221	0.8694
	RMSE	13.0592	11.7164	10.7352	13.9364	13.9145	10.6812
60 min	MAE	8.4511	7.6451	7.3453	9.1654	9.5046	7.2255
	Accuracy	0.8445	0.8566	0.8582	0.8145	0.8086	0.8630

Table 5. For the EXPY-TKY dataset, the optimization ratio of MAT-WGCN and other models at each prediction time.

T	Metric	HA	SVR	T-GCN	A3T-GCN	NA-DGRU
	RMSE	24.88%	9.81%	0.42%	31.06%	31.51%
10 min	MAE	17.18%	5.31%	−0.20%	26.16%	31.14%
	Accuracy	3.21%	1.05%	−0.28%	4.09%	4.16%
	RMSE	24.80%	9.94%	0.51%	31.17%	31.95%
30 min	MAE	17.06%	5.72%	0.02%	26.35%	31.44%
	Accuracy	2.86%	1.27%	0.41%	5.06%	5.44%
	RMSE	24.60%	9.88%	0.52%	31.06%	31.80%
60 min	MAE	16.96%	5.81%	0.27%	26.85%	31.54%
	Accuracy	2.14%	0.74%	0.90%	5.62%	6.30%

Table 6. Test results on the METR-LA dataset.

T	Metric	SVR	T-GCN	A3T-GCN	NA-DGRU	MAT-WGCN
	RMSE	6.1958	5.2775	5.2403	5.0563	5.1624
15 min	MAE	3.6133	3.3175	3.2685	3.0153	3.0324
	Accuracy	0.8937	0.9102	0.9118	0.9222	0.9259
	RMSE	7.2257	6.1698	6.1502	6.1292	5.9421
30 min	MAE	4.1163	3.9073	3.8088	3.5407	3.4476
	Accuracy	0.8789	0.8851	0.8991	0.8956	0.9143
	RMSE	7.9241	6.7925	6.7415	6.9303	6.4643
45 min	MAE	4.4035	4.1688	4.2036	3.9395	3.8454
	Accuracy	0.8661	0.8743	0.8889	0.8836	0.9066
	RMSE	8.8138	7.3451	7.3049	7.4349	7.0558
60 min	MAE	4.8354	4.4494	4.4617	4.434	4.3418
	Accuracy	0.8508	0.8681	0.8750	0.8733	0.8964

Table 7. For the METR-LA dataset, the optimization ratio of MAT-WGCN and other models at each prediction time.

T	Metric	SVR	T-GCN	A3T-GCN	NA-DGRU
	RMSE	20%	2.23%	1.51%	−2.06%
15 min	MAE	19.14%	9.40%	7.79%	−0.56%
	Accuracy	3.47%	1.70%	1.52%	0.40%
	RMSE	21.63%	3.83%	3.50%	3.15%
30 min	MAE	19.40%	13.33%	10.48%	2.70%
	Accuracy	3.87%	3.19%	1.66%	2.05%
	RMSE	22.56%	5.08%	4.29%	7.21%
45 min	MAE	14.51%	8.41%	9.32%	2.45%
	Accuracy	4.47%	3.56%	1.95%	2.54%
	RMSE	24.92%	4.10%	3.53%	5.37%
60 min	MAE	11.35%	2.48%	2.76%	2.12%
	Accuracy	5.08%	3.16%	2.39%	2.58%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tian, X.; Du, L.; Zhang, X.; Wu, S. MAT-WGCN: Traffic Speed Prediction Using Multi-Head Attention Mechanism and Weighted Adjacency Matrix. Sustainability 2023, 15, 13080. https://doi.org/10.3390/su151713080

AMA Style

Tian X, Du L, Zhang X, Wu S. MAT-WGCN: Traffic Speed Prediction Using Multi-Head Attention Mechanism and Weighted Adjacency Matrix. Sustainability. 2023; 15(17):13080. https://doi.org/10.3390/su151713080

Chicago/Turabian Style

Tian, Xiaoping, Lei Du, Xiaoyan Zhang, and Song Wu. 2023. "MAT-WGCN: Traffic Speed Prediction Using Multi-Head Attention Mechanism and Weighted Adjacency Matrix" Sustainability 15, no. 17: 13080. https://doi.org/10.3390/su151713080

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MAT-WGCN: Traffic Speed Prediction Using Multi-Head Attention Mechanism and Weighted Adjacency Matrix

Abstract

1. Introduction

2. Research Methods

2.1. Problem Definition

2.2. Normalization Preprocessing

2.3. Two-Layer GCN Model

2.4. GRU Model

2.5. Multi-Head Attention Mechanism

2.6. MAT-WGCN Model

3. Experiments

3.1. Data Description

3.2. Evaluation Metrics

3.3. Experiment Settings

3.3.1. The Influence of Normalization on the Prediction Effect

3.3.2. Comparison of Baseline Prediction Model Parameter Settings

3.4. Experimental Results and Analysis

3.5. Visualization Analysis

4. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI