STEFT: Spatio-Temporal Embedding Fusion Transformer for Traffic Prediction

Cui, Xiandai; Lv, Hui

doi:10.3390/electronics13193816

Open AccessArticle

STEFT: Spatio-Temporal Embedding Fusion Transformer for Traffic Prediction

by

Xiandai Cui

^1,2 and

Hui Lv

^1,2,*

¹

School of Science, Hubei University of Technology, Wuhan 430068, China

²

School of Chip Industry, Hubei University of Technology, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(19), 3816; https://doi.org/10.3390/electronics13193816

Submission received: 25 July 2024 / Revised: 6 September 2024 / Accepted: 25 September 2024 / Published: 27 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

Accurate traffic prediction is crucial for optimizing taxi demand, managing traffic flow, and planning public transportation routes. Traditional models often fail to capture complex spatial–temporal dependencies. To tackle this, we introduce the Spatio-Temporal Embedding Fusion Transformer (STEFT). This deep learning model leverages attention mechanisms and feature fusion to effectively model dynamic dependencies in traffic data. STEFT includes an Embedding Fusion Network that integrates spatial, temporal, and flow embeddings, preserving original flow information. The Flow Block uses an enhanced Transformer encoder to capture periodic dependencies within neighboring regions, while the Prediction Block forecasts inflow and outflow dynamics using a fully connected network. Experiments on NYC (New York City) Taxi and NYC Bike datasets show STEFT’s superior performance over baseline methods in RMSE and MAPE metrics, highlighting the effectiveness of the concatenation-based feature fusion approach. Ablation studies confirm the contribution of each component, underscoring STEFT’s potential for real-world traffic prediction and other spatial–temporal challenges.

Keywords:

traffic prediction; transformer; multi-head attention

1. Introduction

The increasing availability of traffic-related datasets and their significant implications in real-world applications have recently propelled traffic prediction, a spatial–temporal prediction challenge, to the forefront of research interests. This evolution is paralleled by the recognition that an accurate traffic prediction model holds immense practical value for real-world applications [1]. To illustrate, taxi demand prediction offers taxi companies the opportunity to proactively allocate vehicles, thereby optimizing service delivery. Additionally, traffic volume prediction enables transportation authorities to more effectively manage and control traffic flow, ultimately mitigating congestion and enhancing overall traffic operations. Through the accurate prediction of traffic flow, public transportation companies can enhance route planning and scheduling optimization, ultimately leading to improved service efficiency and decreased passenger waiting times. Similarly, logistics and express delivery companies can leverage traffic forecasts to streamline delivery and pickup routes, minimize transportation durations and costs, and ultimately elevate their service efficiency to a higher level [2].

Data analysis technology has demonstrated excellent predictive accuracy in financial risk control, healthcare, and climate forecasting [3]. Data mining of traffic conditions and analysis based on hidden patterns has become one of the most important methods [4]. Traffic data, which encompasses both temporal and spatial dimensions, determines the effectiveness of traffic prediction. On the temporal scale, data from each node are associated with a certain time period before and after. On the spatial scale, there is also a certain correlation between adjacent nodes. To ensure the accuracy of predictions, it is essential to consider both temporal and spatial information simultaneously.

In the field of traffic prediction, deep learning algorithms, such as CNN (Convolutional Neural Network) [5], LSTM (Long Short-Term Memory) [6], SAE (Stacked Autoencoder) [7], and GRU (Gated Recurrent Unit) [8], have garnered extensive application. While CNN is a popular choice for extracting spatial features, LSTM networks frequently capture temporal dependencies within data.

In recent years, the Transformer architecture, which initially garnered acclaim in NLP (Natural Language Processing), has broadened its horizons to find utilization within the transportation domain. For example, Li [9] proposed a Transformer-based traffic prediction model that integrates a spatio-temporal network with information fusion and regional sampling. This model takes into account the dynamic spatial and temporal dependencies between regions within each time slot, as well as the periodic spatial–temporal dependencies from multiple time slots. This traffic prediction model is characterized by its minimal parameter count, rapid computational performance, and exceptional precision.

Despite its merits, this approach is not without constraints. Spatial and temporal embeddings are individually processed through convolution blocks to achieve the same size as the flow information, and the spatio-temporal information is accumulated with the flow information through sequential addition. This can lead to the loss of spatio-temporal information. Firstly, such accumulation mixes temporal and spatial information, making it impossible for subsequent models to distinguish between temporal and spatial embeddings; secondly, only using a Transformer to encode the accumulated vectors also results in the partial loss of original flow information.

Although adopting element-wise addition for integrating spatial and temporal embeddings with flow information is computationally efficient, it carries the risk of information loss. Element-wise addition necessitates matching dimensions between tensors, ensuring each element of the two tensors is summed at the same position. However, this approach inherently blends the temporal and spatial features, potentially obscuring their distinction within the aggregated result. When the values of the two feature maps being summed vary significantly, the smaller valued features may be overwhelmed by the larger ones, leading to a loss of important spatio-temporal details. Consequently, this combination method, while facilitating efficient computation, necessitates careful consideration of the potential information loss it incurs.

In traffic prediction, traditional models have consistently faced challenges in capturing the intricate dynamic interactions between spatial and temporal features, leading to limitations in prediction accuracy—our research endeavors to bridge this gap by developing a model capable of efficiently modeling these complex dependencies. One of the major issues encountered with existing approaches is the loss of valuable information while integrating spatial and temporal embeddings, which compromises the uniqueness and integrity of flow data.

To address these issues, we explore innovative ways to fuse spatial, temporal, and flow embeddings while minimizing information loss and preserving the essential characteristics for accurate traffic predictions. Attention mechanisms, which have demonstrated remarkable abilities in identifying relevant patterns within large datasets, present an attractive opportunity for improvement. By integrating these mechanisms with effective feature fusion techniques, we aim to unlock a potent pathway for enhancing prediction performance.

Our research aims to devise a comprehensive framework that leverages attention mechanisms and advanced feature fusion approaches tailored specifically for the complexities of traffic prediction. This innovative approach will strive to overcome the limitations of traditional models, resulting in more accurate and reliable traffic forecasts.

Consequently, our objective is to devise a technique that effectively balances the preservation of the original flow data’s integrity with the integration of temporal and spatial context. In this work, to address the challenges we have identified, we propose a novel deep learning network for traffic prediction: Spatio-Temporal Embedding Fusion Transformer (STEFT).

Our main contributions are outlined as follows:

A Novel Feature Fusion Approach Integrated with Transformer: We introduce a unique approach that leverages concatenation-based feature fusion within the Transformer architecture to enhance traffic prediction. By integrating spatial, temporal, and flow embeddings through concatenation rather than summation, STEFT preserves richer details from each embedding, enabling it to capture intricate spatio-temporal dependencies within traffic data more accurately. This method significantly advances the state-of-the-art feature fusion techniques for traffic prediction tasks.

Enhanced Multi-Time-Interval Transformer Encoder in Flow Block: Within the Flow Block, we employ an improved Transformer Encoder equipped with a multi-head attention mechanism. This design enables STEFT to effectively discern and capture periodic patterns embedded within the spatial–temporal dependencies across multiple time intervals. By adopting a purely data-driven approach, our model avoids relying on prior assumptions about traffic periodicity, rendering it more adaptable and robust in real-world scenarios.

Extensive Experimental Validation and Superior Performance: To substantiate the effectiveness of STEFT, we conducted exhaustive experiments on the NYC Taxi and NYC Bike datasets. The results convincingly demonstrate that our model outperforms state-of-the-art baselines, achieving lower RMSE and MAPE scores. These findings not only validate the superiority of our concatenation-based feature fusion strategy and model architecture but also underscore the practical value of STEFT for real-world traffic prediction applications, where accuracy and reliability are paramount.

This paper is organized as follows: Section 2 introduces related works. In Section 3, Preliminaries, we provide an overview of our proposed method. Next, Section 4 elaborates on the details of STEFT. Section 5 presents experiments and results, and Section 6 concludes the paper.

2. Related Works

Data-based traffic forecasting challenges have garnered significant attention over several decades. Fundamentally, the objective of traffic prediction endeavors is to ascertain a traffic-associated metric for a specific locale at a designated time point, leveraging historical datasets.

Due to its significant value and crucial role, traffic prediction has attracted extensive attention from researchers. Early models, such as ARIMA (Autoregressive Integrated Moving Average Model) [10], considered temporal dependencies but neglected spatial dependencies between regions. Other models, like the Support Vector Machine [11] and linear regression model [12], despite achieving performance improvements, did not consider spatio-temporal dependencies.

Recently, deep learning has been applied to the research of spatial and temporal correlations in traffic forecasting. Many models only consider spatial and temporal information about traffic separately, which makes the network unable to obtain some of their potential comprehensive effects. Like CNN [13], it is usually used to extract spatial features. LSTM is often used to receive temporal information [14].

Some works have begun to consider combining spatial and temporal information to build networks, such as combining CNN with LSTM [15]. Among various neural network models, “add” and “concat” are two common feature fusion methods. Element-wise add is adopted by networks such as ResNet (Residual Network) [16] or FPN (Feature Pyramid Network) [17] for feature fusion, while DenseNet (Densely Connected Convolutional Networks) [18] and other architectures employ concat for the same purpose. Commonly, concatenation leads to an augmentation in the channel count, whereas the summation of features maintains the original channel number unchanged.

In a detailed study of Convolutional Neural Networks, researchers found that the diverse layers of these complex networks can capture important details and information from remote sensing images with varying levels of detail [19]. Therefore, combining features from different layers has become a useful way to improve the overall performance of the network. To prove this, researchers did many experiments to test how accurately they could classify remote sensing scenes, especially when features from different layers of the CNN were combined by adding or concating them together [20]. The experimental outcomes revealed that both the ‘Add’ and ‘Concat’ feature fusion approaches possess distinct advantages, with no definitive superiority determined between them.

In STEFT, the effective integration of the attention mechanism and feature fusion techniques is primarily achieved through the following means: Feature fusion techniques are first embodied in the design of the Embedding Fusion Network (EFN). Rather than simply adding spatial embeddings, temporal embeddings, and flow embeddings together, EFN concatenates them, preserving the uniqueness of each type of embedding and mitigating information loss. This concatenated fused embedding serves as input to the attention mechanism, enabling the model to fully leverage information from spatial, temporal, and flow aspects. Subsequently, the built-in multi-head attention mechanism within EFN is capable of handling complex interactions between different regions. By calculating attention scores between regions to update the embedded representation of each region, this process ensures that the model can capture dynamic correlations among regions. After passing through the flow block, fusion embeddings from different time points are further processed by the Transformer decoder, which calculates correlations between different time points. The combination of these two techniques allows STEFT to effectively integrate spatio-temporal context while preserving the integrity of original flow information, thereby significantly enhancing prediction performance in traffic forecasting tasks. By applying the attention mechanism to analyze fused embeddings and retaining the integrity of multi-dimensional information through feature fusion techniques, STEFT achieves efficient modeling of complex traffic dynamics.

Our proposed model integrates the attention mechanism and feature fusion techniques effectively, enhancing performance significantly through the combined utilization of both approaches.

3. Preliminaries

3.1. Problem Formulation

We view the entire city of New York as an a × b grid rectangle containing a total of n regions (where n = a × b). The regions are designated by their serial numbers, starting from i = 1 and incrementing to i = n. The total duration is segmented into m distinct intervals. Any taxi or bike starts from one region and arrives at the destination after a while. We utilize I and O to denote the inflow and outflow data, respectively, of all regions over time. At any given time t, taxis and bikes move from one region to another, contributing to the inflow of I^t_i and outflow of O^t_i of traffic for each region.

Definitions:

Region: The area is partitioned into non-overlapping regions R = {r₁, r₂, …, r_n}, where r_i represents the i-th region.

Traffic data: We use I and O to denote the inflow and outflow data of all regions over time, respectively. Specifically, I_t∈R₁_×_n and O_t∈R₁_×_n represent the inflow and outflow for the n regions at time t. The elements

I_{i}^{t}

and

O_{i}^{t}

denote the inflow and outflow for region r_i at time t.

Traffic Flow: The cumulative effect of discrete journeys between regions contributes to the formation of traffic flow, which captures the dynamic shifts occurring between distinct regional pairs over time. Traffic flow reflects both the connectivity between regions and the propagation of individuals.

Problem (Traffic Forecasting):

The objective of traffic forecasting is to predict the initial and terminal traffic volumes (i.e., inflow and outflow) for each region at the subsequent time interval t + 1, based on information up to time interval t. Specifically, given the traffic data from time steps 0 to t − 1, represented by I_0:t−1 and O_0:t−1, the traffic forecasting problem aims to predict I_t and O_t for any region.

3.2. STEFT Overview

In this section, we present the comprehensive details of our proposed Spatio-Temporal Embedding Fusion Transformer (STEFT). The structural layout of our approach is depicted in Figure 1.

Encoder and Decoder: Based on Transformer and multi-head attention.

Embedding Fusion Network (EFN): Utilizing the spatial and temporal embeddings within a specific time slot, in conjunction with the region-connected graph, the proposed STEFT framework effectively captures the dynamic spatial–temporal dependencies between regions at that particular time slot. By encoding spatial and temporal information, the framework can discern the collective influence of these spatial–temporal dependencies among regions. The flow embedding is sequentially accumulated with the spatial embedding and temporal embedding. The outcome will not be immediately conveyed to the transformer encoder. Instead, we utilize three separate encoders to analyze the results of each step, a strategy that mitigates the loss of effective information from flow, spatial, and temporal dimensions.

Flow Block (FB): Utilizing the region embedding generated by the EFN across multiple time slots, the Flow Block effectively captures periodic patterns in spatial–temporal dependencies through an attention mechanism. Consequently, STEFT incorporates the influence of historical spatial–temporal dependencies on the predicted time slot, facilitating a data-driven learning process that does not rely on any prior assumptions regarding traffic periodicity.

Prediction Block (PB): STEFT uses a fully connected neural network to predict the inflow and outflow information.

4. STEFT Details

In this part, we shall offer an exhaustive and comprehensive overview of STEFT. In Section 4.1, we describe the Embedding Extractor. In Section 4.2, we introduce the Embedding Fusion Network. Subsequently, in Section 4.3, we elaborate on the Flow Block. Finally, in Section 4.4, we present the prediction network.

4.1. Embedding Extractor

To accurately capture the dynamic joint spatial–temporal dependency for traffic forecasting, we should obtain the spatial–temporal information for each region at specific time slots.

We initially employ one-hot encoding to represent every node within the region r_i (as described in Section 3.1) and re-encode it using a fully connected network.

{\hat{S}}_{i} = S_{i} \cdot W^{S} + b^{S}

(1)

where

{\hat{S}}_{i}

represents the spatial embedding of r_i. W^S and b^S are weight and bias, respectively. In this study, weights and biases are generated randomly. During the training process, parameters are updated based on the Adam [21] method, which adaptively adjusts the learning rate for each parameter by calculating the first-moment estimates and second-moment estimates of gradients, thereby accelerating convergence and improving model performance.

The temporal information was also encoded using a linear layer

\hat{T_{i}} = T_{i} \cdot W^{T} + b^{T}

(2)

where

{\hat{T}}_{i}

represents the temporal embedding of r_i. W^T and b^T are weight and bias, respectively.

Then

F_{i}^{t_{j}} = C o n c a t (C o n v^{I} (I_{i}^{t_{j} - w : t_{j} - 1}), C o n v^{O} (O_{i}^{t_{j} - w : t_{j} - 1})) \cdot W^{F} + b^{F}

(3)

where Conv^I and Conv^O are the convolutional blocks of inflow and outflow data. W^F and b^F are weight and bias, respectively.

4.2. Embedding Fusion Network

Element-wise addition allows the addition of two or more feature maps at corresponding elements, thereby integrating the information from these feature maps. This integration method preserves the dimensions of the original feature maps (such as the number of channels, height, and width) but increases the diversity of information under each dimension, helping the network learn more comprehensive or diverse feature representations.

If spatial, temporal, and flow embeddings are to be fused simultaneously, one approach is to consider their summation as follows:

L_{i}^{t_{j}} = ({\hat{S}}_{i} + {\hat{T}}_{M (t_{j})} + F_{i}^{t_{j}}) \cdot W^{L} + b_{i}^{L}

(4)

where

L_{i}^{t_{j}}

is the fusion embedding, W^L and b^L are learnable parameters.

Spatio-temporal information is taken into account when spatial and temporal embeddings are accumulated with flow embedding. The flow, spatial, and temporal embeddings share a common set of encoder and decoder parameters in this model.

Unlike addition, the concatenation method does not involve summing the input vectors but rather preserves all input vectors intact, thereby preventing smaller features from being overshadowed by larger ones during element-wise addition. Of course, this approach also leads to an increase in the scale of outcomes, thereby imposing greater computational costs.

In our work, we propose concatenating different superposition results to avoid the loss of information from flow embedding and spatial embedding, which will inevitably increase the number of parameters in the network, as:

φ_{i}^{t_{j}} = C o n c a t [F_{i}^{t_{j}}, ({\hat{S}}_{i} + F_{i}^{t_{j}}), ({\hat{S}}_{i} + T_{M (t_{j})} + F_{i}^{t_{j}})]

(5)

where

φ_{i}^{t_{j}}

represents the fusion embedding after concatenation.

And

L_{i}^{t_{j}} = φ_{i}^{t_{j}} \cdot W^{L} + b_{i}^{L}

(6)

4.3. Flow Block

The Flow Block acquires knowledge of periodic dependencies through the interplay between Fusion Embedding and the historical dependencies at various time intervals. Subsequently, it produces a novel embedding by integrating embeddings from diverse time slots, considering their respective correlations.

With the fusion embedding as input, STEFT effectively captures the complicated spatial–temporal dependencies among regions through an enhanced Transformer encoder. Adhering to the standard architecture of the canonical Transformer, STEFT incorporates a multi-head attention mechanism, enabling it to consider diverse dependencies among regions. Specifically, for the m-th attention head, the attention score between regions r_i and r_v at time t_j is formulated as:

A_{m} (r_{i}, r_{v}, t_{j}) = \frac{(L_{i}^{t_{j}} \cdot W_{Q_{m}}) \cdot {(L_{v}^{t_{j}} \cdot W_{K_{m}})}^{T}}{\sqrt{d}}

(7)

where

L_{i}^{t_{j}}

and

L_{v}^{t_{j}}

represent the fusion embeddings at time step t_j for the representations r_i and r_v, respectively. Additionally,

W_{Q_{m}}

and

W_{K_{m}}

denote the parameters of the transformer encoder, while d signifies the dimensionality of the embedding space.

W_{Q_{m}}

and

W_{K_{m}}

are learnable weight matrices specific to the m-th attention head. These matrices transform the region embeddings into query (Q) and key (K) vectors in the context of attention. The query vector represents the region r_i for which we want to compute attention scores, while the key vectors represent all other regions (r_v in this case) in the spatial domain.

The dot product of the query vector

L_{i}^{t_{j}} \cdot W_{Q_{m}}

and the transpose of the key vector

{(L_{V}^{t_{j}} \cdot W_{K_{m}})}^{T}

computes the raw attention score between regions r_i and r_v for the m-th head. The scaling factor d is applied to mitigate the impact of extreme values that may arise due to the high dimensionality d of the embeddings. This scaling factor helps stabilize gradient descent during training.

In contrast to the canonical Transformer approach, our methodology does not involve evaluating the attention scores across a single region and all other regions within the dataset. Our method focuses solely on computing the attention scores between a given region and its neighboring regions within the region-connected graph. We aggregate the embeddings of these neighboring regions based on their respective attention scores, updating the embedding of the target region. For the m-th attention head, the embedding of a region r_i at a given time t_j is subsequently updated by this approach.

\begin{array}{l} {\hat{L}}_{i, m}^{t_{j}} = \sum_{r_{v} \in N e i g h (r_{i})} s o f t \max (A_{m} (r_{i}, r_{v}, t_{j})) \cdot L_{v}^{t_{j}} \\ \begin{matrix} = \end{matrix} \sum_{r_{v} \in N e i g h (r_{i})} \frac{\exp (A_{m} (r_{i}, r_{v}, t_{j}))}{\sum_{r_{u} \in N e i g h (r_{i})} \exp (A_{m} (r_{i}, r_{u}, t_{j}))} \cdot L_{v}^{t_{j}} \end{array}

(8)

where

{\hat{L}}_{i, m}^{t_{j}}

represents the output of the (m)-th attention head, while

N e i g h (r_{i})

denotes the neighboring regions of r_i within the graph structure.

A_{m} (r_{i}, r_{v}, t_{j})

signifies the attention score associated with the respective attention head and neighboring region.

The attention score A_m(r_i,r_v,t_j) between region r_i and neighboring region r_v at time t_j for the m-th attention head is calculated using Equation (7). This score represents the importance of region r_v on region r_i at the current time step. The attention scores are passed through a softmax function. This step converts the raw attention scores into probabilities, making them suitable for weighted averaging. The embedding

L_{v}^{t_{j}}

of each neighboring region r_v is multiplied by its corresponding attention score after softmax normalization. This results in a weighted version of the neighboring region’s embedding that indicates its contribution to the target region r_i. The weighted embeddings are then summed up to update the embedding of the target region

{\hat{L}}_{i, m}^{t_{j}}

for the m-th attention head.

The outputs of multi-heads will be concatenated:

{\hat{L}}_{i}^{t_{j}} = C o n c a t ({\hat{L}}_{i, 1}^{t_{j}}, \dots, {\hat{L}}_{i, M}^{t_{j}}) \cdot W^{O}

(9)

where M is the number of the head and

W^{O}

are learnable parameters.

The results will be sequentially processed through a Transformer decoder and a fully connected network, with a residual connection implemented between every two layers. We designate the outcome of the Flow Block as

R^{t_{j}} = {R_{1}^{t_{j}}, R_{2}^{t_{j}}, \dots, R_{n}^{t_{j}}}

. Specifically,

R_{i}^{t_{j}}

represents the embedded representation of the region r_i at the temporal junction t_j.

4.4. Prediction Block

Given the results from the Flow Block, the Prediction Block employs a fully connected architecture to forecast the inflow and outflow dynamics. The prediction is calculated as follows:

[{\hat{I}}_{i}^{t}, {\hat{O}}_{i}^{t}] = ReLU (η_{i}^{t} \cdot W^{P B} + b^{P B})

(10)

where

{\hat{I}}_{i}^{t}

and

{\hat{O}}_{i}^{t}

are predictions of inflow and outflow, respectively.

W^{P B}

and

b^{P B}

are learnable parameters.

We concurrently predict both inflow and outflow and establish the loss function as follows:

L O S S = \sqrt{\frac{\sum_{i = 1}^{n} {(I_{i}^{t} - {\hat{I}}_{i}^{t})}^{2} + \sum_{i = 1}^{n} {(O_{i}^{t} - {\hat{O}}_{i}^{t})}^{2}}{2 N}}

(11)

where n represents the count of distinct regions.

5. Experimental Results

5.1. Datasets

To verify the performance of our model, we employed two datasets: the NYC taxi dataset and the NYC Bike dataset.

(1): NYC Taxi Dataset

NYC taxi data were obtained from the NYC Open Data portal between 2011 and 2016. Each entry in the dataset represents a single taxi trip and contains the longitude and latitude coordinates where the taxi departed. Each entry can be considered as a single instance of taxi demand.

The content of the NYC taxi dataset is shown in Table 1. The whole city is split into a 20 × 10 grid map, where m and n represent the coordinates of the grid. Each time slot lasts for 30 min.

Figure 2 illustrates the Time Distribution of Flow from the Start Region in the NYC Taxi Dataset. In this dataset, inflow and outflow data are recorded every half hour. As evidenced by the figure, the flow exhibits a certain periodic trend on a daily basis, which may be closely related to daily peak hours. Meanwhile, the peak values vary across different days, adding complexity to flow prediction.

(2): NYC Bike Dataset

NYC bike data were collected from 2019 to 2021. Each entry in the dataset corresponds to a unique bike journey and encompasses the longitude and latitude of the bike’s unlock location. Each entry thus serves as a representation of an individual instance of bike demand.

STEFT does not directly introduce specific traffic characteristics, such as the number of lanes on each grid road, but we indirectly capture the impact of these characteristics through the input data of the model, namely, the historical inflow and outflow traffic volume of each grid. Traffic volume is a complex indicator that reflects the combined effects of factors such as the number of lanes on roads, traffic flow control, and traffic signals. Therefore, although we did not directly model physical characteristics such as the number of lanes, these factors have been implicitly included in our model through historical traffic data. Additionally, in the actual traffic forecast, the influence of land use characteristics on traffic flow is evident. However, in our current work, due to the limitations of data, we did not directly include land use characteristics.

5.2. Metrics and Baseline Methods

The evaluation of the models in this paper was mainly conducted using two metrics: RMSE (Root Mean Square Error) and MAPE (Mean Average Percentage Error) [22].

R M S E = \sqrt{\frac{1}{Ω} \sum_{i = 1}^{Ω} |y^{i} - x^{i}|}

(12)

M A P E = \frac{1}{Ω} \sum_{i = 1}^{Ω} |\frac{y^{i} - x^{i}}{x^{i}}|

(13)

where y and x are the predicted and actual observed values, respectively.

Additionally, we have also utilized R-square as an individual metric to evaluate the performance of STEFT:

R - s q u a r e = 1 - \frac{\sum_{i = 1}^{Ω} {({\hat{y}}^{i} - y^{i})}^{2}}{\sum_{i = 1}^{Ω} {({\bar{y}}^{i} - y^{i})}^{2}}

(14)

where

\hat{y}

and y are the predicted and actual observed values, respectively.

In this work, we implemented our model utilizing the PyTorch framework (version 1.8.1). We employ a specific set of data as the foundational input for our model, chosen based on their ability to optimize performance on relevant validation datasets: The encoder in Figure 1. is configured with a layer number of 3, while both modules are endowed with a head number of 4. The dropout rate was set to 0.1, while the learning rate was set at 0.001 to ensure stable and effective model training. Additionally, a batch size of 8 was adopted to optimize the computational efficiency. The Adam optimizer was utilized for training the model. The proposed model was trained on an NVIDIA Tesla A100 GPU.

5.3. Forecast Accuracy

We conducted a comparative analysis of our model against the endeavors outlined in Table 2, detailing the performance outcomes for the two datasets within the same table. Table 3 displays the R-square results of STEFT when applied to the NYC Taxi and NYC Bike datasets. For each methodology employed, we executed five independent runs and subsequently reported the average outcomes along with the associated standard deviations. The tables illustrate that STEFT holds a distinct advantage over all alternative methods across both evaluation metrics and various datasets, demonstrating its remarkable performance.

Baselines:

(1): MLP (Multilayer Perceptron): This refers to a three-layered neural network architecture, where each layer is fully interconnected with the subsequent one.
(2): LSTM: A recurrent neural network capable of learning long-range dependencies in sequential data.
(3): GRU [23]: A simpler variant of LSTM, also used for processing sequences with long-term dependencies by employing update and reset gates.
(4): ConvLSTM (Convolutional LSTM Network) [24]: An extension of the standard LSTM, where the fully connected layers are replaced with convolutional layers. This modification is specifically tailored for addressing precipitation nowcasting challenges.
(5): ST-ResNet (Spatio-Temporal Residual Networks) [25]: A deep learning model that applies residual convolutional networks across multiple layers to capture spatial correlations from varying temporal periods, enabling accurate crowd flow predictions.
(6): STDN (Spatial–Temporal Dynamic Network) [26]: A hybrid architecture combining CNNs and LSTMs, which significantly enhances crowd flow forecasting by incorporating both transition dynamics and temporal shifts.
(7): ASTGCN (Attention-based Spatial–Temporal Graph Convolutional Networks) [27]: A model specifically designed for traffic flow forecasting, leveraging attention mechanisms within spatial–temporal graph convolutions to improve prediction accuracy.
(8): STGODE (Spatial–Temporal Graph ODE Networks) [28]: Utilizes ordinary differential equations (ODEs) over a spatial graph (based on distance) and a semantic graph (based on flow similarity) to predict traffic flows, offering a novel approach to modeling dynamic spatial–temporal systems.
(9): STSAN (Spatial Temporal Self-Attention Network) [29]: Employs CNNs to extract spatial information and a Transformer model to model temporal dependencies across time, with additional consideration for regional correlations through transition data between regions.
(10): DSAN (Dynamic Switch-Attention Network) [30]: Leverages the Transformer architecture to capture intricate spatial–temporal correlations, enabling effective prediction for various spatial–temporal tasks while utilizing transition data between regions to refine correlation modeling.
(11): ST-TIS (Spatial–Temporal Transformer with Information Fusion and Region Sampling) [9]: Integrates attention mechanisms with Convolutional Neural Networks (CNNs) to bolster its capability for accurate and comprehensive traffic prediction.

We compared the features of the above models in Table 4.

5.4. Distribution of Predictions

To gain a deeper understanding of the spatial and temporal distribution patterns within our predictions, we performed distinct analyses for each dimension.

Figure 3 presents a comparison of actual and predicted values for a specific time slot within the NYC-Taxi dataset. New York City is partitioned into a grid of 20 × 10 regions. Consequently, the location information in the dataset is numbered from 1 to 200. As evident from the figure, STEFT demonstrates a good alignment between predicted and actual outflow values, whereas the prediction accuracy for inflow is slightly lower than that for outflow. This observation aligns with the outcomes of most models, suggesting that predicting inflow poses a greater challenge than outflow. This disparity can likely be attributed to real-world traffic conditions such as traffic lights or congestion, which facilitate a more straightforward estimation of outflow timing and volume, whereas inflow predictions exhibit higher uncertainty.

Figure 4 illustrates the distribution of RMSE and MAPE across 24 h within the NYC-Taxi dataset, utilizing data extracted from a 24 h period within the test set. Similarly, the model exhibits superior prediction performance for outflow compared to inflow. Furthermore, the predictive performance remains relatively consistent across different time periods, without significant variations based on the specific time of day, which could serve as evidence for the robustness of STEFT.

5.5. Learning Rate

Figure 5 demonstrates the influence of the learning rate on the NYC-Taxi dataset. As evident from the figure, when the learning rate is relatively low, the model exhibits underfitting. Subsequently, as the learning rate increases, both RMSE and MAPE gradually decrease. A minimum in RMSE and MAPE is observed when the learning rate reaches 0.001. However, further increasing the learning rate beyond 0.001 results in overfitting of the model, leading to a significant rise in RMSE and MAPE and a consequent degradation in model performance.

5.6. Batch Size

The batch size is the number of samples fetched from the dataset and processed together during a forward and backward pass. Figure 6. illustrates the impact of varying batch sizes on the performance of the model when trained on the NYC-Taxi dataset. Specifically, as the batch size increases from 2 to 4, a gradual decrease in RMSE and MAPE is observed. However, upon further incrementing the batch size, a notable rise in RMSE and MAPE is evident. Using small batch sizes while facilitating regularization via the incorporation of noisy gradients may potentially undermine training stability and convergence rates. In contrast, medium batch sizes stabilize gradients, subsequently bolstering optimization performance. Nevertheless, the employment of large batch sizes entails a heightened risk of overfitting specific data subsets and necessitates a notable increase in resource utilization.

On the other hand, the batch size significantly influences the speed of model training. A reduction in batch size often leads to a substantial acceleration in the training process. Although larger batch sizes can enhance computational efficiency by processing more data per iteration, they require increased memory to store the data and corresponding gradients. Unreasonably large batch sizes may exceed the limitations of hardware resources, ultimately hindering the training process.

5.7. Layer Number

In the context of STEFT, the transformer encoder is stacked n_l times, a strategy that aims to facilitate information propagation across different regions and thereby enhance the robustness of the model. The outcomes of the taxi dataset are shown in Figure 7.

When the layer number is set to 1, the encoder’s capability to capture deep information is limited. However, as the number of layers increases, a decline in RMSE and MAPE is observed, indicating a performance improvement. Nevertheless, an excessive number of layers (layer number > 3) leads to a reversal in this trend, with RMSE and MAPE values rising again. This is attributed to the difficulties in training the model with too many layers, particularly when the layer number surpasses 5, where convergence becomes challenging, resulting in significantly larger RMSE values (exceeding 150) and MAPE percentages (exceeding 170%).

5.8. Head Number

We incorporate the multi-head attention mechanism for dependency learning. This approach enables individual heads to discern and capture distinct patterns embedded within the historical data, thereby enhancing the model’s ability to identify intricate relationships and dependencies. The results are shown in Figure 8.

As the number of attention heads escalates, a decline in both RMSE and MAPE is observed, indicating that the multi-head attention mechanism plays a role in boosting prediction precision. However, when the number of heads is overly extensive, an increase in RMSE and MAPE reoccurs. This phenomenon can be ascribed to the exponential surge in trainable parameters, which complicates the model’s training phase. Notably, when the number of heads exceeds six, convergence issues arise, resulting in markedly higher RMSE values (surpassing 172) and MAPE percentages (exceeding 300%).

5.9. Diverse Design Permutations of STEFT

To assess the efficacy of the proposed modules in STEFT, we conduct a comparative analysis with its variants. The subsequent variants under investigation are delineated as follows:

ADD: In this model, we replace the concatenation with addition, as discussed in Equation (4).
noEFN: We removed Embedding Fusion Network in this model.
noS: We eliminated the spatial embedding in the Embedding Fusion Network, leaving only the flow and temporal embedding as the input.
noT: We eliminated the temporal embedding in the Embedding Fusion Network, leaving only the flow and spatial embedding as the input.

Figure 9 displays the scores of various structures on the NYC-Taxi dataset. Key insights emerge from this analysis:

The concat method in STEFT yields superior results compared to the add method. This indicates that the add method may cause a loss of flow and spatial and temporal embedding information. By using the concat method, the fusion embedding presents more comprehensive features, allowing subsequent models to capture a broader spectrum of information.

The concat-based fusion embedding, when compared to the add method, increases the parameter count in the transformer encoder. This ultimately enhances model performance. The improvement is reflected in the reduced RMSE and MAPE when the concat method is applied, signifying an overall boost in model performance.

Removing the Embedding Fusion Network from the model (noEFN model) leads to a moderate decline in performance. Both RMSE and MAPE show an increase, suggesting that the transformer encoder gathers less information without spatial or temporal embeddings. Consequently, this results in a decrease in prediction accuracy.

Our experiments also show that adding only temporal or spatial embedding improves prediction performance, but it still falls short of STEFT’s results. This highlights the critical role of the Embedding Fusion Network in deeply integrating flow embedding with spatial and temporal embeddings, extracting more robust features that enhance prediction performance.

6. Conclusions

In this work, we introduce STEFT, a novel Spatio-Temporal Embedding Fusion Transformer framework designed specifically for traffic prediction. STEFT efficiently integrates attention mechanisms and concatenation techniques to simultaneously capture dynamic spatial and temporal dependencies, thereby ensuring the integrity of the original flow information.

Unlike traditional methods, STEFT utilizes an Embedding Fusion Network (EFN) to effectively combine spatial, temporal, and flow embeddings, minimizing information loss and enhancing predictive performance. By employing three separate encoders for spatial, temporal, and flow embeddings, STEFT can maintain the distinctiveness of each type of information, facilitating a more comprehensive understanding of traffic dynamics.

Furthermore, the Flow Block within STEFT incorporates historical spatial–temporal dependencies into the prediction process, leveraging an enhanced Transformer encoder with a multi-head attention mechanism. This approach enables STEFT to capture intricate patterns across time slots, enhancing its ability to predict traffic flows accurately.

Comprehensive experiments performed on both taxi and bike datasets affirm the superior performance of STEFT compared to current leading-edge methodologies. Our model records substantial enhancements in RMSE and MAPE metrics, underscoring its proficiency in providing more accurate traffic forecasts. Furthermore, the ablation analysis elucidates the individual contributions of each STEFT component, validating the efficacy of our architectural decisions.

However, despite the notable successes achieved by the STEFT model across multiple datasets, it may still have certain limitations. Specifically, in regions characterized by lower levels of structured traffic data, such as rural areas in some developing countries, STEFT could encounter challenges in capturing intricate traffic dynamics. Similarly, in cities with diversified and unpredictable traffic patterns, the performance of STEFT may also be impacted to some degree. Therefore, in future research endeavors, it is imperative to conduct a more thorough analysis of these unique scenarios to identify potential deficiencies in STEFT and explore corresponding optimization strategies, thereby enhancing its applicability and robustness across various complex traffic environments.

To summarize, STEFT embodies a computationally efficient and remarkably accurate model for traffic prediction, adeptly merging spatial, temporal, and flow data. Its innovative fusion and attention techniques enhance its exceptional performance, positioning it as a viable solution for practical traffic management and control scenarios.

The application prospects of the STEFT model in real-world traffic management systems are extensive, primarily manifested in the optimization of traffic signal control systems, integration into smart city traffic systems, and support for intelligent transportation decision-making. This model is capable of real-time prediction of traffic flow changes, providing precise data for traffic signal control systems, enabling dynamic adjustment of signal timing, alleviating traffic congestion, and enhancing road traffic efficiency. Simultaneously, it facilitates the deep integration of the transportation system with other urban infrastructure, optimizing the allocation of traffic resources and elevating the operational efficiency and service level of the overall transportation system. Furthermore, the robust data processing and learning capabilities of the STEFT model can provide long-term traffic trend analysis for transportation planning and management departments, supporting intelligent transportation decision-making. In the future, we will further explore the integration pathways of the STEFT model with existing traffic control systems and evaluate its application effects in different cities and traffic scenarios, aiming to drive innovation and development in traffic management technology and contribute to the comprehensive construction of smart cities.

Author Contributions

Methodology, X.C.; Software, X.C.; Writing—original draft, X.C.; Writing—review & editing, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Hubei Province University Outstanding Middle-Aged and Young Innovative Team Project No. T201907, the National Key Research and Development Program of China No. 2023YFE0126400, the China-Africa Partner Institute Exchange Program, and the Hubei Province Program for the Introduction of Foreign Talents and Intelligence No. 2023DJC02.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jiang, W.; Luo, J. Graph neural network for traffic forecasting: A survey. Expert Syst. Appl. 2022, 207, 117921. [Google Scholar] [CrossRef]
Jiang, W.; Luo, J.; He, M.; Gu, W. Graph neural network for traffic forecasting: The research progress. ISPRS Int. J. Geo-Inf. 2023, 12, 100. [Google Scholar] [CrossRef]
Haq, M.A. CDLSTM: A novel model for climate change forecasting. Comput. Mater. Contin. 2022, 71, 2363. [Google Scholar]
Yang, Y.; Yuan, Z.; Meng, R. Exploring traffic crash occurrence mechanism toward cross-area freeways via an improved data mining approach. J. Transp. Eng. Part A Syst. 2022, 148, 04022052. [Google Scholar] [CrossRef]
Ye, T.; Zou, F.; Guo, F. Expressway Short-Term Traffic Flow Prediction Based on CNN-LSTM. In Genetic and Evolutionary Computing, Proceedings of the International Conference on Genetic and Evolutionary Computing, Kaohsiung, Taiwan, 6–8 October 2023; Springer Nature: Singapore, 2023; pp. 29–36. [Google Scholar]
Cai, L.; Lei, M.; Zhang, S.; Yu, Y.; Zhou, T.; Qin, J. A noise-immune LSTM network for short-term traffic flow forecasting. Chaos Interdiscip. J. Nonlinear Sci. 2020, 30, 023135. [Google Scholar] [CrossRef]
Lv, Y.; Duan, Y.; Kang, W.; Li, Z.; Wang, F.-Y. Traffic flow prediction with big data: A deep learning approach. IEEE Trans. Intell. Transp. Syst. 2014, 16, 865–873. [Google Scholar] [CrossRef]
Sun, P.; Boukerche, A.; Tao, Y. SSGRU: A novel hybrid stacked GRU-based traffic volume prediction approach in a road network. Comput. Commun. 2020, 160, 502–511. [Google Scholar] [CrossRef]
Li, G.; Zhong, S.; Deng, X.; Xiang, L.; Chan, S.-H.G.; Li, R.; Liu, Y.; Zhang, M.; Hung, C.-C.; Peng, W.-C. A lightweight and accurate spatial-temporal transformer for traffic forecasting. IEEE Trans. Knowl. Data Eng. 2022, 35, 10967–10980. [Google Scholar] [CrossRef]
Shekhar, S.; Williams, B.M. Adaptive seasonal time series models for forecasting short-term traffic flow. Transp. Res. Rec. 2007, 2024, 116–125. [Google Scholar] [CrossRef]
Zhang, Y.; Xie, Y. Forecasting of short-term freeway volume with v-support vector machines. Transp. Res. Rec. 2007, 2024, 92–99. [Google Scholar] [CrossRef]
Tong, Y.; Chen, Y.; Zhou, Z.; Chen, L.; Wang, J.; Yang, Q.; Ye, J.; Lv, W. The simpler the better: A unified approach to predicting original taxi demands based on large-scale online platforms. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 13–17 August 2017; pp. 1653–1662. [Google Scholar]
Yu, B.; Yin, H.; Zhu, Z. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv 2017, arXiv:1709.04875. [Google Scholar]
Zhao, Z.; Chen, W.; Wu, X.; Chen, P.C.Y.; Liu, J. LSTM network: A deep learning approach for short-term traffic forecast. IET Intell. Transp. Syst. 2017, 11, 68–75. [Google Scholar] [CrossRef]
Yu, H.; Wu, Z.; Wang, S.; Wang, Y.; Ma, X. Spatiotemporal recurrent convolutional networks for traffic prediction in transportation networks. Sensors 2017, 17, 1501. [Google Scholar] [CrossRef]
Targ, S.; Almeida, D.; Lyman, K. Resnet in Resnet: Generalizing residual architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Iandola, F.; Moskewicz, M.; Karayev, S.; Girshick, R.; Keutzer, K.; Darrell, T. DenseNet: Implementing efficient ConvNet descriptor pyramids. arXiv 2014, arXiv:1404.1869. [Google Scholar]
Ding, Y.; Zhang, Z.; Zhao, X.; Hong, D.; Yu, C.; Yang, N.; Cai, W. Multi-feature fusion: Graph neural network and CNN combining for hyperspectral image classification. Neurocomputing 2022, 501, 246–257. [Google Scholar] [CrossRef]
Mei, S.; Yan, K.; Ma, M.; Chen, X.; Zhang, S.; Du, Q. Remote sensing scene classification using sparse representation-based framework with deep feature fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5867–5878. [Google Scholar] [CrossRef]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Lin, J.; Ren, Q. Rethinking Spatio-Temporal Transformer for Traffic Prediction: Multi-level Multi-view Augmented Learning Framework. arXiv 2024, arXiv:2406.11921. [Google Scholar]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Zhang, J.; Zheng, Y.; Qi, D. Deep spatio-temporal residual networks for citywide crowd flows prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Yao, H.; Tang, X.; Wei, H.; Zheng, G.; Li, Z. Revisiting spatial-temporal similarity: A deep learning framework for traffic prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 5668–5675. [Google Scholar]
Guo, S.; Lin, Y.; Feng, N.; Song, C.; Wan, H. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 922–929. [Google Scholar]
Fang, Z.; Long, Q.; Song, G.; Xie, K. Spatial-temporal graph ode networks for traffic flow forecasting. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 364–373. [Google Scholar]
Lin, H.; Jia, W.; You, Y.; Sun, Y. Interpretable crowd flow prediction with spatial-temporal self-attention. arXiv 2020, arXiv:2002.09693. [Google Scholar]
Lin, H.; Bai, R.; Jia, W.; Yang, X.; You, Y. Preserving dynamic attention for long-term spatial-temporal prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Online, 6–10 July 2020; pp. 36–46. [Google Scholar]

Figure 1. STEFT overview (The ‘+’ symbol denotes the addition operator, and ‘||’ denotes the concatenation operator).

Figure 2. Time Distribution of Flow from the Start Region in the NYC Taxi Dataset.

Figure 3. Actual and predicted values for a specific time slot on the NYC-Taxi dataset. (a) Inflow, (b) Outflow.

Figure 4. The distribution of RMSE and MAPE across 24 h on the NYC-Taxi dataset. (a) RMSE, (b) MAPE.

Figure 5. The influence of the learning rate on the NYC-Taxi dataset. (a) RMSE, (b) MAPE.

Figure 6. The influence of batch size on the NYC-Taxi dataset. (a) RMSE, (b) MAPE.

Figure 7. The influence of layer number on the NYC-Taxi dataset. (a) RMSE, (b) MAPE.

Figure 8. The influence of head number on the NYC-Taxi dataset. (a) RMSE, (b) MAPE.

Figure 9. Scores of different structures on the NYC-Taxi dataset. (a) RMSE, (b) MAPE.

Table 1. NYC Taxi Dataset.

Time Slot	1	1	1	1	1	1	1	1
m	1	1	1	1	1	1	1	1
n	1	2	3	4	5	6	7	8
Inflow Volumn	20	17	1	1	0	0	0	0
Outflow Volumn	26	25	4	0	0	0	0	0

Table 2. Comparison with related works.

Dataset	Method	Inflow		Outflow
		RMSE	MAPE	RMSE	MAPE
	MLP	22.08 ± 0.50	18.31 ± 0.83%	26.67 ± 0.56	18.43 ± 0.62%
	LSTM	24.04 ± 0.29	20.01%	29.92 ± 0.32	19.11%
	GRU	23.36 ± 0.30	18.57%	29.77 ± 0.25	19.31%
	ConvLSTM	23.67 ± 0.20	20.70 ± 0.20%	28.13 ± 0.25	20.50 ± 0.10%
NYC-Taxi	ST-ResNet	21.63 ± 0.25	21.09 ± 0.51%	26.23 ± 0.33	21.13 ± 0.63
	STDN	19.05 ± 0.31	16.25 ± 0.26%	24.10 ± 0.25	16.30 ± 0.23%
	ASTGCN	22.05 ± 0.37	20.25 ± 0.26%	26.10 ± 0.25	20.30 ± 0.31%
	STGODE	21.46 ± 0.42	19.22 ± 0.36%	27.24 ± 0.46	19.30 ± 0.34%
	STSAN	23.07 ± 0.64	22.24 ± 1.91%	27.83 ± 0.30	25.90 ± 1.67%
	DSAN	18.32 ± 0.39	16.07 ± 0.31%	24.27 ± 0.30	17.70 ± 0.35%
	ST-TIS	18.21 ± 0.23	16.48 ± 0.32%	22.65 ± 0.13	15.42 ± 0.76%
	This work	17.83 ± 0.24	14.99 ± 0.27%	22.57 ± 0.19	15.80 ± 0.32%
	MLP	9.12 ± 0.24	22.40 ± 0.40%	9.83 ± 0.19	23.12 ± 0.24%
	LSTM	10.53 ± 0.14	23.22%	10.99 ± 0.19	24.16%
	GRU	10.66 ± 0.15	23.12%	11.35 ± 0.16	24.25%
	ConvLSTM	9.22 ± 0.19	23.20 ± 0.47%	10.40 ± 0.17	25.10 ± 0.45%
NYC-Bike	ST-ResNet	8.85 ± 0.13	22.98 ± 0.53%	9.80 ± 0.12	25.06 ± 0.36%
	STDN	8.15 ± 0.15	20.87 ± 0.39%	8.85 ± 0.11	21.84 ± 0.36%
	ASTGCN	9.05 ± 0.31	22.25 ± 0.36%	9.34 ± 0.24	23.13 ± 0.30%
	STGODE	8.58 ± 0.38	23.33 ± 0.26%	9.23 ± 0.31	23.99 ± 0.23%
	STSAN	8.20 ± 0.45	20.42 ± 1.33%	9.87 ± 0.23	23.87 ± 0.71%
	DSAN	7.97 ± 0.25	20.23 ± 0.18%	10.07 ± 0.58	23.92 ± 0.39%
	ST-TIS	8.04 ± 0.04	20.31 ± 0.23%	8.00 ± 0.10	19.07 ± 0.19%
	This work	7.86 ± 0.06	19.41 ± 0.22%	8.18 ± 0.19	18.28 ± 0.22%

Table 3. The r-square results of STEFT on the NYC Taxi and NYC Bike datasets.

Dataset	NYC Taxi	NYC Bike
Inflow	98.48 ± 0.27%	94.69 ± 0.21%
Outflow	98.38 ± 0.23%	94.44 ± 0.20%

Table 4. The feature of each model.

Model	Type	Key Features
MLP	Feedforward Neural Network	Simple structure, easy to implement
LSTM	RNN Variant	Learns long-term dependencies, uses forget, input, and output gates
GRU	RNN Variant	Simpler structure than LSTM, faster training, uses update and reset gates
ConvLSTM	Convolutional RNN	Replaces fully connected layers in LSTM with convolutional layers
ST-ResNet	Deep Residual Network	Combines spatial convolutions with residual connections
STDN	Spatio-temporal Deep Network	Designed to capture complex patterns in spatio-temporal data
ASTGCN	Attention-based Spatial–Temporal GCN	Integrates GCNs with attention mechanisms for graph-structured data
STGODE	Spatio-temporal Graph ODE Network	Combines ordinary differential equations with graph convolutional networks
STSAN	Self-Attention Network	Incorporates self-attention mechanisms and spatio-temporal convolutions
DSAN	Dual Self-Attention Network	include two self-attention mechanisms for spatial and temporal dimensions
ST-TIS	Self-Attention Network	integrate multiple spatio-temporal modeling techniques
STEFT	CNN, Feedforward Neural Work, Transformer	Feature Fusion, Multi-Time-Interval Transformer Encoder

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, X.; Lv, H. STEFT: Spatio-Temporal Embedding Fusion Transformer for Traffic Prediction. Electronics 2024, 13, 3816. https://doi.org/10.3390/electronics13193816

AMA Style

Cui X, Lv H. STEFT: Spatio-Temporal Embedding Fusion Transformer for Traffic Prediction. Electronics. 2024; 13(19):3816. https://doi.org/10.3390/electronics13193816

Chicago/Turabian Style

Cui, Xiandai, and Hui Lv. 2024. "STEFT: Spatio-Temporal Embedding Fusion Transformer for Traffic Prediction" Electronics 13, no. 19: 3816. https://doi.org/10.3390/electronics13193816

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

STEFT: Spatio-Temporal Embedding Fusion Transformer for Traffic Prediction

Abstract

1. Introduction

2. Related Works

3. Preliminaries

3.1. Problem Formulation

3.2. STEFT Overview

4. STEFT Details

4.1. Embedding Extractor

4.2. Embedding Fusion Network

4.3. Flow Block

4.4. Prediction Block

5. Experimental Results

5.1. Datasets

5.2. Metrics and Baseline Methods

5.3. Forecast Accuracy

5.4. Distribution of Predictions

5.5. Learning Rate

5.6. Batch Size

5.7. Layer Number

5.8. Head Number

5.9. Diverse Design Permutations of STEFT

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI