1. Introduction
Accurately predicting traffic flow has become crucial in the development and construction of smart cities [
1]. The precise processing of urban traffic information and forecasting changes can enable city officials to understand current urban operations and make more efficient and effective management decisions [
2,
3,
4]. In addition, the results of traffic flow prediction can be used to optimize the city’s land use [
5], reduce traffic congestion [
6], and optimize the location of functional areas [
7], helping to develop a complete and comprehensive smart city system. The field of traffic flow prediction has progressed from traditional dynamical models [
8] and statistical models [
9,
10,
11] to models based on deep learning. Although better than traditional prediction models, the performance of these existing deep learning models still has some limitations.
Firstly, there are still limitations in traffic flow pattern recognition. Different urban areas serve different functions, and traffic patterns within these areas change over time. As shown in
Figure 1, the areas marked in green illustrate traffic connectivity at a small scale, where a single area contains multiple road intersections with different road flow efficiencies and where a single area is connected to neighboring areas with road connections in a very short period of time to form its own traffic flow pattern. At a larger scale, the area is not only composed of roads, but may also contain different types of buildings. The zones marked in red represent the different functional areas of the city as a result of actual use, which may be work, residential, or commercial areas, and the traffic flows between these zones are influenced by the time of day and the road connections in the area. For example, on weekdays, the main traffic flows are between residential and commercial areas, and on weekends, the focus of traffic flows shifts to residential and consumer areas. If the road connections between areas are dense, the rate of change of traffic flow between the two will also be faster, while for larger spans between areas that need to build traffic connections through multiple intermediate areas, the change of traffic flow will be relatively slow. Capturing their specific traffic flow patterns at different scales will help the model to make accurate predictions of future traffic flow trends.
The second limitation is the lack of capture of the interaction relationship between long-lived spatio-temporal regions. Urban traffic flow changes are associated not only with changes in time but also with spatial relationships among regions. Over time, traffic flow changes in a city’s periphery will affect traffic flow in the core, and accurately identifying potential connections between these areas will improve the accuracy of modeling future traffic flow predictions. The graph convolution model [
12,
13,
14,
15,
16] uses convolution and superposition of time and space to predict future traffic trends based on past continuous flow maps and other influencing factors. However, most of the current work is biased towards analyzing the impact of various environmental factors on local traffic flow [
14,
15]. As for networks combining Convolutional Neural Networks (CNNs) with Long Short-Term Memory (LSTM) networks or their variants [
17,
18,
19,
20] for capturing the spatial and temporal dependence of different regions over long distances, which is often achieved through the use of LSTM with constant loops in time and continuous stacking of convolutional layers to enhance the convolutional sensory field, it is difficult to establish a potentially direct link that exists between the local and the global.
In this paper, we present a detailed model to solve the problems of traffic flow forecasting. Our contributions are as follows:
We propose a multi-scale non-local spatio-temporal information fusion network (MN-STFN), which is able to accurately and stably make multi-step predictions of future traffic flows by inputting gridded data of traffic flows in the past period.
Our model is able to capture the unique traffic flow patterns at different scales.
We add a non-local network structure to the model to better capture the spatio-temporal direct traffic connections between the local and global parts of the urban region in the temporal traffic flow data.
We compare our model with multiple baseline models on two public datasets in Beijing and New York. Experiments show that our model exhibits better performance on Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) against existing models.
2. Related Work
In the problem of traffic flow prediction, how to accurately establish the interdependence between regions and areas and effectively identify the potential traffic flow patterns in the city is crucial to improve the accuracy of future traffic flow trend prediction.
The most direct way to establish the relationship between areas is to use the graph structure to express the spatial layout of urban areas. In the structure of the graph, each node corresponds to a point of interest (POI) that exists in the current city reality, and each edge connection corresponds to a road connection in reality, and through the use of the graph neural network, it is possible to extract the urban traffic flow pattern on the basis of the traffic flow graph [
21]. In STGCN [
13], the authors constructed spatio-temporal convolution blocks interspersed with spatial graph convolution in the temporal graph convolution and went on to capture the spatio-temporal traffic flow patterns over long distances in the city by continuously stacking the spatio-temporal convolution blocks and incorporating additional influencing factors. DCRNN [
22] represents the dynamics of the traffic flow as a diffusion process and through the introduction of an interpretable diffusion convolution operation to construct the spatial dependence between nodes, combined with recurrent neural network to complement the temporal correlation to capture the spatio-temporal flow characteristics of traffic flow.
On this basis, more work focuses on refining the extra factors existing in traffic flow so as to improve the accuracy of the model in predicting future traffic flow trends. STC-CDPM [
12] analyzes the trajectory of the flow of real people and extracts the extra activity features based on the POI attributes of the areas passing through during the flow, and combines them with the flow features. STUaNet [
15] aims to reduce the uncertainty between the predicted value of the traffic flow and the actual data by quantifying the uncertainty of external additional influencing factors and the uncertainty caused by the variation in urban traffic changes. ST-SSL [
14] performs data augmentation on the original data and constructs two self-supervised learning tasks based on it in spatio-temporal terms to enhance the model’s recognition of traffic flow patterns and the spatio-temporal heterogeneity of traffic.
Although the above graph neural network can intuitively and effectively construct the traffic flow relationship between urban areas and regions, the expansion of the prediction area will dramatically increase the model’s parameters. This increase will reduce the model’s prediction performance and make it less practical. Additionally, the method of extracting features through graph convolution cannot combine the temporal and spatial features of regional traffic flow simultaneously. This can result in the loss of the unique traffic flow pattern of the region in spatio-temporal terms.
In addition to using edges to represent connectivity between neighboring city areas, actual distance between two regions in reality can also serve as a measure of connection. The city can be partitioned into uniformly sized two-dimensional grids, which allows for the establishment of long-range relationships between regions using deep convolutional neural networks. Deep-ST [
2] proposed a method of modeling traffic flow grid data by employing three distinct deep convolutional networks with three temporal features: closeness, period, and trend. The study analyzes traffic flow grid data by extracting flow trends over time intervals and predicts future traffic flow data using a combination of spatio-temporal information and additional features through two different fusions. An alternative approach presented in ST-ResNet [
3] involves replacing convolutional networks with stacked residual units and Deep Residual Networks to extract changing features of traffic flow spatio-temporal data with greater depth. The stacking of convolutional networks can enhance the model’s ability to perceive space, while establishing inter-regional interactions over long distances via the deep neural network. However, the traffic flow pattern for regions cannot be based solely on spatial relationships, as the impact of temporal changes also holds significance. Stacking convolutional operations could impede the model’s feature extraction of temporal changes in the region. The combination of convolutions will decrease the model’s ability to extract regional features that vary over time. When modeling time series data, it is always more beneficial to use LSTM. Unlike the utilization of convolutional networks alone, the AT-Conv-LSTM [
20] algorithm incorporates LSTM into the network architecture. It combines LSTM and a CNN to discern the traffic flow trend, captures the traffic flow pattern over a longer period of time through the use of two bi-directional LSTM networks, and ultimately resolves the connection between the three output-hidden states by means of a fully connected network, leading to the final prediction results. However, due to network structure limitations, it is only appropriate for managing traffic flow relationships in one dimension. ConvLSTM [
23] mitigates redundant connections introduced by the fully connected structure present in LSTM by substituting convolutional operation, thus reducing the number of parameters in the network structure. In addition, the sharing of the convolutional kernel also indicates similarities in time changes between various regions. DeepSTCL [
18] employs a network structure similar to ST-ResNet but instead applies ConvLSTM to traffic flow data with three unique periodic trends. This model establishes the spatio-temporal relationship of traffic flow among various time slices and accounts for multiple spatio-temporal factors simultaneously when identifying traffic flow patterns. As a result, the accuracy of the prediction is improved. However, the convolutional kernel’s size constrains the range of regional relationships and therefore hampers the model’s capacity to capture long-distance spatio-temporal dependencies.
Furthermore, research in meta-learning typically focuses on the utilization of extracted meta-knowledge in combination with model parameter optimization. STMetaNet [
24] suggests two distinct meta-knowledge learners for nodes and edges, enabling the creation of individualized regression models for various regions while avoiding an overwhelming number of model parameters. AutoSTG [
25] utilizes meta-learning to learn regional spatio-temporal network structure. This involved aggregating node and edge meta-knowledge, combined with searching the network structure. Through continuously training and updating the network parameter weights, spatio-temporal data prediction models are able to be constructed.
Since the attention mechanism has been proposed, it has been widely applied in various deep learning tasks [
26]. ACFM [
27] improves footfall prediction accuracy with a spatio-temporal feature learning module that includes two ConvLSTM networks and an adaptive attention mechanism through adaptive spatio-temporal weight construction. SA-ConvLSTM [
17] presents a self-attention module that utilizes extra memory units based on ConvLSTM. This approach expresses global spatio-temporal dependencies through additional memory units, combined with self-attention mechanisms to rectify content errors. The technique effectively solves the challenge of long-distance dependency capture in spatio-temporal data. AttConvLSTM [
19] utilizes a structure composed of seq2seq and attention to predict future traffic flows in multiple steps. However, the former method uses attention within the SA-ConvLSTM cell, which only establishes relationships between neighboring regions in time intervals, leaving long-range dependencies to be built by continuous loops, while the latter approach focuses attention on the final encoded output of a single time slice, obscuring the regional structural relationships present in the original inputs and reflecting only the direct associations between temporal variations of traffic flows. The non-local network [
28] was initially proposed to establish direct connections between diverse pixels across space and time in computer vision. This methodology includes an attention-based remote dependency capture mechanism, making it a versatile network module that can be conveniently embedded inside any model. The implementation of a non-local network in traffic flow prediction enables direct connectivity between traffic flow characteristics in diverse regions, capturing traffic flow dependencies over time and space and improving predictive accuracy.
Therefore, our model utilizes the Multi-scale Traffic Flow Pattern Capture (MTFPC) block to capture traffic flow features at different scales based on the encoding-prediction network structure. This allows for better adaptation to different sizes of urban areas and extraction of complex traffic flow patterns compared to the graph neural network. We introduce the non-local network, which establishes a direct link between local and global urban areas. This reduces prediction errors and enables multi-step accurate prediction of traffic flow.
6. Conclusions
In this paper, we propose a multi-scale non-local spatio-temporal information fusion network (MN-STFN) for multi-step prediction of traffic flow. The extraction of spatio-temporal flow features of traffic flow at different scales is realized by stacking multiple multi-scale traffic flow pattern capture (MTFPC) blocks, and supplemented with a non-local network to capture direct spatio-temporal dependencies between local regions and the global region to improve the accuracy of prediction. The performance evaluation results for the BJTaxi dataset and the NYCBike dataset show that our model exhibits better performance on the multi-step prediction of traffic flow compared to the benchmark model. The evaluation comparison for these two datasets of different magnitudes also shows that our model is able to comprehensively capture and predict the simple traffic flow model for smaller regions, while along with the increase in the prediction region, our model is able to show better performance in modeling regional relationships and predicting future traffic flow compared to other models. We also verify the effectiveness of the MTFPC block and non-local block by comparing a series of ablation experiments. The multi-scale structure will help the model to more comprehensively construct the complex traffic flow patterns existing in the city, rather than local modules, and intuitively show the impact of the historical traffic flow changes on the current moment, which will help city managers to more comprehensively grasp the current state of urban traffic flow and make the right decision for future urban planning in combination with real needs.
In our future work, we will explore the following directions: first, improve the model structure to reduce the impact on the accuracy of multi-step prediction due to error accumulation and achieve longer-term traffic flow prediction, and second, consider the impact of additional factors such as weather and holidays in the prediction to reduce uncertainty in the model prediction.