1. Introduction
With the continuous development of economic growth and urbanization, the scale of cities is increasing, and many cities are increasingly plagued by traffic jams. Traffic brings a series of social problems, such as time-consuming travel for residents, frequent traffic accidents, environmental pollution, and fuel consumption increases [
1]. These problems greatly reduce the quality of life of urban residents. Therefore, the transportation department has formulated many measures to alleviate urban congestion, namely improving traffic infrastructure, charging congestion fees, providing route guidance, promoting public transport, and implementing traffic control [
2,
3]. However, when the contradiction between urban traffic supply and demand is long-term and urban resources are limited, it is not possible to solve this problem only by increasing road infrastructure, and it is also very difficult to alleviate traffic congestion by increasing road infrastructure when the traffic infrastructure has been perfected. With the continuous development and progress of science and technology, traffic management can be improved in a digital and intelligent way, namely through the construction of an Intelligent Transportation System (ITS) [
4]. Intelligent transportation systems can integrate traffic management, traffic information, vehicle control, and other systems. Advanced computer technology, communication technology, sensing technology, artificial intelligence, and other technologies are effectively integrated into the traffic management system, the organic combination of people, roads, and cars, to establish an efficient operation of the traffic and transportation comprehensive management system, to provide a theoretical basis for managers to carry out active traffic control.
An intelligent transportation system is an effective means to relieve traffic congestion, and traffic flow prediction is the basis and key to the realization of an intelligent transportation system. It can not only provide accurate travel information for travelers but also help traffic management Induce more effective judgments. The goal of traffic prediction is to predict the traffic operation state parameters for a period of time in the future based on the historical data (such as traffic speed, traffic flow, etc.) collected by detectors on the road network. The predicted content can be traffic flow, traffic speed, travel time, etc. From the perspective of time forecasting, traffic flow forecasting can be divided into two categories, namely short-term traffic flow forecasting and long-term traffic flow forecasting. Long-term forecasts focus on monthly or annual traffic flow forecasts useful for urban construction or transportation planning. Short-term forecasting focuses on making real-time forecasts of potential traffic flows over a short period of time (e.g., minutes) in the future. Its prediction results are widely used in traffic control, congestion prediction, and other fields. Since the performance of an intelligent transportation system largely depends on the accuracy of short-term real-time traffic flow prediction [
5], accurate and sensitive short-term traffic flow prediction results can be used for traffic light control, vehicle scheduling, police deployment, etc. To provide an effective basis, the traffic management department can take certain control and scheduling measures in a timely manner through the forecast results so as to avoid or alleviate the occurrence of traffic congestion as much as possible. Residents can also rely on short-term forecast results to reasonably plan travel time and routes, saving travel time and improving travel efficiency. Therefore, the research on short-term traffic flow forecasts has important theoretical and practical significance.
Predicting the future state of the transportation system has always been a challenging task, especially for city-wide traffic flow forecasting, which involves many complex factors. First, compared to individual roadway forecasting, city-wide forecasting requires an assessment of the overall traffic conditions of the urban road network. The urban road network consists of many road segments, and the traffic flow on a particular road segment is affected by both upstream and downstream road segments as well as adjacent road segments. Therefore, city-wide forecasting cannot ignore this complex spatial relationship of the urban road network. At the same time, urban traffic flow forecasting is a typical time series forecasting problem, and the long-term time correlation in the traffic series is difficult to be captured by traditional forecasting methods, especially when modeling its periodicity and trend. In addition, an urban traffic system is a system with nonlinearity and uncertainty, and its traffic state is also affected by weather, traffic accidents, holidays, and other factors. Traditional prediction models cannot solve the above challenges. Therefore, how to improve prediction accuracy is worthy of further discussion.
In the current field of traffic flow prediction, researchers mainly transform complex urban road networks into images with spatiotemporal feature information and use CNN to learn their features and make predictions. Considering that CNN shows powerful performance during image feature extracting. However, at the same time, CNN has its limitations in characterizing traffic road network features: First, the pooling operation in CNN will actively discard a large amount of information, making the key correlations in road states lost, reducing the spatial resolution of feature maps and affecting the final prediction results. Second, for specific complex road network structures containing viaducts, intersections, and branch roads, CNN cannot capture the interdependencies between road connections. Considering the irregular spatial distribution of urban road networks, traditional CNN are unable to effectively model and predict traffic road networks. Therefore, this paper proposed a model that integrates CNN and capsules, namely capsule network (CapsNet) [
6], as the backbone network of the traffic flow prediction model before improvement. CapsNet uses capsules in vector form as neurons in the network to extract spatial features in the traffic road network, instead of neurons in a scalar form in the traditional CNN for feature representation, and discard the pooling layer in convolutional networks by transferring vectors through dynamic routing algorithm. All extracted local features are retained to avoid losing the spatial feature relationship between traffic roads so as to improve the problem of insufficient extraction of road network spatial features by traditional CNN and achieve accurate prediction of traffic flow. This paper proposes an improved capsule network, MCapsNet, based on CapsNet. The main contributions of this paper are as follows:
CapsNet initially extracted the features of the traffic network through the front segment CNN and converted the feature information extracted by CNN into vector neurons, namely capsules, at the capsule layer. However, CapsNet only uses two layers of convolution, which makes it impossible for the network to extract the information of high-dimensional local features in the feature map. Therefore, in this paper, a depthwise separable convolutional block was added to the convolutional layer of CapsNet, and the feature channel was expanded to allow the network to extract deep-level feature information and effectively reduce the introduction of parameters in the network and its calculation amount. At the same time, considering feature degradation during training and channel collapse easily caused by separable convolution, a shortcut design of the residual network and modified linear bottleneck layer structure was added to the convolution layer to avoid information loss caused by channel restoration and network degradation during training.
Depthwise separable convolution was used in the convolutional layer to enrich feature channels, but the information between channels did not interact: that is, the dependencies between channels were not considered. Therefore, this paper considered integrating separable convolution block and channel attention into the network at the same time and using depthwise separable convolutional block at the head of channel attention. The network can selectively learn the feature information of the important channels after expansion, suppress the useless information, and enhance the model expression’s ability to improve the model prediction performance.
By adding improved channel attention to the residual blocks of the network, the network obtained a richer and more detailed feature map of the traffic road network. However, it was not necessary to use residual blocks with attention in each region of the network. For this reason, the effect of embedding the attention blocks in different regions on the traffic flow prediction performance was further discussed and used the optimal result as the final embedding location of the attention blocks.
The paper is organized as follows.
Section 1 introduces the related research and development of traffic flow forecasting.
Section 2 introduces the related work.
Section 3 details the proposed improved method and the structure of the improved model. In
Section 4, the used traffic flow datasets and experimental equipment are introduced, followed by experiments on both datasets and an analysis of the experimental results. Finally,
Section 5 summarizes the research conclusions of this paper and provides an outlook for future research.
2. Related Work
The complex condition of the urban road network and the changing spatiotemporal correlation make traffic prediction challenging. The characteristics of traffic data mainly include temporal and spatial characteristics: temporal characteristics refer to the traffic changes at the current moment and historical periods affecting the traffic flow of the road section in the future period. Spatial characteristics refer to the traffic of the road section in the future period that will be affected by the traffic flow of adjacent sections. In early traffic forecasting problems, time series analysis models were used, represented by autoregressive integrated moving average (ARIMA) [
7] and vector autoregressive (VAR) [
8]. These methods are based on the assumption of dynamic linear dependence in time series data, while traffic data has the characteristics of nonlinearity and high complexity, so the effect of the model is not ideal in actual performance. In order to address the effect of the linearity assumption, traditional machine learning-based methods such as k-nearest neighbor (KNN) [
9] and support vector regression (SVR) [
10] are used, which have improved prediction performance compared to earlier linear models, but the performance largely depends on hand-crafted features, which will take a lot of manpower to extract and limit the application of the model in practical scenarios. Traditional machine learning methods acquire knowledge from data and have strong nonlinear approximation ability and self-learning ability in prediction. However, they also have obvious defects, such as high requirements for data, large amounts of training data, and poor interpretability of models.
With the continuous improvement of deep learning theory and the continuous improvement of GPU computing power, methods based on deep learning are widely used in traffic flow prediction tasks. Zhuo et al. [
11] proposed a traffic prediction model based on a long short-term memory (LSTM) network and added an autocorrelation coefficient to the model to improve the prediction accuracy of the model. FUR et al. [
12] used a gated recurrent unit (GRU) to predict flow and compared it with LSTM to verify its superiority. The convolutional neural network shows strong performance in image feature extraction and uses the translation-invariant characteristic of image information to extract the spatial features of the data. However, when it is applied to the real road network for traffic flow prediction, the effect is not good because the real traffic network is usually graph-structured data, which is not conducive to modeling the network topology space. With the rise of graph convolutional networks, researchers transferred the convolution operations on Euclidean space to graph (non-Euclidean) structure. Kipf et al. [
13] proposed a graph convolution network (GCN). A GCN extracts the spatial correlation of the traffic network by aggregating the information of the central detector and its neighboring detectors. Zhao et al. [
14] proposed a temporal graph convolutional network (T-GCN), which integrates graph convolution and gated recurrent units to simultaneously model the spatial and temporal dependencies of the traffic network to achieve traffic flow prediction. Yu et al. [
15] proposed Spatio-Temporal Graph Convolutional Networks (STGCN) method, which uses causal convolution and GCN to model the spatiotemporal correlation. In the model, multiple spatiotemporal convolution blocks are stacked to make the model have fewer parameters and a faster training speed to achieve parallel input so as to accurately and sensitively predict large-scale traffic network flow. Song et al. [
16] proposed the Spatial-Temporal Synchronous Graph Convolutional Networks (STSGCN) method, which proposes to construct a local spatiotemporal graph, fuse the spatiotemporal features of the previous, current, and subsequent moments, and extract the local spatiotemporal correlation and heterogeneity in the spatiotemporal graph by multiple synchronous convolutional layers. The experiment verifies that this method considering the local dynamic correlation in the traffic road network, is effective for traffic flow prediction, and the prediction performance is also better than other advanced models.
With the development of lightweight and high-efficiency deep learning networks, the attention mechanism was proposed by Velickovic et al. [
17] and used for image description. It has been widely used in the field of natural language processing (NLP) [
18,
19,
20,
21,
22]. Its main idea is to recalibrate the static hidden vector with a memory function by introducing attention module to obtain the output value. Guo et al. [
23] proposed the idea of the attention-based spatial-temporal graph convolutional networks (ASTGCN) method, which uses one-dimensional temporal convolution and graph convolution models for temporal and spatial correlations, respectively, model and uses the attention module to capture correlations between different moments, locations, and different sequences at the same location, the experiment shows that the prediction result of the model considering the spatiotemporal characteristics of different periods is better than that of the more advanced model. On this basis, Guo et al. [
24] proposed the Attention-based Spatial-Temporal Graph Neural Network (ASTGNN) method. In the temporal dimension, proposed a temporal trend-aware multi-headed attention module to effectively capture the causal relationships and local context information in time series. In the spatial dimension, a new dynamic graph convolution block is proposed to calculate the spatial correlation strength between nodes using the intermediate sequence output of the temporal attention module to obtain a dynamic spatial correlation weight matrix, which enables the model to extract local spatiotemporal correlation and spatial heterogeneity in the spatiotemporal graph, and the experiment verifies that this method of using temporal trend and causality to achieve simultaneous capture of local spatiotemporal correlation is favorable for traffic flow prediction, while its prediction performance is better than other advanced models.
The above work provided reference significance for the development of traffic flow prediction, and the current research on traffic flow prediction models mainly focuses on how to extract and model the features of traffic data. In order to solve the limitations of the traditional convolutional neural network for modeling the spatial relationship of the road network and the problem of inadequate feature extraction by CapsNet front segment network, this paper proposes MCapsNet, an improved traffic flow prediction model based on CapsNet, and in order to make the model perform high-dimensional feature extraction for traffic feature map to capture more key features to achieve the improvement of prediction performance, we proposed to use depthwise separable convolutional block in the network. In order to avoid the problems of feature degradation, channel collapse, and information redundancy, a series of adaptation improvements such as shortcut connection design, modified linear bottleneck layer structure and added channel attention are proposed for the effective use of a depthwise separable convolutional block. The improved MCapsNet network enabled the front segment CNN to perform feature extraction in a high-dimensional space, enhanced the network to extract features at a deeper level from the input features, and made a better adaptation for vectorizing the features extracted by the CNN into a capsule form to better characterize the attribute features in the road network. The experimental results showed that MCapsNet improves the ability to extract high-dimensional feature details in the spatiotemporal feature map compared with the CapsNet model and enhances the expressiveness and prediction performance of the network.
5. Conclusions
This paper proposed an improved traffic flow prediction model based on CapsNet. CapsNet initially extracts features from traffic data through the front segment CNN and then vectorizes it into capsules to model the traffic road network space. In this paper, there are only two layers of convolution blocks in the front section of CapsNet, which is insufficient for deep-level feature extraction. In view of lightweight considerations, a depthwise separable convolutional block is introduced to enrich the number of features, and channel attention is used to selectively learn the expanded channel features. The depthwise separable convolutional block and channel attention are connected into the residual block of the network through the modified linear bottleneck layer, which ensures that the information is not lost in large amounts when it is restored back to the low-dimensional space. In the experiment, we introduced the improved channel attention into the three residual blocks of CapsNet: the first convolution layer, the second convolution layer, and the PrimaryCaps layer, respectively, to explore the impact of attention embedding position on the performance of the prediction model. When the improved channel attention is introduced into the second convolution layer, the best prediction performance of the model is achieved. The RMSE of the full period prediction results for Wenyi Road is 30.70, the MAE is 16.55, the peak period RMSE is 40.93, the MAE is 22.75, and the off-peak period RMSE is 37.54, and the MAE is 20.52. At the same time, we changed the ReLU6 activation function in the linear bottleneck layer to ReLU, and the activation function without the maximum limit can improve the expressive power of the model in prediction. After experimental comparison, the improved model prediction effect is better. In the experiments with two different datasets, comparing other prediction models, among which SVR, ARIMA, and GRU are single module models considering only temporal characteristics, the prediction performance of the models was unstable and was not conducive to traffic flow prediction under different time periods; GCN and TGCN are prediction models based on graph convolution that can model the space of traffic road network. Compared with MCapsNet, our model error index had also been significantly reduced. Therefore, the spatial modeling ability of vector neurons in MCapsNet is better than that of a graph convolution network. At the same time, compared with CapsNet and CapsNet (CA) models, the prediction performance of MCapsNet also achieved different degrees of improvement. The effectiveness of the proposed method was verified by experiments.
The MCapsNet model proposed in this paper uses a depthwise separable convolutional block to enable the network to achieve deeper feature extraction of the input features, thus improving the expression of the model. Meanwhile, in order to ensure that the model is deeply and effectively trained, using shortcut connection in the network and the depthwise separable convolutional block and channel attention were connected by the improved linear bottleneck layer structure to avoid feature channel collapse and information redundancy. In addition, in the experiment, we verified that when the activation function of the linear bottleneck layer in the original target recognition task was replaced by ReLU, the prediction performance of the model was better than that of ReLU6. Therefore, we believe that in the prediction task, the activation function of the prediction model is not limited by the maximum value, which can improve the expressiveness of the model in the prediction, thus improving the prediction performance. In future research work, we will focus on improving the running speed of the model and improving its application value.