Research on Improved Traffic Flow Prediction Network Based on CapsNet

Qiu, Bin; Zhao, Yun

doi:10.3390/su142315996

Open AccessArticle

Research on Improved Traffic Flow Prediction Network Based on CapsNet

by

Bin Qiu

^* and

Yun Zhao

School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China

^*

Author to whom correspondence should be addressed.

Sustainability 2022, 14(23), 15996; https://doi.org/10.3390/su142315996

Submission received: 19 October 2022 / Revised: 24 November 2022 / Accepted: 27 November 2022 / Published: 30 November 2022

(This article belongs to the Special Issue Big Data, Information and AI for Smart Urban)

Download

Browse Figures

Versions Notes

Abstract

:

Traffic flow prediction is the basis and key to the realization of an intelligent transportation system. The road traffic flow prediction of city-level complex road network can be realized using traffic big data. In the traffic prediction task, the limitation of the convolutional neural network (CNN) for modeling the spatial relationship of the road network and the insufficient feature extraction of the shallow network make it impossible to accurately predict the traffic flow. In order to improve the prediction performance of the model, this paper proposes an improved capsule network (MCapsNet) based on capsule network (CapsNet). First, in the preliminary feature extraction stage, a depthwise separable convolutional block is added to expand the feature channel to enrich channel information. Subsequently, in order to strengthen the reuse of important features and suppress useless information, channel attention is used to selectively reinforce learning of extended channel information so that the network can extract a large number of high-dimensional important features and improve the ability of network feature learning and expression. At the same time, in order to alleviate the feature degradation during training and the channel collapse problem easily caused by deep convolution, a shortcut connection, and a modified linear bottleneck layer structure are added to the convolution layer. The bottleneck layer adds the depth convolution and channel attention connection to the residual block of the network. Finally, the deep local feature information extracted from the improved convolutional layer is vectorized into the form of a capsule, which can more accurately model the details of road network attributes and features and improve the model expression power and prediction performance. The network is tested on the Wenyi Road dataset and the public dataset SZ-taxi. Compared with other models, the evaluation indicators of MCapsNet are better than other models in the tests of different time periods and predictors. Compared with CapsNet, the RMSE index of MCapsNet is reduced by 10.50% in the full period of Wenyi Road, 4.66% in the peak period, 9.78% in the off-peak period, and 6.07% in the SZ-tax dataset. The experimental results verify the effectiveness of the model improvement.

Keywords:

intelligent transportation; traffic flow prediction; attention mechanism; temporal-spatial correlation

1. Introduction

With the continuous development of economic growth and urbanization, the scale of cities is increasing, and many cities are increasingly plagued by traffic jams. Traffic brings a series of social problems, such as time-consuming travel for residents, frequent traffic accidents, environmental pollution, and fuel consumption increases [1]. These problems greatly reduce the quality of life of urban residents. Therefore, the transportation department has formulated many measures to alleviate urban congestion, namely improving traffic infrastructure, charging congestion fees, providing route guidance, promoting public transport, and implementing traffic control [2,3]. However, when the contradiction between urban traffic supply and demand is long-term and urban resources are limited, it is not possible to solve this problem only by increasing road infrastructure, and it is also very difficult to alleviate traffic congestion by increasing road infrastructure when the traffic infrastructure has been perfected. With the continuous development and progress of science and technology, traffic management can be improved in a digital and intelligent way, namely through the construction of an Intelligent Transportation System (ITS) [4]. Intelligent transportation systems can integrate traffic management, traffic information, vehicle control, and other systems. Advanced computer technology, communication technology, sensing technology, artificial intelligence, and other technologies are effectively integrated into the traffic management system, the organic combination of people, roads, and cars, to establish an efficient operation of the traffic and transportation comprehensive management system, to provide a theoretical basis for managers to carry out active traffic control.

An intelligent transportation system is an effective means to relieve traffic congestion, and traffic flow prediction is the basis and key to the realization of an intelligent transportation system. It can not only provide accurate travel information for travelers but also help traffic management Induce more effective judgments. The goal of traffic prediction is to predict the traffic operation state parameters for a period of time in the future based on the historical data (such as traffic speed, traffic flow, etc.) collected by detectors on the road network. The predicted content can be traffic flow, traffic speed, travel time, etc. From the perspective of time forecasting, traffic flow forecasting can be divided into two categories, namely short-term traffic flow forecasting and long-term traffic flow forecasting. Long-term forecasts focus on monthly or annual traffic flow forecasts useful for urban construction or transportation planning. Short-term forecasting focuses on making real-time forecasts of potential traffic flows over a short period of time (e.g., minutes) in the future. Its prediction results are widely used in traffic control, congestion prediction, and other fields. Since the performance of an intelligent transportation system largely depends on the accuracy of short-term real-time traffic flow prediction [5], accurate and sensitive short-term traffic flow prediction results can be used for traffic light control, vehicle scheduling, police deployment, etc. To provide an effective basis, the traffic management department can take certain control and scheduling measures in a timely manner through the forecast results so as to avoid or alleviate the occurrence of traffic congestion as much as possible. Residents can also rely on short-term forecast results to reasonably plan travel time and routes, saving travel time and improving travel efficiency. Therefore, the research on short-term traffic flow forecasts has important theoretical and practical significance.

Predicting the future state of the transportation system has always been a challenging task, especially for city-wide traffic flow forecasting, which involves many complex factors. First, compared to individual roadway forecasting, city-wide forecasting requires an assessment of the overall traffic conditions of the urban road network. The urban road network consists of many road segments, and the traffic flow on a particular road segment is affected by both upstream and downstream road segments as well as adjacent road segments. Therefore, city-wide forecasting cannot ignore this complex spatial relationship of the urban road network. At the same time, urban traffic flow forecasting is a typical time series forecasting problem, and the long-term time correlation in the traffic series is difficult to be captured by traditional forecasting methods, especially when modeling its periodicity and trend. In addition, an urban traffic system is a system with nonlinearity and uncertainty, and its traffic state is also affected by weather, traffic accidents, holidays, and other factors. Traditional prediction models cannot solve the above challenges. Therefore, how to improve prediction accuracy is worthy of further discussion.

In the current field of traffic flow prediction, researchers mainly transform complex urban road networks into images with spatiotemporal feature information and use CNN to learn their features and make predictions. Considering that CNN shows powerful performance during image feature extracting. However, at the same time, CNN has its limitations in characterizing traffic road network features: First, the pooling operation in CNN will actively discard a large amount of information, making the key correlations in road states lost, reducing the spatial resolution of feature maps and affecting the final prediction results. Second, for specific complex road network structures containing viaducts, intersections, and branch roads, CNN cannot capture the interdependencies between road connections. Considering the irregular spatial distribution of urban road networks, traditional CNN are unable to effectively model and predict traffic road networks. Therefore, this paper proposed a model that integrates CNN and capsules, namely capsule network (CapsNet) [6], as the backbone network of the traffic flow prediction model before improvement. CapsNet uses capsules in vector form as neurons in the network to extract spatial features in the traffic road network, instead of neurons in a scalar form in the traditional CNN for feature representation, and discard the pooling layer in convolutional networks by transferring vectors through dynamic routing algorithm. All extracted local features are retained to avoid losing the spatial feature relationship between traffic roads so as to improve the problem of insufficient extraction of road network spatial features by traditional CNN and achieve accurate prediction of traffic flow. This paper proposes an improved capsule network, MCapsNet, based on CapsNet. The main contributions of this paper are as follows:

CapsNet initially extracted the features of the traffic network through the front segment CNN and converted the feature information extracted by CNN into vector neurons, namely capsules, at the capsule layer. However, CapsNet only uses two layers of convolution, which makes it impossible for the network to extract the information of high-dimensional local features in the feature map. Therefore, in this paper, a depthwise separable convolutional block was added to the convolutional layer of CapsNet, and the feature channel was expanded to allow the network to extract deep-level feature information and effectively reduce the introduction of parameters in the network and its calculation amount. At the same time, considering feature degradation during training and channel collapse easily caused by separable convolution, a shortcut design of the residual network and modified linear bottleneck layer structure was added to the convolution layer to avoid information loss caused by channel restoration and network degradation during training.
Depthwise separable convolution was used in the convolutional layer to enrich feature channels, but the information between channels did not interact: that is, the dependencies between channels were not considered. Therefore, this paper considered integrating separable convolution block and channel attention into the network at the same time and using depthwise separable convolutional block at the head of channel attention. The network can selectively learn the feature information of the important channels after expansion, suppress the useless information, and enhance the model expression’s ability to improve the model prediction performance.
By adding improved channel attention to the residual blocks of the network, the network obtained a richer and more detailed feature map of the traffic road network. However, it was not necessary to use residual blocks with attention in each region of the network. For this reason, the effect of embedding the attention blocks in different regions on the traffic flow prediction performance was further discussed and used the optimal result as the final embedding location of the attention blocks.

The paper is organized as follows. Section 1 introduces the related research and development of traffic flow forecasting. Section 2 introduces the related work. Section 3 details the proposed improved method and the structure of the improved model. In Section 4, the used traffic flow datasets and experimental equipment are introduced, followed by experiments on both datasets and an analysis of the experimental results. Finally, Section 5 summarizes the research conclusions of this paper and provides an outlook for future research.

2. Related Work

The complex condition of the urban road network and the changing spatiotemporal correlation make traffic prediction challenging. The characteristics of traffic data mainly include temporal and spatial characteristics: temporal characteristics refer to the traffic changes at the current moment and historical periods affecting the traffic flow of the road section in the future period. Spatial characteristics refer to the traffic of the road section in the future period that will be affected by the traffic flow of adjacent sections. In early traffic forecasting problems, time series analysis models were used, represented by autoregressive integrated moving average (ARIMA) [7] and vector autoregressive (VAR) [8]. These methods are based on the assumption of dynamic linear dependence in time series data, while traffic data has the characteristics of nonlinearity and high complexity, so the effect of the model is not ideal in actual performance. In order to address the effect of the linearity assumption, traditional machine learning-based methods such as k-nearest neighbor (KNN) [9] and support vector regression (SVR) [10] are used, which have improved prediction performance compared to earlier linear models, but the performance largely depends on hand-crafted features, which will take a lot of manpower to extract and limit the application of the model in practical scenarios. Traditional machine learning methods acquire knowledge from data and have strong nonlinear approximation ability and self-learning ability in prediction. However, they also have obvious defects, such as high requirements for data, large amounts of training data, and poor interpretability of models.

With the continuous improvement of deep learning theory and the continuous improvement of GPU computing power, methods based on deep learning are widely used in traffic flow prediction tasks. Zhuo et al. [11] proposed a traffic prediction model based on a long short-term memory (LSTM) network and added an autocorrelation coefficient to the model to improve the prediction accuracy of the model. FUR et al. [12] used a gated recurrent unit (GRU) to predict flow and compared it with LSTM to verify its superiority. The convolutional neural network shows strong performance in image feature extraction and uses the translation-invariant characteristic of image information to extract the spatial features of the data. However, when it is applied to the real road network for traffic flow prediction, the effect is not good because the real traffic network is usually graph-structured data, which is not conducive to modeling the network topology space. With the rise of graph convolutional networks, researchers transferred the convolution operations on Euclidean space to graph (non-Euclidean) structure. Kipf et al. [13] proposed a graph convolution network (GCN). A GCN extracts the spatial correlation of the traffic network by aggregating the information of the central detector and its neighboring detectors. Zhao et al. [14] proposed a temporal graph convolutional network (T-GCN), which integrates graph convolution and gated recurrent units to simultaneously model the spatial and temporal dependencies of the traffic network to achieve traffic flow prediction. Yu et al. [15] proposed Spatio-Temporal Graph Convolutional Networks (STGCN) method, which uses causal convolution and GCN to model the spatiotemporal correlation. In the model, multiple spatiotemporal convolution blocks are stacked to make the model have fewer parameters and a faster training speed to achieve parallel input so as to accurately and sensitively predict large-scale traffic network flow. Song et al. [16] proposed the Spatial-Temporal Synchronous Graph Convolutional Networks (STSGCN) method, which proposes to construct a local spatiotemporal graph, fuse the spatiotemporal features of the previous, current, and subsequent moments, and extract the local spatiotemporal correlation and heterogeneity in the spatiotemporal graph by multiple synchronous convolutional layers. The experiment verifies that this method considering the local dynamic correlation in the traffic road network, is effective for traffic flow prediction, and the prediction performance is also better than other advanced models.

With the development of lightweight and high-efficiency deep learning networks, the attention mechanism was proposed by Velickovic et al. [17] and used for image description. It has been widely used in the field of natural language processing (NLP) [18,19,20,21,22]. Its main idea is to recalibrate the static hidden vector with a memory function by introducing attention module to obtain the output value. Guo et al. [23] proposed the idea of the attention-based spatial-temporal graph convolutional networks (ASTGCN) method, which uses one-dimensional temporal convolution and graph convolution models for temporal and spatial correlations, respectively, model and uses the attention module to capture correlations between different moments, locations, and different sequences at the same location, the experiment shows that the prediction result of the model considering the spatiotemporal characteristics of different periods is better than that of the more advanced model. On this basis, Guo et al. [24] proposed the Attention-based Spatial-Temporal Graph Neural Network (ASTGNN) method. In the temporal dimension, proposed a temporal trend-aware multi-headed attention module to effectively capture the causal relationships and local context information in time series. In the spatial dimension, a new dynamic graph convolution block is proposed to calculate the spatial correlation strength between nodes using the intermediate sequence output of the temporal attention module to obtain a dynamic spatial correlation weight matrix, which enables the model to extract local spatiotemporal correlation and spatial heterogeneity in the spatiotemporal graph, and the experiment verifies that this method of using temporal trend and causality to achieve simultaneous capture of local spatiotemporal correlation is favorable for traffic flow prediction, while its prediction performance is better than other advanced models.

The above work provided reference significance for the development of traffic flow prediction, and the current research on traffic flow prediction models mainly focuses on how to extract and model the features of traffic data. In order to solve the limitations of the traditional convolutional neural network for modeling the spatial relationship of the road network and the problem of inadequate feature extraction by CapsNet front segment network, this paper proposes MCapsNet, an improved traffic flow prediction model based on CapsNet, and in order to make the model perform high-dimensional feature extraction for traffic feature map to capture more key features to achieve the improvement of prediction performance, we proposed to use depthwise separable convolutional block in the network. In order to avoid the problems of feature degradation, channel collapse, and information redundancy, a series of adaptation improvements such as shortcut connection design, modified linear bottleneck layer structure and added channel attention are proposed for the effective use of a depthwise separable convolutional block. The improved MCapsNet network enabled the front segment CNN to perform feature extraction in a high-dimensional space, enhanced the network to extract features at a deeper level from the input features, and made a better adaptation for vectorizing the features extracted by the CNN into a capsule form to better characterize the attribute features in the road network. The experimental results showed that MCapsNet improves the ability to extract high-dimensional feature details in the spatiotemporal feature map compared with the CapsNet model and enhances the expressiveness and prediction performance of the network.

3. Methodology

3.1. Depthwise Separable Convolutional Block and Inverted Residual Structure

CapsNet is an integrated model of CNN and capsule. As shown in Figure 1, it is the CapsNet network structure before improvement. First, the traffic flow data is converted into images with spatiotemporal feature information. The first convolution layer and the second convolution layer in the network use 32 3 × 3 convolution kernels to initially extract the features in the spatiotemporal map of traffic. Local feature acquisition activities as input to the PrimaryCaps layer. In the PrimaryCaps layer, the feature maps are feature extracted by 128 3 × 3 convolution kernel features, and subsequently, the feature maps are divided into several sub-feature maps with a channel dimension of 8. The sub-feature maps are viewed as multiple 8-dimensional vectors, each vector represents a primary capsule containing 8 neurons, and the number of primary capsules is determined by the size of the dataset, and the primary capsules are used as the input to the TrafficCaps layer. The fourth layer is the TrafficCaps layer, which consists of a 16×N vector, where 16 is the dimension of the vector, indicating that the capsule of this layer has 16 elements, and N indicates the actual number of roads. The spatial relationships between the local features implicit in all primary capsules are captured by a three-time dynamic routing algorithm between the primary capsule layer and the advanced capsule layer, and the features are output to a set of advanced capsules of dimension 16. The direction of each 16-dimensional advanced capsule vector in this layer represents various attribute features of the corresponding road segment, the length represents the predicted value of the traffic flow of the road segment, and finally, the predicted value is output through the flattening operation. CapsNet uses CNN to extract features, which is the premise for the subsequent capsule layer to integrate the extracted features and perform vectorization processing. However, only two layers of convolution are used in CapsNet, so the network is unable to extract the local feature information in the feature map in high dimensions, which affects the capsule in the capsule layer to vectorize the local features after the capsule does not represent the details of the features well enough, thus affecting the prediction performance of the model. The more layers of the neural network, the deeper the level of abstraction of the input features and the deeper the accuracy of understanding. This is because the expansion of feature channels enables the network to fully extract abstract features in high-dimensional space, with stronger expressive power to cover more key features. However, the addition of convolutional layers will also bring more parameters to the network, which is more prone to overfitting problems.

Adding depthwise separable convolution to the convolutional layer of an already designed network can make the feature maps in the network be extracted at a deeper level, and at the same time, compared with ordinary convolution, it can introduce less number of parameters and computational effort into the network. In order to enable CapsNet to enter the high-dimensional space to achieve deep feature extraction in the feature extraction stage, and in view of the lightweight consideration of the network and the problem of easy overfitting, it is considered to add depthwise separable convolutional blocks to the convolutional layer of CapsNet. When expanding feature channels and extracting features, effectively reduce the introduction of parameters in the network and its calculation. The depthwise separable convolution is divided into two parts: Depthwise Convolution (DW) and Pointwise Convolution (PW). (1) In the Depthwise Convolution, one convolution kernel is responsible for one channel, and one channel is only convolved by one convolution kernel. The number of feature map channels generated by this process is the same as the number of input channels. (2) The size of the convolution kernel of the Pointwise Convolution is 1 × 1 × M × N, where M is the number of channels in the previous layer, and N is the number of output channels. The convolution operation here weights and combines the previous map in the depth direction to generate new feature maps, and there are several convolution kernels to have several feature map outputs. By this way of grouped convolution, the number of parameters can be reduced, and the operation cost can be lowered, and at the same time, its computation is reduced by 8–9 times compared to standard convolution in the same operation as conventional convolution [25].

Although the depthwise convolution can enrich feature channels and greatly reduce the amount of computation in the model, the convolution kernel trained by this grouping convolution method is prone to failure [26]. That is, most of the parameters in the convolution kernel are 0 because after the channel is restored, the feature is in the low-dimensional space, and after the ReLU layer, the negative value will become 0, causing information loss. Compared with the high-dimensional space, there is a greater probability that a certain dimension will be eliminated. That is, the channel will collapse.

In order to solve the problem of feature degradation and the channel collapse easily caused by separable convolution during training, the shortcut design of residual network structure is used between layers to change the direction of data transmission in the network to alleviate the problem of feature degradation, while considering that the depthwise convolution first performs the up-dimensional operation on the feature map, and then performs feature extraction on the data, so the residual design is slightly different, and this design structure of using deepwise convolution in the residual network to first up-dimension and then extract and finally down-dimension is similar to the residual structure in MobileNetV2. The classical residual connection is shown in Figure 2a. First, a 1 × 1 convolution is used to reduce the dimension, then a 3 × 3 convolution operation, and finally, a 1 × 1 convolution is used to increase the dimension to form an identity map. In MobileNetV2, 1 × 1 point convolution is used to increase the dimension first to enrich the number of features, and this process is opposite to the order of dimension reduction, and then dimension increase of the classical residual network, and this up-dimensioning and then down-dimensioning structure is called inverse residual, as shown in Figure 2b, similar to the dimension-raising operation of the feature before using the depthwise separable convolution. This paper uses the inverted residual structure to connect the depthwise convolution to adapt the channel convolution in the network. The extraction can be performed in a high-dimensional space, and at the same time, considering that the ReLU nonlinear activation function will destroy the feature information when used in low-dimensional space, it is not as good as the linear function, so after the second point convolution in Figure 2b, ReLU6 is no longer used, and a linear function is used instead, so it is called the linear bottleneck layer. Among them, the use of the nonlinear activation function of the linear bottleneck layer here, after experimental comparison, the prediction result of using ReLU is better than that of ReLU6; that is, the ReLU6 in MobileNetV2 is not used in the bottleneck layer, but the ReLU nonlinear activation function is used. The experimental comparison and analysis are in the Experiments and Results Analysis section, so the linear bottleneck layer structure of the residual block in the improved MCapsNet network is shown in Figure 2b.

3.2. Improved Channel Attention

The depthwise separable convolution is used in the network to realize the expansion of feature channels, but the information between channels does not produce interaction. That is, the dependencies between channels are not considered. It is more important to selectively learn the information of a large number of channels after expansion to avoid the redundancy of information, which makes it difficult for the network to grasp the deep and important information, resulting in negative optimization training of the network and affecting the final prediction results.

To solve the above problems, consider adding channel attention to the convolution layer. As an example, with the channel attention [27] used in this paper, the SE module (Squeeze-and-Excitation Module) is a typical attention model based on the channel domain. It calculates the weight coefficient of each channel through the operation of three steps: squeezing, excitation and redistribution. The input feature map uses global average pooling (AvgPool) to aggregate the global information of each channel of the feature map and obtain the channel vector, where each value has the global feature information of the corresponding channel and then uses the two fully connected layers to learn the interdependence between different channels to obtain weight coefficients, normalize the weights within the range of [0, 1] through Sigmoid, and finally adjust the size of the batch regularization to obtain the attention feature map. The formula is as follows:

C_{A} = σ \{BN [fc \{fc (AvgPool (F))\}]\}

(1)

where σ represents the sigmoid function, BN represents batch regularization, fc is the fully connected layer, AvgPool represents global average pooling, and F represents the feature map.

By integrating the depthwise separable convolutional block and channel attention into the network at the same time, adding a depthwise separable convolutional block to the head of the channel attention, enriching the number of feature channels, and feature maps with more high-dimensional information into the channel attention, the weights are assigned to the high-dimensional feature channels, so that the network can selectively learn the feature information of each channel, suppress useless information, and enhance the model’s expressive ability. The improvement method is shown in Figure 3. The blue box in Figure 3 is the channel attention before the improvement, Conv represents the convolution layer with the ReLU activation layer, and conv represents the convolution layer with a linear function. DS 3 × 3 is the depthwise convolution block, and the right side is the internal structure of the convolution block. A separable convolution with a convolution kernel size of 3 × 3 is used in the block to extract information for the high-dimensional features after channel expansion. The normalization layer and ReLU layer are added to make the parameters obtained by the network during training more stable, and at the same time, more nonlinear relationships between channels are learned to improve the nonlinear expressive power and convergence speed of the network. The formula is as follows:

Out = σ \{BN [fc \{fc (AvgPool (D_{s} \cdot F))\}]\} \otimes F

(2)

where D_S represents the channel expansion multiple, and

\otimes

represents the multiplication of corresponding elements.

3.3. Improved MCapsNet Network Structure

Improved MCapsNet is a network with an obvious layering mechanism. The first convolution layer, the second convolution layer, and the PrimaryCaps layer in the network involve convolutional operations, so only adding shortcut connections to these three layers is considered. The improved channel attention DCA module is also embedded in the residual blocks of the above layers of MCapsNet. Figure 4a is the CapsNet structure, Figure 4b is the MCapsNet structure with improved channel attention introduced in each layer, and Figure 4c is the structure of MCapsNet with the improved channel attention DCA module added to the second convolutional layer. By using the modified linear bottleneck layer in the residual block of the network to connect the depthwise separable convolutional block and the channel attention, when using CNN in the front of CapsNet to extract images from traffic flow data into images with spatiotemporal feature information, the feature channels are expanded, make the feature extraction work in high-dimensional space, and use channel attention to selectively learn high-dimensional channel information, so that the model can grasp the important feature information in the data, and subsequently pass the linear output of the bottleneck layer after channel recovery to avoid a large loss of feature information in the low-dimensional space. The feature map with weight assignment enters the capsule layer, and the high-dimensional local feature information extracted by the CNN layer is converted into a capsule in the form of a vector. After the feature extraction stage is improved, the capsule with more high-dimensional feature information vectorized will be It can better represent the deeper abstract features in the data and enhance the expressive power of the model to improve the prediction performance of the model. By adding a channel attention module to the residual block, richer and more detailed feature information in the spatiotemporal feature map can be obtained, but using the attention mechanism in the residual block of each convolutional layer is not necessarily optimal [28]. Therefore, we further discuss the impact of attention modules embedded in different regions on model prediction performance in our experiments.

4. Experiment and Result Analysis

4.1. Datasets

The data used in the experiment are the traffic flow data of the four intersections in Wenyi Road, from west to east are Gudun Road, Jingzhou Road, Fengtan Road, and Yile Road. Figure 5A–D are the drainage maps of the four intersections, respectively. The data acquisition time was from 1 August 2020 to 30 August 2020, and detectors were placed at four intersections, and data were collected every 3 min. The collected data included road codes, lane codes, collection time, and traffic flow. In this experiment, two data items, time and traffic flow are selected as the datasets used in the experiment. The data arrangement method is as follows: at the intersection of Gudun Road, there are nine lanes on North–South 271 and 278, and the sum of the number of vehicles in these nine lanes is used as the data sample of Gudun Road. Likewise, East–West 269 and 272 have ten lanes. The sum of the number of vehicles in each lane is used as the data sample of Wenyi Road at one of the intersections, and the sum of the vehicle data of the east and west lanes of the four intersections is used as the total data sample of Wenyi Road. The same statistical method was used for the other three intersections. A number of 480 × 4 samples were collected 24 h a day. Due to sensor failures and other conditions, some data values were less than zero, and these data have been deleted. A number of 13,125 × 5 samples were collected within 30 days. As shown in Figure 6a, it is a 30-day traffic flow visualization processing diagram of four adjacent roads: Gudun Road, Jingzhou Road, Fengtan Road, and Yile Road from west to east. Traffic flow between roads has spatial feature similarity. Figure 6b is a 7-day traffic flow visualization processing diagram of Wenyi Road, which shows the obvious periodic changes in traffic flow within a week. In order to verify the validity and practicability of the prediction model, the dataset is divided into the full-time period, peak time period, and off-peak time period for prediction, respectively. The morning peak hour in Hangzhou is 7:00–9:00, and the evening peak hour is 16:30–18:30. During the peak period, a number of 82 samples were collected every day, and 2460 data samples were collected within 30 days; during the off-peak period, a number of 10,665 data samples were collected within 30 days. For each dataset, the first 80% (i.e., the first 3 weeks of data) were selected as the training set, and the last 20% (i.e., the last week of data) was used as the test set. The SZ-taxi public dataset with traffic speed as a predictor is also used in the experiment. The dataset is divided into the first 80% as the training set and the last 20% as the test set.

4.2. Experimental Equipment

The test platform settings used in the experiment are as follows: CPU: Intel(R) Core(TM) i7-9750H; Memory: 16 GB; Graphics card: NVDIA GTX 2080Ti 11 GB), Win10 operating system. The test environment is CUDA9.0 version, Python3.6, Tensorflow1.9.0, etc. All models were trained with Adam optimizer, with β = 0.001 parameters for optimization, 200 cycles of training, and batch size set to 128.

4.3. Evaluation Indicators

The model uses the RMSE and MAE shown in Equations (1) and (2) in the experiment, two indicators for performance evaluation to verify the effect of improvement.

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(ŷ_{i} - y_{i})}^{2}}

(3)

MAE = \frac{1}{n} \sum_{i = 1}^{n} |ŷ_{i} - y_{i}|

(4)

where n is the number of predicted samples;

y_{i}

represents the real value of traffic flow;

ŷ_{i}

is the corresponding predicted value.

4.4. Embedding Position of Channel Attention in CapsNet

In order to observe the impact of the embedding of the attention module on the prediction performance of the model, this paper attempts to embed the improved channel attention in the first three layers of CapsNet to complete the traffic flow prediction on the Wenyi Road dataset. In the experiment, the optimal result is used as the embedding position of the final improved attention module. The experimental results are shown in Table 1. When the residual block of the second convolution layer is embedded, the RMSE of the model is 30.70, and the MAE is 16.55. The effect is significantly better than that in other embedding positions. The reason may be that in the initial feature extraction stage, after the initial extraction of the first convolutional layer, channel expansion and attention reinforcement training are performed on the basis of two layers of feature extraction to make the initially obtained local features more adequate, and the subsequent vectorization of the capsule has a better characterization effect. It can be seen from Table 1 that the prediction effect is not optimal when the attention mechanism is directly added to each residual block because the attention mechanism in the residual block will change the output of the middle layer, and the addition of multiple attention modules worsens the degree of confusion of information during forward propagation and back propagation during training and destroys the semantic feature information contained in the model training parameters. Combined with the experimental results, channel attention is embedded in the second convolution layer of the network, which achieves the best prediction performance of the model.

In Section 3.1, it is mentioned that the specific use of the nonlinear activation function in the linear bottleneck layer is determined through experimental comparison. The Wenyi Road dataset was used in the experiment. The results are shown in Table 2. The prediction performance of the model when using the ReLU activation function is that it would be better than ReLU6. This may be because the MobileNetV2 network limits the maximum output for the recognition task, which is convenient for the network to be deployed on low-precision mobile terminals while limiting the network output in the traffic flow prediction task will affect the expression ability of the model and thus the prediction results, so the nonlinear activation function in the linear bottleneck layer of the MCapsNet network is changed to ReLU.

4.5. Model Performance Comparison and Result Analysis

In order to verify the effectiveness of the proposed model improvement, the model MCapsNet proposed in this paper is combined with SVR, ARIMA, GCN, GRU, TGCN, and CapsNet, and the channel attention before the improvement is added to the second convolution layer, which is represented by CapsNet (CA), For comparison, the experimental results are shown in Table 3.

As can be seen from Table 3, the prediction results of the proposed model MCapsNet in the three periods have a certain improvement compared with the prediction performance of other benchmark models. In the comparison between CapsNet (CA) and CapsNet, RMSE indexes in the three periods are reduced by 4.49%, 4.12%, and 8.92%, respectively. The experimental results verified that channel attention could focus on the region with the most abundant feature map information during feature extraction and reduce the information redundancy between channels. MCapsNet incorporated both depthwise convolution and channel attention in the residual block of the second convolution layer. When expanding feature channels, the channel attention enhancement model was used to selectively learn a large number of features, and the depthwise convolution and channel attention were combined through a linear bottleneck layer to avoid a large loss of feature information during channel recovery. Through comparison in the experiment, the ReLU6 activation function in the linear bottleneck layer was changed to ReLU, and the maximum output was not limited to improve the model’s expressive ability in prediction. MCapsNet used a special vector neuron representation and routing algorithm to transmit information for the model to retain and learn the local features extracted in the early stage to improve the model prediction ability. In the comparison of the results of MCapsNet and CapsNet, the indicators of RMSE in the three periods were reduced by 10.50%, 4.66%, and 9.78%, respectively, and the indicators of MAE in the three periods were reduced by 6.55%, 5.21%, and 10.43% respectively, which proves the improvement. The model MCapsNet was effective for channel expansion and channel-to-channel selective learning in the feature extraction stage. Among them, the prediction error index of the model was also significantly reduced in the comparison of GCN and TGCN models based on graph convolutional networks, which are good at spatial modeling, which proves that the spatial modeling ability of capsule networks is better than that of graph convolutional networks. Compared with single-module models such as SVR, ARIMA, and GRU, which only consider temporal features, the prediction performance is significantly worse than other models in datasets with complex spatial features. The RMSE index of the ARIMA model was better than that of most models in some periods, but the MAE error index was much worse than that of other models. Moreover, the prediction performance in the complete period was also poor, and the overall prediction performance of the model was not stable. By comparing the prediction results of experiments in three different time periods, it was verified that MCapsNet is effective in attention-based channel domain improvement. Figure 7 was the visualization of the test set prediction results of MCapsNet on one day of the five road segments in the Wenyi Road full-time dataset. The blue and yellow lines in the figure represent the real traffic flow curve and the prediction curve of MCapsNet model output for the same time period for the Wenyi Road dataset, where Figure 7a–d were secondary roads, so the traffic flow is small and fluctuates little, which is easy to learn and predict by the model; while the Wenyi Road represented by Figure 7e was the main road, with a large traffic flow and frequent changes. In comparison, the model learning and prediction were also more difficult, so the error value between the two curves in the figure is also larger, not as small as the error value of the previous four roads, but the trend of its prediction curve was consistent with the real curve, in view of the complex consideration of the real situation of traffic on Wenyi Road, we believe that the error range of the model is acceptable. Table 4 shows the comparison of the prediction results of the models on the publicly available dataset SZ-taxi, where traffic speed is used as the prediction parameter in the SZ-taxi dataset. Through experiments to verify the effectiveness of the proposed models in prediction under multiple scenarios, the prediction performance of both models, MCapsNet and CapsNet (CA), improved in this paper and also outperformed the other models. Meanwhile, in comparison with CapsNet, after using channel attention and depthwise separable convolutional block, the two groups of models proposed by us achieved different levels of improvement on the basis of the previous model, respectively. The experimental results on the SZ-taxi dataset verified the robustness of the improved models in this paper.

5. Conclusions

This paper proposed an improved traffic flow prediction model based on CapsNet. CapsNet initially extracts features from traffic data through the front segment CNN and then vectorizes it into capsules to model the traffic road network space. In this paper, there are only two layers of convolution blocks in the front section of CapsNet, which is insufficient for deep-level feature extraction. In view of lightweight considerations, a depthwise separable convolutional block is introduced to enrich the number of features, and channel attention is used to selectively learn the expanded channel features. The depthwise separable convolutional block and channel attention are connected into the residual block of the network through the modified linear bottleneck layer, which ensures that the information is not lost in large amounts when it is restored back to the low-dimensional space. In the experiment, we introduced the improved channel attention into the three residual blocks of CapsNet: the first convolution layer, the second convolution layer, and the PrimaryCaps layer, respectively, to explore the impact of attention embedding position on the performance of the prediction model. When the improved channel attention is introduced into the second convolution layer, the best prediction performance of the model is achieved. The RMSE of the full period prediction results for Wenyi Road is 30.70, the MAE is 16.55, the peak period RMSE is 40.93, the MAE is 22.75, and the off-peak period RMSE is 37.54, and the MAE is 20.52. At the same time, we changed the ReLU6 activation function in the linear bottleneck layer to ReLU, and the activation function without the maximum limit can improve the expressive power of the model in prediction. After experimental comparison, the improved model prediction effect is better. In the experiments with two different datasets, comparing other prediction models, among which SVR, ARIMA, and GRU are single module models considering only temporal characteristics, the prediction performance of the models was unstable and was not conducive to traffic flow prediction under different time periods; GCN and TGCN are prediction models based on graph convolution that can model the space of traffic road network. Compared with MCapsNet, our model error index had also been significantly reduced. Therefore, the spatial modeling ability of vector neurons in MCapsNet is better than that of a graph convolution network. At the same time, compared with CapsNet and CapsNet (CA) models, the prediction performance of MCapsNet also achieved different degrees of improvement. The effectiveness of the proposed method was verified by experiments.

The MCapsNet model proposed in this paper uses a depthwise separable convolutional block to enable the network to achieve deeper feature extraction of the input features, thus improving the expression of the model. Meanwhile, in order to ensure that the model is deeply and effectively trained, using shortcut connection in the network and the depthwise separable convolutional block and channel attention were connected by the improved linear bottleneck layer structure to avoid feature channel collapse and information redundancy. In addition, in the experiment, we verified that when the activation function of the linear bottleneck layer in the original target recognition task was replaced by ReLU, the prediction performance of the model was better than that of ReLU6. Therefore, we believe that in the prediction task, the activation function of the prediction model is not limited by the maximum value, which can improve the expressiveness of the model in the prediction, thus improving the prediction performance. In future research work, we will focus on improving the running speed of the model and improving its application value.

Author Contributions

Conceptualization, B.Q. and Y.Z.; methodology, B.Q. and Y.Z.; software, B.Q. and Y.Z.; validation, Y.Z.; formal analysis, B.Q.; investigation, B.Q.; resources, B.Q.; data curation, B.Q.; writing—original draft preparation, B.Q.; writing—review and editing, Y.Z and B.Q.; visualization, B.Q.; supervision, Y.Z.; project administration, B.Q.; and funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China, grant/award number: 2019YFE0126100; Science and Technology project of Zhejiang Province, grant/award number: 2019C54005.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Spyropoulou, I. Impact of public transport strikes on the road network: The case of Athens. Transp. Res. Part A Policy Pr. 2020, 132, 651–665. [Google Scholar] [CrossRef]
Ma, J.; Xu, M.; Meng, Q.; Cheng, L. Ridesharing user equilibrium problem under OD-based surge pricing strategy. Transp. Res. Part B Methodol. 2020, 134, 1–24. [Google Scholar] [CrossRef]
Sun, J.; Wu, J.; Xiao, F.; Tian, Y.; Xu, X. Managing bottleneck congestion with incentives. Transp. Res. Part B Methodol. 2020, 134, 143–166. [Google Scholar] [CrossRef]
Nallaperuma, D.; Nawaratne, R.; Bandaragoda, T.; Adikari, A.; Nguyen, S.; Kempitiya, T.; De Silva, D.; Alahakoon, D.; Pothuhera, D. Online Incremental Machine Learning Platform for Big Data-Driven Smart Traffic Management. IEEE Trans. Intell. Transp. Syst. 2019, 20, 4679–4690. [Google Scholar] [CrossRef]
Guo, J.; Huang, W.; Williams, B.M. Adaptive Kalman filter approach for stochastic short-term traffic flow rate prediction and uncertainty quantification. Transp. Res. Part C Emerg. Technol. 2014, 43, 50–64. [Google Scholar] [CrossRef]
Kim, Y.; Wang, P.; Zhu, Y.; Mihaylova, L. A capsule network for traffic speed prediction in complex road networks. In Proceedings of the 2018 Sensor Data Fusion: Trends, Solutions, Applications (SDF), Bonn, Germany, 9–11 October 2018; pp. 1–6. [Google Scholar]
Williams, B.M.; Hoel, L.A. Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process: Theoretical basis and empirical results. J. Transp. Eng. 2003, 129, 664–672. [Google Scholar] [CrossRef] [Green Version]
Lu, Z.; Zhou, C.; Wu, J.; Jiang, H.; Cui, S. Integrating granger causality and vector auto-regression for traffic prediction of large-scale WLANs. KSII Trans. Internet Inf. Syst. 2016, 10, 136–151. [Google Scholar]
van Lint, J.W.C.; van Hinsbergen, C. Short-term traffic and travel time prediction models. Artif. Intell. Appl. Crit. Transp. Issues 2012, 22, 22–41. [Google Scholar]
Wu, C.H.; Ho, J.M.; Lee, D.T. Travel-time prediction with support vector regression. IEEE Trans. Intell. Transp. Syst. 2004, 5, 276–281. [Google Scholar] [CrossRef] [Green Version]
Zhuo, Q.; Li, Q.; Yan, H.; Qi, Y. Long short-term memory neural network for network traffic prediction. In Proceedings of the 2017 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), Nanjing, China, 24–26 November 2017; pp. 1–6. [Google Scholar]
Fu, R.; Zhang, Z.; Li, L. Using LSTM and GRU neural network methods for traffic flow prediction. In Proceedings of the 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC), Wuhan, China, 11–13 November 2016; pp. 324–328. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Deng, M.; Li, H. T-GCN: A Temporal Graph Convolutional Network for Traffic Prediction. IEEE Trans. Intell. Transp. Syst. 2019, 21, 3848–3858. [Google Scholar] [CrossRef] [Green Version]
Yu, B.; Yin, H.; Zhu, Z. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv 2017, arXiv:1709.04875. [Google Scholar]
Song, C.; Lin, Y.; Guo, S.; Wan, H. Spatial-temporal synchronous graph convolutional networks: A new framework for spatial-temporal network data forecasting. Proc. AAAI Conf. Artif. Intell. 2020, 34, 914–921. [Google Scholar] [CrossRef]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Vashishth, S.; Upadhyay, S.; Tomar, G.S.; Faruqui, M. Attention interpretability across nlp tasks. arXiv 2019, arXiv:1909.11218. [Google Scholar]
Hu, D. An introductory survey on attention mechanisms in NLP problems. In Proceedings of SAI Intelligent Systems Conference; Springer: Cham, Switzerland, 2019; pp. 432–448. [Google Scholar]
Athier, G.; Floquet, P.; Pibouleau, L.; Domenech, S. Synthesis of heat-exchanger network by simulated annealing and NLP procedures. AIChE J. 1997, 43, 3007–3020. [Google Scholar] [CrossRef]
Skantze, G.; Gustafson, J. Attention and interaction control in a human-human-computer dialogue setting. In Proceedings of the SIGDIAL 2009 conference; Association for Computational Linguistics: London, UK, 2009; pp. 310–313. [Google Scholar]
Chavis, M.H. Attention Deficit Hyperactivity Disorder: A Clinical Interventive Approach with Case Vignettes. Dissertations & Theses-Gradworks, Drew University, Madison, NJ, USA, 2008. [Google Scholar]
Guo, S.; Lin, Y.; Feng, N.; Song, C.; Wan, H. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. Proc. AAAI Conf. Artif. Intell. 2019, 33, 922–929. [Google Scholar] [CrossRef] [Green Version]
Guo, S.; Lin, Y.; Wan, H.; Li, X.; Cong, G. Learning Dynamics and Heterogeneity of Spatial-Temporal Graph Data for Traffic Forecasting. IEEE Trans. Knowl. Data Eng. 2021, 34, 5415–5428. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. arXiv 2018, arXiv:1709.01507. [Google Scholar]
Park, J.; Woo, S.; Lee, J.-Y.; Kweon, I.S. A Simple and Light-Weight Attention Module for Convolutional Neural Networks. Int. J. Comput. Vis. 2020, 128, 783–798. [Google Scholar] [CrossRef]

Figure 1. Structure of CapsNet network.

Figure 2. Residual block and inverted residual block. (a) Residual block; (b) inverted residual block.

Figure 3. Improved channel attention DCA module and DS convolutional layer structure.

Figure 4. Comparison of CapsNet structure before and after introducing improved channel attention. (a) CapsNet structure; (b) MCapsNet structure; (c) Structure of improved channel attention DCA module in MCapsNet.

Figure 5. Traffic channelization diagram of four intersections in Wenyi Road. (A) Gudun Road; (B) Jingzhou Road; (C) Fengtan Road; (D) Yile Road.

Figure 6. Visualization of traffic flow in Wenyi Road dataset. (a) Visualization of traffic flow on four adjacent roads within a month; (b) Visualization of traffic flow in Wenyi Road within a week.

Figure 7. Visualization of prediction results of MCapsNet on the full-time dataset. (a) Gudun Road; (b) Jingzhou Road; (c) Fengtan Road; (d) Yile Road; (e) Wenyi Road.

Table 1. Test results of channel attention embedding in CapsNet.

Convolution 1	Convolution 2	PrimaryCaps	RMSE	MAE
√			32.96	16.81
	√		30.70	16.55
		√	33.13	16.94
√	√		32.65	17.97
√		√	32.66	17.19
	√	√	32.95	16.82
√	√	√	33.29	18.13

Table 2. Test results of different activation functions in linear bottleneck layer.

Model	RMSE	MAE
MCapsNet(ReLU)	30.70	16.55
MCapsNet(ReLU6)	32.85	16.81

Table 3. Prediction results of the model in the complete period, peak period, and off-peak period.

Model	Full-Time		Peak		Off-Peak
Model	RMSE	MAE	RMSE	MAE	RMSE	MAE
SVR [10]	35.50	17.90	44.84	24.14	41.81	22.12
ARIMA [7]	48.18	41.18	37.12	29.09	33.01	26.29
GCN [13]	38.99	22.64	49.45	30.16	42.95	25.33
GRU [12]	35.40	18.22	43.28	23.36	41.69	22.10
TGCN [14]	36.05	19.16	44.64	24.79	37.53	18.31
CapsNet [6]	34.30	17.71	42.93	24.00	41.61	22.91
CapsNet(CA)	32.76	16.98	41.68	23.53	37.90	21.34
MCapsNet	30.70	16.55	40.93	22.75	37.54	20.52

Table 4. Prediction results of the model in SZ-taxi dataset.

Model	RMSE	MAE
SVR [10]	4.12	2.60
ARIMA [7]	5.25	3.87
GCN [13]	5.97	4.41
GRU [12]	4.17	2.76
TGCN [14]	4.16	2.78
CapsNet [6]	4.12	2.76
CapsNet (CA)	3.98	2.69
MCapsNet	3.87	2.56

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiu, B.; Zhao, Y. Research on Improved Traffic Flow Prediction Network Based on CapsNet. Sustainability 2022, 14, 15996. https://doi.org/10.3390/su142315996

AMA Style

Qiu B, Zhao Y. Research on Improved Traffic Flow Prediction Network Based on CapsNet. Sustainability. 2022; 14(23):15996. https://doi.org/10.3390/su142315996

Chicago/Turabian Style

Qiu, Bin, and Yun Zhao. 2022. "Research on Improved Traffic Flow Prediction Network Based on CapsNet" Sustainability 14, no. 23: 15996. https://doi.org/10.3390/su142315996

APA Style

Qiu, B., & Zhao, Y. (2022). Research on Improved Traffic Flow Prediction Network Based on CapsNet. Sustainability, 14(23), 15996. https://doi.org/10.3390/su142315996

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Improved Traffic Flow Prediction Network Based on CapsNet

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Depthwise Separable Convolutional Block and Inverted Residual Structure

3.2. Improved Channel Attention

3.3. Improved MCapsNet Network Structure

4. Experiment and Result Analysis

4.1. Datasets

4.2. Experimental Equipment

4.3. Evaluation Indicators

4.4. Embedding Position of Channel Attention in CapsNet

4.5. Model Performance Comparison and Result Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI