Traffic prediction is still a key functional component and a research focus of ITS. Various traffic prediction methods have been proposed and improved the prediction accuracy in recent years. Deep learning approaches have been proposed for traffic prediction because of effectively extracting the features of traffic data. CNN-based methods may improve the predictive accuracy by transiting traffic conditions to images and effectively extracting features in the images. Despite the success of the CNN-based methods, such methods are clearly limited because they do not directly extract the features in multiple links and do not use interval times as an aspect.
2.1. Traffic Prediction Methods
Various methods have been proposed for traffic prediction in the past years. The target functions of these methods relate the explicative variables to the target variable, which are usually implemented by using statistical and machine learning techniques [
4]. According to the implementation techniques of these methods, these methods can be grouped into two categories including statistical methods and machine learning methods. Time series models are the most typical statistical methods and usually include the autoregressive (AR), moving average (MA), autoregressive moving average (ARMA), autoregressive integrated moving average (ARIMA), seasonal autoregressive integrated moving average (SAIMA) and Kalman filtering method. The machine learning methods for traffic prediction usually includes the linear regression (LR), k-nearest neighbor regression (k-NNR), support vector regression (SVR), SSNN, LSTM NN, AES and CNN.
ARIMA (p, d, q) model combines AR model and MA model, where parameters p, d, and q are non-negative integers, p is the order (number of time lags) of the autoregressive model, d is the degree of differencing (the number of times the data have had past values subtracted), and q is the order of the moving-average model. ARIMA model is a generalization of ARMA model. The I in ARIMA model is the difference of the observation values to make the time series stationary. In [
16], the ARIMA(0, 1, 3) was proposed for short-term traffic prediction. Based on the 166 data sets from three surveillance systems that are deployed on a freeway in Los Angeles, the experimental results show that the proposed model achieves better performance than MA model. In [
17], an ARIMA (0, 1, 2) was proposed for traffic prediction. Based on the dataset on five major urban arterials, the experimental results show that the proposed model is effective in reproducing the original time series.
SARIMA model is ARIMA model with a seasonal component. In [
18], Wold decomposition theorem states that any stationary time series can be decomposed into a deterministic series and a stochastic series. Based on the Wold decomposition theorem, it is hypothesized that a weekly seasonal difference between traffic data can yield a weakly stationary transformation and univariate traffic data streams can be modeled as a SARIMA process. The experiments are performed based on the dataset from two freeways and validated the theoretical hypothesis.
In [
19], a prediction scheme based on Kalman filtering technique was proposed for traffic flow prediction. The method uses both historic data (previous two days flow data). Real time data on the day of interest was also attempted. Promising results were obtained with mean absolute percentage error (MAPE) of 10 between observed and predicted flows.
Researchers in the traffic prediction field have paid much attention to machine learning methods in recent years because of their ability of extracting features in traffic data.
In [
20], it was proposed for travel-time prediction that a linear relation (LR) between the predicted travel-time
and two naive predictors including the current status travel-time
and the historical mean travel-time
. Based on the dataset from 116 single-loop detectors, the proposed LR method outperforms the principal components method and the k-NNR method.
The accuracy of k-NNR may be improved by using a larger dataset [
21]. K-NNR methods have some advantages over SARIMA model [
22] and can avoid current time series data to lead inefficient predictions [
23].
Support vector regression machine does not depend on the dimension of the input vectors space and may nonlinearly map input vectors to a high-dimension feature space to construct a linear decision surface. Thus, SVR will have advantages in high dimensionality space [
24]. In [
25], an SVR method with a radial basis function kernel (RBF) was proposed for traffic prediction and achieves better performance than the current-time predictor and the historical-mean predictor based on a highway traffic dataset. In [
26], the online version of SVR was proposed for short-term travel-time prediction and achieves better performance than the Gaussian maximum likelihood, Holt exponential smoothing and the artificial neural network. In [
27], the incremental SVR method was proposed for traffic flow prediction and achieves better performance than the back-propagation neural network.
Various neural networks have been proposed for traffic prediction because of their ability to extract features in traffic data. The advantages and disadvantages of individual deep learning methods are shown in
Table 1. In [
28], the Elman recurrent neural networks [
29] were referred to as SSNN for travel-time prediction based on the traffic state-space formulation. In [
6], AES method was proposed for traffic flow prediction and was trained in a greedy layerwise fashion. Based on freeway data from the freeway system in California, the experimental results show that the AES method outperforms the random walk (RW), SVR, RBF network and back-propagation neural network for traffic flow prediction. In [
5], LSTM NN was proposed for traffic speed prediction. Based on the dataset on an expressway without signal controls, LSTM NN outperforms the ARIMA(2, 2, 2), SVM, Kalman filter [
3], Elman NN, time-delay neural network (TDNN) [
30] and nonlinear autoregressive with exogenous inputs neural network (NARX) [
31]. In [
7], the deep learning approach combines a sequence of tanh layers and a linear layer to capture the impacts of breakdowns, recoveries or congestion on traffic flow prediction. Based on the dataset from twenty-one loop detectors, the experimental results show that the deep learning method is effective for traffic flow prediction.
In [
8,
9,
10], CNN-based methods were proposed for traffic prediction by transiting traffic data to images and extracting the adjacent relationship in the images. In [
8], the CNN-based method was proposed for traffic speed prediction. In [
9], the fusion of CNN and LSTM was proposed for passenger demand prediction. In [
10], the CNN-based method with an errorfeedback RNN was proposed for traffic speed prediction. In [
12], based on the single link, the global-level representation was proposed to directly capture the relationship in single link.
While various traffic prediction methods have improved the accuracy of traffic prediction, these methods have been evaluated under different datasets and thus it is difficult to say that one method is clearly superior over the other methods under any traffic conditions [
6].
2.2. Convolution Neural Network
To improve the learning ability of neural networks, a more interesting scheme is to rely on the topology of the input data [
11]. CNN was proposed to implement the scheme by combining three architectural ideas including local receptive fields, shared weights and spatial or temporal subsampling. By applying feature mapping and weight sharing, the learning ability of neural networks of recognizing handwritten zip codes was enhanced [
34]. In [
11], LeNet-5 was proposed for document recognition and achieves better performance than the baseline methods.
Three architectural ideas can be seen as that CNN integrates the constraints in the task domain into its architecture. The idea of providing constraints in task domain to the proposed method has also been widely applied in many other fields. In [
33], the convolutional LSTM extends the fully connected LSTM to have convolutional structures in both the input-to-state and state-to-state transitions. The experimental results show that the convolutional LSTM outperforms the fully connected LSTM for precipitation prediction. In [
32], the convolutional architecture with piecewise max pooling utilizes all local features to perform relation extraction globally. The experimental results show that the proposed method achieves better performance than the baseline methods for relation extraction. In [
12], based on the single link, the global-level representation was proposed to directly capture the relationship in single link. In this paper, we use the global-level representation to integrate the constraints in multiple links.
2.3. Attention Mechanism
Attention mechanism has recently succeeded in many tasks including image classification [
13], neural machine translation [
14], multimedia recommendation [
15], because it may focus on the effective parts of features when different aspects are considered. The attention mechanism is usually implemented by using a neural network based on the corresponding task. In [
13], attention mechanism was proposed to address the problem of enormous computation cost in CNN. In [
14], attentional mechanism selectively focuses on the effective parts of input sentences during translating to improve the accuracy of machine translation. In [
15], the two-layer attention mechanism is proposed to adaptively extract the implicit feedback. In this paper, the attention mechanism is proposed to use interval times as aspect for traffic prediction.
In [
35], the convolutional block attention module was proposed for object detection. Given an intermediate feature map, the proposed module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. The experimental results show that the proposed method outperforms all the baselines. The attention mechanism in our work is similar to the channel attention in [
35].