*Article* **Global Spatial-Temporal Graph Convolutional Network for Urban Traffic Speed Prediction**

#### **Liang Ge 1,2,\*, Siyu Li 1,2, Yaqian Wang 1,2, Feng Chang 1,2 and Kunyan Wu 1,2**


Received: 13 January 2020; Accepted: 19 February 2020; Published: 22 February 2020

**Abstract:** Traffic speed prediction plays a significant role in the intelligent traffic system (ITS). However, due to the complex spatial-temporal correlations of traffic data, it is very challenging to predict traffic speed timely and accurately. The traffic speed renders not only short-term neighboring and multiple long-term periodic dependencies in the temporal dimension but also local and global dependencies in the spatial dimension. To address this problem, we propose a novel deep-learning-based model, Global Spatial-Temporal Graph Convolutional Network (GSTGCN), for urban traffic speed prediction. The model consists of three spatial-temporal components with the same structure and an external component. The three spatial-temporal components are used to model the recent, daily-periodic, and weekly-periodic spatial-temporal correlations of the traffic data, respectively. More specifically, each spatial-temporal component consists of a dynamic temporal module and a global correlated spatial module. The former contains multiple residual blocks which are stacked by dilated casual convolutions, while the latter contains a localized graph convolution and a global correlated mechanism. The external component is used to extract the effect of external factors, such as holidays and weather conditions, on the traffic speed. Experimental results on two real-world traffic datasets have demonstrated that the proposed GSTGCN outperforms the state-of-the-art baselines.

**Keywords:** spatial-temporal dependencies; traffic periodicity; graph convolutional network; traffic speed prediction

#### **1. Introduction**

Traffic speed prediction is an important part of the Intelligent Transportation System (ITS). Accurate and timely traffic prediction can assist in real-time dynamic traffic light control [1] and urban road planning, which will help alleviate the huge congestion problem as well as improve the safety and convenience of public transportation. Besides, traffic control in advance can prevent traffic paralysis, pedaling, and other events. Traffic speed prediction aims to predict future traffic speed based on a series of historical traffic speed observations. The three key complex factors affecting traffic speed are as follows:

Factor 1: Global Spatial Dependencies. As shown in Figure 1, given the road network and sensors, the spatial correlations over different nodes on the traffic network are both local and global. Take Sensor 1 for example; the traffic status of its adjacent sensors (see Sensors 2 and 3) can influence that of Sensor 1. These are localized spatial correlations between sensors. In addition, the sensors (see Sensor 4) far from Sensor 1 can indirectly affect the traffic status of Sensor 1. Thus, all other sensors on the road network have impacts on Sensor 1. These are global spatial correlations between sensors.

Factor 2: Multiple Temporal Dependencies. Historical traffic conditions at different timestamps in the same location have different effects on status of a future timestamp. As shown by Sensor 1 in Figure 1, the traffic status at time *t* + 1 is more related to that of time *t* − *l* + 1, compared with that of time *t*. In addition, we find that the trend of traffic speed over time in different workdays shows a high degree of similarity in Figure 2a. Moreover, the trend of traffic speed on the same workday in different weeks is similar as well in Figure 2b, which indicates that traffic speed renders both short-term neighboring and multiple long-term periodic dependencies. Thus, we consider the recent, daily, and weekly periodic patterns for traffic speed prediction simultaneously.

Factor 3: External Factors. Traffic speed is significantly affected by external factors such as weather conditions, holidays, other special events, and so on. According to Figure 3a, it is clearly shown that the traffic speed on holidays is different from that on normal days. In addition, it can be seen in Figure 3b that the traffic speed of a heavily rainy day is much lower than that of a sunny day.

In addition to the above-mentioned key factors affecting traffic speed, there is uncertainty and inconsistency in the traffic data sensors collect, due to sensor failures, sensor maintenance, and other reasons. Several studies [2,3] have focused on evaluating and improving the reliability of sensors. To address the problem, in this paper, we also deal with the outliers and missing values in the traffic data, respectively.

**Figure 1.** The topological structure of the road network and complex spatial-temporal correlations between sensors.

**Figure 2.** Multiple temporal dependencies of the traffic speed for PeMSD7. (PeMSD7 is a dataset containing traffic information from the sensors on the highways of Los Angeles County.)

Studies on traffic prediction have never stopped in the past few decades. Early statistical methods [4,5] and traditional machine learning methods [6–8] for traffic prediction cannot model the non-linear temporal correlations of traffic data effectively, and they hardly consider spatial dependencies. In recent years, with the continuous development of deep learning, many researchers have applied deep-learning-based methods to the traffic domain. Some studies [9–11] combine convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for traffic prediction, where CNNs are used to capture the spatial dependencies while RNNs are used to extract the temporal correlations of traffic data.

**Figure 3.** Effects of holidays and weather in San Francisco Bay Area.

The main limitation of the aforementioned methods is that conventional convolution operations only capture the spatial characteristics of regular grid structures. They are not suitable for data with irregular topologies. To tackle this problem, graph convolutional networks (GCNs) that can effectively handle non-Euclidean relations are integrated with RNNs [12] or CNNs [13] to embed prior knowledge of the road network and capture the correlations between sensors. The graph convolution network here represents the road network structure as a fixed weighted graph. Wu et al. [14] integrated Wavenet [15] into the GCN to capture the dynamic spatial-temporal correlations of traffic data, while using an adaptive adjacency matrix to obtain hidden spatial dependencies in the road network. However, there are still some limitations in these methods: (i) RNN-based models are challenging to train well [16] due to the problem of gradient disappearance or gradient explosion, and the receptive field of RNNs is limited; (ii) many existing methods only consider localized spatial correlations but ignore non-local ones; and (iii) they do not utilize more complicated traffic-related features such as the existing periodicity, repeating patterns, and external factors.

To capture the dynamic complex spatial-temporal correlations more effectively, we propose a novel global spatial-temporal graph convolutional network called GSTGCN to predict urban traffic speed, which consists of three independent spatial-temporal components with the same structure and one external component. Each spatial-temporal component contains a dynamic temporal module and a global correlated spatial module. The main contributions of this paper are as follows:


#### **2. Related Work**

Over the past few decades, traffic prediction has been extensively studied. Early statistical methods for traffic prediction were simple time series models, containing Autoregressive Integrated Moving Average (ARIMA) [4] and its variant [17], vector autoregression (VAR) [5], etc. These methods rely on data stationary assumption, thus have limited ability to model complex traffic data. Later, models based on traditional machine learning methods, such as k-nearest neighbors (KNN) [6], support

vector regression (SVR) [7], and Kalman filtering (KF) [8], were applied to traffic prediction to model more complex data. However, these methods cannot capture non-linearity in traffic data effectively, and barely utilize spatial correlations [18]. Moreover, they need more detailed feature engineering.

Recently, methods based on deep learning have been applied in many fields and achieved success, which has inspired the study of traffic prediction to use deep-learning-based methods modeling the complex spatial-temporal dependencies of the traffic data [19]. Lv et al. [20] utilized a stacked autoencoder (SAE) to predict the traffic status of different nodes. Luo et al. [21] integrated KNN and LSTM [22] to predict traffic flow. Yu et al. [23] combined LSTM networks with SAE to predict traffic status in extreme conditions. Cui et al. [24] proposed a LSTM-based network composed of bidirectional ones and unidirectional ones for traffic prediction. In addition, Zhang et al. [25] transformed the road network into a regular 2D grid and used convolutional neural network to predict citywide crowd flows. Liu et al. [26] used fully-connected neural networks and improved residual network to predict bus traffic flow. Later, the authors of [9–11] combined convolutional neural networks (CNNs) with recurrent neural networks (RNNs) and its variants for traffic forecasting. However, the main limitation of the above models is that conventional convolution operations can only capture the spatial characteristics of regular grid structures but do not work for data points with irregular topologies. Therefore, they fail to make an effective use of the topological structure of the traffic network to capture complex spatial correlations.

To extract the spatial correlations of traffic data with complex topologies, extending neural networks to process graph-structured data has attracted widespread attention [27]. A series of studies has extended traditional convolution to model arbitrary graphs on spectral [28–30] or spatial [31–33] domain. Spectral-based methods use a graph spectral filter to smooth the input signals of nodes. Spatial-based methods extract high-level representations of nodes by gathering feature information of neighbors. Other studies focus on graph embedding, which learns low-dimensional representations for vertices that preserve the graph-structured information [34,35]. To overcome the limitation of conventional convolution and capture more complex spatial-temporal dependencies, Li et al. [12] proposed a framework that combines the diffusion convolutional with the recurrent neural network to forecast traffic conditions. Fang et al. [36] proposed Global Spatial-Temporal Network (GSTNet) for traffic flow prediction. GSTNet employs tensor casual convolution and global correlated mechanism for extracting dynamic temporal dependencies and global spatial correlations. Yu et al. [13] proposed the Spatio-Temporal GCN (ST-GCN), which uses a full convolution structure combining graph convolution with 1D convolution. In ST-GCN, the graph convolution is used to obtain the spatial correlation, and the 1D convolution is used to extract the temporal dependencies. STGCN is much more computationally efficiently than the above-mentioned models using RNNs. Afterward, ST-MetaNet [37] utilizes sequence-to-sequence structure and combines the graph attention network (GAT) with the recurrent neural network (RNN) for capturing the spatial-temporal correlations. Wu et al. [14] integrated Wavenet [15] into the GCN to extract the dynamic temporal dependencies of traffic data, while using an adaptive adjacency matrix to obtain hidden spatial dependencies in the road network. This self-adaptive adjacency matrix is constructed by the similarity of different node embeddings on the road network. However, the learned spatial dependencies between nodes lack the guidance of domain knowledge, and it is prone to overfit during the training phase [18]. In addition, most existing traffic speed prediction methods ignore global spatial correlations between different nodes in the road network, and they hardly utilize multiple temporal correlations and external factors.

#### **3. Materials and Methods**

#### *3.1. Problem Description*

The task of traffic speed prediction is to predict the future traffic speed based on the given historical traffic measurements (such as traffic speed, traffic flow, etc.) of observed sensors in the road network. We first define the road network as a weighted undirected graph G = (V, E, *A*), where V is a set of |V| = *N* nodes, representing observed sensors in the road network; E is a set of edges, indicating connectivity of nodes; and *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*N*×*<sup>N</sup>* is a weighted adjacency matrix, which represents the proximity between nodes and can be computed from the distance in the road network. Then, the traffic data observed at time *<sup>t</sup>* on <sup>G</sup> are denoted as a graph signal *Xt* <sup>∈</sup> <sup>R</sup>*N*×*F*. Here, F represents the number of features observed at each node. The goal of traffic speed prediction is to learn a function *f* to predict future *T* graph signals based on graph G and *T* historical graph signals:

$$X\_{t+1}, \dots, X\_{t+T} = f\_{\theta}(X\_{t-T'+1}, \dots, X\_t; \mathcal{G}) \tag{1}$$

where *θ* stands for the learnable parameters.

#### *3.2. The Architecture of Our Designed Network*

**Overview:** As presented in Figure 4, the GSTGCN model proposed in this paper contains three independent spatial-temporal components with the same structure and an external component. The first three spatial-temporal components are designed to model the recent, daily-periodic, and weekly-periodic dependencies of the historical speed data, respectively, and the external component extracts the characteristics of external factors, such as weather condition, time of the day, and day of the week, to model external impacts on traffic speed. The first three spatial-temporal modules have the same structure. Each of them is composed of a temporal module with multiple stacked residual blocks and a global correlated spatial module. The global correlated spatial module models the localized spatial dependencies and the global spatial correlations of traffic data, respectively. We first construct an adjacency matrix based on the points of interest (POI) data around the sensors and the related features of the road segments. Then, we intercept three time series segments X*h*, X*d*, and X*<sup>w</sup>* from the traffic data along the time axis as inputs to the three spatial-temporal components. Next, we extract the characteristics of external factors and enter them into the external component. After that, the outputs of the first three spatial-temporal components *Yh*, *Yd*, and *Yw* are assigned different weights and then fused into the output *Yres*. Then, we merge the *Yres* with the output of the external component *Yext* to generate the prediction result. Finally, we utilize a tanh function to map the result into [−1, 1] [38].

**Figure 4.** The architecture of GSTGCN.

**Adjacency Matrix Construction.** Previous studies have only used the distances between sensors to construct the adjacency matrix, which represents the topological structure of the road network. However, even if two sensors are geographically far apart, they may have similar traffic conditions when they are in similar functional areas. Therefore, we consider not only the distance between the sensors, but also the similarity of the regions in which they are located to construct the adjacency matrix. More specifically, we first use the Dijkstra algorithm to calculate the distances between pairs of sensors in the road network, *dist*(*i*, *j*) represents the distance between sensor *i* and sensor *j*. Next, we use Openstreetmap [39] mapping each sensor to the corresponding road segment and collect the properties of the segment as the road-related features of sensors, which includes speed \_ limit, lanes, length, etc. Then, we obtain the number of points of interest (POIs) of ten categories within 500 m around the detector from FourSquare [40] as POI relevant features, which contain travel and transport, food, arts and entertainment, residence, etc. Finally, we splice the road-related features and POI data to form a feature vector *E<sup>r</sup>* . The form is defined as:

$$E^r = (po i\_1, \dots, po i\_{10}, \text{lames}, speed\\_\text{limi}, \text{type}, length, \text{is\\_bridge}, oneway) \tag{2}$$

where *poii* is the number of points of interest of the *i*th category. The details of road-related features are presented in Table 1. Therefore, we calculate the similarity of sensor *i*, *j* using the cosine similarity formula [41]:

$$sim(i,j) = \frac{\stackrel{\subset}{\sum} E\_{i,m}^r \times E\_{j,m}^r}{||E\_i^r|| \times ||E\_j^r||}\tag{3}$$

where *C* is the length of feature vector and *E<sup>r</sup> <sup>i</sup>* represents the feature vector of sensor *i*. Finally, we use threshold-based Gaussian kernel [42] to calculate the adjacency matrix; the formula is as follows:

$$W\_{i,j} = \begin{cases} B\_{i,j} = w\_1 \exp(-\frac{d \text{dist}(i,j)^2}{2\sigma\_1^2}) + w\_2 \exp(-\frac{(1-\text{sim}(i,j))^2}{2\sigma\_2^2}) & \text{if } B\_{i,j} \ge \ge 1\\ 0 & \text{otherwise} \end{cases} \tag{4}$$

where *σ*<sup>1</sup> is the standard deviation of distances; *σ*<sup>2</sup> is the standard deviation of similarities; *w*<sup>1</sup> + *w*<sup>2</sup> = 1; and *κ* is the threshold.

**Table 1.** Road-related features.


**Detailed Three Time Series Segments:** Suppose that the sampling frequency is p times a day. The current time is *t*0, and the length of the sequence to be predicted is *Tp*. As described in Figure 5, we intercept three time series fragments of length *Th*, *Td*, and *Tw* along the time axis as the inputs of three spatial-temporal components, respectively. Here, *Th*, *Td*, and *Tw* are all integer multiples of *Tp*. The details of the three time series fragments are as follows:

• **The recent segment:** <sup>X</sup>*<sup>h</sup>* = (*Xt*0−*Th*<sup>+</sup>1,*Xt*0−*Th*+2,...,*Xt*<sup>0</sup> )<sup>∈</sup> <sup>R</sup>*N*×*F*×*Th* . As shown by the red part in Figure 5, this segment is directly adjacent to the time series to be predicted. Since the traffic condition of the sensors gradually spreads to the vicinity over time, the adjacent historical time series have a great impact on it.


**Figure 5.** An example of extracting time series segments. X*h*, X*d*, and X*<sup>w</sup>* correspond to the three time series fragments input into the model. X*<sup>p</sup>* is the time series to be predicted and its length is *Tp*. The lengths of X*h*, X*d*, and X*<sup>w</sup>* are *Th*, *Td*, and *Tw*. Here, *Tp* is equal to *Th* and *Td*, *Tw* are double *Th*.

#### *3.3. Structures of the Three Spatial-Temporal Components*

Traffic conditions usually involve multiple temporal periodic patterns, and the traffic data exhibit strong daily and weekly periodicity. Taking multiple periodic temporal dependencies into account will improve prediction performance [19]. The three spatial-temporal components, respectively, model the recent, daily-periodic, and weekly-periodic spatial-temporal dependencies with the same structure. It includes two sub-modules: a dynamic temporal module and a global correlated spatial module (see Figure 4).

#### 3.3.1. Dynamic Temporal Module

We propose a dynamic temporal module to extract the temporal dependencies of the traffic data. The dynamic temporal module is composed of multiple residual blocks containing stacked dilated casual convolutions [43]. It has a long receptive field so as to capture both short-term neighboring and long-term periodic temporal dependencies with high effectiveness.

**Dilated Casual Convolution:** Dilated causal convolution (DCC) is based on 1D convolution, injecting holes into the convolution kernel, padding zeros to the input sequence to keep its length unchanged, skipping a fixed step, and sliding on the input sequence to operate convolution. In Figure 6, for a 1D sequence *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>T</sup>* when the convolution kernel is *<sup>f</sup>* <sup>∈</sup> <sup>R</sup>*K*, *xt* denotes the *<sup>t</sup>*th value in the 1D sequence *x*, the dilated causal convolution is *F*, and the *t*th value of the dilated causal convolution result is as follows:

$$F\_t(\mathbf{x}) = (f \ast\_d \mathbf{x})\_t = \sum\_{i=0}^{K-1} f(i) \cdot \mathbf{x}\_{t-d \cdot i} \tag{5}$$

where *d* refers to the dilation factor. It is the distance skipped during the convolution process. Multiple stacked dilated casual convolutions with progressively increasing dilation factor *d*(*l*) = 2(*l*−1) make the model's receptive field grow exponentially, where *l* denotes the number of layers.

**Figure 6.** Dilated casual convolution with kernel size 2. With dilation factor *d*, it performs a standard 1D convolution on the selected sequence that is picked from the input every *d* steps.

**Residual Block Architecture:** As shown in Figure 4, a residual block contains two stacked dilated causal convolutional layers and an identity map. The identity map is connected across layers. It addresses the gradient explosion problem in the deep networks. Weight normalization [44] is added after each dilated causal convolution layer to tackle the overfitting problem. For non-linearity, the rectified linear unit (ReLU) [45] keeps the model's convergence rate steady. The dilation factors of two DCC layers in a residual block are the same. Given an input *<sup>X</sup><sup>l</sup>* <sup>∈</sup> <sup>R</sup>*N*×*T*×*F*, the result *<sup>X</sup>l*+<sup>1</sup> <sup>∈</sup> R*N*×*T*×*F* after passing through the (*l* + 1)th residual block is:

$$X^{l+1} = X^l + \operatorname{ReLU}(\Phi\_1 \*\_{d\_1} (\operatorname{ReLU}(\Phi\_0 \*\_{d\_0} X^l)))\tag{6}$$

where <sup>Φ</sup>0, <sup>Φ</sup><sup>1</sup> <sup>∈</sup> <sup>R</sup>*F*×*F* <sup>×</sup>*<sup>K</sup>* are the convolution kernels for two dilated causal convolutions in a residual block. *F* and *F* represent the number of input features and output channels, respectively; *d*<sup>0</sup> and *d*<sup>1</sup> are the dilation factors; and K is the length of the convolution kernel.

Most previous models used RNN- and CNN-based methods to capture temporal dependencies, but they cannot handle very long sequences and are prone to the problem of gradient explosions. In contrast, residual blocks have a larger receptive field via stacking fewer dilated casual convolutional layers, and the introduction of residual connections also eliminates the problem of gradient disappearance or explosion. Besides, this architecture can be calculated in parallel with much less resource consumption.

#### 3.3.2. Global Correlated Spatial Module

This paper proposes a global correlated spatial module for capturing complex spatial correlations between nodes on the traffic network. The module contains a localized graph convolution and a global correlated mechanism with residual connection [16], where the former is used to extract local spatial correlations while the latter is used to capture the non-local spatial correlations.

**Localized Graph Convolution:** Since traditional convolutions fail to effectively extract the complicated spatial correlations between different nodes on the traffic network, the spectral graph theory extends the convolution to the graph-structured data. In spectral graph theory, the Laplacian matrix of a graph represents its topological structure. Therefore, we can study the properties of the graph by analyzing the eigenvalues and eigenvectors of the Laplacian matrix. The Laplacian matrix of a graph is defined as **<sup>L</sup>** <sup>=</sup> **<sup>D</sup>** <sup>−</sup> **<sup>A</sup>**, and the normalized Laplacian matrix **<sup>L</sup>** <sup>=</sup> **<sup>I</sup>***<sup>N</sup>* <sup>−</sup> **<sup>D</sup>**<sup>−</sup> <sup>1</sup> <sup>2</sup> **AD**<sup>−</sup> <sup>1</sup> <sup>2</sup> <sup>∈</sup> <sup>R</sup>*N*×*N*, where **<sup>I</sup>***<sup>N</sup>* is a unit matrix with *<sup>N</sup>* dimensions, **<sup>A</sup>** <sup>∈</sup> <sup>R</sup>*N*×*<sup>N</sup>* is the adjacency matrix, and the **<sup>D</sup>** <sup>∈</sup> <sup>R</sup>*N*×*<sup>N</sup>*

is the degree matrix, with **D***ii* = ∑*<sup>j</sup>* **A***ij*. The eigenvalue decomposition of the normalized Laplacian matrix is **L** = **UΛU***T*, where **U** is eigenvectors of the normalized **L** and **Λ** is the diagonal matrix with corresponding eigenvalues. Let *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* be the signal of all nodes on the traffic network. The Fourier transform of the signal *x* is *x***ˆ** = **U***Tx*. According to the properties of the Laplacian matrix, **U** is an orthogonal matrix, so the inverse Fourier transform of the signal *x***ˆ** is *x* = **U ˆ***x*. Based on these concepts, the signal *x* on graph *G* is filtered by the convolution kernel *gθ*:

$$\mathbf{g}\_{\theta} \ast\_{\mathbf{G}} \mathbf{x} = \mathbf{g}\_{\theta}(\mathbf{L})\mathbf{x} = \mathbf{g}\_{\theta}(\mathbf{U}\mathbf{A}\mathbf{U}^{T})\mathbf{x} = \mathbf{U}\mathbf{g}\_{\theta}(\mathbf{A})\mathbf{U}^{T}\mathbf{x} \tag{7}$$

The spectral graph convolution first uses the Fourier transform to map the graph signal *x* and the kernel *g<sup>θ</sup>* into an orthogonal space formed by the Laplacian matrix eigenvectors, then performs convolution in the Fourier domain, and last conducts the inverse Fourier transform to obtain the final graph convolution results. However, this method requires explicit Laplacian matrix eigenvalue decomposition, and the computational complexity is too high, when the scale of the graph is large. Therefore, in this paper, we employ the Chebyshev polynomial [29] to approximate the convolution kernel and solve this problem. The formula is as follows:

$$\lg\_{\theta} \*\_{\mathbf{G}} \mathbf{x} = \lg\_{\theta} (\mathbf{L}) \mathbf{x} = \sum\_{k=0}^{K-1} \theta\_k T\_k(\breve{\mathbf{L}}) \mathbf{x} \tag{8}$$

where the parameter *<sup>θ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>K</sup>* is a vector of polynomial coefficients and **<sup>L</sup>** <sup>=</sup> <sup>2</sup> *<sup>λ</sup>max* **L** − **I***N*, with *λmax* the maximal eigenvalue of the Laplacian matrix. The Chebyshev polynomials are recursively defined as *Tk*(*x*) = <sup>2</sup>*xTk*−1(*x*) − *Tk*−2(*x*), in which *<sup>T</sup>*0(*x*) = 1, *<sup>T</sup>*1(*x*) = *<sup>x</sup>*.

We denote *xi*,*<sup>t</sup>* <sup>∈</sup> <sup>R</sup>*<sup>C</sup>* as all extracted features of the *<sup>i</sup>*th node at the *<sup>t</sup>*th historical timestamp, where *C* is the number of the input channels. Thus, the input signal of the graph convolution is a feature matrix *X<sup>t</sup>* = [*x*1,*t*, *x*2,*t*, ..., *xN*,*t*] *<sup>T</sup>* <sup>∈</sup> <sup>R</sup>*N*×*<sup>C</sup>* and the result of graph convolution is as follows:

$$\hat{\mathbf{X}}\_{t} = \sum\_{k=0}^{K-1} \theta\_{k} T\_{k}(\tilde{\mathbf{L}}) \mathbf{X}\_{t} \tag{9}$$

where *<sup>X</sup>***ˆ***<sup>t</sup>* = [*x***ˆ**1,*t*, *<sup>x</sup>***ˆ**2,*t*, ..., *<sup>x</sup>***ˆ***N*,*t*] <sup>∈</sup> <sup>R</sup>*N*×*D*, and *<sup>D</sup>* is the number of the output channels. It is worth noting that the convolution results contain the feature information of K-order neighbors, and only capture the local spatial correlation of the road network structure.

**Global Correlated Spatial Mechanism:** To model the global spatial correlations between different nodes in the road network, a global correlated mechanism is proposed, as depicted in Figure 4. The formula for computing global correlations is as follows:

$$y\_{i,t} = \sum\_{\forall v\_j, t \mid v\_i} s\_{i,j} \cdot \phi(\pounds\_{i,t}, \pounds\_{j,t}) \cdot \pounds\_{j,t} \mathcal{W}\_{\mathcal{g}} + \pounds\_{i,t} \mathcal{W}\_{r} \tag{10}$$

where *<sup>y</sup>i*,*<sup>t</sup>* <sup>∈</sup> <sup>R</sup>*<sup>F</sup>* represents the output feature of the *<sup>i</sup>*th node at timestamp *<sup>t</sup>*. Considering whether there is an edge between the *i*th node and the *j*th node in the road network, if there is an edge, *si*,*<sup>j</sup>* = *α* > 1, else *si*,*j*=1. *si*,*<sup>j</sup>* represents the static global topological weights. In Equation (10), *φ* is the Gaussian kernel *φ*(*x*, *y*) = *exp*(*xTWφy*), measuring the correlations between two node embedding representations, where *<sup>W</sup><sup>φ</sup>* <sup>∈</sup> <sup>R</sup>*D*×*<sup>D</sup>* is the learnable parameter. <sup>∑</sup>∀*vj*=*vi si*,*<sup>j</sup>* · *<sup>φ</sup>*(*x***ˆ***i*,*t*, *<sup>x</sup>***ˆ***j*,*t*) · *<sup>x</sup>***ˆ***j*,*tWg* represents the impact of all other nodes on the *i*th node in the spatial dimension and "+*x***ˆ***i*,*tWr*" denotes a residual connection with the localized output features of the *<sup>i</sup>*th node, with *Wr* <sup>∈</sup> <sup>R</sup>*D*×*<sup>F</sup>* and *Wg* <sup>∈</sup> <sup>R</sup>*D*×*<sup>F</sup>* the learnable parameters. The output of the global correlated mechanism at timestamp *t* is *Y<sup>t</sup>* = [*y*1,*t*, *y*2,*t*, ..., *yN*,*t*] *<sup>T</sup>* ∈ <sup>R</sup>*N*×*F*, and the final output feature matrix of the spatial module is *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*N*×*T*×*F*.

#### *3.4. The Structure of the External Component*

Traffic speed is affected by many factors such as holidays, weather conditions, and so on. Suppose *t*<sup>0</sup> is the current time and *St*+<sup>1</sup> represents the feature vector of the external factor at time interval *t* + 1 to be predicted. We use the feature vectors of the *T* time intervals to form a feature matrix *S*. In our implementation, *S* = [*St*+1, ..., *St*+*T*] *<sup>T</sup>* <sup>∈</sup> <sup>R</sup>*T*×*Fs*, where *Fs* <sup>=</sup> 15 is the number of features we select. Specific details are shown in Table 2. Since the weather conditions at the next *T* intervals are unknown, we use the weather forecasting data from the weather website Darksky [46]. Next, we stack two fully-connected layers in the external component to deal with the external factor features. The first layer embeds each sub-factor and followed by an activation. The second layer maps the low-dimensional features to the higher-dimensional ones to get *Yext* whose shape is the same as *Yres*.

**Table 2.** External factors.


#### *3.5. Multi-Component Fusion*

In this section, we discuss how to integrate the four main parts of the model. Since the first three spatiotemporal components model the recent, daily-periodic, and weekly-periodic spatial-temporal correlations, respectively, the impact of the three parts on different locations is various. For example, we intend to predict the traffic speed at 08:30 on Monday morning. For some places with obvious morning peaks, the output of the daily-periodic and weekly-periodic component have significant impacts on prediction performance, and for some places where there is no obvious periodic pattern, the output of the daily component and weekly component will be useless. Above all, the different locations are all affected by short-term neighbors, long-term period, and trend, but the degrees of impact may be diverse [25]. Therefore, the impact weights of different spatiotemporal components on each node are constantly changing, and these weight values should be learned from historical traffic data. Thus, we fuse the three components of Figure 4 as follows:

$$\mathbf{Y}\_{\rm res} = \mathcal{W}\_{\rm h} \diamond \mathbf{Y}\_{\rm h} + \mathcal{W}\_{\rm d} \diamond \mathbf{Y}\_{\rm d} + \mathcal{W}\_{\rm w} \diamond \mathbf{Y}\_{\rm w} \tag{11}$$

where ◦ is Hadamard product. *Wh*, *Wd*, and *Ww* are all learned parameters. These parameters indicate the degree to which the outputs of the three spatiotemporal components affect the forecasting target.

Then, we further merge the fusion result *Yres* of the three spatiotemporal components with the output of the external component *Yext* to generate the final prediction result *Y***ˆ**, as illustrated in Figure 4. The output of the entire model is:

$$\hat{\mathbf{Y}} = \tanh(\mathbf{Y}\_{\text{res}} + \mathbf{Y}\_{\text{ext}}) \tag{12}$$

where tanh is a function to map the prediction result to the range of [−1, 1] and makes the model converge faster.

Our model predicts the future *T* timestamps speeds of all sensors based on the historical *T* traffic conditions. We choose the L2 loss as the training target of GSTGCN, which is defined by:

$$L(\hat{\mathbf{Y}}\_{t+1}, \dots, \hat{\mathbf{Y}}\_{t+T}; \Theta) = \sum\_{t} \sum\_{i=1}^{T} ||\hat{\mathbf{Y}}\_{t+i} - \mathbf{Y}\_{t+i}||^2 \tag{13}$$

where Θ are all learnable parameters in the GSTGCN, *Yt*+*<sup>i</sup>* is the ground truth, and *Y***ˆ** *<sup>t</sup>*+*<sup>i</sup>* is the model's prediction result.

#### **4. Experiments**

#### *4.1. Datasets*

The proposed model was verified on two highway traffic datasets, PeMSD4 and PeMSD7, collected by Caltrans Performance Measurement System (PeMS) [47] at 30-s intervals. The traffic speed data were aggregated from the raw data into 5-min windows. This system deploys 39,000 detectors in major cities in California. Geographic information of sensors is recorded in datasets with corresponding interval. The details of the datasets in our experiments are:

**PeMSD7:** It contains the traffic information from the sensors on the highways of Los Angeles County. We selected 204 sensors and collected three months of data from 1 January 2018 to 31 March 2018 for the experiment.

**PeMSD4:** It refers to the traffic data from the sensors in San Francisco Bay Area. We chose 325 sensors and extracted the data from 1 January 2017 to 31 March 2017 for the experiment.

During the experiment, both datasets were divided into chronological order, with 70% used for training, 10% for validation, and the remaining 20% for testing. The sensors distribution of the two datasets is displayed in Figure 7. In the data preprocessing stage, we discarded traffic speed outliers less than 0 and used the tensor decomposition method to complete the missing values in the traffic speed data. Then, we encoded the non-numeric features in external factors using a one-hot encoding scheme. Later, we used Min-Max normalization to map its value into [0, 1] and the original speed into [−1, 1]. During the evaluation phase, we re-projected the speed back to the original range as the final prediction result.

**Figure 7.** Sensor distribution of PeMSD4 and PeMSD7 datasets.

#### *4.2. Evaluation Metric*

In the experiments, we applied three widely-used metrics to evaluate the performance of our model: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE). They are defined as follows:

$$MAE = \frac{1}{M \times N} \sum\_{i=1}^{M} \sum\_{j=1}^{N} |y\_{i,j} - \mathcal{g}\_{i,j}| \tag{14}$$

$$RMSE = \sqrt{\frac{1}{M \times N} \sum\_{i=1}^{M} \sum\_{j=1}^{N} (y\_{i,j} - \mathcal{g}\_{i,j})^2} \tag{15}$$

$$MAPE = \frac{1}{M \times N} \sum\_{i=1}^{M} \sum\_{j=1}^{N} \frac{|y\_{i,j} - \mathcal{G}\_{i,j}|}{y\_{i,j}} \tag{16}$$

where *y***ˆ***i*,*<sup>j</sup>* and *yi*,*<sup>j</sup>* are the true value and the predicted value, *N* is the number of detectors we select in the road network, and *M* is the total number of predicted samples.

#### *4.3. Baselines*

We compared our model with the following eight models:


#### *4.4. Experiment Settings*

We implemented our GSTGCN model based on the Pytorch framework and conducted experiments on a computer with one Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz and one NVIDIA Quadro P2000 GPU card. The dynamic temporal module in the model contained four residual blocks. The residual blocks consisted of two stacked dilated casual convolutions. The kernel size was set as 3. The dilation factors of three residual blocks were 1, 2, and 4. The number of kernels in localized graph convolution and the hidden channels was set to 8. The hyperparameter *α* was set as 2. For external component, we set the output channels of the first fully-connected layer to 22, and the output channels were reduced to 1 by the next fully-connected layer. During the phase of constructing adjacency matrix, *κ*, *w*<sup>1</sup> and *w*<sup>2</sup> were set as 0.5. The batch size was 256, and we trained the model for 30 epochs. We used Adam optimizer to train our model with the initialized learning rate of 0.001. During the testing phase, we predicted the traffic speed in the next hour (12 steps) based on 12 historical speeds.

#### *4.5. Experimental Results*

#### 4.5.1. Prediction Performance Comparison

Table 3 displays the GSTGCN and all baseline models on the PeMSD4 and PeMSD7 datasets for prediction of MAE, RMSE, and MAPE of 15 min (3 steps), 30 min (6 steps), and 60 min (12 steps). As shown in the table, we observed that deep learning methods perform better than simple time series methods (ARIMA) and traditional machine learning methods (SVR), indicating that deep learning methods can model more complex traffic data. Graph-based models containing STGCN, DCRNN, Graph WaveNet, ST-MetaNet, and GSTGCN predict more accurately than SAE and SBU-LSTM. It means that the spatial topological information of the traffic data is critical to prediction performance. Compared to DCRNN, STGCN, ST-MetaNet, and Graph WaveNet, GSTGCN has a great advantage in long-term prediction with a slower error growth rate and achieves the best prediction accuracy on all metrics and both two datasets. To further verify the accuracy of our model, we compared the prediction performance of GSTGCN for the morning peak hours and weekends with that of Graph Wavenet. Based on the experimental result shown in Figure 8, we found that GSTGCN performs better

than Graph WaveNet, which demonstrates that GSTGCN is more effective in modeling the complex spatiotemporal correlations.

**Figure 8.** Speed prediction in the morning peak hours and weekends of the dataset PeMSD7.

**Table 3.** Performance comparison of different approaches for traffic prediction on PeMSD7 and PeMSD4 datasets. The best results are marked in bold.


#### 4.5.2. Model Structure Comparison

In this section, we mainly discuss the structural differences between STGCN [13], ST-MetaNet [37], Graph WaveNet [14], and our proposed model GSTGCN.

STGCN is a deep learning framework with complete convolutional structures. It contains multiple 1D casual convolutions followed by a gated linear unit (GLU) for capturing temporal correlations and employs K-order Chebyshev graph convolution on traffic data to extract spatial dependencies. The architecture only captures simple nonlinear temporal correlations and localized spatial dependencies of traffic data. We can observe that STGCN performs poorly compared to the other models in Figure 9a, especially in the case of long-term prediction.

ST-MetaNet employs a sequence-to-sequence architecture. It introduces meta-learning to spatiotemporal modeling. The model first utilizes the points of interests (POIs) and density of road network around the detector to construct node attributes and then constructs the graph's edge attributes using k-nearest neighbor (KNN) algorithm. In the model, a meta graph attention network (GAT) is used to capture diverse spatial correlations, and a meta recurrent neural network (RNN) is employed to consider diverse temporal correlations. Compared with the ordinary 1D casual convolution, RNN has the advantage for time series modeling, as it can remember the previous input sequence using its inner memory structure. Besides, ST-MetaNet takes meta-learning knowledge into account. Thus, it is superior to STGCN in Table 3 and Figure 9a.

Graph WaveNet is a graph neural network architecture for spatial-temporal graph modeling. In the spatial dimension, Graph WaveNet introduces an adaptive adjacency matrix to capture spatial correlation based on diffusion convolution. The adaptive adjacency could learn the hidden spatial dependency existing in the road network and the diffusion convolution could capture localized spatial correlations. In the temporal dimension, Graph WaveNet employs stacked dilated casual convolution (DCC) to obtain temporal dependencies. The stacked dilated casual convolution's receptive field grows exponentially as the number of layers increases and can handle long sequence very well. Therefore, as shown in Table 3 and Figure 9a, Graph WaveNet performs better than ST-MetaNet.

Our proposed model GSTGCN integrates the spatiotemporal correlations of traffic data and the influence of external factors together. In the temporal dimension, we employ three spatial-temporal components considering multiple temporal periodicities, and we use stacked dilated casual convolution (DCC) with residual connection to obtain temporal dynamics in each component. In the spatial dimension, we model local and global correlations through a global correlated module, which contains K-order Chebyshev graph convolution and a global correlated mechanism. When constructing the adjacency matrix, we consider not only the distance between the geographic locations of the sensors, but also the surrounding points of interests (POIs) data to explore the functional similarity of the area where the sensors are located. In addition, we take external factors into account using fully connected layers. Compared to Graph WaveNet, GSTGCN considers multiple temporal periodicities, global spatial correlations, and the impact of external factors on traffic data. Hence, the experimental results demonstrate that GSTGCN achieves the best prediction accuracy on all metrics.

**Figure 9.** (**a**) Test Mean RMSE of 12 steps versus the number of training epochs on PeMSD7 dataset. (**b**) Fault-tolerance comparison on PeMSD7 dataset.

4.5.3. Number of Residual Blocks in Dynamic Temporal Module

To determine the appropriate number of residual blocks in the model, we selected different numbers of residual blocks and performed experiments. The experimental results are presented in Figure 10a. As the number of residual blocks increases, the prediction performance of the model improves. However, after the number of residual blocks reaches 4, the accuracy of the model does not continue to improve or even becomes worse, and the training time of the model also increases greatly. Finally, four residual blocks are used in the dynamic temporal module of our model.

#### 4.5.4. Effect of Each Component

To investigate the effect of each component of our model on the prediction result, we evaluate the four variants separately by removing the external module, the global correlated mechanism, the independent daily-periodic spatial-temporal component, and the independent weekly-periodic spatial-temporal component from GSTGCN. These four variants are: GSTGCN-noExt, GTSGCN-noGlo, GSTGCN-noDay, and GSTGCN-noWeek. Figure 10b illustrates the MAE comparison of the GSTGCN and its four variants predicting the next 12 steps on PeMSD7. It can be seen from the figure that the GSTGCN consistently outperforms GSTGCN-noExt and GSTGCN-noGlo, indicating the effectiveness of the external component and the global correlated mechanism. The other two models, GSTGCN-noDay and GSTGCN-noWeek, have similar short-term prediction performance as GSTGCN, but they perform worse in the long-term predictions. Therefore, it is proved that the daily-periodic component and the weekly-periodic component help to capture the long-term temporal dependencies of the traffic data more effectively.

**Figure 10.** (**a**) Prediction performance of GSTGCN with a different number of residual blocks on PeMSD7. (**b**) MAE of each prediction step of GSTGCN and its four variants on PeMSD7.

#### 4.5.5. Fault Tolerance Comparison

Due to sensor maintenance and breakdown, there are partially missing values in the traffic data. To evaluate the fault-tolerant ability of the model, we randomly discarded a fraction *α* of the historical traffic data, and trained the model using the remaining data. In the experiment, we set the *α* ranging from 10% to 90%. We conducted experiments on the GSTGCN, Graph Wavenet, DCRNN, STGCN, and ST-MetaNet using the dataset PeMSD7 separately, and the prediction MAE are shown in Figure 9b. Our proposed model, GSTGCN, has better fault tolerance than all the other baselines. It indicates that GSTGCN learn complex spatiotemporal correlations more effectively from sparse and noisy real-world datasets.

#### 4.5.6. Training Efficiency

We compared the computational cost of GSTGCN, DCRNN, STGCN, ST-Metanet, and Graph WaveNet on PeMSD7. For the sake of fairness, the training time is the time it takes each model to train one epoch, and the inference time is the time cost of each model to predict the traffic speed at 12 timestamps in the next hour on the validation data. Table 4 demonstrates the experiment results. We observed that, during the training phase, the fastest is GSTGCN, followed by STGCN and Graph WaveNet. GSTGCN runs eleven times faster than DCRNN and seven times faster than ST-MetaNet in training. Since DCRNN and ST-MetaNet use recurrent neural network to capture temporal dependencies, they need more time to train. For inference, GSTGCN is the most effective one, and the time cost of STGCN and DCRNN significantly increases because they need to iteratively predict the results of 12 steps, while GSTGCN and Graph WaveNet generate 12 predictions in one run. To further investigate the performance of the compared models, we plot the Mean RMSE of 12 steps on the PeMSD7 test set with increasing training epochs, as shown in Figure 9a. The figure suggests that our GSTGCN achieves easier convergence and faster training procedure.


**Table 4.** The computation time on the PeMSD7 dataset.

#### **5. Conclusions**

We propose a novel global spatial-temporal graph convolutional network called GSTGCN to predict urban traffic speed. In the spatial dimension, the model combines localized graph convolution and global correlated mechanism for local and non-local spatial correlations. When constructing the adjacency matrix that represents the structure of the road network, the model considers not only the distances between the sensors, but also the similarities of the sensors' locations. In the temporal dimension, three independent modules are used to model the recent, daily-periodic and weekly-periodic temporal dependencies, respectively. Each module consists of several residual blocks containing stacked dilated causal convolutions. In addition, the model takes the effects of weather condition and other factors such as holidays into account. Experiments on two real-world datasets showed that the prediction accuracy of our model GSTGCN is significantly better than existing models. In the future work, we plan to explore more complex spatial correlations to further improve the prediction accuracy. Since GSTGCN is a general framework for the spatiotemporal prediction problem of graph-structured data, we can also apply it to other practical applications, such as arrival time estimation.

**Author Contributions:** S.L., K.W., and L.G. conceptualized the work and defined the methodology; Y.W. and F.C. did the data curation; S.L. implemented the experiments and drafted the manuscript; and L.G. contributed to the supervision. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
