Spatio-Temporal Self-Attention Network for Origin–Destination Matrix Prediction in Urban Rail Transit

Zhou, Wenzhong; Tang, Tao; Gao, Chunhai

doi:10.3390/su16062555

Open AccessArticle

Spatio-Temporal Self-Attention Network for Origin–Destination Matrix Prediction in Urban Rail Transit

by

Wenzhong Zhou

^1,2,*

,

Tao Tang

¹ and

Chunhai Gao

²

¹

School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing 100044, China

²

Traffic Control Technology Co., Ltd., Beijing 100070, China

^*

Author to whom correspondence should be addressed.

Sustainability 2024, 16(6), 2555; https://doi.org/10.3390/su16062555

Submission received: 19 January 2024 / Revised: 26 February 2024 / Accepted: 18 March 2024 / Published: 20 March 2024

(This article belongs to the Section Sustainable Transportation)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Short-term origin–destination (OD) prediction in urban rail transit (URT) is vital for improving URT operation. However, due to the problems such as the unavailability of the OD matrix of the current day, high dimension and long-range spatio-temporal dependencies, it is difficult to further improve the prediction accuracy of an OD matrix. In this paper, a novel spatio-temporal self-attention network (SSNet) for OD matrix prediction in URT is proposed to further improve the prediction accuracy. In the proposed SSNet, a lightweight yet effective spatio-temporal self-attention module (STSM) is proposed to capture complex long-range spatio-temporal dependencies, thus helping improve the prediction accuracy of the proposed SSNet. Additionally, the finished OD matrices on previous days are used as the only data source without the passenger flow data on the current day in the proposed SSNet, which makes it possible to predict the OD matrices of all time intervals on the current day before the operation of the current day. It is demonstrated by experiments that the proposed SSNet outperforms three advanced deep learning methods for short-term OD prediction in URT, and the proposed STSM plays an important role in improving the prediction accuracy.

Keywords:

origin–destination prediction; deep learning; self-attention mechanism; urban rail transit

1. Introduction

Due to its advantages of fast speed, high safety, high punctuality and large volume, the urban rail transit (URT) has become one of the important means of transportation in many cities. Origin–destination (OD) passenger flow distribution can reflect the demand and rule of passenger travel. And, short-term OD prediction in URT can predict the OD passenger flow distribution in a future period of time, so as to grasp the change of passenger demand in advance, and then provide important input for the dynamic operation and control system of URT, and thus help operators formulate and adjust operation strategies. Additionally, it is helpful for passengers to optimize their travel time and route by obtaining the OD passenger flow information in a future period of time released by the operator, thus reducing the waiting time in line and improve the travel efficiency and experience. Therefore, short-term OD prediction can make the passenger demand and transport capacity match better, thus contributing to the sustainable development of URT.

However, there are three main challenges in the short-term OD prediction task of URT: (1) Data availability. For URT, passengers need to swipe their cards when entering and leaving the station, and the card swiping information can be recorded in real time. However, there is a time delay for each passenger between entering the station and leaving the station, and the travel time of passengers between entering the station and leaving the station is different. Additionally, in-and-out card swiping information for each passenger is required to obtain complete OD flow information. Therefore, complete OD flow information on the current day cannot be obtained in time; (2) Data dimensionality. OD flow information is usually expressed in the form of an OD matrix. For n URT stations, the dimension of the OD matrix is

n \times n

. It can be seen that the amount of data contained in the OD matrix increases significantly compared with the number of stations; (3) Long-range spatio-temporal dependencies. Historical OD flow data and OD flow data to be predicted generally conform to the same distribution, that is, there are long-range temporal dependencies. Additionally, the change of an OD pair is related to the change in some other OD pairs for each OD matrix, that is, there are long-range spatial dependencies. Therefore, there are complex long-range spatio-temporal dependencies among several adjacent OD matrices. Due to the above challenges, it is difficult to further improve the prediction accuracy of an OD matrix.

In recent years, due to its outstanding performance in computer vision [1,2,3,4], natural language processing [5,6,7] and other fields, deep learning has gradually gained attention in short-term OD prediction [8,9,10,11,12]. According to the few existing research results [10,11,12], deep learning methods for short-term OD prediction in URT not only have excellent performance in prediction accuracy but can also meet the real-time demand.

It is crucial to capture complex long-range spatio-temporal dependencies for the improvement in the prediction accuracy of the OD matrix in URT. However, complex long-range spatio-temporal dependencies cannot be efficiently captured for most of the existing deep learning methods for short-term OD prediction in URT. For capturing long-range dependencies, the self-attention modules can be used. However, there have been few studies on the self-attention modules applied in OD matrix prediction in URT. Considering the large-scale transportation network, it is necessary to propose a lightweight self-attention module.

In addition, for most of the existing deep learning methods for short-term OD prediction in URT, not only are the finished OD matrices on previous days used as input, but also passenger flow information on the current day. The prediction time granularity of short-term OD prediction is usually less than one hour, which makes it challenging for most of the existing deep learning methods to deal with certain emergencies.

Based on the above analysis, in this paper, a novel spatio-temporal self-attention network (SSNet) for the OD matrix prediction in URT is proposed to further improve the OD prediction accuracy. In the proposed SSNet, a lightweight yet effective spatio-temporal self-attention module (STSM) is innovatively proposed to capture complex long-range spatio-temporal dependencies, thus helping improve the prediction accuracy of the proposed SSNet. Additionally, only the finished OD matrices on previous days are used as the input, which makes it possible to predict the OD matrices of all time intervals on the current day before the operation of the current day. Based on a real-world dataset from the Beijing subway, it is demonstrated by experiments that the proposed SSNet outperforms three advanced deep learning methods for short-term OD prediction in URT, and the proposed STSMs play an important role in improving the prediction accuracy.

The rest of this paper is organized as follows. In Section 2, the origin–destination matrix prediction methods in traffic scenarios and the self-attention mechanism are surveyed. In Section 3, the proposed SSNet is described in detail, including problem formulation, overview, spatio-temporal non-local operation, the proposed STSM and loss function. In Section 4, the datasets, implementation details, evaluation metrics and comparison methods are first introduced. Then, the comparison results among the proposed SSNet and three advanced deep learning methods for short-term OD prediction in URT are shown and analyzed. After that, the ablation studies are performed. In Section 5, this paper is summarized, and then the future work is outlined.

2. Related Work

In this section, origin–destination matrix prediction methods in traffic scenarios are reviewed first, in which the advantages and disadvantages of different types of OD prediction methods are analyzed. Then, the self-attention mechanism is reviewed.

2.1. Origin–Destination Matrix Prediction in Traffic Scenarios

In recent years, there have been many studies into OD matrix prediction for road traffic scenarios, such as bus OD matrix prediction [13], taxi OD matrix prediction [9,14], and so on, but few on OD matrix prediction in URT. The challenges of OD matrix prediction tasks for different traffic scenarios are similar, such as complex spatio-temporal dependencies. Therefore, OD matrix prediction methods for different traffic scenarios are usually universal.

OD matrix prediction methods can be divided into three categories, which are then described separately. The first category is the conventional methods, such as the least square method [15], probability analysis model [16], and so on. The second category is the machine learning methods, such as principal component analysis and singular value decomposition [17], and so on. The conventional methods and machine learning methods have adequate mathematical theory support; however, they have the three following problems: (1) it is difficult to process a large amount of data in real time for large-scale transportation network; (2) it is difficult to efficiently capture complex spatio-temporal dependencies; and (3) the prediction accuracy needs to be further improved.

The third category is that of deep learning methods. In recent years, various deep learning models have increasingly been introduced into short-term OD matrix prediction. Notably, several deep learning methods have emerged for short-term OD matrix prediction in URT [10,11,12,18,19]. A channel-wise attentive split–convolutional neural network (CASCNN) [12] was proposed using the channel-wise attention mechanism, split CNN and a masked loss function. Additionally, based on the multisource spatiotemporal data, Ref. [11] proposed a spatio-temporal long short-term memory network by improving the structure of LSTM to improve the prediction accuracy. Ref. [18] proposed a multi-resolution spatio-temporal neural network for real-time OD prediction, where the multi-resolution spatial feature extraction modules were first used to capture the local spatial dependencies, and ConvLSTMs were used to capture the temporal evolution of demand. Ref. [19] proposed a short-term passenger flow prediction model (EEMD-LSTM) by integrating the ensemble empirical mode decomposition (EEMD) and long short-term memory neural network (LSTM) to solve the problem that the existing empirical mode decomposition (EMD) is prone to modal aliasing. EEMD-LSTM has important theoretical value and practical significance for improving the accuracy of short-term passenger flow prediction, which can realize the sustainable development of cities.

Compared with the conventional methods and machine learning methods, deep learning methods can meet the demand of real-time prediction while further improving prediction accuracy.

2.2. Self-Attention Mechanism

Transformer, which was first proposed by Vaswani et al. in 2017 [6], is a deep learning architecture based on the self-attention mechanism, which can effectively capture long-range dependencies. Currently, self-attention mechanism has been widely applied in various fields (e.g., natural language processing [20,21,22,23,24], computer vision [25,26,27,28,29]). However, the existing self-attention modules are rarely applied in the OD matrix prediction in URT. Ref. [30] proposed a completion augmentation-based self-attention temporal convolutional network (CA-SATCN) using the recent matrices completion, flow distribution augmentation and a variant of the self-attention mechanism, where the variant of self-attention mechanism can model the dynamic and global spatial dependencies.

Inspired by [25], the non-local block that can be viewed as the original self-attention module in the computer vision field was proposed to aggregate long-range contextual information by computing a huge attention map, which makes the original self-attention module require extremely large computing resources for a large-sized input feature. In order to reduce the occupation of the computing resources, several lightweight self-attention modules [28,31] were proposed by reducing the range of non-local information aggregation. Ref. [31] proposed a lightweight spatial orthogonal attention module which is a variant of the original self-attention module.

2.3. Our Contribution

The self-attention mechanism is crucial for capturing long-range spatio-temporal dependencies. However, according to the review in Section 2.1 and Section 2.2, there have been very few studies on lightweight self-attention modules in the field of OD matrix prediction in URT. Therefore, an important contribution of this paper is to propose a lightweight yet effective STSM to capture complex long-range spatio-temporal dependencies, thus helping improve the OD prediction performance of the proposed SSNet.

3. Methods

In this section, the OD matrix prediction task is first formulated. Second, the overview of the proposed SSNet is introduced. Third, the spatio-temporal non-local operation (STNO) is proposed, which provides a theoretical basis for the design of the proposed STSM. Fourth, the proposed STSM is elaborated. Eventually, the loss function is introduced.

3.1. Problem Formulation

Assume that the time granularity of the given OD data is

δ

minutes. In this way, multiple time intervals are formed according to the time granularity of the given OD data. In this paper, the aim is to predict the OD matrix at the predicted time interval by using the historical OD matrices on previous days. The temporal relationship among the historical OD matrices on previous days and the predicted OD matrix is shown in Figure 1.

As shown in Figure 1, the time interval t on day d denotes the predicted time interval, the time intervals from

t - μ_{2} + 1

to t for each day from day

d - μ_{1}

to day

d - 1

denote the time intervals where the historical OD matrices are located. Therefore,

μ_{2}

adjacent OD matrices are used for each day of the previous

μ_{1}

days. According to Figure 1, the OD matrix prediction task can be formulated as

{\hat{M}}^{d, t} = f (M^{d - y, t - x}),

(1)

where

{\hat{M}}^{d, t}

represents the predicted OD matrix at the time interval t on day d,

M^{d - y, t - x}

denotes the historical OD matrix at the time interval

t - x

on day

d - y

, where

y = 1, 2, \dots, μ_{1}

and

x = 0, 1, \dots, μ_{2} - 1

.

f (\cdot)

represents the nonlinear function for OD matrix prediction in URT, in this paper,

f (\cdot)

denotes the forward process of the proposed SSNet. For brevity, the superscripts of

{\hat{M}}^{d, t}

and

M^{d - y, t - x}

will be ignored in the following sections.

3.2. Overview

The overall architecture of the proposed SSNet is presented in Figure 2. As shown in Figure 2, in the proposed SSNet,

M \in R^{μ \times η \times η}

including the historical OD matrices on previous days is taken as the input and the predicted OD matrix

\hat{M} \in R^{1 \times η \times η}

is taken as the output, where

μ = μ_{1} μ_{2}

is the number of channels and

η

denotes the number of stations. Specifically, M is first fed into a two-dimensional convolutional layer with

ψ

filters to obtain the preliminary feature

C_{1} \in R^{ψ \times η \times η}

. Then, a residual block [2] (shown in Figure 3a) is used to further extract the feature information to obtain

D_{1} \in R^{ψ \times η \times η}

, where the residual connections are used to make training more efficient.

In order to capture complex long-range spatio-temporal dependencies, the proposed self-attention residual block (shown in Figure 3b) is used to obtain

B_{1} \in R^{ψ \times η \times η}

. In the self-attention residual block, a two-dimensional convolutional layer, an STSM, and a two-dimensional convolutional layer are used in turn, where the proposed STSM can capture complex long-range spatio-temporal dependencies by aggregating the long-range contextual information in the spatial dimension and channel direction.

In order to extract higher-level feature information, the residual block and the self-attention residual block are again used to obtain

D_{2} \in R^{ψ \times η \times η}

and

B_{2} \in R^{ψ \times η \times η}

, respectively. Eventually, a two-dimensional convolutional layer with a filter is used as the output layer to obtain the predicted OD matrix

\hat{M}

.

3.3. Spatio-Temporal Non-Local Operation

The STNO is proposed based on the spatial orthogonal non-local operation [31]. In order to illustrate STNO more clearly, the formulation of spatial orthogonal non-local operation is first reviewed, and then the formulation of STNO is introduced. Assume that the input feature and output feature are

X \in R^{C \times H \times W}

and

Y \in R^{C \times H \times W}

, respectively. Here, C, H and W, respectively, denote the number of channels, height and width.

3.3.1. Spatial Orthogonal Non-Local Operation

As revealed in [31], the spatial orthogonal non-local operation can efficiently capture long-range dependencies in the vertical and horizontal directions. The spatial orthogonal non-local operation can be described in three steps as follows.

Firstly, the degree of correlation between any two planes in the vertical and horizontal directions is, respectively, calculated, which corresponds to the following Equations (2) and (3). Secondly, the results of the first step are normalized to generate the attention maps in the vertical and horizontal directions. Then, the non-local feature information in the vertical and horizontal directions is aggregated by multiplying the transformed X with the corresponding attention map. This step corresponds to the following Equations (4) and (5). Thirdly, the results of the second step are merged to obtain the final output, which corresponds to the following Equation (6).

The spatial orthogonal non-local operation can be formulated as

\begin{matrix} f_{h 1} (X) & = ϕ (X) τ_{1} (ϕ (X)), \end{matrix}

(2)

\begin{matrix} f_{w 1} (X) & = τ_{1} (ϕ (X)) ϕ (X), \end{matrix}

(3)

\begin{matrix} g_{1} (X) & = s o f t m a x (f_{h 1} (X)) γ_{1} (θ_{1} (X)), \end{matrix}

(4)

\begin{matrix} g_{2} (X) & = s o f t m a x (f_{w 1} (X)) γ_{2} (θ_{1} (X)), \end{matrix}

(5)

\begin{matrix} Y & = γ_{3} (g_{1} (X))) + γ_{4} (g_{2} (X))), \end{matrix}

(6)

where

τ_{1} (\cdot)

denotes the matrix transpose operation.

s o f t m a x (\cdot)

represents the softmax operation, which can perform normalization computation along the horizontal direction.

ϕ (\cdot)

presents a learnable function, which includes a two-dimensional convolutional operation with a filter, and a squeeze operation. Here, the squeeze operation is used to remove the dimension whose size is 1. Thus,

ϕ (X) \in R^{H \times W}

is obtained.

After the matrix multiplications in Equations (2) and (3),

f_{h 1} (X) \in R^{H \times H}

and

f_{w 1} (X) \in R^{W \times W}

are obtained, where the matrix multiplications in Equations (2) and (3) are named as dual matrix multiplication.

In Equations (4) and (5),

s o f t m a x (f_{h 1} (X))

and

s o f t m a x (f_{w 1} (X))

are the attention maps which contain the degree of correlation between any two planes in the vertical and horizontal directions, respectively.

θ_{1} (\cdot)

denotes a two-dimensional convolutional operation with C filters.

γ_{1} (\cdot)

,

γ_{2} (\cdot)

,

γ_{3} (\cdot)

and

γ_{4} (\cdot)

denote the reshaping operations. Thus,

γ_{1} (θ_{1} (X)) \in R^{H \times C W}

and

γ_{2} (θ_{1} (X)) \in R^{W \times H C}

are generated.

According to Equations (4) and (5), the matrix multiplication between

s o f t m a x (f_{h 1} (X))

and

γ_{1} (θ_{1} (X))

is performed to obtain

g_{1} (X) \in R^{H \times C W}

, and the matrix multiplication between

s o f t m a x (f_{w 1} (X))

and

γ_{2} (θ_{1} (X))

is performed to obtain

g_{2} (X) \in R^{W \times H C}

. Thus,

g_{1} (X)

and

g_{2} (X)

contain the aggregation information of the non-local features in the vertical and horizontal directions, respectively. Note that the vertical and horizontal directions are collectively referred to as the spatial dimension in this paper.

Eventually, in order to restore the dimension to

C \times H \times W

, the reshaping operations are again performed to obtain

γ_{3} (g_{1} (X)) \in R^{C \times H \times W}

and

γ_{4} (g_{2} (X)) \in R^{C \times H \times W}

. After the adding operation in Equation (6), Y aggregates the long-range contextual information in the spatial dimension. However, it can be seen that the spatial orthogonal non-local operation cannot capture long-range dependencies in both the spatial dimension and channel direction.

3.3.2. Spatio-Temporal Non-Local Operation

Capturing complex long-range spatio-temporal dependencies among several OD matrices can be achieved by capturing long-range dependencies in both spatial dimension and channel direction. In order to capture long-range dependencies in both spatial dimension and channel direction, in addition to the introduction of Equations (2) and (3), further improvements based on Equations (2) and (3) are made in the proposed STNO, which can be first formulated as

\begin{matrix} X_{c w} & = τ_{2} (X), \end{matrix}

(7)

\begin{matrix} X_{h c} & = τ_{3} (X), \end{matrix}

(8)

\begin{matrix} f_{c 1} (X_{c w}) & = ϕ (X_{c w}) τ_{1} (ϕ (X_{c w})), \end{matrix}

(9)

\begin{matrix} f_{w 2} (X_{c w}) & = τ_{1} (ϕ (X_{c w})) ϕ (X_{c w}), \end{matrix}

(10)

\begin{matrix} f_{h 2} (X_{h c}) & = ϕ (X_{h c}) τ_{1} (ϕ (X_{h c})), \end{matrix}

(11)

\begin{matrix} f_{c 2} (X_{h c}) & = τ_{1} (ϕ (X_{h c})) ϕ (X_{h c}), \end{matrix}

(12)

where

τ_{2} (\cdot)

and

τ_{3} (\cdot)

denote the transpose operations on the three-dimensional tensor. Thus, according to Equations (7) and (8), X is transposed into

X_{c w} \in R^{H \times C \times W}

and

X_{h c} \in R^{W \times H \times C}

. After the convolutional operation with a filter and the squeeze operation,

ϕ (X_{c w}) \in R^{C \times W}

and

ϕ (X_{h c}) \in R^{H \times C}

are obtained.

f_{w 2} (X_{c w}) \in R^{W \times W}

in Equation (10) and

f_{h 2} (X_{h c}) \in R^{H \times H}

in Equation (11) have the same meaning as

f_{w 1} (X) \in R^{W \times W}

and

f_{h 1} (X) \in R^{H \times H}

, respectively. Additionally,

f_{c 1} (X_{c w}) \in R^{C \times C}

in Equation (9) and

f_{c 2} (X_{h c}) \in R^{C \times C}

in Equation (12) contain the degree of correlation between any two planes in the channel direction.

In order to make the attention maps more accurate, the attention maps in the same direction are fused, which can be expressed as

\begin{matrix} f_{w} (X) & = s o f t m a x (f_{w 1} (X) + f_{w 2} (X_{c w})), \end{matrix}

(13)

\begin{matrix} f_{h} (X) & = s o f t m a x (f_{h 1} (X) + f_{h 2} (X_{h c})), \end{matrix}

(14)

\begin{matrix} f_{c} (X) & = s o f t m a x (f_{c 1} (X_{c w}) + f_{c 2} (X_{h c})) . \end{matrix}

(15)

f_{w} (X)

,

f_{h} (X)

and

f_{c} (X)

are the attention maps which contain the degree of correlation between any two planes in the horizontal, vertical and channel directions, respectively.

After that, the aggregations of non-local feature information are performed, which can be formulated as

\begin{matrix} g_{3} (X) = f_{w} (X) γ_{5} (τ_{3} (θ_{2} (X))), \end{matrix}

(16)

\begin{matrix} g_{4} (X) = f_{h} (X) γ_{6} (τ_{2} (θ_{2} (X))), \end{matrix}

(17)

\begin{matrix} g_{5} (X) = f_{c} (X) γ_{7} (θ_{2} (X)), \end{matrix}

(18)

\begin{matrix} Y = τ_{4} (γ_{8} (g_{3} (X))) + τ_{5} (γ_{9} (g_{4} (X))) + γ_{10} (g_{5} (X)), \end{matrix}

(19)

where

θ_{2} (\cdot)

has the same meaning as

θ_{1} (\cdot)

.

γ_{5} (\cdot)

,

γ_{6} (\cdot)

,

γ_{7} (\cdot)

,

γ_{8} (\cdot)

,

γ_{9} (\cdot)

and

γ_{10} (\cdot)

denote the reshaping operations.

τ_{4} (\cdot)

and

τ_{5} (\cdot)

are the reverse operations of

τ_{3} (\cdot)

and

τ_{2} (\cdot)

, respectively. Thus,

γ_{5} (τ_{3} (θ_{2} (X))) \in R^{W \times H C}

,

γ_{6} (τ_{2} (θ_{2} (X))) \in R^{H \times C W}

,

γ_{7} (θ_{2} (X)) \in R^{C \times H W}

,

τ_{4} (γ_{8} (g_{3} (X))) \in R^{C \times H \times W}

,

τ_{5} (γ_{9} (g_{4} (X))) \in R^{C \times H \times W}

and

γ_{10} (g_{5} (X)) \in R^{C \times H \times W}

are obtained. Eventually, according to Equation (19), Y is obtained by aggregating the long-range contextual information in both the spatial dimension and channel direction.

It can be seen that the complexity of the proposed STNO is

O (W \times W + H \times H + C \times C)

, which indicates the efficiency of the proposed STNO. Therefore, the proposed STNO can aggregate efficiently the long-range contextual information in both spatial dimension and channel direction.

3.4. Spatio-Temporal Self-Attention Module

The spatio-temporal self-attention module (STSM) is designed based on the proposed STNO. The overall architecture of the proposed STSM is shown in Figure 4. According to Figure 4, the proposed STSM consists of an attention map generation network (AMGN, shown in Figure 5) and a non-local feature aggregation network (NFAN, shown in Figure 6). Next, the overall structure of the proposed STSM is first introduced, followed by AMGN and NFAN in turn.

3.4.1. The Overall Architecture of STSM

As shown in Figure 4, the proposed STSM is composed of AMGN and NFAN. Assume that the input and output are, respectively,

X \in R^{C \times H \times W}

and

Y \in R^{C \times H \times W}

, then, the overall process of the proposed STSM can be described as follows.

Firstly, X is fed into the AMGN to generate three attention maps

A^{W W} \in R^{W \times W}

,

A^{H H} \in R^{H \times H}

and

A^{C C} \in R^{C \times C}

, where

A^{W W}

,

A^{H H}

and

A^{C C}

contain the degree of correlation between any two planes in the horizontal, vertical and channel directions, respectively.

Secondly, the generated three attention maps

A^{W W}

,

A^{H H}

and

A^{C C}

are, respectively, fed into the NFAN to capture long-range dependencies in the horizontal, vertical and channel directions, where the non-local feature information in spatial dimension can be extracted using

A^{W W}

and

A^{H H}

, and the non-local feature information in the channel direction can be extracted using

A^{C C}

. Then, Y is obtained by fusing the non-local feature information in spatial dimension and channel direction.

3.4.2. Attention Map Generation Network

According to Figure 5, in the AMGN, there are three parallel sub-networks (i.e., Network-A-1, Network-A-2 and Network-A-3) that can generate attention maps in the horizontal, vertical and channel directions. For Network-A-1, the input feature X is first fed into a two-dimensional convolutional layer with a 1 × 1 filter to obtain

Q_{1} \in R^{1 \times H \times W}

. Then, the squeeze operation is performed to obtain

K_{1} \in R^{H \times W}

. After that, the dual matrix multiplication is used to generate

A_{1}^{H H} \in R^{H \times H}

and

A_{1}^{W W} \in R^{W \times W}

, which can be expressed as

\begin{matrix} A_{1}^{H H} & = K_{1} τ_{1} (K_{1}), \end{matrix}

(20)

\begin{matrix} A_{1}^{W W} & = τ_{1} (K_{1}) K_{1}, \end{matrix}

(21)

where

A_{1}^{H H}

and

A_{1}^{W W}

contain the degree of correlation between any two planes in the vertical and horizontal directions, respectively.

For Network-A-2 and Network-A-3, X is first fed into two parallel transpose layers to perform transpose operations on two different dimensions, obtaining

E_{1} \in R^{H \times C \times W}

and

E_{2} \in R^{W \times H \times C}

. Then, the next steps are similar to Network-A-1. Specifically,

E_{1}

is fed into a two-dimensional convolutional layer with a 1 × 1 filter, a squeeze layer in turn. Then, a dual matrix multiplication is performed to obtain

A_{2}^{W W} \in R^{W \times W}

and

A_{1}^{C C} \in R^{C \times C}

. According to Equations (9) and (10),

A_{2}^{W W}

and

A_{1}^{C C}

contain the degree of correlation between any two planes in the horizontal and channel directions, respectively. Similarly,

A_{2}^{C C} \in R^{C \times C}

and

A_{2}^{H H} \in R^{H \times H}

are obtained for

E_{2}

, where

A_{2}^{C C}

and

A_{2}^{H H}

contain the degree of correlation between any two planes in the channel and vertical directions, respectively.

Based on Equations (13)–(15), the weighted adding operations and the softmax operations are performed to generate the attention maps

A^{W W}

,

A^{H H}

and

A^{C C}

, which can be formulated as

\begin{matrix} A^{W W} & = s o f t m a x (β_{2} A_{1}^{W W} + β_{3} A_{2}^{W W}), \end{matrix}

(22)

\begin{matrix} A^{H H} & = s o f t m a x (β_{1} A_{1}^{H H} + β_{6} A_{2}^{H H}), \end{matrix}

(23)

\begin{matrix} A^{C C} & = s o f t m a x (β_{4} A_{1}^{C C} + β_{5} A_{2}^{C C}), \end{matrix}

(24)

where

β_{1}

,

β_{2}

,

β_{3}

,

β_{4}

,

β_{5}

and

β_{6}

are the trainable weights, which can adapt themselves during training to achieve the best match.

3.4.3. Non-Local Feature Aggregation Network

The NFAN is designed to efficiently capture long-range dependencies in the horizontal, vertical and channel directions using the generated attention maps

A^{W W}

,

A^{H H}

and

A^{C C}

. As can be seen from Figure 6, X is first fed into a two-dimensional convolutional layer to obtain

Q_{2} \in R^{C \times H \times W}

. Then, three parallel sub-networks (i.e., Network-B-1, Network-B-2 and Network-B-3) are used to aggregate the non-local feature information in the horizontal, vertical and channel directions, respectively.

For Network-B-1 and Network-B-2,

Q_{2}

is first fed into two parallel transpose layers to obtain

E_{3} \in R^{W \times H \times C}

and

E_{4} \in R^{H \times C \times W}

, respectively. Then, two parallel reshape layers are performed to generate

G_{1} \in R^{W \times H C}

and

G_{2} \in R^{H \times C W}

, respectively. After that, the matrix multiplication between

A^{W W}

and

G_{1}

is performed to obtain

P_{1} \in R^{W \times H C}

, and the matrix multiplication between

A^{H H}

and

G_{2}

is performed to obtain

P_{2} \in R^{H \times C W}

, which can be formulated as

\begin{matrix} P_{1} & = A^{W W} G_{1}, \end{matrix}

(25)

\begin{matrix} P_{2} & = A^{H H} G_{2}, \end{matrix}

(26)

where each element in

P_{1}

and each element in

P_{2}

are the non-local feature aggregation results in the horizontal and vertical directions, respectively.

For Network-B-3,

Q_{2}

is first fed into a reshape layer to obtain

G_{3} \in R^{C \times H W}

, and then the matrix multiplication between

A^{C C}

and

G_{3}

is performed to obtain

P_{3} \in R^{C \times H W}

, which can be expressed as

P_{3} = A^{C C} G_{3},

(27)

where each element in

P_{3}

is the non-local feature aggregation result in the channel direction.

According to Equation (19), the reshape layers and the transpose layers are again used for Network-B-1 and Network-B-2, and the reshape layer is again used for Network-B-3. Thus,

{\hat{E}}_{3} \in R^{C \times H \times W}

,

{\hat{E}}_{4} \in R^{C \times H \times W}

and

{\hat{G}}_{3} \in R^{C \times H \times W}

are generated. Finally, the weighted sum of

{\hat{E}}_{3}

,

{\hat{E}}_{4}

and

{\hat{G}}_{3}

gives the final result Y, which can be formulated as

Y = β_{7} {\hat{E}}_{3} + β_{8} {\hat{E}}_{4} + β_{9} {\hat{G}}_{3} .

(28)

where

β_{7}

,

β_{8}

and

β_{9}

are the trainable weights. It can be seen from Equation (28) that Y aggregates the long-range contextual information in both the spatial dimension and channel direction.

3.5. Loss Function

After obtaining

\hat{M}

,

\bar{M} \in R^{η \times η}

is obtained by performing the squeeze operation on

\hat{M}

. Then,

L_{2}

loss is used as the loss function, which can be formulated as

L_{2} = \frac{1}{η^{2}} \sum_{i = 1}^{η} \sum_{j = 1}^{η} {({\tilde{M}}_{i, j} - {\bar{M}}_{i, j})}^{2},

(29)

where

{\tilde{M}}_{i, j}

and

{\bar{M}}_{i, j}

denote the value of row i and column j in the real OD matrix and the predicted OD matrix, respectively.

4. Experiments

In this section, the basic settings of the experiments including the dataset, implementation details, evaluation metrics, and comparison methods are first introduced. Then, the comparison results among the proposed SSNet and three comparison methods in prediction performance are discussed. Eventually, the results of the ablation studies are discussed.

4.1. Dataset

A real-world large-scale OD dataset from the Beijing subway was used in the experiments, the dataset was collected in 2021. The date range is from 14 September 2021 to 17 October 2021 and from 14 December 2021 to 29 December 2021, the number of data records is 137 million, and the number of stations is 369, thus the OD matrix dimension is 369 × 369. The time range in a day is from 7:00 to 23:00. Additionally, each record contains the entry station ID, the exit station ID, the entry time, and the number of passengers entering the station. The OD dataset contains 50 days, of which 36 days, 7 days and 7 days are used as the training set, validation set and test set, respectively.

4.2. Implementation Details

In this paper, the hyperparameters of the proposed SSNet are set as follows: the learning rate and the batch size are, respectively, set to 0.001 and 32, the initial values of

β_{1}

,

β_{2}

,

β_{3}

,

β_{4}

,

β_{5}

and

β_{6}

are all set to 0.5, the initial values of

β_{7}

,

β_{8}

and

β_{9}

are, respectively, set to 0.3, 0.3 and 0.4.

δ = 5

,

μ_{1} = 3

,

μ_{2} = 3

,

ψ = 9

. The parameters of the proposed SSNet are optimized using the Adam optimizer. Additionally, the early stopping is used as the stopping criterion to avoid overfitting; specifically, the best validation root mean square error (RMSE) is recorded in each epoch, and then the training will not stop until the best validation RMSE rises 10 times in a row. In terms of software and hardware configuration, the proposed SSNet is implemented by using Python 3 and PyTorch 1.10 on a GPU NVIDIA RTX 3090.

4.3. Evaluation Metrics

In this paper, the RMSE, mean absolute error (MAE) and symmetric mean absolute percentage error (SMAPE) are used as evaluation metrics to evaluate the prediction performance of each method, in which the RMSE, MAE and SMAPE are, respectively, expressed as

\begin{matrix} R M S E & = \sqrt{\frac{1}{η^{2}} \sum_{i = 1}^{η} \sum_{j = 1}^{η} {({\tilde{M}}_{i, j} - h ({\bar{M}}_{i, j}))}^{2}}, \end{matrix}

(30)

\begin{matrix} M A E & = \frac{1}{η^{2}} \sum_{i = 1}^{η} \sum_{j = 1}^{η} |{\tilde{M}}_{i, j} - h ({\bar{M}}_{i, j})|, \end{matrix}

(31)

\begin{matrix} S M A P E & = \frac{1}{η^{2}} \sum_{i = 1}^{η} \sum_{j = 1}^{η} \frac{|{\tilde{M}}_{i, j} - h ({\bar{M}}_{i, j})|}{({\tilde{M}}_{i, j} + h ({\bar{M}}_{i, j})) / 2 + c}, \end{matrix}

(32)

where the notations

{\tilde{M}}_{i, j}

and

{\bar{M}}_{i, j}

are the same as those in Equation (29), c is a very small positive constant to prevent the denominator from being 0, which is set as

10^{- 9}

. Since

{\bar{M}}_{i, j}

usually contains decimals and negative numbers,

{\bar{M}}_{i, j}

is evaluated after rounding and non-negative integer processing to more accurately evaluate the model performance. Thus,

h (\cdot) = h_{2} (h_{1} (\cdot))

is a composite function, in which

h_{1} (\cdot)

denotes the rounding function and

h_{2} (\cdot)

denotes the function of taking non-negative integers.

h_{2} (h_{1} ({\bar{M}}_{i, j}))

can be formulated as:

h_{2} (h_{1} ({\bar{M}}_{i, j})) = \{\begin{matrix} 0, h_{1} ({\bar{M}}_{i, j}) < 0 \\ h_{1} ({\bar{M}}_{i, j}), h_{1} ({\bar{M}}_{i, j}) \geq 0 . \end{matrix}

(33)

4.4. Comparison Methods

In this paper, ConvLSTM [32], STResNet [8] and CASCNN [12] are selected as the comparison methods for the proposed SSNet. Compared with the original paper, the implementation details of the three comparison methods are fine-tuned to make them more suitable for the dataset in this paper. Specifically, the loss functions of the three comparison methods are all

L_{2}

loss functions, and the other basic settings of the three comparison methods are described below:

ConvLSTM: the inflow series on the current day and the finished OD matrices on previous days are used to feed into two parallel network branches, respectively, each of which includes three ConvLSTM modules. The batch size and learning rate are set to 16 and 0.001, respectively.

STResNet: the inflow series on the current day and the finished OD matrices on previous days are used to feed into two parallel network branches, respectively, each of which is a residual network with four residual blocks. The batch size and learning rate are set to 32 and 0.001, respectively.

CASCNN: the inflow and outflow series on the current day and the finished OD matrices on previous days are used, and the network architecture is the same as that in the original paper [12]. The batch size and learning rate are set to 32 and 0.001, respectively.

4.5. Results and Discussions

In this section, the experimental results are presented and analyzed in two parts: the first part is the comparisons of the OD prediction performance among the proposed SSNet and the three comparison methods, and the second part is the ablation studies.

4.5.1. Comparison of Prediction Performance

Table 1 shows the comparisons of the evaluation metrics (i.e., RMSE, MAE and SMAPE) among the proposed SSNet and the comparison methods (i.e., ConvLSTM, STResNet and CASCNN), where all evaluation metrics are presented in the form of a mean ± std (standard deviation), and the overall best results are marked in bold. As shown in Table 1, the proposed SSNet shows a better prediction performance with lower RMSE, lower MAE and lower SMAPE than the comparison methods, which indicates that the proposed SSNet outperforms the comparison methods.

In order to show more detailed comparison results, the comparisons of the RMSE, MAE and SMAPE at each time interval during the noon hours (i.e., 12:15–14:15) among the proposed SSNet and the three comparison methods are shown in Figure 7.

As shown in Figure 7a, the proposed SSNet obtains lower RMSEs than the comparison methods at most time intervals, and only at 13:10 and 13:25 are the RMSEs obtained by the proposed SSNet almost the same as the RMSEs obtained by the suboptimal STResNet. As shown in Figure 7b, the MAEs obtained by the proposed SSNet are lower than the MAEs obtained by the comparison methods at all time intervals. In Figure 7c, the SMAPEs obtained by the proposed SSNet are lower than the three comparison methods at most time intervals, and only at 12:25 and 13:10 are the SMAPEs obtained by the proposed SSNet higher than those obtained by ConvLSTM.

On the whole, the proposed SSNet performs better than the comparison methods at most time intervals during the noon hours. For the comparison results in Figure 7, the main reasons can be explained as follows: during the noon hours, the passenger flow volume is small, and therefore, each OD matrix is sparse. In this case, aggregating non-local feature information is very important for extracting more effective feature information. For comparison methods, it is difficult to effectively capture complex long-range spatio-temporal dependencies. However, the proposed STSM can aggregate the long-range contextual information in the spatial dimension and channel direction, which allows the proposed SSNet to efficiently and effectively capture complex long-range spatio-temporal dependencies.

The comparisons of RMSE, MAE and SMAPE at each time interval during the evening peak hours (i.e., 17:40–19:40) among the proposed SSNet and the comparison methods are shown in Figure 8. As shown in Figure 8a,b, the proposed SSNet obtains lower RMSEs and lower MAEs than the comparison methods at all time intervals. As shown in Figure 8c, the SMAPEs obtained by the proposed SSNet are lower than those obtained by the three comparison methods at most time intervals, and only at 18:35 and 19:40 are the SMAPEs obtained by the proposed SSNet higher than those obtained by ConvLSTM.

On the whole, the proposed SSNet outperforms the comparison methods at most time intervals during the evening peak hours. For the comparison results in Figure 8, the main reasons can be explained as follows: during the evening peak hours, due to the large passenger flow volume, each OD matrix contains fewer OD pairs with a value of zero, which makes the spatio-temporal dependencies among adjacent OD matrices more complex. Thus, capturing the complex long-range spatio-temporal dependencies is crucial for the improvement in the prediction performance. Compared with ConvLSTM, STResNet and CASCNN, the proposed SSNet can efficiently and effectively capture complex long-range spatio-temporal dependencies, and thus can obtain better prediction performance.

Additionally, Figure 9 shows the comparisons between the convergence curves of the best validation RMSE among the proposed SSNet and the three comparison methods. According to Figure 9, compared with these three comparison methods, the proposed SSNet has a better convergence stability and can converge faster to a lower best RMSE.

In summary, compared with the comparison methods, the proposed SSNet can converge to lower validation results during training, which in turn can achieve more accurate OD matrix prediction in URT during testing.

4.5.2. Ablation Studies

The proposed STSM is designed to efficiently capture complex long-range spatio-temporal dependencies, thus improving the prediction accuracy. In order to further understand the effects of the proposed STSM on the prediction performance of the proposed SSNet, the following additional experiments are performed.

As shown in Figure 2, residual block and self-attention residual block are named as RB and SRB, respectively. Therefore, the sequence of blocks from

C_{1}

to

B_{2}

can be written as RB–SRB–RB–SRB, where an RB–SRB denotes the cascade of a residual block and a self-attention residual block. In order to demonstrate the effect of the proposed STSM on improving the prediction performance of the proposed SSNet, RB–SRB–RB–SRB is changed to RB–RB–RB–RB to form a variant of SSNet, named SSNet-A. In order to further explore the influence of the number and location of STSM on the prediction performance of the network, two variants are formed: (1) SSNet-B: change RB–SRB–RB–SRB to SRB–SRB–SRB–SRB; (2) SSNet-C: change RB–SRB–RB–SRB to SRB–RB–SRB–RB. Additionally, in order to study the effect of the number of RB–SRBs on the prediction performance of the network, another two variants are formed: (1) SSNet-D: change RB–SRB–RB–SRB to RB–SRB; (2) SSNet-E: change RB–SRB–RB–SRB to RB–SRB–RB–SRB–RB–SRB.

Table 2 shows the comparisons between the evaluation metrics (i.e., RMSE, MAE and SMAPE) among the proposed SSNet and its variants (i.e., SSNet-A, SSNet-B, SSNet-C, SSNet-D and SSNet-E). From Table 2, the proposed SSNet can obtain lower RMSE, lower MAE and lower SMAPE than its variants.

Figure 10 shows the comparisons between the RMSE, MAE and SMAPE at each time interval during the evening peak hours (i.e., 17:40–19:40) among the proposed SSNet and its variants (i.e., SSNet-A, SSNet-B, SSNet-C, SSNet-D and SSNet-E). From Figure 10a–c, the proposed SSNet obtains lower RMSE, lower MAE and lower SMAPE than its variants at most time intervals.

Figure 11 shows the comparisons of convergence curves of the best validation RMSE among the proposed SSNet and its variants (i.e., SSNet-A, SSNet-B, SSNet-C, SSNet-D and SSNet-E). From Figure 11, compared with these variants, the proposed SSNet has a better convergence stability and can converge to a lower best RMSE.

Based on the above comparison results, further discussions are made as follows:

(1): The proposed SSNet performs better than SSNet-A, which indicates that the proposed STSMs are crucial for the improvement of the network prediction performance. The possible reasons for this result can be analyzed as follows: the proposed STSMs can aggregate long-range spatio-temporal contextual information more effectively than the convolutional layers, thus more effectively exploring lower local minima and then obtaining a lower best RMSE.
(2): From Table 2, the prediction performance from SSNet-A to SSNet is improved, while the prediction performance from SSNet to SSNet-B is decreased, which indicates that, simply increasing the number of STSMs cannot continuously improve the network prediction performance. The possible reasons can be analyzed as follows: since the convolutional operation cannot efficiently capture non-local feature information, it is difficult for SSNet-A to converge to a low minimum value in the process of parameter optimization, thus it is difficult to further improve the network prediction performance. For the proposed SSNet, compared with SSNet-A, adding an appropriate amount of STSMs can effectively aggregate the long-range spatio-temporal contextual information with little impact on parameter optimization; therefore, the prediction performance is further improved. However, compared with the appropriate amount of STSMs, it is difficult to further improve the non-local information aggregation ability of the network by adding more STSMs; simultaneously, the difficulty of parameter optimization will be greatly increased, which is not conducive to the convergence of the network. Therefore, the prediction performance of SSNet-B decreases compared with SSNet. The comparisons of convergence curves among SSNet-A, SSNet-B and SSNet in Figure 11 further validate the above analysis results.
(3): SSNet performs better than SSNet-C, which demonstrates that the STSMs placed at the back of the network can obtain a better prediction performance than the STSMs placed at the front. The main reason may be that in the process of backpropagation, parameter optimization complexity for the STSMs closer to the network output is lower, which makes parameter optimization easier and thus conducive to network convergence.
(4): As shown in Table 2, the prediction performance from SSNet-D to SSNet is improved, while the prediction performance from SSNet to SSNet-E is decreased, which indicates that increasing the number of RB–SRBs cannot continuously improve the network prediction performance. The possible reasons can be analyzed as follows: SSNet-D contains only one RB–SRB, which leads to a small number of layers and thus fails to extract high-level feature information. Compared with SSNet-D, the appropriately increased number of RB–SRBs in SSNet can extract higher-level feature information while having little impact on parameter optimization. Therefore, SSNet performs better than SSNet-D. Compared with SSNet, SSNet-E adds an additional RB–SRB again. Since SSNet can extract effective high-level feature information, it is difficult for SSNet-E to extract more effective feature information by adding another RB–SRB. Moreover, using three RB-SRBs will increase the difficulty of parameter optimization in the training process. Therefore, the OD prediction performance of SSNet-E is decreased compared with SSNet.

In summary, the proposed STSMs are crucial to improving the network prediction performance. Additionally, both the number and location of STSMs, and the number of RB–SRBs have an impact on the OD prediction performance of the network.

5. Conclusions

In this paper, a novel SSNet for OD matrix prediction in URT is proposed to further improve the prediction accuracy of OD matrix. In the proposed SSNet, the proposed STSM can capture complex long-range spatio-temporal dependencies efficiently and effectively, thus helping improve the prediction accuracy of the proposed SSNet. The comparative experimental results show that the proposed SSNet outperforms three advanced deep learning methods for short-term OD prediction in URT. According to ablation studies, the proposed STSMs play an important role in improving the prediction accuracy. Additionally, both the number and location of STSMs, as well as the number of RB–SRBs have an impact on the OD prediction performance of the network.

The proposed SSNet is a deep learning method based on single-source data, which can only use OD data as input, and cannot use other multi-source data (such as weather data and mobile phone data) as input. Additionally, this paper presents the importance of self-attention modules to improve the prediction accuracy of the OD matrix. In the future work, a deep learning method based on multi-source data will be designed to support the input of multi-source data, and a more effective self-attention module will be designed to further enhance the extraction ability of non-local features, so as to further improve the prediction accuracy of the OD matrix.

Author Contributions

Conceptualization, W.Z., T.T. and C.G.; methodology, W.Z.; software, W.Z.; validation, W.Z.; formal analysis, W.Z.; investigation, W.Z.; resources, W.Z. and C.G.; data curation, W.Z.; writing—original draft preparation, W.Z.; writing—review and editing, W.Z.; visualization, W.Z.; supervision, T.T. and C.G.; project administration, T.T. and C.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to the security of passenger flow information.

Conflicts of Interest

Author Wenzhong Zhou was employed by Traffic Control Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Maaten, L.V.D.; Weinberger, K.Q. Densely Connected Convolutional Networks. arXiv 2018, arXiv:1608.06993v5. [Google Scholar]
Ledig, C.; Theis, L.; Huszar, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 105–114. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. arXiv 2014, arXiv:1409.3215v3. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805v2. [Google Scholar]
Zhang, J.; Zheng, Y.; Qi, D. Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, Palo Alto, CA, USA, 4–9 February 2017; pp. 1655–1661. [Google Scholar]
Liu, L.; Qiu, Z.; Li, G.; Wang, Q.; Ouyang, W.; Lin, L. Contextualized Spatial–Temporal Network for Taxi Origin-Destination Demand Prediction. IEEE Trans. Intell. Transp. Syst. 2019, 20, 3875–3887. [Google Scholar] [CrossRef]
Zhang, J.; Chen, F.; Wang, Z.; Liu, H. Short-Term Origin-Destination Forecasting in Urban Rail Transit Based on Attraction Degree. IEEE Access 2019, 7, 133452–133462. [Google Scholar] [CrossRef]
Li, D.; Cao, J.; Li, R.; Wu, L. A Spatio-Temporal Structured LSTM Model for Short-Term Prediction of Origin-Destination Matrix in Rail Transit With Multisource Data. IEEE Access 2020, 8, 84000–84019. [Google Scholar] [CrossRef]
Zhang, J.; Che, H.; Chen, F.; Ma, W.; He, Z. Short-term origin-destination demand prediction in urban rail transit systems: A channel-wise attentive split-convolutional neural network method. Transp. Res. Part C Emerg. Technol. 2021, 124, 102928. [Google Scholar] [CrossRef]
Zhang, J.; Shen, D.; Tu, L.; Zhang, F.; Xu, C.; Wang, Y.; Tian, C.; Li, X.; Huang, B.; Li, Z. A Real-Time Passenger Flow Estimation and Prediction Method for Urban Bus Transit Systems. IEEE Trans. Intell. Transp. Syst. 2017, 18, 3168–3178. [Google Scholar] [CrossRef]
Ou, J.; Lu, J.; Xia, J.; An, C.; Lu, Z. Learn, Assign, and Search: Real-Time Estimation of Dynamic Origin-Destination Flows Using Machine Learning Algorithms. IEEE Access 2019, 7, 26967–26983. [Google Scholar] [CrossRef]
Bierlaire, M.; Crittin, F. An Efficient Algorithm for Real-Time Estimation and Prediction of Dynamic OD Tables. Oper. Res. 2004, 52, 116–127. [Google Scholar] [CrossRef]
Wang, S.-W.; Ou, D.-X.; Dong, D.-C.; Xie, H. Research on the model and algorithm of origin-destination matrix estimation for urban rail transit. In Proceedings of the 2011 International Conference on Transportation, Mechanical, and Electrical Engineering (TMEE), Changchun, China, 16–18 December 2011; pp. 1403–1406. [Google Scholar]
Yang, C.; Yan, F.; Xu, X. Daily metro origin-destination pattern recognition using dimensionality reduction and clustering methods. In Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 16–19 October 2017; pp. 548–553. [Google Scholar]
Noursalehi, P.; Koutsopoulos, H.N.; Zhao, J. Dynamic Origin-Destination Prediction in Urban Rail Systems: A Multi-Resolution Spatio-Temporal Deep Learning Approach. IEEE Trans. Intell. Transp. Syst. 2022, 23, 5106–5115. [Google Scholar] [CrossRef]
Cao, Y.; Hou, X.; Chen, N. Short-Term Forecast of OD Passenger Flow Based on Ensemble Empirical Mode Decomposition. Sustainability 2022, 14, 8562. [Google Scholar] [CrossRef]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 5753–5763. [Google Scholar]
Lin, Z.; Feng, M.; dos Santos, C.N.; Yu, M.; Xiang, B.; Zhou, B.; Bengio, Y. A structured self-attentive sentence embedding. arXiv 2017, arXiv:1703.03130v1. [Google Scholar]
Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. arXiv 2019, arXiv:1901.02860v3. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2016, arXiv:1409.0473v7. [Google Scholar]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking attention with performers. arXiv 2022, arXiv:2009.14794v4. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-Attention Generative Adversarial Networks. arXiv 2019, arXiv:1805.08318. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-Cross Attention for Semantic Segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2021, arXiv:2010.11929v2. [Google Scholar]
Ye, J.; Zhao, J.; Zheng, F.; Xu, C. Completion and augmentation-based spatiotemporal deep learning approach for short-term metro origin-destination matrix prediction under limited observable data. Neural Comput. Appl. 2023, 35, 3325–3341. [Google Scholar] [CrossRef]
Zhou, W.; Du, H.; Mei, W.; Fang, L. Spatial orthogonal attention generative adversarial network for MRI reconstruction. Med. Phys. 2021, 48, 627–639. [Google Scholar] [CrossRef] [PubMed]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.; Wong, W.; Woo, W. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 802–810. [Google Scholar]

Figure 1. Temporal relationship among the historical OD matrices on previous days and the predicted OD matrix.

Figure 2. The overall architecture of the proposed SSNet.

Figure 3. The architectures of residual block and self-attention residual block.

Figure 4. The overall architecture of the proposed STSM.

Figure 5. The architecture of attention map generation network.

Figure 6. The architecture of non-local feature aggregation network.

Figure 7. Comparisons of evaluation metrics at each time interval during the noon hours among the proposed SSNet and the three comparison methods.

Figure 8. Comparisons of the evaluation metrics at each time interval during the evening peak hours among the proposed SSNet and the three comparison methods.

Figure 9. Comparisons of convergence curves of the best validation RMSE among the proposed SSNet and the three comparison methods.

Figure 10. Comparisons of the evaluation metrics at each time interval during the evening peak hours among the proposed SSNet and its variants.

Figure 11. Comparisons of the convergence curves of the best validation RMSE among the proposed SSNet and its variants.

Table 1. Comparisons of evaluation metrics among the proposed SSNet and the comparison methods.

Method	RMSE	MAE	SMAPE
ConvLSTM	0.648 ± 0.343	0.161 ± 0.100	0.185 ± 0.078
STResNet	0.722 ± 0.476	0.172 ± 0.119	0.189 ± 0.078
CASCNN	0.740 ± 0.473	0.178 ± 0.116	0.195 ± 0.077
SSNet	0.622 ± 0.336 *	0.159 ± 0.105	0.183 ± 0.080

* Bold for the evaluation metric represents the best result.

Table 2. Comparisons of the evaluation metrics among the proposed SSNet and its variants.

Method	Block Sequence	RMSE	MAE	SMAPE
SSNet-A	RB–RB–RB–RB	0.726 ± 0.490	0.172 ± 0.120	0.188 ± 0.079
SSNet-B	SRB–SRB–SRB–SRB	0.631 ± 0.310	0.172 ± 0.107	0.202 ± 0.086
SSNet-C	SRB–RB–SRB–RB	0.648 ± 0.343	0.168 ± 0.111	0.190 ± 0.086
SSNet-D	RB–SRB	0.656 ± 0.365	0.167 ± 0.105	0.192 ± 0.079
SSNet-E	RB–SRB–RB–SRB–RB–SRB	0.652 ± 0.353	0.168 ± 0.110	0.193 ± 0.085
SSNet	RB–SRB–RB–SRB	0.622 ± 0.336 *	0.159 ± 0.105	0.183 ± 0.080

* Bold for the evaluation metric represents the best result.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, W.; Tang, T.; Gao, C. Spatio-Temporal Self-Attention Network for Origin–Destination Matrix Prediction in Urban Rail Transit. Sustainability 2024, 16, 2555. https://doi.org/10.3390/su16062555

AMA Style

Zhou W, Tang T, Gao C. Spatio-Temporal Self-Attention Network for Origin–Destination Matrix Prediction in Urban Rail Transit. Sustainability. 2024; 16(6):2555. https://doi.org/10.3390/su16062555

Chicago/Turabian Style

Zhou, Wenzhong, Tao Tang, and Chunhai Gao. 2024. "Spatio-Temporal Self-Attention Network for Origin–Destination Matrix Prediction in Urban Rail Transit" Sustainability 16, no. 6: 2555. https://doi.org/10.3390/su16062555

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spatio-Temporal Self-Attention Network for Origin–Destination Matrix Prediction in Urban Rail Transit

Abstract

1. Introduction

2. Related Work

2.1. Origin–Destination Matrix Prediction in Traffic Scenarios

2.2. Self-Attention Mechanism

2.3. Our Contribution

3. Methods

3.1. Problem Formulation

3.2. Overview

3.3. Spatio-Temporal Non-Local Operation

3.3.1. Spatial Orthogonal Non-Local Operation

3.3.2. Spatio-Temporal Non-Local Operation

3.4. Spatio-Temporal Self-Attention Module

3.4.1. The Overall Architecture of STSM

3.4.2. Attention Map Generation Network

3.4.3. Non-Local Feature Aggregation Network

3.5. Loss Function

4. Experiments

4.1. Dataset

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Comparison Methods

4.5. Results and Discussions

4.5.1. Comparison of Prediction Performance

4.5.2. Ablation Studies

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI