Next Article in Journal
A Reverse Positional Encoding Multi-Head Attention-Based Neural Machine Translation Model for Arabic Dialects
Previous Article in Journal
On Solving the Problem of Finding Kinetic Parameters of Catalytic Isomerization of the Pentane-Hexane Fraction Using a Parallel Global Search Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Short-Term Online Forecasting for Passenger Origin–Destination (OD) Flows of Urban Rail Transit: A Graph–Temporal Fused Deep Learning Method

School of Traffic and Transportation, Beijing Jiaotong University, No. 3 Shang Yuan Cun, Hai Dian District, Beijing 100044, China
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(19), 3664; https://doi.org/10.3390/math10193664
Submission received: 14 September 2022 / Revised: 30 September 2022 / Accepted: 4 October 2022 / Published: 6 October 2022

Abstract

:
Predicting short-term passenger flow accurately is of great significance for daily management and for a timely emergency response of rail transit networks. In this paper, we propose an attention-based Graph–Temporal Fused Neural Network (GTFNN) that can make online predictions of origin–destination (OD) flows in a large-scale urban transit network. In order to solve the key issue of the passenger hysteresis in online flow forecasting, the proposed GTFNN takes finished OD flow and a series of features, which are known or observable, as the input and performs multi-step prediction. The model is constructed from capturing both spatial and temporal characteristics. For learning spatial characteristics, a multi-layer graph neural network is proposed based on hidden relationships in the rail transit network. Then, we embedded the graph convolution into a Gated Recurrent Unit to learn spatial–temporal features. For learning temporal characteristics, a sequence-to-sequence structure embedded with the attention mechanism is proposed to enhance its ability to capture both local and global dependencies. Experiments based on real-world data collected from Chongqing’s rail transit system show that the metrics of GTFNN are better than other methods, e.g., the SMAPE (Symmetric Mean Absolute Percentage Error) score is about 14.16%, with a range from 5% to 20% higher compared to other methods.

1. Introduction

1.1. Background

Urbanization introduces growing populations in cities and leads to significant mobility and sustainability challenges. Predicting short-term passenger flow of urban rail transit systems, especially the origin–destination (OD) flow, is vital for providing full understanding of travel patterns and daily operation efficiency improvement. In case of emergencies or special incidents, an accurate online passenger flow forecasting can help to implement efficient response measures and enhance the service quality of public transport systems [1].
Nevertheless, traffic forecasting is challenging due to the complex temporal and spatial relationships among data. More specifically, in temporal dimension, the global features and local features both influence the evolution of passenger flow. Then, in the spatial dimension, the interrelationship between passenger flow and the complex rail network, both similarities and corrections (i.e., similarities or correlations among passenger flow series of different stations) [2], make the prediction of passenger flow much more complex.
In addition, passenger forecasting problems often contain a complex mix of inputs—including time-invariant, known future inputs, and other exogenous time series that are only observed historically—without any prior information on how they interact with the target [3].
In addition, OD prediction has its own characteristics. In the huge size of a rail transit network, online forecasting can only obtain data on the finished OD passenger flows, as passenger trips cannot finish immediately. The unknown passenger flow includes the current unfinished passenger flow as well as the future flow that will continuously enter the network. Meanwhile, the AFC (Auto Fare Collection) system has a delay time for updating OD flow information. Thus, the forecasting task of OD flow naturally takes temp finished OD flow as input and the actual OD flow as output.
To solve the above issues, we need a method that simultaneously considers spatial–temporal properties of OD passenger flow. The development of deep learning provides the possibility to find a reasonable solution. The temporal fusion transformer method [3] is explored, which can combine high-performance multi-horizon forecasting with interpretable insights into temporal dynamics. Furthermore, Graph Convolution Networks (GCN), which can automatically learn representations on non-Euclidean data (e.g., graphs), has been proposed and utilized in many scenarios [4,5,6,7].
Inspired by the above, a Graph–Temporal Fused Neural Network (GTFNN) is proposed in this paper. The most important feature of this model is the introduction of a graph neural network component in the attention-based sequence-to-sequence structure to fuse the hidden spatial relationships in a rail transit network into temporal relationships. This realizes the prediction of time-series OD passenger flow, which is highly dependent on the network structure.

1.2. Potential Contributions

This paper develops a method for forecasting OD flow of a network-level rail transit system, with considerations of the hysteresis characteristic and hidden complex spatial–tempol relationships. The potential contributions can be summarized as:
  • Considering the hysteresis characteristic of OD flow, we proposed an online forecasting framework that maps the historical observable finished flow and features to the actual OD flow in the future.
  • For better improving the prediction accuracy, we built a feature system including observable features and knowable features based on the temporal relationship. We also used specialized feature selection components for weighting the features.
  • For capturing the hidden spatial relationships in the rail transit network, we introduced a multi-layer graph neural network as a key component in the method to depict the relationships among stations from different aspects.
  • Based on sequence-to-sequence framework, a Graph–Temporal Fused Deep Learning model was built. In addition, an attention mechanism was attached to the model to fuse local and global temporal dependencies; then, we achieved the prediction of short-term online OD passenger flow.
For rail transit systems, obtaining accurate OD passenger flow data in real time is an important support for transportation organization. The train line planning, timetabling and station passenger flow planning all depend on accurate OD prediction results. Especially, the prediction of different OD passenger flow peaks can effectively guide the allocation of rail transportation resources. Therefore, the OD prediction model studied in this paper has strong practical significance.

2. Literature Reviews

We performed a literature review from three aspects: passenger flow prediction, graph convolution neural networks, and time series forecasting with attention-based deep neural networks. All three aspects are relevant to the study of this paper.

2.1. Passenger Flow Prediction

Passenger flow prediction is one of the most significant tasks in the field of intelligent transportation. There are two major categories of prediction methods covered in the flow prediction issue, which are statistical methods and machine learning methods [8]. Among the statistical algorithms, the Auto-Regressive Integrated Moving Average (ARIMA) model is representative. Williams et al. [9] and Lee and Fambro [10] applied the ARIMA model to realize urban freeway traffic flow prediction.
Recently, a large number of research studies in recent years have focused on applying machine learning or deep learning algorithms in transportation applications, which have generated considerable achievements. Huang et al. [11] claimed that their research is the earliest application of deep learning to traffic flow prediction research through the proposed Deep Belief Nets networks. Specific to subway passenger flow, Ni et al. [12] trained the Linear Regression model with event occurrences of information to tackle abnormal prediction. Sun et al. [13] proposed Wavelet-SVM and discussed the idea to predict different kinds of passenger flows. A combination of empirical mode decomposition and MLP was proposed by Wei and Chen [1] to forecast the short-term metro flow. Later on, a multi-scale radial basis function network [14] was proposed by the same authors to improve their forecasting accuracy. Furthermore, some researchers [15] focused on using the nonparametric regression model to predict flows of transfer stations instead of the whole subway system.
In the early days, OD predictions were more concerned with potential relationships in the time dimension. Some works [16,17,18] usually employed time series filtering models (e.g., Kalman filter) to estimate the OD flow. Recently, many studies have fused spatial relationships with temporal relationships. For instance, Liu et al. [19] proposed a Contextualized Spatial–temporal Network that incorporated local spatial context, temporal evolution context, and global correlation context to forecast taxi OD demand. Shi et al. [20] utilized long short-term memory (LSTM) units to extract temporal features for each OD pair and then learned the spatial dependency of origins and destinations by a two-dimensional graph convolutional network.
For ride-hailing applications, the origin and destination of a passenger are known once a taxi request is generated. However, in online metro systems, the destination of a passenger is unknown until it reaches the destination station; thus, the operators cannot obtain the complete OD distribution immediately to forecast the future OD demand. To address this issue, Gong et al. [21] used some indication matrices to mask and neglect the potential unfinished metro orders. Lingbo Liu et al. [22] handled this task by learning a mapping from the historical incomplete OD demands to the future complete OD demands.

2.2. Graph Convolution Neural Networks

To fit the required input format of Convolutional Neural Network and Recurrent Neural Network, some works [23,24] divided the studied network system into regular grid cells and transformed the raw traffic data to tensors. However, this preprocessing manner has limitations in handling the traffic systems with irregular topologies, such as rail transit systems and road networks.
Graph Convolutional Networks can be used to improve the generality of deep learning-based forecasting methods in networks. For instance, Li et al. [25] modeled the traffic flow as a diffusion process on a directed graph and captured the spatial dependency with bi-directional random walks. Song et al. [26] developed a Spatial–Temporal Synchronous Graph Convolutional Network (STSGCN), which captured the complex localized spatial–temporal correlations through a spatial–temporal synchronous modeling mechanism. In the article by Han et al. [27], graph convolution operations were applied to capture the irregular spatial–temporal dependencies along with the metro network. Geng X. et al. [28] developed a ST-MGCN, which incorporated a neighborhood graph (NGraph), a transportation connectivity graph (TGraph), and a functional similarity graph (FGraph) for ride-hailing demand prediction. Lingbo Liu et al. [22] designed a more flexible historical ridership data network based on this and fully explored the inter-station flow similarity and OD correlation for virtual graph construction.

2.3. Time Series Forecasting with Attention-Based Deep Neural Networks

Attention mechanisms are used in translation [29], image classification [30] or tabular learning [31] to identify salient portions of input for each instance using the magnitude of attention weights. Recently, they have been adapted for time series with interpretability motivations [32,33,34], using LSTM-based [35] and transformer-based [34] architectures.
Deep neural networks have increasingly been used in time series forecasting, which demonstrates stronger performance over traditional time-series models [32,36,37]. More recently, transformer-based architectures have been explored in Li et al. [34], which proposes the use of convolutional layers for local processing and a sparse attention mechanism to increase the size of the receptive field during forecasting. Despite their simplicity, iterative methods rely on the assumption that only the target needs to be recursively fed into future inputs.
The Multi-horizon Quantile Recurrent Forecaster (MQRNN) [38] uses LSTM or convolutional encoders to generate context vectors. In Fan et al. [39], a multi-modal attention mechanism was used with LSTM encoders to construct context vectors for a bi-directional LSTM decoder. Despite performing better than LSTM-based iterative methods, interpretability remains challenging for such standard direct methods.
In general, deep learning models involve a large number of input features. The significance of features is difficult to measure, which is not beneficial to the large-scale application of prediction models. Some methods are used for the significance analysis of different input features in time series forecasting. For example, Interpretable Multi-Variable LSTMs [40] partition the hidden state such that each variable contributes uniquely to its own memory segment and weights memory segments to determine variable contributions. Methods combining temporal importance and variable selection have also been considered [33], which compute a single contribution coefficient based on attention weights from each. However, in addition to the shortcoming of modeling only one-step-ahead forecasts, existing methods also focus on instance-specific (i.e., sample-specific) interpretations of attention weights—without providing insights into global temporal dynamics. The usage of an attention mechanism can provide an improvement for this issue. Temporal Fusion Transformer (TFT) [3] is able to analyze global temporal relationships and allows users to interpret global behaviors of the model on the whole dataset—specifically in the identification of any persistent patterns (e.g., seasonality or lag effects) and regimes present.
From the existing studies, it can be seen that the current deep learning techniques have started to take into account the temporal and spatial fusion features into the prediction models. To solve the prediction problem in this paper, a systematic consideration of the hysteresis of OD traffic, the hidden spatial relationships in complex networks, the different input features and their temporal characteristics, and how to ensure the capability to capture both local and global dependencies in the temporal dimension is needed.

3. Methodology

3.1. Short-Term Online Framework Considering Hysteresis of Passenger OD Flow

The online short-term online framework for forecasting passenger OD flow should consider hysteresis of the OD passenger, as well as the hidden temporal and spatial relationships.

3.1.1. Solution for Passenger Hysteresis

Passenger OD flow has a hysteresis characteristic that the destination of a passenger is unknown until it reaches the destination station, which provides a challenge to the short-term online forecasting. The passengers cannot finish their trips in a short forecasting time step (e.g., 15 min or even 5 min); furthermore, in each time step, we cannot obtain the actual or complete OD distribution for forecasting dynamically. This characteristic has attracted the attention of one existing research study. Gong et al. [21] used some indication matrices to mask and neglect the potential unfinished trip in urban rail networks.
In order to handle this issue, this paper proposes a framework that maps a time-dependent finished trip to the actual entered trips based on Equation (1).
t O D f l o w t e n = O D f l o w t u n + t O D f l o w t f i
where O D f l o w t e n represents the actual or entered OD flow of time step t , which can be obtained in historical data but cannot be calculated instantaneously. O D f l o w t e n is the object we wish to forecast and the guidance of system management. O D f l o w t f i represents the finished OD flow of time step t , which can be obtained from historical data and instantaneous observations. The gap between the two cumulative values of O D f l o w t e n and O D f l o w t f i in the temporal dimension is caused by an unfinished journey. We denote this gap as the O D f l o w t u n that represents the unfinished OD flow of time step t .
Based on this relationship, it is clear that one of the core aspects of the forecasting process is how to establish a mapping between the observable finished passenger flow and the actual passenger flow.

3.1.2. Consideration for Temporal Relationships

Each OD pair i , j associates with a set of features and targets at each time step t [ 0 , m m a x ] . To facilitate the following description, we hereby make conventions on the notation, model, and framework of the study. The framework of the forecasting task can be illustrated as Equation (2).
Y ^ p t , m A = G T F N N ( Y p t , h , π A , X p t , h , π F , O p t , h , π , K p t , h , m , G ( X p t , h , π F , O p t , h , π s , K p t , h , m s ) )
In Equation (2), Y ^ p t , m A is the m -step-ahead ( m is the length of the predicted sequence) forecast result at prediction time p t . G T F N N ( · ) refers to the proposed forecasting model.   X p t , h , π F is the vector of historical finished OD flow of time range [ p t h , p t π ] . h refers to the horizon of the input sequence, and π refers to the blind spot for data updates caused by the information system update mechanism. Y p t , h , π A is the target vector of forecasting, a vector of historical actual or entered OD flow.   O p t , h , π refers to the observed input features that can be obtained in historical data only, K p t , h , m refers to the known input features that can be obtained in the whole range of time, and G ( X p t , h , π F , O p t , h , π s , K p t , h , m s ) represents the graphic structures called multi-layer networks that can help the framework to extract hidden localized features among stations. Within Figure 1, we show the relationship of the above concepts on the time axis to facilitate the construction of the following model.
Theoretically, the OD pairs among stations should be fully connected, i.e., if the number of stations is N , the number of OD pairs is N ( N 1 ) . However, the OD matrix is relatively sparse. Thus, we only consider the OD flow from station i to the top- κ stations where its passengers are most likely to reach, as well as the total OD flow to the remaining stations. The details of the mentioned notations are shown in Table 1.
The details of O p t , h , π and K p t , h , m are shown in Table 2. These two sets mainly include sequenced (e.g., O p t , h , π s and K p t , h , m s ) and graphic features (e.g., O p t , h , π g and K p t , h , m g ) across time. These features are designed to help the model’s learning of the relationship among the different components of the historical data.
Unfold the features in Table 2 by spatial and temporal dimensions. Each element k t j K p t , h , m g , j { 4 , 5 , 10 } , can be denoted as:
o t j = { o 1 , t j , o 2 , t j , , o i , t j , , o N , t j } ,     j [ 1 , 4 ]
k t w = { k 1 , t w , k 2 , t w , , k i , t w , , k N , t w } ,   w [ 1 , 10 ]
Thus, each element k t w can map to the station set of the network.
The input of the model is denoted as:
I = { X p t , h , π F , O p t , h , π , K p t , h , m }
I t = { I t 1 , I t 2 , , I t N }
I t i = { x i , t F , { o i , t j } j [ 1 , 4 ] , { k i , t w } w [ 1 , 10 ] }
Specifically, in Table 2, the features in set K p t , h , m s   are dependent on the OD flow but are independent from the structure of networks. Thus, we define the input of the graph model in Equation (8).
I G t = { I G t 1 , I G t 2 , , I G t N }
I G t i = { x i , t F , { o i , t j } j [ 1 , 4 ] , { k i , t w } w [ 4 , 10 ] }

3.1.3. Consideration for Spatial Relationships

The spatial relationships studied in this paper refer to the relationships among stations, which can influence the prediction of OD flow. We summarized four classes of spatial relationships with six derived graphs that exist in the urban transit system. A summarization of these relationships is shown in Table 3.
  • Station–line–network relationship
This is the basic tomography relationship of the rail transit network that determines the connections between each pair of stations. We focused on two networks of this relationship:
a. Station network
The station network G s n = ( N ,   E s n ,   W s n ) is directly constructed according to the connections of sections and stations of the studied rail transit network. An edge is formed to connect node i and j in E s n if the corresponding station   i and j   are connected in the real network.
b. Transfer network
The transfer network G t n = ( N ,   E t n ,   W t n ) is constructed by the connections of a station with its nearby transfer stations. An edge is formed to connect node i and j in E t n if the corresponding station i and transfer station j are connected by a station path along with the station network, and the path cannot contain other transfer station.
2.
Passenger flow characteristics relationship
When two stations are located in different areas but have the same function (e.g., office, education, business districts), it makes sense that the evolution of the passenger flow will be similar. We chose two kinds of feature to measure similarities.
a. Time series similarity
The daily passenger flow data along the time axis will form a time series. The similarity among the time series belonging to different stations can construct the edges and weights among stations. By these, the time series similarity G t s = ( N ,   E t s ,   W t s ) is built by a similarity measurement with threshold. We illustrate the details about how to construct G t s in the following part of measurement weights.
b. Peak hour factor similarity
Likewise, the peak hour factor was also chosen as a feature to measure similarity between any pair of stations. The peak hour factor is calculated by the formula:
p i = max ( Y p t , h 1 , h 2 A ) avg ( Y p t , h 1 , h 2 A )
where h 1 and h 2 are the operational beginning and end time points of a day. The function max ( · ) can find the maximum OD flow in vector Y p t , h 1 , h 2 A , and the function avg ( · ) can calculate the average flow of off-peak hours.
By these, the peak hour factor similarity G p s = ( N ,   E p s ,   W p s ) is built by a similarity measurement based on the second-norm. We illustrate the details about how to construct G p s in the following part of measurement weights.
3.
Line planning relationship
Although urban rail transit has a natural physical topology, passengers can only move with the trains. Thus, the line planning has a huge impact on passenger OD, especially in terms of short-term resolution. A line plan is one of the most basic papers for rail transit operations. A line is often taken to be a route in a high-level infrastructure graph ignoring precise details of platforms, junctions, etc. In addition, a line is a route in the network together with a stopping pattern for the stations along that route, as a line may either stop at or bypass a station on its route [41]. We define a line plan as a set of such routes, each with a series of way stations, a stopping pattern and frequency, which together must meet certain targets such as providing minimal service at every station.
a. Line planning network
The line planning network G l p = ( N ,   E l p ,   W l p ) describes the connected relationships formed by the line planning. This correlation has a huge impact on passenger travel. E l p and W l p are determined by the shopping pattern and running frequency.
4.
Correlation relationship
For representing potential large OD pairs or potential travel demand hidden in the urban rail transit, we built a network to represent the correlation relationships.
a. Correlation of flow evolution
OD flow between every two stations is not uniform, and the direction of passenger flow implicitly represents the correlation of two stations. For instance, if: (I) the majority of inflow of station a stream to station b , or (II) the outflow of station a primarily comes from station b , we believe that the stations a and b are highly correlated.
According to the above discussions, we defined the graphs by nodes N , edges E and the weights W . In this context, we set node n N , where n N represents a real station. The graphs share the same nodes but have their own edges and weights, that is, E { E s n , E t n , E t s , E p s , E c d , E c f } and W { W s n , W t n , W t s , W p s , W c d , W c f } . Specifically, we denote W 6 × N × N as the weights of all edges, and for each W z W , z { s n , t n , t s , p s , l p , c f } . An overall design of the graphs is shown in Table 3.
We denote W z ( i , j ) as the weight of edge ( i , j ) . In Table 3, the fifth column summarizes the calculation methods of different weights.
For calculating W s n ( i , j ) and W t n ( i , j ) , S e ( i , j ) represents the connection function of nodes (i.e., station) i and j . S e ( i , j ) = 1 if there exists a section between node i and j , else S e ( i , j ) = 0 , and we set S e ( i , i ) = 0 . Likewise, T r ( i , j ) represents the connection function of a station with the nearby transfer stations. T r ( i , j ) = 1 if there exists a path without another transfer station between station i and transfer station j , or else T r ( i , j ) = 0 .
Theoretically, similarity exists between any two stations. However, in large-scale networks, the number of station pairs is large, and the similarity of most of them is small, which leads to a complex similarity graph. For this reason, we used a combination of clustering method and threshold selection to first find the potential groups to which the stations belong and then excluded the station pairs with small similarity. Thus, for calculating T s ( i , j ) in W t s ( i , j ) , using the time series as inputs, we first obtained the similarity relationships of different stations based on the clustering method [42] and obtained clusters of stations, denoted as C . Then, for each c C , a predefined similarity threshold was set to control the number of similarities. Based on the finite similarity relationships, we built the edge set E t s and used Equations (17)–(20) to calculate T s ( i , j ) in category c C .
T s ( i , j ) = exp ( S B D ( { s u m ( y i , t A ) } t T h , { s u m ( y j , t A ) } t T h ) )
s u m ( y i , t A ) = j [ 1 , κ ] y i ~ j , t A
S B D r i , r j = 1 m a x w C C w r i , r j | | r i | | | | r j | |
C C w r i , r j = l = 1 2 m w r i , l + w m r j , l , w m C C w r j , r i ,   w < m
In Equation (17), the function S B D [43] (shape-based distance) is set for measuring the distance between two temporal sequences with equal length. Specifically, the S B D is calculated by Equation (11), where r i , r j R I is the flow time series of OD i and j and | | | | refers to the second norm operator. C C w ( r i , r j ) represents the cross-correlation between r i and r j . In addition, we set   T s ( i , i ) to 0.
Analogously, we calculated P s ( i , j ) for W p s ( i , j ) by the cluster–threshold–calculation framework. The classical k-means method is used for clustering, and the similarity is measured by the second norm operator, as shown in Equation (21).
P s ( i , j ) = | | P i , P j | |
where P i and P j are the peak hour factor of station i and j .
In Equation (15), the weight W l p ( i , j ) takes line plans of urban transit into consideration. We set l L as a line plan and S L ( l ) as the stations in the route of l . L p ( i , j ) = 0 represents that there exists an edge between station i and j in terms of a train running line l , and F r e ( l ) is the running frequency of the train running line l .
In Equation (16), the D ( i , j ) is the total number of passengers that traveled from station j to station i in the whole dataset.

3.2. Multi-Layer Graph Neural Networks Model (MGNN) for Structured Forecasting

3.2.1. Structure of the Multi-Layer Graph Neural Networks Model (MGNN)

In previous works [3,16,44], sequenced features and graphic features have both been proven to be useful for traffic state prediction. One key issue in designing MGNN is how to fuse the temporal features and the spatial features during the training of the model. In the context of network-level passenger flow prediction, we specifically call the features that can be represented by the graphs the spatial features, or otherwise, the temporal features. For fusing the temporal features and the local features, we adopted the designs of the Graph Convolution Gated Recurrent Unit (GC-GRU) and Fully Connected Gates Recurrent Unit (FC-GRU) proposed by [3].

3.2.2. Convolution Operation for MGNN by GC-GRU

A GC-GRU [22] was introduced for spatial–temporal feature learning of MGNN. By using this GC-GRU, we could effectively learn spatial temporal features from the OD flow data. The convolution operation was designed as follows:
The parameters of this graph convolution are denoted as Θ . Following the definition in Equation (9), the input of graph convolution is set as I G t = { I G t 1 , I G t 2 , , I G t N } . By definition of convolution, the output feature f ( I G t i ) d of I G t i is computed by:
f ( I G t i ) = Θ l I G t i + j N s n ( i ) W s n ( i , j ) Θ s n I G t i + j N t n ( i ) W t n ( i , j ) Θ t n I G t i + j N t s ( i ) W t s ( i , j ) Θ t s I G t i + j N p s ( i ) W p s ( i , j ) Θ p s I G t i + j N l p ( i ) W l p ( i , j ) Θ l p I G t i + j N c f ( i ) W c f ( i , j ) Θ c f I G t i
where is the Hadamard product and Θ { Θ s n , Θ t n , Θ t s , Θ p s , Θ l p , Θ c f } . Specifically, Θ l I G t i is the self-loop for all graphs, and Θ l is the learnable parameters. Θ s n denotes the parameters of the station network graph G s n , and N s n ( i ) represents the neighbor set of node i in G s n . Other notations Θ t n , Θ t s , Θ p s , Θ l p , Θ c f , N s n ( i ) , N t n ( i ) , N t s ( i ) , N p s ( i ) , N l p ( i ) and N c f ( i ) have similar semantic meanings. d is the dimensionality of feature f ( I G t i ) . In this manner, a node can dynamically receive information from some highly corrected neighbor nodes. For convenience, we denoted the graph convolution in Equation (22) as I G t Θ in the following.
Since the above-mentioned operation is conducted on a spatial dimension, we embedded the graph convolution in a Gated Recurrent Unit (GRU) to learn spatial–temporal features. Specifically, the reset gate R t = { R t 1 , R t 2 , , R t N } , update gate Z t = { Z t 1 , Z t 2 , , Z t N } new information N t = { N t 1 , N t 2 , , N t N } and hidden state H t = { H t 1 , H t 2 , , H t N } are computed by:
H t = G C - G R U ( I G t , H t 1 ) :
R t = σ ( Θ r x × I G t + Θ r h × H t 1 + b r )
Z t = σ ( Θ z x × I G t + Θ z h × H t 1 + b z )
N t = tanh ( Θ n x × I G t + R t ( Θ n h × H t 1 + b n ) )
H t = ( 1 Z t ) N t + Z t H t 1
where σ is the sigmoid function and the H t 1 is the hidden state at last t 1 iteration. Θ r x , Θ z x , Θ z h , Θ n x and Θ n h denote the graph convolution parameters. b r , b z and b n are bias terms. The feature dimensions of R t i , Z t i , N t i and H t i are set to d .

3.2.3. The Combination of GC-GRU and FC-GRU

The proposed MGNN can conduct convolution on graphic space, and FC-GRU can learn the inputs from a sequenced view. We combined the GC-GRU and FC-GRU and denoted the combination as GFGRU. Many GFGRUs are organized according to the Encoder–Decoder framework, and we built a Sequence-to-Sequence structure, as shown in Figure 2. Specifically, the inputs of GFGRU are I t and H ˜ t 1 , where H ˜ t 1 is the output hidden state of the last iteration.
In GFGRU, GC-GRU utilizes the accumulated information in H ˜ t 1 to update the hidden state, rather than take the original H t as input. Thus, Equation (23) becomes Equation (28).
H t = G C - G R U ( I G t , H ˜ t 1 )
For FC-GRU, we first transformed I t and H ˜ t 1 to an embedded I t e d and H t 1 e d with two fully connect (FC) layers. Then, we fed I t e and H t 1 e into a common GRU [45] implemented with a full connection to generate a global hidden state H t f d , which can be express as:
I t e = F C ( I t )
H t 1 e = F C ( H ˜ t 1 )
H t f = F C G R U ( I t e , H t 1 e )
Finally, we incorporated H t and H t g to generate a combined hidden state H ˜ t = { H ˜ t 1 , H ˜ t 2 , , H ˜ t N } with a fully connected layer:
H ˜ t i = F C ( H t i H t g )
where denotes an operator of feature concatenation.

3.3. Graph–Temporal Fused Neural Network (GTFNN)

3.3.1. Overall for the Proposed GTFNN

When building the GTFNN framework, we need to address two main issues: firstly, the model has a comprehensive but complex input. The relationship (e.g., linear or nonlinear) between the complex inputs and the forecasting model is difficult to determine; second, time series data naturally have local features (e.g., change-points, etc.) and global features (e.g., series trends and attention at different time positions), and the framework we designed needs to take both types of features into account to ensure better forecast performance. Thus, three important designs are considered in GTFNN:
  • Gating structure GRN
A gating structure GRN can decide if non-linear learning is required. It can skip over any unused components of the architecture, which provide adaptive depth and network complexity to accommodate a wide range of datasets and scenarios.
2.
Feature selection layer
While the designed features may be available, their relevance and specific contribution to the output are typically unknown. The GTFNN also uses specialized components for the judicious selection of relevant features and a series of gating layers to suppress unnecessary components, enabling high performance in a wide range of regimes. We adopted the feature selection network proposed by Lim et al. [3] to tune the weights of input features at each time step dynamically. This feature selection network was designed based on the gating mechanism.
3.
Sequence-to-sequence layer with attention mechanisms
For learning both local and global temporal relationships from time-varying inputs, a sequence-to-sequence layer is employed for local processing, whereas long-term dependencies are captured using an interpretable multi-head attention block. The GTFNN employs a self-attention mechanism to learn long-term relationships across different time steps [3], which is modified from multi-head attention in transformer-based architectures [29,34] to enhance explainability.
Figure 3 shows the high-level architecture of GTFNN, where individual components are described in detail in the subsequent sections.

3.3.2. Gating Structure GRN

With the motivation of giving the model the flexibility to apply non-linear processing only where needed, the Gated Residual Network (GRN) was chosen as a building block of GTFNN. The GRN takes in a primary input I t designed in Equation (6) and obtains:
G R N ω ( I t ) = L a y e r N o r m ( I t + G L U ω ( η 1 ) )
η 1 = W 1 , ω η 2 + b 1 , ω
η 2 = E L U ( W 2 , ω I t + b 2 , ω )
where E L U is the Exponential Linear Unit activation function [46], η 1     R d m o d e l , η 2 R d m o d e l are intermediate layers, L a y e r N o r m ( · ) is standard layer normalization [47], and   ω   is an index to denote weight sharing. When W 2 , ω a + b 2 , ω 0 , the ELU activation would act as an identity function and when W 2 , ω a + b 2 , ω 0 , the ELU activation would generate a constant output, resulting in linear layer behavior. The structure of GRN is shown in Figure 4.
In Equation (33), Gated Linear Units (GLUs) [48] are selected as the gating components to provide the flexibility to suppress any part of the architecture that can be skipped in the scenario. Denoting γ R d m o d e l as the input of the G L U , we obtain Equation (36).
G L U ω ( γ ) = σ ( W 3 , ω γ + b 3 , ω ) ( W 4 , ω γ + b 4 , ω )
where σ ( . ) is the sigmoid activation function, W ( . )   R d m o d e l × d m o d e l , b ( . ) R d m o d e l are the weights and biases, is the element-wise Hadamard product, and d m o d e l is the hidden state size. GLU allows GTFNN to control the extent to which the GRN contributes to the original input I t , potentially skipping over the layer entirely if necessary, as the GLU outputs could be all close to 0 in order to suppress the nonlinear contribution [3].

3.3.3. Instance-Wise Feature Selection Layer

Instance-wise feature selection is provided by the feature selection networks applied to all input features. Entity embeddings [49] is used for categorical variables as feature representations, and linear transformations for continuous variables—transforming each input variable into a d m o d e l -dimensional vector. All inputs make use of separate feature selection networks with distinct weights.
Let ξ t ( j ) d m o d e l   denote the transformed input of the j th feature at time t , with Ξ t = [ ξ t ( j ) T , ,   ξ t ( m χ ) T ] T being the flattened vector of all past inputs at time t . Feature selection weights are generated by feeding Ξ t through a GRN, followed by a Softmax layer:
v χ t = Softmax ( GRN v χ ( Ξ t ) )
where v χ t   m χ is a vector of feature selection weights.
At each time step, an additional layer of non-linear processing is employed by feeding each ξ t ( j ) through its own GRN:
ξ ˜ t ( j ) = GRN ξ ˜ ( j ) ( ξ t ( j ) )
where ξ ˜ t ( j )   is the processed feature vector for variable j . We note that each variable has its own G R N ξ ( j ) , with weights shared across all time steps t . Processed features are then weighted by their variable selection weights and are combined:
ξ ˜ t = j = 1 m χ v χ t ( j ) ξ ˜ t ( j )
where v χ t ( j ) is the j th element of vector v χ t .

3.3.4. Sequence-to-Sequence Layer with Attention Mechanisms

The local features of the time series that are identified in relation to their surrounding values, such as anomalies, change-points etc., are significant. Thus, we applied a sequence-to-sequence layer to build an Encoder–Decoder structure with feeding   ξ ˜ t h : t into the encoder and ξ ˜ t + 1 : t + m m a x into the decoder. This then generates a set of uniform temporal features that serve as inputs into the decoder itself, denoted by ϕ ( t ,   n ) { ϕ ( t , h ) ,   .   .   .   , ϕ ( t , m m a x ) } with n being a position index. We also employed a gated skip connection over this layer:
ϕ ˜ ( t , n ) = L a y e r N o r m ( ξ ˜ t + n + G L U ϕ ¯ ( ϕ ( t , n ) )  
where n [ h , m m a x ] is the position index.
(1)
Temporal self-attention layer
Following the sequence-to-sequence layer and the gate layer, we applied a modified self-attention layer [3]. All temporal features are grouped into a matrix, i.e., Θ ( t ) = [ θ ( t , h ) , , θ ( t ,     m ) ] T , and interpretable multi-head attention [3] is applied at each forecast time step, with the number of time steps feeding into the attention layer N = m m a x + h + 1 :
B ( t ) = I n t e r p r e t a b l e M u l t i H e a d ( Θ ( t ) ,   Θ ( t ) ,   Θ ( t ) ) = [ β ( t , h ) , β ( t , m m a x ) ]
Decoder masking [29,34] is applied to the multi-head attention layer to ensure that each temporal dimension can only attend to features preceding it.
I n t e r p r e t a b l e M u l t i H e a d ( Q ,   K , V ) = H ˜ W H
H ˜ = A ˜ ( Q ,   K ) V W V = { 1 H h = 1 n H A ( Q W Q ( h ) , K W K ( h ) ) } V W V
= 1 H h = 1 n H A t t e n t i o n ( Q W Q ( h ) , K W K ( h ) , V W V )
A ( Q , K ) = Softmax ( Q K T / d a t t n )
A t t e n t i o n ( Q , K , V ) = A ( Q , K ) V
where V R N × d V , K R N × d a t t n   and Q R N × d a t t n are attention mechanism scale values, keys and queries. d V = d a t t n = d m p d e l / n H , and n H is the number of heads. A ( · ) is a normalization function. W K ( h ) R d m o d e l × d a t t n , W Q ( h ) R d m o d e l × d a t t n   are head-specific weights for keys, queries. W V R d m o d e l × d V are value weights shared across all heads, and W H R d a t t n × d m o d e l is used for final linear mapping.
The self-attention layer allows GTFNN to pick up long-range dependencies that may be challenging for RNN-based architectures to learn. Following the self-attention layer, an additional gating layer is also applied to facilitate training:
δ ( t , n ) = L a y e r N o r m ( θ ( t , n ) + G L U δ ( β ( t , n ) ) )
(2)
Position-wise feed-forward layer
We applied additional non-linear processing to the outputs of the self-attention layer. This layer also makes use of GRNs:
ψ ( t , n ) = G R N ψ ( δ ( t , n ) )
where the weights of G R N ψ are shared across the entire layer. As shown in Figure 5, we also applied a gated residual connection that skips over the entire transformer block, providing a direct path to the sequence-to-sequence layer, yielding a simpler model if additional complexity is not required, as shown below:
ψ ˜ ( t , n ) = L a y e r N o r m ( ϕ ˜ ( t , n ) + G L U ψ ˜   ( ψ ( t , n ) ) )
All the notation is concluded in the Appendix A.

4. Numerical Experiments

4.1. Experiment Settings

4.1.1. Dataset

We collected a mass of trip transaction records from a real-world metro system and constructed a large-scale dataset, which is termed as CQMetro. The overview of the dataset is summarized in Table 4.
This dataset was built based on the rail transit system of Chongqing, China. The transaction records were collected from 1 January to 15 February 2019, with daily passenger flow of 1.72 million on average. The total number of OD pairs in the whole network is 14,314 pairs. Each record contains the information of entry/exit station and the corresponding timestamps. In this period, 170 stations operated normally, and they were connected by 224 physical edges (i.e., the sections between stations). For each station, we measured design features every 15 min. The data of the first 30 days were used for training, and the last 15 days were used for training and testing, while OD flows of the following day were used for validation. In particular, 1–3 January and 4–10 February 2019 are New Year’s Day and Chinese New Year holidays.

4.1.2. Details for Implementing GTFNN

We implemented our GTFNN with the deep learning framework PyTorch. The lengths of input and output are listed in Table 4. Hyperparameter optimization was conducted via random search with 60 iterations. The search ranges for all hyperparameters and the selected optimal hyperparameters can be found in Table 5. We applied Adam [50] to optimize our GTFNN for 200 epochs by minimizing the jointly minimizing function between the predicted results and the corresponding ground-truths. The jointly minimizing function [38], summed across every time step output:
L Ω , W = y t Ω τ = 1 m m a x Q L y t , y ^ t τ , τ M m m a x
Q L ( y ,   y ^ ) = | y y ^ |
where   is the domain of training data containing M samples.
During training, dropout is applied before the gating layer and layer normalization.

4.1.3. Evaluation Metrics

Following previous works [25,51], we evaluated the performance of methods with Mean Square Error (MSE), Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Symmetric Mean Absolute Percentage Error (SMAPE), which are defined as:
M S E = 1 n i = 1 n X ^ i X i 2
R M S E = 1 n i = 1 n X ^ i X i 2
M A E = 1 n i = 1 n X ^ i X i
S M A P E = 1 n i = 1 n 2 X ^ i X i X ^ i + X i
where n is the number of testing time steps; for example, if we conduct the forecasting every 60 min, then n = 4 . X ^ i and X i   denote the predicted ridership and the ground-truth ridership, respectively. Note that X ^ i and X i have been transformed back to the original scale with an inverted z-score normalization. Our GTFNN is developed to predict the metro ridership of the next two steps. In the following experiments, we would measure the errors of each time interval separately. For the 15 min granularity OD prediction, there may be a true value of 0; thus, SMAPE is used instead of MAPE to describe the relative accuracy of the prediction. Unlike MAPE (Mean Absolute Percentage Error), the SMAPE values range from 0 to 200%. All metrics are close to 0, meaning higher prediction accuracy.

4.2. Comparison with State-of-the-Art Methods

In this section, we compared our GTFNN with four base-line methods including:
Gradient Boosting Decision Trees (GBDT) [52]: GBDT is a weighted ensemble method that consists of a series of weak estimators. We implemented this method with a python package named Sklearn. The number of boosting stages is set to 100, and the maximum depth of each estimator is 4. Gradient descent optimizer is applied to minimize the loss function.
Long Short-Term Memory (LSTM) [53]: This network is a simple Seq2Seq model, and its core module consists of two fully connected LSTM layers. The hidden size of each LSTM layer is set to 256. Its hyper-parameters are the same as ours.
Gated Recurrent Unit (GRU) [54]: With the similar architecture of the previous model, this network replaces the original LSTM layers with GRU layers. The hidden size of GRU is also set to 256. Its hyper-parameters are the same as ours.
Graph-WaveNet [55]: This method develops an adaptive dependency matrix to capture the hidden spatial dependency and utilizes a stacked dilated 1D convolution component to handle very long sequences. We implemented this method with its official code.

4.3. Performance in Different Scenarios

In this part, we discuss the prediction results of different methods on weekends, weekdays and holidays and use four metrics such as MSE, RMSE, MAE and SMAPE for evaluation. Each metric is the average of the predicted results of 14,314 OD pairs, with extreme values excluded. The performances of all methods are summarized in Table 6, Table 7 and Table 8.
The predictions for the weekdays are shown in Table 6. To predict the ridership at the next four consecutive time intervals (60 min), the baseline LSTM obtains a SMAPE score of 35.17% on CQMetro, ranking last among all the methods. Compared to LSTM and GRU, the performance of GBDT and Graph-WaveNet were much better. The prediction ability of the GTFNN model is the best compared with the above models. SMAPE is 14.16%, MAE is only 0.83, and the average prediction error of each period is lower than 1 person. It can be seen that GTFNN fully combines the advantages of the graph neural network model and time series model and has good passenger flow prediction ability.
The predictions for the weekends are shown in Table 7. Similar to working days, the GTFNN model still predicted better than the other models, with a SMAPE of 13.21% and MAE, MSE and RMSE metrics of 0.78, 4.02 and 2.00, respectively. In general, the prediction accuracy of different models for weekend passenger flows is higher than that of weekday passenger flows, with the SMAPE value of different models increasing by about 1–4%.
The predictions for the holidays are shown in Table 8. From the results, the prediction results of different models for holidays are not very satisfactory. The GTFNN model still predicted better than the other models, with SMAPE of 47.96% and MAE, MSE and RMSE metrics of 4.41, 35.08 and 5.92, respectively.

4.4. The Rank of Features

Figure 6 shows the variable importance for the CQMetro dataset. From the model description in Section 2, in the encoder, the predicted features include the observed input feature and known input feature classes. For the encoders, the most known input features are more important than the observed input features, with the k t 1 (Hour of the day) being the most important, at more than 20%. The importance of “passenger flow for the first 2–4 periods” ( k t 5 ~ k t 7 ) has exceeded 10%. In observed input features, the o t 3 (Max. passenger flow) feature is the most important. In the decoder, the “Passenger flow in the two time steps before the latest update time” ( k t 5 ) is much more important than all the other features, with more than 60%. The importance of the remaining individual features was less than 10%.

5. Discussion

The results of the overall metrics analysis of Section 4.3 showed the excellent performance of the GTFNN model. To further analyze the forecasting performance in different scenarios, this section is designed from two aspects: the prediction results and characteristics of different ODs on weekends and weekdays are analyzed; furthermore, the comparison between forecasting results of ordinary days and holidays is discussed to analyze the applicability of the studied model in forecasting OD passenger flow from different sources.

5.1. Comparison of Different ODs

The following four typical ODs were selected for further discussion and analysis.
(1)
OD 1: This OD consists of a hub-type station and a station in CBD. The selected hub-type station is located close to the city’s high-speed rail passenger hub, mainly serving long-distance passengers entering and leaving the city, and it is an interchange station between the high-speed rail network and the urban rail transit. The station is located in the CBD of the city, which is the most prosperous part of the city, with large passenger flow.
(2)
OD 2: This OD consists of a station in the residential area and a station in CBD. The selected station in the residential area is located in the main residential area of the city and mainly serves the commuting needs of passengers in the residential area. The station in CBD is the same as the station in OD 1.
(3)
OD 3: This OD consists of a station in the residential area and a station in the suburban area. The selected station in the residential area is the same as those in OD 2. The suburban-type station is the starting and ending station of the line, and the station is far away from the city hub, where trains need to make a turnaround, and the daily passenger flow is small.
(4)
OD 4: This OD consists of a hub-type station and station in the suburban area. The selected hub-type station and the station in the suburban area are the same as the stations in OD 1 and OD 3, respectively.
Figure 7, Figure 8, Figure 9 and Figure 10 show the prediction results of four pairs of different ODs on weekdays or weekends. The blue dashed line represents the prediction result, and the red solid line represents the actual flow. The x-axis represents the time step, and the y-axis represents the passenger flow. The time scale represents the first 15 min of the day; thus, the whole day is divided into 1440/15 = 96 periods (since the passenger flow at night is 0, it is not shown in the operation periods).

5.1.1. Forecasting of OD 1 in Weekdays and Weekends

The main purposes of the passenger flow between hubs and CBDs are business activities, shopping, consumption, and attending large events. Hubs and CBDs attract frequent economic activities and commercial activities, which lead to strong population mobility. Passenger flow in these areas is extremely large. As shown in Figure 7, the peak passenger flow exceeds 20 people/15 min on weekdays and 40 people/15 min on weekends. Table 9 shows the metrics of forecasting results. The passenger flow forecast results on weekends are more accurate compared to those on weekdays, where MAE is reduced by 0.20 and SMAPE is reduced by 3.3%.

5.1.2. Forecasting of OD 2 in Weekdays and Weekends

Jobs or shopping opportunities offered by enterprises located in CBD attract people from near and far, which causes large passenger flow in OD 2. As shown in Figure 8, its traffic peak is close to 15 people/15 min on weekdays, while its traffic peak drops to 10 people/15 min on weekends due to the reduction of commuter traffic. In terms of the forecast metrics (Table 10), the forecast accuracy of passenger flow on weekdays is slightly higher than that on weekends. In particular, MAE is reduced by 0.02, and SMAPE is reduced by 2.24%.

5.1.3. Forecasting of OD 3 in Weekdays and Weekends

Residential–suburban stations are more diverse in terms of the main purposes of passenger traffic. As usual, suburban stations have smaller passenger flows, but the residential station area selected in this case still has some passenger flows due to its physical proximity to the suburban station. As shown in Figure 9, its peak passenger flow is close to 10 passengers/15 min on weekdays, and drops to 6 passengers/15 min due to the reduced economic activity in the city on weekends. In terms of forecasting metrics (Table 11), the forecast accuracy of passenger flow on weekends is slightly higher than that of weekdays. In particular, MAE is reduced by 0.04 and SMAPE is reduced by 2.24%.

5.1.4. Forecasting of OD 4 in Weekdays and Weekends

The main purpose of passenger flow between hubs and suburban stations is to commute to urban hubs or to engage in commercial activities. Since suburban stations are less populated nearby, hubs are extremely attractive to suburban stations located at the edge of the city. The urban functions of suburban areas are largely dependent on the existence of urban hubs; therefore, suburban stations would have lower passenger volumes. As shown in Figure 10, its passenger peak is just over 6 passengers/15 min on weekdays and drops to less than 6 passengers/15 min due to the reduced urban economic activity on weekends. In terms of forecasting metrics (Table 12), the accuracy of passenger flow forecasting on weekdays is slightly higher than that on weekends. In particular, MAE is reduced by 0.02, and SMAPE is reduced by 5.83%.

5.1.5. Analyses of the Model Performance

From the above characteristics, we can analyze the following points.
(1)
The predicted and actual values of OD 1 (both the weekdays and weekends) would exceed the other three pairs of ODs. Moreover, because of the great amount of passenger flow, the corresponding prediction ability of OD 1 is much better than the other three pairs of ODs.
(2)
The absolute MAE scores of all ODs are small regardless of the SMAPE difference. It shows that the absolute prediction accuracy of this model is relatively high.
(3)
Because the value of passenger flows in a period is small, small errors could cause big relative errors. The MAE scores of the above four pairs of ODs have little difference, but the larger the passenger flow is, the smaller the SAMPE scores are.
(4)
The SMAPE scores of OD 1 are much lower than the other three pairs of ODs.
In general, it could be seen from the above discussion that for OD prediction, the size of OD passenger flow will affect the evaluation of the relative metrics of the prediction, especially when the passenger flow is small. OD with small passenger flows also shows a poor prediction ability when evaluated with SMAPE. This conclusion has some correlation to the conclusion that “max flow is the most important in observed input features” is obtained from Section 4.4.
In terms of absolute metrics, such as MAE and RMSE, the prediction accuracy of the GTFNN model is high both overall and locally, with MAE scores less than 0.4 for all six pairs of ODs mentioned above. We believe that for the urban rail passenger flow, when the prediction error is less than 1, the impact on the overall line is relatively small, and it is within the acceptable range.

5.2. Comparison between Ordinary Days and Holidays

As the prediction results in Section 4.3 shows, it can be seen that the prediction performance of holidays is worse than that of ordinary days. From the perspective of MAE, the MAE score of holidays (4.41) is 3.58 and 3.63 higher than that of weekdays (0.83) and weekends (0.78), respectively. Likewise, from the perspective of MSE, the score of holidays (35.08) is 30.66 and 31.06 higher than that of weekdays (4.42) and weekends (4.02) and from the perspective of SMAPE, the score of holidays (47.96%) is 33.80% and 34.75% higher than that of weekdays (14.16%) and weekends (13.21%). Although the passenger flow on holidays is larger than that on ordinary days, it is clear that the passenger flow characteristics of holidays are not yet well captured by the model. On the one hand, the model does not consider how to capture the characteristics of holidays (i.e., uncommonly large passenger flow) when designing. Furthermore, there are only a small amount of holidays data in the training data, and the passenger flow characteristics vary between holidays, which poses a great challenge to the model’s prediction during holidays.

6. Conclusions

In this work, we proposed a Graph–Temporal Fused Neural Network (GTFNN) to address the network-level origin–destination (OD) flows online short-term forecasting problem. In order to solve the key issue of online flow forecasting, the proposed GTFNN has made efforts in the four aspects below.
(1)
The GTFNN takes finished OD flow and a series of known and observable features as the input and explores multi-step predictions.
(2)
Unlike previous works that either focus on the spatial relationship or the temporal relationship of OD flows evolution, the proposed method is constructed from capturing both spatial and temporal characteristics.
(3)
In order to learn spatial characteristics, a multi-layer graph neural network model is proposed based on hidden relationships in the rail transit network. Then, we embedded the graph convolution in a Gated Recurrent Unit to learn spatial–temporal features.
(4)
Based on the sequence-to-sequence framework, a Graph–Temporal Fused Deep Learning model was built. In addition, an attention mechanism was attached to the model to fuse local and global temporal dependencies to achieve the prediction of short-term online OD passenger flow.
Experiments based on real-world data collected from Chongqing’s rail transit system showed that the proposed model performed better than other models. For instance, on weekdays for passenger flow forecasting scenarios, the SMAPE score of GTFNN was about 14.16%, with a range from 5% to 20% higher compared to other methods. In addition, the MAE score ranged from 0.1 to 0.8, which is suitable for applications. By comparing some representative ODs, we found that it is more difficult to forecast ODs with small average passenger flow values. OD forecasting for small passenger flows should be one of the next research points.
The proposed model can also analyze weights of different features. The weights of observed input features and known input features were different in the encoder, where the most important feature of observed input features is “hour of the day”, and the most important feature of known input features is “max. passenger flow”.
Obtaining accurate OD passenger flow data on time is vital to support transportation organization in a rail transit system. The accurate OD prediction results allow operators to understand the passenger demand between different ODs of the network at a certain time point in the future, thereby supporting the dynamic and efficient deployment of transportation organization resources. Well-designed train line planning, timetabling and station passenger flow planning can be obtained. From the point of view of the trend of a rail transit passenger flow prediction problem, the traditional station in/out volume prediction can no longer meet the actual application demand. This kind of prediction can only know the station collection and dispersion volume, and the passenger flow distribution on the network is unknown. Accurate prediction of OD will become a popular topic in the future. The method that combines both temporal and spatial relationships into the prediction system studied in this paper will be one of the supports to solve the accurate prediction of OD. Thus, the OD prediction model studied in this paper has strong practical significance.
Nevertheless, there still exist some limitations in the proposed model. By comparing the results of ordinary days and holidays (with uncommonly large passenger flow), we can find that the method studied in this paper still cannot guarantee the accuracy in different passenger flow scenarios. In the future, it is also necessary to optimize the model and algorithm for scenarios where sudden large passenger flows happen to meet the needs of on-time forecasting. In addition, six levels of graph neural networks used in the current model have the same weights. The effect of these graphs on prediction accuracy has not been studied. A good understanding of the importance of different graph relations can help reduce the complexity of the model and improve the training efficiency of the model.

Author Contributions

Data curation, H.Z.; Funding acquisition, H.Z.; Methodology, H.Z., Z.H. and K.Y.; Supervision, J.C.; Writing—original draft, H.Z., Z.H. and K.Y.; Writing—review and editing, J.C. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the China Postdoctoral Science Foundation, grant number 2021T140003, and by the China Postdoctoral Science Foundation, grant number 2021M700186.

Data Availability Statement

Not applicable.

Acknowledgments

Thanks to Guofei Gao for his support and help in data processing and for providing hardware.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

CBDCentral Business Distinct
AFCAuto Fare Collection
ARIMAAuto-Regressive Integrated Moving Average
ELUExponential Linear Unit
FCFully Connect
FC-GRUFully Connected Gates Recurrent Unit
FGraphFunctional Similarity Graph
GBDTGradient Boosting Decision Trees
GC-GRUGraph Convolution Gated Recurrent Unit
GCNGraph Convolution Networks
GLUGated Linear Units
GRNGated Residual Network
GRUGated Recurrent Unit
GTFNNGraph–temporal Fused Neural Network
LSTMLong Short-Term Memory
MAEMean Absolute Error
MAPEMean Absolute Percentage Error
MGNNMulti-Layer Graph Neural Networks Model
MQRNNMulti-Horizon Quantile Recurrent Forecaster
MSEMean Square Error
NGraphNeighborhood Graph
ODOrigin–Destination
RMSERoot Mean Square Error
SBDShape-Based Distance
SMAPESymmetric Mean Absolute Percentage Error
STSGCNSpatial–Temporal Synchronous Graph Convolutional Networks
TGraphTransportation Connectivity Graph
TFTTemporal Fusion Transformer

Appendix A

Table A1 lists the notation used in this paper and the descriptions.
Table A1. Notation and the descriptions.
Table A1. Notation and the descriptions.
IndexNotationDescription
1 t The time   step   t [ 0 , m m a x ]
2 m m a x The maximum of   t
3 O D f l o w t e n The   actual   or   entered   OD   flow   of   time   step   t , which can be obtained in historical data but cannot be calculated instantaneously
4 O D f l o w t e n The object we wish to forecast and the guidance of system management
5 O D f l o w t f i The   finished   OD   flow   of   time   step   t , which can be obtained from historical data and instantaneous observations
6 Y ^ p t , m A The   m - step - ahead   ( m   is   the   length   of   the   predicted   sequence )   forecast   result   at   prediction   time   p t
7 G T F N N ( · ) The proposed forecasting model
8 C C w ( r i , r j ) The   cross - correlation   between   r i   and   r j
Graph convolution
9 m The length of the predicted sequence
10 p t The prediction time
11 h The length of the input sequence
12 π The blind spot for data updates caused by the information system update mechanism
13 N The number of stations
14 X p t , h , π F The   finished   passenger   flow   of   time   steps   { p t h + 1 , p t h + 2 , p t π } ,   X p t , h , π F { X p t n + 1 F , X p t n + 2 F , X p t π F }
15 X t F The   finished   OD   flows   of   time   step   t   with   the   origin   station   i [ 1 , N ] ,   X t F { x 1 , t F , x 2 , t F , , x i , t F , , x N , t F } κ × N
16 x i , t F The   top   κ 1   finished   OD   pairs   that   origin   from   station   i   at   time   step   t ,   x i , t F { x i ~ 1 , t F , x i ~ 2 , t F , , x i ~ j , t F , , x i ~ κ 1 , t F , x i ~ κ , t F } κ
17 x i ~ κ , t I C The   rest   of   the   finished   OD   flow   of   station   i
18 x i ~ j , t F The   finished   OD   flow   traveled   from   station   i   to   station   j   at   time   step   t
19 x i ~ κ , t C The   rest   of   the   actual   ( or   entered )   OD   flow   of   station   i
20 A The   variable   represents   the   actual   ( or   entered )   passenger   flow   of   time   steps   { p t + 1 , p t + 2 , p t + m } ,
21 Y ^ p t , m A The   vector   of   predicted   actual   ( or   entered )   passenger   flow   sequence ,   Y ^ p t , m A { Y p t + 1 A , Y p t + 2 A , Y p t + m A }
22 Y p t , h , π A The   actual   passenger   flow   of   time   steps   { p t h + 1 , p t h + 2 , p t π } ,   where   h > 0   and   π < h , Y p t , h , π A { Y p t n + 1 A , Y p t n + 2 A , Y p t π A }
23 Y t A The   actual   ( or   entered )   OD   flows   of   time   step   t   with   the   origin   station   i [ 1 , N ] ,   Y t A { y 1 , t A , y 2 , t A , , y i , t A , , y N , t A } κ × N
24 y i , t A The   top   κ 1   actual   ( or   entered )   OD   pairs   that   origin   from   station   i   at   time   step   t ,   y i , t A { y i ~ 1 , t A , y i ~ 2 , t A , , y i ~ j , t A , , y i ~ κ 1 , t A , y i ~ κ , t A } κ
25 y i ~ j , t A The   actual   ( or   entered )   OD   flow   traveled   from   station   i   to   station   j   at   time   step   t
26 O p t , h , π The   observed   input   features   that   can   only   be   obtained   in   historical   data ,   O p t , h , π { O p t , h , π s , O p t , h , π g }
27 K p t , h , m The   known   input   features   that   can   be   obtained   in   the   whole   range   of   time ,   K p t , h , m { K p t , h , m s , K p t , h , m g }
28 O p t , h , π s The set of finished OD passenger flow
29 O p t , h , π g The set of horizontal passenger flow
30 K p t , h , m s The sequenced known input features
31 K p t , h , m g The graphic known input features
32 O The observed input features
33 K The known input features
34 o t 1 Finished OD passenger flow in next time step
35 o t 2 Finished OD passenger flow in next two time steps
36 o t 3 Max. passenger flow
37 o t 4 Min. passenger flow
38 k t 1 Hour of the day
39 k t 2 Day of the week
40 k t 3 The weather (Sunny, rainy, and cloudy)
41 k t 4 Passenger flow in the latest update time step
42 k t 5 Passenger flow in the two time steps before the latest update time
43 k t 6 Passenger flow in the three time steps before the latest update time
44 k t 7 Passenger flow in the four time steps before the latest update time
45 k t 8 Passenger flow in the same time step of the previous day
46 k t 9 Passenger flow in the same time step last week
47 k t 10 Passenger flow for the same time step two weeks ago
48 o t j The   vector   of   o 1 , t j j [ 1 , 4 ] collected by stations
49 k t w The   vector   of   k i , t w w [ 1 , 10 ] collected by stations
50 I The input of the model
51 I t The   part   of   input   at   time   step   t
52 I t i The   input   of   station   i   at   time   step   t
53 I G t The   input   of   graph   convolution   at   time   step   t
54 I G t i The   input   of   graph   convolution   of   station   i   at   time   step   t
55 G s n The graph of station network
56 G t n The graph of transfer network
57 G t s The graph of time series similarity
58 G p s The graph of peak hour factor similarity
59 G l p The graph of line planning network
61 G c f The graph of correlation of flow evolution
62 p i The peak hour factor
63 W z The   weight   of   graphs   z { s n , t n , t s , p s , l p , c f }
64 S e ( i ,   j ) The   connection   function   of   node   ( i . e . ,   station )   i   and   j
65 T s ( i ,   j ) The   time   series   similarity   between   station   i   and   j
66 T r ( i ,   j ) The connection function of a station with the nearby transfer stations, T r ( i ,   j ) = 1   if   there   exists   a   path   without   other   transfer   station   between   station   i   and   transfer   station   j ,   or   else   T r ( i ,   j ) = 0
67 r i The   passenger   flow   time   series   of   OD   i
68 r j The   passenger   flow   time   series   of   OD   j
70 P s ( i ,   j ) The   peak   hour   factor   similarity   between   station   i   and   j
71 P i The   peak   hour   factor   of   station   i
72 P j The   peak   hour   factor   of   station   j
73 F r e ( l ) The running frequency of the train running line
74 D ( i ,   j ) The   total   number   of   passengers   that   traveled   from   station   j   to   station   i in the whole dataset
75 Θ s n The parameters of the corresponding networks
76 Θ t n
77 Θ t s
78 Θ p s
79 Θ l p
80 Θ c f
81 N s n ( i ) The   neighbor   set   of   node   i of the corresponding networks
82 N t n ( i )
83 N t s ( i )
84 N p s ( i )
85 N l p ( i )
86 N c f ( i )
87 R t The reset gate
88 Z t The update gate
89 N t The new information
90 H t The hidden state of GC-GRU
91 σ The sigmoid function
92 H t 1 The   hidden   state   at   last   t 1 iteration of GC-GRU
93 H t 1 g The   hidden   state   of   the   multi - layer   graphic   structures   at   the   t 1 iteration of GC-GRU
94 Θ r x The graph convolution parameters of the corresponding networks
95 Θ z x
96 Θ z h
97 Θ n x
98 Θ n h
99 b r , The   bias   terms   of   R t ,   Z t ,   and   N t
100 b z
101 b n
102 R t i The   i th   element   of   R t ,   where   i is the index of station
103 Z t i The   th   element   of   Z t ,   where   i is the index of station
104 N t i The   i th   element   of   N t ,   where   i is the index of station
105 H t i The   i th   element   of   H t ,   where   i is the index of station
106 H ˜ t The   combined   hidden   state   at   time   step   t
107 H t f The   hidden   state   generated   by   FC - GRU   at   time   step   t
108 The operator of feature concatenation
109 E L U The exponential linear unit activation function
110 η 1 The intermediate layers of ELU
111 η 2 The intermediate layers of ELU
112 ω The index to denote weight sharing
113 γ The input of the GLU
114 W 1 , ω The weights of ELU
115 W 2 , ω
116 b 4 , ω The biases of ELU
117 ξ t ( j ) The   transformed   input   of   the   j th   feature   at   time   t
118 ξ ˜ t ( j ) The   processed   feature   vector   for   variable   j
119 Ξ t The   flattened   vector   of   all   past   inputs   at   time   t
120 v χ t The feature selection weights
121 v χ t ( j ) The   j th   element   of   vector   v χ t
122 ϕ ˜ The gated skip connection
123 ϕ The set of uniform temporal features which serve as inputs into the decoder itself
124 Θ The temporal features matrix
125 V The value vector of the attention mechanism
126 K The key vector of the attention mechanism
127 Q The query vector of the attention mechanism
128 A ( · ) The normalization function
129 δ The gating layer
130 ψ The non-linear processing by GRNs
131 ψ ˜ The gated residual connection
132 The   domain   of   training   data   containing   M samples
133 M The number of samples

References

  1. Wei, Y.; Chen, M.C. Forecasting the short-term metro passenger flow with empirical mode decomposition and neural networks. Transp. Res. Part C Emerg. Technol. 2012, 21, 148–162. [Google Scholar] [CrossRef]
  2. Bai, L.; Yao, L.; Kanhere, S.S.; Wang, X.; Sheng, Q.Z. Stg2seq: Spatial-temporal graph to sequence model for multi-step passenger demand forecasting. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}, Macao, China, 10–16 August 2019; pp. 1981–1987. [Google Scholar]
  3. Lim, B.; Ark, S.; Loeff, N.; Pfister, T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
  4. Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar]
  5. Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems; ACS: Washington, DC, USA, 2016. [Google Scholar]
  6. Atwood, J.; Towsley, D. Diffusion-convolutional neural networks. Comput. Sci. 2015, 29. Available online: https://proceedings.neurips.cc/paper/2016/hash/390e982518a50e280d8e2b535462ec1f-Abstract.html (accessed on 1 September 2022).
  7. Velikovi, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
  8. Vlahogianni, E.I.; Golias, J.C.; Karlaftis, M.G.; Banister, D.; Givoni, M. Short-term traffic forecasting: Overview of objectives and methods. Transp. Rev. 2003, 24, 533–557. [Google Scholar] [CrossRef]
  9. Williams, B.; Durvasula, P.; Brown, D. Urban freeway traffic flow prediction: Application of seasonal autoregressive integrated moving average and exponential smoothing models. Transp. Res. Rec. 1998, 1644, 132–141. [Google Scholar] [CrossRef]
  10. Lee, S.; Fambro, D.; Lee, S.; Fambro, D. Application of subset autoregressive integrated moving average model for short-term freeway traffic volume forecasting. Transp. Res. Rec. J. Transp. Res. Board 1999, 1678, 179–188. [Google Scholar] [CrossRef]
  11. Huang, W.; Song, G.; Hong, H.; Xie, K. Deep architecture for traffic flow prediction: Deep belief networks with multitask learning. IEEE Trans. Intell. Transp. Syst. 2014, 15, 2191–2201. [Google Scholar] [CrossRef]
  12. Ni, M.; He, Q.; Gao, J. Forecasting the subway passenger flow under event occurrences with social media. IEEE Trans. Intell. Transp. Syst. 2016, 18, 1623–1632. [Google Scholar] [CrossRef]
  13. Sun, Y.; Leng, B.; Guan, W. A novel wavelet-svm short-time passenger flow prediction in beijing subway system. Neurocomputing 2015, 166, 109–121. [Google Scholar] [CrossRef]
  14. Li, Y.; Wang, X.; Sun, S.; Ma, X.; Lu, G. Forecasting short-term subway passenger flow under special events scenarios using multiscale radial basis function networks—Sciencedirect. Transp. Res. Part C Emerg. Technol. 2017, 77, 306–328. [Google Scholar] [CrossRef]
  15. Sun, Y.; Zhang, G.; Yin, H. Passenger flow prediction of subway transfer stations based on nonparametric regression model. Discret. Dyn. Nat. Soc. 2014, 2014, 397154. [Google Scholar] [CrossRef]
  16. Zhou, X.; Mahmassani, H.S. A structural state space model for real-time traffic origin-destination demand estimation and prediction in a day-to-day learning framework. Transp. Res. Part B Methodol. 2007, 41, 823–840. [Google Scholar] [CrossRef]
  17. Hazelton, M.L. Inference for origin–destination matrices: Estimation, prediction and reconstruction. Transp. Res. Part B 2008, 35, 667–676. [Google Scholar] [CrossRef]
  18. Djukic, T. Dynamic od Demand Estimation and Prediction for Dynamic Traffic Management; Delft University of Technology: Delft, The Netherlands, 2014. [Google Scholar]
  19. Liu, L.; Qiu, Z.; Li, G.; Wang, Q.; Ouyang, W.; Lin, L. Contextualized spatial-temporal network for taxi origin-destination demand prediction. IEEE Trans. Intell. Transp. Syst. 2019, 20, 3875–3887. [Google Scholar] [CrossRef] [Green Version]
  20. Shi, H.; Yao, Q.; Guo, Q.; Li, Y.; Liu, Y. Predicting Origin-Destination Flow via Multi-Perspective Graph Convolutional Network. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020. [Google Scholar]
  21. Gong, Y.; Li, Z.; Zhang, J.; Liu, W.; Zheng, Y. Online spatio-temporal crowd flow distribution prediction for complex metro system. IEEE Trans. Knowl. Data Eng. 2020, 34, 865–880. [Google Scholar] [CrossRef]
  22. Liu, L.; Chen, J.; Wu, H.; Zhen, J.; Li, G.; Lin, L. Physical-virtual collaboration modeling for intra-and inter-station metro ridership prediction. IEEE Trans. Intell. Transp. Syst. 2020, 23, 3377–3391. [Google Scholar] [CrossRef]
  23. Yao, H.; Wu, F.; Ke, J.; Tang, X.; Jia, Y.; Lu, S.; Gong, P.; Ye, J.; Li, Z. Deep multi-view spatial-temporal network for taxi demand prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LO, USA, 2–7 February 2018. [Google Scholar]
  24. Dong, W.; Wei, C.; Jian, L.; Ye, J. Deepsd: Supply-demand prediction for online car-hailing services using deep neural networks. In Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering (ICDE), San Diego, CA, USA, 19–22 April 2017. [Google Scholar]
  25. Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv 2017, arXiv:01926. [Google Scholar]
  26. Song, C.; Lin, Y.; Guo, S.; Wan, H. Spatial-temporal synchronous graph convolutional networks: A new framework for spatial-temporal network data forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 914–921. [Google Scholar]
  27. Han, Y.; Wang, S.; Ren, Y.; Wang, C.; Gao, P.; Chen, G. Predicting station-level short-term passenger flow in a citywide metro network using spatiotemporal graph convolutional neural networks. Int. J. Geo Inf. 2019, 8, 243. [Google Scholar] [CrossRef] [Green Version]
  28. Geng, X.; Li, Y.; Wang, L.; Zhang, L.; Yang, Q.; Ye, J.; Liu, Y. In Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2019; pp. 3656–3663. [Google Scholar]
  29. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  30. Fei, W.; Jiang, M.; Chen, Q.; Yang, S.; Tang, X. Residual attention network for image classification. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  31. Arik, S.O.; Pfister, T. Tabnet: Attentive interpretable tabular learning. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
  32. Alaa, A.M.; Schaar, M. Attentive state-space modeling of disease progression. In Advances in Neural Information Processing Systems; ACS: Washington, DC, USA, 2019. [Google Scholar]
  33. Choi, E.; Bahadori, M.T.; Schuetz, A.; Stewart, W.F.; Sun, J. Retain: Interpretable Predictive Model in Healthcare Using Reverse Time Attention Mechanism; Curran Associates Inc.: Red Hook, NY, USA, 2016. [Google Scholar]
  34. Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.X.; Yan, X. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In Advances in Neural Information Processing Systems; ACS: Washington, DC, USA, 2019. [Google Scholar]
  35. Song, H.; Rajan, D.; Thiagarajan, J.J.; Spanias, A. Attend and diagnose: Clinical time series analysis using attention models. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, LO, USA, 2–7 February 2018. [Google Scholar]
  36. Makridakis, S.; Spiliotis, E.; Assimakopoulos, V.; Hyndman, R.J. The m4 competition: 100,000 time series and 61 forecasting methods. Int. J. Forecast. 2020, 36, 54–74. [Google Scholar] [CrossRef]
  37. Rangapuram, S.S.; Seeger, M.W.; Gasthaus, J.; Stella, L.; Wang, Y.; Januschowski, T. In Deep state space models for time series forecasting. In Advances in Neural Information Processing Systems; ACS: Washington, DC, USA, 2018. [Google Scholar]
  38. Wen, R.; Torkkola, K.; Narayanaswamy, B. A multi-horizon quantile recurrent forecaster. arXiv 2017, arXiv:1711.11053. [Google Scholar]
  39. Fan, C.; Zhang, Y.; Pan, Y.; Li, X.; Zhang, C.; Yuan, R.; Wu, D.; Wang, W.; Pei, J.; Huang, H. Multi-horizon time series forecasting with temporal attention learning. In Proceedings of the 25th ACM SIGKDD International Conference, Anchorage, AL, USA, 3–7 August 2019. [Google Scholar]
  40. Guo, T.; Lin, T.; Antulov-Fantulin, N. Exploring interpretable lstm neural networks over multi-variable data. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
  41. Burggraeve, S.; Bull, S.H.; Vansteenwegen, P.; Lusby, R.M. Integrating robust timetabling in line plan optimization for railway systems. Transp. Res. Part C Emerg. Technol. 2017, 77, 134–160. [Google Scholar] [CrossRef] [Green Version]
  42. Zheng, H.; Cui, Z.; Zhang, X. Automatic discovery of railway train driving modes using unsupervised deep learning. ISPRS Int. J. Geo Inf. 2019, 8, 294. [Google Scholar] [CrossRef] [Green Version]
  43. Paparrizos, J.; Gravano, L. K-shape: Efficient and accurate clustering of time series. ACM SIGMOD Rec. 2015, 45, 69–76. [Google Scholar] [CrossRef]
  44. Fang, S.; Zhang, Q.; Meng, G.; Xiang, S.; Pan, C. Gstnet: Global spatial-temporal network for traffic flow prediction. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China, 10–16 August 2019; pp. 2286–2293. [Google Scholar]
  45. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
  46. Clevert, D.-A.; Unterthiner, T.; Hochreiter, S. In Fast and accurate deep network learning by exponential linear units (elus). In Proceedings of the ICLR, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  47. Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
  48. Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2016. [Google Scholar]
  49. Gal, Y.; Ghahramani, Z. A theoretically grounded application of dropout in recurrent neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems; Curran Associates Inc.: Barcelona, Spain, 2016; pp. 1027–1035. [Google Scholar]
  50. Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  51. Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Deng, M.; Li, H. T-gcn: A temporal graph convolutional network for traffic prediction. IEEE Trans. Intell. Transp. Syst. 2019, 21, 3848–3858. [Google Scholar] [CrossRef] [Green Version]
  52. Friedman, J. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
  53. Gers, F.A.; Schmidhuber, E. Lstm recurrent networks learn simple context-free and context-sensitive languages. IEEE Trans. Neural Netw. 2001, 12, 1333–1340. [Google Scholar] [CrossRef]
  54. Jozefowicz, R.; Zaremba, W.; Sutskever, I. An empirical exploration of recurrent network architectures. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar]
  55. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph wavenet for deep spatial-temporal graph modeling. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}, Macao, China, 10–16 August 2019. [Google Scholar]
Figure 1. Illustration of the forecasting task in the temporal perspective.
Figure 1. Illustration of the forecasting task in the temporal perspective.
Mathematics 10 03664 g001
Figure 2. Illustrations for the Encoder–Decoder component of the GTFNN.
Figure 2. Illustrations for the Encoder–Decoder component of the GTFNN.
Mathematics 10 03664 g002
Figure 3. The framework of Graph–Temporal Fused Neural Network (GTFNN).
Figure 3. The framework of Graph–Temporal Fused Neural Network (GTFNN).
Mathematics 10 03664 g003
Figure 4. The structure of GRN.
Figure 4. The structure of GRN.
Mathematics 10 03664 g004
Figure 5. Illustration for feature selection network.
Figure 5. Illustration for feature selection network.
Mathematics 10 03664 g005
Figure 6. Variable importance for the CQMetro dataset. (a) Encoder variables importance. (b) Decoder variables importance.
Figure 6. Variable importance for the CQMetro dataset. (a) Encoder variables importance. (b) Decoder variables importance.
Mathematics 10 03664 g006
Figure 7. The examples of forecasting curves and the ground-truth curves for OD 1: (a) weekdays; (b) weekends.
Figure 7. The examples of forecasting curves and the ground-truth curves for OD 1: (a) weekdays; (b) weekends.
Mathematics 10 03664 g007
Figure 8. The examples of forecasting curves and the ground-truth curves for residential areas to CBD OD flow: (a) weekdays; (b) weekends.
Figure 8. The examples of forecasting curves and the ground-truth curves for residential areas to CBD OD flow: (a) weekdays; (b) weekends.
Mathematics 10 03664 g008
Figure 9. The examples of forecasting curves and the ground-truth curves for residential to suburban OD flow: (a) weekdays; (b) weekends.
Figure 9. The examples of forecasting curves and the ground-truth curves for residential to suburban OD flow: (a) weekdays; (b) weekends.
Mathematics 10 03664 g009
Figure 10. The examples of forecasting curves and the ground-truth curves for hub to suburban OD flow: (a) weekdays; (b) weekends.
Figure 10. The examples of forecasting curves and the ground-truth curves for hub to suburban OD flow: (a) weekdays; (b) weekends.
Mathematics 10 03664 g010
Table 1. Key notations in forecasting task.
Table 1. Key notations in forecasting task.
IndexNotationDescription
1 X p t , h , π F A vector of input finished passenger flow sequence, where p t is the time step of forecasting. h refers to the length of the input sequence and π refers to the blind spot for data updates caused by the information system update mechanism, where h > 0 and π < h . X p t , h , π F represents the finished passenger flow of time steps { p t h + 1 , p t h + 2 , p t π } .
X p t , h , π F { X p t n + 1 F , X p t n + 2 F , X p t π F }
2 X t F The finished OD flows of time step t with the origin station i [ 1 , N ] .
X t F { x 1 , t F , x 2 , t F , , x i , t F , , x N , t F } κ × N
3 x i , t F The top κ 1 finished OD pairs that originate from station i at time step t .   x i ~ κ , t I C represents the rest of the finished OD flow of station i .
x i , t F { x i ~ 1 , t F , x i ~ 2 , t F , , x i ~ j , t F , , x i ~ κ 1 , t F , x i ~ κ , t F } κ
4 x i ~ j , t F The finished OD flow traveled from station i to station j at time step t .
5 Y ^ p t , m A A vector of predicted actual (or entered) passenger flow sequence, where p t is the predicted or query time step. m refers to the length of the predicted sequence where. A indicates that the variable represents the actual (or entered) passenger flow of time steps { p t + 1 , p t + 2 , p t + m } .
Y ^ p t , m A { Y p t + 1 A , Y p t + 2 A , Y p t + m A }
6 Y p t , h , π A A vector of historical actual (or entered) passenger flow sequence, where p t is the time step of forecasting. h refers to the length of the sequence, and π refers to the blind spot for data updates caused by the information system update mechanism, where h > 0 and π < h .   Y p t , h , π A represents the actual passenger flow of time steps { p t h + 1 , p t h + 2 , p t π } .
Y p t , h , π A { Y p t n + 1 A , Y p t n + 2 A , Y p t π A }
7 Y t A The actual (or entered) OD flows of time step t with the origin station i [ 1 , N ] .
Y t A { y 1 , t A , y 2 , t A , , y i , t A , , y N , t A } κ × N
8 y i , t A The top κ 1 actual (or entered) OD pairs that originate from station i at time step t . x i ~ κ , t C represents the rest of the actual (or entered) OD flow of station i .
y i , t A { y i ~ 1 , t A , y i ~ 2 , t A , , y i ~ j , t A , , y i ~ κ 1 , t A , y i ~ κ , t A } κ
9 y i ~ j , t A The actual (or entered) OD flow traveled from station i to station j at time step t .
10 O p t , h , π The observed input features that can only be obtained in historical data. We define O p t , h , π { O p t , h , π s , O p t , h , π g }
11 K p t , h , m The known input features that can be obtained in the whole range of time. We define K p t , h , m { K p t , h , m s , K p t , h , m g }
Table 2. Time-dependent features.
Table 2. Time-dependent features.
ClassNameNotationTime RangeSet
Observed input features ( O )Finished OD passenger flow in next time step o t 1 [pt – h + 1, pt − π] O p t , h , π s
Finished OD passenger flow in next two time steps o t 2
Max. passenger flow o t 3 O p t , h , π g
Min. passenger flow o t 4
Known input features ( K )Hour of the day k t 1 [pt – h + 1, pt + m] K p t , h , m s
Day of the week k t 2
Weather (Sunny, rainy, and cloudy) k t 3
Passenger flow in the latest update time step k t 4 K p t , h , m g
Passenger flow in the two time steps before the latest update time
Passenger flow in the three time steps before the latest update time k t 6
Passenger flow in the four time steps before the latest update time k t 7
Passenger flow in the same time step of the previous day k t 8
Passenger flow in the same time step last week k t 9
Passenger flow for the same time step two weeks ago k t 10
Table 3. Multi-layer networks.
Table 3. Multi-layer networks.
ClassGraphIllustrationNotationWeight of Edge
Station–line–network relationshipStation networkMathematics 10 03664 i001 (a) G s n = ( N ,   E s n ,   W s n ) W s n i , j = S e i , j k = 1 N S e i , k (10)
Transfer networkMathematics 10 03664 i002 (b) G t n = ( N ,   E t n ,   W t n ) W t n i , j = T r i , j k = 1 N T r i , k (11)
Passenger flow characteristics relationshipTime series similarityMathematics 10 03664 i003 (c) G t s = ( N ,   E t s ,   W t s ) W t s ( i , j ) = T s ( i , j ) k c T s ( i , k ) (12)
Peak hour factor similarityMathematics 10 03664 i004 (d) G p s = ( N ,   E p s ,   W p s ) W p s ( i , j ) = P s ( i , j ) k c P s ( i , k ) (13)
Line planning relationshipLine planning networkMathematics 10 03664 i005 (e) G l p = ( N ,   E l p ,   W l p ) W l p ( i , j ) = F r e ( l ) L p ( i , j ) k S L ( l ) L p ( i , k ) F r e ( l ) (14)
Correlation relationshipCorrelation of flow evolutionMathematics 10 03664 i006 (f) G c f = ( N ,   E c f ,   W c f ) W c f i , j = D i , j k = 1 N D i , k (15)
G = ( N ,   E ,   W )
Table 4. Dataset of CQMetro.
Table 4. Dataset of CQMetro.
DatasetNotation
CityChongqing, China
Station170
Physical Edge448
Flow volume per day1.72 M
OD pairs14,314
Time Step15 min
Input length384 (4 days)
Output length4 (60 min)
Forecasting horizon2 (30 min)
Training Timespan2880 (30 days)
Testing Timespan1440 (15 days)
Validation Timespan96 (1 day)
Table 5. Hyperparameter optimization.
Table 5. Hyperparameter optimization.
Random search rangeState size10, 20, 40, 80, 160, 240, 320
Dropout rate0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.9
Minibatch size64, 128, 256
Learning rate0.0001, 0.001, 0.01
Decay ratio0.1
Max. gradient norm0.01, 1.0, 100.0
Feature dimensionality256
Optimal hyperparametersRandom search iterations60
State size320
Dropout rate0.3
Minibatch Size128
Learning rate0.001
Decay ratio0.1
Max Gradient Norm100
Num. Heads4
Feature dimensionality256
Table 6. The mean performances of models in forecasting weekdays OD flow.
Table 6. The mean performances of models in forecasting weekdays OD flow.
MetricsGTFNNGBDTLSTMGRUGraph-WaveNet
MAE0.831.714.574.632.91
MSE4.427.0615.4916.3410.42
RMSE2.102.663.944.043.23
SMAPE14.16%19.41%35.17%34.52%23.25%
Table 7. The mean performances of models in forecasting weekends OD flow.
Table 7. The mean performances of models in forecasting weekends OD flow.
MetricsGTFNNGBDTLSTMGRUGraph-WaveNet
MAE0.781.584.124.262.13
MSE4.026.5313.0913.199.63
RMSE2.002.563.623.633.10
SMAPE13.21%18.35%31.48%31.54%20.16%
Table 8. The mean performances of models in forecasting holiday OD flow.
Table 8. The mean performances of models in forecasting holiday OD flow.
MetricsGTFNNGBDTLSTMGRUGraph-WaveNet
MAE4.416.6019.9720.308.89
MSE35.0844.4485.7685.1167.32
RMSE5.926.679.269.238.20
SMAPE47.96%60.83%109.86%85.44%60.07%
Table 9. The metrics of forecasting results for OD 1.
Table 9. The metrics of forecasting results for OD 1.
Passenger Flow Scenario 1MAEMSERMSESMAPE
weekdays0.380.520.7211.51%
weekends0.180.070.278.20%
Table 10. The metrics of forecasting results for residential areas to CBD OD flow.
Table 10. The metrics of forecasting results for residential areas to CBD OD flow.
Passenger Flow Scenario 2MAEMSERMSESMAPE
weekdays0.260.700.8431.23%
weekends0.280.770.8833.47%
Table 11. The metrics of forecasting results for residential to suburban OD flow.
Table 11. The metrics of forecasting results for residential to suburban OD flow.
Passenger Flow Scenario 3MAEMSERMSESMAPE
weekdays0.160.180.4181.98%
weekends0.120.100.3276.42%
Table 12. The metrics of forecasting results for hub to suburban OD flow.
Table 12. The metrics of forecasting results for hub to suburban OD flow.
Passenger Flow Scenario 4MAEMSERMSESMAPE
weekdays0.120.090.29106.84%
weekends0.140.130.36112.67%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zheng, H.; Chen, J.; Huang, Z.; Yang, K.; Zhu, J. Short-Term Online Forecasting for Passenger Origin–Destination (OD) Flows of Urban Rail Transit: A Graph–Temporal Fused Deep Learning Method. Mathematics 2022, 10, 3664. https://doi.org/10.3390/math10193664

AMA Style

Zheng H, Chen J, Huang Z, Yang K, Zhu J. Short-Term Online Forecasting for Passenger Origin–Destination (OD) Flows of Urban Rail Transit: A Graph–Temporal Fused Deep Learning Method. Mathematics. 2022; 10(19):3664. https://doi.org/10.3390/math10193664

Chicago/Turabian Style

Zheng, Han, Junhua Chen, Zhaocha Huang, Kuan Yang, and Jianhao Zhu. 2022. "Short-Term Online Forecasting for Passenger Origin–Destination (OD) Flows of Urban Rail Transit: A Graph–Temporal Fused Deep Learning Method" Mathematics 10, no. 19: 3664. https://doi.org/10.3390/math10193664

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop