1. Introduction
Along with the rapid development of urban transportation, the metro has become an important choice for urban residents due to its convenience and efficiency. The development of the metro network and the continuous growth of passengers have led to increased difficulty in metro operation, which in turn results in station congestion and prolonged commuting time for passengers. In this context, precise prediction of passenger flow has emerged as a pivotal solution to alleviate the operational pressure posed by the ever-increasing demand [
1], thereby ensuring a secure and comfortable travel experience for passengers [
2]. In addition, accurate passenger flow prediction can also be used to respond to emergency situations or traffic interruptions [
3], reduce train delays [
4], improve overall punctuality, optimize system operations [
5,
6], and thereby improve the overall efficiency and service quality of the transportation system. The relevant literature on partial passenger flow prediction and its role in operation is shown in
Table 1. Nevertheless, existing passenger prediction methods struggle to handle complex nonlinear interactions and integrate multi-source data, which leads to poor prediction results. Therefore, it is particularly important to improve the prediction accuracy of the models.
In response to the above issues, a considerable amount of research has been conducted on passenger flow prediction methods. The earliest passenger flow prediction methods were statistical methods, followed by machine learning methods. Statistical methods primarily rely on extracting time series patterns from historical data [
7,
8]. These methods struggle to meet the requirements for real-time performance and prediction accuracy. Compared with statistical methods, machine learning approaches have higher prediction precision and have more advantages in solving multi-source data integration problems. Moreover, deep nonlinear relationships modeling can be achieved by some models, such as back-propagation neural network (BPNN) [
9], gradient boosting decision tree (GBDT) [
10], and multilayer perceptron (MLP) [
11]. In recent years, deep learning methods have attracted considerable attention as a branch of machine learning, which not only automatically extracts intricate temporal and spatial features but also effectively models high-dimensional data. In the initial stage of deep learning, many models based on recurrent neural networks (RNNs) have emerged, which are widely employed in passenger flow prediction tasks for their superior ability to handle temporal features [
12,
13,
14]. As one of the variants of RNN, the gated recurrent unit (GRU) distinguishes itself through its advanced gating mechanism [
15,
16]. However, despite the progress made in contemporary passenger flow prediction models, a fundamental limitation of these models is their inability to capture spatial features. Thanks to the excellent properties to extract connections in terms of space, the convolutional neural network (CNN) is widely applied to process spatial data with regular grid structures or Euclidean data [
17]. As a supplement to the CNN, the graph convolutional network (GCN) [
18] is valuable in capturing network-based spatial dependencies from graph-structured data, and it aggregates the edge features of adjacent nodes to promote global modeling [
19,
20,
21].
The spatial dependencies are optimal for metro passenger flow prediction since they can capture the inter-station flow patterns between stations [
22]. Considering this, we need to capture not only the local dependencies within each station but also the global interactions between stations throughout the entire metro network. So, some studies focus on multi-station passenger flow prediction, which individually predicts passenger flow for each cluster comprising multiple stations. For instance, Dong et al. [
23] employed K-Means to classify all stations into different categories and created a prediction model for the same category in the network. Liu et al. [
24] proposed a novel two-step model that predicted the passenger flow for each type of classification result. Unlike stand-alone prediction strategies that only consider interdependencies within a cluster, network-level prediction focuses on interactions across clusters and the flow dynamics of the entire network. In addition to temporal and spatial correlations among stations, the stations are also affected by inter-station correlations measured by geographical connectivity and passenger travel behavior patterns [
25]. The inter-station correlations not only exist in local areas but can also occur between geographically distant stations. Nevertheless, some recent models only consider the physical topology of metro networks and neglect the diversity of inter-station dependencies. Zhang et al. [
26] constructed a geographic topology graph based on the connection relationships in the metro network. Overall, these methods are limited to learning local spatial dependencies between adjacent stations and cannot fully capture the spatial dependencies of long-term passenger flow trends. Therefore, it provides substantial opportunities to improve the accuracy and robustness of passenger flow prediction.
To address these limitations and improve the performance of passenger flow prediction, we propose a deep learning framework for metro passenger flow prediction, named MRDSCNN (Multiscale Residual Depthwise Separable Convolutional Neural Network). Based on the travel behavior and operational principles of the metro, we extract the historical passenger flow from the smart card data as input. The proposed approach leverages the residual network to capture spatiotemporal dependency details from multiple temporal patterns of passenger flows and utilizes multi-scale convolution to extract inter-station correlations from two graphs. Subsequently, spatiotemporal dependencies and inter-station correlations features are fused and then fed into the AttBiGRU to integrate the comprehensive global information. The passenger inflow and outflow prediction results are obtained through a fully connected layer. The main contributions of this study can be summarized as follows:
- (1)
We employ the RDSC module to capture spatiotemporal dependencies between stations from various temporal patterns (real-time, daily, and weekly), which is a residual network structure with a channel attention mechanism.
- (2)
We model inter-station interactions through a network structure correlation graph and passenger flow similarity graph and utilize the MDSC module to enhance multi-scale spatial correlations on these graphs.
- (3)
Experimental results based on real data of metro passenger inflow and outflow prove that our approach outperforms other baseline models in terms of prediction performance.
The remainder of this study is organized as follows. In
Section 2, we define the problem and describe the architecture of the proposed model, providing a detailed overview of the different components of the framework. In
Section 3, we elaborate on the experimental details and discuss the results of the case study. The conclusion is given in
Section 4.
2. Methodology
2.1. Overall Framework
This study aims to predict the inflow and outflow of passengers based on historical metro passenger flow data and predefined graph structure data. Assuming there are metro stations in the network, the passenger flow at time interval is denoted as . The previous time step historical passenger flow data across the entire metro system are represented as a signal , , a series of time intervals, whereby , , and represent different time offset values from the current time interval to the past. Aiming to efficiently capture the trends of passenger flow across various time periods (real-time, daily, and weekly), our model contains three different types of temporal patterns . The passenger flow data of metro stations in time intervals can be defined as , whereby , , and represent the passenger inflow or outflow in the three temporal patterns mentioned above, respectively.
The inter-station correlation not only represents the connection between adjacent stations but also represents the correlation between station passenger flow trends in the metro network. Station interaction graphs are constructed to capture inter-station correlations based on network structure correlation graph and passenger flow similarity graph , where is the set of stations, and and are the set of edges that depict connections between stations in different graphs. is denoted as adjacency matrix, indicating the weights of the edge from station to station . is the adjacency matrix of , and the adjacency matrix of is represented by , which is the normalized result of the passenger flow similarity matrix. The main purpose of this study is to perform network-level prediction for passenger inflow and outflow by the data matrices and the station interaction graphs and .
The framework of the MRDSCNN model is depicted in
Figure 1, encompassing RDSC module, MDSC module, and AttBiGRU module. The RDSC module cleverly utilizes the residual structure of hopping connections and integrates stacked residual blocks in three different temporal modes. The residual block incorporates a channel mechanism, which focuses on the important temporal dynamics features of metro passenger flow. Meanwhile, the MDSC module utilizes multi-scale convolution to extract complex inter-station correlations from network structure correlation graph and passenger flow similarity graph. This strategy empowers the model to capture features at different scales and enhances fine-grained feature modeling. Then, the outputs of the above modules are integrated through the fusion layer and fed into AttBiGRU to learn the global evolutionary features of all stations. Notably, the adoption of depthwise separable convolution (DSC) effectively reduces parameter overhead and maintains predictive performance.
2.2. Construction of Relationship Graphs between Metro Stations
In this section, we explore the relationship between metro stations from two perspectives. One is the track connectivity of adjacent stations, and the other is the correlation of passenger flow patterns between different stations. Considering the station-to-station relevant information from the network map and historical passenger flow, network structure correlation graph and passenger flow similarity graph are constructed to fully explore the complex inter-station correlations.
2.2.1. Network Structure Correlation Graph
When two stations are situated in close proximity, their passenger flow interactions tend to exhibit stronger correlations. That means the topology association between adjacent stations is a critical spatial relationship within the metro network. This spatial relationship exerts a notable influence on the trajectory and speed of passenger traffic, thereby shaping the dynamic behavior of the metro system. In order to effectively capture the relationship among stations, we establish a network structure correlation graph denoted as
referring to the real-world network map. The value of item
of adjacency matrix
is assigned to 1 if two stations
and
are adjacent, and 0 otherwise. Additionally, diagonal entries of the matrix are set to 0 to avoid self-loop connections between stations and themselves, thus reducing the repeated redundant information and computational burden.
2.2.2. Passenger Flow Similarity Graph
Although the network structure correlation graph can depict the spatial position of stations, it falls short of capturing the intricate interaction of long-term trends in passenger flow. At the same time, the construction of a single graph is limited in fully describing inter-station interactions. Thus, we introduce a passenger flow similarity graph
to provide rich insights and an adequate understanding of the high-order spatiotemporal relationships among stations. Diverging from the network structure correlation graph based on web map, passenger flow similarity graph employs historical passenger flow to model the inter-station correlations through dynamic time warping (DTW) algorithm. The DTW algorithm captures the matching relationships between passenger flows at different time points, which can identify stations with similar travel patterns in passenger flow prediction [
27]. Based on this approach, we align and flexibly extend the passenger flow data to determine optimal matching paths and then calculate the distances between data points at the corresponding locations on these paths.
Supposing there are two stations,
and
, which have the passenger flow series
and
, respectively. We generate a distance matrix of dimensions
, where
and
represent the length of two time series being compared, as shown in
Figure 2. The set
represents the index of each element in the matrix, and
denotes the value of every element. The unit of each element in distance matrix of DTW algorithm is person.
The path consisting of gray grids in
Figure 2 is the optimal warping path, which depicts how the data points of
are matched with
to achieve the best alignment. Overall differences resulting from alignment are measured by the cumulative warp distance
, which is defined as the distance between the first
data points of
and the first
data points of
. In other words, it is the shortest path length from the lower left corner
to any point
in the matrix of
Figure 2. Moreover, the computation of the shortest distance adheres strictly to the constraints of boundedness, monotonicity, and continuity, ensuring the reliability and accuracy of the results, as expressed in Equation (2).
where
represents the index of the first point
in
, and
represents the index of the first point
in
.
During the path movement of the DTW algorithm, three potential directions are available: horizontal, vertical, and diagonal directions. For instance, when the current point is at coordinates
, the subsequent point can be chosen from three options:
,
, and
. Consequently, the warped distance resulting from the alignment of the path can be represented, as shown in Equation (3).
where
represents the calculated value of point
in the matrix, and we choose the Euclidean distance to measure the warped distance between time series.
The smaller the warped distance, the more similar the passenger flow patterns between two stations, as shown in Equation (4).
Subsequently, we take
as the value of item
of passenger flow similarity matrix
[
28], as shown in Equation (5). Considering the different impacts of time granularities and directional distinctions inherent in historical passenger flow data, a more detailed division of the passenger flow similarity matrix is required in the construction process (presented in
Section 3.4.1). In this way, these graphs can help to identify stations (e.g., similar passenger flow patterns, functions, and network structures), thereby responding to different passenger flow changes and complex network dynamics.
2.3. Residual Depthwise Separable Convolution Module
As we all know, traditional CNN may suffer from performance degradation and overfitting as the network depth increases, especially for limited training data. Therefore, the residual network (ResNet) has emerged as a solution to improve performance and optimize deeper models [
29]. In this study, we employ an improved residual module RDSC that leverages the synergy between depthwise separable convolution and attention mechanism. As depicted in
Figure 3, the input passenger flow passes through a series of processes to obtain the output features, DSConv represents a depthwise separable convolution, BN indicates a batch normalization layer, ReLu denotes an activation function, and channel-attention mechanism includes squeeze and excitation operations. The main effect of the attention mechanism is to make the model adaptively learn the importance of each channel, thus reducing the focus on non-essential feature channels. Through these connections, information can flow directly from one layer to another, thus alleviating the vanishing gradient problem and facilitating the training of deep networks.
Figure 1 illustrates two RDSC modules being used to extract inflow or outflow features. Subsequently, the extracted features are flattened and fed into the fully connected layer.
In this module, the structure of traditional residual networks is improved to simultaneously optimize the network model parameters and increase the prediction accuracy. Specifically, we introduce a depthwise separable convolution to divide the convolution process into two distinct stages, depthwise convolution and pointwise convolution, as depicted in
Figure 4. The numbers of input channels and output channels are
and
, the sizes of input space width and height are
and
, the size of output space width and height are
and
, and the size of convolutional kernel is
and spatial dimension is
. First of all, depthwise convolution only operates one convolution kernel for each input data channel and obtains feature maps with number
in size
. Subsequently, pointwise convolution takes the output feature maps obtained from depthwise convolution as input and uses convolution kernels of size
with number
to perform the convolution operation for each channel, thus obtaining a new feature map of size
. This improvement significantly reduces the number of parameters and computational complexity, as well as retaining capacity of the model for intricate feature representation.
We also introduce the concept of dilation rates within the framework of depthwise separable convolution. The dilation rate permits convolution kernels to traverse across selective pixels or feature points while performing convolution, thus extending the receptive field. Standard convolution is a special case, with dilation rate being equal to 1, as shown in
Figure 5a, which demonstrates the standard convolutional receptive field for a filter size of
. When the filter sizes are the same, both being
, the particular receptive fields with different dilation rates for convolution can be observed in
Figure 5b,c. The sensory field that extends beyond the range of the standard convolution contributes to a comprehensive understanding of the data distribution and improves the ability to discern complex spatiotemporal relationships.
2.4. Multi-Scale Depthwise Separable Convolution Module
Based on the two predefined graphs, we construct the MDSC module to achieve a balance between feature propagation and extraction, which complements the spatial fine-grained information. Since GCN can capture spatial relationships from graphical data (presented in
Section 2.2), two parallel GCNs are used to extract inter-station correlation of graphs. The GCN model can obtain the topological relationship between the center node and its neighboring nodes to obtain the spatial features reflected by the network, which mainly includes self-looping, neighbor aggregation, and normalization, as illustrated in
Figure 6.
Taking the central node in
Figure 6 as an example, a node’s neighbors are the nodes directly connected to it in the graph. Self-looping means that central node can consider its own features when aggregating information from its neighbors during convolution operations. Subsequently, GCN aggregates information from these neighboring nodes to update the features of the central node. Finally, GCN normalizes the adjacency matrix to ensure the stability and controllability of information transmission. In this study, GCN revolves around the convolution operation in the spatial domain, i.e., spatial-based GCN. It performs convolution on the adjacency matrices of network structure correlation graph and passenger flow similarity graph. Each station accumulates and assigns weights to its own features and its adjacent stations’ features. The functionality of the GCN can be mathematically expressed as shown in Equation (6).
where
and
represent the feature matrix of the
layer and the
layer.
represents the adjacency matrix with self-connections—that is,
.
is based on the predefined adjacent matrix of network structure correlation graph and passenger flow similarity graph, and
is the identity matrix.
is the degree matrix of graph.
represents the weight matrix of the
layer, and
is the sigmoid activation function.
However, several studies have shown that the increase in the number of GCN layers not only leads to higher computational cost during backpropagation but also leads to the disappearance of gradients. Therefore, we input the inter-station correlation features
obtained through GCN into the multi-scale convolution for extracting multiple spatial features. A single convolutional kernel cannot capture dependencies on various scales, while several convolutional kernels of different sizes can be viewed as feature extractors that capture different levels of features. Multi-scale convolution is able to extend the multidimensional relationships between input and output, as shown in
Figure 7, which consists of four branches combining a single convolutional layer with different kernel sizes and a
maximum pooling layer. Each branch uses a
filter to compress the number of channels and improve nonlinear fitting of the model.
2.5. AttBiGRU
In view of the various influences that may lead to changes in spatiotemporal correlations, we have embraced a weighted fusion methodology in our framework to amplify the important information from the input data and avoid limitations arising from over-reliance on a single feature, as shown in Equation (7).
where
represents the output of multiple branches.
,
, and
represent the outputs of RDSC modules, MDSC module based on
and MDSC module based on
, respectively.
,
, and
are the weight parameters of each branch, which are automatically updated through backpropagation during the training process.
represents the product of Hadamard.
The AttBiGRU takes
as input and provides powerful bidirectional modeling and prediction performance for the proposed model through the combination of attention mechanism and BiGRU. As the latest advancement in the field of sequence modeling following RNN, gated recurrent unit (GRU) inherits its advantages and has the potential to surpass it in various processing applications. Nevertheless, the unidirectional nature of GRU imposes constraints on its capacity to access global information, thereby engendering potential losses or errors in information accumulation. Bidirectional gated recurrent unit (BiGRU) emerges as a solution by processing input sequences in both the forward and reverse directions, enabling it to identify critical factors or patterns that might be missed when only processing the data in a single direction. The attention mechanism of AttBiGRU can capture key bidirectional dependencies and context information, thus allowing the model to maintain a long-term memory for useful information in prediction tasks, as depicted in
Figure 8.
The fused features are passed through AttBiGRU to obtain
. Then, via flattenning and fully connected layer, we obtain the final predicted passenger flow results, as shown in Equation (8).
where
is the output of AttBiGRU, and
denotes the fully connected network.
and
are the weights and biases of fully connected layer, respectively.
4. Conclusions
As an important aspect of urban transportation planning and management, accurate passenger flow prediction is crucial for ensuring efficient and reliable metro services. Predicting metro passenger flow is a difficult task due to intricacies and uncertainties. In this study, we propose an MRDSCNN model based on the residual structure, multi-scale convolution, and AttBiGRU to predict the passenger inflow and outflow in the metro. Considering changes in passenger flow at different time scales, we take three different passenger flow patterns as input of the residual network and capture the spatiotemporal characteristics. Moreover, we develop two graphs based on prior knowledge and model the inter-station interaction correlations through multi-scale convolution operations. Then, AttBiGRU is utilized for global information fusion. The proposed model is evaluated by the Hangzhou metro passenger flow data, and the results demonstrate that the model effectively simulates the global dynamic dependencies of the entire metro system. In the future, we will explore more inter-station correlations and external factors (such as weather and emergencies) to further improve the accuracy of passenger flow prediction, as well as extend its application to other urban transportation domains.