ARFGCN: Adaptive Receptive Field Graph Convolutional Network for Urban Crowd Flow Prediction

Dai, Genan; Huang, Hu; Peng, Xiaojiang; Zhang, Bowen; Fu, Xianghua

doi:10.3390/math12111739

Open AccessArticle

ARFGCN: Adaptive Receptive Field Graph Convolutional Network for Urban Crowd Flow Prediction

by

Genan Dai

^1,2,

Hu Huang

^2,3

,

Xiaojiang Peng

¹

,

Bowen Zhang

^1,2,* and

Xianghua Fu

^1,*

¹

College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518118, China

²

Guangdong Key Laboratory for Intelligent Computation of Public Service Supply, Shenzhen 518055, China

³

Shenzhen Graduate School, Peking University, Shenzhen 518055, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2024, 12(11), 1739; https://doi.org/10.3390/math12111739

Submission received: 29 April 2024 / Revised: 22 May 2024 / Accepted: 30 May 2024 / Published: 3 June 2024

(This article belongs to the Section Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

Urban crowd flow prediction is an important task for transportation systems and public safety. While graph convolutional networks (GCNs) have been widely adopted for this task, existing GCN-based methods still face challenges. Firstly, they employ fixed receptive fields, failing to account for urban region heterogeneity where different functional zones interact distinctly with their surroundings. Secondly, they lack mechanisms to adaptively adjust spatial receptive fields based on temporal dynamics, which limits prediction performance. To address these limitations, we propose an Adaptive Receptive Field Graph Convolutional Network (ARFGCN) for urban crowd flow prediction. ARFGCN allows each region to independently determine its receptive field size, adaptively adjusted and learned in an end-to-end manner during training, enhancing model prediction performance. It comprises a time-aware adaptive receptive field (TARF) gating mechanism, a stacked 3DGCN, and a prediction layer. The TARF aims to leverage gating in neural networks to adapt receptive fields based on temporal dynamics, enabling the predictive network to adapt to urban regional heterogeneity. The TARF can be easily integrated into the stacked 3DGCN, enhancing the prediction. Experimental results demonstrate ARFGCN’s effectiveness compared to other methods.

Keywords:

crowd flow prediction; graph convolutional network; time-aware gating mechanism

MSC:

68T50

1. Introduction

Urban crowd flow, characterized by inflow and outflow, refers to the movement of people entering and leaving various regions within a city over specific time intervals [1]. Accurate urban crowd flow prediction has gained substantial importance due to its far-reaching implications across diverse social and economic domains [2]. Predicting crowd flow across different urban regions is pivotal for optimizing resource allocation, mitigating congestion, and enhancing emergency response capabilities. For example, it enables governments to implement effective and timely measures for public safety during urban events. Moreover, ride-sharing platforms can leverage such predictions to efficiently dispatch vehicles to regions with high anticipated demand.

Urban crowd flow prediction is a highly intricate task, as it necessitates not only the forecasting of temporal sequences but also the consideration of intricate spatial dependencies. Over the years, numerous approaches have been proposed to tackle this problem. Traditional time-series prediction methods, such as ARIMA [3] and SARIMA [4], mainly focus on analyzing temporal dimensions while neglecting spatial correlations. With the recent advancements in deep learning techniques, deep neural networks have introduced novel perspectives and methodologies for urban crowd flow prediction. One category of these approaches employs convolutional neural networks (CNNs), which represent crowd flow data as regular grids and build CNN models on them for forecasting [5,6,7]. However, CNN-based methods are restricted to regular spatial grid data, limiting their applicability to real-world problems. Recently, Graph Convolutional Networks (GCNs) have been widely adopted due to their ability to effectively capture information from irregular regions, as graph structures provide a powerful representation of such data [8,9,10,11,12]. GCN-based methods partition cities according to the actual usage (e.g., commercial, residential) and geographical features, conforming to urban geospatial characteristics and road network structure [13]. Moreover, functionally or geographically consistent partitioning may yield higher-quality inputs for predictive modeling. In these approaches, the entire city is represented as a graph comprising multiple regions, where each region is modeled as a node, and the inter-regional flow changes or geographical proximities between regions are encoded as edges. This graph structure naturally captures the spatial relationships among various urban regions. Consequently, GCN-based methods have gradually emerged as dominant in this domain owing to their inherent ability to effectively model and leverage these spatial dependencies.

Despite the effectiveness of prior GCN-based methods, urban crowd flow prediction in the real world remains challenging for several reasons: (1) Current GCN layers conduct a fixed receptive field for all areas. This overlooks urban heterogeneity, as distinct functional zones, like commercial, residential, and educational regions, interact differently with their surroundings. Fixed receptive fields may over- or under-introduce information for certain regions, which may lead to decreased prediction performance. (2) Crowd flow depends on both spatial and temporal correlations. Current methods lack mechanisms to adaptively adjust spatial receptive fields based on temporal dynamics. For instance, a business district requires larger receptive fields during peak hours on workdays to capture traffic fluctuations, but lower dependencies on surroundings during off-peak or weekends.

To address the aforementioned limitations, in this paper, we propose an Adaptive Receptive Field Graph Convolutional Network (ARFGCN) for urban crowd flow prediction. The purpose of ARFGCN is to allow each region to independently determine its receptive field size, which is adaptively adjusted and learned in an end-to-end manner during training to enhance model generalizability. The proposed ARFGCN model consists of three main components: a stacked 3DGCN, a time-aware adaptive receptive field (TARF) gating mechanism, and a prediction layer. Specifically, the stacked 3DGCN consists of multiple 3DGCN layers [13], which are used to learn complex spatio-temporal correlations, facilitating the simultaneous capture of temporal and spatial dependencies. The TARF aims to leverage gating in neural networks to adapt receptive fields based on temporal dynamics, enabling the predictive network to adapt to urban regional heterogeneity. The TARF can be easily integrated into the stacked 3DGCN, enhancing the prediction.

The main contributions of our work are summarized as follows:

We propose a novel framework (ARFGCN) for urban crowd flow prediction. To the best of our knowledge, this is the first approach to simultaneously consider dynamic receptive fields in both spatial and temporal dimensions.
We propose a time-aware adaptive receptive field gating mechanism to enable each region to independently and adaptively determine its receptive field size, considering temporal dynamics to capture intricate spatial dependencies.
We conduct extensive experiments on two real-world datasets to evaluate the effectiveness of ARFGCN for urban crowd flow prediction. The empirical findings validate that the proposed ARFGCN exhibits notable enhancements relative to the benchmarked methodologies.

The structure of this paper is as follows. Section 2 provides a review of the literature, covering both traditional and recent approaches to urban crowd flow prediction. Section 3 gives the problem definition. Section 4 details the methodology of the proposed model. The experimental setup, including the datasets, benchmark methods, evaluation metrics, and implementation details used, is presented in Section 5. Section 6 presents the results and analysis. Finally, Section 7 concludes this paper and discusses future work.

2. Related Works

This section presents a review of the existing research on crowd flow prediction, categorizing the methods into three classes: traditional methods, CNN-based methods, and GCN-based methods.

Traditional methods predominantly employ machine learning methods such as Autoregressive Integrated Moving Average (ARIMA) [3], Space-Time ARIMA (STARIMA) [4], Vector Autoregression (VAR) [14], Hidden Markov Models [15], and Gaussian Processes [16]. ARIMA [3] is a classic time-series forecasting method that relies on autocorrelation within historical data to predict future trends. STARIMA [4] extends ARIMA by incorporating the influence of neighboring areas, adapting it for spatio-temporal data. VAR [14] extends univariate regression models to multivariate time-series autoregression but requires a substantial number of parameters, leading to high computational costs. The aforementioned methods provide suboptimal predictions, as they cannot effectively capture the nonlinear and complex spatio-temporal relationships.

Recent advancements in deep learning have led to the development of numerous models for predicting crowd flow [17]. Given the advantages of CNNs in capturing image features, urban crowd flow prediction frequently employs CNNs [18,19,20,21] to capture the spatial correlations of crowd flows in surrounding areas. Zhang et al. introduced the DeepST model [22], marking the first application of CNNs to urban crowd flow prediction. This method models urban crowd flow at each time interval as an image and samples data across different time scales (e.g., hourly, daily, weekly) to generate sequences at three temporal granularities. However, the predictive performance of this approach is limited by the number of convolutional layers—as the number of layers increases, the performance rapidly deteriorates. Zhang et al. proposed ST-ResNet [5], an improvement on DeepST. ST-ResNet incorporates residual neural networks to address the problem of network degradation when a deep neural network has too many hidden layers. Owing to the effectiveness of the ST-ResNet model, numerous enhancements have since been developed. DeepSTN+ [6] further improves the residual units in ST-ResNet by replacing the convolutional layers with ConvPlus units, which better capture spatial associations in distant areas. To address the inefficiency of ST-ResNet in learning global spatial dependencies, Liang et al. introduced DeepLGR [20], which utilizes spatial pyramid pooling to efficiently aggregate regional features for capturing global spatial dependencies. Addressing the insufficient capture of spatio-temporal correlations by ST-ResNet, MST3D [23] replaces the 2D CNNs in the ST-ResNet model with 3D CNNs to better learn the spatio-temporal correlations in the data simultaneously. A spatio-temporal convolutional neural network based on ConvLSTM and STCNN was proposed in [7] to address long-term traffic prediction challenges. GeoMAN [24] is a multi-level attention mechanism-based recurrent neural network designed to model the dynamic spatio-temporal characteristics of sensor data. Liu et al. [25] combined ConvLSTM with attention mechanisms to propose the ACFM model for predicting urban crowd flow. However, CNN-based methods are restricted to operating on regular spatial grids, rendering them impractical for real-world applications, where meaningful spatial units, such as street blocks, are more relevant. To address this limitation, GCN has emerged as a promising approach for modeling non-grid spatial correlations.

Recently, GCN-based methods [8,9,10,11,12] have been proposed for urban crowd flow prediction. Yu et al. [26] proposed STGCN, which pioneered the application of graph neural networks to spatio-temporal prediction. The work in [27] further improved on STGCN by incorporating the attention mechanism into the spatio-temporal convolutional module, proposing the ASTGCN model. Similar to ST-ResNet, ASTGCN adopts a three-branch network architecture, individually modeling the three temporal attributes of traffic flow: closeness, periodicity, and trend. MVGCN [28] utilizes data at multiple time scales to predict future crowd flow in regions. Addressing the limitation of existing traffic prediction models in balancing long-term and short-term prediction tasks, Huang et al. [29] proposed a novel graph convolutional network called LSGCN, which enhances STGCN. This method simplifies the model structure to reduce accumulated errors in iterative prediction, and a more efficient network architecture captures spatio-temporal features. DCRNN [30] simulates traffic flow as a diffusion process, employing diffusion convolution to capture spatial dependencies and improving GRU by replacing matrix multiplication with diffusion convolution to capture spatio-temporal characteristics. The authors of [31] argued that explicit graph structures may not necessarily reflect true dependencies, and they proposed the AGCRN model that is capable of automatically capturing spatio-temporal relationships without predefined graph structures. For predicting passenger flow in urban rail transit systems, He et al. introduced MGC-RNN [32], which leverages multiple graphs to encode the spatial correlations and other heterogeneous inter-station relationships. To effectively address the multivariate correlation-aware multi-scale traffic flow prediction, Wang et al. proposed MC-STGCN [33], which employs cross-scale spatial-temporal feature learning and fusion techniques to capture spatio-temporal correlations. 3DGCN [13] generalizes 3D CNN from structured data to graph structures, capturing the spatio-temporal correlations in graph data. However, these methods often employ a fixed receptive field for all regions, overlooking urban heterogeneity and leading to decreased prediction performance.

In addition to the aforementioned deep learning-based prediction methods, integrating constraint conditions, particularly those concerning the fundamental diagram of pedestrian movement, into model predictions is becoming increasingly important [34]. Considering different scenarios, the construction of network models that account for group dynamics or panic in crowd behavior is gaining significant attention [35].

3. Problem Overview

Definition 1

(Irregular Region). Regions refer to a set of non-overlapping areas in a city partitioned based on road networks, as in previous work [13]. Let

V = {v_{i} | i = 1, 2, \dots, V}

denote the set of partitioned regions with irregular sizes and geometries. The road networks are composed of multiple levels, dividing the city into V distinct regions

v_{i}

.

Definition 2

(Inflow/Outflow). The inflow and outflow of the i-th region

v_{i}

at the t-th time interval are defined as follows:

\begin{matrix} x^{t, i, i n} = \sum_{T_{r} \in P} | {m > 1 | g_{m - 1} \notin v_{i} a n d g_{m} \in v_{i}} |, \\ x^{t, i, o u t} = \sum_{T_{r} \in P} | {m > 1 | g_{m - 1} \in v_{i} a n d g_{m} \notin v_{i}} |, \end{matrix}

(1)

Definition 3

(OD Flow). Besides the inflow and outflow, we define the origin-destination (OD) flow as the number of people who move from one region to another at a given time interval. The OD flow from region

v_{i}

to region

v_{j}

at the t-th time interval, denoted by

p^{t, i, j}

, is also obtained from trajectory set P as

p^{t, i, j} = \sum_{T_{r} \in P} | {m > 1 | g_{m - 1} \in v_{i} a n d g_{m} \in v_{j}} |

(2)

Thus,

P^{t} \in R^{V \times V}

represents all directional OD flows at the t-th time interval, where V denotes the number of regions. Specifically, the sum of OD flows toward region

v_{i}

represents its inflow, and from region

v_{i}

represents its outflow at the same time interval.

Task definition. The purpose of urban crowd flow prediction is to estimate the inflow and outflow of urban regions based on historical data. Given the historical crowd flows

(X^{t - T + 1}, \dots, X^{t - 1}, X^{t})

, the historical OD flows

(P^{t - T + 1}, \dots, P^{t - 1}, P^{t})

, and POI information as input, the goal is to predict crowd flows

(X^{t + 1}, \dots, X^{t + s})

of regions within the next s time intervals, where T is the length of the input sequence.

4. Method

Figure 1 shows the overall pipeline of ARFGCN, which is described in detail below.

4.1. Graph Construction

In the first step of the ARFGCN framework, we prepare the data to construct a spatio-temporal graph (STGraph) for prediction. First, we treat each region as a node in the STGraph. The inflow and outflow for each region are calculated according to Definition 2 and assigned as node attributes. Second, we construct the edges of the DSTG based on historical OD flows according to Definition 3. Following [13], to leverage OD flows across different time intervals, we categorize dates as weekdays and weekends, and divide each day into multiple time intervals. We then obtain the average OD flow patterns P within the same time intervals for weekdays and weekends separately from the OD flows. Accordingly, graph topologies tailored to each respective time interval are constructed. Hence, the normalized adjacency matrix

A_{p}^{t}

for time interval t is calculated as

A_{p}^{t} = D^{- 1 / 2} P^{t} D^{- 1 / 2}

(3)

where D is the degree matrix of

P^{t}

, namely

D^{i, i} = \sum_{j} P^{t, i, j}

.

4.2. Stacked 3DGCN

The stacked 3DGCN consists of multiple 3DGCN layers [13], which are used to learn complex spatio-temporal correlations, facilitating the simultaneous capture of temporal and spatial dependencies. Its convolutional field encompasses both spatial and temporal views, while its aggregator component enables accurate aggregation of related information from temporal and spatial neighbors. This capability aids in more effectively modeling temporal and spatial correlations simultaneously.

Formally, given the input of the l-th layer

H_{(l - 1)}^{t}

, the value of

H_{(l)}^{t} \in R^{V \times C_{o u t}}

, i.e., the output of the l-th 3DGCN layer at time interval t, is given by

H_{l}^{t} = σ (\sum_{τ = - T}^{T} (H_{(l - 1)}^{t + τ} W_{{(l)}_{0}}^{τ + T} + A^{t + τ} H_{(l - 1)}^{t + τ} W_{{(l)}_{1}}^{τ + T}))

(4)

where

σ

is an activation function,

W_{(l)} \in R^{2 \times (2 T + 1) \times C_{i n} \times C_{o u t}}

is the 3D convolutional kernel,

2 T + 1

is the temporal size of the kernel, and

A^{t + τ}

is the weighted adjacency matrix at time interval

t + τ

. Subsequently, in order to better account for region heterogeneity, 3DGCN can be enhanced with a node-based partitioning approach based on different region types. Specifically, regions are categorized into K classes based on their POI information. Specifically, we adopt K-means to cluster the regions and choose the ones close to each cluster centroid as labels. Then, we utilize a two-layer GCN to classify the regions into K classes in a semi-supervised way. The node partition-enhanced 3DGCN method can be expressed as

H_{l}^{t} = σ (\sum_{τ = - T}^{T} (H_{(l - 1)}^{t + τ} W_{l_{0}}^{τ + T} + \sum_{k = 1}^{K} A_{k}^{t + τ} H_{(l - 1)}^{t + τ} W_{l_{k}}^{τ + T}))

(5)

where

W_{(l)} \in R^{(K + 1) \times (2 T + 1) \times C_{i n} \times C_{o u t}}

is the 3D convolutional kernel.

4.3. TARF

Following [13], stacking multiple 3DGCN layers progressively increases the receptive field of each region. However, simply stacking more 3DGCN layers to expand the receptive field would result in uniform receptive field sizes across all regions, thus limiting the model’s flexibility and predictive performance. To address this limitation, we propose a time-aware adaptive receptive field gating unit to learn region-specific receptive fields. Figure 2 shows the framework of the TARF.

Specifically, inspired by [36], we score each region to determine whether its receptive field needs further expansion. Unlike the work in [36], our scoring rules additionally take into account the temporal influence on crowd flows. For a given region, the extent to which it is affected by other regions varies over time, implying that its receptive field size should be different across different time periods.

Formally, for region

v_{i}

at time t, its output from the l-th 3DGCN layer is denoted as

h_{i (l)}^{t} \in R^{T \times C_{o u t}}

. To compute the receptive field, we first define the receptive field score

s_{i (l)}^{t}

for region

v_{i}

at the l-th layer and at time t as

s_{i (l)}^{t} = δ (W (h_{i (l)}^{t} * Q^{t}) + b)

(6)

where

δ

is the activation function, W and b are linear transformation parameters,

*

denotes element-wise multiplication, and

Q^{t} \in R^{T \times C_{o u t}}

is the time-aware parameter used to adjust the receptive field across different time periods. Subsequently, we define the time-aware adaptive receptive field gate

g_{i (l)}^{t}

for region

v_{i}

at the l-th layer as the sum of its receptive field scores up to layer l:

g_{i (l)}^{t} = \sum_{l^{'} = 1}^{l} s_{i (l^{'})}^{t}

(7)

Subsequently, we leverage the time-aware adaptive receptive field gate to update the output

Z_{i (l)}^{t}

for region

v_{i}

at the l-th layer:

z_{i (l)}^{t} = g_{i (l)}^{t} h_{i (l)}^{t} + (1 - g_{i (l)}^{t}) h_{i (l - 1)}^{t}

(8)

where

g_{i (l)}^{t}

denotes the time-aware adaptive receptive field gate for region

v_{i}

at the l-th layer,

h_{i (l)}^{t}

represents the output of region

v_{i}

from the l-th 3DGCN layer, and

h_{i (l - 1)}^{t}

corresponds to the output of region

v_{i}

from the

(l - 1)

layer.

When the gate

g_{i (l)}^{t}

exceeds the threshold

1 - ϵ

, the receptive field ceases to expand further, where

ϵ

is a hyperparameter. Additionally, to ensure that the dynamic changes in the receptive field incorporate temporal information, we introduce a maximum receptive field size L by considering historical temporal data. If the cumulative receptive field expansion reaches L, the receptive field stops growing. The time-aware adaptive receptive field size

R_{i}^{t}

for region

v_{i}

at time t is defined as follows:

R_{i}^{t} = m i n {L, m i n {l : g_{i (l)}^{t} \geq 1 - ϵ}}

(9)

Based on the time-aware adaptive receptive field size

R_{i}^{t}

for region

v_{i}

at time t, we aggregate the information from neighboring regions within the receptive field range

R_{i}^{t}

to obtain the spatio-temporal features

{\hat{z}}_{i}^{t}

:

{\hat{z}}_{i}^{t} = \frac{1}{R_{i}^{t}} \sum_{l = 1}^{R_{i}^{t}} z_{i (l)}^{t}

(10)

4.4. Prediction Layer

Through the TARF, we can obtain the spatio-temporal features

Z^{t} = {z_{i}^{t} | i = 0, 1, 2, \dots, V}

for all V regions. Subsequently, the prediction layer employs a temporal convolutional unit to transform the spatio-temporal features

Z^{t}

into prediction

{\hat{X}}^{t}

. Particularly, for multi-step forecasting, ARGCN adopts an iterative prediction mechanism, where the predicted output from the previous step serves as the historical observation data for the next prediction step, iteratively performing the forecasting process.

4.5. Loss and Training

To facilitate effective model training, our objective function comprises two components. The first is the

L_{a}

term, which aims to minimize the mean square error (MSE) between the predicted values

{\hat{X}}^{t}

and the ground truth

X^{t}

:

L_{a} = \frac{1}{N} \sum_{t = T}^{N} | | {\hat{X}}^{t} - X^{t} {| |}_{2}^{2}

(11)

Additionally, to effectively leverage POI information, we introduce a region classification loss function

L_{c}

:

L_{c} = \sum_{v_{i} \in V_{L}} \frac{1}{V_{L}^{ω_{i}}} l n (Z^{i, ω_{i}})

(12)

where

ω_{i}

denotes the clusters for region

v_{i}

, and

Z^{i, ω_{i}}

represents the probability that region

v_{i}

belongs to cluster

ω_{i}

.

V_{L}

is a set of all labeled nodes, with a subset

V_{L}^{ω_{i}}

containing the labeled nodes within cluster

ω_{i}

. Since the number of regions in each cluster is imbalanced,

\frac{1}{V_{L}^{ω_{i}}}

serves as a normalization term.

Finally, during training, we incorporate a scaling factor

γ

and sum the two loss terms to obtain the overall loss function for ARFGCN:

L = L_{a} + γ L_{c}

(13)

We provide the training process of ARFGCN in Algorithm 1. First, given historical crowd flows

{X^{0}, X^{1}, \dots, X^{n - 1}}

and historical transition flows

{P^{0}, \dots, P^{n - 1}}

, we construct STGraph. Second, we train the proposed ARFGCN by optimizing the parameters to minimize a designed loss function, i.e., Equation (13). The adaptive receptive fields are learned in a data-driven manner during training.

Algorithm 1: Training process of ARFGCN

5. Experimental Setup

5.1. Experimental Data

Comprehensive experiments are conducted on the BikeNYC dataset [13] and the YellowTaxi dataset, obtained from https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page (accessed on 1 May 2022), to evaluate the proposed approach. Each dataset comprises three subdatasets: crowd flows, OD flows, and POI information. The POI data are obtained from the OpenStreetMap repository. Following [13], the POI categories for both datasets span nine distinct classes: dining, residential, shopping, educational institutions, nightlife venues, tourism, arts and entertainment, outdoor recreation, and other professional facilities. The details of the datasets are presented in Table 1.

5.2. Baseline Methods

To comprehensively evaluate the efficacy of our proposed model, we compare the ARFGCN model with several baselines for predicting urban crowd flow:

HA (Historical Average) [37]: This approach employs the historical average of inflow and outflow as the predicted future crowd flow.
VAR (Vector Autoregression) [38]: A data-driven time-series prediction model that captures interdependencies among multiple time series.
STGCN [26]: A spatio-temporal prediction method based on GCNs, combining graph convolutions and gated temporal convolutions to model spatial and temporal dependencies.
DCRNN [30]: Leverages RNNs to capture temporal dependencies and bidirectional random walks on graphs to model spatial dependencies.
MVGCN [28]: A deep learning model for non-grid-based crowd flow prediction, utilizing multi-view data from various time scales.
AGCRN [31]: A deep spatio-temporal model capable of automatically capturing spatial and temporal correlations in time-series data without predefined graph structures.
3DGCN [13]: A model for non-grid-based crowd flow prediction that generalizes 3D CNNs from structured data to graph-structured data to capture spatio-temporal correlations.

5.3. Parameters

Following [13,26,29], we employ the root mean squared error (RMSE) and mean absolute error (MAE), which are widely adopted metrics for assessing crowd flow prediction performance. In our experiments, we utilize observations from the previous five time intervals to predict future crowd flow. The ARFGCN employs 3 × 3 × 3 convolutional kernels with 32 convolutional kernels per layer. The temporal convolution units in the prediction layer utilize 3 × 3 kernels. For the BikeNYC and YellowTaxi datasets, the maximum receptive field threshold L is set to 5. The batch size is 32, and the learning rate is 0.001. The Adam optimizer [39] is employed for model training.

6. Experimental Results

In this section, we evaluate the performance of ARFGCN both quantitatively and qualitatively compared to robust baselines.

6.1. Overall Performance

Table 2 and Table 3 present a comparative analysis of the performance of ARFGCN against the baseline methods for single-step and multi-step predictions (specifically for the second and third future time intervals) on the BikeNYC and YellowTaxi datasets, respectively. Optimal results are highlighted in bold, and the second-best performance is underlined.

The results show that our proposed ARFGCN consistently outperformed all baseline methods, achieving new state-of-the-art results across all six experimental setups. Additionally, we observed that the performance improvement of our method increased with the extension of the prediction interval. For instance, in the BikeNYC dataset, the enhancement achieved by the ARFGCN method over a 3 h interval was 8.25 times greater than that observed over a 1 h interval (6.13 times greater in YellowTaxi). This enhancement can be attributed to our method’s TARF, which adaptively learns receptive fields based on region heterogeneity and temporal dynamics, thereby facilitating an effective integration of temporal and spatial attributes and enhancing performance.

6.2. Ablation Study

To examine the contribution of the components of our proposed model, we conducted an ablation study of ARFGCN by removing the time-aware gating mechanism (denoted w/o time). It should be noted that when the entire TARF component was removed, the model reverted to the 3DGCN model. The ablation results are summarized in Table 2 and Table 3. The results demonstrate that the proposed components contribute substantially to the performance improvements of ARFGCN. In particular, removing the TARF leads to a considerable decline in performance, underscoring its significance to the model. This was expected since TARF was designed to adaptively learn receptive fields based on region heterogeneity and temporal dynamics, thereby enhancing the predictive performance of the model.

6.3. The Effect of the Number of Layers

The depth of the 3DGCN, quantified by the number of layers L, is a critical hyperparameter in ARFGCN, markedly affecting the performance of the model. Experiments were conducted to determine the optimal number of layers for the two dataset configurations. The number of layers was varied from 1 to 6 to evaluate performance across different model depths.

As shown in Figure 3a,b, when the number of layers in 3DGCN was restricted to one or two, ARFGCN exhibited suboptimal predictive performance on the BikeNYC and YellowTaxi datasets. This limitation stems from an inadequately small receptive field, where regions only perceive information from first- or second-order neighbors, failing to capture the influence of higher-order neighbors and the associated spatiotemporal correlations. Furthermore, with only one or two layers, ARFGCN struggled to effectively adjust the receptive field across different regions. Conversely, when the layer count of 3DGCN ranged from three to six, ARFGCN’s predictive performance was not significantly sensitive to layer variations. This indicates that even with multiple layers, ARFGCN can adaptively adjust the receptive fields, avoiding the pitfalls of incorporating irrelevant information and subsequent declines in predictive performance. It also verifies the effectiveness of the TARF.

6.4. Analysis of Learned Adaptive Receptive Field

To further demonstrate the efficacy of ARFGCN, we analyzed the learned adaptive receptive field distributions on the BikeNYC and YellowTaxi datasets. As shown in Figure 4, the x-axis indicates the adaptive receptive field size, and the y-axis denotes the percentage of regions corresponding to the respective adaptive receptive field size out of the total number of regions. In the experiment, the number of 3DGCN layers L is set to 5, namely the maximum receptive field size is 5. It can be observed across both datasets that the receptive field sizes for the regions are not uniformly 5, validating that different regions can adaptively adopt distinct receptive field sizes through the proposed TARF, rather than a uniform size. Furthermore, it can be observed across both datasets that the proportion of regions with a receptive field of 1 is extremely small, while regions with a receptive field of 5 constitute a relatively large proportion. This indicates that most regions require a larger receptive field to perceive higher-order neighborhood information, reflecting the importance of capturing the influence of higher-order neighbors.

7. Conclusions

This paper proposes an Adaptive Receptive Field Graph Convolutional Network (ARFGCN) for urban crowd flow prediction. ARFGCN enables each region to independently determine its receptive field size, which is adaptively adjusted and learned in an end-to-end manner during training, thereby enhancing model prediction performance. It comprises a time-aware adaptive receptive field (TARF) gating mechanism, a stacked 3D graph convolutional network (3DGCN), and a prediction layer. The TARF leverages gating in neural networks to adapt receptive fields based on temporal dynamics, allowing the predictive network to adapt to urban regional heterogeneity. The TARF can be easily integrated into the stacked 3DGCN, enhancing prediction performance. Experimental results demonstrate that ARFGCN achieves superior performance on the BikeNYC and YellowTaxi datasets. In future work, we plan to incorporate prior geospatial knowledge to further improve the performance of urban crowd flow prediction.

Author Contributions

Conceptualization, G.D. and B.Z.; methodology, G.D.; software, G.D.; validation, G.D., X.P. and H.H; formal analysis, X.P. and X.F.; writing—original draft preparation, G.D. and B.Z.; writing—review and editing, G.D., H.H., X.F. and B.Z.; visualization, B.Z.; supervision, B.Z and X.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (No. 62306184), the Natural Science Foundation of Top Talent of SZTU (No. GDRC202320), and the Research Promotion Project of the Key Construction Discipline in Guangdong Province (No. 2022ZDJS112).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, S.; Miao, H.; Li, J.; Cao, J. Spatio-temporal knowledge transfer for urban crowd flow prediction via deep attentive adaptation networks. IEEE Trans. Intell. Transp. Syst. 2021, 23, 4695–4705. [Google Scholar] [CrossRef]
Xie, P.; Li, T.; Liu, J.; Du, S.; Yang, X.; Zhang, J. Urban flow prediction from spatiotemporal data using machine learning: A survey. Inf. Fusion 2020, 59, 1–12. [Google Scholar] [CrossRef]
Alghamdi, T.; Elgazzar, K.; Bayoumi, M.; Sharaf, T.; Shah, S. Forecasting traffic congestion using ARIMA modeling. In Proceedings of the 2019 15th International Wireless Communications & Mobile Computing Conference, Tangier, Morocco, 24–28 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1227–1232. [Google Scholar]
Duan, P.; Mao, G.; Zhang, C.; Wang, S. STARIMA-based traffic prediction with time-varying lags. In Proceedings of the 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), Rio de Janeiro, Brazil, 1–4 November 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1610–1615. [Google Scholar]
Zhang, J.; Zheng, Y.; Qi, D. Deep spatio-temporal residual networks for citywide crowd flows prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Lin, Z.; Feng, J.; Lu, Z.; Li, Y.; Jin, D. Deepstn+: Context-aware spatial-temporal neural network for crowd flow prediction in metropolis. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 1020–1027. [Google Scholar]
He, Z.; Chow, C.Y.; Zhang, J.D. STCNN: A spatio-temporal convolutional neural network for long-term traffic prediction. In Proceedings of the 2019 20th IEEE International Conference on Mobile Data Management (MDM), Hong Kong, China, 10–13 June 2019; pp. 226–233. [Google Scholar]
Wen, H.; Lin, Y.; Xia, Y.; Wan, H.; Wen, Q.; Zimmermann, R.; Liang, Y. Diffstg: Probabilistic spatio-temporal graph forecasting with denoising diffusion models. In Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems, Hamburg, Germany, 13–16 November 2023; pp. 1–12. [Google Scholar]
Ke, S.; Pan, Z.; He, T.; Liang, Y.; Zhang, J.; Zheng, Y. AutoSTG+: An automatic framework to discover the optimal network for spatio-temporal graph prediction. Artif. Intell. 2023, 318, 103899. [Google Scholar] [CrossRef]
Deng, P.; Zhao, Y.; Liu, J.; Jia, X.; Wang, M. Spatio-temporal neural structural causal models for bike flow prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 4242–4249. [Google Scholar]
Wang, T.; Chen, J.; Lü, J.; Liu, K.; Zhu, A.; Snoussi, H.; Zhang, B. Synchronous spatiotemporal graph transformer: A new framework for traffic data prediction. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 10589–10599. [Google Scholar] [CrossRef] [PubMed]
Guo, Z.; Liu, H.; Zhang, L.; Zhang, Q.; Zhu, H.; Xiong, H. Talent demand-supply joint prediction with dynamic heterogeneous graph enhanced meta-learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 2957–2967. [Google Scholar]
Xia, T.; Lin, J.; Li, Y.; Feng, J.; Hui, P.; Sun, F.; Guo, D.; Jin, D. 3dgcn: 3-dimensional dynamic graph convolutional network for citywide crowd flow prediction. ACM Trans. Knowl. Discov. Data 2021, 15, 1–21. [Google Scholar] [CrossRef]
Luo, J.; Huang, Y.S.; Weng, Y.S. Design of variable traffic light control systems for preventing two-way grid network traffic jams using timed Petri nets. IEEE Trans. Intell. Transp. Syst. 2019, 21, 3117–3127. [Google Scholar] [CrossRef]
Chen, Z.; Wen, J.; Geng, Y. Predicting future traffic using hidden markov models. In Proceedings of the 2016 IEEE 24th International Conference on Network Protocols (ICNP), Singapore, 8–11 November 2016; pp. 1–6. [Google Scholar]
Le, T.V.; Oentaryo, R.; Liu, S.; Lau, H.C. Local Gaussian processes for efficient fine-grained traffic speed prediction. IEEE Trans. Big Data 2016, 3, 194–207. [Google Scholar] [CrossRef]
Zuo, J.; Zeitouni, K.; Taher, Y.; Garcia-Rodriguez, S. Graph convolutional networks for traffic forecasting with missing values. Data Min. Knowl. Discov. 2023, 37, 913–947. [Google Scholar] [CrossRef]
He, R.; Liu, Y.; Xiao, Y.; Lu, X.; Zhang, S. Deep spatio-temporal 3D densenet with multiscale ConvLSTM-Resnet network for citywide traffic flow forecasting. Knowl.-Based Syst. 2022, 250, 109054. [Google Scholar] [CrossRef]
Mo, J.; Gong, Z.; Chen, J. Attentive differential convolutional neural networks for crowd flow prediction. Knowl.-Based Syst. 2022, 258, 110006. [Google Scholar] [CrossRef]
Liang, Y.; Ouyang, K.; Wang, Y.; Liu, Y.; Zhang, J.; Zheng, Y.; Rosenblum, D.S. Revisiting convolutional neural networks for citywide crowd flow analytics. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD, Bilbao, Spain, 13–17 September 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 578–594. [Google Scholar]
Dai, G.; Hu, X.; Ge, Y.; Ning, Z.; Liu, Y. Attention based simplified deep residual network for citywide crowd flows prediction. Front. Comput. Sci. 2021, 15, 1–12. [Google Scholar] [CrossRef]
Zhang, J.; Zheng, Y.; Qi, D.; Li, R.; Yi, X. DNN-based prediction model for spatio-temporal data. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Francisco, CA, USA, 31 October–3 November 2016; pp. 92:1–92:4. [Google Scholar]
Chen, C.; Li, K.; Teo, S.G.; Chen, G.; Zou, X.; Yang, X.; Vijay, R.C.; Feng, J.; Zeng, Z. Exploiting spatio-temporal correlations with multiple 3d convolutional neural networks for citywide vehicle flow prediction. In Proceedings of the 2018 IEEE International Conference on Data Mining, Singapore, 17–20 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 893–898. [Google Scholar]
Liang, Y.; Ke, S.; Zhang, J.; Yi, X.; Zheng, Y. GeoMAN: Multi-level Attention Networks for Geo-sensory Time Series Prediction. In Proceedings of the International Joint Conferences on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 3428–3434. [Google Scholar]
Liu, L.; Zhang, R.; Peng, J.; Li, G.; Du, B.; Lin, L. Attentive crowd flow machines. In Proceedings of the ACM Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1553–1561. [Google Scholar]
Yu, B.; Yin, H.; Zhu, Z. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In Proceedings of the International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 3634–3640. [Google Scholar]
Guo, S.; Lin, Y.; Feng, N.; Song, C.; Wan, H. Attention Based Spatial-Temporal Graph Convolutional Networks for Traffic Flow Forecasting. In Proceedings of the AAAI, Honolulu, HI, USA, 27 January 2019; pp. 922–929. [Google Scholar]
Sun, J.; Zhang, J.; Li, Q.; Yi, X.; Liang, Y.; Zheng, Y. Predicting citywide crowd flows in irregular regions using multi-view graph convolutional networks. IEEE Trans. Knowl. Data Eng. 2022, 34, 2348–2359. [Google Scholar] [CrossRef]
Huang, R.; Huang, C.; Liu, Y.; Dai, G.; Kong, W. LSGCN: Long short-term traffic prediction with graph convolutional networks. In Proceedings of the IJCAI, Yokohama, Japan, 11–17 July 2020; Volume 7, pp. 2355–2361. [Google Scholar]
Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv 2018, arXiv:1707.01926. [Google Scholar] [CrossRef]
Bai, L.; Yao, L.; Li, C.; Wang, X.; Wang, C. Adaptive graph convolutional recurrent network for traffic forecasting. Adv. Neural Inf. Process. Syst. 2020, 33, 17804–17815. [Google Scholar]
He, Y.; Li, L.; Zhu, X.; Tsui, K.L. Multi-graph convolutional-recurrent neural network (MGC-RNN) for short-term forecasting of transit passenger flow. IEEE Trans. Intell. Transp. Syst. 2022, 23, 18155–18174. [Google Scholar] [CrossRef]
Wang, S.; Zhang, M.; Miao, H.; Peng, Z.; Yu, P.S. Multivariate correlation-aware spatio-temporal graph convolutional networks for multi-scale traffic prediction. ACM Trans. Intell. Syst. Technol. 2022, 13, 1–22. [Google Scholar] [CrossRef]
Chattaraj, U.; Seyfried, A.; Chakroborty, P. Comparison of pedestrian fundamental diagram across cultures. Adv. Complex Syst. 2009, 12, 393–405. [Google Scholar] [CrossRef]
Aghabayk, K.; Soltani, A.; Shiwakoti, N. Investigating pedestrians’ exit choice with incident location awareness in an emergency in a multi-level shopping complex. Sustainability 2022, 14, 11875. [Google Scholar] [CrossRef]
Spinelli, I.; Scardapane, S.; Uncini, A. Adaptive propagation graph convolutional network. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4755–4760. [Google Scholar] [CrossRef] [PubMed]
Pan, B.; Demiryurek, U.; Shahabi, C. Utilizing real-world transportation data for accurate traffic prediction. In Proceedings of the 2012 IEEE 12th International Conference on Data Mining, Brussels, Belgium, 10–13 December 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 595–604. [Google Scholar]
Chandra, S.R.; Al-Deek, H. Predictions of freeway traffic speeds and volumes using vector autoregressive models. J. Intell. Transp. Syst. 2009, 13, 53–72. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]

Figure 1. The framework of the proposed ASRGCN model for urban crowd flow prediction.

Figure 2. The framework of the TARF.

Figure 3. The effect of the number of layers. (a) BikeNYC. (b) YellowTaxi.

Figure 4. Regional adaptive receptive field distributions on two datasets.

Table 1. Dataset descriptions.

Dataset	BikeNYC	YellowTaxi
Data type	Bike rent	Taxi trip
Time span	1 July 2017–30 September 2017	1 January 2022–28 February 2022
Time interval	1 h	1 h
Number of regions	82	263
Number of POIs	26,202	317,445

Table 2. Comparison of ARFGCN with other baseline models on the BikeNYC dataset.

Method	1 h		2 h		3 h
Method	RMSE	MAE	RMSE	MAE	RMSE	MAE
HA	17.05	9.97	17.05	9.97	17.05	9.97
VAR	11.45	7.25	16.77	10.34	20.63	12.60
STGCN	11.73	6.49	12.93	7.06	15.37	7.94
DCRNN	9.85	5.88	10.39	6.19	12.37	7.72
MVGCN	9.64	5.65	13.53	7.72	13.93	8.00
AGCRN	14.67	6.49	14.92	6.72	15.89	7.15
3DGCN	7.76	4.81	9.49	5.61	11.74	6.99
ARFGCN	7.55	4.61	8.35	5.05	8.85	5.34
w/o-time	7.61	4.72	8.83	5.41	8.95	5.58

Table 3. Comparison of ARFGCN with other baseline models on the YellowTaxi dataset.

Method	1 h		2 h		3 h
Method	RMSE	MAE	RMSE	MAE	RMSE	MAE
HA	22.96	11.01	22.96	11.01	22.96	11.01
VAR	23.09	12.04	32.43	16.74	37.31	19.22
STGCN	12.04	4.59	14.83	5.77	18.46	7.10
DCRNN	11.13	3.43	17.12	4.95	21.96	6.29
MVGCN	10.81	3.74	12.38	4.25	13.23	4.52
AGCRN	11.44	3.31	11.56	3.46	12.15	3.61
3DGCN	7.15	2.76	9.05	3.37	11.43	4.02
ARFGCN	6.95	2.6	7.81	2.86	8.35	3.04
w/o-time	7.15	2.72	8.16	3.05	9.20	3.43

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dai, G.; Huang, H.; Peng, X.; Zhang, B.; Fu, X. ARFGCN: Adaptive Receptive Field Graph Convolutional Network for Urban Crowd Flow Prediction. Mathematics 2024, 12, 1739. https://doi.org/10.3390/math12111739

AMA Style

Dai G, Huang H, Peng X, Zhang B, Fu X. ARFGCN: Adaptive Receptive Field Graph Convolutional Network for Urban Crowd Flow Prediction. Mathematics. 2024; 12(11):1739. https://doi.org/10.3390/math12111739

Chicago/Turabian Style

Dai, Genan, Hu Huang, Xiaojiang Peng, Bowen Zhang, and Xianghua Fu. 2024. "ARFGCN: Adaptive Receptive Field Graph Convolutional Network for Urban Crowd Flow Prediction" Mathematics 12, no. 11: 1739. https://doi.org/10.3390/math12111739

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ARFGCN: Adaptive Receptive Field Graph Convolutional Network for Urban Crowd Flow Prediction

Abstract

1. Introduction

2. Related Works

3. Problem Overview

4. Method

4.1. Graph Construction

4.2. Stacked 3DGCN

4.3. TARF

4.4. Prediction Layer

4.5. Loss and Training

5. Experimental Setup

5.1. Experimental Data

5.2. Baseline Methods

5.3. Parameters

6. Experimental Results

6.1. Overall Performance

6.2. Ablation Study

6.3. The Effect of the Number of Layers

6.4. Analysis of Learned Adaptive Receptive Field

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI