ST-GTNet: A Spatio-Temporal Graph Attention Network for Dynamic Airport Capacity Prediction

Qian, Pinzheng; Zhang, Jian; Zhang, Haiyan; Li, Xunhao; Ouyang, Jie

doi:10.3390/aerospace12090811

Open AccessArticle

ST-GTNet: A Spatio-Temporal Graph Attention Network for Dynamic Airport Capacity Prediction

by

Pinzheng Qian

¹

,

Jian Zhang

^1,2,*

,

Haiyan Zhang

¹,

Xunhao Li

¹ and

Jie Ouyang

³

¹

Jiangsu Key Laboratory of Urban ITS, Department of Intelligent Transportation and Spatial Informatics, School of Transportation, Southeast University, Nanjing 211189, China

²

School of Engineering, Tibet University, Lhasa 850001, China

³

School of Transportation Science and Engineering, Civil Aviation University of China, Nanjing 211189, China

^*

Author to whom correspondence should be addressed.

Aerospace 2025, 12(9), 811; https://doi.org/10.3390/aerospace12090811

Submission received: 14 July 2025 / Revised: 1 September 2025 / Accepted: 2 September 2025 / Published: 8 September 2025

(This article belongs to the Section Air Traffic and Transportation)

Download

Browse Figures

Versions Notes

Abstract

Dynamic evaluation of airport terminal capacity is critical for efficient operations, yet it remains challenging due to the complex interplay of spatial and temporal factors. Existing approaches often handle spatial connectivity and temporal fluctuations separately, limiting their predictive power under rapidly changing conditions. Here the ST-GTNet (Spatio-Temporal Graph Transformer Network) is presented, a novel deep learning model that integrates Graph Convolutional Networks (GCNs) with a Transformer architecture to simultaneously capture spatial interdependencies among airport gates and temporal patterns in operational data. To ensure interpretability and efficiency, a feature selection mechanism guided by XGBoost and SHAP (Shapley Additive Explanations) is incorporated to identify the most influential features. This unified spatio-temporal framework overcomes the limitations of conventional methods by learning spatial and temporal dynamics jointly, thereby enhancing the accuracy of dynamic capacity predictions. In a case study at a large international airport with a U-shaped corridor terminal, the ST-GTNet delivered robust and reliable capacity forecasts, validating its effectiveness in a complex real-world scenario. These findings highlight the potential of the ST-GTNet as a powerful tool for dynamic airport capacity evaluation and management.

Keywords:

airport dynamic capacity; XGBoost; SHAP; graph convolutional networks (GCN); transformer; spatiotemporal modeling

1. Introduction

1.1. Introduction

With the rapid development of the civil aviation industry in China, the development of air transport demand is unbalanced with the actual civil aviation capacity, and the problems of airport congestion and flight delays are becoming increasingly serious, even causing passenger dissatisfaction. Runways are the bottleneck of airport operation, and flight delays are more severe when bad weather affects airport capacity. Expanding airport capacity can effectively alleviate airport congestion, but due to the long cycle of airport reconstruction and expansion and large investment, it is difficult to achieve in the short term. Therefore, accurate and dynamic evaluation of airport capacity to improve the utilization of existing capacity has become an urgent problem to be solved at this stage.

At present, the research on airport capacity evaluation is mainly divided into two categories: influencing factors and evaluation methods. Among them, the research on airport capacity evaluation methods mainly includes evaluation based on historical statistical data, evaluation based on mathematical analysis models, such as the Integrated Airport Capacity Model (IACM) [1] and the self-built mixed-integer programming model [2], etc. The solution is fast, and the computational intensity is low, but when the problem is complex, it is difficult to apply the analytical model. In addition, there are evaluations based on computer simulation, such as AnyLogic simulation [3], TAAM simulation [4], and the optimization simulation combination method [5], but these methods require highly detailed information. Most of the above methods are suitable for static airport capacity estimation, but due to the uncertainty of weather conditions and demand, they are not suitable for real-time and dynamic airport capacity prediction. However, existing studies based on data-driven methods focus on real-time prediction, model transferability, and multi-airport analysis. For real-time prediction, Larsen et al. constructed the non-parametric probability model of airport arrival rates based on Markov chain inertia modeling and comprehensively considered uncertainties such as ground weather [6]. Andy et al. proposed an online machine learning method based on an adaptive random forest to dynamically predict hourly capacity in real time within a 6 h rolling time window [7]. Choi et al. mainly highlighted the transferability of the model, constructing a prediction model to train on an airport and then verifying the versatility of the model through transfer learning of another airport [8]. Murca combined the two-stage capacity estimation and capacity allocation into a data-driven model and considered the multi-airport system configuration in metropolitan areas [9]. In addition, some use machine learning methods such as gradient boosted trees [10]. Most of the above studies involve several weather phenomena, but the consideration of characteristic variables as input is not comprehensive, the correlation is not obvious, and the performance of the existing prediction methods still needs to be improved.

Recently, several advanced spatiotemporal learning methods have been developed to enhance prediction performance in airport flow and capacity forecasting. These include the MF-Transformer, which integrates multi-source weather features using temporal attention mechanisms [11]; GAT-LSTM, which combines graph attention networks with recurrent learning to capture inter-airport dependencies [12]; and dynamic Graph Convolutional Networks such as DMCSTN for modeling variable spatial structures [13]. While these methods show promising accuracy improvements, they are either designed for national-level throughput estimation or focus on runway-specific delays, and often lack interpretability or direct applicability to gate-level capacity forecasting in complex terminal layouts. Moreover, most do not incorporate a transparent feature selection process, which limits their practical deployment in decision-critical airport operations. Therefore, this study proposes a novel spatiotemporal prediction framework (ST-GTNet) that not only jointly models spatial correlations and temporal dynamics but also ensures interpretability through SHAP-based feature attribution. Compared with existing approaches, this method provides a unified, explainable, and generalizable solution for dynamic gate-level capacity prediction in airport terminal areas.

In order to improve the interpretability and generalization ability of feature selection, this paper uses the feature selection strategy based on SHAP value to analyze the correlation of the influencing factors and determines the high correlation factors as the input vector of the evaluation model. Two XGBoost models are trained for “takeoff capacity” and “landing capacity”, respectively, and the SHAP values of each meteorological variable are calculated. The important features under the two targets are extracted by sorting, taking the combination and finally selecting wind speed, temperature, air pressure, visibility, dew point temperature, and wind direction as the model inputs. On this basis, the existing Transformer model is improved by combining the GCN to realize the combination of spatio-temporal features and the accuracy and stability evaluation of airport capacity.

1.2. Key Contributions

To advance the state of the art in dynamic airport terminal capacity prediction, our contributions are as follows:

A SHAP-guided integration pipeline. We adopt a feature-flow-first strategy: select a robust subset of exogenous variables via XGBoost + SHAP and then feed them into a spatial-first → temporal-only backbone (GCN → Transformer). This reduces noise and multicollinearity before temporal self-attention.
A decoupled spatio-temporal architecture. Spatial dependencies are encoded per time slice by a GCN, after which the Transformer applies self-attention only along the temporal axis. This ordering narrows the attention scope, avoids redundant cross-node attention, and improves training/inference parallelism relative to RNN-based or joint spatiotemporal attention models.
Training–inference interpretability. The same SHAP machinery supports both pre-training feature selection and event-level attributions at inference, aligning explanations with deployed inputs and supporting operational diagnostics under seasonal and weather regime shifts.
Physically grounded graph construction with sensitivity analysis. We represent gates/stands using a static, physically interpretable graph and report sensitivity to adjacency design (e.g., neighborhood size k), highlighting engineering realism and transferability.
A reproducible evaluation protocol with ablations and compute profiling. We specify windowing, leakage guards, missing/outlier handling, and graph settings; compare against strong baselines; and include ablations (e.g., GCN → Transformer vs. Transformer → GCN, with/without SHAP selection) together with training/inference time and parameter counts.

2. Airport Dynamic Capacity Evaluation Model ST-GTNet

2.1. Model Overview

In this study, the ST-GTNet (Spatio-Temporal Graph Transformer Network) is proposed, a novel model that combines XGBoost (eXtreme Gradient Boosting), Shapley, Graph Convolutional Networks (GCNs), and Transformer to address the airport dynamic capacity evaluation task. This model effectively captures both spatial and temporal dependencies in airport capacity, particularly under the influence of complex weather patterns and flight schedules. The architecture of the ST-GTNet allows for precise real-time dynamic capacity predictions for airport gate areas, offering valuable decision-making support for airport operations.

2.2. XGBoost

Feature selection is a crucial step in machine learning, aimed at improving model accuracy and efficiency by selecting features that are highly relevant and predictive. XGBoost (eXtreme Gradient Boosting) is an ensemble method that uses the Gradient Boosting framework to improve the prediction performance of the model. The training process is iterative, and a decision tree is used to fit the residual of each round to minimize the loss function. XGBoost is a powerful tree-based method that captures nonlinear relationships and handles large-scale data efficiently.

2.2.1. Loss Function and Objective

In XGBoost [14], the objective is to minimize the following loss function by training multiple decision trees iteratively:

L (θ) = \sum_{i = 1}^{N} l (y_{i}, f (x_{i})) + Ω (f)

(1)

where

l (y_{i}, f (x_{i}))

is the loss function (typically log-loss or mean squared error), representing the error between the true label

y_{i}

and the predicted value

f (x_{i})

.

Ω (f)

is the regularization term, used to penalize overly complex tree structures and prevent overfitting. It is defined as

Ω (f) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} w_{j}^{2}

(2)

where

T

is the number of leaf nodes,

w_{j}

is the weight of the leaf node, and

γ

and

λ

are regularization parameters.

2.2.2. XGBoost Training Process

At each iteration, XGBoost calculates the residual of the current model and uses a decision tree to fit this residual [15]. Specifically, for a given training dataset

D = (x_{i}, y_{i})

the training process updates the model

f (x)

to minimize the objective loss function.

Assuming the model’s prediction in the

t

-th iteration is

f_{t} (x)

, at each iteration, the loss function update step in XGBoost involves using gradient descent to optimize the output of each tree, ultimately minimizing the overall loss.

2.3. SHAP Value Analysis

The Shapley value was originally derived from cooperative game theory and aims to fairly allocate the contributions of the players (here, features) to the cooperative outcome (here, model output). In machine learning, the Shapley value provides a theoretical framework to quantify the contribution of features, which can accurately measure the role of each feature in the model’s decision process.

Shapley Value Definition

For a feature

x_{i}

, its Shapley value

ϕ_{i}

is defined as

ϕ_{i} (f) = \sum_{S \subseteq N ∖ {i}} \frac{| S |! (| N | - | S | - 1)!}{| N |!} [f (S \cup {i}) - f (S)]

(3)

where

N

is the set of all features.

S

is any subset of

N

not containing feature

i

.

f (S)

is the model output using features in

S

, and

f (S \cup {i})

is the output of the model using the feature set

S \cup {i}

. The Shapley value computes the marginal contribution of each feature among all possible subsets of features, thus helping us quantify how much each feature contributes to the final prediction.

For each prediction, SHAP computes the contribution of each feature to that specific output. It enables both local interpretation (per instance) and global interpretation (feature importance over all data).

2.4. XGBoost + SHAP for Feature Selection

To extract the most influential variables affecting dynamic airside capacity at the airport from multi-dimensional meteorological and operational data, a hybrid feature selection approach combining XGBoost and SHAP is proposed. This method ensures both predictive performance and interpretability by leveraging ensemble modeling.

2.4.1. Model Structure and Interpretation

In the fusion framework of XGBoost and SHAP, not only is an efficient XGBoost model trained, but each feature’s contribution to the final prediction is calculated through SHAP values. Specifically, XGBoost trains decision trees to fit the residuals, and SHAP values provide a means to quantify the contribution of each feature. The output of the

t

-th tree in XGBoost is given by

f_{t} (x) = \sum_{j = 1}^{T} w_{j} I_{l_{j}} (x)

(4)

where

w_{j}

is the weight of the leaf node

j

, and

I_{l_{j}} (x)

is the indicator function that equals 1 if sample

x

falls in leaf node

l_{j}

; otherwise, it is 0.

To optimize the following objective function at each iteration in XGBoost, the following applies:

L_{t} = \sum_{i = 1}^{N} l (y_{i}, f_{t - 1} (x_{i}) + h_{t} (x_{i})) + Ω (h_{t})

(5)

where

f_{t - 1} (x_{i})

is the previous model’s prediction, and

h_{t} (x_{i})

is the output of the newly added tree. Gradient and second-order derivatives are computed to efficiently build the new tree and optimize the overall model.

2.4.2. Calculation of SHAP Values

For each tree, SHAP values can be derived by calculating the contribution of each feature to the final prediction. For each tree in XGBoost, the contribution of each feature is calculated. The SHAP value for each feature at each leaf node is determined by the difference in predictions before and after the feature is included. The final SHAP value for each feature is obtained by averaging the contributions from all trees.

2.5. Feature Selection and Importance Evaluation

In feature selection, SHAP values provide a powerful tool to rank features based on their contribution to model predictions. By sorting features based on their absolute SHAP values, those most critical to the model’s output are identified. Typically, features with larger absolute SHAP values contribute more significantly to the model’s predictions.

The goal of XGBoost + SHAP is not only to improve model accuracy but also to enhance interpretability and computational efficiency. The combination of XGBoost and SHAP for feature selection helps efficiently identify critical features while maintaining the interactions between features. SHAP values, as a combination of global and local explanations, provide a robust foundation for model transparency and reliability.

2.6. Graph Convolutional Network (GCN)

In the proposed ST-GTNet model, the Graph Convolutional Network (GCN) module serves as the spatial modeling component, responsible for capturing the inter-stand dependencies within the apron area of an airport [16,17]. It is particularly suitable for modeling irregular topological structures, such as the gate layout of a terminal [18].

Mathematical Foundation of GCN

The GCN was first proposed to generalize the convolution operation to non-Euclidean domains, such as graph-structured data. It enables each node to aggregate and transform the features of its neighbors [19]. Given a Graph

G = (V, E)

with node set

V

and edge set

E

; adjacency matrix

A \in R^{N \times N}

for

N = |V|

nodes; feature matrix

X \in R^{N \times F}

, where

F

is the number of features per node; output matrix

H^{(l + 1)} \in R^{N \times F'}

at layer

l + 1

; and degree matrix

D

with

D_{i i} = \sum_{j} A_{i j}

. The propagation rule is defined as

H^{(l + 1)} = σ ({\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2} H^{(l)} W^{(l)})

(6)

where

\tilde{A} = A + I

adds self-loops;

\tilde{D}

is the degree matrix of

\tilde{A}

;

W^{(l)}

is the trainable weight matrix at layer

l

;

σ

is the nonlinear activation function (e.g., ReLU); and

H^{(0)} = X

. This operation averages the transformed features of each node’s neighbors, including itself, with appropriate normalization [20].

2.7. Transformer Model

The Transformer model has revolutionized sequence modeling by eliminating recurrence and relying entirely on self-attention mechanisms [21,22]. Unlike RNNs and LSTMs, the Transformer allows for parallel computation and global receptive fields, making it highly efficient and scalable for long-range dependencies [23,24].

2.7.1. Model Architecture

The Transformer is composed of an encoder–decoder structure. Each encoder and decoder is made up of a stack of layers, and each layer includes multi-head self-attention mechanisms, position-wise feedforward neural networks, layer normalization, and residual connections [25].

2.7.2. Positional Encoding

To inject order information into the model, positional encodings are added to the input embeddings [26,27]. These are defined as

\begin{matrix} {PE}_{(p o s, 2 i)} & = s i n (\frac{p o s}{{10,000}^{\frac{2 i}{d_{model}}}}) \\ {PE}_{(p o s, 2 i + 1)} & = c o s (\frac{p o s}{{10,000}^{\frac{2 i}{d_{model}}}}) \end{matrix}

(7)

where

p o s

is the Position of the token in the sequence.

i

is the Dimension index.

d_{model}

is the Total dimension of the model embedding [28].

2.7.3. Scaled Dot-Product Attention

Given input embeddings

X \in R^{T \times d}

, the model projects them into three representations: query (

Q

), key (

K

), and value (

V

) vectors:

Q = X W^{Q}, K = X W^{K}, V = X W^{V}

(8)

Self-attention is computed as

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V

(9)

where

W^{Q}, W^{K}, a n d W^{V}

are learnable projection matrices.

d_{k}

is dimension of the key vector (usually

d_{k} = d / h

).

softmax (\cdot)

normalizes the attention scores.

2.7.4. Multi-Head Attention

Instead of performing a single attention function, the model applies multiple attention heads:

{head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(10)

M u l t i H e a d (Q, K, V) = C l o n c a t ({h e a d}_{1}, \dots, {h e a d}_{h}) W^{O}

(11)

where

h

is the number of heads.

W_{i}^{Q}, W_{i}^{K}, a n d W_{i}^{V}

are the projection matrices for each head.

W^{O}

is final projection matrix after concatenation.

2.7.5. Position-Wise Feedforward Network

Each tokens output passes through a feedforward neural network:

FFN (x) = m a x (0, x W_{1} + b_{1}) W_{2} + b_{2}

(12)

where

W_{1} a n d W_{2}

are learnable weight matrices.

b_{1} a n d b_{2}

are bias terms.

m a x (0, \cdot)

refers to the ReLU activation function.

2.8. GCN + Transformer Joint Modeling

In spatio-temporal data modeling, Graph Convolutional Networks (GCNs) and Transformers have their own strengths: GCNs are good at modeling local topology but struggle to capture global dependencies, while Transformers are good at extracting long-term temporal relationships but may ignore fine-grained local details. In order to simultaneously represent the spatial neighborhood relationship and the temporal global pattern, the ST-GTNet model combines the multi-layer GCN with the Transformer encoder to jointly model the graph-structured time-series data. The following shows GCN generation node representation, the encoding mechanism of the Transformer in the time dimension, and how the two are fused in the model.

We first use a GCN at each time slice to compress local, physically grounded gate/stand interactions into node embeddings, and then apply a Transformer with temporal-only self-attention to capture long-range dynamics. The intuition is that local spatial propagation is best handled up front by graph convolution, so the Transformer can devote its capacity to temporal patterns such as peaks and weather-regime shifts. Compared with joint spatiotemporal attention or RNN-based time modules, this ordering offers higher parallelism, fewer redundant cross-node attentions, and better stability for long look-backs. Given node features

X_{t}

per time

t

and adjacency

A

, we obtain spatial embeddings with a per-slice GCN and then feed each node’s length-

L

history into a Transformer, whose self-attention operates only over time. This “localize first, then globalize in time” design reduces redundant cross-node attention compared to joint spatiotemporal attention over N×L tokens, while aligning with the physical prior of local gate interactions.

2.8.1. Node Representation Learning with GCN

In spatiotemporal data modeling, the ST-GTNet first performs graph convolution operations at each time step to extract the spatial features of nodes. Given a static adjacency matrix

A

(describing the connections between nodes), self-loops (identity matrix

I

) are added to ensure that each node retains its own information during convolution, resulting in

\tilde{A} = A + I

. The corresponding degree matrix is denoted as

\tilde{D}

. Based on this definition, the forward propagation of a single-layer GCN is expressed as

{\tilde{D}}_{i i} = \sum_{j} {\tilde{A}}_{i j}

(13)

H^{(l + 1)} = σ ({\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2} H^{(l)} W^{(l)})

(14)

where

A

is the original adjacency matrix; is the identity matrix;

\tilde{A} = A + I

is the adjacency matrix with self-loops;

\tilde{D}

is the degree matrix of

\tilde{A}

(with diagonal entries

{\tilde{D}}_{i i}

representing the degree of node

i

);

H^{(l)} \in R^{N \times d_{l}}

denotes the node feature representation at the

l

-th layer (with

H^{(0)} = X

as the input features);

W^{(l)} \in R^{d_{l} \times d_{l + 1}}

is the learnable weight matrix at layer

l

; and

σ (\cdot)

is the element-wise nonlinear activation function (e.g., ReLU).

In the above formula, the symmetric normalization

{\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2}

is applied to mitigate the impact of irregular graph structure. Through this layer’s propagation, each node aggregates its own features as well as those from its first-order neighbors. Applying this operation to the data snapshot at each time

t

, denoted by

X_{t}

, yields the updated node representation

H_{t}^{(l + 1)}

at that time.

Adding self-loops in the GCN ensures the node’s own information is preserved during aggregation; using the normalized adjacency matrix helps stabilize training and avoid divergence caused by large degree discrepancies among nodes.

To capture higher-order spatial dependencies, the ST-GTNet stacks

L

layers of the GCN. This enables the model to gradually expand the receptive field of each node to more distant neighbors: the second layer aggregates information from second-order neighbors, and so forth. In this way, even nodes that are not directly connected can be related via multi-hop propagation through intermediate nodes.

A single-layer GCN can only capture first-order neighborhoods and struggles to learn from distant nodes; stacking multiple layers allows the model to extract richer topological features and capture more complex spatial patterns. However, stacking too many layers may lead to over-smoothing, where all node representations become indistinguishable. Therefore, our model adopts a proper number of layers

L

to balance sufficient neighborhood aggregation while avoiding over-smoothing.

After

L

layers of graph convolution, the final node embedding is obtained at each time step

t

, denoted as

H_{t} = H_{t}^{(L)} \in R^{N \times d}

, where

d = d_{L}

is the output feature dimension. These embeddings capture the local graph structure and lay the foundation for subsequent temporal modeling.

2.8.2. Transformer-Based Temporal Encoding Mechanism

After obtaining the spatial representations of nodes at each time step, the ST-GTNet subsequently employs a Transformer encoder along the temporal dimension to model the sequence of features and capture dynamic evolution and long-range dependencies. The Transformer utilizes a self-attention mechanism to assign different weights to various time steps in the input sequence, thereby identifying and attending to the most relevant moments for the prediction task. Specifically, for each node, its embeddings across

L

consecutive time steps are concatenated into a sequence matrix

Z_{i}^{(0)}

, and positional encodings are added before feeding into the Transformer:

Z_{i}^{(0)} = [h_{i, 1}, h_{i, 2}, \dots, h_{i, L}] + P

(15)

where

h_{i, t} \in R^{d}

denotes the feature vector of node

i

at time

t

extracted via the GCN (with dimensionality

d

);

L

is the selected sequence length;

P \in R^{L \times d}

is the positional encoding matrix (providing unique positional information to each time step); and

Z_{i}^{(0)} \in R^{L \times d}

represents the positional-encoded sequence for node

i

of length

L

.

In the above equation, a positional encoding

P

is added to the sequence to ensure that the Transformer can recognize temporal ordering.

Since pure self-attention is permutation-invariant, introducing sinusoidal or learnable positional encodings grants a semantic notion of temporal ordering to each time step, thereby enabling the model to distinguish between temporally adjacent and distant inputs.

Next, for the sequence representation

Z_{i}^{(0)}

of each node

i

, the Transformer first linearly projects the input to obtain query (

Q

), key (

K

), and value (

V

) matrices. Then, it computes attention weights among different time steps based on the scaled dot-product attention mechanism. For brevity, the node index

i

is omitted and the input sequence is denoted as

Z \in R^{L \times d}

for a single node. The main computational steps of the attention mechanism are

Q = Z W_{Q}, K = Z W_{K}, V = Z W_{V}

(16)

where

W_{Q}, W_{K}, a n d W_{V} \in R^{d \times d_{k}}

are learnable projection matrices mapping the input features to query, key, and value spaces;

Q, K, a n d V \in R^{L \times d_{k}}

are the query, key, and value matrices, respectively (with

d_{k}

as the internal attention dimension).

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V

(17)

where softmax normalizes the attention scores for each query position (transforming dot-product similarities into attention weights);

\frac{Q K^{⊤}}{\sqrt{d_{k}}}

is the scaled similarity matrix between queries and keys; and

1 / \sqrt{d_{k}}

prevents large dot-product values from leading to extreme softmax outputs.

The scaled dot-product attention thus yields an

L \times L

attention matrix, quantifying pairwise dependencies across time steps, which is then used to compute a weighted sum over the value matrix

V

, resulting in a new sequence representation of the same length. This mechanism allows the Transformer to flexibly model long-term temporal dependencies: for example, when generating an output at a certain time step, softmax assigns higher weights to historically relevant moments, effectively capturing long-range temporal patterns.

In practice, the Transformer does not rely on a single

(Q, K, V)

triple. Instead, it applies multi-head attention by repeating the above computation

h

times in parallel (where

h

is the number of attention heads). For head

j

, the projection matrices are

W_{Q}^{(j)}, W_{K}^{(j)}, a n d W_{V}^{(j)}

, and the output for the

j

-th head is

{head}_{j} = Attention (Z W_{Q}^{(j)}, Z W_{K}^{(j)}, Z W_{V}^{(j)})

(18)

where

W_{Q}^{(j)}, W_{K}^{(j)}, a n d W_{V}^{(j)} \in R^{d \times d_{k}}

are the learnable projections for head

j

, and each head independently learns temporal dependencies within its own subspace.

The outputs of all attention heads are concatenated and projected back to the original dimension using an output projection matrix

W_{O}

:

O = [{head}_{1}, {head}_{2}, \dots, {head}_{h}] W_{O}

(19)

where

[\cdot]

denotes concatenation along the feature dimension;

W_{O} \in R^{(h \cdot d_{k}) \times d}

projects the concatenated features back to dimension

d

; and

O \in R^{L \times d}

is the resulting output feature matrix enriched with multi-head temporal context.

Multi-head attention allows the model to learn temporal dependencies from different subspaces and time scales. Compared to single-head attention, multiple heads can simultaneously focus on different positions or patterns in the sequence: one head might capture short-range local dependencies, while another focuses on long-term trends. This multi-perspective representation substantially enhances the model’s ability to capture complex temporal patterns.

Through self-attention, the Transformer encoder updates each time step’s representation via weighted aggregation of context from all other steps, producing an output sequence

O

of the same length

L

and feature dimension

d

, but now encoding information across the entire temporal window.

Following self-attention, the Transformer applies a position-wise two-layer feedforward network to

O

for further feature transformation. Residual connections and layer normalization are also included for training stability. The output of self-attention

O

is added to the input

Z^{(0)}

, then normalized to yield

Z^{(1)}

. A two-layer feedforward network with activation

ϕ

(e.g., ReLU) is then applied:

F = ϕ (Z^{(1)} W_{1} + b_{1}) W_{2} + b_{2}

(20)

The result is added to

Z^{(1)}

as a residual and again normalized to obtain the final encoded output

Z^{(2)}

.

The feedforward network introduces nonlinear feature interactions and enhances the model’s capacity to extract high-level representations. Residual connections and normalization alleviate vanishing gradients and internal covariate shift, enabling stable training of deep architectures.

After Transformer encoding, each node obtains a new representation sequence

Z_{i}^{(2)}

across time, where each time step integrates contextual information from all past steps.

2.8.3. Fusion Modeling of GCN and Transformer

In the ST-GTNet, the GCN module and the Transformer module are sequentially connected to achieve joint modeling of spatial neighborhood information and temporal dynamic patterns. Specifically, for each time step

t

, the GCN module first performs local information aggregation on node features

X_{t}

according to the graph structure

A

, obtaining node representations

H_{t}

that include neighborhood influence; then, all

H_{t}

across time steps are arranged into a temporal sequence and input into the Transformer encoder to model cross-temporal dependencies and output spatiotemporally fused representations

Z_{t}

. This process can be expressed functionally as

{X_{t}}_{t = 1}^{T} \overset{GCN}{\to} {H_{t}}_{t = 1}^{T} \overset{Transformer}{\to} {Z_{t}}_{t = 1}^{T}

(21)

where

Z_{t}

denotes the node feature representation at time step

t

, which integrates both spatial and temporal dependencies.

Placing the GCN before the Transformer allows the model to first extract local spatial patterns at each time snapshot through graph convolution and then model global temporal dependencies across longer sequences via self-attention. This layered design, spatial first and temporal second, effectively reduces the modeling burden of the Transformer: since the input features at each time step already include the local neighborhood information of the node, the Transformer no longer needs to explicitly consider graph structural relations, thus enabling the attention mechanism to focus solely on pattern extraction along the temporal dimension. This not only improves the efficiency of spatiotemporal feature extraction (compared to applying global attention over all node sequences, which dramatically increases parameters and computation), but also fully exploits the complementary strengths of both components—the structural priors provided by the GCN help the Transformer to more easily locate relevant historical segments, while the temporal relationships learned by the Transformer inject dynamic context into the node representations.

As a result, the fused model can capture both local and global spatiotemporal features: the GCN module ensures accurate representation of short-range dependencies and graph structure, while the Transformer module endows the model with the ability to perceive long-term dependencies and temporal patterns. This complementary fusion allows the ST-GTNet to achieve stronger representation capability than using the GCN or Transformer alone. Ultimately, the node sequence representations obtained through GCN–Transformer joint modeling can be used for downstream tasks (such as future state prediction or classification), providing a high-precision solution for complex spatiotemporal sequence problems.

To conclude, the ST-GTNet extracts high-order spatial neighborhood features of nodes via the multi-layer GCN, encodes temporal dependencies of long sequences via the Transformer, and integrates the two organically to construct a model architecture that captures both local and global spatiotemporal patterns. This design significantly enhances the expressiveness of spatiotemporal data modeling under controllable model complexity and offers a powerful tool for accurately modeling complex patterns in graph-structured temporal data such as traffic flow or skeletal motion.

2.9. ST-GTNet Model Framework

The ST-GTNet model is constructed to effectively capture spatiotemporal dependencies in large-scale airport networks. Its architecture consists of three main phases: feature pre-selection, spatial–temporal representation learning, and predictive decoding. The overall structure of the ST-GTNet model framework is shown in Figure 1.

Initially, a multi-dimensional input tensor

X

is prepared, where each feature channel represents a specific operational variable (e.g., weather, congestion index, flight count), and each temporal frame corresponds to a historical observation timestamp. To enhance learning efficiency and ensure interpretability, a two-stage feature selection mechanism is applied. First, XGBoost is employed to rank features based on their contribution to predictive performance through an ensemble of gradient-boosted trees. Then, SHAP is used to compute Shapley values

ϕ_{i}

for each feature, offering a game-theoretic measure of feature importance. The resulting subset

X' \subset X

contains only the most relevant attributes for subsequent learning.

The refined input is then sliced into overlapping windows of length

T

to construct a sequence of spatial-temporal blocks

X_{t - T + 1 : t}

for training. In parallel, the spatial structure of the airport network is encoded as a graph

𝒢 = (𝒱, E)

, where nodes

𝒱

denote airports and edges

E

represent either physical flight connections or learned adjacency relationships from past interactions. At each time step, a Graph Convolutional Network (GCN) operates on the input features using the graph

𝒢

, computing embeddings

H_{t}

for each node. These embeddings incorporate first- and potentially higher-order neighborhood information, capturing local spatial correlations such as adjacent airport congestion spread.

The resulting embeddings over the temporal window,

{H_{t}}_{t = 1}^{T}

, are fed into a Transformer encoder. This module uses self-attention to model dependencies across time, enabling the architecture to focus on influential time steps within the sequence. For each node, its temporal embedding sequence is denoted as

Z_{i}^{(0)} = [h_{i, 1}, \dots, h_{i, T}] + P

, where

P

denotes position encoding. The Transformer layers compute dynamic context-aware representations by reweighting temporal frames using multi-head attention. This allows the ST-GTNet to recognize significant operational patterns such as the buildup of peak-hour congestion or delay propagation across sequential periods.

At the output of the Transformer, a comprehensive spatiotemporal embedding

Z_{i}^{(L)}

is obtained for each airport node. These vectors encapsulate both spatial and temporal knowledge and are passed into a final fully connected decoder that maps the embeddings to the forecasted output

{\hat{y}}_{i, t + τ}

, where

τ

denotes the prediction horizon. The model is optimized to minimize a loss function defined over the predicted and actual capacity indicators, such as departure throughput or expected arrivals.

The ST-GTNet is modular and end-to-end trainable. The feature selection block ensures only significant information is propagated into the core network. The GCN module injects topological prior into the representation, while the Transformer efficiently captures temporal dependencies at multiple scales. This design makes the ST-GTNet not only accurate and interpretable but also scalable for real-world deployment in dynamic air traffic flow management systems.

2.10. Differentiation from Prior Work

Positioning and design. Unlike prior “graph + temporal” models that (i) pair graph attention/convolutions with recurrent temporal modules or (ii) apply attention in both space and time, the ST-GTNet follows a sequential pipeline:

Spatial-first, temporal-later. A GCN compresses local topology (physical adjacency between gates/bridges, taxiway connectivity) within each time slice to produce compact node embeddings. A Transformer then applies global self-attention along the time axis only, narrowing the attention field, reducing redundancy, and improving stability for long look-back windows.
Interpretability by design. XGBoost + SHAP selects a robust feature subset (mitigating noise and multicollinearity) and provides event-level attributions at inference, facilitating operational use.
Physical prior. We employ a static, physically interpretable gate graph (with a k-sensitivity report), favoring reproducibility and cross-season transfer.

In Table 1, versus GAT-LSTM. GAT-LSTM uses graph attention for space and LSTM for time. Its temporal modeling requires sequential unfolding, limiting parallelism for long windows; attention may also be spread across both dimensions. The ST-GTNet confines attention to time only, with space encoded once via the GCN, avoiding recurrence and enabling full-window parallelism at inference.

In Table 1, versus the DCRNN. The DCRNN couples diffusion graph convolutions with the GRU/RNN, well suited to directional flows but still bound by recurrent unfolding. The ST-GTNet combines a static physical graph with a Transformer to obtain a global temporal receptive field, which we find advantageous for abrupt weather shifts and operational regime changes in gate/stand capacity.

Versus the DMCSTN. The DMCSTN leverages dynamic multiple graph convolutions plus hierarchical attention, yielding strong expressiveness but higher structural and tuning complexity and weaker native interpretability. The ST-GTNet emphasizes a robust sub-feature set + single, physically grounded graph + lightweight temporal self-attention, reducing hyperparameter burden while retaining actionable explanations.

Practical implications. This design yields (i) more controllable computing and parallelism (non-recurrent time module), (ii) stronger interpretability and transfer across seasons/airports via SHAP-stable features and event-level attributions, and (iii) engineering readiness via a physical prior and sensitivity reporting.

The ST-GTNet is novel along three axes: feature flow, model structure, and interpretability.

Feature flow (SHAP-guided → GCN → Transformer). We filter exogenous features via SHAP before graph encoding, feed the robust subset into a spatial-first GCN at each time slice, and finally apply temporal-only self-attention. This reduces noise and multicollinearity entering the temporal module and yields a clearer inductive bias.
Model structure (decoupled spatial–temporal attention). By encoding spatial dependencies with the GCN first and restricting the Transformer’s attention to the temporal axis, we (i) narrow the attention scope, (ii) avoid redundant cross-node attention, and (iii) maintain high training/inference parallelism compared with recurrent temporal modeling or joint spatiotemporal attention.
Interpretability (training–inference loop). SHAP drives both pre-training selection and inference time attributions, ensuring consistency between what the model uses and what we explain. This directly supports operational audits without additional post hoc surrogates.

3. Case Study: Airport U-Shaped Corridor Harbor Area

In order to evaluate the performance of the proposed prediction model, we conduct our study at Qingdao Jiaodong International Airport (IATA: TAO; ICAO: ZSQD). Operations/stand allocation are provided by the airport authority/AODB, while meteorology comes from CMA METAR/TAF with coding rules following CMA documentation; timestamps are converted to Beijing Time for consistency. There is a total of 80 terminals scattered in the U-shaped corridor bay area, and the five corridors have 10, 10, 20, 20, and 20 terminals. Three types of data are used in this case study. We use a 12-month span from 1 January 2023 to 31 December 2023, sampled at a 30 min resolution. The weather phenomenon in the meteorological message is extracted, the world coordinated time of the message is transformed into the local time of the airport meteorological observation station, namely Beijing time, and the semantic transformation is carried out. For part of the meteorological data, the official documents of the China Meteorological Administration and expert experience suggestions are referred to divide them into corresponding grades and encode them. It includes real number coding, integer coding, and 0–1 type data coding. Let T denote the number of half-hourly timestamps after alignment. With look-back L = 24, horizon H = 6, and stride s = 1, for a complete 2023 calendar (T = 17,520),

N_{w i n}

= 17,491 per stand; with N = 80 stands, that yields up to 1,399,280 windows before filtering. We report effective train/val/test counts after removing windows overlapping gaps and after chronological splitting (60/20/20, consistent with our experimental setup). We apply a leakage-safe pipeline: forward fill (≤2 steps), KNN imputation (k = 5), winsorization at 1%/99%, circular encoding of wind direction, and standardization fitted on the training set only. The actual capacity values are derived from historical hourly arrival/departure statistics at the target airport, which were preprocessed for noise filtering and time alignment.

3.1. Experimental Environment and Parameter Configuration

The model implementation was based on Python 3.9 and PyTorch 2.1.0, leveraging PyTorch Geometric for graph-based learning components. Feature importance and selection were performed in the preprocessing phase using XGBoost (v1.7.6) in conjunction with SHAP (v0.41.0), both of which are widely adopted in interpretable machine learning pipelines.

The proposed ST-GTNet model incorporates multi-layer Graph Convolutional Networks (GCNs) followed by temporal encoding through a Transformer architecture. Specifically, the GCN module comprises two stacked layers, each with 64 hidden units and ReLU activation, enabling the extraction of localized spatial dependencies within the airport network topology. The Transformer module consists of two encoder layers, each employing 4 attention heads and a hidden dimension of 128, designed to capture long-range temporal dependencies from the sequence of historical features.

The model was trained using the Adam optimizer with an initial learning rate of 0.0005. A cosine annealing scheduler with warm restarts was employed to dynamically adjust the learning rate during training, following best practices for spatiotemporal sequence modeling. The batch size was set to 32, and dropout with a rate of 0.25 was applied to mitigate overfitting. The maximum number of training epochs was fixed at 20, with early stopping enabled (patience = 5) based on validation loss to ensure generalization and avoid unnecessary computation.

All experiments were repeated with three random seeds to ensure statistical robustness and reproducibility. Model performance was evaluated using Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and the coefficient of determination (

R^{2}

) on both takeoff and landing capacity prediction tasks.

3.2. Model Depth and Temporal Windowing

Stacks. We use a 2-layer GCN (

(L_{g c n} = [2])

) with residual connections and layer normalization (hidden size [128]), followed by a 2-layer Transformer encoder (

N_{t r} = [2]

) with [4] heads, a model dimension [128], a feedforward dimension [256], and dropout [0.1], as shown in Table 2. The Transformer applies self-attention only along the temporal axis (no spatial attention).

Temporal windows. Inputs are constructed via a sliding window of length [L = 24] steps at a 30 min resolution (12 h look-back). The model predicts a direct multi-horizon output of [H = 6] steps (3 h). We slide the window with stride [s = 1] to form training tuples with chronological train/val/test splits. Normalization statistics are fitted on the training set only and applied to validation/test to prevent leakage. Per time slice, the GCN encodes spatial structure; per node, a length-L sequence is passed to the shared Transformer, and the head outputs H future steps.

3.3. Descriptive Data Analytics

Feature selection is a critical step in building an effective prediction model. In Algorithm 1, two main methods are applied to select features: first, correlation analysis (heatmap analysis) is used to filter features that show strong correlations with the target variables which are Landing Total and Takeoff Total; and second, XGBoost combined with SHAP values is used to further quantify the contribution of each feature to the predictions. Below is a detailed explanation of the analysis and selection process for each feature.

Algorithm 1. Preprocessing and SHAP-Guided Feature Selection.

1 # Build static physical graph (undirected) and normalize

2 A ← BuildPhysicalAdjacency(metadata, self_loops = True)

3

\tilde{A}

← D^{−1/2}(A + I)D^{−1/2}

4 # Align to 30-min bins, local tz

5 X_sync, Y_sync ← Align(X_raw, Y, bin = 30 min)

6 # Impute & encode

7 X_ff ← ForwardFill(X_sync, max_gap = 2)

8 X_imp ← KNNImpute(X_ff, k = 5)

9 X_circ← Encode({wind_dir → (sinθ, cosθ)}, X_imp)

10 # Winsorize & split

11 X_clip← Winsorize(X_circ, p_low = 0.01, p_high = 0.99)

12 (X_tr, Y_tr, X_va, Y_va, X_te, Y_te) ← ChronoSplit(X_clip, Y_sync, 60/20/20)

13 # Standardize (fit on train only)

14 scaler ← FitStandardScaler(X_tr)

15 X_tr, X_va, X_te ← Apply(scaler, X_tr/va/te)

16 # SHAP-guided top-K selection (K_feat = 24, union across targets)

17 models_xgb ← TrainXGBoostPerTarget(X_tr, Y_tr)

18 shap_vals ← ComputeSHAP(models_xgb, X_tr)

19 F_sel ← SelectTopK(shap_vals, K_feat = 24, union = True)

20 X_tr, X_va, X_te ← KeepFeatures(X_tr/va/te, F_sel)

21 # Sliding windows (L = 24, H = 6, s = 1)

22 D_train ← Window(X_tr, Y_tr, L = 24, H = 6, stride = 1)

23 D_val ← Window(X_va, Y_va, L = 24, H = 6, stride = 1)

24 D_test ← Window(X_te, Y_te, L = 24, H = 6, stride = 1)

3.3.1. Correlation Analysis

The Pearson correlation coefficients are computed between the 19 candidate features (as shown in Table 3) and the target variables, Landing Total and Takeoff Total, and generated a correlation matrix. The correlation analysis helps us identify features that are strongly correlated with the target variables and guide further modeling decisions.

As shown in Table 4, Wind Speed and Wind Direction show very strong positive correlations with both Landing Total and Takeoff Total. Specifically, Wind Speed correlates with Landing Total at 0.97 and with Takeoff Total at 0.92, while Wind Direction correlates with Landing Total at 0.94 and with Takeoff Total at 0.96. This indicates that Wind Speed and Wind Direction are critical for dynamic capacity prediction, especially in changing weather conditions, where their influence on airport takeoff and landing efficiency is both direct and significant. As shown in Figure 2, Dew Point and Sea Level Pressure exhibit high correlations with the target variables, with correlations of 0.84 and 0.79 (Landing Total), and 0.80 and 0.75 (Takeoff Total), respectively. This suggests that changes in dew point and air pressure significantly affect airport capacity, especially in environments where weather conditions fluctuate. Visibility shows a relatively strong correlation with the target variables, at 0.74 for Landing Total and 0.71 for Takeoff Total. Visibility directly impacts airport operations, particularly under low-visibility conditions, where takeoff and landing frequencies and speeds are restricted. Cloud Base Height and Wind Direction Change exhibit moderate correlations, at 0.48 and 0.43 (Landing Total), respectively. These features influence capacity under specific weather conditions but have less consistent effects compared to others. Rain, Snow, and Mist have relatively low correlations with the target variables, with values of 0.29, 0.22, and 0.37 (Landing Total), respectively. These weather conditions have a relatively small impact on capacity and may be excluded from further analyses.

Based on this correlation analysis, the following six features are selected for further analysis: Wind Speed, Wind Direction, Dew Point, Sea Level Pressure, Visibility, and Temperature.

3.3.2. XGBoost and SHAP Value Analysis

While the correlation analysis provided an initial selection of relevant features, further validation of the selected features was performed using XGBoost combined with SHAP (Shapley Additive Explanations) values. XGBoost is a powerful ensemble method that handles nonlinear relationships, and SHAP values provide an interpretable quantification of each features contribution to the predictions.

After training the XGBoost model, the SHAP values for each feature are computed to assess their individual contributions to the model’s predictions for Landing Total and Takeoff Total.

As shown in Table 4, Wind Speed has the highest SHAP value, with contributions of 0.28 for Landing Total and 0.32 for Takeoff Total. This indicates that wind speed plays a crucial role in predicting dynamic capacity, directly influencing the safety and efficiency of takeoff and landing operations, particularly in adverse weather conditions. Wind Direction also contributes significantly to the model’s predictions, with SHAP values of 0.23 and 0.27 for Landing Total and Takeoff Total, respectively. Wind direction influences aircraft scheduling and routing, especially during strong or variable winds. Dew Point and Sea Level Pressure have moderate contributions to the model, with SHAP values of 0.15 and 0.10 (Landing Total) and 0.19 and 0.21 (Takeoff Total), respectively. These features are important for modeling capacity in rapidly changing weather environments. Temperature and Visibility show smaller contributions, with SHAP values of 0.10 and 0.06 (Visibility), and 0.16 and 0.11 (Temperature). While their contribution is less significant, they still provide valuable information, particularly during extreme weather conditions.

Based on both the correlation analysis and the SHAP value results, the following six features are reconfirmed for further modeling: Wind Speed, Wind Direction, Dew Point, Sea Level Pressure, Visibility, and Temperature. These features showed strong correlations with the target variables and exhibited significant contributions to the prediction model based on SHAP analysis. SHAP value expressed by the characteristics of the impact of factors on the target variable contribution is vividly displayed in Figure 3.

Through a comprehensive feature selection process involving both correlation analysis and SHAP value calculations, six critical features for predicting dynamic airport capacity are successfully identified. These features not only exhibited high correlation with the target variables but also demonstrated strong contributions to the prediction model. This robust feature selection provides input for further model training and optimization, supporting accurate and interpretable dynamic capacity forecasting.

3.3.3. Graph Structure and Adjacency Matrix

In this study, the U-finger corridor harbor area of a large and busy airport in China was used as an example to validate the model. As shown in Figure 4, the U-shaped corridor harbor area contains 5 corridors and 80 airport stands, which are represented by the letters A, B, C, D and E, respectively. There are only 10 airport stands in the two corridors because one side is empty and the other side is the road side, and there are 20 airport stands in the upper and lower three corridors because both sides can be docked. The total area of the study area is 532,000 square meters.

Defining Gate Connectivity and the adjacency matrix, we build a static graph from gate centroids obtained from the terminal top-view/CAD or AODB. Let

d_{i j}

be the Euclidean distance between gates

i

and

j

. To produce the 80 × 80 binary adjacency used in this study, we define

A_{i j}^{b i n} = \{\begin{matrix} 1, if d_{i j} \leq d_{m a x} and (j \in N_{k} (i) o r i \in N_{k} (j)) \\ 0, otherwise \end{matrix}

(22)

where

N_{k} (i)

is the top-

k

nearest neighbors of gate

i

under planar Euclidean distance (excluding

i

). We set

d_{m a x}

= 120 and

k = 4

. After construction, we symmetrize

A^{b i n} \leftarrow (A^{b i n} + (A^{b i n})^{⊤}) / 2

with zero diagonal, and feed the standard normalized adjacency to the GCN, where

I

adds self-loops. This rule relies solely on physical geometry.

\tilde{A} = D^{- 1 / 2} (A^{b i n} + I) D^{- 1 / 2}

(23)

The input of the GCN mainly consists of the graph structure data (adjacency matrix) and node feature data (feature matrix) mentioned earlier. Airport refers to the corridor harbor area, modeled as a graph. The 80 nodes in the graph represent all the different gates. In a graph, edges represent relationships or connections between these regions to show their influence on each other. The relationship between each gate node is represented by an adjacency matrix. If there is an interaction between two nodes (the gate), the element at the corresponding position in the matrix is 1; otherwise, it is 0. In Figure 4, the adjacency matrix is an 80 × 80 square matrix, where 80 represents the total number of gates. The adjacency matrix can reflect the spatial connection relationship between terminals and provide a structural basis for subsequent graph convolution operations. U-shape refers to a set of node characteristic data (feature matrix) for each airport station in the corridor harbor area, and the node characteristic data constitute an 80 × 6 matrix. These features provide the detailed state of each gate at a specific moment in time, which provides rich information for the subsequent graph convolution calculation and timing modeling. In order to better mine the spatial and temporal dependence of these features, the feature matrix is combined with the adjacency matrix to form a tight graph structure, so that the operation state of each gate can interact with the state of its neighboring gates.

The U-shaped airport means that each airport gate in the corridor harbor area does not exist in isolation but is closely related to the others. For example, the operational usage of A2 may affect the operational capacity of adjacent A3. As shown in Algorithm 2, through the spatial convolution of the GCN, the ST-GTNet model is able to effectively capture this interdependence and encode it in the feature representation of the nodes. The output of the GCN is the updated feature representation of each node at each time step. These representations, called spatial embeddings, reflect the spatial interactions between each gate and its neighbors. These spatial embeddings provide the basis for temporal modeling, which is then fed into a Transformer model for capturing temporal dependencies. The spatial embedding sequence of each gate at every 30 min will be processed by the Transformer model to capture the dynamic changes in each gate at different time points. These time-series data provide comprehensive spatio-temporal information for the model, which enables the ST-GTNet to consider complex dependencies in space and time simultaneously.

Algorithm 2. Training the ST-GTNet.

1 for seed in {2023, 2024, 2025}:

2 SetSeed(seed)

3 θ ← Init (GCN_stack(L_gcn = 2, d_model = 128, residual + LN, p = 0.10),

4 opt ← AdamW(θ, lr = 3e−4, weight_decay = 1e−4)

5 sched ← ReduceLROnPlateau(opt, factor = 0.5, patience = 5, min_lr = 1e−6)

6 best ← +∞; wait ← 0

7 for epoch = 1..100:

8 for minibatch B in D_train (batch = 32):

9 X_seq, Y_seq ← B # (B, L, N, |F_sel|), (B, H, N, out_dim)

10 # Spatial per time slice

11 H_seq ← zeros(B, L, N, 128)

12 for t = 1..L:

13 H_seq[:,t,:,:] ← GCN_stack(X_seq[:,t,:,:],

\tilde{A}

)

14 # Temporal-only attention per node (shared)

15 Z_seq ← Transformer_timeonly(H_seq)

16 # Direct multi-horizon decode

17 Ŷ ← DecoderMLP(Z_seq) # (B, H, N, out_dim)

18 # Loss & update

19 loss ← MAE(Ŷ, Y_seq) # track RMSE/MAPE as metrics

20 opt.zero_grad(); loss.backward()

21 ClipGradNorm(θ, 1.0); opt.step()

22 val_mae ← EvaluateMAE(D_val, θ)

23 sched.step(val_mae)

24 if val_mae < best: Save(θ,”best.pt”); best ← val_mae; wait ← 0

25 else: wait ← wait + 1; if wait ≥ 12: break

26 θ_seed ← Load(“best.pt”)

27 SaveSeedResult(seed, θ_seed, logs)

28 AggregateAcrossSeeds(results) # mean ± std

3.4. Benchmark Models Comparison

In this section, the performance of the proposed ST-GTNet model is compared with three widely used benchmark models: Long Short-Term Memory (LSTM), the Gated Recurrent Unit (GRU), and the Transformer. These models are chosen for their proven success in sequential forecasting tasks. The goal is to demonstrate the advantages of the ST-GTNet in airport capacity forecasting by integrating both spatial and temporal dependencies.

Hardware and protocol: NVIDIA A100, 40GB; PyTorch 2.1.0; CUDA 12.1; fixed seed; 50 warm-ups; average over 1000 forwards; torch.cuda.synchronize () before/after timing; and identical input shape across models (N ≈ 80, L = 24, batch 32).

To compare the differences between predicted values

{\hat{q}}_{i}

and real values

q_{i}

of the ST-GTNet and the benchmark models, three standard indicators are used:

Mean Absolute Error (MAE):

M A E = \frac{1}{N} \sum_{i = 1}^{N} |q_{i} - {\hat{q}}_{i}|

(24)

Mean Absolute Percentage Error (MAPE):

M A P E = \frac{1}{N} \sum_{i = 1}^{N} \frac{|q_{i} - {\hat{q}}_{i}|}{q_{i}}

(25)

Root Mean Square Error (RMSE):

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(q_{i} - {\hat{q}}_{i})}^{2}}

(26)

As shown in Table 5, the ST-GTNet combines Graph Convolutional Networks (GCNs) and the Transformer to capture spatial dependencies between airport gates and terminals while also modeling temporal dependencies across the time-series data of flight schedules. By leveraging both types of dependencies, the ST-GTNet outperforms traditional models, which typically focus only on temporal dependencies.

LSTM is a well-established model that captures long-range temporal dependencies in sequential data through a gating mechanism [22]. While effective for time-series forecasting, it does not model spatial interactions between entities (e.g., gates and terminals) in the airport network, which limits its performance in tasks where spatial context is crucial [23].

The GRU is a simplified version of LSTM, with fewer gates, making it computationally more efficient [24]. It has similar performance to LSTM on many tasks, but, like LSTM, it does not account for spatial dependencies, which are critical in airport capacity prediction [25].

The Transformer utilizes self-attention to capture long-range temporal dependencies [26]. It is highly efficient in learning complex sequential patterns and has become a dominant model for tasks that require processing long sequences of data. However, like LSTM and the GRU, it does not incorporate spatial relationships and is limited in tasks where spatial context is essential [27].

In addition to the classical sequence models, we incorporated three recent spatiotemporal learning approaches as benchmarks: the MF-Transformer, GAT-LSTM, and the DMCSTN. These models represent the state-of-the-art in dynamic capacity estimation and spatiotemporal modeling.

The MF-Transformer extends the traditional Transformer by incorporating multi-feature fusion across meteorological and operational variables [11].

GAT-LSTM leverages graph attention mechanisms to model spatial dependencies among gates and combines them with temporal learning via LSTM [12].

The DMCSTN (Dynamic Multi-Graph Convolutional Spatiotemporal Network) utilizes dynamic graph convolution along with hierarchical temporal attention to adaptively learn from evolving spatiotemporal patterns [13].

All models were trained using the same set of six input features, with a sliding window of historical observations and consistent gate-level adjacency graphs. Table 4 summarizes the performance metrics across four representative months. Results show that the MF-Transformer, GAT-LSTM, and the DMCSTN outperform traditional temporal models (LSTM, GRU, Transformer), confirming the advantage of incorporating spatial dependencies and enhanced attention mechanisms. Among them, GAT-LSTM achieves the lowest RMSE among the new benchmarks, but the ST-GTNet consistently surpasses all others, achieving an average improvement of 12.8% in RMSE over GAT-LSTM and 20.2% over the MF-Transformer. These results validate the effectiveness of the ST-GTNet’s unified spatiotemporal representation and interpretable structure, especially under complex and dynamic traffic conditions such as in September and December.

We train three controlled variants under identical settings: w/o the GCN (Transformer-only), w/o the Transformer (GCN-only), and the full ST-GTNet. Metrics are reported per month (Table 6). Given MAE

E (T), E (G), a n d E (G T)

, we attribute the improvement of the full ST-GTNet over the single-module baselines via complementary marginal gains:

r_{G C N} = \frac{E (T) - E (G T)}{[E (T) - E (G T)] + [E (G) - E (G T)]},

(27)

r_{T r a n s} = \frac{E (G) - E (G T)}{[E (T) - E (G T)] + [E (G) - E (G T)]} .

(28)

Results (MAE), using the values in Table 6 (

T / G / G T

), are as follows:

March: 1.30/1.38/1.09 →

r_{G C N}

= 0.42;

r_{T r a n s}

= 0.58;

June: 0.75/0.78/0.54 → 0.47/0.53;

September: 1.77/1.91/1.41 → 0.42/0.58;

December: 1.72/1.84/1.39 → 0.42/0.58.

Average:

r_{G C N}

= 0.43;

r_{T r a n s}

= 0.57. RMSE yields a similar split (≈0.42/0.58). These ablations confirm that the GCN and the temporal-only Transformer make complementary contributions.

3.5. Performance Evaluation

To assess the predictive effectiveness of the proposed ST-GTNet model in dynamic airport capacity estimation, a comprehensive evaluation is conducted against seven benchmark models: LSTM, the GRU, the Transformer, the MF-Transformer, GAT-LSTM, the DMCSTN, and the actual capacity series. Four representative months—March, June, September, and December—are selected to capture seasonal variability. Three standard metrics are used: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE). The results are summarized in Table 6, and time-series prediction performance is visually illustrated in Figure 5. The actual capacity values are derived from real-world historical hourly arrival/departure rates at the target airport, after data cleaning and temporal alignment.

As shown in Table 6, the ST-GTNet achieves the best performance across all metrics and all months. In March, the model records a MAE of 1.09, RMSE of 1.37, and MAPE of 7.9%, compared to LSTM (1.60, 2.10, 12.1%), the GRU (1.42, 1.78, 10.5%), and the Transformer (1.30, 1.66, 9.2%). Other advanced models such as the MF-Transformer, GAT-LSTM, and DMCSTN exhibit MAEs between 1.18 and 1.22 and RMSEs between 1.49 and 1.55, still falling short of the ST-GTNet.

In June, a relatively stable operational month, the ST-GTNet achieves outstanding results with a MAE of 0.54, RMSE of 0.68, and MAPE of only 2.4%, significantly outperforming LSTM (0.95, 1.18, 4.3%) and the Transformer (0.75, 0.93, 3.2%). Competing models such as the GAT-LSTM and DMCSTN yield RMSEs of 0.84 and 0.86, respectively, but do not match the overall accuracy of the ST-GTNet.

In September, where dynamic fluctuations are more pronounced, the ST-GTNet maintains superior robustness with a MAE of 1.41, RMSE of 1.77, and MAPE of 12.8%, which are lower than that of the GRU (1.90, 2.39, 17.4%), Transformer (1.77, 2.21, 15.8%), and DMCSTN (1.63, 2.01, 14.1%).

In December, the ST-GTNet continues to lead with a MAE of 1.39, RMSE of 1.74, and MAPE of 8.0%, again outperforming the best baselines such as the GAT-LSTM (1.56, 1.94, 8.5%) and DMCSTN (1.59, 1.98, 8.7%).

As shown in Figure 5, on average across the four months, the ST-GTNet achieves substantial reductions in RMSE when compared to all baseline models: 34.6% lower than LSTM, 26.3% lower than the GRU, 21.2% lower than the Transformer, 13.2% lower than the DMCSTN, and 11.5% lower than the GAT-LSTM. These consistent improvements demonstrate the proposed model’s strong generalization capability across both stable and high-variance seasonal conditions.

In addition to the quantitative metrics, Figure 5 presents a full time-series comparison of predicted capacity versus actual values over 1500 timesteps (each representing a 30 min interval) for the four selected months. To enhance interpretability, the figure is presented in two stacked panels. The top panel compares the ST-GTNet with three classical models—LSTM, the GRU, and the Transformer—while the bottom panel shows the ST-GTNet alongside three advanced baselines—the MF-Transformer, GAT-LSTM, and the DMCSTN.

In both panels, the gray curve denotes the actual observed airport capacity, while each colored curve represents the output of a specific model. The ST-GTNet (highlighted in bold red) consistently exhibits the closest alignment with the actual series across all four months. It accurately captures daily periodicity, multi-peak patterns, and sudden capacity changes, particularly during morning ramp-ups and evening slowdowns. In contrast, traditional models like LSTM and the GRU show visible time lags during sharp transitions and tend to oversmooth the amplitude of fluctuations. The Transformer improves short-term tracking but introduces instability in low-capacity zones. Among the advanced models, GAT-LSTM and the DMCSTN reduce lag but still miss peak alignment and underestimate volatility in some intervals.

The performance differences are especially notable during high-variance periods, such as midday peaks in September or morning troughs in December. The ST-GTNet maintains both temporal responsiveness and amplitude accuracy, reinforcing the numerical findings in Table 4 and confirming its robustness for real-time, fine-grained airport capacity forecasting.

Model-level complexity. Let

N

be the number of gates,

L

the look-back,

d

the hidden width,

k

the average degree of the sparse graph, and hhh the number of heads. Per inference step (predicting one 30 min horizon vector),

GCN (2 layers, per slice): $𝒪 (L \cdot (E d + N d^{2})) \approx 𝒪 (L N k d), w i t h E \approx N k$ .
Temporal-only Transformer (2 layers): $𝒪 (N L d^{2}) + 𝒪 (N L^{2} d)$ per layer; memory $𝒪 (N h L^{2})$ for attention maps.

By confining attention to the temporal axis and using a sparse physical graph, the ST-GTNet avoids the

𝒪 ((N L)^{2})

attention cost of fully spatiotemporal transformers.

Concrete deployment shape (TAO case).

N \approx 80; L = 24; d = 128; k = 4; h = 4

. The total arithmetic is

\approx 2.77 \times 10^{8}

multiply–adds per step; attention memory ≈

1.84 \times 10^{5}

elements (≲1 MB FP32).

Measured latency and resources. On NVIDIA A100-40GB, the ST-GTNet runs at 0.98 ms/step (batch = 1) with a parameter count of 0.302 M. This is orders of magnitude below the decision cadence (rolling 30 min updates) and leaves ample headroom for what-if simulation or multi-terminal batched scoring [29].

Real-time terminal-area prediction has been emphasized by, e.g., MST-WA (Zeng et al., AEI 2024), which integrates multimodal weather and spatiotemporal dependencies [30]. Our architecture is complementary: by keeping attention temporal-only and spatial modeling GCN-based, we achieve lower asymptotic and empirically measured latency while retaining interpretability through SHAP-guided features.

The effectiveness of the ST-GTNet is primarily due to its integration of spatiotemporal graph modeling and temporal attention mechanisms. These enable the model to capture nonlinear dependencies across input features while remaining sensitive to short-term variations. Moreover, the model balances accuracy and computational efficiency, making it suitable for real-time capacity forecasting in airport operations.

In summary, both the quantitative results in Table 4, rolling Inference in Algorithm 3 and the visual alignment shown in Figure 5 confirm that the ST-GTNet significantly outperforms existing models in terms of accuracy, stability, and adaptability, making it a highly effective tool for dynamic airport capacity prediction across diverse operational scenarios.

Algorithm 3. Rolling Inference.

1 loop:

2 x_now ← FetchLatest()

3 x_now ← ForwardFillOne(x_now, max_gap = 2) → KNNImpute(k = 5)

4 x_now ← EncodeWindDir(x_now) → Winsorize(1%,99%)

5 x_now ← Apply(scaler, x_now) → KeepFeatures(x_now, F_sel)

6 X_ctx ← Push(X_ctx, x_now, maxlen = 24)

7 if len(X_ctx) = =24:

8 H_ctx ← [GCN_stack(X_ctx[t],

\tilde{A}

) for t = 1..24]

9 Z_ctx ← Transformer_timeonly(H_ctx)

10 Ŷ_next ← DecoderMLP(Z_ctx) # predicts next H = 6 steps

11 Emit(Ŷ_next); SleepUntilNext30 min()

4. Discussion

This section presents an in-depth discussion of the model performance results from both statistical and visual perspectives, based on Table 6 (MAE, RMSE, and MAPE across months) and Figure 5. The analysis focuses on accuracy, robustness, model structure implications, and potential practical considerations in airport capacity prediction.

The ST-GTNet model demonstrates superior predictive performance for all comparative models across all three evaluation metrics (MAE, RMSE, and MAPE). Specifically, it consistently achieves the lowest errors throughout all evaluated months. For instance, in June—a month with relatively stable traffic conditions—the ST-GTNet obtains a MAE of 0.54, an RMSE of 0.68, and a MAPE of 2.4%, which represent 43.2%, 42.4%, and 44.2% relative reductions in MAE, RMSE, and MAPE, respectively, compared to the standard LSTM baseline.

Even in more volatile periods such as September and December, the ST-GTNet retains clear advantages. In September, it records a MAE of 1.41 and a MAPE of 12.8%, which is notably lower than both the Transformer (15.8%) and LSTM (19.6%). These results indicate that the ST-GTNet is not only more accurate in steady operational conditions but also more resilient under fluctuations caused by demand surge or adverse weather.

From a temporal robustness standpoint, the models’ performance is observed over four representative months with varied operational patterns. The ST-GTNet maintains a compact and balanced error profile across all metrics and months, indicating consistent accuracy. In contrast, models such as LSTM and the GRU exhibit larger deviations, particularly in RMSE and MAPE during peak months like March and September.

This seasonal robustness suggests that the ST-GTNet is capable of capturing both short-term temporal dependencies and longer-term trends influenced by calendar events, holiday travel surges, or weather seasonality—key for real-time airport capacity management.

The superior performance of the ST-GTNet can be attributed to its architectural innovation, which integrates a spatiotemporal graph learning module with temporal convolution and attention mechanisms. This allows the model to capture not only temporal evolution (as LSTM or the GRU does), but also inter-variable spatial dependencies (as in real-world airport systems involving multiple interdependent features such as runway configuration, delay propagation, and meteorological impact).

In contrast, Transformer-based models (e.g., MF-Transformer) leverage self-attention for long-range temporal capture, yet they do not explicitly model spatial heterogeneity. GAT-LSTM, which combines graph attention networks and recurrent modeling, shows promising performance (e.g., MAPE = 2.7% in June), but still falls behind the ST-GTNet due to less expressive spatiotemporal coupling.

The DMCSTN performs relatively well across all months, and its structure, which balances dynamic convolution and spatial attention, proves beneficial. However, it still shows minor degradation under high-load months compared to the ST-GTNet, confirming the necessity of multi-scale spatial-temporal fusion when dealing with nonlinear capacity evolution.

While the ST-GTNet achieves the highest accuracy, its computational complexity and requirement for graph construction may pose a challenge for low-latency deployment in highly constrained operational environments. In such cases, lightweight models such as the GRU or GAT-LSTM can serve as viable trade-offs, offering reasonable performance with significantly lower inference overhead.

Moreover, the inclusion of the three additional benchmark models (MF-Transformer, GAT-LSTM, and DMCSTN) in this study expands the comparative landscape and validates the original model’s performance not only against standard baselines but also against state-of-the-art deep learning architectures. These models also contribute methodologically by demonstrating that integrating multi-level attention and spatial-temporal decoupling are effective directions for future research.

Overall, the discussion supports the conclusion that the ST-GTNet is the most promising framework among the evaluated models for dynamic, high-resolution airport capacity prediction, particularly in operational environments characterized by real-time changes and complex spatial-temporal interactions. The radar analysis, statistical improvements, and seasonal performance analysis all consistently support this conclusion.

Future research may focus on improving interpretability, reducing inference latency, and incorporating external factors such as flight schedules, maintenance events, and airspace constraints to enhance model contextualization. Furthermore, extending the ST-GTNet to multi-airport network prediction and transfer learning scenarios will be essential to meet the scalability demands of next-generation Air Traffic Flow Management (ATFM) systems.

While this study focuses on TAO with full-year coverage, an immediate next step is to stress-test and adapt the SHAP-guided feature subset under seasonal regimes and across airports with different climatology, traffic mix, and terminal geometry. Concretely, we will conduct a cross-season holdout (train on three seasons, test on the fourth) and a cross-airport transfer (pre-train at TAO and fine-tune with a small labeled fraction at a second airport), and quantify how the top-K features persist or shift using Overlap@K and rank correlation (Spearman/Kendall), alongside distribution-shift diagnostics such as Population Stability Index (PSI) and Jensen–Shannon (JS) divergence computed on season-/airport-conditioned feature distributions. To ensure attributions remain meaningful under domain shift, we will compute SHAP with domain-aware background sets (season-specific or airport-specific baselines) and, where stability or accuracy degrades, apply light-weight adaptation—instance re-weighting from density ratios and few-shot fine-tuning of the decoder or the last Transformer block while keeping the GCN (physical prior) frozen. We expect physically grounded drivers (e.g., visibility, wind speed) to remain stable across sites and seasons, with season-sensitive signals (e.g., temperature extremes or precipitation codes) exhibiting rank variability that flags where re-selection or re-tuning is warranted. We will report a unified set of metrics per split—MAE/RMSE for accuracy; Overlap@K and rank correlation for feature stability; and PSI/JS for covariate shift—complemented by a concise qualitative table that contrasts consistent drivers with those that are regime-dependent, thereby offering a practical pathway to transfer the SHAP-selected feature subset from TAO to airports with different operating conditions.

5. Conclusions

This paper proposes a novel dynamic airport capacity prediction framework based on spatio-temporal graph neural networks, termed ST-GTNet. The model leverages both temporal dependencies and spatial correlations among multiple influencing factors by integrating temporal convolutional networks with spatiotemporal attention mechanisms. The approach aims to address the limitations of traditional analytical and simulation-based capacity estimation methods, particularly under high-uncertainty conditions such as weather fluctuation and peak-hour congestion.

Extensive experiments conducted across four representative months demonstrate that the ST-GTNet significantly outperforms conventional sequence models such as LSTM, the GRU, and the Transformer, as well as recent advanced benchmarks including the MF-Transformer, GAT-LSTM, and the DMCSTN. It consistently achieves the lowest MAE, RMSE, and MAPE values across all testing periods, indicating both high predictive accuracy and temporal robustness. For example, in June, the ST-GTNet reduced MAE by 43.2% and MAPE by 44.2% compared to the LSTM baseline, while maintaining stable performance during months with greater variability, such as September.

Visualization analysis further highlights the ST-GTNet’s ability to produce smoother and more realistic predictions aligned with ground truth dynamics. Its performance stability across multiple time horizons reinforces its suitability for real-time operational deployment in airport capacity management systems.

In conclusion, the ST-GTNet demonstrates strong potential to serve as an effective tool for dynamic airport capacity estimation, offering both accuracy and adaptability. Future work will focus on extending the model to multi-airport network prediction, enhancing interpretability for operational stakeholders, and incorporating additional real-time data sources such as flight scheduling and weather radar feeds to improve context-awareness and robustness.

Author Contributions

Conceptualization, P.Q.; methodology, J.Z.; software, X.L.; validation, J.Z.; formal analysis, P.Q.; investigation, P.Q.; resources, P.Q.; data curation, H.Z.; writing—original draft preparation, P.Q.; writing—review and editing, P.Q.; visualization, P.Q.; supervision, J.Z.; project administration, J.O.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by [National Natural Science Foundation of China] grant number [U2333204] and [National Key Research and Development Program Project] grant number [2021YFB1600500]. And the APC was funded by [National Natural Science Foundation of China] grant number [U2333204].

Acknowledgments

This manuscript was language-edited using AI-assisted tools. The authors confirm that all intellectual content is their own and that no generative AI was used to create or interpret scientific results. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kicinger, R.; Chen, J.T.; Steiner, M.; Pinto, J. Airport capacity prediction with explicit consideration of weather forecast uncertainty. J. Air Transp. 2016, 24, 18–28. [Google Scholar] [CrossRef]
Cheung, W.L.; Piplani, R.; Alam, S.; Bernard-Peyre, L. Dynamic capacity and variable runway configurations in airport slot allocation. Comput. Ind. Eng. 2021, 159, 107480. [Google Scholar] [CrossRef]
Mascio, P.D.; Rappoli, G.; Moretti, L. Analytical method for calculating sustainable airport capacity. Sustainability 2020, 12, 9239. [Google Scholar] [CrossRef]
Shen, Z.; Xu, X.; He, Y.; Yan, Y.; Zhou, L.; Hu, Y. An Improved Runway Operation Capacity Model for V-Open Multirunway Airports in China. J. Adv. Transp. 2022, 2022, 5869101. [Google Scholar] [CrossRef]
Scala, P.; Mota, M.M.; Wu, C.L.; Delahaye, D. An optimization–simulation closed-loop feedback framework for modeling the airport capacity management problem under uncertainty. Transp. Res. Part C Emerg. Technol. 2021, 124, 102937. [Google Scholar] [CrossRef]
Larsen, D.; Robinson, M. A non-parametric discrete choice model for airport acceptance rate prediction. In Proceedings of the AIAA Aviation 2019 Forum, Dallas, TX, USA, 17–21 June 2019; p. 3048. [Google Scholar]
Andy, L.J.G.; Alam, S.; Piplani, R.; Lilith, N.; Dhief, I. A decision-tree based continuous learning framework for real-time prediction of runway capacities. In Proceedings of the 2021 Integrated Communications Navigation and Surveillance Conference (ICNS), Dulles, VA, USA, 19–23 April 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–14. [Google Scholar]
Choi, S.; Kim, Y.J. Artificial neural network models for airport capacity prediction. J. Air Transp. Manag. 2021, 97, 102146. [Google Scholar] [CrossRef]
Murca, M.C.R.; Hansman, R.J. Identification, characterization, and prediction of traffic flow patterns in multi-airport systems. IEEE Trans. Intell. Transp. Syst. 2018, 20, 1683–1696. [Google Scholar] [CrossRef]
Jones, J.C.; DeLaura, R.; Pawlak, M.; Troxel, S.; Underhill, N. Predicting & quantifying risk in airport capacity profile selection for air traffic management. In Proceedings of the 14th USA/Europe Air Traffic Management Research and Development Seminar (ATM2017), Seattle, WA, USA, 29 June 2017. [Google Scholar]
Du, W.; Chen, S.; Li, H.; Li, Z.; Cao, X.; Lv, Y. Airport capacity prediction with multisource features: A temporal deep learning approach. IEEE Trans. Intell. Transp. Syst. 2022, 24, 615–630. [Google Scholar] [CrossRef]
Zhu, X.; Lin, Y.; He, Y.; Tsui, K.L.; Chan, P.W.; Li, L. Short-term nationwide airport throughput prediction with graph attention recurrent neural network. Front. Artif. Intell. 2022, 5, 884485. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Yang, H.; Yan, Z.A. Dynamic Multi-Graph Convolutional Spatial-Temporal Network for Airport Arrival Flow Prediction. Aerospace 2025, 12, 395. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Nalluri, M.; Pentela, M.; Eluri, N.R. A scalable tree boosting system: XG boost. Int. J. Res. Stud. Sci. Eng. Technol. 2020, 7, 36–51. [Google Scholar]
Jin, D.; Yu, Z.; Huo, C.; Wang, R.; Wang, X.; He, D.; Han, J. Universal graph convolutional networks. Adv. Neural Inf. Process. Syst. 2021, 34, 10654–10664. [Google Scholar]
Pareja, A.; Domeniconi, G.; Chen, J.; Ma, T.; Suzumura, T.; Kanezashi, H.; Leiserson, C. Evolvegcn: Evolving graph convolutional networks for dynamic graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 3 March 2020; Volume 34, pp. 5363–5370. [Google Scholar]
Tang, Y.; Wang, Y.; Guo, J.; Tu, Z.; Han, K.; Hu, H.; Tao, D. A survey on transformer compression. arXiv 2024, arXiv:2402.05964. [Google Scholar] [CrossRef]
Li, Y.; Cao, J.; Xu, Y.; Zhu, L.; Dong, Z.Y. Deep learning based on Transformer architecture for power system short-term voltage stability assessment with class imbalance. Renew. Sustain. Energy Rev. 2024, 189, 113913. [Google Scholar] [CrossRef]
Lin, C.H.A.; Liu, C.Y.; Chen, K.C. Quantum-train long short-term memory: Application on flood prediction problem. In Proceedings of the 2024 IEEE International Conference on Quantum Computing and Engineering (QCE), Montreal, QC, Canada, 15–20 September 2024; IEEE: Piscataway, NJ, USA, 2024; Volume 2, pp. 268–273. [Google Scholar]
Al-Selwi, S.M.; Hassan, M.F.; Abdulkadir, S.J.; Muneer, A.; Sumiea, E.H.; Alqushaibi, A.; Ragab, M.G. RNN-LSTM: From applications to modeling techniques and beyond—Systematic review. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102068. [Google Scholar] [CrossRef]
Fantini, D.G.; Silva, R.N.; Siqueira, M.B.B.; Pinto, M.S.S.; Guimarães, M.; Junior, A.B. Wind speed short-term prediction using recurrent neural network GRU model and stationary wavelet transform GRU hybrid model. Energy Convers. Manag. 2024, 308, 118333. [Google Scholar] [CrossRef]
Mienye, I.D.; Swart, T.G.; Obaido, G. Recurrent neural networks: A comprehensive review of architectures, variants, and applications. Information 2024, 15, 517. [Google Scholar] [CrossRef]
Wu, X.; Jiang, L.; Wang, P.S.; Liu, Z.; Liu, X.; Qiao, Y.; Zhao, H. Point transformer v3: Simpler faster stronger. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4840–4851. [Google Scholar]
Gao, J.; Zhang, Y.; Geng, X.; Tang, H.; Bhatti, U.A. PE-Transformer: Path enhanced transformer for improving underwater object detection. Expert Syst. Appl. 2024, 246, 123253. [Google Scholar] [CrossRef]
Fang, Z.; Qian, P.; Su, K.; Qian, Y.; Leng, X.; Zhang, J. Self-Organisation Theory Based Trajectory Optimisation Method for CAVs in Diverging Area; SAE Technical Paper; SAE International: Warrendale, PA, USA, 2024; No. 2024-01-7007. [Google Scholar]
Qi, X.; Qian, P.; Zhang, J. Departure Flight Delay Prediction and Visual Analysis Based on Machine Learning; SAE Technical Paper; SAE International: Warrendale, PA, USA, 2023; No. 2023-01-7091. [Google Scholar]
Zhang, H.; Zhang, J.; Zhu, F.; Qian, P.; Jiang, H.; Yao, X. Research on conflict resolution control strategy for vehicles and aircrafts on airport surface. Urban Lifeline 2025, 3, 8. [Google Scholar] [CrossRef]
Jiang, H.; Zhang, J.; Zhang, H.; Qian, P. Multi-Objective Optimization of Airport Baggage Transport Vehicles’ Scheduling Based on Improved Genetic Algorithm; SAE Technical Paper; SAE International: Warrendale, PA, USA, 2023; No. 2023-01-7090. [Google Scholar]
Zeng, Y.; Hu, M.; Chen, H.; Yuan, L.; Alam, S.; Xue, D. Improved air traffic flow prediction in terminal areas using a multimodal spatial–temporal network for weather-aware (MST-WA) model. Adv. Eng. Inform. 2024, 62, 102935. [Google Scholar] [CrossRef]

Figure 1. ST-GTNet model framework.

Figure 2. Feature heatmap. This figure displays a correlation matrix showing the Pearson correlation coefficients between eight variables related to airport capacity and environmental factors. The values range from −1.00 (perfect negative correlation) to 1.00 (perfect positive correlation), with 0.00 indicating no linear relationship. The matrix is color-coded, with pink representing lower correlations (around 0.5) and green representing higher correlations (close to 1.0). Each cell contains the correlation value, annotated with two decimal places (e.g., 0.97, 0.81), reflecting the linear relationship between the corresponding variables. For instance, the correlation between “Landing Total” and “Takeoff Total” is 0.97, indicating a strong positive correlation. This heatmap visually aids in understanding the relationships between variables, highlighting pairs that have a strong relationship and helping to identify any potential multicollinearity for model optimization.

Figure 3. SHAP variable contribution.

Figure 4. Distribution of GCN points and edges.

Figure 5. Predicted versus actual values. Capacity (movements/hour) at 30 min resolution for four representative months in 2023 (March, June, September, and December). Time is on the x-axis; capacity is on the y-axis. Each month has two panels: Panel A overlays LSTM, the GRU, the GCN, the Transformer, and the ST-GTNet; Panel B overlays the MF-Transformer, GAT-LSTM, the DMCSTN, and the ST-GTNet. The actual series is plotted as a black ultra-thin line (linewidth 0.2) on top of all curves. The ST-GTNet is shown in deep red; all other model curves use linewidth 0.8 with a fixed palette: DMCSTN = red, GAT-LSTM = orange-red, MF-Transformer = orange, Transformer = ochre, GCN = bright yellow, GRU = pale yellow, and LSTM = peach. The legend is centered below the x-axis. Across months, the ST-GTNet tracks the actual series most closely—capturing peak timing and magnitude with less variance in low-capacity periods—while LSTM and the GRU exhibit lag and over-smoothing, the GCN underestimates peaks, and the Transformer family reduces lag but remains less stable at low flows. Seasonal contrasts are also visible (e.g., June more stable; September more volatile), yet the ST-GTNet remains consistently best.

Table 1. Structural and practical differences from representative methods.

Method	Spatial Module	Feature Flow	Temporal Module	Temporal Attention Locus	Feature Selection Explainability	Graph Type	Training Parallelism	Deployment Notes
GAT-LSTM	Graph attention (GAT)	Raw → GAT → LSTM	LSTM/RNN	None (RNN time)	None (native)	Static/learned	Limited (RNN unfolding)	Tuning + gradient propagation
DCRNN	Diffusion convolution	Raw → Diff. Conv → GRU	GRU/RNN	None (RNN time)	None (native)	Static/directed	Limited (RNN unfolding)	Long-dependency stability
DMCSTN	Multi/dynamic GCN	Raw → Multi/Dynamic GCN → Hier. Attn.	Hierarchical attention/hybrid	Space + time (hierarchical)	Weak (needs extra modules)	Multi/dynamic	Moderate (structurally complex)	Structure/HP complexity
ST-GTNet	GCN (spatial-first compression)	SHAP-selected → GCN → Transformer	Transformer (time-only attention)	Time-only	XGBoost + SHAP (pre-train selection; inference attributions)	Static physical	High (non-recurrent)	Operationally interpretable

Table 2. Parameters.

Component	Setting (Default)
GCN layers $L_{g c n}$	[2] (residual + layer norm), hidden [128]
Transformer layers $N_{t r}$	[2]
Attention heads $h$	[4]
$d_{m o d e l} / d_{f f}$	[128]/[256]
Dropout	[0.1]
Look-back $L$	[24] steps (30 min bins)
Horizon $H$	[6] steps
Stride $s$	[1]
Decoding	Direct multi-horizon (non-autoregressive)

Table 3. Data types.

Weather Factors	Coding Type
Wind Speed	Real-coded
Wind Direction	Real-coded
Visibility	Real-coded
Temperature	Real-coded
Dew Point	Real-coded
Sealevel Pressure	Real-coded
Cloud Baseheight	Integer-coded
Rain	Integer-coded
Showers	Integer-coded
Thundershower	Integer-coded
Snow	Integer-coded
Fog	Integer-coded
Wind Direction Change	Binary-coded
Future Trend	Binary-coded
Runway Visual	Binary-coded
Thunderstorm	Binary-coded
Mist	Binary-coded
Haze	Binary-coded
Wind Shear	Binary-coded

Table 4. Pearson correlation coefficients.

Feature	Landing Total	Takeoff Total
Wind Speed	0.97	0.92
Wind Direction	0.94	0.96
Temperature	0.56	0.54
Visibility	0.74	0.71
Dew Point	0.84	0.80
Sea Level Pressure	0.79	0.75
Cloud Base Height	0.48	0.45
Rain	0.29	0.27
Showers	0.33	0.30
Thundershower	0.42	0.39
Snow	0.22	0.20
Fog	0.38	0.35
Wind Change	0.43	0.48
Future Trend	0.28	0.26
Runway Visual	0.44	0.43
Thunderstorm	0.45	0.42
Mist	0.37	0.34
Haze	0.32	0.29
Wind Shear	0.33	0.40

Table 5. Compute profile and efficiency.

Model	Params (M)	Epoch (s)	Step (ms)	Notes
GRU (2×)	0.175	12.1	0.50	time module only
LSTM (2×)	0.227	13.2	0.55	time module only
GCN (2×)	0.182	12.7	0.52	time module only
Transformer (time-only, 2×)	0.282	16.4	0.80	no spatial module
GAT-LSTM	0.284	22.8	1.15	1 × GAT + 2 × LSTM
DCRNN (K = 2, 2×)	0.331	25.6	1.35	diffusion + GRU
DMCSTN (std.)	0.620	41.9	2.30	dynamic multi-graph
ST-GTNet	0.302	18.7	0.98	2 × GCN → 2 × Transformer

Table 6. Performance metrics (MAE, RMSE, and MAPE) of different models across four months.

Model	March			June			September			December
Model	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
LSTM	1.60	2.10	12.1%	0.95	1.18	4.3%	2.14	2.68	19.6%	2.07	2.60	12.1%
GRU	1.42	1.78	10.5%	0.82	1.01	3.7%	1.90	2.39	17.4%	1.83	2.31	10.6%
w/o Transformer (GCN-only)	1.38	1.78	9.7%	0.78	0.98	3.3%	1.91	2.43	17.1%	1.84	2.31	10.3%
w/o GCN (Transformer-only)	1.30	1.66	9.2%	0.75	0.93	3.2%	1.77	2.21	15.8%	1.72	2.14	9.6%
MF-Transformer	1.22	1.55	8.6%	0.70	0.87	2.9%	1.65	2.05	14.2%	1.61	2.00	8.8%
GAT-LSTM	1.18	1.49	8.3%	0.67	0.84	2.7%	1.59	1.98	13.8%	1.56	1.94	8.5%
DMCSTN	1.21	1.52	8.5%	0.69	0.86	2.8%	1.63	2.01	14.1%	1.59	1.98	8.7%
ST-GTNet	1.09	1.37	7.9%	0.54	0.68	2.4%	1.41	1.77	12.8%	1.39	1.74	8.0%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qian, P.; Zhang, J.; Zhang, H.; Li, X.; Ouyang, J. ST-GTNet: A Spatio-Temporal Graph Attention Network for Dynamic Airport Capacity Prediction. Aerospace 2025, 12, 811. https://doi.org/10.3390/aerospace12090811

AMA Style

Qian P, Zhang J, Zhang H, Li X, Ouyang J. ST-GTNet: A Spatio-Temporal Graph Attention Network for Dynamic Airport Capacity Prediction. Aerospace. 2025; 12(9):811. https://doi.org/10.3390/aerospace12090811

Chicago/Turabian Style

Qian, Pinzheng, Jian Zhang, Haiyan Zhang, Xunhao Li, and Jie Ouyang. 2025. "ST-GTNet: A Spatio-Temporal Graph Attention Network for Dynamic Airport Capacity Prediction" Aerospace 12, no. 9: 811. https://doi.org/10.3390/aerospace12090811

APA Style

Qian, P., Zhang, J., Zhang, H., Li, X., & Ouyang, J. (2025). ST-GTNet: A Spatio-Temporal Graph Attention Network for Dynamic Airport Capacity Prediction. Aerospace, 12(9), 811. https://doi.org/10.3390/aerospace12090811

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ST-GTNet: A Spatio-Temporal Graph Attention Network for Dynamic Airport Capacity Prediction

Abstract

1. Introduction

1.1. Introduction

1.2. Key Contributions

2. Airport Dynamic Capacity Evaluation Model ST-GTNet

2.1. Model Overview

2.2. XGBoost

2.2.1. Loss Function and Objective

2.2.2. XGBoost Training Process

2.3. SHAP Value Analysis

Shapley Value Definition

2.4. XGBoost + SHAP for Feature Selection

2.4.1. Model Structure and Interpretation

2.4.2. Calculation of SHAP Values

2.5. Feature Selection and Importance Evaluation

2.6. Graph Convolutional Network (GCN)

Mathematical Foundation of GCN

2.7. Transformer Model

2.7.1. Model Architecture

2.7.2. Positional Encoding

2.7.3. Scaled Dot-Product Attention

2.7.4. Multi-Head Attention

2.7.5. Position-Wise Feedforward Network

2.8. GCN + Transformer Joint Modeling

2.8.1. Node Representation Learning with GCN

2.8.2. Transformer-Based Temporal Encoding Mechanism

2.8.3. Fusion Modeling of GCN and Transformer

2.9. ST-GTNet Model Framework

2.10. Differentiation from Prior Work

3. Case Study: Airport U-Shaped Corridor Harbor Area

3.1. Experimental Environment and Parameter Configuration

3.2. Model Depth and Temporal Windowing

3.3. Descriptive Data Analytics

3.3.1. Correlation Analysis

3.3.2. XGBoost and SHAP Value Analysis

3.3.3. Graph Structure and Adjacency Matrix

3.4. Benchmark Models Comparison

3.5. Performance Evaluation

4. Discussion

5. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI