Inspired by related work [
31,
32,
35,
36], dynamic spatio-temporal learning is divided into three phases. In the first stage, we assume that the taxi time of the individual taxiing segment is directly influenced by the traffic conditions of the taxiing segment itself, as well as its neighbor taxiing segments [
31]. Spatial and temporal relationships between and within taxiing segments are captured using spatial and temporal attention, respectively [
35]. This process generates the hidden states of the dynamic spatio-temporal tensor of taxiing segments. In the second stage, a 3D cross spatio-temporal attention network [
36] is utilized to adaptively capture the joint dynamic spatio-temporal correlation between the hidden states of flight properties and the hidden states of the taxiing segments through which the taxi route passes and their neighbors. This process generates a higher-level dynamic spatio-temporal representation of the taxi route. In the third stage, the final representation of the taxi route is obtained by adding sequence information to the links using a position embedding technique [
32].
Attention mechanisms are widely used techniques in deep learning. By incorporating attention mechanisms, neural networks are able to automatically learn and selectively focus on important information in the input, thus enhancing the model’s performance and generalization ability. The well-known self-attention mechanism is proposed in transformer [
38], which consists of a QKV structure. Attention scores are computed using QK, after which important information is extracted from V according to these scores. In this module, we use different attention mechanisms to capture the complex spatio-temporal correlations in airport traffic data.
Static spatio-temporal tensors of the taxiing segments need to be constructed first. We leverage the concept of [
31]. The dynamic links hidden states
(where
,
denotes the number of links) incorporate historical and future taxiing segments traffic conditions features (generated by the time series block) which are encoded through a fully connected layer. The static links’ hidden states
are encoded by discrete features such as the link ID and continuous features such as link length using a lookup table technique and a fully connected layer, respectively. The hidden states of temporal information
are encoded through a fully connected layer. Then, matrices
,
, and
are combined into spatio-temporal matrices
, by expanding matrices
and
,
3.5.1. Spatial Attention and Temporal Attention
(a) Spatial attention
The term “dynamic spatial correlation” describes how the influence weights of other taxiing segments in the airport network change over time, impacting the traffic state of the target taxiing segment. For example, as illustrated in
Figure 4, the upstream taxiing segment in shallow blue may negatively affect the traffic state of the green taxiing segment during rush hour. This impact may diminish when the congestion eases. To capture the dynamic spatial correlations, we designed a spatial attention network, the multi-head graph attention network (MGAT) [
39], to adaptively model spatial correlations between neighbor taxiing segments.
In an airport network , for a taxiing segment , at time step k, the set of spatial-related taxiing segments is , which includes the taxiing segment at time step k itself and its neighbor taxiing segments at time step k. Then, we use the following graph attention settings to compute the dynamic spatial attention hidden representation of the taxiing segment .
The static spatio-temporal hidden representation of taxiing segment
i at time step
k, where
is taken as the
of the attention mechanism. The static spatio-temporal hidden representations of spatial-related taxiing segments
(
) are taken as the
and
of the attention mechanism. To be specific, the attention mechanism is formulated as follows:
where
M is the number of heads,
represents the concatenation,
is the learnable weight matrix of the
head shared by all taxiing segments,
is the learnable weight matrix shared by all taxiing segments,
refers to the exponential function,
d refers to the embedding dimension, and
is the batch normalization operation. Then, the dynamic spatial attention hidden representation of taxiing segment
i, at time step
k can be encoded as
by Equation (
9) (the residual connection [
40] and batch normalization (BN) [
41] are added to prevent feature loss and an internal covariate shift).
The initial input of the spatial attention network is static spatio-temporal hidden representation of taxiing segments and the output is the dynamic spatial attention hidden representation of taxiing segments .
(b) Temporal attention
The term “dynamic temporal correlation” refers to how the influence weights of a given taxiing segment’s historical traffic conditions on its future traffic state change over time in the airport network. For example, morning rush hour congestion might be influenced by preceding traffic conditions, and this influence may accumulate over time until it diminishes. To dynamically capture relationships across multiple time steps, we implement a temporal attention network based on the mask mechanism of the Transformer [
38], as depicted in
Figure 5.
In an airport network , for a taxiing segment , at time step k, the set of temporal-related taxiing segments is , which includes the taxiing segment at step k itself and the taxiing segment before time step k. Then, we use the following temporal attention settings to compute the dynamic temporal attention hidden representation of the taxiing segment .
The static spatio-temporal hidden representation of taxiing segment
i at time step
k,
is taken as the
of the attention mechanism. The static spatio-temporal hidden representation of temporal-related taxiing segments
(
) are taken as the
and
of the attention mechanism. To be specific, the attention mechanism is formulated as follows:
where
is the learnable weight matrix of the
head shared by all taxiing segments and
is the learnable weight matrix shared by all taxiing segments. The meanings of the other symbols remain consistent with those previously determined. Then, the dynamic temporal attention hidden representation of taxiing segment
i at time step
k can be encoded as
by Equation (
16).
The initial input of temporal attention is static spatio-temporal hidden representation of taxiing segments and the output is the dynamic temporal attention hidden representation of taxiing segments .
(c) Fusion network
The traffic states on a taxiing segment at a given time step are related to the traffic states at previous time steps and the neighboring taxiing segments. To obtain a dynamic spatio-temporal hidden representation of the taxiing segment, the dynamic spatial attention hidden representation and dynamic temporal attention hidden representation of a taxiing segment are fused using a fully connected layer. To be specific, the fusion mechanism is formulated as follows:
where
represents the concatenation and
represents the fully connected layers with activation functions.
The initial inputs of the fusion networks are the dynamic spatial attention hidden representation and dynamic temporal attention hidden representation , the output is the dynamic spatio-temporal hidden representation of taxiing segments .
3.5.2. Three-Dimensional Cross Spatio-Temporal Attention
Intuitively, the taxi time for an aircraft within a given taxiing segment is related to the historical and future traffic states of the taxiing segment itself and its neighbors, as well as the flight properties. For instance, aircraft of different types may require different amounts of time on the same taxiing segment, even under the same traffic conditions. Therefore, in the second stage, we design a 3D cross spatio-temporal attention network [
31,
36] to adaptively model the joint dynamic spatio-temporal correlations between the hidden states of flight properties and hidden states of the taxiing segments (generated in the first stage) through which the taxi route passes, as well as those of their neighbors. Then, we obtain the higher-level spatio-temporal representation of the taxi route, as illustrated in
Figure 6.
The hidden states of flight properties , which incorporate features of aircraft type, airline affiliation, arrival or departure status, and total taxi distance, are encoded using a lookup table technique and a fully connected layer. The hidden states of start-time information are encoded through a fully connected layer.
Then, the combination of the representation of flight properties
and the representation of start time
is taken as the
of the attention mechanism. The dynamic spatio-temporal hidden states of taxiing segments through which the taxi route passes and their neighbors in the last
and future
time slots
are taken as the
and
of the attention mechanism. To be specific, the attention mechanism is formulated as follows:
where
is the learnable weight matrix of the
head shared by all aircraft and
and
are the
head learnable weight matrices shared by all taxiing segments. The meanings of the other symbols remain consistent with those previously determined. Then, the higher-level dynamic spatio-temporal attention hidden states of taxiing segment
i can be encoded as
by Equation (
25).
The initial inputs of a 3D cross spatio-temporal attention network are the hidden states of flight properties , the hidden states of start time , and the dynamic spatio-temporal hidden states of taxiing segments through which the taxi route passes and their neighbors in the last and future time slots . And the output is the higher-level dynamic spatio-temporal representation of taxiing segments through which the taxi route passes, .
3.5.3. Positional Encoding
The taxiing segments along each taxi route exhibit a clear sequential structure, which is significant for training a highly accurate taxi time prediction model. However, recurrent neural networks are very time-consuming during inference, especially with large sequence lengths, and may not meet the real-time requirements of taxi time prediction tasks. Therefore, in this study, a positional encoding technique [
32] designed to capture the sequence relationship of taxiing segments is used to incorporate sequence information into the taxiing segments. For the sequence encoding problem, we generate a series of cosinusoids:
where
is the taxiing segment’s position in a taxi route,
is the dimension index of taxiing segment hidden representation, and
is the size of taxiing segment hidden representation. We denote
as the positional encoding vector for position
i (
). Then, the final hidden representation of the taxi route
can be encoded as follows:
where
is a hyper-parameter that is used to control the effect of sequence information.
The initial input of positional encoding is higher-level dynamic spatio-temporal representation of taxiing segments through which the taxi route passes, . And the output is the final hidden representation of the taxi route .