A Methodology for Predicting Ground Delay Program Incidence through Machine Learning

Dong, Xiangning; Zhu, Xuhao; Hu, Minghua; Bao, Jie

doi:10.3390/su15086883

Open AccessArticle

A Methodology for Predicting Ground Delay Program Incidence through Machine Learning

¹

College of Civil Aviation, Nanjing University of Aeronautics and Astronautics, No. 29 General Avenue, Nanjing 211106, China

²

National Key Laboratory of Air Traffic Flow Management, Nanjing University of Aeronautics and Astronautics, No. 29 General Avenue, Nanjing 211106, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(8), 6883; https://doi.org/10.3390/su15086883

Submission received: 27 March 2023 / Revised: 17 April 2023 / Accepted: 18 April 2023 / Published: 19 April 2023

(This article belongs to the Special Issue Air Traffic Management (ATM) for the Sustainability and Environmental Performance of Aviation)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Effective ground delay programs (GDP) are needed to intervene when there are bad weather or airport capacity issues. This paper proposes a new methodology for predicting the incidence of effective ground delay programs by utilizing machine learning techniques, which can improve the safety and economic benefits of flights. We use the combination of local weather and flight operation data along with the ATM airport performance (ATMAP) algorithm to quantify the weather and to generate an ATMAP score. We then compared the accuracy of three machine learning models, Support Vector Machine, Random Forest, and XGBoost, to estimate the probability of GDPs. The results of the weather analysis, performed by the ATMAP algorithm, indicated that the ceiling was the most critical weather factor. Lastly, we used two linear regression models (ridge and LASSO) and a non-linear regression model (decision tree) to predict departure flight delays during GDP. The predictive accuracy of the regression models was enhanced by an increase in ATMAP scores, with the decision tree model outperforming the other models, resulting in an improvement of 8.8% in its correlation coefficient (

R^{2}

).

Keywords:

ground delay programs; machine learning; ATMAP algorithm; feature importance

1. Introduction

Over the course of several decades, the weather has significantly contributed, around the world, to decreased airspace efficiency and capacity. It ranks first for causes of flight irregularities by around 50% [1]. To manage air traffic during poor weather conditions or airspace congestion, traffic flow managers may implement various air traffic flow management (ATFM) measures, including ground delay programs, ground stops, reroute advisories, and mileage restrictions [2]. Among these measures, ground delay programs are often used to regulate air traffic flow, resulting in significant flight delays. In 2011, the United States alone had 1065 ground delay programs, causing 519,940 flights to be delayed by a total of 26.8 million min or an average delay of 52 min per flight [3].

The deferments engender considerable pecuniary losses for the aviation sector. By prognosticating ground delay instances, the pertinent regulatory bodies can take preemptive measures, such as tailoring flight plans, refining resource allotment, and augmenting operational efficacy, thereby diminishing the frequency of flight delays and fuel consumption, enhancing operational efficiency and service quality, abating the economic and environmental repercussions of flights, as well as ameliorating the efficiency and safety of air traffic management.

Current research on ground delay programs (GDP) can be broadly categorized into two approaches. The first category focuses on using simulation methods or mathematical modeling to find more efficient and cost-effective strategies for optimizing flight times, assuming that a GDP is necessary [4,5,6]; this approach aims to minimize the impact on airport and airline operations. The second category employs machine learning techniques to predict ground delay programs [7,8,9]. These ground delay programs, modeled mathematically, aim to reduce flight delays by determining which flights must wait and for how long [10]. Ball et al. [4] investigated two-layer-network-structured integer programming to address the ground holding issues at one single airport. Avijit Mukherjee et al. [5] presented a dynamic stochastic linear programming model using weather forecasts at various decision points to revise ground delays, thus solving single-airport ground-holding problems. Later, Avijit Mukherjee et al. [11] introduced probabilistic airport capacity forecasts to determine flight departure delays. This approach optimizes the number of scheduled arriving flights in stages using a static stochastic ground-holding model, which is more straightforward to construct than earlier proposed stochastic dynamic optimization models, and it also offers a new perspective on ground delay programs. Yan et al. [12] established a comprehensive platform to model flight operations during a GDP and proposed flight route recovery schemes. Jacquillat [13] developed a large-scale integer optimization model using a passenger-centric GDP (GDP-PAX) optimization strategy, which significantly reduced passenger delays while only slightly increasing aircraft delay costs.

The application of machine learning in the aviation industry has gained significant momentum in recent years [14,15]. With the availability of historical data, researchers have been able to make predictions about the occurrence of GDP and their causes. Grabbe et al. [2,16] conducted a study using improved clustering algorithms on three years’ worth of GDP records to identify the primary factors leading to GDPs. The results showed that clustering techniques have great potential for determining the causes of GDPs. Liu et al. [17] proposed a semi-supervised learning algorithm to evaluate the weather forecast similarity of strategic air traffic management, and to determine whether GDP should be released by searching for similar days. Smith et al. [7] used terminal aerodrome forecasts (TAF) as variables with Support Vector Machine (SVM) algorithm to predict aircraft arrival rates. Then, they used the airport accept rate (AAR) to determine the planned rate, duration, and passenger delays for GDPs. Liu et al. [17] also compared the performance of logistic regression and the Random Forest algorithm in predicting GDP incidence using weather and arrival demand variables that were generated by SVM. Wang et al. [9] studied the impact of dynamic airport ground weather on GDP and used T-WITI-FA (terminal weather-impacted traffic index forecast accuracy) and air traffic data to model GDP prediction. Chen et al. [8] employed multi-agent reinforcement learning (MARL) to simulate the application of GDP to resolve the demand and capacity balancing problem in high-density situations in the pre-tactical stage. The MARL approach, using a double Q-learning network (DQN), has the potential to significantly decrease the number of flights delayed and the average delay duration.

The abovementioned GDP models focus on defining objective functions and selecting decision variables; in addition, many conditional assumptions are predetermined. Thus, solving GDPs with limited practical applications can be challenging. The quantitative analysis of weather data is also critical, and many researchers utilize various forms of the weather-impacted traffic index (WITI) [18]. However, due to their complex nature and diverse forms, many types of WITI are not well suited for fast-time modeling and forecasting. Previous studies [9,18] have attempted to predict GDP duration in a given hour, yielding promising results. Meanwhile, Bloem [13] found that machine learning models had difficulty predicting the initiation and cancellation of GDP. As such, the predicting time of departure delay during GDP duration remains a relevant challenge.

Collectively, the results described in this paper show that the proposed methodology demonstrates a solution that offers significant contributes to GDP decision makers. (a) We employed the ATMAP method to evaluate the airport weather and to quantify the meteorological aerodrome report (METAR) messages. The ATMAP score, calculated using an ordinary equation, was easier to determine than other metrics, such as the WITI. To this end, we created a dataset for predictive modeling by utilizing actual flight operating data, meteorological data, and ATMAP scores. (b) The study comprehensively compares the predictive abilities of three classification models, namely SVM, random forests, and XGBoost, using Bayesian parameter modification to enhance their predictiveness and accuracy. Meanwhile, we investigate the correlation between actual GDP runs and feature importance. (c) We established a departure flight delay prediction model based on known GDP duration time and assessed the model’s ability to predict outcomes with or without ATMAP scores; in addition, the reliability of the ATMAP scores was also assessed.

The rest of the paper is organized as follows: Section 2 briefly introduces the description of GDP followed by the research objective. Section 3 explains the process of gathering and processing data, as well as the regression model and machine learning classification model used to forecast the occurrences of GDPs and delay time, respectively. Section 4 describes the experimental findings, assesses how well the machine learning model is working, and examines the influence of various characteristics. Finally, Section 5 summarizes the conclusions and suggests areas for future study.

2. Description of Ground Delay Programs

When the anticipated demand for arriving air traffic flows exceeds AAR for an extended period of time, GDP will be used as one of the ATFM to keep departing flights waiting in gate positions, or on taxiways, until demand and capacity are balanced.

The GDP occurrence process is shown in Figure 1. Prior to GDP beginning, traffic flow management (TFM) provides the concerned airport with an initial GDP plan based on GDP operating parameters, including GDP start time, end time, and geographic coverage. During a GDP, the ground waiting time during GDP is the interval between estimated time of departure (ETD) and controlled time of departure (CTD) [19].

TFM chooses a suitable end time based on the capacity forecast data and the total GDP operation. The end time chosen for GDP is essentially the expected time. The GDP should ideally operate as intended with GDP cancellations occurring at the same time as the end of the GDP, as well as a capacity restoration balance time equal to that time. If cancellations are postponed due to anticipated weather and expected demand, this may result in unnecessary delays.

3. Methods

We used three types of data: weather scores from the ATMAP algorithm, actual weather conditions, and the airport traffic to predict the probability of the GDP per hour under local weather conditions. The general framework is shown in Figure 2.

The METAR data were transformed into a weather severity score by the ATMAP method. The resulting dataset, consisting of ATMAP scores, local weather, and airport flight data, was used to predict GDP events using the XGBoost, Random Forest, and SVM algorithms with 70% of the data being used as the training set and remaining 30% as the testing set. A GDP flight delay prediction model was then constructed using the entire dataset or the ATMAP scores. The performance of the ridge regression, LASSO regression, and decision tree regression analyses were compared. To address the issue of imbalanced data, where there could be more data without GDP events than with, the synthetic minority over-sampling technique (SMOTE) [20] was employed to oversample the training set. This approach helps to increase the amount of data with GDP events and to improve the performance of the prediction model.

3.1. Data Set

We chose the flight operation data and METAR messages in Nanjing Lukou International Airport (ICAO four-character code: ZSNJ) between 1 January 2021 and 30 June 2021. ZSNJ is a dual-runway 4F airport, the second largest in eastern China in terms of size. Its construction area for the terminal building is 425,000 square meters, and its apron area is around 1.1 million square meters. Figure 3 depicts the airport’s two parallel 3600 m runways, RWY06/24 and RWY07/25. We used the data of RWY06/24, which is the main departure runway in ZSNJ.

The flight data encompasses the scheduled departure time, actual departure time, and scheduled landing time. According to [21], the quantity of scheduled arrival flights and scheduled departure flights were selected as the input for the airport’s traffic.

METAR and TAF are both weather reports for the airport terminal area. TAF serves forecast information and is updated every three hours, while the METAR data are observations of weather conditions at the airport, typically every half or one hour. METARs primarily encompass critical meteorological information, such as wind direction, wind speed, precipitation, ceiling, etc. As far as GDPs are concerned, ceiling, visibility, wind direction, and wind speed are considered to be significant [3]. Wind, a major factor impacting takeoff and landing at airports, is differentiated into headwind or tailwind relative to runway direction and crosswind relative to vertical runway direction. For instance, when runway 06 at ZSNJ was selected, a wind from the northeast is considered as a headwind and denoted with 1, otherwise it would be logged as 0. Figure 4 depicts the meteorological information extracted by ZSNJ from the METAR messages during the initial six months of 2021.

Whether or not an hourly GDP occurs is recorded in the ZSNJ ATC logbook, with a start time and an end time. ZSNJ reported a total of 117 h of GDP and 4227 h of no GDP from January to June 2021. Table 1 highlights the essential features selected for the study of GDP. Figure 5 illustrates the correlation coefficient linking the weather conditions and traffic features.

3.2. ATMAP Algorithm

The ATMAP algorithm [22] was developed by the EUROCONTROL Performance Review Unit (PRU) in collaboration with the ATMAP Group at the request of the Performance Review Commission (PRC) to provide a standardized measurement of airport air-side performance across European airports. This algorithm assesses weather conditions at the airport in the post-operational phase and quantifies METAR messages to identify the airports impacted negatively by prolonged poor weather. Additionally, the algorithm evaluates the effect on airport performance [23].

The ATMAP method quantifies airport weather data into five categories: visibility, wind, as well as precipitation, freezing, and dangerous weather. Each category is assigned a severity level, and coefficients are provided to reflect the varying levels of severity under the different meteorological conditions. The key concepts behind the algorithm are outlined.

Parsing METAR messages and extracting the 5 elements: visibility, wind, as well as precipitation, freezing, and hazardous weather phenomena;
An indicator of the weather class status is expressed as a severity code, which ranges from 1 for good weather to a maximum of 5 for bad weather;
For each specified severity level code, discrete values (0 to 30) called synoptic score coefficients are given a value. This enables the description of the non-linear behavior of several weather events. Examples of ATMAP weather algorithms for scoring precipitation phenomena are given in Table 2.

This algorithm allows for the quantification of each parsed METAR report in accordance with the scoring criteria, and Table 3 provides a concrete example of the quantification of a typical weather report for ZSNJ.

3.3. GDP Classification Models

The classification problem consists of two processes: learning and classification. A classifier is learned based on available training data using efficient learning methods

Y = f (x)

. The new input is classified from

X_{N + 1}

to

Y_{N + 1}

in the classification process using the learned classifier [24]. The present study endeavors to thoroughly evaluate three machine learning algorithms, SVM, Random Forest, and XGBoost, to determine the most appropriate algorithmic approach for the prediction model. SVM denotes the conventional classification model, whereas Random Forest and XGBoost indicate the ensemble learning model. Ensemble learning is a comprehensive phrase for amalgamating numerous learning methods in machine learning, which possesses the benefit of integrating multiple weakly supervised models to construct a strongly supervised model, thereby amplifying the forecasting precision of the model [25].

In order to achieve this goal, the available dataset was meticulously divided into two parts, with 30% allocated for testing purposes and the remaining 70% serving as the training set, as is detailed in Table 1. However, it was observed that the data were significantly imbalanced, with a disproportionately larger number of instances corresponding to “no GDP” events in comparison to “GDP” events. To mitigate this issue, the SMOTE technique was employed to oversample the minority class in the training set.

3.3.1. Evaluation Indicators

In this paper, the hourly GDP is referred to as positive, and if there is no control in that hour it is referred to as negative. An incorrect positive prediction is called a false positive (FP) and an incorrect negative prediction is called a false negative (FN). Samples that are correctly classified in the positive class are referred to as a true positive (TP). Samples that are correctly classified in the inverse class are called a true negative (TN). The evaluation indicators, accuracy and precision, are calculated using the Equations (1) and (2), respectively.

A c c u r a c y = \frac{T P + F N}{T P + T N + F P + F N}

(1)

P r e c i s i o n = \frac{T P}{T P + F P}

(2)

F1 score is the summed mean of precision and recall, calculated as Equation (3).

F 1 = \frac{2 P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(3)

R e c a l l = \frac{T P}{T P + T N}

(4)

Cohen Kappa is calculated as Equation (5). The Cohen Kappa statistic was utilized to evaluate the consistency of observations, with values closer to 1 indicating a higher degree of consistency.

K = \frac{P_{0} - P_{e}}{1 - P_{e}}

(5)

where

P_{0}

is the observed value and

P_{e}

is the expected value.

The ROC curve is called the receiver operating characteristics curve (ROC), which is a composite indicator of sensitivity and specificity. The formula for this is Equation (6).

F P R = \frac{F P}{F P + T N}

(6)

The area between the curve and the horizontal axis is the area under ROC curve (AUC) value, where the closer the AUC value is to 1, the better the performance of the classifier. Figure 6 is a diagram of the ROC curve.

3.3.2. Support Vector Machine

In this study, the non-linear SVM model was chosen to cope with the large dimensionality and the complexity of the classification situation. The non-linear SVM [26] classification problem cannot find an optimal classification hyperplane on a space of low dimensionality; as such, each class of number sets was separated entirely and correctly, requiring the introduction of mapping at this point:

x \to Φ (x)

. All points in the original space were mapped into the higher dimensional space and solved. A non-linear SVM answer can be represented as Equation (7), which involves solving an optimization problem. Using the Sequential Minimization of the Optima (SMO) algorithm, it is constantly updated

(α_{i}, α_{j})

to determine different equations. In the calculation,

(α_{i}, α_{j})

is determined. For a non-linear SVM classification, the kernel trick [27] is needed, such that it is not necessary to know the specific properties of mapping

Φ (x)

. The optimization problem is constructed and solved, as is shown in Equation (1), by selecting an appropriate kernel function

K (x_{i}, x_{j})

and appropriate parameters

C

. The four prevalent kernel varieties encompass the linear kernel, polynomial kernel (poly), hyperbolic tangent (sigmoid) kernel, and radial basis function (rbf) kernel.

\begin{array}{l} \underset{α}{m i n} \frac{1}{2} \sum_{i = 1}^{N} & \sum_{j = 1}^{N} α_{i} α_{j} y_{i} y_{j} K (x_{i}, x_{j}) - \sum_{i = 1}^{N} α_{i} \\ s . t . & \sum_{i = 1}^{N} α_{i} y_{i} = 0 \\ 0 \leq α_{i} \leq C, i = 1, 2, \dots, N \end{array}

(7)

3.3.3. Random Forest

The principle of the Random Forest algorithm [28] is to use bootstrap sub-sampling to obtain different sets of samples for model construction, thus increasing the degree of variation between models and improving the ability to extrapolate predictions. Since the appearance and absence of GDP fall under a binary classification issue category, the classification outcomes of

k

samples are first chosen randomly from the initial training set. After each sample is voted on following the findings, the classification is decided. Equation (8) provides the final classification decision. Further,

H (x)

denotes the combined classification model,

h_{i}

is a single decision tree classification model, and

I (\cdot)

is a linearity function.

H (x) = a r g \underset{Y}{m a x} \sum_{i = 1}^{k} I (h_{i} (x) = Y)

(8)

We evaluate each features’ significance using the Gini coefficient [29]; the lower the Gini coefficient, the better, as this shows that the sub-nodes are more homogenous.

G i n i = 1 - \sum_{k = 1}^{|y|} p_{k}^{2}

(9)

where

p_{k}

represents the probability that the sample belongs to the

k

th category.

3.3.4. XGBoost Classification Algorithm

XGBoost is short for “extreme gradient boosting”, a scalable machine learning approach for tree boosting [30]. The algorithm uses a second-order Taylor expansion of the loss function and the addition of a common term, which effectively prevents over-fitting and hastens convergence. It is based on the Gradient-Boosted Decision Tree (GDBT) algorithm. The XGBoost algorithm can be represented as an additive form, as is shown in Equation (10):

{\hat{y}}_{i} = \sum_{k = 1}^{K} f_{k} (x_{i})

(10)

where

f_{k}

represents the

k

th sub-model.

The loss function and canonical term in Equations (11) and (12) are the objective function of XGBoost.

L (ϕ) = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t - 1)}) + \sum_{k = 1}^{K} Ω (f_{k})

(11)

Ω (f_{k}) = γ T + \frac{1}{2} α {‖ω‖}^{2}

(12)

where

{\hat{y}}_{i}^{(t - 1)}

indicates the predicted value of the previous

t - 1

iterations and

Ω

represents the regularized term; this prevents decision trees from being over-fitted or overly complex.

Then, the objective function shown in Equation (11) is expanded using Taylor’s Equation to give Equation (13):

L^{t} ≃ \sum_{i = 1}^{n} [l (y_{i}, {\hat{y}}_{i}^{(t + 1)}) + g_{i} f_{t}^{2} (x_{i})] + γ T + \frac{1}{2} λ \sum_{j = 1}^{T} ω_{j}^{2}

(13)

The XGBoost algorithm [31] is an effective integrated learning technique that has recently gained popularity. It has a number of benefits, including quick processing, acceptance of various input data types, built-in cross-validation, tree pruning, high flexibility, and better control of overfitting than other augmentation models.

3.4. GDP Departure Delay Time Models

Our study analyzes how ATMAP scores impact the accuracy of delay prediction models during ground delay programs. There have been many studies [32] in the past that have used different algorithms to predict flight delays. We focus on exploring the connection between these prediction models and ATMAP scores. However, it is essential to note that our study does not explicitly evaluate whether our prediction model is the best or whether a more advanced one should be used. We are solely exploring the relationship between conventional regression models and ATMAP scores.

This research aims to predict the holding time for departing aircraft under GDP conditions. We limited the scope of our analysis to total delay time for flights that experience delays of up to 15 min per hour of GDP, as this is deemed acceptable by passengers. The data for this study were collected from GDP events, and the total delay was predicted using decision trees, LASSO regression, and ridge regression methods [33]. As in Section 3.3, we split the data into a 70% training set and a 30% test set.

Ridge regression and LASSO are two regularization methods employed to counteract over fitting in linear regression models. Ridge regression introduces a penalty term to the summation of squared residuals, which promotes the model to have reduced coefficient values, as is depicted in Equation (14). Conversely, LASSO adopts an

L 1

penalty term, which can lead to some coefficients being precisely zero, thereby performing feature selection. The objective function for LASSO is exhibited in Equation (15).

\underset{w}{m i n} {‖X w - Y‖}_{2}^{2} + α_{1} \cdot {‖w‖}_{2}^{2}

(14)

\underset{w}{m i n} \frac{1}{2} {‖X w - Y‖}_{2}^{2} + α_{2} \cdot {‖w‖}_{1}

(15)

where

X

represents the eigenvector,

w

represents the set of regression coefficients,

Y

represents the predicted value, and

α_{i}

represents the regularization coefficient number.

Decision tree regressors are non-parametric models that repeatedly divide the data based on the input feature values to generate predictions. They have the capability to handle both discrete and continuous data, and are commonly utilized in machine learning assignments for their interpretability and user friendliness.

4. Results and Discussion

In this section, the ATMAP algorithm for quantitative local weather analysis is provided in Section 4.1. In Section 4.2, a thorough examination of the GDP categorization model’s findings and a discussion of the dataset’s salient characteristics follow. Finally, Section 4.3 discusses and analyzes the performance impact of ATMAP scores on the prediction of departure flight delays.

4.1. ATMAP Score Analysis

In this study, the computer environment used was a Windows 10 operating system with an Intel Core 7 10,700 processor clocked at 2.92 Hz and equipped with 16 GB of memory. The programming was conducted using the scikit-learn package in Python.

Our study used the ATMAP algorithm to understand better the weather conditions at the ZSNJ airport on an hourly basis. The results of our analysis are shown in Figure 7, which breaks down the weather components into different categories and shows their relative impact as a percentage. As we can see in the figure, the most significant factor affecting the weather at the airport is visibility, which makes up 61% of the total. The wind is also a significant factor, contributing 17% to the overall weather conditions. In comparison, dangerous weather events and freezing phenomena are much less common, making up only 9% and 3% of the weather conditions at ZSNJ. This is because the temperatures at the airport do not often drop below 0 °C, so freezing phenomena are relatively rare.

In this study, we use January 26th as a case example to examine the correlation between the planned and actual flight departures and the local weather conditions, as is depicted in Figure 8. Previous research [23] has shown that when the ATMAP score exceeds 1.5, the likelihood of flight delays or cancellations increases. However, according to the graph, when the ATMAP score reaches 3, typically considered to indicate severe weather, the number of actual flight departures decreases dramatically between 10:00 a.m. and 7:00 p.m. This example underscores the importance of considering the local weather when defining severe weather conditions. On that day, the ATMAP score was primarily above 3, and we observed an upward trend in both the actual and expected departures between 6:00 a.m. and 8:00 a.m. However, between 10:00 a.m. and 7:00 p.m., when there was heavy fog and rain, the number of actual departures dropped significantly, reaching a maximum decrease of 60%. As expected, the higher the ATMAP score, the more likely flights will be canceled or delayed [18].

As we continue to examine the figure, it becomes evident how the timing of weather events and the operational delays at the airport vary. Between 9:00 a.m. and 2:00 p.m., there is a decrease in the number of actual departures, while the number of actual arrivals experiences a slight increase. However, as the ATMAP score rises, the number of actual arrivals drops sharply, reaching a low of five flights per hour at 8:00 p.m. This is a typical example of the air traffic system’s delayed response to local weather events.

4.2. GDP Classification Models

4.2.1. Parameter Selection

We used a Bayesian optimization [34] to search for the important parameters of the three types of algorithm models, SVM, Random Forest and XGBoost. The goal was to locate the optimal combination of parameters to be used in the model. The process of Bayesian optimization modeling proceeds as follows:

Establish the model function, determine the parameters of the model, and set the hyperparameter interval to be optimized;
Bayesian optimization is performed, using the model as the objective function for optimization and the AUC value as the evaluation function to maximize it;
The model is trained with a 5-fold cross-validation method under the current combination of parameters, and the value of the model evaluation function is calculated under the current parameters. After completion, the model is returned to Bayesian optimization and the next set of parameters is selected for a new round of training based on the probabilistic model acquisition function until the number of iterations is reached;
Output the optimal set of parameters for model performance after Bayesian optimization.

According to the data we collected, we established intervals for optimizing the crucial parameters of the three models. As illustrated in Table 4, we set the remaining parameters based on our previous tuning experience.

We have set the number of iterations to 100, and the optimization will halt once the designated number of calculations is attained. Consequently, this process will generate the most efficient parameter combination. You can find the parameter values listed in Table 5.

4.2.2. Model Evaluation

In our study, we compared the performance of the three machine learning algorithms, XGBoost, SVM, and Random Forest, to predict the GDP. Figure 9a shows the results of the AUC values in the test set, and it is clear that XGBoost performed the best with the highest AUC values and fewest false positives. This is because XGBoost considers the second-order derivatives of the loss function, providing more detail and accuracy. Meanwhile, determining the exact calculation time for each model is a complex issue as it depends on the size of the data fed into the model. Based on our current data volume, the XGBoost and Random Forest models take 5 s to complete, while SVM takes 31 s.

When we evaluated the performance of the models on the test set, as shown in Figure 9b, XGBoost once again came out on top, outperforming both SVM and Random Forest in precision and Kappa metrics. The evaluation also showed the performance of the model on the test set, and it can be seen that XGBoost performed the best in terms of F1 scores, with Random Forest being slightly higher than SVM. These results demonstrate the superiority of XGBoost as a model for predicting GDPs.

The results of the prediction performance for both the training and test sets can be found in Table 6. It is common for the model’s performance to be slightly lower in the test set compared to the training set. The XGBoost model performed the best, with an R-score of 0.902, while the SVM model had the lowest R-score at 0.722. It is clear from the results that the XGBoost model is the most accurate, while the Random Forest model is the least accurate. The Cohen Kappa coefficient is a frequently employed evaluation metric in classification problems that considers not only the precision of the model, but also its consistency and dependability. XGBoost exhibited the best performance, with Cohen Kappa coefficients of 0.962 and 0.8 for the training and test sets, respectively. Furthermore, the precision, F1, and AUC scores of XGBoost on the training and test sets also maintained an elevated level of accuracy.

The confusion matrix of the prediction results is graphically represented, and the obtained results are displayed in Figure 10. The higher the value of the diagonal of the confusion matrix, or the darker the color of the diagonal, the better the classification performance of the model. It is apparent that the suggested XGBoost model demonstrates a superior level of accuracy in terms of overall performance on the test set concerning the correct classification rate, which is significantly better than the other two models.

We determined the feature importance rankings for the XGBoost and Random Forest models to investigate the significance of the selected features in predicting the likelihood of GDP occurrence. Figure 11 showcases the importance of each feature in the predictive model, and Table 1 describes each feature. The high cloud base is the most influential factor for both XGBoost and Random Forest, accounting for 20% of the total importance. Given that cloud base height can impact visibility, this supports the conclusion of the ATMAP algorithm’s quantitative analysis of local weather.

Regarding traffic demand, the significance of planned departures and planned arrivals ranks in the top four in both XGBoost and Random Forest models, surpassing the importance of cloud bottom height. This indicates that not only does the weather play a role in GDP occurrence, but the anticipated number of flights per hour also affects the congestion of airspace capacity. During certain ‘busy hours’, airports may implement ground waiting measures to balance capacity and demand, as was noted in [17].

The significance of the ATMAP score was around 0.05 in both XGBoost and Random Forest models, which appears to be incongruent with the actual conditions at the airport. One main reason for this discrepancy is that the local airport does not typically experience highly adverse weather, causing the ATMAP score to be low. The model needs to understand its correlation with GDP fully. As a result, the impact of the ATMAP score on the final prediction results is limited.

4.3. Departure Delay Time Models

To understand the impact of a GDP, it is crucial to examine the estimated delay time for departing flights on an hourly basis during GDPs. Figure 12 provides an overview of all GDP releases and showcases the delay of departing aircraft during its occurrence. As seen in the figure, even when there are no departing flights, GDP continues to be in effect if there are no delays. Furthermore, it can be noted that the period with high ATMAP scores typically occurs between January and March, leading to extended delays due to unfavorable weather conditions, with a particularly noticeable impact on 19 March.

4.3.1. Performance Measures

In assessing the predictive performance of the model, the model’s efficiency was verified using the values of correlation coefficient (

R^{2}

), mean-squared-error (MSE) and mean-absolute-error (MAE).

R^{2}

represents the correlation between the predicted value and the true value. The MSE is a measure of the deviation between the predicted value and the true value. The smaller the MSE value, the better the prediction accuracy. MAE is the average of the absolute errors, which reflects the actual magnitude of the errors in the predicted value.

R^{2}

, MSE, and MAE can be calculated using Equations (16)–(18), respectively.

R^{2} = 1 - \frac{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})}{\sum_{i = 1}^{n} (y_{i} - \bar{y})}

(16)

M S E = \frac{1}{n} {\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})}^{2}

(17)

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}|

(18)

where

y_{i}

is the actual value,

{\hat{y}}_{i}

represents the predicted value, and

\bar{y}

represents the mean average of the true values.

The results of our predictions utilizing

R^{2}

, RMSE, and MAE models are presented in Table 7, accompanied by the performance metrics for the three regression models. It is evident that all three regression models exhibited improved performance when considering ATMAP scores. The decision tree model stands out with a remarkable 8.8% improvement in its

R^{2}

score. This could be because the ATMAP score captures a non-linear variation in weather, acting as a more impactful composite weather metric compared to individual weather attributes.

4.3.2. Example Validation

We employed the decision tree model to demonstrate the prediction of GDP duration using data from the GDP that occurred on 22 April. Figure 13 exhibits the predictions made with and without the inclusion of ATMAP scores. The actual data are depicted in red, while the predictions made with ATMAP scores are illustrated by the blue dotted line, and the predictions made without ATMAP scores are indicated by the purple dotted line. The results indicate that the predictions made with the inclusion of ATMAP scores are more precise and closer to the actual data than the predictions made without ATMAP scores. Additionally, the majority of prediction errors are within 15 min, underscoring the effectiveness of the decision tree model in forecasting the departure flight delay in terms of GDP duration.

5. Conclusions

This paper presents a novel approach for predicting GDP incidence by utilizing machine learning techniques and evaluating the performance of three different models, including SVM, Random Forest, and XGBoost. Although the models achieved similar results in terms of their AUC, XGBoost outperformed the others regarding F1 score, accuracy, and Kappa metrics. (The test set accuracy was 0.902 when using the XGBoost model.) Regarding feature importance, the highest contributing factor was the ceiling, which was consistent with the analysis of local weather conducted by the ATMAP algorithm.

We then investigated the forecasting of departure delays during GDP duration using both linear and non-linear regression models. The results showed that the decision tree model outperformed the ridge and LASSO models with a minimum MAE of 10.9 to 12 min. Incorporating the ATMAP score into the models led to improved accuracy, especially for the decision tree model, which saw an 8% increase in

R^{2}

values. This could be attributed to the ATMAP score’s ability to reflect non-linear weather variance and capture the delays in ground holding that is caused by weather events, resulting in a more precise prediction of flight delays. The results also show that the ATMAP scores introduced in this study are indeed of high importance for flight departure delay prediction accuracy during GDP durations.

In conclusion, the GDP prediction model outlined in this paper sheds light on the interplay between weather conditions and air traffic flow with regard to GDP occurrences. These machine learning models offer a valuable tool for controllers to make informed decisions on GDP activation and help to anticipate the extent of its impact on flight delays, thus improving operational efficiency and reducing the environmental impact. The combination of these models has the potential to predict the entire GDP process accurately. This study sets the stage for further research to expand our understanding of GDP formation and its effects. It is recommended that future studies explore the collaborative decision-making paradigm and examine the ripple effect of GDP among airport clusters, taking into account its contributing factors.

Author Contributions

Conceptualization, X.D. and X.Z.; methodology, M.H. and X.D.; software, X.Z. and J.B.; data curation, X.D. and X.Z.; supervision, M.H.; validation, M.H. and J.B. writing—original draft preparation, X.D. and X.Z.; writing—review and editing, X.D., M.H. and J.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China (No. 2022YFB2602401) and National Natural Science Foundation of China, grant number [U2033203, 52272333].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding authors upon reasonable request.

Acknowledgments

The authors thank the National Air Traffic Control Flight Flow Management Technology Key Laboratory of Nanjing University of Aeronautics and Astronautics and Jiangsu Air Traffic Control Branch for providing the GDP data used in the model tests described in this paper. We also acknowledge the help of peer review.

Conflicts of Interest

The authors declare no conflict of interest.

References

Civil Aviation Administration of China. Development Statistics Bulletin of Civil Aviation Industry in 2018; Civil Aviation Administration of China: Beijing, China, 2019. [Google Scholar]
Grabbe, S.; Sridhar, B.; Mukherjee, A. Clustering Days and Hours with Similar Airport Traffic and Weather Conditions. J. Aerosp. Inf. Syst. 2014, 11, 751–763. [Google Scholar] [CrossRef]
Liu, Y.; Hansen, M.; Zhang, D.; Liu, Y. Modeling Ground Delay Program Incidence Using Convective and Local Weather Information. In Proceedings of the Twelfth USA/Europe Air Traffic Management Research and Development Seminar (ATM2017), Seattle, WA, USA, 2 May 2020; pp. 1–8. [Google Scholar]
Ball, M.O.; Hoffman, R.; Odoni, A.R.; Rifkin, R. A Stochastic Integer Program with Dual Network Structure and Its Application to the Ground-Holding Problem. Oper. Res. 2003, 51, 167–171. [Google Scholar] [CrossRef]
Mukherjee, A.; Hansen, M. A Dynamic Stochastic Model for the Single Airport Ground Holding Problem. Transp. Sci. 2007, 41, 444–456. [Google Scholar] [CrossRef]
Kuhn, K.D. Ground Delay Program Planning: Delay, Equity, and Computational Complexity. Transp. Res. Part C Emerg. Technol. 2013, 35, 193–203. [Google Scholar] [CrossRef]
Smith, D.A.; Sherry, L. Decision Support Tool for Predicting Aircraft Arrival Rates from Weather Forecasts. In Proceedings of the 2008 Integrated Communications, Navigation and Surveillance Conference, Bethesda, MD, USA, 5–7 May 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1–12. [Google Scholar]
Chen, Y.; Xu, Y.; Hu, M.; Yang, L. Demand and Capacity Balancing Technology Based on Multi-Agent Reinforcement Learning. In Proceedings of the 2021 IEEE/AIAA 40th Digital Avionics Systems Conference (DASC), San Antonio, TX, USA, 3–7 October 2021; pp. 1–9. [Google Scholar]
Wang, Y.; Kulkarni, D. Modeling Weather Impact on Ground Delay Programs. SAE Int. J. Aerosp. 2011, 4, 1207–1215. [Google Scholar] [CrossRef]
Vossen, T.W.M.; Ball, M.O. Slot Trading Opportunities in Collaborative Ground Delay Programs. Transp. Sci. 2006, 40, 29–43. [Google Scholar] [CrossRef]
Ball, M.O.; Hoffman, R.; Mukherjee, A. Ground Delay Program Planning Under Uncertainty Based on the Ration-by-Distance Principle. Transp. Sci. 2010, 44, 1–14. [Google Scholar] [CrossRef]
Yan, C.; Vaze, V.; Barnhart, C. Airline-Driven Ground Delay Programs: A Benefits Assessment. Transp. Res. Part C Emerg. Technol. 2018, 89, 268–288. [Google Scholar] [CrossRef]
Jacquillat, A. Predictive and Prescriptive Analytics Toward Passenger-Centric Ground Delay Programs. Transp. Sci. 2022, 56, 265–298. [Google Scholar] [CrossRef]
Ikram, R.M.A.; Goliatt, L.; Kisi, O.; Trajkovic, S.; Shahid, S. Covariance Matrix Adaptation Evolution Strategy for Improving Machine Learning Approaches in Streamflow Prediction. Mathematics 2022, 10, 2971. [Google Scholar] [CrossRef]
Kalita, J.; Balas, V.E.; Borah, S.; Pradhan, R. (Eds.) Recent Developments in Machine Learning and Data Analytics: IC3 2018; Advances in Intelligent Systems and Computing; Springer Singapore: Singapore, 2019; Volume 740, ISBN 9789811312793. [Google Scholar]
Grabbe, S.R.; Sridhar, B.; Mukherjee, A. Clustering Days with Similar Airport Weather Conditions. In Proceedings of the 14th AIAA Aviation Technology, Integration, and Operations Conference; American Institute of Aeronautics and Astronautics, Atlanta, GA, USA, 16 June 2014. [Google Scholar]
Liu, Y.; Liu, Y.; Hansen, M.; Pozdnukhov, A.; Zhang, D. Using Machine Learning to Analyze Air Traffic Management Actions: Ground Delay Program Case Study. Transp. Res. Part E Logist. Transp. Rev. 2019, 131, 80–95. [Google Scholar] [CrossRef]
Klein, A.; Craun, C.; Lee, R.S. Airport Delay Prediction Using Weather-Impacted Traffic Index (WITI) Model. In Proceedings of the 29th Digital Avionics Systems Conference, Salt Lake City, UT, USA, 3–7 October 2010; pp. 2.B.1-1–2.B.1-13. [Google Scholar]
Liu, J.; Li, K.; Yin, M.; Zhu, X.; Han, K. Optimizing Key Parameters of Ground Delay Program with Uncertain Airport Capacity. J. Adv. Transp. 2017, 2017, 1–9. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. Jair 2002, 16, 321–357. [Google Scholar] [CrossRef]
Kuhn, K.D. A Methodology for Identifying Similar Days in Air Traffic Flow Management Initiative Planning. Transp. Res. Part C Emerg. Technol. 2016, 69, 1–15. [Google Scholar] [CrossRef]
EUROCONTROL. Algorithm to Describe Weather Conditions at European Airports; Technical Report; Eurocontrol: Brussels, Belgium, 2011. [Google Scholar]
Schultz, M.; Lorenz, S.; Schmitz, R.; Delgado, L. Weather Impact on Airport Performance. Aerospace 2018, 5, 109. [Google Scholar] [CrossRef]
Hazarika, B.B.; Gupta, D. 1-Norm Random Vector Functional Link Networks for Classification Problems. Complex Intell. Syst. 2022, 8, 3505–3521. [Google Scholar] [CrossRef]
Kutty, A.A.; Wakjira, T.G.; Kucukvar, M.; Abdella, G.M.; Onat, N.C. Urban Resilience and Livability Performance of European Smart Cities: A Novel Machine Learning Approach. J. Clean. Prod. 2022, 378, 134203. [Google Scholar] [CrossRef]
Hazarika, B.B.; Gupta, D.; Natarajan, N. Wavelet Kernel Least Square Twin Support Vector Regression for Wind Speed Prediction. Environ. Sci. Pollut. Res. 2022, 29, 86320–86336. [Google Scholar] [CrossRef] [PubMed]
Borah, P.; Gupta, D. Affinity and Transformed Class Probability-Based Fuzzy Least Squares Support Vector Machines. Fuzzy Sets Syst. 2022, 443, 203–235. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Wang, Y.; Xia, S.-T. Unifying Attribute Splitting Criteria of Decision Trees by Tsallis Entropy. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2507–2511. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13 August 2016; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Guo, M.; Yuan, Z.; Janson, B.; Peng, Y.; Yang, Y.; Wang, W. Older Pedestrian Traffic Crashes Severity Analysis Based on an Emerging Machine Learning XGBoost. Sustainability 2021, 13, 926. [Google Scholar] [CrossRef]
Carvalho, L.; Sternberg, A.; Maia Gonçalves, L.; Beatriz Cruz, A.; Soares, J.A.; Brandão, D.; Carvalho, D.; Ogasawara, E. On the Relevance of Data Science for Flight Delay Research: A Systematic Review. Transp. Rev. 2021, 41, 499–528. [Google Scholar] [CrossRef]
AlKhereibi, A.H.; Wakjira, T.G.; Kucukvar, M.; Onat, N.C. Predictive Machine Learning Algorithms for Metro Ridership Based on Urban Land Use Policies in Support of Transit-Oriented Development. Sustainability 2023, 15, 1718. [Google Scholar] [CrossRef]
Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.P.; de Freitas, N. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proc. IEEE 2016, 104, 148–175. [Google Scholar] [CrossRef]

Figure 1. GDP occurrence process.

Figure 2. General framework.

Figure 3. Nanjing Lukou International Airport layout.

Figure 4. Weather data of ZSNJ for the first six months of 2021.

Figure 5. Weather and traffic related matrix.

Figure 6. ROC curve.

Figure 7. Proportion of each weather element.

Figure 8. Actual/scheduled flights and ATMAP weather score for 26 January 2021.

Figure 9. (a) ROC curve on the test set and (b) test set performance comparison of XGBoost, Random Forest, and SVM.

Figure 10. Confusion matrix of SVM on (a) test set; Random Forest on (b) test set; and XGBoost on (c) test set.

Figure 11. (a) Random Forest feature importance ranking and (b) XGBoost feature importance ranking.

Figure 12. Departure flight delays at GDP duration time.

Figure 13. Delayed predicted time on 22 April in terms of GDP duration time.

Table 1. Weather and traffic variables description.

Category	Attribute Name (Abbr.)	Attribute Description
Weather elements	ATMAP scores	Score obtained by ATMAP algorithm
	Visibility	Surface visibility (miles)
	Ceiling	Cloud height (miles)
	Wind direction (WD)	1 if headwind, 0 otherwise
	Wind speed (WS)	Surface wind speed (m/s)
Traffic features	Scheduled arrivals (SA)	Hourly scheduled arrival counts (aircraft/h)
	Scheduled departures (SD)	Hourly scheduled departure counts (aircraft/h)
	Actual arrivals (AA)	Hourly actual arrival counts (aircraft/h)
	Actual departures (AD)	Hourly actual departures counts (aircraft/h)

Table 2. ATMAP weather algorithm scoring criteria for precipitation phenomena.

Precipitations Severity Code	Type of Precipitations	Coefficient
1	No precipitation	0
2	RA, UP, DZ, IC	1
3	−SN, SG, +RA	2
4	FZxx, SN, +SN	3

Table 3. Quantitative results of routine airport weather reports.

MEATR	METAR ZSNJ 150800Z VRB01MPS 3500 -SHRA BR FEW033TCU OVC033 18/17 Q1011
Weather Class	Visibility	Wind	Precipitation	Freezing	Dangerous	Sum
	3500	01MPS	−SHRA	-	−SHRA FEW033TCU
Quantitative results	0	0	1	0	12	13

Table 4. Model parameter search range.

Model	Parameter	Parameter Search Range
SVM	$C$	[0,1]
SVM	$g a m m a$	[0,1]
	kernel	[linear,poly,rbf,sigmoid]
Random Forest	n_estimators	[0,100]
Random Forest	max_depth	[0,30]
	min_samples_split	[0,50]
	min_samples_leaf	[0,10]
XGBoost	n_estimators	[10,300]
	$g a m m a$	[0,10]
	learning_rate	[0.05,0.5]
	max_depth	[3,30]
	min_child_weight	[0,10]

Table 5. The models’ key parameter selections.

Model	Parameters	Optimal Values
SVM	$C$	0.1
	$g a m m a$	0.029
	kernel	rbf
Random forests	n_estimators	20
	max_depth	3
	min_samples_split	10
	min_samples_leaf	5
XGBoost	n_estimators	200
	$g a m m a$	3.54
	learning_rate	0.05
	max_depth	10
	min_child_weight	1.3

Table 6. Predictive performance of the different models.

Prediction Model	Accuracy		Cohen Kappa		Precision		F1 Score		AUC
Prediction Model	Train	Test	Train	Test	Train	Test	Train	Test	Train	Test
SVM	0.838	0.722	0.676	0.445	0.780	0.683	0.853	0.749	0.901	0.792
Random Forest	0.812	0.748	0.595	0.497	0.733	0.707	0.822	0.771	0.871	0.821
XGBoost	0.986	0.902	0.962	0.800	0.991	0.977	0.986	0.891	0.998	0.961

Table 7. Model performance measures with and without ATMAP score.

Model	Data Set	MSE	MAE	$R^{2}$
Ridge	Without ATMAP score	472	16.6 min	0.51
Ridge	With ATMAP score	430	15.71 min	0.55
LASSO	Without ATMAP score	462	16.89 min	0.52
LASSO	With ATMAP score	418	15.92 min	0.56
Decision tree	Without ATMAP score	302	12.03 min	0.68
Decision tree	With ATMAP score	244	10.9 min	0.74

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dong, X.; Zhu, X.; Hu, M.; Bao, J. A Methodology for Predicting Ground Delay Program Incidence through Machine Learning. Sustainability 2023, 15, 6883. https://doi.org/10.3390/su15086883

AMA Style

Dong X, Zhu X, Hu M, Bao J. A Methodology for Predicting Ground Delay Program Incidence through Machine Learning. Sustainability. 2023; 15(8):6883. https://doi.org/10.3390/su15086883

Chicago/Turabian Style

Dong, Xiangning, Xuhao Zhu, Minghua Hu, and Jie Bao. 2023. "A Methodology for Predicting Ground Delay Program Incidence through Machine Learning" Sustainability 15, no. 8: 6883. https://doi.org/10.3390/su15086883

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Methodology for Predicting Ground Delay Program Incidence through Machine Learning

Abstract

1. Introduction

2. Description of Ground Delay Programs

3. Methods

3.1. Data Set

3.2. ATMAP Algorithm

3.3. GDP Classification Models

3.3.1. Evaluation Indicators

3.3.2. Support Vector Machine

3.3.3. Random Forest

3.3.4. XGBoost Classification Algorithm

3.4. GDP Departure Delay Time Models

4. Results and Discussion

4.1. ATMAP Score Analysis

4.2. GDP Classification Models

4.2.1. Parameter Selection

4.2.2. Model Evaluation

4.3. Departure Delay Time Models

4.3.1. Performance Measures

4.3.2. Example Validation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI