Predictive Modeling of Flight Delays at an Airport Using Machine Learning Methods

Hatıpoğlu, Irmak; Tosun, Ömür

doi:10.3390/app14135472

Open AccessArticle

Predictive Modeling of Flight Delays at an Airport Using Machine Learning Methods

by

Irmak Hatıpoğlu

¹

and

Ömür Tosun

^2,*

¹

Department of International Trade and Logistics, Faculty of Applied Sciences, Akdeniz University, Antalya 07070, Turkey

²

Department of Management Information Systems, Faculty of Applied Sciences, Akdeniz University, Antalya 07070, Turkey

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(13), 5472; https://doi.org/10.3390/app14135472

Submission received: 13 May 2024 / Revised: 8 June 2024 / Accepted: 22 June 2024 / Published: 24 June 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Flight delays represent a significant challenge in the global aviation industry, resulting in substantial costs and a decline in passenger satisfaction. This study addresses the critical issue of predicting flight delays exceeding 15 min using machine learning techniques. The arrival delays at a Turkish airport are analyzed utilizing a novel dataset derived from airport operations. This research examines a range of machine learning models, including Logistic Regression, Naïve Bayes, Neural Networks, Random Forest, XGBoost, CatBoost, and LightGBM. To address the issue of imbalanced data, additional experiments are conducted using the Synthetic Minority Over-Sampling Technique (SMOTE), in conjunction with the incorporation of meteorological data. This multi-faceted approach ensures robust forecast performance under varying conditions. The SHAP (SHapley Additive exPlanations) method is employed to interpret the relative importance of features within the models. The study is based on a three-year period of flight data obtained from a Turkish airport. The dataset is sufficiently extensive and robust to provide a reliable foundation for analysis. The results indicate that XGBoost is the most proficient model for the dataset, demonstrating its potential to deliver highly accurate predictions with an accuracy of 80%. The impact of weather factors on the predictions is found to be insignificant in comparison to scenarios without weather data in this dataset.

Keywords:

flight delay prediction; forecasting; machine learning; performance analysis

1. Introduction

Air travel, renowned for its remarkable speed compared to other modes of transportation, is also highly susceptible to unforeseen delays. The airline industry often formulates flight plans well in advance, with some airlines planning up to 300 days ahead [1]. Nevertheless, this extensive planning period is beset by variables and circumstances that can disrupt flight schedules, resulting in delays.

The consequences of flight delays have a profound impact on a multitude of stakeholders. In terms of economic impact, flight delays cost the United States (U.S.) an estimated 33 billion dollars in 2019 [2]. Furthermore, sectors that are indirectly affected by flight delays, including retailers, accommodation businesses, and tourism, collectively incur costs that exceed two billion U.S. dollars. In addition to financial losses, flight delays have a multitude of other consequences. The consequences of flight delays are felt by all parties involved, including passengers, airlines, airport operators, and the environment. Repetitive and unpredictable delays have been identified as one of the most significant challenges faced by both airlines and passengers [3,4]. For passengers, flight delays can have a detrimental impact on their travel plans, leading to missed connections and delayed vacations and business appointments, which in turn exacerbates passenger dissatisfaction [5]. Furthermore, persistent delays can prompt leisure travelers to explore alternative transportation options, which in turn affects the number of passengers carried by airlines [6,7]. Airlines are also affected by delays in terms of their operating costs. The occurrence of persistent delays on specific routes has the potential to erode an airline’s appeal to customers over time [8]. Even airports are not immune to the negative effects of delays, as passenger satisfaction levels decline significantly when such occurrences occur [5,9]. Another crucial aspect to be considered is the environmental impact of flight delays. It was estimated that U.S. airlines expend a considerable 2 billion liters of jet fuel annually as aircraft remain idling on runways due to flight delays [10]. This heightened fuel consumption not only results in economic losses but also amplifies carbon emissions, thereby exacerbating environmental concerns [11]. In light of these considerations, it is evident that flight delays have a multitude of adverse effects on stakeholders, encompassing both financial costs and other detrimental consequences.

To mitigate the adverse effects of flight delays, it is of the utmost importance to be able to accurately predict when such delays are likely to occur. The objective of this study is to predict whether flights will be delayed by more than 15 min, utilizing readily available weather and airport data. The primary objective of this study is to analyze the arrival performance of the selected airport. It is evident that arrival delays not only affect the on-board passengers but also have an impact on future flight plans utilizing the same aircraft. Given the uniqueness of the dataset in comparison to those utilized in the existing literature, a variety of machine learning techniques, including Logistic Regression, Naïve Bayes, Neural Networks, XGBoost, Random Forests, CatBoost, and LightGBM, are explored to determine the most suitable approach.

This paper proceeds with a detailed examination of the research stages. The research study commences with the explicit definition of the research question, which is followed by a comprehensive review of the existing literature. Subsequently, the methods employed are elucidated, with detailed information provided about the dataset. Subsequently, the findings are presented and discussed in depth, leading to recommendations for future research.

2. Problem Definition

Air travel plays a pivotal role in modern society, offering unparalleled convenience and efficiency in long-distance transportation. One of the most persistent challenges currently facing the aviation industry is the occurrence of flight delays. Such delays have the potential to disrupt travel plans, lead to the incurrence of additional costs, and ultimately result in the dissatisfaction of passengers. In order to address this issue, the objective of this study is to develop a predictive model for flight delays at a Turkish airport using machine learning techniques.

Flight delays are commonly defined as instances where flights depart later than their scheduled departure times. It is, however, important to note that the Federal Aviation Administration (FAA) does not categorize delays of 15 min or less as tardy. Accordingly, for the purposes of this study, only those instances of flight delay exceeding the 15 min threshold are considered as delayed, in alignment with the FAA standards [12].

In order to mitigate potential biases and improve the robustness of our models, we explore the use of a resampling technique called Synthetic Minority Over-Sampling Technique (SMOTE). SMOTE is a data augmentation technique that generates synthetic samples by interpolating existing minority class instances, thereby addressing the class imbalance issue. This process helps to prevent models from being biased towards the majority class [13]. In our case, the objective of SMOTE is to achieve a more balanced representation of both on-time and delayed departures by generating synthetic instances of the minority class (delayed flights).

There is a paucity of consensus regarding the impact of meteorological conditions on aviation operations. According to the Bureau of Transportation Statistics (2022), while extreme weather conditions affect flight conditions 4% of the times, in 2020, 45.8% of flights were delayed due to weather conditions. The financial impact of flight delays due to weather conditions and other circumstances differs between airlines. In the event of a flight delay resulting from inclement weather, the airline is not required to make additional payments to passengers. It is possible that some airlines may attempt to exploit this situation. Consequently, a novel experiment was devised, extracting weather conditions from the available data, with the objective of enhancing the accuracy of flight delay predictions.

Feature selection is a crucial aspect of machine learning model development. This study investigates whether comparable predictive results can be achieved using fewer features. In particular, the use of SHAP, a method that automates the selection of relevant features while discarding less informative ones, is explored. This analysis aims to determine if model performance can be optimized by reducing the dimensionality of the dataset without sacrificing predictive accuracy.

In conclusion, this article addresses the complex issue of flight delay prediction at a Turkish airport. By considering the guidelines set forth by the FAA, meteorological conditions, data imbalances, and feature selection techniques, our objective is to gain a comprehensive understanding of the factors contributing to flight delays. Ultimately, we aim to develop accurate predictive models that can assist the aviation industry in optimizing operations and enhancing passenger experiences.

This study makes several significant contributions to the field of flight delay prediction. Firstly, we conducted an analysis of the performance of various machine learning models, including Logistic Regression, Naïve Bayes, Neural Networks, Random Forest, XGBoost, CatBoost, and LightGBM, with the objective of identifying the optimal model for our dataset. This diversity allowed for a comprehensive evaluation of model performance. Secondly, SMOTE was employed in order to address the issue of imbalanced datasets, with the effect of this specific technique being assessed. Additionally, the integration of weather data into the models enabled the examination of the impact of weather conditions on flight delays, thereby enhancing the overall accuracy of the models and the understanding of the contribution of weather variability to the predictions.

Furthermore, we employed a novel dataset that has not been previously explored in the literature for the purpose of predicting flight delays. This enhanced the originality of our study and its contribution to the field. Additionally, the performance of the models was evaluated on both the training and the test datasets, with recommendations provided for improving the models’ generalization ability. This will facilitate the generation of more reliable and consistent predictions in real-world applications. Finally, we employed SHAP values to identify the key features affecting flight delays, thereby enhancing the interpretability of the models and providing more reliable and understandable predictions for decision-makers.

3. Literature Review

In the field of flight delay prediction, a variety of methods have been employed in the literature. These methods can be categorized into regression models, probability-based models, operational research models, network-based models, and machine learning methods.

Regression models have been used extensively, such as a hybrid regression and time series models applied to Frankfurt Airport, explaining 60–69% of variability [14], and a multivariate regression model achieving a 5.3 min precision [15]. Probability-based approaches have been utilized, including a probabilistic model combined with genetic algorithms to estimate delay distributions at Denver International Airport [16], and survival analysis with the Cox model to analyze flight delays probabilistically [17].

Operations research techniques have also played a role, with dynamic optimization models for flight operations [18], optimization of aircraft routes and departure times using mixed-integer programming [19], and multi-objective optimization leading to a 12–31% improvement in flight delays [20]. Network-based models, like the clustered airport modeling approach of optimizing flight operations [21], the minimum-cost flow problem for system design [22], and the investigation of delay propagation among airports with Granger causality [23,24] are also used.

Despite the historical utilization of these methods, their practical application has not been widely adopted. In recent years, there has been a notable shift towards the embrace of machine learning techniques. This transition has been driven by the abundance of big datasets in the aviation industry and the demonstrated success of machine learning in pattern recognition.

The earliest documented use of machine learning methods in flight delay prediction from our literature review dates to 2014. Since 2016, research employing machine learning techniques has been on the rise, with an evolving array of methods. Table 1 provides a concise summary of findings from these studies, encompassing information on data sources, datasets used, and selected techniques. Among the methods employed, Random Forest (RF) stands out as the most utilized technique in the papers reviewed. Following closely, Decision Trees (DT), Gradient Boosting (GB), and Artificial Neural Networks (ANN) have been preferred in successive order. In terms of dataset origins, the United States predominantly serves as the source, with select studies encompassing data from Egypt, Iran, Brazil, and various European countries or individual airline companies. While most publications cover information on multiple airports, some studies focus on just one or a few specific airports.

In conclusion, each method for flight delay prediction has its own set of strengths and weaknesses. Regression models and probability-based approaches offer simplicity and flexibility but may encounter difficulties when faced with complex relationships and data requirements. Operations research models offer powerful optimization capabilities but can be complex and difficult to scale. Network-based models offer a comprehensive view of delay propagation but are data-intensive and computationally demanding. Machine learning models are distinguished by their high accuracy and capacity to handle complex relationships. However, these models require substantial data and computational resources.

In the digital age, data are widely acknowledged as a form of currency. Consequently, it is of paramount importance for businesses to make predictions based on their own data sources. In this context, flight data belonging to a single airport operator were utilized in this study. As this dataset has not been previously used in the literature, seven different machine learning techniques were applied. The choice of using the ANN technique was driven by its frequent preference in the literature. Naïve Bayes (NB) and logistic regression were selected due to their suitability for binary classification tasks. Furthermore, the inclusion of XGBoost, LightGBM, and CatBoost was motivated by their frequent comparison in the context of gradient boosting-based techniques.

This study addresses several notable gaps in the existing literature on flight delay prediction. By utilizing a novel dataset from a Turkish airport, which has not been previously explored, the study shows originality and provides a new perspective on flight delay prediction. Furthermore, it conducts a comprehensive comparison of various machine learning models, including Logistic Regression, Naïve Bayes, Neural Networks, Random Forest, XGBoost, CatBoost, and LightGBM, allowing for a thorough evaluation of model performance across different scenarios. The integration of weather data into the predictive models enables an examination of the impact of weather conditions on flight delays, offering a more holistic view of the influencing factors. Furthermore, this study addresses data imbalance by employing SMOTE to balance the dataset. This assessment demonstrates an innovative approach to handling data imbalance. Additionally, feature importance analysis using SHAP values identifies key features affecting flight delays, enhancing the interpretability of the models and providing more reliable and understandable predictions for decision-makers.

4. Methods

One of the initial decisions to be made when utilizing machine learning techniques is how to conduct the learning process. In this study, all methods were employed with the supervised learning technique, given that the outputs of the dataset were available. Consequently, 70% of the available data in the training phase and 30% in the testing phase were utilized. Another issue that requires resolution is the selection of hyperparameter values. In this study, Bayesian optimization for hyperparameter tuning was used for this aim. The advantage of the Bayesian technique is that it performs a probabilistic scan, rather than calculating the effects of all hyperparameters. In other words, the Bayesian technique selects several hyperparameter settings, evaluates the selected parameters’ quality, and decides which samples to select. Consequently, results can be obtained more rapidly. Finally, all the selected machine learning techniques were suitable for binary classification. In other words, all the techniques were suitable for the prediction of the status of aircrafts (late/on time). In the rest of this section, a brief introduction of each technique and their reason of selection will be presented.

Naïve Bayes

The Naïve Bayes algorithm is a straightforward probabilistic classifier that calculates probabilities based on the frequency and combinations of values within a given dataset. The fundamental principle of this approach is the assumption of strong independence, which states that the presence or absence of any given class feature has no bearing on the presence or absence of any other feature. In essence, the Naïve Bayes algorithm operates using conditional probability [50,51]. One of its advantages is its ability to generate classifications from large datasets without the need for complex iterative parameter estimation procedures [52]. Because of this, it is chosen for its simplicity and effectiveness with large datasets, which makes it a good baseline model.

Artificial Neural Networks

Artificial neural networks (ANNs) are designed to mimic the functionality of the human brain. Specifically, ANN models emulate the electrical activity of the brain and nervous system. Due to their structure, they can offer solutions to various problems, such as nonlinear and stochastic problems [53]. Artificial neurons or nodes, which are information processing units grouped in layers and joined by synaptic weights, constitute perceptron-type neural networks (connections). The technique has the advantages of working with incomplete data, being fault-tolerant, and not corrupting one or more cells in the network [54]. An ANN is chosen for its ability to capture complex, non-linear relationships in the data, as shown in [30,32,36].

Logistic Regression

Logistic regression is a well-established technique that was first developed by D. R. Cox [55]. It is very sensitive to the problem of overfitting, which is usually fixed with the L2 arrangement [44]. This technique assumes that the relationship between the dependent variable and the independent variables is linear only between the logits of the explanatory variables and the output [56]. The method can be used when the dependent variable is only categorical. Unlike artificial neural networks, practitioners can see the relationship between variables and output. Another advantage is that they can find a solution in a short time, as less processing time is required [57]. Therefore, logistic regression is preferred for its interpretability and speed, which makes it suitable for quick insights and baseline comparisons, as seen in [35,44].

Random Forests

Random Forests (RFs) are a model type that provides a transparent and understandable estimation since they are based on the idea that decision trees can be used in combination. It is a model introduced by [58]. The main idea of the Random Forests method is based on generating a large number of predictive trees and then randomly selecting sub-communities from the generated trees. The fundamental concept underlying the Random Forests method is the bagging method itself. In other words, the combination of decision trees and the bagging technique constitutes the Random Forests method. The RF method is chosen for its robustness and ability to handle high-dimensional data and interactions between features with high accuracy, as demonstrated in [25,31,38].

XGBoost

It was developed to improve the GBDT technique. This technique has been successful in many machine learning competitions. XGBoost has many advantages over classical methods [59]. The first is that it contains a revolutionary tree-learning technique for sparse datasets. Second, it employs a theoretically justified weighted quantile sketch approach that allows approximation tree learning to handle instance weights. Parallel and distributed computing reduces the training time, allowing researchers to conduct model explorations more quickly. Finally, using out-of-core computation enables data scientists to process massive amounts of information. As one of the widely preferred algorithms, XGBoost was included for its high performance and efficiency, especially in handling sparse datasets and complex patterns, as reported by [29,31,49].

LightGBM

LightGBM is another widely used technique that has gained recognition, particularly in classification problems. The motivation for improvement was stated as the inadequacy of the available methods for datasets containing high amounts and many features [60]. Focused on development based on XGBoost, it takes a different approach to classification problems by adding and combining the two techniques of gradient-based one-side sampling and exclusive feature bundling. Thus, LightGBM was included for its speed and efficiency, especially with large datasets with many features, as evidenced by [31,42].

CatBoost

It is a technique especially developed for categorical features. The CatBoost method adds two new fundamental functions: a permutation-driven ordered boosting algorithm and a novel algorithmic approach specifically for processing category information [61]. It is claimed that CatBoost can also handle cases when the number of categories is too large for conventional GBDT models. It achieves this by dividing the dataset into random subsets, converting the labels to integer values, and converting the remaining categorical features to numerical values. The reasons for its selection were its superior handling of categorical data and avoidance of overfitting through ordered boosting, as demonstrated in [43].

By employing a diverse set of models, we aimed to leverage the strengths of each technique to develop a comprehensive and accurate predictive model for flight delays.

5. Dataset

Although aviation data are published as open source in some countries, such a secondary data source is not available in most countries. For this reason, secondary data were obtained from an airport while conducting this study. The data records of 40,077 flights (8625 late, 31,452 on time) in the dataset covered three years (2016, 2017, 2018). The dataset we used is unique, which allowed for the generalization and evaluation of results when working on different data.

While creating the dataset, care was taken to ensure that the features recorded at the airport were similar to those used in the literature for estimating flight delays. Table 2 shows the features obtained from the airport, the data types of these features, and the studies in the literature that made flight delay estimations using relevant data.

The study included meteorological data and data pertaining to the airport. The relevant weather data were obtained via the online website rp5.ru. The weather data from the time zone closest to the flight were extracted from the dataset, which contained weather information updated every 20 min. In addition to these variables, those presented in Table 2 were employed in the estimation of flight delays in conjunction with meteorological conditions. Table 3 presents the weather-related features utilized and the types of data associated with each feature. Furthermore, the flight type variable was included, which indicated whether the flight in question was scheduled or not. This encompassed special, ferry, and technical flights.

Encoding of categorical data and unbalanced dataset management tools were required throughout the data preparation process. There were 7 unique observations for the day of the week feature. For the other categorical values, there were 31 for day of the month, 12 for month, 97 for origin of departure airport, 17 for mean wind direction, 419 for total cloud cover, and 2 for flight type. For a proper modelling, all categorical features were converted to numerical values. One-hot encoding, which is frequently used for categorical data conversion, was selected for this. Using this approach, new features were created form the unique observations provided for each categorical feature. After encoding was applied, the number of features used in the study was 595. For the imbalanced dataset, another data preparation step was required. If the classification categories are not relatively equally represented, the dataset is imbalanced. Due to the lower number of delayed flights, this issue was present in the dataset of this study. SMOTE [67], one of the recommended approaches, was preferred to address this problem. The minority class disadvantage is minimized with this technique, which generates synthetic data similar to the minority class. The data acquired without utilizing SMOTE are also presented to further analyze the outcomes obtained.

6. Results

We conducted four distinct trials using the seven selected methods on our prepared dataset to predict flight delays using Python. These trials were designed to examine the impact of different factors on prediction accuracy. In the first trial, we estimated flight delays solely by optimizing hyperparameters without incorporating weather data. The second experiment incorporated the results of the SMOTE technique in addition to the same hyperparameter optimization. In the third and fourth trials, we introduced weather data into the dataset. Notably, the third trial did not include the SMOTE technique, while the fourth trial employed it.

In all trials, the Bayesian hyperparameter optimization approach was used, and appropriate hyperparameters were selected to fine-tune the machine learning models. Table 4 provides an overview of the scanned hyperparameter space for reference.

6.1. Performance Evaluation

The effectiveness of the learning models was assessed during both the training and the testing phases for all trials. The success of the techniques was determined through a comprehensive evaluation using multiple performance metrics. These metrics offer various insights into a model’s performance:

Accuracy: This metric measures the overall correctness of a model’s predictions and is widely used in the literature. It is computed from the confusion matrix. However, for imbalanced datasets, relying solely on accuracy can be misleading [66].

Recall (sensitivity): Recall represents the percentage of relevant instances that were correctly identified by the model. It is a vital metric for cases where finding all relevant instances is crucial.

Precision (positive predictive value): Precision indicates the percentage of retrieved instances that are relevant to the problem at hand. High precision is important when minimizing false positives is essential.

F-1 score: The F-1 score is the harmonic mean of precision and recall. It provides a balanced evaluation that considers both false positives and false negatives.

Balanced error rate: This metric offers an assessment of error that considers class imbalances and can be particularly valuable for imbalanced datasets.

These metrics collectively offer a robust evaluation of a techniques’ performance, considering different aspects of predictive accuracy and suitability for the task.

The result of the first trial, which was conducted with only the airport data, is shared in Table 5. XGBoost and CatBoost consistently demonstrated strong performance across multiple metrics in both the training and the testing phases, which makes them promising candidates for predicting flight delays. Logistic Regression, ANN, and LightGBM also delivered reasonable results, while Naïve Bayes performed less effectively in most metrics.

In the second experiment, the same dataset was balanced with SMOTE. The results are presented in Table 6. Overall, the application of the SMOTE technique helped improve recall in several models during the training phase, at the expense of precision in some cases. However, it did not consistently improve the model performance in the testing phase, with recall often decreasing. XGBoost, CatBoost, and LightGBM continued to demonstrate strong performance even with SMOTE, suggesting their robustness in dealing with both imbalanced and balanced data.

Table 7 reflects the varying performance of different machine learning models when combined with airport and weather data. XGBoost, CatBoost, and LightGBM continued to emerge as strong candidates, demonstrating excellent accuracy, recall, precision, and balanced error rates. On the other hand, Naïve Bayes excelled in recall but at the expense of precision and accuracy. Logistic Regression and ANN showed reasonable performance, while Random Forest exhibited suboptimal results with zero recall, which made it less suitable for this dataset and task.

Table 8 reveals the performance of the machine learning models when using a combination of weather and airport data with the application of the SMOTE technique. XGBoost, CatBoost, and LightGBM continued to demonstrate strong performance, showcasing high accuracy, recall, precision, and balanced error rates. Naïve Bayes exhibited exceptionally high recall but at the cost of precision, which makes it a suitable choice if recall is the top priority. Logistic Regression and ANN showed reasonable and balanced performance, while Random Forest demonstrated suboptimal results, with low recall and precision.

6.2. Analyses of Feature Importance

In this section, we delve into the analysis of feature importance. The forecasting model employed in this study incorporated 595 features, resulting in increased decision complexity. The selection of features is of critical importance for determining the extent to which a simplified model can produce results that are comparable in quality to those of a more comprehensive model.

Various feature selection approaches exist in the literature, but in this study, we focused on SHAP (SHapley Additive exPlanations) values. SHAP values, originating from cooperative game theory, are widely used to enhance the transparency and interpretability of machine learning models. They aim to measure each input feature’s contribution to individual predictions or the final model outcome [68].

A defining feature of SHAP values is that they always sum up to the difference between the model’s outcome and the actual outcome when all features are present and to the actual outcome when none of the features is considered. In the context of machine learning, this implies that the sum of SHAP values for all input features is equal to the difference between the baseline (expected) model output and the current model output for the prediction under examination.

Figure 1 indicates the relative importance of each feature in influencing the model’s output. According to the SHAP results shown in Figure 1, the variables in order of importance were scheduled time, number of passengers (PAX), atmospheric pressure at the arrival airport (P0), travel distance, arrival airport’s temperature, arrival airport’s relative humidity, arrival airport’s dew point temperature, arrival airport’s mean wind speed, carried cargo by the flight, and flights departed from Sabiha Gökçen Airport. Understanding these features’ impacts is crucial for interpreting and improving the model, as it allows one to focus on the most influential factors when addressing flight delay predictions.

Figure 2 indicates that flights with late scheduled times had a positive impact and were more likely to be delayed. It can be observed that flights with a smaller number of passengers were more likely to be delayed at the destination airport. Longer distances resulted in a positive impact on delay, indicating that longer flights may be more likely to result in arrival delays. The significance of air pressure (P0), relative humidity (U), mean wind speed (FF), and dew point temperature (Td) indicated that the weather conditions exerted a considerable influence on flight delays. A reduction in wind speed or an increase in air pressure appeared to have a negative impact on flight delays, resulting in a greater number of on-time arrivals. Conversely, high wind speeds or low air pressure at the arrival airport were associated with increased delays.

The three models (XGBoost, CatBoost, and LightGBM) were selected for analysis due to their high-accuracy results, which demonstrated their robustness and reliability in predicting flight delays. Although the three models exhibited similarly impressive performance, XGBoost was selected for in-depth analysis due to its consistently higher recall rates, which is essential for reducing the number of false negatives when forecasting flight delays. The subsequent results were obtained by utilizing the 10 most significant features in conjunction with XGBoost, as detailed in Table 9.

The XGBoost model with selected features demonstrated excellent performance on the training data, with high accuracy, recall, precision, F-1 score, and AUC score and low balanced error rate (Table 9). However, the model’s performance on the testing data was slightly lower, suggesting that it may have some difficulty in generalizing to new, unseen data. Further model refinement and evaluation may be necessary to improve its generalization capabilities.

The techniques that showed signs of overfitting were Naïve Bayes, ANN, XGBoost, Random Forest, CatBoost, and LightGBM. These models showed high performance on the training data, but significantly lower performance on the test data, indicating that they learned specific patterns in the training data that did not generalize well to the new data.

7. Conclusions

In this section, we interpret and analyze the results of our study on flight delay prediction at a Turkish airport using machine learning techniques.

Looking at the results, based on the accuracy rates reported in the literature, XGBoost, CatBoost, and LightGBM performed consistently well across different scenarios. They consistently maintained robust performance when dealing with unbalanced data, integrating weather information, and when SMOTE was applied. Specifically, XGBoost achieved an accuracy of 80% and recall rates of 20% to 27% across different trials, indicating its robustness. CatBoost and LightGBM also demonstrated strong performance, with accuracies around 80% and recall rates ranging from 0.13 to 0.24. Logistic Regression, although generally less robust compared to more advanced models, showed improved performance when SMOTE was applied. With only airport data and SMOTE, Logistic Regression achieved an accuracy of 64% and 61% on the test set. When both weather and airport data were used with SMOTE, the accuracy improved to 65% on the test set, with recall rates of 60%. In the literature, most studies report results based primarily on accuracy. In this context, our findings suggest that XGBoost, CatBoost, and LightGBM, with their higher accuracy rates, are more effective for predicting flight delays compared to Logistic Regression. Therefore, they appear to be strong candidates for predicting flight delays at this Turkish airport.

In the context of predicting delayed flights on time, the most unfavorable scenario would be a model that performs poorly in terms of recall. In this context, the least desirable outcome would be that the model failed to correctly identify a significant proportion of delayed flights (i.e., high number of false negatives). XGBoost consistently demonstrated higher recall rates across a range of trials and scenarios. Nevertheless, the performance of the models on the test dataset, while satisfactory, indicates a need for further improvement in terms of generalization. This finding indicates that our models may benefit from additional fine-tuning and evaluation to enhance their performance in real-world settings.

The incorporation of weather data into our models proved to be a valuable addition, as it provided insights into the impact of weather on flight delay predictions. The experiments demonstrated that the weather conditions exert some influence on flight delays. Nevertheless, while the incorporation of weather data enhanced the model performance to a certain extent, it did not markedly alter the overall prediction accuracy in comparison to that of models that did not include weather data. This suggests that while the weather conditions are a significant factor, airport-specific operational factors may be more influential in determining in-flight delays at the studied location. Future research could investigate the interplay between weather conditions and operational factors in greater depth in order to gain a more comprehensive understanding of their combined effect on flight delays. Furthermore, the limited impact of the weather data on overall prediction accuracy indicates the potential necessity for more granular weather data or the investigation of other environmental factors that might affect flight operations.

Feature selection plays a critical role in model performance and interpretability. Our analysis of feature importance using SHAP values indicated that several key features significantly influence flight delays. Factors such as scheduled time, number of passengers, time to arrival, atmospheric pressure at the arrival airport, travel distance, and the number of flights departed from the airport showed the highest impacts on predicting flight delays. Understanding these influences is essential for aviation professionals to optimize operations and passenger experiences.

8. Discussion and Future Recommendations

Here, we will consider the implications of our findings and the potential avenues for further research in the field of aviation and machine learning.

While our machine learning models achieved noteworthy results on the training dataset, the slightly reduced performance on the testing dataset highlights the need for model generalization. To enhance their ability to handle unseen data effectively, further research could explore techniques such as additional feature engineering.

The transition from traditional methods to machine learning in the aviation industry is becoming increasingly apparent. The incorporation of large datasets and advanced algorithms has the potential to revolutionize flight delay prediction and aviation operations. As our models continue to improve, they can be valuable tools for airlines, airports, and regulators, allowing them to proactively manage delays and enhance passenger experiences.

At this juncture, it is academically prudent to engage in a discourse on ethical considerations. In addition to the positive results brought by accurate predictions, it is important to consider ethical issues when using machine learning techniques. This is important not only to ensure the accuracy of a study’s results but also to uphold fairness, impartiality, and respect for the rights and well-being of all relevant stakeholders. In this context, one of the crucial aspects is data privacy and security, emphasizing the need for data anonymization and compliance with data protection regulations, such as GDPR. To prevent biases in the training data, characteristics such as airline names or aircraft types were not included.

When examining the impact on stakeholders, it becomes evident that passengers may experience increased discomfort if they are misdirected based on prediction outcomes, which potentially leads to missed connecting flights or important events. Airlines, on the other hand, could face disruptions to their operations, resulting in financial losses and damage to their reputation due to incorrect predictions. Airport authorities might struggle with efficient resource allocation and, contrary to expectations, may encounter traffic congestion, delays, and additional costs.

To maintain ethical standards, it is crucial not only to ensure the accuracy of predictions but also to strive for fairness and the enhancement of the well-being of all stakeholders. Thus, monitoring and, if necessary, improving models during their usage become imperative in preserving ethical standards.

Our study has limitations, including the reliance on a specific dataset from a single airport. Future research could expand to include data from multiple airports and airlines to generalize the findings. Additionally, exploring advanced machine learning techniques like deep learning and recurrent neural networks (RNNs) may offer further insights. Also, regression algorithms should be tested to predict the delay times of delayed flights.

The practical implications of our study extend to the operations of the aviation industry and passenger services. By harnessing the potential of machine learning, airlines can optimize the flight schedules, allocate resources more efficiently, and reduce the costs associated with delays. Passengers, too, can benefit from more reliable travel experiences.

Author Contributions

Conceptualization, methodology, Ö.T. and I.H.; validation, I.H.; formal analysis, Ö.T.; writing—original draft preparation, Ö.T. and I.H.; writing—review and editing, I.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed during the current study are not publicly available as they were obtained from a private company, and restrictions apply to the availability of these data, but can be acquired from the corresponding author on reasonable request.

Acknowledgments

This work was supported by The Scientific Research Projects Coordination Unit of Akdeniz University. Project Number: SBG-2020–5120.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Schrader, R. How Far in Advance Can You Book a Flight? Skyscanner: London, UK, 2023; Available online: https://www.skyscanner.com/tips-and-inspiration/how-far-in-advance-can-you-book-a-flight#:~:text=The bottom line-,How far in advance can you book a flight%3F,of factors can influence that.&text=You should keep in mind,points to book your ticket (accessed on 1 April 2022).
FAA. Cost of Delay Estimates. In Federal Aviation Administration; 2020. Available online: https://www.faa.gov/data_research/aviation_data_statistics/media/cost_delay_estimates.pdf (accessed on 1 April 2022).
Yang, C.; Marshall, Z.A.; Mott, J.H. A Novel Integration Platform to Reduce Flight Delays in the National Airspace System. In Proceedings of the 2020 Systems and Information Engineering Design Symposium, Charlottesville, VA, USA, 24 April 2020. [Google Scholar] [CrossRef]
Transportation Research Board. Future Flight: A Review of the Small Aircraft Transportation System Concept; National Academy Press: Washington, DC, USA, 2002. [Google Scholar]
Song, C.; Guo, J.; Zhuang, J. Analyzing passengers’ emotions following flight delays—A 2011–2019 case study on SKYTRAX comments. J. Air Transp. Manag. 2020, 89, 101903. [Google Scholar] [CrossRef]
Ferrer, J.C.; Rocha e Oliveira, P.; Parasuraman, A. The behavioral consequences of repeated flight delays. J. Air Transp. Manag. 2012, 20, 35–38. [Google Scholar] [CrossRef]
Shaw, S. Airline Marketing and Management, 6th ed.; Ashgate: Hampshire, UK, 2007. [Google Scholar]
Britto, R.; Dresner, M.; Voltes, A. The impact of flight delays on passenger demand and societal welfare. Transp. Res. E Logist. Transp. Rev. 2012, 48, 460–469. [Google Scholar] [CrossRef]
Efthymiou, M.; Njoya, E.T.; Lo, P.L.; Papatheodorou, A.; Randall, D. The impact of delays on customers’ satisfaction: An empirical analysis of the british airways on-time performance at heathrow airport. J. Aerosp. Technol. Manag. 2019, 11, 1–13. [Google Scholar] [CrossRef]
Ball, M.; Barnhart, C.; Dresner, M.; Hansen, M.; Neels, K.; Odoni, A.; Peterson, E.; Sherry, L.; Trani, A.; Zou, B.; et al. Total Delay Impact Study; The National Center of Excellence for Aviation Operations Research: Washington, DC, USA, 2010. [Google Scholar]
Daley, B. Air Transport and the Environment; Routledge: London, UK, 2010. [Google Scholar]
Federal Aviation Administration. (n.d.). In Delay. Available online: https://aspm.faa.gov/aspmhelp/index/Types_of_Delay.html (accessed on 1 April 2022).
Brownlee, J. SMOTE for Imbalanced Classification with Python; Machine Learning Mastery: San Juan, PR, USA, 2020; Available online: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ (accessed on 1 April 2022).
Markovic, D.; Hauf, T.; Röhner, P.; Spehr, U. A statistical study of the weather impact on punctuality at Frankfurt Airport. Meteorol. Appl. 2008, 15, 233–293. [Google Scholar] [CrossRef]
Xu, N.; Sherry, L.; Laskey, K.B. Multifactor model for predicting delays at U.S. airports. Transp. Res. Rec. 2008, 2052, 62–71. [Google Scholar] [CrossRef]
Gil, R.; Myongjin, K. Does competition increase quality? Evidence from the US airline industry. Int. J. Ind. Organ. 2021, 77, 102742. [Google Scholar] [CrossRef]
Wong, J.T.; Tsai, S.C. A survival model for flight delay propagation. J. Air Transp. Manag. 2012, 23, 5–11. [Google Scholar] [CrossRef]
Jungai, T.; Hongjun, X. Optimizing Arrival Flight Delay Scheduling Based on Simulated Annealing Algorithm. Phys. Procedia 2012, 33, 348–353. [Google Scholar] [CrossRef]
Lan, S.; Clarke, J.; Barnhart, C. Planning for Robust Airline Operations: Optimizing Aircraft Routings and Flight Departure Times to Minimize Passenger Disruptions. Transp. Sci. 2006, 40, 15–28. [Google Scholar] [CrossRef]
Chen, X.; Yu, H.; Cao, K.; Zhou, J.; Wei, T.; Hu, S. Uncertainty-Aware Flight Scheduling for Airport Throughput and Flight Delay Optimization. IEEE Trans. Aerosp. Electron. Syst. 2000, 56, 853–862. [Google Scholar] [CrossRef]
Güvercin, M.; Ferhatosmanoğlu, N.; Gedik, B. Forecasting Flight Delays Using Clustered Models Based on Airport Networks. IEEE Trans. Intell. Transp. Syst. 2020, 22, 5. [Google Scholar] [CrossRef]
Helme, M.P. Reducing air traffic delay in a space-time network. In Proceedings of the 1992 IEEE International Conference on Systems, Man, and Cybernetics, Chicago, IL, USA, 18–21 October 1992; pp. 236–242. [Google Scholar]
Xiao, Y.; Zhao, Y.; Wu, G.; Jing, Y. Study on Delay Propagation Relations Among Airports Based on Transfer Entropy. IEEE Access 2020, 8, 97103–97113. [Google Scholar] [CrossRef]
Zanin, M.; Belkoura, S.; Yanbo, Z. Network analysis of Chinese air transport delay propagation. Chin. J. Aeronaut. 2017, 30, 491–499. [Google Scholar] [CrossRef]
Rebollo, J.J.; Balakrishnan, H. Characterization and prediction of air traffic delays. Transp. Res. Part C Emerg. 2014, 44, 231–241. [Google Scholar] [CrossRef]
Choi, S.; Kim, Y.J.; Briceno, S.; Mavris, D. Prediction of Weather-induced Airline Delays Based on Machine Learning Algorithms. In Proceedings of the 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC), Sacramento, CA, USA, 25–29 September 2016; pp. 1–6. [Google Scholar]
Kim, Y.J.; Choi, S.; Briceno, S.; Mavris, D. A deep learning approach to flight delay prediction. In Proceedings of the 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC), Sacramento, CA, USA, 25–29 September 2016; pp. 1–6. [Google Scholar] [CrossRef]
Belcastro, L.; Marozzo, F.; Talia, D.; Trunfio, P. Using scalable data mining for predicting flight delays. ACM Trans. Intell. Syst. Technol. 2016, 8, 1–20. [Google Scholar] [CrossRef]
Manna, S.; Biswas, S.; Kundu, R.; Rakshit, S.; Gupta, P.; Barman, S. A statistical approach to predict flight delay using gradient boosted decision tree. In Proceedings of the 2017 International Conference on Computational Intelligence in Data Science, Chennai, India, 2–3 June 2017; pp. 1–5. [Google Scholar] [CrossRef]
Takeichi, N.; Kaida, R.; Shimomura, A.; Yamauchi, T. Prediction of delay due to air traffic control by machine learning. In Proceedings of the AIAA Modeling and Simulation Technologies Conference, Grapevine, TX, USA, 9–13 January 2017; pp. 1–7. [Google Scholar] [CrossRef]
Thiagarajan, B.; Srinivasan, L.; Sharma, A.V.; Sreekanthan, D.; Vijayaraghavan, V. A machine learning approach for prediction of on-time performance of flights. In Proceedings of the 2017 IEEE/AIAA 36th Digital Avionics Systems Conference (DASC), St. Petersburg, FL, USA, 17–21 September 2017; pp. 5–10. [Google Scholar] [CrossRef]
Venkatesh, V.; Arya, A.; Agarwal, P.; Lakshmi, S.; Balana, S. Iterative machine and deep learning approach for aviation delay prediction. In Proceedings of the 2017 4th IEEE Uttar Pradesh Section International Conference on Electrical, Computer and Electronics (UPCON), Mathura, India, 26–28 October 2017; pp. 562–567. [Google Scholar] [CrossRef]
Mogha, G.; Ahlawat, K.; Singh, A.P. Performance analysis of machine learning techniques on big data using apache spark. Commun. Comput. Info. Sci. 2018, 799, 17–26. [Google Scholar] [CrossRef]
Al-Tabbakh, S.M.; El Mohamed, H. Machine Learning Techniques for Analysis of Egyptian Flight Delay. J. Sci. Res. Sci. 2018, 35, 390–399. [Google Scholar] [CrossRef]
Nigam, R.; Govinda, K. Cloud based flight delay prediction using logistic regression. In Proceedings of the 2017 International Conference on Intelligent Sustainable Systems (ICISS), Palladam, India, 7–8 December 2017; pp. 662–667. [Google Scholar] [CrossRef]
Moreira, L.; Dantas, C.; Oliveira, L.; Soares, J.; Ogasawara, E. On Evaluating Data Preprocessing Methods for Machine Learning Models for Flight Delays. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018. [Google Scholar] [CrossRef]
Yu, B.; Guo, Z.; Asian, S.; Wang, H.; Chen, G. Flight delay prediction for commercial air transport: A deep learning approach. Transp. Res. Part E Logist. Transp. Rev. 2019, 125, 203–221. [Google Scholar] [CrossRef]
Chen, J.; Li, M. Chained predictions of flight delay using machine learning. In Proceedings of the AIAA Scitech 2019 Forum, San Diego, CA, USA, 7–11 January 2019; pp. 1661–1686. [Google Scholar] [CrossRef]
Mangortey, E.; Pinon, O.J.; Puranik, T.G.; Mavris, D.N. Predicting the occurrence of weather and volume related ground delay programs. In Proceedings of the AIAA Aviation 2019 Forum, Dallas, TX, USA, 17–21 June 2019; pp. 1–29. [Google Scholar] [CrossRef]
McCarthy, N.; Karzand, M.; Lecue, F.; Kim, Y.J.; Choi, S.; Briceno, S.; Mavris, D.; Yazdi, M.F.; Kamel, S.R.; Chabok, S.J.M.; et al. Amsterdam to Dublin eventually delayed? Lstm and transfer learning for predicting delays of low cost airlines. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 7, pp. 9541–9546. [Google Scholar] [CrossRef]
Khaksar, H.; Sheikholeslami, A. Airline delay prediction by machine learning algorithms. Sci. Iran. 2019, 26, 2689–2702. [Google Scholar] [CrossRef]
Meel, P.; Singhal, M.; Tanwar, M.; Saini, N. Predicting Flight Delays with Error Calculation using Machine Learned Classifiers. In Proceedings of the 2020 7th International Conference on Signal Processing and Integrated Networks (SPIN), Noida, India, 27–28 February 2020; pp. 71–76. [Google Scholar] [CrossRef]
Dou, X. Flight Arrival Delay Prediction and Analysis Using Ensemble Learning. In Proceedings of the 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 12–14 June 2020; pp. 836–840. [Google Scholar] [CrossRef]
Patgiri, R.; Hussain, S.; Nongmeikapam, A. Empirical Study on Airline Delay Analysis and Prediction. EAI Endorsed Trans. 2020, 2020, 1–8. [Google Scholar]
Gui, G.; Liu, F.; Sun, J.; Yang, J.; Zhou, Z.; Zhao, D. Flight delay prediction based on aviation big data and machine learning. IEEE Trans. Veh. Technol. 2020, 69, 140–150. [Google Scholar] [CrossRef]
Yazdi, M.F.; Kamel, S.R.; Chabok, S.J.M.; Kheirabadi, M. Flight delay prediction based on deep learning and Levenberg-Marquart algorithm. J. Big Data 2020, 7, 106. [Google Scholar] [CrossRef]
Esmaeilzadeh, E.; Mokhtarimousavi, S. Machine Learning Approach for Flight Departure Delay Prediction and Analysis. Transp. Res. Rec. 2020, 2674, 145–159. [Google Scholar] [CrossRef]
Aljubairy, A.; Zhang, W.E.; Shemshadi, A.; Mahmood, A.; Sheng, Q.Z. A system for effectively predicting flight delays based on IoT data. Computing 2020, 102, 2025–2048. [Google Scholar] [CrossRef]
Liu, F.; Sun, J.; Liu, M.; Yang, J.; Gui, G. Generalized Flight Delay Prediction Method Using Gradient Boosting Decision Tree. In Proceedings of the 2020 IEEE 91st Vehicular Technology Conference (VTC2020-Spring), Antwerp, Belgium, 25–28 May 2020. [Google Scholar] [CrossRef]
Zheng, A. Evaluating Machine Learning Models A Beginner’s Guide to Key Concepts and Pitfalls; O’Reilly: North Sebastopol, CA, USA, 2015. [Google Scholar]
Iswarya, B.; Sathyapriya, T. Detection of Diabetes and Cholesterol. J. Rec. Res. Eng. Technol. 2015, 2, 24–28. [Google Scholar]
Dimitoglou, G.; Adams, J.A.; Jim, C.M. Comparison of the C4.5 and a Naive Bayes Classifier for the Prediction of Lung Cancer Survivability. arXiv 2012, arXiv:1206.1121. [Google Scholar]
Graupe, D. Principles Of Artificial Neural Networks, 3rd ed.; World Scientific: Singapore, 2013. [Google Scholar]
Mijwil, M.M. Artificial Neural Networks Advantages and Disadvantages; Bagdad College of Economic Science University: Baghdad, Iraq, 2018; pp. 1–2. Available online: https://www.researchgate.net/publication/323665827 (accessed on 1 April 2022).
Cox, D.R. The Regression Analysis of Binary Sequences. J. R. Stat. Soc. Series B 1958, 20, 215–232. [Google Scholar] [CrossRef]
Laitinen, E.K.; Laitinen, T. Bankruptcy prediction: Application of the Taylor’s expansion in logistic regression. Int. Rev. Financ. Anal. 2000, 9, 327–349. [Google Scholar] [CrossRef]
Tu, J.V. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. J. Clin. Epidemiol. 1996, 49, 1225–1231. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 123–140. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3147–3155. [Google Scholar]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. Catboost: Unbiased boosting with categorical features. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 6638–6648. [Google Scholar]
Fernandes, N.; Moro, S.; Costa, C.J.; Aparício, M. Factors influencing charter flight departure delay. Res. Transp. Bus. Manag. 2020, 34, 100413. [Google Scholar] [CrossRef]
Lambelho, M.; Mitici, M.; Pickup, S.; Marsden, A. Assessing strategic flight schedules at an airport using machine learning-based flight delay and cancellation predictions. J. Air Transp. Manag. 2020, 82, 101737. [Google Scholar] [CrossRef]
Mustapha, I.B.; Shamsuddin, S.M.; Hasan, S. A Preliminary Study on Learning Challenges in Machine Learning-based Flight Delay Prediction. Int. J. Innov. Comput. 2019, 9, 1–5. [Google Scholar] [CrossRef]
Deshpande, V.; Arikan, M. The impact of airline flight schedules on flight delays. Manuf. Serv. Oper. Manag. 2012, 14, 423–440. [Google Scholar] [CrossRef]
Tsionas, M.G.; Chen, Z.; Wanke, P. A structural vector autoregressive model of technical efficiency and delays with an application to Chinese airlines. Transp. Res. Part A Policy Pract. 2017, 101, 1–10. [Google Scholar] [CrossRef]
Fernández, A.; García, S.; Galar, M.; Prati, R.C. Learning from Imbalanced Data Sets; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]

Figure 1. Feature impacts on model output.

Figure 2. Feature impacts on model output with positive and negative effects.

Table 1. Flight delay prediction studies.

Source	Data Source	Methods	Summary
[25]	Trained with 2007–2008 data and tested with 100 most-delayed flight routes	Random Forest (RF)	Predicted time/network-related delays 2–24 h in advance, 19% test error when classifying flight either above or below 60 min
[26]	U.S. domestic traffic 2005–2015 NOAA’s hourly based weather data	RF, DT, KNN, AdaBoost	Supervised machine learning algorithms to predict individual flight delays. Highest 83.07% accuracy with Adaboost
[27]	10 U.S. airports	RNN Long Short-Term Memory—LSTM	The highest accuracy was achieved at Atlanta Airport, reaching 90.95%, while the lowest accuracy was observed for Phoenix Airport, which was 71.34%
[28]	U.S. airports	DT	Used MapReduce program to analyze airline flight and weather datasets. With a 15 min delay threshold, the accuracy was 74.2%
[29]	April–October 2013 U.S. Department of Transportation data	GB	Study focused on Gradient Boosted Decision Trees analyzing departure delay statistics with MAE, RMSE, and R²
[30]	Actual operation data from Tokyo Airport from 42 days in 2012	ANN	Evaluated the performance of neural networks compared to traditional queueing analysis
[31]	U.S.	RF, GB, ANN, K-Nearest Neighbors (KNN), Extra Trees	The model firstly performed a binary classification to predict whether a flight would be delayed and, in the second stage, conducted a regression analysis to estimate the duration of the delay in minutes The highest departure delay classification was gathered with Gradient Boosting, with accuracy of 86.48%
[32]	Kaggle.com	ANN, Deep Belief Network	Deep neural networks attained an accuracy rate of 77%, whereas ANN achieved an accuracy rate of 89%
[33]	Github	RF, DT, Support Vector Machine (SVM), Naïve Bayes (NB)	The analysis revealed that DT achieved the highest accuracy with a score of 0.70
[34]	Egypt Air flight delay data	REPTree, Forest, Stump, J48	REPTree achieved the highest accuracy of 80.3%. In the case of rule-based classifiers, PART demonstrated the best accuracy, reaching 83.1%. Additionally, when considering both accuracy and running time, REPTree emerged as the most efficient classifier
[35]	70 Airports in the U.S.	Logistic Regression (LR)	The findings of the study revealed that the proposed approach achieved an accuracy rate of approximately 80%
[36]	Brazilian National Civil Aviation Agency	ANN, RF, SVM, NB, KNN	The findings demonstrated that the incorporation of balancing techniques into the models led to a noteworthy enhancement resulting in an accuracy rate of approximately 60%
[37]	Beijing International Airport	Deep Belief Network	The model incorporated support vector regression (SVR) for refinement. The results revealed that when utilizing the DBN-SVR model, approximately 99.3% of the predicted values fell within a 25cmin deviation from the observed values
[38]	U.S.	RF, Delay Propagation Model	The study employed an air traffic delay prediction model that integrated RF with an approximate delay propagation model. The model achieved a departure delay accuracy of 0.83 and an arrival delay accuracy of 0.87
[39]	Air traffic data systems	DT, NB, Classification Rule Learners, SVM, Bagging Ensemble, Boosting Ensemble, RF	This study employed supervised Machine Learning algorithms to construct prediction models for both weather-related and volume-related Ground Delay Programs. The research identified the Boosting Ensemble algorithm as the most effective, achieving the highest Kappa statistic of 0.66
[40]	Low-cost airlines in Europe	LSTM, LR, BM, TL	The study highlighted the potential for small airlines to initiate model training through transfer learning and subsequently fine-tune the models using their own data
[41]	U.S. and Iranian data	RF, DT	The proposed approach reached 71.39% for predicting delay occurrences and 70.16% for predicting delay magnitude in the U.S. network, while it achieved 76.44% and 75.93%, respectively, in the Iranian network
[42]	U.S.	LR, DT Regression, Bayesian Ridge, RF, Gradient Boosting Regression	It was shown that the RF Regressor was the best-performing model for predicting both departure and arrival delays, as it demonstrated the lowest mean squared error and mean absolute error
[43]	U.S.	CatBoost	The model achieved an accuracy of 80.44%, indicating its effectiveness in predicting flight arrival delays
[44]	U.S.	LR with L2 Regularization, Gaussian NB, KNN, DT, RF	The RF model achieved an accuracy of 82% when predicting flight delays with a threshold of 15 min
[45]	Automatic Dependent Surveillance Broadcast	LSTM	The proposed model achieved an accuracy of 90.2% in binary classification
[46]	U.S.	Levenberg–Marquart algorithm	For imbalanced datasets, the accuracy of the SDA-LM model was from 8.2% to 11.3% greater than that of the SAE-LM and SDA models. For balanced datasets, the SDA-LM model’s accuracy was 10.4% greater than that of the SAE-LM model and 7.3% greater than that of the SDA model
[47]	Airports in New York City	SVM	The variable impact analysis revealed that the following factors are significantly associated with flight departure delay; pushback delay, taxi-out delay, demand–capacity imbalance
[48]	IoT Data Sources	Multiple LR, SVM, Multiple Linear Regression	The study’s results revealed the association between flight delays and air quality index factors. The prediction model achieved an accuracy rate of 85.74%
[49]	Automatic Dependent Surveillance Broadcast	GB	The experimental results demonstrated that the proposed GBDT-based model produced an accuracy rate of 87.72% for binary classification

Table 2. Flight characteristics, types, and studies in the literature.

Features	Data Type	References
Day of Week	Categorical	[25,27,32,41,46,62]
Day of Month	Categorical	[43,63,64]
Month	Categorical	[27,35,62]
Origin (Departure Airport)	Categorical	[41,46,62]
Scheduled Departure Time	Continuous	[35,41,62]
Distance	Continuous	[32]
Number of Passengers	Continuous	[37,65,66]
Total Cargo	Continuous	[66]

Table 3. Weather condition features and data types.

Features	Data Type
Temperature	Continuous
Atmospheric Pressure	Continuous
Relative Humidity	Continuous
Mean Wind Direction at the height of 10–12 m	Categorical
Mean Wind Speed at the height of 10–12 m	Continuous
Total Cloud Cover	Categorical
Horizontal Visibility	Continuous

Table 4. Experimental design for hyperparameter space.

Method	Parameters
Logistic Regression	penalty: (l2, l1) C: (0.001, 0.01, 0.1, 1, 10, 100, 1000)
Naive Bayes	var_smoothing:(0, 0.9, num = 100)
XGBoost	learning_rate: (0.01, 1.0), n_estimators: (100, 1000) max_depth: (3, 15), subsample: (0, 1.0) gamma: (0, 5), min_child_weight: (0, 20)
CatBoost	max_depth: (5, 15), bagging_temperature: (3, 10) l2_leaf_reg: (2, 10)
Random Forest	n_estimators: (100, 1000), max_depth: (4, 40) min_samples_split: (2, 100), min_samples_leaf: (1, 20)
LightGBM	num_leaves: (20, 50), max_depth: (5, 30) lambda_l2: (0.0, 0.05), lambda_l1: (0.0, 0.05) min_child_samples: (5, 100), min_data_in_leaf: (5, 100) feature_fraction: (0.1, 0.9), bagging_fraction: (0.8, 1)
ANN	hidden_layer_sizes: ((5,5), (5,10), (5,15), (5,20), (5,25), (10,5), (10,10), (10,15), (10,20), (10,25), (15,5), (15,10), (15,15), (15,20), (15,25), (20,5), (20,10), (20,15), (20,20), (20,25), (25,10), (25,15), (25,20), (25,25))

Table 5. Airport dataset results.

Method	Phase	Accuracy	Recall	Precision	F-1 Score	Balanced Error Rate
Logistic Regression	Train	0.79	0.07	0.58	0.13	0.47
Logistic Regression	Test	0.79	0.08	0.62	0.14	0.47
Naïve Bayes	Train	0.78	0.06	0.42	0.10	0.48
Naïve Bayes	Test	0.77	0.05	0.37	0.08	0.49
ANN	Train	0.80	0.19	0.63	0.29	0.42
ANN	Test	0.78	0.15	0.48	0.23	0.45
XGBoost	Train	0.83	0.26	0.78	0.39	0.38
XGBoost	Test	0.80	0.20	0.65	0.31	0.41
Random Forest	Train	0.82	0.16	0.91	0.28	0.42
Random Forest	Test	0.79	0.09	0.70	0.16	0.46
CatBoost	Train	0.84	0.25	0.91	0.39	0.37
CatBoost	Test	0.80	0.15	0.71	0.25	0.43
LightGBM	Train	0.82	0.17	0.83	0.28	0.42
LightGBM	Test	0.80	0.13	0.72	0.22	0.44

Best values are shown in bold.

Table 6. Second trial, airport data with SMOTE.

Method	Phase	Accuracy	Recall	Precision	F-1 Score	Balanced Error Rate
Logistic Regression	Train	0.64	0.63	0.64	0.64	0.36
Logistic Regression	Test	0.64	0.61	0.33	0.43	0.37
Naïve Bayes	Train	0.54	0.13	0.76	0.22	0.46
Naïve Bayes	Test	0.77	0.11	0.42	0.18	0.47
ANN	Train	0.86	0.82	0.90	0.86	0.14
ANN	Test	0.74	0.28	0.38	0.33	0.42
XGBoost	Train	0.96	0.92	0.99	0.95	0.04
XGBoost	Test	0.80	0.26	0.58	0.36	0.39
Random Forest	Train	0.90	0.80	0.99	0.89	0.10
Random Forest	Test	0.79	0.12	0.62	0.20	0.45
CatBoost	Train	0.90	0.81	0.99	0.89	0.10
CatBoost	Test	0.80	0.15	0.66	0.25	0.43
LightGBM	Train	0.88	0.77	0.98	0.86	0.12
LightGBM	Test	0.80	0.12	0.69	0.22	0.44

Best values are shown in bold.

Table 7. Results from the combined use of airport and weather datasets.

Method	Phase	Accuracy	Recall	Precision	F-1 Score	Balanced Error Rate
Logistic Regression	Train	0.80	0.10	0.63	0.18	0.46
Logistic Regression	Test	0.78	0.08	0.55	0.15	0.47
Naïve Bayes	Train	0.29	0.96	0.22	0.36	0.47
Naïve Bayes	Test	0.28	0.92	0.22	0.36	0.49
ANN	Train	0.81	0.20	0.65	0.31	0.41
ANN	Test	0.77	0.13	0.46	0.21	0.45
XGBoost	Train	0.89	0.53	0.93	0.68	0.24
XGBoost	Test	0.79	0.25	0.58	0.35	0.40
Random Forest	Train	0.79	0.00	0.00	0.00	0.50
Random Forest	Test	0.78	0.00	0.00	0.00	0.50
CatBoost	Train	0.88	0.42	0.98	0.59	0.29
CatBoost	Test	0.80	0.15	0.69	0.24	0.43
LightGBM	Train	0.89	0.53	0.93	0.89	0.24
LightGBM	Test	0.79	0.24	0.58	0.34	0.40

Best values are shown in bold.

Table 8. Results of weather and airport data using the SMOTE technique.

Method	Phase	Accuracy	Recall	Precision	F-1 Score	Balanced Error Rate
Logistic Regression	Train	0.66	0.66	0.65	0.66	0.34
Logistic Regression	Test	0.65	0.60	0.33	0.43	0.36
Naïve Bayes	Train	0.52	0.99	0.51	0.67	0.48
Naïve Bayes	Test	0.25	0.95	0.22	0.36	0.50
ANN	Train	0.91	0.90	0.92	0.91	0.09
ANN	Test	0.70	0.32	0.33	0.32	0.43
XGBoost	Train	0.99	0.97	0.99	0.98	0.02
XGBoost	Test	0.79	0.27	0.56	0.35	0.40
Random Forest	Train	1.00	1.00	1.00	1.00	0.00
Random Forest	Test	0.78	0.12	0.50	0.19	0.46
CatBoost	Train	1.00	0.99	1.00	1.00	0.00
CatBoost	Test	0.79	0.16	0.57	0.26	0.43
LightGBM	Train	0.89	0.78	0.99	0.87	0.11
LightGBM	Test	0.80	0.15	0.70	0.24	0.44

Best values are shown in bold.

Table 9. Results of XGBoost with selected features.

Phase	Accuracy	Recall	Precision	F-1 Score	Balanced Error Rate
Train	0.99	0.94	1	0.97	0.03
Test	0.78	0.28	0.51	0.35	0.41

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hatıpoğlu, I.; Tosun, Ö. Predictive Modeling of Flight Delays at an Airport Using Machine Learning Methods. Appl. Sci. 2024, 14, 5472. https://doi.org/10.3390/app14135472

AMA Style

Hatıpoğlu I, Tosun Ö. Predictive Modeling of Flight Delays at an Airport Using Machine Learning Methods. Applied Sciences. 2024; 14(13):5472. https://doi.org/10.3390/app14135472

Chicago/Turabian Style

Hatıpoğlu, Irmak, and Ömür Tosun. 2024. "Predictive Modeling of Flight Delays at an Airport Using Machine Learning Methods" Applied Sciences 14, no. 13: 5472. https://doi.org/10.3390/app14135472

APA Style

Hatıpoğlu, I., & Tosun, Ö. (2024). Predictive Modeling of Flight Delays at an Airport Using Machine Learning Methods. Applied Sciences, 14(13), 5472. https://doi.org/10.3390/app14135472

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predictive Modeling of Flight Delays at an Airport Using Machine Learning Methods

Abstract

1. Introduction

2. Problem Definition

3. Literature Review

4. Methods

5. Dataset

6. Results

6.1. Performance Evaluation

6.2. Analyses of Feature Importance

7. Conclusions

8. Discussion and Future Recommendations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI