The flight delay propagation prediction based on deep learning mainly includes the following parts: data preprocessing and flight chain data set construction, feature extraction, and classification prediction. Feature extraction is mainly introduced in the third and fourth parts of this paper. The following mainly introduces data preprocessing, flight chain data set construction, classification, and prediction.
2.1. Data Preprocessing
The flight data used in this project are the flight data of China from March 2018 to May 2019 provided by the Civil Aviation Administration of the China East China Regional Administration (ECRA). Among them, the key sample attributes include flight number, aircraft number, actual departure/arrival airport, flight path, planned departure/arrival time, actual departure/arrival time, planned departure/arrival airport, planned aircraft type, cruise altitude, cruise speed, military batch number, coverage type, and a total of 38 attributes. These characteristics are closely related to whether the flight is delayed or not, which not only contains important spatial features but also contains abundant time information. Since there are some abnormal values and null values in the flight data provided by the ECRA, the mainstream data analysis library Pandas is selected to clean the original flight data set. The characteristic attributes required by the model are defined as follows.
Definition 1. Flight data Ff, including 38 characteristic attributes such as flight number, aircraft number, actual departure/arrival airport, flight path, planned departure/arrival time, actual departure/arrival time, etc.
Definition 2. Flight chain data Fc, within a certain time range, the same aircraft respectively performs different flight tasks from class 1 airport to class 2 airport and then to class 3 airport, and the time sequence is related. This is a flight chain. Multiple flight chain data constitute the flight chain data set.
2.2. Construction of the Flight Chain Data Set
Flight delay has the characteristics of temporal and spatial distribution. When the same aircraft performs different flight missions in succession, it is common for subsequent flights to be delayed due to the previous flight delay. After the delay of the previous flight is passed along the flight plan step by step, it will lead to a large area of flight delays. The airport where the same aircraft takes off for the first time within a certain time range is defined as the class 1 airport. The airport where the aircraft arrives from the class 1 departure airport for flight task 1 is called the class 2 airport, also known as the class 1 arrival airport or the class 2 departure airport. By analogy, the same aircraft
Z continuously performs flight tasks between multiple airports, which are connected in chronological order to form a flight chain relationship, as shown in
Figure 1. Taking “Beijing-Tianjin-Shanghai” as an example, Beijing is defined as a class 1 airport. The same aircraft performs flight task 1 from Beijing to Tianjin. Tianjin is the class 2 airport in the flight chain, also known as the class 1 arrival airport or the class 2 departure airport. The plane starts from Tianjin and performs flight task 2. It flies from Tianjin to Shanghai. Shanghai is the class 3 airport in the flight chain, also known as the class 2 arrival airport or the class 3 departure airport.
Based on the above characteristic attributes, the flight chain data set is constructed. Firstly, a hub airport is selected as the class 1 airport. The airports with the number of flights from this class 1 airport are ranked from high to low, and the top 20 airports are selected as class 2 airports. Then, we directly select the airport with flights from each class 2 airport as class 3 airports. Thus, the air transport network is determined with the class 1 airport as the center and radiating outward. Secondly, taking the time and the flight tail number as key values, each flight chain is extracted from the aviation network to form a flight chain data set. Thirdly, the discrete data and continuous data in the original data are encoded by different methods to avoid misleading the training process of the network. Lastly, the processed data are converted into a suitable characteristic matrix that feed into the network. In order to more clearly describe the flight chain data set, the
i-th flight chain data in Definition 2 are represented by
fi = (
fi1,
fi2,
fi3), where
fi1,
fi2, and
fi3, respectively, represent the flight chain data
fi containing the information of three single flights that perform flight tasks before and after in the time dimension. The
Fc dataset can be further represented by
Fc = {(
f11,
f12,
f13), (
f21,
f22,
f23), …, (
fn1,
fn2,
fn3)}. The flight chain dataset description is shown in
Figure 2.
The flight data of three consecutive flights of the same aircraft within a certain time range constitute the flight chain data. There are 1,048,576 pieces of data in the original single flight. After data cleaning and construction of the flight chain data set, the data volume of the flight chain data used in the flight delay propagation prediction experiment are 36,287 pieces. The data set construction steps are as follows:
According to Definition 2, an aircraft performs continuous flight missions. In this paper, a three-class flight chain data set is formed according to the change of the same aircraft within 24 h. Firstly, select the four attributes of the aircraft number, flight execution date, class 1 arrival airport, and class 2 departure airport as the key values of data fusion; conduct the first data fusion on the cleaned flight data set; and remove the abnormal flight chain whose departure time of the secondary airport is earlier than that of the primary airport. At this time, in the flight chain data set, the aircraft performed two flight missions and turned around three airports in space.
The generation of the delayed propagation phenomenon has the characteristic of passing from one class to another, so we continue to fuse the flight chain data set for the second time. The aircraft number and flight execution date remain unchanged, and the class 2 arrival airport and class 3 departure airport are selected for the second data fusion. The abnormal flight chain whose departure time from the class 3 airport is earlier than the arrival time at the class 2 airport is removed. The flight chain data set of two consecutive flight tasks is obtained.
By analogy, the data are fused three times in this paper to form the final flight chain dataset. The aircraft in each data chain performed three consecutive flight missions, and the spatial dimension involved the transit situation in four airports. A total of four airports including the first-class airport, second-class airport, third-class airport, and fourth-class airport are affected by flight delays. The delay label of the flight chain is the delay level of the class 3 flight mission. Most aircrafts fly one mission and do not fly another that day. As more flights are performed on the same day, the available data in the flight chain data set become smaller and smaller. Therefore, we focus on the delayed propagation of flight chains consisting of three consecutive flight missions. Finally, the characteristic attributes in the flight chain data set are divided into the numerical type and discrete type. The numerical type features are coded by Min-Max normalization, and the discrete type features are coded by CatBoost.