1. Introduction
China’s civil aviation passenger volume has always been one of the important indicators of national economic development and people’s living standards. With the issuance of the “14th Five-Year Plan for Civil Aviation Development”, China will embark on a new journey of building a strong civil aviation country in many fields. At the same time, with the rapid development of China’s economy and the improvement of people’s living standards, civil aviation passenger traffic also shows a trend of continuous growth. However, since the outbreak of the global epidemic at the end of 2019, people have been worried about the various impacts that the epidemic may bring, so they have reduced their travel, which has caused an unprecedented impact on the aviation industry, and the passenger volume of civil aviation has been greatly reduced.
Therefore, it is particularly important to accurately predict civil aviation passenger traffic. This helps not only to promote the sustainable development of China’s aviation industry, but also helps airlines to adjust flights and models in advance, thereby optimizing passenger carrying rate and improving operational efficiency. At the same time, it is also important to understand the impact of the epidemic on passenger traffic, which enables airlines to adjust their long-term and short-term strategies according to the actual situation. This includes the optimization and adjustment of the route network, the re-evaluation of the aircraft procurement plan and so on.
The primary task of the prediction of civil aviation passenger volume is to select the model. The common prediction methods include the time series analysis model, including the moving average method, the weighted moving average method, the simple exponential-smoothing method, and the ARIMA model. The most classic model is the ARIMA model, which is the most commonly used model in practical cases, and it is also one of the most widely used methods for univariate time series data prediction. It only needs endogenous variables without other exogenous variables. The neural network model is also often used to predict passenger volume. The BP neural network model is a commonly used artificial neural network model with strong nonlinear modeling capabilities. Through the backpropagation algorithm, the model can be trained and learned, thereby improving the prediction accuracy of the model, and can handle a large amount of data. Therefore, this paper chooses ARIMA and BP neural network models, through in-depth analysis of the characteristics of these two models, and combines them to achieve more accurate prediction results.
The main idea of this paper is to first apply the SARIMA model and the BP neural network model to predict and analyze the civil aviation passenger traffic from 2006 to 2019. Subsequently, the prediction residual of the SARIMA model is used as the input data of the BP neural network, and the BP neural network is used to represent the nonlinear characteristics of the civil aviation passenger volume to obtain more accurate prediction results. At the same time, this study also uses this method to predict the passenger volume of civil aviation during the epidemic period from 2020 to 2023, compares the predicted results with the actual data to analyze the specific impact of the epidemic on the passenger volume of civil aviation, and puts forward the corresponding suggestions accordingly. The main contributions of this paper are as follows:
- (1)
The SARIMA-BP combined model is used to predict the civil aviation passenger volume, improve the accuracy of civil aviation passenger volume prediction, make the airlines adjust their flights and models in advance, and improve operation efficiency.
- (2)
By predicting the passenger volume during the epidemic and comparing it with the actual passenger transport data, the impact of the epidemic on the passenger volume of civil aviation is demonstrated. These research results can provide some reference information for airlines to help them develop effective strategies and measures to cope with possible future challenges.
The rest of this paper is organized as follows:
Section 2 presents the relevant literature.
Section 3 describes the methods used to forecast civil aviation passenger traffic and data processing.
Section 4 details how to use the SARIMA model, the BP neural network model, and the combined model. This section aims to explain how to use these models for forecasting and compare and analyze the results obtained.
Section 5 presents the forecasting using this optimal methodology and analyses the impact of the epidemic on civil aviation passenger traffic by comparing it with the actual data.
Section 6 presents the conclusions drawn.
Section 7 contains some discussions summarizing the highlights and shortcomings of this paper.
2. Literature Review
Prediction methods are mainly divided into three categories: traditional time series analysis prediction, non-traditional time series analysis prediction, and prediction technology based on machine learning. In traditional time series analysis and prediction, the main methods include various regression models, the moving average method, the autoregressive integrated moving average method (ARIMA), the Holt–Winters method (also known as Winters’ method), and various exponential-smoothing methods. The demand for forecasting based on non-traditional time series analysis puts forward a relatively new forecasting method from the perspective of the multi-disciplinary integration of statistics, system dynamics, and grey system theory.
Although the traditional time series analysis method provides an effective prediction solution to a certain extent, with the exponential growth in the amount of data in the flight-booking process, the nonlinear trend, and the high irregularity and volatility in the data, these methods may encounter difficulties in dealing with modern high-complexity big data. Therefore, the machine learning method provides a new and effective solution to deal with these complex and volatile flight demand forecasting problems, with its powerful nonlinear modeling ability. The neural network model based on the error backpropagation algorithm is a kind of neural network model with a strong nonlinear mapping ability prediction model.
The prediction of civil aviation passenger volume has been widely studied. Scholars usually divide it into two methods: single-model prediction and combined-model prediction. There are many prediction methods for a single model. Yu et al. [
1] used the GM (1,1) model to simulate the prediction of civil aviation passenger traffic and corrected it using the GM (1,1) residual model, proving the high accuracy of the prediction formula. Zhang et al. [
2] used a BP neural network prediction model to forecast the passenger traffic of civil aviation in Beijing from four aspects: economy, tourism, competition, and airport operational capacity. The ELM prediction model was used to predict civil aviation passenger traffic by Chen et al. [
3]. Wu et al. [
4] used the LSTM prediction model to predict civil aviation passenger traffic. Their results show that the performance of the model is better than the existing fusion model and stable. Meng et al. [
5] used a fuzzy diagonal regression neural network to forecast civil aviation passenger traffic. Ma et al. [
6] used a multiple linear regression model to analyze the influencing factors of civil aviation passenger traffic in the Gansu province. Anupam et al. [
7] used the NARX dynamic neural network to forecast civil aviation passenger traffic. Li used the SARIMA model and LSTM neural network for prediction, respectively, and the LSTM model was better in predicting the passenger traffic of civil aviation [
8]. Kanavos et al. [
9] developed an air travel demand estimation and forecasting model using the classical autoregressive integrated moving average (ARIMA), the seasonal approach (SARIMA), and a deep learning neural network (DLNN). In addition, many scholars [
10,
11,
12,
13,
14] have also used the ARIMA model to forecast the passenger traffic of civil aviation.
Although individual-model prediction methods are straightforward to implement, they often have inherent shortcomings that lead to an insufficient prediction accuracy. Therefore, some scholars choose to use the combined model prediction method to improve the accuracy of their predictions. Chen et al. [
15] utilized a combined SARIMA-LR model to forecast civil aviation passenger traffic and analyze the impact of the civil aviation industry during the epidemic. Gan et al. [
16] employed a bi-directional LSTM model for prediction, resulting in a high prediction accuracy. Al-Sultan [
17] considered a wide range of time series prediction models. An empirical analysis shows that the BSTS model is superior to other time series models in predicting complex time series. Hu [
18] used the nonadditive Choquet fuzzy integral to combine the prediction of four commonly used univariate grey prediction models into combined prediction ones. Yao et al. [
19] used a combined ARIMA-BP model to predict civil aviation passenger volume, but the modeling process was cumbersome. Yu et al. [
20] used the ARIMA-BP combined model to forecast short-term traffic flows, which effectively reduced the error.
The COVID-19 pandemic has had a profound impact on the global development of civil aviation. Su et al. [
21] examined the spatial distribution of outbreaks and civil aviation passenger throughput in China utilizing COVID-19 statistical data, alongside socioeconomic development data from various Chinese cities, and integrating the Moran index with econometric models. Deveci et al. [
22] investigated the economic ramifications of COVID-19 on the civil aviation sector. Wojcik et al. [
23] built a behavioral model of flu search based on survey data linked to users’ online browsing data. The research results of the above-selected parts of the literature are summarized in
Table 1.
3. Research Methodology and Data
3.1. Data Source and Processing
This paper selects the monthly data of national civil aviation passenger traffic published by the National Bureau of Statistics from January 2006 to December 2019, through a collation and a summary, as shown in
Figure 1.
According to the data shown in
Figure 1, it can be observed that the distribution of data points is relatively continuous, and there are no obvious outliers or anomalies, so there is no need for data cleaning. In addition, each month’s data are complete, and there are no missing values, so there is no need for data replenishment processing.
3.2. SARIMA Model
SARIMA is a time series forecasting model for forecasting and analyzing data with seasonal patterns. It is an extension of the ARIMA model to handle seasonal data. Time series data with seasonal components can be supported. Three hyper-parameters
are added to
, as well as an additional seasonal cycle parameter
.
has a total of seven parameters, which can be classified into two categories, three non-seasonal parameters
and four seasonal parameters
where
is the seasonal autoregression,
is the non-seasonal autoregression,
are the maximum lag order of the moving average operator,
is the number of non-seasonal differentials, and
is the number of seasonal differentials.
We performed
seasonal differencing (de-periodization) and d differencing (de-trending) on the time series
to obtain the new series
, then modeled the differenced
as follows:
where
and
are autoregressive and moving average polynomials.
and
are polynomials in seasonal autoregression and the seasonal moving average.
is the observed value, and
is the whiteout sound.
3.3. BP Neural Network Model
The backpropagation neural network is called the BP network, which has been widely used in various applications. It learns and stores a large number of input–output mode-mapping relations. The learning rule is to use the steepest descent method to iteratively adjust the weights and thresholds of the network through backpropagation to minimize the sum of squared errors. Because of the steepest descent method, the BP neural network can solve the problems of a slow learning convergence and a low learning efficiency.
3.3.1. Fundamentals
A BP network consists of an input layer, a hidden layer, and an output layer. The input layer receives the input data, while the hidden layer processes the information. The output layer is the output of the message, which is the result we want. The weights from the input layer to the hidden layer are represented by while the weights from the hidden layer to the output layer are represented by .
In
Figure 3, the model diagram depicts a neural network with a single hidden layer. The process of the BP neural network can be divided into two stages. The first stage involves the forward propagation of the signal, where the input data pass through the hidden layer and eventually reach the output layer. The second stage is the backward propagation of the error. The error is propagated from the output layer to the hidden layer and then to the input layer. This backward propagation allows for the adjustment of the weights and biases in the hidden layer and the weights in the input layer.
The neural network is trained by a backpropagation algorithm. The algorithm uses gradient descent to adjust the connection weights and biases by minimizing the error between the network output and the actual values. This process consists of iterative steps of forward propagation and backward updating of the weights.
Common activation functions include Sigmoid, Tanh, ReLU, etc., which are used to introduce nonlinear factors so that the neural network can handle complex nonlinear relationships. The most used function at the moment is the Sigmoid (logistic) function, also known as the S-shaped growth curve, a function which works better when used for classifiers.
3.3.2. Training Process
Step 1 Input data: Input data from the training set is fed into the input layer of the network;
Step 2 Forward propagation: Calculate the output of each neuron through the forward propagation of the network;
Step 3 Calculate the error: Compare the network output with the actual value and calculate the error;
Step 4 Backpropagation: Backpropagate using the error information, calculate the gradient, and update the connection weights and bias according to the gradient;
Step 5 Repeat Iteration: Adjust the network parameters through multiple iterations of the training process until the error converges to a satisfactory level;
Therefore, the BP neural network has a strong nonlinear fitting ability and is suitable for complex problems; it has a strong learning ability and a good processing ability for large-scale data sets. However, it is sensitive to the initial weights and learning rate, and it may require larger training data when dealing with some specific problems.
3.4. SARIMA-BP Neural Network Forecasting Model
Due to the pronounced seasonal characteristics of civil aviation passenger traffic, this study initially employs the seasonal ARIMA (SARIMA) model to describe its linear components. However, the model’s predictive accuracy may be compromised when delineating time series changes, as the SARIMA model employs differencing to isolate linear factors and fails to adequately account for the nonlinear elements influencing time series fluctuations. The SARIMA model’s prediction error (residual) serves as the input for the BP neural network. This study utilizes the nonlinear BP neural network model to characterize the nonlinear aspects of civil aviation passenger transportation volume. Concurrently, this approach corrects the SARIMA model’s prediction residuals to enhance the prediction accuracy. The nonlinear BP neural network learns the residual prediction model through training, and the final prediction result is as follows:
where
, in this paper, is the corrected residual of the prediction SARIMA model, and
in this paper, is the predictions of the SARIMA model.
3.5. Evaluating Indicator
To better evaluate the error and bias of the prediction results and evaluate the performance of the prediction method, this study used five indicators:
,
,
, and
. It is expressed by Equations (6)–(10).
where
, in this paper, is the predicted value of the model;
in this paper, is the true value;
is the relative error;
is the mean relative error;
is the coefficient of determination; and
is the mean square error, which can evaluate the degree of change in the data. The smaller the value of
, the better the accuracy of the prediction model to describe the experimental data. Meanwhile,
is the root mean square error, which measures the deviation between the predicted value and the real value and is sensitive to the outliers in the data.
4. Model Application and Analysis of Results
4.1. Forecasting Civil Aviation Passenger Traffic Based on the SARIMA Model
4.1.1. Smoothness Test
In the line graph of the original series (
Figure 1), we can observe that the data of the civil aviation passenger transportation volume show a growing trend with the increase in time, indicating that the time series has an obvious linear trend. By scrutinizing the line graph, we find that, after 12 time intervals, the series again shows the same fluctuation pattern, which indicates that the time series of civil aviation passenger traffic has a strong periodicity, where the cycle length is
.
Since there is significant seasonal volatility in the civil aviation passenger traffic time series to eliminate the effects of seasonality and trend in the series, we label the original series as
. Firstly, we perform a seasonal differencing of the series with a step size of 12, denoted as
, as shown in
Figure 4. Next, a first-order differencing with a step size of 1 is performed, denoted as
, as shown in
Figure 5. These two operations help make the series smoother and easier for subsequent time series analysis and modeling.
Meanwhile, the ADF unit root test is performed on the sequence
after calculating the difference, and the test results are detailed in
Table 2. The absolute values of the t-statistics are smaller than the corresponding t-values of the ADF test when the t-statistics are set to the 1%, 5%, and 10% levels, respectively. In addition, the probability
p-value is 0.0000, which is significantly smaller than the usual significance level of 0.05. Combining the results of
Figure 5 and the unit root test, it can be seen that the sequence
exhibits smooth properties.
4.1.2. Model Identification
We identified the model using Box–Jenkins’ model identification method. This method first assumes that the process of generating time series can be approximated by an ARMA model (if it is stationary) or an ARIMA model (if it is non-static). Two diagnostic charts can be used to help select the p and q parameters of ARMA or ARIMA, which are the autocorrelation function (ACF) and the partial autocorrelation function (PACF), respectively. The ACF plot summarizes the correlation between the observations and the lag values. The PACF plot summarizes the correlation of the observations with the lagged values, which are not explained by previous lagged observations. If the ACF drops sharply to near 0 and the PACF quickly converges to 0 when the time interval k is small, then we can use the MA model. If the PACF drops sharply to near 0 and the ACF quickly converges to 0 when the time interval k is small, then we can use the AR model. If the ACF and PACF do not decline sharply but eventually converge to 0, then it is more appropriate to use the ARMA model. A sharp decline refers to a cliff-like decline, does not mean convergence to 0, and may rise later.
The autocorrelation function (ACF) and partial autocorrelation function (PACF) of the
sequence are shown in
Figure 6, with a
p-value of less than 0.05 for a non-white noise sequence, which can be modeled; the autocorrelation sequences all converge to 0 after the second period, presenting a certain amount of trail; the partial autocorrelation sequences present a certain amount of trail; and, a preliminary decision is made, selecting the ARMA model.
4.1.3. Model Ordering and Parameter Estimation
In this section, we will analyze the ACF and PACF plots (determining and ). The value of the autoregressive term is determined using the PACF plot. In the PACF plot, if all the bars after delay k are close to zero, then can be chosen. This means that the first significant non-zero delay in the PACF plot is a candidate value for . The value of the moving average term is determined using the ACF plot. In an ACF plot, if all the bars after delay are close to zero, then can be chosen. This means that the first significant non-zero delay in the ACF plot is a candidate value for .
It can be seen from
Figure 6 that the model parameters AR can be taken as 2, 11, and 12, and MA can be taken as 2, 3, 11, and 12. Since the autocorrelation function (ACF) of the time series shows a significant correlation at the first lag point after each seasonal cycle, a seasonal moving average term is needed to help the model capture this seasonal effect, so SMA takes 1. Through model debugging, we obtain the model parameters in
Table 3.
model is more appropriate, and the
sequence is modeled as follows:
4.1.4. Model Testing
Residual Analysis: According to
Figure 7, the residual autocorrelation plot of the model’s residuals is examined; it is, indeed, white noise; and, there is no obvious pattern or trend.
Ljung–Box Test: According to the residual autocorrelation plot in
Figure 8, the
p-value is less than the significance level (usually 0.05), which indicates that there is autocorrelation in the residual series.
AIC Comparison: Using information criteria such as the Akaike Information Criterion (AIC), by comparing the fitting performance of different SARIMA models, the model has the model with the minimum AIC.
Prediction Performance: The model is trained using historical data and then used to make predictions of future data.
Table 4 shows the prediction results and the relative error.
Figure 8 and
Table 4 show that the relative error is small and that the predictive performance of the model is good.
4.2. Passenger Traffic Prediction Based on BP Neural Network Modeling
4.2.1. BP Neural Network Design
This paper selects sequence values from the first 12 periods to predict the values of the subsequent period. Specifically, the sequence values from periods 1–12 serve as the input to the network, while the sequence value of the 13th period is designated as the network’s output. Likewise, the sequence values from periods 2–13 serve as the input, with the sequence value of the 14th period being the output, and this pattern continues. According to the “Rule of Thumb,” the number of hidden-layer neurons is typically calculated as 2/3 of the number of input-layer neurons plus 1/3 of the number of output-layer neurons, resulting in either 8 or 9 neurons. Subsequently, an empirical approach is employed to determine the appropriate number of output-layer neurons, which, in this case, is set to nine. Finally, the network configuration consists of 12 input-layer neurons, 9 hidden-layer neurons, and 1 output-layer neuron, with the Sigmoid function selected as the activation function. The training process involves 5000 iterations, with an error threshold of 0.000001 and a learning rate of 0.01.
4.2.2. BP Neural Network Prediction Results
After training on the sample data, the network produces output values and their fitness with the actual values is illustrated in
Table 5 and
Figure 9. The relative error between the network’s output and the actual values from the BP neural network model training is minimal, suggesting that the neural network can be effectively applied to predicting China’s civil aviation passenger traffic.
4.3. SARIMA-BP Neural Network Prediction Model for Civil Aviation Passenger Traffic Volume
The results demonstrate that individual prediction methods exhibit limited accuracy. Therefore, the ARIMA-BP model combination is employed to forecast civil aviation passenger traffic volume. Residuals are derived from predictions using the seasonal ARIMA model, serving as the desired output for the BP neural network. Subsequently, the original civil aviation passenger traffic data are utilized for training, and the resulting data are fed into the BP neural network for learning modeling to obtain predicted residual sequence values. Finally, MATLAB 2023b outputs the prediction results of the combined SARIMA-BP model. As depicted in
Figure 10, the predicted values closely align with the true values, leading to a significant reduction in prediction error and an enhancement in the model’s prediction accuracy.
4.4. Comparison and Analysis of Results
The relative errors of the three models were compared and analyzed and the results of the comparison are shown below (see
Figure 11). The evaluation indicators (MRE, R
2, MSE, RMSE) of the three models are compared in
Table 6.
Observing
Figure 11 reveals that all the relative errors of the combined model are below 5 percent, whereas the individual prediction models exhibit some significant relative error values.
It can be observed from
Table 6 that the prediction results of the combined model are in good agreement with the actual civil aviation passenger volume data. The average relative error is 1.6906%, and the R
2 value is as high as 0.9816, which is very close to 1, indicating that the model fits well. In addition, the mean square error (MSE) and root mean square error (RMSE) of the model are also significantly lower than other comparison models, which further proves its superiority. The SARIMA-BP model skillfully combines the advantages of the two models and effectively utilizes the prediction information of each model. This combination model greatly improves the accuracy of the prediction, thereby enhancing the reliability of the prediction results. Therefore, it was decided to use the SARIMA-BP model to predict the civil aviation passenger volume during the epidemic period (2020–2023).
5. Analysis of the Impact of the Epidemic on Passenger Transport Volume
We compared the forecast of civil aviation passenger traffic during the epidemic period (2020–2023) with the actual data, as shown in
Figure 12.
We observed the severe impact of the epidemic on the aviation industry. Overall, civil aviation passenger traffic suffered significant losses totaling approximately 1347.2 million passengers, particularly in February 2022, when the outbreak losses peaked at 87.62 percent, with approximately 55.8 million passengers, and in February 2020, at the beginning of the outbreak, when the losses were also significant, with a reduction of 85.11 percent, along with approximately 49.81 million passengers lost. However, over time, especially at the beginning of 2023, we could see a gradual recovery in civil aviation passenger traffic, with the smallest loss of 13.37 percent in July 2023 and with a loss of about 9.64 million passengers, followed by a gradual return to normal levels.
Presently, with the risk of the epidemic receding and the steady growth in civil aviation passenger traffic, people’s willingness to travel abroad has increased significantly. Due to the constraints of road and railway transportation, airlines have the opportunity to attract more passengers choosing to fly by launching various promotional activities and improving cabin comfort. In addition, airlines can open up new routes according to changes in market demand or optimize or even discontinue existing routes to more effectively meet the needs of passengers and enhance their market competitiveness.
6. Conclusions
Based on the comparative study of the BP neural network model and the SARIMA model in predicting civil aviation passenger volume as well as the results of combining the two models for simultaneous prediction, we have drawn the following conclusions.
Firstly, when predicting the passenger volume of civil aviation in 2019, we found that the SARIMA-BP combination model performed the best, with a better prediction accuracy than using the BP neural network model or the SARIMA model alone. This shows that the accuracy and stability of prediction can be improved by combining multiple prediction methods according to the characteristics of a single model.
Secondly, for predicting civil aviation passenger volume from 2020 to 2023, we utilized the SARIMA-BP combination model, which had been validated as the best method. Through comparison with actual data, it was observed that the epidemic had significantly impacted the aviation industry, resulting in substantial losses in civil aviation passenger traffic. Particularly in July 2022, during the initial outbreak of the epidemic, the decline in civil aviation passenger traffic reached its peak. However, over time, especially in early 2023, the passenger volume of civil aviation gradually rebounded and eventually returned to normal levels. Airlines can adjust their long-term and short-term strategies according to the actual situation. This includes the optimization and adjustment of the route network, the re-evaluation of the aircraft procurement plan, and so on.
In summary, this study demonstrates the effectiveness of combination models in predicting civil aviation passenger volume and provides an in-depth analysis of the epidemic’s impact on the aviation industry. These findings offer valuable insights for airlines and government departments, enabling them to develop effective response strategies and measures to address similar crises which may arise in the future. Future research could focus on exploring alternative prediction models or integrating multiple methods to enhance the precision and stability of predictions, thereby better adapting to the ever-changing market environment.
7. Discussion
Civil aviation passenger volume shows a significant linear growth trend. The SARIMA model has a high prediction accuracy for time series with regular growth. At the same time, the BP neural network also shows an excellent prediction ability for nonlinear sequences. By combining these two models, we can further improve the accuracy of the prediction. The research literature shows that, compared with the single model, the combined prediction model can usually provide a higher accuracy. As shown in
Table 7, the example verifies the advantages of the combined model in the prediction effect.
In this paper, the combination model of SARIMA and a BP neural network is used to predict the passenger volume of civil aviation, and the prediction accuracy is improved. However, this paper has the following shortcomings:
This paper does not try to use a variety of combinations in the prediction.
In the prediction of civil aviation passenger volume, this paper does not take into account the economic, demographic, and other external factors.
The amounts of data used in this paper are relatively limited, including only monthly data but not annual data.
In future research, we can consider introducing more external factors and expanding the types and quantities of data to improve the accuracy of prediction. In addition, the combined-model method can also be applied to the prediction of highway and railway passenger volume.