1. Introduction
International air networks and the number of air passengers have expanded and increased, respectively, dramatically due to globalization and increasingly liberalized bilateral air service agreements (ASAS). Such domestic and international air connectivity allows people to travel around the world easily, stimulating trade and people-to-people exchanges, but it may also facilitate infectious diseases to spread rapidly around the world. International air travel has acted as an important medium to contribute to the fast spread of several pandemics in the past, for example, SARS in 2013 and H1N1 in 2009 [
1,
2,
3]. The COVID-19 outbreak in early 2020 caused significant disruption to economic activity; the aviation market has been hit particularly hard. Scholars have studied the influence of COVID-19 around the world, such as its impact on domestic and international U.S. air travel [
4,
5,
6] as well as policies in or its impact on Europe [
7,
8,
9]. In a study on COVID-19 focused on aviation, Sun et al. [
10] reviewed more than 110 items of literature and found that the current research on the impact of COVID-19 on the aviation market is mainly focused on (1) the analysis of the global air transport system during COVID-19 [
11,
12]; (2) the impact of COVID-19 on the passenger-centric flying experience [
13,
14,
15]; and (3) the long-term impact of COVID-19 on the aviation market [
16,
17,
18,
19,
20], etc. China was the first country to be hit by COVID-19. Under a number of the controlling measures, the epidemic in China has recovered relatively well, and the resumption of work and production has been carried out simultaneously. However, the epidemic has experienced several stages of rebound due to factors that were beyond control. Zhang et al. [
21] plotted the changes in the number of air passengers throughout the year from 2018 to 2020. It can be found that, after May 2020, the overall number of air passengers in China recovered well, and several inflection points of reduced passenger flow all corresponded to local epidemic outbreaks. We further plot air passengers from 2018 to 2021 in
Figure 1, which also shows that local outbreaks affect the total air demand.
Therefore, it is obvious that a sudden local epidemic hurts the recovery of the aviation market. With the rapid spread of COVID-19 and the gloomy global situation, local outbreaks are no longer an accident. Studying the impact of local outbreaks on urban air passenger flow can not only guide transportation organizations on the trend of passenger numbers in the future during the epidemic—such as airlines and urban airports being able to provide plans to adjust flight frequency and airport operation as well as maintenance—but also help air passengers make more reasonable travel plans.
The changing of air passengers is uncertain in the context of COVID-19. Therefore, the main problem that we need to study is evaluating the changing of air traffic demand after a city is affected by a local outbreak. In order to exclude the impact of the natural increase in tourists during the recovery period on our assessment, we used a classic research method in economics to regard the outbreak of COVID-19 as a policy variable, and divided the research objects into two groups. The group affected by a local outbreak is the experimental group; by finding a control group not affected by the epidemic policy, we can obtain the treatment effect by finding the difference between the two. Here, we assume that the only difference between this two is the policy. However, the control group is difficult to find in practice. For example, in our study, an outbreak had already occurred in a certain city, so it is difficult for us to obtain a control group that was not affected by the epidemic and had the same changed trend in air traffic as the experimental group. Therefore, economists study the effects of policies by constructing counterfactual experimental groups. The difference-in-differences (DID) method is one of the most widely used methods by scholars [
13]. In this method, the areas affected by the policy can be defined as the experimental group, while the areas not affected by the policy can be defined as the control group. Meanwhile, the difference between the experimental group and the control group before and after the policy treatment can be compared. However, for DID research methods, it is difficult to solve the problem of selective bias. Taking China as an example, the geographical location, population, and economic level of cities vary greatly, making it difficult to directly screen out a matched control group. For the deficiency of the DID method, Abadie and Gardeazabal [
22] proposed a new method to identify the effect of policies—the synthetic control method (SCM). The synthetic control method also has drawbacks. For example, because the synthetic control method uses a weighted method to synthesize virtual controls, it is largely unable to synthesize “extreme” virtual controls. The defect of the above two methods can be avoided by predicting the changing trend of the experimental group and constructing a counterfactual control group. There is a linear and nonlinear method to predict the changing trend, and the machine learning prediction method is a nonlinear method that can improve the accuracy of counterfactual estimation in recent years [
23,
24].
In our study, we used machine learning to predict the air traffic demand of cities and its changing trend in the absence of outbreaks. This group can be seen as a counterfactual experimental group. The impact of a local outbreak on urban air traffic is the difference between the actual air demand and the counterfactual group.
When we use the machine learning method to predict the trend of air traffic in cities with local outbreaks, the following difficulties need to be solved: (1) Due to the incubation period and lockdown policy of COVID-19, the impact of a local outbreak may be long-term, so we need to predict the long-term trend of air passenger traffic as a counterfactual experimental group after a local outbreak through a machine learning algorithm. (2) Due to the lack of historical data and the dramatic growth in the recovery period, it is difficult to make predictions from time series. (3) Air transport is a network structure. Local outbreaks not only affect the inbound and outbound passenger flows of urban airports, but also change the flow at the airline level. In order to solve the above problems, we transformed the prediction problem from time series to the fitting problem at the spatial level. We used the LightGBM algorithm to fit the air traffic demand of target cities through those cities not affected by local outbreaks.
This model has the advantages of fast computing speed, small memory occupation, and difficulty in overfitting, and is suitable for our research background. At the same time, in order to reduce the network’s influence, we need to assume that a sudden outbreak in a city and other cities have spatial dependence, the means of processing and forecasting a dataset are in the training dataset, and we need to remove the stay fit of a city to obtain the clearance between local epidemic breakouts in a city. Moreover, in training, we also need to remove the spatial dependencies associated with an epidemic from a city.
We introduce research on causal inference through the machine learning method into the field of air transportation in this study. We construct a counterfactual experimental group of urban air passenger demand in the case of a local outbreak through the LightGBM method and evaluate the impact of a local outbreak on urban air demand. Finally, we select two local outbreak cities (Beijing and Dalian) as case studies and compare them with synthetic control methods to verify the effect of our policy evaluation method.
The main structure of this paper includes the following parts:
Section 2 is a review of the literature.
Section 3 introduces the research method, which includes information on how to construct a counterfactual group, how to predict air traffic demand, and how to measure the impact. We select the local outbreaks in Beijing and Dalian as cases that can show the method that we use and the impact that we want to measure.
Section 5 is the conclusion.
2. Literature Review
In order to exclude the impact of the natural increase in tourists during the recovery period on our assessment, we use a classic research method from economics to regard the outbreak of COVID-19 as a policy variable, and we divided the research objects into two groups. The difference-in-differences (DID) method is one of the most widely used methods by scholars [
13]. The DID method has been widely used in airline competition and policy analyses. For example, Yan et al. [
25] and Ma et al. [
26] treated an airline merger as a treatment policy and used the DID method to investigate the fare effects of mergers. However, for DID research methods, it is difficult to solve the problem of selective bias in order to solve the problem of selectivity deviation. Abadie et al. [
22] developed a synthetic control procedure for estimating the effect of a treatment, in the presence of a single treated unit and a number of control units, with pretreatment outcomes observed for all units. This method constructs a set of weights, such that covariates and pretreatment outcomes of the treated unit are approximately matched by a weighted average of control units. The weights are restricted to be non-negative and total one. The synthetic control method has been widely used in policy evaluation. Borbely [
27] applied the synthetic control method to the change in air travel passenger volume under the influence of an air tax. When we regard an emergency as a policy and study its impact on a specific city, we can make an appropriate linear combination of several major cities to construct a better “synthetic control region” and compare a “real city” with a “synthetic city”. For example, Xin et al. [
28] studied and estimated the impact of COVID-19 on the daily passenger volume of urban rail transit (URT) through the synthetic control method. However, the synthetic control method also has defects. For example, because the synthetic control method uses the weighted method to synthesize a virtual control group, and the weighted coefficient is one, a virtual control group with an “extreme value” cannot be synthesized.
Therefore, how to obtain a set of counterfactual experimental groups that is closer to reality is one of the difficult problems in the research. In this paper, we construct a counterfactual experimental group by predicting the air demand of a city without a local outbreak and study the impact of an outbreak on the city. In terms of air transportation demand prediction on the country or city level, there exists a large amount of literature. The existing studies that focus on linear forecasting include a variety of univariate, multivariate, and panel regression OLS models [
23,
29], ARIMA models [
30], gravity models [
31,
32,
33], and so on. Although this kind of method has achieved good prediction results, its prediction accuracy needs to be improved for nonlinear cases. The change in aviation demand is nonlinear, so it is challenging research to analyze air demand and its growing trend in the recovery period. These nonlinear methods are mainly based on the framework of machine learning, such as artificial neural networks [
34], support vector regression [
35], and so on. There are also some machine learning algorithms used in the prediction of road traffic demand, such as long and short memory [
36]. Researchers report that the machine learning methodologies adhere more closely to the actual transportation demands of air traffic than the econometric ones. Alekseev and Seixas [
37] researched air demand forecasting for Brazil based on simple OLS regression and artificial neural network (ANN) models, and found that ANNs provide more accurate forecasts of future air transportation demand than the econometric models. Srisaeng et al. [
38] predicted Australia’s low-cost carrier passenger demand and revenue passenger kilometers (RPKs) performance using traditional econometric and artificial neural network (ANN) methods, and they found that the prediction performance of the ANN model was better than that of the traditional multiple linear regression (MLR) approaches.
There are also some hybrid methods to improve the accuracy of prediction, such as Xie et al. [
39] using hybrid seasonal decomposition and least squares support vector regression approaches, which predict short-term air passenger demand; they found that hybrid approaches are better than other time series models.
In addition, various graph neural networks based on the characteristics of air transport networks are also used to predict the number of passengers at the node or route level. ConvLSTM was used to deal with a temporal and spatial network of airlines [
40]. However, ConvLSTM is very complex and requires a large amount of training data. With the increase in network depth, the training cost will increase significantly, which limits the depth of the network and the ability to capture a wide range of spatial–temporal correlations. The traffic demand prediction of road networks is also faced with the problem of a short prediction time. Zhao et al. [
41] considered temporal–spatial correlations with the LSTM approach in a traffic system via a two-dimensional network for short-term traffic forecasting. However, the network model based on LSTM cannot effectively capture the remote time correlation, so it cannot make long-term predictions.
3. Data Description and Research Method
3.1. Data Description and Processing
The passenger data used in this paper were retrieved from UMETRIP, which is the largest aviation data service company in China. It is a technical company which is jointly operated by China TravelSky Holding Company Limited and TravelSky Mobile Technology Limited. Both are state-owned companies that operate air ticket booking and are integrated with IATA’s global air ticket reservation system. It provides daily air travel service information and monitors more than 12,000 domestic flights in China as well as 60,000 flights every day. For more information, please refer to the following website:
https://www.umetrip.com (accessed on 28 April 2022). The database contains the weather and number of air passenger departures from various cities in China from 1 May to 1 October 2020. We also collected the grade data, geographical location, and GDP of urban airports from the Civil Aviation Administration of China and the 2019 urban statistical yearbook. The cities and research periods of the local outbreaks in the study are shown in
Table 1.
After a local outbreak affects Beijing, the number of passengers departing from airports on routes connected with Beijing will also change due to the network structure. Therefore, if the city affected by the local outbreak is i, and a city not affected by the local outbreak is j, we divide the number of passengers of city j into two types: the first is the total number of passengers departing from city j, and the second is the number of passengers excluding the air passenger departures from j and arrivals at city i. At the same time, we also introduce factors that affect the number of air traffic volumes in predictions. These mainly include weather, GDP, and the geographical location of the airport.
Weather: Extreme weather may affect the number of passengers. For example, in the case of strong typhoon weather, flights may be canceled, which reduces the number of passengers.
GDP: The GDP of a city can be used as one of the indicators to measure the consumption capacity of a city. The greater the consumption capacity of a city, the greater the possibility of traveling by air.
The geographical location of a city: In this paper, the geographical location of an airport where a city is located is identified according to the air traffic control bureau, which is specifically divided into East China, North China, South China, Central China, northwest, southwest, northeast, and Hong Kong, Macao, and Taiwan. We use these data to exclude the impact of policy differences between different air traffic control bureaus.
We also obtained the data of outbreaks through the statistics of Dingxiangyuan in China.
3.2. Research Method
When portraying the effect of a local outbreak of COVID-19 on aviation demand, we can regard it as a policy variable. When a city suffers from a local outbreak, the number of air passengers departing from a city will be reduced due to restrictive travel policies and a reduction in passengers’ travel willingness. When we quantitatively analyze this impact, we should also consider the dynamic change process of air passengers and the natural growth in the recovery period. In order to eliminate the influence of natural growth, we borrow the idea of a random experiment from economics. We suppose that there are two groups of experimental subjects: one group (the experimental group) is affected by the policy, while the other group (the control group) is not affected by the policy. The only difference between the two groups is the policy, and the treatment effect of the policy is the difference between the two. However, a randomized controlled experiment will have some difficulties in practice. For example, in our study, if an outbreak has occurred in a certain place, there will be no change trend for air traffic in real life. Therefore, economists’ study of the treatment effect of policy by a random experiment was used, and a counterfactual experimental group (control group) was constructed. In our study, the synthetic control method and the machine learning prediction method are used to construct the control group of a counterfactual experiment to study the treatment effect of an outbreak, and the two are compared. Finally, we extended the impact of a local epidemic on a single city to other stations in China.
3.2.1. Definition of Impact
First, we define the impact of a COVID-19 local outbreak. Assuming that a city with a local outbreak is
i and the time point of the local outbreak is
, the variation curve of the actual departure demand of air passengers in city
i with time
is
. The time period,
T, we studied is from a
moment before an outbreak to a
moment after an outbreak. If there is no local outbreak in city
i, the curve of air passenger departure demand changing with time
t is
. Therefore, we have
=
when
t ∈ [
,
], and
=
+
when
t ∈ [
,
]. Here,
refers to the changes of departure demand due to the local outbreak of city
i. It can be written as:
Due to the different airport sizes in the outbreak cities, the throughput of air passengers is also different. In order to make the impact values comparable, the relative impact values are calculated as follows:
3.2.2. Prediction of Counterfactual Group
In previous studies the machine learning method has usually been used for the time series prediction of air passenger demand. The historical dataset is divided into two parts: the first part is the training set, and the second part is the test set. In the training set, the law and logic of the air passenger demand changing trend is obtained, and the prediction error is obtained through the test set. However, COVID-19 in 2020 is an occasional event, and there are relatively few data available since domestic restrictions were lifted in May and aviation demand began to recover. However, when the number of available data is small, the time series prediction also has its limitations. During the recovery period, the departure demand of cities presents a very obvious growing trend. If only a time series prediction is made, the long-term forecast value may be too high. Therefore, we abandon the traditional cyclic neural network algorithm and use the LightGBM algorithm to construct the proportion relationship of passenger flow between the city to be predicted and other cities from the spatial level. The air passenger flow of the local outbreak city,
i, was predicted from the spatial dimension and the time dimension at the same time. The characteristics and training time used in the prediction are described in
Section 3.1. We want to predict
, which is the number of air passengers departing from the local outbreak city,
i, and it is also a continuous value prediction problem. This being the case, we have:
Here,
is a function related to the following three parts: (1) the air passenger number of other cities
; (2) the external characteristics of other cities,
; and (3) the external characteristics of city
i,
. Since the changing trend of time,
T, will affect the prediction results, time,
T, is not included in the data training process in our study: that is, for Equation (3), we have
. In addition, when we fit the local outbreak city,
i, through other cities, the data we use have removed those numbers connected with city
i. The training logic is shown in
Figure 2:
Here, represents the departure demand of selected cities 1 to m, is the characteristics reflecting traffic demand, such as the weather, GDP, and so on, and are those characteristics that reflect the traffic demand of city i, such as weather, GDP, departure date, and so on. Our training set period is from 1 May 2020 to 5 June 2020, and the test set is from 6 June to 10 June.
3.3. Synthetic Control Method
The synthetic control method allows factors such as the changing over time, or those factors that cannot be observed. We can overcome the control object of sample selection bias and not being observed in other factors caused by the endogeneity problem through synthetic control. The data used in this paper are the air passenger departure data of all airports in China. However, the weather or GDP data are missing for the small cities. Therefore, before conducting research through the synthetic control method, it is necessary to sort out and screen the full data of all cities in China. Firstly, urban airports are classified according to the classification of airport grade by the Civil Aviation Administration of China. The two cities studied in this paper are Beijing and Dalian. Both of the airports in Beijing are 4F-level airports, while Dalian has 4E-level airports. Therefore, airports with a level above 4E are preferentially selected as research objects. On the other hand, due to the small number of flights at airports below 4D and the limited level of airport informatization, the number of passengers on many dates was missing in the process of data collection, and the urban weather record is not perfect. The GDP of small cities is also missing from China’s statistical yearbook, so all 4F and 4E airports, airports, and cities in China are retained in the calculation, as shown in
Table 2. For comparison, the variables used in the synthetic control method are consistent with the eigenvalues used in machine learning.
The synthetic control method was first applied in the research of Abadie et al. [
22] (2010). In our study, we assume that in the research period,
T, there are
N + 1 cities, and that the air traffic of city
i is affected by a local outbreak.
is the real air traffic volume that we can observe after a local outbreak, and
represents the potential outcome without a local outbreak. Suppose that
is the time at which the intervention was applied.
The observed outcome,
, in region I at time
t can be written in two parts;
is the estimated effects (
) of the intervention:
The potential outcome due to predictors,
, can be written as:
where
is a constant factor across all units,
is a vector composed of the predictors not affected by the intervention,
is a vector of the unobserved predictors, and
as well as
are two vectors of coefficients.
is a dummy variable with a value of 1 if unit
i is exposed to the intervention, and a value of 0 otherwise.
is an error term.
Estimating the effect of the intervention with the synthetic control method requires the creation of a “synthetic control unit”, which is a weighted combination of other units that are not exposed to the intervention. The estimation process for vector
W is proposed in the literature [
22], as is the significance of the estimation. The estimation of
W is achieved by Stata 15.0 using the “synth” command.
At last, the impact of the local outbreak on air traffic is shown in Equation (6):
3.4. Goodness of Fit
In the synthetic control method, the outputs of the “synth” command include W, variables’ balance, and the root mean square prediction error (RMSPE). The RMSPE is the average of the root-squared discrepancies between
in the treated unit and its synthetic counterpart
during
T periods, and is written as follows:
By comparison, we also get the mean square prediction error (RMSPE) in the LightGBM method:
where
is the actual number of air passengers departing from city
i on the day,
t, before the outbreak, and
is the number of air passengers departing from city
i on the day,
t, obtained through the machine learning method.
3.5. Estimation of the Impact on Aviation Network
Assume that a city affected by a local outbreak is i, and that a direct air route between the city to be estimated is j and i. In our research period, t ∈ [,], the outbreak point is , so city j will also be affected by correlation when t ∈ [,].
If the actual volume of air passengers departing from city j is , is the traffic volume of city i without passengers from j to i. In the interval, , is the number of passengers actually observed after being affected by the epidemic in city i. The relative influence of city i with an outbreak obtained in the previous paper is .
First, we estimate the air traffic volume between city
i and city
j if there is no epidemic in city
i:
We have the traffic volume of city
j without a local outbreak:
Then, we have a related impact of city
i to city
j:
5. Conclusions
We describe the impact of COVID-19 local outbreaks on air demand by constructing a counterfactual framework. In the research, we constructed a control group of the counterfactual group through the synthetic control method and machine learning method at the same time. In the comparison between the synthetic control method and the machine learning method, we found that the synthetic control method cannot study Beijing, a city with a special population, economy, airport composition, and urban scale. For prefecture-level cities, such as Dalian, the synthetic control method is feasible, but it cannot match the impact of weather and other external characteristics on air traffic. Through machine learning, the air demand of the experimental group is closer to the fact, and the influence impact is more accurate.
We transform the time series prediction problem into a spatial fitting problem through the LightGBM algorithm, predicting the air demand of cities without a sudden epidemic, which can avoid the problems of a small number of historical data and a short prediction period. In the study, we found that the impact of a local outbreak on air passengers will not disappear with the end of the epidemic, and that its impact is more long-term. After the epidemic situation in China was basically controlled the restrictive policies were released, and the number of air passengers began to recover. It can be found that the number of air passengers in China has fully recovered to the historical level during the Golden Week of National Day. According to the recovery trend of air passengers before the epidemic and the conclusions obtained in this study, if there is no local outbreak the number of air passengers in China may return to the historical level in August. By estimating the impact of the outbreak on the aviation network, we found that if the outbreak occurs in mega cities, it will have a great impact on the whole aviation network, especially in tourist cities. Similarly, the ability of large stations to resist the impact of sudden outbreaks is also stronger than that of small stations and tourist cities, such as Shanghai, Guangzhou, and Shenzhen. For a prefecture-level city, such as Dalian, due to the relatively small number of navigable cities and flights, the impact of the epidemic on the whole aviation network is relatively small.
The aviation industry is not only one of the carriers of virus transmission but also one of the industries most seriously affected by the epidemic. In particular, local epidemics broke out in many places in China at the beginning of 2022, which had a significant impact on China’s aviation industry, which was already on the right track. According to the research of this paper, such an impact cannot be fully recovered from in a short time after the epidemic is controlled. As we know, COVID-19 has a high socioeconomic impact in the long and short terms [
42], especially on tourism and its value chain (hotels, restaurants, etc.), which are related to air transport activities that affect multiple sectors of the economy. Our research also proposed that, when local outbreaks happen, tourism may lose a lot of passenger flow. Therefore, if we want to restore the tourism economy, we must control the epidemic.