Air Quality Prediction and Ranking Assessment Based on Bootstrap-XGBoost Algorithm and Ordinal Classification Models

Yang, Jingnan; Tian, Yuzhu; Wu, Chun Ho

doi:10.3390/atmos15080925

Open AccessArticle

Air Quality Prediction and Ranking Assessment Based on Bootstrap-XGBoost Algorithm and Ordinal Classification Models

by

Jingnan Yang

¹,

Yuzhu Tian

^1,*

and

Chun Ho Wu

^2,*

¹

School of Mathematics and Statistics, Northwest Normal University, Lanzhou 730070, China

²

Big Data Intelligence Centre, The Hang Seng University of Hong Kong, Hong Kong, China

^*

Authors to whom correspondence should be addressed.

Atmosphere 2024, 15(8), 925; https://doi.org/10.3390/atmos15080925

Submission received: 23 June 2024 / Revised: 16 July 2024 / Accepted: 26 July 2024 / Published: 2 August 2024

(This article belongs to the Special Issue Atmospheric Pollutants: Monitoring and Observation)

Download

Browse Figures

Versions Notes

Abstract

:

Along with the rapid development of industries and the acceleration of urbanisation, the problem of air pollution is becoming more serious. Exploring the relevant factors affecting air quality and accurately predicting the air quality index are significant in improving the overall environmental quality and realising green economic development. Machine learning algorithms and statistical models have been widely used in air quality prediction and ranking assessment. In this paper, based on daily air quality data for the city of Xi’an, China, from 1 October 2022 to 30 September 2023, we construct support vector regression (SVR), gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), random forests (RF), neural network (NN) and long short-term memory (LSTM) models to analyse the influence of the air quality index for Xi’an and to conduct comparative tests. The predicted values and 95% prediction intervals of the AQI for the next 15 days for Xi’an, China, are given based on the Bootstrap-XGBoost algorithm. Further, the ordinal logit regression and ordinal probit regression models are constructed to evaluate and accurately predict the AQI ranks of the data from 1 October 2023 to 15 October 2023 for Xi’an. Finally, this paper proposes some suggestions and policy measures based on the findings of this paper.

Keywords:

AQI; air quality rank; XGBoost; neural network; ordinal probit

1. Introduction

As industrialisation and urbanisation advance and urban energy consumption increases, the world faces a certain degree of regional and compounded air pollution problems. The deterioration of the ecological environment caused by atmospheric environmental problems has become increasingly evident and is closely related to human health, which has become a significant public health problem. Relevant studies [1,2,3] have shown that severe air pollution can cause a variety of respiratory diseases, such as chronic pharyngitis, chronic bronchitis, and bronchial asthma, and there is a significant correlation between exposure to atmospheric particulate matter and near-surface ozone and the number of morbidities and deaths due to chronic cardiovascular diseases. In addition, severe air pollution exacerbates haze conditions and reduces visibility in the near-air layer, leading to several problems, such as traffic congestion and flight delays [4]. This issue is consistent with the Sustainable Development Goals (SDGs) of “good health and well-being” and “sustainable cities and communities”, which emphasise the importance of addressing environmental health risks and promoting sustainable urban development. Therefore, it is of great significance to explore the influence of changes in air pollutant concentrations and various factors and to establish an urban air quality prediction model with high efficiency and accuracy to promote the prevention and control of air pollution in the country.

Currently, methods for analysing and predicting quality are divided into two main categories: statistical methods and machine learning algorithms. Statistical methods are mostly linear studies, which mainly use traditional modelling strategies to predict air quality [5]. For example, Niu et al. [6] used the auto-regressive moving average model to study the composite air quality index for Chengdu. The results of the study showed that the air quality in Chengdu had no improving trend in the coming period. Jian et al. [7] predicted airborne particulate matter concentrations via the auto-regressive integrated moving average model. Abedi et al. [8] used the auto-regressive distributed lag (ARDL) model to study the relationship between air pollution and respiratory and cardiovascular diseases. The results showed that for every 10-unit increase in the AQI, the number of hospitalisations for cardiovascular diseases increased by 7.3%. Woldu [9] used the ARDL model to study the bidirectional causal relationship between urban globalisation and CO₂ emissions in Mozambique. With its rapid development, new-generation information technology such as big data analytics and machine learning can perform analysis and prediction by learning data patterns and optimising algorithms [10,11,12,13]. Because of its high accuracy and reliability, many scholars use different machine learning algorithms to study and predict air quality. Biancofiore et al. [14] used a recursive neural network model that allows one to obtain real-time information on PM₁₀ and PM_2.5 in Adriatic coastal cities. Yang et al. [15] used a space–time support vector regression model to predict hourly PM_2.5 concentrations. Pawul [16], based on Polish National Environmental Monitoring System data and meteorological data, developed a neural network model for air pollutant prediction. Ma et al. [17] performed an impact analysis of air quality based on an extreme gradient boosting model. The results of the study showed that six factors, such as personal income and power plant density, have the greatest impact on air quality in the United States. Zhao et al. [18] integrated forward neural networks and recurrent neural networks to predict air quality hourly in northwestern China. Bekkar et al. [19] performed the hour-by-hour forecasting of PM_2.5 concentration in Beijing based on a neural network with a long short-term memory network model. Huang et al. [20] used spatio-temporal information to forecast air quality through deep convolutional networks. Samad et al. [21] used five machine learning models, such as ridge regression, support vector regression, and random forest, to study pollutant concentrations at monitoring sites. Zhang et al. [22] used a spatial transform network model to forecast PM_2.5 concentrations in the next 6, 12, 24, and 48 h in Beijing and Taizhou. In addition, many other scholars have studied and forecasted the air quality ranks. Liu et al. [23] forecasted the air quality ranks for Bayannur based on an integrated model of stepwise regression, principal component analysis, and BP neural network (STEPDISC-PCA-BP). Ratković et al. [24] accurately predicted air quality levels in the next hour in a city in Montenegro by using a hybrid LSTM model. Zhao et al. [25] proposed a detailed examination model of air quality with a co-training semi-supervised learning approach.

This paper proposes to model and predict the daily air quality data from 1 October 2022 to 30 September 2023 in the city of Xi’an, China, by using the Bootstrap-XGBoost algorithm and ordinal classification models. This paper is organised as follows: Section 2 preprocesses the data and draws the stacked plot of air quality percentage to study the seasonality of air pollution in Xi’an, China, in recent years; Section 3 constructs the SVR model, the GBDT model, the XGBoost model, the RF model, the NN model, and the LSTM model to compare the predictions on the test set, and the Bootstrap-XGBoost algorithm is proposed based on the best-performing XGBoost model combined with the Bootstrap method, based on which the predicted values and prediction intervals of the AQI for the next 15 days in Xi’an, China, are given; Section 4 forecasts the AQI ranks for the next 15 days based on the ordinal logit and ordinal probit regression models; Section 5 proposes the targeted recommendations and improvement measures based on the results of the study. The research framework of this paper is shown in Figure 1.

2. Data Sources and Preprocessing

This section gives a clear explanation of the data sources used in this study and their descriptive analysis.

2.1. Data Sources

The data in this paper come from the data published on the official statistics website of China Air Quality Online Testing Platform (https://www.aqistudy.cn/historydata/ (accessed on 29 November 2023)), and daily air quality data for one year from 1 October 2022 to 30 September 2023 were collected for Xi’an, China.

The data used in this paper include the air quality index (AQI) and the concentrations of six pollutants: sulphur dioxide (SO₂), nitrogen dioxide (NO₂), fine particulate matter (PM_2.5), carbon monoxide (CO), particulate matter (PM₁₀), and ozone (O₃). The basic information about these data and the related descriptions are shown in Table 1. In 1999, the United States Environmental Protection Agency (USEPA) established the AQI as a quantitative way of interpreting air quality [26]. The individual indexes of the six criteria pollutants are calculated by Equation (1), and the maximum value is determined as the AQI. The higher the AQI, the more severe the air pollution and the greater the risk to human health.

I_{p} = \frac{I_{H i} - I_{L o}}{{B P}_{H i} - {B P}_{L o}} (C_{p} - {B P}_{L o}) + I_{L o} .

(1)

where

I_{p}

is the index for pollutant p,

C_{p}

is the truncated concentration of pollutant p,

{B P}_{H i}

is the concentration breakpoint that is greater than or equal to

C_{p}

,

{B P}_{L o}

is the concentration breakpoint that is less than or equal to

C_{p}

,

I_{H i}

is the AQI value corresponding to

{B P}_{H i}

, and

I_{L o}

is the AQI value corresponding to

{B P}_{L o}

.

2.2. Data Preprocessing and Seasonal Air Pollution Percentage Analysis

Data preprocessing is the first step to establishing a statistical learning model, and this paper uses the interpolation method and imputed package in R to fill in the missing values; the interpolation method formula is

\hat{X} = 2 X_{t} - X_{t - 1} .

(2)

It is stipulated that December to February is winter, March to May is spring, June to August is summer, and September to November is fall. With the help of Tableau software 2024.2.0 (20242.24.0613.1930), the day-by-day air quality pollution levels in Xi’an, China, were counted, and the proportions of each of the six pollutant levels were calculated in each of the four seasons and analysed seasonally, as shown in Figure 2.

It is obvious from Figure 2 that the air quality in Xi’an, China, in the past year was mainly “Good”, accounting for 45.21% of the total data. In summer, the degree of air pollution was the smallest, and the air quality was the best, mainly “Good” and “Excellent”; the air quality in spring and fall was “Excellent” and “Good”, second only to summer, with an increase in the number of “Mild pollution” days and some days being also accompanied by “Moderate pollution” and average air quality; with the arrival of cold air, the number of days with air quality of “Mild pollution” and above in Xi’an, China, in winter accounted for 16.44%, which is more than twice the number of days with air quality of “Excellent” and “Good”. Especially in winter, Xi’an is characterised by severe haze and poor air quality. This is due to the city’s geographical location and climate conditions. During the winter months, Xi’an experiences temperature inversions, which trap pollutants close to the ground, leading to a buildup of particulate matter and other air pollutants. Additionally, the increased heating demand in winter leads to greater emissions from coal-fired power plants and residential heating sources, further exacerbating the air quality issues. As a result, the air situation in Xi’an is not optimistic during the autumn and winter seasons.

3. Empirical Analysis of AQI Prediction

In this section, the selected dataset are empirically analysed by using SVR, GBDT, XGBoost, RF, NN, and LSTM models. In this case, 80% of the dataset is the training set, and 20% is the test set.

3.1. Analysis Based on SVR Model

SVR is a machine learning algorithm for classification and regression that utilises a kernel function to map the data into a high-dimensional space [27]. It searches for the optimal hyperplane to separate different categories of data, thus realising classification or regression prediction.

Let a set of training data be

(x_{i}, y_{i})

, where

x_{i} \in R^{d}

is the feature vector of the ith sample and

y_{i} \in R^{d}

is the true value of the ith sample. The objective function of the SVR model can be expressed as

\underset{w, b}{m i n} \frac{1}{2} {‖w‖}^{2} + C \sum_{i = 1}^{n} [y_{i} - f (x_{i})]^{2} .

(3)

where w is the weight vector, b is the bias term, C is the regularisation parameter, and

{‖w‖}^{2}

is the

L_{2}

norm of the weight vector.

3.2. Analysis Based on GBDT Model

GBDT is an integrated machine learning algorithm based on the Boosting strategy proposed by Friedman (2001) [28]. It is an iterative decision tree algorithm consisting of multiple decision trees. The results obtained from each decision tree are accumulated to become the final result, and it is also one of the methods with the best performance in statistical learning. The specific implementation process of the GBDT classifier is as follows.

Initialise the learner:

f_{0} (x) = argmin \sum_{i = 1}^{N} L (y_{i}, C),

(4)

2.: For each tree $m = 1,2, \dots, M$ and each sample $i = 1,2, \dots, N$ , calculate the corresponding negative gradient, i.e., residuals

r_{i m} = - [\frac{\partial L (y_{i}, f (x_{i}))}{\partial f (x_{i})}] .

(5)

where

f (x_{i})

is the predicted value of the weak learner and

y_{i}

is the true value of the weak learner.

3.3. Analysis Based on XGBoost Model

XGBoost is a supervised learning model based on combining classifiers with lower classification accuracy into a classifier model with higher accuracy, mainly to reduce the model error [29]. It is a model that is based on the combination of multiple CART trees and has a strong generalisation ability. The algorithm uses the tree as a base classifier, expands the loss function to the second-order derivative by Taylor’s formula, and adds a regular term to the objective loss function, thus avoiding an overly complex model. The objective function is

P (x) = L (x) + E (x),

(6)

where

L (x)

is the loss function, which measures how well the model fits on the training dataset, and

E (x)

is the canonical term, which measures the complexity of the model. If the model generates

j

trees and the base learner model is

y_{i} = \sum_{j = 1}^{J} f_{j} (x_{i}), f_{j} \in F,

(7)

where

f_{j}

is one of the base classifiers in

F

,

J

is the number of base classifiers, and

F

is the set of all base classifiers, then the objective function of the XGBoost model is

P (x) = \sum_{i = 1}^{n} L (y_{i}, {\hat{y}}_{i}) + \sum_{j = 1}^{J} E (f_{j}) .

(8)

3.4. Analysis Based on RF Model

RF is an integrated learning method proposed by Breiman [30] which is commonly used for classification, regression, and other machine learning tasks. The principle is to construct a large number of decision trees during training, each of which is unrelated to the others; when a new sample enters the algorithm, each decision tree makes a judgment separately, and each identifies which category the sample should belong to. Further, according to the votes on the classification tree, the sample is classified into a category. The flow of the algorithm is shown in Figure 3. RF is a robust integration algorithm based on a bagging decision tree that realises RF’s random split selection, effectively correcting the problem of decision tree fitting.

3.5. Analysis Based on NN Model

NN is a computational model inspired by the human nervous system. It consists of many interconnected neurons that transmit and process information through weights. A neural network usually consists of multiple layers, including input, hidden, and output layers. It is often used to solve complex pattern recognition and machine learning problems. The input layer receives raw data, the hidden layer processes and extracts features from the input data, and the output layer generates the final output. The connection weight of each neuron in the network determines how much the input affects the output. In Figure 4,

x_{i} (i = 1,2, 3)

is the value of the input layer, and

a_{i}^{(k)} (k = 1,2, \dots, K; i = 1,2, \dots, N_{k})

denotes the activation value of the ith neuron (the output of that neuron) in the kth layer.

3.6. Analysis Based on LSTM Model

Hochreiter and Schmidhuber proposed the LSTM model in the late 1990s [31]; it is a variant of the traditional RNN. Compared with the traditional neural network model, LSTM can deal with the long-term dependence problem of temporal data and, at the same time, avoid the gradient vanishing problem. It can effectively capture long-term dependencies in sequence data by introducing mechanisms such as memory units, input gates, forgetting gates, and output gates combined with the error back-propagation algorithm. The structural flowchart of the LSTM model is shown in Figure 5.

3.7. Forecast Results and Comparative Analysis

By preprocessing the data, the AQI values were predicted by using the above models in the test set, and the actual and predicted values are shown in Figure 6. It can be seen that the five models have similar prediction trends for the AQI and that the predicted values are close to the true values, indicating that the models can capture the relationships between the AQI and the concentrations of the six pollutants better and that the models are well characterised.

3.8. Model Evaluation

In order to assess the effectiveness of different models in fitting the AQI, four evaluation indexes, namely, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), and Goodness-of-Fit (R-squared) were selected to evaluate the results of the six model fits in this paper.

M A E (y, \hat{y}) = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}|, R M S E (y, \hat{y}) = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}, M A P E (y, \hat{y}) = \frac{100 %}{n} \sum_{i = 1}^{n} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}|, R^{2} = 1 - \frac{S S E}{S S T} = 1 - \frac{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}} .

(9)

where

y_{i}

is the AQI observation of the ith sample,

{\hat{y}}_{i}

is the AQI prediction of the ith sample, and

\bar{y}

is the sample mean. Smaller MAE, RMSE, and MAPE and larger R-squared mean better final prediction. The fitting results are shown in Table 2. From the table, one can see that the MAE and the MAPE values of fitting using the XGBoost model are the smallest and that the R-squared value is the largest. Therefore, the XGBoost model was used to predict the AQI for Xi’an, China, with the smallest error and the best fit, so the following is based on the model to predict the AQI for the next 15 days.

4. AQI Prediction Based on Bootstrap-XGBoost

The bootstrap method is a resampling method in statistics that is used to give the standard error of prediction and prediction interval in predictive analysis. In this paper, time-series data were considered, so the residual Bootstrap method was used, and the prediction in each cycle of the Bootstrap method adopted the XGBoost model. The specific algorithm flow is shown in Algorithm 1.

Algorithm 1: The Bootstrap-XGBoost algorithm

1 : Input : Dataset (x_{t}, y_{t}); B :

Bootstrap sample size

2 : Output : The B

times Boostrap prediction results

3 : Fitting a model to the preprocessed data : y_{t} = f (x_{t}) + ε_{t}

4 : Prediction of preprocessed data : {\hat{y}}_{t} = \hat{f} (x_{t})

5 : Calculation of prediction residuals : ε_{t} = y_{t} - {\hat{y}}_{t}

6: Residual Bootstrap step:

7 : Setting the number of resampling times B

8 : f o r b = 1,2, \dots, B

9 : {A new sample ε_{t}^{(b)}

is obtained by resampling ε_{t}

10 : ε_{t}^{(b)} \to s a m p l e (ε_{t}, r e p l a c e = T R U E)

11 : Getting new samples : y_{t}^{(b)} = {\hat{y}}_{t} + ε_{t}^{(b)}

12 : The Bootstrap prediction result {\hat{y}}_{t}^{(b)} is obtained by fitting the new sample via the XGBoost algorithm : (x_{t}, y_{t}^{(b)}) \to {\hat{y}}_{t}^{(b)} = {\hat{f}}_{b} (x_{t})

}

13 : Based on a series of {\hat{y}}_{t}^{(b)}

, the prediction standard deviation and 95% prediction interval are calculated

Based on the proposed Bootstrap-XGBoost algorithm, the standard deviation and 95% prediction intervals of the predicted AQI for the next 15 days in Xi’an, China, were obtained by setting

B = 500

times, as shown in Table 3.

Table 3 and Figure 7 show that the actual values generally fall within the 95% prediction interval of the prediction. The accuracy of the prediction for more days in the future is high, which indicates that the prediction of the future short-term AQI using the Bootstrap-XGBoost method has a high degree of reliability and accuracy and that the prediction for long periods needs to take into account other influencing factors.

5. AQI Rank Assessment

The ordinal logit and probit models are generalised linear models used to establish the relationship between ordinal categorical variables and independent variables and are commonly used in biomedicine, socioeconomics, and machine learning.

5.1. Ordinal Logit Model and Ordinal Probit Model

In this section, the response variable AQI ranks are classified into six categories, namely, “Excellent”, “Good”, “Mild pollution”, “Moderate pollution”, “Severe pollution”, and “Serious pollution”, and based on the following two models, the AQI ranks for Xi’an, China, are evaluated and further predicted from 1 October 2023 to 15 October 2023, in terms of AQI ranks.

Assuming that the response

y_{i} \in {1,2, \dots, C}

are the ordinal variables and the predictor variables are

X = (x_{1}, x_{2}, \dots, x_{k})^{T}

, the expression of the ordinal logit regression model is

L o g i t [P (y_{i} \leq j)] = \log [\frac{P (y_{i} \leq j)}{P (y_{i} > j)}] = α_{j} + x_{i}^{T} β, i = 1,2, \dots, n; j = 1,2, \dots, C,

(10)

The ordinal probit regression model expression is

Φ^{- 1} [P (y_{i} \leq j)] = α_{j} + x_{i}^{T} β, i = 1,2, \dots, n; j = 1,2, \dots, C .

(11)

where

x_{i} = (x_{i 1}, x_{i 2}, \dots, x_{i k})^{T}

denotes the

k

pollutant concentration indicators affecting the AQI,

β

is the regression coefficient,

α_{1} \leq \dots \leq α_{j}

is the model intercept, and

Φ (\cdot)

is the cumulative distribution function of the standard normal distribution.

5.2. Model Estimation Results

The results of parameter estimation using the above two models are shown in Table 4. Based on a p-value of less than 0.05, it can be concluded that PM_2.5, PM₁₀, SO₂, and O₃ concentrations have the most significant effect on the AQI rank, which means that the higher the concentration of this pollutant, the more serious the air pollution situation is, and the higher the AQI rank is (Table 4).

According to Equation (10) and Table 4, the probability of each category of the AQI rank for Xi’an, China, under the ordinal logit regression model can be obtained by

\begin{array}{l} P_{1} = \frac{\exp (9.5826 + E s t i m a t e_{L})}{1 + \exp (9.5826 + E s t i m a t e_{L})} \\ P_{2} = \frac{\exp (23.2516 + E s t i m a t e_{L})}{1 + \exp (23.2516 + E s t i m a t e_{L})} - P_{1} \\ P_{3} = \frac{\exp (34.9270 + E s t i m a t e_{L})}{1 + \exp (34.9270 + E s t i m a t e_{L})} - P_{1} - P_{2} \\ P_{4} = \frac{\exp (46.4949 + E s t i m a t e_{L})}{1 + \exp (46.4949 + E s t i m a t e_{L})} - P_{1} - P_{2} - P_{3} \\ P_{5} = \frac{\exp (67.1061 + E s t i m a t e_{L})}{1 + \exp (67.1061 + E s t i m a t e_{L})} - P_{1} - P_{2} - P_{3} - P_{4} \end{array}

(12)

where

E s t i m a t e_{L} = 0.1729 * P M_{2.5} + 0.1033 * P M_{10} - 0.3825 * S O_{2} + 0.0415 * N O_{2} + 0.0379 * O_{3} - 1.2952 * C O, P_{1}, P_{2}, P_{3}, P_{4}

and

P_{5}

represent the probability of having an the AQI ranks of “Excellent”, “Good”, “Mild pollution”, “Moderate pollution”, “Severe pollution”, and “Serious pollution”, respectively.

According to Equation (11) and Table 4, the probability of each category of AQI rank for Xi’an, China, under the ordinal probit regression model can be obtained by

\begin{array}{l} P_{1} = Φ (5.4850 + E s t i m a t e_{P}) \\ P_{2} = Φ (14.0113 + E s t i m a t e_{P}) - P_{1} \\ P_{3} = Φ (20.7566 + E s t i m a t e_{P}) - P_{1} - P_{2} \\ P_{4} = Φ (28.0236 + E s t i m a t e_{P}) - P_{1} - P_{2} - P_{3} \\ P_{5} = Φ (41.2004 + E s t i m a t e_{P}) - P_{1} - P_{2} - P_{3} - P_{4} \end{array}

(13)

where

E s t i m a t e_{P} = 0.1109 * P M_{2.5} + 0.0632 * P M_{10} - 0.2405 * S O_{2} + 0.0170 * N O_{2} + 0.0224 * O_{3} - 0.9499 * C O

.

5.3. AQI Ranking Forecast

The accuracy of predicting AQI rankings on the test set based on the ordinal logit regression and ordinal probit regression models developed in Section 5.1 and Section 5.2 is shown in Table 5. As can be seen from Table 5, the prediction accuracy of both the ordinal logit regression and ordinal probit regression models is above 86%, which is high, and the AQI ranks for Xi’an, China, for the period from 1 October to 15 October 2023, were predicted using these two models.

Table 6 shows that the predictions for the next 15 days based on the two models are entirely consistent with the actual results, and the prediction probabilities are mostly high, indicating that the two models have a good prediction effect on the AQI ranks.

6. Conclusions and Suggestions

In this paper, the prediction of AQI by PM_2.5, PM₁₀, SO₂, NO₂, O₃, and CO is investigated by building SVR, GBDT, XGBoost, RF, NN, and LSTM models and comparing the prediction results on the test set. It is found that the XGBoost model has the best prediction effect. Thereby, we propose a robust Bootstrap-XGBoost algorithm to give the AQI prediction values and 95% prediction intervals for the next 15 days in Xi’an, China, and the results show that most of the predicted values coincide with the true values and that the prediction intervals are covered by more than 70%. In addition, in this paper, we fit the air quality classes based on ordinal logit and ordinal probit regression models and forecast the AQI classes from 1 October to 15 October 2023, in Xi’an, China. The results show that the prediction accuracy of both ordinal models is 100%.

Based on the above conclusions, this paper proposes some applicable recommendations and targeted initiatives to improve air quality in Xi’an.

(1) As one of the megacities in China, preventing and controlling particulate pollution in Xi’an has been crucial in recent years. Based on continuing to maintain emission reduction, the prevention and control of gaseous pollutants (especially O₃ and SO₂) should be strengthened. The main goals for the present and future are the synergistic prevention and control of atmospheric particulate matter and O₃ and introducing related policy and standards on VOC management—for example, centralised heating, optimised motor vehicle travel restrictions, and increased vegetation cover. In addition, it is also necessary to raise public awareness of air pollution prevention and control.

(2) The air in Xi’an is dry, and in spring and winter, the soil has low water content, which can easily bring dust into the air in windy weather. Relevant governments should strengthen the monitoring of dust from construction sites and material stacking sites in urban areas and urge construction units to cover the bare-ground surface in a timely manner. At the same time, regular urban sanitation sweeps should be carried out to pick up garbage, wipe down fences, clean up roads, and spray on time to ensure the city is clean and tidy.

(3) As one of the cities with a modern industrial system developed, it is essential to maintain the synergistic development of the green economy and create a better living environment by improving the resource utilisation rate in Xi’an, optimising the energy structure and industrial layout, continuing to carry out the comprehensive treatment of multi-pollution sources and pollutants from coal combustion, industry, transportation, and biomass combustion, as well as in-depth emission reduction, and increasing the use of renewable resources and consumable alternative resources.

Author Contributions

Conceptualisation, J.Y. and Y.T.; methodology, Y.T.; software, J.Y.; validation, J.Y., Y.T. and C.H.W.; formal analysis, J.Y., Y.T. and C.H.W.; investigation, J.Y. and Y.T.; resources, Y.T. and C.H.W.; data curation, J.Y. and Y.T.; writing—original draft preparation, J.Y. and Y.T.; writing—review and editing, C.H.W.; visualisation, J.Y.; supervision, Y.T.; project administration, C.H.W.; funding acquisition, Y.T and C.H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was jointly supported by grants from National Natural Science Foundation of China (grant No. 12061065), National Foundation for Social Sciences of China (grant No. 22BTY037), Funds for Innovative Fundamental Research Group Project of Gansu Province (grant No. 23JRRA684), and University Grants Committee of HK SAR (RMGS Project Acc. No. 700043).

Institutional Review Board Statement

Not applicable. This study does not involve humans or animals. Our study mainly involves the air quality forecast issue and does not address any matter which violates ethics approval.

Informed Consent Statement

Not applicable. This study does not involve humans or animals. Our study mainly involves the air quality forecast issue and does not address any matter which violates ethics approval.

Data Availability Statement

The data can be found at https://www.aqistudy.cn/historydata/ (accessed on 29 November 2023) or requested from the corresponding authors.

Acknowledgments

The authors would like to thank the School of Decision Sciences, Hang Seng University of Hong Kong, and the School of Mathematics and Statistics, Northwest Normal University, for supporting this research. Special thanks go to P.P.L. Leung for the assistance given in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chang, Q.; Zhang, H.; Zhao, Y. Ambient air pollution and daily hospital admissions for respiratory system–related diseases in a heavy polluted city in Northeast China. Environ. Sci. Pollut. Res. 2020, 27, 10055–10064. [Google Scholar] [CrossRef]
Schwartz, J. Particulate air pollution and chronic respiratory disease. Environ. Res. 1993, 62, 7–13. [Google Scholar] [CrossRef] [PubMed]
Chai, G.; He, H.; Sha, Y.; Zhai, G.; Zong, S. Effect of PM2. 5 on daily outpatient visits for respiratory diseases in Lanzhou, China. Sci. Total Environ. 2019, 649, 1563–1572. [Google Scholar] [CrossRef] [PubMed]
Gao, J.; Woodward, A.; Vardoulakis, S.; Kovats, S.; Wilkinson, P.; Li, L.; Xu, L.; Li, J.; Yang, J.; Cao, L.; et al. Haze, public health and mitigation measures in China: A review of the current evidence for further policy response. Sci. Total Environ. 2017, 578, 148–157. [Google Scholar] [CrossRef] [PubMed]
Graupe, D.; Krause, D.; Moore, J. Identification of autoregressive moving-average parameters of time series. IEEE Trans. Autom. Control. 1975, 20, 104–107. [Google Scholar] [CrossRef]
Niu, B.; Yin, Y. The prediction and research of air quality in Chengdu based on ARMA model. Stat. Appl. 2016, 5, 365–372. [Google Scholar] [CrossRef]
Jian, L.; Zhao, Y.; Zhu, Y.P.; Zhang, M.B.; Bertolatti, D. An application of ARIMA model to predict submicron particle concentrations from meteorological factors at a busy roadside in Hangzhou, China. Sci. Total Environ. 2012, 426, 336–345. [Google Scholar] [CrossRef]
Abedi, A.; Baygi, M.M.; Poursafa, P.; Mehrara, M.; Amin, M.M.; Hemami, F.; Zarean, M. Air pollution and hospitalisation: An autoregressive distributed lag (ARDL) approach. Environ. Sci. Pollut. Res. 2020, 27, 30673–30680. [Google Scholar] [CrossRef] [PubMed]
Woldu, G. Impact of urbanisation and globalisation on environmental quality in Mozambique: An ARDL bound testing approach. Int. J. Clim. Chang. Impacts Responses 2021, 13, 147–161. [Google Scholar] [CrossRef]
Wu, C.H.; Ng, S.C.H.; Kwok, K.C.M.; Yung, K.L. Applying Industrial Internet of Things Analytics to Manufacturing. Machines 2023, 11, 448. [Google Scholar] [CrossRef]
Wang, D.; Cao, J.; Zhang, B.; Zhang, Y.; Xie, L. A Novel Flexible Geographically Weighted Neural Network for High-Precision PM2.5 Mapping across the Contiguous United States. ISPRS Int. J. Geo-Inf. 2024, 13, 217. [Google Scholar] [CrossRef]
Wu, C.H.; Wong, Y.S.; Ip, W.H.; Lau, H.C.W.; Lee, C.K.M.; Ho, G.T.S. Modeling the cleanliness level of an ultrasonic cleaning system by using design of experiments and artificial neural networks. Int. J. Adv. Manuf. Technol. 2009, 41, 287–300. [Google Scholar] [CrossRef]
Lin, C.M.; Lin, Y.S. TPTM-HANN-GA: A Novel Hyperparameter Optimization Framework Integrating the Taguchi Method, an Artificial Neural Network, and a Genetic Algorithm for the Precise Prediction of Cardiovascular Disease Risk. Mathematics 2024, 12, 1303. [Google Scholar] [CrossRef]
Biancofiore, F.; Busilacchio, M.; Verdecchia, M. Recursive neural network model for analysis and forecast of PM10 and PM2.5. Atmos. Pollut. Res. 2017, 8, 652–659. [Google Scholar] [CrossRef]
Yang, W.; Deng, M.; Xu, F.; Wang, H. Prediction of hourly PM2.5 using a space-time support vector regression model. Atmos. Environ. 2018, 181, 12–19. [Google Scholar] [CrossRef]
Pawul, M. Application of neural networks to the prediction of gas pollution of air. New Trends Prod. Eng. 2019, 2, 515–523. [Google Scholar] [CrossRef]
Ma, J.; Ding, Y.; Cheng, J.C.P.; Jiang, F.; Tan, Y.; Gan, V.J.; Wan, Z. Identification of high impact factors of air quality on a national scale using big data and machine learning techniques. J. Clean. Prod. 2020, 244, 118955. [Google Scholar] [CrossRef]
Zhao, Z.; Qin, J.; He, Z.; Li, H.; Yang, Y.; Zhang, R. Combining forward with recurrent neural networks for hourly air quality prediction in Northwest of China. Environ. Sci. Pollut. Res. 2020, 27, 28931–28948. [Google Scholar] [CrossRef]
Bekkar, A.; Hssina, B.; Douzi, S. Air-pollution prediction in smart city, deep learning approach. J. Big Data 2021, 8, 161. [Google Scholar] [CrossRef]
Huang, G.; Ge, C.; Xiong, T.; Song, S.; Yang, L.; Liu, B.; Yin, W.; Wu, C. Large scale air pollution prediction with deep convolutional networks. Sci. China Inf. Sci. 2021, 64, 192107. [Google Scholar] [CrossRef]
Samad, A.; Garuda, S.; Vogt, U.; Yang, B. Air pollution prediction using machine learning techniques—An approach to replace existing monitoring stations with virtual monitoring stations. Atmos. Environ. 2023, 310, 119987. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, S. Modeling air quality PM2. 5 forecasting using deep sparse attention-based transformer networks. Int. J. Environ. Sci. Technol. 2023, 20, 13535–13550. [Google Scholar] [CrossRef]
Liu, M.; Hu, H.; Zhang, L.; Zhang, Y.; Li, J. Construction of air quality level prediction model based on STEPDISC-PCA-BP. Appl. Sci. 2023, 13, 8506. [Google Scholar] [CrossRef]
Ratković, K.; Kovač, N.; Simeunović, M. Hybrid LSTM Model to Predict the Level of Air Pollution in Montenegro. Appl. Sci. 2023, 13, 10152. [Google Scholar] [CrossRef]
Zhao, Y.; Wang, L.; Zhang, N.; Huang, X.; Yang, L.; Yang, W. Co-Training Semi-Supervised Learning for Fine-Grained Air Quality Analysis. Atmosphere 2023, 14, 143. [Google Scholar] [CrossRef]
Seo, J.H.; Jeon, H.W.; Sung, U.J.; Sohn, J.R. Impact of the COVID-19 outbreak on air quality in Korea. Atmosphere 2020, 11, 1137. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. Available online: https://www.researchgate.net/publication/13853244_Long_Short-term_Memory (accessed on 29 November 2023). [CrossRef]

Figure 1. The research framework of this paper.

Figure 2. Stacked graph of air quality shares by season.

Figure 3. The flow chart of the RF algorithm.

Figure 4. The illustration of NN.

Figure 5. LSTM model network structure.

Figure 6. The plot of predicted vs. actual AQI values on the test set.

Figure 7. Box plots with 95% confidence intervals.

Table 1. Basic characteristics of the data.

	Minimum	Maximum	Mean	Median	Mode	Skewness
AQI	13	439	86.1206	67	53	2.2285
SO₂	4	26	7.7342	7	6	2.1754
NO₂	8	90	36.3123	32	20	0.8220
PM_2.5	6	283	52.2740	33	17	2.1467
CO	0.3	2	0.7047	0.6	0.6	1.5547
PM₁₀	10	1072	98.6603	76	46 and 52	4.6481
O₃	5	147	56.3726	52	20	0.5450

Table 2. Comparison of model fitting results.

Model	MAE	RMSE	MAPE(%)	R-Squared
SVR	9.6472	33.0640	10.68	0.7611
XGBoost	5.5969	16.1382	3.90	0.9431
GBDT	9.8143	21.3959	11.4	0.9000
RF	7.6624	19.2412	8.05	0.9191
CNN	9.1216	25.8005	9.86	0.8546
LSTM	6.2284	7.6542	38.11	0.2235

Table 3. AQI forecast values and 95% prediction intervals.

Forecast Date	Actual Value	Forecast Value	Standard Deviation	95% Prediction Interval
October 1	56	56.8184	2.4194	[55.0003, 64.3353]
October 2	26	28.5179	2.2609	[25.3514, 33.7640]
October 3	25	23.0133	2.6373	[22.8364, 32.3976]
October 4	22	22.6515	2.6171	[21.0723, 30.7458]
October 5	26	28.0114	2.3195	[25.0484, 33.1558]
October 6	24	23.0133	2.5359	[22.7003, 31.5158]
October 7	36	33.5798	2.2505	[32.3784, 41.1384]
October 8	44	45.4493	1.9171	[44.9358, 52.4214]
October 9	56	56.8184	2.8158	[54.9225, 64.9447]
October 10	70	69.2584	2.6190	[68.1752, 78.4387]
October 11	77	77.8555	2.7382	[77.8773, 88.8032]
October 12	66	63.8876	3.2501	[61.4553, 73.4492]
October 13	41	41.7567	2.0295	[41.7567, 48.7301]
October 14	39	36.6577	2.7054	[34.8364, 45.4578]
October 15	40	41.7567	2.3040	[40.5036, 49.5267]

Table 4. Parameter estimation results.

	Ordinal Logit Regression			Ordinal Probit Regression
Parameter	Estimate	S.E	p-Value	Estimate	S.E	p-Value
$α_{1}$	9.5826	1.6391	0.0000	5.4850	0.9473	0.0000
$α_{2}$	23.2516	3.0082	0.0000	14.0113	1.7184	0.0000
$α_{3}$	34.9270	4.4030	0.0000	20.7566	2.4414	0.0000
$α_{4}$	46.4949	5.8521	0.0000	28.0236	3.3141	0.0000
$α_{5}$	67.1061	8.7257	0.0000	41.2004	4.8089	0.0000
PM_2.5	0.1729	0.0260	0.0000	0.1109	0.0146	0.0000
PM₁₀	0.1033	0.0147	0.0000	0.0632	0.0086	0.0000
SO₂	−0.3825	0.1456	0.0086	−0.2405	0.0893	0.0071
NO₂	0.0415	0.0240	0.0834	0.0170	0.0136	0.2104
O₃	0.0379	0.0091	0.0000	0.0224	0.0053	0.0000
CO	−1.2952	1.6675	0.4373	−0.9499	0.9723	0.3286

Table 5. Prediction accuracy on test set.

Model	Accuracy (%)
Ordinal logit regression model	89.04
Ordinal probit regression model	86.30

Table 6. Predicted results of AQI ranks.

Forecast Date	True Rank	Prediction Rank (Predicted Probability)
	True Rank	Ordinal Logit Regression	Ordinal Probit Regression
October 1	II	II (0.6531)	II (0.6729)
October 2	I	I (0.9870)	I (0.9935)
October 3	I	I (0.9956)	I (0.9994)
October 4	I	I (0.9961)	I (0.9996)
October 5	I	I (0.9904)	I (0.9966)
October 6	I	I (0.9914)	I (0.9971)
October 7	I	I (0.9407)	I (0.9493)
October 8	I	I (0.8979)	I (0.9049)
October 9	II	II (0.5727)	II (0.5560)
October 10	II	II (0.9978)	II (0.9999)
October 11	II	II (0.9923)	II (0.9989)
October 12	II	II (0.9974)	II (0.9999)
October 13	I	I (0.9157)	I (0.9213)
October 14	I	I (0.9159)	I (0.9258)
October 15	I	I (0.9078)	I (0.9249)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, J.; Tian, Y.; Wu, C.H. Air Quality Prediction and Ranking Assessment Based on Bootstrap-XGBoost Algorithm and Ordinal Classification Models. Atmosphere 2024, 15, 925. https://doi.org/10.3390/atmos15080925

AMA Style

Yang J, Tian Y, Wu CH. Air Quality Prediction and Ranking Assessment Based on Bootstrap-XGBoost Algorithm and Ordinal Classification Models. Atmosphere. 2024; 15(8):925. https://doi.org/10.3390/atmos15080925

Chicago/Turabian Style

Yang, Jingnan, Yuzhu Tian, and Chun Ho Wu. 2024. "Air Quality Prediction and Ranking Assessment Based on Bootstrap-XGBoost Algorithm and Ordinal Classification Models" Atmosphere 15, no. 8: 925. https://doi.org/10.3390/atmos15080925

APA Style

Yang, J., Tian, Y., & Wu, C. H. (2024). Air Quality Prediction and Ranking Assessment Based on Bootstrap-XGBoost Algorithm and Ordinal Classification Models. Atmosphere, 15(8), 925. https://doi.org/10.3390/atmos15080925

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Air Quality Prediction and Ranking Assessment Based on Bootstrap-XGBoost Algorithm and Ordinal Classification Models

Abstract

1. Introduction

2. Data Sources and Preprocessing

2.1. Data Sources

2.2. Data Preprocessing and Seasonal Air Pollution Percentage Analysis

3. Empirical Analysis of AQI Prediction

3.1. Analysis Based on SVR Model

3.2. Analysis Based on GBDT Model

3.3. Analysis Based on XGBoost Model

3.4. Analysis Based on RF Model

3.5. Analysis Based on NN Model

3.6. Analysis Based on LSTM Model

3.7. Forecast Results and Comparative Analysis

3.8. Model Evaluation

4. AQI Prediction Based on Bootstrap-XGBoost

5. AQI Rank Assessment

5.1. Ordinal Logit Model and Ordinal Probit Model

5.2. Model Estimation Results

5.3. AQI Ranking Forecast

6. Conclusions and Suggestions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI