Construction of Air Quality Level Prediction Model Based on STEPDISC-PCA-BP

Liu, Min; Hu, Hua; Zhang, Liqian; Zhang, Yongan; Li, Jia

doi:10.3390/app13148506

Open AccessArticle

Construction of Air Quality Level Prediction Model Based on STEPDISC-PCA-BP

by

Min Liu

^1,2,

Hua Hu

^1,2,*,

Liqian Zhang

^1,2,

Yongan Zhang

^1,2 and

Jia Li

^1,2

¹

School of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China

²

Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, School of Computer and Information Engineering, Hohhot 010018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(14), 8506; https://doi.org/10.3390/app13148506

Submission received: 26 June 2023 / Revised: 20 July 2023 / Accepted: 21 July 2023 / Published: 23 July 2023

Download

Browse Figures

Versions Notes

Abstract

:

Air quality level has a complex nonlinear relationship with air pollutant and meteorological conditions, including multiple factors, overlapping information, and difficulty solving equations. In order to identify significant factors, remove correlations, reduce data dimensionality, and simplify the model structure, a BP neural network model for air quality level prediction optimized by stepwise discriminant analysis (STEPDISC) and principal component analysis (PCA) is proposed with 12 factors of historical daily meteorology and air pollutants in Bayannur city as samples. The results showed that, at the significance level of 0.01, the STEPDISC method retained 9 significant impact factors. The PCA method made an orthogonal linear combination of the 9 factors to form the principal components, and the contribution of the top 5 principal components were 37.6%, 19.2%, 15.3%, 8.8%, and 7.7%. At a contribution threshold of 0.85, the top 5 principal component scores were used as input nodes to construct the STEPDISC-PCA-BP model, which had a prediction accuracy of 85.5%.Compared with the PCA-BP and BP model, which had a prediction accuracy of 61.8% and 56.7%, respectively, the STEPDISC-PCA-BP model has a higher prediction accuracy, shorter time, and lower complexity of structure and data dimensionality, and can provide the necessary technical support for the local air quality improvement.

Keywords:

computer neural network; prediction of air quality level; stepwise discriminant method; principal component analysis; BP neural network

1. Introduction

Green water and green mountains are reservoirs, grain depots, money depots, and carbon depots [1]. Bayannur city in Inner Mongolia relies on its rich resource advantages to develop an ecological tourism economy. Air pollution has a significant impact on tourism. Accurate and effective prediction of ambient air quality can provide an important guarantee for the development of local tourism [2]. Air quality index (AQI) and AQI-based air quality level (AQL) is a new national ambient air quality evaluation standard released in 2012 which states that the main evaluation factors of AQI are six air pollutants, namely sulfur dioxide (SO₂), nitrogen dioxide (NO₂), carbon monoxide (CO), ozone (O₃), and particulate matter of median aerometric diameter ≤ 10 µm or ≤ 2.5 µm (PM₁₀, PM_2.5). There are six grades of AQL, i.e., level 1 (0 ≤ AQI ≤ 50, excellent), level 2 (51 ≤ AQI ≤ 100, good), level 3 (101 ≤ AQI ≤ 150, light pollution), level 4 (151 ≤ AQI ≤ 200, moderate pollution), level 5 (201 ≤ AQI ≤ 300, heavy pollution), and level 6 (AQI ≥ 300, severe pollution). The greater the AQI value, the higher the AQL and the more serious the air pollution condition [3,4]. The complex composition and extensive sources of air pollutants make the prediction of air quality levels full of uncertainty. With its simple structure and strong self-learning ability, BP neural networks have shown promise in improving the accuracy of air quality predictions by allowing models to learn from complex and non-linear relationships in the data [5]. It has been applied to AQL prediction by domestic and foreign scholars in recent years. Ji D. et al. established the BP prediction model evaluated by air pollutants has good generalization and strong stability [6]. Shakerkhatibi M. et al. confirmed that the BP neural network is optimal in detecting the best predictors of air pollution-induced cardiopulmonary disease admissions [7]. The selection of evaluation factors is directly related to the robustness of the AQL prediction model. The Ambient Air Quality Standard states that the AQI value depends on six air pollutants, which have become the necessary parameters for detecting air quality. Not only that, the accumulation and dispersion of air pollutants are affected by weather conditions such as air pressure, temperature, humidity, wind speed, wind direction, sunshine, etc. [8]. Some scholars have modeled both air pollutant factors and meteorological factors. Song D. et al. modeled based on the previous day’s AQI value and the current day’s meteorological factor values [9]. You Y. et al. added not only the AQI value of the previous day and the meteorological value of the day, but also entered the concentration values of six air pollutants in the model [10]. Given the issues of multiple factors, overlapping information, and difficulty solving equations, it is necessary to integrate multiple methods in practice to retain significant factors, remove correlation, reduce data dimensionality, simplify structure, and improve model fitting performance. However, most of the existing modeling studies only used a single BP neural network and had the following problems: some did not add meteorological evaluation factors; some lacked six air pollutants; some did not screen out factors with strong discrimination ability and were missing to perform factor dimensionality reduction treatment. Therefore, this study collected the real data of 1771 sets of day-by-day air pollutant evaluation factors and air quality level classification labels in Bayannur city during 2015–2020, combined the meteorological data such as air pressure, temperature, and wind speed in the same period, tried to retain the evaluation factors with significant discriminative ability through STEPDISC, and on this basis, used PCA to reduce the dimensionality of factor vector and calculated the score vector of the principal components with large contribution rates and input them into BP neural network for air quality level prediction modeling which named STEPDISC-PCA-BP model. The purpose of this study is to provide a comprehensive approach to air quality modeling to improve the accuracy, and also to provide a new perspective for air quality level forecasting.

2. Materials and Methods

2.1. Study Area and Datasets

Bayannur city is located in the west of Inner Mongolia, China, with a total length of 378 km, a width of 238 km, and an area of 64,000 km², with a temperate continental climate [11]. The location is shown in Figure 1.

The sample air pollutant data were obtained from the weather report website (http://www.tianqihoubao.com (accessed on 9 January 2023)), and the meteorological data for the same period were obtained from the Inner Mongolia Meteorological Bureau, containing a total of 1771 sets of daily average monitoring values from 1 January 2015 to 13 January 2020. Since the percentage of missing values in the wind direction factor is more than 50% of the total sample size, this factor is not included in the dataset. There are 12 factors in each sample dataset, including date, air pressure (AP), air temperature (AT), wind speed (WS), sunshine hours (SH), relative humidity (RH), daily precipitation (DP), PM_2.5, PM₁₀, SO₂, NO₂, CO, and O₃. AQL is the predicted grouping variable mapped from the value of the AQI. Some of the sample data are shown in Table 1 and Table 2.

2.2. Methodology

2.2.1. Stepwise Discriminant Analysis (STEPDISC)

In order to avoid the influence of irrelevant factors on the discriminant analysis results or the instability of discriminant results caused by too many factors, for the sample dataset with one classification label and several factors, STEPDISC performs hypothesis testing for each factor in turn to retain the factors with the strongest discriminatory power, and requires the factors to follow multivariate normal distribution within each category [12]. Assuming that there are k classes with m factors within each class, firstly, the statistical test is performed at the significance level α for equality of the means of each factor component x_i (i = 1, 2, …, m) in the k classes, and the Wilks’ λ value is calculated to measure the discriminative power of x_i, which is equal to the ratio of the within-group deviation matrix to the total deviation matrix. The smaller the Wilks’ λ value, the greater the difference between groups and the more important x_i is. Similarly, using Wilks’ λ value to construct the F statistic, we further calculate p = P{F ≥ F₀} ≤ α (F₀ is the observed value of the F statistic); the larger the F value, the smaller the p value, the more significant the discriminative power of x_i which needs to be introduced into the discriminant function. Otherwise, there is no significant difference among x_i of each class, which cannot provide additional information for distinguishing between k classes, so x_i is not introduced. Secondly, every time a new factor with the strongest discriminative power is introduced into the discriminant function, the old factors in the discriminant function are tested one by one and should be eliminated if their discriminative power becomes insignificant due to the introduction of new factors. Finally, the best subset of factors is gradually selected according to their discriminative power to distinguish k classes.

2.2.2. Principle Component Analysis (PCA)

The PCA method is a linear combination of correlated original factors by orthogonal transformation to obtain fewer uncorrelated comprehensive factors, i.e., principal components, to reflect as much information as possible about the original factors, thus achieving the purpose of data dimensionality reduction. Suppose there are n samples, each including p factors, and the original observation data matrix is denoted as X, and each observation is denoted as x_ij (i = 1, 2, …, n, j = 1, 2, …, p), when the range of values of the p original factors is very different from each other, it is necessary to standardized X (Equation (1)), denoted as X*.

x_{i j}^{*} = \frac{x_{i j} - {\bar{x}}_{j}}{s_{j}}

(1)

Starting from the correlation matrix of X to calculate the principal components is equivalent to standardizing the original data. The calculation steps are as follows:

1.: Calculate the correlation matrix R of X (Equation (2)).

\begin{array}{l} R = {(r_{i j})}_{p \times p} = \frac{\sum_{k = 1}^{n} (x_{k i} - {\bar{x}}_{i}) (x_{k j} - {\bar{x}}_{j})}{\sqrt{\sum_{t = 1}^{n} {(x_{t i} - {\bar{x}}_{i})}^{2}} \sqrt{\sum_{t = 1}^{n} {(x_{t j} - {\bar{x}}_{i})}^{2}}} (i, j = 1, 2, \dots, p) \end{array}

(2)

2.: Calculate the eigenvalues of R and unit orthogonalized eigenvectors.

The eigenvalues of R in descending order are:

λ_{1}

≥

λ_{2}

≥ …

λ_{p}

> 0, and the i-th principal component is F_i. λ_i is the variance of F_i, and the corresponding orthogonalized unit eigenvector of eigenvalues is denoted as:

a_{1} = (\begin{array}{l} a_{11} \\ a_{21} \\ \dots \\ a_{p 1} \end{array}), a_{2} = (\begin{array}{l} a_{12} \\ a_{22} \\ \dots \\ a_{p 2} \end{array}), \dots, a_{p} = (\begin{array}{l} a_{1 p} \\ a_{2 p} \\ \dots \\ a_{p p} \end{array})

The i-th principal component is (Equation (3)):

F_{i} = a_{i}^{T} X^{*}, i = 1, 2, \dots, p

(3)

3.: Select the principal components

The contribution rate of the principal component F_i is

\frac{λ_{i}}{\sum_{k = 1}^{p} λ_{k}}

. The cumulative contribution rate of the top m principal components is

\frac{\sum_{k = 1}^{m} λ_{k}}{\sum_{k = 1}^{p} λ_{k}}

. It can be interpreted that the top m principal components reflect the amount of the information of the original factors. In general, a cumulative contribution rate greater than 85% includes the main amount of information from the original factor and simplifies the model structure. Therefore, m is determined with a cumulative contribution rate threshold of 85% [13].

4.: Explain the principal components

According to

ρ (F_{k}, x_{i}) = a_{i k} \sqrt{λ_{k}} (k, i = 1, 2, \dots, p)

,

a_{i k}

reflects the correlation between the k-th principal component and the i-th original factor, and a reasonable interpretation should be given to the principal component based on some primary factors that play a major role, rather than simply interpreting it as the role of a single factor.

5.: Calculate the scores of n samples on m principal components (Equation (4)):

F_{j} = a_{1 j} X_{1}^{*} + a_{2 j} X_{2}^{*} + \dots + a_{p j} X_{p}^{*}, j = 1, 2, \dots, m

(4)

2.2.3. BP Neural Network

A BP neural network can train the network model to approximate the functional equation using the input and output vector and get the closest result to the expected output value when the factor vector is input to the functional equation, and its strong nonlinear mapping ability is widely used in classification problems. Therefore, the BP neural network model is chosen to predict the air quality level of the study area. The classical three-layer network structure contains an input layer, a hidden layer, and an output layer, as shown in Figure 2.

Assuming that the input layer has p neurons and the received information is a factor vector

X = {(x_{1}, \dots, x_{i}, \dots, x_{p})}^{T}

,

{(α_{1 j}, \dots, α_{i j}, \dots, α_{p j})}^{T}

and

{(ω_{1 k}, ω_{2 k}, \dots, ω_{j k}, \dots, ω_{m k})}^{T}

denote the connection weights of the neurons in each layer. The hidden and output layers, each with m and n neurons, receive the dot product of the factor vector and the weights, compare them with the set thresholds

{(β_{1}, β_{2}, \dots, β_{j}, \dots, β_{m})}^{T}

and

{(θ_{1}, \dots, θ_{k}, \dots, θ_{n})}^{T}

, and substitute them as independent variables into the function

f

. The function return value is the actual output of the neuron at that layer

{({\hat{h}}_{1}, {\hat{h}}_{2}, \dots, {\hat{h}}_{j}, \dots, {\hat{h}}_{m})}^{T}

(Equation (5)) and

{({\hat{y}}_{1}, \dots, {\hat{y}}_{k}, \dots, {\hat{y}}_{n})}^{T}

(Equation (6)) [14].

{\hat{h}}_{j} = f (\sum_{i = 1}^{p} α_{i j} x_{i} - β_{j})

(5)

{\hat{y}}_{k} = f (\sum_{j = 1}^{m} ω_{j k} h_{j} - θ_{k})

(6)

The Sigmoid is used for the output of the hidden layer which works well when the factor difference is complex (Equation (7)).

f (x) = \frac{1}{1 + e^{- x}}

(7)

The output layer uses the Softmax which converts the results of each output classification into probability values and selects the category with the highest probability as the final classification. This function is also able to distance the values with large disparities more accurately and classify the results more accurately (Equation (8)).

P (k) = s o f t \max (y_{k}) = \frac{e^{y_{k}}}{\sum_{k = 1}^{n} e^{y_{k}}}

(8)

The loss function E is defined as the mean square error between the observed output and the predicted output (Equation (9)).

E = \frac{1}{2} \sum_{j = 1}^{n} ({\hat{y}}_{j} - y_{j})^{2}

(9)

The network model is trained with the objective of minimizing E. Using the gradient descent algorithm (Equations (10) and (11)), the optimal network model is obtained by iteratively updating the weights

ω_{j k}^{(l)}

and thresholds

θ_{k}^{(l)}

of the neurons in the l-th layer based on the chain rule starting from E at the output and propagating backward to the input, and terminating the learning after reaching the expected error or the set number of learning times.

ω_{j k}^{(l)} = ω_{j k}^{(l - 1)} - b \frac{\partial E}{\partial ω_{j k}^{(l - 1)}}

(10)

θ_{k}^{(l)} = θ_{k}^{(l - 1)} - b \frac{\partial E}{\partial θ_{k}^{(l - 1)}}

(11)

In the above equation, b is called the learning rate, which is too large to fall into a local minimum, and too small to have a long training period for the network model, which usually needs to be adjusted empirically.

In addition, a reasonable selection of the number of neuron nodes in each layer will also improve the model fitting ability to a great extent. The number of nodes in the input and output layers needs to be determined in conjunction with the actual problem. Too many nodes in the hidden layer are easy to over-fit and weakly generalize the model, while too few nodes are easy to under-fit the model, and in practice, it is often calculated using Equation (12), with c taking values in the range of 1 to 10 [15].

m = \sqrt{n + p} + c

(12)

3. Results and Discussion

3.1. Data Pre-Processing

From Table 1 and Table 2, it can be seen that the meteorological and air pollutant factors in the samples have different parameter units and order of magnitude differences, which are directly used to train the network model to make it easy to get the same order of magnitude differences in weights, making the network model unstable. Normalization can make all factors have equal weights and better data convergence conditions. Equation (1) is used to standardize the original data.

3.2. STEPDISC Method for Selecting Factors

The values of the factors obey multivariate normal distribution and meet the prerequisites of STEPDISC method [16,17]. If all of them are used to build the discriminant function, it will increase the computational effort, and the possible correlation between the factors will also cause computational difficulties and decrease the accuracy and instability of the discriminant function, so we need to select the factors with significant discriminant ability from the 12 factors. With AQL as the predicted grouping variable, the 12 factors were gradually selected at the significance level of 0.01. The results are shown in Table 3.

Step 1 performed an analysis of covariance for the 12 factors outside the model, and from Table 3, we can see that PM₁₀ corresponds to a Wilks’ λ-value of 0.280, and F-value of 909.7, which is the largest among all factors. p-value < 0.0001 is also the smallest and meets the criteria for factors to enter the model (p < 0.01), which can be judged to be most closely related to the AQL. Therefore, PM₁₀ was selected to enter the discriminant model. Step 2 started with an analysis of covariance for the factors within the model, at which point there was only PM₁₀ within the model, so it was not eliminated, and then an analysis of covariance was performed on the remaining 11 factors outside the model, and PM_2.5 was selected for inclusion in the discriminant model. Steps 3 to 9 analyzed the factors in the model and outside the model, respectively, to determine whether they meet the criteria for exclusion or entry. In this process, AP, NO₂, CO, RH, SO₂, SH, and WS were selected into the model, respectively, and no factors were excluded. The remaining AT, DP, and O₃ outside the model were analyzed for covariance, and the corresponding p-values did not meet the criteria for entry into the model and could not be selected into the discriminant model. Finally, AP, WS, SH, RH, PM_2.5, PM₁₀, SO₂, NO₂, and CO were selected from 12 factors.

The annual averages of PM_2.5, PM₁₀, SO₂, NO₂, CO concentration, and AQI for the years 2015–2019 are shown in Figure 3 and Figure 4.

According to Figure 3, PM₁₀ and PM_2.5 were the top pollutants in each year, SO₂ and NO₂ decreased significantly after 2018, and CO had little change in its impact in each year. From Figure 4, we know that the highest mean AQI value was 81.66 in 2015, the next highest value was 77.63 in 2016, and the lowest mean AQI value was 68.62 in 2019, which is a decrease of 15.97% compared to 2015.

3.3. PCA Method to Reduce the Factors Dimension

To reduce the number of input nodes of the BP neural network, principal component analysis was performed based on the discriminant significant factors in Table 3 to further reduce the data dimensionality. The pearson correlation coefficients of the 9 factors in Table 3 were calculated to verify the existence of correlation between the factors, and the calculation results are shown in Table 4.

As can be seen from Table 4, the correlation coefficient between PM_2.5 and PM₁₀ is the largest, 0.7765, which is moderately correlated. The main sources of PM_2.5 are both primary particulate matter directly emitted from coal combustion, fuel oil motor vehicle exhaust, ground dust, and restaurant fumes, and secondary fine particles generated by chemical reactions of nitrogen oxides and volatile organic compounds in the air. About 70% of PM₁₀ is PM_2.5, and as PM₁₀ emissions increase, PM_2.5 concentrations show an increasing trend. NO₂, SO₂, and CO have the next highest correlation coefficients, 0.7476 and 0.7236, respectively. CO is contained in the flue gas of SO₂ generated by the use of coal, natural gas, and other fuels due to insufficient combustion, and the three air pollutants have the same increasing or decreasing relationship [18,19,20]. SH and RH are negatively correlated with a correlation coefficient of −0.5112, with a decrease in sunshine hours bringing an increase in relative humidity. In summary, there are obvious correlations between meteorological factors and between some air pollutant factors, which have synergistic effects on the superiority and inferiority of air quality. Principal component analysis is considered to extract new mutually independent principal components to explain as much information of the original factors as possible. The results of calculating the eigenvalues and contribution rates of the correlation coefficient matrix of Table 4 are shown in Table 5.

Table 5 is arranged in descending order according to the eigen values. The contribution rate of the top 5 are 37.6%, 19.2%, 15.3%, 8.8%, and 7.7%, respectively, covering a total of 88.6% of the information from nine factors. At a contribution threshold of 85%, the top 5 principal components were selected.

Based on the eigenvectors of the first five principal components and Equation (3), the expressions of the principal components can be written as follows:

p r i n 1 = 0.203 A P_{}^{*} - 0.197 W S_{}^{*} - 0.219 S H_{}^{*} + 0.205 R H_{}^{*} + 0.402 P M_{2.5}^{*} + 0.194 P M_{10}^{*} + 0.457 S O_{2}^{*} + 0.449 N O_{2}^{*} + 0.47 C O_{}^{*}

p r i n 2 = - 0.39 A P_{}^{*} + 0.422 W S_{}^{*} + 0.022 S H_{}^{*} - 0.202 R H_{}^{*} + 0.456 P M_{2.5}^{*} + 0.635 P M_{10}^{*} - 0.107 S O_{2}^{*} - 0.078 N O_{2}^{*} - 0.03 C O_{}^{*}

p r i n 3 = - 0.058 A P_{}^{*} - 0.236 W S_{}^{*} + 0.668 S H_{}^{*} - 0.612 R H_{}^{*} - 0.030 P M_{2.5}^{*} - 0.007 P M_{10}^{*} + 0.182 S O_{2}^{*} + 0.277 N O_{2}^{*} + 0.091 C O_{}^{*}

p r i n 4 = 0.83 A P_{}^{*} + 0.221 W S_{}^{*} - 0.106 S H_{}^{*} - 0.363 R H_{}^{*} - 0.002 P M_{2.5}^{*} + 0.233 P M_{10}^{*} + 0.081 S O_{2}^{*} - 0.18 N O_{2}^{*} - 0.161 C O_{}^{*}

p r i n 5 = - 0.07 A P_{}^{*} + 0.778 W S_{}^{*} - 0.018 S H_{}^{*} - 0.106 R H_{}^{*} - 0.2 P M_{2.5}^{*} - 0.375 P M_{10}^{*} + 0.226 S O_{2}^{*} + 0.16 N O_{2}^{*} + 0.348 C O_{}^{*}

The absolute values of the coefficients of SO₂, NO₂, and CO in prin1 are the largest of the first three items, which are 0.457, 0.449, and 0.47, respectively, so prin1 combines the factors of SO₂, NO₂, and CO, which is called the composite factor of sulfur and nitrogen carbon oxide pollution, reflecting 37.6% of the information of the original factors. The absolute values of the coefficients of PM_2.5 and PM₁₀ in prin2 are the two largest, 0.456 and 0.635, respectively. Thus, prin2 combines PM₁₀ and PM_2.5 air pollutant factors and is called the composite factor of particulate matter pollution, reflecting 19.2% of the information of the original factors. The absolute value of the SH coefficient is the largest in prin3, with a value of 0.668. Thus, prin3 is called the composite factor of sunshine hours, reflecting 15.3% of the information of the original factors. The absolute value of the AP coefficient is 0.83, which is the largest in prin4. Therefore, prin4 is called the air pressure composite factor and reflects 8.8% of the information of the original factor. The largest absolute value of WS coefficient in prin5 is 0.778, called the wind speed composite factor, which reflects 7.7% of the information of the original factor.

The data of the nine factors of the 1771 groups of samples in Table 3 and Table 4 were substituted into the five principal component expressions, and the principal component scores of each group of samples were calculated using Equation (4) as prin1–prin5, and the output results are shown in Table 6.

The first two principal components are extracted from the air pollutant factors, which contribute more significantly to the air quality, and the month-by-month averages of prin1 and prin2 are plotted under the same coordinate system, as in Figure 5.

As can be seen from Figure 5, the sample time points with higher prin1 scores are concentrated in December to February, indicating that there are more local sulfur and nitrogen carbon oxide pollution emissions in winter, which is closely related to coal-fired heating, increased traffic volume, and higher air pressure and static windy weather, and pollutants are difficult to diffuse. The sample data with lower scores come from May to July, when air pollution is low because of abundant rainfall and strong air convection in summer, making it easy for pollutants to be removed. The first two months with higher prin2 scores are April and May, which are the two most active months for dust storms in spring, coupled with abundant sand sources, dry surface, and loose soil, frequent dusty weather makes the most obvious pollution of particulate matter. The sample data with lower scores are from December to February, with the arrival of cold air in autumn and winter while the pressure rises, the snowfall in the sand source is not easy to melt, covering the surface to form protection, it is difficult to raise dust and sand, and the particulate pollution emissions are gradually reduced. In September, prin1 and prin2 scores are relatively low, indicating that pollutant emissions are low and air quality is good in September, making this time of year more suitable for travel.

3.4. STEPDISC-PCA-BP Model Construction and Performance

Prin1–prin5 obtained by PCA was used as the input of the BP neural network, and the AQL was used as the output. Since stratified sampling better reflects the overall characteristics of the data [21,22], a total of 1240 groups were selected as the training set and 531 groups as the test set by random independent sampling in the ratio of 7:3 in each level, and the division is shown in Table 7.

The sample data were normalized to the [−1, 1] interval before training the network model, and the five-dimensional vectors

{(p r i n 1, p r i n 2, p r i n 3, p r i n 4, p r i n 5)}^{T}

of 1240 groups of training samples were used as the five neuron nodes in the input layer. The output variables were AQL, using unique thermal coding, and level 1–level 6 were represented by 100000, 010000, 001000, 000100, 00000010, 00000001, corresponding to the six output layer neuron nodes. The hidden layer contained four neuron nodes. The STEPDISC-PCA-BP model structure is shown in Figure 6.

The Sigmoid function was used for the hidden layer, the Softmax function was used for the output layer, and the Trainrp function was introduced in the neural network training. The number of training times was set to 1000, the learning rate was 0.05, the network performance function was selected as MSE, and the training objective converged to the minimum error of 0.03. The five-dimensional principal component score vectors of 1240 groups were used as the training set for the simulation training, and the training results are shown in Figure 7, and the training error is shown in Figure 8. From Figure 7 and Figure 8, the observed number of iterations was 372, the training time was 20 s, the training error converged to 0.03, and the training process stopped.

The 531 test sets were tested on the trained STEPDISC-PCA-BP, and the confusion matrix output is shown in Table 8. The diagonal line in Table 8 is the number of samples correctly classified into levels for that level, and the non-diagonal line is the number of samples incorrectly classified into levels. The correct rates for each level in descending order were 62% (level 2), 13.9% (level 1), 6.2% (level 3), 1.7% (level 4), 1.1% (level 5), and 0.6% (level 6). Table 7 showed the number of tested data for each level, from which it can be seen that the larger the sample size, the more stable the pattern and the higher the prediction accuracy.

The TP, FN, FP, and TN values of the model within each air quality level were calculated according to Table 8. TP indicates the number of current true levels correctly predicted. FN indicates the number of current true levels predicted as other levels. FP indicates the number of other true levels predicted as current levels. TN indicates the number of other true levels correctly predicted. The calculation results are shown in Table 9.

In a multi-level classification task,

M i c r o - F 1

can calculate the accuracy of each level by taking all levels into account at once and is often used to quantitatively assess the overall accuracy of the model prediction, with values ranging from 0 to 1, the closer to 1, the better the model performance. Equations (13) and (14) were used to average all levels of each of the three models to obtain values of

\bar{T P}, \bar{T N}, \bar{F P}, \bar{F N},

and then find the values of

M i c r o - P

and

M i c r o - R

, and finally substitute into Equation (15) to calculate the values of

M i c r o - F 1

.

M i c r o - P = \frac{\bar{T P}}{\bar{T P} + \bar{F P}}

(13)

M i c r o - R = \frac{\bar{T P}}{\bar{T P} + \bar{F N}}

(14)

M i c r o - F 1 = \frac{2 \times (M i c r o - P) \times (M i c r o - R)}{(M i c r o - P) + (M i c r o - R)}

(15)

After calculation, the

M i c r o - F 1

value of STEPDISC-PCA-BP model was 0.855, and the overall correct rate was 85.5%. The STEPDISC-PCA-BP model has a simple structure, high prediction accuracy, strong generalization ability, and robust network, which is feasible for the prediction of future air quality levels.

3.5. STEPDISC-PCA-BP, PCA-BP, and BP Model Performance Evaluation

In order to objectively and accurately evaluate the performance of the three models, PCA-BP and BP do the same training parameter settings as STEPDISC-PCA-BP model during model training. The PCA-BP model performed principal component analysis on the 12 factors, and the top six principal component scores were calculated and input to the BP neural network. The BP model directly input the 12 factors for training, so the number of input layer nodes was different for the three models. The observed results of the three models after training are shown in Figure 9.

As can be seen from Figure 9, the STEPDISC-PCA-BP model with five input nodes converged to a training error of 0.03 after 372 iterations, took 0.333 min, and had the accuracy of 85.5%. In contrast, PCA-BP and BP models with the same neural network configuration and actual training times, with more input nodes and training times of 0.917 and 1 min, respectively, did not converge to the target minimum error, and the accuracy was only 61.77% and 56.69%. In a comprehensive comparison, the STEPDISC-PCA-BP model has lower data dimensionality, weaker factor correlation, simpler structure, faster convergence, and higher prediction accuracy, which is more suitable for application in local air quality prediction research and helps to improve the forecasting level.

4. Conclusions

In this study, we constructed an air quality level prediction model based on STEPDISC and PCA optimized BP neural network for the daily average monitoring values of air pollutants and meteorological related factors and air quality level label samples in Bayannur city from 2015 to 2020. We used 1240 sets of samples to train the model, and a dataset with a large sample size can improve the robustness of the model. Firstly, at the significance level of 0.01, nine factors with significant ability to distinguish air quality level were screened from twelve original factors using STEPDISC method, including air pressure, wind speed, sunshine hours, relative humidity, PM_2.5, PM₁₀, SO₂, NO₂, and CO. PM_2.5 was moderately positively correlated with PM₁₀, NO₂ and SO₂ were moderately positively correlated with CO, and sunshine hours were moderately negatively correlated with relative humidity. Secondly, under the contribution threshold of 0.85, the principal component analysis was performed on the nine discriminative significant factors, and the top five principal components were indicated as sulfur and nitrogen carbon oxide pollution, particulate pollution, sunshine hours, air pressure, and wind speed, with contribution rates of 37.6%, 19.2%, 15.3%, 8.8%, and 7.7%, respectively. The extraction of the top five principal components can explain 88.6% of the information of the nine factors, effectively reduce the factor dimensionality, remove the correlation, and improve the overall interpretability of the result. Sulfur and nitrogen carbon oxide pollution emissions are higher from December to February and lower from May to July. Particulate pollution is most serious in April and May, and gradually reduced from December to February; both types of pollution are light in September, and the air quality is mainly excellent and good. Finally, the scores of each group of samples in the five principal components were calculated, and the normalized principal component score vectors were input into the BP neural network for air quality level prediction modeling. With the same network training parameter configuration, the prediction accuracy of the STEPDISC-PCA-BP, PCA-BP, and BP models was 85.5%, 61.77%, and 56.69%, respectively. The STEPDISC-PCA-BP model has a simpler structure, higher prediction accuracy, shorter time consumption, better network performance, stronger generalization ability, the faster convergence speed and the performance is better when applied to the study of local air quality prediction. Research on air quality prediction algorithms such as decision tree, convolutional neural network, and random forest can promote the application of machine learning in the environmental field, improve science and technology, and accelerate the process of environmental protection.

Author Contributions

Conceptualization, M.L. and H.H.; data curation, M.L., Y.Z. and J.L.; formal analysis, M.L., Y.Z. and J.L.; funding acquisition, H.H. and L.Z.; investigation, M.L.; methodology, H.H.; project administration, H.H. and L.Z.; resources, H.H.; software, M.L., L.Z. and J.L.; supervision, H.H.; validation, M.L., H.H. and Y.Z.; visualization, M.L.; writing—original draft, M.L.; writing—review and editing, H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Research Program of Science and Technology at Universities of Inner Mongolia Autonomous Region of China, grant number NJZZ23044; Multidisciplinary Interdisciplinary Research Project at Universities of Inner Mongolia Autonomous Region, grant number BR231516; National Natural Science Foundation of China, grant number 51969025; Natural Science Foundation of Inner Mongolia Autonomous Region of China, grant number 2023MS05023; National Natural Science Foundation of China, grant number 61962047; Natural Science Foundation of Inner Mongolia Autonomous Region of China, grant number 2021MS06009; Youth Science and Technology Talent Development (Innovation Team) Program at Universities of Inner Mongolia Autonomous Region of China, grant number NMGIRT2313; National Natural Science Foundation of China, grant number 32160813; Natural Science Foundation of Inner Mongolia Autonomous Region of China, grant number 2021BS03038; Research Program of Science and Technology at Universities of Inner Mongolia Autonomous Region of China, grant number NJZY21457.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Song, Y.; Chen, B.; Ho, H.; Kwan, M.; Liu, D.; Wang, F.; Wang, J.; Cai, J.; Li, X.; Xu, Y.; et al. Observed Inequality in Urban Greenspace Exposure in China. Environ. Int. 2021, 156, 106778. [Google Scholar] [CrossRef] [PubMed]
Song, Y.; Huang, B.; He, Q.; Chen, B.; Wei, J. Dynamic assessment of PM 2.5 exposure and health risk using remote sensing and geo-spatial big data. Environ. Pollut. 2019, 253, 288–296. [Google Scholar] [CrossRef] [PubMed]
GB 3095-2012; Ambient Air Quality Standards. China Environmental Science Press: Beijing, China, 2012.
HJ 633-2012; Technical Regulation on Ambient Air Quality Index (on trial). China Environmental Science Press: Beijing, China, 2012.
Atluri, G.; Karpatne, A.; Kumar, V. Spatio-temporal Data Mining: A Survey of Problems and Methods. ACM Comput. Surv. 2018, 51, 83. [Google Scholar] [CrossRef]
Ji, D.; Xu, A.; Xie, X. Prediction and Analysis of Air Quality Based on FCM and BP Neural Network. Meteorol. Environ. Res. 2018, 9, 72–74. [Google Scholar]
Shakerkhatibi, M.; Dianat, I.; Jafarabadi, M. Air Pollution and Hospital Admissions for Cardiorespiratory Diseases in Iran: Artificial Neural Network versus Conditional Logistic Regression. Int. J. Environ. Sci. Technol. 2015, 12, 3433–3442. [Google Scholar] [CrossRef] [Green Version]
Wu, H.; He, X. Research on the Prediction of Air Quality Index Based on GA-BP Neural Network. J. Anhui Norm. Univ. (Nat. Sci. Ed.) 2019, 42, 360–365. [Google Scholar]
Song, D.; Xia, X.; He, Y.; Zhang, L.; Du, Z. Forecast Method and Effect Examination of Air Quality in Guiyang. Meteorol. Environ. Sci. 2019, 42, 93–100. [Google Scholar]
You, Y.; Zhang, L. Application of Bayesian Regularized BP Neural Network in Air Quality Index Prediction. J. Chongqing Univ. Sci. Technol. (Nat. Sci.) 2022, 24, 78–82. [Google Scholar]
Yu, S.; Bi, L.; Su, L.; Liu, J.; Shi, J.; Yi, N.; Fan, R. Movement Paths and Characteristics of Hail Clouds in Bayannur, Inner Mongolia. Arid Zone Res. 2022, 39, 1047–1055. [Google Scholar]
Yasin, E.; Mahmut, A.; Yavuz, H.; Siddik, K.; Alpaslan, Y. Sex Estimation from Sacrum and Coccyx with Discriminant Analyses and Neural Networks in an Equally Distributed Population by Age and Sex. Forensic Sci. Int. 2019, 303, 109955. [Google Scholar]
Zhang, Q.; Li, J.; Wang, M.; Tang, X. Logging Curve Rock Layering Technology Based on Improved Principal Component Analysis. J. Jilin Univ. (Earth Sci. Ed.) 2022, 52, 1369–1376. [Google Scholar]
Zhang, Y.; Guo, A.; Wu, H.; Yuan, H.; Dong, Y. Seasonal Prediction of PM2.5 Based on the PCA-BP Neural Network. J. Nanjing For. Univ. (Nat. Sci. Ed.) 2020, 44, 231–238. [Google Scholar] [CrossRef] [Green Version]
Liu, M.; Pan, X.; Liu, F.; Zhou, Y.; Jiang, K. Flame Target Detection Based on Stepwise Discrimination and BP Neural Network. J. Inn. Mong. Agric. Univ. (Nat. Sci. Ed.) 2021, 42, 92–96. [Google Scholar]
Jo, J. Effectiveness of Normalization Pre-processing of Big Data to the Machine Learning Performance. J. Korea Inst. Electron. Commun. Sci. 2019, 14, 547–552. [Google Scholar]
Yang, H.; Zhao, X.; Wang, L. A Review of Data Normalization Methods. Comput. Eng. Appl. 2023, 59, 13–22. [Google Scholar]
Zhang, J.; Yan, X.; Zhang, J. The Relationship between Air Pollution and Human Health. Shanxi Med. J. 2021, 50, 3339–3341. [Google Scholar]
Lee, E.; Romeiko, X.; Zhang, W. Residential Proximity to Biorefinery Sources of Air Pollution and Respiratory Diseases in New York State. Environ. Sci. Technol. 2021, 55, 10035–10045. [Google Scholar] [CrossRef] [PubMed]
Zang, Z.; Zhang, F.; Li, Y.; Xing, Y. Spatio-Temporal Distribution and Affecting Factors of Pm2.5 And Pm10 in Major Grain Producing Areas in China: A Case Study of Henan Province. J. Nat. Resour. 2021, 36, 1163–1175. [Google Scholar] [CrossRef]
Lv, Z.; Wang, Y.; Li, L. Application of Stratified Sampling Method in Core Plug Sampling. Pet. Geol. Exp. 2018, 40, 274–279. [Google Scholar]
Blanka, S.; Tams, R. Reducing Variance with Sample Allocation based on Expected Response Rates in Stratified Sample Designs. J. Surv. Stat. Methodol. 2022, 10, 1107–1120. [Google Scholar]

Figure 1. Bayannur city location (a) Map of China. (b) Map of Inner Mongolia. (c) Map of Bayannur city.

Figure 2. Three-layer BP neural network structure.

Figure 3. Annual averages of pollutant concentrations for 2015–2019: (a) Annual averages of PM_2.5, PM₁₀, SO₂, NO₂ concentration; (b) Annual average CO concentration.

Figure 4. Annual average of AQI for 2015–2019.

Figure 5. Monthly average for prin1 and prin2.

Figure 6. Structure of STEPDISC-PCA-BP model.

Figure 7. Training results of STEPDISC-PCA-BP model.

Figure 8. Training error of STEPDISC-PCA-BP model.

Figure 9. Comparison results of performance evaluation of STEPDISC-PCA-BP, PCA-BP, and BP models.

Table 1. Meteorological factors sample.

Date	AP (0.1 hPa)	AT (0.1 °C)	WS (0.1 m/s)	SH (0.1 h)	RH (1%)	DP (0.1 mm)
1 January 2015	9109	−96	14	81	39	0
2 January 2015	9060	−70	7	78	47	0
3 January 2015	8985	−42	8	83	38	0
…	…	…	…	…	…	…
11 January 2020	9047	−122	16	77	78	0
12 January 2020	9042	−124	19	78	79	3
13 January 2020	9066	−167	14	78	76	0

Table 2. Air pollutant factors sample during the same period.

PM_2.5 (μg⁄m³)	PM₁₀ (μg⁄m³)	SO₂ (μg⁄m³)	NO₂ (μg⁄m³)	CO (mg⁄m³)	O₃ (μg⁄m³)	AQI	AQL
39	83	57	27	0.58	43	67	2
96	152	118	48	1.75	26	128	3
67	128	85	36	1.27	46	97	2
…	…	…	…	…	…		…
110	159	38	43	1.55	54	145	3
145	228	38	52	2	38	187	4
45	66	20	27	1	54	63	2

Table 3. Stepwise selecting results of factors.

Step	Entered	Wilks’λ	F Value	Pr > F
1	PM₁₀	0.280	909.7	<0.0001
2	PM_2.5	0.218	99.97	<0.0001
3	AP	0.195	41.38	<0.0001
4	NO₂	0.186	17.68	<0.0001
5	CO	0.182	6.30	<0.0001
6	RH	0.180	5.46	<0.0001
7	SO₂	0.177	4.29	<0.0001
8	SH	0.175	4.06	<0.0001
9	WS	0.174	3.41	<0.0001

Table 4. Pearson correlation coefficient matrix for discriminant significant factors.

	AP	WS	SH	RH	PM_2.5	PM₁₀	SO₂	NO₂	CO
AP	1
WS	−0.2644	1
SH	−0.2047	−0.0193	1
RH	0.1595	−0.1473	−0.5112	1
PM_2.5	0.0038	−0.0191	−0.2614	0.1867	1
PM₁₀	−0.1162	0.1787	−0.1207	−0.0974	0.7765	1
SO₂	0.3566	−0.3195	−0.2085	0.1481	0.4875	0.1332	1
NO₂	0.2297	−0.3680	−0.0944	0.1391	0.5049	0.1350	0.6993	1
CO	0.2309	−0.2254	−0.2404	0.2712	0.5511	0.1693	0.7236	0.7476	1

Table 5. Eigen values and contribution rates.

Principal Component Number	Eigen Value	Contribution Rate	Cumulative Contribution Rate
1	3.382	0.376	0.376
2	1.727	0.192	0.568
3	1.377	0.153	0.721
4	0.793	0.088	0.809
5	0.694	0.077	0.886
6	0.441	0.049	0.935
7	0.275	0.031	0.966
8	0.215	0.024	0.989
9	0.096	0.011	1.000

Table 6. Principal component score results.

Date	AQL	Prin1	Prin2	Prin3	Prin4	Prin5
1 January 2015	2	1.324642	−1.21233	0.624716	1.287813	−0.47382
2 January 2015	3	6.04963	−0.42106	1.733634	0.010815	0.17862
3 January 2015	2	3.384212	−0.21597	1.558986	−0.49273	−0.23203
…			…	…	…
11 January 2020	3	4.196798	0.431482	−0.63176	−0.78609	−0.5376
12 January 2020	4	5.687232	1.622739	−0.4458	−0.91892	−0.35238
13 January 2020	2	1.392678	−1.31886	−1.09655	−0.43905	−0.70181

Table 7. Training set and test set division.

Level	Training Set	Test Set	Total Sample
1	247	106	353
2	804	345	1149
3	119	51	170
4	43	18	61
5	17	7	24
6	10	4	14
Total	1240	531	1771

Table 8. Confusion matrix of STEPDISC-PCA-BP model.

		Observed Level
		1	2	3	4	5	6
Predicted Level	1	74	16	0	0	0	0
	1	13.9%	3.0%	0.0%	0.0%	0.0%	0.0%
	2	32	329	15	0	0	0
	2	6.0%	62.0%	2.8%	0.0%	0.0%	0.0%
	3	0	0	33	5	0	0
	3	0.0%	0.0%	6.2%	0.9%	0.0%	0.0%
	4	0	0	2	9	1	0
	4	0.0%	0.0%	0.4%	1.7%	0.2%	0.0%
	5	0	0	1	3	6	1
	5	0.0%	0.0%	0.2%	0.6%	1.1%	0.2%
	6	0	0	0	1	0	3
	6	0.0%	0.0%	0.0%	0.2%	0.0%	0.6%

Table 9. TP, FN, FP, and TN values of the STEPDISC-PCA-BP model.

Level	TP	FN	FP	TN
1	74	32	16	409
2	329	16	47	139
3	33	19	5	474
4	9	8	3	511
5	6	1	5	519
6	3	1	1	526

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, M.; Hu, H.; Zhang, L.; Zhang, Y.; Li, J. Construction of Air Quality Level Prediction Model Based on STEPDISC-PCA-BP. Appl. Sci. 2023, 13, 8506. https://doi.org/10.3390/app13148506

AMA Style

Liu M, Hu H, Zhang L, Zhang Y, Li J. Construction of Air Quality Level Prediction Model Based on STEPDISC-PCA-BP. Applied Sciences. 2023; 13(14):8506. https://doi.org/10.3390/app13148506

Chicago/Turabian Style

Liu, Min, Hua Hu, Liqian Zhang, Yongan Zhang, and Jia Li. 2023. "Construction of Air Quality Level Prediction Model Based on STEPDISC-PCA-BP" Applied Sciences 13, no. 14: 8506. https://doi.org/10.3390/app13148506

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Construction of Air Quality Level Prediction Model Based on STEPDISC-PCA-BP

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Datasets

2.2. Methodology

2.2.1. Stepwise Discriminant Analysis (STEPDISC)

2.2.2. Principle Component Analysis (PCA)

2.2.3. BP Neural Network

3. Results and Discussion

3.1. Data Pre-Processing

3.2. STEPDISC Method for Selecting Factors

3.3. PCA Method to Reduce the Factors Dimension

3.4. STEPDISC-PCA-BP Model Construction and Performance

3.5. STEPDISC-PCA-BP, PCA-BP, and BP Model Performance Evaluation

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI