1. Introduction
Environment pollution is becoming increasingly severe as industrialization and urbanization accelerate, attracting the attention of governments and experts worldwide [
1]. PM
2.5, an essential factor causing haze [
2] and reducing atmospheric visibility [
3], contributes much more to heavily polluted weather than other pollutants [
4]. Many researchers have studied the relationship between PM
2.5 and health. One study shows that breathing air with excessive pollution levels for a long time can worsen cardiovascular and respiratory diseases [
5]. Ashtari et al. [
6] demonstrated a positive relationship between air quality and multiple sclerosis severity through an eight-year study in the Isfahan region of Iran. There are an estimated 1.65 million to 2.19 million early deaths every year associated with PM
2.5 [
7]. In order to prevent air pollution and protect the environment, air pollution prediction is crucial [
8]. An accurate PM
2.5 concentration forecast model can help the public to make sensible travel arrangements, lessen the risks to people’s health, and guide governments and the public to make the right choices.
Obtaining accurate PM
2.5 concentration prediction values through the existing large quantity of meteorological and pollutant historical monitoring data is a crucial issue in air pollution research. The concentration of pollutants at a given point in time can affect future pollutant concentrations. The effects may last hours, days, or longer. At present, the time granularity in pollutant prediction is mainly assessed in hourly [
9] and daily [
10] terms, and some studies have examined pollutant emissions on a yearly scale [
11,
12]. The current pollutant prediction models are classified into mechanism models, statistical models and deep learning models. Mechanism models simulate the chemical and physical evolution of pollutant transport through atmospheric kinetic theory. The typical methods are community multiscale air quality (CMAQ) [
13], the nested air quality prediction and modeling system (NAQPMS) [
14], and the weather research and forecasting (WRF) model [
15]. The advantage of these methods is that the modeling does not require historical data. Still, the complexity of model calculations and parameter settings requires a strong background in atmospheric science, limiting the practical application of the method. With the development of sensing technology, data acquisition has become easy. Statistical models are used for pollutant prediction. Statistical models are divided into traditional statistical models and machine learning models. For pollutant prediction, the traditional statistical models known as autoregressive integrated moving average (ARIMA) [
16] and autoregressive moving average with extra input (ARMAX) [
17] are frequently utilized. However, they are based on linear presumptions and have straightforward model structures that cannot be used to solve nonlinear issues. Machine learning techniques are also used to address nonlinear issues in pollutant prediction. Machine learning-based prediction methods include back propagation (BP) [
18], extreme learning machine (ELM) [
19], and support vector regression (SVR) [
20]. Even though machine learning algorithms have produced positive results for pollutant prediction, PM
2.5 concentration sequences exhibit complex fluctuations over time and are influenced by a variety of factors. The prediction accuracy of statistical models is constrained because they cannot satisfy the requirements of multivariate nonlinear data prediction.
With the development of computer technology, deep learning has demonstrated outstanding nonlinear data processing capabilities. For example, computer vision [
21,
22], image classification [
23], and natural language processing [
24] have demonstrated the effectiveness of deep learning models. Because recurrent neural networks (RNNs) are good at dealing with the difficulties encountered when dealing with sequential data, they have been used in numerous models to predict pollution concentration [
25]. Nevertheless, RNNs suffer from gradient disappearance and gradient explosion when predicting long time series. Long short-term memory (LSTM) [
26], as a variant of an RNN, can solve this problem [
27]. To demonstrate the viability of using LSTM models in short-term prediction, LSTM-based models have been compared with statistical models [
28,
29,
30]. Both Wu et al. [
31] and Wang et al. [
32] predicted air pollution using LSTM; the difference was that Wang et al. [
32] took into account the role of influencing factors through the chi-square test. As the LSTM can only encode data in one direction, Graves and Schmidhuber [
33] proposed the bidirectional long short-term memory (BiLSTM) model, which contained forward LSTM unit and a backward LSTM unit. Experiments have shown that the BiLSTM is more accurate and stable in pollutant prediction than the single LSTM model, particularly for maximums and minimums. A single neural network cannot handle complex prediction tasks since the changes in pollutant concentration are complex and have spatiotemporal characteristics. So, more and more combination models are being developed. Bai et al. [
34] considered the seasonal characteristics of PM
2.5 concentration variation and then obtained PM
2.5 predicted values using a stacked auto-encoder model. Pak et al. [
35] extracted the spatial features of multidimensional time series using a convolutional neural network and then used LSTM to make predictions. Chang et al. [
36] achieved PM
2.5 prediction by aggregating multiple LSTM units and considered the correlation of influencing factors in the model. A combination of auto-encoder and BiLSTM was proposed as a new model [
37]. Furthermore, the encoder–decoder model based on LSTM was also applied to time series prediction [
38,
39]. Because encoders and decoders based on LSTM can fully extract time series features, they are widely used in pollutant concentration prediction [
40,
41]. These models use an LSTM unit as an encoder to extract the features of the sequence data and then use another LSTM unit as a decoder to decode the encoded vector to obtain the predicted values. The existing studies show that BiLSTM performs better than LSTM in time series prediction and so we construct an encoder–decoder prediction model using BiLSTM.
In addition to prediction models, the presence of excessive features in PM
2.5 short-term prediction can cause information redundancy and lead to error accumulation [
42]. Multiple-series data can lead to the overfitting of models [
43]. Based on the above issues, we select features from the many pollutants and a large quantity of meteorological historical monitoring data. Feature selection improves prediction accuracy by selecting strongly correlated variables from multidimensional data. The methods commonly used for feature selection include mutual information [
44,
45], Pearson correlation coefficient [
46,
47], and Kendall correlation coefficient [
48]. As part of their study, Bai et al. [
34] calculated the Kendall correlation coefficient for PM
2.5 and meteorological variables. Zhang et al. [
37] solved the same problem using the Pearson correlation coefficient, but the Pearson correlation coefficient is not a good measure of nonlinear problems [
49]. To measure correlations between PM
2.5 and other pollutants, Wang et al. [
2] used interval gray incidence analysis. However, the above study did not consider the redundancy between features. This study shows that adjacent monitoring sites can provide important information for the prediction model. To overcome the limitation, we use the maximum correlation minimum redundancy (MRMR) algorithm based on mutual information [
50] to decide the model’s input variables. Mutual information is also used to calculate the correlation of PM
2.5 between adjacent stations.
Although the existing methods can adequately extract time series features, they only consider the correlation between multidimensional feature variables and ignore the effect of feature redundancy on model accuracy. In addition, few studies have been conducted to predict pollutant concentrations without monitoring stations. In this paper, a new combination model for PM2.5 concentration prediction is proposed based on the above analysis. This study contributes to the following areas:
- (1)
Based on the nonlinear characteristics of the PM2.5 concentration series, this paper adopts the MRMR algorithm based on mutual information for feature selection, removing the effect of redundant features with full consideration of the influencing factors.
- (2)
To fully extract of spatial and temporal features, we construct a dual encoder–decoder prediction model based on BiLSTM for predicting PM2.5 concentration named ED-BiLSTM. The proposed model accurately predicts PM2.5 concentration.
- (3)
To obtain the PM2.5 concentration distribution in Xi’an, China, we combine the ED-BiLSTM model with inverse distance weight (IDW) spatial interpolation to get the predicted PM2.5 concentration values without monitoring stations. Finally, the overall distribution is obtained by ArcGIS visualization.
3. Methods
We propose a new three-stage combination model for use in short-term PM
2.5 prediction. In the first part of the research, we select the temporal and spatial features related to PM
2.5 concentration through the MRMR algorithm. For the second part, we construct a dual encoder–decoder model based on BiLSTM to train and learn the spatiotemporal features and perform prediction. Finally, the study area is divided into 1 km × 1 km grids, and the IDW method is used to interpolate the grids to obtain PM
2.5 concentration values of all grids.
Figure 1 presents a flow chart for the proposed model. This section describes the methods involved in the model in detail.
3.1. MRMR Feature Selection Algorithm
PM
2.5 concentration is affected by other atmospheric pollutants, meteorological conditions [
51], and PM
2.5 concentration at adjacent sites [
52]. However, using multiple-feature variables may result in information redundancy and reduce the prediction model’s accuracy [
42]. To eliminate information redundancy between features, we use the MRMR feature selection algorithm based on mutual information to remove irrelevant and redundant features, reduce computational consumption, and improve prediction accuracy.
The MRMR algorithm considers the correlation and redundancy between features simultaneously. The correlation between the random variables
and
can be calculated using mutual information (MI) from Equation (3).
where
and
are the probability distribution functions of
and
,
is the joint probability distribution function of
x and
y, and
I(
x,
y) is the mutual information of
and
.
Suppose that
denotes the set of feature variables
and
, and that
represents the number of features. The correlation between the characteristic variables and the target variable is calculated by Equation (4). The redundancy between the characteristic variables is calculated by Equation (5).
where
is the value of the ith feature variable,
is the value of the target variable,
is the MI between the feature variables,
is the MI between the target variable and the feature variables,
C is the correlation between the feature variable and the target variable, and
R is the redundancy between the features.
The objective function for selecting the initial feature subset is shown in Equations (6) and (7). After determining the initial feature subset, an incremental search is performed using Equation (8). The steps for conducting an incremental search are as follows:
Step 1: The initial feature set determined by the MRMR feature selection algorithm is represented as , which contains features. The remaining features are expressed as . The feature variables contained in are denoted as .
Step 2: Add to to form feature set . Input and into the prediction model to calculate the evaluation index of the model, respectively. If the prediction error corresponding to input is small, the original data set is kept. If the error corresponding to input is small, the replaces as the new data set.
Step 3: Add the remaining features in
to the feature set with the minimum error in turn and perform Step 2 until the final feature set with the minimum error is obtained.
The final feature set fully preserves the correlation between the multidimensional characteristic variables and the prediction target, and achieves a data dimensionality reduction of the prediction model, which reduces the calculation loss.
3.2. ED-BiLSTM Prediction Model
The ED-BiLSTM model comprises a dual encoder and a decoder based on BiLSTM. An encoder encodes a temporal feature, and an encoder encodes a spatial feature. Then, the encoded vectors are aggregated and inputted to the decoder to obtain the ED-BiLSTM model. Finally, the PM2.5 concentration prediction is realized via the test set.
In the deep neural network prediction model, the time series modeling and time step calculation for the characteristic variables are the basis of the realization of the prediction model. Therefore, before proposing the ED-BiLSTM prediction model, we first introduce the time series modeling and time step calculation methods.
3.2.1. Time Series Sample Modeling
To convert the data into a form that the computer can understand, we model the time series through a rolling window. An example of time series sample modeling by a rolling window is shown in
Figure 2. Suppose that the period contains 10 records
,
, …,
, where
, and
n is the number of feature dimensions. When ∆
t = 6, for Sample 1,
, and
are the features and
is the label. For Sample 2, the features are
and
;
is the label, and all samples are modeled in the same way. In the prediction model, a small
will make the input information incomplete, and a larger one will increase the noise and the computational consumption [
27]. We determined the ∆t by calculating the autocorrelation coefficient (ACF) and the partial autocorrelation coefficient (PACF), as implemented in
Section 3.2.2.
3.2.2. Time Step Calculation
The number of stamps in each sample is determined by the size of time step , which is called the lag effect. A small value of does not ensure sufficient information input, while a large value will introduce irrelevant information and cause computational loss. Therefore, determining an appropriate is crucial to designing the prediction model. We determined the by analyzing ACF and PACF.
As a measure of correlation, the ACF describes the degree of correlation between values over time. For the time series
, the ACF between the values of
and
can be measured by Equation (9). A larger ACF value indicates a stronger correlation between
and
.
where
is the label,
is the sample with a
time lag, and
is the average value of the sample.
Compared to ACF, PACF is more focused on the correlation between
and
. The PACF of
and
is the correlation coefficient between
and
after removing the indirect effects of
. The formula of PACF is shown in Equations (10)–(12). When the value of PACF is closer to 0, it indicates that the correlation between
and
is weaker.
3.2.3. ED-BiLSTM Model
The BiLSTM [
33] neural network structure contains a forward LSTM cell and a backward LSTM cell. The BiLSTM can fully mine time series information and has shown outstanding performance capacity in time series prediction tasks [
27,
29,
53].
Figure 3 shows a comparison between LSTM and BiLSTM. We can see that BiLSTM performs not only forward, but also backward, extraction of features compared with LSTM. To extract bi-directional features from sequence data, we use the BiLSTM as the main component to construct the dual encoder and decoder model. To provide a deeper understanding of BiLSTM, we give the detailed calculation process of LSTM.
Figure 4 shows that each LSTM neural unit consists of an input gate, an output gate, and a forget gate. Data are received through the input gate, historical information is preserved through the forgetting gate, and information is output through the output gate. Equations (13)–(18) give the calculation formula.
where
is the output of the hidden at
,
is the output of the memory cell at
, and
is the current input vector.
,
,
, and
are the parameter matrices controlling the hidden state.
,
,
, and
are the parameter matrices controlling the input information,
,
,
, and
are the bias vectors,
is the sigmoid function, tanh is the function tanh, and * is the Hadamard operation.
is the forget gating used to control the degree of forgetting of the previous state
,
is the memory of the current input information,
is memory gating, and
and
perform Hadamard operations to determine the storage of new information. The forgetting of past states and the updating of current input information together determine the
state vector. The output state
is obtained by the operation of
and
.
The output of the BiLSTM neural network unit is a combination of forward LSTM hidden layer output
and backward LSTM hidden layer output
. The calculation process is shown in Equation (19).
where
and
are the output weights of the forward and backward LSTM hidden layers, respectively.
For multivariate time series, one hidden layer cannot express all the information. Adding hidden layers may give more accurate results. The general approach is to stack multiple hidden layers [
54] and use the output of the previous hidden layer as the input of the next hidden layer. Since both temporal and spatial features are considered in our study, simply adding hidden layers cannot solve the complex data feature problem. Therefore, we propose the use of a dual encoder and decoder model (ED-BiLSTM) based on BiLSTM.
Figure 5 shows the model structure, which consists of two parts: the encoder and the decoder. One of the encoders encodes the temporal features obtained from MRMR, and the other encoder extracts spatial features from adjacent sites. Then, the encoded vectors from both encoders are aggregated and fed into the decoder for prediction. The ED-BiLSTM prediction model will be more beneficial for calculating the grid space interpolation without monitoring stations.
3.3. Inverse Distance Weight (IDW) Interpolation Method
With increasing popular concern over environmental pollution, predictions about air pollutants have made significant progress in recent years. However, for deep learning to be effective, large amounts of historical data are required for prediction, which leads most of the research about pollutant prediction to target monitoring stations [
30,
42]. There have also been a few studies into new sites with limited historical data [
27]. However, for areas without monitoring sites, the absence of historical data has led to a lack of research in this area. In this study, the ED-BiLSTM prediction model was combined with the IDW algorithm to interpolate the 1 km × 1 km spatial grid and obtain each grid’s predicted values. The method is divided into the following steps. Firstly, the PM
2.5 concentration of all monitoring stations in the study area is predicted using the ED-BiLSTM model at the next moment. Then, the place is divided into grids of 1 km × 1 km; then, the coordinates of each grid center point are determined based on latitude and longitude. Finally, the IDW spatial interpolation algorithm obtains the predicted values for all grids.
IDW is a spatial interpolation algorithm based on points. For the interpolated grid with the center point coordinates
), the predicted PM
2.5 value at this grid is calculated by Equation (20), where n is the number of monitoring stations,
is the predicted value at the monitoring station
and
indicate the distance between the interpolation grid
), and the monitoring point
is calculated by Equation (21). k is the power exponent of distance. When defining higher values of
k, the more adjacent stations have a more significant effect on the interpolation. By comparison, when defining smaller values of
k, the more remote stations have more significant effects on interpolation. The core of the IDW algorithm is the selection of
k. It has been shown that the search radius does not affect the results when
k is 3 or more [
55]. Based on the analysis of previous studies, we set
k as 3.
3.4. Evaluation Index
To quantify the prediction model performance, we evaluated the ED-BiLSTM model and compared models using the root mean square error (RMSE), the mean absolute error (MAE), the mean absolute percentage error (MAPE), and R squared (
). Deviation between actual and predicted values is measured by RMSE and MAE, and MAPE can determine the percentage deviation. The smaller the RMSE, MAE, and MAPE values are, the more accurate the prediction will be.
reflects the ability to fit data, and the value ranges from 0 to 1. The higher the
is, the better the model fits the data will be. Extreme values do not affect MAE and MAPE, while RMSE uses the square of the error. This amplifies the prediction error and is more sensitive to anomalies. Therefore, we combine the evaluation metrics to evaluate the ED-BiLSTM model. The formulas are shown in Equations (22)– (25).
where
is the observed value of PM
2.5 concentration,
is the predicted value,
is the mean of the monitored value, and
denotes the number of observed samples.
5. Discussion
This paper proposes a new combination prediction model, combining the MRMR feature selection algorithm, the BiLSTM neural network, and IDW spatial interpolation algorithm to predict PM2.5 concentrations in Xi’an. Firstly, we select the appropriate feature subset using the MRMR algorithm, and experiments show that the algorithm fully considers the correlation and redundancy among features, which helps to improve the accuracy of the prediction model. Secondly, we compare our model with the machine learning and single deep learning models. Due to our model considering the spatiotemporal correlation of multivariate time series, it shows strong advantages. Thirdly, the model based on BiLSTM units is compared with the model based on LSTM and autoencoder units, and it is found that BiLSTM has a more robust feature extraction ability. Next, our model improves the predictive ability by considering the spatial correlation between adjacent sites. Finally, the experimental analysis finds that the dual encoder model outperforms a single encoder in extracting the spatiotemporal features of multiple time series. In our proposed comparative model prediction experiments, the ED- BiLSTM model has the lowest prediction error. In addition, we divided the Xi’an into 1 km × 1 km grids to predict PM2.5 concentrations in each grid and explored the process of PM2.5 concentration variation all day through ArcGIS visualization. The proposed model combines the feature selection algorithm, neural network, and interpolation algorithm, which solves the feature redundancy in PM2.5 prediction and the difficulty of PM2.5 prediction in regions without monitoring stations.
According to the analysis of PM
2.5 distribution results in
Figure 14, it can be found that industrial pollution and traffic pollution are the primary sources of PM
2.5. The public can reduce PM
2.5 emissions by choosing more public transportation and reducing the use of motor vehicles. Regarding industrial emissions, industrial production processes should be optimized to reduce and purify industrial waste gas. In addition, the emission of PM
2.5 precursors, such as NOx, SO
2, and volatile organic compounds, should be controlled in the management of PM
2.5. When the level of PM
2.5 pollution in the atmosphere is high, it is reasonable for the public to wear masks when traveling, which can effectively protect the cardiovascular system. Effective source prevention and control can cut off pollution sources and solve pollution problems. As such, environmental protection departments should formulate more reasonable pollution prevention and control countermeasures to eliminate pollution at the source.
As our model predicts PM2.5 concentration at the hourly level in the short term, and in the case analysis, the model’s accuracy is verified by selecting stations only in Xian. However, this does not mean that the proposed model is geographically limited, and the model can be applied to PM2.5 or other pollutant prediction in different cities. Due to the limitation of data sources, we only consider the effect of other pollutants and meteorological factors on PM2.5 concentration. In addition, we only consider the effects of feature correlation and PM2.5 concentration nonlinear features on the prediction results in this paper, ignoring the errors caused by the non-stationarity of PM2.5 concentration series. In the future, we will analyze the non-stationarity of PM2.5 to achieve a more accurate PM2.5 concentration prediction. Besides, model parameter optimization by using optimization algorithms is also a future research direction.