Long-Term Forecasting of Air Pollution Particulate Matter (PM2.5) and Analysis of Influencing Factors

Zhang, Yuyi; Sun, Qiushi; Liu, Jing; Petrosian, Ovanes

doi:10.3390/su16010019

Open AccessArticle

Long-Term Forecasting of Air Pollution Particulate Matter (PM2.5) and Analysis of Influencing Factors

Faculty of Applied Mathematics and Control Processes, Saint-Petersburg State University, Saint Petersburg 198504, Russia

^*

Author to whom correspondence should be addressed.

Sustainability 2024, 16(1), 19; https://doi.org/10.3390/su16010019

Submission received: 16 October 2023 / Revised: 4 December 2023 / Accepted: 12 December 2023 / Published: 19 December 2023

Download

Browse Figures

Versions Notes

Abstract

:

Long-term forecasting and analysis of PM2.5, a significant air pollution source, is vital for environmental governance and sustainable development. We evaluated 10 machine learning and deep learning models using PM2.5 concentration data along with environmental variables. Employing explainable AI (XAI) technology facilitated explainability and formed the basis for factor analysis. At a 30-day forecasting horizon, ensemble learning surpassed deep learning in performance, with CatBoost emerging as the top-performing model. For forecasting horizons of 90 and 180 days, Bi-SLTM and Bi-GRU, respectively, exhibited the highest performance. Through an analysis of influencing factors by SHAP, it was observed that PM10 exerted the greatest impact on PM2.5 forecasting. However, this effect was particularly pronounced at higher concentrations of CO. Conversely, at lower CO concentrations, the impact of increased PM10 concentrations on PM2.5 was limited. Hence, it can be inferred that CO plays a pivotal role in driving these effects. Following CO, factors such as “dew point” and “temperature” were identified as influential. These factors exhibited varying levels of linear correlation with PM2.5, with temperature showing a negative correlation, while PM10, CO, and dew point generally demonstrated positive correlations with PM2.5.

Keywords:

PM2.5 forecasting; explainable AI; influencing factors

1. Introduction

Air pollution, especially the presence of particulate matter (PM) [1,2,3], has become a significant global environmental concern due to its detrimental effects on human health and the environment. Of the various PM fractions, PM2.5—fine particles with a diameter of 2.5

μ

m or smaller—is particularly worrisome because it can be deeply inhaled into the respiratory system, leading to respiratory and cardiovascular diseases. Accurate forecasting of PM2.5 concentrations [4] is essential for effective air quality management and policy implementation. Long-term forecasting of PM2.5 levels plays a crucial role in predicting trends, identifying potential risks, and formulating appropriate mitigation strategies. By projecting future PM2.5 concentrations, governments, policymakers, and public health agencies can make well-informed decisions to reduce exposure and protect vulnerable populations. To develop reliable long-term PM2.5 forecasts, researchers have focused on understanding the factors [5] that contribute to the formation and dispersion of PM2.5 particles. These factors can vary significantly across regions and time frames due to variations in emission sources, meteorological conditions, geographical features, and local atmospheric chemistry. Consequently, a thorough analysis of these factors is necessary for accurate predictions and effective decision-making. Under clear-sky conditions, visibility is negatively correlated with the diffuse fraction, AQI, PM10, and PM2.5, indicating a connection between scattered radiation and air quality [6]. Several emission factors and meteorological conditions can significantly impact PM2.5 concentration in the atmosphere. Emission factors such as PM10, SO₂, NO₂, CO, and O₃ contribute to the formation and presence of PM2.5 particles. PM10 emissions [7,8,9,10] are composed of larger particles, which can undergo atmospheric processes, leading to the production of smaller particles (PM2.5). SO₂ emissions [9,10,11,12], predominantly from fossil fuel combustion, can react with other pollutants and form fine particulate matter, including PM2.5. Similarly, NO₂ emissions [9,10,13,14,15] generated by high-temperature combustion processes can contribute to the development of secondary particles, including PM2.5. While CO itself is not a direct precursor to PM2.5, it can indirectly influence atmospheric chemistry, which leads to particle formation. Ozone (O₃) [9,10,13,14,15], a secondary pollutant formed through complex atmospheric photochemical reactions, can also contribute to the generation of secondary particles, including PM2.5. Meteorological conditions [16,17,18,19,20] also play a crucial role in PM2.5 concentrations. Temperature influences the rate of the chemical reactions involved in the formation of secondary pollutants, including PM2.5. Higher temperatures accelerate these reactions. Additionally, higher temperatures enhance the vertical mixing of pollutants in the atmosphere, affecting their dispersion and resulting concentrations. Air pressure affects the vertical distribution and movements of air masses. Low-pressure systems often create stagnant conditions, which lead to the accumulation of pollutants, including PM2.5, in the lower atmosphere. Dew point [21,22,23] represents the temperature at which air becomes saturated, causing water vapor condensation. Higher dew point values indicate increased moisture in the air, which can impact particle growth dynamics and increase the likelihood of particle hygroscopic growth, potentially leading to higher PM2.5 concentrations. Precipitation, such as rainfall, facilitates the removal of particulate matter from the atmosphere. However, certain pollutants, like sulfates, can become more soluble in wet conditions, potentially enhancing their removal through precipitation. Lastly, wind direction and speed determine the dispersion and transport pathways of pollutants. Different emission sources and wind patterns can result in long-distance transport or confinement of pollutants in specific areas, influencing PM2.5 concentrations in different regions.

Long-term forecasting based on statistical methods is challenging due to non-stationarity and noise. Ensemble learning [24] and deep learning [25] are both powerful techniques for handling nonlinear and non-stationary data. They offer effective ways to uncover relationships in complex datasets. Ensemble learning accomplishes this by combining the predictions of multiple models, leveraging their collective wisdom to enhance accuracy. Boosting algorithms [26], such as XGBoost, LightGBM, and CatBoost, are widely used in ensemble learning, along with Bagging algorithms. On the other hand, deep learning utilizes a complex network structure to identify intricate patterns and dependencies between input and output variables. This technique is categorized into twotypes based on the network structure: Artificial Neural Networks (ANNs) and Recurrent Neural Networks (RNNs). These methods find significant applications in domains like long-term PM2.5 forecasting [27,28,29,30,31], where nonlinear relationships and patterns are prevalent. By employing ensemble learning and deep learning, researchers and practitioners can avoid the problem of nonlinear and non-stationary data in these fields more effectively.

However, black-box models have limitations in identifying the factors that impact forecasting accuracy in air-pollution-prediction systems. it is crucial to understand these factors in order to improve system performance and reduce costs. Factors such as air conditions, industrial structure, and population quality play a critical role in achieving these objectives [32,33,34,35,36]. By combining the analysis of influencing factors and effective forecasting techniques, it is possible to minimize risks and identify key factors that affect PM2.5 levels. Addressing the challenges posed by black-box models, the explainable AI (XAI) technology [37,38,39], specifically SHapley Additive exPlanations (SHAP) [40,41,42], can be utilized. XAI aims to enhance transparency, accountability, and trust in AI systems by providing users with insights into their decision-making processes. SHAP, known for its consistency, model-agnosticism, and local interpretability, employs cooperative game theory’s Shapley values to assign importance to each variable in a machine learning model. These values measure the contribution of each variable to the prediction, capturing interactions between variables. By incorporating XAI approaches like SHAP, AI systems become more explainable and comprehensible, allowing users to have a clearer understanding of the system’s decisions and increasing user confidence. SHAP provides a reliable and transparent approach for explaining machine learning models, ultimately assisting users in making informed decisions regarding air pollution prediction and management.

This research contributes to the field by focusing on long-term forecasting of PM2.5 and analyzing the influencing factors. The contributions of this work include:

Comparing the performance of 10 mainstream models in long-term PM2.5 forecasting;
Exploring the application of explainable AI technology in air-pollution-prediction systems;
Analyzing environmental factors affecting PM2.5 levels from the perspective of individual factors and the interaction between factors.

Compared to previous research, the third contribution stands out. While previous work primarily focused on optimizing and comparing the performance of long-term PM2.5 forecasting models, it neglected the analysis of both the model’s output results and the model itself. This limitation hindered the successful practical application of theoretical research in real-world scenarios. In contrast, this work makes a concerted effort to address this gap. It not only emphasizes selecting the most-appropriate forecasting model, but also employs explainable artificial intelligence (XAI) techniques to analyze the contributing factors. By doing so, it provides valuable insights for environmental governance and sustainable development. In Section 2, we review relevant prior research. Section 3 delves into the data visualization. In Section 4, we outline the methods used, including the forecasting model (ensemble learning, multilayer perceptron, RNN-based models, Bi-RNN-based models), explainable AI, and the objective function. Section 5 presents the experimental results, with Section 5.1 evaluating the forecasting performance, Section 5.2 explanation the results, and Section 5.3 conducting an analysis of the influencing factors. Finally, in Section 6, we discuss the obtained results, and the conclusion is provided in Section 7.

2. Related Work

For long-term forecasting of PM2.5, the outstanding performance achieved by machine learning and deep learning has received more and more attention and has been successfully applied in real-life cases. Zhang Kand Yang X et al. [27] presented a robust forecasting system for accurate multi-step forecasting of PM2.5 and PM10 concentrations. The system uses correlation analysis, spatial–temporal attention mechanisms, and a residual-based convolutional neural network to improve accuracy and stability, outperforming baseline models in terms of reducing the root-mean-squared error by 5.595–15.247% for PM2.5 and 6.827–16.906% for PM10 predictions in three major cities, with successful applicability demonstrated in an additional 23 cities within the region. Liu H and Jin K et al. [28] introduced a hybrid model called WPD-PSO-BP-Adaboost for PM2.5 forecasting, which combines wavelet packet decomposition (WPD), particle swarm optimization (PSO), a back propagation neural network (BPNN), and the Adaboost algorithm. The experimental results indicated that the WPD-PSO-BP-Adaboost model achieved the most-accurate multi-step PM2.5 forecasts compared to other models for four cities in China, demonstrating the effectiveness of WPD, PSO, and Adaboost in enhancing forecasting precision. Ahani I K and Salari M et al. [29] introduced an ensemble multi-step-ahead forecasting system for urban PM2.5 forecasts, combining different strategies and prediction tools. The proposed hybrid framework, incorporating Ensemble Empirical Mode Decomposition (EEMD), Boosting, LSSVR, and LSTM, demonstrated effectiveness in providing accurate long-term air quality forecasts compared to other strategies with the smallest error rates.

For PM2.5, the influencing factors’ analysis has a significant role in the prevention of air pollution. Jing Z and Liu P et al. [32] used a geographic detector method to analyze the effects of anthropogenic precursors (APs) and meteorological factors on PM2.5 concentrations in Chinese cities. The results revealed significant spatio-temporal disparities in the impacts of these factors, with temperature being the main influencing factor throughout the year. Precipitation and temperature had primary effects in southern and northern China, respectively, while APs had stronger impacts in northern China during winter. The interaction between ammonia and temperature was found to have the strongest influence on PM2.5 concentrations at the national scale. Gao X and Ruan Z et al. [33] examined the relationship between PM2.5 pollution and various factors using machine learning. The results showed that the random forest model accurately predicted the PM2.5 concentration, with the mean relative humidity (RH) and aerosol optical depth (AOD) having a significant impact. Niu M and Zhang Y et al. [34] proposed a new PM2.5 predictor for accurate long time series prediction in Beijing. The predictor simplifies the input parameters based on Spearman correlation analysis and utilizes Informer for long-term prediction. The results indicated that selecting the AQI, CO, NO₂, and PM10 concentrations from air quality data, along with incorporating the dew point temperature (DEWP) and wind speed from meteorological data led to a significant 27% improvement in prediction efficiency, highlighting the importance of these meteorological conditions. Zhang L and An J et al. [35] analyzed the spatio-temporal variations and influencing factors of PM2.5 concentrations in Beijing from 2013 to 2018 and evaluated the impacts of recent environmental protection policies. The results showed a yearly decrease in PM2.5 concentrations, indicating the success of air pollution control policies. Winter and southern regions had higher concentrations, and meteorological factors such as relative humidity and wind speed, as well as gaseous pollutants, notably SO₂, NO₂, and CO, were key factors influencing PM2.5 distribution. Pang N and Gao J et al. [36]. investigated the cause of fine particulate matter (PM2.5) pollution in the North China Plain, during the heating season. The study revealed that secondary inorganic ions (SNA) were the primary contributors to PM2.5 pollution, accounting for 30–40% of PM2.5, while total carbon (TC) contributed 26.5–30.1%. Sulfate played a significant role in driving elevated PM2.5 concentrations, formed through regional transport and homogeneous reactions, while nitrate formation was limited by HNO₃ in ammonia-rich conditions. Furthermore, favorable meteorological conditions and both regional transport and local emissions contributed to PM2.5 pollution in all three cities.

These research studies suffered from a limitation wherein the prediction of future PM2.5 levels and the exploration of influencing factors were treated as distinct tasks. The objective of this study was to address this limitation by employing machine learning and deep learning techniques to integrate these tasks, thereby enhancing the forecasting accuracy. Furthermore, the study utilized explainable artificial intelligence (XAI) technology to provide comprehensible explanations for black-box models, enabling users to analyze the underlying influencing factors in a more-understandable manner.

3. Data Description

The dataset used in this study comprises PM2.5 concentration data collected at the Beijing Olympic Sports Centre Gymnasium, as illustrated in Figure 1. The data span a period from 1 March 2013 to 28 February 2017, with measurements recorded at hourly intervals.

The dataset used in the study consisted of 31,815 observations. It encompassed 11 variables, comprising both meteorological conditions and emission factors. The emission factors encompassed particulate matter (PM10,

μ

g/m

^{3}

), sulfur dioxide (SO

_{2}

,

μ

g/m

^{3}

), nitrogen dioxide (NO

_{2}

,

μ

g/m

^{3}

), carbon monoxide (CO,

μ

g/m

^{3}

), and ozone (O

_{3}

,

μ

g/m

^{3}

). Among the meteorological conditions, there were continuous variables such as temperature (TEMP, °C), pressure (PRES, hPa), dew point (DEWP, °C), rainfall (RAIN, mm), wind speed (WSPM, m/s), and one discrete variable, wind direction (WD). Wind direction was coded using a natural order system with 16 directions: NNW, N, NW, NNE, ENE, E, NE, W, SSW, WSW, SE, WNW, SSE, ESE, S, SW.

The field of machine learning focuses on developing algorithms and models with the ability to acquire knowledge and make autonomous predictions or decisions, eliminating the reliance on explicit programming. It finds extensive application in dealing with large and complex datasets. The main goal in this domain is to achieve accurate predictions and discover patterns within the data. While machine learning techniques consider data distributions, they often do not rely on strict assumptions about the statistical characteristics of the underlying data. Consequently, it becomes necessary to analyze the data distribution, particularly with regard to explainability. Figure 2 shows the data distribution for all the variables, and to show the distribution more clearly, we also used Kernel Density Estimation (KDE) to smooth the data distribution shown in the histogram.

The analysis of data distributions can yield valuable insights, offering useful information for various purposes. Among the data distributions examined in this study, PM10 and CO exhibit similarities to the target variable, PM2.5, although slight variations were observed across other variables’ distributions. It is noteworthy that the data in this dataset deviated from a normal distribution; notably, RAIN displayed the most-uneven distribution. By presenting data distributions, we can make informed decisions about selecting an appropriate method to calculate the correlation coefficients. In this particular dataset, given the non-normal distribution of all variables, we employed the Spearman correlation coefficient matrix. This approach provided a measure of correlation and afforded some degree of explainability at the data level.

The Spearman correlation coefficient matrix is a valuable tool for visualizing variable relationships. In Figure 3, the correlation coefficients between the variables provide explainability at the data level. Examining emission factors, PM₁₀ and CO exhibited the highest correlations with PM2.5 at 0.89 and 0.83, respectively, followed by NO₂ (0.69) and SO₂ (0.47). However, the correlation between O₃ and PM2.5 was insignificant. Meteorological conditions, on the other hand, did not show significant correlations between the variables. The correlation coefficient matrix highlighted the presence of multicollinearity, particularly amongst emission factors. Notably, CO showed strong correlations with both PM₁₀ (0.72) and NO₂ (0.75) and a correlation of 0.56 with SO₂, as well as a negative correlation of −0.46 with O₃. Additionally, NO₂ exhibited high correlations with CO (0.65), PM₁₀ (0.45), and SO₂ (0.66), along with a negative correlation of −0.66 with O₃. Meteorological conditions also demonstrated significant correlations, such as the correlation of 0.82 between dew point and temperature and the correlation of −0.78 between dew point and pressure. Moreover, the correlation between temperature and pressure was −0.83. Multicollinearity poses challenges to explanatory analysis using traditional methods alone, as this hinders the understanding of the effect of explanatory variables on the target variable at deeper levels based solely on correlation coefficient matrices. To address this issue, Section 5.2 discusses the SHAP method, which can elucidate the mechanism of influence in the presence of multicollinearity between variables.

Overall, the Spearman correlation coefficient matrices showed that all emission factors except O₃ had significant direct positive effects on PM2.5, while there was no significant effect in the meteorological conditions. However, multicollinearity existed among the emission factors, and therefore, the mechanism of their effects on PM2.5 needs to be further discussed with the help of the XAI method.

4. Method

4.1. Data Processing and Influencing Factors’ Analysis Process

The overview of this work is shown in Figure 4. In this study, the dataset employed included emission factors and meteorological conditions, both considered pivotal for enhanced precision in forecasting. To attain more-accurate predictions, a black-box model was utilized to effectively model these variables.

In order to elucidate the intricacies of the black-box model, explainable artificial intelligence (XAI) technology was employed, resulting in the development of an explainer. This explainer served the purpose of elucidating the relative significance of different variables, encompassing both emission factors and meteorological conditions. Notably, the explainer leverages a SHapley-Additive-exPlanations (SHAP)-based approach, thereby providing valuable insights not only regarding the importance of individual variables, but also their interactions on the forecasted outcomes. Consequently, this integrative framework formed a solid foundation for conducting comprehensive analyses on the influencing factors that contribute to the forecasting results.

4.2. Forecasting Methods

4.2.1. Ensemble Learning

The Boosting algorithm is a common form of ensemble learning, where weak learners are connected in series to construct a robust learner. This study focused on Boosting algorithms, specifically XGBoost, LightGBM, and CatBoost, which are based on decision trees. XGBoost improves generalization ability through regularization and second-order derivative information. LightGBM is designed for extensive datasets and reduces memory usage and computation time using Gradient-based One-Side Sampling (GOSS). CatBoost is optimized for categorical features with Ordered Boosting and Symmetric Tree-Based Sampling (STBS) for handling missing values effectively. The figure illustrates the differences between these algorithms.

Figure 5 shows the growth strategies of the three types of trees. XGBoost uses depthwise leaf growth, resulting in balanced tree structures. LightGBM uses leafwise leaf growth, leading to deeper, but sparser trees, and utilizes Histogram-based Gradient Boosting for efficient training of unbalanced trees. CatBoost combines both strategies by using a predefined split threshold and balances instance numbers in each leaf node to prevent overfitting.

4.2.2. Multilayer Perceptron

The Multilayer Perceptron (MLP) is an effective model for capturing intricate nonlinear connections between input and output variables. Figure 6 illustrates its structure. It comprises interconnected artificial neurons that transmit data from the input layer to the output layer using forward propagation.

The input layer receives the input data, while the output layer produces the model’s results. Positioned between the input and output layers, the intermediate layers are known as hidden layers. Each neuron in an MLP is characterized by specific weights and biases that govern its output.

4.2.3. RNN-Based Models

Recurrent Neural Networks (RNNs) are designed to handle sequential data by capturing temporal dependencies using an internal state. Figure 7 (left) illustrates its structure. Unlike MLP models, RNNs can process input sequences of various lengths and retain the memory of previous inputs. This flexibility makes RNNs suitable for tasks like voice recognition, natural language processing, and time series prediction.

However, RNNs can face issues with disappearing or exploding gradients during training. The gradients, which propagate backwards in time, can exponentially increase or decrease, affecting the accuracy of the network. To address this problem, Long Short-Term Memory (LSTM) was developed. LSTM uses a memory cell that preserves its state over longer periods. Its architecture is shown in Figure 7 (upper right), which includes three gates: input, output, and forget gates, which control the flow of information.

Figure 7 (lower right) presents the structure of the Gated Recurrent Unit (GRU), another form of RNN, which resembles LSTM, but lacks a distinct memory cell. Instead, it combines the memory cell and output gate into a unified component called the update gate. Additionally, the GRU incorporates a reset gate, controlling the information discarded from the previous hidden state. Despite its simplicity compared to LSTM, the GRU achieves comparable performance in many sequence-modeling tasks while requiring fewer parameters and less computation.

4.2.4. Bi-Directional-RNN-Based Models

The Bi-directional Recurrent Neural Network (Bi-RNN) is an effective neural network architecture specifically designed for sequence-modeling tasks. It consists of two Recurrent Neural Networks (RNNs) operating in opposite directions on the input sequence. Figure 8 illustrates its structure.

By incorporating both forward and backward dependencies, the Bi-RNN aims to capture comprehensive temporal information. The final output is obtained by combining the outputs of both RNNs, either through concatenation or other merging techniques. This dual-layer structure significantly enhances the ability of the model to understand and represent sequential patterns. It is worth noting that the formulations for the Bi-RNN, Bi-LSTM, and Bi-GRU closely mirror those of their unidirectional counterparts, making them relatively straightforward to implement and adapt in practice.

We present a summary of several prominent machine learning models that are widely utilized in time-series-forecasting domains. Table 1 provides an overview of these models, highlighting their respective descriptions, advantages, and disadvantages.

4.3. Explainable AI

The prevalence of black-box models in modern AI systems raises concerns among users due to their lack of transparency. These models only provide predicted outcomes without explaining how the variables affect these predictions. To address this issue, explainable artificial intelligence (XAI) has been developed to facilitate human understanding of AI system decision-making processes. XAI methods focus on exploring feature importance and interactions within black-box models during the training phase. These methods prove particularly valuable for domain experts in fields like environmental governance and sustainable development as they provide interpretative results that aid in assessing model rationality. Furthermore, XAI techniques enable users to discover interactive relationships between variables.

One widely used method for explaining predictions in complex machine learning models is SHapley Additive exPlanations (SHAP). This technique is model-agnostic, making it applicable to a broad range of models including tree-based models, neural networks, and linear models. The foundation of the SHAP method lies in Shapley values, which measure the contribution of each player in a game. Shapley values originate from cooperative game theory, aiming to allocate benefits fairly among players forming a coalition. In recent years, there has been increasing interest in applying the concept of Shapley values to explain models in machine learning. The connection between Shapley values and model explanation arises from considering the variables used for training as “players”, while the model’s predictions represent the corresponding “revenues”. By employing the SHAP method, users can gain insights into the significance of individual variables in predicting outcomes, fostering a better understanding of complex AI models. The Shapley value is defined by the following formula:

k_{i} = \sum_{S \subseteq N ∖ \{i\}}^{} \frac{|S|! (n - |S| - 1)!}{n!} \times (v (S \cup \{i\}) - v (S)) .

(1)

where

k_{i}

is the importance (contribution) of the i-th variable; N is the set of all players (features) (1, 2, 3,

\dots, i, \dots, n

), which is the complete set; S is a subset of N, which removes the explained feature i, with a total of

2^{N}

; v is the gain function (

v (S) = E_{\hat{D}} [f (x) ∣ x_{S}]

, where

\hat{D}

is the empirical distribution of the training data and f is the black-box model).

The SHAP method is highly regarded for its ability to offer both a global and local perspective on feature importance. It allows us to understand how each feature influences overall predictions and also how it specifically affects the outcome of individual instances. This powerful technique has gained increasing popularity across diverse domains including healthcare, finance, and image recognition. A noteworthy advantage of the SHAP method is its capacity to provide intuitive and insightful explanations for complex machine learning models. By doing so, it significantly enhances the transparency and interpretability of these models, thereby instilling trust in the accuracy of their predictions.

4.4. Objective Function

Various criteria have been developed to assess the performance of machine learning models. Three widely used metrics include the Coefficient of Determination (

R^{2}

), the root-mean-squared error (RMSE), and the Mean Absolute Error (MAE). They can be calculated as follow:

\begin{matrix} R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}} \end{matrix}

(2)

\begin{matrix} R M S E % = \frac{\sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}}{\bar{y}} \times 100 \end{matrix}

(3)

\begin{matrix} M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}| \end{matrix}

(4)

where

y_{i}

is the true value;

{\hat{y}}_{i}

is the forecast value,

\bar{y}

is the average of all true values, and n is the number of observations.

5. Results

5.1. Results of Forecasting Performance

We applied emission factors and meteorological conditions as the inputs to the black-box models for PM2.5 forecasting and evaluated the performance of these models at different forecast horizons (30 days, 90 days, and 180 days). The results are summarized in Table 2.

The analysis revealed that CatBoost performed best for a 30-day forecast horizon, while Bi-LSTM yielded the highest accuracy at 90 days, and Bi-GRU excelled at 180 days. When considering model types, ensemble learning models demonstrated superior performance for shorter forecast horizons, whereas recurrent neural network models, including their derived Bi-directional counterparts, exhibited better accuracy for longer-term forecasting. Conversely, the multilayer perceptron model did not deliver noteworthy results in our study.

5.2. Analysis of Influencing Factors

5.2.1. Impact Analysis of Single Factor

In our study, we selected the ensemble learning model that was best supported by SHAP in order to analyze the factors influencing PM2.5. We focused our comparison within the type of ensemble model. Specifically, we utilized CatBoost to examine the factors affecting PM2.5 with a 30-day forecast horizon and LightGBM for the 90- and 180-day analyses. The SHAP results offered valuable insights into how the variables influenced the forecasting outcomes, capturing both individual variable effects and interactions between variables. To visually represent each variable’s contribution, we utilized mean value plots, as shown in Figure 9.

In this analysis, the SHAP of the three forecasting horizons maintained consistency. It revealed that PM10 had the highest contribution to the PM2.5 results across all horizons, indicating its significant impact. Following PM10, CO also exhibited a higher level of significance in the forecasted results for all horizons. Regarding emission factors like O₃ and SO₂, their impact was considered acceptable. However, among all the emission factors, NO₂ had the least-significant effect. Concerning meteorological conditions, the dew point demonstrated the most-pronounced effect on PM2.5, with temperature being the next influential factor. In fact, these two factors outweighed all emission factors except PM10 and CO. Pressure and wind speed had a minor influence on PM2.5 concentrations. Since the analysis focused solely on the PM2.5 concentration in a specific area, the impact of wind direction was limited. Furthermore, in this dataset, the variable RAIN representing rainfall amount was generally recorded as 0 in most cases (Figure 2, RAIN). Consequently, its effect on PM2.5 was minimal, which was emphasized due to Beijing’s climatic characteristics.

The analysis presented above focused on individual variables, but the scatterplot provides additional insights into the relationship between the variables and SHAP values. The scatterplot (Figure 9 upper) shows high variable values depicted as red dots, while low values are represented by blue dots. For instance, with respect to PM10, a high PM10 value corresponds to a high SHAP value on the horizontal axis, indicating a positive effect. In other words, an increase in PM10 (red dots) contributes to the increase in PM2.5, whereas a decrease in PM10 (blue dots) inhibits the increase in PM2.5. Applying the same analytical process, we observed similar patterns for CO, dew point, and ozone. However, the effect of temperature exhibited a different correlation mechanism. An increase in temperature inhibited the increase of PM2.5, whereas a decrease promoted its increase. This phenomenon is likely influenced by the presence of centralized winter heating in the Beijing area. As the primary heating method during winter involves thermal power, a significant amount of fossil fuels is burned at lower temperatures, leading to higher concentrations of PM2.5.

5.2.2. Interaction Analysis of Factors

In light of the significance of the interaction explanation plot presentation, we specifically chose to focus our analysis on four variables that demonstrated a substantial influence on the results. These variables comprised two emission factors, namely PM10 and CO, along with two meteorological conditions, namely dew point and temperature. The SHAP interaction explanation plot offers valuable insights, particularly regarding the interplay among variables. It unveils the mechanisms behind the variables’ influences on the forecast results, even in cases where there is covariance between them. Figure 10 visually illustrates how two crucial factors, PM10 and CO, impacted the PM2.5 concentration.

The relationship between PM10 concentration and its contribution to the forecasting result, as indicated by the SHAP value, demonstrated that an increase in PM10 led to a corresponding increase in PM2.5. The interaction effect of CO on PM10 was particularly significant. To analyze this interaction effect, we represent the CO concentration using colors, with red indicating a high CO concentration and blue indicating a low CO concentration. Examining the data revealed that, at high CO concentrations, there was a linear promotion of PM2.5 with increased levels of PM10 (red dots). Conversely, a decrease in the CO concentration significantly inhibited the contribution of the PM10 to PM2.5 concentrations (blue dots). Similarly, analyzing the CO interaction plots reveals that increasing CO also promoted PM2.5, although not to the same extent as PM10. Furthermore, at high PM10 concentrations, both PM10 and CO contributed to higher PM2.5 levels, but the suppression of this trend by low PM10 levels was not significant (only a few blue dots are observed).

The SHAP interaction plot provides insights that are not attainable through the scatter plot and average plot alone. While PM10 ranked highest in importance, this was observed specifically at high CO concentrations. At low CO concentrations, an increase in PM10 concentration tended to stabilize its impact on PM2.5. Therefore, it can be deduced that CO plays a crucial role in driving these effects.

Figure 11 illustrates the impact of a single variable on the forecasted outcomes. The variable exhibiting the most-pronounced interaction with this variable is displayed on the right vertical axis. The interaction plots provide clearer depictions of the linear associations between changes in these variables and alterations in PM2.5 levels. Specifically, the dew point exhibited a positive correlation with PM2.5, while temperature generally showed a negative correlation. On the other hand, for PM2.5 forecasting with a horizon of 30 days, “month” had the most-obvious interaction with “dew” and “press”, respectively. However, since “month” did not have a linear relationship with the forecast results, it does not show a clear red–blue boundary here either.

For the remaining horizons, there existed a significant interaction between temperature and dew point. It was evident that higher dew points corresponded to higher temperatures, indicating a positive correlation between dew point and temperature. This relationship can be further confirmed by examining the “temperature” variable, where high dew points (represented by red dots) were observed in the higher temperature range. This consistency aligned with the Spearman correlation coefficient matrix, providing some reassurance regarding the reliability of the SHAP. Additionally, an increase in the dew point coincided with an increase in the SHAP value, suggesting a rise in PM2.5 concentration. Conversely, high temperatures accompanying elevated dew points led to a decrease in PM2.5 concentration. However, numerically, when the dew point exceeded 25 °C, it may have even contributed to a 100

μ {g / m}^{3}

increase in the PM2.5 concentration, while a corresponding temperature increase only resulted in a decrease of about 60

μ {g / m}^{3}

. Therefore, the impact of a high dew point on the PM2.5 concentration can be considered reasonably reliable. Consequently, it is crucial to underscore the positive effect of a high dew point on PM2.5 concentrations.

5.3. Findings from the Analysis

The findings of the analysis indicated that “PM10” had the strongest impact on the prediction of PM2.5, followed by “CO”, “dew point”, and “temperature”. These influential factors exhibited varying degrees of linear association with PM2.5. In particular, temperature displayed a negative correlation with PM2.5, whereas PM10, CO, and dew point generally exhibited positive correlations with PM2.5. Additionally, the influence of PM10 on PM2.5 diminished at lower concentrations of CO, while the influence of CO on PM2.5 appeared to be largely unaffected by PM10. Theoretically, these correlated variables can influence PM2.5 concentration mutually. However, numerically, exceeding 25 °C in the dew point led to a significant increase of up to 100

μ {g / m}^{3}

in the PM2.5 concentration, compared to a maximum reduction of 60

μ {g / m}^{3}

at high temperatures and 20–40

μ {g / m}^{3}

.

6. Discussion

In our experiments, we observed inadequate performance of complex deep learning models at a forecast horizon of 30 days. This was primarily due to the fact that these models were initially designed for natural-language-recognition tasks. Although they are also referred to as time series, their nature is fundamentally different from the temporal relationships in forecasting tasks.

To illustrate this distinction, consider the following example:

“I like playing the piano and playing basketball.”
“I like playing basketball and playing the piano.”

These two sentences exhibit a temporal relationship in natural-language-recognition tasks, as the word positions can be interchanged without altering the meaning conveyed by the sentences. However, in forecasting tasks, the temporal points within the time series cannot undergo such positional changes. This represents the fundamental explanation for this phenomenon.

Moreover, such deep learning models are better suited for handling complex data types, particularly high-dimensional data. In the case of most time series forecasting, the complexity of the dataset itself is limited, making complex models with intricate structures more susceptible to overfitting. Consequently, simpler models are generally more appropriate for time series forecasting tasks.

There is a wide range of explainable artificial intelligence (XAI) methods available. Notably, popular algorithms such as XGBoost, LightGBM, and CatBoost have well-established feature importance techniques for assessing variable significance during training. However, these methods are limited to tree-based models, providing only importance values without offering more-comprehensive explanation details like positive and negative correlations. While LIME can estimate such correlations and is applicable to all model types, it can only explain individual samples and lacks an overall and global explanation of the variables. On the other hand, the the Layerwise Relevance Propagation (LRP) method offers several advantages, but is restricted to neural network models. Furthermore, in terms of technical support and associated libraries, SHAP excels over the aforementioned methods.

Boundary layer stability highly influences surface pollutants, but this aspect has not been addressed. This study [43] explored the interaction between aerosols and the planetary boundary layer (PBL) in the North China Plain (NCP). The research focused on differentiating the effects of absorbing and scattering aerosols under various synoptic patterns and aerosol vertical distributions. The results demonstrated that synoptic conditions can influence the PBL’s thermal structure and the vertical distribution of aerosols, with scattering aerosols primarily affecting the PBL under stable stratification and absorbing aerosols being more dependent on their vertical distribution. The study suggested controlling emissions of scattering aerosols under stable stratification and cooperating on controlling both scattering and absorbing aerosols under unstable stratification for effective air pollution control.

7. Conclusions

We conducted an analysis of long-term PM2.5 prediction using ten widely used models. The findings indicated that integrated learning based on decision trees performed better than deep learning models in predicting PM2.5 levels up to 30 days ahead. However, beyond this time frame, the deep learning models exhibited an advantage in longer-term predictions, specifically at 90 days and 180 days. Among the deep learning models considered, Bi-LSTM yielded the best performance for a 90-day prediction horizon, while Bi-GRU was most effective for a 180-day prediction horizon. Regarding the analysis of factors influencing PM2.5 concentrations, our results revealed that PM10 had the greatest impact among all emission factors, with CO following closely behind. We also observed that CO can significantly influence the impact of PM10, meaning that even low concentrations of CO can contribute to PM2.5 levels, particularly in conjunction with high concentrations of PM10. Consequently, we identified CO as a key factor in PM2.5 formation. In terms of meteorological conditions, dew point and temperature had the most-substantial influence, exhibiting a linear relationship. An increase in the dew point corresponded to higher PM2.5 concentrations, whereas an increase in the temperature could inhibit PM2.5 concentration escalation.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z.; software, Y.Z.; validation, Y.Z., Q.S., and J.L.; formal analysis, Y.Z.; investigation, Q.S. and J.L.; resources, Q.S., J.L.; data curation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, O.P.; visualization, Y.Z., Q.S., and J.L.; supervision, O.P.; project administration, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by China Scholarship Council (CSC) and Saint-Petersburg State University, Project ID 94062114.

Data Availability Statement

https://www.kaggle.com/datasets/rupakroy/lstm-datasets-multivariate-univariate, accessed on 1 August 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

Maciejczyk, P.; Chen, L.C.; Thurston, G. The role of fossil fuel combustion metals in PM2.5 air pollution health associations. Atmosphere 2021, 12, 1086. [Google Scholar] [CrossRef]
Meo, S.A.; Almutairi, F.J.; Abukhalaf, A.A. Effect of green space environment on air pollutants PM2.5, PM10, CO, O₃, and incidence and mortality of SARS-CoV-2 in highly green and less-green countries. Int. J. Environ. Res. Public Health 2021, 18, 13151. [Google Scholar] [CrossRef]
Fan, Z.; Zhan, Q.; Yang, C. How did distribution patterns of particulate matter air pollution (PM2.5 and PM10) change in China during the COVID-19 outbreak: A spatiotemporal investigation at Chinese city-level. Int. J. Environ. Res. Public Health 2020, 17, 6274. [Google Scholar] [CrossRef]
Wang, Z.; Chen, L.; Ding, Z. An enhanced interval PM2.5 concentration forecasting model based on BEMD and MLPI with influencing factors. Atmos. Environ. 2020, 223, 117200. [Google Scholar] [CrossRef]
Delp, W.W.; Singer, B.C. Wildfire smoke adjustment factors for low-cost and professional PM2.5 monitors with optical sensors. Sensors 2020, 20, 3683. [Google Scholar] [CrossRef]
Luo, H.; Han, Y.; Lu, C. Characteristics of surface solar radiation under different air pollution conditions over Nanjing, China: Observation and simulation. Adv. Atmos. Sci. 2019, 36, 1047–1059. [Google Scholar] [CrossRef]
Fan, H.; Zhao, C.; Yang, Y. Spatio-temporal variations of the PM2.5/PM10 ratios and its application to air pollution type classification in China. Front. Environ. Sci. 2021, 9, 692440. [Google Scholar] [CrossRef]
Spandana, B.; Rao, S.S.; Upadhya, A.R. PM2.5/PM10 ratio characteristics over urban sites of India. Adv. Space Res. 2021, 67, 3134–3146. [Google Scholar] [CrossRef]
Al-Janabi, S.; Alkaim, A.; Al-Janabi, E. Intelligent forecaster of concentrations (PM2.5, PM10, NO₂, CO, O₃, SO₂) caused air pollution (IFCsAP). Neural Comput. Appl. 2021, 33, 14199–14229. [Google Scholar] [CrossRef]
Hu, Y.; Yao, M.; Liu, Y. Personal exposure to ambient PM2.5, PM10, O₃, NO₂, and SO₂ for different populations in 31 Chinese provinces. Environ. Int. 2020, 144, 106018. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Shen, Z.; Zhang, B. Emission reduction effect on PM2.5, SO₂ and NOx by using red mud as additive in clean coal briquetting. Atmos. Environ. 2020, 223, 117203. [Google Scholar] [CrossRef]
Zhang, Y.; Bao, F.; Li, M. Photoinduced uptake and oxidation of SO₂ on Beijing urban PM2.5. Environ. Sci. Technol. 2020, 54, 14868–14876. [Google Scholar] [CrossRef] [PubMed]
Orellano, P.; Reynoso, J.; Quaranta, N. Short-term exposure to particulate matter (PM10 and PM2.5), nitrogen dioxide (NO₂), and ozone (O₃) and all-cause and cause-specific mortality: Systematic review and meta-analysis. Environ. Int. 2020, 142, 105876. [Google Scholar] [CrossRef] [PubMed]
Naghan, D.J.; Neisi, A.; Goudarzi, G. Estimation of the effects PM2.5, NO₂, O₃ pollutants on the health of Shahrekord residents based on AirQ+ software during (2012–2018). Toxicol. Rep. 2022, 9, 842–847. [Google Scholar] [CrossRef] [PubMed]
Rovira, J.; Domingo, J.L.; Schuhmacher, M. Air quality, health impacts and burden of disease due to air pollution (PM10, PM2.5, NO₂ and O₃): Application of AirQ+ model to the Camp de Tarragona County (Catalonia, Spain). Sci. Total Environ. 2020, 703, 135538. [Google Scholar] [CrossRef]
Chen, Z.; Chen, D.; Zhao, C. Influence of meteorological conditions on PM2.5 concentrations across China: A review of methodology and mechanism. Environ. Int. 2020, 139, 105558. [Google Scholar] [CrossRef]
Li, M.; Wang, L.; Liu, J. Exploring the regional pollution characteristics and meteorological formation mechanism of PM2.5 in North China during 2013–2017. Environ. Int. 2020, 134, 105283. [Google Scholar] [CrossRef]
Zhang, Z.; Xu, B.; Xu, W. Machine learning combined with the PMF model reveal the synergistic effects of sources and meteorological factors on PM2.5 pollution. Environ. Res. 2022, 212, 113322. [Google Scholar] [CrossRef]
Dong, J.; Liu, P.; Song, H. Effects of anthropogenic precursor emissions and meteorological conditions on PM2.5 concentrations over the “2+ 26” cities of northern China. Environ. Pollut. 2022, 315, 120392. [Google Scholar] [CrossRef]
Yang, Z.; Yang, J.; Li, M. Nonlinear and lagged meteorological effects on daily levels of ambient PM2.5 and O₃: Evidence from 284 Chinese cities. J. Clean. Prod. 2021, 278, 123931. [Google Scholar] [CrossRef]
Shrestha, A.K.; Thapa, A.; Gautam, H. Solar radiation, air temperature, relative humidity, and dew point study: Damak, Jhapa, Nepal. Int. J. Photoenergy 2019, 2019, 8369231. [Google Scholar] [CrossRef]
Sein, Z.M.M.; Ullah, I.; Iyakaremye, V. Observed spatiotemporal changes in air temperature, dew point temperature and relative humidity over Myanmar during 2001–2019. Meteorol. Atmos. Phys. 2022, 134, 7. [Google Scholar] [CrossRef]
Feistel, R.; Hellmuth, O.; Lovell-Smith, J. Defining relative humidity in terms of water activity: III. Relations to dew-point and frost-point temperatures. Metrologia 2022, 59, 045013. [Google Scholar] [CrossRef]
Dong, X.; Yu, Z.; Cao, W. A survey on ensemble learning. Front. Comput. Sci. 2020, 14, 241–258. [Google Scholar] [CrossRef]
Dong, S.; Wang, P.; Abbas, K. A survey on deep learning and its applications. Comput. Sci. Rev. 2021, 40, 100379. [Google Scholar] [CrossRef]
Wen, L.; Hughes, M. Coastal wetland mapping using ensemble learning algorithms: A comparative study of bagging, Boosting and stacking techniques. Remote Sens. 2020, 12, 1683. [Google Scholar] [CrossRef]
Zhang, K.; Yang, X.; Cao, H. Multi-step forecast of PM2.5 and PM10 concentrations using convolutional neural network integrated with spatial–temporal attention and residual learning. Environ. Int. 2023, 171, 107691. [Google Scholar] [CrossRef] [PubMed]
Liu, H.; Jin, K.; Duan, Z. Air PM2.5 concentration multi-step forecasting using a new hybrid modeling method: Comparing cases for four cities in China. Atmos. Pollut. Res. 2019, 10, 1588–1600. [Google Scholar] [CrossRef]
Ahani, I.K.; Salari, M.; Shadman, A. An ensemble multi-step-ahead forecasting system for fine particulate matter in urban areas. J. Clean. Prod. 2020, 263, 120983. [Google Scholar] [CrossRef]
Gao, X.; Li, W. A graph-based LSTM model for PM2.5 forecasting. Atmos. Pollut. Res. 2021, 12, 101150. [Google Scholar] [CrossRef]
Zaini, N.; Ean, L.W.; Ahmed, A.N. PM2.5 forecasting for an urban area based on deep learning and decomposition method. Sci. Rep. 2022, 12, 17565. [Google Scholar] [CrossRef] [PubMed]
Jing, Z.; Liu, P.; Wang, T. Effects of meteorological factors and anthropogenic precursors on PM2.5 concentrations in cities in China. Sustainability 2020, 12, 3550. [Google Scholar] [CrossRef]
Gao, X.; Ruan, Z.; Liu, J. Analysis of atmospheric pollutants and meteorological factors on PM2.5 concentration and temporal variations in harbin. Atmosphere 2022, 13, 1426. [Google Scholar] [CrossRef]
Niu, M.; Zhang, Y.; Ren, Z. Deep learning-based PM2.5 long time-series prediction by fusing multisource data—A case study of Beijing. Atmosphere 2023, 14, 340. [Google Scholar] [CrossRef]
Zhang, L.; An, J.; Liu, M. Spatiotemporal variations and influencing factors of PM2.5 concentrations in Beijing, China. Environ. Pollut. 2020, 262, 114276. [Google Scholar] [CrossRef] [PubMed]
Pang, N.; Gao, J.; Che, F. Cause of PM2.5 pollution during the 2016-2017 heating season in Beijing, Tianjin, and Langfang, China. J. Environ. Sci. 2020, 95, 201–209. [Google Scholar] [CrossRef] [PubMed]
Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable ai: A review of machine learning interpretability methods. Entropy 2020, 23, 18. [Google Scholar] [CrossRef] [PubMed]
Phillips, P.J.; Hahn, C.A.; Fontana, P.C. Four Principles of Explainable Artificial Intelligence; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2020.
Vilone, G.; Longo, L. Notions of explainability and evaluation approaches for explainable artificial intelligence. Inf. Fusion 2021, 76, 89–106. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Chen, H.; Covert, I.C.; Lundberg, S.M. Algorithms to estimate Shapley value feature attributions. Nat. Mach. Intell. 2023, 5, 590–601. [Google Scholar] [CrossRef]
Chen, H.; Lundberg, S.M.; Lee, S.I. Explaining a series of models by propagating Shapley values. Nat. Commun. 2022, 13, 4512. [Google Scholar] [CrossRef]
Luo, H.; Dong, L.; Chen, Y. Interaction between aerosol and thermodynamic stability within the planetary boundary layer during wintertime over the North China Plain: Aircraft observation and WRF-Chem simulation. Atmos. Chem. Phys. 2022, 22, 2507–2524. [Google Scholar] [CrossRef]

Figure 1. Location information of data-collection place. Beijing Olympic Sports Center Gymnasium—Longitude: 116.3995; Latitude: 39.98365.

Figure 2. Visualization of the data distribution. Visualization of the data distribution. The orange line is the Kernel Density Estimation (KDE) curve, which is used in order to show the distribution more clearly. As a non-parametric method for estimating the probability density function, it smooths out the distribution shown by the histogram and, thus, facilitates our view of the distribution. It is important to note that the KDE curve is continuous and the sides are close to the x-axis so that the integral of the curve is 1. Therefore, the KDE curve will show a “negative” value for the variable, but this does not mean that the variable has a negative value.

Figure 3. Correlation coefficient between variables.

Figure 4. Flowchart for data processing and analysis of influencing factors.

Figure 5. Example of growing Boosting trees. CatBoost (left); XGBoost (middle); LightGBM (right). They are all models based on decision trees, which are root nodes, internal nodes, and leaf nodes from top to bottom.

Figure 6. Network structure of MLP. y is the output of MLP,

h_{i} = w_{i} \times x

;

w_{1}

,

w_{2}

, …,

w_{L}

are the weight matrices that determine the strength of the connections between layers. The activation function f can be any nonlinear function, such as

s i g m o i d

,

R e L U

, or

t a n h

. Hidden layer 1 represents the first hidden layer; Hidden layer 2 represents the second hidden layer.

Figure 6. Network structure of MLP. y is the output of MLP,

h_{i} = w_{i} \times x

;

w_{1}

,

w_{2}

, …,

w_{L}

are the weight matrices that determine the strength of the connections between layers. The activation function f can be any nonlinear function, such as

s i g m o i d

,

R e L U

, or

t a n h

. Hidden layer 1 represents the first hidden layer; Hidden layer 2 represents the second hidden layer.

Figure 7. Network structure of traditional RNN (left) and neuron structure of LSTM (upper right) and GRU (lower right). In RNN,

h_{t} = f (U x^{t} + W h_{t - 1} + b_{h})

;

O^{t} = S o f t m a x (V h_{t} + b_{o})

, where

O^{t}

is the output at time step t;

h_{t}

is the hidden state at time step t; f is the activation function; W is the weight matrix for the previous hidden state;

b_{i}

is the bias term of each layer; U is the weight matrix for the input at time step t;

x^{t}

is the input at time step t. Hidden layer 1 represents the first hidden layer; Hidden layer 2 represents the second hidden layer.

Figure 7. Network structure of traditional RNN (left) and neuron structure of LSTM (upper right) and GRU (lower right). In RNN,

h_{t} = f (U x^{t} + W h_{t - 1} + b_{h})

;

O^{t} = S o f t m a x (V h_{t} + b_{o})

, where

O^{t}

is the output at time step t;

h_{t}

is the hidden state at time step t; f is the activation function; W is the weight matrix for the previous hidden state;

b_{i}

is the bias term of each layer; U is the weight matrix for the input at time step t;

x^{t}

is the input at time step t. Hidden layer 1 represents the first hidden layer; Hidden layer 2 represents the second hidden layer.

Figure 8. Network structure of Bi-RNN-based model.

Figure 9. Univariate explanation results of SHAP for CatBoost (left: 30 days) and for LightGBM (middle: 90 days, right: 180 days). The scatterplot of the explanation results is shown above, and the mean plot is shown below. For each variable in each observation, SHAP calculates its contribution to the forecasted outcome; all contributions were processed in absolute values and averaged for each variable, which ultimately resulted in the global contribution value of the variable. The SHAP value of each data point is displayed as a point in the scatter plot, and the final global contribution of the variable is plotted in the mean plot. In the scatterplot, the horizontal axis represents the SHAP value, where a larger value represents a larger value of the target variable, and the vertical axis is a ranking of the contribution of all variables to the forecast, where a higher ranked variable represents its greater influence on the final forecast result. The blue and red transitions represent variable values from small to large.

Figure 10. SHAP interaction effects plot of PM10 and CO for CatBoost (left: 30 days) and for LightGBM (middle: 90 days, right: 180 days). The horizontal axis represents the value of the variable, and the shaded area represents the data distribution for this variable. The left vertical axis represents the SHAP value, and the right vertical axis displays the variable with which the variable had the most-obvious interactions, representing the smallest to the largest of its values by transitioning from blue to red.

Figure 11. SHAP interaction effects plot of dew point and temperature for CatBoost (left: 30 days) and for LightGBM (middle: 90 days, right: 180 days).

Table 1. Model information summary.

Model	Description	Advantages	Disadvantages
XGBoost	Gradient Boosting framework	High accuracy, feature importance assessment	Computationally expensive
LightGBM	Gradient Boosting framework	Fast training speed, low memory usage	Prone to overfitting
CatBoost	Gradient Boosting framework	Handles categorical features well	Slower training speed compared to others
MLP	Multilayer Perceptron neural network	Versatile, can handle diverse problems	Sensitive to input scaling
RNN	Recurrent Neural Network	Sequential data modeling	Difficulty capturing long-term dependencies
LSTM	Long Short-Term Memory network	Captures long-term dependencies	Proneness to overfitting
GRU	Gated Recurrent Unit network	Fewer parameters than LSTM	May struggle with complex sequences
Bi-RNN	Bi-directional RNN	Captures information from both directions	Higher computational complexity
Bi-LSTM	Bi-directional LSTM	Captures information from both directions	Increased memory requirements
Bi-GRU	Bi-directional GRU	Captures information from both directions	Complexity may lead to overfitting

Table 2. Comparison of forecasting performance. Bold/underline represents the best performance value.

Dataset		30 Days			90 Days			180 Days
Metrics	$R^{2}$	RMSE%	MAE	$R^{2}$	RMSE%	MAE	$R^{2}$	RMSE%	MAE
XGBoost	0.9623	24.180	0.0144	0.9376	28.167	0.0239	0.9349	28.471	0.0210
LightGBM	0.9639	23.653	0.0148	0.9433	26.863	0.0227	0.9414	27.001	0.0203
CatBoost	0.9734	20.310	0.0128	0.9297	29.893	0.0252	0.9349	28.471	0.0214
MLP	0.9278	33.298	0.0218	0.9498	26.305	0.0256	0.9304	28.036	0.0264
RNN	0.9614	24.462	0.0149	0.9639	21.418	0.0206	0.9564	23.274	0.0185
LSTM	0.9659	22.973	0.0148	0.9521	24.677	0.0216	0.9531	24.149	0.0182
GRU	0.9327	32.307	0.0227	0.9513	24.878	0.0237	0.9520	24.433	0.0189
Bi-RNN	0.9603	24.807	0.0165	0.9570	23.375	0.0262	0.9576	22.970	0.0189
Bi-LSTM	0.9419	30.005	0.0212	0.9661	20.740	0.0196	0.9576	22.949	0.0185
Bi-GRU	0.9648	23.345	0.0153	0.9656	20.912	0.0214	0.9589	22.601	0.0182

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Sun, Q.; Liu, J.; Petrosian, O. Long-Term Forecasting of Air Pollution Particulate Matter (PM2.5) and Analysis of Influencing Factors. Sustainability 2024, 16, 19. https://doi.org/10.3390/su16010019

AMA Style

Zhang Y, Sun Q, Liu J, Petrosian O. Long-Term Forecasting of Air Pollution Particulate Matter (PM2.5) and Analysis of Influencing Factors. Sustainability. 2024; 16(1):19. https://doi.org/10.3390/su16010019

Chicago/Turabian Style

Zhang, Yuyi, Qiushi Sun, Jing Liu, and Ovanes Petrosian. 2024. "Long-Term Forecasting of Air Pollution Particulate Matter (PM2.5) and Analysis of Influencing Factors" Sustainability 16, no. 1: 19. https://doi.org/10.3390/su16010019

APA Style

Zhang, Y., Sun, Q., Liu, J., & Petrosian, O. (2024). Long-Term Forecasting of Air Pollution Particulate Matter (PM2.5) and Analysis of Influencing Factors. Sustainability, 16(1), 19. https://doi.org/10.3390/su16010019

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Long-Term Forecasting of Air Pollution Particulate Matter (PM2.5) and Analysis of Influencing Factors

Abstract

1. Introduction

2. Related Work

3. Data Description

4. Method

4.1. Data Processing and Influencing Factors’ Analysis Process

4.2. Forecasting Methods

4.2.1. Ensemble Learning

4.2.2. Multilayer Perceptron

4.2.3. RNN-Based Models

4.2.4. Bi-Directional-RNN-Based Models

4.3. Explainable AI

4.4. Objective Function

5. Results

5.1. Results of Forecasting Performance

5.2. Analysis of Influencing Factors

5.2.1. Impact Analysis of Single Factor

5.2.2. Interaction Analysis of Factors

5.3. Findings from the Analysis

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI