Short-Term PM2.5 Concentration Changes Prediction: A Comparison of Meteorological and Historical Data

Kang, Junfeng; Zou, Xinyi; Tan, Jianlin; Li, Jun; Karimian, Hamed

doi:10.3390/su151411408

Open AccessArticle

Short-Term PM_2.5 Concentration Changes Prediction: A Comparison of Meteorological and Historical Data

by

Junfeng Kang

¹

,

Xinyi Zou

¹,

Jianlin Tan

¹,

Jun Li

^2,* and

Hamed Karimian

^1,3

¹

School of Civil and Surveying & Mapping Engineering, Jiangxi University of Science and Technology, Ganzhou 341000, China

²

Guangdong Science & Technology Infrastructure Center, Guangzhou 510033, China

³

School of Marine Technology and Geomatics, Jiangsu Ocean University, Lianyungang 222005, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(14), 11408; https://doi.org/10.3390/su151411408

Submission received: 12 April 2023 / Revised: 14 July 2023 / Accepted: 18 July 2023 / Published: 22 July 2023

(This article belongs to the Special Issue Geographic Big Data Analysis and Urban Sustainable Development)

Download

Browse Figures

Versions Notes

Abstract

:

Machine learning is being extensively employed in the prediction of PM_2.5 concentrations. This study aims to compare the prediction accuracy of machine learning models for short-term PM_2.5 concentration changes and to find a universal and robust model for both hourly and daily time scales. Five commonly used machine learning models were constructed, along with a stacking model consisting of Multivariable Linear Regression (MLR) as the meta-learner and the ensemble of Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM) as the base learner models. The meteorological datasets and historical PM_2.5 concentration data with meteorological datasets were preprocessed and used to evaluate the model’s accuracy and stability across different time scales, including hourly and daily, using the coefficient of determination (R²), Root-Mean-Square Error (RMSE), and Mean Absolute Error (MAE). The results show that historical PM_2.5 concentration data are crucial for the prediction precision of the machine learning models. Specifically, on the meteorological datasets, the stacking model, XGboost, and RF had better performance for hourly prediction, and the stacking model, XGboost and LightGBM had better performance for daily prediction. On the historical PM_2.5 concentration data with meteorological datasets, the stacking model, LightGBM, and XGboost had better performance for hourly and daily datasets. Consequently, the stacking model outperformed individual models, with the XGBoost model being the best individual model to predict the PM_2.5 concentration based on meteorological data, and the LightGBM model being the best individual model to predict the PM_2.5 concentration using historical PM_2.5 data with meteorological datasets.

Keywords:

PM_2.5 prediction; machine learning; stacking; meteorological factor

1. Introduction

Owing to China’s rapid development of industrialization and urbanization, air pollution has become increasingly severe [1,2,3]. PM_2.5 is a common environmental pollutant that is closely associated with visibility, human health, meteorology, and climate [4,5,6,7,8,9,10]. Therefore, accurate prediction of PM_2.5 concentrations is crucial for government officials to control air pollution.

Because of the uneven distribution of air quality and meteorological stations, researchers have attempted to use meteorological factors for predicting PM_2.5 concentrations, and have found that meteorological data can serve as a valuable supplement for missing air quality data [11]. Previous studies have demonstrated a close relationship between PM_2.5 concentrations and factors such as wind speed, wind direction, humidity, air pressure, and temperature [12]. Some scholars have even utilized meteorological factors to predict changes in PM_2.5 concentrations [13]. Some scholars proposed a hybrid spatiotemporal Land Use Regression (LUR) model system that combines Support Vector Regression (SVR), MLR, and the ST algorithm, which yielded good spatial prediction performances in almost all time panels [14]. However, these studies did not compare the prediction precision across both hourly and daily time scales. In addition, current research is still lacking regarding the development of a robust PM_2.5 concentration prediction model that can be applied simultaneously to both meteorological factor-based and historical data-based PM_2.5 concentration prediction.

Recently, in addition to the mechanism model [15,16,17], statistical models and machine learning methods have become the main methods for PM_2.5 concentration prediction. Statistical models, such as the grey prediction model [18,19] and Multiple Linear Regression (MLR) model [20], are commonly used for PM_2.5 concentration prediction. As for machine learning methods of PM_2.5 concentration prediction, they mainly include Support Vector Regression (SVR) [21,22,23], Random Forest (RF) [24,25,26,27], and Long–Short-Term Memory (LSTM) [28,29,30,31]. Chen [27] demonstrated that the Random Forest approach can be used to estimate the daily concentrations of PM_2.5 across China. Zhai [31] proposed a Long–Short-Term Memory (LSTM) approach for predicting air quality, and the results indicated that this method presented better prediction performance than traditional methods in terms of the PM_2.5, PM₁₀, O₃, NO₂, SO₂, and CO concentrations. But most studies simply used a single machine learning model to predict the PM_2.5 concentration. As such, the linear model has poor performance in processing nonlinear and large amounts of data, and neural networks have problems with overfitting, slow convergence speed, and poor generalization capabilities [32]. Therefore, combined models [33,34,35,36,37,38] and hybrid models [22,39,40,41,42,43] have been used to predict PM_2.5 concentrations for improving the prediction accuracy for the PM_2.5 concentration. But most of these combined models were mainly focused on weighted averages or a simple combination of data preprocessing, modeling, and optimization techniques, which did not have complimentary training and ignored robustness. A general stack ensemble algorithm that fully integrates multiple machine learning models has been used for PM_2.5 concentration prediction [44,45]. Stacking integration technology finds the optimal combination of base learners by training advanced learners, and conducts cross-validation training on the base learners. Based on the output results of the base learners, secondary features are constructed to train the meta-learners [46].

As a combination model, Stacking technology can overcome the disadvantages of the limitations of using individual models by integrating multiple machine learning methods [47], and it has shown promising performance in various applications [46,48]. However, in addition to model selection, many studies have faced challenges when dealing with datasets [49]. Existing studies have only focused on using one dataset for PM_2.5 concentration prediction, meteorological data or historical pollution data [50], and the comparison and analysis of model performance using both datasets simultaneously are lacking.

The specific objectives of this study are threefold: (1) to develop a PM_2.5 concentration prediction model using stacking technology, (2) to analyze the potential of using meteorological datasets for PM_2.5 concentration prediction, and (3) to compare the prediction accuracy of using meteorological datasets and historical PM_2.5 concentration datasets with meteorological datasets.

2. Materials and Methods

2.1. Study Area and Datasets

Jiangxi Province (24.29° N–30.04° N,113.34° E–118.28° E) is located in southeastern China. It is an important node of the Yangtze River Economic Belt. In recent years, many high-energy-consuming and polluting enterprises have migrated to Jiangxi Province, resulting in the air quality in some areas of Jiangxi Province not meeting the national secondary standards (GB 3095-2012) [51]. Many highly polluting enterprises have been relocated to Jiangxi as restricted environmental policies have been put in place in coastal provinces.

The datasets included hourly meteorological data and hourly air quality data from 2016 to 2018, in which historical meteorological data were derived from the China Meteorological Information Center website (http://data.cma.cn/ (accessed on 15 April 2020)), and historical air quality data were derived from the China Environmental Monitoring Station (http://www.cnemc.cn/ (accessed on 15 April 2020)). There are 91 meteorological stations covering all cities of Jiangxi Province, and there are 60 air quality monitoring stations located in the central city and industrial areas of the province. As the meteorological station and the air quality station are geographically different, it is necessary to match the distance according to the geographical location. After considering the actual matching situation of Jiangxi Province and relevant research conducted by scholars [52,53], the matching distance between the meteorological station and air quality station was set to 20 km.

In the process of station matching, the meteorological data of meteorological stations with more than one air quality station and the air quality data of the matched air quality stations were used as the research data; the meteorological stations were matched to nearby air quality stations within a distance threshold of 20 km [52,53]. This matching approach was adopted because meteorological data have broader coverage, and the meteorological data from several air quality stations near a meteorological station are consistent [54]. After matching, data from 17 meteorological stations and 57 air quality monitoring stations were used (the specific information of the site is shown in Tables S1 and S2). Figure 1 displays the distribution and the use of meteorological stations (represented by red dots and five-pointed stars) and air quality monitoring stations (represented by blue dots and five-pointed stars) in Jiangxi Province. The base map of Figure 1 was downloaded from the Jiangxi Provincial Geographic Information Public Service Platform (http://bnr.jiangxi.gov.cn/col/col45382/index.html (accessed on 15 April 2022)) without any modifications. For the selection of the input variable, considering the quality and completeness of the data in the Jiangxi area and related research on PM_2.5 concentration prediction [50,55], 10 variables (Table 1) were finally used as input variables for the prediction [56].

2.2. Data Preprocessing

2.2.1. Data Quality Control

Preprocessing historical data from air quality and meteorological stations is a crucial step in ensuring data accuracy and reliability. This involves identifying and removing abnormal or missing values, which can distort the data and lead to incorrect results. Specifically, if a certain meteorological data item is missing or abnormal, all data for that hour will be removed. Similarly, when preprocessing PM_2.5 historical data, not only are abnormal values removed, but data with PM_2.5 concentrations below 0 μg/m³ or above 1000 μg/m³ are excluded, as these values are considered outliers [57]. This preprocessing is essential for conducting accurate and meaningful analyses of air quality and meteorological data.

2.2.2. Data Matching

Spatial–temporal matching of station and time information involves several steps, as illustrated in Figure 2:

(1): The geographic location information of the meteorological and air quality stations is used to match the average PM_2.5 concentration of corresponding stations and obtain a simultaneous dataset;
(2): Based on the simultaneous dataset, future 1–6 h time scale data are matched;
(3): The daily average dataset is then calculated and obtained from the simultaneous dataset;
(4): Finally, based on the daily average dataset, future 1–6-day time scale data are matched.

These steps are necessary to ensure that the datasets are appropriately matched and that the data analysis is reliable and accurate.

Figure 2. Matching process of station data.

After matching and merging the data from each site, the final datasets were obtained. The dataset of meteorological factors estimating the PM_2.5 concentration included 419,147 items at the hourly scale and 18,001 records at the daily scale. The different time scale PM_2.5 concentration-estimation datasets included 414,163 records at the 1 h scale, 413,101 records at the 2 h scale, 412,101 records at the 3 h scale, 411,264 records at the 4 h scale, 410,741 records at the 5 h scale, 410,443 records at the 6 h scale, 17,783 records at the 1-day scale, 17,680 records at the 2-day scale, 17,608 records at the 3-day scale, 17,534 records at the 4-day scale, 17,465 records at the 5-day scale, and 17,391 records at the 6-day scale.

These datasets are critical for conducting accurate and meaningful analyses of the PM_2.5 concentration and its relationship with meteorological factors. The large number of records in these datasets reflects the extensive data collection efforts and underscores the importance of data preprocessing and matching to ensure the reliability and validity of the data.

2.2.3. Normalization and Division of the Datasets

Before training, the data were normalized and divided into a training set (90%) and a test set (10%) [58] (Table 2).

2.3. Research Methods

2.3.1. Prediction Model

The construction process of an individual model is shown in Figure 3. First, the initial parameters of each algorithm were set based on their characteristics. Then, according to the specific situation of each machine learning algorithm, the parameter values and ranges were adjusted and set. Ten-fold cross-validation was used to select the optimal parameters [59]. The hyperparameter optimization of all single models was implemented using the GridSearch method [55] in the Scikit-Learn library. All models used the Grid Search method in the Scikit-Learn library of Python 3.6 (Python Software Foundation, Fredericksburg, VA, USA) for parameter selection.

(1): Random Forest model

RF is modeled based on the Bootstrap idea, which has high prediction accuracy and overcomes over-fitting. It has been widely used in the applications of medicine, bioinformatics, and agriculture [60,61,62]. The algorithm was implemented using the Random Forest Regressor method in the Scikit-Learn library, and the Gridsearch method was used to adjust the main parameters of the algorithm, such as n_estimator, soob_score, max_fetures, max_depth, min_samples_split, and min_samples_leaf. The parameter selection range and results are shown in Table S3.

(2): XGBoost model

XGBoost is one of the boosting algorithms, which is commonly used in regression and classification, especially in text classification, customer behavior prediction, and advertising click-through rate prediction [63,64]. The algorithm was implemented using the Xgb library in the Scikit-Learn library, and the main parameters of the algorithm booster, n_estimators, max_depth, min_child_weight, gamma, subsample, colasample_bytree, reg_alpha, and reg_lambda, were adjusted step-by-step using the Gridsearch method. The parameter selection range and results are shown in Table S4.

(3): LightGBM model

LightGBM [65,66] is a framework developed by Microsoft that can be used for sorting, classification, regression, and many other machine learning tasks. The framework is a gradient boosting framework based on decision trees and has the advantages of distribution and high performance. The algorithm was implemented through the Scikit-Learn and LightGBM libraries, and the LGBRegressor method. The main parameters, boosting_type, n_estimators, max_depth, num_leaves, min_child_samples, min_child_weight, subsample, colsample_bytree, reg_lambda, and reg_alpha, were adjusted step-by-step using the Gridsearch method. The results are shown in Table S5.

(4): Stacking model

Stacking technology is an integration algorithm using multiple lower-level learners for integration to obtain a high-level learner and overcome the shortcomings of a single model, achieving higher accuracy. The stacking model commonly consists of two layers. The first layer is the “base learner”, and the input is the initial training set. The second layer “meta-learner” takes the output of the first layer as the input data to train and obtain the final result. For the first-level model, a model with strong learning ability and diversification can greatly improve the overall effect of model prediction. For the second-layer model, the input features of the second layer “meta-learner” are obtained from the combination of output features calculated by the cross-validation calculation of the “base learner”. The features are strongly correlated, and the input and output are linearly correlated. Therefore, a simple model with better stability is usually selected as the second-layer model.

(5): Adaboost model

The Adaboost model is an iterative algorithm that belongs to the family of boosting algorithms and is commonly used for classification and regression tasks. Its core purpose is to train different classifiers (weak classifiers) on the same training set and then combine them to form a stronger final classifier (strong classifier).

(6): DT model

Decision Tree (DT) is a non-parametric supervised learning method for classification and regression. The model predicts the value of the target variable by learning simple decision rules inferred from the features of the data. DT is commonly used in operations research and decision analysis.

Many previous studies on PM_2.5 concentration prediction based on machine learning modes showed that the XGBoost model and RF model have high prediction accuracy and good performance [67,68,69]. The LightGBM model has high prediction accuracy and efficiency in many applications [65,66]. Therefore, in this study, considering the performance and modeling speed simultaneously, RF, XGBoost, and LightGBM were chosen as the “base learner” for the first layer, and Multiple Linear Regression (MLR) was chosen as the “meta-learner” for the second layer to construct a stacked ensemble model. The parameters of the XGBoost, RF, and LightGBM algorithms were determined during single model construction. The multiple linear regression model was implemented using the Scikit-Learn library’s Linear Regression, without the need for parameter selection. And the results are shown in Table S6. The stacking model construction process is shown in Figure 4. The steps are as follows:

(1): Obtain the original training set and the original test set;
(2): Each model was trained using 5-fold cross-validation [70,71]. First, divide the training set into 5 parts, select 4 of them as training data and leave 1 as test data. Every time the training data are used to train, the test data are predicted to obtain a prediction result, a, and the test set data are predicted by the trained model to obtain the test set prediction result, b. After five training processes, the five-times prediction result, a, was combined into one column as A, and the prediction result, b, of the five training processes are averaged as B. Finally, new datasets, A and B, were obtained, where the number of one-dimensional A is the same as the number of training sets;
(3): The step shown in step (2) was used to train the RF model, the XGBoost model, and the LightGBM model, respectively; after that, 3 A and 3 B were generated. Then, by combining 3 A and the actual value of the original training set, the 3 B data, and the original test set, the actual value expansion obtained a new training set and a new test set, which were input into the “meta-learner”;
(4): A multiple linear regression algorithm was used to train the new training set. The trained model was saved and the stacking model was then performed by inputting the new test set.

Figure 4. The progress of stacking model construction.

The parameters of the stacking model were determined by the following principles and methods: the parameters of XGBoost, RF, and LightGBM in the stacking ensemble model used were the same as the parameters used when the single model was constructed. The multiple linear regression model was implemented using the Linear Regression function of the Scikit-Learn library without parameter selection.

2.3.2. Model Evaluation

Three dimensionless indicators, R², RMSE, and MAE [72,73], were used to evaluate these models. The calculation methods for each indicator are as follows:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - \hat{y})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(1)

RMSE = \sqrt{\frac{1}{n} \sum_{t = 1}^{n} {(y o_{t} - y m_{t})}^{2}}

(2)

MAE = \frac{1}{n} \sum_{t = 1}^{n} |(y o_{t} - y m_{t})|

(3)

where n represents the number of data points; ym is the predicted result; yo is the real value; and

\bar{y} m

and

\bar{y} o

represent the average values of the predicted result and real value, respectively. Generally, the closer the R² value to 1 and the smaller the values of RMSE and MAE, the better the model’s performance.

3. Results

3.1. PM_2.5 Concentration Prediction Based on Meteorological Datasets

3.1.1. Current PM_2.5 Concentration Estimation

Meteorological data cannot directly reflect the value of the PM_2.5 concentration, so before using meteorological data to predict the PM_2.5 concentration, experiments were conducted to verify the stacking model and other individual models’ precision by using meteorological data to estimate the current hourly PM_2.5 concentration and the current daily PM_2.5 concentration; the results are shown in Table 3. When estimating the current hourly PM_2.5 concentration, the accuracies of each model were good, with R² values above 0.8. From the perspective of comprehensive indicators, the stacking model performed the best. However, when using meteorological data to estimate the current daily PM_2.5 concentration, the accuracies of each model were moderate, and the average values of R², RMSE, and MAE were 0.76, 12.63, and 9.00. The performance achieved when using meteorological data to estimate the hourly PM_2.5 concentration was better than that when using the current daily PM_2.5 concentration. Air quality monitoring stations were concentrated in the city center and industrial areas, and the spatial distribution was uneven, resulting in a lack of monitoring of PM_2.5 and other air pollutant concentrations in some areas. The use of meteorological data to estimate the PM_2.5 concentration can resolve PM_2.5 monitoring blank or missing data; and compared with the spatial interpolation method, the stacking model was more reliable and accurate.

3.1.2. Hourly PM_2.5 Concentration Prediction

Based on the hourly dataset of meteorological factors, the prediction of the hourly PM_2.5 concentration in the future 1–6 h is shown in Figure 5. The X-axis represents the predicted value of the PM_2.5 concentration, while the Y-axis represents the measured value of the PM_2.5 concentration. There is a fitting function relationship between the predicted and measured values, R², MAE, and RMSE in the upper-left corner. The blue dashed line in the middle of the figure is a 1:1 line, and the red solid line represents a fitting function line. If the red line appears above the blue line, it indicates that the predicted value is greater than the measured value, and vice versa, the measured value is greater than the predicted value. The farther the scatter points deviate from the 1:1 line, the greater the difference between the predicted and measured results. According to the relationship between the red function fitting line and the 1:1 line, it can be seen that the four models generally predicted higher results when the measured PM_2.5 concentration values were lower. When the measured PM_2.5 concentration value was high, the predicted results of the four models were slightly smaller. The performance of the stacking model was the best of the six models, although the R² value was 0.88 on the 4 h scale, while those of the rest were 0.89, the RMSE values were 9.49, 9.58, 9.52, 9.87, 9.79, and 9.32, and the MAE values were 6.10, 6.07, 6.15, 6.14, 6.17, and 6.12 respectively. The stacked ensemble model had an average increase of 0.9%, 5.3%, and 1.3% compared with the best-performing models in the base model at different time scales on this dataset.

3.1.3. Daily Average PM_2.5 Concentration Prediction

The prediction of the daily average PM_2.5 concentration in the future 6 days based on the meteorological datasets is shown in Figure 6. According to the relationship between the red function fitting line and the 1:1 line, it can be seen that the four models generally predicted higher results when the measured PM_2.5 concentration values were lower. When the measured PM_2.5 concentration value was high, the predicted results of the four models were slightly smaller. The fitting slopes of the four models were all less than 1, and the stacking model had the smallest deviation and was closest to the 1:1 line. The performance was optimal on different time scales, and the indexes’ values were increased by 1.41%, 1.99%, 1.98%, 4.11%, 3.57%, and 3.52% on average compared with the single model. Compared with the model with the best performance in the base model on different time scales for this dataset, the stacking model improved the performance, with R², RMSE, and MAE increasing by 3.1%, 3.6%, and 3.2%.

3.1.4. Model Stability Comparison Analysis

For model stability comparison, to increase the credibility of the research, in addition to comparing and evaluating the stacked ensemble model and its separate model, this study also added the common bagging integrate Adaboost model [74,75] and DT model [76,77] to the experiment. The changes in all models on the hourly and daily scales based on the meteorological datasets are shown in Figure 7 and Figure 8. It can be seen that, with the increase in the time scale, the indicators of the Stacking, XGBoost, LightGBM, and RF models did not change significantly, while the Adaboost and DT models had obvious changing trends.

When utilizing meteorological data, XGBoost exhibited superior prediction accuracy for the PM_2.5 concentration compared with other individual models; among all models, the stacking model had the most stable trends and the best prediction performance.

3.2. PM_2.5 Concentration Prediction Based on Historical PM_2.5 Concentration with Meteorological Datasets

3.2.1. Hourly PM_2.5 Concentration Prediction

The predicted results of the future 1–6 h PM_2.5 concentration values are shown in Figure 9. The stacking model performed better than the single models in predicting the PM_2.5 concentration in the future 1–6 h, and the value range of R² was 0.92–0.97. Among them, the best prediction performance was in the future 1 h (R²: 0.97, RMSE: 5.03, MAE: 2.91). The performance of the prediction model in the future 6 h was poorer than that of the other hourly predictions, with R²: 0.92, RMSE: 8.24, and MAE: 5.29. As the prediction time increased, R², RMSE, and MAE decreased by an average of 1% and increased by 10% and 16% for each hour, respectively. That is, as the time increased, the performances gradually decreased. On this dataset, the stacking model had an average increase, with R² of 0.3%, RMSE of 6.3%, and MAE of 3%, compared with the single models that performed best in the base model at different time scales.

3.2.2. Daily Average PM_2.5 Concentration Prediction

Figure 10 displays the daily average PM_2.5 concentration prediction for the next 1–6 days. The stacking model outperformed the individual models in predicting the PM_2.5 concentrations for the next 1 day, with an R² value of 0.82. However, the prediction accuracy decreased for the next 2–6 days, with R² values ranging from 0.71 to 0.73. Despite this, the stacking model still performed better than all individual models across all time scales, with an average increase of 2.3% for R², 2.4% for RMSE, and 3.2% for MAE, compared with the best of the individual models for different time scales.

3.2.3. Model Stability Comparison Analysis

Figure 11 and Figure 12 provide a comparison of the stability of all models on hourly and daily time scales. It can be observed that, as the time scale increased, the performance of all models gradually decreased, as evidenced by decreasing R² and increasing RMSE and MAE values. The stacking model exhibited the smoothest broken line and showed the smallest change, indicating its strong stability and lower susceptibility to the effect of time. Conversely, the Adaboost model displayed the steepest broken line, indicating the weakest stability among all models and the most obvious range of change.

When using historical PM_2.5 data with meteorological datasets, LightGBM performed better than other individual models, and among all models, the stacking model still had the most stable trends and the best prediction performance.

4. Discussion

Many existing studies on PM_2.5 concentration prediction based on machine learning modes show that the XGBoost model and RF model have high prediction accuracy and good performance [67,68,69]. Hourly PM_2.5 concentration forecasting can provide accurate PM_2.5 levels and pollution warnings for environmental agencies [56]. But these studies only deployed the hourly prediction experiment, and there was no further discussion comparing the models’ precision crossing different time scales, including hourly and daily. In our experiments, twelve meteorological datasets and twelve historical PM_2.5 data with meteorological datasets were preprocessed for the comparison of short-term PM_2.5 concentrations prediction, including the future 1 h to 6 h, and future 1 day to 6 days. The results demonstrate that the XGBoost model was the best individual model for predicting the PM_2.5 concentration based on meteorological data, with the LightGBM model being the best individual model to predict the PM_2.5 concentration using historical PM_2.5 data with meteorological datasets. In our previous work [56], variable importance analysis was conducted on the saved XGBoost model; the main parameters selected and their importance ranking were: RHU, TEM, PRS, GST, and WIN_S. Then, a seasonal analysis of the model’s prediction accuracy was conducted, and the results showed that the prediction accuracy of XGBoost and other individual models performed poorly in the spring and summer, and performed better in the autumn and winter. The value of the PM_2.5 concentration in the summer was lower, while in the winter, the value of the PM_2.5 concentration was higher and there were multiple sources of pollution; people’s coal-fired heating behavior is one of the biggest sources of PM_2.5 pollution [78], indicating that the model had a more stable prediction ability in polluted seasons.

The stacking model performed better than all individual models in predicting the PM_2.5 concentrations across different time scales on both hourly and daily datasets (as shown in Table 4). The model using historical PM_2.5 data with meteorological datasets performed better than those only using meteorological datasets, indicating the importance of historical pollutant concentrations in accurately predicting the PM_2.5 concentration [79]. For hourly PM_2.5 concentration prediction, the R², RMSE, and MAE indexes increased on average by 5.6%, 26.4%, and 27.9% respectively; meanwhile, for the average daily PM_2.5 concentration prediction, the R², RMSE, and MAE indexes increased on average by 4.2%, 5.6%, and 7.1% respectively. The stacking model performed better in the hourly PM_2.5 concentration prediction than in the daily average PM_2.5 concentration prediction on the historical PM_2.5 concentration dataset. Additionally, the stacking model and all individual models showed high accuracy in hourly prediction, while for daily average prediction, only the daily average prediction for the next day had high accuracy. Certain regions may exhibit diverse pollution sources, including both point sources and area sources. Additionally, variations in climatic conditions across different areas can influence the trends in the PM_2.5 concentrations. Moreover, as the forecasting time increased, the impact of prediction factors on the changes in the PM_2.5 concentration weakened.

The reason why the stacking model was better than the single model can be attributed to two aspects: (1) From the perspective of the choice of the base learner, RF, LightGBM, and XGBoost were chosen as the “base learner” for the stacking method in this study. The RF model handles nonlinear problems by integrating multiple decision trees, which resolves the shortcomings of neural networks that overfit easily, and makes up for the insufficiency of the support vector machine model’s over-reliance on the user’s choice of kernel function and the regression function needing to be set in advance [80]. The LightGBM model uses gradient unilateral sampling and mutually exclusive feature bundling to improve the traditional GBDT algorithm, which improves the model’s prediction speed and reduces the model’s computational complexity [80,81]. The XGBoost model performing a second-order Taylor expansion of the cost function and adding a regular term improves the training speed of the model and reduces the overfitting of the model. The stacking method achieves an optimal combination of “base learners” through training. (2) From the perspective of the stacking model’s structure, the input of the “meta-learner” is the output of the “base learner”, effectively preventing the over-fitting phenomenon caused by the repeated use of data, and fully combining the advantages of a single base learner model to build the model, which improves the model ensemble effect and model predictive capabilities on the whole.

Based on the results, it can be observed that the stacking model is not well-suited for long-time-series tasks for two reasons. Firstly, the complexity of the stacking model significantly increases in long-time-series tasks. As the number of stacked models increases, the complexity and computational resource requirements of the stacking model also increase. Additionally, training and optimizing the model become more challenging, as the model needs to learn and predict from a longer history of information in long-time-series data. Secondly, the issue of data lag arises in the stacking model. In the stacking model, each sub-model may only have access to past observations as input features, making it difficult to capture the relationship between current observations and future time points. This limitation hampers the predictive performance of the stacking model.

Furthermore, the stacking model is not limited to PM_2.5 prediction; it can also be applied to other tasks, such as solar radiation [82] and electricity price forecast [83]. These tasks exhibit clear time-series characteristics and are influenced by multiple factors. To generate accurate prediction results, all tasks require the consideration of the impact of various relevant factors and the selection and construction of appropriate individual models to capture their relationships.

5. Conclusions

This study performed data preprocessing and variable selection based on meteorological and air quality data from 2016 to 2018 for Jiangxi Province, China. Then, five PM_2.5 concentration prediction models and a stacking model based on RF, XGBoost, and LightGBM algorithms were constructed, and the R², RMSE, and MAE indicators were used to evaluate the predictive ability.

The models’ prediction performance of historical PM_2.5 data with meteorological datasets was better than that when using only the meteorological data; additionally, when there were missing data from a ground PM_2.5 monitoring site, meteorological data could be used to predict PM_2.5 concentration changes. The stacking model performed better in predicting the PM_2.5 concentration at different time scales on two types of datasets than all individual models, and was less affected by time factors. The XGBoost model was the best individual model to predict the PM_2.5 concentration on meteorological data, and the LightGBM model was the best individual model to predict the PM_2.5 concentration on the historical PM_2.5 data with meteorological datasets. The results demonstrate that the stacking integration strategy combined the advantages of the RF, XGBoost, and LightGBM models to effectively improve the robustness of the PM_2.5 concentration prediction accuracy, and can be used in warning systems for short-term air pollution.

As the time scale increased, the prediction performance of all individual models and the stacking model gradually decreased, especially on daily time scales. For the stacking model, this is because stacked models exhibit higher data dependency for long-time-series data, and the complexity and computational resource requirements of models are also higher in long-time-series tasks. Therefore, it becomes challenging to train and optimize stacking models, making them more suitable for short-term prediction tasks. Hence, significant efforts will be made to develop a high-precision prediction model for medium and long-term PM_2.5 concentrations.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/su151411408/s1, Table S1: Meteorological stations; Table S2: Air quality stations; Table S3: /; Table S4: Parameter range and results of Random Forest model; Table S5: Parameter range and results of XGBoost model; Table S6: Parameter range and results of LightGBM model; Figure S1: Simulation results of LUR model based on multiple linear regression; Figure S2: Kriging interpolation results; Figure S3: Elevation Map of Jiangxi Province; Figure S4: Daily distribution of errors; Figure S5: Hourly distribution of errors; Figure S6: Daily distribution of errors; Figure S7: Hourly distribution of errors.

Author Contributions

Conceptualization, J.K. and J.L.; methodology, J.T.; software, J.T. and X.Z.; data curation, visualization and writing—original draft preparation, X.Z.; writing—review and editing, J.K., X.Z., J.L. and H.K.; project administration, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 42261071).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available in https://doi.org/10.6084/m9.figshare.23716080.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationship that could be construed as a potential conflict of interest.

References

Ye, W.; Ma, Z.; Ha, X. Spatial-Temporal Patterns of PM_2.5 Concentrations for 338 Chinese Cities. Sci. Total Environ. 2018, 631–632, 524–533. [Google Scholar] [CrossRef]
Cao, X.C.X.; Kostka, G.K.G.; Xu, X. Environmental Political Business Cycles: The Case of PM_2.5 Air Pollution in Chinese Prefectures. Environ. Sci. Policy 2019, 93, 92–100. [Google Scholar] [CrossRef]
Fontes, T.; Li, P.; Barros, N.; Zhao, P. Trends of PM_2.5 Concentrations in China: A Long Term Approach. J. Environ. Manag. 2017, 196, 719–732. [Google Scholar] [CrossRef]
Zanobetti, A.; Franklin, M.; Koutrakis, P. Fine Particulate Air Pollution and Its Components in Association with Cause-Specific Emergency Admissions. Environ. Health-Glob. 2009, 19, S315–S316. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Xing, Y.; Xu, Y.; Shi, M.; Lian, Y. The Impact of PM_2.5 on the Human Respiratory System. J. Thorac. Dis. 2016, 8, 69–74. [Google Scholar]
Whittaker, A.; BeruBe, K.; Jones, T.; Maynard, R.; Richards, R. Killer Smog of London, 50 Years On: Particle Properties and Oxidative Capacity. Sci. Total Environ. 2004, 334–335, 435–445. [Google Scholar] [CrossRef]
Kim, Y.; Manley, J.; Radoias, V. Medium-and Long-Term Consequences of Pollution on Labor Supply: Evidence from Indonesia. IZA J. Labor Econ. 2017, 6, 1010. [Google Scholar] [CrossRef] [Green Version]
Mimura, T.; Ichinose, T.; Yamagami, S.; Fujishima, H.; Kamei, Y.; Goto, M.; Matsubara, M. Airborne Particulate Matter (PM_2.5) and the Prevalence of Allergic Conjunctivitis in Japan. Sci. Total Environ. 2014, 487, 493–499. [Google Scholar] [CrossRef]
Nguyen, G.; Shimadera, H.; Uranishi, K.; Matsuo, T.; Kondo, A. Numerical assessment of PM_2.5 and O-3 air quality in Continental Southeast Asia: Impacts of potential future climate change. Atmos. Environ. 2019, 215, 116901. [Google Scholar] [CrossRef]
Requia, W.J.; Jhun, I.; Coull, B.A.; Koutrakis, P. Climate impact on ambient PM_2.5 elemental concentration in the United States: A trend analysis over the last 30 years. Environ. Int. 2019, 131, 104888. [Google Scholar] [CrossRef]
Bu, Q.; Hong, Y.; Tan, H.; Liu, L.; Wang, C.; Zhu, J.; Chan, P.; Chen, C. The Modulation of Meteorological Parameters on Surface PM_2.5 and O₃ Concentrations in Guangzhou, China. Aerosol. Air Qual. Res. 2020, 20, 200084. [Google Scholar] [CrossRef]
Hou, P.; Wu, S.L. Long-term Changes in Extreme Air Pollution Meteorology and the Implications for Air Quality. Sci. Rep. 2016, 6, 23792. [Google Scholar] [CrossRef] [PubMed]
Ji, M.Y.; Jiang, Y.Y.; Han, X.P.; Liu, L.; Xu, X.L.; Qiao, Z.; Sun, W. Spatiotemporal Relationships between Air Quality and Multiple Meteorological Parameters in 221 Chinese Cities. Complexity 2020, 2020, 6829142. [Google Scholar] [CrossRef]
Wang, J.W.; Xu, H. A novel hybrid spatiotemporal land use regression model system at the megacity scale. Atmos. Environ. 2021, 244, 117971. [Google Scholar] [CrossRef]
Huang, L.; Sun, J.; Jin, L.; Brown, N.J.; Hu, J. Strategies to Reduce PM_2.5 and O₃ Together During Late Summer and Early Fall in San Joaquin Valley. Calif. Atmos. Res. 2021, 258, 105633. [Google Scholar] [CrossRef]
Dennis, R.L.; Byun, D.W.; Novak, J.H.; Galluppi, K.J.; Coats, C.J.; Vouk, M.A. The Next Generation of Integrated Air Quality Modeling: EPA’s Models-3. Atmos. Environ. 1996, 30, 1925–1938. [Google Scholar] [CrossRef]
Wang, Q.W.Q.; Zeng, Q.Z.Q.; Tao, J.T.J.; Sun, L.S.L.; Zhang, L.Z.L.; Gu, T.G.T.; Chen, L.C.L. Estimating PM_2.5 Concentrations Based on Modis Aod and Naqpms Data Over Beijing-Tianjin-Hebei. Sensors 2019, 19, 1207. [Google Scholar] [CrossRef] [Green Version]
Zhang, Z.; Wu, L.; Chen, Y. Forecasting PM_2.5 and PM₁₀ Concentrations Using GMCN(1,N) Model with the Similar Meteorological Condition: Case of Shijiazhuang in China. Ecol. Indic. 2020, 119, 106871. [Google Scholar] [CrossRef]
Pai, T.; Ho, C.; Chen, S.; Lo, H.; Sung, P.; Lin, S.; Kao, J. Using Seven Types of GM (1, 1) Model to Forecast Hourly Particulate Matter Concentration in Banciao City of Taiwan. Water Air Soil Pollut. 2011, 217, 25–33. [Google Scholar] [CrossRef]
Ziomas, I.C.; Melas, D.; Zerefos, C.S. Forecasting Peak Pollutant Levels From Meteorological Variables. Atmos. Environ. 1995, 29, 3703–3711. [Google Scholar] [CrossRef]
Zhu, S.; Yang, L.; Wang, W. Optimal-Combined Model for Air Quality Index Forecasting: 5 Cities in North China. Environ. Pollut. 2018, 243, 842–850. [Google Scholar] [CrossRef]
Murillo-Escobar, J.; Sepulveda-Suescun, J.P.; Correa, M.A. Forecasting Concentrations of Air Pollutants Using Support Vector Regression Improved with Particle Swarm Optimization: Case Study in AburrÁ Valley, Colombia. Urban Clim. 2019, 29, 100473. [Google Scholar] [CrossRef]
Sun, W.S.W.; Sun, J.S.J. Daily PM_2.5 Concentration Prediction Based on Principal Component Analysis and LSSVM Optimized by Cuckoo Search Algorithm. J. Environ. Manag. 2017, 188, 144–152. [Google Scholar] [CrossRef]
Chen, G.C.G.; Li, S.L.S.; Knibbs, L.K.L.D.; Hamm, N.H.N.A.; Cao, W.C.W.; Li, T.L.T.; Guo, Y.G.Y. A Machine Learning Method to Estimate PM_2.5 Concentrations Across China with Remote Sensing, Meteorological and Land Use Information. Sci. Total Environ. 2018, 636, 52–60. [Google Scholar] [CrossRef]
Huang, K.H.K.; Xiao, Q.X.Q.; Meng, X.M.X.; Geng, G.G.G.; Wang, Y.W.Y.; Lyapustin, A.L.A.; Liu, Y.L.Y. Predicting Monthly High Resolution PM_2.5 Concentrations with Random Forest Model in the North China Plain. Environ. Pollut. 2018, 242, 675–683. [Google Scholar] [CrossRef]
Li, X.L.X.; Zhang, X.Z.X. Predicting Ground-Level PM_2.5 Concentrations in The Beijing-Tianjin-Hebei Region: A Hybrid Remote Sensing and Machine Learning Approach. Environ. Pollut. 2019, 249, 735–749. [Google Scholar] [CrossRef] [Green Version]
Sekula, P.; Ustrnul, Z.; Bokwa, A.; Bochenek, B.; Zimnoch, M. Random Forests Assessment of the Role of Atmospheric Circulation in PM₁₀ in an Urban Area with Complex Topography. Sustainability 2022, 14, 3388. [Google Scholar] [CrossRef]
Zhai, W.X.; Cheng, C.Q. A Long Short-Term Memory Approach to Predicting Air Quality Based on Social Media Data. Atmos. Environ. 2020, 237, 117411. [Google Scholar] [CrossRef]
Ma, J.; Ding, Y.; Cheng, J.C.P.; Jiang, F.; Wan, Z. A Temporal-Spatial Interpolation and Extrapolation Method Based on Geographic Long Short-Term Memory Neural Network for PM_2.5. J. Clean. Prod. 2019, 237, 117729. [Google Scholar] [CrossRef]
Wen, C.W.C.; Liu, S.L.S.; Yao, X.Y.X.; Peng, L.P.L.; Li, X.L.X.; Hu, Y.H.Y.; Chi, T.C.T. A Novel Spatiotemporal Convolutional Long Short-Term Neural Network for Air Pollution Prediction. Sci. Total Environ. 2019, 654, 1091–1099. [Google Scholar] [CrossRef]
Kristiani, E.; Lin, H.; Lin, J.; Chuang, Y.; Huang, C.; Yang, C. Short-Term Prediction of PM_2.5 Using LSTM Deep Learning Methods. Sustainability 2022, 14, 2068. [Google Scholar] [CrossRef]
Wang, X.; Lin, X.; Dang, X. Supervised Learning in Spiking Neural Networks: A Review of Algorithms and Evaluations. Neural Netw. 2020, 125, 258–280. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.C.; Qi, S.F.; Hu, F.; Ma, S.B.; Mao, W.; Li, W. Recognizing Activities of the Elderly Using Wearable Sensors: A Comparison of Ensemble Algorithms Based on Boosting. Sensor Rev. 2019, 39, 743–751. [Google Scholar] [CrossRef]
Bai, Y.B.Y.; Wu, L.W.L.; Qin, K.Q.K.; Zhang, Y.Z.Y.; Shen, Y.S.Y.; Zhou, Y.Z.Y. A Geographically and Temporally Weighted Regression Model for Ground-Level PM_2.5 Estimation from Satellite-Derived 500 m Resolution AOD. Remote Sens. 2016, 8, 262. [Google Scholar] [CrossRef] [Green Version]
Dietterich, T.G. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization. Mach. Learn. 2000, 40, 139–157. [Google Scholar] [CrossRef]
Liu, D.; Sun, K. Short-Term PM_2.5 Forecasting Based on CEEMD-RF in Five Cities of China. Environ. Sci. Pollut. Res. 2019, 32, 32790–32803. [Google Scholar] [CrossRef]
Liu, H.; Tian, H.; Li, Y.; Zhang, L. Comparison of Four Adaboost Algorithm Based Artificial Neural Networks in Wind Speed Predictions. Energy Convers. Manag. 2015, 92, 67–81. [Google Scholar] [CrossRef]
Ysc, A.; Htc, B.; Sa, C.; Yph, D.; Ytt, A.; Kml, A. An LSTM-Based Aggregated Model for Air Pollution Forecasting. Atmos. Pollut. Res. 2020, 11, 1451–1463. [Google Scholar]
Bai, Y.A.; Li, Y.A.; Zeng, B.A.; Li, C.A.C.C.; Zhang, J.A. Hourly PM_2.5 Concentration Forecast Using Stacked Autoencoder Model with Emphasis on Seasonality. J. Clean. Prod. 2019, 224, 739–750. [Google Scholar] [CrossRef]
Bai, Y.B.Y.; Zeng, B.Z.B.; Li, C.L.C.; Zhang, J.Z.J. An Ensemble Long Short-Term Memory Neural Network for Hourly PM_2.5 Concentration Forecasting. Chemosphere 2019, 222, 286–294. [Google Scholar] [CrossRef]
Liu, H.; Jin, K.; Duan, Z. Air PM_2.5 Concentration Multi-Step Forecasting Using a New Hybrid Modeling Method: Comparing Cases for Four Cities in China. Atmos. Pollut. Res. 2019, 10, 1588–1600. [Google Scholar] [CrossRef]
Liu, H.; Dong, S. A Novel Hybrid Ensemble Model for Hourly PM_2.5 Forecasting Using Multiple Neural Networks: A Case Study in China. Air Qual. Atmos. Health 2020, 13, 1411–1420. [Google Scholar] [CrossRef]
Dai, H.; Huang, G.; Zeng, H.; Yang, F. PM_2.5 Concentration Prediction Based on Spatiotemporal Feature Selection Using XGBoost-MSCNN-GA-LSTM. Sustainability 2021, 13, 12071. [Google Scholar] [CrossRef]
Zhai, B.; Chen, J. Development of A Stacked Ensemble Model for Forecasting and Analyzing Daily Average PM_2.5 Concentrations in Beijing, China. Sci. Total Environ. 2018, 635, 644–658. [Google Scholar] [CrossRef]
Chen, J.A.; Yin, J.A.Y.W.; Zang, L.A.; Zhang, T.A.; Zhao, M.A. Stacking Machine Learning Model for Estimating Hourly PM_2.5 in China Based on Himawari 8 Aerosol Optical Depth Data. Sci. Total Environ. 2019, 697, 134021. [Google Scholar] [CrossRef]
Jahangir, H.; Golkar, M.A.; Alhameli, F.; Mazouz, A.; Ahmadian, A.; Elkamel, A. Short-Term Wind Speed Forecasting Framework Based on Stacked Denoising Auto-Encoders with Rough ANN. Sustain. Energy Technol. 2020, 38, 100601. [Google Scholar] [CrossRef]
Agarwal, S.A.S.R.; Chowdary, C.R.A.R. A-Stacking and A-Bagging: Adaptive Versions of Ensemble Learning Algorithms for Spoof Fingerprint Detection. Expert Syst. Appl. 2020, 146, 113160. [Google Scholar] [CrossRef]
Moon, J.; Jung, S.; Rew, J.; Rho, S.; Hwang, E. Combination of Short-Term Load Forecasting Models Based on a Stacking Ensemble Approach. Energy Build. 2020, 216, 109921. [Google Scholar] [CrossRef]
Zhang, Y.; Bocquet, M.; Mallet, V.; Seigneur, C.; Baklanov, A. Real-Time Air Quality Forecasting, Part I: History, Techniques, and Current Status. Atmos. Environ. 2012, 60, 632–655. [Google Scholar] [CrossRef]
Gang, L.; Jing, F.; Dong, J.; Wang, J.H. Spatial Variation of the Relationship between PM_2.5 Concentrations and Meteorological Parameters in China. BioMed Res. Int. 2015, 2015, 684618. [Google Scholar]
Wang, Z.; Tan, Y.; Guo, M.; Cheng, M.M.; Gu, Y.Y.; Chen, S.Y.; Wu, X.F.; Chai, F.H. Prospect of China’s ambient air quality standards. J. Environ. Sci. 2023, 123, 255–269. [Google Scholar] [CrossRef] [PubMed]
Xu, G.Y.; Ren, X.D.; Xiong, K.N.; Li, L.Q.; Bi, X.C.; Wu, Q.L. Analysis of the driving factors of PM_2.5 concentration in the air: A case study of the Yangtze River Delta, China. Ecol. Indic. 2020, 110, 5889. [Google Scholar] [CrossRef]
Tan, H.A.T.W.; Chen, Y.A.Y.W.; Wilson, J.P.A.J.; Zhang, J.A.J.W.; Cao, J.A.C.W.; Chu, T.A.C.W. An eigenvector spatial filtering based spatially varying coefficient model for PM_2.5 concentration estimation: A case study in Yangtze River Delta region of China. Atmos. Environ. 2020, 223, 117205. [Google Scholar] [CrossRef]
Roy, M.; Brokamp, C.; Balachandran, S. Clustering and Regression-Based Analysis of PM_2.5 Sensitivity to Mete-orology in Cincinnati, Ohio. Atmosphere 2022, 13, 545. [Google Scholar] [CrossRef]
Tandon, A.; Yadav, S.; Attri, A.K. Non-linear analysis of short term variations in ambient visibility. Atmos. Pollut. Res. 2013, 4, 199–207. [Google Scholar] [CrossRef] [Green Version]
Kang, J.F.; Huang, L.X.; Zhang, C.Y.; Zeng, Z.L.; Yao, S.J. Hourly PM_2.5 prediction and comparative analysis under multiple machine learning models. China Environ. Sci. 2020, 40, 1895–1905. (In Chinese) [Google Scholar]
Peng, H.J.; Zhou, Y.; Hu, J.F.; Zhang, L.; Peng, Y.Z.; Cai, X.Y. PM_2.5 concentration prediction model based on deep learning and random forest. J. Remote Sens. 2023, 27, 430–440. [Google Scholar]
Wenchao, B.; Liangduo, S. PM_2.5 Prediction Based on the CEEMDAN Algorithm and a Machine Learning Hybrid Model. Sustainability 2022, 14, 16128. [Google Scholar]
Van, D.G.; Hoffman, T.P.P.I.; Remijsen, M.P.P.I.; Hijman, R.U.M.C. The Five-Factor Model of the Positive and Negative Syndrome Scale II: A Ten-Fold Cross-Validation of A Revised Model. Schizophr. Res. 2006, 85, 280–287. [Google Scholar]
Liu, D.A.; Sun, K.A.N.E. Random Forest Solar Power Forecast Based on Classification Optimization. Energy 2019, 187, 115940. [Google Scholar] [CrossRef]
Su, X.; Ntilde, A.T.P.A.; Liu, L.; Levine, R.A. Random Forests of Interaction Trees for Estimating Individualized Treatment Effects in Randomized Trials. Stat. Med. 2018, 37, 2547–2560. [Google Scholar] [CrossRef]
Zhao, J.; Yuan, L.; Sun, K.; Huang, H.; Guan, P.; Jia, C. Forecasting Fine Particulate Matter Concentrations by In-Depth Learning Model According to Random Forest and Bilateral Long- and Short-Term Memory Neural Networks. Sustainability 2022, 14, 9430. [Google Scholar] [CrossRef]
Lim, S.; Chi, S. XGboost Application on Bridge Management Systems for Proactive Damage Estimation. Adv. Eng. Inform. 2019, 41, 100922. [Google Scholar] [CrossRef]
Rad, A.K.; Shamshiri, R.R.; Naghipour, A.; Razmi, S.; Shariati, M.; Golkar, F.; Balasundram, S.K. Machine Learning for Determining Interactions between Air Pollutants and Environmental Parameters in Three Cities of Iran. Sustainability 2022, 14, 8027. [Google Scholar] [CrossRef]
Chen, C.A.; Zhang, Q.A.; Ma, Q.A.; Yu, B.A.Y.Q. LightGBM-PPI: Predicting Protein-Protein Interactions through LightGBM with Multi-Information Fusion. Chemom. Intell. Lab. 2019, 191, 54–64. [Google Scholar] [CrossRef]
Sun, X.; Liu, M.; Sima, Z. A Novel Cryptocurrency Price Trend Forecasting Model Based on LightGBM. Financ. Res. Lett. 2020, 32, 101084. [Google Scholar] [CrossRef]
Harishkumar, K.S.; Yogesh, K.M.; Gad, I. Forecasting Air Pollution Particulate Matter (PM_2.5) Using Machine Learning Regression Models. Procedia Comput. Sci. 2020, 171, 2057–2066. [Google Scholar]
Ma, J.; Cheng, J.C.P.; Xu, Z.; Chen, K.; Lin, C.; Jiang, F. Identification of the most influential areas for air pollution control using XGBoost and Grid Importance Rank. J. Clean. Prod. 2020, 274, 122835. [Google Scholar] [CrossRef]
Xiong, Z.; Cui, Y.; Liu, Z.; Zhao, Y.; Hu, M.; Hu, J. Evaluating Explorative Prediction Power of Machine Learning Algorithms for Materials Discovery Using K-Fold Forward Cross-Validation. Comput. Mater. Sci. 2020, 171, 109203. [Google Scholar] [CrossRef]
Wan, M.; Hu, W.; Qu, M.; Li, W.; Zhang, C.; Kang, J.; Huang, B. Rapid Estimation of Soil Cation Exchange Capacity through Sensor Data Fusion of Portable XRF Spectrometry and Vis-NIR spectroscopy. Geoderma 2020, 363, 114163. [Google Scholar] [CrossRef]
Hu, S.; Liu, P.F.; Qiao, Y.X.; Wang, Q.; Zhang, Y.; Yang, Y. PM_2.5 concentration prediction based on WD-SA-LSTM-BP model: A case study of Nanjing city. Environ. Sci. Pollut. Res. 2022, 29, 70323–70339. [Google Scholar] [CrossRef] [PubMed]
Chu, W.; Zhang, C.; Zhao, Y.; Li, R.; Wu, P. Spatiotemporally Continuous Reconstruction of Retrieved PM_2.5 Data Using an Autogeoi-Stacking Model in the Beijing-Tianjin-Hebei Region, China. Remote Sens. 2022, 14, 4432. [Google Scholar] [CrossRef]
Xiao, L.; Dong, Y.X.; Dong, Y. An Improved Combination Approach Based on ADABOOST Algorithm for Wind Speed Time Series Forecasting. Energy Convers. Manag. 2018, 160, 273–288. [Google Scholar] [CrossRef]
Liu, D.L.D.; Li, L.L.L. Application Study of Comprehensive Forecasting Model Based on Entropy Weighting Method on Trend of PM_2.5 Concentration in Guangzhou, China. Int. J. Environ. Res. Public Health 2015, 12, 7085–7099. [Google Scholar] [CrossRef] [Green Version]
Zhou, F.; Zhang, Q.; Sornette, D.; Jiang, L. Cascading Logistic Regression onto Gradient Boosted Decision Trees for Forecasting and Trading Stock Indices. Appl. Soft Comput. 2019, 84, 105747. [Google Scholar] [CrossRef]
Shcherbakov, M.; Kamaev, V.; Shcherbakova, N. Automated Electric Energy Consumption Forecasting System Based on Decision Tree Approach. IFAC Proc. 2013, 46, 1027–1032. [Google Scholar] [CrossRef]
Xu, N.; Zhang, F.; Xuan, X. Impacts of Industrial Restructuring and Technological Progress on PM_2.5 Pollution: Evidence from Prefecture-Level Cities in China. Int. J. Environ. Res. Public Health 2021, 18, 5283. [Google Scholar] [CrossRef]
Wang, P.W.P.; Zhang, H.Z.H.; Qin, Z.Q.Z.; Zhang, G.Z.G. A Novel Hybrid-Garch Model Based on ARIMA and SVM for PM_2.5 Concentrations Forecasting. Atmos. Pollut. Res. 2017, 8, 850–860. [Google Scholar] [CrossRef]
Gounaridis, D.G.; Koukoulas, S. Urban Land Cover Thematic Disaggregation, Employing Datasets from Multiple Sources and Random Forests Modeling. Int. J. Appl. Earth Obs. Geoinf. 2016, 51, 1–10. [Google Scholar]
Gao, X.G.X.; Luo, H.L.H.; Wang, Q.W.Q.; Zhao, F.Z.F.; Ye, L.Y.L.; Zhang, Y.Z.Y. A Human Activity Recognition Algorithm Based on Stacking Denoising Autoencoder and LightGBM. Sensors 2019, 19, 947. [Google Scholar] [CrossRef] [Green Version]
Ju, Y.J.Y.; Sun, G.S.G.; Chen, Q.C.Q.; Zhang, M.Z.M.; Zhu, H.Z.H.; Rehman, M.R.M.U. A Model Combining Convolutional Neural Network and LightGBM Algorithm for Ultra-Short-Term Wind Power Forecasting. IEEE Access 2019, 7, 28309–28318. [Google Scholar] [CrossRef]
Huang, L.; Kang, J.; Wan, M.; Fang, L.; Zhang, C.; Zeng, Z. Solar Radiation Prediction Using Different Machine Learning Algorithms and Implications for Extreme Climate Events. Front. Earth Sci. 2021, 9, 596860. [Google Scholar] [CrossRef]
Divina, F.; Gilson, A.; Goméz-Vela, F.; García Torres, M.; Torres, J.F. Stacking Ensemble Learning for Short-Term Electricity Consumption Forecasting. Energies 2018, 11, 949. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Distribution map of meteorological stations and air quality monitoring stations in the study area.

Figure 3. The processes in a single model’s construction.

Figure 5. Scatter plot of predicted and measured results in the future 1–6 h.

Figure 6. Scatter plots of predicted and measured results for the future 1–6 days.

Figure 7. Comparison of the stability of different models for 1–6 h prediction based on meteorological datasets.

Figure 8. Comparison of the stability of different models for 1–6-day prediction based on meteorological datasets.

Figure 9. Scatter plots of predicted and measured results in the future 1–6 h.

Figure 10. Scatter plots of predicted and measured results in the future 1–6 days.

Figure 11. Comparison of the stability of different models for 1–6 h prediction based on based on historical PM_2.5 concentration with meteorological datasets.

Figure 12. Comparison of the stability of different models for 1–6-day prediction based on historical PM_2.5 concentration with meteorological datasets.

Table 1. Variable information.

Variables	Unit	Description
LON	°	Longitude
LAT	°	Latitude
DOY		Day of the year
Hour	h	Hour
PRS	%	Average pressure
TEM	°C	Average temperature
RHU	%	Relative humidity
GST	°C	Average surface temperature
WIN_S	m/s	Average wind speed
PM_2.5	μg/m³	PM_2.5 concentration

Table 2. Datasets and division.

Data	Train Set	Test Set	All Datasets
Meteorological factors estimating PM_2.5 concentration dataset (hourly)	377,232	41,915	419,147
Meteorological factors estimating PM_2.5 concentration dataset (daily)	16,200	1801	18,001
Future 1 h PM_2.5 concentration prediction	372,746	41,417	414,163
Future 2 h PM_2.5 concentration prediction	371,790	41,311	413,101
Future 3 h PM_2.5 concentration prediction	370,890	41,211	412,101
Future 4 h PM_2.5 concentration prediction	370,137	41,127	411,264
Future 5 h PM_2.5 concentration prediction	369,666	41,075	410,741
Future 6 h PM_2.5 concentration prediction	369,398	41,045	410,443
Future 1-day PM_2.5 concentration prediction	16,004	1779	17,783
Future 2-day PM_2.5 concentration prediction	15,912	1768	17,680
Future 3-day PM_2.5 concentration prediction	15,847	1761	17,608
Future 4-day PM_2.5 concentration prediction	15,780	1754	17,534
Future 5-day PM_2.5 concentration prediction	15,718	1747	17,465
Future 6-day PM_2.5 concentration prediction	15,651	1740	17,391

Table 3. Use of meteorological factors to estimate current PM_2.5 concentration.

	Current Hourly PM_2.5 Concentration			Current Daily PM_2.5 Concentration
	R²	RMSE	MAE	R²	RMSE	MAE
XGBoost	0.87	10.49	6.74	0.78	12.11	8.75
LightGBM	0.87	10.74	6.73	0.76	12.66	8.95
RF	0.88	10.39	6.03	0.73	13.49	9.55
Stacking	0.88	10.18	5.97	0.78	12.25	8.75
Average	0.87	10.45	6.37	0.76	12.63	9.00

Table 4. The prediction performance of the stacking model based on two types of datasets.

		Meteorological Dataset			Combined Dataset
		R²	RMSE	MAE	R²	RMSE	MAE
Hourly	+1	0.89	9.49	6.10	0.97	4.99	2.91
	+2	0.89	9.58	6.07	0.95	6.37	3.97
	+3	0.89	9.52	6.15	0.94	7.19	4.56
	+4	0.88	9.87	6.14	0.93	7.77	4.86
	+5	0.89	9.79	6.17	0.92	8.18	5.09
	+6	0.89	9.32	6.12	0.92	7.94	5.14
	Average	0.89	9.60	6.13	0.94	7.07	4.42
Daily	+1	0.77	12.99	9.09	0.82	11.48	7.92
	+2	0.69	13.77	9.60	0.72	13.13	8.95
	+3	0.68	14.10	10.01	0.73	12.95	9.26
	+4	0.69	13.05	9.63	0.71	12.61	9.16
	+5	0.71	13.58	9.65	0.73	13.19	9.14
	+6	0.70	12.59	9.31	0.72	12.25	8.87
	Average	0.71	13.35	9.55	0.74	12.60	8.88

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kang, J.; Zou, X.; Tan, J.; Li, J.; Karimian, H. Short-Term PM_2.5 Concentration Changes Prediction: A Comparison of Meteorological and Historical Data. Sustainability 2023, 15, 11408. https://doi.org/10.3390/su151411408

AMA Style

Kang J, Zou X, Tan J, Li J, Karimian H. Short-Term PM_2.5 Concentration Changes Prediction: A Comparison of Meteorological and Historical Data. Sustainability. 2023; 15(14):11408. https://doi.org/10.3390/su151411408

Chicago/Turabian Style

Kang, Junfeng, Xinyi Zou, Jianlin Tan, Jun Li, and Hamed Karimian. 2023. "Short-Term PM_2.5 Concentration Changes Prediction: A Comparison of Meteorological and Historical Data" Sustainability 15, no. 14: 11408. https://doi.org/10.3390/su151411408

APA Style

Kang, J., Zou, X., Tan, J., Li, J., & Karimian, H. (2023). Short-Term PM_2.5 Concentration Changes Prediction: A Comparison of Meteorological and Historical Data. Sustainability, 15(14), 11408. https://doi.org/10.3390/su151411408

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Short-Term PM_2.5 Concentration Changes Prediction: A Comparison of Meteorological and Historical Data

Abstract

1. Introduction