1. Introduction
Against the backdrop of the continued advancement of global industrialization and urbanization, air pollution has become increasingly serious and has become one of the key factors threatening human health and the ecological environment. Pollutants in the atmosphere are rich and diverse, including sulfur dioxide (SO
2), nitrogen oxides (NO
x), particulate matter (PM), etc. Among them, PM
2.5, as the main air pollutant, can penetrate into the human respiratory system due to its tiny particle size (≤2.5 μm), causing serious harm to human health, such as causing respiratory diseases, cardiovascular diseases, and even increasing the risk of death [
1,
2,
3]. In high-density urban areas, traffic pollutant emissions have a significant impact on air quality and residents’ health [
4]. Many studies have shown that PM
2.5 is closely related to a variety of health problems [
5,
6]. In terms of respiratory system effects, some studies have pointed out that it can cause damage to the human respiratory system [
7,
8], Additionally, epidemiological studies associate PM
2.5 with allergic rhinitis [
9]. Furthermore, environmental exposure to PM
2.5 elevates influenza-like illness risks through airborne transmission mechanisms [
10,
11]. Heavy metals in PM
2.5 also have an impact on lung health [
12,
13,
14]. In addition, there are also studies summarizing the impact of ambient PM
2.5 on human health in China, especially the updated summary of the adverse health effects of PM
2.5 exposure [
15,
16]. In view of this, accurate prediction of PM
2.5 concentration is of vital importance for formulating pollution prevention and control strategies in advance, protecting public health, and maintaining ecological balance [
17].
Meteorological properties and satellite datasets can play an important role in PM
2.5 inversion, especially in model building and training. Many studies have shown that meteorological conditions have a significant impact on the changes in atmospheric pollutant concentrations, not only in the inversion of PM
2.5, but also in the inversion of other pollutants (such as NO
2, O
3, and nitrogen oxides). For example, Liu et al. evaluated the impact of meteorological and emission changes on O
3 concentrations in different regions of China using machine learning methods [
18]. In addition, Mak et al. reversed the concentration of tropospheric NO
2 column in southern China by improving the aerosol quality factor (AMF) [
19]. Rodriguez-Sanchez et al. studied the influence of meteorological conditions on the effectiveness of traffic measures in the control of NOx concentration [
20]. These studies provide an important reference for the inversion of atmospheric pollutant concentrations using meteorological properties and satellite data. At the level of meteorological attribute application, many studies have pointed out that meteorological factors have a significant impact on the changes in atmospheric pollutant concentrations [
21]. In terms of satellite data application and its combination with ground monitoring, relevant explorations have also emerged. Some studies have developed methods that combine satellite remote sensing technology and low-cost sensor networks to estimate the long-term PM
2.5 concentration in specific areas [
22,
23]. In the study of China, some studies have proposed using three-dimensional variational data fusion methods to improve the modeling of spatiotemporal changes in fine particulate matter (PM
2.5). By fusing multi-source data, including satellite remote sensing data and ground monitoring data, the spatiotemporal distribution characteristics of PM
2.5 can be more accurately portrayed, and the simulation and prediction capabilities of PM
2.5 concentrations have been improved [
24]. These studies provide important references for in-depth understanding of pollutant characteristics, distribution patterns, and related research from different dimensions, and also provide ideas and methods for our research focusing on PM
2.5.
Early studies on PM
2.5 concentration primarily focused on the development of monitoring technologies and traditional statistical models. Initially, researchers relied on basic air sampling techniques and gravimetric methods to measure PM
2.5 levels, which were time-consuming and less precise compared to modern methods [
25,
26,
27]. With the advancement of technology, automated monitoring systems using beta attenuation and light scattering techniques were introduced, significantly improving the accuracy and efficiency of PM
2.5 monitoring [
28]. In addition to ARIMA and SARIMA models, the multivariate linear regression (MLR) model was also widely used to predict PM
2.5 concentration by establishing a linear relationship between PM
2.5 concentration and multiple influencing factors such as meteorological factors and pollution source emissions [
29]. However, due to the complexity of the air pollution process, these traditional models often had limitations in complex environments and struggled to effectively capture the nonlinear and multi-factor influencing characteristics of PM
2.5 concentration changes [
30].
With the rapid development of science and technology, Machine Learning (ML) and Deep Learning (DL) have emerged as powerful tools for predicting PM
2.5 concentration. These techniques offer significant advantages over traditional statistical methods, especially in handling nonlinear relationships and large datasets [
31]. For instance, Support Vector Machines (SVMs) have been widely used due to their ability to handle high-dimensional data and provide accurate predictions [
32,
33]. Random Forests have also proven effective in capturing the complex interactions between various factors influencing PM
2.5 levels [
34,
35]. More recently, deep learning models such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been employed to leverage spatial and temporal patterns in PM
2.5 data [
36]. For example, the Long Short-Term Memory (LSTM) network has shown superior performance in capturing long-term dependencies in time series data, making it particularly suitable for PM
2.5 concentration prediction [
37]. Hybrid models combining LSTM with other techniques, such as wavelet transform and decomposition methods, have further enhanced prediction accuracy [
38,
39].
Research on PM
2.5 concentration has been conducted across various geographical regions, including urban, rural, and industrial areas, to better understand its spatial and temporal variations. In urban areas, studies have shown that local meteorological conditions, traffic emissions, and industrial activities significantly influence PM
2.5 levels [
40,
41]. For example, in the Beijing-Tianjin-Hebei region of China, PM
2.5 concentrations exhibit pronounced seasonal and diurnal patterns due to the combined effects of meteorology and anthropogenic emissions [
42]. In contrast, rural areas are more influenced by agricultural activities and biomass burning, which can lead to significant PM
2.5 spikes during specific periods [
43,
44]. Industrial areas, particularly those with heavy manufacturing and chemical plants, often experience higher PM
2.5 concentrations due to direct emissions and secondary pollution formation [
45,
46]. Recent studies have also highlighted the importance of considering regional differences in PM
2.5 prediction models. For instance, a study in the United States found that machine learning models trained on data from one region may not perform well in another due to variations in pollution sources and meteorological conditions [
47,
48].
Internationally, PM
2.5 concentration prediction has become a critical area of research due to its significant impact on public health and environmental quality. In Europe, studies have utilized high-resolution chemical transport models (CTMs) combined with machine learning techniques to improve PM
2.5 predictions [
49]. These hybrid models leverage the strengths of both approaches, using CTMs to simulate atmospheric chemistry and machine learning algorithms to capture complex spatial and temporal patterns in PM
2.5 data. For example, a study in Germany demonstrated that integrating machine learning with CTMs improved prediction accuracy by 20% compared to traditional methods alone [
50]. In North America, researchers have focused on developing advanced machine learning models that incorporate satellite data and ground-based measurements to enhance spatial coverage and prediction accuracy [
51]. For instance, a study in the United States used a combination of Random Forest and Gradient Boosting algorithms to predict PM
2.5 concentrations across different regions, highlighting the importance of meteorological factors and land use in model performance [
52]. These models have shown significant potential in capturing regional variations in PM
2.5 concentrations, which is crucial for effective air quality management. In Asia, particularly in countries like India and South Korea, research has focused on developing localized prediction models tailored to specific urban and industrial environments. A study in South Korea used a novel spectral clustering algorithm combined with machine learning techniques to analyze PM
2.5 data and identify key pollution sources [
53]. In India, researchers have developed ensemble machine learning models to predict PM
2.5 concentrations in highly polluted urban areas, such as Bengaluru and Delhi [
54]. These studies emphasize the need for region-specific approaches due to variations in pollution sources and meteorological conditions.
In recent years, numerous innovative approaches have been developed to improve PM
2.5 concentration prediction. For example, the integration of Convolutional LSTM (ConvLSTM) and Graph Convolutional Network (GCN) architectures has been proposed to leverage both spatial and temporal features in multi-source data [
55]. Another study introduced a hybrid model combining LSTM with a Deep Auto-Encoder (DAE) to enhance the model’s ability to capture complex patterns in PM
2.5 data [
56]. Additionally, the application of ensemble learning techniques, such as stacking, has shown promise in improving prediction accuracy by combining the strengths of multiple machine learning models [
57]. These advancements highlight the ongoing efforts to develop more accurate and robust PM
2.5 prediction models, which are crucial for effective air quality management and public health protection. The integration of high-resolution data from satellites, ground-based sensors, and meteorological models, combined with advanced machine learning techniques, is expected to further enhance the accuracy and reliability of PM
2.5 predictions in diverse geographical regions.
After achieving remarkable results in fields such as natural language processing, the Transformer model has also emerged in the field of environmental science, especially in PM
2.5 concentration prediction. Many scholars have conducted in-depth research on it. For example, Thundiyil et al. pointed out in their study that the unique architecture of the Transformer model enables it to effectively handle complex dependencies in time series data, which provides new ideas for PM
2.5 concentration prediction [
58]. Wang et al. used a large-scale air quality monitoring data set to train and verify the Transformer-based prediction model. The results showed that the model performed well in capturing the long-term trend and seasonal changes of PM
2.5 concentration [
59]. In addition, Mohammed et al., based on the Informer architecture and the residual transformer, developed a PM
2 artificial time-series prediction model named ResInformer to predict the PM
2.5 concentration in three major cities in China (Beijing, Shijiazhuang, and Wuhan) [
60].
At the same time, there are also many explorations in the fusion of Deep Learning models. Li et al. proposed a deep learning-based method, AC-LSTM, which includes one-dimensional convolutional neural network (CNN), long and short-term memory (LSTM) network, and attention-based network for urban PM
2.5 concentration prediction [
61]. Kim et al. had developed a hybrid attention converter (HAT) to accurately obtain daily PM
2.5 Seoul predictions. The performance of HAT was evaluated by comparing its predictions with ground observations and the 3 D chemical transport model (3-DCTM). The experimental results show that this fusion model can effectively reduce the model training time while improving the prediction accuracy [
62].
Through a comprehensive analysis of the above literature, we can find that models such as neural networks and decision trees in machine learning models have made great progress in dealing with nonlinear problems, but the parameter selection and training process of the model are often complex and require a lot of computing resources and data. For example, the LSTM model may encounter the problem of gradient vanishing or gradient explosion when processing long sequence data, which requires special techniques to alleviate. Although the ensemble learning model can improve the prediction performance, how to select the appropriate base model and combination method still needs further research. At the same time, Deep Learning models such as Transformer have shown strong feature learning capabilities, but when applied to PM2.5 concentration prediction, they may face problems such as overfitting and computational efficiency. Moreover, current research pays less attention to the interpretability of the model, and it is difficult to understand the prediction results of the model in a physical sense. In addition, most studies have certain limitations in data selection and processing. Some studies only consider a few influencing factors and do not fully explore the multi-source driving factors of PM2.5 concentration changes. At the same time, the quality and spatiotemporal resolution of the data will also affect the prediction accuracy of the model, but research in this regard is not deep enough.
In view of the above situation, our research aims to address the shortcomings of existing research. Taking Nanning urban area in China as an example, we conducted a prediction study on the PM
2.5 concentration in the area based on the Machine Learning method. We selected the SARIMA, Prophet, and LightGBM models in our research. Among them, SARIMA, as a classic time series model, is good at capturing the seasonality and trend of data [
63]. The Prophet model is designed for processing data with seasonal and holiday effects and has strong flexibility and interpretability [
64]. LightGBM is an efficient gradient boosting framework with significant advantages in processing large-scale data and complex nonlinear relationships [
65]. We selected the annual and monthly PM
2.5 raster data from 2012 to 2023 nationwide in ChinaHighAirPollutants (CHAP), cropped it with the Nanning urban area mask, used the 2012–2022 data as the training set and the 2023 data as the test set, and performed model training and prediction according to the standard process. By predicting and saving the 2023 results raster by raster, we provided data support for subsequent analysis, compared the performance of different models, explored the prediction method suitable for the region, and provided a scientific basis for air quality management.
It is worth mentioning that our research has several notable innovations that distinguish it from existing studies. Firstly, unlike previous studies that were mostly limited to a single model type, we systematically compared different types of models, including traditional statistical models (SARIMA)and advanced machine learning models (Prophet and LightGBM). This multi-model comparison approach allows for a comprehensive evaluation of the strengths and weaknesses of each model in the context of PM
2.5 concentration prediction, providing a more robust basis for model selection [
66]. Secondly, we employed a variety of visualization methods to deeply analyze the prediction results from multiple dimensions, such as spatial and temporal variations. This approach not only enhances the interpretability of the results but also provides valuable insights into the underlying patterns of PM
2.5 concentration changes [
67]. Thirdly, considering the characteristics of PM
2.5 data in Nanning urban area, our study adopted a data-driven strategy, focusing solely on time series data to mine potential laws. This strategy maximizes the use of high-resolution data, thereby improving the reliability and effectiveness of the prediction [
68]. Lastly, our research on Nanning urban area has important demonstration value for other regions facing similar situations. By providing a detailed case study, our methods and findings can help promote the widespread application and innovative development of Machine Learning models in the field of environmental science [
69]. These innovations not only enhance the scientific rigor of our study but also contribute to practical applications. For instance, the multi-model comparison approach can help policymakers and environmental managers make more informed decisions by selecting the most appropriate model based on specific needs and conditions. The data-driven strategy and high-resolution data utilization can improve the accuracy of PM
2.5 concentration predictions, thereby supporting more effective air quality management and public health protection measures [
70].
3. Research Method
3.1. Research Process
Our research process is shown in
Figure 2.
3.2. Model Introduction
In our study, we used three different prediction models: Seasonal Autoregressive Integrated Moving Average model (SARIMA), Prophet model and LightGBM model. They each have different advantages and are all applicable for achieving the core goal of PM2.5 concentration prediction in Nanning Urban Area.
3.2.1. SARIMA Model
The SARIMA (Seasonal Autoregressive Integrated Moving Average) model is a mature statistical time series analysis tool that is particularly suitable for data with seasonal fluctuations [
77]. The model combines the three components of autoregression, difference, and moving average, and effectively captures the variation pattern of PM
2.5 concentration by modeling the seasonal characteristics of the time series.
In our study, the non-seasonal parameter is set to non_seasonal_order = (1, 1, 1), corresponding to (p, d, q). The seasonal parameter is seasonal_order = (1, 1, 1), S = 12. The mathematical formula and the meaning of each parameter are as follows:
p is the number of autoregressive terms. In our study, p = 1, which means that the current PM2.5 concentration is affected by the concentration at the previous moment, which can help the model capture the short-term dependence trend of the data.
d is the non-seasonal difference order, and d = 1 means that the original data are processed by first-order difference, the purpose of which is to make the data stationary so that the model can better identify the patterns in the data.
q is the number of moving average terms. q = 1 indicates that the current PM2.5 concentration is related to the previous error term, and the previous error term can be used to correct the current prediction.
P is the number of seasonal autoregressive terms, P = 1 and the seasonal period S = 12, indicating that the current concentration is affected by the concentration in the previous period (i.e., 12 months ago), which helps capture the annual seasonal dependence of the data.
D is the seasonal difference order, and D = 1 is used to seasonally differenciate the data to remove seasonal trends and make the data more stable, which is convenient for model analysis of other potential patterns.
Q is the number of seasonal moving average terms, and Q = 1 means that the previous error term within the seasonal cycle is used to adjust the current forecast to improve the accuracy of the seasonal forecast.
S is the length of the seasonal cycle, which is set to S = 12, which is consistent with the possible annual periodic variation of the monthly PM2.5 data.
3.2.2. Prophet Model
The Prophet model is a time series forecasting tool developed by Facebook, specifically designed to handle data containing seasonal and holiday effects [
78]. Its additive model is used in our study to predict the PM
2.5 concentration in Nanning urban area. The additive model formula and the meaning of each variable are as follows:
y(t) is the predicted value of PM2.5 concentration at time t.
g(t) is a trend function that reflects the long-term trend of PM2.5 concentration over time. This function can be a linear or logistic growth model, which adapts to the long-term trend characteristics of the data by adjusting relevant parameters.
s(t) is a seasonal function that can capture the seasonal fluctuations of PM2.5 concentration on a monthly, quarterly or annual scale. The model will automatically identify and fit these seasonal patterns.
h(t) is the holiday effect function, which is used to reflect the abnormal impact of special events such as holidays on PM2.5 concentration. Users can customize specific holidays or events according to actual conditions to improve the prediction accuracy of the model.
3.2.3. LightGBM Model
LightGBM (Light Gradient Boosting Machine) is an efficient gradient-based Machine Learning framework developed by Microsoft, which is particularly suitable for regression and classification tasks of large-scale datasets [
79]. In our study, it is used to mine the deep features of PM
2.5 concentration changes and capture complex nonlinear patterns.
Core calculation formula:
is the objective function value, which is used to measure the difference between the model prediction result and the actual value, and to constrain the complexity of the model to balance the model’s fitting ability and generalization ability.
N is the number of PM2.5 concentration data samples used to train the LightGBM model.
is the loss function for the actual value and the predicted value of the -th sample. Common loss functions include mean square error, mean absolute error, etc. In our study, we select a suitable loss function to evaluate the model performance according to the task requirements.
is a regularization term, which is used to control the complexity of the model, prevent the model from overfitting, and make the model perform better on both training data and new data.
is the predicted value of the -th sample in the -th iteration. As the number of iterations increases, the predicted value of the model is continuously updated and optimized.
is the predicted value of the -th sample in the th iteration, that is, the result of the previous iteration. The new predicted value is updated based on the result of the previous iteration.
is the learning rate, which controls the contribution of the newly generated weak learner to the final prediction result at each iteration. A suitable learning rate can make the model converge stably during the training process and improve the accuracy of the prediction.
is the predicted value of the -th weak learner for the -th sample . LightGBM constructs multiple weak learners through continuous iteration, and accumulates their prediction results to finally form a strong learner for prediction.
3.3. Data Preprocessing
In our study, the original PM
2.5 data is China’s high-resolution PM
2.5 concentration data. In order to focus on the study area of Nanning urban area and ensure that the equipment can run the code normally, the PM
2.5 data of all years are cropped using the Nanning urban area mask (
Figure 3). The cropping process uses the rasterio and geopandas libraries to ensure the accuracy and efficiency of data processing.
After cropping, the data needs to be further processed and divided into training sets and test sets. The specific steps are as follows:
Data reading and stacking: By generating a list of year and month strings from January 2012 to December 2023, traversing and reading the corresponding cropped TIFF files, the data from 2012 to 2022 are stacked into a training set, and the data from 2023 are stacked into a test set.
Missing value and outlier processing: Check the training set data. If there are missing values, fill them with the mean of the corresponding column. For outliers, determine the upper and lower boundaries by calculating the interquartile range, and trim the values that exceed the boundaries to within the boundaries to improve data quality.
SARIMA model: The training set data is differentiated to ensure stationarity, and the last month of training data is recorded for back-difference in subsequent forecasts.
Prophet model: With the help of pandas’ date_range function, with 1 January 2012 as the starting date and month as the frequency, a date sequence is generated according to the length of the monthly time series data, and combined with the PM2.5 concentration into a format containing a date column ds and a target value column y.
LightGBM model: After extracting the time series from the training set data by pixel, create a DataFrame containing date and month features, and use the month as the input feature of the model.
We selected the trimmed and processed PM2.5 data from 2012 to 2022 as the training set and the processed data from 2023 as the test set, mainly based on the following considerations:
Data characteristics and trends: The 11 years of data form a coherent time series that can fully reflect the long-term trend and seasonal characteristics of PM2.5 concentration. Affected by factors such as interannual climate, urban development, and environmental protection policies, the concentration change trend can be observed to provide information for model learning. At the same time, the data covers multiple complete seasonal cycles, which helps the model capture the seasonal variation patterns under Nanning’s subtropical monsoon climate and enhance the SARIMA and Prophet models’ ability to predict PM2.5 concentration changes.
Model training and validation: Sufficient data is the key to model training. 11 years of data allows the model to fully learn the relationship between PM2.5 concentration and various potential influencing factors. For example, LightGBM can mine complex nonlinear relationships. Using data from 2012 to 2022 for training and data from 2023 for validation can effectively test the generalization ability of the model and avoid overfitting.
Research purpose and application: In environmental forecasting, the accuracy of near-term forecasts is of great significance to environmental management and decision-making. Selecting 2023 for verification can evaluate the current forecasting performance of the model and understand its adaptability to environmental changes. By comparing the 2023 forecast value with the true value, the advantages and disadvantages of the model can be summarized, providing a reference for future forecasts and environmental management decisions, and helping to formulate scientific and effective environmental protection policies.
3.4. Model Training and Prediction
Figure 4 shows the model training and prediction workflow that we employed in our research.
SARIMA model training and prediction: For the SARIMA model, during training, according to the non-seasonal parameter non_seasonal_order = (1, 1, 1) and seasonal parameter seasonal_order = (1, 1, 1), S = 12, the corresponding algorithm is used to estimate the parameters of the training set data from 2012 to 2022. The model parameters are optimized through multiple iterations to achieve a better fitting effect. After the training is completed, the trained model is used to predict the 2023 test set data grid by grid, and the prediction results are saved.
Prophet model training and prediction: Initialize the Prophet model and use the previously prepared formatted training set data from 2012 to 2022 for training. After the training is completed, according to the set 12-month forecast period, use model.make_future_dataframe to generate 2023 future date data with a monthly frequency, and input it into the trained model to obtain the 2023 PM2.5 concentration forecast results.
LightGBM model training and prediction: For the LightGBM model, set the target to regression task, use root mean square error as the evaluation indicator, use gradient boosted decision tree (GBDT) as the boosting type, the number of leaf nodes is 31, the learning rate is 0.05, the feature sampling ratio is 0.9, the minimum number of leaf node samples is 1, the maximum depth of the tree is not limited, and detailed output information is turned off. Using these parameter settings, the training set data from 2012 to 2022 is trained, and the model continuously learns the characteristics of PM2.5 concentration data during the training process. After the training is completed, the month is used as the input feature to perform grid-by-grid prediction on the 2023 test set data, and the prediction results are saved.
3.5. Model Metrics
3.5.1. Mean Squared Error, MSE
The mean square error is the average of the squares of the errors between the predicted value and the true value, and is calculated as:
is the number of samples, is the true value of the -th sample, and is the predicted value of the -th sample.
measures the average error between the predicted value and the true value. Since the error is squared, larger errors will be magnified in , so is sensitive to outliers. In PM2.5 concentration prediction research, can reflect the overall prediction error of the model. If the value is small, it means that the predicted value of the model is closer to the true value and the prediction accuracy of the model is high. Conversely, if the value is large, it means that the model has a large prediction error and the model may need to be adjusted or improved.
3.5.2. Root–Mean–Squared Error, RMSE
The Root–Mean–Squared Error is the square root of the mean square error and is calculated as:
RMSE has the same dimensions as the original data, which makes it easier to interpret and understand in practical applications. In the PM2.5 concentration prediction study, RMSE can intuitively represent the average size of the model prediction error. For example, if the unit of PM2.5 concentration is μg/m3, then the unit of RMSE is also μg/m3, which can be directly compared with the actual value of PM2.5 concentration. In addition, RMSE is also sensitive to larger errors, which can highlight the shortcomings of the model in dealing with large errors and help researchers find the weak links of the model.
3.5.3. Mean Absolute Error, MAE
The mean absolute error is the average of the absolute errors between the predicted value and the true value, and is calculated as:
MAE directly calculates the absolute error between the predicted value and the true value, avoiding the amplification effect caused by the square of the error, so it is more robust to outliers. In the PM2.5 concentration prediction study, due to the complexity of environmental factors, some abnormal PM2.5 concentration values may appear (such as high concentration values caused by sudden pollution events). MAE can more robustly reflect the average prediction error of the model and will not produce large fluctuations due to individual outliers. Compared with MSE and RMSE, MAE focuses more on measuring the average prediction error of the model on most samples.
3.5.4. Coefficient of Determination, R2
The coefficient of determination R
2 is used to evaluate the goodness of fit of the model to the data and is calculated as:
is the average value of the true value. The closer is to 1, the better the model fits the data, that is, the higher the degree of variation of the true value that the model can explain. In the PM2.5 concentration prediction study, can help researchers determine whether the model can effectively capture the variation pattern of PM2.5 concentration. For example, if = 0.8, it means that the model can explain 80% of the PM2.5 concentration changes, and the remaining 20% of the changes may be caused by factors not considered by the model (such as unincluded meteorological variables, sudden human activities, etc.). Therefore, can be used as an important reference indicator for evaluating model performance and selecting the optimal model.
4. Experimental Results and Analysis
4.1. Comparison of Original Data and Model Prediction Results
Our study uses
Figure 5 to analyze the spatial distribution of PM
2.5 concentration in Nanning urban area in 2023 and the prediction performance of LightGBM, Prophet and SARIMA models.
Figure 6 selects some months (January, April, July, and September) to display the prediction results, presenting the spatial prediction distribution results at different time resolutions from year to month.
From the annual distribution of PM
2.5 in 2023 shown in
Figure 5, it can be deduced that the city center has dense industrial activities, heavy traffic flow, and large exhaust emissions. In parts of Jiangnan District and Xixiangtang District, PM
2.5 concentrations are high due to the combined effects of industrial activities and traffic emissions, while in parts of Qingxiu District and Yongning District, due to the high green coverage rate, vegetation has a significant adsorption and filtration effect on particulate matter. At the same time, there are relatively few industrial activities in this area and relatively few pollution sources, so PM
2.5 concentrations are low.
From the monthly distribution in
Figure 6, the concentration distribution in different months is significantly affected by factors such as meteorological conditions. In January, high-concentration areas are concentrated in urban centers and industrial areas. This is because winter heating activities increase energy consumption, which in turn leads to increased pollutant emissions. At the same time, the atmospheric diffusion conditions in winter are poor. In April, the overall concentration decreased and the scope of the high-concentration area decreased. This is mainly attributed to the strong wind in spring, which is conducive to the diffusion of pollutants. In July, the concentration further decreased, and the high-concentration areas were mainly concentrated in busy traffic sections and local industrial areas. This is because summer precipitation has a scouring effect on particulate matter. In September, the concentration distribution is similar to that in July, but it can be clearly observed that the concentration in some areas has rebounded. This may be because the meteorological conditions have changed after the summer, resulting in worse pollutant diffusion conditions.
From the annual prediction results, although the SARIMA model can present a general concentration distribution trend, its spatial distribution prediction accuracy is insufficient in some streets in Xixiangtang District and some industrial parks in Jiangnan District. This shows that the model does not capture the spatial variation characteristics of these areas, such as the local airflow changes caused by building shading that affect pollutant diffusion. The Prophet model is not accurate enough in predicting spatial location distribution. In areas with large concentration gradient changes such as the border between Xingning District and Qingxiu District, the predicted values in some areas are low, which shows that the model has limitations in dealing with concentration gradient changes. Relatively speaking, the LightGBM model has a better ability to capture spatial distribution, but within the high-concentration industrial area of Jiangnan District, its prediction of details such as emission differences between different factories is not accurate enough, resulting in relatively high predicted values.
In the comparison of monthly prediction results, the LightGBM model has a good range prediction effect for high-concentration areas such as April and September, but there is still a large deviation in the prediction of concentration values. Taking September as an example, in low-concentration areas such as around some parks in Qingxiu District, the model did not take into account factors such as vegetation adsorption of particulate matter and low surrounding traffic flow, resulting in overestimation of values in some low-concentration areas. The prediction results of the Prophet model are relatively fragmented. Although the values in some rural areas of Yongning District are relatively close to the true values, in the border areas such as the border with Jiangnan District, due to the complex changes in concentration gradients caused by changes in topography and human activities, the prediction deviates greatly from the actual distribution. In general, the SARIMA model has certain advantages in time series trend prediction, but in terms of actual spatial feature capture, there are certain deviations in different urban areas in different months, such as the commercial area of Xingning District and the newly developed area of Liangqing District.
In general, each model has its own advantages and disadvantages in capturing and predicting the spatial distribution of PM2.5 concentration at different time resolutions. In practical applications, it is necessary to comprehensively consider the changes in meteorological conditions in different time periods in the study area, such as the low temperature and calm wind period in winter, the concentrated precipitation period in summer, etc., as well as regional characteristics, such as the functional zoning of urban areas (commercial areas, industrial areas, residential areas, etc.), topography (mountainous areas, plains, water areas, etc.) and other factors. For areas with complex and changeable meteorological conditions and complex topography, the LightGBM model that can handle complex nonlinear relationships can be given priority. For time series with obvious seasonal and trend characteristics, the SARIMA model may be more advantageous. At the same time, the prediction accuracy of the spatial distribution of PM2.5 concentration can be further improved by adjusting model parameters, such as optimizing the tree depth and learning rate of the LightGBM model, or fusing the prediction results of multiple models.
4.2. Time Series Trend Analysis
Figure 7 presents the monthly average PM
2.5 concentration data of Nanning urban area from 2012 to 2023 in the form of a line chart. Through a detailed and in-depth analysis of these data, we can not only clearly sort out the long-term trend of PM
2.5 concentration during this period, but also combine known influencing factors such as holidays and epidemics to accurately analyze their specific effects on PM
2.5 concentration.
From the overall change, during the period of 2012–2014, the PM2.5 concentration was high at the beginning of each year, such as 75.4 μg/m3 in January 2012, about 96 μg/m3 in January 2013, and 120.4 μg/m3 in January 2014. Although it declined in the following months, it still fluctuated at a high level. At that time, the industrialization process in Nanning urban area was advancing rapidly, industrial activities were frequent, energy consumption was high, and environmental protection measures were relatively weak. A large amount of industrial waste gas, dust and other pollutants were discharged into the atmosphere. At the same time, a large number of fireworks and firecrackers were set off during the Spring Festival, releasing a large amount of smoke, sulfur dioxide and other pollutants in a short period of time, causing the PM2.5 concentration in the air to rise sharply.
Since 2015, the PM2.5 concentration in some months has shown a clear downward trend, especially in the summer months (June to August) from 2015 to 2017, when the concentration was generally low and stable. For example, the PM2.5 concentration in July 2015 dropped to 25.5 μg/m3, and the concentration in the same period of 2016 and 2017 also remained at a low level. This is due to strict environmental governance measures, including strengthening the control of industrial pollution sources, requiring enterprises to install advanced waste gas treatment equipment, promoting clean energy, replacing high-pollution energy with solar energy, wind energy, etc., upgrading motor vehicle exhaust emission standards, and eliminating old vehicles, which effectively controlled pollutant emissions. However, in certain months of some years, such as National Day and other holidays, PM2.5 concentrations will rise briefly due to factors such as increased mobility and traffic flow, and increased fume emissions from restaurants.
The PM2.5 concentration changes were special from 2020 to 2022 due to the impact of the COVID-19 pandemic. In the early stages of the pandemic, strict prevention and control measures such as factory shutdowns and vehicle restrictions significantly reduced industrial production and transportation activities, and pollutant emissions dropped sharply. In February 2020, PM2.5 concentrations in many cities, including Nanning urban area, dropped to a relatively low level in the same period in history, about 32 μg/m3. With the normalization of epidemic prevention and control, production and life have gradually resumed but are still restricted. Enterprises pay more attention to environmental protection, and public transportation operations are optimized, making PM2.5 concentrations relatively stable during this period and the overall level is not high, and the seasonal fluctuation range is reduced. In addition, after 2019, the small peak concentration that might have occurred in October was significantly reduced or postponed. This may be because the epidemic affected holiday travel and gathering activities, reducing pollutant emissions, or it may be due to the public’s improved environmental awareness and the positive impact of practicing green travel and energy conservation and emission reduction.
The above mentioned trend of PM
2.5 concentration in 2012–2022 has a multifaceted effect on the prediction of each model in 2023. According to
Figure 8 (monthly average PM
2.5 concentration line chart of each model in 2023), each model refers to the trend characteristics of historical data (the main characteristics used in this experiment) and the seasonal fluctuations caused by various factors when predicting the concentration of 2023.
The LightGBM model, with its excellent nonlinear fitting ability, effectively captures the complex changes in the historical trends from 2012 to 2022, and applies it to the PM2.5 concentration forecast for 2023. From the trend of the line chart, we can see that the model can more accurately grasp the overall trend of high concentration at the beginning of the year (such as the actual concentration in January is about 41 μg/m3, and the LightGBM model predicts a value of about 58 μg/m3), then declines, and rebounds at the end of the year (the actual concentration in December is about 33 μg/m3, and the predicted value is about 52 μg/m3). This is due to its effective learning of seasonal changes, holiday impacts, and special fluctuations during the epidemic in historical data. However, from the comparison between the overall forecast results and the actual values, the predicted values of the model are generally higher than the actual values, and there is a significant “overestimation” phenomenon. This shows that although the model can learn historical influencing factors, it does not take into account the changes in new pollution sources, differences in meteorological conditions, and different holiday activities in 2023, and further optimization is needed to improve the accuracy of the forecast.
The Prophet model has certain methods for dealing with seasonality, trend and holiday effects of time series. However, when predicting the concentration in 2023 based on the trend from 2012 to 2022, the performance is not satisfactory. In the stage of concentration decline, the predicted value of the model deviates greatly from the original data, showing an obvious “underestimation” trend. For example, during the period from April to July, the actual concentration dropped from about 32 μg/m3 to about 15 μg/m3, while the predicted value of the Prophet model was always lower than the actual value, and the gap gradually increased. This may be because the Prophet model simply models the holiday effect and seasonal changes when processing historical data, ignoring the significant differences in pollution sources and meteorological conditions between 2023 and previous years. For example, in 2023, some factories upgraded their emission reduction equipment and reduced pollutant emissions, but the model did not take this key change into account. When using the previous average diffusion coefficient for prediction, the prediction of the concentration decline stage was seriously underestimated.
As a classic time series model, the SARIMA model builds a model based on the data characteristics from 2012 to 2022 to predict the concentration in 2023. In some months, it can reflect the trend of concentration changes, but there are deviations from the actual values in the overall seasonal fluctuation range and the prediction of some key time points. Nevertheless, compared with the other two models, its deviation is relatively small. For example, in September and November, there is a certain difference between the actual concentration and the predicted value. This may be because the model has high requirements for the stability of the data, and the concentration changes in 2023 are affected by the combined effect of multiple factors such as the resumption of normal holiday activities and the gradual weakening of the impact of the epidemic. It does not fully conform to the stable pattern presented by historical data. There are some sudden factors or new change patterns that are not fully captured by the model.
PM2.5 concentration changes from 2012 to 2022 provides an important reference for the prediction in 2023. However, when using the information of these time series changes for prediction, each model has certain advantages and disadvantages due to the limitations of the model itself and the complexity of the actual situation in the prediction year. In subsequent research and application, it is necessary to further consider multiple factors, optimize the model or combine the results of multiple models to better improve the accuracy of the prediction results.
4.3. Analysis of Model Performance Metrics
Figure 9 presents the calculation results of the performance indicators (Mean Squared Error, MSE), (Root–Mean–Squared Error, RMSE), (Mean Absolute Error, MAE), (Coefficient of Determination, R
2) of the LightGBM, Prophet and SARIMA models for each month in 2023 in the form of a heat map. To present the error magnitudes or goodness-of-fit of the three models more clearly and intuitively, each set of data was normalized independently within itself and sorted by size. They were marked with colors from dark to light to better display the comparative relationships within each group of indicators (The values shown in the figure are still the original data of the indicators).
The heat map of the LightGBM model in January, October and December has darker colors, indicating that the evaluation index values in these months are relatively high after normalization, that is, the model prediction value deviates greatly from the actual value. Taking May as an example, the color of this month is relatively light, and the values of various indicators are low, which means that the error index is relatively small and the prediction effect is good. This difference is closely related to the seasonal characteristics of the data in the study area. Although the LightGBM model has a strong nonlinear fitting ability and can handle complex data relationships, in the winter of January and December, the atmospheric diffusion conditions are poor, and the PM2.5 concentration is affected by a variety of complex factors. Even this model is difficult to accurately capture the concentration change law. In July, the summer meteorological conditions are relatively stable, and the laws in the data are easier to be learned and predicted by the model, so it performs better.
Unlike the LightGBM model, the Prophet model has darker colors in May-July, indicating that the MSE, RMSE and MAE values are large during this period, and the prediction effect is poor, while the color is relatively light in other months, and the prediction error is relatively stable and small. The Prophet model is based on the seasonality and trend of time series, and has a certain ability to handle conventional seasonal patterns. However, during the period from May to July, the changes in PM2.5 concentration in the study area were more complex. There may be some special meteorological conditions or human activities, which led to deviations between the actual situation and the seasonal patterns in the historical data. The model is too dependent on the historical pattern, which increases the prediction error.
Looking at the SARIMA model again, the darker colors in January, May, September and November mean that the MSE, RMSE and MAE values are higher and the prediction performance is poor. From the R2 indicator, the closer the R2 is to 1, the better the model fits the data. Among the three models, the negative R2 value of the SARIMA model is smaller than that of the LightGBM and Prophet models, that is, it is closer to 1, indicating that it fits the data relatively well. However, combined with its higher error index in some months, it can be seen that although the model has relatively good fitting performance, it still lacks in prediction accuracy. This is because the SARIMA model has certain requirements for the stability of the data, and the PM2.5 concentration data in 2023 is affected by a variety of complex factors, and it is difficult to meet its stability assumption in some months, so the overall performance still has room for improvement.
Overall, the three models of LightGBM, Prophet and SARIMA have their own advantages and disadvantages in predicting the PM2.5 concentration in 2023. The LightGBM model has strong fitting ability and can capture complex trends, but it is greatly affected by special environmental factors. The Prophet model has certain methods in dealing with conventional seasonal time series, but it is not adaptable enough in the face of complex changes. The SARIMA model has a relatively good fitting effect, but the requirement for data stability limits its prediction accuracy in some months. Due to their own problems, it is difficult to fully and accurately grasp the data characteristics, and there will be large errors and poor fitting effects in different months, as the models cannot perfectly match the complex changes in the data. In practical applications, it is necessary to comprehensively consider the selection of appropriate models or combine the advantages of multiple models for prediction based on specific data characteristics and application scenarios.
4.4. Error Analysis with Scatter Plots and Box Plots
Figure 10 shows the predicted values and true values of the three models, LightGBM, Prophet, and SARIMA, in the form of scatter plots. With the help of the identity function (y = x) image, the gap between the predicted value and the true value can be more intuitively displayed, and it is also convenient to observe the performance of different models in different numerical ranges.
The LightGBM model is presented as green scatter points. In months such as May and July, the green scatter points are relatively close to the identity function (y = x) line, which indicates that the model can better capture the distribution trend of PM2.5 concentration in these months. From the distribution of scatter points, the degree of dispersion is low, indicating that the model prediction results are more stable. However, overall, there is a large deviation between the predicted values of the LightGBM model and the true value range. This may be due to the fact that the LightGBM model allocates weights to data features or fits complex relationships during training, which leads to a certain deviation in capturing the overall range of data. Although it can stably reflect trends, there is still room for improvement in accuracy.
In contrast, the blue scatter points of the Prophet model are more loosely distributed, and most of the scatter points deviate downward from the identity function. Taking June and July as examples, the model’s predicted values deviate greatly from the true values, indicating that it is insufficient in grasping the trend of PM2.5 concentration changes in these two months, and the prediction error is large. The Prophet model is mainly based on the seasonality, trend and holiday effects of time series for modeling. If in a specific month, the actual concentration change is affected by some factors that are not fully considered by the model, such as sudden meteorological changes or special human activities, it is easy to cause the model’s prediction deviation to increase.
The red scattered points of the SARIMA model are the most scattered compared to the other two models. In months such as March and November, the red scattered points are obviously far away from the identity function, showing that the model has a weak ability to capture the trend of PM2.5 concentration changes in these months. However, it is worth noting that the predicted value range of the SARIMA model is close to the true value range, and there are more scattered points close to the identity function. This is because the SARIMA model, as a classic time series model, can reflect the overall characteristics of the data to a certain extent when processing time series data with a certain degree of stationarity. However, due to its high requirements for data stationarity, when the data is affected by some sudden or abnormal factors, inaccurate trend capture will occur.
Figure 11 presents data in box plots. These plots vividly showcase the distribution characteristics of prediction errors, making it easier to conduct statistical analysis of error conditions. When we analyze, on a monthly basis, the prediction–error distributions of each model, comparing these box plots with the ones of the original data can more intuitively and clearly highlight the unique advantages each model has in predicting PM
2.5 concentrations.
The LightGBM model performed well in some months from May to July. From the box plot, the boxes of the predicted data in these months are shorter, indicating that the data has a smaller degree of dispersion, that is, the prediction error is relatively concentrated. At the same time, the median is close to the median of the original data, and the deviation is within 8 μg/m3, which shows that the model can accurately grasp the central trend of the data in these months, and the prediction accuracy and stability are high. This advantage can provide a more reliable reference for the decision-making of relevant departments in practical applications. For example, in terms of air quality control decisions, based on the accurate prediction of the model, pollution prevention and control measures can be arranged more reasonably. However, in some other months, the box plot of the LightGBM model shows that the dispersion of its predicted data is large, indicating that there is a certain degree of fluctuation in the prediction error, and the adaptability of the model can be further improved.
The prediction performance of the Prophet model in most months is poor, and the boxes of the predicted data box plot are generally longer and the degree of dispersion is large. However, in months such as September to November, the lower quartile is close to that of the original data, with a deviation of about 7 μg/m3, indicating that in these months, the model has a certain degree of accuracy in predicting the lower value part of the data. This may be because the change characteristics of PM2.5 concentration in these months are more consistent with some patterns in the historical data learned by the model. However, the large error of the Prophet model in other months limits its application effect in the overall forecast of the whole year, and further optimization is needed to improve stability.
Observing the data of each month throughout the year, in months with obvious seasonal characteristics, such as winter (December–February) or summer (June–August), the fluctuation trend of the box plot of the SARIMA model predicted data is more similar to the original box plot, and the deviation of key statistics (such as median, quartiles, etc.) is within 7 μg/m3. This is due to the fact that the SARIMA model, as a classic time series model, can better capture the seasonal changes in the data. Therefore, the model has certain advantages in understanding and predicting the changes in PM2.5 concentration caused by seasonal factors, and can provide valuable reference for the study and prediction of seasonal air quality changes. However, in some months, there are still some differences between its box plot and the original data box plot, indicating that the model still has certain limitations when dealing with some special situations or complex changes.
4.5. Normality Test and Analysis of Prediction Errors
Figure 12 shows the normality test results of the forecast errors of the LightGBM, Prophet, and SARIMA models in 2023, including error histograms and normal probability plots. In order to better illustrate the normal distribution, we also marked the relevant information of the Kolmogorov-Smirnov (K-S) test in the chart (when performing a normality test, 0.05 is usually used as the limit of the significance level).
From the error histogram analysis, the error distribution of the LightGBM model shows an obvious multi-peak shape, and the intervals between the peaks are relatively scattered, indicating that its error values are relatively complex, with a certain frequency of occurrence in different intervals, and no obvious central trend. The error distribution of the Prophet model fluctuates more disorderly and lacks regularity. The error frequency of each interval fluctuates, and it is difficult to find a unified distribution pattern. Although the SARIMA model shows a certain unimodal trend, the peak is not located in the center of the distribution, and the left and right sides are asymmetric, indicating that its error distribution deviates from the symmetric characteristics of the normal distribution. Overall, the error histograms of the three models do not conform to the typical characteristics of the normal distribution.
In the normal probability graph, the sample quantile curves of the three models deviate significantly from the theoretical quantile straight line. The curve of the LightGBM model deviates greatly from the straight line, and fluctuates significantly in different quantile intervals, indicating that its error distribution is quite different from the normal distribution. The curve of the Prophet model is separated from the straight line in most intervals, and only a few points are close to the straight line, showing the inconsistency between its error distribution and the normal distribution. Although the curve of the SARIMA model overlaps with the straight line to a certain extent in some intervals, there is still a significant deviation overall, especially at the quantiles at both ends. At the same time, the K-S test results all rejected the null hypothesis that the errors follow the normal distribution, further confirming that the errors of the three models do not follow the normal distribution.
Although the errors of the three models do not follow the normal distribution, this study is still of great value for PM2.5 concentration prediction. From the perspective of model improvement, the multi-peak and dispersed error distribution of the LightGBM model suggests that in subsequent optimization, we can focus on analyzing the influencing factors corresponding to different peaks, such as whether they are related to specific time periods and data characteristics, and then adjust the feature selection and parameter settings of the model in a targeted manner to reduce the discrete degree of the error. For the irregular error distribution of the Prophet model, it is possible to consider strengthening the ability to handle outliers and sudden changes in time series, such as introducing a more flexible anomaly detection mechanism, or improving the modeling method for special factors such as holidays, so that the model can better adapt to the dynamic changes of data. For the asymmetric single-peak error distribution of the SARIMA model, we can try to adjust the model’s stabilization method, or combine it with other model components that can capture asymmetric characteristics to improve the model’s ability to fit the data distribution.
In terms of new model development or model fusion strategy formulation, the data and experience accumulated in this study provide rich references. Through comparative analysis of the error distribution of different models, we can clarify the advantages and disadvantages of each model under different data conditions, so that when developing new models, we can integrate the advantages of each model in a targeted manner and abandon its disadvantages. In terms of model fusion strategy, we can reasonably allocate the weights of different models according to the characteristics of the error distribution. For example, we can give a higher weight to the model with a relatively stable error distribution to improve the accuracy and stability of the overall prediction. These will indirectly promote the PM2.5 prediction research to a more accurate and reliable direction, and help us to understand the changing law of PM2.5 concentration more deeply.
5. Discussion
5.1. Model Selection Rationale
Before we delve into the performance of the models, it is necessary to explain why we chose the three models of SARIMA, Prophet and LightGBM to predict PM2.5 concentration in Nanning urban area. Accurate prediction of PM2.5 concentration is extremely important for urban environmental management and public health protection, and appropriate model selection is the key to achieving accurate prediction.
Our study aims to comprehensively evaluate the performance of different types of models in predicting PM2.5 concentration in Nanning urban area, and provide a scientific basis for subsequent research and practical applications. The PM2.5 concentration in Nanning urban area is affected by many factors such as meteorological conditions, human activities, and topography. The interaction of these factors causes the concentration change to present a complex situation, which not only shows obvious seasonal and trend characteristics, but also has complex nonlinear relationships. In addition, the research data has time series characteristics and covers information at different time resolutions. Based on these research objectives and data characteristics, we selected the following three models:
As a mature statistical time series analysis tool, the SARIMA model has a deep theoretical foundation and extensive application practice in processing data with seasonality and trend. The PM2.5 concentration data in Nanning urban area shows a significant seasonal variation pattern. For example, in summer, precipitation is relatively abundant, which has a flushing effect on pollutants and the concentration is relatively low. In winter, the atmospheric diffusion conditions are relatively poor, and pollutants are easy to accumulate, making the PM2.5 concentration relatively high. In the long run, with the development of the city and the implementation of environmental protection policies, the PM2.5 concentration also shows a specific trend of change.
The SARIMA model can effectively capture the seasonal and trend characteristics in the data through autoregression (AR), difference (I) and moving average (MA) components. The reasonable setting of its non-seasonal parameters (p, d, q) and seasonal parameters (P, D, Q, S) helps to accurately model the time series of PM2.5 concentration in Nanning urban area. In this study, the non-seasonal parameters are set to non_seasonal_order = (1, 1, 1), the seasonal parameters are set to seasonal_order = (1, 1, 1), and S = 12. This setting enables the model to fully consider the seasonal fluctuations of PM2.5 concentration on a monthly scale and the long-term trend of the data. By learning from historical data, the SARIMA model can dig out potential patterns and provide a reliable basis for predicting future changes in PM2.5 concentration. Therefore, choosing the SARIMA model helps analyze the long-term trend of PM2.5 concentration in Nanning urban area and provides strong support for long-term monitoring and prediction of air quality.
The Prophet model was developed by Facebook and is a time series prediction tool specifically for processing data containing seasonal and holiday effects. The PM2.5 concentration in Nanning urban area is not only affected by seasonal changes, but also significantly changes due to changes in human activities during holidays. For example, during the National Day holiday, changes in traffic flow and residents’ activity patterns will lead to an increase in PM2.5 emissions, which in turn affects the concentration level.
The additive model structure of the Prophet model, y (t) = g (t) + s (t) + h (t), enables it to model trends (g (t)), seasonality (s (t)), and holiday effects (h (t)) respectively. The model can automatically identify and fit seasonal patterns in the data without pre-setting complex seasonal parameters. At the same time, users can customize holidays or special events according to actual conditions, so as to more accurately capture the impact of these factors on PM2.5 concentration. This flexibility and the ability to effectively handle seasonal and special events are highly consistent with the characteristics of PM2.5 concentration changes in Nanning urban area. Therefore, choosing the Prophet model can better analyze the impact of seasonality and special events on PM2.5 concentration in Nanning urban area city and improve the accuracy of prediction, especially during special periods such as holidays.
The LightGBM model is an efficient machine learning framework based on gradient boosting decision trees, which has unique advantages in processing large-scale data and mining complex nonlinear relationships. The change in PM2.5 concentration in Nanning urban area is the result of the combined action of multiple complex factors. The relationship between these factors is intricate and nonlinear, and it is difficult to accurately describe it with traditional linear models. For example, the relationship between temperature, humidity, wind speed, wind direction and other meteorological factors and PM2.5 concentration is not a simple linear relationship. Topographic features can also have complex indirect effects on PM2.5 concentration by affecting the diffusion and transmission of pollutants.
By constructing a gradient boosting decision tree, the LightGBM model can automatically learn complex patterns and relationships in the data and effectively mine these potential nonlinear relationships. When processing large-scale data, it exhibits high computational efficiency and low memory consumption, and is suitable for modeling long-term, high-resolution PM2.5 concentration data in Nanning urban area. In addition, the LightGBM model has good scalability and flexibility, and the model performance can be optimized by adjusting parameters. Based on these advantages, the selection of the LightGBM model helps to reveal the complex driving mechanism behind the changes in PM2.5 concentration in Nanning urban area, improve the accuracy of predictions, and provide strong support for more accurate predictions of PM2.5 concentrations.
5.2. Comprehensive Discussion of Model Performance
Our study used LightGBM, Prophet and SARIMA models to predict the PM2.5 concentration in Nanning urban area in 2023, and compared and analyzed the performance of each model from many aspects:
Spatial dimension: In terms of the spatial distribution prediction of PM2.5 concentration, our study used ArcMap software (version 10.8) and ArcGIS Pro software (version 3.1) to conduct geographic visualization analysis of the raster data prediction results of the three models. Among them, the LightGBM model showed certain advantages. It has a relatively strong ability to capture spatial distribution and can roughly delineate the scope of high-concentration areas. However, there are deficiencies in the internal details of high-concentration areas, and the predicted values are sometimes too high, which cannot accurately reflect the actual concentration changes in high-pollution areas. The Prophet model has obvious shortcomings in spatial prediction. There is a deviation between its predicted spatial position distribution and the actual situation. The predicted values in some areas are too low, and its ability to deal with concentration gradient changes is limited, making it difficult to accurately present the concentration change trend in space. Although the SARIMA model can outline the general outline of the concentration distribution and show the overall trend, it does not accurately grasp the spatial change characteristics of some areas, and the prediction accuracy in these areas needs to be improved.
Time dimension and stability in different months: From the perspective of time series trend prediction, the performance of each model is also different. With its strong learning ability, the LightGBM model can better capture the complex change trends in historical data, such as being able to identify the pattern of PM2.5 concentration being high at the beginning of the year, then decreasing, and rising at the end of the year. However, in actual predictions, its overall prediction results deviate greatly from the true value, and there is a general “overestimation” phenomenon, especially in winter months with poor atmospheric diffusion conditions, where the prediction error is more obvious. The Prophet model has a certain processing ability for seasonality and trend in time series, but it over-relies on seasonal patterns in historical data during the prediction process. In the face of more complex concentration changes in May–July, the model is not adaptable enough and the prediction effect is not ideal, especially in the concentration decline stage, the predicted value deviates significantly from the original data, showing an “underestimation” trend. As a classic time series model, the SARIMA model can reflect the trend of PM2.5 concentration in some months. However, due to its high requirements for data stability, it has poor adaptability when facing non-stationary data caused by various factors in 2023, and its forecast performance in January, May, September and November is poor. However, in comparison with the other two models, the overall deviation of the SARIMA model is relatively low.
Accuracy and reliability: Through in-depth analysis of model performance indicators, it can be found that there are differences in the accuracy and reliability of each model. The performance of the LightGBM model fluctuates greatly in different months. In some months (such as May), its error indicators (MSE, RMSE, MAE) are low, the prediction effect is good, and it can more accurately reflect the actual concentration situation. But in January, October and December, the error indicators are high and the prediction error is large. During the period from May to July, the MSE, RMSE and MAE values of the Prophet model were significantly larger, indicating that the prediction effect of the model during this period was poor. While in other months, its performance was relatively stable. The coefficient of determination (R2) of the SARIMA model shows that it fits the data relatively well, which means that the model can explain the variation of the data to a certain extent. However, in some months, its error indicators are still high, indicating that there is room for improvement in its prediction accuracy.
In summary, the LightGBM model has advantages in capturing complex nonlinear relationships, but its stability is poor. The Prophet model has methods to deal with seasonality and trends, but it lacks adaptability to complex changes and is easily affected by special circumstances. The SARIMA model has a theoretical basis for time series prediction, but it has limitations in dealing with non-stationary data and spatial heterogeneity.
5.3. Analysis of Influencing Factors on PM2.5 Prediction
Meteorological conditions: Meteorological factors have a significant impact on PM2.5 concentration. Changes in temperature will change the stability of the atmosphere, thereby affecting the diffusion capacity of pollutants. Humidity can not only affect the hygroscopic growth of particulate matter, but also participate in some chemical reactions. Wind speed and direction determine the direction and speed of pollutant transmission. Precipitation is like nature’s “cleaner”, which can effectively wash PM2.5 in the air. For example, in winter, the temperature is low, the atmosphere is stable, and the diffusion conditions of pollutants are poor. PM2.5 is easy to accumulate and the concentration increases. While in summer, there is abundant precipitation, which can significantly reduce the concentration of PM2.5. Although our research uses monthly raster data to capture certain seasonal characteristics, it does not track changes in meteorological elements in real time. In months with changeable weather, such as the intersection of cold and warm air, the meteorological conditions are complex. Due to the lack of real-time meteorological data, the model is difficult to accurately simulate the diffusion, transmission and transformation process of pollutants, resulting in increased prediction errors.
Anthropogenic emission sources: Anthropogenic emissions are an important source of PM2.5, covering industrial activities, traffic exhaust, coal-fired heating and many other aspects. Taking the Nanning urban area as an example, with the rapid development of urbanization, the scale of industry continues to expand, the number of motor vehicles has increased significantly, and the emission sources have become increasingly diversified. However, the models in our study failed to fully consider the dynamic changes in emission sources. Factories will adjust production according to orders and production plans, and traffic flow will fluctuate with time, weather and special events. These changes will directly lead to changes in PM2.5 emissions. However, the model does not take these real-time changes into account, and cannot accurately capture the changes in PM2.5 concentration caused by fluctuations in anthropogenic emissions, reducing the accuracy of the prediction.
Topographic and geomorphic features: The topography and geomorphic features of the Nanning urban area have a significant impact on PM2.5 concentration. The basin-shaped area with the Yongjiang River Valley as the center and the mountains on three sides hinder the diffusion of atmospheric pollutants, making PM2.5 easily accumulate in the urban area. Unfortunately, the model of our study was not optimized in combination with local topographic and geomorphic information during the construction process. Since the model cannot accurately reflect the hindering effect of the terrain on the diffusion of pollutants, deviations occur when simulating the transmission and distribution of pollutants, which in turn affects the prediction accuracy of PM2.5 concentration.
Data uncertainty: Data quality is crucial to the accuracy of model predictions, but there are some problems with the data in our study. Although monthly raster data are used to reflect the concentration changes in time and space to a certain extent, there are still limitations. On the one hand, the data has a limited time span and it is difficult to cover various scenarios of environmental changes, which limits the model’s ability to learn the laws of long-term changes. On the other hand, there is a lag in data updates, and it is impossible to timely reflect the latest changes such as new pollution sources and traffic control. In addition, the granularity of monthly raster data is coarse, and it is difficult to accurately present the details of concentration changes in local areas in the short term. Moreover, the data lacks key fine-grained features such as meteorology and traffic, such as hourly wind speed and traffic volume on specific sections of roads. The uncertainty of these data makes it difficult for the model to capture complex environmental changes and affects the accurate prediction of PM2.5 concentrations.
To improve the accuracy of predictions, future studies will consider refining meteorological data, incorporating real-time updated data on anthropogenic emission sources, and optimizing the model in combination with topographic information of the study area to reduce the impact of data uncertainty.
5.4. Model Application and Practical Significance
In the PM2.5 concentration prediction work, different models are suitable for different scenarios, and there are complementary characteristics between each other. In terms of short-term prediction, although the LightGBM model has errors, it performs well in quickly locating the scope of high-concentration areas and can provide preliminary pollution area distribution information for emergency response. The Prophet model has high prediction stability in some months and is good at dealing with seasonality and special events. When special activities are clearly arranged, it can effectively predict short-term concentration changes. Combining the two, LightGBM first determines the scope of high-concentration areas, and Prophet then accurately predicts the concentration fluctuations in these areas during special periods, which can improve the accuracy and completeness of short-term predictions. In long-term trend analysis, the SARIMA model can effectively identify the long-term change law of PM2.5 concentration by virtue of its ability to accurately capture time series trends. However, it is insufficient in dealing with complex nonlinear relationships and special events. At this time, combined with LightGBM’s ability to mine complex relationships and Prophet’s ability to deal with fluctuations in special events, long-term trend analysis can be improved to provide a more reliable basis for long-term monitoring and policy making. In the scenario of regional pollution warning, a single model is difficult to meet the needs. By comprehensively utilizing the spatial capture capability of LightGBM, the special event processing capability of Prophet, and the trend analysis capability of SARIMA, a composite early warning system can be constructed, which can significantly improve the accuracy of early warnings and provide strong support for timely prevention and control of pollution.
In practical applications, appropriate models should be flexibly selected or multiple model results should be integrated according to different scenarios. In daily monitoring work, the SARIMA model is used as the basis for analyzing long-term trends, the Prophet model is used to focus on concentration changes in special periods, and the LightGBM model is used to supplement spatial distribution information to achieve comprehensive and continuous monitoring of air quality. At the same time, model fusion techniques such as weighted averaging and stacking generalization are used to combine the advantages of each model to further improve the reliability and accuracy of the prediction.
Our research results provide a solid scientific basis for the formulation of environmental protection policies. Based on the model predictions, differentiated prevention and control strategies can be formulated for different regions and seasons [
80]. For example, in industrial areas, supervision can be strengthened, emission standards can be strictly enforced, and clean production technologies can be promoted. in traffic-intensive areas, traffic management can be optimized, public transportation can be encouraged, and new energy vehicles can be promoted. By accurately predicting changes in air quality during special periods, traffic control, enterprise production restrictions and other measures can be formulated in advance to effectively reduce pollution emissions. In addition, accurate prediction results can help the public understand the air quality status in a timely manner so that they can take corresponding protective measures, such as wearing masks and reducing outdoor activities, to effectively protect public health.
5.5. Research Limitations
Data: The data has a limited time span and lacks a real-time update mechanism, which makes it less adaptable to new data. It relies only on a single data set and does not integrate multi-source data, making it difficult to fully reflect influencing factors. The data granularity is coarse and does not include fine-grained and auxiliary feature collaborative training. The chemical composition of PM2.5 is not considered, which limits the in-depth understanding of the nature of pollution and the accuracy of prediction.
Model: Model parameters mostly rely on experience settings, lack automatic tuning mechanisms, and cannot be adjusted adaptively. Model assumptions may not match the actual situation, and fail to fully consider the actual conditions affected by various factors. For example, the SARIMA model has high requirements for data stability and is difficult to adapt to the complex and changeable actual environment. Some models (such as LightGBM) have poor interpretability, which is not conducive to an in-depth understanding of the prediction process.
Research methods: Although the evaluation indicators can reflect the performance of the model, they have limitations and cannot comprehensively evaluate the performance of the model in different scenarios. The research did not fully consider practical problems such as monitoring equipment errors and data transmission delays, nor did it fine-tune and adjust the model for different regions.
These deficiencies may limit the accuracy of model predictions and affect the reliability and universality of research results. Future research needs to improve in these aspects.
5.6. Comparative Analysis and Innovation of Research Findings
Compared with other related studies, our study is unique in several ways, which have different impacts on the results:
Model selection: Our research shows unique innovation in model selection. Unlike most studies that focus only on a single type of model, our research innovatively compares statistical models (SARIMA), models that handle seasonal and holiday effects (Prophet), and Machine Learning models (LightGBM). When traditional studies simply use Machine Learning models, although they can mine complex relationships, they often ignore the precise analysis of data seasonality and trends. Our research breaks this limitation and systematically compares different types of models to comprehensively evaluate their performance in PM2.5 concentration prediction. This not only allows us to clearly see the advantages of each model, such as SARIMA’s ability to capture time series trends, Prophet’s advantages in handling seasonality and special events, and LightGBM’s expertise in mining complex nonlinear relationships. it also clarifies the disadvantages of each model, providing a strong basis for subsequent studies to select the most appropriate model according to different needs. This multi-model comparison method has opened up new ideas for PM2.5 concentration prediction research.
Data application: Our research is also innovative in data application. We selected the China High-Resolution and High-Quality PM2.5 Dataset (CHAP), which has a 1km resolution raster data format and combines two time resolutions, year and month, to bring a new perspective to the research. The innovative use of dual-time resolution data, year-resolution data can present the long-term trend of PM2.5 concentration from a macro perspective, and help analyze the impact of long-term factors such as urban development and industrial structure adjustment on concentration. Monthly resolution data focuses on short-term changes and accurately captures concentration fluctuations caused by seasonal factors, such as the difference in the impact of changes in meteorological conditions in different seasons on concentration. This innovative data application method provides a comprehensive and detailed time dimension perspective for the research, which is conducive to in-depth analysis of the changing characteristics and mechanisms of PM2.5 concentration at different time scales. Compared with the use of data with a single time resolution, it can more comprehensively reveal the law of PM2.5 concentration changes.
Research area and time scale: Our study selected Nanning urban area as the research area, which is significantly innovative and targeted. Nanning urban area has unique geographical, climatic and pollution source characteristics. It belongs to the subtropical monsoon climate, with distinct dry and wet seasons and diversified pollution sources, providing rich and special samples for research. Compared with other regional studies, the environmental differences in different regions are large. For example, the winter coal-fired heating in northern cities has a significant impact on PM2.5 concentration, which is very different from the pollution source composition in Nanning urban area, which makes our research more unique. Our research was carried out in Nanning urban area, which not only provides an in-depth understanding of the variation law of PM2.5 concentration in the region, but also provides a practical case for the application of the model in special environments. In terms of time scale, data from 2012 to 2023 were selected. This period covers changes in many aspects such as urban development and implementation of environmental protection policies. The innovative use of data during this period provides rich environmental change information for model training and prediction, which helps to explore the impact of environmental changes in different time ranges on model performance, and provides a new reference direction for subsequent research in time scale selection and data processing.