Decoding PM2.5 Prediction in Nanning Urban Area, China: Unraveling Model Superiorities and Drawbacks Through SARIMA, Prophet, and LightGBM

Chen, Minru; Liu, Binglin; Liang, Mingzhi; Yao, Nini

doi:10.3390/a18030167

Open AccessArticle

Decoding PM_2.5 Prediction in Nanning Urban Area, China: Unraveling Model Superiorities and Drawbacks Through SARIMA, Prophet, and LightGBM

¹

School of Geography and Planning, Nanning Normal University, Nanning 530001, China

²

Key Laboratory of Environment Change and Resources Use in Beibu Gulf, Ministry of Education, Nanning Normal University, Nanning 530001, China

³

Faculty of Geographical Science, Beijing Normal University, Beijing 100875, China

⁴

Department of Architecture and Built Environment, University of Nottingham, Ningbo 315154, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Algorithms 2025, 18(3), 167; https://doi.org/10.3390/a18030167

Submission received: 17 January 2025 / Revised: 6 March 2025 / Accepted: 8 March 2025 / Published: 14 March 2025

(This article belongs to the Special Issue Development of Machine Learning and Artificial Intelligence Algorithms in Environmental Retrieval Tasks)

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid development of industrialization and urbanization, air pollution is becoming increasingly serious. Accurate prediction of PM_2.5 concentration is of great significance to environmental protection and public health. Our study takes Nanning urban area, which has unique geographical, climatic and pollution source characteristics, as the object. Based on the dual-time resolution raster data of the China High-resolution and High-quality PM_2.5 Dataset (CHAP) from 2012 to 2023, the PM_2.5 concentration prediction study is carried out using SARIMA, Prophet and LightGBM models. The study systematically compares the performance of each model from the spatial and temporal dimensions using indicators such as mean square error (MSE), mean absolute error (MAE) and coefficient of determination (R²). The results show that the LightGBM model has a strong ability to mine complex nonlinear relationships, but its stability is poor. The Prophet model has obvious advantages in dealing with seasonality and trend of time series, but it lacks adaptability to complex changes. The SARIMA model is based on time series prediction theory and performs well in some scenarios, but has limitations in dealing with non-stationary data and spatial heterogeneity. Our research provides a multi-dimensional model performance reference for subsequent PM_2.5 concentration predictions, helps researchers select models reasonably according to different scenarios and needs, provides new ideas for analyzing concentration change patterns, and promotes the development of related research in the field of environmental science.

Keywords:

machine learning; PM_2.5; prediction; Nanning urban area; China

1. Introduction

Against the backdrop of the continued advancement of global industrialization and urbanization, air pollution has become increasingly serious and has become one of the key factors threatening human health and the ecological environment. Pollutants in the atmosphere are rich and diverse, including sulfur dioxide (SO₂), nitrogen oxides (NO_x), particulate matter (PM), etc. Among them, PM_2.5, as the main air pollutant, can penetrate into the human respiratory system due to its tiny particle size (≤2.5 μm), causing serious harm to human health, such as causing respiratory diseases, cardiovascular diseases, and even increasing the risk of death [1,2,3]. In high-density urban areas, traffic pollutant emissions have a significant impact on air quality and residents’ health [4]. Many studies have shown that PM_2.5 is closely related to a variety of health problems [5,6]. In terms of respiratory system effects, some studies have pointed out that it can cause damage to the human respiratory system [7,8], Additionally, epidemiological studies associate PM_2.5 with allergic rhinitis [9]. Furthermore, environmental exposure to PM_2.5 elevates influenza-like illness risks through airborne transmission mechanisms [10,11]. Heavy metals in PM_2.5 also have an impact on lung health [12,13,14]. In addition, there are also studies summarizing the impact of ambient PM_2.5 on human health in China, especially the updated summary of the adverse health effects of PM_2.5 exposure [15,16]. In view of this, accurate prediction of PM_2.5 concentration is of vital importance for formulating pollution prevention and control strategies in advance, protecting public health, and maintaining ecological balance [17].

Meteorological properties and satellite datasets can play an important role in PM_2.5 inversion, especially in model building and training. Many studies have shown that meteorological conditions have a significant impact on the changes in atmospheric pollutant concentrations, not only in the inversion of PM_2.5, but also in the inversion of other pollutants (such as NO₂, O₃, and nitrogen oxides). For example, Liu et al. evaluated the impact of meteorological and emission changes on O₃ concentrations in different regions of China using machine learning methods [18]. In addition, Mak et al. reversed the concentration of tropospheric NO₂ column in southern China by improving the aerosol quality factor (AMF) [19]. Rodriguez-Sanchez et al. studied the influence of meteorological conditions on the effectiveness of traffic measures in the control of NOx concentration [20]. These studies provide an important reference for the inversion of atmospheric pollutant concentrations using meteorological properties and satellite data. At the level of meteorological attribute application, many studies have pointed out that meteorological factors have a significant impact on the changes in atmospheric pollutant concentrations [21]. In terms of satellite data application and its combination with ground monitoring, relevant explorations have also emerged. Some studies have developed methods that combine satellite remote sensing technology and low-cost sensor networks to estimate the long-term PM_2.5 concentration in specific areas [22,23]. In the study of China, some studies have proposed using three-dimensional variational data fusion methods to improve the modeling of spatiotemporal changes in fine particulate matter (PM_2.5). By fusing multi-source data, including satellite remote sensing data and ground monitoring data, the spatiotemporal distribution characteristics of PM_2.5 can be more accurately portrayed, and the simulation and prediction capabilities of PM_2.5 concentrations have been improved [24]. These studies provide important references for in-depth understanding of pollutant characteristics, distribution patterns, and related research from different dimensions, and also provide ideas and methods for our research focusing on PM_2.5.

Early studies on PM_2.5 concentration primarily focused on the development of monitoring technologies and traditional statistical models. Initially, researchers relied on basic air sampling techniques and gravimetric methods to measure PM_2.5 levels, which were time-consuming and less precise compared to modern methods [25,26,27]. With the advancement of technology, automated monitoring systems using beta attenuation and light scattering techniques were introduced, significantly improving the accuracy and efficiency of PM_2.5 monitoring [28]. In addition to ARIMA and SARIMA models, the multivariate linear regression (MLR) model was also widely used to predict PM_2.5 concentration by establishing a linear relationship between PM_2.5 concentration and multiple influencing factors such as meteorological factors and pollution source emissions [29]. However, due to the complexity of the air pollution process, these traditional models often had limitations in complex environments and struggled to effectively capture the nonlinear and multi-factor influencing characteristics of PM_2.5 concentration changes [30].

With the rapid development of science and technology, Machine Learning (ML) and Deep Learning (DL) have emerged as powerful tools for predicting PM_2.5 concentration. These techniques offer significant advantages over traditional statistical methods, especially in handling nonlinear relationships and large datasets [31]. For instance, Support Vector Machines (SVMs) have been widely used due to their ability to handle high-dimensional data and provide accurate predictions [32,33]. Random Forests have also proven effective in capturing the complex interactions between various factors influencing PM_2.5 levels [34,35]. More recently, deep learning models such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been employed to leverage spatial and temporal patterns in PM_2.5 data [36]. For example, the Long Short-Term Memory (LSTM) network has shown superior performance in capturing long-term dependencies in time series data, making it particularly suitable for PM_2.5 concentration prediction [37]. Hybrid models combining LSTM with other techniques, such as wavelet transform and decomposition methods, have further enhanced prediction accuracy [38,39].

Research on PM_2.5 concentration has been conducted across various geographical regions, including urban, rural, and industrial areas, to better understand its spatial and temporal variations. In urban areas, studies have shown that local meteorological conditions, traffic emissions, and industrial activities significantly influence PM_2.5 levels [40,41]. For example, in the Beijing-Tianjin-Hebei region of China, PM_2.5 concentrations exhibit pronounced seasonal and diurnal patterns due to the combined effects of meteorology and anthropogenic emissions [42]. In contrast, rural areas are more influenced by agricultural activities and biomass burning, which can lead to significant PM_2.5 spikes during specific periods [43,44]. Industrial areas, particularly those with heavy manufacturing and chemical plants, often experience higher PM_2.5 concentrations due to direct emissions and secondary pollution formation [45,46]. Recent studies have also highlighted the importance of considering regional differences in PM_2.5 prediction models. For instance, a study in the United States found that machine learning models trained on data from one region may not perform well in another due to variations in pollution sources and meteorological conditions [47,48].

Internationally, PM_2.5 concentration prediction has become a critical area of research due to its significant impact on public health and environmental quality. In Europe, studies have utilized high-resolution chemical transport models (CTMs) combined with machine learning techniques to improve PM_2.5 predictions [49]. These hybrid models leverage the strengths of both approaches, using CTMs to simulate atmospheric chemistry and machine learning algorithms to capture complex spatial and temporal patterns in PM_2.5 data. For example, a study in Germany demonstrated that integrating machine learning with CTMs improved prediction accuracy by 20% compared to traditional methods alone [50]. In North America, researchers have focused on developing advanced machine learning models that incorporate satellite data and ground-based measurements to enhance spatial coverage and prediction accuracy [51]. For instance, a study in the United States used a combination of Random Forest and Gradient Boosting algorithms to predict PM_2.5 concentrations across different regions, highlighting the importance of meteorological factors and land use in model performance [52]. These models have shown significant potential in capturing regional variations in PM_2.5 concentrations, which is crucial for effective air quality management. In Asia, particularly in countries like India and South Korea, research has focused on developing localized prediction models tailored to specific urban and industrial environments. A study in South Korea used a novel spectral clustering algorithm combined with machine learning techniques to analyze PM_2.5 data and identify key pollution sources [53]. In India, researchers have developed ensemble machine learning models to predict PM_2.5 concentrations in highly polluted urban areas, such as Bengaluru and Delhi [54]. These studies emphasize the need for region-specific approaches due to variations in pollution sources and meteorological conditions.

In recent years, numerous innovative approaches have been developed to improve PM_2.5 concentration prediction. For example, the integration of Convolutional LSTM (ConvLSTM) and Graph Convolutional Network (GCN) architectures has been proposed to leverage both spatial and temporal features in multi-source data [55]. Another study introduced a hybrid model combining LSTM with a Deep Auto-Encoder (DAE) to enhance the model’s ability to capture complex patterns in PM_2.5 data [56]. Additionally, the application of ensemble learning techniques, such as stacking, has shown promise in improving prediction accuracy by combining the strengths of multiple machine learning models [57]. These advancements highlight the ongoing efforts to develop more accurate and robust PM_2.5 prediction models, which are crucial for effective air quality management and public health protection. The integration of high-resolution data from satellites, ground-based sensors, and meteorological models, combined with advanced machine learning techniques, is expected to further enhance the accuracy and reliability of PM_2.5 predictions in diverse geographical regions.

After achieving remarkable results in fields such as natural language processing, the Transformer model has also emerged in the field of environmental science, especially in PM_2.5 concentration prediction. Many scholars have conducted in-depth research on it. For example, Thundiyil et al. pointed out in their study that the unique architecture of the Transformer model enables it to effectively handle complex dependencies in time series data, which provides new ideas for PM_2.5 concentration prediction [58]. Wang et al. used a large-scale air quality monitoring data set to train and verify the Transformer-based prediction model. The results showed that the model performed well in capturing the long-term trend and seasonal changes of PM_2.5 concentration [59]. In addition, Mohammed et al., based on the Informer architecture and the residual transformer, developed a PM₂ artificial time-series prediction model named ResInformer to predict the PM_2.5 concentration in three major cities in China (Beijing, Shijiazhuang, and Wuhan) [60].

At the same time, there are also many explorations in the fusion of Deep Learning models. Li et al. proposed a deep learning-based method, AC-LSTM, which includes one-dimensional convolutional neural network (CNN), long and short-term memory (LSTM) network, and attention-based network for urban PM_2.5 concentration prediction [61]. Kim et al. had developed a hybrid attention converter (HAT) to accurately obtain daily PM_2.5 Seoul predictions. The performance of HAT was evaluated by comparing its predictions with ground observations and the 3 D chemical transport model (3-DCTM). The experimental results show that this fusion model can effectively reduce the model training time while improving the prediction accuracy [62].

Through a comprehensive analysis of the above literature, we can find that models such as neural networks and decision trees in machine learning models have made great progress in dealing with nonlinear problems, but the parameter selection and training process of the model are often complex and require a lot of computing resources and data. For example, the LSTM model may encounter the problem of gradient vanishing or gradient explosion when processing long sequence data, which requires special techniques to alleviate. Although the ensemble learning model can improve the prediction performance, how to select the appropriate base model and combination method still needs further research. At the same time, Deep Learning models such as Transformer have shown strong feature learning capabilities, but when applied to PM_2.5 concentration prediction, they may face problems such as overfitting and computational efficiency. Moreover, current research pays less attention to the interpretability of the model, and it is difficult to understand the prediction results of the model in a physical sense. In addition, most studies have certain limitations in data selection and processing. Some studies only consider a few influencing factors and do not fully explore the multi-source driving factors of PM_2.5 concentration changes. At the same time, the quality and spatiotemporal resolution of the data will also affect the prediction accuracy of the model, but research in this regard is not deep enough.

In view of the above situation, our research aims to address the shortcomings of existing research. Taking Nanning urban area in China as an example, we conducted a prediction study on the PM_2.5 concentration in the area based on the Machine Learning method. We selected the SARIMA, Prophet, and LightGBM models in our research. Among them, SARIMA, as a classic time series model, is good at capturing the seasonality and trend of data [63]. The Prophet model is designed for processing data with seasonal and holiday effects and has strong flexibility and interpretability [64]. LightGBM is an efficient gradient boosting framework with significant advantages in processing large-scale data and complex nonlinear relationships [65]. We selected the annual and monthly PM_2.5 raster data from 2012 to 2023 nationwide in ChinaHighAirPollutants (CHAP), cropped it with the Nanning urban area mask, used the 2012–2022 data as the training set and the 2023 data as the test set, and performed model training and prediction according to the standard process. By predicting and saving the 2023 results raster by raster, we provided data support for subsequent analysis, compared the performance of different models, explored the prediction method suitable for the region, and provided a scientific basis for air quality management.

It is worth mentioning that our research has several notable innovations that distinguish it from existing studies. Firstly, unlike previous studies that were mostly limited to a single model type, we systematically compared different types of models, including traditional statistical models (SARIMA)and advanced machine learning models (Prophet and LightGBM). This multi-model comparison approach allows for a comprehensive evaluation of the strengths and weaknesses of each model in the context of PM_2.5 concentration prediction, providing a more robust basis for model selection [66]. Secondly, we employed a variety of visualization methods to deeply analyze the prediction results from multiple dimensions, such as spatial and temporal variations. This approach not only enhances the interpretability of the results but also provides valuable insights into the underlying patterns of PM_2.5 concentration changes [67]. Thirdly, considering the characteristics of PM_2.5 data in Nanning urban area, our study adopted a data-driven strategy, focusing solely on time series data to mine potential laws. This strategy maximizes the use of high-resolution data, thereby improving the reliability and effectiveness of the prediction [68]. Lastly, our research on Nanning urban area has important demonstration value for other regions facing similar situations. By providing a detailed case study, our methods and findings can help promote the widespread application and innovative development of Machine Learning models in the field of environmental science [69]. These innovations not only enhance the scientific rigor of our study but also contribute to practical applications. For instance, the multi-model comparison approach can help policymakers and environmental managers make more informed decisions by selecting the most appropriate model based on specific needs and conditions. The data-driven strategy and high-resolution data utilization can improve the accuracy of PM_2.5 concentration predictions, thereby supporting more effective air quality management and public health protection measures [70].

2. Data Sources and Study Area

2.1. Data Sources

Our research data comes from the ChinaHighAirPollutants (CHAP) dataset (2000–2023) [71] published by Dr. Wei Jing and Professor Li Zhanqing’s team on the website of the National Tibetan Plateau Data Center.

The spatial resolution of this dataset is 1km, and each grid cell corresponds to an actual ground area of 1 km × 1 km, which can present the spatial distribution of PM_2.5 in detail. When analyzing the Nanning urban area, it can clearly distinguish the PM_2.5 concentration differences in commercial areas, industrial areas, residential areas and other functional areas, accurately locate high-pollution areas, and help study the relationship between PM_2.5 and local environmental factors.

Our study uses the PM_2.5 raster data from 2000 to 2023 in the national data set (Table 1). The original format is NetCDF (.nc), which is subsequently converted to raster format (.tif) for easy analysis and processing. The data coordinate system is WGS_1984, the unit is µg/m³, and the time resolution is annual and monthly. In this study, the data from 2012 to 2022 are used as the training set to predict the PM_2.5 concentration in 2023 to evaluate the prediction ability of the model within this time range. The PM_2.5 data is well representative, covering the entire country and a long time span, which can better reflect the long-term change trend and seasonal characteristics of the PM_2.5 concentration in the study area, Nanning urban area. Although Nanning urban area has unique geographical and climatic conditions, the data contains the comprehensive impact of various environmental factors on PM_2.5 concentration on a large scale, and can be effectively applied to our study after clipping.

2.2. Study Area

Our study focuses on the urban area of Nanning, Guangxi Zhuang Autonomous Region, China. As the capital of Guangxi Zhuang Autonomous Region, Nanning is not only the economic, political and cultural center of the region, but also a city in a rapid development stage [72]. It has a superior geographical location, located in the central and southern part of Guangxi, bordering many cities, and has rich natural resources and a unique geographical environment [73].

Nanning is an ideal region for evaluating the applicability of our research approach due to its unique climate conditions and PM_2.5 pollution sources. In terms of climate, Nanning has a typical humid subtropical monsoon climate with distinct four seasons, abundant precipitation, and high air humidity. In this climate context, high atmospheric water vapor content provides favorable conditions for the complex chemical reactions of chemical components in PM_2.5, which in turn leads to dynamic changes in its concentration and chemical composition. Precipitation, as a natural mechanism of atmospheric purification, has a complex impact on PM_2.5 concentrations. In the early stage of precipitation, PM_2.5 is prone to accumulation due to high atmospheric stability. However, after precipitation, as atmospheric conditions change, PM_2.5 concentrations may rise again [74]. In addition, high air humidity can cause PM_2.5 particles to absorb moisture and grow, changing their optical and physical properties, which significantly affects their transmission and diffusion processes in the atmosphere. Studies have shown that when relative humidity increases, the water-soluble components in PM_2.5 will absorb water vapor, causing the particle size to increase and the particles to develop the ability to scatter and absorb light, thereby affecting atmospheric visibility [75].

In terms of PM_2.5 pollution sources, along with the accelerated urbanization process, the urban economy of Nanning urban area has developed rapidly. According to the statistics in 2024, the permanent population of Nanning urban area has reached 8.9408 million, and the number of urban motor vehicles has exceeded 3.37 million. The continuous increase in urban population and the continuous expansion of traffic scale have made the PM_2.5 pollution sources present obvious diversified characteristics. Emissions from industrial production activities are one of the important sources of PM_2.5. Various factories will release a large amount of waste gas containing PM_2.5 and other pollutants into the atmosphere during the production process. Exhaust emissions in the transportation sector should not be ignored either. The continuous increase in the number of urban motor vehicles has led to a significant increase in exhaust emissions. In addition, dust generated during urban construction and fumes emitted by the catering service industry have also played an important role in the formation of PM_2.5. PM_2.5 from different sources has obvious differences in chemical composition, particle size distribution, etc. This difference further increases the complexity and difficulty of PM_2.5 pollution.

Nanning urban area has a total area of approximately 8000 square kilometers, and its jurisdiction covers several administrative districts, such as Xingning District, Qingxiu District, Jiangnan District, Xixiangtang District, and Yongning District [76]. Its terrain is centered on the Yongjiang River Valley and is in the form of a basin. It is surrounded by mountains on the south, north, and west, and only has an open east side. This terrain condition is not conducive to the diffusion of atmospheric pollutants, making it easy for pollutants such as PM_2.5 to accumulate within the urban area, further exacerbating the degree of air pollution in the area.

In view of the above situation, in-depth research on the PM_2.5 concentration changes and influencing factors in Nanning urban area has important scientific and practical significance. Our research will focus on the PM_2.5 concentration data in this area for systematic analysis, aiming to provide a scientific basis for improving urban air quality and provide a reference for policy makers to formulate practical air pollution control strategies. At the same time, limiting the scope of the research to the urban area will help reduce the complexity of data processing, improve the efficiency and accuracy of the prediction model, and more effectively respond to the increasingly severe air pollution challenges.

Figure 1 shows the specific location of the study area. Sub-figure a is the Guangxi Zhuang Autonomous Region of China, and the study area is clearly marked in the figure. Sub-figure b shows the specific location of the study area in more detail from the perspective of the city. In order to accelerate the prediction process of the model and highlight the representative area of the city center, this study only selected the urban part of Nanning City in the Guangxi Zhuang Autonomous Region of China. Sub-figure c briefly shows the high-resolution PM_2.5 grid data of China used in the study (which has been cropped by the mask of the study area). The values shown in the figure are the Nanning urban area for the whole year of 2023. The maximum and minimum values of PM_2.5 concentration (unit: μg/m³) are relatively low overall, and the air quality is good.

3. Research Method

3.1. Research Process

Our research process is shown in Figure 2.

3.2. Model Introduction

In our study, we used three different prediction models: Seasonal Autoregressive Integrated Moving Average model (SARIMA), Prophet model and LightGBM model. They each have different advantages and are all applicable for achieving the core goal of PM_2.5 concentration prediction in Nanning Urban Area.

3.2.1. SARIMA Model

The SARIMA (Seasonal Autoregressive Integrated Moving Average) model is a mature statistical time series analysis tool that is particularly suitable for data with seasonal fluctuations [77]. The model combines the three components of autoregression, difference, and moving average, and effectively captures the variation pattern of PM_2.5 concentration by modeling the seasonal characteristics of the time series.

In our study, the non-seasonal parameter is set to non_seasonal_order = (1, 1, 1), corresponding to (p, d, q). The seasonal parameter is seasonal_order = (1, 1, 1), S = 12. The mathematical formula and the meaning of each parameter are as follows:

S A R I M A (p, d, q) {(P, D, Q)}_{S}

p is the number of autoregressive terms. In our study, p = 1, which means that the current PM_2.5 concentration is affected by the concentration at the previous moment, which can help the model capture the short-term dependence trend of the data.

d is the non-seasonal difference order, and d = 1 means that the original data are processed by first-order difference, the purpose of which is to make the data stationary so that the model can better identify the patterns in the data.

q is the number of moving average terms. q = 1 indicates that the current PM_2.5 concentration is related to the previous error term, and the previous error term can be used to correct the current prediction.

P is the number of seasonal autoregressive terms, P = 1 and the seasonal period S = 12, indicating that the current concentration is affected by the concentration in the previous period (i.e., 12 months ago), which helps capture the annual seasonal dependence of the data.

D is the seasonal difference order, and D = 1 is used to seasonally differenciate the data to remove seasonal trends and make the data more stable, which is convenient for model analysis of other potential patterns.

Q is the number of seasonal moving average terms, and Q = 1 means that the previous error term within the seasonal cycle is used to adjust the current forecast to improve the accuracy of the seasonal forecast.

S is the length of the seasonal cycle, which is set to S = 12, which is consistent with the possible annual periodic variation of the monthly PM_2.5 data.

3.2.2. Prophet Model

The Prophet model is a time series forecasting tool developed by Facebook, specifically designed to handle data containing seasonal and holiday effects [78]. Its additive model is used in our study to predict the PM_2.5 concentration in Nanning urban area. The additive model formula and the meaning of each variable are as follows:

y (t) = g (t) + s (t) + h (t)

y(t) is the predicted value of PM_2.5 concentration at time t.

g(t) is a trend function that reflects the long-term trend of PM_2.5 concentration over time. This function can be a linear or logistic growth model, which adapts to the long-term trend characteristics of the data by adjusting relevant parameters.

s(t) is a seasonal function that can capture the seasonal fluctuations of PM_2.5 concentration on a monthly, quarterly or annual scale. The model will automatically identify and fit these seasonal patterns.

h(t) is the holiday effect function, which is used to reflect the abnormal impact of special events such as holidays on PM_2.5 concentration. Users can customize specific holidays or events according to actual conditions to improve the prediction accuracy of the model.

3.2.3. LightGBM Model

LightGBM (Light Gradient Boosting Machine) is an efficient gradient-based Machine Learning framework developed by Microsoft, which is particularly suitable for regression and classification tasks of large-scale datasets [79]. In our study, it is used to mine the deep features of PM_2.5 concentration changes and capture complex nonlinear patterns.

Core calculation formula:

(1) Objective function

L = \sum_{i = 1}^{N} l (y i, \hat{y i}) + Ω (f)

L

is the objective function value, which is used to measure the difference between the model prediction result and the actual value, and to constrain the complexity of the model to balance the model’s fitting ability and generalization ability.

N is the number of PM_2.5 concentration data samples used to train the LightGBM model.

l (y i, \hat{y i})

is the loss function for the actual value

y i

and the predicted value

\hat{y i}

of the

i

-th sample. Common loss functions include mean square error, mean absolute error, etc. In our study, we select a suitable loss function to evaluate the model performance according to the task requirements.

Ω (f)

is a regularization term, which is used to control the complexity of the model, prevent the model from overfitting, and make the model perform better on both training data and new data.

(2) Additive Model

{\hat{y_{i}}}^{(m)} = {\hat{y_{i}}}^{(m - 1)} + ν \cdot f_{m} (x_{i})

{\hat{y_{i}}}^{(m)}

is the predicted value of the

i

-th sample in the

m

-th iteration. As the number of iterations increases, the predicted value of the model is continuously updated and optimized.

{\hat{y_{i}}}^{(m - 1)}

is the predicted value of the

i

-th sample in the

(m - 1)

th iteration, that is, the result of the previous iteration. The new predicted value is updated based on the result of the previous iteration.

ν

is the learning rate, which controls

f_{m} (x_{i})

the contribution of the newly generated weak learner to the final prediction result at each iteration. A suitable learning rate can make the model converge stably during the training process and improve the accuracy of the prediction.

f_{m} (x_{i})

is the predicted value of the

m

-th weak learner for the

i

-th sample

x_{i}

. LightGBM constructs multiple weak learners through continuous iteration, and accumulates their prediction results to finally form a strong learner for prediction.

3.3. Data Preprocessing

In our study, the original PM_2.5 data is China’s high-resolution PM_2.5 concentration data. In order to focus on the study area of Nanning urban area and ensure that the equipment can run the code normally, the PM_2.5 data of all years are cropped using the Nanning urban area mask (Figure 3). The cropping process uses the rasterio and geopandas libraries to ensure the accuracy and efficiency of data processing.

After cropping, the data needs to be further processed and divided into training sets and test sets. The specific steps are as follows:

Data reading and stacking: By generating a list of year and month strings from January 2012 to December 2023, traversing and reading the corresponding cropped TIFF files, the data from 2012 to 2022 are stacked into a training set, and the data from 2023 are stacked into a test set.

Missing value and outlier processing: Check the training set data. If there are missing values, fill them with the mean of the corresponding column. For outliers, determine the upper and lower boundaries by calculating the interquartile range, and trim the values that exceed the boundaries to within the boundaries to improve data quality.

SARIMA model: The training set data is differentiated to ensure stationarity, and the last month of training data is recorded for back-difference in subsequent forecasts.

Prophet model: With the help of pandas’ date_range function, with 1 January 2012 as the starting date and month as the frequency, a date sequence is generated according to the length of the monthly time series data, and combined with the PM_2.5 concentration into a format containing a date column ds and a target value column y.

LightGBM model: After extracting the time series from the training set data by pixel, create a DataFrame containing date and month features, and use the month as the input feature of the model.

We selected the trimmed and processed PM_2.5 data from 2012 to 2022 as the training set and the processed data from 2023 as the test set, mainly based on the following considerations:

Data characteristics and trends: The 11 years of data form a coherent time series that can fully reflect the long-term trend and seasonal characteristics of PM_2.5 concentration. Affected by factors such as interannual climate, urban development, and environmental protection policies, the concentration change trend can be observed to provide information for model learning. At the same time, the data covers multiple complete seasonal cycles, which helps the model capture the seasonal variation patterns under Nanning’s subtropical monsoon climate and enhance the SARIMA and Prophet models’ ability to predict PM_2.5 concentration changes.

Model training and validation: Sufficient data is the key to model training. 11 years of data allows the model to fully learn the relationship between PM_2.5 concentration and various potential influencing factors. For example, LightGBM can mine complex nonlinear relationships. Using data from 2012 to 2022 for training and data from 2023 for validation can effectively test the generalization ability of the model and avoid overfitting.

Research purpose and application: In environmental forecasting, the accuracy of near-term forecasts is of great significance to environmental management and decision-making. Selecting 2023 for verification can evaluate the current forecasting performance of the model and understand its adaptability to environmental changes. By comparing the 2023 forecast value with the true value, the advantages and disadvantages of the model can be summarized, providing a reference for future forecasts and environmental management decisions, and helping to formulate scientific and effective environmental protection policies.

3.4. Model Training and Prediction

Figure 4 shows the model training and prediction workflow that we employed in our research.

SARIMA model training and prediction: For the SARIMA model, during training, according to the non-seasonal parameter non_seasonal_order = (1, 1, 1) and seasonal parameter seasonal_order = (1, 1, 1), S = 12, the corresponding algorithm is used to estimate the parameters of the training set data from 2012 to 2022. The model parameters are optimized through multiple iterations to achieve a better fitting effect. After the training is completed, the trained model is used to predict the 2023 test set data grid by grid, and the prediction results are saved.

Prophet model training and prediction: Initialize the Prophet model and use the previously prepared formatted training set data from 2012 to 2022 for training. After the training is completed, according to the set 12-month forecast period, use model.make_future_dataframe to generate 2023 future date data with a monthly frequency, and input it into the trained model to obtain the 2023 PM_2.5 concentration forecast results.

LightGBM model training and prediction: For the LightGBM model, set the target to regression task, use root mean square error as the evaluation indicator, use gradient boosted decision tree (GBDT) as the boosting type, the number of leaf nodes is 31, the learning rate is 0.05, the feature sampling ratio is 0.9, the minimum number of leaf node samples is 1, the maximum depth of the tree is not limited, and detailed output information is turned off. Using these parameter settings, the training set data from 2012 to 2022 is trained, and the model continuously learns the characteristics of PM_2.5 concentration data during the training process. After the training is completed, the month is used as the input feature to perform grid-by-grid prediction on the 2023 test set data, and the prediction results are saved.

3.5. Model Metrics

3.5.1. Mean Squared Error, MSE

The mean square error is the average of the squares of the errors between the predicted value and the true value, and is calculated as:

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}

n

is the number of samples,

y_{i}

is the true value of the

i

-th sample,

\hat{y_{i}}

and is the predicted value of the

i

-th sample.

M S E

measures the average error between the predicted value and the true value. Since the error is squared, larger errors will be magnified in

M S E

, so

M S E

is sensitive to outliers. In PM_2.5 concentration prediction research,

M S E

can reflect the overall prediction error of the model. If the

M S E

value is small, it means that the predicted value of the model is closer to the true value and the prediction accuracy of the model is high. Conversely, if the

M S E

value is large, it means that the model has a large prediction error and the model may need to be adjusted or improved.

3.5.2. Root–Mean–Squared Error, RMSE

The Root–Mean–Squared Error is the square root of the mean square error and is calculated as:

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}} or R M S E = \sqrt{M S E}

RMSE has the same dimensions as the original data, which makes it easier to interpret and understand in practical applications. In the PM_2.5 concentration prediction study, RMSE can intuitively represent the average size of the model prediction error. For example, if the unit of PM_2.5 concentration is μg/m³, then the unit of RMSE is also μg/m³, which can be directly compared with the actual value of PM_2.5 concentration. In addition, RMSE is also sensitive to larger errors, which can highlight the shortcomings of the model in dealing with large errors and help researchers find the weak links of the model.

3.5.3. Mean Absolute Error, MAE

The mean absolute error is the average of the absolute errors between the predicted value and the true value, and is calculated as:

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - \hat{y_{i}} |

MAE directly calculates the absolute error between the predicted value and the true value, avoiding the amplification effect caused by the square of the error, so it is more robust to outliers. In the PM_2.5 concentration prediction study, due to the complexity of environmental factors, some abnormal PM_2.5 concentration values may appear (such as high concentration values caused by sudden pollution events). MAE can more robustly reflect the average prediction error of the model and will not produce large fluctuations due to individual outliers. Compared with MSE and RMSE, MAE focuses more on measuring the average prediction error of the model on most samples.

3.5.4. Coefficient of Determination, R²

The coefficient of determination R² is used to evaluate the goodness of fit of the model to the data and is calculated as:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

\bar{y}

is the average value of the true value. The closer

R^{2}

is to 1, the better the model fits the data, that is, the higher the degree of variation of the true value that the model can explain. In the PM_2.5 concentration prediction study,

R^{2}

can help researchers determine whether the model can effectively capture the variation pattern of PM_2.5 concentration. For example, if

R^{2}

= 0.8, it means that the model can explain 80% of the PM_2.5 concentration changes, and the remaining 20% of the changes may be caused by factors not considered by the model (such as unincluded meteorological variables, sudden human activities, etc.). Therefore,

R^{2}

can be used as an important reference indicator for evaluating model performance and selecting the optimal model.

4. Experimental Results and Analysis

4.1. Comparison of Original Data and Model Prediction Results

Our study uses Figure 5 to analyze the spatial distribution of PM_2.5 concentration in Nanning urban area in 2023 and the prediction performance of LightGBM, Prophet and SARIMA models. Figure 6 selects some months (January, April, July, and September) to display the prediction results, presenting the spatial prediction distribution results at different time resolutions from year to month.

From the annual distribution of PM_2.5 in 2023 shown in Figure 5, it can be deduced that the city center has dense industrial activities, heavy traffic flow, and large exhaust emissions. In parts of Jiangnan District and Xixiangtang District, PM_2.5 concentrations are high due to the combined effects of industrial activities and traffic emissions, while in parts of Qingxiu District and Yongning District, due to the high green coverage rate, vegetation has a significant adsorption and filtration effect on particulate matter. At the same time, there are relatively few industrial activities in this area and relatively few pollution sources, so PM_2.5 concentrations are low.

From the monthly distribution in Figure 6, the concentration distribution in different months is significantly affected by factors such as meteorological conditions. In January, high-concentration areas are concentrated in urban centers and industrial areas. This is because winter heating activities increase energy consumption, which in turn leads to increased pollutant emissions. At the same time, the atmospheric diffusion conditions in winter are poor. In April, the overall concentration decreased and the scope of the high-concentration area decreased. This is mainly attributed to the strong wind in spring, which is conducive to the diffusion of pollutants. In July, the concentration further decreased, and the high-concentration areas were mainly concentrated in busy traffic sections and local industrial areas. This is because summer precipitation has a scouring effect on particulate matter. In September, the concentration distribution is similar to that in July, but it can be clearly observed that the concentration in some areas has rebounded. This may be because the meteorological conditions have changed after the summer, resulting in worse pollutant diffusion conditions.

From the annual prediction results, although the SARIMA model can present a general concentration distribution trend, its spatial distribution prediction accuracy is insufficient in some streets in Xixiangtang District and some industrial parks in Jiangnan District. This shows that the model does not capture the spatial variation characteristics of these areas, such as the local airflow changes caused by building shading that affect pollutant diffusion. The Prophet model is not accurate enough in predicting spatial location distribution. In areas with large concentration gradient changes such as the border between Xingning District and Qingxiu District, the predicted values in some areas are low, which shows that the model has limitations in dealing with concentration gradient changes. Relatively speaking, the LightGBM model has a better ability to capture spatial distribution, but within the high-concentration industrial area of Jiangnan District, its prediction of details such as emission differences between different factories is not accurate enough, resulting in relatively high predicted values.

In the comparison of monthly prediction results, the LightGBM model has a good range prediction effect for high-concentration areas such as April and September, but there is still a large deviation in the prediction of concentration values. Taking September as an example, in low-concentration areas such as around some parks in Qingxiu District, the model did not take into account factors such as vegetation adsorption of particulate matter and low surrounding traffic flow, resulting in overestimation of values in some low-concentration areas. The prediction results of the Prophet model are relatively fragmented. Although the values in some rural areas of Yongning District are relatively close to the true values, in the border areas such as the border with Jiangnan District, due to the complex changes in concentration gradients caused by changes in topography and human activities, the prediction deviates greatly from the actual distribution. In general, the SARIMA model has certain advantages in time series trend prediction, but in terms of actual spatial feature capture, there are certain deviations in different urban areas in different months, such as the commercial area of Xingning District and the newly developed area of Liangqing District.

In general, each model has its own advantages and disadvantages in capturing and predicting the spatial distribution of PM_2.5 concentration at different time resolutions. In practical applications, it is necessary to comprehensively consider the changes in meteorological conditions in different time periods in the study area, such as the low temperature and calm wind period in winter, the concentrated precipitation period in summer, etc., as well as regional characteristics, such as the functional zoning of urban areas (commercial areas, industrial areas, residential areas, etc.), topography (mountainous areas, plains, water areas, etc.) and other factors. For areas with complex and changeable meteorological conditions and complex topography, the LightGBM model that can handle complex nonlinear relationships can be given priority. For time series with obvious seasonal and trend characteristics, the SARIMA model may be more advantageous. At the same time, the prediction accuracy of the spatial distribution of PM_2.5 concentration can be further improved by adjusting model parameters, such as optimizing the tree depth and learning rate of the LightGBM model, or fusing the prediction results of multiple models.

4.2. Time Series Trend Analysis

Figure 7 presents the monthly average PM_2.5 concentration data of Nanning urban area from 2012 to 2023 in the form of a line chart. Through a detailed and in-depth analysis of these data, we can not only clearly sort out the long-term trend of PM_2.5 concentration during this period, but also combine known influencing factors such as holidays and epidemics to accurately analyze their specific effects on PM_2.5 concentration.

From the overall change, during the period of 2012–2014, the PM_2.5 concentration was high at the beginning of each year, such as 75.4 μg/m³ in January 2012, about 96 μg/m³ in January 2013, and 120.4 μg/m³ in January 2014. Although it declined in the following months, it still fluctuated at a high level. At that time, the industrialization process in Nanning urban area was advancing rapidly, industrial activities were frequent, energy consumption was high, and environmental protection measures were relatively weak. A large amount of industrial waste gas, dust and other pollutants were discharged into the atmosphere. At the same time, a large number of fireworks and firecrackers were set off during the Spring Festival, releasing a large amount of smoke, sulfur dioxide and other pollutants in a short period of time, causing the PM_2.5 concentration in the air to rise sharply.

Since 2015, the PM_2.5 concentration in some months has shown a clear downward trend, especially in the summer months (June to August) from 2015 to 2017, when the concentration was generally low and stable. For example, the PM_2.5 concentration in July 2015 dropped to 25.5 μg/m³, and the concentration in the same period of 2016 and 2017 also remained at a low level. This is due to strict environmental governance measures, including strengthening the control of industrial pollution sources, requiring enterprises to install advanced waste gas treatment equipment, promoting clean energy, replacing high-pollution energy with solar energy, wind energy, etc., upgrading motor vehicle exhaust emission standards, and eliminating old vehicles, which effectively controlled pollutant emissions. However, in certain months of some years, such as National Day and other holidays, PM_2.5 concentrations will rise briefly due to factors such as increased mobility and traffic flow, and increased fume emissions from restaurants.

The PM_2.5 concentration changes were special from 2020 to 2022 due to the impact of the COVID-19 pandemic. In the early stages of the pandemic, strict prevention and control measures such as factory shutdowns and vehicle restrictions significantly reduced industrial production and transportation activities, and pollutant emissions dropped sharply. In February 2020, PM_2.5 concentrations in many cities, including Nanning urban area, dropped to a relatively low level in the same period in history, about 32 μg/m³. With the normalization of epidemic prevention and control, production and life have gradually resumed but are still restricted. Enterprises pay more attention to environmental protection, and public transportation operations are optimized, making PM_2.5 concentrations relatively stable during this period and the overall level is not high, and the seasonal fluctuation range is reduced. In addition, after 2019, the small peak concentration that might have occurred in October was significantly reduced or postponed. This may be because the epidemic affected holiday travel and gathering activities, reducing pollutant emissions, or it may be due to the public’s improved environmental awareness and the positive impact of practicing green travel and energy conservation and emission reduction.

The above mentioned trend of PM_2.5 concentration in 2012–2022 has a multifaceted effect on the prediction of each model in 2023. According to Figure 8 (monthly average PM_2.5 concentration line chart of each model in 2023), each model refers to the trend characteristics of historical data (the main characteristics used in this experiment) and the seasonal fluctuations caused by various factors when predicting the concentration of 2023.

The LightGBM model, with its excellent nonlinear fitting ability, effectively captures the complex changes in the historical trends from 2012 to 2022, and applies it to the PM_2.5 concentration forecast for 2023. From the trend of the line chart, we can see that the model can more accurately grasp the overall trend of high concentration at the beginning of the year (such as the actual concentration in January is about 41 μg/m³, and the LightGBM model predicts a value of about 58 μg/m³), then declines, and rebounds at the end of the year (the actual concentration in December is about 33 μg/m³, and the predicted value is about 52 μg/m³). This is due to its effective learning of seasonal changes, holiday impacts, and special fluctuations during the epidemic in historical data. However, from the comparison between the overall forecast results and the actual values, the predicted values of the model are generally higher than the actual values, and there is a significant “overestimation” phenomenon. This shows that although the model can learn historical influencing factors, it does not take into account the changes in new pollution sources, differences in meteorological conditions, and different holiday activities in 2023, and further optimization is needed to improve the accuracy of the forecast.

The Prophet model has certain methods for dealing with seasonality, trend and holiday effects of time series. However, when predicting the concentration in 2023 based on the trend from 2012 to 2022, the performance is not satisfactory. In the stage of concentration decline, the predicted value of the model deviates greatly from the original data, showing an obvious “underestimation” trend. For example, during the period from April to July, the actual concentration dropped from about 32 μg/m³ to about 15 μg/m³, while the predicted value of the Prophet model was always lower than the actual value, and the gap gradually increased. This may be because the Prophet model simply models the holiday effect and seasonal changes when processing historical data, ignoring the significant differences in pollution sources and meteorological conditions between 2023 and previous years. For example, in 2023, some factories upgraded their emission reduction equipment and reduced pollutant emissions, but the model did not take this key change into account. When using the previous average diffusion coefficient for prediction, the prediction of the concentration decline stage was seriously underestimated.

As a classic time series model, the SARIMA model builds a model based on the data characteristics from 2012 to 2022 to predict the concentration in 2023. In some months, it can reflect the trend of concentration changes, but there are deviations from the actual values in the overall seasonal fluctuation range and the prediction of some key time points. Nevertheless, compared with the other two models, its deviation is relatively small. For example, in September and November, there is a certain difference between the actual concentration and the predicted value. This may be because the model has high requirements for the stability of the data, and the concentration changes in 2023 are affected by the combined effect of multiple factors such as the resumption of normal holiday activities and the gradual weakening of the impact of the epidemic. It does not fully conform to the stable pattern presented by historical data. There are some sudden factors or new change patterns that are not fully captured by the model.

PM_2.5 concentration changes from 2012 to 2022 provides an important reference for the prediction in 2023. However, when using the information of these time series changes for prediction, each model has certain advantages and disadvantages due to the limitations of the model itself and the complexity of the actual situation in the prediction year. In subsequent research and application, it is necessary to further consider multiple factors, optimize the model or combine the results of multiple models to better improve the accuracy of the prediction results.

4.3. Analysis of Model Performance Metrics

Figure 9 presents the calculation results of the performance indicators (Mean Squared Error, MSE), (Root–Mean–Squared Error, RMSE), (Mean Absolute Error, MAE), (Coefficient of Determination, R²) of the LightGBM, Prophet and SARIMA models for each month in 2023 in the form of a heat map. To present the error magnitudes or goodness-of-fit of the three models more clearly and intuitively, each set of data was normalized independently within itself and sorted by size. They were marked with colors from dark to light to better display the comparative relationships within each group of indicators (The values shown in the figure are still the original data of the indicators).

The heat map of the LightGBM model in January, October and December has darker colors, indicating that the evaluation index values in these months are relatively high after normalization, that is, the model prediction value deviates greatly from the actual value. Taking May as an example, the color of this month is relatively light, and the values of various indicators are low, which means that the error index is relatively small and the prediction effect is good. This difference is closely related to the seasonal characteristics of the data in the study area. Although the LightGBM model has a strong nonlinear fitting ability and can handle complex data relationships, in the winter of January and December, the atmospheric diffusion conditions are poor, and the PM_2.5 concentration is affected by a variety of complex factors. Even this model is difficult to accurately capture the concentration change law. In July, the summer meteorological conditions are relatively stable, and the laws in the data are easier to be learned and predicted by the model, so it performs better.

Unlike the LightGBM model, the Prophet model has darker colors in May-July, indicating that the MSE, RMSE and MAE values are large during this period, and the prediction effect is poor, while the color is relatively light in other months, and the prediction error is relatively stable and small. The Prophet model is based on the seasonality and trend of time series, and has a certain ability to handle conventional seasonal patterns. However, during the period from May to July, the changes in PM_2.5 concentration in the study area were more complex. There may be some special meteorological conditions or human activities, which led to deviations between the actual situation and the seasonal patterns in the historical data. The model is too dependent on the historical pattern, which increases the prediction error.

Looking at the SARIMA model again, the darker colors in January, May, September and November mean that the MSE, RMSE and MAE values are higher and the prediction performance is poor. From the R² indicator, the closer the R² is to 1, the better the model fits the data. Among the three models, the negative R² value of the SARIMA model is smaller than that of the LightGBM and Prophet models, that is, it is closer to 1, indicating that it fits the data relatively well. However, combined with its higher error index in some months, it can be seen that although the model has relatively good fitting performance, it still lacks in prediction accuracy. This is because the SARIMA model has certain requirements for the stability of the data, and the PM_2.5 concentration data in 2023 is affected by a variety of complex factors, and it is difficult to meet its stability assumption in some months, so the overall performance still has room for improvement.

Overall, the three models of LightGBM, Prophet and SARIMA have their own advantages and disadvantages in predicting the PM_2.5 concentration in 2023. The LightGBM model has strong fitting ability and can capture complex trends, but it is greatly affected by special environmental factors. The Prophet model has certain methods in dealing with conventional seasonal time series, but it is not adaptable enough in the face of complex changes. The SARIMA model has a relatively good fitting effect, but the requirement for data stability limits its prediction accuracy in some months. Due to their own problems, it is difficult to fully and accurately grasp the data characteristics, and there will be large errors and poor fitting effects in different months, as the models cannot perfectly match the complex changes in the data. In practical applications, it is necessary to comprehensively consider the selection of appropriate models or combine the advantages of multiple models for prediction based on specific data characteristics and application scenarios.

4.4. Error Analysis with Scatter Plots and Box Plots

Figure 10 shows the predicted values and true values of the three models, LightGBM, Prophet, and SARIMA, in the form of scatter plots. With the help of the identity function (y = x) image, the gap between the predicted value and the true value can be more intuitively displayed, and it is also convenient to observe the performance of different models in different numerical ranges.

The LightGBM model is presented as green scatter points. In months such as May and July, the green scatter points are relatively close to the identity function (y = x) line, which indicates that the model can better capture the distribution trend of PM_2.5 concentration in these months. From the distribution of scatter points, the degree of dispersion is low, indicating that the model prediction results are more stable. However, overall, there is a large deviation between the predicted values of the LightGBM model and the true value range. This may be due to the fact that the LightGBM model allocates weights to data features or fits complex relationships during training, which leads to a certain deviation in capturing the overall range of data. Although it can stably reflect trends, there is still room for improvement in accuracy.

In contrast, the blue scatter points of the Prophet model are more loosely distributed, and most of the scatter points deviate downward from the identity function. Taking June and July as examples, the model’s predicted values deviate greatly from the true values, indicating that it is insufficient in grasping the trend of PM_2.5 concentration changes in these two months, and the prediction error is large. The Prophet model is mainly based on the seasonality, trend and holiday effects of time series for modeling. If in a specific month, the actual concentration change is affected by some factors that are not fully considered by the model, such as sudden meteorological changes or special human activities, it is easy to cause the model’s prediction deviation to increase.

The red scattered points of the SARIMA model are the most scattered compared to the other two models. In months such as March and November, the red scattered points are obviously far away from the identity function, showing that the model has a weak ability to capture the trend of PM_2.5 concentration changes in these months. However, it is worth noting that the predicted value range of the SARIMA model is close to the true value range, and there are more scattered points close to the identity function. This is because the SARIMA model, as a classic time series model, can reflect the overall characteristics of the data to a certain extent when processing time series data with a certain degree of stationarity. However, due to its high requirements for data stationarity, when the data is affected by some sudden or abnormal factors, inaccurate trend capture will occur.

Figure 11 presents data in box plots. These plots vividly showcase the distribution characteristics of prediction errors, making it easier to conduct statistical analysis of error conditions. When we analyze, on a monthly basis, the prediction–error distributions of each model, comparing these box plots with the ones of the original data can more intuitively and clearly highlight the unique advantages each model has in predicting PM_2.5 concentrations.

The LightGBM model performed well in some months from May to July. From the box plot, the boxes of the predicted data in these months are shorter, indicating that the data has a smaller degree of dispersion, that is, the prediction error is relatively concentrated. At the same time, the median is close to the median of the original data, and the deviation is within 8 μg/m³, which shows that the model can accurately grasp the central trend of the data in these months, and the prediction accuracy and stability are high. This advantage can provide a more reliable reference for the decision-making of relevant departments in practical applications. For example, in terms of air quality control decisions, based on the accurate prediction of the model, pollution prevention and control measures can be arranged more reasonably. However, in some other months, the box plot of the LightGBM model shows that the dispersion of its predicted data is large, indicating that there is a certain degree of fluctuation in the prediction error, and the adaptability of the model can be further improved.

The prediction performance of the Prophet model in most months is poor, and the boxes of the predicted data box plot are generally longer and the degree of dispersion is large. However, in months such as September to November, the lower quartile is close to that of the original data, with a deviation of about 7 μg/m³, indicating that in these months, the model has a certain degree of accuracy in predicting the lower value part of the data. This may be because the change characteristics of PM_2.5 concentration in these months are more consistent with some patterns in the historical data learned by the model. However, the large error of the Prophet model in other months limits its application effect in the overall forecast of the whole year, and further optimization is needed to improve stability.

Observing the data of each month throughout the year, in months with obvious seasonal characteristics, such as winter (December–February) or summer (June–August), the fluctuation trend of the box plot of the SARIMA model predicted data is more similar to the original box plot, and the deviation of key statistics (such as median, quartiles, etc.) is within 7 μg/m³. This is due to the fact that the SARIMA model, as a classic time series model, can better capture the seasonal changes in the data. Therefore, the model has certain advantages in understanding and predicting the changes in PM_2.5 concentration caused by seasonal factors, and can provide valuable reference for the study and prediction of seasonal air quality changes. However, in some months, there are still some differences between its box plot and the original data box plot, indicating that the model still has certain limitations when dealing with some special situations or complex changes.

4.5. Normality Test and Analysis of Prediction Errors

Figure 12 shows the normality test results of the forecast errors of the LightGBM, Prophet, and SARIMA models in 2023, including error histograms and normal probability plots. In order to better illustrate the normal distribution, we also marked the relevant information of the Kolmogorov-Smirnov (K-S) test in the chart (when performing a normality test, 0.05 is usually used as the limit of the significance level).

From the error histogram analysis, the error distribution of the LightGBM model shows an obvious multi-peak shape, and the intervals between the peaks are relatively scattered, indicating that its error values are relatively complex, with a certain frequency of occurrence in different intervals, and no obvious central trend. The error distribution of the Prophet model fluctuates more disorderly and lacks regularity. The error frequency of each interval fluctuates, and it is difficult to find a unified distribution pattern. Although the SARIMA model shows a certain unimodal trend, the peak is not located in the center of the distribution, and the left and right sides are asymmetric, indicating that its error distribution deviates from the symmetric characteristics of the normal distribution. Overall, the error histograms of the three models do not conform to the typical characteristics of the normal distribution.

In the normal probability graph, the sample quantile curves of the three models deviate significantly from the theoretical quantile straight line. The curve of the LightGBM model deviates greatly from the straight line, and fluctuates significantly in different quantile intervals, indicating that its error distribution is quite different from the normal distribution. The curve of the Prophet model is separated from the straight line in most intervals, and only a few points are close to the straight line, showing the inconsistency between its error distribution and the normal distribution. Although the curve of the SARIMA model overlaps with the straight line to a certain extent in some intervals, there is still a significant deviation overall, especially at the quantiles at both ends. At the same time, the K-S test results all rejected the null hypothesis that the errors follow the normal distribution, further confirming that the errors of the three models do not follow the normal distribution.

Although the errors of the three models do not follow the normal distribution, this study is still of great value for PM_2.5 concentration prediction. From the perspective of model improvement, the multi-peak and dispersed error distribution of the LightGBM model suggests that in subsequent optimization, we can focus on analyzing the influencing factors corresponding to different peaks, such as whether they are related to specific time periods and data characteristics, and then adjust the feature selection and parameter settings of the model in a targeted manner to reduce the discrete degree of the error. For the irregular error distribution of the Prophet model, it is possible to consider strengthening the ability to handle outliers and sudden changes in time series, such as introducing a more flexible anomaly detection mechanism, or improving the modeling method for special factors such as holidays, so that the model can better adapt to the dynamic changes of data. For the asymmetric single-peak error distribution of the SARIMA model, we can try to adjust the model’s stabilization method, or combine it with other model components that can capture asymmetric characteristics to improve the model’s ability to fit the data distribution.

In terms of new model development or model fusion strategy formulation, the data and experience accumulated in this study provide rich references. Through comparative analysis of the error distribution of different models, we can clarify the advantages and disadvantages of each model under different data conditions, so that when developing new models, we can integrate the advantages of each model in a targeted manner and abandon its disadvantages. In terms of model fusion strategy, we can reasonably allocate the weights of different models according to the characteristics of the error distribution. For example, we can give a higher weight to the model with a relatively stable error distribution to improve the accuracy and stability of the overall prediction. These will indirectly promote the PM_2.5 prediction research to a more accurate and reliable direction, and help us to understand the changing law of PM_2.5 concentration more deeply.

5. Discussion

5.1. Model Selection Rationale

Before we delve into the performance of the models, it is necessary to explain why we chose the three models of SARIMA, Prophet and LightGBM to predict PM_2.5 concentration in Nanning urban area. Accurate prediction of PM_2.5 concentration is extremely important for urban environmental management and public health protection, and appropriate model selection is the key to achieving accurate prediction.

Our study aims to comprehensively evaluate the performance of different types of models in predicting PM_2.5 concentration in Nanning urban area, and provide a scientific basis for subsequent research and practical applications. The PM_2.5 concentration in Nanning urban area is affected by many factors such as meteorological conditions, human activities, and topography. The interaction of these factors causes the concentration change to present a complex situation, which not only shows obvious seasonal and trend characteristics, but also has complex nonlinear relationships. In addition, the research data has time series characteristics and covers information at different time resolutions. Based on these research objectives and data characteristics, we selected the following three models:

As a mature statistical time series analysis tool, the SARIMA model has a deep theoretical foundation and extensive application practice in processing data with seasonality and trend. The PM_2.5 concentration data in Nanning urban area shows a significant seasonal variation pattern. For example, in summer, precipitation is relatively abundant, which has a flushing effect on pollutants and the concentration is relatively low. In winter, the atmospheric diffusion conditions are relatively poor, and pollutants are easy to accumulate, making the PM_2.5 concentration relatively high. In the long run, with the development of the city and the implementation of environmental protection policies, the PM_2.5 concentration also shows a specific trend of change.

The SARIMA model can effectively capture the seasonal and trend characteristics in the data through autoregression (AR), difference (I) and moving average (MA) components. The reasonable setting of its non-seasonal parameters (p, d, q) and seasonal parameters (P, D, Q, S) helps to accurately model the time series of PM_2.5 concentration in Nanning urban area. In this study, the non-seasonal parameters are set to non_seasonal_order = (1, 1, 1), the seasonal parameters are set to seasonal_order = (1, 1, 1), and S = 12. This setting enables the model to fully consider the seasonal fluctuations of PM_2.5 concentration on a monthly scale and the long-term trend of the data. By learning from historical data, the SARIMA model can dig out potential patterns and provide a reliable basis for predicting future changes in PM_2.5 concentration. Therefore, choosing the SARIMA model helps analyze the long-term trend of PM_2.5 concentration in Nanning urban area and provides strong support for long-term monitoring and prediction of air quality.

The Prophet model was developed by Facebook and is a time series prediction tool specifically for processing data containing seasonal and holiday effects. The PM_2.5 concentration in Nanning urban area is not only affected by seasonal changes, but also significantly changes due to changes in human activities during holidays. For example, during the National Day holiday, changes in traffic flow and residents’ activity patterns will lead to an increase in PM_2.5 emissions, which in turn affects the concentration level.

The additive model structure of the Prophet model, y (t) = g (t) + s (t) + h (t), enables it to model trends (g (t)), seasonality (s (t)), and holiday effects (h (t)) respectively. The model can automatically identify and fit seasonal patterns in the data without pre-setting complex seasonal parameters. At the same time, users can customize holidays or special events according to actual conditions, so as to more accurately capture the impact of these factors on PM_2.5 concentration. This flexibility and the ability to effectively handle seasonal and special events are highly consistent with the characteristics of PM_2.5 concentration changes in Nanning urban area. Therefore, choosing the Prophet model can better analyze the impact of seasonality and special events on PM_2.5 concentration in Nanning urban area city and improve the accuracy of prediction, especially during special periods such as holidays.

The LightGBM model is an efficient machine learning framework based on gradient boosting decision trees, which has unique advantages in processing large-scale data and mining complex nonlinear relationships. The change in PM_2.5 concentration in Nanning urban area is the result of the combined action of multiple complex factors. The relationship between these factors is intricate and nonlinear, and it is difficult to accurately describe it with traditional linear models. For example, the relationship between temperature, humidity, wind speed, wind direction and other meteorological factors and PM_2.5 concentration is not a simple linear relationship. Topographic features can also have complex indirect effects on PM_2.5 concentration by affecting the diffusion and transmission of pollutants.

By constructing a gradient boosting decision tree, the LightGBM model can automatically learn complex patterns and relationships in the data and effectively mine these potential nonlinear relationships. When processing large-scale data, it exhibits high computational efficiency and low memory consumption, and is suitable for modeling long-term, high-resolution PM_2.5 concentration data in Nanning urban area. In addition, the LightGBM model has good scalability and flexibility, and the model performance can be optimized by adjusting parameters. Based on these advantages, the selection of the LightGBM model helps to reveal the complex driving mechanism behind the changes in PM_2.5 concentration in Nanning urban area, improve the accuracy of predictions, and provide strong support for more accurate predictions of PM_2.5 concentrations.

5.2. Comprehensive Discussion of Model Performance

Our study used LightGBM, Prophet and SARIMA models to predict the PM_2.5 concentration in Nanning urban area in 2023, and compared and analyzed the performance of each model from many aspects:

Spatial dimension: In terms of the spatial distribution prediction of PM_2.5 concentration, our study used ArcMap software (version 10.8) and ArcGIS Pro software (version 3.1) to conduct geographic visualization analysis of the raster data prediction results of the three models. Among them, the LightGBM model showed certain advantages. It has a relatively strong ability to capture spatial distribution and can roughly delineate the scope of high-concentration areas. However, there are deficiencies in the internal details of high-concentration areas, and the predicted values are sometimes too high, which cannot accurately reflect the actual concentration changes in high-pollution areas. The Prophet model has obvious shortcomings in spatial prediction. There is a deviation between its predicted spatial position distribution and the actual situation. The predicted values in some areas are too low, and its ability to deal with concentration gradient changes is limited, making it difficult to accurately present the concentration change trend in space. Although the SARIMA model can outline the general outline of the concentration distribution and show the overall trend, it does not accurately grasp the spatial change characteristics of some areas, and the prediction accuracy in these areas needs to be improved.

Time dimension and stability in different months: From the perspective of time series trend prediction, the performance of each model is also different. With its strong learning ability, the LightGBM model can better capture the complex change trends in historical data, such as being able to identify the pattern of PM_2.5 concentration being high at the beginning of the year, then decreasing, and rising at the end of the year. However, in actual predictions, its overall prediction results deviate greatly from the true value, and there is a general “overestimation” phenomenon, especially in winter months with poor atmospheric diffusion conditions, where the prediction error is more obvious. The Prophet model has a certain processing ability for seasonality and trend in time series, but it over-relies on seasonal patterns in historical data during the prediction process. In the face of more complex concentration changes in May–July, the model is not adaptable enough and the prediction effect is not ideal, especially in the concentration decline stage, the predicted value deviates significantly from the original data, showing an “underestimation” trend. As a classic time series model, the SARIMA model can reflect the trend of PM_2.5 concentration in some months. However, due to its high requirements for data stability, it has poor adaptability when facing non-stationary data caused by various factors in 2023, and its forecast performance in January, May, September and November is poor. However, in comparison with the other two models, the overall deviation of the SARIMA model is relatively low.

Accuracy and reliability: Through in-depth analysis of model performance indicators, it can be found that there are differences in the accuracy and reliability of each model. The performance of the LightGBM model fluctuates greatly in different months. In some months (such as May), its error indicators (MSE, RMSE, MAE) are low, the prediction effect is good, and it can more accurately reflect the actual concentration situation. But in January, October and December, the error indicators are high and the prediction error is large. During the period from May to July, the MSE, RMSE and MAE values of the Prophet model were significantly larger, indicating that the prediction effect of the model during this period was poor. While in other months, its performance was relatively stable. The coefficient of determination (R²) of the SARIMA model shows that it fits the data relatively well, which means that the model can explain the variation of the data to a certain extent. However, in some months, its error indicators are still high, indicating that there is room for improvement in its prediction accuracy.

In summary, the LightGBM model has advantages in capturing complex nonlinear relationships, but its stability is poor. The Prophet model has methods to deal with seasonality and trends, but it lacks adaptability to complex changes and is easily affected by special circumstances. The SARIMA model has a theoretical basis for time series prediction, but it has limitations in dealing with non-stationary data and spatial heterogeneity.

5.3. Analysis of Influencing Factors on PM_2.5 Prediction

Meteorological conditions: Meteorological factors have a significant impact on PM_2.5 concentration. Changes in temperature will change the stability of the atmosphere, thereby affecting the diffusion capacity of pollutants. Humidity can not only affect the hygroscopic growth of particulate matter, but also participate in some chemical reactions. Wind speed and direction determine the direction and speed of pollutant transmission. Precipitation is like nature’s “cleaner”, which can effectively wash PM_2.5 in the air. For example, in winter, the temperature is low, the atmosphere is stable, and the diffusion conditions of pollutants are poor. PM_2.5 is easy to accumulate and the concentration increases. While in summer, there is abundant precipitation, which can significantly reduce the concentration of PM_2.5. Although our research uses monthly raster data to capture certain seasonal characteristics, it does not track changes in meteorological elements in real time. In months with changeable weather, such as the intersection of cold and warm air, the meteorological conditions are complex. Due to the lack of real-time meteorological data, the model is difficult to accurately simulate the diffusion, transmission and transformation process of pollutants, resulting in increased prediction errors.

Anthropogenic emission sources: Anthropogenic emissions are an important source of PM_2.5, covering industrial activities, traffic exhaust, coal-fired heating and many other aspects. Taking the Nanning urban area as an example, with the rapid development of urbanization, the scale of industry continues to expand, the number of motor vehicles has increased significantly, and the emission sources have become increasingly diversified. However, the models in our study failed to fully consider the dynamic changes in emission sources. Factories will adjust production according to orders and production plans, and traffic flow will fluctuate with time, weather and special events. These changes will directly lead to changes in PM_2.5 emissions. However, the model does not take these real-time changes into account, and cannot accurately capture the changes in PM_2.5 concentration caused by fluctuations in anthropogenic emissions, reducing the accuracy of the prediction.

Topographic and geomorphic features: The topography and geomorphic features of the Nanning urban area have a significant impact on PM_2.5 concentration. The basin-shaped area with the Yongjiang River Valley as the center and the mountains on three sides hinder the diffusion of atmospheric pollutants, making PM_2.5 easily accumulate in the urban area. Unfortunately, the model of our study was not optimized in combination with local topographic and geomorphic information during the construction process. Since the model cannot accurately reflect the hindering effect of the terrain on the diffusion of pollutants, deviations occur when simulating the transmission and distribution of pollutants, which in turn affects the prediction accuracy of PM_2.5 concentration.

Data uncertainty: Data quality is crucial to the accuracy of model predictions, but there are some problems with the data in our study. Although monthly raster data are used to reflect the concentration changes in time and space to a certain extent, there are still limitations. On the one hand, the data has a limited time span and it is difficult to cover various scenarios of environmental changes, which limits the model’s ability to learn the laws of long-term changes. On the other hand, there is a lag in data updates, and it is impossible to timely reflect the latest changes such as new pollution sources and traffic control. In addition, the granularity of monthly raster data is coarse, and it is difficult to accurately present the details of concentration changes in local areas in the short term. Moreover, the data lacks key fine-grained features such as meteorology and traffic, such as hourly wind speed and traffic volume on specific sections of roads. The uncertainty of these data makes it difficult for the model to capture complex environmental changes and affects the accurate prediction of PM_2.5 concentrations.

To improve the accuracy of predictions, future studies will consider refining meteorological data, incorporating real-time updated data on anthropogenic emission sources, and optimizing the model in combination with topographic information of the study area to reduce the impact of data uncertainty.

5.4. Model Application and Practical Significance

In the PM_2.5 concentration prediction work, different models are suitable for different scenarios, and there are complementary characteristics between each other. In terms of short-term prediction, although the LightGBM model has errors, it performs well in quickly locating the scope of high-concentration areas and can provide preliminary pollution area distribution information for emergency response. The Prophet model has high prediction stability in some months and is good at dealing with seasonality and special events. When special activities are clearly arranged, it can effectively predict short-term concentration changes. Combining the two, LightGBM first determines the scope of high-concentration areas, and Prophet then accurately predicts the concentration fluctuations in these areas during special periods, which can improve the accuracy and completeness of short-term predictions. In long-term trend analysis, the SARIMA model can effectively identify the long-term change law of PM_2.5 concentration by virtue of its ability to accurately capture time series trends. However, it is insufficient in dealing with complex nonlinear relationships and special events. At this time, combined with LightGBM’s ability to mine complex relationships and Prophet’s ability to deal with fluctuations in special events, long-term trend analysis can be improved to provide a more reliable basis for long-term monitoring and policy making. In the scenario of regional pollution warning, a single model is difficult to meet the needs. By comprehensively utilizing the spatial capture capability of LightGBM, the special event processing capability of Prophet, and the trend analysis capability of SARIMA, a composite early warning system can be constructed, which can significantly improve the accuracy of early warnings and provide strong support for timely prevention and control of pollution.

In practical applications, appropriate models should be flexibly selected or multiple model results should be integrated according to different scenarios. In daily monitoring work, the SARIMA model is used as the basis for analyzing long-term trends, the Prophet model is used to focus on concentration changes in special periods, and the LightGBM model is used to supplement spatial distribution information to achieve comprehensive and continuous monitoring of air quality. At the same time, model fusion techniques such as weighted averaging and stacking generalization are used to combine the advantages of each model to further improve the reliability and accuracy of the prediction.

Our research results provide a solid scientific basis for the formulation of environmental protection policies. Based on the model predictions, differentiated prevention and control strategies can be formulated for different regions and seasons [80]. For example, in industrial areas, supervision can be strengthened, emission standards can be strictly enforced, and clean production technologies can be promoted. in traffic-intensive areas, traffic management can be optimized, public transportation can be encouraged, and new energy vehicles can be promoted. By accurately predicting changes in air quality during special periods, traffic control, enterprise production restrictions and other measures can be formulated in advance to effectively reduce pollution emissions. In addition, accurate prediction results can help the public understand the air quality status in a timely manner so that they can take corresponding protective measures, such as wearing masks and reducing outdoor activities, to effectively protect public health.

5.5. Research Limitations

Data: The data has a limited time span and lacks a real-time update mechanism, which makes it less adaptable to new data. It relies only on a single data set and does not integrate multi-source data, making it difficult to fully reflect influencing factors. The data granularity is coarse and does not include fine-grained and auxiliary feature collaborative training. The chemical composition of PM_2.5 is not considered, which limits the in-depth understanding of the nature of pollution and the accuracy of prediction.

Model: Model parameters mostly rely on experience settings, lack automatic tuning mechanisms, and cannot be adjusted adaptively. Model assumptions may not match the actual situation, and fail to fully consider the actual conditions affected by various factors. For example, the SARIMA model has high requirements for data stability and is difficult to adapt to the complex and changeable actual environment. Some models (such as LightGBM) have poor interpretability, which is not conducive to an in-depth understanding of the prediction process.

Research methods: Although the evaluation indicators can reflect the performance of the model, they have limitations and cannot comprehensively evaluate the performance of the model in different scenarios. The research did not fully consider practical problems such as monitoring equipment errors and data transmission delays, nor did it fine-tune and adjust the model for different regions.

These deficiencies may limit the accuracy of model predictions and affect the reliability and universality of research results. Future research needs to improve in these aspects.

5.6. Comparative Analysis and Innovation of Research Findings

Compared with other related studies, our study is unique in several ways, which have different impacts on the results:

Model selection: Our research shows unique innovation in model selection. Unlike most studies that focus only on a single type of model, our research innovatively compares statistical models (SARIMA), models that handle seasonal and holiday effects (Prophet), and Machine Learning models (LightGBM). When traditional studies simply use Machine Learning models, although they can mine complex relationships, they often ignore the precise analysis of data seasonality and trends. Our research breaks this limitation and systematically compares different types of models to comprehensively evaluate their performance in PM_2.5 concentration prediction. This not only allows us to clearly see the advantages of each model, such as SARIMA’s ability to capture time series trends, Prophet’s advantages in handling seasonality and special events, and LightGBM’s expertise in mining complex nonlinear relationships. it also clarifies the disadvantages of each model, providing a strong basis for subsequent studies to select the most appropriate model according to different needs. This multi-model comparison method has opened up new ideas for PM_2.5 concentration prediction research.

Data application: Our research is also innovative in data application. We selected the China High-Resolution and High-Quality PM_2.5 Dataset (CHAP), which has a 1km resolution raster data format and combines two time resolutions, year and month, to bring a new perspective to the research. The innovative use of dual-time resolution data, year-resolution data can present the long-term trend of PM_2.5 concentration from a macro perspective, and help analyze the impact of long-term factors such as urban development and industrial structure adjustment on concentration. Monthly resolution data focuses on short-term changes and accurately captures concentration fluctuations caused by seasonal factors, such as the difference in the impact of changes in meteorological conditions in different seasons on concentration. This innovative data application method provides a comprehensive and detailed time dimension perspective for the research, which is conducive to in-depth analysis of the changing characteristics and mechanisms of PM_2.5 concentration at different time scales. Compared with the use of data with a single time resolution, it can more comprehensively reveal the law of PM_2.5 concentration changes.

Research area and time scale: Our study selected Nanning urban area as the research area, which is significantly innovative and targeted. Nanning urban area has unique geographical, climatic and pollution source characteristics. It belongs to the subtropical monsoon climate, with distinct dry and wet seasons and diversified pollution sources, providing rich and special samples for research. Compared with other regional studies, the environmental differences in different regions are large. For example, the winter coal-fired heating in northern cities has a significant impact on PM_2.5 concentration, which is very different from the pollution source composition in Nanning urban area, which makes our research more unique. Our research was carried out in Nanning urban area, which not only provides an in-depth understanding of the variation law of PM_2.5 concentration in the region, but also provides a practical case for the application of the model in special environments. In terms of time scale, data from 2012 to 2023 were selected. This period covers changes in many aspects such as urban development and implementation of environmental protection policies. The innovative use of data during this period provides rich environmental change information for model training and prediction, which helps to explore the impact of environmental changes in different time ranges on model performance, and provides a new reference direction for subsequent research in time scale selection and data processing.

6. Conclusions and Outlook

Our study systematically evaluated the performance of three models, SARIMA, Prophet and LightGBM, in predicting PM_2.5 concentration in Nanning city in 2023. The study found that each model has its own advantages and limitations: SARIMA is good at capturing time series changes and is suitable for long-term monitoring, but has poor adaptability to non-stationary data. Prophet can handle the seasonality of time series and the impact of special events, but has weak prediction ability under extreme events and is prone to underestimate at high concentrations. LightGBM has strong feature extraction capabilities and can mine complex nonlinear relationships, but is prone to overfitting and has poor stability. In terms of spatial prediction, LightGBM can roughly divide high-concentration areas, but the details are not accurate enough. Prophet’s spatial positioning is inaccurate, and the predicted values in some areas are low. SARIMA can show the overall trend, but the spatial characteristics of some areas are not captured enough. In the time dimension, LightGBM has a large prediction error in winter, Prophet performs poorly from May to July, and SARIMA is limited by data stability. The prediction performance decreases in January, May, September and November, but the overall deviation is relatively small. In addition, the error distribution of the three models is non-normal, indicating that there are limitations in processing PM_2.5 concentration data and further optimization is needed.

In the future, our research will focus on improving the prediction level of PM_2.5 concentration and expanding the depth and breadth of research. Specifically, it includes: using genetic algorithms, particle swarm optimization algorithms, etc. to automatically adjust model parameters, optimize model structure, explore the integration of models such as SARIMA, Prophet, LightGBM and neural networks, integrate the advantages of each model, and build a real-time air quality prediction system that integrates multi-source data such as ground monitoring, satellite remote sensing, transportation, and industry. Further explore the depth of meteorological data, carry out PM_2.5 chemical composition prediction experiments, establish its quantitative relationship with related factors, establish a high-resolution model based on regional characteristics, consider the interaction of multi-scale factors, and use GIS technology to visualize the prediction results to help urban planning and environmental management decisions.

Author Contributions

Conceptualization: B.L., M.L. and N.Y.; Methodology: B.L. and M.C.; Software: B.L. and M.C.; Formal analysis: B.L. and M.C.; Investigation: B.L. and M.C.; Resources: B.L. and M.L.; Data curation: B.L. and N.Y.; Writing—Original Draft Preparation: B.L. and M.L.; Writing —Review and Editing: B.L. and M.C.; Visualization: B.L.; Supervision: N.Y. and M.C.; Project Administration: B.L. and N.Y.; Funding Acquisition: B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Guangxi Zhuang Autonomous Region Human Resources and Social Security Department 2024 Young Scientist Program Scientific Research Start-up Fund (No. 60203038919630213), The key research development project of Guangxi (GK: AB22080101), Guangxi Innovation and Entrepreneurship Training Program for College Students (S202410603072), Nanning Normal University Doctoral Research Startup Project (No. 602021239447).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Acknowledgments

During the preparation of this manuscript, the authors used [Doubao, 1.42.6_win] for the purposes of [polishing the introduction and discussion sections as well as checking the grammar of the entire text]. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Brunekreef, B.; Holgate, S.T. Air pollution and health. Lancet 2002, 360, 1233–1242. [Google Scholar] [CrossRef] [PubMed]
Bai, N.; Khazaei, M.; van Eeden, S.F.; Laher, I. The pharmacology of particulate matter air pollution-induced cardiovascular dysfunction. Pharmacol. Ther. 2007, 113, 16–29. [Google Scholar] [CrossRef]
Li, S.; Liu, Y.; Wei, G.; Bi, M.; He, B.J. Carbon surplus or carbon deficit under land use transformation in China? Land Use Policy 2024, 143, 107218. [Google Scholar] [CrossRef]
Mak, H.W.L.; Ng, D.C.Y. Spatial and socio-classification of traffic pollutant emissions and associated mortality rates in high-density hong kong via improved data analytic approaches. Int. J. Environ. Res. Public Health 2021, 18, 6532. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Wen, Q.; Zhang, R. Sources, health effects and control strategies of indoor fine particulate matter (PM_2.5): A review. Sci. Total Environ. 2017, 586, 610–622. [Google Scholar] [CrossRef]
Wei, G.; He, J.B.; Liu, Y.; Li, R. How does rapid urban construction land expansion affect the spatial inequalities of ecosystem health in China? Evidence from the country, economic regions and urban agglomerations. Environ. Impact Assess. Rev. 2024, 106, 107533. [Google Scholar] [CrossRef]
Tu, J.; Inthavong, K.; Ahmadi, G.; Tu, J.; Inthavong, K.; Ahmadi, G. Case studies in the human respiratory system. In Computational Fluid and Particle Dynamics in the Human Respiratory System; Springer: Dordrecht, The Netherlands, 2013; pp. 233–319. [Google Scholar]
Xing, Y.F.; Xu, Y.H.; Shi, M.H.; Lian, Y.X. The impact of PM25 on the human respiratory system. J. Thorac. Dis. 2016, 8, E69–E74. [Google Scholar]
Li, R.L.; Ho, Y.C.; Luo, C.W.; Lee, S.S.; Kuan, Y.H. Influence of PM_2.5 exposure level on the association between Alzheimer’s disease and allergic rhinitis: A national population-based cohort study. Int. J. Environ. Res. Public Health 2019, 16, 3357. [Google Scholar] [CrossRef]
Toczylowski, K.; Wietlicka-Piszcz, M.; Grabowska, M.; Sulik, A. Cumulative effects of particulate matter pollution and meteorological variables on the risk of influenza-like illness. Viruses 2021, 13, 556. [Google Scholar] [CrossRef]
Feng, C.; Li, J.; Sun, W.; Zhang, Y.; Wang, Q. Impact of ambient fine particulate matter (PM_2.5) exposure on the risk of influenza-like-illness: A time-series analysis in Beijing, China. Environ. Health 2016, 15, 12–17. [Google Scholar] [CrossRef]
Gao, Y.; Ji, H. Microscopic morphology and seasonal variation of health effect arising from heavy metals in PM_2.5 and PM10: One-year measurement in a densely populated area of urban Beijing. Atmos. Res. 2018, 212, 213–226. [Google Scholar] [CrossRef]
Hart, J.E.; Grady, S.T.; Laden, F.; Coull, B.A.; Koutrakis, P.; Schwartz, J.D.; Moy, M.L.; Garshick, E. Effects of indoor and ambient black carbon and PM_2.5 on pulmonary function among individuals with COPD. EHP 2018, 126, 127008. [Google Scholar] [CrossRef] [PubMed]
Zhao, X.; Zhang, X.; Xu, X.; Xu, J.; Meng, W.; Pu, W. Seasonal and diurnal variations of ambient PM_2.5 concentration in urban and rural environments in Beijing. Atmos. Environ. 2009, 43, 2893–2900. [Google Scholar] [CrossRef]
Miller, L.; Xu, X. Ambient PM_2.5 human health effects—Findings in China and research directions. Atmosphere 2018, 9, 424. [Google Scholar] [CrossRef]
Sangkham, S.; Phairuang, W.; Sherchan, S.P.; Pansakun, N.; Munkong, N.; Sarndhong, K.; Islam, M.A.; Sakunkoo, P. An update on adverse health effects from exposure to PM_2.5. Environ. Adv. 2024, 18, 100603. [Google Scholar] [CrossRef]
Wei, G.; Zhang, W.; Bi, M.; Sun, P.; Li, S.; Ouyang, X.; Liu, Y.; Tian, X. Trade-offs and synergies pattern evolution of ecosystem structure-resilience-activity-services (SRAS) in the Belt and Road Initiative region. Resour. Conserv. Recycl. 2024, 211, 107883. [Google Scholar] [CrossRef]
Liu, B.; Li, Y.; Wang, L.; Zhang, L.; Qiao, F.; Nan, P.; Ji, D.; Hu, B.; Xia, Z.; Lou, Z. Evaluating the effects of meteorology and emission changes on ozone in different regions over China based on machine learning. Atmos. Pollut. Res. 2024, 102354. [Google Scholar] [CrossRef]
Mak, H.W.L.; Laughner, J.L.; Fung, J.C.H.; Zhu, Q.; Cohen, R.C. Improved Satellite Retrieval of Tropospheric NO₂ Column Density via Updating of Air Mass Factor (AMF): Case Study of Southern China. Remote Sens. 2018, 10, 1789. [Google Scholar] [CrossRef]
Rodríguez-Sánchez, A.; Santiago, J.L.; Vivanco, M.G.; Sanchez, B.; Rivas, E.; Martilli, A.; Martín, F. How do meteorological conditions impact the effectiveness of various traffic measures on NOx concentrations in a real hot-spot? Sci. Total Environ. 2024, 954, 176667. [Google Scholar] [CrossRef]
Lin, C.; Lau, A.K.; Fung, J.C.; Song, Y.; Li, Y.; Tao, M.; Lu, X.; Ma, J.; Lao, X.Q. Removing the effects of meteorological factors on changes in nitrogen dioxide and ozone concentrations in China from 2013 to 2020. Sci. Total Environ. 2021, 793, 148575. [Google Scholar] [CrossRef]
Lin, C.; Labzovskii, L.D.; Mak, H.W.L.; Fung, J.C.; Lau, A.K.; Kenea, S.T.; Bilal, M.; Hey, J.D.V.; Lu, X.; Ma, J. Observation of PM_2.5 using a combination of satellite remote sensing and low-cost sensor network in Siberian urban areas with limited reference monitoring. Atmos. Environ. 2020, 227, 117410. [Google Scholar] [CrossRef]
Chen, C.C.; Wang, Y.R.; Yeh, H.Y.; Lin, T.H.; Huang, C.S.; Wu, C.F. Estimating monthly PM_2.5 concentrations from satellite remote sensing data, meteorological variables, and land use data using ensemble statistical modeling and a random forest approach. Environ. Pollut. 2021, 291, 118159. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Fung, J.C.; Lau, A.K.; Zhang, S.; Huang, W. Improved modeling of spatiotemporal variations of fine particulate matter using a three-dimensional variational data fusion method. J. Geophys. Res. Atmos. 2021, 126, e2020JD033599. [Google Scholar] [CrossRef]
Zhang, L.; Tang, N. PM_2.5 Pollution and Monitoring. In Field Work and Laboratory Experiments in Integrated Environmental Sciences; Springer Nature: Singapore, 2024; pp. 15–26. [Google Scholar]
Onaiwu, G.E.; Ayidu, N.J. Advancements and Innovations in PM_2.5 Monitoring: A Comprehensive Review of Emerging Technologies. Fudma J. Sci. 2024, 8, 243–255. [Google Scholar] [CrossRef]
Chow, J.C.; Watson, J.G.; Pritchett, L.C.; Pierson, W.R.; Frazier, C.A.; Purcell, R.G. The DRI thermal/optical reflectance carbon analysis system: Description, evaluation and applications. Atmos. Environ. 1993, 27, 15–22. [Google Scholar] [CrossRef]
Kim, M.; Choi, H.; Lee, J.; Jeong, S.G. Enhancing PM_2.5 Measurement Accuracy: Insights from Environmental Factors and BAM-Light Scattering Device Correlation. Indoor Air 2024, 2024, 2930582. [Google Scholar] [CrossRef]
Kumar, U.; Jain, V.K. ARIMA forecasting of ambient air pollutants (O₃, NO, NO₂, and CO). Stoch. Environ. Res. Risk Assess. 2010, 24, 751–760. [Google Scholar] [CrossRef]
Vlachogianni, A.; Kassomenos, P.; Karppinen, A.; Karakitsios, S.; Kukkonen, J. Evaluation of a multiple regression model for the forecasting of the concentrations of NO_x and PM₁₀ in Athens and Helsinki. Sci. Total Environ. 2011, 409, 1559–1571. [Google Scholar] [CrossRef]
Zhou, S.; Wang, W.; Zhu, L.; Qiao, Q.; Kang, Y. Deep-learning architecture for PM_2.5 concentration prediction: A review. Environ. Sci. Ecotechnol. 2024, 21, 100400. [Google Scholar] [CrossRef]
Shin, K.-S.; Lee, T.S.; Kim, H.-J. An application of support vector machines in bankruptcy prediction model. Expert Syst. Appl. 2005, 28, 127–135. [Google Scholar] [CrossRef]
Gao, Z.; Do, K.; Li, Z.; Jiang, X.; Maji, K.J.; Ivey, C.E.; Russell, A.G. Predicting PM_2.5 levels and exceedance days using machine learning methods. Atmos. Environ. 2024, 323, 120396. [Google Scholar] [CrossRef]
Čampulová, M.; Veselík, P.; Michálek, J. Control chart and Six sigma based algorithms for identification of outliers in experimental data, with an application to particulate matter PM₁₀. Atmos. Pollut. Res. 2017, 8, 700–708. [Google Scholar] [CrossRef]
Kawichai, S.; Sripan, P.; Rerkasem, A.; Rerkasem, K.; Srisukkham, W. Long-term retrospective predicted concentration of PM_2.5 in upper northern Thailand using machine learning. Atmosphere 2025, 13, 170. [Google Scholar] [CrossRef]
Wang, Z.; Huang, J.; Huang, J.; Wang, Y.; Zhang, C. PM_2.5 Concentration Prediction Using CNN-LSTM Model Based on Multi-Feature Fusion. Concurr. Comput. Pract. Exp. 2025, 37, e8391. [Google Scholar] [CrossRef]
Li, X.; Peng, L.; Yao, X.; Cui, S.; Hu, Y.; You, C.; Chi, T. Long short-term memory neural network for air pollutant concentration predictions: Method development and evaluation. Environ. Pollut. 2017, 231, 997–1004. [Google Scholar] [CrossRef] [PubMed]
Wu, Q.; Lin, H. A novel optimal-hybrid model for daily air quality index prediction considering air pollutant factors. Sci. Total Environ. 2019, 683, 808–821. [Google Scholar] [CrossRef]
Hu, X.; Shi, J.; He, C.; Fang, J. Combined Prediction Model of PM2. 5 Concentration Based on Wavelet Transform and LSTM. J. Phys. Conf. Ser. 2023, 2555, 012009. [Google Scholar] [CrossRef]
Cai, M.; Yin, Y.; Xie, M. Prediction of hourly air pollutant concentrations near urban arterials using artificial neural network approach. Transp. Res. Part D Transp. Environ. 2009, 14, 32–41. [Google Scholar] [CrossRef]
Goudarzi, G.; Hopke, P.K.; Yazdani, M. Forecasting PM_2.5 concentration using artificial neural network and its health effects in Ahvaz, Iran. Chemosphere 2021, 283, 131285. [Google Scholar] [CrossRef]
Gao, J.; Wang, K.; Wang, Y.; Liu, S.; Zhu, C.; Hao, J.; Liu, H.; Hua, S.; Tian, H. Temporal-spatial characteristics and source apportionment of PM_2.5 as well as its associated chemical species in the Beijng-Tianjin-Hebei region of China. Environ. Pollut. 2018, 233, 714–724. [Google Scholar] [CrossRef]
Chantaraprachoom, N.; Mochizuki, D.; Shimadera, H.; Luong, M.V.; Matsuo, T.; Kondo, A. Impact assessment of biomass burning in Southeast Asia to 2019 annual average PM_2.5 concentration in Thailand using atmospheric chemical transport model. E3S Web Conf. 2023, 379, 01002. [Google Scholar] [CrossRef]
Gokul, P.R.; Mathew, A.; Bhosale, A.; Nair, A.T. Spatio-temporal air quality analysis and PM_2.5 prediction over Hyderabad City, India using artificial intelligence techniques. Ecol. Inform. 2023, 76, 102067. [Google Scholar] [CrossRef]
Hao, Y.; Meng, X.; Yu, X.; Lei, M.; Li, W.; Yang, W.; Shi, F.; Xie, S. Quantification of primary and secondary sources to PM_2.5 using an improved source regional apportionment method in an industrial city, China. Sci. Total Environ. 2020, 706, 135715. [Google Scholar] [CrossRef]
Kim, S.J.; Lee, H.Y.; Lee, S.J.; Choi, S.D. Passive air sampling of VOCs, O₃, NO₂, and SO₂ in the large industrial city of Ulsan, South Korea: Spatial-temporal variations, source identification, and ozone formation potential. Environ. Sci. Pollut. Res. 2023, 30, 125478–125491. [Google Scholar] [CrossRef]
Xu, Y.; Ho, H.C.; Wong, M.S.; Deng, C.; Shi, Y.; Chan, T.C.; Knudby, A. Evaluation of machine learning techniques with multiple remote sensing datasets in estimating monthly concentrations of ground-level PM_2.5. Environ. Pollut. 2018, 242, 1417–1426. [Google Scholar] [CrossRef] [PubMed]
Zaman, N.A.F.K.; Kanniah, K.D.; Kaskaoutis, D.G.; Latif, M.T. Evaluation of machine learning models for estimating PM_2.5 concentrations across Malaysia. Appl. Sci. 2021, 11, 7326. [Google Scholar] [CrossRef]
Ma, X.; Chen, T.; Ge, R.; Xv, F.; Cui, C.; Li, J. Prediction of PM_2.5 concentration using spatiotemporal data with machine learning models. Atmosphere 2023, 14, 1517. [Google Scholar] [CrossRef]
Giovannini, L.; Ferrero, E.; Karl, T.; Rotach, M.W.; Staquet, C.; Trini Castelli, S.; Zardi, D. Atmospheric pollutant dispersion over complex terrain: Challenges and needs for improving air quality measurements and modeling. Atmosphere 2020, 11, 646. [Google Scholar] [CrossRef]
Just, A.C.; De Carli, M.M.; Shtein, A.; Dorman, M.; Lyapustin, A.; Kloog, I. Correcting measurement error in satellite aerosol optical depth with machine learning for modeling PM_2.5 in the Northeastern USA. Remote Sens. 2018, 10, 803. [Google Scholar] [CrossRef]
Vignesh, P.P.; Jiang, J.H.; Kishore, P. Predicting PM_2.5 concentrations across USA using machine learning. Earth Space Sci. 2023, 10, e2023EA002911. [Google Scholar] [CrossRef]
Kim, J.H.; Choi, J.H.; Park, Y.H.; Leung, C.K.S.; Nasridinov, A. KNN-SC: Novel spectral clustering algorithm using k-nearest neighbors. IEEE Access 2021, 9, 152616–152627. [Google Scholar] [CrossRef]
Agarwal, A.; Sahu, M. Forecasting PM_2.5 concentrations using statistical modeling for Bengaluru and Delhi regions. Environ. Monit. Assess. 2023, 195, 502. [Google Scholar] [CrossRef]
Chen, J.; Yuan, C.; Dong, S.; Feng, J.; Wang, H. A novel spatiotemporal multigraph convolutional network for air pollution prediction. Appl. Intell. 2023, 53, 18319–18332. [Google Scholar] [CrossRef]
Zaini, N.A.; Ean, L.W.; Ahmed, A.N.; Abdul Malek, M.; Chow, M.F. PM_2.5 forecasting for an urban area based on deep learning and decomposition method. Sci. Rep. 2022, 12, 17565. [Google Scholar] [CrossRef]
Chang, Y.S.; Abimannan, S.; Chiao, H.T.; Lin, C.Y.; Huang, Y.P. An ensemble learning based hybrid model and framework for air pollution forecasting. Environ. Sci. Pollut. Res. 2020, 27, 38155–38168. [Google Scholar] [CrossRef] [PubMed]
Thundiyil, S.; Picone, J.; McKenzie, S. Transformer Architectures in Time Series Analysis: A Review. Available online: https://isip.piconepress.com/courses/temple/ece_8110/lectures/2024_00_spring/lecture_36a.pdf (accessed on 1 January 2025).
Wang, Y.-Z.; He, H.D.; Huang, H.C.; Yang, J.M.; Peng, Z.R. High-resolution spatiotemporal prediction of PM_2.5 concentration based on mobile monitoring and deep learning. Environ. Pollut. 2025, 364, 125342. [Google Scholar] [CrossRef]
Al-qaness, M.A.; Dahou, A.; Ewees, A.A.; Abualigah, L.; Huai, J.; Abd Elaziz, M.; Helmi, A.M. ResInformer: Residual transformer-based artificial time-series forecasting model for PM_2.5 concentration in three major Chinese cities. Mathematics 2023, 11, 476. [Google Scholar] [CrossRef]
Li, S.; Xie, G.; Ren, J.; Guo, L.; Yang, Y.; Xu, X. Urban PM_2.5 concentration prediction via attention-based, CNN–LSTM. Appl. Sci. 2020, 10, 1953. [Google Scholar] [CrossRef]
Kim, H.S.; Han, K.M.; Yu, J.; Youn, N.; Choi, T. Development of a Hybrid Attention Transformer for Daily PM_2.5 Predictions in Seoul. Atmosphere 2025, 16, 37. [Google Scholar] [CrossRef]
Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis:Forecasting and Control; Wiley: Hoboken, NJ, USA, 2015. [Google Scholar]
Taylor, S.J.; Letham, B. Forecasting at scale. Am. Stat. 2018, 72, 37–45. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf (accessed on 1 January 2025).
Liu, H.; Dong, S. A novel hybrid ensemble model for hourly PM_2.5 forecasting using multiple neural networks: A case study in China. Air Qual. Atmos. Health 2020, 13, 1411–1420. [Google Scholar] [CrossRef]
Bhattarai, H.; Tai, A.P.; Martin, M.V.; Yung, D.H. Responses of fine particulate matter (PM_2.5) air quality to future climate, land use, and emission changes: Insights from modeling across shared socioeconomic pathways. Sci. Total Environ. 2024, 948, 174611. [Google Scholar] [CrossRef]
Wang, Z.; Zhou, Y.; Zhao, R.; Wang, N.; Biswas, A.; Shi, Z. High-resolution prediction of the spatial distribution of PM_2.5 concentrations in China using a long short-term memory model. J. Clean. Prod. 2021, 297, 126493. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, Z.; Li, H. Improved ANN model for PM_2.5 concentration prediction in urban areas. J. Environ. Sci. 2015, 30, 334–341. [Google Scholar]
Bergen, S.; Sheppard, L.; Sampson, P.D.; Kim, S.Y.; Richards, M.; Vedal, S.; Kaufman, J.D.; Szpiro, A.A. A national prediction model for PM_2.5 component exposures and measurement error–corrected health effect inference. Environ. Health Perspect. 2013, 121, 1017–1025. [Google Scholar] [CrossRef] [PubMed]
Wei, J.; Li, Z. ChinaHighPM_2.5: High-Resolution and High-Quality Ground-Level PM_2.5 Dataset for China (2000–2023); National Tibetan Plateau/Third Pole Environment Data Center: Beijing, China, 2023. [Google Scholar] [CrossRef]
Wang, Q.; Li, R.; Cheong, K.C. Nanning–Perils and promise of a frontier city. Cities 2018, 72, 51–59. [Google Scholar] [CrossRef]
Huang, H.; Huang, S.; He, S.; Lu, Y.; Deng, S. Healthy city evaluation based on factor analysis—Taking cities in the Guangxi Zhuang Autonomous Region as an example. PLoS ONE 2024, 19, e0306344. [Google Scholar] [CrossRef] [PubMed]
Jiang, N.; Yin, S.; Guo, Y.; Li, J.; Kang, P.; Zhang, R.; Tang, X. Characteristics of mass concentration, chemical composition, source apportionment of PM_2.5 and PM10 and health risk assessment in the emerging megacity in China. Atmos. Pollut. Res. 2018, 9, 309–321. [Google Scholar] [CrossRef]
Pan, X.L.; Yan, P.; Tang, J.; Ma, J.Z.; Wang, Z.F.; Gbaguidi, A.; Sun, Y.L. Observational study of influence of aerosol hygroscopic growth on scattering coefficient over rural area near Beijing mega-city. Atmos. Chem. Phys. 2009, 9, 7519–7530. [Google Scholar] [CrossRef]
Asia, E. Cost-Effectiveness of Harm Reduction Interventions in Guangxi Zhuang Autonomous Region, China. 2007. Available online: https://documentos.bancomundial.org/es/publication/documents-reports/documentdetail/379171468214511184/china-cost-effectiveness-of-harm-reduction-interventions-in-guangxi-zhuang-autonomous-region-china (accessed on 1 January 2025).
Dama, F.; Sinoquet, C. Time series analysis and modeling to forecast: A survey. arXiv 2021, arXiv:2104.00164. [Google Scholar]
Rafferty, G. Forecasting Time Series Data with Facebook Prophet: Build, Improve, and Optimize Time Series Forecasting Models Using the Advanced Forecasting Tool; Packt Publishing Ltd.: Birmingham, UK, 2021. [Google Scholar]
Manzoor, A.; Qureshi, M.A.; Kidney, E.; Longo, L. A Review on Machine Learning Methods for Customer Churn Prediction and Recommendations for Business Practitioners. IEEE Access 2024, 12, 70434–70463. [Google Scholar] [CrossRef]
Li, T.; Liu, Y.; Ouyang, X.; Zhou, Y.; Bi, M.; Wei, G. Sustainable development of urban agglomerations around lakes in China: Achieving SDGs by regulating Ecosystem Service Supply and Demand through New-type Urbanization. Habitat Int. 2024, 153, 103206. [Google Scholar] [CrossRef]

Figure 1. A schematic diagram of the study area (arrows indicate the spatial scale transition from a larger to a smaller scope). (a) The geographical location of Nanning Urban Area within Guangxi, (b) The location of Nanning Urban Area within Nanning City, (c) Distribution map of PM_2.5 concentration in Nanning Urban Area in 2023.

Figure 2. Overall Process Flow Diagram of the Research.

Figure 3. Trimming of Original Data (The arrow represents the process of cropping from the original data range to the study area range).

Figure 4. Model Training and Prediction Workflow.

Figure 5. Comparison of Observed and Predicted PM_2.5 for 2023.

Figure 6. Comparison Chart of Observed versus Predicted PM_2.5 for Selected Months in 2023.

Figure 7. Monthly Average PM_2.5 concentration from 2012 to 2023.

Figure 8. Monthly Average PM_2.5 concentration in 2023 by Model.

Figure 9. Monthly Heatmap of PM_2.5 Prediction Error Metrics in 2023.

Figure 10. Scatter Plots of PM_2.5 Predictions in 2023.

Figure 11. Box plots of PM_2.5 Data (Original & Predicted) in 2023.

Figure 12. Visualizing 2023 PM_2.5 Prediction Errors: Histograms and Normal Probability Plots of 3 Models.

Table 1. Data Sources.

Dataset	Time Range	Spatial Resolution	Temporal Resolution	Usage Part
ChinaHighAirPollutants (CHAP)	2000–2023	1 km	Day	not used
			Month	Jan 2012–Dec 2023
			Year	2012–2023

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, M.; Liu, B.; Liang, M.; Yao, N. Decoding PM_2.5 Prediction in Nanning Urban Area, China: Unraveling Model Superiorities and Drawbacks Through SARIMA, Prophet, and LightGBM. Algorithms 2025, 18, 167. https://doi.org/10.3390/a18030167

AMA Style

Chen M, Liu B, Liang M, Yao N. Decoding PM_2.5 Prediction in Nanning Urban Area, China: Unraveling Model Superiorities and Drawbacks Through SARIMA, Prophet, and LightGBM. Algorithms. 2025; 18(3):167. https://doi.org/10.3390/a18030167

Chicago/Turabian Style

Chen, Minru, Binglin Liu, Mingzhi Liang, and Nini Yao. 2025. "Decoding PM_2.5 Prediction in Nanning Urban Area, China: Unraveling Model Superiorities and Drawbacks Through SARIMA, Prophet, and LightGBM" Algorithms 18, no. 3: 167. https://doi.org/10.3390/a18030167

APA Style

Chen, M., Liu, B., Liang, M., & Yao, N. (2025). Decoding PM_2.5 Prediction in Nanning Urban Area, China: Unraveling Model Superiorities and Drawbacks Through SARIMA, Prophet, and LightGBM. Algorithms, 18(3), 167. https://doi.org/10.3390/a18030167

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Decoding PM2.5 Prediction in Nanning Urban Area, China: Unraveling Model Superiorities and Drawbacks Through SARIMA, Prophet, and LightGBM

Abstract

1. Introduction

2. Data Sources and Study Area

2.1. Data Sources

2.2. Study Area

3. Research Method

3.1. Research Process

3.2. Model Introduction

3.2.1. SARIMA Model

3.2.2. Prophet Model

3.2.3. LightGBM Model

3.3. Data Preprocessing

3.4. Model Training and Prediction

3.5. Model Metrics

3.5.1. Mean Squared Error, MSE

3.5.2. Root–Mean–Squared Error, RMSE

3.5.3. Mean Absolute Error, MAE

3.5.4. Coefficient of Determination, R2

4. Experimental Results and Analysis

4.1. Comparison of Original Data and Model Prediction Results

4.2. Time Series Trend Analysis

4.3. Analysis of Model Performance Metrics

4.4. Error Analysis with Scatter Plots and Box Plots

4.5. Normality Test and Analysis of Prediction Errors

5. Discussion

5.1. Model Selection Rationale

5.2. Comprehensive Discussion of Model Performance

5.3. Analysis of Influencing Factors on PM2.5 Prediction

5.4. Model Application and Practical Significance

5.5. Research Limitations

5.6. Comparative Analysis and Innovation of Research Findings

6. Conclusions and Outlook

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Decoding PM_2.5 Prediction in Nanning Urban Area, China: Unraveling Model Superiorities and Drawbacks Through SARIMA, Prophet, and LightGBM

3.5.4. Coefficient of Determination, R²

5.3. Analysis of Influencing Factors on PM_2.5 Prediction