Next Article in Journal
Intra-Seasonal Features of Winter Extreme Cold Events in Northeast–North China and Synergistic Effects of Circulation Systems in Mid-High Latitude
Next Article in Special Issue
Air Quality Improvement in China: Evidence from PM2.5 Concentrations in Five Urban Agglomerations, 2000–2021
Previous Article in Journal
Population and GDP Exposure to Extreme Precipitation Events on Loess Plateau under the 1.5 °C Global Warming Level
Previous Article in Special Issue
Demarcation of Coordinated Prevention and Control Regions in the Yangtze River Delta Based on Spatio-Temporal Variations in PM2.5 and O3 Concentrations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Analysis of Atmospheric Pollutants and Meteorological Factors on PM2.5 Concentration and Temporal Variations in Harbin

School of Energy Science and Engineering, Harbin Institute of Technology, 92 West Dazhi Street, Harbin 150001, China
*
Author to whom correspondence should be addressed.
Atmosphere 2022, 13(9), 1426; https://doi.org/10.3390/atmos13091426
Submission received: 26 July 2022 / Revised: 18 August 2022 / Accepted: 31 August 2022 / Published: 2 September 2022

Abstract

:
With rapid economic development, the problem of air pollution has become increasingly prominent. Countries have paid attention to PM2.5, one of the main air pollutants, and have gradually addressed this issue. Based on the 2015–2019 air quality data, meteorological data, and aerosol optical depth data from Harbin, China, this study investigated the relationship between PM2.5, a number of influencing factors, and their temporal changes using a machine-learning method. It can be seen from the analysis that the random forest model can predict PM2.5 concentration. In this model, the mean RH and AOD have a high impact on PM2.5 concentration, but there was negligent correlation with PM2.5. The results indicated that the level of PM2.5 pollution continuously decreased from 2015 to 2019, and there were significant seasonal differences in PM2.5 concentration and its variations. In 2019, due to the impact of heating and adverse meteorological conditions, PM2.5 pollution during the heating period increased significantly. This study provides theoretical and data support for the analysis of PM2.5 pollution in Harbin and formulation of air pollution control policies.

1. Introduction

In recent years, with the continuous advancement of industrialization and urbanization, polluting gases from fuel combustion and fugitive dust [1] from traffic and construction have triggered frequent occurrence of haze pollution globally under unfavorable meteorological conditions of diffusion. In haze-polluted environments, PM2.5 is one of the most significant pollutants [2]. It increases travel barriers, impacts economic development, and adversely affects human health [3]. Presently, strategies to improve the management of PM2.5 have been launched worldwide. In China, from 2016 to 2020, the 13th Five-Year Plan was implemented with the aim of controlling the precursors of fine particulate matter, including SO2 and NOx [4]. At the same time, under the China’s existing carbon peak and carbon-neutral goal, effective management of PM2.5 has also become an important objective. To effectively control PM2.5, it is crucial to analyze air pollutants and meteorological factors closely associated with it.
Numerous studies have been conducted with the goal of identifying means to effectively prevent and control air pollution. Many factors affect the concentration of PM2.5, including social, economic, and meteorological factors, and its interaction with other pollutants. By definition, PM2.5 resides in the atmosphere, so changes in meteorological conditions directly affect its diffusion, transformation, and deposition. The impacts of other factors on PM2.5 are essentially the same as those on other manmade pollutants. Therefore, scholars have focused their research on the following: meteorological conditions and other pollutants. For example, Cifuentes et al. [5] found that solar radiation and temperature were the most significant predictors using statistical models. Park et al. [6] found that meteorological conditions such as wind and turbulent motion at the surface were sensitive to PM2.5 concentrations. Greatly reduced PM2.5 concentrations were mainly influenced by synoptic rather than local conditions. Li et al. [7] found that the daily variation in the PM2.5 concentration is related to weather conditions. Chen et al. [8] found obvious seasonal and regional differences in the effect of meteorological factors on the concentration of PM2.5. The effects of temperature, humidity, and wind on the concentration of PM2.5 were greater than those of other meteorological factors. Jing et al. [9] found significant temporal and spatial differences in the impact of meteorological conditions and precursors of human activities on PM2.5. Zheng et al. [10] elucidated that PM10, SO2, NO2, and CO are the main factors affecting the concentration of PM2.5, while meteorological factors and O3 are secondary factors affecting its concentration. Luo [11] found that climatic conditions, NO2, and O3 are the Granger causes of PM2.5. Zhang et al. [12] employed diverse statistical methods to evaluate the regional and seasonal differences of PM2.5 concentrations based on long-term air quality data. Shao et al. [13] evaluated the influence of typical adverse meteorological conditions in Tianjin on PM2.5 and observed that an increase in wind speed and planetary boundary height reduced the PM2.5 concentration, with inversion having the most significant effect on PM2.5. Therefore, to predict PM2.5 concentrations more accurately, studies must consider meteorological variables such as temperature, wind direction and speed, rainfall, and other air pollutants, such as SO2, CO, O3, NO2, and PM10.
Because of the uncertainty in the variation of PM2.5 concentration and the diversity of the influencing factors, it is of great scientific significance to explore an effective PM2.5 concentration prediction model. A PM2.5 concentration forward prediction model based on physical principles, starting from meteorological factors and pollution conditions, has many shortcomings owing to the imperfect initial state, inaccurate lower boundary conditions, and approximation of physical parameters [14]. At the same time, with the development of computer technology, data-driven statistical methods have attracted increasing attention. Among them, machine learning has received extensive attention because of its advantages in automatically improving algorithms through experience [15]. Some typical nonlinear machine-learning models, such as random forests and neural networks, have exhibited good predictive performance [16]. The integration algorithm is not a single machine-learning algorithm. It builds and combines multiple machine learners, creating a strong learner to complete a task. Random forest is a type of integration algorithm that comprises a large number of individual decision trees that operate as an ensemble [17]. The key difference between the neural network method and the conventional parametric model method is that the former is a data-driven adaptive technique and does not require any prior assumptions on the problem model. When the internal rules of problem solving are unknown, neurons can acquire the hidden functional relationships between the data by learning and training samples [18]. It is appropriate for solving problems that are challenging to explain using hypotheses and existing theories but have sufficient data and observed variables. Owing to its outstanding learning ability, machine learning has been widely used for predicting PM2.5. Wang et al. [19] introduced gaseous pollutant data (NO2, SO2, CO, and O3) as predictors into a deep neural network model to predict the PM2.5 concentration in the missing AOD (aerosol optical depth) area, reducing the estimation bias caused by insufficient AOD. Wang selected a single category of predictors, considering only gaseous pollutants. Yang et al. [20] developed a random forest model that integrated TOA reflectance, meteorological field, and land use variables to evaluate ground PM2.5 and demonstrated its excellent performance. Yang emphasized the impact of ground-level factors on PM2.5, selected meteorological factors, and land use variables. Zheng et al. [21] selected PM10, SO2, NO2, temperature, pressure, humidity, wind direction, and wind speed as influencing factors. A prediction model based on a radial basis function (RBF) neural network was used to predict PM2.5. The results showed that the model has significant advantages. Zheng used a more comprehensive combination of gaseous pollutants and meteorological factors as predictors. Shi et al. [22] proposed a neural network method based on the attention mechanism to predict the 24-h PM2.5 concentration. He found that the model had a better predictive effect with respect to the root-mean-square error (RMSE) and the mean absolute error (MAE) evaluation indicators. Lu et al. [23] established an enhanced deep neural network model to estimate the hourly PM2.5 concentration of the Yangtze River Delta Urban Agglomeration. The results demonstrate that this model can significantly improve the estimation accuracy of the algorithm. Du et al. [24] designed a hybrid deep-learning framework of one-dimensional convolutional neural networks and bidirectional long short-term memory networks to handle PM2.5 air pollution forecasts with suitable accuracy. Czernecki et al. [25] tested four machine-learning models and confirmed the high applicability of machine learning to short-term air quality prediction with the perfect approach employing analysis and cross-validation. The above studies focused on enhancing the existing model to achieve better prediction accuracy and performance, ignoring the interpretability of the model and not studying the various factors affecting PM2.5.
Most of the existing PM2.5 studies in China have focused on the Beijing–Tianjin–Hebei, Yangtze River Delta, and Pearl River Delta regions [26,27,28], and only a few have focused on the Northeast region. Northeast China has a high latitude and a mid-temperate continental monsoon climate with four distinct seasons. The summer is short, hot, and rainy, while the winter is long, cold, and dry. Development has mainly focused on heavy industries that consume a significant amount of energy. The heating period is long, and a large amount of straw is burned in the open air in the surrounding rural areas, causing frequent smog in winter. Harbin is a typical representative of northeast cities with typical characteristics of air pollution in Northeast China. The evaluation of the current situation with PM2.5 pollution in Harbin is easy for the public to have a better and more intuitive understanding of the current situation of the air environment. This can also provide a basis for environmental management departments and government decision-making bodies to conduct air environment control in a targeted manner. It is of utmost significance to formulate urban development plans and maintain sustainable economic development.
We used the random forest model to study the influence of air quality data, meteorological data and AOD on the PM2.5 concentration changes and quantify the different contributions of these factors. A prediction model was established using air quality, meteorological, and AOD data of the Harbin City from 2015 to 2019 as the predictors and the PM2.5 concentration as the outcome. Under this model, the SHAP method was used to establish the critical influencing factors of PM2.5, and the effect of each factor on PM2.5 was quantified. The Pearson algorithm was used to analyze the correlation between each factor and PM2.5. The inter- and intra-annual variations in PM2.5 concentration were analyzed, and multiple reasons for the variation in the PM2.5 concentration in Harbin were investigated. This can provide theoretical and data support for air pollution control and air quality management in Harbin.

2. Materials and Methods

2.1. Data

The following datasets were used in this study: air quality data, meteorological data, and AOD data.
Air quality data were obtained from China’s Air Quality Online Monitoring and Analysis Platform (https://www.aqistudy.cn/historydata/, accessed on 9 March 2021). They included the average daily concentrations of PM2.5, PM10, NO2, CO, SO2, and O3 in Harbin from 2015 to 2019 (CO, mg/m; other pollutants, μg/m3).
Meteorological data were obtained from a greenhouse gas data-sharing platform (http://data.sheshiyuanyi.com/, accessed on 12 March 2021). We obtained the daily mean temperature (mean TEMP, °C), and the total daily precipitation (PRE, mm), mean relative humidity (mean RH, %), daily mean pressure (mean PRS, hPa), daily sunshine duration (SSD, h), and daily mean wind speed (mean WIN, m/s) in Harbin from 2015 to 2019.
AOD data were obtained from the MODIS aerosol products from the Terra satellite of the National Aeronautics and Space Administration (NASA). We downloaded the data from 2015 to 2019 from the Level 1 Atmospheric Archive and Distribution System (LAADS, http://ladswebnascomnasagov/data/searchhtml, accessed on 15 March 2021). MODIS provides two AOD products with resolutions of 10 km and 3 km. In this study, the dark target and deep blue methods were combined to retrieve AOD data with a resolution of 10 km. The AOD data obtained by the Terra satellite at 10:30 am transit time were the average for the day.
The data used in this study are shown in Table 1.

2.2. Random Forest Prediction Model

Random forest (RF) is a machine-learning algorithm proposed by Breiman in 2001 [29]. Among the ensemble-learning algorithms, bagging and RF are two representative parallelization methods in which individual learners do not have strong dependence on each other; can be generated simultaneously. The principle of bagging is to extract a sampling set of several training samples from the sample set using the bootstrap method, train a weak learner based on this sampling set, repeat training many times, and then combine the trained multiple learners. The classification tasks are simply voted in the output of the prediction, and the regression tasks are simply averaged to obtain the final result. RF is an extended variant of bagging. Its algorithm principle is shown in Figure 1. First, a RF uses a decision tree based on the classification and regression tree (CART) algorithm as a weak learner, and the RF introduces a random selection of attributes to the training process of decision trees. For a traditional decision tree, an optimal feature is selected from all N sample features on the node to divide the left and right subtrees of the decision tree. However, RF randomly selects N s u b ( N s u b < N ) sample features on the node to select an optimal feature to divide the left and right subtrees of the decision tree. This further enhances the generalizability of the model. In each round of random sampling of bagging, for the kth tree, approximately 36.8% of the training data do not participate in the generation of the kth tree. These are called out-of-bag samples of the kth tree. These unselected data are not involved in modeling and can be used as verification data to evaluate the model performance.
In summary, RF uses the bootstrap method to generate a sample subset and randomly selects F features for node splitting to build a single regression decision subtree. The above process is repeated many times to construct T regression decision subtrees, and each tree grows freely without pruning to form a random forest. The final prediction result is determined by the mean value of all decision trees for sample training. A flowchart of the algorithm for the random forest is shown in Figure 2. The random forest method can deal with high-dimensional data and is not prone to overfitting owing to the introduction of sample randomness and feature randomness. Moreover, it achieves an unbiased estimate of the importance of each feature and obtains good results for default value problems. This method can be highly parallelized during training. It has advantages of speed for training large samples, strong adaptability to datasets, and high prediction accuracy.
In our study, to obtain the best prediction, a grid search was used to find the optimal parameters of the model. Grid search is an exhaustive search parameter-tuning method: the procedure is to loop through all the candidate parameters, try every possibility, and obtain the best-performing set of parameters. Some of the key parameter descriptions and settings of random forest used in the current analysis are shown in Table 2, and default values were adopted for the other parameters.

2.3. Data Analysis Method

Explainability refers to the degree to which humans grasp the purpose for their decisions. Machine learning is based on an algorithm that acquires potential patterns and relationships from data and ultimately generates decisions or predictions. The more explainable it is, the easier it is for people to understand the rationale for certain decisions or predictions. People are not only satisfied with the effect of the model, but also think more about the reasons for its effect. Such thinking helps to optimize the model and features and can also help to comprehend the model itself and enhance the quality of the model.
SHAP is an additive explanation model inspired by the Shapley value [30]. Its goal is to explain the prediction of an instance X by computing the contribution of each feature to the prediction. The feature values of a data instance act as “contributors”, and SHAP attributes the output values to the Shapely values of each feature. Shapely values are measures of contributions each feature has in a machine-learning model. The Shapley value for feature Xj in the model is as follows [31]:
  S h a p e l y ( X j ) = S N \ { j } k ! ( p k 1 ) ! p ! ( f ( S { j } ) f ( s ) )
where p is the total number of features, N \ { j } is the set of all the conceivable combinations of the features excluding X j , S is the feature set in N \ { j } , f(S) is the model prediction with features in S and f ( S { j } ) is the model prediction with features in S and X j . The interpretation of Equation (1) indicates that the Shapley value of a feature is its marginal contribution to the model prediction averaged over all potential models with diverse permutations of features. As the main obstacle to the widespread adoption of Shapley values is computation, Lundberg and Lee [30] created SHAP. It calculates the Shapley value of each feature value to measure the influence of feature values on the final output value, and specifies the explanation as follows:
g ( z ) = ϕ 0 + j = 1 M ϕ j Z j
where g is the explanation model, M is the number of input features, z { 0 , 1 } M is the simplified features and indicates whether the corresponding feature exists (1 or 0), ϕ j R is the feature attribution for feature j, the Shapley values, and ϕ 0 is usually the mean of the target variable for all the samples. For instance, with the X we are explaining, all the feature values are “present”, i.e., all the simplified features are 1. At this point, the above formula can be simplified as follows:
g ( z ) = ϕ 0 + j = 1 M ϕ j
It can be regarded as the contribution of the sum of Shapley values of each feature, which changes the predicted result from the mean to the predicted result, namely g ( z ) . The advantage of SHAP is that we can clearly see whether each feature has a positive or negative impact on the prediction.
The analysis reported in this paper adopted the SHAP library of Python. By calling this library, the random forest prediction model established in the current analysis can be explained.

3. Results and Discussion

3.1. Prediction Results and Model Evaluation

A random forest prediction model was established; it considered the atmospheric pollutants (PM10, NO2, CO, SO2, and O3), meteorological conditions (PRE, SSD, mean TEM, mean RH, mean PRS, and mean WIN), and AOD in Harbin from 2015 to 2019 as characteristic quantities. Seven out of ten of them were used as training data, and those remaining were used as test data. Daily concentrations of PM2.5 were the output of the model.
The results of fitting the test samples are shown in Figure 3. The density scatter diagram shows that the model had an excellent prediction ability. The sample points are roughly distributed on both sides of the straight line and are relatively clustered. The predicted PM2.5 concentrations obtained by the prediction model were very close to the actual PM2.5 concentrations. The fitting equation was Y = 0.97X + 1.89. R2 was as high as 0.94; RMSE and MAE were 13.23 μg/m3 and 6.69 μg/m3, respectively, i.e., relatively low and, thus, satisfactory. It can be seen from the results above that the adopted approach accurately predicted PM2.5 concentrations. A random forest model was also used to predict PM2.5 and obtained similar results with an R2 of 0.93 [32]. On this basis, the influence of different factors can be analyzed.

3.2. Influence of Predictive Factors on the PM2.5 Concentration

We used the SHAP method to explain the model results and the roles of the influencing factors (Figure 4). Figure 4a shows that the mean RH had the greatest influence on PM2.5 and its SHAP value was as high as 22.52. It was discovered that RH has an inverted U-shaped relationship with PM2.5 concentrations (peaking at RH = 45–70%) [33]. Dry (RH = 45–60%) and low-humidity (RH = 60–70%) conditions positively affected PM2.5 and exerted an accumulation effect. Under normal weather conditions, RH is mostly 45–60%; the SHAP value of RH is thus relatively high. The influence of AOD was the second highest, with a SHAP value of 8.32, which was less than one-third of the mean RH. The influence of the other factors (SSD, PRE, NO2, SO2, CO, mean TEMP, PM10, O3, mean WIN, and mean PRS) was lower, from 5.12 to 0.23.
Figure 4b demonstrates the effect of different values of each feature on PM2.5. With an increase in the values of the mean PRS and the mean RH, as well as of the NO2, CO, and PM10 concentrations, their influence on the PM2.5 concentration gradually changed from negative to positive. The turning points were found at around 1002 hPa (mean PRS), 70% (mean RH), 45 μg/m3 NO2, 1.1 mg/m3 CO, and 85 μ g/m3 PM10, respectively. The SHAP absolute values of the mean PRS and the mean RH changed a little around the 0 line, while the SHAP values of the NO2, CO, and PM10 concentrations increased significantly with an increase in their numerical values. The NO2 concentration increased slowly, the CO concentration increased more rapidly, and the PM10 concentration increased even faster. The influence of SSD and the mean TEMP on the PM2.5 concentration changed from positive to negative as their values increased, with turning points around 6 h (SSD) and 6 (mean TEMP). The absolute values of SHAP for both remained in a relatively stable range around the 0 line.
When PRE > 1.3 mm, its influence on the PM2.5 concentration was positive and the degree of influence remained relatively stable. When the mean WIN < 3.9 m/s, the influence on the PM2.5 concentration was mostly positive. When the mean WIN > 3.9 m/s, it began to have a negative influence on PM2.5. There was an obvious cutoff point for the influence of SO2 on PM2.5: when SO2 < 17 μ g/m3, it exerted a negative influence on PM2.5, with a drastic variation in effect. When SO2 > 17 μ g/m3, the impact became positive and the variation decreased slightly as the concentration increased. The influence of AOD on the PM2.5 concentration remained stable. The influence trend of the concentration of O3 first decreased and then increased, with a turning point at O3 = 60 μ g/m3.

3.3. Correlation between PM2.5 and Predictors

We used the Pearson correlation algorithm to calculate the correlation between PM2.5 and the studied predictors. The results are shown in Figure 5.
Among the major atmospheric pollutants, the O3 concentration was negatively but weakly correlated (−0.217) with PM2.5. The main reason is that an increase in the PM2.5 concentration can increase the scattering of solar radiation in the visible and near-infrared bands, thus reducing the photochemical rate and, finally, leading to a decrease in the O3 concentration. However, SSD and temperature had a more obvious impact on the total solar radiation energy; consequently, O3 had a stronger correlation with the mean TEMP and SSD, with correlation coefficients reaching 0.624 and 0.383, respectively. Significant positive correlations were observed between other pollutants and PM2.5, with coefficients ranging from 0.906 to 0.598 in the order of PM10  > NO2 > CO > SO2. In fact, PM10, NO2, CO, and SO2 concentrations were positively correlated, indicating that the emission sources of these pollutants were similar; for example, they may have been the burning of straw and coal [34]. In addition, the photochemical reactions of NO2, CO, and SO2 can generate nitrate and carbonate, which can lead to an increase in PM2.5 [35].
Among the meteorological factors, the mean PRS was positively correlated with the PM2.5 concentration, while the mean TEMP, mean WIN, SSD, and PRE were negatively correlated with the PM2.5 concentration. This is because when pollution occurs in a season with low temperatures, such as winter, high pressure and low wind speed promote a stable state for the near-surface atmosphere, which increases the strength of the thermal inversion layer and hinders the diffusion of pollutants [36]. A larger SSD can enhance the photolysis of organic carbon, which is a major component of particulate matter, thereby reducing the PM2.5 concentration [37]. A higher wind speed has a diffusion and dilution effect on PM2.5, reducing its concentration. More precipitation corresponds to a more significant removal of moisture, which is also an important reason for the reduction in the PM2.5 concentration [38,39]. The correlation coefficients between the mean RH and AOD and the PM2.5 concentration were less than 0.1 and could thus be considered irrelevant. According to a previous study [33], RH has an inverted U-shaped relationship with PM2.5 concentrations (peaking at RH = 45–70%). The RH fluctuated between 45–60% and the PM2.5 concentrations were always high, which may be attributed to a low correlation.
The hygroscopic growth of aerosol particles is an important factor in the estimation of particle concentration. The measurement of the PM2.5 concentration near the ground is achieved through a drying process, so the air quality data represent the concentration under dry air. However, AOD via satellite remote sensing was obtained by inversion under environmental RH conditions. Higher atmospheric humidity causes a hygroscopic growth of aerosol particles, which affects the composite refractive index, resulting in greater AOD values. Therefore, to reduce the uncertainty introduced by particle hygroscopic growth and the difference between AOD and the PM2.5 concentration in the monitoring environment and increase their correlation, we made an appropriate humidity correction for AOD. The revised formula [40] is as follows:
A O D R H = A O D / f ( R H )
f ( R H ) = 1 / ( 1 R H / 100 )
where   A O D R H is the aerosol optical thickness corrected for humidity, RH is the relative humidity in the atmosphere (%), and f ( R H ) is the scattering hygroscopic growth factor of particulate matter, which represents the influence of air humidity on AOD.
After revision, the correlation coefficient between AOD and the PM2.5 concentration was −0.032. This represented a slight improvement, but the value remained very low. Based on this, the correlation coefficient between AOD and the PM2.5 concentration in the four seasons was calculated as shown in Table 3.
The correlation between AOD and the PM2.5 concentration exhibited seasonal differences in Harbin and was the highest in winter. However, the correlation coefficient was only −0.218, which may be because the AOD data were obtained by the Terra satellite at 10:30 am transit time, while the PM2.5 concentration data were daily averages and the degree of time matching of the two data sources was insufficient.

3.4. Time Series Analysis of PM2.5

In this section, we report our investigation on the temporal relationships between the PM2.5 concentration and the abovementioned factors. They are described in terms of annual and monthly averages.

3.4.1. Interannual Variation of PM2.5

We analyzed the trend of the PM2.5 concentration in Harbin. As shown in Figure 6, the PM2.5 concentration in 2015 had the highest median, quartile, and upper margin values, corresponding to the worst pollution. In 2016, the PM2.5 concentration decreased overall and the number of outliers increased, but the violin plot widened around the median, indicating that the number of days with a low PM2.5 concentration increased and the pollution situation improved. In 2017, the PM2.5 concentration decreased slightly and the number of outliers decreased significantly, but the number of days with a low PM2.5 concentration also decreased. In 2018, the overall distribution and outliers of the PM2.5 concentration decreased significantly and the distribution was more concentrated near the median, making it the year with the lightest pollution in the past five years. This result may be attributable to the comprehensive and in-depth promotion of air pollution prevention and control work in Harbin in 2018. A total of 1191 coal-fired boilers were eliminated, 415 coal-fired boiler pollution prevention and control facilities were upgraded, 376 scattered pollution enterprises were investigated and cleaned up, and the comprehensive utilization rate of straw reached more than 75% [41]. PM2.5 pollution increased slightly in 2019, but the overall level did not differ significantly from that of the previous year.
According to the Ambient Air Quality Standard (GB3095-2012), the annual average concentration limit and the 24-hour average concentration limit of PM2.5 are 35 μg/m3 and 75 μg/m3, respectively. As shown in Figure 7, the number of days of PM2.5 exceeding the standard and the annual average concentration of PM2.5 showed a declining trend overall from 2015 to 2019. In 2015, fine particulate matter pollution was severe in Harbin, with an average PM2.5 concentration nearly twice the standard, and the number of days exceeding the standard was approximately one-third of the whole year. The annual average PM2.5 concentration in 2018 was the lowest, but it still failed to meet the standard. The number of days exceeding the standard in 2019 was less than half that in 2015. In the past five years, the annual average concentration of PM2.5 decreased in a wave pattern, showing a phenomenon of “annual decline,” and the number of days of PM2.5 exceeding the standard decreased sharply.
According to the climate characteristics of Harbin, the seasons are divided as follows: spring lasts from March to May, summer—from June to August, autumn—from September to November, and winter—from December to February. Figure 8 shows the average PM2.5 concentration data for the different seasons. It can be seen from the figure that the average concentration of PM2.5 changed as follows: in the spring of 2015–2018, it was basically stable at 46 μg/m3 and decreased slightly in 2019; in the summer, a stepwise decrease occurred from 30 μg/m3 to 15 μg/m3, and in the autumn of 2015–2017, the PM2.5 concentration decreased first, then increased, exhibiting a value as low as 23 μg/m3 and a cliff-like decline in 2018 and 2019. In fact, in 2018 and 2019, efforts were made to ban straw burning, which may have prevented air pollution caused by open burning during the harvest season [42]. Finally, the concentration in the winter did not show a clear trend.

3.4.2. Monthly Variation of PM2.5

After analyzing the interannual variation trend as reported in the previous section, it was found that with respect to the number of days exceeding the standard and seasonal variation, PM2.5 pollution in 2019 was moderate. These findings are in line with national governance policies and urban living requirements and are most representative of the current PM2.5 pollution levels.
We investigated the monthly variation of PM2.5 and the related factors in 2019. The results are shown in Figure 9. In 2019, Harbin’s non-heating period lasted from 28 April to 19 October. During the heating period, three small peaks of PM2.5 were observed, specifically, in January, February, and December. The PM10, NO2, CO, and SO2 concentrations showed similar trends. This could be attributed to the surge of gas precursors and PM2.5 caused by coal combustion for heating in winter. Moreover, the meteorological conditions at that time, namely low wind speed, low temperature, low sunshine duration, high pressure, and high relative humidity, were conducive to the accumulation of PM2.5. In March, the PM2.5 concentration dropped sharply, remaining below 100 μg/m3, and it fluctuated around this value until the end of April. In fact, the temperature increased significantly during this period, and the intensity of urban heating was reduced. From the middle of October, the PM2.5 concentration began to increase slightly, and the PM10, NO2, CO, AOD, and SO2 concentrations also increased slightly at the same time. Several heating companies in Harbin began a hot-state trial operation of the pipe network before the heating season, and research [43] has shown that the main source of air pollution is coal burning (62.365%). Combined with the changes in pollutants, it can be seen that coal combustion could have led to an increase in PM2.5. This continued until November. During the entire heating period, the factors contributing to O3 were dominated by temperature and sunshine duration and the trend in O3 was completely different from that of PM2.5.
During the non-heating period, the slightly higher PM2.5 concentration in mid-to-early May could be attributable to residual pollution caused by heating. During the rest of the non-heating period, the PM2.5 concentration remained at approximately 50 μg/m3, and the concentrations of PM10, NO2, CO, and SO2 were low. In addition, the increases in temperature and the mean RH, SDD, and PRE were conducive to the dissipation of PM2.5. During this period, the low concentration of PM2.5 not only reduced the weakening effect of solar radiation, but also reduced the heterogeneous absorption of the precursors of O3, such as HO2, NO2, NO3, and N2O5, leading to an acceleration of the O3 photochemical generation rate and, thus, to an increase in its concentration [43]. However, the O3 concentration was very unstable and fluctuated significantly. Each drop in O3 concentration was accompanied by a large amount of precipitation.
Briefly, according to the evaluation of changes in the PM2.5 concentration, its pollution levels in Harbin gradually reduced from 2015 to 2019. The annual average concentration of PM2.5 indicated a wavelike decline, whereas the number of days when the PM2.5 concentration exceeded the standard demonstrated a stepwise decline. There were significant seasonal differences in the PM2.5 concentration over the past five years. The PM2.5 concentration was highest in winter but declined in a wavelike pattern, slightly lower in autumn, declining in a cliff-like manner, low but stable in spring, and lowest in summer, decreasing gradually year by year. During the 2019 heating period, the meteorological conditions were conducive to the accumulation of PM2.5. Hence, the concentration of PM2.5 was high, while the opposite was true in the non-heating period. In addition, PM2.5 has a similar trend to PM10, NO2, CO, and SO2, indicating that its concentration is significantly affected by urban heating.

4. Conclusions

In this study, a random forest model was established, which used the air quality data, meteorological data, and atmospheric aerosol optical thickness data of Harbin from 2015 to 2019 to predict the concentration of PM2.5. After evaluating the prediction accuracy of the model, the SHAP method was used to analyze the contribution of each feature to the predicted value. The correlation between each factor and PM2.5 and the temporal and spatial patterns of PM2.5 were identified. The following conclusions were drawn.
The random forest model had a very good effect on the prediction of the PM2.5 concentration. The SHAP method was employed to explain the model. It was found that the mean RH and AOD have a great influence on the PM2.5 concentration, but the correlation coefficient with the PM2.5 concentration is less than 0.1. PM10, NO2, CO, SO2, and the mean PRS were positively correlated with PM2.5, whereas the mean TEMP, SSD, O3, mean WIN, and PRE were negatively correlated with PM2.5. There were seasonal differences in the correlation between AOD and PM2.5, and the highest correlation was found in winter, with a correlation coefficient of −0.218.
The degree of PM2.5 pollution in Harbin gradually diminished from 2015 to 2019, and there were substantial seasonal differences in the PM2.5 concentration and its variations. In 2019, the PM2.5 concentration was considerably affected by heating; together with the difference in meteorological conditions, it led to remarkable differences in the PM2.5 concentration between the heating and non-heating periods.
However, due to the different accuracy of the different data, this article can only be used for calculation, which may cause deviation of the analysis results. In future work, we will continue to improve this issue.

Author Contributions

Conceptualization, X.G.; methodology, X.G.; software, X.G. and Z.R.; data curation, J.L.; writing—original draft preparation, X.G. and J.L.; visualization, Z.R.; supervision, Y.Y.; project administration, Q.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (NSFC), No. 51776051.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

For the raw air quality data, meteorological data, and AOD data files, please contact [email protected].

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Gupta, V.; Bisht, L.; Deep, A.; Gautam, S. Spatial distribution, pollution levels, and risk assessment of potentially toxic metals in road dust from major tourist city, Dehradun, Uttarakhand India. Stoch. Environ. Res. Risk Assess. 2022, 1–17. [Google Scholar] [CrossRef]
  2. Westervelt, D.M.; Horowitz, L.W.; Naik, V.; Tai, A.P.K.; Fiore, A.M.; Mauzerall, D.L. Quantifying PM2.5-meteorology sensitivities in a global climate model. Atmos. Environ. 2016, 142, 43–56. [Google Scholar] [CrossRef]
  3. Sharma, S.; Chandra, M.; Kota, S.H. Health Effects Associated with PM2.5: A Systematic Review. Curr. Pollut. Rep. 2020, 6, 345–367. [Google Scholar] [CrossRef]
  4. Sun, W.; Shao, M.; Granier, C.; Liu, Y.; Ye, C.S.; Zheng, J.Y. Long-Term Trends of Anthropogenic SO2, NOx, CO, and NMVO-Cs Emissions in China. Earths Future 2018, 6, 1112–1133. [Google Scholar] [CrossRef]
  5. Cifuentes, F.; Gálvez, A.; González, C.M.; Orozco-Alzate, M.; Aristizábal, B.H. Hourly Ozone and PM2.5 Prediction Using Meteorological Data—Alternatives for Cities with Limited Pollutant Information. Aerosol Air Qual. Res. 2021, 21, 200471. [Google Scholar] [CrossRef]
  6. Park, I.S.; Park, M.S.; Kim, S.H.; Jang, Y.W.; Lee, J.; Owen, J.S.; Cho, C.R.; Jee, J.B.; Chae, J.H.; Kang, M.S. Meteorological Characteristics during Periods of Greatly Reduced PM2.5 Concentrations in March 2020 in Seoul. Aerosol Air Qual. Res. 2021, 21, 200512. [Google Scholar] [CrossRef]
  7. Li, X.; Miao, Y.; Ma, Y.; Wang, Y.; Zhang, Y. Impacts of synoptic forcing and topography on aerosol pollution during winter in Shenyang, Northeast China. Atmos. Res. 2021, 262, 105764. [Google Scholar] [CrossRef]
  8. Chen, Z.; Xie, X.; Cai, J.; Chen, D.; Gao, B.; He, B.; Cheng, N.; Xu, B. Understanding meteorological influences on PM2.5 concentrations across China: A temporal and spatial perspective. Atmos. Chem. Phys. 2018, 18, 5343–5358. [Google Scholar] [CrossRef]
  9. Jing, Z.; Liu, P.; Wang, T.; Song, H.; Lee, J.; Xu, T.; Xing, Y. Effects of Meteorological Factors and Anthropogenic Precursors on PM2.5 Concentrations in Cities in China. Sustainability. 2020, 12, 3550. [Google Scholar] [CrossRef]
  10. Zheng, Y.; Li, S.; Zou, C.; Ma, X.; Zhang, G. Analysis of PM2.5 concentrations in Heilongjiang Province associated with forest cover and other factors. Int. J. For. Res. 2019, 30, 269–276. [Google Scholar] [CrossRef]
  11. Luo, M. Analysis of Influence Factors of PM2.5 in Chengdu Based on VAR Model. IOP Conf. Ser. Earth Environ. Sci. 2017, 63, 012039. [Google Scholar] [CrossRef]
  12. Zhang, L.; An, J.; Liu, M.; Li, Z.; Liu, Y.; Tao, L.; Liu, X.; Zhang, F.; Zheng, D.; Gao, Q.; et al. Spatiotemporal variations and influencing factors of PM2.5 concentrations in Beijing, China. Environ. Pollut. 2020, 262, 114276. [Google Scholar] [CrossRef] [PubMed]
  13. Shao, M.; Dai, Q.; Yu, Z.; Zhang, Y.; Xie, M.; Feng, T. Responses in PM2.5 and its chemical components to typical unfavorable meteorological events in the suburban area of Tianjin, China. Sci. Total Environ. 2021, 788, 147814. [Google Scholar] [CrossRef]
  14. Cheng, F.; Feng, C.; Yang, Z.; Hsu, C.; Chan, K.; Lee, C.; Chang, S. Evaluation of real-time PM2.5 forecasts with the WRF-CMAQ modeling system and weather-pattern-dependent bias-adjusted PM2.5 forecasts in Taiwan. Atmos. Environ. 2021, 244, 118263. [Google Scholar] [CrossRef]
  15. Zhang, L.; Na, J.; Zhu, J.; Shi, Z.; Zou, C.; Yang, L. Spatiotemporal causal convolutional network for forecasting hourly PM2.5 concentrations in Beijing, China. Comput. Geosci. 2021, 155, 104869. [Google Scholar] [CrossRef]
  16. Delavar, M.R.; Gholami, A.; Shiran, G.R.; Rashidi, Y.; Nakhaeizadeh, G.R.; Fedra, K.; Afshar, S.H. A Novel Method for Improving Air Pollution Prediction Based on Machine Learning Approaches: A Case Study Applied to the Capital City of Tehran. ISPRS Int. J. Geo-Inf. 2019, 8, 99. [Google Scholar] [CrossRef]
  17. Sadorsky, P. A Random Forests Approach to Predicting Clean Energy Stock Prices. J. Risk Financ. Manag. 2021, 14, 48. [Google Scholar] [CrossRef]
  18. Lee, I.H. Topological Invariant Prediction via Deep Learning. J. Korean Phys. Soc. 2020, 76, 401–405. [Google Scholar] [CrossRef]
  19. Wang, X.; Sun, W. Meteorological parameters and gaseous pollutant concentrations as predictors of daily continuous PM2.5 concentrations using deep neural network in Beijing-Tianjin-Hebei, China. Atmos. Environ. 2019, 211, 128–137. [Google Scholar] [CrossRef]
  20. Yang, L.; Xu, H.; Yu, S. Estimating PM2.5 concentrations in Yangtze River Delta region of China using random forest model and the Top-of-Atmosphere reflectance. J. Environ. Manag. 2020, 272, 111061. [Google Scholar] [CrossRef]
  21. Zheng, H.; Shang, X. Study on prediction of atmospheric PM2.5 based on RBF neural network. In Proceedings of the 2013 Fourth International Conference on Digital Manufacturing & Automation, Shangdong, China, 29–30 June 2013. [Google Scholar] [CrossRef]
  22. Shi, P.; Fang, X.; Ni, J.; Zhu, J. An Improved Attention-Based Integrated Deep Neural Network for PM2.5 Concentration Prediction. Appl. Sci. 2021, 11, 4001. [Google Scholar] [CrossRef]
  23. Lu, X.; Wang, J.; Yan, Y.; Zhou, L.; Ma, W. Estimating hourly PM2.5 concentrations using Himawari-8 AOD and a DBSCAN-modified deep learning model over the YRDUA, China. Atmos. Pollut. Res. 2021, 12, 183–192. [Google Scholar] [CrossRef]
  24. Du, S.; Li, T.; Yang, Y.; Horng, S. Deep Air Quality Forecasting Using Hybrid Deep Learning Framework. IEEE Trans. Knowl. Data Eng. 2021, 33, 2412–2424. [Google Scholar] [CrossRef]
  25. Czernecki, B.; Marosz, M.; Jędruszkiewicz, J. Assessment of Machine Learning Algorithms in Short-term Forecasting of PM10 and PM2.5 Concentrations in Selected Polish Agglomerations. Aerosol Air Qual. Res. 2021, 21, 200586. [Google Scholar] [CrossRef]
  26. Guo, L.; Chen, B.; Zhang, H.; Fang, J. Prediction and Source Contribution Analysis of PM2.5 Using a Combined FLEXPART Model and Bayesian Method over the Beijing-Tianjin-Hebei Region in China. Atmosphere 2021, 12, 860. [Google Scholar] [CrossRef]
  27. Yun, G.; He, Y.; Jiang, Y.; Dou, P.; Dai, S. PM2.5 Spatiotemporal Evolution and Drivers in the Yangtze River Delta between 2005 and 2015. Atmosphere 2019, 10, 55. [Google Scholar] [CrossRef]
  28. Zhao, X.; Sun, Y.; Zhao, C.; Jiang, H. Impact of Precipitation with Different Intensity on PM2.5 over Typical Regions of China. Atmosphere 2020, 11, 906. [Google Scholar] [CrossRef]
  29. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  30. Lundberg, S.; Lee, S. A Unified Approach to Interpreting Model Predictions, edited. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Red Hook, NY, USA, 4–9 December 2017. [Google Scholar] [CrossRef]
  31. Li, Z. Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Comput. Environ. Urban Syst. 2022, 96, 101845. [Google Scholar] [CrossRef]
  32. Li, Z.; Tong, X.; Ho, J.; Kwok, T.; Dong, G.; Ho, K.; Yim, S. A practical framework for predicting residential indoor PM2.5 concentration using land-use regression and machine learning methods. Chemosphere 2021, 265, 129140. [Google Scholar] [CrossRef]
  33. Lou, C.; Liu, H.; Li, Y.; Peng, Y.; Wang, J.; Dai, L. Relationships of relative humidity with PM2.5 and PM10 in the Yangtze River Delta, China. Environ. Monit. Assess 2017, 189, 582. [Google Scholar] [CrossRef] [PubMed]
  34. Li, M.; Zhang, Q.; Kurokawa, J.; Woo, J.; He, K.; Lu, Z.; Ohara, T.; Song, Y.; Streets, D.G.; Carmichael, G.R.; et al. MIX: A mosaic Asian anthropogenic emission inventory under the international collaboration framework of the MICS-Asia and HTAP. Atmos. Chem. Phys. 2017, 17, 935–963. [Google Scholar] [CrossRef]
  35. Zhang, S.; Li, D.; Ge, S.; Liu, S.; Wu, C.; Wang, Y.; Chen, Y.; Lv, S.; Wang, F.; Meng, J.; et al. Rapid sulfate formation from synergetic oxidation of SO2 by O3 and NO2 under ammonia-rich conditions: Implications for the explosive growth of atmospheric PM2.5 during haze events in China. Sci. Total Environ. 2021, 772, 144897. [Google Scholar] [CrossRef] [PubMed]
  36. Ma, J.; Cao, Y.; Xu, J.; Qu, Y.; Yu, Z. PM2.5 concentration distribution patterns and influencing meteorological factors in the central and eastern China during 1980–2018. J. Clean. Prod. 2021, 311, 127565. [Google Scholar] [CrossRef]
  37. Cao, C.; Jiang, W.; Wang, B.; Fang, J.; Lang, J.; Tian, G.; Jiang, J.; Zhu, T.F. Inhalable Microorganisms in Beijing’s PM2.5 and PM10 Pollutants during a Severe Smog Event. Environ. Sci. Technol. 2014, 48, 1499–1507. [Google Scholar] [CrossRef]
  38. Sun, Y.; Zhao, C.; Su, Y.; Ma, Z.; Li, J.; Letu, H.; Yang, Y.; Fan, H. Distinct Impacts of Light and Heavy Precipitation on PM2.5 Mass Concentration in Beijing. Earth Space Sci. 2019, 6, 1915–1925. [Google Scholar] [CrossRef]
  39. Gao, B.; Ouyang, W.; Cheng, H.; Xu, Y.; Lin, C.; Chen, J. Interactions between rainfall and fine particulate matter investigated by simultaneous chemical composition measurements in downtown Beijing. Atmos. Environ. 2019, 218, 117000. [Google Scholar] [CrossRef]
  40. Wu, J.; Liang, J.; Zhou, L.; Yao, F.; Peng, J. Impacts of AOD Correction and Spatial Scale on the Correlation between High-Resolution AOD from Gaofen-1 Satellite and In Situ PM2.5 Measurements in Shenzhen City, China. Remote Sens. 2019, 11, 2223. [Google Scholar] [CrossRef] [Green Version]
  41. Harbin’s In-Depth Promotion of Air Pollution Prevention and Control Has Achieved Results. Available online: http://www.harbin.gov.cn/art/2019/1/4/art_452_601290.html (accessed on 9 September 2021).
  42. Huang, L.; Zhu, Y.; Wang, Q.; Zhu, A.; Liu, Z.; Wang, Y.; Allen, D.T.; Li, L. Assessment of the effects of straw burning bans in China: Emissions, air quality, and health impacts. Sci. Total Environ. 2021, 789, 147935. [Google Scholar] [CrossRef]
  43. Bai, L.; Li, C.; Yu, C.W.; He, Z. Air pollution and health risk assessment in Northeastern China: A case study of Jilin Province. Indoor Built Environ. 2021, 30, 1857–1874. [Google Scholar] [CrossRef]
Figure 1. Schematic diagram of the random forest principle.
Figure 1. Schematic diagram of the random forest principle.
Atmosphere 13 01426 g001
Figure 2. Flow chart of the algorithm for the random forest.
Figure 2. Flow chart of the algorithm for the random forest.
Atmosphere 13 01426 g002
Figure 3. Prediction results.
Figure 3. Prediction results.
Atmosphere 13 01426 g003
Figure 4. Contribution of features to predicted values. (a) Ranking of feature contributions. (b) Variation of feature influence.
Figure 4. Contribution of features to predicted values. (a) Ranking of feature contributions. (b) Variation of feature influence.
Atmosphere 13 01426 g004aAtmosphere 13 01426 g004b
Figure 5. Heatmap of correlations between the variables.
Figure 5. Heatmap of correlations between the variables.
Atmosphere 13 01426 g005
Figure 6. Distribution of PM2.5 by year.
Figure 6. Distribution of PM2.5 by year.
Atmosphere 13 01426 g006
Figure 7. Histogram of PM2.5 pollution over time.
Figure 7. Histogram of PM2.5 pollution over time.
Atmosphere 13 01426 g007
Figure 8. Seasonal changes in the PM2.5 concentration.
Figure 8. Seasonal changes in the PM2.5 concentration.
Atmosphere 13 01426 g008
Figure 9. Changes of the PM2.5 concentration, air pollutants, and meteorological conditions in 2019. (a) Concentrations of PM2.5 and meteorological conditions. (b) Concentrations of PM2.5 and air pollutants.
Figure 9. Changes of the PM2.5 concentration, air pollutants, and meteorological conditions in 2019. (a) Concentrations of PM2.5 and meteorological conditions. (b) Concentrations of PM2.5 and air pollutants.
Atmosphere 13 01426 g009
Table 1. Data summary.
Table 1. Data summary.
TypeNameUnitValue RangeSource
Air quality dataPM2.5 μ g/m30–704China’s Air Quality Online Monitoring and Analysis Platform
PM10 μ g/m30–628
NO2 μ g/m312–145
COmg/m30.4–4.3
SO2 μ g/m33–222
O3 μ g/m30–232
Meteorological dataMean TEMP −29.4–30.8Greenhouse data-sharing platform
PREmm0–121
Mean RH % 15–99
Mean PRShPa973.2–1025.4
SSDh0–14.6
Mean WINm/s0.2–10.2
AOD dataAOD/0.2–1.345NASA
Table 2. Random forest model: parameter settings.
Table 2. Random forest model: parameter settings.
NameMeaningValues
n_estimatorsThe number of trees in the forest.35
max_featuresThe number of features to consider when looking for the best split.6
max_depthThe maximum depth of the tree.10
bootstrapWhether bootstrap samples are used when building trees.True
criterionThe function to measure the quality of a split.mse
oob_scoreWhether to use out-of-bag samples to estimate the generalization score.True
random_stateControls both the randomness of the bootstrapping of the samples used when building trees and the sampling of the features to consider when looking for the best split at each node.20
Table 3. Seasonal correlation between AOD and the PM2.5 concentration.
Table 3. Seasonal correlation between AOD and the PM2.5 concentration.
SpringSummerAutumnWinter
Correlation coefficient−0.0300.0840.094−0.218
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Gao, X.; Ruan, Z.; Liu, J.; Chen, Q.; Yuan, Y. Analysis of Atmospheric Pollutants and Meteorological Factors on PM2.5 Concentration and Temporal Variations in Harbin. Atmosphere 2022, 13, 1426. https://doi.org/10.3390/atmos13091426

AMA Style

Gao X, Ruan Z, Liu J, Chen Q, Yuan Y. Analysis of Atmospheric Pollutants and Meteorological Factors on PM2.5 Concentration and Temporal Variations in Harbin. Atmosphere. 2022; 13(9):1426. https://doi.org/10.3390/atmos13091426

Chicago/Turabian Style

Gao, Xiuyan, Zhaohui Ruan, Jiemei Liu, Qixiang Chen, and Yuan Yuan. 2022. "Analysis of Atmospheric Pollutants and Meteorological Factors on PM2.5 Concentration and Temporal Variations in Harbin" Atmosphere 13, no. 9: 1426. https://doi.org/10.3390/atmos13091426

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop