4.1. Benford’s Law Method
The data were recorded hourly, with each city having a monitoring value every hour. To assess the reliability of the PM2.5 concentration data, we processed the data to calculate the daily average PM2.5 concentrations for each of the 283 cities from 2015 to 2022. These daily averages were then tested against Benford’s Law.
To compare the frequency of leading digits in the PM2.5 concentration data with the theoretical values predicted by Benford’s Law, we used the Z-statistic, a common method for large-sample tests. We set the confidence level at 0.95; if the frequency of a particular leading digit falls within the confidence interval, it is considered to conform to Benford’s Law and is highlighted in green. Otherwise, it is considered non-conforming and is highlighted in yellow. The test results are presented in
Table 5. And the first-digit analysis could be seen in
Figure 2.
In the statistical analysis of urban PM2.5 concentration data, we employed Benford’s Law as a method to test the reliability of the data. Benford’s Law predicts the frequency with which the digits 1 to 9 appear as the leading digit under natural conditions. By applying this law, we found that the frequency of all digits did not fall within the expected confidence interval, indicating that the dataset did not pass the Benford’s Law test. However, it is worth noting that although the data did not pass the Benford’s Law test, we observed a gradual decrease in the frequency of digits from 1 to 9, which is consistent with the expected pattern of Benford’s Law. Furthermore, considering the large scale of the dataset, reaching 800,000 records, and the small span of data magnitude, mainly concentrated between the tens and hundreds place, these factors may significantly affect the applicability of Benford’s Law. Therefore, we cannot simply conclude the unreliability of urban PM2.5 concentration data based on the results of Benford’s Law. To further verify the reliability of the data, we calculated the annual average PM2.5 concentration for 283 cities from 2015 to 2022. By constructing a statistical model of factors affecting PM2.5 concentration, we quantitatively analyzed the reliability of the data from the perspective of outliers. This method not only considers the distribution characteristics of the data but also takes into account various factors that may affect PM2.5 concentration, such as meteorological conditions, energy consumptions, and industrial structure.
4.2. Robust Regression Results
Since the results of the Benford’s Law test indicate that the PM2.5 data does not strictly conform to the law, we hypothesize that there may be some outliers in the data. If regression analysis were conducted directly under these conditions, it could lead to biased results. Therefore, we opt for robust regression. Based on this, we assess the reliability of the PM2.5 data by calculating the proportion of outliers within the dataset. The specific steps are as follows: First, we derive the regression equation using robust estimation methods. Next, we calculate the residuals between the predicted and actual PM2.5 values for each sample and determine the standard deviation of these residuals. The threshold for identifying outliers is set at three times the residual standard deviation. If the residual for a particular sample exceeds this threshold, the actual PM2.5 value for that sample is considered an outlier. Finally, we compute the proportion of outliers relative to the total number of samples to determine the reliability of the dataset. The results could be seen in
Table 6.
Annual average wind speed, precipitation, relative humidity, and PM2.5 concentration are negatively correlated, while annual average temperature and the Normalized Difference Vegetation Index (NDVI) are positively correlated with PM2.5 concentration. Increased wind speed typically aids in the dilution and dispersion of airborne pollutants, as high winds can carry pollutants away from the source area, thereby reducing local PM2.5 concentrations. Precipitation helps to remove particulate matter, including PM2.5, from the air through wet deposition (i.e., rain bringing pollutants down to the ground). Additionally, under high relative humidity, airborne particles are more likely to aggregate into larger particles and settle out of the air, further reducing PM2.5 concentrations.
In terms of temperature, while lower temperatures can lead to increased fossil fuel combustion for heating, contributing to higher PM2.5 emissions, higher temperatures may also be associated with other factors that elevate PM2.5 concentrations. For instance, higher temperatures can lead to increased vehicle use and the operation of air conditioning systems, both of which contribute to PM2.5 emissions. Furthermore, higher temperatures can enhance biogenic emissions (such as volatile organic compounds emitted by plants), which, under the influence of sunlight and heat, can form secondary organic aerosols. This process, along with enhanced atmospheric chemical reactions, can also contribute to increased PM2.5 concentrations under warmer conditions.
Per capita GDP and the urbanization rate are negatively correlated with PM2.5 concentration, while industrial structure (the proportion of the secondary industry), energy consumption, and population density are positively correlated. Per capita GDP, as a measure of economic development, typically reflects an improvement in economic structure and advancements in environmental protection technologies. As per capita GDP increases, cities often have more resources to invest in environmental management and pollution control, leading to a reduction in air pollutants like PM2.5. Furthermore, with the advancement of urbanization, cities tend to enhance their infrastructure, including the construction and operation of environmental protection facilities, which contributes to lowering PM2.5 concentrations.
The secondary industry, which includes sectors such as manufacturing and construction, is usually a major source of energy consumption and pollutant emissions. As the proportion of the secondary industry increases, pollution emissions tend to rise, resulting in higher PM2.5 concentrations. Increased energy consumption generally indicates greater fossil fuel combustion, leading to higher concentrations of particulate matter (including PM2.5) in the air. Additionally, densely populated urban areas typically experience more vehicular traffic, residential activities, and industrial and commercial operations, all of which contribute significantly to pollutant emissions.
In summary, the influencing factors selected in this study are both significant and reasonable, making them suitable for analyzing the reliability of PM2.5 data. Additionally, through robust regression analysis, we identified 18 outliers, representing 0.8% of the total data points (18/2264).
From
Table 7, with an outlier ratio of just 0.8%, this suggests that the overall PM2.5 concentration data is relatively reliable. Among the 18 outliers, Jinan, Liaocheng, and Urumqi each appeared twice, while other cities only had one occurrence. Notably, most outliers were concentrated in 2015, with few or none in the following years.
However, it’s important to note that the standard deviation of the residuals was 11.41, which is quite high. This indicates that the robust regression estimates may still contain significant bias, potentially leading to errors in outlier detection and reliability analysis and resulting in an underestimation of the number of outliers. A closer look reveals that robust regression, being a linear model, may not fully capture the complex nonlinear relationships between the factors and PM2.5 concentrations. This could explain the large residuals. To address this, we propose using machine learning methods to better capture these nonlinear relationships, thereby producing a more accurate model. This would improve the identification of outliers and lead to more precise conclusions in our reliability analysis.