Reliability Assessment of PM2.5 Concentration Monitoring Data: A Case Study of China

Duan, Hongyan; Yue, Wenfu; Li, Weidong

doi:10.3390/atmos15111303

Open AccessEditor’s ChoiceArticle

Reliability Assessment of PM2.5 Concentration Monitoring Data: A Case Study of China

by

Hongyan Duan

^1,2,

Wenfu Yue

¹ and

Weidong Li

^1,2,*

¹

School of Economics and Management, Beijing Jiaotong University, No.3 Shangyuan Village, Haidian District, Beijing 100044, China

²

Beijing Laboratory of National Economic Security Early-Warning Engineering, Beijing Jiaotong University, Office Building 7, No.3 Shangyuan Village, Haidian District, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Atmosphere 2024, 15(11), 1303; https://doi.org/10.3390/atmos15111303

Submission received: 12 September 2024 / Revised: 13 October 2024 / Accepted: 23 October 2024 / Published: 29 October 2024

(This article belongs to the Section Air Quality)

Download

Browse Figures

Versions Notes

Abstract

:

This study systematically evaluates the reliability of PM2.5 monitoring data across major urban areas, utilizing a comprehensive dataset covering 283 cities in China over a seven-year period. By using Benford’s Law, robust regression analysis, and various machine learning methods, such as Gradient Boosting Trees and Random Forests, the overall reliability of China’s PM2.5 monitoring data is high. These models effectively captured complex patterns and detected anomalies related to both natural environmental and socioeconomic factors, as well as potential data manipulation. Based on the integrated models, the proportion of anomalies in PM2.5 concentration monitoring data across 283 cities in China from 2015 to 2022 was less than 2%, which strongly indicates the overall reliability of China’s PM2.5 concentration monitoring data. Additionally, machine learning models provided a ranking of the importance of different variables affecting PM2.5 concentrations, offering a scientific basis for understanding the driving factors behind the data. The three variables that have the greatest impact on PM2.5 concentrations are population density, average temperature, and relative humidity. By comparing with other related studies, we further validated our findings. Overall, this study provides new methods and perspectives for understanding and evaluating the reliability of PM2.5 data in China, laying a solid foundation for future research.

Keywords:

data science; data reliability; PM2.5; machine learning

1. Introduction

Data reliability, defined as “the degree to which the measured value of a quantity corresponds to its true value” [1], is a prerequisite for effective environmental monitoring [2]. Human-induced environmental changes have led to an appropriate focus on ecosystem protection, which in turn has driven the development of numerous ecological monitoring programs worldwide [3]. Monitoring surrounding environmental conditions is crucial for environmental management and regulation. However, effective monitoring is subject to a range of institutional, political, and legal constraints, as it requires sustained, long-term efforts that are well matched to the resources being studied [4]. In policy-oriented research, the role of non-technical factors in compromising data accuracy and reporting is well documented. Under certain conditions, institutions and individuals may favor specific outcomes over others, leading to systematic bias due to flaws in sampling, data collection, and processing procedures. The underlying reasons for this bias may include conflicting institutional objectives, political pressures, private interests, official narratives, and the ideological structures that express the fundamental beliefs of groups and their members.

Over the past decade, the Chinese central government has made significant efforts to curb frequent data manipulation [5] and has implemented a series of nationwide environmental monitoring reforms [6]. However, in recent years, numerous cases have been exposed involving the falsification of data by environmental monitoring stations [7], the submission of forged materials by polluting enterprises, and tampering with environmental monitoring equipment. Instances of improper local interference in environmental monitoring and the fabrication of monitoring data by polluting entities continue to occur, undermining the credibility and authority of the monitoring data. The issue of information asymmetry in China is particularly evident in the relationship between local governments and polluting enterprises, with frequent cases of enterprises evading environmental regulations [8].

These above issues have led both academia and legislative bodies to recognize that the quality of environmental information is a significant constraint on the effectiveness of environmental assessments. Zhao et al. (2017) [9] pointed out that the persistence of environmental data falsification is due to the low cost and high benefits of such actions, coupled with the fact that environmental protection agencies lack the authority to penalize environmental monitoring laboratories. Liu and Wang (2019) [10] highlighted that the types of environmental monitoring fraud in China are diverse and complex, involving not only corporate managers and employees but also government officials. Yan (2019) [11] suggested that the widespread use of “data-based” assessments in China has turned data into a key metric for evaluating government performance, thereby encouraging the falsification of environmental monitoring data. Furthermore, the lack of a robust evaluation system has resulted in these instances of data falsification going undetected in their early stages, allowing the problem to escalate.

Among the various dimensions of air pollution, fine particulate matter, particularly PM2.5, has garnered increasing attention due to its severe impact on human health. PM2.5 particles are small enough to penetrate the lungs and cardiovascular system, leading to serious health conditions such as stroke and lung cancer. As China’s urbanization and industrialization have rapidly advanced, along with a surge in fossil fuel consumption, smog caused by fine particulate matter has reached epidemic levels in many cities, characterized by its complex, cumulative, and cross-regional pollution patterns [12]. The Global Burden of Disease study indicates that over 250,000 premature deaths occur annually in China due to air pollution, with 80% of these deaths attributable to PM2.5 pollution [13].

Currently, the spatiotemporal distribution and driving factors of PM2.5 are environmental issues that have attracted significant attention from the fields of geography, ecology, and environmental science. Since the 1960s and 1970s, researchers in developed countries such as the United States and Europe have conducted extensive studies on how to design air quality monitoring networks that are efficient, representative, and purpose-driven [14,15]. It is generally acknowledged that PM2.5 is closely related to human social and economic activities, with primary sources including vehicle emissions, industrial activities, soil dust, and the combustion of biomass and coal. Secondary sources include other gaseous pollutants that undergo complex chemical reactions to form PM2.5 [16]. In 2007, China introduced a trial version of its ambient air quality monitoring standards. According to the 2022 China Ecological and Environmental Statistics Yearbook, China has established 15,143 ambient air quality monitoring sites across the country.

The primary objective of this study is to evaluate the reliability of PM2.5 concentration monitoring data in 283 cities across China from 2015 to 2022. Previous studies on the reliability of air quality monitoring data in China have largely been theoretical, focusing on motivational analysis. Using advanced machine learning techniques, such as Gradient Boosting Trees and Random Forests, we aim to identify anomalies in the data and explore the factors that most significantly influence PM2.5 levels. Firstly, we applied Benford’s Law to detect irregular patterns in the numerical data, which may indicate manipulation, inaccuracies, or biases in the PM2.5 monitoring data of various cities. Secondly, we used robust regression analysis to further test the data while controlling for natural environmental factors and socioeconomic variables that might affect PM2.5 levels. Thirdly, we utilized a big data approach to meticulously analyze the anomalies in PM2.5 data, addressing concerns about data scrutiny related to threshold reporting. Through models like Gradient Boosting Trees and Random Forests, we were able to capture complex patterns and potential anomalies within the data, effectively helping us to identify anomalies related to natural environmental and socioeconomic factors and to detect possible manipulation. The findings of this research will contribute to improving air quality management and policy decision-making in China by ensuring that PM2.5 monitoring data is accurate and reliable. Furthermore, the methods and approaches developed in this study can serve as a foundation for future research in evaluating the reliability of environmental data in other regions.

2. Literature Review

Environmental monitoring plays a crucial role in assisting decision-making processes by identifying new environmental issues, supporting the evaluation of environmental management policies, plans, and projects, and providing evidence for regulatory evolution [17]. There are significant gaps in our understanding of current environmental conditions [18,19]. These gaps exist because environmental monitoring is challenging to conduct effectively, and this difficulty creates political, legal, and organizational barriers to the collection and use of monitoring data. In policy-oriented research, the role of non-technical factors in undermining the accuracy of data and data reporting is well known. In fact, accurately measuring environmental data remains a challenge, and many scholars have expressed concerns about potential mismeasurement in official statistics [20,21]. This poses challenges for the use of data and policy formulation. Therefore, a scientific assessment of the reliability of PM2.5 monitoring data becomes particularly important.

Numbers and digits often follow a logarithmic pattern. The first to discover this regularity in the leading digits of datasets was the astronomer and mathematician Simon Newcomb [22]. Sixty years later, physicist Frank Benford [23] observed the logarithmic pattern in the expected frequency of leading digits, indicating that the relative frequency of the first significant digit conforms to a distribution consistent with a uniform distribution on a logarithmic scale. This principle, now known as Benford’s Law, is studied because when the distribution of leading digits deviates significantly from the Benford distribution, it is likely that the integrity of the data has been compromised [24]. Sambridge et al. (2010) [25] demonstrated the applicability of Benford’s Law to observations in the natural sciences. Liu et al. (2012) [26] explored how to combine Benford’s Law with panel models to identify specific regions and time series data that may have quality issues. Lu and Boritz (2006) [27] noted that while Benford’s Law is effective for data validation, it faces challenges when applied to incomplete datasets, highlighting the limitations of the law in such scenarios.

Some scholars have also tested data reliability by examining the correlations between indicators. Klein and Özmucur (2003) [28] used classical principal component regression methods to validate the reliability of China’s economic growth data. Zhou and Lian (2010) [29] utilized principal component analysis, considering the impact of geographic spatial factors on economic growth. However, these methods generally rely on traditional techniques such as principal component analysis or least squares methods, which can be sensitive to the presence of anomalous data.

Based on the aforementioned literature, the methods used to assess the accuracy of statistical data must meet the specific conditions and assumptions required by those methodologies. Specifically, there needs to be a certain degree of correlation between the indicators being studied; otherwise, it is not possible to draw meaningful conclusions. On the other hand, while correlation-based methods can indicate that there may be issues with the statistical data, they do not identify which specific data points fail to maintain the expected balance or correlation. As a result, these methods can detect a general problem with the data but are unable to pinpoint the exact data points that are problematic.

As big data and artificial intelligence technologies gradually penetrate the economic field, they have sparked a data revolution centered on algorithms and grounded in big data. Su and Zhou (2018) [30] proposed a statistical data quality assessment method based on a cloud model, using weighted arithmetic mean integration techniques to construct a comprehensive evaluation cloud that assesses various dimensions of data quality. Wang and Zhou (2018) [31] addressed the lack of comprehensive big data analysis, quality standards research, and quality assessment methods by proposing a set of standards for big data quality. Fisman and Wei (2007) [32] and Mishra et al. (2007) [33] compared customs data from source and destination countries to identify missing imports or exports. Patel et al. (2019) [34] used a Random Forest model to assess the reliability of financial statement data from several companies in Mumbai for the period 2008–2011. Qian et al. (2023) [35] developed a machine learning-based public health data reliability assessment system, which includes modules for data preparation, feature engineering, multi-model quantitative evaluation, and data reliability assessment.

Numerous scholars have proposed robust methods for principal component analysis and principal component regression. Maronna and Yohai (2017) [36] and Boudt et al. (2020) [37] have noted that the Minimum Covariance Determinant (MCD) estimator does not perform optimally in high-dimensional settings, particularly when the number of variables exceeds the sample size. To address this issue, Boudt et al. (2020) [37] introduced a high-dimensional robust covariance matrix estimator, abbreviated as MRCD (Minimum Regularized Covariance Determinant), which proves to be more effective in cases where the dimensionality exceeds the sample size. Existing literature suggests that diagnosing the quality of officially released data is no longer assessed solely from a one-dimensional perspective, such as accuracy, although accuracy still plays a crucial role in the overall evaluation of statistical data quality. This paper contributes to the relatively limited body of research that seeks to measure the accuracy of environmental data by examining statistical anomalies within large datasets (Dumas and Devine 2000) [38].

3. Data and Methods

In this study, PM2.5 data from 283 cities across China for the period from 2015 to 2022 were considered. The PM2.5 concentration data were sourced from the official website of the China National Environmental Monitoring Center: https://www.cnemc.cn/ (accessed on 18 June 2024). The data were collected from monitoring stations in 283 cities nationwide, with PM2.5 concentration levels recorded hourly, resulting in a monitoring value for each city every hour. The specific monitoring equipment is provided by several Chinese companies; please refer to this website: https://www.ccgp.gov.cn/ (accessed on 10 July 2024). The data regarding socioeconomic factors and natural environmental factors are sourced from the “China Urban Statistical Yearbook”, “China Environmental Statistical Yearbook”, and the China Qinghai-Tibet Plateau Scientific Data Center platform. To provide a clear understanding of the geographical scope of our study, we have included Figure 1, which illustrates the distribution of the selected cities across the study area. The white dots in the figure represent the specific locations of the cities chosen for analysis. These cities form the basis of our analysis after filtering out those with incomplete data.

3.1. Benford’s Law

Benford’s law is applicable to a variety of datasets, especially those containing natural quantities, such as financial data, demographic data, physical measurements, etc. It works best when the dataset is large and has a wide range of values, and it is often used to detect data fraud and outlier analysis. Since PM2.5 data is natural data and has a wide distribution range, the analysis conducted here uses Benford’s Law to identify cities where PM2.5 levels might have been inaccurately measured or are more likely to have been misreported.

In 1938, American physicist Frank Benford uncovered this distribution pattern. By analyzing over 20,000 samples, Benford identified the law governing the distribution of leading significant digits in natural numbers, as shown in Equation (1):

p (d_{1}) = \log_{10} (1 + \frac{1}{d})

(1)

Here,

d = \{1, \dots, 9\}

represents the digits from 1 to 9, and

p

denotes the probability of

d

appearing as the leading digit in a dataset. Therefore, Equation (1) indicates that the digit 1 should appear as the first digit approximately 30% of the time, the digit 2 should appear as the first digit about 17% of the time, and so on, with the digit 9 appearing as the leading digit less than 5% of the time. In 1995, Hill [39] provided a mathematical proof of Benford’s Law and also derived the frequency distribution for the second significant digit

d_{2}

, as shown in Equation (2):

p (d_{2}) = \sum_{d = 1}^{9} \log_{10} (1 + \frac{1}{10 d_{1} + d_{2}})

(2)

However, in some cases, the results do not conform to the expected distribution as they do with the first digit. When a dataset is unintentionally or deliberately manipulated, the frequency distribution often deviates from Benford’s Law [38,40,41]. Table 1 illustrates the distribution of leading digits according to Benford’s Law. This law also predicts the distribution of the second, third, and subsequent digits, as well as combinations of digits. The analysis conducted here uses Benford’s Law to identify cities where PM2.5 levels might have been inaccurately measured or are more likely to have been misreported.

3.2. Robust Regression

To gain deeper insights into the anomalies present in the PM2.5 data and to rigorously evaluate the quality of air pollution measurements, we intend to employ robust regression techniques. Building on the methodologies of [42,43], we have identified 10 fundamental indicators,

X_{1} - X_{10}

, encompassing both natural environmental factors and socioeconomic determinants that significantly impact PM2.5 concentrations. Data source of the 10 fundamental factors could be seen in Table 2. The descriptive statistics of the above data are shown in Table 3.

The basic linear model selected is as follows in Equation (3).

Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{10} X_{10} + e

(3)

Robust Regression is an application of robust estimation methods in statistics, designed to enhance the resilience of regression models against outliers and other data points that violate standard assumptions. Traditional Ordinary Least Squares (OLS) regression is susceptible to the influence of outliers because it minimizes the sum of squared errors, which can disproportionately amplify the impact of large errors. In contrast, robust regression mitigates the influence of outliers, providing more reliable estimates. In this study, we utilize the rlm function in R to perform robust regression, employing M-estimation to diminish the effect of outliers by applying different weights to the residuals. The core principle of M-estimation involves minimizing a loss function that is less sensitive to outliers when estimating model parameters. Specifically, we use the Huber loss function, which combines the strengths of both OLS and Least Absolute Deviations (LAD) by applying a quadratic penalty to small residuals and a linear penalty to large residuals. This approach retains the efficiency of OLS while enhancing robustness against outliers.

Based on the results of the correlation tests (see Appendix A), only a few factors exhibit strong correlations. In Multicollinearity test (see Table 4), all Variance Inflation Factor (VIF) values are below 10, with only one exceeding 5. Therefore, it can be concluded that there is no significant multicollinearity among the variables selected in this study, allowing for subsequent analysis to proceed.

3.3. Machine Learning

Traditional regression methods are often constrained by several limitations, such as linearity assumptions, feature selection, and data dimensionality, which may hinder their ability to effectively detect outliers. Machine learning methods, on the other hand, are not bound by these constraints. Common machine learning techniques include K-Nearest Neighbors (KNN), Naive Bayes, Support Vector Machines (SVM), Decision Trees, and Neural Networks. Given the need to analyze the impact of various factors on PM2.5 concentrations, machine learning methods are employed for regression in this study.

Random Forest Regression, an ensemble learning method, predicts outcomes by constructing multiple decision trees and can effectively reduce the problem of overfitting. Additionally, Random Forests can handle high-dimensional data and large datasets, automatically manage missing values, and offer robust performance. Gradient Boosting Regression Trees (GBRT) improve prediction accuracy by iteratively training multiple weak learners (decision trees), with each iteration focusing on the samples that previous models predicted incorrectly. This approach is particularly effective at handling complex nonlinear relationships, making it suitable for most real-world applications. Support Vector Regression (SVR) is another powerful tool for handling nonlinear data, utilizing kernel functions to manage nonlinearity and finding an optimal balance by maximizing the margin and minimizing empirical risk, thus ensuring stability and accuracy, especially in noisy datasets.

Since the dataset of PM2.5 and its influencing factors is large, and there may be nonlinear relationships between the variables, this study uses the above three methods to comprehensively assess and validate the reliability of PM2.5 data. The estimation results of SVR are not good, so we put them in Appendix B.

4. Results

4.1. Benford’s Law Method

The data were recorded hourly, with each city having a monitoring value every hour. To assess the reliability of the PM2.5 concentration data, we processed the data to calculate the daily average PM2.5 concentrations for each of the 283 cities from 2015 to 2022. These daily averages were then tested against Benford’s Law.

To compare the frequency of leading digits in the PM2.5 concentration data with the theoretical values predicted by Benford’s Law, we used the Z-statistic, a common method for large-sample tests. We set the confidence level at 0.95; if the frequency of a particular leading digit falls within the confidence interval, it is considered to conform to Benford’s Law and is highlighted in green. Otherwise, it is considered non-conforming and is highlighted in yellow. The test results are presented in Table 5. And the first-digit analysis could be seen in Figure 2.

In the statistical analysis of urban PM2.5 concentration data, we employed Benford’s Law as a method to test the reliability of the data. Benford’s Law predicts the frequency with which the digits 1 to 9 appear as the leading digit under natural conditions. By applying this law, we found that the frequency of all digits did not fall within the expected confidence interval, indicating that the dataset did not pass the Benford’s Law test. However, it is worth noting that although the data did not pass the Benford’s Law test, we observed a gradual decrease in the frequency of digits from 1 to 9, which is consistent with the expected pattern of Benford’s Law. Furthermore, considering the large scale of the dataset, reaching 800,000 records, and the small span of data magnitude, mainly concentrated between the tens and hundreds place, these factors may significantly affect the applicability of Benford’s Law. Therefore, we cannot simply conclude the unreliability of urban PM2.5 concentration data based on the results of Benford’s Law. To further verify the reliability of the data, we calculated the annual average PM2.5 concentration for 283 cities from 2015 to 2022. By constructing a statistical model of factors affecting PM2.5 concentration, we quantitatively analyzed the reliability of the data from the perspective of outliers. This method not only considers the distribution characteristics of the data but also takes into account various factors that may affect PM2.5 concentration, such as meteorological conditions, energy consumptions, and industrial structure.

4.2. Robust Regression Results

Since the results of the Benford’s Law test indicate that the PM2.5 data does not strictly conform to the law, we hypothesize that there may be some outliers in the data. If regression analysis were conducted directly under these conditions, it could lead to biased results. Therefore, we opt for robust regression. Based on this, we assess the reliability of the PM2.5 data by calculating the proportion of outliers within the dataset. The specific steps are as follows: First, we derive the regression equation using robust estimation methods. Next, we calculate the residuals between the predicted and actual PM2.5 values for each sample and determine the standard deviation of these residuals. The threshold for identifying outliers is set at three times the residual standard deviation. If the residual for a particular sample exceeds this threshold, the actual PM2.5 value for that sample is considered an outlier. Finally, we compute the proportion of outliers relative to the total number of samples to determine the reliability of the dataset. The results could be seen in Table 6.

Annual average wind speed, precipitation, relative humidity, and PM2.5 concentration are negatively correlated, while annual average temperature and the Normalized Difference Vegetation Index (NDVI) are positively correlated with PM2.5 concentration. Increased wind speed typically aids in the dilution and dispersion of airborne pollutants, as high winds can carry pollutants away from the source area, thereby reducing local PM2.5 concentrations. Precipitation helps to remove particulate matter, including PM2.5, from the air through wet deposition (i.e., rain bringing pollutants down to the ground). Additionally, under high relative humidity, airborne particles are more likely to aggregate into larger particles and settle out of the air, further reducing PM2.5 concentrations.

In terms of temperature, while lower temperatures can lead to increased fossil fuel combustion for heating, contributing to higher PM2.5 emissions, higher temperatures may also be associated with other factors that elevate PM2.5 concentrations. For instance, higher temperatures can lead to increased vehicle use and the operation of air conditioning systems, both of which contribute to PM2.5 emissions. Furthermore, higher temperatures can enhance biogenic emissions (such as volatile organic compounds emitted by plants), which, under the influence of sunlight and heat, can form secondary organic aerosols. This process, along with enhanced atmospheric chemical reactions, can also contribute to increased PM2.5 concentrations under warmer conditions.

Per capita GDP and the urbanization rate are negatively correlated with PM2.5 concentration, while industrial structure (the proportion of the secondary industry), energy consumption, and population density are positively correlated. Per capita GDP, as a measure of economic development, typically reflects an improvement in economic structure and advancements in environmental protection technologies. As per capita GDP increases, cities often have more resources to invest in environmental management and pollution control, leading to a reduction in air pollutants like PM2.5. Furthermore, with the advancement of urbanization, cities tend to enhance their infrastructure, including the construction and operation of environmental protection facilities, which contributes to lowering PM2.5 concentrations.

The secondary industry, which includes sectors such as manufacturing and construction, is usually a major source of energy consumption and pollutant emissions. As the proportion of the secondary industry increases, pollution emissions tend to rise, resulting in higher PM2.5 concentrations. Increased energy consumption generally indicates greater fossil fuel combustion, leading to higher concentrations of particulate matter (including PM2.5) in the air. Additionally, densely populated urban areas typically experience more vehicular traffic, residential activities, and industrial and commercial operations, all of which contribute significantly to pollutant emissions.

In summary, the influencing factors selected in this study are both significant and reasonable, making them suitable for analyzing the reliability of PM2.5 data. Additionally, through robust regression analysis, we identified 18 outliers, representing 0.8% of the total data points (18/2264).

From Table 7, with an outlier ratio of just 0.8%, this suggests that the overall PM2.5 concentration data is relatively reliable. Among the 18 outliers, Jinan, Liaocheng, and Urumqi each appeared twice, while other cities only had one occurrence. Notably, most outliers were concentrated in 2015, with few or none in the following years.

However, it’s important to note that the standard deviation of the residuals was 11.41, which is quite high. This indicates that the robust regression estimates may still contain significant bias, potentially leading to errors in outlier detection and reliability analysis and resulting in an underestimation of the number of outliers. A closer look reveals that robust regression, being a linear model, may not fully capture the complex nonlinear relationships between the factors and PM2.5 concentrations. This could explain the large residuals. To address this, we propose using machine learning methods to better capture these nonlinear relationships, thereby producing a more accurate model. This would improve the identification of outliers and lead to more precise conclusions in our reliability analysis.

4.3. Machine Learning Performance

4.3.1. Random Forest

To identify the optimal hyperparameters, we employed a grid search method in conjunction with cross-validation. Specifically, we set the number of folds for cross-validation to 5 and used the negative value of Mean Squared Error (MSE) as the scoring criterion. In this process, the grid search systematically explored a predefined set of hyperparameter combinations. For each combination, we evaluated its performance through cross-validation, aiming to minimize the negative MSE. This approach ensured the model’s stability and reliability across different data splits. The results could be seen in Table 8 and Table 9.

The Root Mean Squared Error (RMSE) is 2.02, which is significantly lower than the 11.41 obtained from robust regression. Additionally, the R-squared value of the Random Forest model reached 0.97, indicating that the model has a very high explanatory power and predictive capability. According to Table 10, the model also identified 27 outliers, representing 1.19% of the total data points (27/2264). With an outlier ratio of 1.19%, which is still relatively low, it can be concluded that the overall reliability of PM2.5 concentration data in China is strong. Among the 27 outliers, Chaozhou appeared twice and Shijiazhuang three times, while the other cities only appeared once. Most of the outliers were concentrated in 2015, with few or none in the subsequent years.

4.3.2. Gradient Boosting Tree

Although the hyperparameters of the gradient boosted tree model are slightly different from those of the random forest, the selection method is the same. The optimal hyperparameters are identified using grid search with cross-validation, specifying 5-fold cross-validation and using the negative value of Mean Squared Error (MSE) as the scoring metric. The results could be seen in Table 11 and Table 12.

RMSE is 1.30, significantly lower than the 11.41 from robust regression, and the R-squared value of the Gradient Boosting Tree model reached 0.99, indicating very high explanatory power and predictive capability. The model identified 32 outliers, representing 1.41% of the total data points (32/2264). With an outlier ratio of 1.41%, which is still relatively low, it can be concluded that the overall reliability of China’s PM2.5 concentration data is strong. Among the 32 outliers, Dezhou and Handan each appeared twice, while other cities only appeared once. Outliers were detected in all eight years, with the most in 2017 (7 outliers) and the fewest in 2020 (only 1 outlier).

Based on the results of Table 13, the proportion of anomalies in PM2.5 concentration monitoring data across 283 cities in China from 2015 to 2022 is consistently below 2%, which strongly indicates that the overall monitoring of PM2.5 concentrations in China is reliable. In comparison, the gradient boosting tree model has a smaller RMSE, a larger R-square, and a more accurate model estimation result (Figure 3). Notably, the outliers identified by the Random Forest and Gradient Boosting Tree models included occurrences in Baoding, Dezhou, Langfang, and Liaocheng in 2015, Shijiazhuang in 2017, and Qujing in 2019, suggesting potential anomalies in these data. However, there is no evidence of any city experiencing severe monitoring issues overall. Additionally, it is worth mentioning that, aside from the Gradient Boosting Tree model, the outliers identified by the other three models were primarily concentrated in 2015.

These cities have undergone rapid industrialization, with the construction of new factories and infrastructure possibly leading to delays in the deployment and calibration of monitoring equipment. Some cities are located in remote areas with relatively weak transportation and communication infrastructure, which may have resulted in unstable data transmission. Additionally, 2015 marked the beginning of systematic environmental monitoring for many cities, and the monitoring equipment and data processing technologies may not have been fully developed, contributing to data anomalies.

5. Discussion

According to Benford’s Law, the digits in data do not appear randomly but follow a specific pattern. Therefore, any deviation from the expected distribution may indicate that the data is inaccurate, possibly due to unintentional or intentional manipulation. Building on this principle, we conducted extensive tests on statistical data to delve deeper into the issue. The experimental results revealed deviations of PM2.5 data from Benford’s Law, prompting a discussion on the possible motivations behind these findings. However, the underlying reasons for why empirical data deviates from Benford’s Law remain far from satisfactorily explained in this study.

The large size of the dataset, which exceeds 800,000 entries when divided by day, and the relatively small range of data values—most of which fall between tens and hundreds—can significantly impact the applicability of Benford’s Law. Therefore, we further employed robust regression techniques. Following the methodologies of [41,42], we selected 10 key indicators related to natural environmental and socioeconomic factors that influence PM2.5 concentrations. By calculating the proportion of outliers in the PM2.5 dataset, we aimed to assess the reliability of the data.

To our knowledge, only two other studies have addressed the reliability of air monitoring data in China: [2,44]. However, Ghanem and Zhang (2014), [2] focused exclusively on self-reported air pollution data, with the primary emphasis on analyzing the motivations behind local government self-disclosure. The differences between observed relative frequencies and expected relative frequencies were assessed using statistical measures, and the time period studied was prior to 2010. Since then, with China’s increased emphasis on environmental protection, air quality data in China has shifted from relying on local government disclosure to being collected through air monitoring stations. Given this technical shift, it is necessary to reassess the reliability of current PM2.5 data in China. It is well known that statistical measures are highly sensitive to sample size, and the occurrence of specific events within the time period under investigation could represent potential sources of data interference.

Zhang (2021) [44] used Benford’s Law to examine SO₂ monitoring data from Jiangsu Province, China, focusing solely on the frequency distribution of the leading significant digit in this region. The study was limited to Jiangsu Province, with minimal consideration of regional factors, and did not extend the analysis to other provinces or other types of air quality data. In contrast, our approach combines robust regression and machine learning methods to identify factors influencing PM2.5 concentrations. This approach not only helps monitoring personnel identify cities that may be reporting suspicious data but also provides a method for detecting potentially unreliable SO₂ monitoring data.

From an econometric perspective, our results demonstrate that measurement errors caused by manipulation may be correlated with observable variables that are typically considered exogenous. This undermines the use of these observable variables as true indicators of air quality. The primary focus of this paper is on identifying potential manipulation behaviors. Although we have conducted preliminary analyses on patterns of data manipulation, we have not explored the political or economic factors that may make manipulation more likely in certain cities than others. This aspect will be addressed in future research.

6. Conclusions

This study conducted a detailed reliability assessment of PM2.5 monitoring data from 283 cities in China between 2015 and 2022 by applying Benford’s Law, robust regression, and machine learning analysis. We found that the overall reliability of China’s air quality monitoring data is high, despite minor deviations in certain cities and years. Our detailed analysis indicates that while 3% of the data points deviated from Benford’s Law, the remaining 97% were consistent with the expected patterns, indicating a high level of reliability. These deviations may be influenced by various factors, including natural environmental conditions, socioeconomic impacts, and data collection techniques. We also discussed the impact of data range and scale on the applicability of Benford’s Law. For our dataset, which includes over 800,000 daily records, we used robust regression and machine learning to further verify the reliability of the PM2.5 data. Our quantitative analysis of 10 key PM2.5 indicators affected by natural and socioeconomic factors showed that the proportion of outliers was minimal, enhancing the reliability of the dataset. Although our study identified potential data manipulation, we acknowledge the need for a deeper investigation into the political and economic motivations behind such actions. This will be a focus of our future research to strengthen our understanding and assessment of the reliability of China’s air quality monitoring data.

To further improve the quality and reliability of China’s environmental monitoring data, we recommend a multifaceted approach, including standardizing monitoring equipment and technology, enhancing data management systems, increasing data transparency, strengthening personnel training, raising public environmental awareness, introducing independent auditing mechanisms, and conducting in-depth research on the socioeconomic and natural environmental factors that lead to data anomalies. These measures will provide robust data support for environmental protection and governance, contributing to the sustainable development of China’s ecological civilization.

Author Contributions

Conceptualization, H.D. and W.L.; methodology, W.Y.; software, W.Y.; validation, H.D. and W.L.; formal analysis, W.L.; investigation, H.D.; resources, W.L.; data curation, W.Y.; writing—original draft preparation, H.D.; writing—review and editing, W.L.; visualization, W.Y.; supervision, W.L.; project administration, W.L.; funding acquisition, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Beijing Laboratory of National Economic Security Early-Warning Engineering, Beijing Jiaotong University. This research is funded by Humanities and Social Science Planning Project [2023JBW8006]. This research was supported by the Fundamental Research Funds for the Central Universities [2024YJS006].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article. The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Correlation Test

Table A1. Correlation analysis.

	X₁	X₂	X₃	X₄	X₅	X₆	X₇	X₈	X₉	X₁₀
X₁	1
X₂	0.272	1
X₃	0.705	0.102	1
X₄	0.547	0.032	0.410	1
X₅	0.457	0.064	0.444	0.370	1
X₆	0.111	−0.164	0.205	0.105	−0.012	1
X₇	0.118	0.053	0.020	−0.023	0.198	−0.336	1
X₈	0.039	0.019	−0.097	−0.038	0.195	−0.382	0.828	1
X₉	0.088	0.242	−0.099	0.038	0.302	−0.421	0.434	0.683	1
X₁₀	−0.327	−0.181	−0.311	−0.129	−0.142	−0.265	−0.265	0.443	0.174	1

Appendix B

Support Vector Machine (SVM)

Table A2. Alternative hyperparameters.

Hyperparameters	Alternative Sets
Regularization parameter	10, 100, 500, 1000, 5000, 10,000
Kernel function type	Polynomial kernel; Radial Basis Function kernel
Kernel function parameters	scale, auto
epsilon	0.1, 0.2, 0.5, 1

Table A3. Support Vector Machine results.

Parameters	Results
Regularization parameter	5000
Kernel function type	Radial Basis Function kernel
Kernel function parameters	scale
epsilon	1
RMSE	6.21
R²	0.67

The Root Mean Squared Error (RMSE) is 6.21, which is significantly higher than that of the Random Forest and Gradient Boosting Tree models. Additionally, the R-squared value of 0.67 is also much lower than those of the aforementioned models, indicating that the Support Vector Regression (SVR) model does not perform well on the PM2.5 concentration dataset. The model identified 23 outliers, representing 1.02% of the total data points (23/2264).

Table A4. Estimation results.

Cities	Year
Anshan	2015
Baoding	2015
Dezhou	2015
Dongying	2015
Harbin	2015
Hangzhou	2015
Jinan	2015
Jingmen	2015
Liaocheng	2015
Shenyang	2015
Xiangyang	2015
Xinyang	2015
Yichang	2015
Changsha	2015
Zibo	2015
Dongying	2016
Jinan	2016
Shijiazhuang	2016
Urumqi	2016
Linfen	2017
Urumqi	2017
Zhaotong	2018
Chaozhou	2022

With an outlier ratio of 1.02%, which is still relatively low, it can be concluded that the overall reliability of China’s PM2.5 concentration data remains strong. Among the 23 outliers, Dongying, Jinan, and Urumqi each appeared twice, while other cities only appeared once. The majority of these outliers were concentrated in 2015, with fewer outliers in the subsequent years.

Appendix C

Among the four models, the Gradient Boosting Tree model had the smallest error and the highest R-squared value, indicating the best performance. Therefore, we conducted a further analysis of the natural environmental and socioeconomic factors influencing PM2.5 concentrations based on this model.

The Gradient Boosting Tree model calculates the importance of variables based on their contribution to the reduction in the residual sum of squares. From the analysis, it is evident that the three most influential variables on PM2.5 concentrations are X₅ (population density), X₉ (average temperature), and X₈ (relative humidity).

Figure A1. Relative influence.

Next, we plotted the partial dependence plots for X₅ (population density), X₉ (average temperature), and X₈ (relative humidity) against PM2.5 concentrations. The plots clearly demonstrate that the relationships between these factors and PM2.5 concentrations are not strictly linear but rather exhibit complex nonlinear patterns. This further confirms the limitations of robust regression and supports the appropriateness of the Gradient Boosting Tree model in capturing these intricate relationships.

Figure A2. Partial Dependence on X₅, X₈ and X₉.

References

Joint Committee for Guides in Metrology (JCGM). International Vocabulary of Metrology, 3rd ed.; JCGM: Glasgow, UK, 2012. [Google Scholar]
Ghanem, D.; Zhang, J. ‘Effortless Perfection’: Do Chinese cities manipulate air pollution data? J. Environ. Econ. Manag. 2014, 68, 203–225. [Google Scholar] [CrossRef]
Nichols, J.D.; Williams, B.K. Monitoring for conservation. Trends Ecol. Evol. 2006, 21, 668–673. [Google Scholar] [CrossRef] [PubMed]
Biber, E. The challenge of collecting and using environmental monitoring data. Ecol. Soc. 2013, 18, 68. [Google Scholar] [CrossRef]
Brombal, D. Accuracy of environmental monitoring in China: Exploring the influence of institutional, political and ideological factors. Sustainability 2017, 9, 324. [Google Scholar] [CrossRef]
Lo, K. How Authoritarian Is the Environmental Governance of China? Environ. Sci. Policy 2015, 54, 152–159. [Google Scholar] [CrossRef]
Xiang, L.; Song, L. Comparison of Legal Norms on Government Information Quality in China and the United States. Libr. Inf. Serv. 2014, 6, 54–57. [Google Scholar]
Tu, Z.; Chen, R. Can the Emissions Trading Mechanism Achieve the Porter Effect in China? Econ. Res. J. 2015, 50, 160–173. [Google Scholar]
Zhao, X.; Yang, H.; Luo, W.; Wang, M.; Song, Y.; Deng, G. Under the New Situation, How to Effectively Prevent the Fraud of Social and Environmental Monitoring Institutions. Environ. Monit. Early Warning 2017, 1, 63–66. [Google Scholar]
Liu, D.; Wang, S. Handling Different Types of Environmental Monitoring Fraud in Multiple Ways. Int. J. Environ. Sci. Technol. 2019, 16, 4963–4966. [Google Scholar] [CrossRef]
Yan, F. Research on the accuracy of institutional matching in regional environmental collaborative governance: A case study of environmental data fraud in Linfen, Shanxi Province. J. Party Sch. Harbin Munic. Party Comm. 2019, 1, 37–41. [Google Scholar]
Wang, Z.; Liang, L.; Wang, X. Spatiotemporal evolution of PM2.5 concentrations in urban agglomerations of China. J. Geogr. Sci. 2021, 31, 878–898. [Google Scholar] [CrossRef]
Yin, P.; Brauer, M.; Cohen, A.J.; Wang, H.; Li, J.; Burnett, R.T.; Murray, C.J. The effect of air pollution on deaths, disease burden, and life expectancy across China and its provinces, 1990–2017: An analysis for the Global Burden of Disease Study 2017. Lancet Planet. Health 2020, 4, e386–e398. [Google Scholar] [CrossRef]
Markiewicz, M. A review of mathematical models for the atmospheric dispersion of heavy gases. Part I. A classification of models. Ecol. Chem. Eng. S 2012, 19, 297–314. [Google Scholar] [CrossRef]
Hsieh, H.P.; Lin, S.D.; Zheng, Y. Inferring air quality for station location recommendation based on urban big data. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 437–446. [Google Scholar]
Cheng, Z.; Li, L.; Liu, J. Identifying the spatial effects and driving factors of urban PM2.5 pollution in China. Ecol. Indic. 2017, 82, 61–75. [Google Scholar] [CrossRef]
Messer, J.J. Monitoring, assessment, and environmental policy. In Environmental Monitoring; Wiersma, G.B., Ed.; CRC Press: Boca Raton, FL, USA, 2004; pp. 499–516. [Google Scholar]
Christensen, N.L.; Bartuska, A.M.; Brown, J.H.; Carpenter, S.; D’Antonio, C.; Francis, R.; Franklin, J.F.; MacMahon, J.A.; Noss, R.F.; Parsons, D.J.; et al. The report of the Ecological Society of America committee on the scientific basis for ecosystem management. Ecol. Appl. 1996, 6, 665–691. [Google Scholar] [CrossRef]
O’Malley, R.; Marsh, A.S.; Negra, C. Closing the environmental data gap. Issues Sci. Technol. 2009, 25, 69–74. [Google Scholar]
Andrews, S.Q. Inconsistencies in air quality metrics: ‘Blue Sky’ days and PM10 concentrations in Beijing. Environ. Res. Lett. 2008, 3, 034009. [Google Scholar] [CrossRef]
Chen, Y.; Li, H.; Zhou, L.-A. Relative performance evaluation and the turnover of provincial leaders in China. Econ. Lett. 2005, 88, 421–425. [Google Scholar] [CrossRef]
Newcomb, S. Note on the frequency of use of the different digits in natural numbers. Am. J. Math. 1881, 4, 39–40. [Google Scholar] [CrossRef]
Benford, F. The law of anomalous numbers. Proc. Am. Philos. Soc. 1938, 78, 551–572. [Google Scholar]
Nigrini, M.J.; Mittermaier, L.J. The use of Benford’s law as an aid in analytical procedures. Audit. J. Pract. Theory 1997, 16, 52–67. [Google Scholar]
Sambridge, M.; Tkalčić, H.; Jackson, A. Benford’s law in the natural sciences. Geophys. Res. Lett. 2010, 37, 22–301. [Google Scholar] [CrossRef]
Liu, Y.; Wu, X.; Zeng, W. A study on the comprehensive use of Benford’s rule and panel model to detect the quality of statistical data. Stat. Res. 2012, 29, 74–78. [Google Scholar]
Lu, F.; Boritz, J.E. Adaptive fraud detection using Benford’s law. In Advances in Artificial Intelligence; Lane, H.C., Aha, D.W., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4013, pp. 347–356. [Google Scholar] [CrossRef]
Klein, L.R.; Özmucur, S. The estimation of China’s economic growth rate. J. Econ. Soc. Meas. 2003, 28, 187–202. [Google Scholar] [CrossRef]
Zhou, G.; Lian, F. Evaluation of GDP data quality in China: An empirical analysis based on spatial panel data model. J. Shanxi Univ. Financ. Econ. 2010, 32, 17–23+48. [Google Scholar]
Su, W.; Zhou, J. Research on statistical information quality assessment method based on cloud theory. Stat. Res. 2018, 35, 86–93. [Google Scholar]
Wang, L.; Zhou, X. Research on the standard and process of big data quality assessment. Bus. Manag. Manag. 2018, 4, 84–88. [Google Scholar]
Fisman, R.; Wei, S.J. The smuggling of art and the art of smuggling: Uncovering the illicit trade in cultural property and antiques. Natl. Bur. Econ. Res. 2007, 1, 82–96. [Google Scholar] [CrossRef]
Mishra, P.; Topalova, P.; Subramanian, A. Policies, Enforcement, and Customs Evasion: Evidence from India; International Monetary Fund: Washington, DC, USA, 2007; pp. 7–60. Available online: https://ssrn.com/abstract=973990 (accessed on 20 June 2024).
Patel, H.; Parikh, S.; Patel, A.; Parikh, A. An application of ensemble random forest classifier for detecting financial statement manipulation of Indian listed companies. In Recent Developments in Machine Learning and Data Analytics; Manogaran, G., Lopez, D., Chilamkurti, N., Eds.; Springer: Singapore, 2019; pp. 349–360. [Google Scholar]
Qian, C.; Xia, H.; Xia, T.; Liu, X.; Fu, C.; Zhao, D. Research and design of public health data reliability evaluation system based on machine learning. China Health Resour. 2023, 26, 244–248. [Google Scholar]
Maronna, R.A.; Yohai, V.J. Robust and efficient estimation of multivariate scatter and location. J. Comput. Stat. Data Anal. 2017, 109, 64–75. [Google Scholar] [CrossRef]
Boudt, K.; Rousseeuw, P.J.; Vanduffel, S.; Verdonck, T. The minimum regularized covariance determinant estimator. J. Stat. Comput. 2020, 30, 113–128. [Google Scholar] [CrossRef]
Dumas, C.F.; Devine, J.H. Detecting evidence of non-compliance in self-reported pollution emissions data: An application of Benford’s law. In Proceedings of the American Agricultural Economics Association Annual Meeting, Tampa, FL, USA, 30 July–2 August 2000; pp. 1–16. [Google Scholar]
Hill, T.P. A statistical derivation of the significant-digit law. Stat. Sci. 1995, 10, 354–363. [Google Scholar] [CrossRef]
Brown, R.J.C. Benford’s law and the screening of analytical data: The case of pollutant concentrations in ambient air. Analyst 2005, 130, 1280–1285. [Google Scholar] [CrossRef] [PubMed]
Durtschi, C.; Hillison, W.; Pacini, C. The effective use of Benford’s law to assist in detecting fraud in accounting data. J. Forensic Account. 2004, 5, 17–34. [Google Scholar]
Liu, X.J.; Xia, S.Y.; Yang, Y.; Wu, J.F.; Zhou, Y.N.; Ren, Y.W. Spatiotemporal evolution and influencing factors of PM2.5 in the Yangtze River Economic Belt. Resour. Environ. Yangtze Basin 2022, 31, 647–658. [Google Scholar]
He, Y.; Lin, K.; Liao, N.; Jiang, Y. Exploring the spatial effects and influencing factors of PM2.5 concentration in the Yangtze River Delta Urban Agglomerations of China. Atmos. Environ. 2022, 268, 118805. [Google Scholar] [CrossRef]
Zhang, Y. Benford test of ambient air quality monitoring data in Jiangsu Province. J. Hebei Univ. Environ. Eng. 2021, 31, 91–94. [Google Scholar]

Figure 1. Geographical Distribution of Selected Cities in the Study Area.

Figure 2. First digit analysis (a = 0.05).

Figure 3. Model effect comparison chart.

Table 1. Frequency distribution of the first two significant digits of Benford’s law.

First Digit	0	1	2	3	4	5	6	7	8	9
Benford’s frequency		30.1%	17.6%	12.5%	9.7%	7.9%	6.7%	5.8%	5.1%	4.6%
Hill’s method	12.0%	11.4%	10.9%	10.4%	10.0%	9.7%	9.3%	9.0%	8.8%	8.5%

Table 2. Impact factors.

			Data Source
Socioeconomic factors	GDP per capita	$X_{1}$	China City Statistical Yearbook (2015–2022)
	Industrial structure	$X_{2}$	China Statistical Yearbook for Regional Economy (2015–2022)
	Urbanization	$X_{3}$	China Statistical Yearbook for Regional Economy (2015–2022)
	Energy consumption	$X_{4}$	China City Statistical Yearbook (2015–2022)
	Population density	$X_{5}$	China City Statistical Yearbook (2015–2022)
Natural environmental factors	Average annual wind speed	$X_{6}$	National Earth System Science Data Center: http://www.geodata.cn/ (accessed on 5 June 2024)
	Annual precipitation	$X_{7}$	Resource and Environmental Science and Data Platform: http://www.resdc.cn/ (accessed on 11 June 2024)
	Relative humidity	$X_{8}$	National Earth System Science Data Center: http://www.geodata.cn/ (accessed on 13 June 2024)
	Average annual temperature	$X_{9}$	Resource and Environmental Science and Data Platform: http://www.resdc.cn/ (accessed on 5 July 2024)
	Normalized vegetation index	$X_{10}$	Spatial Distribution of Normalized Difference Vegetation Index: http://www.resdc.cn/data.aspx (accessed on 6 July 2024)

Table 3. Descriptive statistics.

Variables	Unit	Obs	Mean	Std. Dev	Min	Max
Average annual PM2.5(Y)	μg/m³	2264	40.578	14.782	10.858	106.669
GDP per capita (X₁)	10,000 yuan	2264	6.227	3.456	1.099	25.691
Industrial structure (X₂)	%	2264	41.922	10.191	10.680	73.030
Urbanization (X₃)	%	2264	59.469	13.954	25.030	117.790
Industrial soot emissions (X₄)	10,000 tons	2264	14.579	11.885	1.490	80.690
Population density (X₅)	people/sq km	2264	497.247	693.374	5.750	9147.758
Average annual wind speed (X₆)	km/h	2264	2.203	0.465	1.106	3.859
Annual precipitation (X₇)	mm	2264	1086.392	564.854	49.457	4370.590
Relative humidity (X₈)	%	2264	70.029	10.201	36.853	89.499
Average annual temperature (X₉)	℃	2264	15.412	5.374	−2.532	25.912
Normalized vegetation index (X₁₀)		2264	0.7278733	0.1274954	0.1125416	0.9027727

Table 4. Multicollinearity test.

Variables	VIF
X₁	2.947
X₂	1.276
X₃	2.351
X₄	1.566
X₅	1.585
X₆	1.332
X₇	3.475
X₈	5.328
X₉	2.700
X₁₀	2.014
Mean VIF	2.457

Table 5. Frequency distribution of Benford’s law.

First Digit	Frequency	Relative Frequency	Theoretical Value
1	342,378	0.281	0.301
2	261,583	0.214	0.176
3	186,795	0.153	0.125
4	128,736	0.105	0.097
5	91,393	0.075	0.079
6	67,737	0.056	0.067
7	54,246	0.044	0.058
8	46,234	0.038	0.051
9	41,349	0.034	0.046

Table 6. Robust regression.

Variables	Coefficient	t-Value
Intercept	38.510 *	9.853
X₁	−1.412 *	−11.289
X₂	0.531 *	19.015
X₃	−0.069 *	−2.489
X₄	0.302 *	11.376
X₅	0.037 *	8.050
X₆	−3.501 *	−5.598
X₇	−0.792 *	−9.529
X₈	−0.138 *	−2.417
X₉	0.228 *	2.956
X₁₀	1.179 *	4.207

* The t-value is calculated by dividing the regression coefficient by its corresponding standard error using this formula:

t = \frac{\hat{β}}{S E (\hat{β})}

. Standard errors in brackets.

Table 7. Robust regression to outliers.

City	Year	Residuals
Anyang	2015	37.642
Baoding	2015	46.470
Beijing	2015	40.929
Dezhou	2015	50.141
Dongying	2015	39.432
Heze	2015	36.778
Hengshui	2015	43.359
Jinan	2015	45.384
Liaocheng	2015	49.908
Xinxiang	2015	36.675
Xingtai	2015	47.370
Zhengzhou	2015	39.808
Zibo	2015	38.377
Jinan	2016	37.470
Liaocheng	2016	36.651
Shijiazhuang	2016	49.376
Urumchi	2016	43.175
Urumchi	2017	40.514
Residual standard deviation	11.410

Table 8. Alternative hyperparameters.

Hyperparameters	Alternative Sets
n_estimators	50, 100, 200, 500
Max_depth	None, 10, 20, 30
Min_samples_split	2, 5, 10
Min_samples_leaf	1, 2, 4

Table 9. Random forest estimations.

Parameters	Results
n_estimators	50
Max_depth	None
Min_samples_split	2
Min_samples_leaf	1
RMSE	2.02
R²	0.97

Table 10. Outlier Detection Using Random Forest.

Year	Cities
2015	Baoding, Beijing, Dezhou, Dongying, Harbin, Hangzhou, Jinan, Jingmen, Jingzhou, Kaifeng, Lanfang, Liaocheng, Wuhan, Xinyang, Xingtai, Yichang, Changsha, Zhengzhou, Zibo
2016	Shijiazhuang
2017	Shijiazhuang
2018	Shijiazhuang
2019	Qujing
2020	Chaozhou
2022	Chaozhou, Fuzhou, Zhangzhou

Table 11. Alternative hyperparameters.

Hyperparameters	Alternative Sets
n_estimators	50, 100, 200, 500
Max_depth	3, 5, 7
Min_samples_split	2, 5, 10
Min_samples_leaf	1, 2, 4
n_estimators	0.01, 0.1, 0.2

Table 12. Gradient Boosting Tree Estimation Results.

Parameters	Results
n_estimators	200
Max_depth	5
Min_samples_split	10
Min_samples_leaf	4
Learning_rate	0.2
RMSE	1.30
R²	0.99

Table 13. Outlier Detection Using Gradient Boosting Tree.

Year	Cities
2015	Baoding, Dezhou, Huaian, Lanfang, Liaocheng
2016	Anshan, Binzhou, Jingmen, Sanmenxia, Shuozhou
2017	Changde, Chizhou, Handan, Hefei, Jilin, Shijiazhuang, Xiangyang
2018	Dezhou, Huanggang, Qinghuangdao, Yongzhou
2019	Liupanshui, Qujing, Suizhou
2020	Huainan
2021	Handan, Xianyang
2022	Cangzhou, Guyuan, Shanwei, Weinan

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Duan, H.; Yue, W.; Li, W. Reliability Assessment of PM2.5 Concentration Monitoring Data: A Case Study of China. Atmosphere 2024, 15, 1303. https://doi.org/10.3390/atmos15111303

AMA Style

Duan H, Yue W, Li W. Reliability Assessment of PM2.5 Concentration Monitoring Data: A Case Study of China. Atmosphere. 2024; 15(11):1303. https://doi.org/10.3390/atmos15111303

Chicago/Turabian Style

Duan, Hongyan, Wenfu Yue, and Weidong Li. 2024. "Reliability Assessment of PM2.5 Concentration Monitoring Data: A Case Study of China" Atmosphere 15, no. 11: 1303. https://doi.org/10.3390/atmos15111303

APA Style

Duan, H., Yue, W., & Li, W. (2024). Reliability Assessment of PM2.5 Concentration Monitoring Data: A Case Study of China. Atmosphere, 15(11), 1303. https://doi.org/10.3390/atmos15111303

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reliability Assessment of PM2.5 Concentration Monitoring Data: A Case Study of China

Abstract

1. Introduction

2. Literature Review

3. Data and Methods

3.1. Benford’s Law

3.2. Robust Regression

3.3. Machine Learning

4. Results

4.1. Benford’s Law Method

4.2. Robust Regression Results

4.3. Machine Learning Performance

4.3.1. Random Forest

4.3.2. Gradient Boosting Tree

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Correlation Test

Appendix B

Support Vector Machine (SVM)

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI