Estimating PM2.5 Concentrations Using Spatially Local Xgboost Based on Full-Covered SARA AOD at the Urban Scale

Fan, Zhiyu; Zhan, Qingming; Yang, Chen; Liu, Huimin; Bilal, Muhammad

doi:10.3390/rs12203368

Open AccessArticle

Estimating PM_2.5 Concentrations Using Spatially Local Xgboost Based on Full-Covered SARA AOD at the Urban Scale

by

Zhiyu Fan

^1,2,

Qingming Zhan

^1,2,*

,

Chen Yang

³

,

Huimin Liu

⁴ and

Muhammad Bilal

⁵

¹

School of Urban Design, Wuhan University, 8 Donghu South Road, Wuhan 430072, China

²

Collaborative Innovation Center of Geospatial Technology, 129 Luoyu Road, Wuhan 430079, China

³

College of Urban and Environmental Sciences, Peking University, Beijing 100871, China

⁴

Institute of Space and Earth Information Science, The Chinese University of Hong Kong, Shatin, NT, Hong Kong, China

⁵

School of Marine Sciences, Nanjing University of Information Science & Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2020, 12(20), 3368; https://doi.org/10.3390/rs12203368

Submission received: 18 August 2020 / Revised: 13 October 2020 / Accepted: 14 October 2020 / Published: 15 October 2020

(This article belongs to the Section Atmospheric Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

The adverse effects caused by PM_2.5 have drawn extensive concern and it is of great significance to identify its spatial distribution. Satellite-derived aerosol optical depth (AOD) has been widely used for PM_2.5 estimation. However, the coarse spatial resolution and the gaps caused by data deficiency impede its better application at the urban scale. Additionally, obtaining accurate results in unsampled spatial areas when PM_2.5 ground sites are insufficient and distribute sparsely is also a challenging issue for PM_2.5 spatial distribution estimation. This paper aimed to develop a model, i.e., spatially local extreme gradient boosting (SL-XGB), combining the powerful fitting ability of machine learning and optimal bandwidths of local models, to better estimate PM_2.5 concentration at the urban scale by using Beijing as the study area. This paper adopted simplified high-resolution MODIS aerosol retrieval algorithm (SARA) AOD at 500 m resolution as the major independent variable, hence, ensuring the estimation can be operated at a fine scale. Moreover, the extreme gradient boosting (XGBoost) model was adopted to fill the gaps in SARA AOD, thus improving its availability. Then, based on full-covered SARA AOD and other multisource data, the SL-XGB model, integrating multiple local XGBoost models and particular optimal bandwidths, was trained to estimate PM_2.5 concentration. For comparison, SL-XGB and two other models, XGBoost and geographically weighted regression (GWR), were evaluated by 10-fold cross validation (CV). The sample-based CV results reveal that the SL-XGB performed the best as assessed through R² (0.88), root mean square error (RMSE = 24.08 μg/m³) and mean prediction error (MPE = 16.90 μg/m³). Additionally, SL-XGB also performed the best in the site-based CV with a R² of 0.86, a RMSE of 26.15 μg/m³ and a MPE of 17.97 μg/m³, which shows its good spatial generalization ability. These results demonstrate that SL-XGB can better simultaneously handle non-linear and spatial heterogeneity issues despite spatially limited data at the urban scale. As far as the PM_2.5 concentration distribution was concerned, it presented a gradient increase in PM_2.5 concentrations from the northwest to the southeast in Beijing, with abundant spatial details. Overall, the proposed approach for PM_2.5 estimation showed outstanding performance and can support preventive pollution control and mitigation at the urban scale.

Keywords:

fine particular matter (PM_2.5); SARA AOD; urban scale; extreme gradient boosting (XGBoost); spatially local model

Graphical Abstract

1. Introduction

PM_2.5, with aerodynamic diameters smaller than 2.5 μm, has a sharp effect on the quality of living environment [1,2,3]. Numerous epidemiological and clinical studies have claimed that PM_2.5 has an unfavorable impact on human health, inducing cardiovascular and respiratory diseases such as lung cancer [4,5,6]. It was reported that PM_2.5 is associated with premature mortality and that it contributed to the death of 4.2 million people worldwide in 2015 [7,8]. In China, air pollution has been extremely serious in recent decades. Due to the unprecedented development of economy and urbanization, more and more people in China flock to cities. Thus, energy consumption caused by human activities (e.g., transportation, heating, cooking and industrial manufacture) keeps increasing, which results in many pollutants emitted to the atmosphere [9,10,11]. PM_2.5 has been listed as one of the primary pollutants in urban areas and is of considerable concern to the public and government [12,13]. To mitigate this problem, the Chinese government has established approximately 1500 urban PM_2.5 ground observation sites for real-time and accurate ground-level monitoring in more than 320 cities since 2012. However, the sparse and non-uniform distribution hampers the acquisition of full-covered PM_2.5 spatial distribution. Therefore, it is of great significance to estimate high-resolution and accurate continuous PM_2.5 distribution.

Aerosol optical depth (AOD), measuring light extinction in the atmospheric column, has been proven to be closely correlated with PM_2.5 and is widely used as the major predictor in PM_2.5 estimation [14,15,16]. Due to instrumental and methodological development, spatiotemporally continuous PM_2.5 distribution can be derived based on satellite remote sensing AOD with broad spatial coverage [17]. Commonly used satellite-based AOD products such as 10 and 3 km Moderate Resolution Imaging Spectroradiometer (MODIS) AOD product, 1 km Multi- Angle Implementation of Atmospheric Correction (MAIAC) AOD product, 6 km Geostationary Ocean Color Imager (GOCI) AOD product and 0.05° Advanced Himawari-8 Imager (AHI) AOD product have been widely applied for PM_2.5 estimation [13,18,19,20,21,22]. The spatial resolutions of these AOD products are at the “km” level, which is too coarse for the PM_2.5 estimation at the urban scale. Some scholars also tried to derive AOD using fusion algorithms and obtain PM_2.5 distribution within 160 m based on Gaofen-1 satellite images [23]. Its spatial resolution is ultrahigh but the temporal resolution is low because of the long revisit cycle of Gaofen-1. There are also some “sub-km” level AOD products such as 750 m Visible Infrared Imaging Radiometer Suite instrument (VIIRS) Intermediate Product (IP) and 500 m simplified high-resolution MODIS aerosol retrieval algorithm (SARA) AOD. They are the more appropriate choices for PM_2.5 estimation at the urban scale because of both high spatial and temporal resolution [24,25]. In addition, the non-random gaps in AOD data caused by cloud cover or high ground surface reflectance is a challenge in PM_2.5 estimation [26]. The existence of clouds can contaminate AOD pixels and cause much data deficiency, which may impose some negative impacts on PM_2.5 concentration estimation [18]. Several simple methods such as geographical interpolation and multiple imputation have been employed and some deficient AOD values can be restored by using them [27,28,29]. Recently, some studies applied random forest (RF), a machine learning method, based on multisource data such as meteorology to obtain the fully covered AOD with good performance [30,31]. Thus, the powerful fitting ability of machine learning methods may provide a new guidance for the AOD gap filling.

The PM_2.5 estimation models that are dependent on the relationship between AOD and PM_2.5 can be categorized into two classes—traditional statistical models and machine learning models. Statistical models have been applied for PM_2.5 estimation for a long time. Early on, researchers applied some simple linear regression models with few simple auxiliary variables such as temperature to strengthen the representation of the relationship between AOD and PM_2.5 [32,33]. With the improvement in computation ability, some advanced statistical models were proposed. Among them, the performance of geographically weighted regression (GWR) is prominent and has been widely used at different scales [21,34,35]. GWR investigates the spatial heterogeneity of the relationship between AOD and PM_2.5 by using many local models instead of one global model to predict PM_2.5 concentrations. However, the model structure of GWR is relatively simple given its linear assumption, so it is difficult to characterize the relationship between AOD and PM_2.5 when both the dimension and amount of data are large. In recent years, machine learning models are becoming increasingly popular because of their impressive ability in terms of fitting data and the feasibility of processing complex data. To date, machine learning models such as random forest, extreme gradient boosting (XGBoost) and generalized regression neural networks (GRNNs) have been applied and some of them outperformed statistical models [20,36,37]. These machine learning models are global models, which means that they deal with the whole data directly and neglect the spatial relationship between sampling data points. However, as a geographical phenomenon, the distribution of PM_2.5 is significantly affected by spatial factors such as meteorology and land use. Moreover, the estimation of PM_2.5 by exploiting the relationship between AOD and PM_2.5 over an extensive area is implemented based on insufficient sampling PM_2.5 ground sites. Therefore, the spatial heterogeneity of the relationship between AOD and PM_2.5 cannot be well captured by global machine learning methods, which limits improvement in the accuracy of estimation.

In this paper, our goal is to propose a spatially local machine learning model that can capture the spatial heterogeneity of the relationship between AOD and PM_2.5 to estimate PM_2.5 accurately at the urban scale. We first used XGBoost model to fill the gaps in the cloud-removal SARA AOD for obtaining the full-covered AOD. Then, the spatially local XGBoost model (SL-XGB) we proposed was trained to estimate PM_2.5 concentrations. We also compared SL-XGB with global XGBoost and GWR using both sample-based and site-based 10-fold CV to examine the effect of our strategy.

The paper is organized as follows: Section 2 introduces the data, related methods and the introduction of SL-XGB; Section 3 mainly demonstrates the result of full-covered SARA AOD and the performance of SL-XGB in PM_2.5 estimation; Section 4 discusses some advantages and limitations of this work by comparing it with some other studies; Section 5 is the conclusion of this paper.

2. Materials and Methods

2.1. Study Area

Beijing is the capital of China, located in the North China Plain and surrounded by Taihang Mountain and Yanshan Mountain. Beijing has 16 districts, with Dongcheng, Xicheng, Haidian, Chaoyang, Fengtai, Shijingshan as the major urban areas and Mentougou, Yanqing, Daxing, Tongzhou, Miyun, Fangshan, Huairou, Changping, Shunyi and Pinggu as the suburban areas, as shown in Figure 1. Beijing has been suffering from PM_2.5 pollution for a long time, with increasing emissions, vehicle exhausts and residential energy consumption under rapid urbanization. Therefore, it is urgent to estimate fine and accurate PM_2.5 spatial distribution in Beijing for PM_2.5 pollution control.

2.2. Data

2.2.1. Ground-Level PM_2.5 Measurements

There are 35 PM_2.5 ground observation sites (Figure 1) in the study area established by Beijing Environmental Protection Bureau. In order to record the seasonal continuity (here, March, April and May were viewed as spring; June, July and August were viewed as summer; September, October and November were viewed as autumn; December, January and February were viewed as winter), we downloaded hourly PM_2.5 data spanning from March, 2016 to February, 2017 of Beijing from the China Environmental Monitoring Center (CEMC, http://106.37.208.233:20035/). Then daily average PM_2.5 data were calculated further in order to match independent variables temporally.

2.2.2. Satellite-Based AOD

An AOD retrieval algorithm with a 500 m spatial resolution, SARA, developed by Bilal et al. [38], was adopted in our study. SARA AOD products were retrieved based on aerosol properties derived from the Aerosol Robotic Network (AERONET, https://aeronet.gsfc.nasa.gov/) and MODIS swath products (https://ladsweb.nascom.nasa.gov/) including MOD03 geolocation data, MOD02HKM-calibrated radiance data and MOD09 surface reflectance data (from Terra satellite). SARA shows the advantage of high accuracy in a complex atmospheric environment [39] and has been validated in Beijing and surrounding areas [40]. There are two AERONET sites in Beijing—the Beijing-CAMS site and the Beijing site. We applied the AOD data for the Beijing-CAMS site for SARA AOD retrieval and the AOD data for the Beijing site for validation (Figure S1). Considering the negative effect of clouds, we used MODIS cloud mask product MOD35 to remove the cloud-contaminated pixels. This operation would produce much missing AOD data. Adding the effect of some deficiency in AERONET observation data, the number of available days for SARA AOD retrieved was 197 in total. To improve the availability and reliability of SARA AOD, we applied XGBoost regression to fill the gaps and obtained the full-covered SARA AOD. More details about XGBoost are introduced in Section 2.3.1.

2.2.3. Meteorological Data

ERA-Interim reanalysis meteorological field data obtained from the European Centre for Medium-Range Weather Forecasts (ECMWF, http://www.ecmwf.int/) was used in this study. Specifically, meteorological data included (1) temperature at 2 m (T2M), (2) relative humidity (RH), (3) surface pressure (SP), (4) eastward wind at 10 m above displacement height (U10M), (5) northward wind at 10 m above displacement height (V10M), and (6) planetary boundary layer height (PBLH). The spatial resolution of the data is 0.125° × 0.125°. Daily values of all meteorological variables were calculated by averaging the data obtained at 2:00, 8:00, 14:00 and 20:00 (Beijing time).

2.2.4. Geographical Data

Geographical data contained several types: (1) the normalized difference vegetation index (NDVI), (2) a digital elevation model (DEM), (3) land use, (4) socioeconomic data including population (POP), gross domestic product (GDP), the density of residential areas and restaurants (DRR), the density of industries (DI) and road length (RL). MODIS Terra level-3 16 days average NDVI products (MOD13A1) with a spatial resolution of 500 m were downloaded from the LAADS website (https://ladsweb.nascom.nasa.gov/). DEM products with a spatial resolution of 90 m were downloaded from the Shuttle Radar Topographic Mission (http://srtm.csi.cgiar.org/). Land use data including classification of forest, farmland, shrub land, water bodies, impervious land, and bare land with a spatial resolution of 30 m was obtained from the Resource and Environment Data Cloud Platform (REDCP) (http://www.resdc.cn). Coverages of water bodies (CWB), impervious land (CII), farmland (CFA) and forest (CFO) were further calculated based on land use data. Population raster data, GDP raster data and road shapefile data were also obtained from REDCP (http://www.resdc.cn). RL was calculated further based on road shapefile data. DRR and DI were calculated using kernel density analysis in Arcgis10.6 software based on Point of Interest (POI) data downloaded from Amap (https://www.amap.com/). Here, RL, DRR and DI represented the intensity of traffic, residential activity and industrial activity, respectively, which affect the emission of PM_2.5 [41,42].

2.2.5. Data Preparation

Some data pre-processes were operated for further experiments. All raster data were projected to the UTM_Zone_49N projection coordinated system using ENVI5.3 software and resampled to a 500 m × 500 m defined grid followed by SARA AOD. Then PM_2.5 concentration records were extracted to match the corresponding defined grids where the observation sites are located. All information on variables is presented in Table S1.

2.3. Methods

2.3.1. Extreme Gradient Boosting (XGBoost) Regression

Extreme gradient boosting (XGBoost) regression is an ensemble machine learning algorithm that is widely used in data mining with excellent performance [43]. It is the improved version of the gradient boosting decision tree (GBDT) algorithm. It supports parallel computing and can also randomly select independent variables such as random forest (RF), so it is more efficient than other boosting machine learning models. In addition, in contrast to some other machine learning models such as RF, XGBoost has a more complex structure and introduces regularization items in loss function to control against overfitting so that it can better handle complex data. Therefore, for work with large amounts of data and multidimensional influencing factors such as AOD gap fill and PM_2.5 estimation, XGBoost is the more appropriate choice. XGBoost has also been applied to the estimation of pollutants and it shows better performance than some other statistical and machine learning models [44,45,46].

As a boosting method, XGBoost is the combination of many decision trees and each tree is generated by prior trees based on residuals. Each decision tree will generate a predicted value by split operation based on different independent variables. Then, after the construction of all decision trees, the sum of generated values will be the final predictions. The process can be described as Equation (1), where

h_{m} (x)

is the decision tree in iteration

m,

F_{m - 1} (x)

is the sum of the predictions of

m

−1 previous trees and

γ_{m}

is the learning rate in iteration

m

.

F (X) = \sum_{m = 1}^{M} γ_{m} h_{m} (X) = F_{m - 1} (X) + γ_{m} h_{m} (X)

(1)

For constructing

h_{m} (X)

, the parameters

θ_{m}

are obtained by optimizing the sum of loss function

L (θ_{m})

and the regularizing item

Ω (θ_{m})

. This process is described as:

h_{m} (X) = \underset{θ_{m}}{argmin} L (θ_{m}) + Ω (θ_{m})

(2)

We used XGBoost to obtain full-covered SARA AOD before estimating PM_2.5. Since AOD represents the column of aerosol from the Earth’s surface to the atmosphere, it is greatly affected by meteorological factors, so six types of collected meteorological data mentioned above were used as the main independent variables. Referring to some previous studies [30,31], we also used the day of year (DOY), the NDVI, a DEM, POP and land use coverages as the auxiliary variables to indicate the effects of time and land surface on AOD. A grid search technique was carried out to search for the optimal hyper parameters for XGBoost model (Table S2). Given that AOD varies significantly across different seasons in Beijing [31], we constructed XGBoost models for 4 seasons, respectively, and compared them with the annual model which was trained based on the whole period data.

As a comparing method, XGBoost was also applied for PM_2.5 estimation. In addition to full-covered SARA AOD and all the indicators mentioned in Section 2.2, DOY and season (use dummy variables to represent 4 seasons) were also used as independent variables for considering the daily and seasonal variance of PM_2.5.

2.3.2. Geographically Weighted Regression (GWR)

Geographically weighted regression (GWR) is a linearly local model considering spatial heterogeneity and can better reveal the spatially non-stationary relationship between independent variables and dependent variables [47]. In fact, GWR constructs a local original least square (OLS) regression model in each sampling position. Some previous studies show that the relationship between AOD and PM_2.5 varies significantly and discontinuously in space due to changes in surface context [19,21]. They applied GWR in the estimation of PM_2.5 by using AOD as the main independent variable and found that it outperform other traditional statistical models.

In this paper, we used GWR as a comparison method for PM_2.5 estimation. Before GWR, collinearity detection was operated to exclude variables with multicollinearity by using Variance Inflation Factor (VIF) as the measurement [48]. Here, we set the threshold of VIF as 7.5, which means that variables with a VIF larger than 7.5 would be excluded. Thus, compared with using XGBoost for PM_2.5 estimation, few temporally constant variables such as “GDP” were removed when constructing the GWR model (Table S3). The form of GWR can be represented as Equation (3) shows, where

(u_{i}, v_{i})

represents the geographically sampling position,

a_{0} (u_{i}, v_{i})

is the interception of

(u_{i}, v_{i})

, m is the number of independent variables,

a_{k} (u_{i}, v_{i})

is the corresponding parameters of

(u_{i}, v_{i})

and

ε (u_{i}, v_{i})

is the corresponding error item. Bandwidth is an important parameter in GWR. The samples out of the bandwidth range would not be used in local model training [47,49]. An adaptive bandwidth strategy was used in this study for better performance.

P M 2.5 (u_{i}, v_{i}) = a_{0} (u_{i}, v_{i}) + \sum_{k = 1}^{m} a_{k} (u_{i}, v_{i}) X_{k} + ε (u_{i}, v_{i})

(3)

2.3.3. Spatially Local XGBoost (SL-XGB)

Compared with the regression process of obtaining the gap-filled SARA AOD mentioned in Section 2.3.1, constructing a model to fit the relationship between AOD and PM_2.5 is more challenging. The available PM_2.5 data was only obtained from 35 ground sites and they do not distribute uniformly in the study area (densely in urban areas and sparsely in suburb areas). If we use a global model to predict PM_2.5 concentrations in the whole study area, it may produce some biases because spatial heterogeneity in the relationship between AOD and PM_2.5 cannot be revealed by globally constant coefficients. GWR has great applicability in describing local characteristics of the relationship between AOD and PM_2.5. However, some recent studies pointed out that GWR is constructed based on the assumption of linearity, so it cannot capture the non-linearity in the relationship between AOD and PM_2.5 like machine learning methods such as RF and XGBoost [13,20].

In order to better address the issue of both spatial heterogeneity and non-linearity, we proposed the spatially local XGBoost (SL-XGB) model. The strategy of SL-XGB is similar to GWR but it uses XGBoost in the place of OLS regression as the kernel regression method in each local model. The concept of “bandwidth” in GWR is introduced into our model. For each local site, the data out of the range of the bandwidth will not be used for local model training. To better account for the spatial variance of the data, the adaptive bandwidths are adopted in SL-XGB, which means that each local model has its own particular bandwidth. Specifically, the optimal local bandwidth distance of a local model is selected from among all distances from all other sites to the local site. Firstly, to obtain the optimal bandwidth of each local model, root mean square errors (RMSE) under different distances are calculated by establishing different XGBoost models. These models use the data for other sites in the range of corresponding distances as the training set and the data for the local site as the test set (not added in the training set). Then, the bandwidth distance with the minimal RMSE will be selected as the optimal local bandwidth. After that, the local XGBoost models can be established based on the data in the range of the optimal bandwidth. To predict PM_2.5 concentrations in the non-sampling locations, we integrated the average predictions of all local models based on the geographically Bi-Square function. As a result, predictions in overlap areas can also be calculated by utilizing two or more neighboring local models. Equation (4) shows the form of the geographically Bi-Square function and Equation (5) shows the predictions of PM_2.5 in position

i .

In Equation (4),

W (i, j)

is the weight of the local XGBoost model

j

,

d_{i j}

is the distance between the position

i

and position

j

, and

b_{j}

is the optimal bandwidth of local XGBoost model

j

. In Equation (5),

(u_{i}, v_{i})

represents the geographical coordinates of position

i

, n is the number of all sampling ground sites,

X_{i}

represents all independent variables in position

i

, and

F_{j} (X_{i})

is the prediction value of

j

th local XGBoost model in position

i

. Figure 2 displays the structure and schematic of SL-XGB.

W (i, j) = {\begin{matrix} {[1 - {(\frac{d_{i j}}{b_{j}})}^{2}]}^{2}, & d_{i j} < b_{j} \\ 0, & d_{i j} \geq b_{j} \end{matrix}

(4)

P M 2.5 (u_{i}, v_{i}) = \sum_{j = 1}^{n} W (i, j) F_{j} (X_{i})

(5)

2.3.4. Model Evaluation

We applied a sample-based 10-fold CV technique for validating the result of the gap-filled SARA AOD and PM_2.5 estimation [50]. In sample-based CV, the entire data set was equally split into 10 folds randomly, with nine folds as the training set and the other fold as the test set for evaluating the performance of model fitting and predicting. This process was repeated 10 times until all folds were tested. Moreover, in order to examine the spatial generalization ability of SL-XGB, we further used site-based CV in PM_2.5 estimation. Similar to sample-based CV, site-based CV used 10% of the sites’ data as the test set and other 90% of the sites’ data as the training set, which can show the prediction ability in different geographical environments. Three statistical indicators including root mean square error (RMSE), mean predicted error (MPE, the mean absolute error of prediction) and coefficient of determination (R²) were used as the measurements of model performance. XGBoost also provides variable importance to evaluate the effect of variables on the result. It is determined by the frequency of variables that are used as tree splits in all decision trees. Given that our model included many local XGBoost models, average variable importance was calculated furthermore. All experiments were conducted in the environment of Python 3.7.

3. Results

3.1. PM_2.5 and AOD Data Set Description

The minimum, maximum, mean and standard deviation of all variables are presented in Table S3. The maximum and mean PM_2.5 concentrations for ground-level sites in the study period are 506 and 78.35 μg/m³, respectively. According to the Chinese standard of ambient air pollution [51], the mean PM_2.5 concentration is at the range of “lightly polluted” (75 μg/m³–115 μg/m³) and the maximum PM_2.5 concentration is higher than twice the maximum “severe pollution” concentration standard (250 μg/m³). Thus, the air pollution in Beijing is extremely serious and it is necessary to estimate the continuous distribution of PM_2.5.

We also calculated the minimum, maximum, mean and standard deviation of seasonal and annual AOD coverage (Table S4). Although some previous studies pointed out that SARA AOD has relatively high coverages compared with other AOD products [24,52], the annual coverage of SARA AOD ranged from 27% to 47% and the average coverages of four seasons in our study are only 45%, 42%, 35% and 33% for spring, summer, autumn and winter, respectively.

3.2. Missing AOD Filling

We constructed five XGBoost models for missing AOD filling based on the data for four seasons and the whole study period. Table 1 shows the sample-based CV results of five models, where “N” represents the number of training samples. The four seasonal models achieve high accuracy, with an R² ranging from 0.90 to 0.94 and a RMSE ranging from 0.07 to 0.15. The results suggest that the complicated relationship between AOD and other predictive variables can be investigated by XGBoost. Aerosol characteristics are significantly affected by meteorological fields and vary distinctly in different seasons [53]. Hence, the performance of the annual model, with an R² of 0.86 and a RMSE of 0.15, is not as good as that of the four seasonal models. Therefore, the four seasonal models were applied to obtain full-covered AOD for PM_2.5 estimation.

Figure 3 shows the difference between the cloud-removal SARA AOD and the gap-filled SARA AOD in 18 December 2016 (a specific example) and the difference between the annually averaged spatial distribution of cloud-removal AOD and gap-filled AOD. In terms of (a) and (b), the gaps in the cloud-removal SARA AOD are restored and the gap-filled SARA AOD has strong spatial continuity. As for the annual results, the gap-filled AOD values are generally higher than those in the cloud-removal AOD due to the absence of AOD data, which is mostly caused by cloud cover. When the cloud fraction is high, the humidity will increase, which promotes the growth of aerosols [29]. In terms of spatial distribution, both gap-filled AOD and cloud-removal AOD have similar spatial patterns, with an increasing trend from northwest to southeast. That is because the terrain is flatter and there are more urban regions distributed in the southeastern areas of Beijing [54]. In addition, the gap-filled AOD is relatively smoother than cloud-removal AOD, which suggests that the gap-fill process can restore more spatially continuous details.

3.3. PM_2.5 Estimation Model Performance

The fitting and sample-based validation performance of GWR, XGBoost and SL-XGB are shown in Table 2. Here, of the three models, GWR performed the worst, with an R² of 0.81 and 0.71, a RMSE of 30.74 and 33.67 μg/m³ in the training set (for model fitting) and the test set (for model validation), respectively. In contrast, XGBoost and SL-XGB performed better and have better control in terms of overfitting. The R² and the RMSE of XGBoost are 0.89 and 21.71 μg/m³ in the training set, and 0.85 and 27.01 μg/m³ in the test set. When considering the spatial heterogeneity between PM_2.5 and independent variables, SL-XGB performed better, with a R² and a RMSE of 0.93 and 18.09 μg/m³ in the training set and 0.88 and 24.08 μg/m³ in the test set.

The density scatter plots of two types of CV (sample-based and site-based) performance are displayed in Figure 4. The total number of points recorded in all the sub-graphs in Figure 4 is 12,620. SL-XGB performed best in both types of CV, with the highest R² (0.88 and 0.86 in sample-based CV and site-based CV, respectively) and lowest RMSE and MPE. In addition, the superior performance of SL-XGB in site-based CV illustrates that SL-XGB can better adapt to complex geographical environments than the other two models and is of better spatial generalization ability. While GWR takes spatial heterogeneity into consideration, it is difficult to characterize the complicated relationship between PM_2.5 and independent variables, so the performance is the worst in both types of CV. Moreover, adopting the strategy of “local regression” in non-linear machine learning has proven to be an improvement for PM_2.5 estimation because of the better performance of SL-XGB than that of XGBoost. In fact, the “local” approach we applied was aimed at filtering the samples with large spatial heterogeneity compared to local samples. Although there may be little overfitting in SL-XGB due to relatively few training samples in the few local models, the overall accuracy is still better than that in XGBoost.

The sample-based CV RMSE spatial distribution of PM_2.5 ground sites is given by Figure 5. Generally, all three models performed better in central urban areas than in suburban areas because of the high density of site distribution. XGBoost and SL-XGB performed better than GWR in most sites, indicating the powerful fitting ability of machine learning. The RMSEs of SL-XGB and XGBoost in suburb areas are in the between 25 and 30 μg/m³. However, SL-XGB performs better in central urban areas, with most RMSEs ranging from 10 to 20 μg/m³, because the sub-models in urban areas can exploit more spatially homogenous data, which is in favor of describing the local relationship between AOD and PM_2.5.

Bandwidth, defining the optimal scale for prediction, is a very important parameter in each local model of SL-XGB. Thus, the effect of bandwidth distance on SL-XGB performance needs to be further explored. Here, a sensitive analysis based on polynomial fitting was conducted to explore the relationship between bandwidth distance and the corresponding overall RMSE, which is calculated by using local sites’ data as the test sets. We divided the 35 sites in Beijing into urban sites (17 sites) and suburban sites (18 sites) according to the administrative division that they are located in (sites in Dongcheng, Xicheng, Haidian, Chaoyang, Fengtai, Shijingshan are urban sites and the rest are suburb sites) and the results are given in Figure 6. The trend in the two fitting curves is similar, decreasing first and then rising. The results demonstrate that the data from sites that are far away from the local site will result in negative impacts on local model training. However, it when the bandwidth is too small, it is difficult to take advantage of the insufficient training data for capturing the localized relationship between PM_2.5 and independent variables, so the accuracy is also low. The optimal bandwidth is different for two types of sites—approximately 20 km for urban sites and approximately 50 km for suburban sites. This is because the density of urban sites is higher than that of suburban sites and abundant information used for local model training can be collected in a smaller bandwidth. In addition, the overall optimal RMSE of urban sites is lower than that of suburban sites, which also demonstrates that it significantly increases accuracy by utilizing spatially homogeneous data.

3.4. Variable Importance in SL-XGB

The values of variable importance in all local models were averaged and the top 10 important variables are shown in Figure 7. Among them, PBLH and AOD are the two most important variables with variable importance being 0.26 and 0.18, respectively. Here, PBLH made more contribution than AOD. This may be related to the frequent occurrence of haze events in Beijing, when the PM_2.5 concentration is more sensitive to PBLH [55]. Moreover, PBLH was the mean of four time points and its temporal scale may be more similar to that of PM_2.5 data. The variable importance value of V10 is higher than U10, so wind in a south–north direction has more impact on PM_2.5 than wind in an east–west direction. This may be due to the great amounts of pollutants from Hebei province, which is in the south of Beijing, drifting to Beijing through the wind in a south–north direction. Season and DOY are ranked as the 3rd and the 4th most important independent variables, respectively, demonstrating the strong seasonal and daily variations of PM_2.5. Due to high multicollinearity, these two variables were not used in the GWR model and this shows that machine learning models have better inclusiveness for variables. CII also has a relatively important impact on the final result, which indicates the close relationship between urbanization and PM_2.5. However, most spatially geographical variables contributed little to the result, because they are temporally constant.

3.5. Seasonal and Annual PM_2.5 Distribution

Seasonal and annual PM_2.5 spatial distribution estimated by SL-XGB are presented in Figure 8. Due to the high resolution of SARA AOD, the PM_2.5 estimations obtained provide more spatial details than the results from some previous studies with a “km” level resolution [56,57]. In comparison, the use of full-covered SARA AOD provides smoother results with less noise interference [24,25]. PM_2.5 pollution was most serious in winter, with the highest average PM_2.5 concentration reaching 145 μg/m³ in some areas. This figure is several times greater than the “good” standard according to National Ambient Air Quality Standard of China (35 μg/m³). The overall average PM_2.5 concentration in summer is as low as 35 μg/m³ in some northern areas, although there is still some light pollution in the southeastern areas. This suggests a gradient increase in PM_2.5 concentrations from the northwest to the southeast in all seasons. This pattern is similar to the spatial distribution of AOD, which also indicates a strong relationship between AOD and PM_2.5. There are two reasons for the high PM_2.5 concentrations in southeastern areas: one is that urban areas are mainly located in southeastern areas, so more pollutants are produced there. The other is that there are many high-emission factories in Hebei province which are adjacent to these areas, thus external pollutants also cause an adverse impact. While in in northwestern areas, the main land use types are forest and grassland, thus improving air pollution in a sense.

3.6. The Effect of the AOD Gap-Fill Process on PM_2.5 Estimation

To explore the impact of the AOD gap-fill process on PM_2.5 estimation, we also applied the SARA AOD with gaps to predict PM_2.5 spatial distribution as the comparison. For predictions in grids where ground sites are located, the annually averaged PM_2.5 concentration using the AOD gap-fill process (called “GF” hereafter) was 80 μg/m³. The average PM_2.5 concentration without using the gap-fill process (called “NGF” hereafter) was lower, 64 μg/m³. The true annually averaged PM_2.5 concentration of ground observation sites is 78 μg/m³, which is closer to the GF result. This means that the result will be underestimated without using the AOD gap-fill process. Figure 9 presents the annually averaged spatial differences between GF and the NGF (GF minus NGF) in the whole study area and the average spatial differences between the true PM_2.5 concentration and NGF and GF, respectively, in PM_2.5 sites. The results show that the difference in northern areas is small, while the difference in southern areas is large—28 μg/m³—likely as a result of deficiencies in the data caused by the misidentification of cloud masks. On the one hand, there are many bright urban areas which are often misidentified as clouds in the south of Beijing. On the other hand, the haze events from Hebei province often pass by this area via the North China Plain, so that some data in severely polluted areas would be ignored without using the gap-fill process [24,58]. In terms of the comparison with true PM_2.5 concentrations, basically the absolute differences of NGF are higher than that of GF in most ground sites, which also suggests the essential role of the gap-fill process in our study.

4. Discussion

This study presented a novel approach based on full-covered SARA AOD for PM_2.5 estimation. The results suggest a relatively high spatial resolution and abundant spatial details, which can be implicated well at the urban scale. Compared with previous studies based on multisource data including satellite AOD for PM_2.5 estimation, our study has some advantages as follows. First, PM_2.5 estimation with a 500 m spatial resolution was derived for the urban scale. Many previous studies pay more attention to the PM_2.5 estimation at a large scale, such as at the national scale. However, for the PM_2.5 pollution studies in a specific area, some errors will be induced if using national data for estimation. This is because PM_2.5 is significantly affected by local geographical environments and more negative effects of spatial heterogeneity will be introduced. Many PM_2.5 estimation models based on the data for a certain city or an area have shown good performance, so it is worth developing a specific PM_2.5 estimation model when investigating PM_2.5 pollution at a small scale, such as the urban scale [25,59]. Thus, studies at a relatively small scale are also important. Compared with some previous studies using satellite AOD products at the urban or the urban agglomeration scales [57,60], our study has a finer spatial resolution at 500 m by utilizing SARA AOD data. Additionally, compared with studies using a data fusion method for higher resolution [23], SARA AOD is retrieved directly based on MODIS products, so that it can be applied relatively easily. Therefore, the series of operations we applied are more practical.

Another strength of our study is the gap-fill process by XGBoost. Gaps in AOD have a significantly negative impact on the PM_2.5 estimation. As shown in the analysis in Section 3.6, not adopting a gap-fill process leads to significant biases. To address AOD data deficiency, previous studies often adopted three types of methods to fill the gaps. The first is to utilize the daily, weekly, or seasonal relationship between AOD and PM_2.5 [61,62]. A disadvantage of this kind of method is the poor accuracy because it is difficult to capture the relationship between AOD and PM_2.5 without inducing other independent variables. The second is to use interpolation or imputation such as kriging interpolation (KI) methods and multiple imputation (MI) [29,53]. It is convenient to implement these methods in order to fill deficient AOD but they cannot guarantee accuracy when there is a great amount of deficient data. Recently, machine learning methods were applied in the AOD gap-fill process. For example, some studies used RF based on multisource data such as meteorology in order to derive AOD and obtained great accuracy (sample-based CV R² is higher than 0.9) and full coverage [30,31]. To our knowledge, XGBoost was used to fill the deficient AOD at a fine scale for the first time. Compared with the RF model, XGBoost can guarantee high accuracy and high efficiency at the same time. Table S5 shows that XGBoost took much less time to achieve almost the same accuracy compared with RF. Thus, XGBoost may be the better choice for gap filling when there is too much deficiency in AOD.

The proposed model, SL-XGB, showed outstanding performance in PM_2.5 estimation, with a fitting R² of 0.93 and a CV R² of 0.88, which made the greatest contribution in our study. In contrast to some studies using statistical models such as GWR [63], as a machine learning model, SL-XGB has applicability in model fitting and its non-linear assumption is more in line with the relationship between AOD and PM_2.5. Moreover, compared with some previous global models for PM_2.5 estimation [13,64], SL-XGB enhanced the prediction ability by taking the local relationship between independent variables and PM_2.5 into consideration, so it is more robust when it is difficult to capture the global relationship. Thus, the CV performance of our model is better than that in previous studies using Beijing as the study area [24,25,57] and some using Beijing–Tianjin–Hebei (BTH, including Beijing but with some more areas) as the study area [60,65,66]. The comparisons are listed in Table 3. Through combining the powerful fitting ability of machine learning as well as the optimal local bandwidths, the results show that the performance of both sample-based CV and site-based CV in SL-XGB is better than not only some statistical models such as mixed-effects models but also some global non-linear models such as deep neural network and random forest. Additionally, the spatial resolution of our study is the highest among these studies, so the results can be better utilized at the urban scale.

SL-XGB is the extension of XGBoost in space when considering the spatial heterogeneity of the relationship between independent variables and PM_2.5. Compared with only using one global XGBoost model, the predictive accuracy can be improved by utilizing several local sub-models. According to the first law of geography, for a geographical local object, the neighboring objects are more correlated with it than the objects that are far away. That is to say, for each local model, excluding the data that is far away from the local site in the training set can reduce spatial heterogeneity, which is beneficial for local model performance [67]. Therefore, bandwidth plays a very important role in model construction. As the results in Section 3.3 show, the optimal bandwidth for the urban local model is approximately 20 km and that for the suburb local model is approximately 50 km in our study. A similar experiment was also applied in investigation into the relationship between local sites’ PM_2.5 concentrations and spatial neighboring sites’ lagged PM_2.5 concentrations, which demonstrated that the optimal window size is also approximately 50 km [68]. Given that this study was conducted at an urban agglomeration scale with a relatively low density of ground sites, it may be considered that when sites are distributed sparsely, the PM_2.5 data for ground sites within 50 km may be more appropriate for most local model training. Additionally, it is also noted that the optimal bandwidth may vary in different areas due to the differences in geographical environments and socioeconomic development, so it deserves more discussion in the future by extending this work to more areas.

However, there are still some limitations in this study and some work needs to be finished in the future. The first is that our model did not consider the temporal heterogeneity. Some previous studies constructed local models such as geographically and temporally weighted regression (GTWR) [56,69] by introducing the concept of “spatiotemporal bandwidth” for improvement in both spatial domain and temporal domain, while the spatiotemporal relationship is still not clear for some non-parametric models such as XGBoost. Thus, in future research, we will consider temporal heterogeneity in machine learning model structures. Secondly, limited by data acquirement, PM_2.5 concentrations were estimated from only 35 sites in Beijing, which may produce some biases. Thirdly, the performance of SL-XGB on larger spatial scales or in other cities needs to be explored, so some work about this will be further conducted. Moreover, in terms of the work on the AOD gap-fill process, the spatiotemporal relationship between AOD and independent variables and some other variables such as particle emission data may be further introduced into the model for better performance. Lastly, we estimated PM_2.5 concentrations based on one year’s data, so the performance of SL-XGB implemented over multiple years remains to be conducted, which may be achieved using tools such as Google Earth Engine (GEE) [3].

5. Conclusions

In this paper, we proposed a novel model, SL-XGB, for PM_2.5 estimation over Beijing based on full-covered SARA AOD at a 500 m spatial resolution. Firstly, full-covered AOD was obtained by XGBoost with good performance (sample-based CV R² in all seasons is higher than 0.9), which was very helpful for increasing the availability of SARA AOD and reducing biases in PM_2.5 estimation. Secondly, it proved that the strategy of “local regression’ in GWR can be introduced in XGBoost for performance improvement when sampling points are insufficient. Compared with GWR (with a sample-based CV R² of 0.71 and a RMSE of 33.67 μg/m³) and XGBoost (with a sample-based CV R² of 0.85 and a RMSE of 27.01 μg/m³), SL-XGB had better applicability in estimating PM_2.5 (with a sample-based CV R² of 0.88 and a RMSE of 24.08 μg/m³) as it can better address non-linear and spatial heterogeneity issues based on the powerful fitting ability of XGBoost and the division of optimal bandwidth. Moreover, in site-based CV, SL-XGB showed the best spatial generalization ability among the three models, with a R² of 0.86, a RMSE of 26.15 μg/m³ and a MPE of 17.97 μg/m³. In terms of the result, the high spatial resolution of the estimation provided more spatial details of PM_2.5 distribution at the urban scale. This shows that PM_2.5 pollution in Beijing was serious, especially in winter, and pollution prevention in the southeastern areas of Beijing should be given more attention.

Despite some limitations, the present study can still provide great support for fine urban PM_2.5 exposure assessment and air pollution management and control. In the future, we will incorporate temporal heterogeneity and test the model performance on more urban areas by multitemporal data. Additionally, as more and more PM_2.5 ground sites are established in Chinese cities, the spatial heterogeneity of the relationship between AOD and PM_2.5 can be better explored and more accurate results are expected to be obtained by using our approach.

Supplementary Materials

The following are available online at https://www.mdpi.com/2072-4292/12/20/3368/s1, Figure S1: Regression scattering plots of SARA AOD and AERONET AOD in Beijing site. Table S1: Data information and related variable acronym. Table S2: Parameters selection in XGBoost. Table S3: Variables description. Table S4: Seasonal and annual AOD coverage. Table S5: Cross-validation result comparison of XGBoost and Random Forest (RF).

Author Contributions

Conceptualization, Z.F. and Q.Z.; Methodology, Z.F.; Software, Z.F. and M.B.; Validation, Z.F.; Formal Analysis, Z.F.; Investigation, Z.F.; Resources, Q.Z.; Data Curation, Z.F. and Q.Z.; Writing—Original Draft Preparation, Z.F.; Writing—Review and Editing, C.Y., H.L. and Q.Z.; Supervision, Q.Z., C.Y. and H.L.; Project Administration, Q.Z.; Funding Acquisition, Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Science Foundation of China (No. 51878515, 41331175 and 51378399).

Acknowledgments

We very much appreciate the support of the Beijing Municipal Environmental Monitoring Center (BMEMC) in providing PM_2.5 monitoring data and the European Centre for Medium-Range Weather Forecasts (ECMWF) in providing ERA-Interim reanalysis meteorological data. In addition, we sincerely appreciate all the anonymous reviewers for their excellent comments and efforts.

Conflicts of Interest

The authors declare no conflict of interest.

References

Anenberg, S.C.; Horowitz, L.W.; Tong, D.Q.; West, J.J. An estimate of the global burden of anthropogenic ozone and fine particulate matter on premature human mortality using atmospheric modeling. Environ. Health Perspect. 2010, 118, 1189–1195. [Google Scholar] [CrossRef] [PubMed]
Ezzati, M.; Lopez, A.D.; Rodgers, A.A.; Murray, C.J. Comparative Quantification of Health Risks: Global and Regional Burden of Disease Attributable to Selected Major Risk Factors; World Health Organization (WHO): Geneva, Switzerland, 2004; pp. 1353–1378. [Google Scholar]
Fuentes, M.; Millard, K.; Laurin, E. Big geospatial data analysis for Canada’s Air Pollutant Emissions Inventory (APEI): Using google earth engine to estimate particulate matter from exposed mine disturbance areas. Giscience Remote Sens. 2020, 57, 245–257. [Google Scholar] [CrossRef]
Brook, R.D.; Sanjay, R.; Arden, P.C.; Brook, J.R.; Aruni, B.; Diez-Roux, A.V.; Fernando, H.; Yuling, H.; Luepker, R.V.; Mittleman, M.A. Particulate matter air pollution and cardiovascular disease: An update to the scientific statement from the American Heart Association. Circulation 2010, 121, 2331–2378. [Google Scholar] [CrossRef] [PubMed]
Arden, P.C.; Burnett, R.T.; Thun, M.J.; Calle, E.E.; Daniel, K.; Kazuhiko, I.; Thurston, G.D. Lung cancer, cardiopulmonary mortality, and long-term exposure to fine particulate air pollution. JAMA 2002, 287, 1132–1141. [Google Scholar]
Francesca, D.; Peng, R.D.; Bell, M.L.; Luu, P.; Aidan, M.D.; Zeger, S.L.; Samet, J.M. Fine particulate air pollution and hospital admission for cardiovascular and respiratory diseases. JAMA J. Am. Med Assoc. 2006, 295, 1127. [Google Scholar]
Forouzanfar, M.H.; Afshin, A.; Alexander, L.T.; Anderson, H.R.; Bhutta, Z.A.; Biryukov, S.; Brauer, M.; Burnett, R.; Cercy, K.; Charlson, F.J. Global, regional, and national comparative risk assessment of 79 behavioural, environmental and occupational, and metabolic risks or clusters of risks, 1990–2015: A systematic analysis for the Global Burden of Disease Study 2015. Lancet 2016, 388, 1659–1724. [Google Scholar] [CrossRef]
Lelieveld, J.; Evans, J.S.; Fnais, M.; Giannadaki, D.; Pozzer, A. The contribution of outdoor air pollution sources to premature mortality on a global scale. Nature 2015, 525, 367–371. [Google Scholar] [CrossRef] [PubMed]
Chafe, Z.A.; Brauer, M.; Klimont, Z.; Van Dingenen, R.; Mehta, S.; Rao, S.; Riahi, K.; Dentener, F.; Smith, K.R. Household cooking with solid fuels contributes to ambient PM_2.5 air pollution and the burden of disease. Environ. Health Perspect. 2014, 122, 1314–1320. [Google Scholar] [CrossRef]
Chen, Z.; Wang, J.N.; Ma, G.X.; Zhang, Y.S. China tackles the health effects of air pollution. Lancet 2013, 382, 1959–1960. [Google Scholar] [CrossRef]
Han, L.; Zhou, W.; Li, W.; Li, L. Impact of urbanization level on urban air quality: A case of fine particles (PM_2.5) in Chinese cities. Environ. Pollut. 2014, 194, 163–170. [Google Scholar] [CrossRef] [PubMed]
Sun, L.; Wei, J.; Duan, D.; Guo, Y.; Yang, D.; Jia, C.; Mi, X. Impact of Land-Use and Land-Cover Change on urban air quality in representative cities of China. J. Atmos. Sol. Terr. Phys. 2016, 142, 43–54. [Google Scholar] [CrossRef]
Wei, J.; Huang, W.; Li, Z.; Xue, W.; Peng, Y.; Sun, L.; Cribb, M. Estimating 1-km-resolution PM_2.5 concentrations across China using the space-time random forest approach. Remote Sens. Environ. 2019, 231, 111221. [Google Scholar] [CrossRef]
Engel-Cox, J.A.; Holloman, C.H.; Coutant, B.W.; Hoff, R.M. Qualitative and quantitative evaluation of MODIS satellite sensor data for regional and urban scale air quality. Atmos. Environ. 2004, 38, 2495–2509. [Google Scholar] [CrossRef]
Guo, J.P.; Zhang, X.Y.; Che, H.Z.; Gong, S.L.; An, X.; Cao, C.X.; Guang, J.; Zhang, H.; Wang, Y.Q.; Zhang, X.C. Correlation between PM concentrations and aerosol optical depth in eastern China. Atmos. Environ. 2009, 43, 5876–5886. [Google Scholar] [CrossRef]
Lin, C.; Li, Y.; Yuan, Z.; Lau, A.K.; Li, C.; Fung, J.C. Using satellite remote sensing data to estimate the high-resolution distribution of ground-level PM_2.5. Remote Sens. Environ. 2015, 156, 117–128. [Google Scholar] [CrossRef]
Diao, M.; Holloway, T.; Choi, S.; O’Neill, S.M.; Al-Hamdan, M.Z.; Van Donkelaar, A.; Martin, R.V.; Jin, X.; Fiore, A.M.; Henze, D.K. Methods, availability, and applications of PM_2.5 exposure estimates derived from ground measurements, satellite, and atmospheric models. J. Air Waste Manag. Assoc. 2019, 69, 1391–1414. [Google Scholar] [CrossRef] [PubMed]
Park, S.; Lee, J.; Im, J.; Song, C.K.; Choi, M.; Kim, J.; Lee, S.; Park, R.; Kim, S.M.; Yoon, J. Estimation of spatially continuous daytime particulate matter concentrations under all sky conditions through the synergistic use of satellite-based AOD and numerical models. Sci. Total Environ. 2020, 713, 136516. [Google Scholar] [CrossRef]
Hu, X.; Waller, L.A.; Al-Hamdan, M.Z.; Crosson, W.L.; Estes Jr, M.G.; Estes, S.M.; Quattrochi, D.A.; Sarnat, J.A.; Liu, Y. Estimating ground-level PM_2.5 concentrations in the southeastern US using geographically weighted regression. Environ. Res. 2013, 121, 1–10. [Google Scholar] [CrossRef]
Li, T.; Shen, H.; Zeng, C.; Yuan, Q.; Zhang, L. Point-surface fusion of station measurements and satellite observations for mapping PM_2.5 distribution in China: Methods and assessment. Atmos. Environ. 2017, 152, 477–489. [Google Scholar] [CrossRef]
Song, W.; Jia, H.; Huang, J.; Zhang, Y. A satellite-based geographically weighted regression model for regional PM_2.5 estimation over the Pearl River Delta region in China. Remote Sens. Environ. 2014, 154, 1–7. [Google Scholar] [CrossRef]
Wang, W.; Mao, F.; Du, L.; Pan, Z.; Gong, W.; Fang, S. Deriving hourly PM_2.5 concentrations from himawari-8 aods over beijing–tianjin–hebei in China. Remote Sens. 2017, 9, 858. [Google Scholar] [CrossRef]
Zhang, T.; Zhu, Z.; Gong, W.; Zhu, Z.; Sun, K.; Wang, L.; Huang, Y.; Mao, F.; Shen, H.; Li, Z. Estimation of ultrahigh resolution PM_2.5 concentrations in urban areas using 160 m Gaofen-1 AOD retrievals. Remote Sens. Environ. 2018, 216, 91–104. [Google Scholar] [CrossRef]
Xie, Y.; Wang, Y.; Bilal, M.; Dong, W. Mapping daily PM_2.5 at 500 m resolution over Beijing with improved hazy day performance. Sci. Total Environ. 2019, 659, 410–418. [Google Scholar] [CrossRef] [PubMed]
Yao, F.; Wu, J.; Li, W.; Peng, J. Estimating Daily PM_2.5 Concentrations in Beijing Using 750-M VIIRS IP AOD Retrievals and a Nested Spatiotemporal Statistical Model. Remote Sens. 2019, 11, 841. [Google Scholar] [CrossRef]
Yu, C.; Di Girolamo, L.; Chen, L.; Zhang, X.; Liu, Y. Statistical evaluation of the feasibility of satellite-retrieved cloud parameters as indicators of PM_2.5 levels. J. Expo. Sci. Environ. Epidemiol. 2015, 25, 457. [Google Scholar] [CrossRef] [PubMed]
Unnithan, S.K.; Gnanappazham, L. Spatiotemporal mixed effects modeling for the estimation of PM_2.5 from MODIS AOD over the Indian subcontinent. Gisci. Remote Sens. 2020, 57, 159–173. [Google Scholar] [CrossRef]
Kloog, I.; Koutrakis, P.; Coull, B.A.; Lee, H.J.; Schwartz, J. Assessing temporally and spatially resolved PM_2.5 exposures for epidemiological studies using satellite aerosol optical depth measurements. Atmos. Environ. 2011, 45, 6267–6275. [Google Scholar] [CrossRef]
Xiao, Q.; Wang, Y.; Chang, H.H.; Meng, X.; Geng, G.; Lyapustin, A.; Liu, Y. Full-coverage high-resolution daily PM_2.5 estimation using MAIAC AOD in the Yangtze River Delta of China. Remote Sens. Environ. 2017, 199, 437–446. [Google Scholar] [CrossRef]
Zhang, R.; Di, B.; Luo, Y.; Deng, X.; Grieneisen, M.L.; Wang, Z.; Yao, G.; Zhan, Y. A nonparametric approach to filling gaps in satellite-retrieved aerosol optical depth for estimating ambient PM_2.5 levels. Environ. Pollut. 2018, 243, 998–1007. [Google Scholar] [CrossRef]
Zhao, C.; Liu, Z.; Wang, Q.; Ban, J.; Chen, N.X.; Li, T. High-resolution daily AOD estimated to full coverage using the random forest model approach in the Beijing-Tianjin-Hebei region. Atmos. Environ. 2019, 203, 70–78. [Google Scholar] [CrossRef]
Chu, D.A.; Kaufman, Y.; Zibordi, G.; Chern, J.; Mao, J.; Li, C.; Holben, B. Global monitoring of air pollution over land from the Earth Observing System-Terra Moderate Resolution Imaging Spectroradiometer (MODIS). J. Geophys. Res. Atmos. 2003, 108. [Google Scholar] [CrossRef]
Gupta, P.; Christopher, S.A. Particulate matter air quality assessment using integrated surface, satellite, and meteorological products: Multiple regression approach. J. Geophys. Res. Atmos. 2009, 114. [Google Scholar] [CrossRef]
Van Donkelaar, A.; Martin, R.V.; Spurr, R.J.; Burnett, R.T. High-resolution satellite-derived PM_2.5 from optimal estimation and geographically weighted regression over North America. Environ. Sci. Technol. 2015, 49, 10482–10491. [Google Scholar] [CrossRef] [PubMed]
You, W.; Zang, Z.; Zhang, L.; Li, Y.; Pan, X.; Wang, W. National-scale estimates of ground-level PM_2.5 concentration in China using geographically weighted regression based on 3 km resolution MODIS AOD. Remote Sens. 2016, 8, 184. [Google Scholar] [CrossRef]
Xiao, Q.; Chang, H.H.; Geng, G.; Liu, Y. An ensemble machine-learning model to predict historical PM_2.5 concentrations in China from satellite data. Environ. Sci. Technol. 2018, 52, 13260–13269. [Google Scholar] [CrossRef]
Xu, Y.; Ho, H.C.; Wong, M.S.; Deng, C.; Shi, Y.; Chan, T.-C.; Knudby, A. Evaluation of machine learning techniques with multiple remote sensing datasets in estimating monthly concentrations of ground-level PM_2.5. Environ. Pollut. 2018, 242, 1417–1426. [Google Scholar] [CrossRef]
Bilal, M.; Nichol, J.E.; Bleiweiss, M.P.; Dubois, D. A Simplified high resolution MODIS Aerosol Retrieval Algorithm (SARA) for use over mixed surfaces. Remote Sens. Environ. 2013, 136, 135–145. [Google Scholar] [CrossRef]
Bilal, M.; Nichol, J.E. Evaluation of MODIS aerosol retrieval algorithms over the Beijing-Tianjin-Hebei region during low to very high pollution events. J. Geophys. Res. Atmos. 2015, 120, 7941–7957. [Google Scholar] [CrossRef]
Bilal, M.; Nichol, J.E.; Chan, P.W. Validation and accuracy assessment of a Simplified Aerosol Retrieval Algorithm (SARA) over Beijing under low and high aerosol loadings and dust storms. Remote Sens. Environ. 2014, 153, 50–60. [Google Scholar] [CrossRef]
Di, Q.; Amini, H.; Shi, L.; Kloog, I.; Silvern, R.; Kelly, J.; Sabath, M.B.; Choirat, C.; Koutrakis, P.; Lyapustin, A. An ensemble-based model of PM_2.5 concentration across the contiguous United States with high spatiotemporal resolution. Environ. Int. 2019, 130, 104909. [Google Scholar] [CrossRef]
Xu, M.; Sbihi, H.; Pan, X.; Brauer, M. Local variation of PM_2.5 and NO₂ concentrations within metropolitan Beijing. Atmos. Environ. 2019, 200, 254–263. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA; pp. 785–794. [Google Scholar]
Just, A.; De Carli, M.; Shtein, A.; Dorman, M.; Lyapustin, A.; Kloog, I. Correcting Measurement Error in Satellite Aerosol Optical Depth with Machine Learning for Modeling PM_2.5 in the Northeastern USA. Remote Sens. 2018, 10, 803. [Google Scholar] [CrossRef] [PubMed]
Reid, C.E.; Jerrett, M.; Petersen, M.L.; Pfister, G.G.; Morefield, P.E.; Tager, I.B.; Raffuse, S.M.; Balmes, J.R. Spatiotemporal prediction of fine particulate matter during the 2008 Northern California wildfires using machine learning. Environ. Sci. Technol. 2015, 49, 3887–3896. [Google Scholar] [CrossRef] [PubMed]
Zhai, B.; Chen, J. Development of a stacked ensemble model for forecasting and analyzing daily average PM_2.5 concentrations in Beijing, China. Sci. Total Environ. 2018, 635, 644–658. [Google Scholar] [CrossRef]
Brunsdon, C.; Fotheringham, A.S.; Charlton, M.E. Geographically weighted regression: A method for exploring spatial nonstationarity. Geogr. Anal. 1996, 28, 281–298. [Google Scholar] [CrossRef]
Fox, J. Applied Regression Analysis and Generalized Linear Models; Sage Publications: Newbury Park, CA, USA, 2015; pp. 341–358. [Google Scholar]
Fotheringham, A.S.; Brunsdon, C.; Charlton, M. Geographically Weighted Regression: The Analysis of Spatially Varying Relationships; John Wiley & Sons: Hoboken, NJ, USA, 2003; pp. 27–65. [Google Scholar]
Rodriguez, J.D.; Perez, A.; Lozano, J.A. Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 569–575. [Google Scholar] [CrossRef] [PubMed]
China, M. Ambient Air Quality Standards; GB 3095-2012; China Environmental Science Press: Beijing, China, 2012. [Google Scholar]
Bai, Y.; Wu, L.; Qin, K.; Zhang, Y.; Shen, Y.; Zhou, Y. A geographically and temporally weighted regression model for ground-level PM_2.5 estimation from satellite-derived 500 m resolution AOD. Remote Sens. 2016, 8, 262. [Google Scholar] [CrossRef]
Hu, H.; Hu, Z.; Zhong, K.; Xu, J.; Zhang, F.; Zhao, Y.; Wu, P. Satellite-based high-resolution mapping of ground-level PM_2.5 concentrations over East China using a spatiotemporal regression kriging model. Sci. Total Environ. 2019, 672, 479–490. [Google Scholar] [CrossRef]
Yang, J.; Hu, M. Filling the missing data gaps of daily MODIS AOD using spatiotemporal interpolation. Sci. Total Environ. 2018, 633, 677–683. [Google Scholar] [CrossRef]
Luan, T.; Guo, X.; Guo, L.; Zhang, T. Quantifying the relationship between PM_2.5 concentration, visibility and planetary boundary layer height for long-lasting haze and fog–haze mixed events in Beijing. Atmos. Chem. Phys. 2018, 18, 203. [Google Scholar] [CrossRef]
Guo, Y.; Tang, Q.; Gong, D.Y.; Zhang, Z. Estimating ground-level PM_2.5 concentrations in Beijing using a satellite-based geographically and temporally weighted regression model. Remote Sens. Environ. 2017, 198, 140–149. [Google Scholar] [CrossRef]
Xie, Y.; Wang, Y.; Zhang, K.; Dong, W.; Lv, B.; Bai, Y. Daily estimation of ground-level PM_2.5 concentrations over Beijing using 3 km resolution MODIS AOD. Environ. Sci. Technol. 2015, 49, 12280–12288. [Google Scholar] [CrossRef]
Ma, X.; Wang, J.; Yu, F.; Jia, H.; Hu, Y. Can MODIS AOD be employed to derive PM_2.5 in Beijing-Tianjin-Hebei over China? Atmos. Res. 2016, 181, 250–256. [Google Scholar] [CrossRef]
Yang, L.; Xu, H.; Yu, S. Estimating PM_2.5 concentrations in Yangtze River Delta region of China using random forest model and the Top-of-Atmosphere reflectance. J. Environ. Manag. 2020, 272, 111061. [Google Scholar] [CrossRef] [PubMed]
He, Q.; Huang, B. Satellite-based high-resolution PM_2.5 estimation over the Beijing-Tianjin-Hebei region of China using an improved geographically and temporally weighted regression model. Environ. Pollut. 2018, 236, 1027–1037. [Google Scholar] [CrossRef] [PubMed]
Goldberg, D.L.; Gupta, P.; Wang, K.; Jena, C.; Zhang, Y.; Lu, Z.; Streets, D.G. Using gap-filled MAIAC AOD and WRF-Chem to estimate daily PM_2.5 concentrations at 1 km resolution in the Eastern United States. Atmos. Environ. 2019, 199, 443–452. [Google Scholar] [CrossRef]
Lv, B.; Hu, Y.; Chang, H.H.; Russell, A.G.; Bai, Y. Improving the accuracy of daily PM_2.5 distributions derived from the fusion of ground-level measurements with aerosol optical depth observations, a case study in North China. Environ. Sci. Technol. 2016, 50, 4752–4759. [Google Scholar] [CrossRef]
Zhai, L.; Li, S.; Zou, B.; Sang, H.; Fang, X.; Xu, S. An improved geographically weighted regression model for PM_2.5 concentration estimation in large areas. Atmos. Environ. 2018, 181, 145–154. [Google Scholar] [CrossRef]
Hu, X.; Belle, J.H.; Meng, X.; Wildani, A.; Waller, L.A.; Strickland, M.J.; Liu, Y. Estimating PM_2.5 concentrations in the conterminous United States using the random forest approach. Environ. Sci. Technol. 2017, 51, 6936–6944. [Google Scholar] [CrossRef]
Wang, X.; Sun, W. Meteorological parameters and gaseous pollutant concentrations as predictors of daily continuous PM_2.5 concentrations using deep neural network in Beijing–Tianjin–Hebei, China. Atmos. Environ. 2019, 211, 128–137. [Google Scholar] [CrossRef]
Zhao, C.; Wang, Q.; Ban, J.; Liu, Z.; Zhang, Y.; Ma, R.; Li, S.; Li, T. Estimating the daily PM_2.5 concentration in the Beijing-Tianjin-Hebei region using a random forest model with a 0.01° × 0.01° spatial resolution. Environ. Int. 2020, 134, 105297. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Liu, X.; Yang, X.; Zou, B.; Wang, J. Spatial variations of PM_2.5 in Chinese cities for the joint impacts of human activities and natural conditions: A global and local regression perspective. J. Clean. Prod. 2018, 203, 143–152. [Google Scholar] [CrossRef]
Bai, K.; Li, K.; Chang, N.-B.; Gao, W. Advancing the prediction accuracy of satellite-based PM_2.5 concentration mapping: A perspective of data mining through in situ PM_2.5 measurements. Environ. Pollut. 2019, 254, 113047. [Google Scholar] [CrossRef] [PubMed]
He, Q.; Huang, B. Satellite-based mapping of daily high-resolution ground PM_2.5 in China via space-time regression modeling. Remote Sens. Environ. 2018, 206, 72–83. [Google Scholar] [CrossRef]

Figure 1. Topography of Beijing and the spatial distribution of PM_2.5 sites.

Figure 2. The structure and schematic of the SL-XGB model for PM_2.5 estimation.

Figure 3. Difference between the cloud-removal SARA AOD (a) and the gap-filled SARA AOD (b) on 18 December 2016. Difference between the annually averaged distribution of the cloud-removal SARA AOD (c) and the gap-filled SARA AOD (d).

Figure 4. The density scattering plots of CV results. Sample-based result: (a) GWR, (b) XGBoost and (c) SL-XGB. Site-based result: (d) GWR, (e) XGBoost and (f) SL-XGB. The red lines represent the 1:1 lines and the black lines represent the fitting lines.

Figure 5. The spatial distribution of sample-based CV RMSE for PM_2.5 ground-level sites. (a) is the result of GWR, (b) is the result of XGBoost and (c) is the result of SL-XGB.

Figure 6. The relationship between bandwidth distance and RMSE. The left is the result of suburban sites and the right is the result of urban sites. The red lines are the polynomial fitting curves and the shadow areas represent the 95% confidence interval.

Figure 7. The average top 10 variable importance of independent variables.

Figure 8. The spatial distribution of the seasonally and annually averaged PM_2.5 in Beijing.

Figure 9. Some annual comparisons. (a): Average differences between the GF and the NGF (here, we used the GF minus the NGF). (b): Average differences between the true PM_2.5 concentrations and the NGF in PM_2.5 sites (here, we used the true concentrations minus the NGF). (c): Average differences between the true PM_2.5 concentrations and the GF in PM_2.5 sites (here, we used the true concentrations minus the GF).

Table 1. Sample-based CV results of seasonal and annual XGBoost models.

Period	N	R²	RMSE	MPE
Spring	6578061	0.93	0.10	0.05
Summer	5663296	0.90	0.15	0.09
Autumn	4763970	0.94	0.10	0.06
Winter	4624431	0.92	0.07	0.04
Annual	21629758	0.86	0.15	0.10

Table 2. Fitting and sample-based CV performance of GWR, XGBoost and SL-XGB.

Method	R²		RMSE (μg/m³)		MPE (μg/m³)
Method	Fitting	CV	Fitting	CV	Fitting	CV
GWR	0.81	0.71	30.74	33.67	19.86	21.79
XGBoost	0.89	0.85	21.71	27.01	15.87	19.52
SL-XGB	0.93	0.88	18.09	24.08	13.24	16.90

Table 3. Some comparisons of PM_2.5 estimation studies using Beijing or BTH as the study area.

Source	Year	Model	Resolution	Study Area	Performance
Ours	-	Spatially local XGBoost (SL-XGB)	500 m	Beijing	sample-based CV R² 0.88, site-based CV R² 0.86,
Xie et al. [57]	2015	Mixed-effects model	3 km	Beijing	site-based CV R² 0.83
He and Huang [60]	2018	Improved geographically and temporally weighted regression	3 km	BTH	sample-based CV R² 0.84
Xie et al. [24]	2019	Mixed-effects model with cloud screen	500 m	Beijing	site-based CV R² 0.82
Yao et al. [25]	2019	Nested spatiotemporal Statistical model	750 m	Beijing	sample-based CV R² 0.85
Wang et al. [65]	2019	Deep neural network	10 km	BTH	sample-based CV R² 0.87
Zhao et al. [66]	2020	Random forest considering meteorological lag effects	0.01°	BTH	sample-based CV R² 0.83

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, Z.; Zhan, Q.; Yang, C.; Liu, H.; Bilal, M. Estimating PM_2.5 Concentrations Using Spatially Local Xgboost Based on Full-Covered SARA AOD at the Urban Scale. Remote Sens. 2020, 12, 3368. https://doi.org/10.3390/rs12203368

AMA Style

Fan Z, Zhan Q, Yang C, Liu H, Bilal M. Estimating PM_2.5 Concentrations Using Spatially Local Xgboost Based on Full-Covered SARA AOD at the Urban Scale. Remote Sensing. 2020; 12(20):3368. https://doi.org/10.3390/rs12203368

Chicago/Turabian Style

Fan, Zhiyu, Qingming Zhan, Chen Yang, Huimin Liu, and Muhammad Bilal. 2020. "Estimating PM_2.5 Concentrations Using Spatially Local Xgboost Based on Full-Covered SARA AOD at the Urban Scale" Remote Sensing 12, no. 20: 3368. https://doi.org/10.3390/rs12203368

APA Style

Fan, Z., Zhan, Q., Yang, C., Liu, H., & Bilal, M. (2020). Estimating PM_2.5 Concentrations Using Spatially Local Xgboost Based on Full-Covered SARA AOD at the Urban Scale. Remote Sensing, 12(20), 3368. https://doi.org/10.3390/rs12203368

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Estimating PM_2.5 Concentrations Using Spatially Local Xgboost Based on Full-Covered SARA AOD at the Urban Scale

Abstract

1. Introduction