**1. Introduction**

PM2.5 refers to the concentration of airborne particulate matter (PM) with an aerodynamic diameter of less than 2.5 microns. Although small, these particles are abundant and active, and attach easily to toxic and harmful substances. PM2.5 can be suspended in the atmosphere for extended periods, ranging from months to even years, which has an important impact on air quality and visibility and also affects human health [1–3]. While PM2.5 has been monitored in many parts of the world, observations are still highly limited and very inhomogeneous, with many regions not covered [4–6]. However, satellite remote sensing provides continuous spatial coverage and has been widely used in the estimation of surface PM2.5 concentrations [7–9].

Previous studies have made great efforts to infer PM2.5 from satellite retrievals of aerosol optical depth (AOD) by virtue of their positive correlation because AOD is much

**Citation:** Tian, Z.; Wei, J.; Li, Z. How Important Is Satellite-Retrieved Aerosol Optical Depth in Deriving Surface PM2.5 Using Machine Learning? . *Remote Sens.* **2023**, *15*, 3780. https://doi.org/10.3390/ rs15153780

Academic Editor: Stephan Havemann

Received: 30 June 2023 Revised: 26 July 2023 Accepted: 26 July 2023 Published: 29 July 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

more monitored from both space and the ground. Many factors can influence their relationship, including aerosol vertical distribution, relative humidity, mixed-layer height, and topography, among others [10–12]. The relationship also varies with both location and time scale. Wang and Christopher (2003) [13] used the AOD product retrieved by the Moderate Resolution Imaging Spectroradiometer (MODIS) and in situ measurements of PM2.5 at seven ground observation stations in Alabama, USA, finding a sound correlation between them on a monthly time scale. Natunen et al. (2010) [14] explored the relationship at four stations in Helsinki, Finland, on seasonal and monthly time scales and found that time averaging increased the correlation. Likewise, the correlation also varies with spatial resolution [15], indicating that different geographical locations, study area sizes, and spatial resolutions of MODIS AOD products can change the correlation between AOD and PM2.5. In general, the relationship varies considerably with location and season [16,17]. Su et al. (2018) [18] studied the relationship extensively across China, one of the most polluted regions of the world, finding that the relationship differs considerably in different parts of China (better in northern than in southern China) and among the four seasons (better in winter than in summer). The relationship can be significantly improved by normalizing against the height of the planetary boundary layer.

Due to the complex relationships between AOD and PM2.5, many statistical regression methods have been proposed for estimating PM2.5 using satellite AOD retrievals [7,19–21], such as the multiple linear regression model [22], the geographically weighted regression model [23,24], the geographical spatiotemporal weighting regression model [25], and the linear mixed effects model [26]. To a certain extent, these models are capable of estimating surface PM2.5 concentrations using satellite AOD data. However, they face challenges when it comes to studying the influences of various factors on PM2.5, such as meteorological factors (boundary layer height, relative humidity, etc.) and surface factors (underlying surface types, etc.) [27,28]. Fortunately, machine-learning (ML) models have a strong data-mining capability and can establish robust nonlinear relationships. They allow for the extraction of pertinent information from very large numbers of auxiliary factors to improve the accuracy of PM2.5 retrievals. Therefore, various types of ML models have been adopted in PM2.5 inversion studies in recent years, e.g., the Random Forest model [29,30], the Extra-trees model [31,32], the XGBoost model [33], and the LightGBM model [34].

AOD has been regarded as an essential input variable in inferring PM2.5 from satellites [7,19,27,35]. However, a handful of studies have presented contrasting results [36–39]. Chen et al. (2021) [38], for example, developed a Random Forest model for areas with and without AOD data, finding that the model or areas without AOD can result in better PM2.5 retrievals. Yu et al. (2022) [39] developed a deep ensemble ML framework to estimate daily PM2.5 concentrations in Italy from 2015 to 2019 and found similar accuracies (crossvalidated R2 = 0.853 and 0.857) in comparison with ground observations when including or not including satellite AOD in the model. These conflicting findings pose such critical questions as whether satellite-retrieved AOD plays any significant role in estimating surface PM2.5 and what factor, if any, dictates its role in the ML application for estimating PM2.5. We attempt to address these questions by taking advantage of rich satellite AOD data and in situ PM2.5 measurements in eastern China, together with a large array of other ancillary data, introduced next.

#### **2. Data and Methods**

#### *2.1. Study Area*

The study area (approximately 1,830,000 km2) covers 14 provinces in China, including the North China Plain, the Yangtze River Delta, the Pearl River Delta, and parts of central China (Figure 1). As the most populated and advanced in economic development in China, these regions have experienced serious air pollution problems, thus garnering significant public attention. To monitor air pollution, relatively dense PM2.5 ground observation stations have been uniformly distributed, enabling us to investigate the effects of

satellite AOD on estimating PM2.5 concentrations at different levels of ground observation station density.

**Figure 1.** Study area and the distribution of ground stations (green triangles). The colored background shows land elevations (unit: m).

#### *2.2. Data Sources*

The datasets used in this study consist of observed PM2.5 concentrations, 1-kmresolution MODIS Multi-Angle Implementation of Atmospheric Correction (MAIAC) AOD products, and many auxiliary datasets related to PM2.5, such as meteorological and landand population-related information. The study period spans from 2018 to 2020, ensuring an adequate volume of data for conducting sensitivity analyses.
