2.3.1. Machine-Learning (ML) Models

ML applies complex statistical theories and algorithms to computer simulations in a somewhat human-learning behavior to acquire new knowledge. It can take advantage of any existing knowledge structure and ample information content to infer a piece of new knowledge. ML has been applied in various fields, including the remote sensing of PM2.5 [45–47]. This study compares and analyzes four popular widely used ML algorithms, i.e., Random Forest, Extra-trees, XGBoost, and LightGBM models.

The Random Forest model constructs an ensemble of multiple decision trees, where each tree is generated by bootstrap sampling from the training dataset [48–50]. The Extremely Randomized Trees (Extra-trees) model is also a tree-based ensemble learning method, similar to Random Forest but introduces additional randomness in selecting features and splitting the points from all data samples in the tree-building process [31,32,51]. Both the Random Forest and Extra-trees models are ensemble-learning algorithms based on the bagging technique. In the bagging training process, the base classifiers (decision trees) are trained independently, and there is no strong dependence or correlation between them. This characteristic allows for parallel training of the base classifiers, which can significantly speed up the training process. By training the base classifiers in parallel, these integrated algorithms harness the power of parallel computing, making them efficient and scalable for large datasets.

In contrast, the XGBoost and LightGBM models are based on a boosting ensemble algorithm (Figure 2), where base classifiers are trained sequentially, and each classifier depends on the others. The main goal of boosting is to stack these classifiers on top of each other, with each layer assigning higher weights to samples that were incorrectly classified by the previous layers. However, these two models differ in several ways. XGBoost utilizes a pre-classification algorithm, meaning that all features of a sample are pre-sorted before iterated and repeated operations take place [33,52]. This sorting step significantly reduces the number of calculations required. LightGBM employs a histogram algorithm, which offers advantages such as reduced memory usage and a faster runtime [34,53]. In terms of growth strategy, XGBoost follows a level-wise approach. In this strategy, the child nodes of the same layer are split simultaneously. Conversely, LightGBM adopts a leaf-wise growth strategy, where each layer's child node only needs to find the node with the largest gain (typically the one with the largest data volume) to perform the split.

**Figure 2.** Illustration of the bagging and boosting algorithms.

2.3.2. Importance Assessment Method

(1) Feature importance

The feature importance (FI) score is a common indicator reflecting the importance of input variables that comes with the tree-based ML models. The FI score is calculated via the Gini index based on the mean decrease impurity (MDI) and used to evaluate the importance of each feature by measuring its contribution to splitting in the decision tree [54]. Taking one decision tree as an example, VIM represents variable importance measures, and GI represents the Gini index. Assuming that there are *m* feature variables, the GI score of a feature (represented by Xi) is calculated (represented by VIMGini <sup>i</sup> ), i.e., the average change in the node-splitting impurity of the *i*th feature in the tree model. The Gini index is calculated as

$$\text{GII}\_{\text{m}} = \sum\_{k=1}^{|\mathcal{K}|} \sum\_{\mathbf{k}' \neq k} p\_{mk} p\_{mk\prime} = 1 - \sum\_{k=1}^{|\mathcal{K}|} p\_{mk}^2 \tag{1}$$

where *K* represents the total number of categories of a feature, and *pmk* represents the proportion of a category *k* in node *m*. The importance of feature Xi in node *m*, i.e., the GI change before and after the node branch is expressed as

$$\text{VIM}\_{\text{im}}^{\text{(Gini)}} = \text{GI}\_{\text{m}} - \text{GI}\_{l} - \text{GI}\_{r} \tag{2}$$

where GI*<sup>l</sup>* and GI*<sup>r</sup>* represent the Gini indices of the two new nodes after the branch. The node where feature Xi appears in the *j*th decision tree is set *M*. The importance of Xi in the *j*th tree then is

$$\text{VIM}\_{\text{ij}}^{\text{(Gini)}} = \sum\_{\text{m} \in \mathcal{M}} \text{VIM}\_{\text{im}}^{\text{(Gini)}} \tag{3}$$

(2) Permutation importance

Although the FI can reflect the characteristic importance of variables, it is more favorable when there are more variable categories. For characteristic variables with multiple

correlations, FI may have a bias in describing correlation features, and its assessment could also be overfitted [55]. Therefore, the permutation importance (PI), another method for evaluating the contribution of each feature, is employed. The PI is a model-independent method applicable to almost all types of models, including deep-learning models. Its basic idea is to evaluate the importance of features on a test set by randomly shuffling a feature and then measuring the change in model performance to measure the importance of features. The PI of a feature is calculated as follows (Figure 3). First, the dataset is divided into a training set and a validation set. Second, a baseline metric, defined by scoring, is then evaluated on a (potentially different) dataset defined by the training set. Third, a feature column from the validation set is permutated randomly, and the metric is evaluated again. The PI can be obtained by calculating the difference between the baseline metric and the metric from permutating the feature column.

**Figure 3.** Illustration of the permutation importance (PI).

### 2.3.3. Model Validation Methods

In this study, two methods are used to evaluate the performance of ML models: sample-based and station-based ten-fold cross-validation (10-CV) [32,56]. In sample-based 10-CV, the sample dataset is randomly divided into ten groups. One group, comprising 10% of the samples, is set aside as the independent validation set, while the remaining nine groups (90%) form the training set. This process is repeated ten times, with each group serving as the validation set once, ensuring that all samples have been tested. The final accuracy is calculated as the average of the results obtained from the ten runs. This approach is commonly used to represent the overall accuracy of ML models in estimating PM2.5 levels at locations where ground measurements are available.

The station-based 10-CV is another evaluation method used to assess the predictive ability of ML models in estimating PM2.5 concentrations at locations where ground measurements are not available [57]. This method serves as a spatially independent validation technique. Similarly to the sample-based 10-CV, the station-based 10-CV involves dividing the ground observation stations in the study area into ten groups. One group, consisting of 10% of all stations, is designated as the validation set, while the dataset corresponding to the remaining nine groups of stations (90% of all stations) is used as the training set. This approach creates training and validation samples from different locations. This helps isolate spatial autocorrelations among the data samples, making it an effective spatially independent verification method.

#### 2.3.4. Sensitivity Analysis Methods

This study starts by considering the study area as a whole, with 775 ground observation stations. The total number of stations in the study area is then randomly reduced by 10%. The remaining number of stations is again reduced by 10%. This process continues until the number of stations in a group is 31% of the original total number of stations. The end result is 12 groups of stations, each reflecting a certain station density in the study area. The smaller the proportion of remaining stations in a group, the smaller the density of stations in that group. This set of 12 groups of stations and their associated observations is used next to explore the influence of AOD on inversions of PM2.5 from different ML models. To assess the importance and contribution of satellite AOD retrievals to ML modeling and quantitatively evaluate its impact on the performance of an ML model, this study incorporates an uncertainty analysis from three key aspects:


#### **3. Results**

#### *3.1. Variations of Satellite AOD Contributions*

Figure 4 presents the FI and PI of satellite AOD retrievals in estimating PM2.5 obtained by four ML models with the decrease in the density of ground-based monitoring stations. Also shown are the best-fit lines from linear regression and confidence intervals (CIs). CI is a statistical measure used to quantify the uncertainty of an estimate. Here, we used a 95% confidence level (pink-shaded areas in the figure). This indicates that there is a 95% probability that the true value will fall within the specified range, leaving a 5% probability of it falling outside this range. Figure 4a–c show that as the density of the monitoring station decreases, the FI score significantly increases, with regressed correlation of determination (R2) values of 0.84, 0.81, 0.77, and 0.43 for the Random Forest, Extra-trees, XGBoost, and LightGBM models, respectively. All pass the 99% or 95% confidence (*p* < 0.01 or 0.05) test. The spread of FI in the LightGBM case (Figure 4d) indicates a higher variance compared with other methods, which may be attributed to the use of the specific node-splitting method of the leaf-wise growth strategy. Similar conclusions can be made from the PI score analysis, i.e., the contribution of satellite AOD significantly increases as the density of monitoring stations decreases, with much higher regressed R<sup>2</sup> values (R<sup>2</sup> = 0.94–0.96) for the four ML models (Figure 4e–h). All regressed trends reach the 99% confidence (*p* < 0.01) level. Note that the values of FI and PI are different among the ML methods because these models employ distinct frameworks and operation methods, including sampling, feature selection, and node-splitting techniques. Additionally, the methods of computing FI and PI are also different, e.g., FI relies on the Gini index, which involves calculating the MDI, while PI assesses the change in model performance by randomly shuffling a feature. Results

obtained by the two importance verification methods are consistent, suggesting that the two methods can complement and verify each other. These findings reveal that satellite AOD is crucial for PM2.5 modeling using ML because it plays a dominant predictive role with the highest importance scores, particularly in regions with a small density of PM2.5 ground observation stations.

**Figure 4.** Variations of (**a**–**d**) feature importance (unit: %) and (**e**–**h**) permutation importance (unit: %) of satellite AOD as a function of the decreasing percentage of ground-based monitoring stations for the Random Forest, Extra-trees, XGBoost, and LightGBM models, respectively. Pink-shaded areas are 95% confidence intervals, where \* and \*\* denote the 95% (*p* < 0.05) and 99% (*p* < 0.01) confidence levels of the regressed fits, respectively.

#### *3.2. Impacts of Satellite AOD on Overall Accuracy*

Figure 5 shows the overall accuracies measured by the CV R2 (CV-R2) of the daily estimates of PM2.5 with and without AOD as model input as a function of decreasing station density using four ML models. Regarding model results where AOD was included as an input variable (Figure 5, red dots), it is clearly seen that as the station density decreases, the overall accuracy of the PM2.5 estimates gradually decreases, showing an average significantly decreasing trend (i.e., change in sample-based CV-R<sup>2</sup> per 1% of discarded stations) of −0.16% (*p* < 0.01), −0.15% (*p* < 0.01), −0.19% (*p* < 0.01), and −0.11% (*p* < 0.01) for the Random Forest, Extra-trees, XGBoost, and LightGBM models, respectively. For model results where AOD was not included as an input variable (Figure 5, blue dots), when the station proportion is 100%, the CV-R<sup>2</sup> is 0.83 with AOD and 0.82 without AOD, a small difference of about 1.48%. However, as the station density decreases, the difference in accuracy between the models with and without AOD gradually increases (black diamonds in the figure). When the station density is 31%, the accuracy of the model with AOD is 0.72, while the accuracy of the model without AOD is 0.65. The relative difference in model accuracy with or without AOD significantly increases to 10.35%. Similarly, Figure 5b–d present comparable results: At 100% station density, the overall accuracies of the Extratrees, XGBoost, and LightGBM models with AOD are 0.86, 0.84, and 0.85, respectively. Without AOD, the accuracies of these models are 0.86, 0.83, and 0.81, respectively, and

the relative differences in model accuracy with or without AOD are 0.10%, 0.58%, and 4.80%, respectively. When the station density drops to 31%, the accuracies of the models with AOD become 0.76, 0.70, and 0.76, respectively, while the accuracies of the models without AOD are 0.71, 0.64, and 0.69, respectively. Consequently, the relative differences in overall accuracy between these three models with and without AOD increase to 6.48%, 9.41%, and 10.44%, respectively. In particular, when the number of monitoring stations decreases, the slopes of the decreased overall accuracy are much steeper for the Random Forest (−0.24%, *p* < 0.01), Extra-trees (−0.22%, *p* < 0.01), XGBoost (−0.27%, *p* < 0.01), and LightGBM (−0.16%, *p* < 0.01) models without using AOD compared to these models using AOD. This is because with the decrease in station density, the sample data volume gradually decreases, as do the accuracies of the models. In general, for every 10% reduction in station proportion, the four ML models without AOD experience a 1.2% (*p* < 0.01), 0.9% (*p* < 0.01), 1.1% (*p* < 0.01), and 0.7% (*p* < 0.01) increase in the uncertainty of the estimated results, respectively, compared with these models using AOD. These findings highlight the indispensable role of satellite AOD in improving the accuracy of estimating PM2.5 through ML models, particularly in regions with limited observation stations.

**Figure 5.** Variation in the overall accuracy (sample-based CV-R2, left ordinate) and relative difference (%, right ordinate) of daily PM2.5 estimates as a function of decreasing station density for four ML models: (**a**) Random Forest, (**b**) Extra-trees, (**c**) XGBoost, and (**d**) LightGBM, with (red dots) and without (blue dots) including satellite AOD as an input predictor. Slopes of the best-fit lines from linear regression for each set of results are given, where \*\* denotes the 99% confidence (*p* < 0.01) level. Black dashed lines represent the regressed fits of the relative difference between each set of results.

#### *3.3. Impacts of Satellite AOD on Predictive Ability*

Here, the model performance in predicting the PM2.5 level in areas without surface observations is examined based on results from the station-based 10-CV method. Figure 6 shows the predictive ability of four ML models in retrieving daily PM2.5 with (red dots) and without (blue dots) AOD as inputs under different station density conditions. Overall, the trend observed in station-based 10-CV results aligns with those of sample-based 10-CV: When AOD is included as an input variable, with the decreasing number of monitoring stations, the predictive accuracies of the Random Forest, Extra-trees, XGBoost, and LightGBM models all show significantly decreasing trends (i.e., change in station-based CV-R2 per 1% of discarded stations) of −0.14% (*p* < 0.01), −0.13% (*p* < 0.01), −0.18% (*p* < 0.01), and −0.10% (*p* < 0.01), respectively. However, the model's predictive ability experiences a much faster decline, with larger slopes of −0.21%, −0.18%, −0.26%, and −0.13% for the four models without including satellite AOD, respectively. Specifically, when data from all PM2.5 stations are used (100%), the Random Forest, Extra-trees, XGBoost, and LightGBM models that include AOD produce slightly better results, with higher station-based CV-R2 values (0.81, 0.83, 0.80, and 0.79) than those without considering AOD (CV-R<sup>2</sup> = 0.77, 0.81, 0.76, and 0.71, respectively). However, when the proportion of monitoring stations drops to 31%, the predictive accuracies for the same four models decrease to 0.72, 0.75, 0.69, 0.73 (including AOD) and 0.64, 0.69, 0.61, 0.64 (not including AOD), respectively. More importantly, the average relative differences become 2.48, 2.68, 2.19, and 1.33 times larger than the results when all stations are considered. This is because station-based 10-CV uses known stations (regions) to predict unknown stations (regions), reflecting the spatial prediction ability of the model. However, it is difficult for an ML model to make predictions for regions without training samples, inevitably leading to lower prediction accuracies, consistent with existing cognitive rules of an ML model. In general, for every 10% reduction in the station density, the four ML models without and with AOD experience 1.0% (*p* < 0.01), 0.7% (*p* < 0.01), 1.2% (*p* < 0.01), and 0.6% (*p* < 0.01) increases in the uncertainty of the predicted results, respectively. These findings confirm the indispensable role of satellite AOD in predicting PM2.5 concentration in areas without observations using ML models, particularly in low-station density situations.

**Figure 6.** Variation in the predictive ability (station-based CV-R2, left ordinate) and relative difference (%, right ordinate) of daily PM2.5 estimates as a function of decreasing station density for (**a**) Random Forest, (**b**) Extra-trees, (**c**) XGBoost, and (**d**) LightGBM, with (red dots) and without (blue dots) including satellite AOD as an input predictor. Slopes of the best-fit lines from linear regression for each set of results are given, where \*\* denotes the 99% confidence (*p* < 0.01) level. Black dashed lines represent the regressed fits of the relative difference between each set of results.

The contrasting results with comparable superior accuracy in PM2.5 estimation without incorporating AOD input can be attributed to the high density of ground observation stations in the specific study area. In areas with a sufficiently dense network of monitoring sites, excluding AOD can still lead to relatively accurate PM2.5 estimations, which largely benefit from the presence of spatial autocorrelation in air pollution, e.g., PM2.5 levels and auxiliary factors such as meteorological fields are highly similar in neighboring locations. As a result, PM2.5 concentrations in nearby sites can be reasonably estimated based on the established relationships between PM2.5 and non-AOD factors from nearby stations. Nevertheless, as the station density decreases, the spatial autocorrelation weakens, and the disparity between natural and human-influenced conditions grows, and consequently, the prediction error rapidly increases. This highlights the critical role of AOD in areas with limited ground monitoring because it significantly enhances the accuracy of PM2.5 predictions by providing crucial background pollution information, compensating for the lack of ground observation data.
