*3.2. Algorithms for Forest Diversity Mapping*

In this study, four machine-learning algorithms with various setups were employed: Lasso Regression (LR), Random Forest (RF), K-Nearest Neighbors (KNN), and Support Vector Machine (SVM). Non-parametric, non-linear algorithms including KNN, RF, and SVM have been applied successfully in a variety of remote sensing applications [44]. KNN and SVM represent distance-based and kernel-based models, respectively, while RF represents tree-based models. Specifically, KNN finds similarities between the new data and available results and puts the new results into the category most similar to those available. SVM can hold regression problems with multidimensional data by separating positive and negative samples to identify the optimum decision hyperplane [45]. RF is a classifier containing a large number of decision tree classifiers [46], and each tree is trained with randomly selected training samples to solve a single problem [47]. All algorithms were implemented using the Scikit-learn python library, and the hyperparameters of LR, K-NN, SVM, and RF methods were fitted through cross-validation (Table 4) [48].

**Table 4.** Description of the regression models used in this study, including the parameters considered and the criteria used to rank the feature importance.


#### *3.3. Accuracy Assessment*

The coefficient of determination (R2), root-mean-square error (RMSE) and mean absolute error (MAE) were applied to assess the accuracy of tree species diversity estimation. The following equations were used to calculated R2, RMSE, and MAE:

$$\mathbf{R}^2 = 1 - \frac{\sum\_{i=1}^n \left(\mathbf{y}\_i - \mathbf{x}\_i\right)^2}{\sum\_{i=1}^n \left(\mathbf{y}\_i - \mathbf{y}\right)^2} \tag{1}$$

$$\text{RMSE} = \sqrt{\frac{\sum\_{i=1}^{n} (\mathbf{x\_i} - \mathbf{y\_i})^2}{n}} \tag{2}$$

$$\text{MAE} = \frac{\sum\_{i=1}^{n} |\mathbf{x\_i} - \mathbf{y\_i}|}{n} \tag{3}$$

where xi and yi are the estimated and measured values, respectively. y is the average measured values, and n is the sample number.

All samples were randomly assigned to one of the two sets of training and validation, following the ratio of 70%:30%. Then, k-fold cross validation was also employed. The generalization error of a given method is directly estimated by cross-validation: The data is divided into K folds of almost equal size, and K folds are used to fit the model. Additionally, the estimated generalization error is the average error over the K folds.

#### **4. Results**

#### *4.1. Optimal Features from SENTINEL-2 Images and GEDI LiDAR Data*

MDG and BRT algorithms were applied to analyze the 68 features obtained by Sentinel-2 images and GEDI LiDAR data to find the optimal features for diversity mapping. Crossvalidation is further used to score several feature subsets and choose the best scoring feature collection. Figure 2 shows the ranking results of key features for three diversity indices, other detailed results are displayed in Appendix A, Table A1. Using the FHD and PAI of GEDI LiDAR in growing season, the vegetation indices of NDVI, NDWI, and EVI, and the spectral bands of B7, B8A, B11, and B12 were identified. Compared with individual spectral bands, GEDI feature and vegetation indices have a stronger explanation on the variations of forest diversity.

**Figure 2.** Relative importance of the features selected for estimations of forest diversity indices.

After feature selection, we applied mixed features from GEDI LiDAR data and Sentinel-2 images to estimate forest diversity. For comparison, we selected RF model and applied only GEDI LiDAR data or Sentinel-2 images for forest diversity estimation. Our results show that the Sentinel-2 data alone (averaged R2 = 0.62) gives better prediction accuracies than the GEDI LiDAR data alone (averaged R2 = 0.51), but both are lower than that of combined data sources (Table 5). Specifically, the Sentinel-2&VIs has a good performance on the prediction of *H* and *J* indices, with R2 values of 0.66 and 0.63, RMSE of 0.56 and 0.18, although the result of λ index is slightly lower than other indices (R<sup>2</sup> = 0.57, RMSE = 0.15). The GEDI data alone is observed to have a relatively high prediction on *H* and λ indices (R2 = 0.51; R<sup>2</sup> = 0.54 respectively), but a lower prediction on *J* index (R2 = 0.48).

**Table 5.** Estimated accuracy for different data combinations in three diversity indices.


### *4.2. Diversity Indices Modelling Using Machine Learning Algorithms*

Based on selected optimal predictor variables from Sentinel-2 and GEDI data, three diversity indices were characterized using LR, K-NN, RF, and SVM models. Our results showed that the R<sup>2</sup> values of all models are above 0.45 in all the diversity indices (Figure 3). Specifically, the RF model exhibited the best performance with R<sup>2</sup> = 0.86 (RMSE = 0.11) for the *J* index, 0.78 (RMSE = 0.15) for the λ index, and 0.73 (RMSE = 0.47) for *H* index (Figure 3a,e,i). The SVM also had positive results on the *H* and λ indices, with R<sup>2</sup> values of 0.80 and 0.72, RMSE of 0.37 and 0.16, although the result of the *J* index was lower than the other models (R2 = 0.57, RMSE = 0.21) (Figure 3b,f,j). The KNN and LR models showed relatively low results on the λ index (R2 = 0.46 and 0.57, respectively) (Figure 3c,d) but higher results on the *J* index (R2 = 0.81 and 0.71, respectively) (Figure 3k,l). Overall, the main trend was that lower values of the three indices were a bit overestimated (above the 1:1 line) while high values were underestimated (below the 1:1 line).

**Figure 3.** Scatterplot matrix of true values and predicted values by using RF, SVM, KNN, and LR models in three diversity indices. Shannon indices (**a**–**d**), Simpson (**e**–**h**), and Pielou (**i**–**j**). \*\*: Significant correlation (*p* < 0.01).
