• Feature Extraction

Four categories of 101 features related to forest, height, density, and intensity features were derived from normalized ALS point cloud data. Forest features include canopy cover, leaf area index (LAI), and gap fraction. Canopy cover refers to the proportion of the forest floor covered by the vertical projection of the tree crowns [85]. LAI is one of the most significant variables for representing canopy structure, with the definition of half the total foliage area per unit ground surface area [86]. The gap fraction can be calculated by the ratio of the number of ground points whose elevation is lower than the height threshold (i.e., 2 m in this study) and the total return number. All 101 ALS features, including three forest metrics, 46 elevation metrics, 10 density metrics, and 42 intensity metrics were extracted using LiDAR 360 V3.2 of GreenValley International. The feature details were listed in Table A1 of Appendix A.

A variety of features could be derived from optical imagery. According to previous studies (e.g., [48,54,73,87]), band combinations, vegetation indices, textures (e.g., gray-level co-occurrence matrix (GLCM)) of each band, and image transformations (e.g., principal component analysis, tasseled cap, minimum noise fraction) were extracted as potential predictors for AGB modeling. Therefore, 98 features were selected or extracted from Landsat 8 imagery in this study, including seven original bands (band 1–7), ten band combinations, ten image enhancement features (i.e., three principal components, three tasseled-cap features, and four minimum noise fractions), 56 GLCM features, and 15 vegetation indices. The details of the 98 features derived from Landsat 8 were listed in Table A2 of Appendix A.

• Feature Selection

To avoid the "curse of dimensionality", it is a prerequisite to select the most effective feature for AGB estimation. In this study, the two-step feature selection procedure is implemented, including (1) preliminary selection using Pearson correlation coefficient; and (2) further selection based on variable importance measure using random forest. For the first step, Pearson correlation coefficients of each feature and AGB were calculated and the features with *p*-value less than 0.05 that significantly correlated with AGB were selected. Then, the selected features were ranked according to variable important measures calculated with random forest. Due to the randomness, the ranking procedure was implemented 10 times to find out the most stable set of features with high ranking.

The two-step feature selection was implemented for ALS and Landsat 8 data, respectively, to select two sets of best-performing features. Among the selected ALS features, the best-performing ALS variable was determined by establishing and evaluating the univariate models of each ALS feature and AGB. The feature selection procedure was implemented using R version 4.0.4 (https://www.r-project.org/ (accessed on 1 September 2021)).

According to [48], two types of indices (*COLI*1 and *COLI*2) incorporating optical imagery and ALS information were established using the best-performing LiDAR variable with each optical spectral vegetation index. The best-performing LiDAR variable was determined by the univariate model of AGB and the LiDAR variable with the highest R2. The best-performing spectral features of Landsat 8 were selected by the two-step feature selection procedure described above. Then, the generation of *COLI*1 and *COLI*2 based on the best-performing LiDAR variable (only one feature) and the best-performing Landsat 8 features (could be several features) included both feature selection and extraction procedures. For convenience, we still used the notation of [48] but adjusted the equations as follows.

$$\text{COL}11 = SF\_i \times BLV \tag{3}$$

$$\text{COLI2} = SF\_{\text{i} \text{\textdegree}} BLV = \frac{\left(BLV - SF\_{\text{i}}\right)}{\left(BLV + SF\_{\text{i}}\right)}\tag{4}$$

where *BLV* is the best-performing LiDAR variable (only one feature), *SFi* is a set of bestperforming features derived from Landsat 8 imagery (several features). Thus, the number of *COLI*1 or *COLI*2 is identical to the number of best-performing spectral features (*SFi*).

#### 2.3.3. Classic Machine Learning Algorithms

In this study, seven classic machine learning algorithms were conducted to estimate the AGB of NSFs, including extreme learning machine (ELM), backpropagation (BP) neural network, regression tree (RegT), RF, support vector regression (SVR), KNN, and CNN. Traditional multiple linear regression (MLR) was applied as a baseline for model comparison.

• ELM

ELM is a class of machine learning methods built on the feedforward neuron network (FNN) for supervised and unsupervised learning problems [88]. ELM is an improvement of FNN and its backpropagation algorithm, which is characterized by random or artificially given weights of the nodes in the hidden layer and does not need to be updated. Compared to single-layer perceptron and SVM, ELM is considered to have possible advantages in terms of learning rate and generalization ability [88].

• BP

BP neural network, proposed by Rumelhart et al. in 1986 [89], is a multilayer feedforward network trained by error backpropagation algorithm and is one of the most widely used neural network models [90]. Its learning rule is to use the fastest descent method to continuously adjust the weights and thresholds of the network by backpropagation to minimize the sum of squared errors of the network. According to error and trials, the BP algorithm was implemented with epochs of 1000 in this study.

• RegT and RF

A regression tree is a basic method built on the principle of minimizing the loss function for a regression problem. The major advantage of the regression tree is the readability of the model and fast computational speed, which make it particularly suitable for integrated learning, such as random forests. RF, proposed by Leo Breiman [76], is based on multiple regression trees, which is capable of capturing the complicated relationship between a response and a set of explanatory variables with the following advantages: robustness to reduce over-fitting, ability to determine variable importance, higher accuracy, fewer parameters that need to be tuned, lower sensitivity to the tuning of the parameters, fast training speed, and anti-noise property. The number of regression trees and the random state of the RF algorithm were set to 1000 and 10, respectively, in this study.
