A Comparative Analysis of Remote Sensing Estimation of Aboveground Biomass in Boreal Forests Using Machine Learning Modeling and Environmental Data

Song, Jie; Liu, Xuelu; Adingo, Samuel; Guo, Yanlong; Li, Quanxi

doi:10.3390/su16167232

Open AccessArticle

A Comparative Analysis of Remote Sensing Estimation of Aboveground Biomass in Boreal Forests Using Machine Learning Modeling and Environmental Data

by

Jie Song

^1,*,

Xuelu Liu

¹,

Samuel Adingo

²,

Yanlong Guo

³ and

Quanxi Li

¹

College of Resources and Environment, Gansu Agricultural University, Lanzhou 730070, China

²

Nanjing Institute of Soil Science, Chinese Academy of Sciences, Nanjing 210008, China

³

National Tibetan Plateau Data Center, State Key Laboratory of Tibetan Plateau Earth System and Resource Environment, Institute of Tibetan Plateau Research, Chinese Academy of Sciences, Beijing 100101, China

^*

Author to whom correspondence should be addressed.

Sustainability 2024, 16(16), 7232; https://doi.org/10.3390/su16167232

Submission received: 17 July 2024 / Revised: 14 August 2024 / Accepted: 20 August 2024 / Published: 22 August 2024

Download

Browse Figures

Versions Notes

Abstract

It is crucial to have precise and current maps of aboveground biomass (AGB) in boreal forests to accurately track global carbon levels and develop effective plans for addressing climate change. Remote sensing as a cost-effective tool offers the potential to update AGB maps for boreal forests in real time. This study evaluates different machine learning algorithms, namely Light Gradient Boosting Machine (LightGBM), Extreme Gradient Boosting (XGBoost), Random Forest (RF), and Support Vector Regression (SVR), for predicting AGB in boreal forests. Conducted in the Qilian Mountains, northwest China, the study integrated field measurements, space-borne LiDAR, optical remote sensing, and environmental data to develop a training dataset. Among 34 variables, 22 were selected for AGB estimation modeling. Our findings revealed that the LightGBM AGB model had the highest level of accuracy (R² = 0.84, RMSE = 15.32 Mg/ha), outperforming the XGBoost, RF, and SVR AGB models. Notably, the LightGBM AGB model effectively addressed issues of underestimation and overestimation. We also observed that the disparity in accuracy among the models widens with increasing altitude. Remarkably, the LightGBM AGB model consistently demonstrates optimal performance across all elevation gradients, with residuals generally below 25 Mg/ha for low-value overestimation and below −38 Mg/ha for high-value underestimation. The model developed in this study presents a viable and alternative approach for enhancing AGB estimation accuracy in boreal forests based on remote sensing technology.

Keywords:

boreal forests; AGB; machine learning algorithms; remote sensing technology; topography; bioclimate; soil

1. Introduction

The boreal forest biome contains a significant amount of carbon in its living plants and soils, and its role in maintaining the global carbon balance is becoming increasingly important [1]. However, accurately determining the stored carbon of boreal forests in terms of amount and distribution is challenging due to their remote location and frequent disturbances such as forest fires and insect outbreaks [2].

Aboveground biomass (AGB), an essential consideration in the evaluation of carbon sequestration potential in forests [3], can be effectively estimated using remote sensing techniques [4,5]. Combining remote sensing data from several sources can overcome the limitations associated with data from a single source, such as signal saturation in dense vegetation with optical sensors, wavelength-dependent saturation effects in radar data, and discontinuous sampling strategies in LiDAR data [6]. Additionally, enhancing predictive accuracy for forest AGB from remote sensing data can be further achieved by incorporating environmental variables as predictors [7].

However, multiple data sources pose additional challenges, including complex data dimensionality, duplicate metrics, and the selection of an appropriate prediction model [8]. Linear regression models have been commonly used because of their simplicity and interpretability. Nevertheless, these models have limitations when dealing with the intricate connections between remote sensing variables and AGB [9]. Nonparametric machine learning methods, like Support Vector Regression (SVR), Random Forest (RF), K-Nearest Neighbor (K-NN), Artificial Neural Networks (ANNs), and Gradient Boosting (GB), offer flexibility and accuracy, especially with high-dimensional data [10,11,12]. Additionally, an equally crucial challenge in integrating data from multiple sources is selecting the most informative variables for AGB estimation [8]. One potential solution involves using feature selection techniques, like the variable importance test in the RF model or selecting features based on their contribution to the decision-making process [13,14].

Another important step in modeling forest AGB using remote sensing is to collect a sufficient number of high-quality sample plots [4]. In the absence of remote sensing tools to directly estimate biomass, the use of sample plots is essential to establish an accurate correlation between remotely sensed signals and biomass. Therefore, the size and representation of the sample plots are crucial, especially considering the heterogeneity within boreal forests due to changes in precipitation and temperature caused by altitude gradients [15]. Consequently, generating sample datasets based on different elevation gradients may be advantageous for accurately estimating boreal forest AGB [16].

Despite the availability of various remote sensing techniques and modeling approaches for estimating forest AGB, the diverse ecological, topographic, and biophysical factors across different forest areas require the selection of optimal biomass mapping variables and models tailored to the specific characteristics of each forest or landscape [7].

This study, conducted within the Qilian Mountains in northwestern China, aims to compare the effectiveness of four machine learning (ML) modeling algorithms in mapping the distribution of AGB in boreal forests. By integrating field measurements, space-borne LiDAR, optical remote sensing, and environmental data, this study focuses on constructing a multi-source dataset and evaluating algorithm performance. In particular, LightGBM [17,18], a relatively new decision tree algorithm using gradient boosting, is explored for estimating forest AGB, a domain where its application is yet to be fully realized.

This study aims to (1) investigate the different sensibility of spectral bands, vegetation indices, terrain, bioclimate, and soil in estimating boreal forest AGB; (2) construct sample datasets from GLAS footprints and field survey plots to train AGB estimation models using four algorithms; (3) compare the effectiveness of AGB estimation models based on different ML algorithms; and (4) map forest AGB using optimal models and variables. Additionally, since this study focuses on remote sensing-based forest AGB estimation in montane boreal forests, we also compared the reliability of AGB estimation models developed using various algorithms across different value domains and elevation gradients. We also analyze the effects of sample dataset establishment and model selection on the AGB estimation results. This study seeks to provide a reference for remote sensing-based AGB estimation in boreal forests.

2. Materials

2.1. Study Area

Qilian Mountains National Park graces the Qilian Mountains, nestled along the border of Gansu and Qinghai provinces in China (Figure 1). This region contains a diverse range of ecosystems, including forests, shrubs, grasslands, and glaciers. The importance of the park lies in its role in biodiversity conservation, water conservation, and regional carbon storage.

The study area is characterized by altitudes ranging from 1770 to 5740 m and slopes varying from 0° to 58°. The prevailing tree species in the current forest primarily consist of Picea crassifolia, accompanied by a minor proportion of Sabina przewalskii. Other scattered species in the area are Populus davidiana and Betula Platyphylla. Picea crassifolia predominantly thrives within the altitude range of 2500 to 3300 m, favoring shady and semi-shady slopes. Sabina przewalskii is primarily distributed on sunny, semi-shaded, or semi-sunny slopes within the altitude range of 2700 to 3300 m, frequently cohabiting with Picea crassifolia in high-altitude areas.

2.2. Field Data

Field surveys were conducted in July 2018, encompassing GLAS footprint surveys and typical forest stand surveys. Based on the land cover data for the study area, a total of 201 GLAS footprints were identified within forested regions. Considering GLAS waveform sensitivity to surface topography [19] and accessibility, we selected 95 GLAS footprints across varied slope gradients: 12 at 0°–5°, 30 at 5°–15°, 20 at 15°–25°, 26 at 25°–35°, and 7 at 35°–58° slopes.

For the GLAS footprint surveys, a 55 m diameter field plot was drawn with the center of each selected footprint to maintain a uniform resolution between the field plots and GLAS data. This plot was subdivided into four 15 m diameter circular subplots, with one at the center and the others at azimuths of 360°, 240°, and 120° from the central plot, spaced 20 m apart between the centers of each circle.

To address GLAS data discontinuity and ensure the sample plots are representative of the typical stands and tree species in the study area, 98 circular field plots, each with a 15 m diameter, were positioned strategically in typical forest areas. The distribution of the two types of field plots can be seen in Figure 1.

Data on tree species, number of sampled trees, canopy height, and diameter at breast height (DBH) were collected from sample areas of approximately 700 m² for GLAS footprint survey plots (sum of the areas of the four subplots) and 170 m² for typical forest stand survey plots. It is important to note that the forest in the study area is mainly composed of mature and over-mature stands, according to the National Forest Continuous Inventory (NFCI) data. However, the field survey revealed that saplings (DBH < 5 cm) planted in the last two years were present in three sample plots. These plots are all located on slopes of 0°–5° within the GLAS footprint survey plots. Due to the time difference between the GLAS data and the field data, information on these newly planted trees and their AGB were not recorded during sampling to avoid errors.

Mean canopy height within GLAS footprints refers to the average of the mean canopy heights observed in each subplot, which is the same as AGB. Individual tree AGB was estimated using allometry models specific to each species, which were obtained from previous studies [20]. The calculation of sample plot AGB involved dividing the total AGB of all trees within the plot by the plot area. Supplementary Table S1 presents detailed statistical data on two types of field plot records, as well as the calculated AGB for these plots.

2.3. ICESat/GLAS Data

The LiDAR data utilized in this study were acquired from the space-borne ICESat/GLAS system. Operating within the 1064 nm wavelength range, GLAS emits a pulse waveform, illuminating the ground and generating a nearly circular footprint, typically spanning a diameter of about 70 m (though this dimension may vary depending on the specific laser utilized; for instance, lasers #1, #2, and #3 exhibit footprint diameters of 110 m, 90 m, and 55 m, respectively). Notably, the centers of successive footprints during the satellite’s orbit maintain an approximate separation of 170 m [21].

During the period from 2003 to 2009, GLAS conducted annual data collection over cycles spanning roughly 33 days each. We selected GLAS data from 2005 (L3B, L3C, and L3D), 2007 (L3H and L3I), and 2008 (L3J and L3K) from Laser 3 to ensure uniform footprint sizes across all GLAS data, which is crucial for developing an effective field survey strategy. We accessed version 34 of GLA14 from the National Snow and Ice Data Center (NSIDC) website (http://nsidc.org/), which encompasses details regarding surface height, waveform length, and laser footprint geolocation [22].

To ensure the acquisition of high-quality waveform data, we initially subjected the data to filtering procedures following prior research [23]. Subsequently, we superimposed the GLAS footprints onto the Digital Elevation Model (DEM) data and conducted a weighted averaging of the elevation values across all pixels within each laser footprint. This allowed for a comparison with the elevation values recorded in the GLAS data, facilitating an assessment of the GLAS footprints’ geolocation accuracy [24].

x^{'} = (x_{D} - x_{G}) \sin θ + (y_{D} - y_{G}) \cos θ

(1)

y^{'} = (y_{D} - y_{G}) \sin θ - (x_{D} - x_{G}) \cos θ

(2)

W = e^{(- 2 \sqrt{{(x^{'} / α)}^{2} + {(y^{'} / β)}^{2}})}

(3)

where

x^{'}

and

y^{'}

are the coordinates of DEM pixels along the major and minor axes of the GLAS footprint; (

x_{D}

,

y_{D}

) and (

x_{G}

,

y_{G}

) represent the center coordinates of DEM pixels and the GLAS footprint, respectively;

θ

is the azimuth of the major axis of the GLAS footprint, as recorded in GLA14 data;

W

represents the weight assigned to each DEM pixel within the GLAS footprint;

α

is the semi-major axis; and

β

is the semi-minor axis.

We made a random selection of 50 GLAS footprints to check the accuracy of their geolocation. The results showed a coefficient of determination (R²) of 0.957 and a root mean square error (RMSE) of 1.49 m, both within the acceptable range [25].

2.4. Landsat 8 OLI Data

Landsat 8 OLI multispectral optical images, acquired from the United States Geological Survey (USGS) with a resolution of 30 m, were employed for this study. Nine L1T scenes with low cloud cover (path/row: 136/32, 137/33, 136/33, 135/33, 134/33, 133/33, 133/34, 132/34, and 131/34), acquired between July and August 2018, were selected to match the timing of image acquisition with the field measurement dates.

To mitigate the impact of atmospheric molecules and aerosols on the optical reflectance of land surfaces, radiometric calibration and atmospheric correction were performed. A DEM-based C-correction model [26] was applied for topography correction to alleviate geometric distortions and remove terrain shadows. Subsequently, the corrected image scenes were merged using histogram matching and converted to the UTM coordinate system (Zone 47 North, WGS84).

Forest AGB estimation has often relied on the utilization of original spectral bands and vegetation indices (VIs) [27]. The VIs used in this study consisted of Normalized Difference Vegetation Index (NDVI), Base Ratio (SR), Transformed Vegetation Index (TVI), Soil-Adjusted Vegetation Index (SAVI), and Extended Vegetation Index (EVI). A 3 × 3 sliding window was employed to extract average reflectance, minimizing spatial disparities between field plots and OLI images. The derivation process resulted in a total of 11 variables, each of which is outlined in detail in Table 1.

2.5. Land Cover Map

Our study used the Qilian Mountains Land Cover Dataset version 2.0 as the basis for identifying forested areas in the study area [33]. This dataset employs a classification system based on IGBP and FROM_LC, encompassing 9 categories: forest, shrubland, grassland, wetland, water, cropland, urban and built-up areas, bare land, permanent snow, and ice (Figure 2). Data were obtained from the China National Tibetan Plateau Environment Data Center (https://doi.org/10.11888/Ecolo.tpdc.270916 (accessed on 19 January 2024)).

By creating validation samples using data from the field survey, the land cover classification map accuracy was assessed. Within the study area, the 2018 land cover classification products exhibited a kappa coefficient of 0.84 and achieved an overall accuracy of 90.84%. Specifically, forest classification exhibited 95.34% producer accuracy and 93.53% user accuracy (Table 2).

2.6. Environmental Data

This study relied on elevation data sourced from GDEM V2 (ASTER Global Digital Elevation Model, version 2), accessible from the Japan Aerospace Exploration Agency (JAXA) website (https://www.jspacesystems.or.jp/ersdac/GDEM/J/4.html (accessed on 19 January 2024)). These GeoTIFF-format data, with a resolution of 30 m, also provided slope and aspect information.

The WorldClim Global Climate Data website (www.worldclim.org (accessed on 19 January 2024)) provides the bioclimate variables that were utilized in this study. This dataset includes 19 bioclimatic variables, with a resolution of 30 (approximately 1 km) [34].

The Resource and Environmental Science and Data Center of the Chinese Academy of Sciences (www.resdc.cn (accessed on 19 January 2024)) provided the soil data, which contain information on the spatial pattern of soil texture. The data were compiled from 1:1,000,000 soil type maps and soil profile data derived from the Second Soil Census of China.

For consistency across all datasets, the aforementioned information was resampled to a 30 m resolution and transformed into the WGS84 coordinate system.

3. Methods

3.1. Deriving Forest Canopy Heights from GLAS Data in Mountainous Areas

The extent of the GLAS waveform is affected by factors such as pulse energy distribution, footprint size, and surface topography [35]. To address these biases, a physically based terrain correction model was applied to derive canopy heights of mountainous forests from GLAS data based on diverse slope gradients. Given the significant time difference of approximately 10 years between the GLAS and field data, we enhanced the terrain correction model by incorporating translational coefficients, denoted as a and b, based on prior studies [36]. The modified model is expressed by the following equation:

H_{G L A S} = (d_SigBegOff - d_g p CntRngOff) - [a \times (d \cdot \tan θ / 2 + c \cdot FWHM / 2) + b]

(4)

where

d_SigBegOff

and

d_g p CntRngOff

are both recorded in GLA14, representing the canopy top waveform signal and the surface waveform signal, respectively;

d

denotes the diameter of the GLAS footprint, set at 55 m;

c

signifies light speed; FWHM (Full Width at Half Maximum) represents the width of the laser pulse, set at 6 ns;

θ

is defined as the average slope value of the slope class in which the footprint is located; and the translation coefficients are denoted by

a

and

b

.

To derive optimal estimates of translation coefficients, we employed the Levenberg–Marquard (LM) algorithm [37], which enables the derived canopy height to closely match the true canopy height within the study area. The LM algorithm, known for its effectiveness in nonlinear optimization, is widely used and can quickly find the optimal solution through multiple iterations.

A random sample comprising 60% of the field-surveyed GLAS footprints from each of the five slope gradients was used to fit the translational coefficients a and b. The remaining 40% of the field-surveyed GLAS footprints from each slope gradient were utilized to calculate corrected canopy heights and compare them to actual forest canopy heights in the study area.

3.2. Relating Field-Based AGB to GLAS-Derived Canopy Heights

As mentioned in Section 2.6, out of the total 201 GLAS footprints within the forested area of the study site, only 95 were field-surveyed and had AGB values recorded. To calculate the AGB values for the remaining 106 GLAS footprint sites where no field surveys were conducted, we needed to model the relationship between GLAS-derived canopy heights and GLAS footprint AGB.

Vegetation distribution in the Qilian Mountains is strongly influenced by altitude, with different tree species exhibiting distinct altitudinal ranges [38]. Therefore, we established the relationship between GLAS-derived canopy heights and GLAS footprint AGB based on different elevation gradients.

We categorized elevation gradients into four intervals at 1000 m increments (1770–2770 m, 2770–3770 m, 3770–4770 m, and 4770–5740 m, respectively) within the study area. Subsequently, GLAS footprint survey plots were extracted at different elevation gradients: 19 at 1770–2770 m, 52 at 2770–3770 m, 24 at 3770–4770 m, and no footprint survey plots above 4770 m.

For each elevation gradient, the footprint survey plots were applied to establish a functional connection between tree height extracted by GLAS and footprint AGB. Utilizing SPSS, various mathematical functions were assessed for best fit (highest R² and adjusted R²). The model with optimal performance in each gradient was chosen as its GLAS footprint AGB fitting model.

3.3. Variable Selection for AGB Estimation Modeling

To identify the key variables among the 34 that most significantly impact the estimation of boreal forest AGB, we employed a decision tree-based feature selection method.

Decision tree (DT) divides data into groups based on information purity. This process relies on effective features such as the Gini metric or information gain. Information gain is utilized for segmentation and false pruning to prevent overfitting [39]. The “gain ratio” [40] served as the principle for feature selection and is expressed as follows:

G a i n R a t i o (S, T) = \frac{G a i n (S, T)}{- \sum_{n = 1}^{k} (|S_{n}| / S) \times \log_{2} (|S_{n}| / |S|)}

(5)

where

G a i n (S, T)

is acquired by partitioning the sample set S by feature T, representing the “information gain”; n denotes the number of possible values for the feature T.

At each decision tree branching, the influential features are determined based on their impact on information entropy. Consequently, ranking variables by frequency of occurrence allows for identifying the most important ones. In this study, all feature variables are inputted into the XGBoost model (a decision tree-based ML algorithm) [41]. The frequency of occurrence of each variable is determined by its ratio of occurrences in decision tree branches to the total number of branches. Subsequently, variables ranked in the bottom 10% are excluded to form a new subset, and the training dataset is reconstructed. The performance of the subset for each round was evaluated based on 10-fold cross-validation. The search for influential features halts when R² no longer increases.

3.4. Algorithms of AGB Estimation Modeling

3.4.1. Extreme Gradient Boosting (XGBoost)

Chen introduced XGBoost in 2016, which represents an enhancement to the Gradient Boosting Decision Tree (GBDT) algorithm [42]. It prioritizes optimizing the function’s target, tree size, and weight, which are influenced by standard regularization parameters [42]. Unlike using all sample features in each iteration, XGBoost employs the random subspace method during training, inspired by Random Forests. This strategy effectively tackles computational speed and accuracy challenges. Additionally, XGBoost offers support for diverse objective functions, including classification, regression, and ranking [43].

3.4.2. Light Gradient Boosting Machine (LightGBM)

The LightGBM algorithm, a decision tree algorithm that uses gradient boosting, is considered relatively new [17]. It improves upon GBDT and operates as an ensemble learning technique grounded in boosting. Compared to XGBoost, LightGBM offers reduced training time and memory usage while maintaining the same prediction accuracy. Enhancements have been made to the rate at which LightGBM trains by incorporating exclusive feature bundling (EFB) and gradient-based one-side sampling (GOSS) techniques [44]. LightGBM ensures precision through a leaf-by-leaf tree growth approach, with leaves expanded based on maximum incremental loss [17].

3.4.3. Support Vector Regression (SVR)

SVR, a popular method in machine learning for tackling regression problems, employs the kernel function and statistical learning theory to perform higher dimensional spatial transformations on input data [45]. This transformation allows complex nonlinear regressions to be treated as linear problems, improving learning outcomes, especially with limited statistical samples [45]. The choice of kernel function depends on the specific data characteristics [46]. We opted for the radial base function (RBF) kernel for this study due to its superior generalization performance.

3.4.4. Random Forest (RF)

Ensemble learning is utilized by the RF, relying on bagging techniques, an early form of ensemble tree method known as bootstrap aggregating [47]. RF generates a fresh dataset by bootstrapping the original sample dataset. Subsequently, decision trees are constructed using each bootstrap sample, and the collective results are used to enhance prediction accuracy [48]. Approximately one-third of the original sample data (referred to as the OOB (out-of-bag) data) is excluded in the bootstrap sample, allowing for OOB error estimates to be generated for each decision tree. Averaging these estimates yields a generalized error estimate for RF unaffected by multivariable multicollinearity [11].

The scikit-learn package in Python was utilized to carry out all modeling techniques. The performance of each model was assessed through repeating 5-fold cross-validation 10 times. To optimize these models, grid search methods were implemented to adjust the hyperparameters, as outlined in Table 3.

3.5. Accuracy Assessment and Statistical Analysis

We used stratified random sampling to divide the 193 field survey plots, allocating 70% to the training dataset and the remaining 30% to an independent validation dataset. Additionally, GLAS footprints in the forested areas of the study area that were not surveyed in the field were used for model training, too. The AGB of these sites was calculated using the methodology described in Section 3.2.

To comprehensively evaluate and compare the effectiveness of the AGB models developed by each algorithm, we first assessed their model accuracy by repeating the 5-fold cross-validation 10 times during model training. We measured and compared the cross-validated RMSE (CV-RMSE) and cross-validated R² (CV-R²) for each model. Next, we independently validated the AGB estimates generated by the models of each algorithm using an independent validation dataset. We calculated the RMSE to quantify the deviation between the estimated and true values and the R² to assess the correlation between these values to evaluate the predictive accuracy of each model. Finally, we analyzed the residuals and Relative Error (RE) of the AGB estimates of each model across different field data value ranges and elevations using the independent validation dataset to assess the reliability of each model’s AGB estimates under different conditions. Through these comprehensive analyses and comparisons, we aim to identify the most suitable AGB estimation model for boreal forests. Moreover, we calculated the proportion of rasters with different value ranges in the AGB estimation maps produced by each model and compared them with the training dataset. This comparison allowed us to analyze the impact of the sample dataset creation and algorithm selection on the AGB estimation results.

The forms of the R², RMSE, and RE are as follows:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(6)

RMSE = \sqrt{\frac{\sum_{i = 1}^{n} {(\hat{y_{i}} - y_{i})}^{2}}{n}}

(7)

RE = \frac{\hat{y_{i}} - y_{i}}{y_{i}} \times 100 %

(8)

where

n

is the sample size,

y_{i}

represents the field AGB,

\hat{y_{i}}

signifies the estimated AGB, and

\bar{y}

stands for the average estimated AGB.

In summary, the initial step of this study was to apply a terrain correction model grounded in physical principles to eliminate any terrain-related impact on the GLAS waveforms. Second, we performed a regression analysis to determine the best relationship between the GLAS-derived canopy heights and the field-measured AGB, considering elevation differences. Predictor variables for AGB modeling were selected based on their contribution to feature importance. Finally, we combined field survey data and additional forest extent GLAS footprint AGB estimates with the selected feature variables to create training and testing datasets for evaluating various ML algorithms for estimating boreal forest AGB. The methodology employed in this study is summarized in Figure 3.

4. Results

4.1. GLAS-Derived Canopy Heights Results

Using the LM algorithm, we derived suitable values of

a

and

b

for various terrain slope gradients. Subsequently, we corrected the canopy height obtained from GLAS data. By assessing the RMSE of GLAS-extracted canopy heights before and after topographic correction, we evaluated the correction effect of different slope gradients, as depicted in Figure 4.

Figure 4 illustrates that applying topographic corrections significantly reduced the disparity between GLAS-derived and actual canopy heights, especially on steeper slopes. Prior to the correction, the maximum gap between the two heights reached 12.59 m. However, post-correction, the difference consistently remained below 4 m across all slope categories.

4.2. AGB Estimation of GLAS Footprints in Forest Areas

We fitted the relationship between footprint AGB and GLAS-derived canopy heights using the GLAS footprint survey plots. Quadratic Cubic, and Cubic models proved optimal for elevation gradients of 1770–2770 m, 2770–3770 m, and 3770–4770 m, respectively (Table 4) (p < 0.01).

The 106 GLAS footprints that were not field surveyed were distributed as follows: 12 footprints at elevations between 1770 and 2770 m, 90 footprints between 2770 and 3770 m, and 4 footprints between 3770 and 4770 m. There were no footprints observed above 4770 m. The fitted models were subsequently applied to these footprints, and the estimated AGB results are summarized in Table 5.

4.3. Variable Selection for AGB Estimation

Figure 5 illustrates the frequency of occurrence for each variable in the initial XGBoost model run.

The variable with the highest occurrence frequency (8.77%) was Bio 15 from the WorldClim dataset. Conversely, Bio 8 from the WorldClim dataset and SR from the VIs dataset had the lowest occurrence frequency (0%). Eight of the top ten variables in frequency were from the WorldClim dataset, while two were from the DEM dataset. In contrast, all variables from the VIs dataset had occurrences below 3%.

The most frequent variables in each dataset were Bio 15, Bio 3, and Bio 4 in the WorldClim dataset; Elevation in the DEM dataset; Band 5 and Band 6 in the Original Band dataset; and EVI in the VIs dataset.

We removed variables in the bottom 10% occurrence ranking, resulting in a new variable dataset. After four rounds of filtering, the R² of the AGB prediction model no longer changed significantly. The final AGB estimation dataset comprised 22 variables (Table 6).

4.4. Comparison of Different AGB Estimation Models

Figure 6 displays the cross-validation results for all AGB models evaluated with CV-RMSE and CV-R².

Figure 6 illustrates that the LightGBM AGB model achieved the highest performance (CV-R²avg = 0.67, CV-RMSEavg = 14.34 Mg/ha). Following closely, the XGBoost AGB model showed comparable results (CV-R²avg = 0.62, CV-RMSEavg = 15.33 Mg/ha). Next is the RF AGB model (CV-R²avg = 0.41, CV-RMSEavg = 19.68 Mg/ha). And the weakest model was the SVR AGB model (CV-R²avg = 0.37, CV-RMSEavg = 22.86 Mg/ha).

We clipped the AGB distribution maps produced by each ML model using the forest extent delineated in the land cover map. Subsequently, we counted the proportion of rasters with different value ranges in the AGB maps estimated by each model and compared them to the training dataset (Figure 7).

Figure 7 indicates that the training dataset exhibited a generally uniform distribution across multiple value ranges, except for values below 30 Mg/ha. Conversely, AGB estimates generated by different ML models primarily fell within the 60–120 Mg/ha range. The concentration of data distribution estimated by each model followed the order SVR, RF, XGBoost, and LightGBM.

Figure 8 displays the independent validation results for each model. In descending order of accuracy, the LightGBM AGB model showed the highest performance, followed by the XGBoost, RF, and SVR models.

We further analyzed the residual distribution for different models across various value ranges (Figure 9a) and elevations (Figure 9b).

In Figure 9a, all models showed a stronger underestimation issue at high AGB values than overestimation at low AGB values. LightGBM had residuals within ±25 Mg/ha for field AGB between 30 and 120 Mg/ha. XGBoost had errors over 25 Mg/ha in the 30–60 Mg/ha range. However, XGBoost showed less underestimation than LightGBM in the 120–150 Mg/ha range. RF AGB model performed similarly to XGBoost AGB model in the 30–90 and 120–150 Mg/ha value ranges, slightly outperforming in the 90–120 Mg/ha range. However, RF significantly underestimated values between 150 and 180 Mg/ha. THE SVR AGB model exhibited the highest errors across all value ranges.

Figure 9b illustrates the overestimation at low values and the underestimation at high values across different elevation ranges.

For low value overestimation, at elevations between 1770 and 2770 m, the residuals are 2–35 Mg/ha for the SVR AGB model, 4–25 Mg/ha for the RF AGB model, 3–27 Mg/ha for the XGBoost AGB model, and 1–37 Mg/ha for the LightGBM AGB model. In the 2770–3770 m range, the residuals are 3–46 Mg/ha for the SVR AGB model, 3–30 Mg/ha for the RF AGB model, 1–32 Mg/ha for the XGBoost AGB model, and 1–18 Mg/ha for the LightGBM AGB model. At elevations of 3770–4770 m, the residuals are 15–50 Mg/ha for the SVR AGB model, 1–37 Mg/ha for the RF AGB model, 1–32 Mg/ha for the XGBoost AGB model, and 2–14 Mg/ha for the LightGBM AGB model. As elevation increases, the degree of overestimation at low values rises, following the order SVR > RF > XGBoost. However, the LightGBM AGB model’s residuals decrease with increasing elevation, generally remaining below 25 Mg/ha at all levels.

For high value underestimation, at elevations of 1770–2770 m, the residuals range from −19 to −58 Mg/ha for the SVR AGB model, −2 to −48 Mg/ha for the RF AGB model, −2 to −50 Mg/ha for the XGBoost AGB model, and −9 to −38 Mg/ha for the LightGBM AGB model. In the 2770–3770 m range, the residuals are −3 to −49 Mg/ha for the SVR AGB model, −1 to −43 Mg/ha for the RF AGB model, −2 to −39 Mg/ha for the XGBoost AGB model, and −1 to −38 Mg/ha for the LightGBM AGB model. At elevations of 3770–4770 m, the residuals are −14 to −74 Mg/ha for the SVR AGB model, −1 to −83 Mg/ha for the RF AGB model, −4 to −54 Mg/ha for the XGBoost AGB model, and −5 to −24 Mg/ha for the LightGBM AGB model. The greatest underestimation occurs at 3770–4770 m for the SVR, RF, and XGBoost models, while the least occurs at 2770–3770 m, with the severity order being SVR > RF > XGBoost. The residuals for the LightGBM AGB model generally decrease with elevation, typically remaining below −38 Mg/ha.

Overall, the models tended to overestimate AGB values more frequently than they underestimated them across all elevation gradients. The issue of low overestimation and high underestimation was most pronounced at elevations of 3770–4770 m, where the models showed the greatest variability in performance.

We also analyzed the Relative Error (RE) of AGB estimation for each algorithm. The analysis revealed that SVR exhibits the largest RE, with a maximum exceeding 135%, while LightGBM shows the smallest RE, with a maximum of 43.75%. The average RE for the algorithms follows the order SVR > RF > XGBoost > LightGBM. Regarding elevation gradients, all models show the highest RE at elevations of 3770–4770 m. However, LightGBM demonstrates the highest reliability across all altitude gradients.

4.5. Forest AGB Mapping

The distribution of forest AGB in the study area, estimated by the LightGBM AGB model, is depicted in Figure 10. The average AGB of the regional forests is 102.06 Mg/ha. Higher AGB values were predominantly located in the central part of the park.

5. Discussion

5.1. Modeling AGB Estimation Using Different Algorithms

The accuracy of estimating forest AGB through the integration of remote sensing (RS) techniques with field survey data relies on comprehending the complex relationship between RS variables and forest biomass. Table 7 presents recent studies employing ML algorithms to develop RS-based AGB estimation models.

Commonly used ML techniques for estimating forest AGB include kernel function-based learners (e.g., GPR and SVR), tree-based learners (e.g., RF, GBR, ERT SGB, CatBoost, and XGBoost), and neural network-based learners (e.g., ANN and MLP) [58].

Studies have demonstrated that SVR, a classical ML method, has been extensively and consistently used for AGB estimation. Additionally, tree-based ML approaches such as RF and GBR have proven effective in predicting AGB, with RF achieving high estimation accuracy in numerous studies. Gradient boosting models, such as XGBoost and CatBoost, have even surpassed RF in performance [51,52,55]. In this context, we chose to develop the boreal forest AGB estimation model using SVR; RF; the gradient-boosted XGBoost algorithm, which has shown superior performance; and LightGBM, a newer member of the gradient-boosting family. We evaluated the AGB estimation results from models based on these four algorithms. Among them, the XGBoost and LightGBM AGB models exhibited higher estimation accuracy, with the LightGBM AGB model performing best in terms of model accuracy, estimation accuracy, and reliability across various value ranges and elevation gradients.

The concept of deep learning originates from the study of Artificial Neural Networks, with a multilayer perceptron (MLP) featuring multiple hidden layers as a typical deep learning architecture [59]. As a subfield of machine learning, deep learning is characterized by its ability to approximate complex functions through deep, nonlinear network structures. It captures distributed representations of input data and demonstrates a strong capability to identify critical features from a limited number of samples, thereby enhancing classification or prediction accuracy [60]. Several studies have shown that deep learning methods can be effective for estimating forest AGB, often achieving high levels of accuracy [61,62]. Our subsequent research aims to further improve AGB estimation accuracy in boreal forests by integrating various types of variables and sample data and by developing models based on deep learning techniques.

5.2. Modeling Variables Selected for AGB Estimation

The improvement of forest AGB modeling accuracy heavily depends on the careful selection of appropriate predictor variables [14]. Techniques such as stepwise regression, RF-based variable importance testing, correlation analysis, decision tree algorithm-based feature contribution analysis, and recursive feature elimination (RFE) are commonly used [14,52,63]. The study by Luo et al. [52] emphasized that relying solely on one ML technique for both feature selection and modeling may not always yield the most precise AGB estimation.

In the present study, 22 important top variables were used, including Red, NIR, SWIR1, and SWIR2 bands extracted from optical image data. Previous research highlighted the significance of these bands in accurately estimating AGB [8,14]. The higher sensitivity of the SWIR band to vegetation shadows and soil moisture, as well as its resilience against atmospheric conditions, likely contributes to its importance [64]. The Red band distinguishes between vegetation types, while the NIR band in Landsat 8 OLI displays enhanced sensitivity to diverse vegetation types [65].

Local environmental conditions significantly influence forest diversity. Consequently, an increasing number of studies incorporate topographic or climatic variables as ancillary data to predict forest AGB. In our study, elevation, slope, and aspect were selected, with elevation standing out as the most influential variable given its crucial role in determining boreal forest species distribution.

Furthermore, various climatic variables closely related to plant growth, such as mean warmest month temperature, mean coldest month temperature, summer precipitation, and winter precipitation, will also be considered in our next study. These variables will enrich the dataset of environmental factors, enabling us to further explore their effects on the accuracy of boreal forest AGB estimation.

5.3. Forest Field Survey Data for AGB Estimation

The main categories of field samples used for estimating forest AGB include forest field survey data [51,53], Forest Management Inventory (FMI) data [66], National Forest Inventory (NFI) data [15,52], and forest AGB sample plot data from the previous research literature [11]. The size and representativeness of sample plots are a primary focus of scholarly attention [4]. Cao et al. [16] utilized optical data, airborne LiDAR data, and 32 sample plots in the Qilian Mountains to estimate forest AGB and observed that increasing sample size improved model accuracy up to a point beyond which the variation decreased. Similarly, Strunk et al. [67] discovered that the model accuracy plateaued at a sample size of 35.

Optimal distribution of field plots can partially compensate for reduced sample size [68]. In this study, we carefully considered spatial distribution and forest structure to create a comprehensive sample dataset based on GLAS footprints and typical forest stand survey plots. However, comparing raster proportions of AGB maps estimated by ML models across different value ranges with those of the training dataset revealed that, while training samples were evenly distributed across value segments, model predictions were mainly within the 60–120 Mg/ha range. This highlights that, in the estimation of forest AGB using ML techniques, the choice of predictive models is more crucial than the selection of samples.

5.4. AGB Estimation in Mountain Forests Using GLAS Data

LiDAR instruments, like space-borne ICESat/GLAS, are frequently used to estimate forest AGB. However, footprint size and geolocation accuracy can significantly impact AGB estimation [11]. Milenkovic et al. [69] found that field sample size should match LiDAR footprint size when applying space-borne waveform LiDAR to estimate forest biomass, and additional efforts should be made to minimize footprint geolocation errors.

In this study, we assessed GLAS footprint geolocation accuracy and developed a matching field sampling scheme. We also implemented an enhanced terrain correction model to reduce terrain slope effects on GLAS waveforms. Furthermore, we incorporated elevation gradients to estimate GLAS footprint AGB, enhancing sample quality.

However, there is nearly a 10-year gap between the space-borne GLAS LiDAR data used in this study and the field sampling and Landsat OLI data. Despite the predominance of mature and over-mature forests in the study area, changes in natural conditions and discrepancies between remote sensing datasets over this period may contribute to estimation errors. To address the impact of terrain slope and the temporal gap on GLAS data application, several statistical methods were employed. These included optimizing the terrain correction model using the LM algorithm and fitting the relationship between GLAS-derived canopy heights and GLAS footprint AGB across different elevation gradients. Errors introduced by these fitting processes may also affect AGB estimation accuracy. Recognizing that each model and dataset introduces uncertainty, we plan to quantify these uncertainties using the Monte Carlo method [70] in future work.

5.5. Results of Forest AGB Estimation

In studies utilizing RS data for forest AGB estimation, a common challenge is the tendency to underestimate high AGB values and overestimate low ones [4]. In this study, the LightGBM AGB model outperformed others within the 30 to 120 Mg/ha range, with residuals consistently below ±25 Mg/ha. Even in the 120–180 Mg/ha range, LightGBM’s underestimation residuals remained below 50 Mg/ha. Additionally, compared to other models, the LightGBM AGB model demonstrated the smallest Relative Error, with a maximum value of 43.75%. These findings suggest that appropriate modeling algorithms can mitigate underestimation and overestimation in boreal forest AGB estimation using RS variables.

One of the main reasons for the heterogeneity of boreal forests is the variation in precipitation and temperature caused by differences in elevation gradients [15]. The variability of environmental variables at different altitudes may affect the estimates produced by various machine learning algorithms. In fact, we found that, with respect to residual and Relative Error distributions across different elevation gradients, the models showed no significant difference between 1770 and 3770 m, except for the SVR model, which performed slightly worse. However, above 3770 m, differences in model accuracy became more pronounced, indicating that greater efforts are needed to identify suitable models for estimating forest AGB in high-elevation regions. Notably, LightGBM demonstrated the highest reliability, with the lowest Relative Error, across all elevation gradients. In future studies, we will further refine the optimal strategy for AGB estimation across different elevation conditions in boreal forests by improving sample dataset establishment, variable selection, and algorithm comparison. This is expected to offer valuable insights for enhancing AGB estimation based on varying elevations in boreal forests.

5.6. Error Analysis

The quality of remote sensing images can be affected by factors such as system noise, cloud cover, sun elevation, and atmospheric effects, which may introduce errors in forest AGB estimation. Employing multitemporal image stacking to enhance image quality [71] could potentially improve the accuracy of RS-based AGB estimates for boreal forests.

In this study, the entire Qilian Mountains National Park was selected as the study area, and AGB estimation models were developed using various algorithms to compare their accuracy for boreal forest AGB estimation. However, the presence of other landforms within the study area, which constitute a significant portion of the region, may have interfered with the model fitting process and introduced errors. Additionally, caution is advised when interpreting results due to the focus on a single study area. The generalizability of this method to other forest types or different terrain conditions should also be further explored.

Therefore, in subsequent studies, we will narrow the research scope and focus specifically on regional forest areas. Furthermore, we will explore the effects of variable selection methods, AGB mapping scales, and changes in sample size and representativeness on the boreal forest AGB estimation accuracy.

6. Conclusions

This study assesses and compares the performance of four ML algorithms in estimating AGB in boreal forests using multi-source data. Our findings highlight the significant role of certain environmental variables, particularly precipitation seasonality and elevation, in boreal forest AGB estimation. Among the models evaluated, the LightGBM AGB model demonstrated superior accuracy, effectively addressing issues of overestimation and underestimation in AGB estimates derived from RS data. It also showed the highest reliability across all elevation gradients. Our results suggest that the choice of predictive models is more crucial than sample selection for accurate AGB estimation in boreal forests. Additionally, discrepancies in model accuracy become more pronounced above 3770 m elevation, emphasizing the necessity to identify suitable models for estimating AGB in high-elevation regions of boreal forests.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/su16167232/s1. Table S1: Detailed description of data from field survey sample plots.

Author Contributions

Conceptualization, J.S.; methodology, J.S.; software, Y.G. and Q.L.; validation, X.L., Y.G. and S.A.; formal analysis, J.S.; investigation, X.L., J.S. and S.A.; resources, J.S.; data curation, J.S.; writing—original draft preparation, J.S.; writing—review and editing, X.L. and S.A.; visualization, Y.G.; supervision, X.L.; project administration, J.S.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Nature Science Foundation of Gansu Province, grant number 23JRRAI413; Gansu Agricultural University Publicly Recruited Doctoral Research Initiation Grant, grant number GAU-KYQD-2021-46.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Acknowledgments

We would like to thank the Qilian Mountains National Park Administration for their help in the field inventory of this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tagesson, T.; Schurgers, G.; Horion, S.; Ciais, P.; Tian, F.; Brandt, M.; Ahlstrom, A.; Wigneron, J.-P.; Ardo, J.; Olin, S.; et al. Recent divergence in the contributions of tropical and boreal forests to the terrestrial carbon sink. Nat. Ecol. Evol. 2020, 4, 202–209. [Google Scholar] [CrossRef]
Margolis, H.A.; Nelson, R.F.; Montesano, P.M.; Beaudoin, A.; Sun, G.; Andersen, H.-E.; Wulder, M.A. Combining satellite lidar, airborne lidar, and ground plots to estimate the amount and distribution of aboveground biomass in the boreal forest of North America. Can. J. For. Res. 2015, 45, 838–855. [Google Scholar] [CrossRef]
Zhang, Y.; Liang, S.; Sun, G. Forest Biomass Mapping of Northeastern China Using GLAS and MODIS Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 140–152. [Google Scholar] [CrossRef]
Lu, D.; Chen, Q.; Wang, G.; Liu, L.; Li, G.; Moran, E. A survey of remote sensing-based aboveground biomass estimation methods in forest ecosystems. Int. J. Digit. Earth 2016, 9, 63–105. [Google Scholar] [CrossRef]
Zolkos, S.G.; Goetz, S.J.; Dubayah, R. A meta-analysis of terrestrial aboveground biomass estimation using lidar remote sensing. Remote Sens. Environ. 2013, 128, 289–298. [Google Scholar] [CrossRef]
Sun, X.; Li, G.; Wang, M.; Fan, Z. Analyzing the Uncertainty of Estimating Forest Aboveground Biomass Using Optical Imagery and Spaceborne LiDAR. Remote Sens. 2019, 11, 722. [Google Scholar] [CrossRef]
Lopez-Serrano, P.M.; Cardenas Dominguez, J.L.; Javier Corral-Rivas, J.; Jimenez, E.; Lopez-Sanchez, C.A.; Jose Vega-Nieva, D. Modeling of Aboveground Biomass with Landsat 8 OLI and Machine Learning in Temperate Forests. Forests 2020, 11, 11. [Google Scholar] [CrossRef]
de Almeida, C.T.; Galvao, L.S.; de Oliveira Cruz e Aragao, L.E.; Henry Balbaud Ometto, J.P.; Jacon, A.D.; de Souza Pereira, F.R.; Sato, L.Y.; Lopes, A.P.; Lima de Alencastro Graca, P.M.; Silva, C.V.d.J.; et al. Combining LiDAR and hyperspectral data for aboveground biomass modeling in the Brazilian Amazon using different regression algorithms. Remote Sens. Environ. 2019, 232, 111323. [Google Scholar] [CrossRef]
Fassnacht, F.E.; Hartig, F.; Latifi, H.; Berger, C.; Hernandez, J.; Corvalan, P.; Koch, B. Importance of sample size, data type and prediction method for remote sensing-based estimations of aboveground forest biomass. Remote Sens. Environ. 2014, 154, 102–114. [Google Scholar] [CrossRef]
Gao, Y.; Lu, D.; Li, G.; Wang, G.; Chen, Q.; Liu, L.; Li, D. Comparative Analysis of Modeling Algorithms for Forest Aboveground Biomass Estimation in a Subtropical Region. Remote Sens. 2018, 10, 627. [Google Scholar] [CrossRef]
Liu, K.; Wang, J.; Zeng, W.; Song, J. Comparison and Evaluation of Three Methods for Estimating Forest above Ground Biomass Using TM and GLAS Data. Remote Sens. 2017, 9, 341. [Google Scholar] [CrossRef]
Tien Dat, P.; Yokoya, N.; Xia, J.; Nam Thang, H.; Nga Nhu, L.; Thi Thu Trang, N.; Thi Huong, D.; Thuy Thi Phuong, V.; Tien Duc, P.; Takeuchi, W. Comparison of Machine Learning Methods for Estimating Mangrove Above-Ground Biomass Using Multiple Source Remote Sensing Data in the Red River Delta Biosphere Reserve, Vietnam. Remote Sens. 2020, 12, 1334. [Google Scholar] [CrossRef]
Bolon-Canedo, V.; Sanchez-Marono, N.; Alonso-Betanzos, A. Feature selection for high-dimensional data. Prog. Artif. Intell. 2016, 5, 65–75. [Google Scholar] [CrossRef]
Li, Y.; Li, C.; Li, M.; Liu, Z. Influence of Variable Selection and Forest Type on Forest Aboveground Biomass Estimation Using Machine Learning Algorithms. Forests 2019, 10, 1073. [Google Scholar] [CrossRef]
Fayad, I.; Baghdadi, N.; Guitet, S.; Bailly, J.-S.; Herault, B.; Gond, V.; El Hajj, M.; Dinh Ho Tong, M. Aboveground biomass mapping in French Guiana by combining remote sensing, forest inventories and environmental data. Int. J. Appl. Earth Obs. Geoinf. 2016, 52, 502–514. [Google Scholar] [CrossRef]
Cao, L.; Pan, J.; Li, R.; Li, J.; Li, Z. Integrating Airborne LiDAR and Optical Data to Estimate Forest Aboveground Biomass in Arid and Semi-Arid Regions of China. Remote Sens. 2018, 10, 532. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Sang, M.; Xiao, H.; Jin, Z.; He, J.; Wang, N.; Wang, W. Improved Mapping of Regional Forest Heights by Combining Denoise and LightGBM Method. Remote Sens. 2023, 15, 5436. [Google Scholar] [CrossRef]
Hilbert, C.; Schmullius, C. Influence of Surface Topography on ICESat/GLAS Forest Height Estimation and Waveform Shape. Remote Sens. 2012, 4, 2210–2235. [Google Scholar] [CrossRef]
Wang, J.Y.; Ju, K.J.; Fu, H.E.; Chang, X.X.; He, H.Y. Study on biomass of water conservation forest on North Slope of Qilian Mountains. J. For. Environ. 1998, 18, 319–323. [Google Scholar]
Abshire, J.B.; Sun, X.L.; Riris, H.; Sirota, J.M.; McGarry, J.F.; Palm, S.; Yi, D.H.; Liiva, P. Geoscience Laser Altimeter System (GLAS) on the ICESat mission: On-orbit measurement performance. Geophys. Res. Lett. 2005, 32, L21S02. [Google Scholar] [CrossRef]
Park, T.; Kennedy, R.E.; Choi, S.; Wu, J.; Lefsky, M.A.; Bi, J.; Mantooth, J.A.; Myneni, R.B.; Knyazikhin, Y. Application of Physically-Based Slope Correction for Maximum Forest Canopy Height Estimation Using Waveform Lidar across Different Footprint Sizes and Locations: Tests on LVIS and GLAS. Remote Sens. 2014, 6, 6566–6586. [Google Scholar] [CrossRef]
Chi, H.; Sun, G.; Huang, J.; Guo, Z.; Ni, W.; Fu, A. National Forest Aboveground Biomass Mapping from ICESat/GLAS Data and MODIS Imagery in China. Remote Sens. 2015, 7, 5534–5564. [Google Scholar] [CrossRef]
Chen, Q. Retrieving vegetation height of forests and woodlands over mountainous areas in the Pacific Coast region using satellite laser altimetry. Remote Sens. Environ. 2010, 114, 1610–1627. [Google Scholar] [CrossRef]
Garcia, M.; Popescu, S.; Riano, D.; Zhao, K.; Neuenschwander, A.; Agca, M.; Chuvieco, E. Characterization of canopy fuels using ICESat/GLAS data. Remote Sens. Environ. 2012, 123, 81–89. [Google Scholar] [CrossRef]
Teillet, P.M.; Guindon, B.; Goodenough, D.G. On the slope-aspect correction of multispectral scanner data. Can. J. Remote Sens. 1982, 8, 84–106. [Google Scholar] [CrossRef]
Turgut, R.; Gunlu, A. Estimating aboveground biomass using Landsat 8 OLI satellite image in pure Crimean pine stands: A case from Turkey. Geocarto Int. 2022, 37, 720–734. [Google Scholar] [CrossRef]
Rouse, J.W., Jr.; Haas, R.H.; Deering, D.; Schell, J.; Harlan, J.C. Monitoring the Vernal Advancement and Retrogradation (Green Wave Effect) of Natural Vegetation; Final Report; NASA: Goddard Space Flight Centergreenbelt, MI, USA, 1974. [Google Scholar]
Pearson, R.L.; Miller, L.D. Remote mapping of standing crop biomass for estimation of the productivity of the shortgrass prairie, Pawnee National Grasslands, Colorado. In Proceedings of the Eighth International Symposium on Remote Sensing of Environment, Ann Arbor, MI, USA, 2–6 October 1972. [Google Scholar]
Srestasathiern, P.; Rakwatin, P. Oil Palm Tree Detection with High Resolution Multi-Spectral Satellite Imagery. Remote Sens. 2014, 6, 9749–9774. [Google Scholar] [CrossRef]
Huete, A.R. A soil-adjusted vegetation index (SAVI). Remote Sens. Environ. 1988, 25, 295–309. [Google Scholar] [CrossRef]
Liu, H.Q.; Huete, A. A Feedback Based Modification Of The NDVI To Minimize Canopy Background And Atmospheric Noise. IEEE Trans. Geosci. Remote Sens. 1995, 33, 814. [Google Scholar] [CrossRef]
Zhong, B.; Yang, A.; Nie, A.; Yao, Y.; Zhang, H.; Wu, S.; Liu, Q. Finer Resolution Land-Cover Mapping Using Multiple Classifiers and Multisource Remotely Sensed Data in the Heihe River Basin. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 4973–4992. [Google Scholar] [CrossRef]
Fick, S.E.; Hijmans, R.J. WorldClim 2: New 1-km spatial resolution climate surfaces for global land areas. Int. J. Climatol. 2017, 37, 4302–4315. [Google Scholar] [CrossRef]
Hayashi, M.; Saigusa, N.; Oguma, H.; Yamagata, Y. Forest canopy height estimation using ICESat/GLAS data and error factor analysis in Hokkaido, Japan. Isprs J. Photogramm. Remote Sens. 2013, 81, 12–18. [Google Scholar] [CrossRef]
Hu, K.; Liu, Q.; Pang, Y.; Li, M.; Mu, X. Forest canopy height estimation based on ICESat/GLAS data by airborne lidar. Trans. Chin. Soc. Agric. Eng. 2017, 33, 88–95. [Google Scholar]
Fan, J.Y.; Pan, J.Y. Convergence properties of a self-adaptive Levenberg-Marquardt algorithm under local error bound condition. Comput. Optim. Appl. 2006, 34, 47–62. [Google Scholar] [CrossRef]
Jiang, F.; Sun, H.; Ma, K.; Fu, L.; Tang, J. Improving aboveground biomass estimation of natural forests on the Tibetan Plateau using spaceborne LiDAR and machine learning algorithms. Ecol. Indic. 2022, 143, 109365. [Google Scholar] [CrossRef]
Radivojac, P.; Chawla, N.V.; Dunker, A.K.; Obradovic, Z. Classification and knowledge discovery in protein databases. J. Biomed. Inform. 2004, 37, 224–239. [Google Scholar] [CrossRef]
Sun, H.; Hu, X. Attribute selection for decision tree learning with class constraint. Chemom. Intell. Lab. Syst. 2017, 163, 16–23. [Google Scholar] [CrossRef]
Chen, C.; Zhang, Q.; Yu, B.; Yu, Z.; Lawrence, P.J.; Ma, Q.; Zhang, Y. Improving protein-protein interactions prediction accuracy using XGBoost feature selection and stacked ensemble classifier. Comput. Biol. Med. 2020, 123, 103899. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Sheridan, R.P.; Wang, W.M.; Liaw, A.; Ma, J.; Gifford, E.M. Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships. J. Chem. Inf. Model. 2016, 56, 2353–2360. [Google Scholar] [CrossRef]
Machado, M.R.; Karray, S.; de Sousa, I.T. LightGBM: An Effective Decision Tree Gradient Boosting Method to Predict Customer Loyalty in the Finance Industry. In Proceedings of the 14th International Conference on Computer Science and Education (ICCSE 2019), Toronto, ON, Canada, 19–21 August 2019; pp. 1111–1116. [Google Scholar]
Chen, P.H.; Lin, C.J.; Schölkopf, B. A tutorial on support vector machines. Appl. Stoch. Models Bus. Ind. 2005, 21, 111–136. [Google Scholar] [CrossRef]
Zhao, K.; Popescu, S.; Meng, X.; Pang, Y.; Agca, M. Characterizing forest canopy structure with lidar composite metrics and machine learning. Remote Sens. Environ. 2011, 115, 1978–1996. [Google Scholar] [CrossRef]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Zhang, J.; Huang, S.; Hogg, E.H.; Lieffers, V.; Qin, Y.; He, F. Estimating spatial variation in Alberta forest biomass from a combination of forest inventory and remote sensing data. Biogeosciences 2014, 11, 2793–2808. [Google Scholar] [CrossRef]
Wu, C.; Shen, H.; Shen, A.; Deng, J.; Gan, M.; Zhu, J.; Xu, H.; Wang, K. Comparison of machine-learning methods for above-ground biomass estimation based on Landsat imagery. J. Appl. Remote Sens. 2016, 10, 035010. [Google Scholar] [CrossRef]
Tien Dat, P.; Nga Nhu, L.; Nam Thang, H.; Luong Viet, N.; Xia, J.; Yokoya, N.; Tu Trong, T.; Hong Xuan, T.; Lap Quoc, K.; Takeuchi, W. Estimating Mangrove Above-Ground Biomass Using Extreme Gradient Boosting Decision Trees Algorithm with Fused Sentinel-2 and ALOS-2 PALSAR-2 Data in Can Gio Biosphere Reserve, Vietnam. Remote Sens. 2020, 12, 777. [Google Scholar] [CrossRef]
Luo, M.; Wang, Y.; Xie, Y.; Zhou, L.; Qiao, J.; Qiu, S.; Sun, Y. Combination of Feature Selection and CatBoost for Prediction: The First Application to the Estimation of Aboveground Biomass. Forests 2021, 12, 216. [Google Scholar] [CrossRef]
Ye, Q.; Yu, S.; Liu, J.; Zhao, Q.; Zhao, Z. Aboveground biomass estimation of black locust planted forests with aspect variable using machine learning regression algorithms. Ecol. Indic. 2021, 129, 107948. [Google Scholar] [CrossRef]
Wai, P.; Su, H.; Li, M. Estimating Aboveground Biomass of Two Different Forest Types in Myanmar from Sentinel-2 Data with Machine Learning and Geostatistical Algorithms. Remote Sens. 2022, 14, 2146. [Google Scholar] [CrossRef]
Uniyal, S.; Purohit, S.; Chaurasia, K.; Amminedu, E.; Rao, S.S. Quantification of carbon sequestration by urban forest using Landsat 8 OLI and machine learning algorithms in Jodhpur, India. Urban For. Urban Green. 2022, 67, 127445. [Google Scholar] [CrossRef]
Rana, P.; Popescu, S.; Tolvanen, A.; Gautam, B.; Srinivasan, S.; Tokola, T. Estimation of tropical forest aboveground biomass in Nepal using multiple remotely sensed data and deep learning. Int. J. Remote Sens. 2023, 44, 5147–5171. [Google Scholar] [CrossRef]
Wang, Z.; Yi, L.; Xu, W.; Zheng, X.; Xiong, S.; Bao, A. Integration of UAV and GF-2 Optical Data for Estimating Aboveground Biomass in Spruce Plantations in Qinghai, China. Sustainability 2023, 15, 9700. [Google Scholar] [CrossRef]
Zhang, Y.; Ma, J.; Liang, S.; Li, X.; Li, M. An Evaluation of Eight Machine Learning Regression Algorithms for Forest Aboveground Biomass Estimation from Multiple Satellite Data Products. Remote Sens. 2020, 12, 4015. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Bengio, Y.; Delalleau, O. On the Expressive Power of Deep Architectures. In Proceedings of the 22nd International Conference on Algorithmic Learning Theory (ALT 2011), Espoo, Finland, 5–7 October 2011; pp. 18–36. [Google Scholar]
Ghosh, S.M.; Behera, M.D. Aboveground biomass estimates of tropical mangrove forest using Sentinel-1 SAR coherence data-The superiority of deep learning over a semi-empirical model. Comput. Geosci. 2021, 150, 104737. [Google Scholar] [CrossRef]
Narine, L.L.; Popescu, S.C.; Malambo, L. Synergy of ICESat-2 and Landsat for Mapping Forest Aboveground Biomass with Deep Learning. Remote Sens. 2019, 11, 1503. [Google Scholar] [CrossRef]
Yu, X.; Ge, H.; Lu, D.; Zhang, M.; Lai, Z.; Yao, R. Comparative Study on Variable Selection Approaches in Establishment of Remote Sensing Model for Forest Biomass Estimation. Remote Sens. 2019, 11, 1437. [Google Scholar] [CrossRef]
Roy, D.P.; Wulder, M.A.; Loveland, T.R.; Woodcock, C.E.; Allen, R.G.; Anderson, M.C.; Helder, D.; Irons, J.R.; Johnson, D.M.; Kennedy, R.; et al. Landsat-8: Science and product vision for terrestrial global change research. Remote Sens. Environ. 2014, 145, 154–172. [Google Scholar] [CrossRef]
Chrysafis, I.; Mallinis, G.; Gitas, I.; Tsakiri-Strati, M. Estimating Mediterranean forest parameters using multi seasonal Landsat 8 OLI imagery and an ensemble learning method. Remote Sens. Environ. 2017, 199, 154–166. [Google Scholar] [CrossRef]
Han, H.; Wan, R.; Li, B. Estimating Forest Aboveground Biomass Using Gaofen-1 Images, Sentinel-1 Images, and Machine Learning Algorithms: A Case Study of the Dabie Mountain Region, China. Remote Sens. 2022, 14, 176. [Google Scholar] [CrossRef]
Strunk, J.; Temesgen, H.; Andersen, H.-E.; Flewelling, J.P.; Madsen, L. Effects of lidar pulse density and sample size on a model-assisted approach to estimate forest inventory variables. Can. J. Remote Sens. 2012, 38, 644–654. [Google Scholar] [CrossRef]
Paine, C.E.T.; Baraloto, C.; Diaz, S. Optimal strategies for sampling functional traits in species-rich forests. Funct. Ecol. 2015, 29, 1325–1331. [Google Scholar] [CrossRef]
Milenkovic, M.; Schnell, S.; Holmgren, J.; Ressl, C.; Lindberg, E.; Hollaus, M.; Pfeifer, N.; Olsson, H. Influence of footprint size and geolocation error on the precision of forest biomass estimates from space-borne waveform LiDAR. Remote Sens. Environ. 2017, 200, 74–88. [Google Scholar] [CrossRef]
Tian, X.; Yan, M.; van der Tol, C.; Li, Z.; Su, Z.; Chen, E.; Li, X.; Li, L.; Wang, X.; Pan, X.; et al. Modeling forest above-ground biomass dynamics using multi-source data and incorporated models: A case study over the qilian mountains. Agric. For. Meteorol. 2017, 246, 1–14. [Google Scholar] [CrossRef]
Xu, N.; Ma, X.; Ma, Y.; Zhao, P.; Yang, J.; Wang, X.H. Deriving Highly Accurate Shallow Water Bathymetry From Sentinel-2 and ICESat-2 Datasets by a Multitemporal Stacking Method. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6677–6685. [Google Scholar] [CrossRef]

Figure 1. (a) The arrangement of field survey plots and GLAS footprints across the study area. Representative instances of surveyed forests are illustrated in (b) for a Qilian mountain forest landscape and in (c) for a Picea crassifolia forest.

Figure 2. Land cover map for the study area.

Figure 3. The methodology we used in this study for estimating boreal forest AGB.

Figure 4. Effect of topographic correction of GLAS-derived canopy heights based on different slope gradients.

Figure 5. Frequency distribution of variables in the initial XGBoost run.

Figure 6. Cross-validation results for each ML model (the different colors represent distinct models corresponding to the X-axis).

Figure 7. Comparison of raster ratios across different value domains in predicted AGB maps and training data.

Figure 8. Independent validation results for each ML model.

Figure 9. Distribution of residuals of AGB maps estimated by each model for (a) different value ranges and (b) different elevations (the red line represents no difference between the model’s predicted values and the observed values).

Figure 10. Distribution of forest AGB across the study area.

Table 1. Information on variables from optical remote sensing data.

Source	Label	Description
Original Band	Band 2	Blue (B)
	Band 3	Green (G)
	Band 4	Red (R)
	Band 5	Near Infrared (NIR)
	Band 6	Shortwave Infrared (SWIR1)
	Band 7	Shortwave Infrared (SWIR2)
Vegetation Indices (VIs)	NDVI [28]	$NIR - R / NIR + R$
	SR [29]	$NIR / R$
	TVI [30]	$\sqrt{NDVI + 1}$
	SAVI [31]	$(NIR - R) \times (1 + L) / (NIR + R + L)$ , L = 0.5
	EVI [32]	$2.5 \times (NIR - R) / (NIR + 6 \times R - 7.5 \times B + 1)$

Table 2. The accuracy of the classification for the land cover map in the study area.

Land-Use Type		Ground Truth (%)					Producer Accuracy (%)
Land-Use Type		Cropland	Forest	Grassland	Water	Bare Land	Producer Accuracy (%)
Land cover map (%)	Cropland	21.24	0.00	2.26	0.26	0.00	21.24
	Forest	3.15	95.34	36.28	0.00	0.67	95.34
	Grassland	75.61	3.76	61.46	6.15	1.72	61.46
	Water	0.00	0.06	0.00	69.31	0.00	69.31
	Bare land	0.00	0.84	0.00	24.28	97.61	97.61
User accuracy (%)		61.59	93.53	58.39	98.69	95.99	——

Table 3. Optimal hyperparameter of each machine learning (ML) model used in this study.

Algorithm	Learning Rate	Min_Samples_Leaf Min_Child_Weight	Gamma	Max_Depth/Max Feature	n_Estimators/n_Iteration/or C Value
XGBoost	1	1	0	6	100
LightGBM	2	20	NA	6	200
SVR	0.1	NA	1000	NA	1
RF	NA	1	NA	15	50

Table 4. Performance of regression models for different elevation gradients.

Elevation Gradient	Model	p Value	R²	Adjusted R²	Displayed Formula
1770–2770 m	Quadratic	0.004	0.67	0.60	*Y = −204.42 + 59.52X − 2.62X²
2770–3770 m	Cubic	0.00004	0.69	0.66	Y = 6.97X + 0.46X²−0.02X³ + 6.48
3770–4770 m	Cubic	0.00001	0.78	0.73	Y = 23.79X − 3.28X² + 0.19X³ − 11.22

*Y represents the GLAS footprint AGB, and X represents the GLAS-derived canopy height.

Table 5. Summary of AGB estimation results for 106 GLAS footprints in forested areas (Mg/ha).

Elevation Gradient	Number	Minimum	Maximum	Median	Mean	Standard Deviation
1770–2770 m	12	64.28	176.14	78.57	91.67	38.04
2770–3770 m	90	18.15	184.09	76.54	81.46	34.25
3770–4770 m	4	11.09	64.78	48.72	43.11	19.91

Table 6. Details of variables utilized for modeling forest AGB estimates.

Dataset	Variable	Dataset	Variable
Original Band	Band4	WorldClim	Bio11
	Band5		Bio12
	Band6		Bio13
	Band7		Bio15
WorldClim	Bio1		Bio16
	Bio2		Bio17
	Bio3	VIs	EVI
	Bio4	DEM	Elevation
	Bio5		Slope
	Bio6		Aspect
	Bio7	Soil	Soil Texture

Table 7. Recent studies on estimating forest AGB using different ML algorithms.

No.	Modeling Approach	Data Sources	Forest Type	The Optimal Model and Its Performance			Year	Reference
No.	Modeling Approach	Data Sources	Forest Type	Optimal Model	R²	RMSE (Mg/ha)	Year	Reference
1	Non-spatial and spatial regression models Spatial interpolation Random Forest (RF)	Forest inventory data ICESat/GLAS Climatic variables Elevation	Boreal forests	RF	0.62	47.03	2014	[49]
2	Support Vector Regression (SVR) k-Nearest Neighbor (kNN) Stepwise Linear Regression (SLR) Random Forest (RF) Stochastic Gradient Boosting (SGB)	Field survey data Landsat 5 TM	Subtropical forests	RF	0.63	26.44	2016	[50]
3	Random Forest (RF) Support Vector Regression (SVR)	Forest inventory data Landsat 8 OLI Climatic variables Topographic variables	Temperate forests	SVR	0.8	8.20	2020	[7]
4	Extreme Gradient Boosting (XGBoost) Support Vector Regression (SVR) Gradient Boosting Regression (GBR) Random Forest (RF) Gaussian Process Regression (GPR)	Field survey data Sentinel-2 MSI ALOS-2 PALSAR-2	Mangrove	XGBR	0.805	28.13	2020	[51]
5	Random Forest Regression (RFR) Extreme Gradient Boosting (XGBoost) Categorical Boosting (CatBoost)	National Forest Continuous Inventory (NFCI) data Landsat 8 OLI Topographic variables Canopy density	Temperate forests	CatBoost	0.73	25.77	2021	[52]
6	Random Forest (RF) Support Vector Regression (SVR) Artificial Neural Network (ANN)	Field survey data Landsat 8 OLI Aspect	Planted forests	RF	0.8519	12.552	2021	[53]
7	Kriging algorithms Stochastic Gradient Boosting (SGB) Random Forest (RF)	Field survey data Sentinel-2 image Terrain factors (elevation, slope, aspect)	Tropical forests	RF-based ordinary Kriging	0.47	24.91	2022	[54]
8	Random Forest (RF) Support Vector Regression (SVR) Extreme Gradient Boosting (XGBoost) k-Nearest Neighbor (kNN)	Field survey data Landsat 8 OLI	Urban forests	XGBoost	0.89	14.08	2022	[55]
9	Random Forest (RF) Stacked Autoencoder (SAE) network Extremely Randomized Trees (ERT) Weighted Least Squares (WLS)	Field survey data Airborne Laser Scanning (ALS) RapidEye satellite image Landsat 5 TM	Tropical forests	SAE	0.8	54.01	2023	[56]
10	Random Forest (RF) Support Vector Regression (SVR) Ordinary Least-Squares (OLS) Artificial Neural Network(ANN)	Unmanned Aerial Vehicle (UAV) GF-2 image	Planted forests	RF	0.86	1.75	2023	[57]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, J.; Liu, X.; Adingo, S.; Guo, Y.; Li, Q. A Comparative Analysis of Remote Sensing Estimation of Aboveground Biomass in Boreal Forests Using Machine Learning Modeling and Environmental Data. Sustainability 2024, 16, 7232. https://doi.org/10.3390/su16167232

AMA Style

Song J, Liu X, Adingo S, Guo Y, Li Q. A Comparative Analysis of Remote Sensing Estimation of Aboveground Biomass in Boreal Forests Using Machine Learning Modeling and Environmental Data. Sustainability. 2024; 16(16):7232. https://doi.org/10.3390/su16167232

Chicago/Turabian Style

Song, Jie, Xuelu Liu, Samuel Adingo, Yanlong Guo, and Quanxi Li. 2024. "A Comparative Analysis of Remote Sensing Estimation of Aboveground Biomass in Boreal Forests Using Machine Learning Modeling and Environmental Data" Sustainability 16, no. 16: 7232. https://doi.org/10.3390/su16167232

APA Style

Song, J., Liu, X., Adingo, S., Guo, Y., & Li, Q. (2024). A Comparative Analysis of Remote Sensing Estimation of Aboveground Biomass in Boreal Forests Using Machine Learning Modeling and Environmental Data. Sustainability, 16(16), 7232. https://doi.org/10.3390/su16167232

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparative Analysis of Remote Sensing Estimation of Aboveground Biomass in Boreal Forests Using Machine Learning Modeling and Environmental Data

Abstract

1. Introduction

2. Materials

2.1. Study Area

2.2. Field Data

2.3. ICESat/GLAS Data

2.4. Landsat 8 OLI Data

2.5. Land Cover Map

2.6. Environmental Data

3. Methods

3.1. Deriving Forest Canopy Heights from GLAS Data in Mountainous Areas

3.2. Relating Field-Based AGB to GLAS-Derived Canopy Heights

3.3. Variable Selection for AGB Estimation Modeling

3.4. Algorithms of AGB Estimation Modeling

3.4.1. Extreme Gradient Boosting (XGBoost)

3.4.2. Light Gradient Boosting Machine (LightGBM)

3.4.3. Support Vector Regression (SVR)

3.4.4. Random Forest (RF)

3.5. Accuracy Assessment and Statistical Analysis

4. Results

4.1. GLAS-Derived Canopy Heights Results

4.2. AGB Estimation of GLAS Footprints in Forest Areas

4.3. Variable Selection for AGB Estimation

4.4. Comparison of Different AGB Estimation Models

4.5. Forest AGB Mapping

5. Discussion

5.1. Modeling AGB Estimation Using Different Algorithms

5.2. Modeling Variables Selected for AGB Estimation

5.3. Forest Field Survey Data for AGB Estimation

5.4. AGB Estimation in Mountain Forests Using GLAS Data

5.5. Results of Forest AGB Estimation

5.6. Error Analysis

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI