1. Introduction
Forest growing stock volume serves as a fundamental data source for estimating forest biomass and carbon sequestration [
1,
2]. It is a vital indicator for assessing forest quality and represents a key parameter reflecting the proficiency in forest management and resource management [
3].
A significant method for assessing forest growing stock volume involves the Continuous Forest Inventory [
4], and in China, this is the most reliable means of forest resource inventory. This process includes preliminary sampling based on a kilometer grid, considering the distribution of forest resources within each province and the topographical conditions. Clear precision requirements for sampling are established in advance [
5,
6,
7]. Moreover, the sample plots undergo periodic reviews every 5 years, with fixed sample plots expected to maintain a reset rate of over 98% and fixed sample trees, a reset rate of over 95% [
8]. It is evident that the traditional forest growing stock survey method is characterized by objectivity and precision. However, it does come with certain drawbacks, including lengthy survey periods, challenging fieldwork, and high investigation costs.
The rapid advancement of remote sensing technology has prompted a shift in forest growing stock surveys, moving away from traditional manual ground surveys towards remote-sensing-based estimation methods [
9,
10,
11]. This approach involves the utilization of image-scanning equipment to capture remote sensing data within the study area [
12]. Additionally, Geographic Information System (GIS) technology is employed to gather terrain-related factors [
13], compute various feature parameters such as forest type information and topographic factors [
14], and construct forest growing stock estimation models using both linear and nonlinear modeling techniques [
15]. However, it is noteworthy that the precision of these models, based on the values of statistical indices, is not very high.
Since the 1980s, numerous researchers have dedicated substantial efforts to exploring the connection between optical remote sensing data and forest growing stock [
11,
16]. Through their investigations, it has been discerned that the reflectance values from the red (TM3) and near-infrared (TM4) bands of Landsat TM data can be combined to form a vegetation index [
17], allowing for the estimation of community characteristics. Moreover, forest growing stock can be estimated by establishing a regression relationship between ground survey data and spectral values [
18,
19]. This relationship was subsequently extended to include the AVHRR1 and AVHRR2 bands of NOAA/AVHRR data [
20]. As research progressed, it became evident that the reflectance values from SPOT and Landsat TM showed a negative correlation with forest wood volume, especially in the near-infrared bands [
21]. Building upon this insight, some scholars delved into the extraction of texture features from Landsat TM images and conducted correlation analyses with forests of varying ages to derive the spectral change characteristics associated with different forest ages. Consequently, they were able to estimate forest growing stock more effectively [
22]. However, this approach can only estimate the growing stock volume for a specific tree species.
Indeed, optical remote sensing encounters challenges in capturing forest vertical structural characteristics and is susceptible to cloud cover [
23]. Relying on a single remote sensing data source often falls short in accurately estimating forest growing stock. Consequently, the approach of integrating multiple data sources for forest growing stock estimation has emerged as a solution that leverages their complementary strengths, leading to enhanced estimation precision [
24]. In the early 21st century, scholars delved into remote sensing estimation models that made use of multi-source data [
25]. By combining remote sensing data with terrain factors and applying a linear regression model, they achieved a significant enhancement in the precision of forest growing stock estimation [
26]. Over the past decade, as remote sensing technology has continued to advance, high-resolution image data, such as ground laser scanning, hyperspectral imagery, and unmanned aerial vehicle (UAV) data, have gained widespread adoption and experimentation [
27]. These high-resolution datasets provide more precise information on terrain, ground features, and land cover types. By harnessing multi-source high-resolution images in conjunction with machine learning models, scholars have been able to achieve even greater precision in forest growing stock estimation [
28]. This approach indeed improves the precision of forest growing stock volume estimation, but obtaining hyperspectral and high-resolution imagery is challenging and unsuitable for large-scale geographic analysis [
29].
To enhance the precision of forest growing stock volume estimation, it is a critical focus for forest growing stock volume researchers to identify more precise machine learning models [
30,
31,
32,
33]. These models commonly can be categorized into two main groups: parametric models and non-parametric models [
34]. Parametric models generally assume that the data follow a specific distribution, which can be characterized by certain parameters, which form the basis of the model construction. Parametric models can be built using both linear and nonlinear approaches [
35]. They are typically simple and straightforward to explain, but there is a risk of underfitting due to their inherent assumptions. In contrast, non-parametric models are constructed by fitting the training data without imposing strict constraints on the form of the objective function [
36]. Non-parametric models tend to provide a good fit to the data but may have complex and less interpretable structures [
32]. Among the machine learning models, the random forest model stands out as one of the most widely used and highly accurate models in forest growing stock volume estimation [
37].
In recent years, an improved version of the boosting algorithm called Adaboost (Adaptive Boosting) has emerged, employing forward stagewise additive modeling to construct an ensemble model [
38]. During each iteration, AdaBoost superimposes a base classifier onto the model, focusing on the error between the model’s predictions and the actual label values. This incremental process aims to gradually reduce the model’s deviation from the true values. AdaBoost achieves this by optimizing the weights assigned to the samples. It increases the weights of samples misclassified by the previous base classifier and decreases the weights of correctly classified samples. The subsequent base classifier is then trained with these updated weights. In each iteration, a new weak classifier is added to the ensemble, and the final strong classifier is not determined until either a predetermined sufficiently low error rate is achieved or a specified maximum number of iterations is reached. The AdaBoost algorithm is known for significantly improving prediction precision, and its performance is particularly enhanced when the weak classifiers used within it have higher precision [
39]. However, the Adaboost algorithm has not yet been applied to forest growing stock volume estimation.
Considering the superiority of the Adaboost algorithm and the role of multi-source data in forest growing stock volume estimation, we employed various data sources, including Landsat remote sensing data, Digital Elevation Model (DEM), and Continuous Forest Inventory Data. From these datasets, we extracted vegetation indices, elevation, and selected survey factors as model features. We built an AdaBoost model with Random Forest as weak learners for estimating growing stock volume in the study area. Additionally, we established Random Forest and AdaBoost models to estimate the forest growing stock volume. Finally, we compared the three models based on different data schemes. Ultimately, we observed that the Adaboost model consistently outperformed the others and demonstrated universality without the need for specific tree species differentiation.
4. Discussion
The primary objective of this study was to propose a universal model for estimating forest growing stock volume, a crucial indicator for evaluating forest quality. The establishment of a predictive model based on continuous inventory data of national forest resources and remote sensing data held particular significance for Yunnan. In Yunnan, where the forest cover was extensive, the terrain was complex, and forestry survey tasks were demanding, such a model became especially important. Therefore, a growing stock volume estimation model based on partial continuous inventory data and remote sensing data could greatly enhance survey efficiency and reduce the risks associated with forestry investigations.
Overall, the findings from our research yield several noteworthy conclusions that deepen our understanding of forest growing stock volume estimation: (1) Our investigation unequivocally demonstrates the exceptional performance of the RF-Adaboost model in estimating forest growing stock volume. Regardless of the data scheme, the advantages of this model are particularly evident when compared to other machine learning models. This adaptability underscores its robustness and reliability, making it a valuable tool for forest resource assessment in varying geographical and environmental contexts. (2) Our research highlights the significant positive impact of incorporating multi-source features on model performance. By amalgamating data from various sources, we not only enhance the predictive precision of the model but also improve its robustness against variations in input data. (3) Compared to other machine learning models, our approach excels in achieving a more balanced consideration of the importance attributed to various features. It delves deeper into understanding the impact of each feature on the accurate estimation of forest growing stock. Our objective is to optimize the utilization of different features, thereby minimizing the model’s dependence on any single feature. (4) RF-Adaboost does not require strict differentiation between different tree species, only distinguishing tree species structures, making it more versatile and universal.
The RF-Adaboost model proposed in this study consistently outperforms traditional Random Forest and Adaboost models across various data schemes, demonstrating superior performance in terms of multiple evaluation metrics. This improvement can be attributed to the inherent limitations of traditional Random Forest models, which rely on a regression algorithm using decision trees with equal weights for each tree. This uniform weighting makes the model vulnerable to the influence of outliers and reduces its universality. Specifically, the traditional Random Forest model excels in estimating the growing stock volume of a specific tree species in a small local area but may falter when applied to different regions for estimating various tree species. To address these limitations, our proposed RF-Adaboost model integrates the Adaboost algorithm, allowing for the assignment of different weights to weak learners and data based on the iteration. This adaptive weighting strategy mitigates the negative impact of outliers, enhancing the model’s precision and overall performance.
In terms of data schemes, it is evident that the combination of ground survey data with remote data provides higher precision for predicting forest growing stock volume compared to data from a single source. Additionally, we observed that the model’s precision is lowest when using only Landsat remote sensing data. This is attributed to the large temporal and spatial scale of remote sensing data, along with issues such as cloud cover, which hinder the accurate real-time reflection of vegetation cover in the monitoring area.
As for features, although there are always some features with extremely high importance and others with very low importance in all models, the RF-Adaboost model excels in balancing the treatment of features compared to other models. It significantly reduces the importance of some features while increasing the importance of others, thereby reducing the model’s reliance on certain features and enhancing the overall balance of features, making the model more stable.
For the purpose of facilitating comparison with other studies, we utilized the coefficient of determination (R
2) and the mean absolute percentage error (MAPE%) as comparison metrics. These ratio-based indicators enable meaningful horizontal comparisons across different studies. We selected the most relevant and recent studies for comparison with the content of our research (as shown in
Table 8).
Our comprehensive analysis reveals that our RF-Adaboost model, when applied to Data Scheme C, achieves notable improvements in both R
2 and MAPE compared to previous studies: Our research attains significantly higher model precision (R
2) when compared to Mauya (2019) [
12], Ruyi Zhou (2018) [
33], and Huajian Huang (2022) [
56]. This underscores the superior predictive capabilities of our model. Furthermore, our study achieves a substantially lower MAPE than Ruyi Zhou and slightly lower than Huajian Huang. Notably, Mauya and Jingjing Zhou did not report MAPE values, precluding direct comparisons.
Nonetheless, it is worth noting that our study’s R
2, though impressive, is marginally lower than the remarkable R
2 of 0.82 reported by Jingjing Zhou (2020) [
31]. We employed the Random Forest method as utilized in Jingjing Zhou’s study, using Landsat data to calculate the vegetation indices. In the absence of tree species distinction, the obtained R
2 is only 0.43, significantly lower than the 0.82 reported in Jingjing Zhou’s study. This divergence may be attributed to three main factors. (1) Variation in remote sensing image precision: The SPOT6 satellite imagery used in Jingjing Zhou’s study has a much higher resolution compared to the Landsat imagery used in our study. (2) Study scale and complexity: Jingjing Zhou’s (2020) [
31] study focused on Taizi Mountain in Jingshan County, China, which has a smaller geographic area and exhibits less variability in topographic and climatic conditions. In their study, the range and complexity of these variables were more limited. In contrast, our study, covering the entire Yunnan Province, encompasses a broader and more diverse geographic region, dealing with a multitude of complex factors influencing forest growing stock volume. This diversity may contribute to lower predictive precision. (3) Tree species structure: Jingjing Zhou’s (2020) [
31] research primarily centered on massoniana plantations. In contrast, our study categorizes tree species into five major classes: coniferous pure forest, broad-leaved pure forest, coniferous relatively pure forest, broad-leaved relatively pure forest, and mixed needle and broad forest. This added complexity in our study’s tree species structure may have introduced greater variability and reduced predictive precision.
Yangyang Zhou (2023) [
57], whose study bears the closest resemblance to ours, also employed remote sensing data and forest inventory data for estimating forest stock volume. They highlighted the optimal performance of the Random Forest model in forest stock volume estimation, achieving an R
2 of 0.776 when the remote sensing data source was Landsat 8, consistent with our findings (as shown in
Table 6). However, our research, with improvements to the Adaboost model, attains an R
2 of 0.82, demonstrating that the RF-Adaboost model is more effective than the Random Forest model. Intriguingly, when Yangyang Zhou incorporated Sentinel-2 data, the R
2 reached 0.831, providing a promising direction for our subsequent research: combining the RF-Adaboost model with higher-resolution remote sensing data for stock volume estimation could lead to even more outstanding results.
During the research process, we also identified some issues that warrant further exploration. (1) Although we identified the best-performing model to be the one based on Data Scheme C and the RF-Adaboost model, it is noteworthy that all models exhibited some degree of variability in their predictions, with precision showing slight fluctuations. This variability raises intriguing questions, with one possible explanation being the presence of outliers in the dataset. However, further investigation is needed to conclusively ascertain the cause and nature of this model variability. (2) The RF-Adaboost model boasts a more extensive feature set compared to both the Random Forest and Adaboost models. Nevertheless, we did not undertake the crucial task of feature selection from within the same data source. Consequently, the RF-Adaboost model’s operational efficiency is diminished in comparison to its counterparts. Future research endeavors should explore feature selection methodologies to streamline the model’s feature set, enhancing efficiency without compromising predictive precision.