The Impact of Forest Management Inventory Factors on the Ecological Service Value of Forest Water Conservation Based on Machine Learning Algorithms

Chen, Zhefu; Lü, Yong; Liu, Yang; Chen, Duanlv; Peng, Baofa

doi:10.3390/f15081431

Open AccessArticle

The Impact of Forest Management Inventory Factors on the Ecological Service Value of Forest Water Conservation Based on Machine Learning Algorithms

by

Zhefu Chen

^1,2,

Yong Lü

^1,*,

Yang Liu

³,

Duanlv Chen

² and

Baofa Peng

²

¹

College of Forestry, Central South University of Forestry and Technology, Changsha 410004, China

²

College of School of Geographical Sciences and Tourism, Hunan University of Arts and Science, Changde 415000, China

³

Engineering Research Center for Smart Agricultural Machinery Beidou Navigation Adaptation Technology and Equipment of Hunan Province, Hunan Automotive Engineering Vocational University, Zhuzhou 412000, China

^*

Author to whom correspondence should be addressed.

Forests 2024, 15(8), 1431; https://doi.org/10.3390/f15081431

Submission received: 3 July 2024 / Revised: 4 August 2024 / Accepted: 12 August 2024 / Published: 14 August 2024

(This article belongs to the Special Issue Water Cycle and Energy Balance Measurements in Forests)

Download

Browse Figures

Versions Notes

Abstract

:

Based on forest management inventory data, this study applies machine learning algorithms to explore the relationships between forest water conservation capacity and forest management inventory factors, thus providing more extensive insights into forest water conservation services. By integrating the InVEST model and machine learning algorithms, this study identifies the key factors related to water conservation services based on forest management inventory factors and investigates the differences in and accuracy of forest water conservation models using the random forest algorithm. The results are as follows: (1) The determination coefficients (R²) of the three machine learning models range from 0.508 to 0.869, with root mean square errors (RMSEs) ranging from 28.380 to 69.339. The performance of these models is generally satisfactory, with the random forest algorithm showing superior results. (2) By leveraging the advantages of the three machine learning algorithms in handling categorical data, this study analyzes the contributions of forest management inventory factors, revealing the impact mechanisms of forest-type water conservation services. (3) The integration of machine learning algorithms allows for better processing of the scale and correlation of independent variables, providing more objective information on the main controlling factors of forest water conservation. (4) Predictions of water conservation capacity using machine learning are consistent with that of the InVEST model. The water conservation per unit area shows a variation trend as follows: slow-growing broadleaf forests > shrub forests > middle-growing broadleaf forests > cunninghamia lanceolata forests > fast-growing broadleaf forests > pine forests > bamboo forests. (5) Since this study considers only the factors available in the forest management inventory, which does not encompass all relevant influencing factors, it is difficult to fully address the complexities of how forest water conservation services interact with forest structure. Therefore, further research is needed to investigate the intrinsic mechanisms underlying the interactions between water conservation and forest management inventory factors.

Keywords:

water conservation; machine learning algorithms; forest management inventory factors; InVEST mode

1. Introduction

Water conservation, a key regulatory service function of forest ecosystems due to their close relationship with water resources, has long been a prominent topic in ecosystem service research [1,2]. Understanding water conservation services is crucial for conducting assessments of forest ecosystem service functions [3]. Since the early 20th century, it has been widely recognized that there is a close relationship between runoff and its distribution throughout the year in forests [4]. Since the 1960s, research in China on water conservation services has mainly focused on the relationship between forest vegetation coverage and changes in basin runoff [5]. Various integrated models, including InVEST, are widely used for evaluating water conservation services due to their ease of data acquisition, flexible parameter adjustment, convenient operation, and visual expression [6,7,8]. They are often limited to assessments of regional forest vegetation. It is believed that the spatial and temporal heterogeneity of water conservation services results from long-term natural environmental factors impacting forest ecosystems [9,10]. For instance, site factors influence forest ecosystem service functions by affecting stand factors [11,12]. However, quantification has mainly focused on specific forest vegetation types, with limited differentiation for dominant tree species [13,14], age groups, and constriction [15,16,17]. This lack of specific delineation affects the accuracy of water conservation service assessments to some extent. Although many scholars use the industry standard “Guidelines for Assessment of Forest Ecosystem Services (LY/T1721-2008)” [18] and forest management inventory results for value assessments [19], accurately understanding the spatial differentiation of water conservation services across different forest structures remains a challenging and widely debated topic [16].

China has established “technical regulations for inventory for forest management planning and design” [20]. Forest inventory factors are used to characterize the forest environment and stand structure characteristics [21,22,23]. Scholars have used different stand index factors to evaluate forest water conservation services [24,25,26], quantifying input features to predict these services through correlation analysis and regression models at various storage levels using methods such as multiple regression and stepwise regression [27]. Due to the complex nonlinear relationships and scale-dependent nature of inventory factors, comprising both continuous and categorical variables, traditional statistical regression models face limitations in estimation and struggle with multicollinearity, which impacts their predictive accuracy [28]. In recent years, machine learning methods have been introduced for screening relevant influencing variables and making predictions. These methods have been applied in areas such as forest type recognition [29], forest site factors [30], forest structure factor estimation [31,32,33], and the continuous processing of forest resource inventory data [34]. However, the direct use of classification data from forest resource inventory factors in model simulations is rare. Additionally, different machine learning algorithms show different performances in simulations. Random forest and eXtreme Gradient Boosting (XGBoost) algorithms can handle high-dimensional data and have strong generalization capabilities; however, the XGBoost algorithm has complex parameter settings. On the other hand, Lasso regression offers superior variable selection functions and performs well in ensemble learning by combining multiple learners [35]. It also excels in handling the scale and nonlinear relationships of forest inventory factors, enabling more flexible predictions.

Based on this background, this study focuses on the Dongting Lake area and uses the InVEST (integrated valuation of ecosystem services and tradeoffs) model to assess water conservation services [36]. It utilizes readily available forest inventory factor indicators, incorporating the characteristics of categorical data, and employs Lasso regression, random forest, and XGBoost algorithms separately to identify the key factors influencing forest water conservation services. Then, the optimal model is selected for a quantitative simulation of these services, thereby objectively reflecting the spatial differentiation of forest water conservation capacity and promoting further research in this field.

2. Materials and Methods

2.1. Study Area

Dongting Lake is located between 28°44′ and 29°35′ northern latitude and 111°53′ and 113°05′ eastern longitude in the northeastern part of Hunan Province, China. It is the second largest freshwater lake in the country (see Figure 1). It has a subtropical monsoon climate with distinct seasons and abundant rainfall. The average annual temperature is 17 °C, with an average annual precipitation of 1302 mm. The region’s forested area totals 2.1003 million hectares, divided into various forest types: arboreal forests (1414520.14 hectares; 67.34%), shrub forests (353710.27 hectares; 16.84%), and bamboo forests (332069.73 hectares; 15.82%). The arboreal forests can be further divided into Cunninghamia lanceolata forests, pine forests, fast-growing broadleaf forests, middle-growing broadleaf forests, and slow-growing broadleaf forests, accounting for 36.33%, 21.85%, 3.81%, 24.78%, and 13.23%, respectively.

2.2. Indicator Selection and Data Sources

(1): The InVEST water conservation module includes the following main parameter inputs: terrain, annual precipitation, annual potential evapotranspiration, vegetation distribution, soil depth, root depth, available water content in plants, evaporation coefficient, and the Zhang coefficient. Vegetation data were acquired from secondary land-use divisions based on the natural attributes of forest resources derived from forest management inventories [37,38]. Rainfall data were obtained from meteorological stations and converted into raster format using the Kriging interpolation method; potential evapotranspiration was calculated using the Modified–Hargreaves method [39]. The Zhang coefficient, representing rainfall characteristic constants, used default values; the available water content in plants was calculated using a nonlinear fitting model for soil AWC estimation based on established soil texture and soil organic matter data [40]. The soil depth was interpolated based on forest inventory data, while the saturated hydraulic conductivity of the soil was calculated based on the soil particle composition data model [41]. Biophysical parameters were referenced from relevant domestic and international literature, as well as specific studies focused on the research area [42].
(2): According to the “Thirteenth Five-Year Plan” forest management inventory data in Hunan Province, there are seven forest types: Cunninghamia lanceolata forests, pine forests, fast-growing broadleaf forests, mid-growing broadleaf forests, slow-growing broadleaf forests, shrub forests, and bamboo forests (see Figure 2). Additionally, as per the “technical regulations for inventory for forest management planning and design” (GB/T 26424-2010) [18] in Hunan Province, small-scale inventory factors are classified into basic attributes, topography factors, soil factors, forest factors, scattered trees/side trees, spatial structures, and derivative factors. However, considering that many inventory factors are not relevant to specific operations, this study excludes those. The retained inventory factors are categorized into continuous and categorical data types. Among them, continuous variable data types include elevation (EL), thickness of the humus layer (TH), thickness of the litter (TL), soil thickness (ST), constriction/coverage (CN/CE), average DBH (ADBH), average height (AH), number of plants per hectare (NP), and volume per hectare (VH); categorical variable data types include dominant tree species, the natural renewal level (NL), parent material (PM), reachability (RE), topography (TO), slope direction (SD), slope position (SP), slope (SL), soil type (ST), origin (OR), age group (AG), community structure (CS), and forest species (FS). Since tree species are comprehensive attributes of forest structure elements, existing studies indicate significant differences in water conservation among different tree species and forest types [43]. Therefore, model variables are classified according to dominant tree species/groups to improve modeling accuracy. Using subcompartment data, estimation models for the main forest factors are established for cunninghamia lanceolata forests, pine forests, fast-growing broadleaf forests, mid-growing broadleaf forests, slow-growing broadleaf forests, shrub forests, and bamboo forests [44]. The data used in this study are presented in Table 1.

2.3. Methodology

This study constructs an analytical framework, shown in Figure 3, based on the modeling requirements of Group Lasso regression, XGBoost, and random forest models. The framework includes three key logical steps: the calculation and processing of water conservation-dependent variable data and the selection and analysis of feature variables. Firstly, the data undergoes steady-state screening to obtain stable data for model training and validation. Then, the feature importance (contribution rate) of influencing factors is determined using Lasso regression, XGBoost, and random forest models to select forest inventory factor variables. Finally, the water conservation capacity is simulated, and efficiency is evaluated based on the preferred model.

2.3.1. Measurement of Forest Water Conservation

The evaluation is conducted using the annual water yield module in the InVEST model. This module, based on the 3S technology platform and the principles of the water cycle, calculates water yield using parameters such as precipitation, plant transpiration, surface evaporation, and root depth. The calculation is then performed using terrain indices, soil-saturated hydraulic conductivity, and flow velocity coefficients to determine the water conservation capacity [45]. To address the practical application issues of the InVEST model, this study enhances its applicability by localizing and adjusting relevant parameters. The simulation results are validated using water resource data from water resource bulletins [46]. The formulae used for calculating annual water yield are as follows:

Y_{i} = (1 - \frac{A E T i}{P_{i}})

(1)

\frac{A E T_{i}}{P_{x}} = \frac{1 + w_{i} R_{i}}{1 + w_{i} R_{i} + (\frac{1}{R_{i}})}

(2)

w_{i} = Z \frac{A W C_{i}}{P_{i}}

(3)

R_{i} = \frac{K_{i} \times E T_{0}}{P_{i}}

(4)

where Y_i represents the annual water yield (mm) of grid i; P_i is the annual precipitation of grid i; AETi denotes the annual average actual evapotranspiration of grid i (mm); R_i is the Bydyko aridity index; ω_i represents the nonphysical parameters of soil; Z is an empirical constant (ZHANG coefficient), with a value of 1 [47]; AWC_i represents the soil’s available water content (mm) of grid unit i; K_i represents the vegetation transpiration coefficient for different land cover types within pixel i; and ET0 represents the reference crop coefficient.

The formula used for calculating water conservation capacity is as follows:

W R = M in (1, \frac{249}{V e l o c i t y}) \times M i n (1, \frac{0.9 \times T I}{3}) \times M i n (1, \frac{K_{s o i l}}{300}) \times Y

(5)

where WR represents the water conservation capacity (mm), velocity denotes the runoff coefficient, TI stands for the terrain index, Ksoil represents the soil-saturated hydraulic conductivity (cm/d), and Y represents the total water yield (mm).

2.3.2. Quantification of the Impact of Forest Management Inventory Factors on Forest Water Conservation

(1): Group Lasso Model

Yuan and Lin [48] extended the Lasso regression method by proposing the Group Lasso approach to analyze the selected group variables, thereby overcoming the limitations of the former in handling categorical variables in group structures. The formula for the Group Lasso is as follows:

{\hat{β}}_{λ} = \underset{β}{argmin} ({‖Y - X_{β}‖}_{2}^{2} + λ \sum_{g = 1}^{G} {‖β_{\lg}‖}_{2})

(6)

where

{\hat{β}}_{λ}

is the estimated Lasso regression coefficient matrix; β is the regression coefficient; λ is the penalty parameter, a non-negative number, used to control for shrinkage; G is the number of characteristic groupings; lg denotes the index set of the g-th group variables; and β_lg is the coefficient vector of the g-th group variables.

The algorithm proposed by Yuan et al. has a slow solving process in technical applications. To address this, Yang Y [49] proposed the groupwise majorization descent (GMD) algorithm. This algorithm improves the speed of solving Group Lasso regression by minimizing the sum of the empirical loss and the Group Lasso penalty. In this study, modeling calculations were performed using the Group Lasso module of the data processing system (DPS) software [50].

(2): XGBoost Machine Learning Algorithm

The XGBoost algorithm is an advanced estimator with high performance in both classification and regression tasks based on the development of the GBDT (gradient boosting decision tree) model. It can quickly and reliably select and identify relevant feature variables and possesses strong predictive ability, while being less affected by the quality of training data [32]. This algorithm works by continuously splitting feature variables to generate trees. With each new tree generation, it learns a new function to fit the residuals of the previous prediction, thereby continuously improving learning quality [51]. The prediction process of the XGBoost algorithm for the i-th sample is calculated using the following formula:

{\hat{y}}_{i} = \sum_{k = 1}^{k} f_{k} (x_{i})

(7)

where

{\hat{y}}_{i}

represents the predicted value of the i-th sample and f_k(x_i) represents the prediction result of the k-th tree for the i-th sample. The regression problem in the XGBoost algorithm constructs the squared loss as the objective function and optimizes it using the gradient descent algorithm. The general form of the objective function is as follows:

O bj = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}) + \sum_{k = 1}^{k} Ω (f_{k})

(8)

where

\sum_{i = 1}^{n} l (y i, \overset{⌢}{y} i)

represents the total loss function value for n samples, which can be calculated by comparing the predicted values of the samples (

{\hat{y}}_{i}

) with their true values (y_i) based on the classifier;

\sum_{k = 1}^{k} Ω (f_{k})

is the regularization term, which primarily aims to increase the penalty strength to control the complexity of the model, thereby avoiding or reducing overfitting and improving the model’s generalization ability [52]. In this study, it is necessary to quantitatively process categorical data in the input dataset.

(3): Random Forest Model

The random forest model is a nonparametric regression machine learning algorithm that employs decision trees as base learners in an ensemble learning approach. It constructs a series of base learners (such as m) through resampling and combines their predictions to produce an output [53]; this model is thus capable of addressing both regression and classification problems. It primarily utilizes the bootstrap method to resample the dataset, thereby randomly generating k training sets and subsequently generating decision trees based on these k training sets. The predicted value of the random forest regression algorithm is represented as follows:

\bar{h} (x) = \frac{1}{T} \sum_{t = 1}^{T} \{h (x, θ_{t})\}

(9)

where

\bar{h} (x)

is the result of the model forecast, h(x,θ_t)is the output based on x and θt, x is an independent variable, θ_t is an independent isomorphic random vector, and T is the number of regression decision trees. Modeling calculations and validation were performed using the random forest module of the DPS software [50].

In constructing the random forest model, two important hyperparameters must be set: the number of variables preselected for each node of the regression tree (mtry) and the number of trees generated by the random forest (ntree). In theory, the higher the number of trees, the higher the model’s accuracy; however, this also increases the computational workload. Typically, once the number of trees reaches a certain threshold, the model’s performance stabilizes. Therefore, it is necessary to select an appropriate number of trees to build an efficient model.

One significant advantage of the random forest algorithm lies in its use of out-of-bag cross-validation (OOBCV) for overall accuracy assessment. This method allows for evaluating overall model performance and assessing the importance of training features [54]. By scoring the importance of the variables using random forest regression, the impact of forest inventory factors on water conservation can be analyzed. The higher the importance, the greater the contribution of that factor. To determine the importance of a variable (V_j), the out-of-bag error rate (EROOB_t) for each tree (t) must first be calculated. Then, the sequence of values of the variable (V_j) must be randomly changed in the out-of-bag data, while keeping all other variables unchanged. Finally, the out-of-bag error rate (EROOB_t^j) after changing the sequence should be calculated. By comparing the error rates before and after changing the sequence of the out-of-bag data, the importance of the chosen variable can be estimated. The greater the increase in the out-of-bag error rate caused by the variable (V_j), the more the decrease in accuracy, thus indicating the higher importance of that variable. The importance of variable (V_j) is represented as follows:

V_{j} = \frac{1}{N} \sum_{t} (E R O O B_{t}^{j} - E R O O B_{t})

(10)

In Formula (10), (N) represents the number of trees in the random forest algorithm, and (j) represents the position of the variable among (M) features [55]. The random forest algorithm reflects the degree of influence of a factor under the superposition of multiple factors and demonstrates this superposition effect at the macro level. One significant advantage of this algorithm is its ability to handle data with multidimensional features without requiring feature selection. This helps in analyzing the combined impact of multiple factors from a more comprehensive perspective [56].

(4): Model Accuracy Evaluation Verification

Three metrics, the coefficient of determination (R²), root mean square error (RMSE), and mean absolute error (MAE), are used to assess the accuracy or goodness of fit of the model [55]. To evaluate accuracy in the validation set, the random forest model uses out-of-bag samples, while the Lasso regression and XGBoost models use cross-validation (CV).

3. Results

3.1. Assessment of Water Conservation and Spatial Distribution Characteristics

Based on calculations of the InVEST model, the average water yield in the region is 729.03 mm, with a standard deviation of 272.25 mm. As shown in Figure 4, forested areas exhibit higher water yields, while non-forested areas have significantly lower water yields. The total water conservation capacity is estimated to be 290.73 billion cubic meters. To validate the simulated water yield from the InVEST model, it was compared with the water resource statistics from the same period. According to the water resource reports of Changde, Yueyang, and Yiyang, the average total water resources in the Dongting Lake area in 2020 were calculated to be 340.25 billion cubic meters, with a redundancy of 73.53 billion cubic meters from both surface and underground water sources. The actual water resources amounted to 266.72 billion cubic meters. The relative errors between the simulated water yield and the water resource statistics were all less than 8.26%, indicating a relatively good simulation performance of the water yield. Regarding the water conservation results, based on data on heavy rainfall and historical floods in the Dongting Lake area, the simulation trend of water conservation appears to be reliable.

As shown in Figure 4a, the spatial distribution of water yield exhibits significant spatial differentiation, with high values mainly concentrated in the eastern and western regions, showing a pattern of high in the east and west and low in the middle. In Figure 4b, it can be observed that the water conservation volume shows a spatial distribution pattern that is generally consistent with the water yield. The spatial distribution of water conservation services shows a gradual decrease from the east and west toward the middle. The variation in water conservation volume is mainly influenced by precipitation, topography, and land use types. Areas with high vegetation coverage, such as forests and grasslands, promote water conservation and soil retention, thereby enhancing water provision functions in those areas. When examining different forest types (Figure 5), there are significant differences in the water conservation volume among different dominant tree species. From the perspective of water conservation volume per unit area, the observed variation pattern is slow-growing broadleaf forests > shrub forests > middle-growing broadleaf forests > cunninghamia lanceolata forests > fast-growing broadleaf forests > pine forests > bamboo forests.

3.2. Key Factor Extraction from Forest Inventory Factors

According to the technical regulations of the forest management inventory and the characteristics of forest resources in the Dongting Lake area, different forest types, especially stand factors, have different forest management inventory factors. At the same time, considering the impact of the differences in the tree species structure on the water conservation capacity, based on the forest management inventory data of forest resources in Hunan Province during the “13th Five-Year Plan” period, this study extracted the key factors for three types: arboreal forests, shrub forests, and bamboo forests.

3.2.1. Model Parameter Selection and Estimation

(1): Parameter optimization of the Lasso regression model

By controlling for the size of λ, variable selection can be achieved. The optimal parameter (λ) in the Lasso model was selected through cross-validation (CV) to ensure the stability of the model. Based on the trend corresponding to the estimated values under CV (Figure 6), the λ value was selected to minimize the mean square error of the model. Through the cross-validation process, the optimal tuning parameters λ for the best models of arboreal forests, shrub forests, and bamboo forests were determined to be 0.6018, 0.5850, and 0.5209, respectively.

(2): XGBoost model parameter optimization

The three parameters, namely, the number of iterations (nrounds), tree depth (max_depth, md), and learning rate (eta), were repeatedly cross-validated using root mean square error (RMSE) as the evaluation metric. The parameter combination that resulted in the lowest RMSE was selected as the optimal parameter. As shown in Figure 7a, as the maximum tree depth increases, the accuracy gradually reaches its optimum. However, further increasing the maximum tree depth can lead to model overfitting and decreased accuracy. For all three types of forests (arboreal, shrub, and bamboo), the tree depth (md) was set to 3. Figure 7b–d show the learning curves corresponding to different learning rates. A smaller learning rate corresponds to a larger steady-state number of iterations. Generally, a smaller learning rate combined with a larger number of iterations can improve the model’s generalization ability. There were no significant differences between the three types of forests (arboreal, shrub, and bamboo) regarding parameter selection. Therefore, the final parameter selection is nrounds = 500, eta = 0.1, and md = 3.

(3): Optimization results of random forest parameters

For the custom parameters mtry and ntree in the random forest regression algorithm, mtry was generally chosen to be around one-third of the number of variables. A larger value for ntree leads to better algorithm performance. As ntree increases, the out-of-bag error decreases significantly and then stabilizes. To save time, the ntree value when stability is reached can be selected. Determining the sizes of these two parameters through experimentation improves the prediction accuracy of the random forest model. The relationship between the mtry and ntree parameters of the random forest models for three types, arboreal, shrub, and bamboo, as well as the error, is shown in Figure 8. Model error rates stabilize when ntree values are 400, 110, and 130, respectively.

3.2.2. Accuracy Evaluation of Three Models

To compare the training accuracy of each model, the data were divided into modeling and validation samples in a 70%:30% ratio. Model fit and applicability for predicting water conservation were tested on the training and testing sets using metrics such as the coefficient of determination (R²), root mean square error (RMSE), and average relative error (ARE) for the Lasso regression, random forest, and XGBoost algorithms. As shown in Table 2 and Table 3, for the three forest types—arboreal, shrub, and bamboo—under the indicators of the coefficient of determination, root mean square error, and average relative error, the random forest model, XGBoost model, and Lasso regression model all meet the accuracy requirements of the simulation results (with R² ranging from 0.508 to 0.869 and RMSE ranging from 28.380 to 69.339). The rationality of feature selection using the original features indicates that the framework of influencing factors constructed in this study is effective, and the three machine learning methods have reliable results in selecting key factors for forest investigation factors. Whether for arboreal, shrub, or bamboo forests, the random forest model exhibits the highest accuracy on both the training and validation sets, with the largest R² value and the smallest RMSE value. This suggests that the prediction deviation error of the random forest model is smaller than that of the XGBoost and Lasso regression models, indicating higher prediction accuracy. Thus, using the random forest method to predict water conservation will yield optimal simulation results.

3.2.3. Extraction of Dominant Influencing Factors

(1): Selection of variables in the Lasso regression model

The absolute value of the standardized regression coefficients of variables in the Lasso regression model was used to measure the magnitude of the effects of different independent variables on the dependent variable (importance). The larger the absolute value, the greater the impact on the dependent variable. If the coefficient value of a feature is 0, this means that the feature contributes little to the prediction results of the model, and these features can be deleted to achieve dimensionality reduction. Figure 9 shows the results of variable selection using the Lasso regression model. In terms of the importance ranking of variables, the variables selected using Lasso regression are sorted according to the standardized regression coefficients (correlation). For arboreal forests, the ranking is as follows: topography (TO) > slope (SL) > constriction (CN) > community structure (CS) > origin (OR) > soil thickness (ST) >average DBH(ADBH) > age group (AG) > average height (AH) > volume per hectare (VH). For shrub forests, the importance of variables from greatest to least is as follows: topography (TO) > origin (OR) > slope (SL) > coverage (CE) > soil thickness (ST) > community structure (CS) > average height (AH). For bamboo forests, the importance of variables from greatest to least is as follows: topography (TO) > origin (OR) > slope (SL) > soil thickness (ST) > constriction (CN) > average height (AH) > community structure (CS). For the three forest types, these selected forest investigation factors are the dominant factors influencing water conservation.

(2): Importance of XGBoost model variables

The XGBoost algorithm can identify the importance of feature variables, which are represented by the normalized importance index (Figure 10). By inputting 21 feature variables into the XGBoost model, the importance of different feature variables for water conservation is examined. The variables that cumulatively account for the top 90% of importance from highest to lowest in arboreal forests can be ranked as follows: slope (SL) > constriction (CN) > topography (TO) > origin (OR) > community structure (CS) > soil thickness (ST) > age group (AG) > average DBH (ADBH) > average height (AH) > volume per hectare (VH). In shrub forests, they are ranked as follows: slope (SL) > topography (TO) > coverage (CE) > soil thickness (ST) > community structure (CS) > origin (OR) > average height (AH). In bamboo forests, they are ranked as follows: slope (SL) > topography (TO) > soil thickness (ST) > constriction (CN) > community structure (CS) > origin (OR) > average height (AH). For the three forest types, these selected forest investigation factors are the dominant factors influencing water conservation.

(3): Importance score of forest inventory factors in the random forest model

The random forest regression model expresses the degree of influence of each independent variable on the dependent variable in the model. Using the data of two types of forest resource investigation factors and water conservation separately, based on RF’s OOB importance and correlation analysis, the importance scores of the random forest prediction model are normalized and sorted according to the Importance Index. The results are shown in Figure 11. The variables that cumulatively account for the top 90% of importance from highest to lowest in arboreal forests are ranked as follows: topography (TO) > slope (SL) > community structure (CS) > constriction (CN) > soil thickness (ST) > origin (OR) > age group (AG) > average DBH (ADBH) > average height (AH) > volume per hectare (VH). In shrub forests, they are ranked as follows: slope (SL) > topography (TO) > coverage (CE) > community structure (CS) > origin (OR) > soil thickness (ST) > average height (AH). In bamboo forests, they are ranked as follows: topography (TO) > slope (SL) > constriction (CN) > origin (OR) > community structure (CS) > soil thickness (ST) > average height (AH). For the three forest types, these selected forest investigation factors are the dominant factors influencing water conservation.

(4): Differential analysis of the influencing factors and optimization of variable selection

The importance of the feature variables of the two algorithms, XGBoost and random forest, was sorted. After standardizing the importance of XGBoost and random forest models and accumulating them in descending order, variables with a cumulative explanation rate of 90% were eliminated. Compared with variables extracted by Lasso regression, the results indicate that, although there are differences in factor ranking, the influencing factors of the two categories of forest resource investigation data for arboreal forests, shrub forests, and bamboo forests are approximately consistent. Among them, topography (TO), slope (SL), and constriction/coverage (CN/CE) have the greatest impact on water conservation. Topography (TO) is the highest-ranking site factor, and constriction/coverage (CN/CE) is the highest-ranking stand structure factor. The feature-selection effect obtained by integrating XGBoost, Lasso regression, and random forest is better, indicating that the topography (TO), slope (SL), community structure (CS), constriction/coverage (CN/CE), soil thickness (ST), origin (OR), age group (AG), average DBH (ADBH), average height (AH), and volume per hectare (VH) are important factors influencing water conservation.

3.3. Simulation of Water Conservation Capacity in Random Forest Models

3.3.1. Simulation Accuracy of Random Forest Models for Different Forest Types

Using selected variables such as topography (TO), slope (SL), soil thickness (ST), origin (OR), age group (AG), constriction/coverage (CN/CE), average DBH (ADBH), average height (AH), volume per hectare (VH), and community structure (CS) as independent variables, an estimation of water conservation volume was conducted. This estimated water conservation volume was used as the independent variable, and the water conservation data calculated using the InVEST model were used as the dependent variable to construct a random forest model for prediction and accuracy evaluations. The validation equations and R² of seven different forest types are shown in Figure 12, highlighting no significant difference between the simulated water conservation volume estimated using variable optimization and the calculated water conservation volume for the seven forest types. The prediction of the water conservation capacity for slow-growing broadleaf forests is significantly better than for other forest types. Most samples clustered around the 1:1 line, indicating a high degree of fit, with R² coefficients all exceeding 0.8.

Simultaneously, using all the factors from the two-category forest resource inventory as independent variables, the water conservation volume was calculated, and the fit validation is shown in Figure 13. Comparing the R² values between Figure 12 and Figure 13, the variable optimization simulation accuracy for the seven forest types is higher than the simulation accuracy when all factors are introduced. The random forest model with variable selection has a larger R² (p < 0.01) and smaller RMSE, and there is a closer distribution of predicted values and measured values around the 1:1 reference line in scatter plots, demonstrating better goodness of fit and statistical power. Its predictive power is also better, indicating that selecting feature coefficients and identifying the most influential feature factors for predicting the water conservation volume can achieve better results.

3.3.2. Comparison of the Spatial Distribution of Prediction Results for Different Forests

Based on the selected model parameters, using the feature variables filtered using the three models, a random forest model was constructed with the forest water conservation volume as the target variable. Predictions were carried out for seven forest types: cunninghamia lanceolata forests, pine forests, fast-growing broadleaf forests, middle-growing broadleaf forests, slow-growing broad-leaved forests, shrub forests, and bamboo forests, resulting in the spatial distribution of the forest water conservation volume shown in Figure 14a. The overall trend of the distribution of the errors compared to the InVEST simulations is shown in Figure 14. The predictive results of the random forest model, depicted in Figure 14, show that the overall distribution trends of the seven forest types are generally similar, with slight differences observed in some areas.

4. Discussion

(1): Forest water conservation services involve interactions among the influencing factors, with forest inventory factors exhibiting nonlinearity and scale effects. By integrating the Group Lasso regression model, XGBoost, and the random forest model, it is possible to effectively identify complex characteristics. Utilizing important information for variable dimensionality reduction is beneficial for enhancing the accuracy of predictions regarding the correlation between forest water conservation and forest inventory factors. Previous research findings indicate that the Group Lasso regression model, XGBoost, and the random forest algorithm possess different characteristics in simulation prediction [57], and the important variables selected using a specific algorithm do not necessarily imply their importance for other machine learning models [58]. Additionally, different forest inventory factors have varying impacts on water conservation, manifested at different spatial scales. Among these, site factors are large-scale influencing factors of forest water conservation, acting through influencing lower-level stand factors, thus affecting the macro distribution of forest water conservation. On the other hand, stand factors alter and shape the spatial pattern of forest water conservation on a finer scale [22]. This study found that by integrating machine learning algorithms in selecting two types of inventory factors for forest water conservation, the superior qualities of the three machine learning algorithms in handling the scale and correlation of independent variables were fully utilized. Continuously incorporating more data dimensions enables handling multicollinearity and facilitates the calculation of the nonlinear effects of variables, providing more objective information on the main controlling factors of forest water conservation. Considering the model’s fitting ability, robustness, prediction accuracy, and ability to conduct feature variable selection, the model estimation accuracy indicators meet the corresponding requirements, indicating the practicality and comparability of integrating the three machine learning algorithms. This integration can achieve the best simulation effects and scalability. Due to the nonindependent characteristics of forest inventory factors that affect water conservation functions, this study can effectively screen modeling variables with nonlinear relationships and accurately identify the specificity of forest water conservation functions at different scales [3]. Comprehensive consideration of the impact of forest inventory factors on forest water conservation will aid in making informed management decisions and implementing coordinated management of forest ecosystem structures and functions. It also holds significant importance in formulating sustainable forest management policies in the context of global climate change.
(2): Due to limited data availability for forest inventory factors, model prediction applications for categorical data face challenges. However, Group Lasso regression and random forest algorithms exhibit good identification characteristics, particularly in ranking the importance of categorical variables, and they can be effectively applied to a wide variety of data types. The specifications required by the “technical regulations for inventory for forest management planning and design” necessitate the use of categorical data in type grades for inventory factors. Group Lasso regression and random forest regression algorithms can effectively handle categorical data, thus addressing some of the limitations of traditional methods. In contrast, the XGBoost model quantifies categorical data via scoring; however, the accuracy of feature variable selection can be affected due to limitations in its quantification method. Previous research indicates that traditional model algorithms perform well in simulations but have limited applications for categorical independent variables. They generally use fuzzy quantification, expert scoring, or dummy variables to handle categorical data [59]. This quantification process affects the accuracy of model simulation results [60]. There are certain differences in the effectiveness of different machine learning methods. The feature importance ranking obtained using random forest regression fitting can only suggest the relative importance of the features, but cannot provide specific contribution values. In contrast, Lasso regression may lose some information while optimizing by discarding some independent variables but can detect the explanatory power of factor correlations. This study leverages the advantages of these machine learning models to process the classification (qualitative) variables for forest management inventory factors. By doing so, it ensures accuracy and also improves interpretability to a certain extent, thereby effectively revealing the intrinsic between characteristic elements.
(3): The factors influencing forest water conservation services are complex. This study focuses only on the accessibility of data from the forest management inventories, which limits the ability to address complex interrelationships between forest water conservation services and forest structural impacts. Recent studies state that forest water conservation services are generally influenced by four categories of environmental factors: climate, soil, vegetation, and topography. These factors have different impacts at the individual tree scale, stand scale, slope scale, and regional scale. Due to the varying technical regulations and specifications in forest management inventory factors, the predictions obtained in this study regarding forest water conservation capacity using different methods may be inconsistent. Moreover, feature importance ranking can only provide the relative importance of features, potentially sacrificing interpretability to ensure accuracy. Furthermore, not all regions follow a uniform forest management inventory process; thus, factors that are challenging to record but have a significant potential impact on water conservation are not investigated, leading to a lack of complete information and variability in the simulation results. Additionally, this lack of uniformity leads to differences in machine learning parameters, which also affect simulation performance. To overcome these limitations, this study provides a methodological reference for studying the relationship between water conservation and forest inventory factors. However, further research is required to gain an in-depth understanding of the dynamic mechanisms influencing the interactions between water conservation and forest inventory factors across different ecological and climatic conditions.

5. Conclusions

This article analyzes forest management inventory data to identify the factors influencing water conservation capacity. Different algorithms, such as Group Lasso regression, random forest regression, and the XGBoost model, were used to assess these factors. Then, the preferred method, random forest regression, was used to construct water conservation capacity models for seven forest types. The accuracy of these models was evaluated through fitting and validation metrics to explore the relationship between management inventory factors and the water-conservation function in forests. The main conclusions of this study are as follows:

(1): Using the InVEST model’s water yield module, this study reveals significant spatial heterogeneity in forest water conservation, with low values in the central region and high values in the eastern and western regions. Additionally, the effectiveness of forest water conservation services is varied by different dominant tree species. Slow-growing broad-leaved forests provide the best service, followed by shrub forests, mid-aged broad-leaved forests, cunninghamia lanceolata forests, pine forests, fast-growing broad-leaved forests, and bamboo forests.
(2): Regarding the inventory factors influencing forest water conservation, the coefficients of determination (R²) for the three algorithms range from 0.508 to 0.869, while the root mean square errors (RMSEs) range from 28.380 to 69.339. These results indicate strong modeling accuracy, good simulation effects, and relatively good scalability. Specifically, the accuracy of the three models can be ordered as follows: random forest model > Lasso regression model > XGBoost model. Due to relatively low error and high estimation accuracy, the random forest algorithm was ultimately selected for simulation considering the accuracy and complexity, thus providing a new approach for predicting the forest water conservation function.
(3): The key factors influencing forest-type water conservation include topography, slope, soil thickness, origin, age group, constriction/coverage, average diameter at breast height, average height, number of trees per hectare, and community structure. This indicates that the three estimation methods have consistent correlations in factor selection, thereby revealing the mechanisms of influence of forest-type water conservation services based on forest management inventory data.
(4): The impact of forest management inventory factors on forest water conservation varies. This study compared two methods for introducing independent variables: introducing all factors and feature selection through the fusion of three machine learning algorithms. The results indicate that for the simulation of forest water conservation using the random forest algorithm, feature selection reduces model redundancy and yields a better estimation of the forest water conservation function. This will promote further research concerning the prediction of forest water conservation services.

Author Contributions

Z.C.: conceptualization and writing—original draft. Y.L. (Yang Liu): software and writing—review and editing. Y.L. (Yong Lü): conceptualization, funding acquisition, methodology, and writing—review and editing. D.C.: methodology, project administration, resources, and writing—review and editing. B.P.: methodology, project administration, and resources. All authors have read and agreed to the published version of the manuscript.

Funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This research was funded by the National Natural Science Foundation of China (42271229, 42171213) and the Natural Science Foundation of Hunan Province, China (2024JJ8037).

Data Availability Statement

The data from the sample plots in this study are available on request from the corresponding author. Those data are not publicly available due to privacy and confidentiality.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xie, G.D.; Zhen, L.; Lu, C.; Xiao, Y.; Chen, C. Expert Knowledge Based Valuation Method of Ecosystem Services in China. J. Nat. Resour. 2018, 23, 911–919. [Google Scholar]
Zhang, H.; Yuan, S. Evaluation of the spatial patterns of the water retention function of the forest ecosystem in the Dongjiang River Watershed. Acta Ecol. Sin. 2016, 3, 8120–8127. [Google Scholar]
Liu, X.; Zhang, W.; Feng, Y.; Zhao, X.; Gan, X.; Zhou, Q. Research on water conservation function of forest ecosystem: Propress and prospect. Chin. J. Ecol. 2022, 41, 784–791. [Google Scholar]
Katherine, J.E.; Lindsay, R.B.; Wayne, T.S. Changes in vegetation structure and diversity after grass-to-forest succession in a Southern. Am. Midl. Nat. 1998, 140, 219–233. [Google Scholar]
Luo, M.; Wang, W.; Zong, X.; Zheng, X. Study on the Coupling Relationship between Structure and Function of Water Conservation Forests in Mountainous Area of Beijing. For. Resour. Manag. 2011, 5, 84–88. [Google Scholar]
Ma, W.; Wang, X.; Geng, R. An Overview of Research on Forest Water Conservation in China. J. Cap. Norm. Univ. Nat. Sci. Ed. 2016, 37, 87–92. [Google Scholar]
Vigerstol, K.; Aukema, J. A comparison of tools for modeling freshwater ecosystem services. J. Environ. Manag. 2011, 92, 2403–2409. [Google Scholar] [CrossRef] [PubMed]
Bao, Y.; Li, T.; Liu, H.; Ma, T.; Wang, H.; Liu, K.; Seng, X.; Liu, X. Spatial and temporal changes of water conservation of Loess Plateau in northern Shaanxi province by In VEST model. Geogr. Res. 2016, 35, 664–676. [Google Scholar]
Deng, K.; Shi, P.; Xie, G. Water Conservation of Forest Ecosystem in the Upper Reaches of Yangtze River and its Benefits. Resour. Sci. 2002, 24, 68–73. [Google Scholar]
He, X.; Zhao, Y.; Wang, K. Water Conservation Function of Typical Subtropical Forest Ecosystem. J. Northeast For. Univ. 2023, 51, 77–82. [Google Scholar]
Gong, Z.; Kang, X.; Gu, L.; Zhao, J.; Zheng, Y.; Yang, H. Research methods on natural forest stand structure: A review. J. Zhejiang For. Coll. 2009, 26, 434–443. [Google Scholar]
Jiang, G.; Zheng, X.; Ning, Y. Relationship between Forest Stand Structure and Function of Water Conservation—A Case Study of Badaling Forest Farm. J. Northwest For. Univ. 2012, 27, 175–179. [Google Scholar]
Zhang, G.; Zhang, Y. Analysis of water conservation value of forest vegetation in Zhengzhou. Agric. Sci. Technol. Shanghai. 2018, 82–84. [Google Scholar]
Mao, J.; Tian, Y.; Xie, J.; Zhao, Q.; Ma, L.; Cha, T. Evaluation and value estimation of water conservation function of forest vegetation of four urban functional areas in Beijing. Acta Ecol. Sin. 2021, 41, 9020–9028. [Google Scholar]
Cui, W.; Zheng, X.; Gu, L. Water conservation of forest in Jingouling forest farm and its value. J. Cent. South Univ. For. Technol. 2016, 36, 88–92. [Google Scholar]
Meng, C.; Wang, Q.; Zheng, X. Coupling mechanism between stand structure and function of water conservation forest in Badaling forest farm, Beijin. J. Cent. South Univ. For. Technol. 2017, 37, 69–72. [Google Scholar]
Cui, Y.; Fan, L.; Liu, S.; Sun, T. Evaluation of forest ecosystem service function in Shanxi province. Acta Ecol. Sin. 2019, 39, 4732–4740. [Google Scholar]
GB/T26424-2010; Technical Regulations for Inventory for Forest Management Planning and Design. China Standard Press: Beijing, China, 2010.
Wang, B.; Yang, F.; Guo, H. Specifications for Assessment of Forest Ecosystem Services; China Standard Press: Beijing, China, 2008. [Google Scholar]
Cui, C.; Li, J.; Feng, Y.; Shi, B.; Li, X.; Bian, C. Estimations on Carbon Storage and Carbon Density of Forest Resources Based on the County Forest Management Inventory—The case of Sishui County. J. Shandong Agric. Univ. Nat. Sci. Ed. 2017, 48, 279–283. [Google Scholar]
Yang, C. Relationship between Stand Factors of Pinus Massoniana and Topography in Linan; Agriculture & Forestry University: Hangzhou, China, 2016. [Google Scholar]
Dong, L.; Liu, Z.; Li, F.; Jiang, L. Relationships between Stand Spatial Structure Characteristics and Influencing Factors of Broad-leaved Korean Pine (Pinus koraiensis) Forest in Liangshui Nature Reserve. Northeast China. Bull. Bot. Res. 2014, 34, 114–120. [Google Scholar]
Meng, C.; Zheng, X.; Wang, W. Correlation between Stand Structure and Soil of Water Conservation Forest in Badaling Forest Farm of Beijing. J. Northwest For. Univ. 2016, 31, 99–105. [Google Scholar]
Wei, X.; Sun, Y.; Mei, G.; Wang, Y. Comprehensive evaluation of forest resources based on multi- function management. J. Cent. South Univ. For. Technol. 2013, 33, 103–108. [Google Scholar]
Du, Y.; Zheng, X.; Luo, M. Study on leading function types of evergreen broad-leaved forest in Jiangle County, Fujian Provinc. J. Cent. South Univ. For. Technol. 2013, 33, 39–43. [Google Scholar]
Meng, C.; Zhou, J.; Zheng, X. Health Assessment of Castanoopsis Secondary Forest in Jungle Forest Farm of Fujian. J. Northwest For. Univ. 2015, 30, 198–203. [Google Scholar]
Wen, Y.; Liu, S. Quantitative analysis characteristics of rainfall interception of main forest ecosystem types in China. Sci. Silvae Sin. 1995, 31, 289–298. [Google Scholar]
Dormann, C.F.; Elith, J.; Bacher, S.; Buchmann, C.; Carl, G.; Carré, G.; Marquéz, J.R.G.; Gruber, B.; Lafourcade, B.; Leitão, P.J.; et al. Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography 2013, 36, 27–46. [Google Scholar] [CrossRef]
Lv, J.; Hao, N.; Li, C.; Shi, X.; Li, Z. Identification of Forest Type Based on Random Forest and Texture Characteristics. Remote Sens. Inf. 2017, 32, 109–114. [Google Scholar]
Gao, R.; Su, X.; Xie, Y.; Lei, X.; Lu, Y. Prediction of adaptability of Cunninghamia lanceolata based on random forest. J. Beijing For. Univ. 2017, 39, 36–43. [Google Scholar]
Lu, J.; Feng, Z. Forecast of Stand Volume Growth in Beijing by Using Random Forest. J. Northeast For. Univ. 2020, 48, 7–11. [Google Scholar]
Lin, S.; Wen, Q.; Wu, D.; Huang, H.; Zheng, X. Regional Forest Structure Evaluation Model Based on Remote Sensing and Field Survey Data. Forests 2024, 15, 533. [Google Scholar] [CrossRef]
Li, K.; Wu, D.; Fang, L. Forest Volume Stock with Sentinel-2 Remote Sensing Image. J. Northeast For. Univ. 2021, 49, 59–66. [Google Scholar]
Zhou, T.; Xu, H.; Xu, Q.; Zhu, B.; Lu, Y. Filling missing factors for China’s forest resources inventory based on random forest classification models. Acta Ecol. Sin. 2024, 44, 1–14. [Google Scholar]
Liu, K.; Zhong, H.; Lin, S. Tree Species Classification in UAV Hyperspectral Images Based on Different Machine Learning Algorithms. For. Eng. 2024, 40, 98–108. [Google Scholar]
Yu, X.; Zhou, B.; Lü, X.; Yang, Z. Evaluation of Water Conservation Function in Mountain Forest Areas of Beijing Based on In VEST Model. Sci. Silvae Sin. 2012, 48, 1–5. [Google Scholar]
Jia, F. In VEST Model Based Ecosystem Services Evaluation with Case Study on Ganjiang River Basin; China University of Geosciences: Wuhan, China, 2014. [Google Scholar]
Pan, T.; Wu, S.-H.; Dai, E.-F.; Liu, Y.-J. Spatiotemporal variation of water source supply service in Three Rivers Source Area of China based on In VEST model. Chin. J. Appl. Ecol. 2013, 24, 183–189. [Google Scholar]
Li, C.; Cui, N.; Wei, X.; Hu, X.; Gong, D. Improvement of Hargreaves method for reference evapotranspiration in hilly area of central Sichuan Basin. Trans. Chin. Soc. Agric. Eng. 2015, 31, 129–133. [Google Scholar]
Zhou, W.; Liu, G.; Pan, J. Soil Available Water Capacity and it’s Empirical and Statistical Models- with Special Reference to Black Soils in Northeast China. J. Arid Land Resour. Environ. 2003, 17, 88–95. [Google Scholar]
Yang, Y.; Wang, L.; Ren, W.; Zheng, D.; Zhi, C. Headwater Conservation Evaluation of Forest Resources in Xichuan County Based on In VEST Model. For. Resour. Manag. 2017, 3, 51–55. [Google Scholar]
Yin, G. The Impact of Forest Resource Change on Forest Ecosystem Service Function Based on Invest Model; Normal University: Chongqing, China, 2017. [Google Scholar]
Gong, B.; Shi, C.; He, H.; Liu, C.; Shi, C.; Zhao, T. The water conservation capacity of 6 kinds of planted forests in northern mountain area of Hebei Province. J. Arid Land Resour. Environ. 2019, 33, 165–170. [Google Scholar]
Zhang, Z. Research and Implementation of Key Techniques for Estimating Main Stand Factors in Multi-Source Remote Sensing Forest Resources Survey; Xi’an University of Architecture & Science and Technology: Xi’an, China, 2020. [Google Scholar]
Lü, L.; Ren, T.; Sun, C.; Zheng, D.; Wang, H. Spatial and temporal changes of water supply and water conservation function in Sanjiangyuan National Park from 1980 to 2016. Acta Ecol. Sin. 2020, 40, 993–1003. [Google Scholar]
Chen, S.; Liu, K.; Bao, Y.; Chen, H. Spatial Pattern and Influencing Factors of Water Conservation Service Function in Shangluo City. Sci. Geogr. Sin. 2016, 36, 1546–1554. [Google Scholar]
Zhang, C.; Li, W.; Zhang, B.; Liu, M. Water Yield of Xitiaoxi River Basin Based on In VEST Modeling. J. Resour. Ecol. 2012, 3, 50–54. [Google Scholar]
Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Stat. Methodol. 2006, 68, 49–67. [Google Scholar] [CrossRef]
Yang, Y.; Zou, H. A fast unified algorithm for computing group- lasso penalized learning problems. Stat. Comput. 2015, 26, 1129–1141. [Google Scholar] [CrossRef]
Tang, Q.; Zhang, C. Data Processing System (DPS) software with experimental design, statistical analysis and data mining developed for use in entomological research. Insect Sci. 2013, 20, 254–260. [Google Scholar] [CrossRef] [PubMed]
Chen, C.; Huang, Y.; Wu, S.; Qian, C. The Prediction of Opportunity-driven Entrepreneurship Based on XGBoost Algorithm. Sci. Technol. Prog. Policy 2023, 40, 14–22. [Google Scholar]
Huang, Z.; Shen, J.; Jian, W.; Fan, X.; Nie, W. Displacement prediction of rainfall- induced step- like landslide based on XGBoost model. J. Nat. Disasters 2023, 32, 217–226. [Google Scholar]
Prasad, A.M.; Iverson, L.R.; Liaw, A. Newer classification and regression tree techniques: Bagging and random forests for ecological prediction. Ecosystems 2006, 9, 181–199. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Liu, J.; He, X.; Wang, P.; Huang, J. Early prediction of winter wheat yield with long time series meteorological data and random forest method. Trans. Chin. Soc. Agric. Eng. Trans. CSAE 2019, 35, 158–166. [Google Scholar]
Gislason, P.O.; Benediktsson, J.A.; Sveinsson, J.R. Random forests for land cover classification. Pattern Recognit. Lett. 2006, 27, 294–300. [Google Scholar] [CrossRef]
Li, G.; Li, J.; Zhao, C.; Jiao, Y.; Yan, Q. Spatiotemporal dynamics of ecosystem services and their nonlinear influencing factors- A case study in the Qiantang River Basin. China Environ. Sci. 2022, 42, 5941–5952. [Google Scholar]
Andersen, C.M.; Bro, R. Variable selection in regression—A tutorial. J. Chemom. 2010, 24, 728–737. [Google Scholar] [CrossRef]
Xu, Y.; Zhang, M.; Li, Q.; Xu, E.; Deng, L.; Deng, S.; Liu, Z.; Liang, H. Upscaling subalpine forest soil water-holding capacity based on vegetation and environmental factors: An example of the Zagunao River watershed in the upper reach of the Minjiang River in China. Acta Ecol. Sin. 2023, 43, 5614–5626. [Google Scholar]
Li, C.; Xiao, K.; Li, N. A Comparative Study of Support Vector Machine, Random Forest and Artificial Neural Network Machine Learning Algorithms in Geochemical Anomaly Information Extraction. Acta Geosci. Sin. 2020, 41, 309–319. [Google Scholar]

Figure 1. Research area.

Figure 2. Distribution areas of different forest types based on the resource management inventory.

Figure 3. Analysis framework.

Figure 4. Spatial changes in water yield (a) and water conservation capacity (b).

Figure 5. Statistical diagram of water conservation capacity for different forest types.

Figure 6. Harmonic parameters λ corresponding trend chart.

Figure 7. XGBoost model parameter tuning results.

Figure 8. Relationship between ntree parameters and errors in random forest models with different dominant tree species.

Figure 9. Sorting of the selected variables by absolute values of standard regression coefficients.

Figure 10. Ranking of the importance of the influencing factors.

Figure 11. Ranking of the importance of influencing factors.

Figure 12. Verification of variable optimization simulation accuracy.

Figure 13. Validation of the simulation accuracy for introducing all factors into the random forest regression model.

Figure 14. Simulated values (a) and spatial distribution of errors (b) for water conservation in a random forest model.

Table 1. Data used in this study.

Type	Index	Data Source
Data input data into the InVEST model	Root-restricting layer depth	The Food and Agriculture Organization (FAO)
Data input data into the InVEST model	Several parameters related to vegetation and soil	The relevant studies [36,37,38,39,40,41] and the InVEST UserGuide
Topographic features	Elevation, slope, geomorphology, slope direction, and slope position	The Geospatial Data Cloud (http://www.gscloud.cn/, accessed on 9 October 2020)
Climatic features	Annual precipitation and annual potential evapotranspiration	Meteorological data for 2020 from the National Meteorological Information Centre (NMIC) (https://data.cma.cn/, accessed on 19 February 2021)
Soil factor	Thickness of the humus layer, thickness of the litter, and soil thickness	Hunan Province’s ‘13th Five-Year Plan’ Forest Resources Type II Inventory Data (2020)
Soil factor	Parent material and soil type
Forest stand factor	Constriction/coverage, average DBH, average height, number of plants per hectare, and volume per hectare
Forest stand factor	Natural renewal level, forest species, reachability, origin, age group, forest type, and community structure

Table 2. Precision evaluation of model training set.

Forest Types	Random Forest Model			Lasso Model			XGBoost
Forest Types	R²	RMSE	ARE	R²	RMSE	ARE	R²	RMSE	ARE
Arboreal Forests	0.832 **	33.931	0.0792	0.602 *	52.223	0.146	0.701 **	48.225	0.112
Shrub Forests	0.869 **	28.606	0.0386	0.640 *	58.310	0.166	0.622 *	48.637	0.118
Bamboo Forests	0.860 **	28.380	0.0375	0.557 *	62.558	0.177	0.569 *	53.416	0.157

** p = 0.01; * p = 0.05.

Table 3. Precision evaluation of model test set.

Forest Types	Random Forest Model			Lasso Model			XGBoost
Forest Types	R²	RMSE	ARE	R²	RMSE	ARE	R²	RMSE	ARE
Arboreal Forests	0.801 **	37.771	0.0842	0.562 *	52.337	0.149	0.692 **	46.832	0.105
Shrub Forests	0.867 **	28.328	0.0371	0.598 *	64.258	0.184	0.631 *	49.617	0.123
Bamboo Forests	0.749 **	44.775	0.0949	0.508 *	69.339	0.199	0.610 *	50.269	0.127

** p = 0.01; * p = 0.05.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Z.; Lü, Y.; Liu, Y.; Chen, D.; Peng, B. The Impact of Forest Management Inventory Factors on the Ecological Service Value of Forest Water Conservation Based on Machine Learning Algorithms. Forests 2024, 15, 1431. https://doi.org/10.3390/f15081431

AMA Style

Chen Z, Lü Y, Liu Y, Chen D, Peng B. The Impact of Forest Management Inventory Factors on the Ecological Service Value of Forest Water Conservation Based on Machine Learning Algorithms. Forests. 2024; 15(8):1431. https://doi.org/10.3390/f15081431

Chicago/Turabian Style

Chen, Zhefu, Yong Lü, Yang Liu, Duanlv Chen, and Baofa Peng. 2024. "The Impact of Forest Management Inventory Factors on the Ecological Service Value of Forest Water Conservation Based on Machine Learning Algorithms" Forests 15, no. 8: 1431. https://doi.org/10.3390/f15081431

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Impact of Forest Management Inventory Factors on the Ecological Service Value of Forest Water Conservation Based on Machine Learning Algorithms

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Indicator Selection and Data Sources

2.3. Methodology

2.3.1. Measurement of Forest Water Conservation

2.3.2. Quantification of the Impact of Forest Management Inventory Factors on Forest Water Conservation

3. Results

3.1. Assessment of Water Conservation and Spatial Distribution Characteristics

3.2. Key Factor Extraction from Forest Inventory Factors

3.2.1. Model Parameter Selection and Estimation

3.2.2. Accuracy Evaluation of Three Models

3.2.3. Extraction of Dominant Influencing Factors

3.3. Simulation of Water Conservation Capacity in Random Forest Models

3.3.1. Simulation Accuracy of Random Forest Models for Different Forest Types

3.3.2. Comparison of the Spatial Distribution of Prediction Results for Different Forests

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI