1. Introduction
Forests are an essential component of the carbon cycle, as they are both storing and releasing carbon through their biomass into the atmosphere. Globally, forest ecosystems contain approximately 80% of the aboveground and 40% of the underground biomass [
1]. Knowledge on the amount of biomass and carbon storage is essential for forest management and planning [
2]. Quantifying biomass availability in the forests through field measurements is commonly resource-intensive. Remote sensing techniques integrated with geographic information systems (GISs) provide quick access to useful information, typically available for short cycle times and at lower costs [
3]. Combining remotely sensed data with nonspectral ancillary data such as those produced by field sampling has been suggested by many studies as a way to reach better estimates [
4]. A variety of remotely sensed data, such as those coming from Landsat, Sentinel, Spot, and ALOS missions, have been used to estimate the volume of wood and biomass stocked in the forests [
5,
6,
7,
8,
9,
10,
11,
12,
13].
Aboveground biomass (AGB) estimation methods include field measurements and remote sensing approaches [
14,
15]. There are mainly two methods used in field measurement to estimate the AGB, namely destructive (harvesting) and nondestructive methods. Although the destructive method is useful and accurate in developing equations for the assessment of aboveground biomass over larger areas, it is often constrained to few trees, being time consuming, difficult to implement, and expensive [
16]. A nondestructive method is an alternative to estimate the AGB. It is implemented either by climbing to make measurements in different tree parts or, more commonly, by measuring the diameter at the breast height (DBH) and tree height; other options include the estimation of volume and density using allometric equations or remote imagery [
17,
18]. As a nondestructive method, remote sensing is based on previously developed allometric equations.
The techniques used for estimating the AGB of forests based on remotely sensed data can be divided into two categories, namely those using parametric (statistical regression methods) and nonparametric algorithms, respectively [
7]. Nonparametric techniques, including Machine Learning (ML) algorithms such as the k-Nearest Neighbor (kNN), Artificial Neural Networks (ANNs), and Random Forests (RFs), were found to hold a better ability of identifying complex relations between the used predictors and the AGB [
7,
19]. For instance, ANNs are being considered to be important nonparametric algorithms for estimating forest-related parameters [
20]. In addition, the kNN algorithm has received considerable attention because it is easily accessible, and some literature reviews have shown that it holds an excellent capability to increase the precision when estimating vegetation parameters [
21,
22,
23]. RF regression algorithms have also been widely used for quantifying forest biophysical parameters [
5,
24,
25,
26], standing for an ensemble learning algorithm with applications in classification and regression problems. The RF algorithm was developed by Breiman [
27] and can be used to predict continuous and categorical dependent variables. A random subset of observations with replacement, as well as a random set of explanatory variables, are used to build each regression tree [
28].
Traditionally, in any part of the world, AGB is estimated by destructive methods, which are used to develop allometric equations based on measured parameters collected from harvested trees (e.g., DBH, tree height, and timber volume) [
29]. However, applying allometric equations across a large study area is cumbersome and sometimes impractical as the field measurement input parameters are rare and sometimes unavailable. In comparison, remote sensing techniques can provide large-scale and accurate biophysical information for forest inventory data. Hence, remote sensing data combined with machine learning techniques (i.e., parametric and nonparametric algorithms) have been widely used to estimate forest AGB in the past decade. For example, Muukkonen and Heiskanen [
30] predicted the AGB in boreal forests using ANNs applied to ASTER (Advanced Spaceborne Thermal Emission and Reflection Radiometer) data. IRS P6 LISS-III (Indian Remote-Sensing Satellite-P6 Linear Imaging Self-Scanning Sensor-3) data were used by Yadav et al. [
31] to estimate the AGB in the Timli forests of India. In their research, the kNN method based on Mahalanobis distance outputted a
RMSE of 42.25 Mg/ha, while the distance metric used was found to be best, being followed by the fuzzy and Euclidean distances, with
RMSE of 44.23 Mg/ha and 45.13 Mg/ha, respectively. Lu et al. [
32] showed that the estimation of AGB in Amazon forests using Landsat-5 TM data is more accurate in young than in mature stands. Ronoud et al. [
33] found that the Landsat-5 TM NIR (near-infrared) band exhibited the highest correlation with AGB (
r = 0.427). Several studies have used Sentinel-2 data to estimate AGB in various ecosystems, including semiarid [
34], Mediterranean [
35,
36], temperate [
7,
37,
38], tropical [
37,
39,
40], subtropical [
41,
42] and boreal [
43,
44] forests, and grasslands [
45]. For example, Chrysafis et al. [
46] compared Sentinel-2 MSI (MultiSpectral Instrument) and Landsat-8 OLI (Operational Land Imager) imagery for forest growing stock volume (GSV) estimation in a mixed Mediterranean forest in northeastern Greece. GSV was modeled using RF regression based on spectral bands and vegetation indices. They have shown that to estimate the AGB, Sentinel-2 data with an
R2 = 0.63 and
RMSE = 63.11 m
3/ha were better than Landsat-8 OLI data with an
R2 = 0.62 and
RMSE = 64.40 m
3/ha. According to Castillo et al. [
37], red and red edge bands produced by Sentinel-2 data combined with elevation data provided the best estimates of AGB in Philippine’s mangrove forests when using machine learning methods. Nuthammachot et al. [
47] assessed the potential of seven vegetation indices derived from Sentinel-2 images for estimating the AGB in a private forest of Indonesia. They found that, among other indices, including the Normalized Difference Vegetation Index (NDVI), Enhanced Vegetation Index (EVI), Modified Simple Ratio (MSR), Simple Ratio (SR), Sentinel-2 Red-Edge Position (S2REP), and Greenness Normalized Difference Vegetation Index (GNDVI), the Normalized Difference Index (NDI45) exhibited the strongest correlation with AGB (
r = 0.89,
R2 = 0.79). In addition, they found that the NIR spectral band of the Sentinel-2 was the most effective variable in retrieving forest standing volume when using the kNN algorithm. They estimated the standing volume with a relative
RMSE of 22.94%. Research by Pandit et al. [
42] evaluated the usefulness of Sentinel-2 data for estimating the AGB in protected forests from Nepal using the RF algorithm. The effect of the number of input variables, including spectral band values and spectral-derived vegetation indices on the AGB prediction, was also investigated. The model using all spectral bands, in addition to the derived vegetation indices, provided better AGB estimates (
R2 = 0.81 and
RMSE = 25.57 t/ha). Vafaei et al. [
48] assessed ALOS-2 (Advanced Land Observing Satellite 2) and Sentinel-2 data for AGB estimation in the Asalem forests of Iran using four machine learning methods, namely the Gaussian process (GP), support vector regression (SVR), RF, and Multi-Layer Perceptron Neural Networks (MLP Neural Nets, MLP NNs). In their study, a SVR model using combined Sentinel-2 spectral information (including blue, green, red, and NIR bands) and six vegetation indices, namely SVI (Simple Vegetation Index), RVI (Ratio Vegetation Index), NDVI (Normalized Difference Vegetation Index), EVI-2 (Enhanced Vegetation Index 2), PVI-2 (Perpendicular Vegetation Index 2), and SAVI (Soil Adjusted Vegetation Index) based on ALOS-2 PALSAR2 (Advanced Land Observing Satellite 2, Phased-Array-type L-band Synthetic Aperture Radar 2) imagery, HH (horizontal transmit and horizontal receive), HV (horizontal transmit and vertical receive), VV (vertical transmit and vertical receive), and VH (vertical and horizontal receive), yielded the best performance to estimate the forest AGB.
Data saturation often causes problems in estimating forest AGB when dealing with high amounts of biomass or high-canopy-density areas [
49]. This problem was addressed by combining Sentinel-2 and ALOS2-PALSAR2 data [
48]. The studies mentioned above, which evaluated the utility of remotely sensed data for estimating the forest standing volume and AGB, do not show consistency in performance and outcomes, due to the variety of forest conditions, satellite data used, applied methodology, and due to the inherent, specific limitations of each study.
In Iran, an area of ~10.7 million hectares is covered by forests accounting for ca. 7.4% of the country’s territory [
50]. Hyrcanian forests are the most important forests among the five vegetation regions in Iran due to the density, canopy cover, and diversity in this ecoregion [
51,
52]. They cover ~2 million hectares and are located on the south coast of the Caspian sea [
53]. For these forests, management plans are updated in terms of qualitative and quantitative attributes every ten years, in which collecting data and information are time-consuming and cost-intensive. In contrast, remotely sensed imagery holds a promising potential for monitoring and continuously predicting forest attributes. In conjunction with satellite data, field data can be used to create a continuous map of forest attributes through classification or regression. Therefore, forest attributes have been estimated from remote sensing data with various spatial resolutions, ranging from very high to medium.
To the best of our knowledge, this is the first study attempting to estimate the AGB by the use of remotely sensed data and machine learning algorithms in pure common hornbeam (
Carpinus betulus L.) forests, as a typical forest type in the temperate forest region of many European and Asian countries. This study was guided by the above mentioned, as well as the fact that pure stands of common hornbeam are distributed from 200 to 1800 m a.s.l., from the western part, characterized by a very humid climate, to the eastern part of the Hyrcanian region, which is characterized by a humid climate [
54]. Accordingly, this study aimed to evaluate the usefulness of Sentinel-2 imagery and several machine learning algorithms for estimating the AGB of
C. betulus forests located in the Patom and Namkhane districts of Kheyrud forest, Northern Iran. The objectives of the study were the following: (i) comparing the performance of different AGB estimation approaches including parametric (i.e., Multiple Regression—MR) and nonparametric algorithms (ANN, kNN, and RF), and (ii) investigating the potential and capability of Sentinel-2 imagery in improving the accuracy of the AGB estimation under the given conditions of the study.
3. Results
Based on the in situ measurements, the minimum, maximum, and mean values of the AGB for
C. betulus stands were estimated at 118, 320, and 210 t/ha, respectively, with a standard deviation of 60 t/ha (
Figure 3;
Table A1); there was a high variance (3588 t/ha), indicating that the data were spread out from the mean, and from one another (
Table A1). The results of the normality test indicated a normal distribution of both in situ and remotely sensed data. Based on Pearson’s correlation coefficient, a negative association was found between spectral information and in situ AGB (
Table 2). Band 6 of the Sentinel-2 data outputted the highest correlation with in situ AGB (
r = −0.723;
Table 2).
The result of the AGB prediction using MR indicated that the backward elimination procedure (
R2adj = 0.65, %
RMSE = 24.72) outperformed the linear regression that used all the variables, as well as the stepwise regression model (
Table 3).
Table 4 shows the performance of the kNN models that included all the variables and used four distance metrics (Euclidean, Squared Euclidean, Manhattan, and Chebychev). The best distance metric for the kNN algorithm was the Manhattan distance, which returned the lowest
%RMSE and the highest
R2 (
Table 4).
The ANN fitted by a MLP NN model with an input layer containing all variables and two hidden layers produced a relative
RMSE of 19.93% during the validation phase (
Table 5). The sensitivity analysis indicated that PC1 was the most effective variable for estimating AGB.
As mentioned before, the performance of the RF algorithm depends on choosing the optimal number of trees and numbers of predictors (k) in each node for producing a good response in estimations. For instance,
Figure 4 shows the average squared error rates against the number of trees used for AGB estimation when using RF during the training and testing phases. The optimal number of trees is assigned to the point where the error rate does not change by increasing the number of trees (
Figure 4). The improvement in accuracy was slow after about 220 trees; therefore, this number was used as a good estimation for an optimum number to use (
Figure 4). Based on the variable importance value obtained from the sensitivity analysis, spectral band 6 of Sentinel-2 was the most effective variable. In this study, the best RF model estimated AGB with a relative
RMSE of 22.55% for k set at 6 (
Table 5).
4. Discussion
Previous studies have found that remote sensing-based models for AGB estimation are more accurate than empirical-based and GIS-based models [
32]. In this study, Sentinel-2 data were used to estimate the AGB in pure stands of
C. betulus in a part of the Hyrcanian forest, Iran. A total of 19 variables, including original spectral bands, vegetation indices, and the first principal component of PCA (applied to all original bands), were used for estimation. In situ AGB was found to be negatively correlated with all variables. The highest correlation was between the AGB and the two spectral bands located at the red edge (0.731–0.749 nm wavelength) and shortwave infrared (1.539–1.681 nm wavelength), with values of
R2 of −0.723 and −0.716, respectively. The negative correlation between biomass and spectral values has been discussed in many studies [
9,
10,
11,
90], expected to be caused by the canopy shadowing of trees, canopy size, stand volume and density, and consequently, by a more complex vertical structure of the forests. Shadowing is a factor influencing the reduction in spectral reflectance of forests [
91]. In addition, the fraction of vegetation cover (FVC) of the ground at the pixel level is another reason affecting the radiation behavior at the canopy level, particularly in taller stands [
92,
93], which was the case of forests from this study.
The higher spectral radiances of low-density forests characterized by less biomass can be partially explained by a smaller amount of shadows resulting in a higher contribution of the soil to the spectral radiance [
12,
91]. The age of the studied stands could be another reason for the negative correlation between the amount of AGB and their corresponding spectral values [
13,
94]. At higher ages, which was the case of this study, the size of the canopy is rising [
95], which increases the canopy surface area, size, and number of holes in the canopy [
8,
94]. Increasing the canopy surface area can reduce the amount of reflection due to the holes created in the tree crowns that is causing the electromagnetic waves to spread through the crown and reduces reflection [
94]. In addition, as the age of the trees increases, their requirements for water will increase. As the amount of water increases in the leaves, it will absorb electromagnetic waves and will thus reduce reflection. Furthermore, as the age of the forest stands increases, the number of stories usually develops, causing more propagation of the electromagnetic waves and ultimately a reduction in spectral reflection [
10,
96]. On the other hand, a positive correlation between biomass and spectral reflectance was reported by different researchers [
33,
47] and explained by specific characteristics of the study site such as the vertical structure of forest stands, canopy cover percentage, forest health and vitality, species composition, and soil properties. In this study, we found that the relatively strong correlation between AGB and B6, though negative, preserved the presence of this variable in the backward and stepwise regression models (
Table 3).
Our results indicated that nonparametric models performed better than MR, and the best result was obtained when using an ANN that outputted a relative
RMSE of 19.93%. This is in agreement with the findings of Vafaei et al. [
48] (relative
RMSE = 19.17%) and close to those of Gao et al. [
19] (relative
RMSE = 28.8%). The ability to learn during training and to generalize on new datasets makes ANN more powerful and flexible than MR [
7,
97]. Past research has suggested that whenever an insufficient number of sample plots is available, parametric models can result in a poor performance, while nonparametric models may lead to more accurate predictions [
98]. The ANN, as a nonparametric mathematical model, is conceptually similar to biological neural networks and holds excellent linear and nonlinear fitting capabilities [
7]. Nevertheless, this is mainly due to the fact that the nonparametric models are able to handle nonlinear relations between variables from multiple sources [
34]. By comparing the performance of algorithms for forest AGB estimation on ALOS PALSAR and Landsat data, Gao et al. [
19] concluded that ANN performed better than RF. For the temperate forest of China, Chen et al. [
7] concluded that ANN was most accurate in assessing the biomass of broadleaved deciduous forests as opposed to regression, SVR, and RF algorithms. As shown by this study, the higher performance of nonparametric algorithms could be due to the complex relations established between AGB and remote sensing variables, which are difficult to understand and explain by parametric algorithms. In addition, nonparametric algorithms are more flexible, by removing some limitations such as the hypotheses on data distribution and the functional form of the mathematical relation between independent and dependent variables. For instance, Lu et al. [
32] believed that nonparametric algorithms are more adapted in creating complicated nonlinear biomass models because they do not explicitly predefine the model structure but determine it in a data-driven manner.
As in many other studies, addressing data uncertainty is important. In this study, data uncertainty may be associated with the GPS errors in locating the sample plots, possible errors of the local volume table, the inappropriateness of the available allometric models to calculate the AGB, and spectral interference of other species that existed in the plots. In addition, optical data produced by the Sentinel-2 mission cannot penetrate the forest canopy, preventing it from capturing information about wooden understories. On the one hand, extending the canopy surface will increase the size and number of holes in the canopy. Tree growth will increase in terms of volume, so trees will make a shadow that will cause a reduction in reflection [
99]. On the other hand, spreading water on the leaves and increasing the water availability will also reduce the reflectance [
99].
Many studies have indicated that integrating multisensor information from optical, radar, and lidar platforms can improve biomass estimation accuracy [
32,
100]. Furthermore, to improve the estimation of AGB by Sentinel-2 optical data, some points must be considered. Due to the fact that vegetation cover and trees with DBHs less than 7.5 cm are not typically considered in the calculation of the stand volume, studies should be carried out in areas without vegetation cover and small trees, or they should be carried out during the time of year when the vegetation cover is missing. The amount of reflection during the year varies due to the changes in the color of the leaves, water availability, and changes in stand structure; therefore, in situ measurements should be performed close to the time of satellite image acquisition. In addition, further studies should be carried out to clarify the effects of water availability, saturation, canopy cover, vegetation cover, and undergrowth vegetation on the canopy reflectance in a continuum of canopy closure. As one characteristic of our study was the limited number of plots that provided data for modeling and assessment, further studies should be carried out to check the effect of field sampling effort on the improvements in accuracy of the estimates, as one option. Another option would be using a leave-one-out cross-validation (LOOCV) procedure to improve the results [
101]. Nevertheless, the approach described herein was commonly used in previous studies [
102,
103,
104,
105].