1. Introduction
Accurate prediction of forest characteristics is vital for forest ecosystem management, which serves a range of functions to the Earth’s life supporting system [
1]. Moreover, as a carbon sink, forest dynamics are closely related to the global carbon cycle and climate changes [
2,
3,
4]. Canopy height is an important forest structure parameter for understanding the forest ecosystem, and has been used to estimate forest aboveground biomass and therefore model the global carbon stock and carbon dynamics [
5,
6,
7,
8].
Traditionally, canopy height can be accurately measured in the field using range finders. However, taking field measurements is highly labor-intensive and time-consuming, which constrained its use in generating spatial continuous canopy height products at large scales. The emergence of remote sensing technology can largely help to solve this problem since it can significantly increase the efficiency and reduce the cost of obtaining large-scale land surface observations. Optical passive remote sensing and radar have been used to estimate forest canopy height at different locations and scales [
4,
9,
10]. However, these derived canopy height products are usually fraught with “saturation” effects since optical and radar signals cannot penetrate forest canopies, which may result in canopy height being underestimated in areas dominated by tall trees [
11,
12].
Light detection and ranging (LiDAR), an active remote sensing technology, provides an alternative way to measure forest canopy height [
13,
14]. Through the use of a focused short-wavelength laser pulse, LiDAR can penetrate forest canopy effectively and therefore measure the forest three-dimensional structures [
15]. There have been many studies showing that LiDAR can be used to estimate canopy height accurately at different locations and scales. For example, Su et al. estimated the wall-to-wall Lorey’s tree height distribution at 70 m resolution in Sierra Nevada, US with in situ tree height measurement, airborne LiDAR data, Geoscience Laser Altimeter System (GLAS) data, optical imagery, and other ancillary datasets. [
14]. M. Clark, D. Clark, and Roberts [
16] estimated sub-canopy tree height of a rainforest by using small-footprint LiDAR data; Alexander et al. [
17] mapped the canopy height distribution of Denmark using airborne LiDAR data. However, the availability of LiDAR data in regional to global scales is limited. As one of the most frequently used LiDAR data, airborne LiDAR is only available in certain areas due to the high flight mission cost. The Geoscience Laser Altimeter System (GLAS) onboard the Ice, Cloud, and land Elevation Satellite (ICESat), retired in 2009, is the only available spaceborne LiDAR data with global coverage. However, its footprints are about 170 m away from each other along the track and tens of kilometers away across tracks, which are too sparse to be used alone for large-scale canopy height mapping. The proposed Global Ecosystem Dynamics Investigation (GEDI) LiDAR system is designed for observing the 3D structure of the earth at fine spatial resolution (i.e., at the 500 × 500 m pixel level), which is not fine enough and still on the way (
https://gedi.umd.edu/). How to integrate LiDAR to improve the accuracy of large-scale canopy height mapping is still an active field of research.
Recently, the integration of LiDAR data with other remotely sensed data for large-scale canopy height mapping has been increasingly investigated [
14,
18,
19,
20,
21]. In these data fusion schemes, LiDAR data are usually used as the ground truth to train machine learning algorithms, and therefore predict canopy height from other remotely sensed variables. The large amount of training samples provided by LiDAR data can significantly improve the accuracy of canopy height mapping using machine learning algorithms [
15]. For example, support vector machine (SVM) has been applied to estimated canopy height by fusing LiDAR and imagery data, which is powerful to handle multi-dimensional data and has effective generalization capabilities [
22,
23]. However, results of SVMs are not robust due to the parameter assignment issues [
22]. Artificial neural network (ANN) is another kind of machine learning method, which has also been widely used in canopy height and biomass estimations [
24]. Although ANN is able to handle non-linearity and non-normality issues, it is easy to be over-fitted when trained with too much data [
25]. By contrast, Random Forest (RF) is robust to over-fitting, and has been used successfully in regression problems even with dozens of samples [
26,
27]. Moreover, RF is (1) not limited by the distribution of covariables; (2) it is not sensitive to outliers and noises; and (3) it is a nonlinear method, which can deal with high-dimensional data [
28,
29]. In all, RF has been proven to outperform other machine learning algorithms in mapping canopy height through the fusion of multi-source remotely sensed dataset [
30].
Although the RF-based data fusion technique has been widely used, the influence of various vegetation types, locations and spatial scales of different study areas on the model transferability is rarely studied. A single regression tree model is usually built regardless the size and location of the study area, which assumes that the ecological process is not scale-dependent and shows the same or similar patterns in different circumstances. However, that is often violated in real ecological processes [
31,
32]. Modelling accuracy of these ecological processes are directly affected by heterogeneity in forests [
33], and usually in a nonlinear way [
34]. Locations, vegetation types, and spatial scales all affect the ecological modelling accuracy. For example, Simard et al. [
19] has proven that the accuracy of their canopy height product was lower in protected areas with homogenized low trees and flat terrains; Turner et al. [
35] found that the extent and grain of a landscape influenced ecological measurements and landscape patterns.
Therefore, the problem that needs to be solved in this study is to analyze the transferability of RF on mapping canopy height under different vegetation types, site locations, and spatial scales (in both extent and resolution). Ultimately, this study wants to answer the questions whether we can use a universal RF model to predict canopy height at different locations with various vegetation types, and how do spatial scales affect a RF canopy height regression model. The results of this study can provide guidance for generating canopy height products from local to global scales through the fusion of LiDAR data and optical imagery.
5. Discussion
The RF-based method showed great potential to upscale canopy height estimations from LiDAR footprints to larger areas. However, the trained RF model performed differently in different areas, and the controlling factors also varied with study sites and vegetation types. In high-elevation ENF sites, topographic related parameters (i.e., elevation, slope and aspect) were the more important factors for the canopy height modelling, since high-elevation forests were more likely energy-limited and topographic parameters can determine the amount of solar radiation that can be received by trees [
49,
50]. Annual mean precipitation and vegetation indices from Landsat imagery played a much bigger role in canopy height prediction for the low to medium elevation EBF and DBF sites. This may be caused by the fact that broadleaf forests are more water-limited rather than energy-limited as broadleaf forests intercept less precipitation than coniferous forests, and the vegetation indices from optical imagery can well represent the forest structure and vegetation health [
51,
52,
53]. The four MF forests had site-specific variable importance, since the composition of tree species and topographic climate conditions may change significantly with sites. Moreover, it was found that the RF-based canopy height prediction accuracy decreased with the rise of elevation (
Table 1 and
Table 4), which is consistent with the suggestion that ecological modelling studies were preferable in low-elevation areas [
54]. Integrating spaceborne LiDAR data, which can provide forest structure information, may further help to improve the model performance in high-elevation areas [
14].
The performance of the site-specific canopy height prediction model shows that a single RF model trained within a LiDAR footprint cannot be used to upscale canopy height estimations in a larger area, which may be composed of different vegetation types, topographic characteristics, and climate conditions. The reason may be the RF canopy height prediction model trained within one LiDAR footprint was very sensitive to the study site location [
55]. The canopy height prediction accuracy dropped significantly while one model was used to estimate canopy height at other study sites even with the same vegetation type (
Figure 4), and the weakest transferability of site-specific RF canopy height prediction model appeared at the four MF sites. As above-mentioned, the complexity in vegetation compositions and topographic and climate conditions may lead to significantly different variables controlling the tree growth. In addition, MF sites naturally have more complex vegetation, topographic and climate conditions comparing to the other three vegetation types, which may be the reason for their weak transferability. Moreover, the model transferability was also largely diminished among different vegetation types (
Figure 5), which may also be caused by that different vegetation types have different controlling factors for tree growth (
Figure 3).
However, the weak transferability of RF models can be resolved by including representative training samples from all study sites and vegetation types. The barrier of RF canopy height prediction model in locations can be removed by using the training samples from all study sites of each vegetation type (
Table 5). It has been documented that the heterogeneity in the training samples from different study sites is beneficial for ecological modelling, and training samples from a range of environmental sets should be included to improve the model performance [
56,
57]. The four mixed models of each vegetation type can even slightly improve the canopy height prediction accuracy, except at the MF1 site. The
R2 value derived from the Model
MFm was dropped by around 3% compared with the model trained from its own samples. The possible explanation for this may be that MF forests naturally contain mixed vegetation types with sufficient heterogeneity, and the RF model may not gain improvements by increasing training samples from different study sites. Moreover, errors introduced in the data pre-processing steps, such as different LiDAR data qualities at different study sites and spatial mis-registration among different data sources, may also influence the performance of mixed models [
55].
Moreover, the transferability of a RF model can be further improved by selecting representative training samples covering all vegetation types, and a universal canopy height prediction model for continental and maybe even global scale can be obtained. The estimated canopy height from a universal RF model was almost equivalently accurate to that from the model built for each vegetation type (
Table 6). Vegetation type was not a necessary input for generating the universal RF canopy height prediction model. This may indicate that the RF algorithm can pick out the vegetation type information from the training samples if they can well represent all vegetation types [
58].
Besides, the spatial extent of LiDAR footprints and the targeted spatial resolution of canopy height estimations are also two important factors that influence model performance. The canopy height prediction accuracy was positively influenced by both the spatial extent and spatial resolution (
Figure 6 and
Figure 7). This might be mainly caused by two reasons: (1) the heterogeneity can have a significant increase at the beginning of the increase of spatial extent and the decrease of spatial resolution (bigger pixel size), and reaches a sill gradually, which corresponds to the rule that the number of patches change quickly at the beginning which brings much heterogeneity change, and reaches stability with the larger extent and coarser resolution [
59,
60]; (2) the number of training samples increases with the spatial extent and resolution, which can indirectly increase the spatial heterogeneity. This influence was non-linear because the mean value of different variables may not have the same changing rate in a simple linear way, which brings inconsistency among various variables and canopy height at each pixel. The larger variance (errors) may arise at coarser scales, which was also suggested by Moody and Woodecock [
61]. Moreover, the influence of spatial resolution varied more significant than extent, because when the resolution decreases, not only the number of training pixels decreases, but also the heterogeneity of each pixel might be reduced also. In addition, the influence of spatial scales of broadleaf (i.e., EBF, DBF) was less, that may be caused by the more stable forest characteristics of the broad-leaved forest [
62]. Therefore, we need to comprehensively consider the influence of spatial extent and resolution when we build the RF model in various vegetation types.
“All models are wrong, but some are useful” [
63]. This is especially true for models transferability built from remote sensing data [
64,
65]. This study can provide a comprehensive knowledge in using RF model not only to predict regional to global-scale canopy height but also to estimate biomass and other forest parameters. It can provide guidance on how to choose enough LiDAR data footprints to ensure the modelling accuracy. However, the current study did not consider the temporal transferability of RF-based method in estimating forest parameters, and we did not compare the RF algorithm with other machine learning algorithms. These questions need to be further explored in the future.
6. Conclusions
This study aims to evaluate the RF model transferability among various vegetation types, locations, and spatial scales through the combination of multi-source remote sensed datasets. In total, 16 study sites with an area of 100 km2 covering four vegetation types (four study sites for each vegetation type) were chosen to address this mission. The results demonstrate that RF method can be used to upscale canopy height measurements from airborne LiDAR footprints to areas without LiDAR coverage. However, the RF models built at different study sites have different controlling factors on the canopy height estimations, which constrain their transferability among different locations and vegetation types. A RF model built in a specific location cannot transfer to other locations; as well as a RF model built in a vegetation type cannot transfer to other vegetation types. By including the training samples from all study sites of each vegetation type, a universal RF canopy height prediction model can be achieved for each vegetation type that can generate equivalent accuracy as site-specific models. A universal RF canopy height prediction model can even be generated by including training samples covering all vegetation types and study sites, but vegetation type alone, as a dummy variable, does not play a significant role in the RF model. In addition, spatial extent of the LiDAR footprint and spatial resolution of the targeted canopy height products both have a positive correlation with the canopy height prediction accuracy. They need to be carefully evaluated in the RF-based canopy height upscaling process based on the study extent, available airborne LiDAR data, and the available ancillary datasets.