1. Introduction
Reducing deforestation and forest degradation and promoting their large-scale restoration are promising nature-based solutions to combat climate change [
1]. Several governments, private companies, non-governmental organizations (NGOs), and multilateral organizations have made ambitious pledges to promote forest protection and restoration at unprecedented scales, putting these activities at the core of United Nations’ Sustainable Development Goals and Decade on Ecosystem Restoration (2021–2030) [
2]. One major challenge for integrating deforestation, forest degradation, and restoration as part of climate mitigation initiatives, such as those based on payments for ecosystem services (PESs), is to accurately differentiate forests under degradation or regeneration processes and relatively intact forests. Depending on the stage of degradation and regeneration processes, degraded and secondary forests can be structurally similar but play different ecological functions, such as their potential to store carbon [
3]. Thus, the discrimination of forests under anthropogenic influences is essential for reliable accountability of climate mitigation interventions, reducing the uncertainties of carbon stock estimates [
4]. One of the most promising ways to better characterize land use and land cover (LULC) is through advanced remote sensing approaches.
Remote sensing is an effective source of information on forest traits on the landscape scale. Specifically, data from passive multispectral sensors have been commonly used to differentiate LULC types [
5,
6]. Despite the potential of such data in LULC studies, the great heterogeneity and complexity of tropical forests pose a challenge in obtaining accurate information on vegetation disturbance through conventional approaches. Thus, advanced remote sensing technologies, such as Hyperspectral Imaging (HSI) and Light Detection And Ranging (LiDAR), provide new opportunities to answer complex ecological questions in tropical forests experiencing degradation and regeneration processes.
HSI systems acquire data in many narrow and contiguous spectral bands, generating high-resolution reflectance spectra per pixel [
7]. The ability of HSI to extract more accurate and detailed information compared with other passive remote technologies makes it suitable for a wide variety of applications. Examples include the classification of tree species or LULC classes [
8,
9,
10,
11], the identification of physiological responses to stress [
12], the estimation of biochemical attributes [
13], the detection of burned areas [
14], and the study of the canopy phenology [
15,
16]. According to Thenkabail et al. [
17], there are advantages of using hyperspectral data over multispectral imagery for improving the classification accuracy of complex rainforests such as those from southern Cameroon.
Active LiDAR sensors produce three-dimensional measurements of forests, allowing the quantification of important structural attributes such as canopy height, Leaf Area Density, and biomass [
18]. That accurate structural information has allowed the discrimination of successional stages in tropical forests [
19,
20]. However, LiDAR systems currently capture limited spectral information, which creates difficulties in distinguishing structurally similar forests with distinct species composition or under stress conditions. Thus, the spectral information from HSI can complement the structural information provided by LiDAR. As reported by Almeida et al. [
21], the combination of LiDAR and hyperspectral data is useful to increase the accuracy of aboveground biomass (AGB) estimates over heterogeneous human-modified landscapes of the Brazilian Amazon. In this context, the integration of LiDAR and hyperspectral data can potentially improve the characterization of the degradation and regeneration status of tropical forests. Despite the potential of this synergism, few studies have tested the combined use of LiDAR and HSI data for the classification of land cover types and successional stages in tropical ecosystems. For instance, in the tropical dry forests of Costa Rica, Sun et al. [
22] used different airborne remote sensing data (waveform LiDAR, HSI, and their combination) and machine learning classifiers to map secondary forest age. The best result was found with the combination of LiDAR and HSI data (overall accuracy of 83%).
The Brazilian Amazon concentrates a large share of Reducing Emissions from Deforestation and Forest Degradation (REDD) initiatives and is broadly recognized for its key role in providing critical ecosystem services to local, regional, and global socio-ecological systems [
23]. The deforestation of native vegetation in the Brazilian Amazon has been monitored over the past decades using remote sensing approaches [
24]. In contrast, assessing the status of secondary vegetation (regrowth after complete clearing) and more subtle forest disturbances (e.g., selective logging and fire) is more challenging [
25,
26]. Advancing this characterization is critically needed for improving conservation, management, and restoration strategies.
In this study, we evaluated the potential of integrating airborne LiDAR and HSI data acquired over tropical forests of the Amazon to distinguish four classes of vegetation degradation and regeneration. The classes were previously defined with the use of historical Landsat data and auxiliary information: undisturbed old-growth forests (UFs), degraded old-growth forests (DFs), younger second-growth forests (SF1–15yr), and older second-growth forests (SF16–32yr). To achieve this goal, several LiDAR and HSI metrics related to structural and functional characteristics were calculated and submitted to three machine learning classifiers: Random Forest (RF), Stochastic Gradient Boosting (SGB), and Support Vector Machine (SVM). Finally, we investigated the effect of the data source and classifiers to better characterize these forest classes, as well as the ability to transfer the predictions of the best models to new sites. With this approach, we aimed to address the following research questions:
What LiDAR and HSI metrics are most effective for distinguishing the forest classes, and how do these metrics vary among them?
How do the choice of remote sensing data sources (single LiDAR, single HSI, or their combination) and machine learning classifiers impact the accuracy of classifying forest degradation and regeneration stages in tropical forests?
What is the capacity of regional models to generalize and accurately predict forest classes in new sites?
To the best of our knowledge, this study is the first to discriminate forest degradation and regeneration classes in the Brazilian Amazon by using a large set of LiDAR and HSI metrics. Although extensive airborne LiDAR datasets have been utilized to support significant research on forest structure in the Amazon [
27,
28], no previous studies have combined these data with hyperspectral imagery across such a broad spatial scale to specifically enhance the assessment of degradation and regeneration processes. By integrating both structural and compositional vegetation information, this study sought to advance the current understanding of Amazonian forest dynamics, providing new insights into remote sensing methodologies for tropical forest monitoring.
2. Materials and Methods
2.1. Study Area
Twelve sites distributed throughout the Brazilian Amazon were considered in this study (
Table 1). Each site is represented by one (most sites) or two (three sites) transects of approximately 12.5 km by 0.3 km, where the airborne remote sensing data were collected (
Figure 1). The sites encompass a wide variety of anthropogenic, climatic, geological, and edaphic conditions.
From the sites used in this study, MAM, ZF2, DUC, AUT, and TAP are part of the so-called Central Amazonia region, comprising old sedimentary substrates and low soil fertility [
30]. On the other hand, the sites PAR, JAM, ALF, SFX1, SFX2, FN1, and FN2 are located over the Brazilian Shield composed of Pre-Cambrian rocks with related high-fertility soils. The predominant soil types are Acrisols and Ferralsols, with Gleysols occurring in the seasonal floodplains of the MAM site [
30]. From a topographic point of view, all sites are considered lowlands, having altitudes lower than 500 m. The AUT and MAM sites present the lowest altitude (<50 m), while the southeastern sites (SFX1, SFX2, ALF, FN1, and FN2) have the highest values (200 to 500 m).
The rainfall gradient ranges from wetter conditions on the MAM, ZF2, DUC, and AUT sites to drier conditions on the PAR, SFX1, SFX2, ALF, FN1, and FN2 sites. The long-term (1973–2013) annual rainfall reported for the Brazilian Legal Amazon (BLA) is approximately 2100 mm [
31]. In the studied sites, annual rainfall ranges from 1800 mm at the FN2 site to more than 3000 mm at the MAM site. The mean annual temperature over the BLA is 26.5 °C, varying in the sites from 24.6 °C at SFX2 to 27.0 °C at AUT.
The forest types over the studied sites encompass seasonally flooded ombrophilous forests (MAM site), terra firme (unflooded) transitional forests (ecotones between ombrophilous and seasonal forests in the FN1 and FN2 sites), and terra firme ombrophilous forests (other sites). Undisturbed forests are mainly located in protected areas (sites MAM, ZF2, DUC, TAP, and JAM), with few relatively intact forests on other sites (AUT, ALF, and FN2). Secondary forests usually occur in small areas that were previously cleared near highways or rivers. They were also commonly found around small communities adjacent to conservation units. Specifically, they were observed near the northern borders of the Adolpho Ducke Reserve (DUC site) and in São Jorge, a community located between the boundaries of the Tapajós National Forest (TAP site). These areas are characterized by a mix of residential settlements and small-scale agricultural activities, contributing to localized deforestation pressures along the conservation boundaries.
Much of the studied forests have been degraded by fire, selective logging, and/or fragmentation. Understory fires were responsible for the degradation in sites SFX1, SFX2, and ALF, especially in fragmented areas. The Central Amazonian forests of the sites AUT [
32] and TAP were affected by extensive fires under the effect of the El Niño Southern Oscillation (ENSO) in 1998/99, 2010, and 2015/16. Major fires were also common in previously logged areas, such as the TAP, PAR, FN1, and FN2 sites. Conventional operations of selective logging were also observed without the occurrence of fires in the sites FN1 and FN2. At the JAM site, reduced-impact logging was authorized by forest concession [
33].
2.2. Forest Classes Identification and Sampling
To properly assess tropical forest dynamics, it is important to differentiate between deforestation and degradation. Lapola et al. [
34] proposed a framework in which deforestation is defined as the conversion of forest to non-forest land cover, often accompanied by changes in land use, such as agriculture or pasture. Additionally, deforestation can result from successive or severe disturbances that reduce forest cover below a critical threshold, even if land-use change does not occur. In contrast, degradation involves a deterioration in forest condition, such as reduced carbon storage or biodiversity, but without a change in land cover (i.e., the forest remains a forest). This study focused on four key human-induced drivers of tropical forest degradation: habitat fragmentation, timber extraction, forest fires, and extreme droughts.
Following this framework, we categorized forest dynamics of degradation/regeneration into four distinct classes (
Figure 2A): undisturbed old-growth forests (UFs), degraded old-growth forests (DFs), younger second-growth forests (SF
1–15yr), and older second-growth forests (SF
16–32yr). Old-growth classes are those where no deforestation was observed between 1984 (the first year from which we tracked the historical Landsat time series) and 2017, while second-growth classes were considered as forests regenerating after deforestation occurred at some point during this period. Undisturbed forests were defined as old-growth forests that showed no evidence of disturbance by fire or selective logging and that were at least 300 m away from forest edges. In contrast, degraded old-growth forests presented at least one of these types of disturbance.
Second-growth forests in the Amazon are commonly separated into three successional stages based on the stand age [
8,
35]: initial (<5 years), intermediate (5–15 years), and advanced (>15 years) stages. Here, due to the limited spatial coverage of the initial stages across the study sites, we grouped the initial and intermediate successional stages into a broader class of younger second-growth forests. This choice was made to ensure sufficient sample representation and statistical robustness, as the initial stage alone did not occupy a large enough area for reliable analysis. Therefore, we considered two stages of second-growth forests: the class SF
1–15yr consisted of areas where the last deforestation event occurred between 2002 and 2016, while the class SF
16–32yr included areas where such an event was observed between 1984 and 2001.
To reconstruct the history of forest degradation and regeneration over the sites, we conducted a visual inspection of Landsat images from 1984 to 2017 (TM/Landsat-5, ETM+/Landsat-7, and OLI/Landsat-8) on the Google Earth Engine (GEE) platform. To facilitate our visual interpretation, we plotted the time series of two key spectral indices: the Normalized Difference Vegetation Index (NDVI [
36]) and the Normalized Burn Ratio (NBR [
37]). By examining fluctuations in these indices over time, we were able to visually identify periods of degradation (e.g., sudden drops in NDVI and NBR following fire events) and regeneration (e.g., gradual increases in NDVI and NBR after a deforestation event). These trends, combined with the spatial patterns visible in the imagery, enabled the classification of forest areas into the four defined categories of degradation and regeneration.
Specific visual patterns were considered for each of the four forest classes. Undisturbed old-growth forests (UFs) exhibited consistently high NDVI and stable NBR values throughout the time series. Degraded old-growth forests (DFs), on the other hand, showed declines in these indices in response to events such as selective logging or fire. In some instances, index changes were subtle or absent, but visual inspection of spatial context, such as proximity to fragment edges (within 300 m), revealed degradation pressures. Second-growth forests displayed gradual recovery, marked by increasing NDVI and NBR values following deforestation events identified by initially low values in both indices. The age of second-growth forests was determined based on the year when deforestation was last observed. For instance, in
Figure 2B, two events of vegetation clearing corresponded to strong decreases in NBR values in 1988 and 2003, followed by subsequent vegetation regrowth. We also accounted for potential confounding factors, such as index drops caused by residual cloud cover. To confirm possible small-scale deforestation/degradation events, we also checked historical high-resolution images from Google Earth Pro when available. For the JAM site, we obtained geospatial data from the Annual Operative Plans for selective logging [
33] to discriminate areas of degraded forest.
Furthermore, to assist in the visual identification of the forest disturbance classes and provide additional proof of the quality of our training samples, we inspected auxiliary data from three other disturbance-related maps: i. Primary humid tropical forest map for the year 2001, from Turubanova et al. [
38]; ii. Brazilian secondary forest age map for the year 2017, from Silva Junior et al. [
39]; iii. Map of global forest loss due to fire and other disturbance drivers for 2001–2019, from Tyukavina et al. [
40]. Due to uncertainties and occasional discordant results from these data sources, the visual interpretation of Landsat images was used as the primary reference for determining the class of a sample.
To collect data for training and testing the separability of the forest classes using the machine learning classifiers, we allocated 50 samples per each of the 12 sites, totaling 600 samples in the Amazon biome. At each site, the 50 samples were randomly distributed along the remote sensing flight line, with a minimum distance of 100 m from each other, to avoid spatial autocorrelation. To capture the spatial variation of forest canopies within a stand, the sample unit chosen was a square plot of 0.25 ha (50 m × 50 m). Plots of 0.25 ha have an adequate size to represent the structural variability of tropical forests, as shown in previous studies [
41,
42]. Using the identification from the visual interpretation of Landsat time-series supported by auxiliary data, we detected 53 samples in the SF
1–15yr class, 41 in the SF
16–32yr class, 317 in the DF class, and 189 in the UF class (
Table 2). The distribution of samples over the sites, as well as the information that supported the forest classes identification for the training and testing samples, can be consulted on the interactive map at
https://catherine-almeida.github.io/forestmap/.
2.3. Airborne LiDAR Data
Airborne LiDAR data were obtained by the Trimble HARRIER 68i system between January 2016 and April 2017 as part of the project EBA (Estimation of Biomass in Amazon) [
43]. The LiDAR sensor recorded multiple discrete returns with a small footprint (nearly 30 cm) and a minimum point density of four points.m
−2. This minimum point density is considered sufficient for capturing essential forest attributes, such as canopy height and biomass [
44,
45].
The raw point cloud was denoised with the lasnoise function of the LAStools software, version 171030 [
46]. For the subsequent preprocessing steps, we used the functions GroundFilter, TINSurfaceCreate, Clipdata, and PolyClipData of the FUSION/LDV software [
47]. In short, the ground points were filtered from the denoised point cloud and interpolated into a 1 m Digital Terrain Model (DTM), which was subtracted from point elevations to obtain the height above the ground of each point. The resulting normalized point clouds were clipped to the 600 samples to calculate the LiDAR metrics of each sample.
A total of 34 area-based LiDAR metrics were considered, including metrics related to height distribution (maximum, mean, percentiles of height, standard deviation, coefficient of variation, skewness, and kurtosis), canopy cover (Leaf Area Density in a specific height interval and proportion of first returns respective the number of all returns), structural complexity (Shannon and Simpson diversity indices), and topography (terrain roughness).
To obtain the metrics related to height and LAD (Leaf Area Density), we tested both using all returns and only the first returns. Since the two approaches produced highly correlated metrics (r > 0.9), we chose the metrics calculated only from the first returns, which have been considered more stable across different LiDAR acquisition settings [
48]. Additionally, the first returns effectively capture the top-of-canopy structure, reducing noise from lower vegetation and enhancing measurement consistency for cross-site comparisons. This method also alleviates computational challenges by lowering data volume while maintaining essential structural attributes for forest assessments [
45].
The height and LAD metrics also considered just the points above a 2 m height to avoid low vegetation points. The LAD profile was calculated with the LAD function of the lidR package [
49] using a height bin of 2 m and the extinction coefficient k of 0.695 based on the study by Stark et al. [
50] in central Amazon.
The HSCI and DSCI metrics are related to canopy structural complexity, based on the commonly used entropy measures of Shannon (H’) and Simpson (D) indices, respectively [
51]. These metrics were normalized between 0 and 1 by considering a fixed number of 30 height bins.
The terrain roughness was defined as the difference between the highest and lowest altitude in a 3 × 3 moving window [
52]. To mitigate extreme localized roughness values, the altitude data were averaged from a 1 m DTM to a 10 m DTM. The 10 m spatial resolution was selected as it aligned with our objective of capturing local topographic variability within the sample areas. More details on LiDAR data processing and metrics calculation can be found in Almeida et al. [
21].
2.4. Airborne HSI Data
Airborne HSI data were collected by the EBA project between September and October 2017 using the AISAFenix sensor (Specim, Spectral Imaging, Ltd., Oulu, Finland). The flight lines were oriented close to the N-S direction to reduce variations in viewing–illumination geometry. In addition, data acquisition was carried out on sunny days, between 10 a.m. and 1 p.m. (local time), with an average solar zenith angle of 30° and a standard deviation of 7°. The AISAFenix sensor provided images at a spatial resolution of 1 m in 361 bands covering the spectral range of 380–2500 nm. Of these, 87 bands in the VNIR (visible and near-infrared) had a bandwidth of approximately 6.8 nm, while the other 274 bands in the SWIR (shortwave infrared) had a bandwidth of about 5.7 nm. We reduced the total number of bands to 232 to remove the noise outside the range of 460–2330 nm and around the spectral intervals of 1400 and 1900 nm (major atmospheric water vapor absorptions). Surface reflectance images were obtained from the at-sensor radiance by using the Atmospheric/Topographic Correction for Airborne Imagery tool (ATCOR-4; version 6.3). The water vapor estimates were based on the 940 nm absorption feature. For geometric correction, we used the data provided by a GNSS (Global Navigation Satellite System) receiver onboard the aircraft.
We considered a total of 278 potential metrics from the HSI data: 232 reflectance bands; 30 vegetation indices (
Table 3); 10 continuum-removal absorption band parameters; and 6 sub-pixel metrics derived from linear spectral mixture analysis. The continuum-removed features were characterized by the depth at the absorption center (Dc) and the width at the half depth (Wc) from five fixed absorption bands: chlorophyll band at 495 nm (461–536 nm); chlorophyll band at 670 nm (556–749 nm); leaf water band at 980 nm (893–1074 nm); leaf water band at 1200 nm (1097–1265 nm); and lignin-cellulose band at 2100 nm (2039–2199 nm). To generate the continuum-removed spectrum, the original reflectance was first smoothed for noise reduction using a Savitzky–Golay filter with a window size of five bands and a first polynomial order. Then, the smoothed reflectance values within each absorption band were divided by the values of a continuum line between the band edges [
53].
We used the linear spectral mixture analysis from the unmix function of the hsdar R package [
71] to calculate the fractions of green vegetation (GV), shade, and non-photosynthetic vegetation/soil (NP) endmembers. NP represents a mixture of bright soils and non-photosynthetic vegetation since these components could not be easily distinguished from each other in the scenes. To select reference endmembers for GV and NP, we applied the minimum noise fraction (MNF) followed by the pixel purity index (PPI) technique in the ENVI software, version 5.3 (Harris Geospatial Solutions, Inc., Boulder, CO, USA). To pick the purest pixels at each site, the endmembers detected by the PPI were projected over an n-dimensional scatterplot. The final GV and NP endmembers were then obtained by averaging the purest pixels of all sites. For the shade endmember, a photometric shade with a uniform reflectance of zero was considered [
72].
For each metric (reflectance, vegetation indices, absorption features, and endmember fractions), we averaged the pixel values within each of the 600 samples to calculate the plot-level metrics. For the shade endmember fraction, we also calculated the proportion of pixels with a shade fraction below 30% (S0_30), between 30 and 60% (S30_60), and above 60% (S60).
2.5. Feature Selection and Importance
To maximize the information extracted from LiDAR and HSI data, we initially calculated a large set of metrics. However, this high data dimensionality might cause overfitting in the modeling process. To address this issue, we implemented a feature selection process to reduce the number of metrics and avoid redundancy, simplifying the dataset while preserving its informative value.
First, we eliminated highly correlated metrics using the
findCorrelation function from the R package caret [
73]. This function evaluates the absolute values of pairwise Pearson’s correlations between metrics, and if two metrics have a correlation greater than 0.95, it removes the one with the largest mean absolute correlation. Even though some remote sensing metrics had skewed distribution, Pearson’s correlation is recognized as robust against extreme violations of assumptions of normality [
74], effectively fulfilling its purpose of reducing data redundancy.
Next, we inspected the metrics for linear dependencies with the
findLinearCombos function from the same package. These two steps resulted in 20 LiDAR metrics and 42 HSI metrics (
Figure 3 and
Table 4) selected from the original set of variables (34 LiDAR metrics and 278 HSI metrics). These refined set of metrics were then used as predictors for the machine learning classification (RF, SGB, and SVM) of the four forest classes (UF, DF, SF
1–15yr, and SF
16–32yr) by considering three different scenarios of datasets: a LiDAR-only dataset (20 metrics), an HSI-only dataset (42 metrics), and the combination of both data sources (62 metrics).
Although the remaining number of metrics still produces a high-complexity model, which may be more difficult to apply in practice, our main purpose here was to understand the full potential of each data source. Mainly, we aimed to identify which metrics related to structural and functional characteristics were most important to differentiate the four classes of forests. Therefore, to further explore the discriminative power of metrics, we applied the Kruskal–Wallis test to evaluate differences in metric values across the forest classes. We then calculated the eta squared based on the H statistic from the Kruskal–Wallis test (Equation (1)):
where
H is the value obtained in the Kruskal–Wallis test,
k is the number of classes, and
n is the total number of observations. The non-parametric Kruskal–Wallis test was chosen because some remote sensing metrics had skewed distribution, violating the assumptions of parametric methods. In this context, the eta squared indicates the proportion of total variation in the metric explained by the forest classes, serving as a univariate measure of metric importance. Additionally, we ranked the final set of selected metrics using the importance measures provided by the RF and SGB classifiers, further refining our understanding of which metrics were most informative for the classification task.
2.6. Model Validation, Optimization, and Generalization
The machine learning approach was used to investigate the potential of LiDAR and HSI metrics, used separately and together, for the discrimination of the four forest classes. To validate the machine learning models, we first split the dataset (n = 600 samples) into training and test sets, in which the test set was composed of a given site (n = 50) and the training set by the remaining 11 sites (n = 550). Thus, this procedure was repeated 12 times, always leaving one of the 12 sites out to serve as a test set. The training set results were used to evaluate the models’ performance in a regional context by considering different sites throughout the Amazon. In contrast, the test set results evaluated the model generalization capability, that is, its ability to accurately predict outcomes for a completely new site not included in the training data.
We used the train function of the caret package to train each classifier (RF, SGB, and SVM) in the training sets and calculate an unbiased performance measure via 11-fold cross-validation (each fold had approximately 50 samples for the model validation). For each classifier, we evaluated different tuning parameters to select the optimal model from those parameters. For the RF classifier, the mtry parameter was tuned from the values 2, 4, 6, 8, and 10, and the ntree parameter was set to 1000. For the SGB, tuning parameters were n.trees (50, 100, and 150) and interaction.depth (1, 2, and 3). The parameters shrinkage and n.minobsinnode were set to the default values (0.1 and 10, respectively). For SVM, we used the Radial Basis Function Kernel by tuning the parameters cost (0.5, 1, 2, and 4) and setting the sigma value with the sigest function from the R package kernlab [
75].
We considered the overall F1 score (F1 average of the four classes) as the performance measure to select the optimal model. The F1 score combines
precision (also known as user’s accuracy) and
recall (aka producer’s accuracy or sensitivity) by calculating its harmonic mean (Equation (2)), thus providing a single performance measurement for a given class [
76]:
The overall
F1 score is known to be better suited for imbalanced data because it gives the same weight to every class. However, to be able to evaluate the performance for each class, we also provided the confusion matrices with the
precision and
recall values of each class (see
Tables S1–S9 in the Supplementary Material) and calculated the by-class F1. In addition to the overall and by-class F1, the overall accuracy (OA) was also reported. Even though OA tends to undervalue the performance of classifiers on smaller classes, this measure is widely used and may be useful for comparison among other studies.
A two-way analysis of variance (ANOVA), followed by a Tukey test, was used to assess whether there were any differences in performance measures from cross-validation (OA and overall/by-class F1) among the three datasets (LiDAR, HSI, and its combination), three classifiers (RF, SGB, and SVM), and interaction between data and classifiers. To examine the effect size of these factors (data source, classifier, and their interaction) on the overall model performance (OA and F1), we also calculated the eta squared (η
2). From the ANOVA results, the η
2 is the ratio of the sum of the squares of the factor by the total sum of squares and can be considered as a large effect size when greater than 0.14 [
77].
Finally, we chose the best model (among data sources and classifiers) for each of the 12 training sets and used it to predict the forest classes on its corresponding test set. Thus, we calculated the overall accuracy and F1 for each test set, allowing us to analyze the ability of the best model to generalize to sites that were not used in its training.
All statistical analyses were performed in the R software, version 4.3.1, and considered a significance level of 0.05. The data processing and analysis were conducted on a computer equipped with an Intel Core i7-1255U processor (10 cores, 4.7 GHz) and 16 GB of RAM (Random Access Memory). For storage, a 4 TB external hard drive was used to store the datasets and results.
4. Discussion
The present study brings relevant insights into how advanced remote sensing technologies can be used together to improve our understanding of forest dynamics concerning anthropogenic disturbances and natural regeneration in tropical forests of the Amazon region. It goes beyond simpler evaluations of forest cover loss and advances toward the assessment of forest quality and ecosystem services in heterogeneous human-modified tropical landscapes. Despite the high complementary potential of LiDAR and HSI data for improving the characterization and quantification of forest characteristics, previous studies displayed divergent results on the synergistic use of both technologies. Some investigations showed no or slight information gain when combining structural (LiDAR) and spectral (HSI) data [
72,
78,
79], while others demonstrated a significant effect on increasing the performance of models [
22,
80,
81].
In our study, the complementary information of LiDAR and HSI data considerably improved the characterization of tropical forest degradation and regeneration. While LiDAR performed well in classifying successional stages by differentiating them from old-growth forests, HSI was effective in distinguishing degraded forests from undisturbed forests. Canopy structural characteristics related to LiDAR metrics, such as height, basal area, and biomass, have been used to characterize successional stages [
82]. For instance, the most important LiDAR metrics found here were similar to the ones used to estimate aboveground carbon density in Borneo’s tropical forests [
83]. Examples include canopy cover at 20 m aboveground (Cover20), which was similar to the metrics LAD
20_30 or LAD
22 used here, and top of canopy height (TCH), which was related to the metrics H.max or H.p95 used in this study. Additionally, the HSI and LiDAR metrics that had the greatest influence on the classification of forest degradation and regeneration were also the most important for estimating AGB in a previous study in the Amazon biome [
21]. This is because AGB integrates important forest structural and functional information associated with forest disturbance and regrowth, such as tree height, basal area, number of trees per area, and wood density.
The main errors related to our LiDAR-only models expressed the confusion between degraded and undisturbed old-growth forests. This challenge arises because some degraded areas, while structurally similar to undisturbed forests, may experience changes in species composition. For instance, disturbances like timber harvesting and forest fires may reduce the abundance of certain tree species and favor the regeneration of early-successional species in forest gaps and edges [
84], which have contrasting functional and spectral features [
5,
85]. While LiDAR effectively captures canopy structure and height, it lacks the ability to detect changes in species composition, which can be a crucial indicator of forest health and disturbance. Thus, approaches based solely on structural characteristics limit the characterization of a broad spectrum of forest disturbance conditions.
In contrast, the most important HSI metrics found in this study were related to functional characteristics, especially canopy moisture, biochemical components, and health. The SWIR spectral region, especially the absorption feature around 2100 nm, was very relevant for characterizing forest degradation/regeneration. In consonance with our results, other studies have indicated that SWIR bands contain most of the relevant information to distinguish forest regeneration [
5,
86]. This fact can be explained by the increased canopy complexity, shadowing, and moisture along the succession, which decrease the SWIR reflectance. Furthermore, absorption features around 1700 nm and 2100 nm have been related to non-pigment biochemical components, such as lignin and cellulose [
13], indicating the occurrence of dead or senescent vegetation. Water absorption bands, especially at 1200 nm, were also very important. Asner et al. [
87], using EO-1 Hyperion data in the central Amazon, showed that the canopy water metrics (SWAM, spectroscopic water-absorption metric) and pigment metrics (PRI and ARI) were a proxy for physiological and biochemical changes from chronic water stress. In agreement with our results, Thenkabail et al. [
17] found that EO-1 Hyperion bands related to absorption by biochemical constituents (spectral intervals in the 1300–1900 nm, 1100–1300 nm, 1900–2350 nm, and 600–700 nm) were the most important to characterize African tropical forests following anthropogenic disturbance of different magnitudes. Thus, the importance of HSI metrics in identifying tropical disturbance suggests a greater susceptibility to canopy stress in degraded forests, facilitating their distinction from healthy forests with similar structural characteristics.
It is important to note that among the tested HSI metrics, the most important were the absorption features obtained through the continuum-removal method. These features, unlike reflectance values and some vegetation indices, cannot be derived from multispectral data due to the need for high spectral resolution to detect specific absorption bands associated with biochemical and physiological properties, such as pigments and water content in leaves. Consequently, our study points to a clear advantage of hyperspectral data over multispectral data in this context. This finding aligns with those reported by Thenkabail et al. [
17], who compared three broadband sensors (IKONOS, ALI, and ETM+) to the narrowband hyperspectral Hyperion data for classifying complex rainforest vegetation. Their results indicated that forest classifications using hyperspectral data achieved overall accuracies 45–52% higher than those using multispectral data. Similarly, for non-forest habitats (meadows, grasslands, heaths, and mires), Jarocińska et al. [
88] found that HSI outperformed multispectral Sentinel-2 imagery, with greater improvements in classification accuracy observed in areas of high α or β-diversity.
Although the combination of LiDAR and HSI offers significant advantages for characterizing the complex dynamics of regeneration and degradation in tropical forests, future studies should assess the cost-effectiveness of this approach using airborne data. To expand the analytical scope in larger areas, researchers should consider recently launched hyperspectral missions such as the Environmental Mapping and Analysis Program (EnMAP) and the possibility of integrating their observations with data acquired by existing Synthetic Aperture Radar (SAR) or future LiDAR missions. This strategy would facilitate the development of approaches combining structural and compositional information for large-scale monitoring initiatives.
Most studies on forest ecosystems based on LiDAR and HSI data integration have been performed locally on single study sites [
89,
90,
91,
92]. Although it is important to consider different spatial scales in the study of multisensor data integration, the full realization of its potential as a source of forest information requires an ability to generalize in different environmental conditions and human-induced degradation and regeneration dynamics. Even considering that our best model cannot be generalized to all evaluated sites, it was satisfactorily transferred to most sites that represent the distinct environmental and anthropogenic conditions of the Amazon. Thus, the integrated use of LiDAR and HSI data can also help to understand the dynamics of complex Amazonian forests from a regional perspective.
Second-growth and degraded forests are an integral part of tropical landscapes. However, they present different compositions and structures, leading to divergent functioning patterns. Therefore, their accurate characterization and discrimination from the remaining undisturbed forests are essential for establishing conservation and management priorities. The distinct structural and functional characteristics of undisturbed forests suggest that some of their ecosystem services cannot be replaced by degraded or secondary forests [
93]. Therefore, it is necessary to conserve forests that are still relatively intact, preventing new areas from degradation or deforestation. The remote sensing approaches described here may play a key role in planning interventions under REDD, intact forest landscapes, and the Alliance for the Restoration in the Amazon [
94,
95].
Previous studies [
96,
97,
98] have reported a “secondarization” of degraded forests, described as a process that transforms closed-canopy primary forests into more open forests dominated by short-lived pioneer species due to recurrent anthropogenic disturbances. In fact, older second-growth forests and degraded old-growth forests may represent forests with similar structural and compositional characteristics positioned at the intersection of degradation and regeneration trajectories. However, depending on the degradation intensity and recurrence, degraded forests can retain important structural characteristics of the former primary forests, as well as a generally heterogeneous species composition [
99]. Likewise, the older second-growth forests had distinct characteristics from the younger ones and, according to some HSI metrics, were more similar to the undisturbed forests. Thus, both degraded and secondary forests have great potential to provide significant environmental benefits, as well as to contribute to poverty alleviation through products and services of socio-economic importance. However, avoiding recurrent degradation is essential to ensure the continued functioning of forests.
Finally, as far as we know, this is the first study that evaluates the potential of combining LiDAR and hyperspectral remote sensing to discriminate different levels of forest degradation and regeneration over the Brazilian Amazon. This knowledge is important for future large-swath orbital missions of both instruments, even considering the constraints of data upscaling from the airborne to the orbital level of data acquisition. Understanding the variability of LiDAR and HSI metrics in different degradation and regeneration gradients from a wide spatial sampling over the Amazon forests can help to build models based on equivalent metrics derived from orbital platforms in an upscaling exercise. Thus, the role played by data sources, metrics, and models described in this study represents the first step toward the production of large-scale maps to be further validated with detailed field information in the Amazon.