*Article* **Woody Aboveground Biomass Mapping of the Brazilian Savanna with a Multi-Sensor and Machine Learning Approach**

**Polyanna da Conceição Bispo 1,2,\*, Pedro Rodríguez-Veiga 2,3, Barbara Zimbres 4, Sabrina do Couto de Miranda 5, Cassio Henrique Giusti Cezare 6, Sam Fleming 7, Francesca Baldacchino 7, Valentin Louis 2, Dominik Rains 8,9, Mariano Garcia 10, Fernando Del Bon Espírito-Santo 2, Iris Roitman 11, Ana María Pacheco-Pascagaza 2, Yaqing Gou 2, John Roberts 2, Kirsten Barrett 2, Laerte Guimaraes Ferreira 6, Julia Zanin Shimbo 4, Ane Alencar 4, Mercedes Bustamante 11, Iain Hector Woodhouse 7,12, Edson Eyji Sano 13, Jean Pierre Ometto 14, Kevin Tansey <sup>2</sup> and Heiko Balzter 2,3**


Received: 16 July 2020; Accepted: 15 August 2020; Published: 19 August 2020

**Abstract:** The tropical savanna in Brazil known as the Cerrado covers circa 23% of the Brazilian territory, but only 3% of this area is protected. High rates of deforestation and degradation in the woodland and forest areas have made the Cerrado the second-largest source of carbon emissions in Brazil. However, data on these emissions are highly uncertain because of the spatial and temporal variability of the aboveground biomass (AGB) in this biome. Remote-sensing data combined with local vegetation inventories provide the means to quantify the AGB at large scales. Here, we quantify the spatial distribution of woody AGB in the Rio Vermelho watershed, located in the centre of the Cerrado, at a high spatial resolution of 30 metres, with a random forest (RF) machine-learning approach. We produced the first high-resolution map of the AGB for a region in the Brazilian Cerrado

using a combination of vegetation inventory plots, airborne light detection and ranging (LiDAR) data, and multispectral and radar satellite images (Landsat 8 and ALOS-2/PALSAR-2). A combination of random forest (RF) models and jackknife analyses enabled us to select the best remote-sensing variables to quantify the AGB on a large scale. Overall, the relationship between the ground data from vegetation inventories and remote-sensing variables was strong (R<sup>2</sup> = 0.89), with a root-mean-square error (RMSE) of 7.58 Mg ha−<sup>1</sup> and a bias of 0.43 Mg ha<sup>−</sup>1.

**Keywords:** aboveground biomass; Cerrado ecosystem; random forest; SAR

#### **1. Introduction**

The tropical savanna in Brazil, known as the Cerrado, is the second-largest biome in South America, covering over 200 million ha or approximately 23% of the Brazilian territory [1]. It has the highest species richness and biodiversity among the world's savannas [2,3]. Gradients of tree density, height, canopy cover, and aboveground biomass (AGB) in the Cerrado vary according to the climate, fire regime, geomorphology, and soil nutrient availability, resulting in 19 distinctive ecoregions [4]. The Cerrado is characterised by a mosaic of grasslands, shrublands, and forestlands in varying proportions, depending on the location [5]. Its physiognomies range from *campo* (grasslands) to the typical Cerrado *stricto sensu* (trees and shrubs up to 8–10-m-high and with an understory dominated by grass) and the *cerradão* (forest formations with trees up to a height of 20-m-high) [6,7].

Although the aboveground carbon stock in the Cerrado is lower than that of the Brazilian Amazon, the conversion of the Cerrado biome to different types of land uses is occurring much faster than in the Brazilian Amazon, mainly because the Brazilian livestock and agricultural frontier has been expanding towards the northern parts of the Cerrado over the last decades [8,9]. This trend has been increasing due to the governmental policies put in place since 2018. The Cerrado land use and land cover (LULC) mapping project, conducted by the Brazilian Ministry of Environment (MMA), showed that, in 2013, approximately 43% (88 million ha) of the biome had already been converted to different land uses, with 55% (111 million ha) still covered by native vegetation [4]. The remaining 2% of the Cerrado were covered by water bodies and by the class "unidentified", which included areas covered by clouds and burned areas [10]. Most of the remaining natural areas have been undergoing degradation due to unsustainable selective logging and burning activities, often overlooked as threats to habitat integrity and connectivity [4,11]. According to MMA-2019 [12] and the MapBiomas alert platform (https://mapbiomas.org/en/project), the Cerrado was the Brazilian biome with the highest levels of deforestation between October 2018 and March 2019, losing 47,704 ha of native vegetation. In addition, about 95% of the deforestation alerts were in areas without any authorisation—that is, without a license for deforestation, which is issued by either the federal or state environmental agency. More than 1400 ha of deforestation took place in legal reserves (Brazil's environmental legislation obligates private property owners to retain a fixed proportion of their total area for native vegetation. These areas are called "legal reserves") [13].

Only 3% of the Cerrado is strictly protected by the law within conservation areas [12,14], and the high rates of vegetation loss and degradation have made the Cerrado the second-largest source of carbon emissions in Brazil [15]. In this context, it is essential to monitor AGB and carbon stocks effectively, and reliable maps are needed for climate change mitigation policies [16–18]. Uncertainties in current vegetation carbon stock estimates over the Cerrado are high, and biomass estimates vary by more than 50 Mg ha−<sup>1</sup> within the same area [19–21]. This demands improvements in the accuracy and spatial resolution to estimate the AGB in this biome. The challenge here is to take the large latitudinal gradient and the high variation of the vegetation structure into consideration [15], as well as the paucity of field studies quantifying AGB over different regions of the biome [15,16].

Different types of sensors, such as satellite-based multispectral imagers and Synthetic Aperture RADAR (SAR) systems, as well as airborne light detection and ranging (LiDAR), have been successfully applied to estimate AGB in the tropics [21–26]. SAR and LiDAR have been increasingly used to estimate AGB in the last eight years [27–36]. The microwave pulses transmitted by a SAR system, especially at longer wavelengths such as the L- or P-band, interact with the branches and trunks, providing information about the forest structure, which is highly correlated to AGB [37–39]. Airborne LiDAR provides information on the canopy height and canopy cover [40–43], which are good proxies for woody AGB estimations [44]. The main limitations of LiDAR are that it is still an expensive technology and it is typically not available for large areas [45]. So far, only a few studies have explored the effectiveness of multi-sensor data synergy in tropical savannas [46–48]. In contrast, this approach has been intensively used to study other biomes [49–51].

Most studies focusing on AGB estimations in the Cerrado are based on establishing statistical models between remote-sensing measurements and field plots (i.e., allometric equations linking in situ measurements to parameters such as tree biomass) [5,16,52]. However, uncertainties remain high [53], and studies using remote sensing to estimate AGB in the Cerrado are still limited [53,54]. Bitencourt et al. [53] studied the Cerrado vegetation using optical and RADAR data, showing a strong relationship between these Earth Observation (EO) datasets and foliage biomass. The same authors also used Japanese Earth Resources Satellite (JERS-1) SAR observations to estimate the woody AGB of the savanna using a multiple regression analysis, resulting in R<sup>2</sup> = 0.87. Miguel et al. [54] used artificial neural networks to predict the wood volume and AGB of the savanna using satellite observations from the Linear Imaging Self-Scanner (LISS-III) sensor onboard the ResourceSat-1 satellite. The neural network approach showed a good accuracy for both the wood volume (R<sup>2</sup> ~0.98 and standard error of estimate (SEE) ~4.83%) and AGB (R2 ~0.94 and SEE ~8.5%). Schwieder et al. [55] combined Landsat phenological metrics with aboveground carbon field samples of woodland savanna vegetation using random forest (RF) regression models to map the regional carbon distribution and to analyse the relationship between the phenological metrics and aboveground carbon stocks. The model performance varied among the three selected study areas, with root mean squared error (RMSE) values of 1.64 Mg ha−<sup>1</sup> (mean relative RMSE 30%), 2.35 Mg ha−<sup>1</sup> (mean relative RMSE 46%), and 2.18 Mg ha−<sup>1</sup> (mean relative RMSE 45%), while the aboveground carbon distributions revealed characteristic spatial patterns. Biomass maps can be assessed through the biomass product accuracy requirements of satellite missions dedicated to the estimation of a biomass, such as those from the BIOMASS mission [56]. BIOMASS is aiming at errors smaller than ± 20% in terms of the relative RMSE (rel. RMSE) for AGB higher than 50 Mg ha−<sup>1</sup> and <sup>±</sup> 10 Mg ha−<sup>1</sup> in terms of RMSE for AGB lower than 50 Mg ha<sup>−</sup>1.

In this study, we quantified the spatial distribution of the AGB in the Rio Vermelho watershed, located in a central Cerrado region in Brazil. Using a machine-learning approach, we produced the first high-resolution AGB map (30 m) of a Brazilian Cerrado area based on a combination of vegetation inventory plots, airborne LiDAR data, and satellite images (Landsat 8 and ALOS−2/PALSAR-2). The two-stage upscaling approach (field to LIDAR and LIDAR to EO) was also applied for the first time in an area of this biome. We used a RF model and jackknife analyses to analyse the importance of the remote-sensing predictor variables, enabling us to select the best ones to quantify the AGB.

#### **2. Data and Methodology**

#### *2.1. Study Area*

The study site is located in the Rio Vermelho watershed, which is part of the Rio Araguaia watershed in the state of Goiás, Central Brazil (Figure 1), and it covers an area of 1,082,460 ha. The landscape topography is typically lowland, and the Cerrado biome's climate is semi-humid (*Aw* in the Köppen's climate classification system). It is a tropical savanna with two marked seasons: a dry winter (from May to September) and a rainy summer (from October to April). This watershed is characterised by a relatively constant air temperature throughout the year, with minimum and

maximum temperatures of 20 ◦C and 32 ◦C, respectively) [57,58]. Intense and localised precipitation and frequent dry spells occur.

**Figure 1.** Location of the Rio Vermelho watershed study area (outlined in yellow in the right panel) located in the State of Goiás in Brazil and in the Cerrado biome (grey and green areas in the left overview panel). The location and a zoom of the canopy height light detection and ranging (LiDAR) footprints within the watershed (each of those covered an area of 16.82 ha) with the location of the field plots (black rectangles) are also shown. The entire right panel corresponds to the red square in the left overview panel. The background image corresponds to the © Bing™ aerial photo, a screenshot(s) reprinted with permission of Microsoft Corporation.

The Rio Vermelho watershed is part of a region in the Goiás State called "Mato Grosso Goiano". This denomination is based on the historical extensive and dense native forest cover (typical for the Mato Grosso State) in the region [59]. This part of the Brazilian Cerrado underwent intensive human settlement in the 18th century due to the Brazilian gold rush in Central Brazil [60]. Since forests in the Cerrado are supposedly associated with soils with good fertility, most of the forest-covered areas have already been converted to other uses, mainly croplands and pastures [59], sometimes resulting in land degradation. Today, forest formations are restricted to a small and few fragments. Pastures cover more than 57% of the study area [6], and livestock ranching is the main economic activity in the region. By 2018, only approximately 33% of the native vegetation persisted [12,61], and the remaining native vegetation is fragmented [62] and often located in legal reserves areas.

#### Vegetation

Most of the Cerrado natural vegetation is comprised of savanna formations (cerrado *stricto sensu*). However, forest formations such as woodlands, riparian forests, and seasonal forests play an important role in the carbon balance, because they have a higher carbon stock density [63]. In the Rio Vermelho watershed, woodlands or cerradão (in Portuguese) and seasonal forest formations predominate as the forest remnants. The forest inventory sites were located in two fragments of cerradão, and the plots follow a structural and biomass gradient from: (a) a savanna-cerradão transition zone (lower biomass), (b) cerradão, and (c) cerradão-seasonal forest transition zone (higher biomass).

Dystrophic cerradão is typical for deep, highly lixiviated soil (Oxisols and sandy soils) areas with a seasonal tropical climate, and their canopy height varies between 8 and 15 m. Canopy cover varies between 50% and 90%, with an understory of shrubs and grass [62]. They are structurally similar to seasonal forests and differ mainly in species composition, as they are comprised of seasonal forest as well as wooded savanna species [62,64].

Seasonal forests are often associated with more fertile soils (mesotrophic and eutrophic soils) and present different levels of deciduousness during the dry season. The canopy height varies between 15 and 25 m, and the forest cover is between 70% and 95% during the rainy season. In the dry season, the canopy cover can be lower than 50% in seasonal semideciduous forests and lower than 35% in seasonal deciduous forests [62].

In wooded savannas, most trees are shorter and sparser than in forest formations, allowing for a continuous herbaceous grassy layer. Wooded savannas are separated into subgroups according to their structural gradient: *cerrado ralo* (2–3-m-tall trees with 5–20% canopy cover), *cerrado sensu stricto* or *typical cerrado* (3–6-m-tall trees with 20–50% forest cover), and *cerrado denso* (8–15-m-tall trees with 50–70% forest cover) [65].

#### *2.2. Methodology*

Figure 2 shows the flowchart of the methodology used to produce the AGB map of the Rio Vermelho watershed. Our reference datasets consisted of ground measurements and airborne LiDAR data. We used a combination of Earth Observation (EO) datasets from multispectral passive optical and synthetic aperture radar (SAR) sensors as predictor variables to estimate the AGB of the study area. The description of each step of the methodology is provided in the following subsections.

**Figure 2.** Flowchart showing the methodology used to produce the aboveground biomass (AGB) map of the Rio Vermelho watershed, Goiás State, Brazil. SR = surface reflectance.

#### 2.2.1. Ground Truth Data

We used woody AGB data from 15 (20 m × 50 m) field plots located in cerradão vegetation to estimate the AGB from airborne LiDAR. These plots were established in 2014, and all trees with a diameter at breast height (dbh) ≥ 5 cm at 1.30 m above the ground were considered. The tree heights were measured with a digital clinometer (HAGLOF ECII-D). The basal area in m2 was calculated from the dbh (basal area = (π/4) <sup>∗</sup> (dbh/100)2) (Table 1). Ten plots were located in the remnants of native vegetation in the municipality of Itapirapuã (Figure 1, right panel, top zoom), while five plots were located in the municipality of Goiás (Figure 1, right panel, lower zoom).

**Table 1.** Floristic and structural characterisations of the plots located in fragments of the native Cerrado vegetation in the Rio Vermelho watershed, Goiás State, Brazil. WS-FS = savanna-cerradão transition zone, TFS = cerradão, FS-SF = cerradão-seasonal forest transition zone; S = species richness, TD = tree density, DBH = diameter at breast height, H = height, TBA = tree basal area, AGB = aboveground biomass, and CV= coefficient of variation.


As mentioned, the two native vegetation fragments sampled are classified as cerradão. However, due to the high structural heterogeneity in the Cerrado biome, there were considerable variations in the structure of the woody vegetation within the sampled plots. The floristic and structural characteristics of the plots described in Table 1 (and from Figure S1 to Figure S20 in the Supplementary Materials) corroborate that the sampled plots adequately represent the gradient of floristic-structural variation inherent to forest formations from all of the Cerrado. Thus, four plots were classified as a savanna-cerradão transition zone, seven as cerradão, and four as a cerradão-seasonal forest transition zone [62]. The sampled plots showed a height gradient directly related to the aboveground biomass stocks of woody vegetation. In the savanna-cerradão transition zone, the mean height was below 6 m, in the cerradão, between 6 and 8 m, and, in the cerradão-seasonal forest transition zone, above 9 m (Table 1). These plots were representative of the structural variation found in the cerradão vegetation of the region, with AGB values ranging from 19 to 104 Mg ha−<sup>1</sup> (Table 1).

In tropical ecosystems, species diversity is generally very high. Therefore, generalised (mixed-species) allometric models are more appropriate than species-specific equations [66,67]. We used the following mixed species allometric model [68] designed for cerradão (N = 87, R<sup>2</sup> = 0.94, residual standard error (RSE) = 0.13 Mg; RSE (%) = 49.83). The model estimates the woody AGB (only live trees were used) with a diameter at breast height ≥ 5 cm, excluding leaf biomass.

$$\text{AGB} = \exp\left[-12.29 + 2.69 \ast \ln(\text{dbh}) + 0.80 \ast \ln(\text{h})\right] \tag{1}$$

where dbh is the diameter at breast height, and h is height.

In our study, these in situ woody AGB estimates were then used to establish an empirical model to estimate AGB as a function of the structural metrics derived from the airborne LiDAR.

#### 2.2.2. LiDAR Data

We used 60 tiles of airborne LiDAR data (covering a total area of 1009.01 ha) collected by the Sustainable Landscapes project led by the Brazilian Enterprise for Agricultural Research Corporation (EMBRAPA). The LiDAR data covered the municipalities of Itapirapuã and Goiás, Goiás State, Brazil. The flights took place between 20 June 2015 and 7 July 2015. The LiDAR dataset has an average density of returns of ~45 ppm2 (points per square meter). The average altitude of the flights was ~850 m, with a field of view of 12◦. Two LiDAR sensors (Optech Orion M300 and Optech ALTM 09SEN243) were used in these campaigns, but the percentage of flight line overlaps was high (~65%). LiDAR returns of all 15 plots are shown from Figure S6 to Figure S20 in the Supplementary Materials.

The LASTools software [69] was used to process the LiDAR data and to generate the following variables derived from the LiDAR point cloud data: DSM (digital surface model), CHM (canopy height model), DTM (digital terrain model), CC (canopy cover coverage), CD (canopy density), Max-H (maximum height), Percentile\_p15, Percentile\_p10, Percentile\_p20, Percentile\_p30, Percentile\_p35, Percentile\_p55, Percentile\_p40, Percentile\_p45, Percentile\_p50, Percentile\_p60, and Percentile\_p65. Highly correlated variables were excluded, and the ones containing the most unique information, CHM, CC, and CD, were used to establish statistical models to extrapolate the spatially limited LiDAR AGB estimates to both the optical and radar observations (Figure 2).

#### 2.2.3. Optical Data

The United States Geological Survey (USGS) Landsat 8 Collection 1 Tier 1 data consists of surface reflectance products generated from the Landsat 8 Operational Land Imager (OLI), with 30-m spatial resolution. The USGS atmospherically corrects the scenes using the Landsat 8 Surface Reflectance Code (LaSRC), which also uses the C function of the mask (CFMASK) algorithm [70] to generate a cloud, shadow, water, and snow mask. We selected all the scenes acquired in the period of 2015–2017 (244 scenes) to generate a cloud and shadow-free median temporal composite (Figure 2). In addition to the spectral bands, we generated a range of vegetation indices that could potentially be representative for the vegetation canopy structure, seasonality, and a measure of vegetation greenness. These were the normalised difference vegetation index (NDVI), normalised burn ratio (NBR), normalised difference moisture index (NDMI), and the soil adjusted vegetation index (SAVI). These indices have previously been used in similar studies [21,23,71–73].

#### 2.2.4. SAR Data

We used the global 25-m resolution ALOS PALSAR/PALSAR−2 annual mosaics, which are freely available at https://www.eorc.jaxa.jp/ALOS/en/palsar\_fnf/fnf\_index.htm. This dataset was pre-processed by the Japanese Aerospace Exploration Agency (JAXA) using L-band SAR images of the backscattering coefficient acquired by the Advanced Land Observing Satellite (ALOS) and Advanced Land Observing Satellite-2 (ALOS-2). The mosaics consist of 10 × 10-degree tiles pre-processed to correct for geometric distortions (ortho-rectification) and topographic effects [74]. The mosaics were calibrated to γ0 using the following equation:

$$
\gamma^0 = 10 \ge \log\_{10}(\text{DN})^2 + \text{CF} \tag{2}
$$

where γ<sup>0</sup> is the gamma-naught in decibels (dB), *DN* is the digital number in unsigned 16 bit, and *CF* is a calibration constant of 83.0 dB.

We reduced the noise of the mosaics by applying a temporal multi-channel filter [75] with a 5 × 5-pixel moving average window. Then, we applied a temporal normalisation between PALSAR and PALSAR-2 images aiming to correct for soil moisture and sensor differences. For our final PALSAR-2 composite, we used the median value of the three mosaics (2015, 2016, and 2017) (Figure 2). We observed a geolocation error of ~80 m on average for the PALSAR-2 mosaics in comparison to HR (high-resolution) imagery and Landsat 8 scenes. We generated a Sentinel-1 SAR median γ<sup>0</sup> composite using 283 scenes from between 2016 and 2018, and we used this not as an input for our modelling but as a geolocation reference image. The PALSAR-2 composite was georeferenced against the Sentinel-1 composite using a SAR-to-SAR co-registration. This approach, instead of a SAR-to-optical approach, was chosen to avoid errors derived from the different viewing geometries of SAR and optical images. The variables used for this study were the SAR backscatter coefficients (γ<sup>0</sup> *HV* and <sup>γ</sup><sup>0</sup> *HH*) and two SAR indices—namely, the radar forest degradation index (RFDI) and the cross-polarised ratio (CpR) [74].

#### *2.3. Modelling Framework*

We adopted a two-stage upscaling method from field measurements to airborne LiDAR point clouds and from the LiDAR-based estimates to satellite imagery. We estimated the woody AGB of the field plots and used these as a reference to estimate the woody AGB across the LiDAR footprints using LiDAR-derived structural variables as a predictor. Our field plots covered a representative range of flora and structure of the woody vegetation (Table 1) from 19 Mg ha−<sup>1</sup> to 104 Mg ha<sup>−</sup>1. As the LiDAR point clouds are sensitive to the forest structure, we assumed that these plots could express the physical and structural variations of the vegetation of the study site, and we allowed the AGB LiDAR model to extrapolate for values < 19 Mg ha−1. This field-to-LiDAR procedure allowed us to increase our sampling from the 15 plots to thousands of observations derived from the LiDAR footprint covering a wide range of woody AGB and vegetation types. We then used these LiDAR AGB, representative for the vegetation in the region, in combination with EO datasets to estimate the woody AGB over the entire Rio Vermelho watershed.

The first step was to model the relationship between LiDAR woody vegetation structural variables and the AGB estimated from the field plots (Figure 2). We generated raster-based LiDAR vegetation structural variables with using a 1-m pixel size. We then used the average value of the pixels within the 20 × 50-m plots (0.10 ha) to develop our models. As we had a limited number of field plots, we had to limit the number of variables included in the empirical models as well. Based on a visual exploration of the relationships between the dependent and independent variables, as well as the distribution of each predictor, we identified three potential predictors: canopy height model (CHM), canopy density (CD), and canopy cover (CC). We then used all possible combinations of these variables to find the best generalised linear model to estimate the AGB based on the Akaike's information criterion corrected for a small sample size (AICc). This final model was used to estimate the AGB over the entire LiDAR footprint at a 30-m spatial resolution (0.09 ha). The next step was to model the relationship between the estimated LiDAR AGB and the EO datasets. These EO datasets were resampled, co-registered, and stacked at 30-m spatial resolutions (Figure 2). We used the LiDAR AGB pixels (N = 2,973) as training and test samples in a regression version of the random forest (RF) algorithm [76] in the Google Earth Engine [77], which was implemented within a k-fold cross-validation framework [73]. The following variables were used as predictors in the model: γ<sup>0</sup> *HV*; <sup>γ</sup><sup>0</sup> *HH*; RFDI; CpR from the PALSAR-2 data; and blue, green, red, near infrared, and shortwave infrared bands, as well as NDVI, NBR, NBR2, NDMI, and SAVI from the Landsat 8 data. RF is a nonparametric machine-learning algorithm that uses a bootstrap technique to construct multiple decision trees. Jackknife tests were also run to compare the importance of the predictor variables based on a R2 of the model. Due to the randomness (stochastic nature) of the RF algorithm, the importance may change for each run. Thus, the variable importance analysis was run for each k and then averaged to produce the overall importance figures. The test was performed running each single set of variables in isolation to assess the accuracy of each set. The higher the R2, the higher the importance of the variable. Then, a set of variables was excluded each time from the whole set to assess the drop in the variance explained by the model.

The k-fold cross-validation framework was used to train and validate the algorithm (Figure 2), maximising the reference data available. Figure 2 shows the overview of the method used to produce the AGB map and the uncertainty characterisation in the Rio Vermelho watershed. The whole reference dataset was used for both training and validation purposes. The dataset was stratified by three AGB levels (low, medium, and high), and randomly sampled into the folds to ensure that all folds have similar probability distribution functions of the AGB. Then, k-1 folds were used to train the RF algorithm and to produce the AGB map, while the remaining fold was used for the validation. The process was then run k times, reusing the folds for training purposes but using them only once for validation purposes. The final outputs were k AGB maps over the study area. The mean value of the k AGB maps was used as the final AGB map, and we used the standard deviation (SD) of the k estimates for any given pixel to generate a prediction error (σ*prediction*) map. This error also accounts for the representativeness of the sampling sites of the true distribution of AGB in the region [21]. The total SD (σ*AGB*) was propagated, as explained in Rodriguez-Veiga et al. [29] and Saatchi et al. [21], as follows:

$$\sigma\_{\text{AGB}} = \left(\sigma\_{\text{maximum}}^2 + \sigma\_{\text{iDDAR}}^2 + \sigma\_{\text{allometry}}^2 + \sigma\_{\text{sampling}}^2 + \sigma\_{\text{prediction}}^2\right)^{\frac{1}{2}} \tag{3}$$

where the measurement error (σ*measurement*) of the tree level parameters averaged at the plot scale was assumed as 10% [78], the LiDAR error (σ*LiDAR*) was assumed as 1.5% (from a 0.11-m instrument error for an average 7.13 m canopy height), the allometric error (σ*allometry*) was assumed to be 49.83% (Scolforo et al. [68]), and the sampling error (σ*sampling*) from the variability of AGB within the pixels was estimated as 12.39% based on Réjou-Méchain et al. [79].

Our k-fold cross-validation assessment followed James et al. [80] and involved the calculation of the R2, RMSE, rel. RMSE, and the mean bias error (MBE). Aside from the overall assessment, we also analysed the errors by biomass ranges. Finally, we also compared our AGB estimates to four other studies [21–23,81] (Table 2). Baccini et al. [23] used multi-sensor satellite data to estimate the aboveground live woody vegetation carbon density for pantropical ecosystems with a 500-m resolution. Saatchi et al. [21] mapped the total carbon stock in live biomasses (above- and belowground) in the tropics using a combination of data from in situ inventory plots and satellite LiDAR samples of the forest structure to estimate the carbon storage, plus optical and microwave imagery (1 km resolution), to extrapolate over the landscape. Santoro et al. [81] estimated AGB globally at a 100-m resolution by combining spaceborne SAR, LiDAR, and optical observations for the year 2010, with auxiliary datasets from forest inventories—namely, additional remote-sensing observations, climatological variables, and ecosystems classifications. Avitabile et al. [22] combined two existing datasets of vegetation aboveground biomasses (AGB) into a pantropical AGB map at a 1-km resolution using an independent reference dataset of field observations and locally calibrated it using high-resolution, harmonised, and upscaled biomass maps. For the comparison of our results to these maps, we aggregated all of them to the same spatial resolution (1 km).

**Table 2.** Summary of forest aboveground biomass (AGB) maps used for comparison with our study results. RMSE: root mean square error, rel. RMSE: relative root mean square error.


#### **3. Results**

#### *3.1. LiDAR-Derived AGB Map*

Table 3 shows the models tested in this study. The AGB = f(CHM, CC) model was the one that best explained the field-based AGB variations (Table 3, adj R<sup>2</sup> = 0.93, RMSE = 6.74 Mg ha<sup>−</sup>1, or 13% of the mean biomass values) and, therefore, was used to predict the AGB over the whole LiDAR flight footprint. This predicted LiDAR AGB was used to generate training and test samples to run the RF algorithm to extrapolate the AGB to the EO datasets covering the entire Rio Vermelho watershed. Although the AGB = f(CC, CD, CHM) model resulted in a slightly smaller RMSE and a higher AIC than AGB = f(CHM, CC), both models showed the same R2 (= 0.93) and RMSE (6.74 Mg ha−<sup>1</sup> or rel. RMSE = 13%). We therefore opted for the AGB = f(CHM, CC) model, as it is more parsimonious and only uses two independent variables (Table 3 and Figure S1 in the Supplementary Materials).

**Table 3.** Model comparison according to Akaike's information criterion corrected for a small sample size (AICc). Parameters for each model include: LogLik (log-likelihood), k (number of predictor variables), AICc (Akaike's information criterion corrected for a small sample size), ΔAICc (difference in AICc between the current and the best model), Adj R2 (adjusted coefficient of determination), and RMSE (root mean square error). The intercept and coefficients, as well as the coefficient of determination (R2) for each model, are also presented. CHM = canopy height model, CD = canopy density, and CC = canopy cover.


#### *3.2. AGB and Uncertainty Map*

The AGB varied from 0 to 90 Mg ha−<sup>1</sup> per pixel (Figure 3), whereas the pixel-scale uncertainty estimated by the error propagation approach ranged from 0 to 49 Mg ha−<sup>1</sup> (Figure 4). Figure 5 shows the averaged variable importance analysis across the k-fold procedure for each set of variables derived from Landsat 8 (L8) and ALOS-2/PALSAR-2 (ALOS) included in the random forest (RF) model. These results indicated that the Landsat reflectances composite was the most important set of variables when predicting AGB. However, the ALOS indices contain more unique information not represented by the other variables, as shown by the largest decrease in accuracy when the variable was excluded. Figure 6 shows the overall accuracy assessment and Table 4 the assessment by AGB range.

The accuracy assessment between the AGB map predicted from EO and the AGB reference from LiDAR showed R2 = 0.89, RMSE = 7.58 Mg ha<sup>−</sup>1, rel. RMSE = 43%, and bias = 0.43 Mg ha−<sup>1</sup> (Figure 6). Our map shows an underestimation of very high AGB (negative bias) and a slight overestimation of low AGB (positive bias) [28] (Table 4). We also found a RMSE of 6.39 Mg ha−<sup>1</sup> (rel. RMSE = 61%) for AGB lower than 50 Mg ha−<sup>1</sup> and a RMSE = 13.41 Mg ha−<sup>1</sup> (rel. RMSE = 19%) for AGB higher than 50 Mg ha<sup>−</sup>1. Details of the accuracy assessment by biomass range are shown in Table 4.

**Figure 3.** AGB map (30-m resolution) of the Rio Vermelho watershed in the Brazilian Cerrado, Goiás State. Zooms on the top right and bottom left of the figure show the details of the variations of the AGB between the different types of forest formations in the study area. Background image: © Bing™ aerial photo. Screenshot(s) reprinted with permission by Microsoft™ Corporation.


**Figure 4.** Error map (30-m resolution) for the Rio Vermelho watershed in the Brazilian Cerrado. Zooms on the top right and bottom left of the figure show the details of the variations of the AGB uncertainty between the different types of forest formations in the study area. Background image: © Bing™ aerial photo. Screenshot(s) reprinted with permission from Microsoft™ Corporation.

**Figure 5.** Averaged variable importance analysis across the k-fold procedure for each set of variables derived from Landsat 8 (L8) and ALOS-2/PALSAR-2 (ALOS) included in the random forest (RF) model. The R<sup>2</sup> for each single set of variables and all variables together (left) and decrease in R<sup>2</sup> for models excluding a single set of variables (right). ALOS backscatter: γ<sup>0</sup> *HV* and <sup>γ</sup><sup>0</sup> *HH*. ALOS indices: radar forest degradation index (RFDI), cross-polarised ratio (CpR). L8 reflectances: blue, green, red, near infrared, shortwave infrared-1, and shortwave infrared-2. L8 indices: normalised difference vegetation index (NDVI), normalised burn ratio (NBR), NBR2, normalised difference moisture index (NDMI), and soil adjusted vegetation index (SAVI).

**Figure 6.** Cross-validation between the AGB map predictions and AGB reference data derived from the LiDAR point clouds. The black dash line corresponds to the y = x line. RMSE: root mean square error.

Figures 7 and 8 show the comparisons between the AGB estimates produced in our study and four other studies [21–23,81]. Although these maps were estimated using different methods and at different resolutions over the pantropical area, it is interesting to observe how the different data products compared to each other. While this study and Avitabile et al. [22] presented probability distribution functions skewed towards low biomass ranges, other studies [21,23,81] showed distributions tending towards much higher AGB levels, with averages two to three times larger than those found in this study and by Avitabile et al. [22] (Figure 7). Figure 8 shows that the datasets from [21,23,81] did not adequately represent all the variations that exist in the area due to the lower spatial resolution and, also, generally showed a greater overestimation of the AGB. Although Santoro et al. [81] provided more details in terms of the distribution of AGB in comparison with the other three maps, their map still overestimated the AGB compared to our results. In the map developed in this study, one can observe the AGB variation in detail, which is corroborated by the observed high coefficient of determination (R2 = 0.89) between the AGB reference datasets and the final AGB map.

**Figure 7.** Distribution of the AGB intervals, average, and total AGB observed in this study and in four comparison maps for the Rio Vermelho watershed, Goiás State, Brazil. All the maps were resampled to 1-km resolutions.

**Figure 8.** AGB maps over part of the Rio Vermelho watershed, Goiás State, Brazil produced by this study (30 m) (**A**) and by Santoro et al. [81] (100 m) (**B**), Baccini et al. [23] (500 m) (**C**), Avitabile et al. [22] (1 km) (**D**), and Saatchi et al. [21] (1 km) (**E**).

#### **4. Discussion**

In 2014, Brazil submitted a greenhouse gas (GHG) emission report (Forest Reference Emission Level (FREL) for Reducing Emissions from Deforestation for REDD+ under the United Nations Framework Convention on Climate Change (UNFCCC)) [82] as part of the country's commitment to reduce GHG emissions from deforestation in the Amazon (FREL Amazonia). Brazil indicated in that report that its national report would be the sum of the FRELs for each of the six Brazilian biomes. The FREL report for the Cerrado biome (FREL Cerrado), together with the FREL Amazonia, showed that emissions induced by land use changes accounted for approximately 73% of the emissions in Brazilian territory [82]. This shows the importance of the Cerrado biome for REDD+ policies and for the efforts of climate change mitigation in the country, such as the national inventories initiatives as part of the Third National Communication of Brazil to the UNFCCC [63].

The reduced number of studies on the quantification and spatial variability of the total Cerrado AGB stocks limits a full understanding of CO2 emission patterns in this biome. These are especially critical for local and national climate governance strategies. Appropriate management and mitigation models will depend on the accuracy of this information. We produced the first high-resolution continuous map of AGB in an area of the Brazilian Cerrado, making use of a combination of vegetation inventory plots, airborne light detection and ranging (LiDAR) data, and satellite images (Landsat 8 and ALOS-2/PALSAR-2). Fieldwork in the Brazilian Cerrado is challenging, time-consuming, and expensive, and existing field datasets still do not entirely represent the extent of the biome. An alternative to increasing the sample size is using LiDAR by upscaling the information from field plots (which ideally should be representative of the analysed ecosystem). This strategy has been used in recent studies in other forest types [83,84]. We increased our sample size thousands of times, reaching a very high R2 of 0.93 in the AGB LiDAR estimation. This consistent result was essential to allow us to use LiDAR as a reference for our AGB modelling process using satellite images and machine learning.

In our study, the RF algorithm proved to be a good strategy to estimate the AGB in the study area (Table 4 and Figure 6). This result can be attributed to the ability of these techniques to capture the nonlinearity present in the data, as they can approximate complex functions [85]. This is an important characteristic for the modelling of vegetation, such as ecological patterns governed by nonlinear and interactive processes at any ecological scale [85], since they usually present complex behaviours [28].

Current sources of AGB information for the Cerrado are the global and pantropical maps that have a limited spatial resolution (from 100 m to 1 km) and questionable high estimations for this biome (Figures 7 and 8) [2]. Figure 7 shows the distribution by AGB intervals, average, and total AGB observed from our study and the four global and pantropical maps over the Rio Vermelho watershed [21–23,81]. This study and the one conducted by Avitabile et al. [22] presented distributions with an average of 18.66 Mg ha−<sup>1</sup> and 16.25 Mg ha<sup>−</sup>1, respectively. On the other hand, Santoro et al. [81], Saatchi et al. [21], and Baccini et al. [23] showed distributions tending towards much higher AGB levels, with averages of 33.89 Mg ha <sup>−</sup>1, 31.12 Mg ha <sup>−</sup>1, and 58.59 Mg ha <sup>−</sup>1, respectively. Saatchi et al. [21] and Baccini et al. [23] found the distributions of the AGB tending toward a normal distribution for the study site, which was not the case for the map produced by Avitabile et al. [22], Santoro et al. [81], and our study. Morandi et al. [2] recently calculated AGB values for the Cerrado biome using a large independent dataset of field plots. The study estimated an average of 20.4 <sup>±</sup> 6.0 Mg ha−<sup>1</sup> for the central areas of the Cerrado biome, where this watershed is located, which agrees with our study results. However, global and pantropical studies showed significantly higher AGB levels in comparison to Morandi et al. [2], our study (Figure 7), and to the AGB values measured in our field plots. We also observed the tendency of these products to overestimate the AGB on the lower AGB ranges, which results from the models attempting to compensate for the underestimation at the higher AGB ranges due to the potential signal saturation [28]. This agrees with previous studies that have indicated the inconsistency of these products when compared in several forest biomes [86,87]. Our regional AGB map (Figure 3) takes the local vegetation structure of the Brazilian Cerrado into account and has a spatial resolution of 30 m. Overall, the relationship between our reference data and the remote-sensing variables is relatively high (R2 = 0.89), with a root mean square error (RMSE) of 7.58 Mg ha<sup>−</sup>1. The accuracy assessment showed that our map underestimates high AGB levels and slightly overestimates low AGB levels (Table 4). This result fully meets the biomass product accuracy requirements set by the BIOMASS mission (RMSE < 10 Mg ha−<sup>1</sup> for AGB < 50 Mg ha−<sup>1</sup> and rel. RMSE < 20% for AGB > 50 Mg ha<sup>−</sup>1) [56], with a RMSE of 6.39 Mg ha−<sup>1</sup> for AGB < 50 Mg ha−<sup>1</sup> and a rel. RMSE = 19% for values > 50 Mg ha<sup>−</sup>1. It is important to highlight that we were interested in a representative range for cerradão AGB, including transitions to other vegetation types.

When analysing the statistical indicators of the goodness-of-fit (RMSE and R2) (Table 4) using the RF technique, a higher accuracy was achieved for the AGB range > 40 Mg ha−1. However, even though we are confident in our AGB estimations, we have to be cautious with the estimations of AGB values < 19 Mg ha<sup>−</sup>1, areas that are not covered by woody vegetation (e.g. grasslands, agriculture, and bare soil), as our field-to-LiDAR AGB model was developed using field plots located only in woody vegetation areas. In Figure 6, we can observe an underestimation of the highest AGB ranges (AGB > 80 Mg ha<sup>−</sup>1), which is also reflected in the negative bias for this high AGB level (Table 4). The range of predictions random forest regression can make is bound by the highest and lowest values in the training data. This can lead to overestimations in the lower value range and underestimations in the higher range [88]. The scatterplot between the satellite AGB estimates from our results and the LiDAR AGB showed this effect with the slight underestimation of high AGB (Table 4 and Figure 6). This effect was also observed in the studies of Nunes and Görgens [88] and Zhang and Lu [89].

It is reported that machine-learning techniques can provide more accurate estimates than classical regression models [88]. Silva et al. [90] evaluated the use of machine-learning techniques and mixed models to estimate the volume and AGB of individual trees in the Brazilian Cerrado. The machine-learning techniques presented, and mixed-effect models, showed similar and highly accurate results (R<sup>2</sup> = from 0.97 to 0.99 and RMSE = from 12% to 13%) [58]. According to Silva et al. [90],

because the modelling of forest resources commonly presents complex relationships among the variables, nonlinear mixed-effects modelling (NLME) and machine-learning techniques such as RF, adaptive network-based fuzzy inference systems (ANFIS), and artificial neural networks (ANNs) may be good alternative modelling techniques. In our study, using RF, we produced an AGB map of an area in the Brazilian Cerrado with a R<sup>2</sup> of 0.89 and RMSE of 7.58 Mg ha−1. Previous studies tried to estimate the AGB in savannas using satellite data and reached performances lower than the results from our work [54,91,92].

Over the last decades, several studies have used SAR, especially ALOS PALSAR L-band radar images, to estimate the AGB of tropical savannas. Mermoz et al. [93] used ALOS PALSAR data to map the AGB in savanna ecosystems in Cameroon. They argued that L-band PALSAR mosaics are suitable for the retrieval of savanna AGB (typically less than 100 Mg ha<sup>−</sup>1) at the national and continental scales. In a similar study, Carreiras et al. [27] tested a combination of field data and ALOS PALSAR backscatter intensity to reduce the uncertainty in the estimation of vegetation AGB in the Miombo savanna woodlands of Mozambique (East Africa). They applied a machine-learning algorithm, resulting in a good fit and accurate model (R<sup>2</sup> = 0.95 and RMSE = 5.03 Mg ha<sup>−</sup>1). However, since they used bootstrap samples in combination with a cross-validation procedure, the reported cross-validation statistics could be overoptimistic [22]. A combination of datasets from different sources, such as in this study, proved to be efficient when the goal was to reduce uncertainties in the AGB estimates. Recent studies have explored the combination of different data types. Braun et al. [91] used passive and active microwaves to estimate the AGB of savannas. They introduced the integration of passive brightness temperature as an additional variable for AGB estimation, based on the hypothesis that it contains information complementary to the microwave backscatter coefficient from active sensors.

As we found in our work, many studies have shown the advantages of combining optical and SAR datasets using machine-learning techniques such as RF for AGB estimations. The optical data are sensitive to the biophysical properties of the vegetation, and radar data are more sensitive to the electrical and geometrical information of the vegetation—more specifically, the moisture content and vegetation structure. Forkuor et al. [46] showed in a recent study that combining field inventory data with Sentinel 1 (SAR) and Sentinel 2 (optical) to estimate the AGB in the West African dryland forest (tropical savanna and woodlands) using a RF algorithm performed much better than either one of them alone, reaching a R<sup>2</sup> of 0.90. This corroborates our results, which have also shown an improvement in accuracy using the combination of SAR and optical datasets rather than using either one individually. Our study also showed the highest performance (R2 = 0.89) when all variables were included. The variable importance analysis (Figure 5) showed a slightly superior R2 for Landsat 8 reflectance data (R<sup>2</sup> <sup>=</sup> 0.84) in comparison to ALOS−2 backscatter data (R2 <sup>=</sup> 0.79) when mapping the AGB using a single set of data. Landsat 8 indices seem to have an almost identical performance to the Landsat 8 reflectance data, and there is no significant drop in performance when the indices are excluded (Figure 6). Conversely, the exclusion of the ALOS indices from the whole set results in the largest drop of R2 (i.e., <sup>−</sup>0.07), despite having the lowest performance when used alone (R<sup>2</sup> = 0.51). This agrees with previous studies that found a higher contribution of the optical reflectance data to estimate AGB, potentially due to the sensitivity of optical data to shadow and moisture [23,29,94], which can be a key factor in sparse vegetation such as Cerrado. Additionally, the relatively low AGB levels in the study area might provide a level playing field with L-band SAR data, which usually presents a sensitivity of the signal to higher AGB levels than optical data [95–99]. Our study highlighted the need to use remote sensing in combination with local vegetation inventories to effectively quantify the spatial variation of AGB in ecosystems of the Brazilian Cerrado. Future high-resolution maps of ABG will likely be more useful to quantify carbon emissions from Cerrado degradations at the local and regional scales. The methodology presented here has the potential to be used to generate the first of its kind AGB map of the entire Brazilian Cerrado, which is often neglected in carbon stock assessments of the South American continent.

#### **5. Conclusions**

We provided the first high-resolution map of the AGB (30-m resolution) of a Brazilian Cerrado area using a combination of field inventory plots, airborne light detection and ranging (LiDAR) data, and satellite images (Landsat 8 and ALOS−2/PALSAR−2). We used random forest (RF) models and jackknife analyses to study the importance of remote-sensing variables to quantify the AGB of cerradão at large scales. Overall, the relationship between the reference data and remote-sensing variables is strong (R2 = 0.89), with a root mean square error (RMSE) of 7.58 Mg ha−1. Spatially, our map slightly underestimated and overestimated the values of the AGB in few areas of the savanna (bias = 0.43 Mg ha<sup>−</sup>1). However, this spatial bias is similar to other AGB maps. Our study highlights the need to use remote sensing in combination with local field inventories to effectively quantify the spatial variation of AGB in the ecosystems of the Brazilian Savanna.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2072-4292/12/17/2685/s1: Figure S1: One-on-one relationship between the selected airborne LiDAR predictors (CC (%): canopy cover; and CHM (m): canopy height model) and the mean AGB values measured in the field plots (Mg ha-1). Figure S2 to Figure S5: Field photos of natural vegetation found in the Rio Vermelho watershed. Figure S6 to Figure S20: LiDAR returns.

**Author Contributions:** Conceptualisation, P.d.C.B., P.R.-V., and H.B.; methodology, P.d.C.B. and P.R.-V.; formal analysis, P.d.C.B., P.R.-V., B.Z., S.d.C.d.M., S.F., F.B., M.G., D.R., and I.R.; validation, P.d.C.B., P.R.-V., S.d.C.d.M., C.H.G.C., B.Z., and L.G.F.; visualisation; P.d.C.B., P.R.-V., V.L., Y.G., A.M.P.-P., and D.R.; original draft preparation, P.d.C.B.; supervision, H.B.; writing—review and editing, P.d.C.B., P.R.-V., B.Z., S.d.C.d.M., C.H.G.C., S.F., F.B., V.L., D.R., M.G., F.D.B.E.-S., A.M.P.-P., Y.G., J.R., K.B., L.G.F., J.Z.S., A.A., I.R., M.B., I.H.W., E.E.S., J.P.O., K.T., and H.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** The authors were supported by the Forests 2020 project from UK Space Agency's International Partnership Programme (IPP) under the Global Challenge Research Fund (GCRF). P. Rodriguez-Veiga and H. Balzter were also supported by the UK's Natural Environment Research Council, Agreement PR140015, between the NERC and National Centre for Earth Observation (NCEO) and ESA Biomass Climate Change Initiative (CCI+) project (4000123662/18/I-NB). S.C. Miranda was supported by the Edital Universal—CNPq (445420/2014-6). M. Bustamante was supported by the INCT MC Fase 2 CNPq (465501/2014-1).

**Acknowledgments:** The authors would like to acknowledge the Sustainable Landscapes Brazil project supported by the Brazilian Agricultural Research Corporation (EMBRAPA), the US Forest Service, USAID, and the US Department of State for the LiDAR data. The authors thank the MapBiomas team who provided the land cover map of Brazilian Cerrado with the different forest formations. The authors thank the editor and the anonymous reviewers for their valuable comments, which have improved the quality of this paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*
