Next Article in Journal
Evaluation of Near-Surface Gases in Marine Sediments to Assess Subsurface Petroleum Gas Generation and Entrapment
Previous Article in Journal
Assessment of the Evolution of a Landslide Using Digital Photogrammetry and LiDAR Techniques in the Alpujarras Region (Granada, Southeastern Spain)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Land Cover Classification in an Ecuadorian Mountain Geosystem Using a Random Forest Classifier, Spectral Vegetation Indices, and Ancillary Geographic Data

1
Instituto de Ciencia, Tecnología, Investigación y Saberes, Universidad Nacional de Chimborazo, Riobamba, Provincia de Chimborazo 060150, Ecuador
2
Facultad de Ingeniería, Universidad Nacional de Chimborazo, Riobamba, Provincia de Chimborazo 060150, Ecuador
3
Facultad de Ciencias Forestales y Ambientales, Universidad de Los Andes, Mérida, Estado Mérida 5101, Venezuela
4
Facultad de Ciencias, Universidad de Los Andes, Mérida, Estado Mérida 5101, Venezuela
5
Facultad de Ciencias, Escuela Superior Politécnica de Chimborazo, Riobamba, Provincia de Chimborazo 060150, Ecuador
*
Author to whom correspondence should be addressed.
Geosciences 2017, 7(2), 34; https://doi.org/10.3390/geosciences7020034
Submission received: 28 February 2017 / Revised: 19 April 2017 / Accepted: 21 April 2017 / Published: 3 May 2017

Abstract

:
We presented a methodology to accurately classify mountainous regions in the tropics. These landscapes are complex in terms of their geology, ecosystems, climate and land use. Obtaining accurate maps to assess land cover change is essential. The objectives of this study were to (1) map vegetation using the Random Forest Classifier (RFC), spectral vegetation index (SVI), and ancillar geographic data (2) identify important variables that help differentiate vegetation cover, and (3) assess the accuracy of the vegetation cover classification in hard-to-reach Ecuadorian mountain region. We used Landsat 7 ETM+ satellite images of the entire scene, a RFC algorithm, and stratified random sampling. The altitude and the two band enhanced vegetation index (EVI2) provide more information on vegetation cover than the traditional and often use normalized difference vegetation index (NDVI) in other settings. We classified the vegetation cover of mountainous areas within the 1016 km2 area of study, at 30 m spatial resolution, using RFC that yielded a land cover map with an overall accuracy of 95%. The user’s accuracy and the half-width of the confidence interval for 95% of the basic map units, forest (FOR), páramo (PAR), crop (CRO) and pasture (PAS) were 95.85% ± 2.86%, 97.64% ± 1.24%, 91.53% ± 3.35% and 82.82% ± 7.74%, respectively. The overall disagreement was 4.47%, which results from adding 0.43% of quantity disagreement and 4.04% of allocation disagreement. The methodological framework presented in this paper and the combined use of SVIs, ancillary geographic data, and the RFC allowed the accurate mapping of hard-to-reach mountain landscapes as well as uncovering the underlying factors that help differentiate vegetation cover in the Ecuadorian mountain geosystem.

1. Introduction

Vegetation cover is the set of biophysical attributes of a land surface; its behavior is the cumulative result of several factors that exert control, including climate, soil, altitude gradients, physiography, and biological aspects. Thus, vegetation cover can be a measurable indicator of the functionalities of a mountain ecosystem. Changes in vegetation cover reflect alterations in natural factors that impact vegetation growth and its performance, as well as the occurrence of anthropogenic factors [1]. The information obtained from monitoring changes in vegetation cover and land use can quantify the effects of primary sources of soil degradation such as deforestation, and the dynamic alteration and transformation of land use over time. Also, this information can serve as essential input in an early warning system for the possible occurrence of potential and irreversible changes in the functionalities of a mountain ecosystem.
The ability to remotely map the vegetation cover of mountain geosystems makes it possible to perform environmental analyses that cannot be easily conducted in the field while monitoring changes in land use. In this context, the additional information provided by geographical information systems and the SVIs are a fundamental requirement for understanding both natural and human-induced changes to vegetation cover and their implications [2]. Vegetation cover analysis is based on defining a classification scheme and a categorization method that allows the identification of primary units. Categorization is a standard application of satellite images. Current attention is shifting from methods of statistical classification (parametric) to methods based on machine learning (ML) or other non-parametric methods. This shifting is because ML methods do not have data distribution assumptions and can handle complex feature spaces and non-normal data. Existing literature suggests that the RFC offers great potential and achieves better results in the categorization of complex scenes [3]. In general, the RFC makes very precise classifications, provides information about the importance of the predictors (variables), classifies outliers, estimates missing data, and gives an estimate of the error rate associated with the prediction. The accuracy and importance of the predictors (variables) are automatically generated, and overfitting is not a problem. Also, RFC is not sensitive to outlier data values and contains a set of parameters that is easy to initialize.
Images from the Landsat satellites are useful for detecting vegetation cover changes, with sufficient detail to provide relevant information for understanding processes and supporting decision-making if appropriate categorization schemes and classifiers are used. Unfortunately, the mapping of vast areas in complex landscapes is difficult due to abrupt environmental changes of humidity, altitude, temperature, and topographic. Páramos typically dominate Ecuadorian mountain geosystems, and anthropogenic factors are the primary source of degradation. Cultivation rapidly reduces the functionality of páramos [4]. This accelerated soil degradation causes a rapid advancement of the agricultural frontier and an accelerated and “irreversible” deterioration of the proper functioning of the páramo soils. It is, therefore, necessary to evaluate vegetation cover in Ecuadorian mountain geosystems efficiently and rapidly to gain a better understanding of the underlying factors that influence vegetation cover, monitor vegetation cover changes, and improve our ability to address emerging incidents promptly.
The objectives of this study were to (1) map vegetation using the RFC, spectral vegetation index, and ancillar geographic data (2) identify important variables that help differentiate vegetation cover, and (3) assess the accuracy of the vegetation cover classification in hard-to-reach Ecuadorian mountain region.
To achieve these objectives , we combined spectral vegetation indices derived from Landsat 7 ETM+ satellite with ancillary geographic data to configure a subset of data to train the RFC algorithm. We subsequently mapped the spatial distribution of the vegetative land cover.

2. Materials and Methods

The land cover assessment is from March 2011, and the land cover changes are not estimated. Post classification accuracy assessment was conducted by generating stratified random points within each vegetation class; points were attributed to their actual land cover type through the expert opinion and to their class by intersecting with the classified image.

3. Studied Area

The studied area is a mountainous and rugged area with very irregular topography in the Ecuadorian Andes, situated in the Achupallas parish in the southwest of Sangay National Park, province of Chimborazo, Ecuador. It is located 300 km south of the city of Quito and covers an area of 1016 km2, which is in the rectangle defined by the UTM coordinates x = 743,089.8; y = 9,760,133.5 and x = 782,504.2; y = 9,715,844.1. Four micro-watersheds (MW) intersect the study area: Ozogoche (OMW), Zula (ZMW), Jubal (JMW), and Pulpito (PMW) (Figure 1).
Ozogoche and Zula MWs have a gradient mostly ranging from undulating to steep (5% to 25%). Jubal MW exhibits a gradient ranging from steep (12% to 25%) to very steep (50% to 70%). Pulpito MW has slopes ranging from inclined (12% to 25%) to steep (30% and 50%), with steep slopes predominating. Ozogoche MW has altitudes ranging from 3608 to 4124 masl. Zula MW has altitudes ranging from 3093 to 4124 masl. Jubal MW has altitudes ranging from 2062 to 4124 masl. Pulpito MW has altitudes ranging from 2162 to 3608 masl.
The studied area is at the junction of Azuay between Hoya del Chanchán and Hoya del Cañar. The junction of Azuay is a mountain massif formed by high transverse geological structures characterized by a wide range of topographies. It is a transition zone dominated by deep valleys and steep slope hills, where it is difficult to distinguish the Cordillera Real (made up of an ancient volcano-sedimentary rock structure that belongs to the calc-alkaline-andesite-dacite series) and the Cordillera Occidental (made up of metamorphic rocks that correspond to the andesite-dacite-rhyodacite series) [5].
In Ozogoche MW, there are bodies of natural water, páramo areas, and cultivated pasture areas, with the páramo areas predominating. In Zula MW, there are pasture areas, planted forest areas, and páramo areas. In Jubal MW, there are areas of páramo, cultivated pasture, planted forest areas, and natural forest, with the páramo areas predominating. In Pulpito MW, there are areas of páramo and natural forest. The soil in the studied area is of volcanic origin, with Andosols in most of the area and Inceptisols and Histosols to a lesser extent. The soils texture types are silty, sandy, and sandy loam. Underdeveloped soils can also be found on rocky substrates. The soil textures of Ozogoche MW are silty and clay. In Ozogoche MW, under-developed soils can be seen on the rocky substrate. The soils texture in Zula MW is silty with inclusions of fine sand and sand. The soils of Jubal MW are silty, sandy loam, and sandy clay loam. The soils of Pulpito MW have a silty-sandy loam texture. In Pulpito MW, under-developed soils can be seen on the rocky substrates.
In the northern part of the studied area, the Achupallas weather station (INAMHI: M5140) recorded an average annual temperature of 10.6 °C, an annual average relative humidity of 73%, and an average annual rainfall of 694 mm [6]. The INAMHI: M5140 weather station is at UTM coordinates 748,006 and 9,747,233. In the southern part of the studied area, JMW the Jubal weather station (INAMHI: M5138) recorded an average annual temperature of 8.2 °C, an annual average relative humidity of 85.6%, and an average annual rainfall of 981 mm [6]. The INAMHI: M5138 weather station is at UTM coordinates 756,205 and 9,734,265 [6]. The average wind speed is 2.16 m s−1; the faster winds are in the OMW sector, where the maximum speed reaches 6 m s−1.
The studied area covers the southwest of Sangay National Park. This park is one of the most important protected areas in the tropical Andean region and the third most extensive in the continent of Ecuador. The Sangay National Park was declared a UNESCO World Heritage Site in 1983; it is host to a wide variety of unaltered landscapes, high levels of endemism, and ecological diversity. The park is in a transition zone where warm air currents, loaded with moisture from the southern Amazon convergence with cold and dry air currents from the northern part of the Ecuadorian Andes. Ozogoche MW is a natural fresh water reservoir (Ozogoche lacustrine system) and provides water to the Agoyan and San Francisco Hydroelectric Plants. Zula MW provides its water to the Nizag Hydroelectric Plant. The Jubal and Pulpito MWs contribute their water resources to the Paute Hydroelectric Plant.

4. Satellite Data and Processing

This study was carried out with images captured by the Landsat 7 satellite with an Enhanced Thematic Mapper (ETM+) sensor (Landsat 7-ETM+). This satellite has a sun-synchronous orbit at an altitude of 705 km, a solar tilt of 98.2°, and a scan strip width of 185 km. It crosses the equator in a descending path at 10:00 AM. with an uncertainty of ±15 min. The Landsat 7-ETM+ has a temporal resolution of 16 days. The signal per pixel of each sensor is coded in an 8 bit binary code (256 different values) [7]. The scene of the studied area was captured on 4 March 2011 with the following tags: (1) ID: LE70100622011063ASN00; (2) CC: 81%; (3) Date: 2011/3/4; (4) Qlty: 9; (5) Product: ETM + L1T; and (6) path 010 and row 062. Images were selected in such way that the area of study was cloud free.
The level one of the processing involved the “Level 1 Terrain Corrected” using Ground Control Points (GCP) and Digital Elevation Model (DEM) that improves absolute geodetic accuracy; the WGS-84 ellipsoid was used for the UTM coordinate transformation system. The DEM was provided by INFOPLAN [8]. The georeferencing of the image was performed using PSAD-56 georeferenced topographic maps supplied by the Military Geographic Institute of Ecuador [9] and by land inspection.
The preprocessing begins with the process of atmospheric correction, reflectance, and calibration, through the Landsat Ecosystem Disturbance Adaptive Processing System (LEDAPS), and then the images are subjected to a preprocessing code where information is filtered in four stages. First, the metadata is extracted from the Landsat image, and then the digital number (DN) values are calibrated to the Top of Atmosphere (TOA). Second, the soil reflectance is corrected to the TOA by ancillary data. Third, we applied the Automated Cloud Cover Assessment (ACCA) [10], and finally we use the MODIS/6S algorithm for atmospheric correction; focusing on the radiative transfer with aerosols [11]. The soil reflectance correction and the cloud cover were obtained with pixels that were detected by the LEDAPS process. A complete image is achieved by the composition of the seven bands. Then, the image is checked for errors given by aerosols, clouds, pixel without information, salt and pepper noise, random variation impulse noise and speckle [12].
From May 2003, Landsat 7 ETM images exhibit partial loss of information because the scan line corrector were off (SLC-off). Thus, wedge-shaped information gaps are observed on both sides of the scene, and about 20% of the data was lost [13,14]. The procedure to overcome the SLC-off effect involved the selection of complementary images to compose a complete scene using gap-fill techniques [15]. If the image does not need gap-fill, we performed a cloud mask calculation, which allows generating a mask where information has the clouds and shadows generated by LEDAPS. Then, the composed image is joined with the cloud mask to obtain a cloudless image in each band of the composed image. If the image needs gap-fill, we proceeded to multi-fill the images with the image chosen to take into account the order of filling, same that is given by order of importance of the image for the filling (Table 1).
Finally, the image is re-projected to the area UTM 17 S, datum WGS 84 and the image pixels within the studied area extent were extracted. Also, we integrated land vegetation cover information from 1640 reference points probabilistically sampled. Figure 2 shows the workflow during the satellite data processing.

5. The Random Forest Classifier

The RFC bases its performance on the learning that results from training an ensemble (a large number) of decision trees (DT). The RFC provides its best prediction by counting the number of trees that have similar results and selecting the result that was predicted by the most trees. Therefore, the RFC performance is sensitive to the samples used for training [16]. One of the relevant aspects of the RFC is the random sampling of the training data space. The RFC algorithm randomly selects subsets of data (bootstrap) that serve to train an equal number of DTs while attempting to avoid any correlation between these subsets. The DT can arbitrarily model complex relationships between the input and output without having to assume a normal distribution of the data. The DT can also handle complex data of different types [17]. The categorization process goes through a set of rules that determine the trajectory followed from the root node to the terminal node in each DT. The DT learns these rules by induction, using a subset of the data from the training set. Thus, DT separates classes by choosing variables and points that best separate them. The main idea is to partition predictor variables into different regions so that the dependent variable can be predicted more accurately.
According to Hastie et al. [18], the RFC does not require much tuning. Three parameters need to be set by the user in order to produce the ensemble of DTs; the number of DTs to be generated (ntree), the number of predictor variables to be selected at random and tested for the best split when growing the DTs (ptry), and the minimun amount of data in the terminal node (nodesize). We found that the magnitude of the error was the smallest with ntree = 500, ptry = 3 and nodesize = 3, and the classification accuracy stabilize much before the 500 DT is achieved.
How is a new object classified using RFC? We used the vector containing all the attributes of the new object as input for each of the DTs that make up the full set. The output of each DT is a classification (one vote). The RFC selects as its prediction the result predicted by the most trees with the highest number of votes [17].
Figure 3 shows the structure and flow of information in the context of the RFC. Figure 3 illustrates only the flow of information using a portion of the data; that portion was divided into one data set for DT training and another data set for validation (“out-of-bag” (OOB) data subset), error calculation, and estimating the importance of the variables.
The RFC quantifies the importance of the variables in two ways. (1) The first relates to the gain of information achieved when the partition is performed on each node of each DT. This value is accumulated for each predictor in all DTs; a higher value indicates a more relevant predictor for performing the partition and therefore the classification; (2) The algorithm also uses the samples in an OOB subset to develop a measure of the importance of the predictors and thus estimate the predictive strength of each predictor. Once the DTs have been trained, all the data of the OOB subset go through all the DTs, and the accuracy of the prediction is recorded. The values of predictor j are randomly permuted in the OOB subset, all the data goes through the DTs, and once again, the accuracy of the prediction is recorded. This permutation causes a decrease in the accuracy of all DTs; all the decreases are averaged, and the resulting value is used as an indicator of the importance via permutations of predictor j. The random permutation effect effectively nullifies the effect of the variable, similar to what occurs when we set a coefficient in a linear model to zero [18]. In this study, we used the variables importance indicator via random permutations to find the key variables that are highly related to the effectiveness of the classifier. We will also use this indicator to determine a small number of predictors that are “sufficient” for making accurate predictions or forecasts.
We selected the relevant variables from the group of predictor variables that were initially proposed regarding the quantification that the RFC performs of the importance of the variables and the average error in the final prediction. In this way, it was also possible to optimize the correction factor L, which accounts for the reflectivity of the soil in the analytical expression that determines the soil adjusted vegetation index as defined in the Equation (2). Similarly, we found the optimal value of constant “a” in the equation of the wide dynamic range vegetation index, see Equation (5).
We did not analyze autocorrelation relationship between different variables because correlated variables do not affect the predictive performance of a RFC. Since each split in each tree is the split that yields the maximum improvement. One consequence of correlated variables is that the variable importance scores among variables that are correlated lower than what they would be if there were no correlated variables. The issue of correlates variables is prominent in high-dimensional problems like in genomics [19]. Strobl et al. [19] developed a conditional permutation scheme for the computation of the variable importance measure. The resulting conditional variable importance reflects the true impact of each predictor variable more reliable than the original permutacion approach. We did not implement the conditional permutation approach because our objective was to find a small number of predictor variables sufficient for a good performance of the RFC and we did not want to find important variables for interpretation purpose. Similarly, we did not perfom an independent validation becauce it has been stated that one compelling component of RFC is that there is no need for independent validation [20]. One issue with RFC arises when the clases in categorical response variables are imbalance. Imbalances in the response variable result in biased classification accuracy. This is due to the boostrap over-representing the majority class, leading to under-prediction of the minority class [20]. We circunved this problem by training the RFC with equal allocation of samples in each stratum.

6. Acquisition and Use of the Field Data

We divided the land cover into four non-overlapping subpopulations (stratum): forest (FOR), páramo (PAR), crop (CRO) and pasture (PAS). Within the stratum, samples are selected by the simple random protocol. We used two field datasets in which the ground samples were classified to the correct class. One for training the RFC, assess the RFC performance and predict the whole map of the studied area and the second for the error-adjusted predicted area and map accuracy assessment. To generate the datasets it was necessary to resort to auxiliary information due to the retrospective nature of this study. The field datasets were generated by an analysis of supervised classification plus field validation scheme using pre-existing land-cover maps (PLCM) provided by the Enviroment Minister of Ecuador. More specifically, the PLCM was reclassified into four categories, and sites were sampled randomly from each category. We assessed the corred classifications of the fiel data ussing the maximum likelihood algoritm.
The first dataset (the ground reference dataset) was generated by a stratified random sampling method of the overall area of study. The total number of samples were 740 and the number of samples per class was kept equal to 185 (equal allocation). Roughly 2/5 of the data from the reference set was used for training the RFC and produces the classification rules for the predictive model. Although there is no sampling design requirement for training an RFC, we calculated the sample size to ensure the representation of the target class (see the first row in Table 2). The full reference data set (740 pixels) was left to estimate the accuracy of the RFC (OOB error) and the importance of the variables in an unbiased manner. Lawrence et al. [21] and Zhong and Gong [22] reported that the OOB error in the RFC could be used as a reliable measure of classification accuracy [17]. An error matrix expressed regarding pixel count was subsequently constructed to compute the overall accuracy (OA), user’s accuracy (UA) and producer’s accuracy (PA) as criteria for evaluating the overall RFC performance.
The second dataset (e.g., 900 pixels) was generated by a stratified random sampling method of the overall area of study, following the Neyman optimal allocation (NOA) of samples. The NOA minimize the variance of the estimator of OA given a specified total sample size.

7. Data Processing

We used the RFC incorporated into commercial software from Salford Systems version SPM 8 San Diego CA USA. The RFC was used to differentiate units (pixels) of land cover within the area of study. Figure 4 illustrates the methodological framework, the workflow of data processing, and accuracy assessment.
First, we trained the RFC and assessed its performance. In the RFC training, we get a set of rules (model) that make up 500 DTs. The model is used to predict the class where the sample belongs and the whole map of the study area. An error matrix expressed regarding pixel count was subsequently constructed to compute the OA, UA, and PA as criteria for evaluating the overall RFC performance (Table 2). Second, we used the model and the whole dataset from the scene’s image (1,068,065 pixels) to get a first estimation of the stratum’s area proportions and a prediction of the map of the studied area. A population error matrix expressed regarding proportions was constructed to compute the stratum’s area proportions and to assess map accuracy (Table 2). Third, we assess the map accuracy, error-corrected areas, and the confident intervals through the equation in Table 3. The equations in Table 4 allowed assessing quantity disagreement and allocation map disagreements [23]. The standard tools for assessing agreements between maps are Cohen’s kappa coefficient and weighted kappa coefficient [24], but Pontius and Millones [25] showed that the overall disagreement can be decomposed into the quantity disagreement and the allocation disagreement. We assess the resulting disagreement upon equal sample allocation and Neyman sample allocation. The overall disagreement reflects the level of difference between the estimated (user) map and the reference (producer) map. The overall allocation disagreement reflects the difference between the reference map and a comparison map that is due to the less than a maximum match in the spatial distribution of the categories, given the total category of the maps [24,25,26]. If the quantity disagreement is (relatively) high, there are substantial differences in the category totals of the maps. If the allocation disagreement is high, there are substantial differences in the spatial allocation of the categories. If one of the map-level disagreement measures is high, category-level measures can be used to investigate which categories cause the disagreement [24,25].

8. Spectral Vegetation Indices

The spectral vegetation index (SVI) is a number that expresses the physiological and biophysical characteristics of the vegetation. SVIs are derived from the solar reflectance spectrum of vegetal material. The solar reflectance spectrum is characteristic of each plant species or the reflective surface of water bodies, pavements, and other surfaces. In general, the SVI is a function of one or more solar reflectivity values in the same number of spectral bands. Knowing the spectral signature of water and other elements that coexist in vegetal material allows for inferring biophysical and physiological aspects of interest in the vegetal material responsible for the total reflectance spectrum.
The reflectivity values measured are converted into a numeric value through an analytical expression that determines the SVI. Thus, each SVI defines a metric for one or more physiological or biophysical aspects of the vegetation. However, the effectiveness of the SVI is affected to varying degrees by soil properties, atmospheric conditions (clouds, particles and wind), topographical conditions (altitude, slope and aspect), sunlight, and the calibration and viewing geometry of sensors [29]. Thus, a good SVI should be sensitive to these influencing factors; to achieve this, correction factors are incorporated depending on the nature of the influencing factor. Alternatively, one can determine the minimum set of spectral bands whose reflectance values only provide information about the aspect of interest and combine these values into a SVI. Thus, after slightly more than 50 years of development, there are currently over 150 SVIs proposed in existing literature, although few have been systematically evaluated [1]. With calculated SVI values, we can make inferences about the structure of vegetal material, the state of vegetation cover, photosynthetic capacity, the density and distribution of leaves, leaf water content, mineral deficiencies and evidence of the existence of stressful elements or conditions in plant development.
We can generally group SVIs into indicators that use signals from broadband, narrow band, foliar nitrogen, foliar water content, dry or senescent carbon, leaf pigment and sunlight efficiency sensors [30]. The indicators that use signals from broadband sensors comprise indices of vegetation quantity and vigor (chlorophyll content). These indices compare reflectivity values in the near-infrared and red bands, where chlorophyll absorbs photons to convert light into stored energy through photosynthesis. A reduction of the reflectance in the near-infrared band and an increase of the reflectance in the red band can occur due to an increase in chlorophyll concentration in the leaf, an increase in the leaf area, a decrease of foliage clumping and a change in foliage architecture. This simultaneous decrease/increase in these two bands is reflected in an increase of the SVI that uses broadband sensors. These SVIs are sensitive to the combined effect of chlorophyll concentration of the foliage, leaf area, foliage clumping and architecture. However, they do not provide quantitative information for any biological or environmental factor contributing to the fraction of absorbed photosynthetically active radiation (fAPAR). They do provide comprehensive information about the quantity and quality of the photosynthetic material in the vegetation, and this is essential in order to understand the state of the vegetation for any purpose. These SVIs are used, among other applications, in vegetation phenology studies, climate impact and land use assessments, and plant productivity modeling [30]. From this group of indicators that use signals from broadband sensors, we can highlight, among others, the normalized difference vegetation index (NDVI), the soil adjusted vegetation index (SAVI) and the two-band enhanced vegetation index (EVI2).
The NDVI was initially proposed by Rouse et al. [31] and is defined as the quotient resulting from subtracting the reflectivity measured in the red band (ρR) from the reflectivity measured in the near-infrared band (ρNIR) divided by the number that results from adding (ρNIR) and (ρR); see Equation (1). The symbol ρ denotes the reflectivity measured and its subscript denotes the electromagnetic spectrum band used.
NDVI = ρ NIR ρ R ρ NIR + ρ R
The NDVI value varies between −1 and 1. Vegetal material corresponds only to positive values; the higher the index, the higher the chlorophyll content. Values close to zero indicate sparse vegetation on bare soil. Materials that absorb more radiation in the near-infrared than in the visible (such as water) produce negative NDVI values [1,32]. The NDVI exhibits limitations related to the photons reflected by the ground. To overcome this limitation, Huete [33] considered the spectral signature of a typical soil and modified the NDVI expression by explicitly incorporating a correction factor (L) that accounted for the effect of the soil in a spectral indicator called SAVI; see Equation (2). Huete [33] found that an L value of 0.5 was acceptable in a wide range of vegetation covers. In reality, however, a prior knowledge of the vegetation and an iterative process are required to determine the optimal L value [34].
SAVI = ( 1 + L ) ρ NIR ρ R ρ NIR + ρ R + L
When the NDVI is plotted vs. the leaf projected area index per unit of ground surface area (LAI), it should be noted that once the LAI is greater than a certain value (near one) the sensitivity of the NDVI to the LAI value reduces significantly and its behavior becomes asymptotic [35,36]. This behavior suggests that the NDVI will never reach values close to one in dense vegetation conditions; on the contrary, at a certain LAI value, the variation will not be significant. This can be explained if we consider that all the photons in the visible range are absorbed when LAI values are high. Consequently, this behavior confirms that prior to being an indicator of LAI, the NDVI is first and foremost an indicator of photosynthetic capacity and primary production [1].
The NDVI is also sensitive to attenuation and dispersion by the atmosphere and to particles (aerosols) present in the atmosphere [35,37,38,39]. To overcome this limitation, researchers have focused on the reflectivity in the red band because the spectral components with wavelengths in the visible range are more sensitive to smoke and aerosols in the atmosphere. Kaufman and Tanre [37] corrected the effect of aerosols in the atmosphere by using the reflectivity difference between the blue and the red bands as the reflectivity value of interest. Thus, these authors proposed the atmospherically resistant vegetation index (ARVI). Another alternative that was developed to correct for the atmospheric effect in NDVI values consisted of using the reflectance in the shortwave infrared band (SWIR: 1600 to 2100 nm) as a substitute of the red band in the original NDVI expression, as the long-wavelength spectral components were less sensitive to smoke and aerosols in the atmosphere. Thus, Karnieli et al. [40] proposed the aerosol-free vegetation index (AFRI) based on the high correlations found between the reflectance values in visible bands and shortwave infrared bands. The authors replaced the reflectance value in the red band of the NDVI expression with the estimated value based on the linear relationship they found with the reflectance in the shortwave infrared band.
The enhanced vegetation index (EVI) was developed to provide greater sensitivity in regions with high biomass while minimizing the influence of the soil and the atmosphere. The EVI corrects distortions caused by particles in the air as well as the brightness of the soil. The EVI does not become “saturated” as easily as the NDVI in areas with large amounts of chlorophyll production. The EVI was originally proposed as a function of the reflectance values in the blue, red and near-infrared bands. However, subsequent studies exposed serious controversies that made its use questionable and caused its discontinuation [1]. Alternatively, when the effects of atmospheric conditions are negligible and the data quality is good, it has been proposed to replace the EVI with EVI2, which uses only two bands (red and near-infrared) and is a modified version of the NDVI—see Equation (3) [35]. The EVI2 maintains the sensitivity and linearity shown by EVI in conditions of high biomass and minimizes the influence of the soil. The EVI2 is based on the assumption that a linear relationship exists between the reflectance values measured in the red band and the blue band.
EVI 2 = 2.5 ρ NIR ρ R ρ NIR + 2.4 × ρ R + 1
The normalized difference water index (NDWI) is an indicator of foliar water content. Water classification is required to map the distribution, morphology, and connectivity over large areas. Water classification is also useful when excluding it from processing routines that are directed toward other purposes [41]. Several indicators of water have been used for identification purposes based on the solar reflectance spectrum provided by remote satellite sensors. The NDWI was initially proposed by McFeeters [42]; see Equation (4). The index uses the reflectance values measured in the near-infrared and green bands (ρR) to quantify the presence of water bodies while reducing the impact of soil reflectivity and vegetation cover. We added the McFeeters subscript to distinguish the indicator from other indices that are also identified with the letters NDWI.
NDWI | McFeeters = ρ G ρ NIR ρ G + ρ NIR
This study explored the use of multiple SVIs as predictors of the vegetation cover in our studied area. Indexes EVI2, SAVI, NDVI, and NDWI were calculated using Equations (1)–(4), respectively. However, we also used the indices “Wide Dynamic Range Vegetation Index” (WDRVI) [38] and “Visible Atmospherically Resistant Index — Green” (VARIG) [43,44] with Equations (8) and (9), respectively.
WDRVI = a ρ NIR ρ R a ρ NIR + ρ R | a = 0.05
VARI G = ρ G ρ R ρ G + ρ R
The indices are calculated on a piel-by-pixel basis and when we loss any bit we imported it using the Nearest Neighbor Interpolation algorithm [45] and an appropriate reclassification is carried out to assess their value with respect to the different types of land use. We used SVIs as preditor varibles and not the image band because the SVIs carrie information about the spectral signature of the vegetative material and the image band by itselft does not. The SVI derived from information provided by remote sensors is the primary source of information for monitoring and assessing the vegetation cover of the land surface [1,46]. For example, the vegetation analysis and detection of changes in vegetation patterns are important in the assessment and monitoring of natural resources [2].

9. Results and Discussion

Figure 5 shows the importance of the variables in the RFC performance, which decreased with the following trend: Altitude > EVI2 > WDRVI > NDVI > NDWI > SAVI > VARIG. Low values are associated with variables with little relevance for the categorization and RFC performance. Predictor variables with little importance provide limited information and do not reduce the Shannon entropy of the information. However, this hierarchical framework must be taken with caution because the correlation among the variables could have had an impact on the relative assessment of importance but do not affect the predictive performance of a RFC.
Figure 5 also shows that the topographic variable “Altitude” provided more information—with a relative importance of 100%—than any of the SVIs used in this study. Thus, “Altitude” significantly increases the accuracy of the categorization of the different basic units of vegetation cover. “Altitude” is a topographic variable derived from the digital elevation model scaled to the spatial resolution of the SVIs (30 m). The vital importance of altitude suggests that the spatial distribution of the categories of vegetation cover in the study region depends on the topography or the temperature [47]. In our studied area, the altitude varies from 2162 masl to 4124 masl. “Altitude” is a terrain variable that affects the microclimate and indirectly affects plant physiology. “Altitude” is the most relevant environmental variable in the spatial distribution of vegetation. The space allocation of forests in mountain regions is related to the altitude, slope, and aspect.
The two-band enhanced vegetation index—with a relative importance of 57.32%—was second in importance to “Altitude”, and its information contribution was higher than that provided by the NDVI. This result agrees with the fact the NDVI is directly related to primary production (photosynthesis-chlorophyll), whereas EVI2 is more related to the leaf area index (LAI) and does not lose sensitivity in areas with dense vegetation and abundant chlorophyll production [1].
The wide dynamic range vegetation index—WDRVI—exhibited greater importance (37.74%) than the NDVI (34.16%) and less importance than the EVI2 (57.32%). Studies conducted by Gitelson [36] showed that the WDRVI indicator exhibits greater sensitivity to LAI than the NDVI. This index has been widely used to estimate the “Green LAI”. We found that a constant “a” in the expression of the WDRVI indicator (see Equation (5)) required a value of 0.05 for the importance of the WDRVI in the classifier results to reach 37.74%. With a = 0.05, the WDRVI spectral index provided the most information to the classifier and contributed to an information entropy reduction. The WDRVI is an algebraic transformation of the NDVI that is implemented to increase the sensitivity of the NDVI when the LAI indicator is greater than 1.
The relative importance of the NDVI was 34.16%. Under dense vegetation conditions (LAI > 1), the NDVI never reaches values close to 1 and loses sensitivity (its variants will be insignificant) to the value of the LAI [35,36]. This because NDVI is mainly an indicator of photosynthetic capacity and primary production and not of leaf area [1].
The normalized difference water index—NDWI—had an importance of 30.38% and provided slightly less information than the NDVI and a little more than the SAVI. The NDWI separates water bodies fairly well from other bodies; the NDWI assumes negative values when the reflectance is due to vegetal material and values greater than zero when the reflectance is due to water bodies.
In general, when the vegetation is very dense, the value of constant L approaches zero, and the SAVI is equivalent to the NDVI. When the terrain is bare and has little or no vegetation cover, the value of constant L approaches 1. The soil adjusted vegetation index—SAVI—had an importance of 25.96% in RFC results. We found that constant “L” in the expression of the SAVI indicator (see Equation (2)) must have a value of 0.15 for the importance of SAVI to reach 25.96% in the classifier results.
The visible atmospherically resistant index—VARIG—had the lowest importance (17.86%) of the set of predictor variables. Previous studies conducted by Gitelson et al. [46] showed that VARIG is sensitive to the entire range (0 to 100%) of variation of the vegetal fraction (VF). Moreover, the NDVI shows high sensitivity to the VF within the range of 0 to 50% and begins to exhibit an asymptotic behavior (saturation effect) when the VF exceeds 50%. The VARIG indicator was developed to estimate crop conditions because it exhibits a good correlation with nitrogen contents (g N/m2) [43].
We included the topographic variable “Slope” as a predictor variable in the initial calculations. Our studied area exhibits slopes that range from undulating (5%) to very steep (70%). It is well known that the “Slope” can be relevant in a spatial distribution of vegetation that requires large amounts of moisture and has a significant influence on the microclimate. Because “Slope” affects water conditions in the soil as well as temperature. The slope is one of the sources of heterogeneity in the landscape that is related to the spatial distribution of the vegetation in mountain regions [48]. The tests performed showed that the slope has little importance (less than 3%) on the RFC results in the categorization of basic units of coverage in our study region. By excluding the “Slope” from the group of predictor variables, the change in the overall percentage of correct classifications did not exceed 1%; we, therefore, decided to remove this variable from the group of predictor variables. Our results showed that the slope was not useful in reducing the entropy of the information provided by the set of predictor variables. We explain this behavior in our setting because of the terrain, in general, is very irregular and has areas with various slopes and has very dense and diverse vegetation cover.
With this group of predictor variables, the overall percentage of correct categorizations made with the OOB data subset were 93.78% (see Table 5). This value is relatively acceptable if we consider the difficulties of the studied region and that we only used 247 pixels to train the algorithm. In general, these areas are located in hard-to-reach areas, possess very uneven topography and extreme climates, and offer little available local information.
Table 5 shows the overall trained RFC performance—error matrix expressed in pixel counts—and upon training with equal sample allocation in each stratum. The equations used to estimate the error matrix, UA, PA, and OA are listed in Table 5. The percentages of correct classifications of pixels from the FOR, PAR, CRO, and PAS categories were 92.43%, 94.05%, 95.14%, and 93.51%, respectively. Of the 185 pixels associated with each of the classes, the classifier correctly categorized 171 from the FOR, 174 from the PAR, 176 from CRO, and 173 from the PAS.
Table 6 lists the values of the population error matrix with cell entries expressed regarding the estimated proportion of the area, and upon the training of RFC with equal sample allocation in each stratum. The equations used to estimate the error matrix, UA, PA, and OA are listed in Table 2.
We run on the whole data set of 1,068,065 pixels from the overall scene through the predictive RFC model, and we get the map shown in Figure 6. This map shows a predicted distribution of the different vegetation cover in the study area and its predicted proportions within the whole map (see Table 6). The proportion of category’s area derived from simple pixel counts for FOR, PAR, CRO and PAS are 0.2340, 0.5403, 0.1896 and 0.0361, respectively.
Table 7 lists the values of the estimated map area for each stratum, the map accuracy, and the incertitude assessment of estimated values using the equal sample allocation in each stratum. The equations used to estimate the values are listed in Table 2. The estimated area of FOR, PAR, CROP, and PAS is 21,702 ± 1027 ha, 51,015 ± 1593 ha, 18,585 ± 1429 ha and 4823 ± 960 ha, respectively.
However, the OA is 94.80%, the UA for each class is above 90%, and the PA for the PAS is equal to 66.57% (see Table 6). Forest always surrounds PAS and PAS represents a relatively small proportion (3.61%) compared to FOR (23.40%). Thus, a small error in PAS’s pixel counts (PAS producer’s accuracy is 93.51%, see Table 5) yields smaller PA (66.57%) in the proportion of the area cover by PAS. The smaller the proportion more pixels at the boundaries and less inside, so any error counting pixel at the boundary yield bigger error pixel counts. The lowest values of the producer accuracy for the PAS (66.57%) prompted us to compose a newest error and population matrix. Therefore, we performed a Neyman sample allocation in each stratum.
Table 7 lists the number of samples in each stratum while performing a NOA of samples. We kept the RFC model trained with equal number of samples allocated to each stratum, and re-run the RFC with the new set of samples. A new error matrix and population matrix were calculated using equations in Table 2 and Table 3. From the newest error matrix, the new UA and PA for the PAS class are 82.81% and 93.51%, respectively. From the newest population error matrix, the estimated UA and PA is 82.81% and 82.28%, respectively. The new OA is 95.53%. Therefore, we find the error-corrected areas by performing a NOA of the sample in each stratum (Table 8 and Table 9).
Table 9 lists the values of the new estimated map area for each stratum, the map accuracy, and incertitude assessment of the error-adjusted areas. Thus, the NOA allows an error-adjustment for the FOR’s map area from 21,702 ± 1027 ha to 22,293 ± 772 ha; the PAR’s area from 51,015 ± 1593 ha to 52,319 ± 1057 ha, the CRO’s area from 18,583 ± 1429 ha to 18,011 ± 1014 ha; the PAS’s area from 4823 ± 960 ha to 3492 ± 596 ha (Table 7 and Table 9).
Table 10 lists the values of the overall disagreement assessment and the disagreement assessment by category.
The equal allocation stage yields a 5.20% of overall disagreement with 1.78% due to quantity disagreement and 3.42% due to allocation disagreement (see Table 10). The disagreement assessment of PAS class shows the biggest relative disagreement (22.58%) with a 72.25% of the relative disagreement because of quantity disagreement and 27.75% because of allocation disagreement. Meanwhile, the relative disagreement of the other categories remains below 10%. The NOA stage yields a 4.47% of overall disagreement with 0.43% because of overall quantity disagreement and 4.04% because of OAD. The disagreement assessment of the PAS class shows a relative disagreement of 17.45% with a 1.86% due to quantity disagreement and 98.14% due to allocation disagreement. Thus, the new sample allocation adjusts PAS quantity disagreement from 72.25% to 1.86% and the allocation disagreement from 27.75% to 98.14%.
Our results suggest that CRO had occupied areas initially from the PAR in a proportion of 0.2560 ± 0.0146. Thus, at to 2011, the PAR had lost approximately 26% (around 18,011 ± 1014 ha) of its pristine coverage due to anthropogenic factors. The PAR resilience and functionality is constantly threatened by the expansion of the CRO cover. Establishing crop promotes spreading of the agricultural frontier and a speeded and irreversible decline of the proper functioning of PAR soils in the Ecuadorian mountain geosystem.

10. Conclusions

We had defined a methodological framework for mapping hard-to-reach mountain geosystems. The relevant factors characterizing the vegetation cover in the studied area were the topographic variables “Altitude” and the next SVIs: EVI2, WDRVI_05, NDVI, SAVI_L15, and VARIG. “Altitude” and EVI2 provide more information on vegetation cover than the traditional NDVI and SAVI, particularly when the reflectance of red is low, and the NDVI is “saturated”, as it occurs in conditions of high chlorophyll production (high density of vegetation).
Within this new framework, the RFC has shown great potential to improve the method of the study and mapping of land cover as well as the study of the underlying phenomena responsible for changes in coverage and the use of the land in complex areas such as the mountain geosystems. The RFC confirms the importance of predictor variables, regardless of their nature (whether spectral or topographic) and the information they provide (thus reducing entropy in the data set), in achieving a more accurate categorization. This feature of the RFC is important because previous studies of vegetation and land use in these complex and heterogeneous regions have required significant amounts of information from multiple sources with a large number of variables.
One apparent disadvantage of the RFC is that it is hard to understand the rules used to categorize the primary units of coverage; intrinsically, this is because there are a set number of rules and subsets of these rules that are captured in a significant number of DTs. Each DT corresponds to an ensemble or a set of predictor variables that enables the exploration of data space using the reduction of information entropy as a fundamental criterion. However, an RFC that excels in its performance demands a good statistical representation of the space, data and predictor variables. For training an RFC, any data can be used; there is no sampling design requirement. However, the reference data must be acquired using a probability sampling design that includes a randomization component. Training samples must be representative of the target classes and class balanced. Otherwise, there is no assurance the results “faithfully” represent the study area. Although the accuracy of the RFC, the estimates of class areas taken directly from a map are subject to classification error and may or may not, adequately represent the real area on the ground. The stratified estimation approach and the NOA of samples do circumvent not only the effects of classification errors, but also allows an assessment of the uncertainty of each class area estimate. The quantity disagreement and the allocation disagreement assessment light up a much more enlightened path for the study of an RFC performance in a mapping application.
The combined use of SVIs, supplementary topographic information, and algorithms such as the RFC results in the most appropriate technology for achieving the goal of mapping complex mountain landscapes as well as mapping and monitoring strategies aimed at conservation and management. The methodological framework developed in this research based on the Random Forest Classifier is an appropriate and highly accurate method of classifying land cover when remotely sensed spectral imagery and ancillary data are combined. The resulting land cover classification can be both accurate and usable on a large spatial scale.

Acknowledgments

The authors express their gratitude to the Prometheus Project of the Secretary of Higher Education, Science, Technology and Innovation of the Republic of Ecuador (Proyecto Prometeo de la Secretaria Superior, Ciencia, Tecnología e Innovación de la República del Ecuador) and to the Vice-Rectorate of Postgraduate Studies and Research of the National University of Chimborazo (UNACH) for funding this research through the project “Soil organic carbon evaluation and sequestration in ecuadorian paramos ecosistem” and the project “Caracterización biogeográfica de la subcuenca hídrica para la adaptación al cambio climático considerando el paisaje cultural andino en la parroquia Achupallas, Cantón Alausí, Provincia de Chimborazo”.

Author Contributions

J.A. Ayala-Izurieta, C. Márquez and V. J. García conceived and designed the experiments; J.A. Ayala-Izurieta and D.A. Damián-Carrión performed the experiment; C. Márquez, V.J. García and C.G. Recalde-Moreno Analyzed the data; M.V. Rodríguez-Llerena and C.G. Recalde-Moreno contributed reagents/materials/analysis tools; C. Márquez and V. J. García wrote the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Yengoh, G.T.; Dent, D.; Olsson, L.; Tengberg, A.E.; Tucker, C.J., III. The Use of the Normalized Difference Vegetation Index (NDVI) to Assess Land Degradation at Multiple Scales: Current Status, Future Trends, and Practical Considerations; Springer: New York, NY, USA, 2015; p. 100. [Google Scholar]
  2. Ahmad, F. Spectral vegetation indices performance evaluated for Cholistan Desert. J. Geogr. Reg. Plan. 2012, 5, 165–172. [Google Scholar]
  3. Wiesmair, M.; Feilhauer, H.; Magiera, A.; Otte, A.; Waldhardt, R. Estimating Vegetation Cover from High­Resolution Satellite Data to Assess Grassland Degradation in the Georgian Caucasus. Mt. Res. Dev. 2016, 36, 56–65. [Google Scholar] [CrossRef]
  4. Podwojewski, P.; Poulenard, J. La Degradación de los Suelos en los Páramos. Available online: https://es-static.z-dn.net/files/d5c/a9751b4293c82c26ef428db85b953254.pdf#page=27 (accessed on 2 May 2017).
  5. Stern, C.R. Active Andean Volcanism: Its Geologic and Tectonic Setting. Revista Geológica de Chile 2004, 31, 161–206. [Google Scholar] [CrossRef]
  6. UNACH. Isoyetas en la Provincia del Chimborazo [Internet]. Riobamba, Ecuador, 2011. Available online: dspace.unach.edu.ec/bitstream/51000/3007/1/UNACH-ING-CIVIL-2016-0030.pdf (accessed on 2 May 2017).
  7. USGS. Global Visualization Viewer 2014. Available online: http://glovis.usgs.gov/ (accessed on 22 April 2017).
  8. INFOPLAN. Información para el desarrollo—Oficina de planificación de la Presidencia ODEPLAN [Internet]. Quito, Ecuador, 2011. Available online: http://www.cepal.org/deype/mecovi/docs/TALLER6/7.pdf (accessed on 22 April 2017).
  9. Instituto Geográfico Militar. Instituto Geográfico Militar 1968. Available online: http://www.igm.gob.ec/cms/files/cartabase/enie/ENIEV_C2.htm (accessed on 22 April 2017).
  10. Irish, R.R.; Barker, J.L.; Goward, S.N.; Arvidson, T. Characterization of the Landsat-7 ETM+ Automated Cloud-Cover Assessment (ACCA) Algorithm. Photogramm. Eng. Remote Sens. 2006, 72, 1179–1188. [Google Scholar] [CrossRef]
  11. Kaufman, Y.J.; Tanré, D.; Remer, L.A.; Vermote, E.F.; Chu, A.; Holben, B.N. Operational remote sensing of tropospheric aerosol over land from EOS moderate resolution imaging spectroradiometer. J. Geophys. Res. Atmos. 1997, 102, 17051–17067. [Google Scholar] [CrossRef]
  12. Al-amri, S.S.; Kalyankar, N.V.; Khamitkar, S.D. A comparative study of removal noise from remote sensing image. IJCSI Int. J. Comput. Sci. Issues. 2010, 7, 32–36. [Google Scholar]
  13. Howard, S.M.; Lacasse, J.M. An evaluation of gap-filled Landsat SLC-off imagery for wildland fire burn severity mapping. Photogramm Eng. Remote Sens. 2004, 70, 877–880. [Google Scholar]
  14. Pringle, M.J.; Schmidt, M.; Muir, J.S. Geostatistical interpolation of SLC-off Landsat ETM+ images. ISPRS J. Photogramm. Remote Sens. 2009, 64, 654–664. [Google Scholar] [CrossRef]
  15. Zhang, C.; Li, W.; Travis, D. Gaps-fill of SLC-off Landsat ETM+ satellite image using a geostatistical approach. Int. J. Remote Sens. 2007, 28, 5103–5122. [Google Scholar] [CrossRef]
  16. Belgiu, M.; Drăgu, L. Random forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm. Remote Sens. 2016, 114, 24–31. [Google Scholar] [CrossRef]
  17. Louppe, G. Understanding Random Forests. Ph.D. Thesis, University of Liege, Liège, Belgium, 2014. Available online: https://arxiv.org/abs/1407.7502 (accessed on 2 May 2017).
  18. Hasti, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning, 2nd ed.; Springer Series in Statistics; Springer: New York, NY, USA.
  19. Strobl, C.; Boulesteix, A.-L.; Kneib, T.; Augustin, T.; Zeileis, A. Conditional variable importance for random forests. BMC Bioinform. 2008, 9, 1–11. [Google Scholar] [CrossRef] [PubMed]
  20. Evans, J.S.; Murphy, M.A.; Holden, Z.A.; Cushman, S.A. Modeling species distribution and change using Randon FOrest. In Predictive Species and Habitat Modeling in Landscape Ecology: Concepts and Applications; Springer Science Business Media, B.V.: New York, NY, USA, 2011; pp. 139–159. [Google Scholar]
  21. Lawrence, R.L.; Wood, S.D.; Sheley, R.L. Mapping invasive plants using hyperspectral imagery and Breiman Cutler classifications (randomForest). Remote Sens. Environ. 2006, 100, 356–362. [Google Scholar] [CrossRef]
  22. Zhong, L.; Gong, P. Efficient corn and soybean mapping with temporal extendability: A multi-year experiment using Landsat imagery. Remote Sens. Environ. 2014, 140, 1–13. [Google Scholar] [CrossRef]
  23. Warrens, M.J. Relative quantity and allocation disagreement measures for category-level accuracy assessment. Int. J. Remote Sens. 2015, 36, 5959–5969. [Google Scholar] [CrossRef]
  24. Warrens, M.J. Properties of the quantity disagreement and the allocation disagreement. Int. J. Remote Sens. 2015, 36, 1439–1446. [Google Scholar] [CrossRef]
  25. Pontius, R.G.; Millones, M. Death to Kappa: Birth of quantity disagreement and allocation disagreement for accuracy assessment. Int. J. Remote Sens. 2011, 32, 4407–4429. [Google Scholar] [CrossRef]
  26. Stehman, S.V. Impact of sample size allocation when using stratified random sampling to estimate accuracy and area of land-cover change. Remote Sens. Lett. 2012, 3, 111–120. [Google Scholar] [CrossRef]
  27. Olofsson, P.; Foody, G.M.; Herold, M.; Stehman, S.V.; Woodcock, C.E.; Wulder, M.A. Good practices for estimating area and assessing accuracy of land change. Remote Sens. Environ. 2014, 148, 42–57. [Google Scholar] [CrossRef]
  28. Lehtonen, R.; Djerf, K. Introduction to Sample Design and Estimation Techniques. The Statistical Office of the European Communities: Luxembourg, 2008. Available online: http://epp.eurostat.ec.europa.eu/cache/ITY_OFFPUB/KS-RA-08-005/EN/KS-RA-08-005-EN.PDF (accessed on 22 April 2017).
  29. Zhu, G.; Ju, W.; Chen, J.M.; Liu, Y. A novel Moisture Adjusted Vegetation Index (MAVI) to reduce background reflectance and topographical effects on LAI retrieval. PLoS ONE 2014, 9, 1–16. [Google Scholar] [CrossRef] [PubMed]
  30. Harris Geospatial Solutions. Vegetation Indices Background. 2016. Available online: https://www.harrisgeospatial.com/docs/backgroundvegetationindices.html (accessed on 22 April 2017).
  31. Rouse, J.W.J.; Haas, R.H.; Schell, J.A.; Deering, D.W. Monitoring Vegetation Systems in the Great Plains with Erts. In Third Earth Resources Technology Satellite-1 Symposium—Volume I: Technical Presentations NASA SP-351; NASA: Washington, DC, USA, 1974; p. 309. [Google Scholar]
  32. Scheftic, W.; Zeng, X.; Broxton, P.; Brunke, M. Intercomparison of seven NDVI products over the United States and Mexico. Remote Sens. 2014, 6, 1057–1084. [Google Scholar] [CrossRef]
  33. Huete, A.R. A soil-adjusted vegetation index (SAVI). Remote Sens. Environ. 1988, 25, 295–309. [Google Scholar] [CrossRef]
  34. Qi, J.; Chehbouni, A.; Huete, A.R.; Kerr, Y.H.; Sorooshian, S. A modified soil adjusted vegetation index. Remote Sens. Environ. 1994, 48, 119–126. [Google Scholar] [CrossRef]
  35. Jiang, Z.; Huete, A.R.; Didan, K.; Miura, T. Development of a two-band enhanced vegetation index without a blue band. Remote Sens. Environ. 2008, 112, 3833–3845. [Google Scholar] [CrossRef]
  36. Gitelson, A.A. Wide Dynamic Range Vegetation Index for Remote Quantification of Biophysical Characteristics of Vegetation. J. Plant. Physiol. 2004, 161, 165–173. [Google Scholar] [CrossRef] [PubMed]
  37. Kaufman, Y.J.; Tanre, D. Atmospherically resistant vegetation index (ARVI) for EOS-MODIS. IEEE Trans. Geosci. Remote Sens. 1992, 30, 261–270. [Google Scholar] [CrossRef]
  38. Carlson, T.C.; Ripley, D.A. On the relationship between NDVI, fractional vegetation cover and leaf area index. Remote Sens. Environ. 1997, 62, 241–252. [Google Scholar] [CrossRef]
  39. Ben-Ze’ev, E.; Karnieli, A.; Agam, N.; Kaufman, Y.; Holben, B. Assessing vegetation condition in the presence of biomass burning smoke by applying the Aerosol-free Vegetation Index (AFRI) on MODIS images. Int. J. Remote Sens. 2006, 27, 3203–3221. [Google Scholar]
  40. Karnieli, A.; Kaufman, Y.J.; Wald, R.A. AFRI-aerosol free vegetation index. Remote Sens. Environ. 2001, 77, 10–21. [Google Scholar] [CrossRef]
  41. Fisher, A.; Danaher, T. A water index for SPOT5 HRG satellite imagery, New South Wales, Australia, determined by linear discriminant analysis. Remote Sens. 2013, 5, 5907–5925. [Google Scholar]
  42. McFeeters, S.K. The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
  43. Cammarano, D.; Fitzgerald, G.J.; Casa, R.; Basso, B. Assessing the robustness of vegetation indices to estimate wheat N in mediterranean environments. Remote Sens. 2014, 6, 2827–2844. [Google Scholar] [CrossRef]
  44. Gitelson, A.A.; Kaufman, Y.J.; Stark, R.; Rundquist, D. Novel algorithms for remote estimation of vegetation fraction. Remote Sens. Environ. 2002, 8, 76–87. [Google Scholar] [CrossRef]
  45. Haapanen, R.; Ek, A.R.; Bauer, M.E.; Finley, A.O. Delineation of forest/nonforest land use classes using nearest neighbor methods. Remote Sens. Environ. 2004, 89, 265–271. [Google Scholar] [CrossRef]
  46. Gilabert, M.A.; González-Piqueras, J.; García-Haro, F.J.; Meliá, J. A generalized soil-adjusted vegetation index. Remote Sens. Environ. 2002, 82, 303–310. [Google Scholar] [CrossRef]
  47. Rodriguez-Galiano, V.F.; Ghimire, B.; Rogan, J.; Chica-Olmo, M.; Rigol-Sanchez, J.P. An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS J. Photogramm. Remote Sens. 2012, 67, 93–104. [Google Scholar] [CrossRef]
  48. Lu, D.; Weng, Q. A survey of image classification methods and techniques for improving classification performance. Int. J. Remote Sens. 2007, 28, 823–870. [Google Scholar] [CrossRef]
Figure 1. Map of the studied area.
Figure 1. Map of the studied area.
Geosciences 07 00034 g001
Figure 2. The processing of the Landsat 7 ETM+ images.
Figure 2. The processing of the Landsat 7 ETM+ images.
Geosciences 07 00034 g002
Figure 3. Information flows in the RFC. Only five classes and M predictors are taken into account. The random examination of the space of predictors is performed by selecting three predictors at random.
Figure 3. Information flows in the RFC. Only five classes and M predictors are taken into account. The random examination of the space of predictors is performed by selecting three predictors at random.
Geosciences 07 00034 g003
Figure 4. Methodological framework, the workflow of data processing, and accuracy assessment. The accuracy assessment could be done for proportional, power, equal or Neyman optimal allocation of samples in each stratum in accord with the sampling design.
Figure 4. Methodological framework, the workflow of data processing, and accuracy assessment. The accuracy assessment could be done for proportional, power, equal or Neyman optimal allocation of samples in each stratum in accord with the sampling design.
Geosciences 07 00034 g004
Figure 5. The importance of the variables assessed suing mean decrease in OOB accuracy.
Figure 5. The importance of the variables assessed suing mean decrease in OOB accuracy.
Geosciences 07 00034 g005
Figure 6. Map obtained from the categorization generated by the RFC of all pixels associated with the study region.
Figure 6. Map obtained from the categorization generated by the RFC of all pixels associated with the study region.
Geosciences 07 00034 g006
Table 1. The images used to fill data gaps and its order of importance.
Table 1. The images used to fill data gaps and its order of importance.
Images Used to Fill the Data GapsOrder of Importance
LE70100622011063ASN001
LE70100622011159ASN002
LE70100622011287ASN003
LE70100622010204ASN004
LE70100622010236ASN005
LE70100622010252ASN006
LE70100622011223ASN007
LE70100622009217EDC008
LE70100622009089EDC009
LE70100622007052ASN0010
LE70100622006289EDC0011
LE70100622004028ASN0112
LT50100621998355XXX0113
Table 2. The equation used to estimate error matrix, population error matrix, user accuracy, producer accuracy and overall accuracy [27].
Table 2. The equation used to estimate error matrix, population error matrix, user accuracy, producer accuracy and overall accuracy [27].
EquationComments
i = j = { 1 , 2 , .. , q } q is the number of categories. The classified (map) categories (i = 1, 2, …, q) are represented by rows. The reference (ground truth) categories (j = 1, 2, …, q) are represented by columns.
n i j The number of pixels categorized as belonging to class 𝑖 and belong to category 𝑗 in the reference dataset.
n i + = j = 1 q n i j The total number of pixels categorized as belonging to class 𝑖 in the map.
n + j = i = 1 q n i j The total number of pixels belonging to the class j in the reference dataset.
U i = n i i n i + User’s accuracy (UA or Ui) is the proportion of the pixels classified (mapped) as a class i that has reference class i. UA is the complement of the probability of commission error.
The user’s accuracy represents the likelihood that a pixel belongs to a specific class and the classifier accurately assigns it, such class.
P j = n j j n + j Producer’s accuracy (PA or Pj) estimates the probability that a pixel, which is of class j in the reference dataset, is correctly classified. Producer accuracy is the complement of the probability of omission error. Producer’s accuracy expresses the likelihood of a certain class being properly recognized.
O = j = 1 q n j j / i = 1 q n i + Overall accuracy (OA or O) is the proportion of all reference pixels, which are classified correctly. Overall accuracy is the probability that a randomly selected pixel on the map is properly classified.
W i = A i m A total W i is the proportion of mapped area of class i in the map. A i m is the mapped area of category “i”. A total is the total area of the map.
p i j = W i n i   j n i + It is the unbiased and post-stratified estimator of the proportion p i j of area in cell i,j of the error matrix. p i j depends on the sampling design used. This equation is valid for equal probability sampling designs (e.g., simple random and systematic sampling) and stratified random sampling.
p i + = j = 1 q p i   j W i Each cell element p i j represents the probability that a randomly selected area is categorized as belonging to the class i and belongs to the class j in the reference dataset. As a consequence, the sum of the cells of each row p i + is equal to W i (the estimated proportion of the area of class i in the map).
p + j = i = 1 q p i j The proportion of the area of class j as determined from a reference data set (classification).
U i = p i i p i + User’s accuracy is the proportion of the area mapped as a particular category that is that category “on the ground” where the reference classification is the best assessment of ground condition.
P j = p j j p + j Producer’s accuracy is the proportion of the area that is a particular category on the ground that is also mapped to that category.
O = j = 1 q p j j Overall accuracy is the proportion of the area mapped correctly. It provides the user of the map with the probability that a randomly selected location on the map is properly classified.
Table 3. Sample size estimation and accuracy assessment [27,28].
Table 3. Sample size estimation and accuracy assessment [27,28].
EquationComments
n i = z 2 O i ( 1 O i ) d i 2 For a simple random sampling in the stratum i and targeting OA (Oi = 0.95) as the estimation objective. z represents a percentile of the standard normal distribution (z = 1.96 for a 95% confidence interval), di = 0.05 is the desire half-width of the confidence interval (HCIi ) of the OA (Oi). Thus the sample size is n = 73.
S i = U i ( 1 U i ) The standard deviation of the stratum i. U i denotes the user’s accuracy of the stratum i.
n ( i W i S i S ( O ) ) 2 The sample size for stratified random sampling with homogeneous allocation. The cost of sampling each stratum is assumed the same. The number of pixels (unit) is significant. S ( O ) symbolizes the standard error of the estimated OA that we would like to achieve. We specify a target standard error for the OA of 0.01.
n i = n N i S i i = 1 q N i S i The number of sample units n i in stratum i under optimal allocation (NOA). Ni denotes the stratum size. Si represents the population standard deviation for single stratum and it is known, or a reliable figure is available.
H C I o = z ( i = 1 q p i i ( W i p i i ) n i + ) 1 2 The half-width interval for 95% confidence in the overall accuracy, O .
H C I U i = z ( p i i ( W i p i i ) W i 2 n i + ) 1 2 The half-width interval for 95% confidence in the user’s accuracy in the category i, U i .
H C I P j = z ( p j j p + j 4 [ p j j ( i j q p i j ( W i p i j ) n i + ) + ( W j p j j ) ( p + j p j j ) 2 n j + ] ) 1 2 The half-width interval for 95% confidence in the producer’s accuracy in the category i, P i .
H C I p + i = z ( i = 1 q p i i ( W i p i i ) ( n i + 1 ) ) 1 2 The half-width interval for 95% confidence in the proportion of the area of class i as determined from reference data (classification).
A j = A tot m × ( p + j ± H C I p + j ) The estimated area of each stratum.
A i m = W i × A tot = p i + × A tot Mapped area of each stratum.
Table 4. Quantity and allocation disagreement assessment [24].
Table 4. Quantity and allocation disagreement assessment [24].
EquationComments
C = O Overall agreement or the proportion of units classified correctly.
D = 1 C = Q + A Overall disagreement.
Q = i = 1 q | p i + p + i | 2 Overall quantity disagreement. It reflects the level of difference between the estimated (user) map and the reference (producer) map.
A = i = 1 q min ( p i + , p + i ) C Overall allocation disagreement. It reflects the difference between the reference map and a comparison map that is due to the less than a maximum match in the spatial distribution of the categories, given the category totals of the maps.
q i = | p i + p + i | Quantity disagreement of the category i. It reflects the level of difference between category i in the estimated (user) map and the reference (producer) map.
a i = 2 min ( p i + , p + i ) 2 p i i Allocation disagreement of the category i.
ϕ i = q i p i + + p + i Relative quantity disagreement of the category i.
α i = a i p i + + p + i Relative allocation disagreement of the category i.
δ i = ϕ i + α i = 1 h i A relative disagreement measure of the category i.
ϕ i * = ϕ i δ i The proportion of quantity disagreement of the category i.
α i * = α i δ i = 1 | p i + p + i | p i + + p + i 2 p i i The proportion of allocation disagreement of the category i. If p i + = p + i the relative disagreement ( δ i ) is only due to allocation disagreement then α i * = 1 and ϕ 1 * = 0 . If p i i = min ( p i + , p + i ) the relative disagreement ( δ i ) is only due to quantity disagreement then α i * = 0 and ϕ 1 * = 1 .
Table 5. The error matrix showing the overall RFC performance expressed in pixel counts, nij a.
Table 5. The error matrix showing the overall RFC performance expressed in pixel counts, nij a.
Reference (Ground Truth) b,c Accuracy Assessment d, %
ClassFORPARCROPASni+UiPiO
FOR171111018393.4492.4393.78
PAR11745118196.1394.05
CRO210176118993.1295.14
PAS110317318792.5193.51
n+j185185185185740
a Map categories (classified pixel by class) are the rows, while the reference categories (reference pixel by class) are the columns, n i + and n + j are the total of the preceding colums and rows, respectively; b The cell entries are calculated using the OOB dataset; c It shows the overall classification and discrimination among four land-covers; Forest (FOR), Páramo (PAR), Crop (CRO), and Pasture (PAS); d User’s accuracy (Ui), producer’s accuracy (Pi), and overall accuracy (O).
Table 6. Population error matrix of the four classes with cell entries ( p i j ) expressed regarding the estimated proportion of the area.
Table 6. Population error matrix of the four classes with cell entries ( p i j ) expressed regarding the estimated proportion of the area.
Reference (Ground Truth) Accuracy Assessment a, %
ClassFORPARCROPAS p i + = W i U i P i O
FOR0.21870.00130.00130.01280.234093.4496.8594.80
PAR0.00300.51940.01490.00300.540396.1397.87
CRO0.00200.01000.17660.00100.189693.1291.32
PAS0.00210.00000.00060.03340.036192.5166.57
p + j 0.22580.53070.19330.0502
W i 0.23400.54030.18960.0361
a User’s accuracy (Ui), producer’s accuracy (Pi), and the OA (O). Wi is the proportion of category’s area.
Table 7. Map accuracy and incertitude assessment of the areas.
Table 7. Map accuracy and incertitude assessment of the areas.
Class U i ± H C I U i (%) P i ± H C I P i (%) O ± H C I O (%) p + j ± H C I p + j (%)Map, A i m (ha)Error-Adjusted A j (ha)
FOR93.84 ± 0.8496.85 ± 2.8294.80 ± 1.8722.58 ± 1.0721,70221,702 ± 1027
PAR96.13 ± 2.8197.87 ± 1.21 53.07 ± 1.6651,01551,015 ± 1593
CRO93.12 ± 3.6191.32 ± 6.22 18.96 ± 1.4918,58518,585 ± 1429
PAS92.51 ± 3.7766.57 ± 13.12 5.02 ± 1.00 48234823 ± 960
User’s accuracy (Ui), producer’s accuracy (Pi), overall accuracy (O), and the sum of P+j ± the HCI for 95% confidence.
Table 8. Neyman optimal allocation of the sample in each stratum, when the total number of samples is n = 900 a.
Table 8. Neyman optimal allocation of the sample in each stratum, when the total number of samples is n = 900 a.
Class p i + = W i N N i P i S i = P i ( 1 P i ) N i S i i = 1 q N i S i N i S i i = 1 q N i S i n i
FOR0.23401,068,065249,9270.96850.174743,653.46202,765.780.2153194
PAR0.5403 577,0760.97870.144483,319.61 0.4109370
CRO0.1896 202,5050.91120.284557,603.56 0.2841256
PAS0.0361 38,5570.66570.471718,189.15 0.089781
a See row four in Table 3 for equation details.
Table 9. Map accuracy and incertitude assessment of the error-adjusted areas.
Table 9. Map accuracy and incertitude assessment of the error-adjusted areas.
Class U i ± H C I U i (%) P i ± H C I P i (%) O ± H C I O (%) p + j ± H C I p + j (%)Map, A i m (ha)Error-Adjusted A j (ha)
FOR95.85 ± 2.8696.70 ± 1.8495.53 ± 1.2923.19 ± 0.8022,29322,293 ± 772
PAR97.64 ± 1.5896.92 ± 1.24 54.43 ± 1.1052,31952,319 ± 1057
CRO91.53 ± 3.3592.61 ± 4.16 18.74 ± 1.0518,01118,011 ± 1014
PAS82.82 ± 7.7482.28 ± 12.56 3.63 ± 0.6234923492 ± 596
User’s accuracy (Ui), producer’s accuracy (Pi), Overall accuracy (O), and the area proportion or sum of P+j ± the HCI for 95% confidence.
Table 10. Relative quantity and allocation disagreement measure for category-level accuracy assessment; Forest (FOR), Páramo (PAR), Crop (CRO), and Pasture (PAS). The relative quantity and allocation disagreement was performed for equal and Neyman allocation of sample in each stratum. The equal allocation is from the first map estimation stage and the Neyman optimal allocation from the second stage of accuracy assessment (Figure 5).
Table 10. Relative quantity and allocation disagreement measure for category-level accuracy assessment; Forest (FOR), Páramo (PAR), Crop (CRO), and Pasture (PAS). The relative quantity and allocation disagreement was performed for equal and Neyman allocation of sample in each stratum. The equal allocation is from the first map estimation stage and the Neyman optimal allocation from the second stage of accuracy assessment (Figure 5).
Equal AllocationNeyman Optimal Allocation
Overall disagreement assessment, %
Agreement (C)94.80 95.53
Quantity disagreement (Q)1.78 0.43
Allocation disagreement (A)3.42 4.04
Disagreement (D = Q + A)5.20 4.47
FORPARCROPASFORPARCROPAS
Disagreement assessment by category, %0.820.960.371.410.210.400.220.02
Quantity disagreement ( q i )1.422.262.610.541.532.552.771.24
Allocation disagreement ( a i ) 1.790.890.9816.310.440.370.590.32
Relative quantity disagreement ( ϕ i )3.102.116.816.273.282.357.3517.13
Relative allocation disagreement ( α i ) 4.883.017.7922.583.722.727.9417.45
A relative disagreement ( δ i )36.6429.7612.5572.2511.8513.687.401.86
Proportion of quantity disagreement ( ϕ i * ) 63.3670.2487.4527.7588.1586.3292.6098.14
* The cell entries are calculated using equations in Table 4.

Share and Cite

MDPI and ACS Style

Ayala-Izurieta, J.E.; Márquez, C.O.; García, V.J.; Recalde-Moreno, C.G.; Rodríguez-Llerena, M.V.; Damián-Carrión, D.A. Land Cover Classification in an Ecuadorian Mountain Geosystem Using a Random Forest Classifier, Spectral Vegetation Indices, and Ancillary Geographic Data. Geosciences 2017, 7, 34. https://doi.org/10.3390/geosciences7020034

AMA Style

Ayala-Izurieta JE, Márquez CO, García VJ, Recalde-Moreno CG, Rodríguez-Llerena MV, Damián-Carrión DA. Land Cover Classification in an Ecuadorian Mountain Geosystem Using a Random Forest Classifier, Spectral Vegetation Indices, and Ancillary Geographic Data. Geosciences. 2017; 7(2):34. https://doi.org/10.3390/geosciences7020034

Chicago/Turabian Style

Ayala-Izurieta, Johanna E., Carmen O. Márquez, Víctor J. García, Celso G. Recalde-Moreno, Marcos V. Rodríguez-Llerena, and Diego A. Damián-Carrión. 2017. "Land Cover Classification in an Ecuadorian Mountain Geosystem Using a Random Forest Classifier, Spectral Vegetation Indices, and Ancillary Geographic Data" Geosciences 7, no. 2: 34. https://doi.org/10.3390/geosciences7020034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop