**Scale Matters: Spatially Partitioned Unsupervised Segmentation Parameter Optimization for Large and Heterogeneous Satellite Images**

**Stefanos Georganos 1,\*, Tais Grippa 1, Moritz Lennert 1, Sabine Vanhuysse 1, Brian Alan Johnson <sup>2</sup> and Eléonore Wolff <sup>1</sup>**


Received: 13 August 2018; Accepted: 5 September 2018; Published: 9 September 2018

**Abstract:** To classify Very-High-Resolution (VHR) imagery, Geographic Object Based Image Analysis (GEOBIA) is the most popular method used to produce high quality Land-Use/Land-Cover maps. A crucial step in GEOBIA is the appropriate parametrization of the segmentation algorithm prior to the classification. However, little effort has been made to automatically optimize GEOBIA algorithms in an unsupervised and spatially meaningful manner. So far, most Unsupervised Segmentation Parameter Optimization (USPO) techniques, assume spatial stationarity for the whole study area extent. This can be questionable, particularly for applications in geographically large and heterogeneous urban areas. In this study, we employed a novel framework named Spatially Partitioned Unsupervised Segmentation Parameter Optimization (SPUSPO), which optimizes segmentation parameters locally rather than globally, for the Sub-Saharan African city of Ouagadougou, Burkina Faso, using WorldView-3 imagery (607 km2). The results showed that there exists significant spatial variation in the optimal segmentation parameters suggested by USPO across the whole scene, which follows landscape patterns—mainly of the various built-up and vegetation types. The most appropriate automatic spatial partitioning method from the investigated techniques, was an edge-detection cutline algorithm, which achieved higher classification accuracy than a global optimization, better predicted built-up regions, and did not suffer from edge effects. The overall classification accuracy using SPUSPO was 90.5%, whilst the accuracy from undertaking a traditional USPO approach was 89.5%. The differences between them were statistically significant (*p* < 0.05) based on a McNemar's test of similarity. Our methods were validated further by employing a segmentation goodness metric, Area Fit Index (AFI)on building objects across Ouagadougou, which suggested that a global USPO was more over-segmented than our local approach. The mean AFI values for SPUSPO and USPO were 0.28 and 0.36, respectively. Finally, the processing was carried out using the open-source software GRASS GIS, due to its efficiency in raster-based applications.

**Keywords:** unsupervised segmentation parameter optimization; GRASS GIS; image classification; land cover; urban areas; big data

#### **1. Introduction**

Accurate and precise Land-Use/Land-Cover (LULC) maps derived from remotely sensed imagery are crucial for applications spanning several fields, including spatial planning, population estimation, environmental monitoring, and socio-economic and epidemiological modelling [1–4]. These map

products not only provide useful information on their own, but also through their use as an input to secondary models (e.g., population distribution models [3], hydrological models [5], or LULC change models [6–8]. As such, maximizing the accuracy of LULC maps is a critical methodological facet in reducing error propagation and enhancing the effectiveness of science-based policy-making.

For the classification of high- and very-high resolution (VHR) imagery in particular, Geographic Object-Based Image (GEOBIA) analysis has been established as a superior method over traditional pixel-based approaches [9], as it produces a semantic representation of data closer to reality than the arbitrary nature of pixels [10]. Recent studies have attempted to establish a formal ontological framework to further advance the use of objects as spatial representation units [11]. In GEOBIA, the most crucial step before classification is the clustering of neighboring image pixels into segments based on spatial, spectral, and contextual criteria [12]. These segments should ideally represent real world objects or LC categories (e.g., building rooftops, or agricultural fields) that are larger than the original image resolution [13]. As several studies have demonstrated, GEOBIA classification accuracy is not only affected by the classification algorithm itself [14], but also by the quality of the extracted image segmentation [15–18]. Consequently, the selection of an appropriate segmentation (i.e., object-creating) algorithm, as well as its parametrization, are crucial with respect to the final output [19–21].

Region-growing (RG) segmentation techniques are the most popular in GEOBIA literature, mainly due to their implementation through the multiresolution segmentation algorithm [22], implemented in the popular software eCognition (Definiens) [16,23–26]. The most important parameter in RG segmentation is the Threshold Parameter (TP; e.g., the Scale Parameter of the multiresolution segmentation algorithm in eCognition), which governs the average size of the created segments. The selection of the parameter is most commonly attempted through a time consuming, user dependent, trial and error process [27,28], in which the quality of the produced segmentations is assessed visually [29], or through a quantitative comparison against reference data (i.e., manually digitized polygons based on visual image interpretation) [30–32]. These approaches have been criticized for being untenable due to their subjective nature and time inefficiency, whilst at the same time, the improvement they can offer in classification accuracy might be limited [33]. Therefore, other research has been directed towards the development of objectively defined Unsupervised Segmentation Parameter Optimization (USPO) techniques, which evaluate individual segmentations based on geostatistical metrics and do not require reference data [34–36]. To do so, various USPO metrics have been proposed, such as the rate of change in local variance implemented through the estimation of scale parameter tool (ESP) [34,37], the optimization of objective functions such as the Global Score (GS) [38] and the F-measure [18,39] among others, with varying degrees of success. In the comparative study of Grybas et al. [23], the F-measure was found superior to the ESP and GS, potentially due to its sensitivity to over and under segmentation. The GS and F-measure assess spectral values within (i.e., Weighted Variance (WV)) and between (i.e., Global Moran's I (MI)) segments. Ideally, an accurate segmentation should minimize the spectral heterogeneity within segments and maximize the spectral heterogeneity between segments, so the TP that is found to maximize the aforementioned function is accredited to be optimal [40].

So far, the optimization of segment-creating algorithms (and in this study, the region growing one), has been attempted mainly through the use of global methods, either at single or multiple scales [36,37]. A global approach implies that the optimization of the TP is adequate using the whole extent of the study area or a subset which is assumed to be representative [15,33,41]. The vast majority of the developments in the past years operate on that assumption, a situation exaggerated from the relatively small study areas that are used (<3 km2 in ~95% of the recent literature on object-based land cover mapping) as pointed out in the review of Ma et al. [42]. These approaches assume spatial stationarity—that the relationship between input data and the segment generating process is stable across space which is reflected by having a spatially invariant TP for the whole study area. However, this begs the question "Why is the extent of the study area in a remote sensing application

automatically assumed to be the most appropriate scale to optimize the segmentation algorithms?". This is of increasing importance as it has been recently demonstrated that partitioning the study area in smaller regions can provide significantly different results, highlighting the effect of geographic scale in remote sensing operations [43,44]. Spatial stationarity might hold for small homogenous regions, but perhaps is unsuitable for large and/or heterogenous scenes. It would be sensible to hypothesize that the optimal TP would intrinsically and significantly vary across space due to local variations in data structure, particularly for urban areas, which are known for their landscape variability even within the same LULC class. If a global approach would be used in such a case, it might only capture an average and potentially misleading impression of the situation and lead towards adding bias to the segmentation model, which could be reflected both in segmentation evaluation metrics and classification accuracy. In recent years, few studies have tackled this issue by employing more localized or regionalized procedures.

Johnson and Xie [36] refined their global segmentation results in a two-stage procedure by re-segmenting local outliers using geospatial metrics such as Local Moran's I. Cánovas-García and Alonso-Sarría [43] demonstrated improvements in segmentation quality by optimizing the TP independently in agricultural plots, instead of selecting a single parameter for the whole dataset. However, the spatial units were selected a priori by using land use parcel vectors, which requires ancillary data and expert knowledge of the study area. Recently, Kavzoglu et al. [35] proposed a regionalized multiscale approach for small, semi-urban environments where initial, broad scale segments derived from the coarse segmentation selection of the ESP tool, defined further areas for calibrating segmentation parameters. Classification results were shown to improve as the parametrization of the TP was performed regionally, rather than globally. The improvement local methods offer for urban LULC mapping has been recently demonstrated by Grippa et al. [44], where the study area was manually delineated into morphological zones that share similar built-up characteristics, and a separate USPO optimization was applied to each one of them. Nonetheless, the operational capabilities of such methods are either restricted computationally or require tedious manual labor and user expertise that is rarely available. These limitations are important given the advent of big data, which includes the use of VHR datasets at an increasing pace [45]. As such, our effort focuses on semi-automatically identifying and quantifying the degree of spatial non-stationarity and geographic scale dependency between the algorithm parameters for large and heterogeneous satellite images [1].

Our main hypothesis questions the use of global methods a priori, when heterogeneous and/or large datasets are employed. To do so, a discrimination between the observation and operating scales between the TP and USPO optimization must be made. The observation scale corresponds to the whole extent of the study area, whilst the operating scale can be a spatial delineation, which better reflects the optimization of a segmentation algorithm. In simpler words, we are asking the question: "Are the segmentation results better if we analyze the data locally rather than globally?".

In this paper, we present a methodological framework named Spatially Partitioned Unsupervised Segmentation Parameter Optimization (SPUSPO) in which optimization of the TP is performed in a localized manner. The proposed methods are automated and do not require reference information. The underlying rationale of SPUSPO is based on the first law of geography [46] that "all things are related but near things are more related", which suggests that objects being near each other (e.g., built-up characteristics of a neighborhood) have a higher degree of similarity than a set of objects far away. The results of the local optimizations are analyzed, mapped and quantified through spatial statistics, highlighting the variation of segmentation parameters as a function of location and spatial scale. The presented methods are evaluated both at the segmentation and classification level. As a proof of concept, we evaluated the procedure for the large, heterogenous city of Ouagadougou, capital of Burkina Faso. All of the analysis was performed using the GRASS open source GIS software along with open access processing chains suited for satellite VHR datasets [47].

#### **2. Materials and Methods**

#### *2.1. Study Area and Data*

The study area covered the city of Ouagadougou, capital city of Burkina Faso in Sub-Saharan Africa (SSA). Ouagadougou comprises a complex and heterogenous urban landscape of planned and unplanned neighborhoods and buildings, of various sizes and materials [48]. The city has been undergoing extensive and partly unregulated urban growth (i.e., rural to urban migration) over the last decades [49,50]. To map the LULC of the city, we used a 4-band (R, G, B, NIR) WorldView-3 multispectral image (607 km2, Figure 1) from October 2015, and a normalized Digital Surface Model (nDSM) derived from stereo image acquisitions on the same image date. The native spatial resolution of the Worldview-3 imagery is 0.30 cm but was resampled at 0.50 cm by the provider. The value of the elevation information was critical, as the built-up characteristics were very hard to visually discriminate from bare soil and artificial ground surfaces, due to the presence of dust on rooftops and the use of similar construction materials for roofs and artificial ground surfaces. Thus, this challenging study site provided a good stress test for our methods.

**Figure 1.** (**a**) Study area extent illustrated from a WorldView-3 RGB composite of Ouagadougou, (**b**) a typical built-up neighborhood of Ouagadougou and (**c**) Normalized Digital Surface Model for the region.

#### *2.2. Segmentation and Unsupervised Segmentation Parameter Optimization*

The whole LULC classification framework was realized by employing and extending the semi-automated processing chain proposed by Grippa et al. [1]. The chain was implemented in a Jupyter Notebook format and integrated GRASS GIS functions with Python and R programming languages, framing a complete procedure from the input of initial datasets to final LULC map production. For segmentation, we utilized the RG algorithm implementation of GRASS GIS [51] with all four bands (VNIR) used as inputs. In the GRASS implementation, the TP ranges between 0 to 1, with 0 leading to the situation where each pixel represents a segment, while 1 unifies all image pixels in one object. As Böck et al. [52] pointed out, the USPO metrics are sensitive to the range of candidate segmentations used as input, so we empirically found a range that corresponded to cases of evident over- and under-segmentations to be used as minimum and maximum possible values, as commonly done in similar studies [18,53]. Thus, we evaluated 27 different segmentations starting with a TP of 4 and finishing at a TP of 31, guided by an incrementing step value of 1, as in previous studies, [54]. For reader convenience, all TP values were multiplied by 1000 in the illustrative and text materials.

To evaluate the quality of each of the different segmentations, we used the inter- and intrasegmentation heterogeneity metrics Moran's I (*MI*) and Weighted Variance (*WV*), respectively. MI calculates the degree of spatial autocorrelation present in the values of nearby geographic features, and it was used in our case (and in many other USPO studies) to calculate how spectrally heterogeneous segments are, on average, from their neighbors (i.e., in terms of the mean segment values calculated for each spectral band). For this reason, it can provide a measure of "oversegmentation goodness"; Low *MI* values for a segmentation layer indicate low spatial autocorrelation between segment spectral values, suggesting that most segments belong to a different ground feature (with different spectral reflectance properties) than its neighbors. *WV*, on the other hand, describes the average spectral variability within segments (weighted by each segment's area). *WV* can provide a measure of "undersegmentation goodness"; Low *WV* values indicate little internal variation in the spectral properties of segments, suggesting the segment does not contain a mixture of multiple ground features. *MI* and *WV* are given by:

$$\mathcal{W}V = \frac{\sum\_{i}^{n} a\_{i} \* \upsilon\_{i}}{\sum\_{i}^{n} a\_{i}} \tag{1}$$

$$MI = \frac{n\sum\_{i}^{n}\sum\_{j}^{n} w\_{ij}z\_{i}z\_{j}}{M\sum\_{i}^{n}z\_{i}^{2}}\tag{2}$$

where for Equation (1), *n* is the number of segments, *vi* is the variance and *ai* the area for each segment, while in Equation (2), *n* is the number of segments, *zi = xi* − *x*, *x* is the mean value of segment *x*, *M* = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> <sup>∑</sup>*<sup>n</sup> <sup>j</sup>*=<sup>1</sup> *wij* and *wij* is the element of the matrix of spatial proximity *M*, which indicates the spatial connectivity for segments *i* and *j* [52,53].

To perform USPO, the oversegmentation and undersegmentation goodness values calculated for each segmentation layer need to be combined into a single value, e.g., through addition [38] or the F-measure [18]. We used the F-measure to combine *MI* and *WV* values in this study, as it was demonstrated to be less sensitive to excessive over- and undersegmentation than other combination approaches in Zhang et al. [39] and implemented in GRASS module "i.segment.uspo" [55]. To derive an F-measure from these two components, we first need to normalize them to a similar range (0–1) [38]:

$$MI\_n = \begin{array}{c} MI\_{\text{max}} - MI \\ MI\_{\text{max}} - MI\_{\text{min}} \end{array} \tag{3}$$

$$\begin{aligned} \, \mathcal{W}V\_{\text{fl}} &= \, \frac{\mathcal{W}V\_{\text{max}} - \, \mathcal{W}V}{\mathcal{W}V\_{\text{max}} - \, \mathcal{W}V\_{\text{min}}} \, \end{aligned} \tag{4}$$

where *WVn* is the normalized *WV* (or *MI*), *WVmax* is the maximum *WV* (or *MI*) value of all candidate segmentations, *WVmin* is the minimum *WV* (or *MI*) value of all candidate segmentations and *WV* is the *WV* (or *MI*) value of the current segmentation. The F-measure is a harmonic weighting of these two features:

$$F\_{\rm opt} = \left(1 + a^2\right) \frac{WV\_{\rm max} - WV}{a^2 \* WV\_{\rm max} - WV\_{\rm min}},\tag{5}$$

where *Fopt* is the score of a candidate segmentation to be evaluated, ranging from 0 to 1, with higher values indicating higher quality; and *a* is the relative weight factor that assigns different significance to one metric over the other [18]. In our case we used a relative weight of 1, indicating equal weighting of the *MI* and *WV* components in calculating *Fopt*. The procedures were fully automated and parallelized due to the flexibility of GRASS GIS for applications including large raster datasets.

#### 2.2.1. Global USPO

The conventional global USPO approach involves using either the whole image extent as input to the USPO procedure, or a representative subset [43]. Since our image was very large (20 GB), we used the latter method, as depicted in Figure 2. The selected subset (10 km2) contained planned, unplanned, and industrial built-up zones, with different kinds of vegetation, as well as bare soil, and thus, was deemed an appropriate candidate. The TP resulting from applying USPO to that region was 12, and we consequently used that value to segment all parts of the WorldView-3 image.

**Figure 2.** Subset of the WorldView-3 imagery (~10 km2) where the RG's TP was optimized for use in the whole image. The selected area contains a distribution of land cover classes representative of the whole study area.

#### 2.2.2. Spatially Partitioned Unsupervised Segmentation Parameter Optimization (SPUSPO)

As mentioned in the introduction, a global optimization of the USPO might not be appropriate due to the spatial heterogeneity within the image. As such, an alternative approach would be to partition the study area into several subsets, and to apply the optimization procedure locally in each subset. If the segmentation level selected as optimal by a global USPO calculation approach differs significantly from the segmentation level selected locally (i.e., through local USPO calculation in each partition of the study area), a spatially non-stationary process is taking place, and thus a global model might not be the best candidate to use. To investigate this phenomenon, we partitioned the image in three automated ways. The first two methods for partitioning were done using regularly-shaped rectangular tiles of predefined sizes, and the third partitioning method involved automated delineation using a cutline creating algorithm. The predefined partition was based on splitting the WorldView-3 image, into tiles of equal area and for most cases, equal geometry. The area of the rectangular image

subsets for the first two partitioning approaches was 0.25 km2 (P1) and 0.12 km<sup>2</sup> (P2), totaling to 2427 and 4887 subsets, respectively (Figure 3). Although the results of predefined partitioning can be fruitful for exploratory purposes, they suffer from edge effects at their borders. Since they are predefined and fixed in size, they arbitrarily partition the landscape, which can result in noisy/badly segmented objects along the boundaries of the rectangular subsets as artifacts (i.e., splitting building roofs or trees in half). To treat this issue, for the third and main partitioning approach (P3), we deployed a cutline creating algorithm using Laplacian zero-crossing edge detection [56–58], as implemented in the 'i.cutlines' module of GRASS GIS [59]. In that way, the created subsets would delineate the landscape in a more meaningful way, as they would follow linear patterns, such as roof edges and streets. The size of the cutline-created subsets can be decided by the user with respect to the application case. In our case, we created subsets closer to the P2 partition and as such, 4900 subsets were created. Examples of the different spatial partitioning methods are illustrated in Figure 3. In both global and local approaches, the minimum size of a created segment was preset at 14 pixels to avoid unnecessary oversegmentation.

**Figure 3.** *Cont*.

**Figure 3.** Partitioning the WorldView-3 image into spatial subsets for local USPO optimization. (**a**) Delineation by 0.250 km2 area tiles, (**b**) delineation by 0.125 km2 area tiles and (**c**) delineation based on zero crossing cutline algorithm.

#### 2.2.3. Spatially Partitioned Unsupervised Segmentation Parameter Optimization (SPUSPO)

One of the merits of carrying out a localized approach is that it allows for decomposing a global process, into a wide set of useful information which is mappable. Since USPO was applied locally, a unique TP was produced for each spatial subset. The variation of the local TP from the single TP value of the global USPO can be quantified to assess the degree of spatial non-stationarity. If there would be no unexpected variation in the TP, that would suggest that a global approach is indeed adequate, ceteris paribus. Along with mapping the results, we proposed a Segmentation Parameter Stationarity Index (*SPSI*), which was loosely based on the Stationarity Index of Osborne et al. [60] to assess spatial non-stationarity in gaussian models:

$$ISPI = \frac{IQR(TP\_L)}{(TP\_G + TP\_{step}) - (TP\_G - TP\_{step})} \tag{6}$$

where *TPG* is the TP of the global USPO, *TPstep* is the step parameter of the USPO procedure, and *IQR*(*TPL*) is the interquartile range of the distribution of the TP's from a local approach. The interquartile range was used to mask outlier TP values that could emerge from random variation. Values equal to or smaller than 1 imply stationarity, as the variation of the local TPs is not exceeding what one would expect from a random process. Values higher than 1 indicate that there is significant spatial variation.

#### *2.3. Land Use and Land Cover Classification*

Ultimately, the segments were constructed with the aim of being labeled through a classification model. As such, another method to assess the local and global USPO methods is through the accuracy and performance of a LULC classification. The classification scheme and training data are presented below (Table 1). The training data were collected through random and stratified random sampling, and consisted of 2478 objects across the city, which were labeled through visual interpretation by two experts during the same period. The amount of training data was selected in such way that the addition of new data points did not significantly improve classification accuracy. Swimming pools were sampled manually due to their scarcity. To evaluate the results of the classification between the two methods, we used an expert-based manual delineation of Ouagadougou, based on building size and density [44] (Figure 4). In each one of these built-up types, we randomly sampled 150 points adding up to a total of 1650 points, and computed the Overall Accuracy (OA), as well as the F-score for each LULC class. No overlapping between training and testing data was allowed.


**Table 1.** Training objects for each LULC class and method.

**Figure 4.** Morphological delineation of Ouagadougou based on built-up size and density categories.

To classify the whole image, we computed several descriptive statistics for segments, based on the values of the pixels located within the segment, i.e., the values of each spectral band, NDVI values, and nDSM values (min, median, mean, max, range, 1st and 3rd quantiles and sum) as well as geometrical covariates (fractal dimension, perimeter, area, compactness). An Extreme Gradient Boosting (XGBoost, R 3.5.1) classifier was used as it was recently shown to outperform benchmark classifiers such as Support Vector Machine in VHR LULC classifications [14]. XGBoost is an ensemble of Classification and Regression Trees that is based in the principle of boosting [61]. The parameters of the algorithm were tuned through Bayesian Optimization [14,62], to ensure the quality of the results. Finally, we performed feature selection to reduce the computational burden and potentially increase the predictive capabilities of the model by deploying the popular Variable Selection with Random Forests (VSURF) algorithm, which is suited for tree-based classifiers such as XGBoost [63,64]. Out of the initial 59 features, 18 were selected by VSURF to build the most discriminant, redundancy-free model.

#### *2.4. Segmentation Goodness Metrics*

To evaluate the effect of SPUSPO on the segmentation of buildings, we compared the cutline-based segmentation and the global approach against reference data. In detail, we manually delineated 100 buildings that were randomly selected from the pool of training data used for the LULC classification. Finally, we computed the *Area Fit Index* (*AFI*) which is a commonly used joint index of over- and undersegmentation [31,32,53]:

$$AFI = \frac{area(\mathbf{x}\_i) - area(y\_{i\text{max}})}{area(\mathbf{x}\_i)}\tag{7}$$

where *xi* is the reference object and *yimax* is the largest relevant segment intersecting *xi*. Values closer to 0 suggest a better segmentation, values > 0 imply oversegmentation whereas values < 0 undersegmentation.

#### *2.5. Computational Requirements and Data Availability*

The computing infrastructure used for the experiments consisted of two Intel® Xeon® CPU E5-2690 (2 processors of 2.90 GHz, 16 cores, 32 processing threads) and 96 GB of RAM. Segmenting the WorldView-3 image with a single TP parameter (tiled) required roughly 20 h of processing time while on average, a SPUSPO method required about 63 hours by exploiting the parallelization of the 'i.segment.uspo' module of GRASS [56]. The code, results and processed material is openly accessible in the following repository (https://zenodo.org/record/1341116#.W3FSUvZuJ\_t) [65].

#### **3. Results**

#### *3.1. Threshold Parameter Variation*

The spatial variation of the TP was a function of the size and geometry of the subsets used for local optimization. Figure 5 demonstrates that the variation follows patterns of the landscape. The locations where high TP values were selected as optimal were mainly clustered around unplanned, low elevated neighborhoods, whereas the locations where very low TP values were selected as optimal were mostly found in vegetated areas, potentially due to their unique spectral properties (high local variation in the NIR band). The local outputs of each metric used for the local USPO calculations can also be enlightening with respect to illustrating the level of spatial heterogeneity of the imagery. Figures 6 and 7 confirm that MI and WV have an inverse relationship, with MI being decisive in optimization in the central and eastern regions of unplanned areas, and vice-versa. The SPSI value was 1.5 for P1, and 2 for P2 and P3, indicating a non-stationary variation in optimal TP values.

**Figure 5.** Spatial variation of the threshold parameter (TP) across Ouagadougou. (**a**) WorldView-**3** RGB composite, partitioning by (**b**) P1 (**c**) P2 and (**d**) P3 approaches, respectively. The TP controls the average size of the created segments.

**Figure 6.** Spatial variation of weighted variance (WV) across Ouagadougou. (**a**) WorldView-3 RGB composite, partitioning by (**b**) P1, (**c**) P2 and (**d**) P3 approaches, respectively. High values of WV indicate large intra-segment variability while low values describe more homogenous objects.

**Figure 7.** Spatial variation of Moran's I (MI) values across Ouagadougou. (**a**) WorldView-3 RGB composite, partitioning by (**b**) P1 (**c**) P2 and (**d**) P3 approaches, respectively. The higher the MI value, the stronger the effect of spatial autocorrelation between a created segment and its neighbors.

The variability of these parameters was also visualized in a set of boxplots in Figure 8. From this figure, the TP parameter variation is slightly smaller for the P1 approach than for the other two partitioning methods, possible because image partitions of P1 are larger than those of P2 and P3, and thus do not capture as much of the local heterogeneity in urban structure. Notably, when using smaller spatial partitions, MI tends to decrease (and WV tends to increase), which constitute the differences in TP among the different methods.

**Figure 8.** Boxplots demonstrating the variability of (**a**) the TP, (**b**) MI and (**c**) WV for the different partitioning approaches (P1, P2, P3).

#### *3.2. Land-Use Land-Cover Classification*

The results of the LULC classification were found to be affected by the segmentation quality. Figures 9 and 10 show case how SPUSPO could enhance classification accuracy by producing segments better fitting the local environment, in various areas in Ouagadougou. Figure 9 demonstrates that in both planned and unplanned regions, the improvement in classification results was mainly due to the cutlines segmentation, delineating the buildings in a less oversegmenting fashion, avoiding overestimation of built-up near the borders due to the inconsistent and "patchy" nature of the nDSM as a predictor, that does not closely follow built-up boundaries.

**Figure 9.** Example of the LULC map classification in a planned and unplanned built-up area. (**a**) LULC classification with a global approach in a planned neighborhood, (**b**) RGB Pleiades Composite, (**c**) LULC classification with a cutline approach in an unplanned neighborhood, and (**d**) LULC classification with a global approach in a planned neighborhood, (**e**) RGB Pleiades Composite, (**f**) LULC classification with a cutline approach in an unplanned neighborhood.

LULC classification based on SPUSPO was superior for vegetation and waterbodies of Ouagadougou. Figure 10 demonstrates cases of confusion between low and high vegetation, when using a global approach. Additionally, the misclassification of water as built-up is significantly less with SPUSPO. Notably, a scene might be segmented with intrinsically different thresholds (Figure 10f), which implies that the reason SPUSPO methods performed better is their incorporation of only the spatial information of the segmented region, and not information that comes from locations far away, which might not be useful at the local level.

**Figure 10.** Example of the LULC map classification in a vegetated regions and inland water bodies. (**a**) LULC classification with a global approach in a forested area, (**b**) RGB Pleiades Composite, (**c**) LULC classification with a cutline approach in a forested area, (**d**) LULC classification with a global approach in water bodies, (**e**) RGB Pleiades Composite, and (**f**) LULC classification with a cutline approach in water bodies.

The Overall Accuracy for the SPUSPO and global optimization based on the reference set was 90.5% and 89%, respectively. Moreover, the differences among them were statistically significant, based on a two-tailed McNemar's test of similarity (*p* < 0.05). The local optimization was superior for most cases, both when concerning the OA and per-class evaluation metrics (Table 2). The largest improvements were found in the classification of inland water and shadows (+18% and +3% increase on the F1 score, respectively).



An additional, indirect way to assess the segmentation quality is to investigate the variable importance of the geometrical covariates. The geometrical covariates that were used in the classification model after VSURF feature selection took place were perimeter, area, and fractal dimension. Figure 11 illustrates the improved effect a local approach has on the importance of most of these variables, further supporting the merit of using SPUSPO. The interpretation of the results, refers to the gain in model accuracy when a feature is used in the splits of the XGBoost tree development. The importance of these covariates is varying, but in all cases, the local approach further enhances their predictive power for classification, since the segments fit better the variability of the local environment.

**Figure 11.** Feature importance of geometrical covariates, as derived from an XGBoost classifier, for the global and cutline segmentation-based approaches, respectively. The method used to derive importance is the gain in accuracy.

#### *3.3. Segmentation Goodness Metrics*

The results of the AFI for buildings are depicted in Table 3 through several descriptive statistics. As expected, the building objects were less over segmented with SPUSPO, because the parameter was spatially adapting to characteristics of each built-up neighborhood in Ouagadougou (Figure 12). The AFI values of the local method were consistently closer to zero compared to their counterpart, further promoting the use of this approach.


**Table 3.** *Area Fit Index* for building objects in Ouagadougou. Values closer to 0 suggest a better segmentation, values > 0 imply over segmentation while values < 0 under segmentation.

**Figure 12.** Example segmentations of buildings in Ouagadougou. Red color indicates segments created by a global approach, while green color indicates segments coming from SPUSPO. The decrease of over segmentation is evident in most cases, as the parameters are derived from neighboring locations, better fitting the data structure.

#### **4. Discussion**

The results suggested that the benefits of performing SPUSPO, are multiple. To start with, it allows for the local variations in spectral and spatial heterogeneity within an image to be incorporated into the segmentation parameter optimization approach, which is more intuitive because the optimization procedure is derived using the actual locations they are being applied to and not from locations situated afar. This supports the hypothesis that in large and heterogeneous areas, a single TP may be inadequate, as it is simply an average expression of several non-stationary processes. The results confirm prior analysis in another Sub-Saharan city of Dakar, where a semi-automated local approach outperformed classical optimization methods [54]. Moreover, several other studies have described how regionalized approaches can be of merit for urban, semi-rural, and agricultural environments [35,43,44]. Nonetheless, an important facet that has been neglected so far is how to partition the landscape in geographically large areas in conjunction with VHR imagery, and in the absence of reference data such as parcels or blocks. For a continuous LULC map, an appropriate delineation of the image is important, as it must be as adjustable to landscape patterns, such as streets or roofs, as much as

possible to avoid/reduce edge effects. Although all local approaches showed they can be of merit, the cutline-based partition helped to specifically address these issues. Undertaking SPUSPO, produced higher classification accuracy than using a traditional global optimization method (+1.5% increase in OA). The results are confirmed further using AFI as a segmentation goodness metric, which showed that building segments from applying SPUSPO are less oversegmented than their global counterparts, with mean values of 0.28 and 0.36 for SPUSPO and global USPO, respectively. The analysis validated our initial hypothesis that the way we look at the data can produce significantly different results, and is related to the importance of appropriate spatial scale selection in geography, which was largely signified through the work of Woodcock and Strahler [66] and Fotheringham et al. [67]. Additionally, a local segmentation optimization approach is not only linked to traditional GEOBIA analysis, but might be needed in large scale applications where deep learning classification is coupled with segments to achieve better object delineation/extraction as demonstrated recently in References [68–70]. Another important piece of information that we can extract from these methods is the ability to map intermediate and final results, which can be enlightening both as a general understanding of how spatial processes operate in the local scale, but also how to calibrate segmentation parameters in further processing if an unsupervised multi-scale framework is selected [18]. The LULC products in SSA cities are often used as inputs for fine scale population modelling, land use, and spatial planning, and consequently, effective policy making, given the extreme scarcity of reference information [2,71,72]. This is significant for the outcome of our analyses because there was better prediction of most classes by the SPUSPO approach; it presents an additional motivation to partake of a local method to reduce error propagation in secondary models.

The main limitation of SPUSPO is the increased computational time and experimentation to detect a satisfactory spatial level to analyze image information, which can vary depending on the image resolution and study area, leading to a trade-off between computational requirements and performance. Therefore, more sophisticated methods are needed to help establish an efficient framework to fully exploit the benefits of local optimization. Ideally, in large and heterogeneous areas, a spatial partition should not suffer from edge effects and should meaningfully delineate the landscape with a certain degree of intra-homogeneity. Cutline partitioning satisfies both criteria to some extent, but its effectiveness can only be determined post-hoc, which increases the computational and time demands as several cutline partitions may need to be evaluated. More adequate methods that can focus in a priori determination of a suitable scale using image statistics, such as spatial dependency among regions [73], could be of benefit to achieve this, particularly in a multi-scale context. Other research should explore the potential of multi-resolution imagery to define operational partitions using top down approaches. For instance, a low-medium resolution LULC product can define homogeneous regions to apply SPUSPO using finer resolution imagery. Moreover, noise additive models could help in better establishing a comparative framework among different segmentation approaches, particularly for SAR or hyperspectral data [74]. A lot of the limitations that come with involving local methods, can be significantly reduced (i) by utilizing GRASS GIS, which is highly parallelized in the USPO optimization module and more notably, performs all the operations in a raster format and does not require vector conversion at any moment, dramatically boosting its effectiveness for large-scale computing; and (ii) invoking state-of-the-art segmentation algorithms, with respect to their computational efficiency, as recently shown by Gu et al. [75].

#### **5. Conclusions**

In this study, the optimization of a region-growing segmentation algorithm was attempted using a spatially varying parameter model, named SPUSPO. The whole framework was developed with a focus on automation and large-scale analysis of VHR imagery. The results validated our hypothesis that in large and heterogeneous areas, using only a single set of parameters to optimize the region-growing algorithm was inadequate. Employing as a case study, the city of Ouagadougou, it was demonstrated that undertaking local optimization methods was of merit and led to significantly better LULC classification results (+1.5% increase in OA), validated by a McNemar's test of similarity. Moreover, at the segmentation level, building delineation was improved with a mean Area Fit Index of 0.28 and 0.36 for SPUSPO and global USPO, respectively. Moreover, the feature importance of geometrical covariates is recommended as an indirect measure to assess the quality of a segmentation. We demonstrated that geometrical features were more important and predictive when using local approaches. Finally, GRASS GIS was heavily utilized and is promoted as an open source tool to handle large volumes of data with advanced analysis techniques.

**Author Contributions:** S.G. wrote the manuscript and performed the analysis. M.L. developed the cutline algorithm. T.G. provided technical assistance and helped develop the experiments. B.J., S.V. and E.W. provided valuable feedback and comments during the internal revisions of the manuscript.

**Acknowledgments:** This work was supported by BELSPO (Belgian Science Policy Office) in the frame of the STEREO III program—project REACT (SR/00/337). We would also like to thank the three reviewers for their useful and insightful comments and recommendations which significantly improved the quality of the manuscript. WorldView3 data is copyrighted under the mention "©COPYRIGHT 2015 DigitalGlobe, Inc., Longmont CO USA 80503. DigitalGlobe and the DigitalGlobe logos are trademarks of DigitalGlobe, Inc. The use and/or dissemination of this data and/or of any product in any way derived there from are restricted. Unauthorized use and/or dissemination is prohibited".

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Multiscale Optimized Segmentation of Urban Green Cover in High Resolution Remote Sensing Image**

**Pengfeng Xiao 1,2,3,\*, Xueliang Zhang 1,2,3,\*, Hongmin Zhang 1, Rui Hu <sup>1</sup> and Xuezhi Feng 1,2,3**


Received: 22 October 2018; Accepted: 13 November 2018; Published: 15 November 2018 -

**Abstract:** The urban green cover in high-spatial resolution (HR) remote sensing images have obvious multiscale characteristics, it is thus not possible to properly segment all features using a single segmentation scale because over-segmentation or under-segmentation often occurs. In this study, an unsupervised cross-scale optimization method specifically for urban green cover segmentation is proposed. A global optimal segmentation is first selected from multiscale segmentation results by using an optimization indicator. The regions in the global optimal segmentation are then isolated into under- and fine-segmentation parts. The under-segmentation regions are further locally refined by using the same indicator as that in global optimization. Finally, the fine-segmentation part and the refined under-segmentation part are combined to obtain the final cross-scale optimized result. The green cover objects can be segmented at their specific optimal segmentation scales in the optimized segmentation result to reduce both under- and over-segmentation errors. Experimental results on two test HR datasets verify the effectiveness of the proposed method.

**Keywords:** multiscale segmentation; scale parameter; cross-scale optimization; segmentation refinement; urban green cover

#### **1. Introduction**

Urban green cover can be defined as the layer of leaves, branches, and stems of trees and shrubs and the leaves of grasses that cover the urban ground when viewed from above [1]. This term is typically used to refer to urban green space identified from remote sensing data, as it is in this study. Green space is an essential infrastructure in cities because it provides various products and ecosystem services for urban dwellers that can address support to climate-change mitigation and adaptation, human health and well-being, biodiversity conservation, and disaster risk reduction [2]. Therefore, inventorying the spatial distribution of urban green cover is imperative in decision-making about urban management and planning [3].

High-spatial resolution (HR) remote sensing data have shown great potential in identifying both the extent and the corresponding attributes of urban green cover [4–8]. In order to fully exploit the information content of the HR images, geographic object-based image analysis (GEOBIA) has become the principal method [9] and has been successfully applied for urban green cover extraction [10–15]. Scale is a crucial aspect in GEOBIA as it describes the magnitude or the level of aggregation and abstraction on which a certain phenomenon can be described [16]. GEOBIA is sensitive to segmentation scale but has challenges in selecting scale parameters, because different objects can only be perfectly

expressed at the scale corresponding to their own granularity. The urban green cover in HR images presents obvious multiscale characteristics, for example, the size of urban green cover varies in a large extent of scales; it can either be a small area with several square meters, such as a private garden, or reach a large area with several square kilometers such as a park. As a result, it is be possible to properly segment all features in a scene using a single segmentation scale, resulting in that over-segmentation (producing too many segments) or under-segmentation (producing too few segments) often occurs [17]. Therefore, it plays a decisive role in GEOBIA that divide the complex features at the appropriate scale to segment landscape into non-overlapping homogenous regions [18].

In order to find the optimal scale for each object, the multiscale segmentation can be optimized using three different strategies based on: supervised evaluation measures, unsupervised evaluation measures, and cross-scale optimization. (1) The supervised strategy compares segmentation results with reference by geometric [19–23] and arithmetic [21,24,25] discrepancy. This strategy is apparently effective but is, in fact, subjective and time-consuming when creating the reference. (2) The unsupervised strategy defines quality measures, such as intra-region spectral homogeneity [26–31] and inter-region spectral heterogeneity [32–34], for conditions to be satisfied by an ideal segmentation. It thus characterizes segmentation algorithms by computing goodness measures based on segmentation results without the reference. This strategy is objective but has the added difficulty of designing effective measures. (3) The cross-scale strategy fuses multiscale segmentations to achieve the expression of various granularity of objects at their optimal scale [35–37]. It can make better use of the multiscale information than the other two strategies.

Recently, cross-scale strategy has garnered much attention in the multiscale segmentation optimization by using evaluation measures as the indicator. (1) For the unsupervised indicator, some studies generated a single optimal segmentation by fusing multiscale segmentations according to local-oriented unsupervised evaluation [35,38,39]. However, the range of involved scales was found to be limited. By contrast, multiple segmentation scales were selected according to a change in homogeneity [27–29]. (2) For the supervised indicator, multiscale segmentation optimization has been achieved by using the single-scale evaluation measure based on different sets of reference objects [28]. For example, some studies have provided reference objects and suitable segmentation scales for different land cover types [40,41]. The difficulty of this strategy is preparing appropriate sets of reference objects that can reflect changes of scales. In our previous work [37], two discrepancy measures are proposed to assess multiscale segmentation accuracy: the multiscale object accuracy (MOA) measure at object level and the bidirectional consistency accuracy (BCA) measure at pixel level. The evaluation results show that the proposed measures can assess multiscale segmentation accuracy and indicate the manner in which multiple segmentation scales can be selected. These proposed measures can manage various combinations of multiple segmentation scales. Therefore, applications for optimization of multiscale segmentation can be expanded.

In this study, an unsupervised cross-scale optimization method specifically for urban green cover segmentation is proposed. A global optimal segmentation is first selected from multiscale segmentation results by using an optimization indicator. The regions in the global optimal segmentation are then isolated into under- and fine-segmentation parts. The under-segmentation regions are further locally refined by using the same indicator as that in global optimization. Finally, the fine-segmentation part and the refined under-segmented part are combined to obtain the final cross-scale optimization result. The goal of the proposed method is to segment urban vegetation in general, for example, trees and grass together included in one region. The segmentation result of urban green cover can be practically used in urban planning, for example investigation of urban green cover rate [42,43], and urban environment monitoring, for example influence analysis of the urban green cover to residential quality [44,45].

The contribution of this study is to propose a new cross-scale optimization method specifically for urban green cover to achieve the optimal segmentation scale for each green cover object. The same optimization indicator is designed to be used both to identify the global optimal scale and to refine the under-segmentation. By refining the isolated under-segmented regions for urban green cover, the optimization result can avoid under-segmentation errors as well as reduce over-segmentation errors, achieving higher segmentation accuracies than single-scale segmentation results. The proposed method also holds the potentials to be applied to cross-scale segmentation optimization for different types of urban green cover or even other land cover types by designing proper under-segmentation isolation rule.

The rest of the paper is organized as follows. Section 2 presents the proposed method of multiscale segmentation optimization. Section 3 describes the study area and test data. Section 4 verifies the effectiveness of the proposed method based on experiments. Section 5 presents the discussions. Finally, conclusions are drawn in Section 6.

#### **2. Method**

#### *2.1. General Framework*

This study proposes a multiscale optimization method for urban green cover segmentation, which aims to comprehensively utilize multiscale segmentation results to achieve optimal scale expression of urban green cover. Figure 1 shows the general framework of the proposed method. First, a global optimal segmentation is selected from the multiscale segmentation results by using an optimization indicator. The indicator is the local peak (*LP*) of the change rate (*CR*) of the mean value of spectral standard deviation (*SD*) of each segment. Second, the regions in the global optimal segmentation result are isolated into under- and fine-segmentation parts, based on designed under-segmentation isolation rule. Third, the under-segmentation regions are refined by using the same optimization indicator *LP* in a local version. Finally, the fine-segmented part and the optimized under-segmented part are combined to obtain the final cross-scale optimization result.

**Figure 1.** General framework of the proposed multiscale segmentation optimization method. The remote sensing images are shown with false color composite: red: near infrared band; green: red band; and blue: green band. The multiscale, under-segmented, and fine-segmented regions are shown with blue, green, and yellow polygons.

#### *2.2. Hierarchical Multiscale Segmentation*

The hierarchical multiscale segmentation is composed of multiple segments from fine to coarse at each location, in which the small objects are supposed to be represented by fine segments at certain segmentation scales and the large objects are represented by coarse segments correspondingly. Furthermore, a fine-scale segment smaller than a real object is supposed to represent a part of the object,

while a coarse-scale segment larger than a real object is to represent an object group. A preliminary requirement for the multiscale segments is that the segments at the same location should be nested. Otherwise the object boundaries would be conflict when combining or fusing the multiscale segments.

The hierarchical multiscale segmentation is represented using a segment tree model [46], as shown in Figure 2. The tree nodes at different levels represent segments at different scales. An arc connecting a parent and a child node represents the inclusion relation between segments at adjacent scales. The leaf nodes represent the segments at the finest scale and the nodes at upper levels represent segments at coarser scales. Finally, the root node represents the whole image. An ancestry path in the tree is defined as the path from a leaf node up to the root node, revealing the transition from object part to the whole scene. The hierarchical context of each leaf node is conveyed by the ancestry path, in which a segment is gradually becoming coarser and finally reaching the whole image.

**Figure 2.** Illustration of segment tree model (**b**) that represents hierarchical multiscale segmentation (**a**).

Several region-based segmentation methods can be applied to produce the required hierarchical multiscale segmentations, for example multiresolution segmentation method [47], mean-shift method [48] and hierarchical method [49]. Specifically, the multiresolution segmentation method [47] is used in the study, in which the shape parameter is set as 0.5 by default. The regions at each segmentation scale are represented by the nodes at the same level in the segment tree. Finally, the segment tree is constructed by recording the multiscale segmentation.

#### *2.3. Selecting Global Optimal Scale*

We need to first select a global optimal segmentation scale and then refine the under-segmentation part for urban green cover. Thus, unlike other optimal scale selection methods for compromising under- and over-segmentation errors, we design an indicator to select an optimal scale in which segmentation results mainly include reasonable under-segmented and fine-segmented regions, reducing over-segmented regions as much as possible.

Referring to the standard deviation indicator [28], we adopt the indicator focusing on homogeneity of segments by calculating the mean value of spectral standard deviation (*SD*) of each segment. *SD* is defined as below:

$$SD = \sqrt{\frac{1}{nb} \sum\_{k=1}^{b} \sum\_{i=1}^{n} SD\_{ki}} \tag{1}$$

where *SDki* is the standard deviation of digital number (DN) of spectral band *k* in segment *i*; *n* is the number of segments in the image; and *b* is the number of spectral bands of the image.

With the increase of the scale parameter, *SD* will change as following. Generally, it tends to increase because the homogeneity of segments is gradually decreased in the region merging procedure. Near the scale that the segments are close to the real objects, the change rate of *SD* will increase suddenly because of the influence of the boundary pixels [29].

To find the scale in which the green cover segments are closest to the real objects, we propose indicator *CR* to represent the change rate of *SD* and indicator *LP* to represent the local peak of *CR*. They are defined respectively as below:

$$CR = \frac{dSD}{dl} = \frac{SD(l) - SD(l - \Delta l)}{\Delta l} \tag{2}$$

$$LP = \left[CR(l) - CR(l - \Delta l)\right] + \left[CR(l) - CR(l + \Delta l)\right] \tag{3}$$

where *l* is the segmentation scale and Δ*l* is the increment in scale parameter, that is the lag at which the scale parameter grows. The scale increment has powerful control over the global optimal segmentation because it can smooth the heterogeneity measure resulting in the optimal segmentation occurred in different scales [50]. Experimentally, the small increments (e.g., 1) produces optimal segmentation in finer scales while the large increments (e.g., 100) produces optimal segmentation in coarser scales [28]. Hence, the medium increment (e.g., 10) of scale is adopted in the study.

According to the aforementioned change law of *SD*, near the scale that the segments are close to the real objects, the *CR* will increase suddenly because of the influence of the boundary pixels of green cover segments. Thus, a *LP* will appear when the global optimal segmentation scale is coming for several segments. However, there are several *LP*s within a set of increased scale parameters because not all the segments have the same optimal segmentation scale. The global optimal segmentation is identified as the scale with largest *LP*, because the largest *LP* indicates that most of the segments in the image reach the optimal segmentation state. Furthermore, the large *LP* could also be caused by the large *SD* value of coarse segments, because the large *SD* will produce large *CR* and corresponding large *LP*, revealing the under-segmentation state. Therefore, the next step is to optimize the under-segmentation part of the global optimal segmentation result for green cover objects.

#### *2.4. Isolating Under-Segmented Regions*

In order to obtain the under-segmented regions for green cover from the globally optimized segmentation result, further isolation of segments is required. When a green cover object is in the under-segmentation state, it is often mixed with other adjacent objects and the spectral standard deviation (*SDi*) is thus great. Moreover, since the normalized difference vegetation index (NDVI) has a good performance to distinguish between green cover and other features, when other objects are mixed with the green cover object, the NDVI value of the region is not very high, that is lower than that of green cover objects, as well as not very low, that is higher than that of non-green cover objects. NDVI is defined as the ratio of difference between near infrared band and red band values to their sum [51]. Thus, NDVI of a region is calculated as below:

$$NDVI\_{\bar{i}} = \frac{1}{m} \sum\_{j=1}^{m} \frac{NIR\_j - R\_j}{NIR\_j + R\_j} \tag{4}$$

where *NIRj* and *Rj* are DNs of near infrared and red band for pixel *j*, respectively; and *m* is the number of pixels in region *i*.

Therefore, a region with a high *SDi* value and a medium *NDVIi* value can be considered an under-segmentation region for green cover. The isolation rule for an under-segmentation region with green cover are thus defined as below:

$$\begin{cases} \begin{array}{c} SD\_i > T\_{SD} \\ T\_{N1} < NDVI\_i < T\_{N2} \end{array} \end{cases} \tag{5}$$

where *TSD*, *TN*1, and *TN*<sup>2</sup> are thresholds that need to be set by users. We set it by the trial-and-error strategy. A segment with *SDi* lower than *TSD* is viewed as fine segment because of the high homogeneity. If the *NDVIi* value of region *i* is higher than *TN*2, it is viewed as fine segmentation of green cover; and if it is lower than *TN*1, it is viewed as not containing green cover and will not be involved in the successive refining procedure.

#### *2.5. Refining Under-Segmented Regions*

For each individual region in the under-segmentation part, the segment tree is first used to quantify the spatial context relationship of the regions at different scales and the appropriate segmentation scale is then selected through the optimization indicator *LP*. Finally, the under-segmentation part is replaced by the optimized segments. The specific steps are performed as follows:


#### *2.6. Accuracy Assessment*

Segmentation quality evaluation strategies include visual analysis, system-level evaluation, empirical goodness, and empirical discrepancy methods [37]. The last two methods are also known as unsupervised and supervised evaluation methods, respectively. The unsupervised evaluation method calculates indexes of homogeneity within segments and heterogeneity between segments [35]. It does not require ground truth but the explanatory of designing measures and the meaning of measure values is insufficient. The supervised evaluation method compares segmentation results with ground truth and its discrepancy can directly reveal the segmentation quality [52]. Region-based *precision* and *recall* measures are sensitive to both geometric and arithmetic errors. Thus, the supervised evaluation method is used to assess the segmentation accuracy of the multiscale optimization.

*Precision* is the ratio of true positives to the sum of true positives and false positives, and *recall* is the ratio of true positives to the sum of true positives and false negatives. Given the segmentation result *S* with *n* segments {*S*1, *S*2, ... , *Sn*} and the reference *R* with *m* objects {*R*1, *R*2, ... , *Rm*}, the *precision* measure is calculated by matching {*Ri*} to each segment *Si* and the *recall* measure by matching {*Si*} to each reference object *Ri*. When calculating the *precision* measure, the matched reference object (*Rimax*) for each segment *Si* is first identified, where *Rimax* has the largest overlapping area with *Si*. The *precision* measure is then defined as [23]:

$$precision = \frac{\sum\_{i=1}^{n} |S\_i \cap R\_{imax}|}{\sum\_{i=1}^{n} |S\_i|} \tag{6}$$

where |·| denotes the area that is represented by the number of pixels in a region.

Similarly, the matched segment (*Simax*) for each reference object *Ri* is searched according to the maximal overlapping area criterion and the *recall* measure is defined as [23]:

$$recall = \frac{\sum\_{i=1}^{m} |R\_i \cap S\_{imax}|}{\sum\_{i=1}^{m} |R\_i|} \tag{7}$$

The *precision* and *recall* measures both range from 0 to 1. Using these two measures can determine both under- and over-segmented situations. An under-segmented result will have a large *recall* and a low *precision*. By contrast, if the result is over-segmented, the *precision* is high but the *recall* is low. If the *precision* and *recall* values of one segmentation result are both higher than another, this result is considered to have a better segmentation quality. However, we do not know which one is better when

one measure in larger than another and the other measure is smaller than another. Hence, we should combine these two measures into one. In this study, we use the harmonic average of *precision* and *recall* called *F-score* [53], which is defined as:

$$F-score = \frac{2 \cdot precision \cdot recall}{precision + recall} \tag{8}$$

where an *F-score* reaches its best value at 1 (perfect *precision* and *recall*) and worst at 0.

#### **3. Data**

The study area is located in Nanjing City (32◦02 38"N, 118◦46 43"E), which is the capital of Jiangsu Province of China and the second largest city in the East China region (Figure 3), with an administrative area of 6587 km<sup>2</sup> and a total population of 8335 thousand as of 2017. As one of the four garden cities in China, Nanjing has a wealth of urban green spaces than many of other cities. The urban green cover rate in the built-up area of Nanjing is 44.85% in 2018.

**Figure 3.** Location of the Nanjing City and the two test images.

In this study, an IKONOS-2 image acquired on 19 October 2010 and a WorldView-2 image acquired on 29 July 2015 in Nanjing are used as the HR data. Both the images consist of four spectral bands: blue, green, red, and near infrared. The spatial resolution of the multispectral bands of the IKONOS-2 image is improved from 3.2 m to 0.8 m after pan-sharpening. The spatial resolution of the multispectral bands of the WorldView-2 image is 2 m.

Two test images identified as I1 and I2 are subsets of the IKONOS-2 and the WorldView-2 images, respectively, containing urban green cover in traffic area, residential area, campus area, park area, commercial area, and industrial area, which are the typical areas in urban. The size of I1 and I2 are <sup>2286</sup> × 1880 and 1478 × 974 pixels and the area are approximate 2.8 km<sup>2</sup> and 5.8 km2, respectively. As shown in Figure 4, there are abundant green cover objects distributed in the images and various in size and shape.

In order to evaluate the segmentation accuracy, we randomly select some green cover objects as reference. The reference objects are uniformly distributed in the test images and various in size and shape. Each reference object is delineated by one person and reviewed by other to catch any obvious errors. Finally, we collect 130 reference objects for each test image. It is noted that if there are trees covered a road, this area will be digitized as green cover objects. The area of the smallest reference object is only 59.5 m2, whereas the area of the biggest reference object is 14,063.1 m2. Hence, it is not possible to properly segment all of the green cover objects using a single segmentation scale.

(**b**)

**Figure 4.** Test image I1 from an IKONOS-2 image (**a**) and test image I2 from a WorldView-2 image (**b**), containing different green cover in urban area. The images are shown with false color composite: red: near infrared band; green: red band; and blue: green band. The reference green cover objects are shown with orange polygons.

#### **4. Results**

#### *4.1. Global Optimal Scale Selection*

The multiscale segmentation results are produced by applying multiresolution segmentation method. For I1, the scale parameters are set from 10 to 250 by increment of 10. Since the spatial resolution of I2 is coarser than I1, the scale parameters are set from 10 to 125 by increment of 5. If we set the same scale parameters for I2 as those for I1, the coarse segmentation scales (e.g., >130) would be seriously under-segmented and the homogeneity of segments at these coarse scales would change randomly, which could not benefit the optimization procedure and could even do harm to the optimization procedure.

The multiscale segmentations cover apparently over-segmentation, medium segmentation, and apparently under-segmentation. The optimization indicators *SD*, *CR*, and *LP* are respectively calculated for each segmentation result and shown in Figure 5. When the scale parameter increases, *SD* gradually increases, which indicates that the regions are gradually growing and the homogeneity decreases. Correspondingly, in the process of *SD* change, *CR* appears multiple local peaks. The indicator *LP* can highlight these local peaks of *CR* very well. We can see that *LP* appears at segmentation scales of 80, 110, 150, 170, 190, and 220 for I1, in which *LP* reaches the maximum at 220. For I2, *LP* appears at segmentation scales of 45, 55, 60, 70, 80, 85, 95, 105, and 120, where *LP* is the largest at 105. Therefore, the segmentation with the scale parameter of 220 and 105 is taken as the global optimal segmentation scale for I1 and I2.

**Figure 5.** Changes of optimization indicator *SD*, *CR*, and *LP* with scale parameter for the multiscale segmentations of test image I1 (**a**) and I2 (**b**).

Combining with the supervised evaluation results of the multiscale segmentations (Figure 6), we can know that the selected global optimal segmentation scale is at the under-segmentation status for green cover objects. For both I1 and I2, the *precision* value is apparently lower than the *recall* value in the optimal scale, which indicates the under-segmentation status. To further illustrate this, the selected I2 segmentation result at scale 105 is presented in Figure 7, in which we can clearly see that except for several fine-segmented green cover objects with relatively large size, many green cover objects are shown as under-segmented.

**Figure 6.** Changes of *precision*, *recall*, and *F-score* with scale parameter for the multiscale segmentation results of test image I1 (**a**) and I2 (**b**).

The selected global optimal segmentation of green cover tends to appear in the case of coarse scales. As a result, the over-segmentation errors are reduced, while some green cover objects with small size will inevitably be in an under-segmentation state and single scale cannot achieve optimal segmentation of green cover objects of different sizes. Therefore, it is necessary to further optimize the global optimal segmentation by refining the under-segmented regions.

**Figure 7.** Selected global optimal segmentation at the scale of 105 for I2. The segments are shown with green polygons, the examples of fine-segmentation for green cover are shown with blue polygons, and those of under-segmentation for green cover are marked by yellow arrows.

#### *4.2. Under-Segmented Region Isolation*

The under-segmented regions are isolated by the rule in Equation (5). The threshold values of *TSD*, *TN*1, and *TN*<sup>2</sup> are set as 40, 0.05, and 0.25 for test image I1 and as 40, 0.10, and 0.55 for test image I2. The threshold values of *NDVIi* for I2 is set as different for I1, this is mainly because the different acquisition date between I1 and I2, between which the vegetation growth status is different.

To illustrate the effectiveness of the designed isolation rule for under-segmentation with green cover, several sample segments in the global optimal segmentations are presented in Figure 8. It can be seen that the under-segmented regions containing green cover have medium *NDVIi* values and high *SDi* values, as shown in Figure 8a,b,f,g. The fine-segmentation of green cover present high *NDVIi* values as shown in Figure 8c,i. A special case of fine-segmentation is shown in Figure 8h, which is a segment mainly containing sparse grass and the *NDVIi* value is thus not very high. However, the relatively low *SDi* value of grass segment can prevent it to be wrongly identified as under-segmentation. The segments without green cover usually present low *NDVIi* value as shown in Figure 8d. A special case of segment without green cover is shown in Figure 8e, where the roof segment has a medium *NDVIi* value because of the roof material. However, the relatively low *SDi* value can prevent it to be wrongly identified as under-segmentation containing green cover.

**Figure 8.** Sample segments (cyan polygons) from the selected global optimal segmentation to illustrate the effectiveness of the designed isolation rule for under-segmentation containing green cover. (**a**–**e**) are from the results of test image I1 and (**f**–**i**) are from the results of test image I2. U, F, and N represent under-segmentation, fine-segmentation, and non-green-cover segmentation, respectively. The numbers in the bracket are sequentially the *NDVIi* and *SDi* values of the segment.

To further validate the effectiveness of the isolation rule, the up-left part of the isolation results of I1 and I2 is zoomed in Figure 9. It can be seen that the green cover and other objects are mixed in the isolated under-segmented regions. In the fine-segmentation part, the regions are either fine-segmented green cover or segments without green cover.

**Figure 9.** Isolation results of under-segmentation and fine-segmentation using the designed isolation rule. (**a**,**b**) are the up-left part of result of test image I1 and (**c**,**d**) are the up-left part of result of test image I2. The segments are shown with green polygons.

#### *4.3. Under-Segmented Region Refinement*

The multiscale optimized segmentation is obtained by refining the under-segmented part of the global optimal scale. In the refinement segmentation result, with the benefit of cross-scale refinement strategy, the segments are at different segmentation scales to achieve the optimal segmentation scale for each green cover object. The histogram of segmentation scales in the refinement results is shown in Figure 10. The refined segments almost cover all the segmentation scales finer than the selected global optimal scale. There are many segments at the small segmentation scales, for example scale 20 to 40 for I1 and scale 15 to 25 for I2, because there are many small-size green cover objects in urban area, such as single trees.

**Figure 10.** Histogram of segmentation scales in the refinement results of test image I1 (**a**) and I2 (**b**).

To illustrate the effectiveness of achieving optimal scale for each green cover object, the sample refinement results from test images are enlarged to present in Figure 11 with labels of the scale number for each segment. Generally, it can be seen that large green cover objects are segmented at relatively coarse segmentation scales while small green cover objects are segmented at relatively small segmentation scales. The green cover objects, especially the small ones, tend to be segmented by a single segment.

**Figure 11.** Sample refinement results of test image I1 (**a**) and I2 (**b**) labeled with optimized segmentation scale number. The segments are shown with green polygons.

The supervised evaluation results of segmentation before and after refinement are presented in Table 1 to quantify the effectiveness of the under-segmentation refinement. It can be seen that the *precision* value is apparently improved after refinement, showing that the under-segmented green cover objects can be effectively refined. The *recall* value is decreased mainly because the reduced under-segmentation. Therefore, the *F-score* after refinement is apparent improved than that before refinement. The segmentation results before and after refinement shown in Figure 12 further prove this.

To quantify the effectiveness of cross-scale optimization, the refinement result is compared with single-scale segmentation that has the highest *F-score* in the produced multiscale segmentations, which is at scale 70 for I1 and 35 for I2. The supervised evaluation results are also presented in Table 1. It can be seen that the *precision* of the refinement result is slightly lower than that of the single-scale best result while the *recall* is higher, which could be caused by the over-segmentation of the large green cover objects. Another reason for the lower *precision* for the refinement result could be caused by the wrong identification of under-segmented green cover objects, which makes the under-segmentation cannot be refined and thus lowers the precision accuracy. As a whole, the *F-score* of the refinement result is slightly higher than that of the single-scale best segmentation, which could mainly be caused by the reduced under-segmentation errors in the refining procedure. The segmentation results presented in Figure 11 further show the difference. As highlighted by the yellow rectangles, the existed under-segmentation errors in the single-scale best segmentation can be effectively reduced by the proposed refining strategy, which indicates the effectiveness of the refining procedure on overcoming under-segmentation errors. According to the comparison result with single-scale best segmentation, we can safely conclude that the proposed unsupervised multiscale optimization method can automatically produce optimal segmentation result at least equals to single-scale best segmentation indicated by supervised evaluation. Furthermore, the proposed refining strategy can help to reduce under-segmentation errors even in the single-scale best segmentation.


**Table 1.** Comparisons of the segmentation accuracy for the result before refinement, after refinement, and the single-scale best result.

**Figure 12.** Comparison of sample segmentation results of test image I1 and I2 before refinement (the first row), after refinement (the second row), and the single-scale best segmentation result according to *F-score* (the third row), which are shown with yellow, green, and pink polygons, respectively. Four areas are highlighted for comparison using orange rectangles.

#### **5. Discussions**

#### *5.1. Influence of Selected Global Optimal Segmentation Scale*

The selected global segmentation scale is assumed to be reasonably under-segmented, which means that a part of segments is under-segmented and others are fine-segmented with few over-segmentation errors. This is because the successive refining procedure is designed to reduce the under-segmentation errors rather than over-segmentation errors. The results in Section 4.1 proved that the used optimization indicator can select segmentation scale at under-segmentation status. However, we can see from Figure 7 that many segments in the selected segmentation are extensively under-segmented, especially for those containing small green cover objects. Actually, the extensive under-segmentation could make the refinement difficult because too many different objects are mixed, which makes the segment features become erratic or random.

Even though the automatically selected global optimal segmentation scale can result in satisfactory refinement result, we explored whether a less under-segmentation than the selected one could further improve the refinement performance. For test images I1 and I2, a less under-segmented scale than the automatically selected one is input to the refining procedure and the segmentation accuracies of the refinement results are presented in Table 2. It is noted that the less under-segmented scale is also a local peak in Figure 5. We can see that the less under-segmented global segmentation scale could result in higher *precision* and higher *F-score* accuracies, which is because the prevalent under-segmentation is reduced that achieves better refinement performance. The sample segments of I1 shown in Figure 13 illustrate the difference. The segment at scale 220 in Figure 13a is prevalent under-segmented in terms of the green cover object, the *NDVIi* value of which is thus very low and it is not identified as the under-segmentation by the rule. Therefore, it is not involved in the following refinement procedure and the under-segmentation error is presented in the refinement result. However, the corresponded segment at scale 110 is less under-segmented and the *NDVIi* value is higher than that at scale 220, which make it allowed to be refined to remove the under-segmentation error. As a whole, this example demonstrates the importance of selecting a reasonable under-segmentation scale for the successive refinement. Specifically, the segments should better not be prevalent under-segmented.

**Table 2.** Comparisons of the segmentation accuracy for the refinement results from different global segmentation scale.


**Figure 13.** Sample segments (cyan polygons) of image I1 to illustrate the influence of global segmentation scale to refinement result. The *NDVIi* and *SDi* values for the segment containing green cover object are 0.04 and 0.91 in (**a**) and 0.12 and 0.89 in (**b**) and (**c**) is the refinement result of (**b**).

#### *5.2. Key Role of Identifying Under-Segmented Region*

Based on the global segmentation scale with under-segmentation, the identification of under-segmented region serves as a key role to reduce under-segmentation errors once the refinement procedure is effective. Hence, before illustrating the key role of identifying under-segmented region, the effectiveness of refinement procedure is further judged in addition to Section 4.3.

Since the optimization indicator of maximum *LP* tends to select under-segmented scale, it makes the refining procedure performed iteratively because the selected optimal scale in the refining procedure could still be under-segmented for some segments and these segments need to be further refined. This increases the calculations for the refinement but it can lead to the safe refinement that avoids new over-segmentation errors.

The segmentation accuracies in the refining procedure of image I1 are presented in Table 3, through which we can see that the *F-score* is gradually increased. This is caused by the iteratively reduced under-segmentation errors in the refining procedure, which shows the effectiveness of the refining strategy and also the effectiveness of the isolation rule for under-segmented region. The sample segments in Figure 14 further demonstrates the effectiveness of the refining procedure.

As discussed above, the identification of under-segmented region serves as a key role in the proposed method. If the under-segmented regions can be correctly identified, the refinement can achieve success on removing under-segmentation errors because the refining procedure stops when the segment is not identified as under-segmented. Furthermore, if the under-segmented regions cannot be correctly identified, it cannot even be involved into the refining procedure and the under-segmentation error would be preserved in the refinement result. The proposed isolation rules of this study is still primary, which need to set appropriate thresholds by users. Even though it is proved to be effective, the automatic isolation of under-segmentation is still needed to be explored in the future.



**Figure 14.** Sample segments (green polygons) from test image I1 to show the refining procedure. The number of optimized segmentation scale is labeled in each segment.

#### *5.3. Potential of Segmenting Different Types of Urban Green Cover*

The proposed cross-scale optimization method aims at segment urban green cover into single regions by reducing both the under- and over-segmentation errors. The segmentation of urban green cover in general can be practically used in urban planning, for example investigation of urban green cover rate [42,43], and urban environment monitoring, for example influence analysis of the urban green cover to residential quality [44,45]. Surely, it would be more practical useful to segment different types of urban green cover [3,12]. To achieve this by the proposed method, the key step is to adjust isolation rule to identify the under-segmented regions containing different types of green cover. Since the optimization indicator used in the global segmentation scale selection and local refinement is based on the spectral standard deviation, it should be able to deal with under-segmentation for different green cover types. Once the under-segmentation for different types of green cover can be identified, the refining procedure is expected to be able to reduce the corresponding under-segmentation error and separate different types of green cover.

Actually, even though the presented method is not aiming at segmenting different types of green cover, several under-segmented segments containing different types of green cover are refined into each type when they meet the isolation rule in this study. An example of isolating different types of green cover based on scale 110 for image I1 is shown in Figure 15. This shows the potential of adjusting the under-segmentation isolating rule to achieve separating different types of green cover in the future.

**Figure 15.** Sample segment from test image I1 to show the potential of discriminating different green cover types by the proposed method. The yellow polygons represent segments at scale 110 before refinement and the green polygons represent the new segments after refinement.

#### *5.4. Potential of Refining Under-Segmented Regions for Other Land Cover Types*

The under-segmented segments for urban green cover contain other land cover types, for example buildings, roads, and water. These segments are also under-segmented for the non-green-cover objects in it. When refining the under-segmentation for green cover, the under-segmentation errors for other land cover objects are also refined, as shown in Figures 11–15. That is to say, the non-green-cover objects in the identified under-segmented segments could also be refined along with the refinement of green cover objects. This reveals the potential of refining other land cover objects by applying the proposed optimization method.

As discussed above, the optimization indicator used in the global segmentation scale selection and local refinement should be able to deal with under-segmentation for different land cover types because it is based on the spectral standard deviation. Accordingly, if a proper isolation rule for other land cover types is designed, the proposed method is expected to be able to optimize the segmentations for those objects, which is also a direction of the future work based on this study.

#### **6. Conclusions**

In this paper, a multiscale optimized segmentation method for urban green cover is proposed. The global optimal segmentation result is first selected from the hierarchical multiscale segmentation

results by using the optimization indicator global *LP*. Based on this, under-segmented regions and fine-segmented regions are isolated by the designed rule. For under-segmented regions, local *LP* is used for refinement, which ultimately allows urban green cover objects of different sizes to be expressed at their optimal scale.

The effectiveness of proposed cross-scale optimization method is proved by experiments based on two test HR images in Nanjing, China. With the benefit of cross-scale optimization, the proposed unsupervised multiscale optimization method can automatically produce optimal segmentation result with higher segmentation accuracy than single-scale best segmentation indicated by supervised evaluation. Furthermore, the proposed refining strategy is demonstrated to be able to effectively reduce under-segmentation errors.

The proposed method can be improved and extended in the future, for example optimizing segmentation for different types of urban green cover or even other land cover types. The key step is to design appropriate isolation rule of under-segmentation for specific applications. To further explore the potentials of the proposed method would be the main future work based on this study.

**Author Contributions:** Conceptualization, P.X. and X.Z.; Methodology, P.X., X.Z.; Software, H.Z., R.H.; Validation, H.Z. and X.Z.; Formal Analysis, H.Z. and X.Z.; Investigation, P.X.; Resources, P.X.; Data Curation, H.Z. and R.H.; Writing-Original Draft Preparation, P.X.; Writing-Review & Editing, P.X. and X.Z.; Supervision, P.X. and X.F.; Project Administration, P.X.; Funding Acquisition, P.X. and X.Z.

**Funding:** This research was funded by the National Natural Science Foundation of China grant number 41871235 and 41601366, the National Science and Technology Major Project of China grant number 21-Y20A06-9001-17/18, the Natural Science Foundation of Jiangsu Province grant number BK20160623, and the Open Research Fund of State Key Laboratory of Space-Ground Integrated Information Technology grant number 2016\_SGIIT\_KFJJ\_YG\_01.

**Acknowledgments:** The authors acknowledge the academic editors and anonymous reviewers for their insightful comments and suggestions helping to improve quality and acceptability of the manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Improving Ecotope Segmentation by Combining Topographic and Spectral Data**

#### **Julien Radoux 1,\*,†, Axel Bourdouxhe 2,†, William Coos 2,†, Marc Dufrêne 2,† and Pierre Defourny 1,†**


Received: 14 January 2019; Accepted: 30 January 2019; Published: 11 February 2019

**Abstract:** Ecotopes are the smallest ecologically distinct landscape features in a landscape mapping and classification system. Mapping ecotopes therefore enables the measurement of ecological patterns, process and change. In this study, a multi-source GEOBIA workflow is used to improve the automated delineation and descriptions of ecotopes. Aerial photographs and LIDAR data provide input for landscape segmentation based on spectral signature, height structure and topography. Each segment is then characterized based on the proportion of land cover features identified at 2 m pixel-based classification. The results show that the use of hillshade bands simultaneously with spectral bands increases the consistency of the ecotope delineation. These results are promising to further describe biotopes of high ecological conservation value, as suggested by a successful test on ravine forest biotope.

**Keywords:** GEOBIA; biodiversity; LIDAR; orthophoto; segmentation; classification; biotope distribution model

#### **1. Introduction**

#### *1.1. Context*

In order to mitigate biodiversity loss and destruction of ecosystems with heritage value around the world, we have to know where biodiversity hotspots and threatened areas are located. Facing the actual threats and due to a big extinction rate, the urgency leads to a race to become aware and map theses area before they don't exist anymore. This logic was followed at many scales. Worldwide, biodiversity hotspots were identified and outlined in order to prioritize conservation actions [1,2].

At the European scale, two directives have defined the need for the conservation of habitats and species with the adoption of appropriate measures. They allow to give a protection status for species and biotopes of interest, but also defining protected areas corresponding to species habitats or group of biotopes. Within this Pan-European ecological network known as "Natura 2000 network" of special areas of conservation, natural habitats will be monitored to ensure the maintenance or restoration of their composition, structure and extent [3].

Monitoring the evolution of the territory (land cover, habitats of species, biotopes, . . . ) is an essential activity to identify major changes, economic, social and environmental issues but also to assess the impact of public policies and private initiatives. This monitoring is held by each country and requires a large amount of data, mainly obtained through field surveys having a high financial and time cost. These mapping results are used to mitigate problems such as conservation measures of the kind at national and local level [4], planning and development of green infrastructure [5], agro-environmental assessments [6], landscape changes monitoring [7,8], ecological forest management [9] or identification of ecosystem services [10,11]. However, existing maps are often limited to categorical land cover characterization which does not provide a precise legend for habitat and biotope types and are hardly interoperable. Innovative remote sensing products could, however, facilitate the status monitoring and the detailed characterization of large areas, even sometimes for fine scale quality indicators [12]. While it does not replace field data collection, remote sensing integration could thus be a first step towards a more cost effective monitoring of natural habitats [13].

Because of the limitations of remote sensing, habitat suitability mapping and biotope prediction models are necessary to fill the gaps of field observation for biodiversity monitoring. Nevertheless, remotely sensed data are of paramount importance in providing some spatially comprehensive information that is necessary to the prediction over large regions [14,15]. In this context, models are often based on regular grids linked with permanent structured inventories. However, with the democratization of geopositioning devices and the rise of citizen science, the precision of the observation has tremendously increased. An alternative approach to grid based habitat and biotope prediction could therefore emerge with a landscape partitioning into ecologically meaningful irregular polygons.

#### *1.2. Remote Sensing for Ecotope Mapping*

Previous studies showed that irregular polygons were supportive of habitats model that outperformed the standard grid-based approach with more than half of the investigated species [16]. This partition of the landscape into spatially consistent regions can be related to the concepts of ecotopes [17] or of land use management units [18]. Ecotopes are the smallest ecologically distinct landscape features in a landscape mapping and classification system. Mapping ecotopes therefore enables the measurement of ecological patterns, process and change [19] with much more details than categorical land cover classes or continuous field of a single class land cover feature.

Ecotope maps are often created by overlaying a large number of components, such as physiotope (topographic and soil features) and biotope (vegetation) layers [20,21]. As a result, ecotope maps are classified into hundreds of types and dozens of groups by combining biological and geophysical variables [22]. Furthermore, the different scales and precision of the boundaries of the overlaid thematic layers may create many artifacts which need to be handled with advanced conflation rules.

Alternatively, Geographic Object-Based Image Analysis can be used to delineate spatial regions by grouping adjacent pixels into homogeneous areas according to the objectives of the study [23,24]. For biodiversity research, image segmentation has been used to automatically derive homogeneous vegetation units based on spectral [25] or a combination of spectral and structural (height) information [16,26]. These approaches helped to reduce the number of polygons and improved the matching of those polygons with entities derived from the field.

On the other hand, GEOBIA was also used to delineate physiotopes, which can then be overlaid with land cover polygons to derive meaningful spatial regions [18] or directly used to map aquatic habitats [27]. The delineation of physiotopes is a difficult task to assess because their definition depends on the purpose of the study [28]. Different GEOBIA methods have therefore been developed, based on curvature indices [18,29], decision rules using elevation and slope [30], network properties [28] or a large set (70) of indices including slope, aspect and various texture indices [27]. However, the methods designed for terrestrial landscapes focused on global to regional scales, where the relative position (ridge, side or valley) plays a major role for the classification and do not directly take the orientation of the slope into account.

Our study aims at improving the large scale delineation of ecotopes applied on ecological modeling in Delangre et al. [16]. Our hypothesis is that this improvement can be achieved by simultaneously processing the topographic information from a LIDAR DEM and the vegetation structure information from optical image and LIDAR DHM. Topography is indeed a major driver

of other abiotic components such as soil properties (which is more difficult to obtain at high precision) [31,32] or insulation (depending on the orientation of the slope) [33].

#### **2. Data and Study Area**

The study area is located in the Walloon region (Southern part of Belgium). This is a very fragmented landscape including coniferous forests (mainly spruce and other sempervirent species), deciduous broadleaved forests (mainly oaks and beeches), crop fields, natural and managed grasslands, peatlands, small water bodies, extraction areas as well as dense and sparse urban fabrics.

There are no mountainous terrains in Belgium, but a topography that is mainly driven by a dense hydrological network. In order to test our hypothesis, the experimental study focuses on the ravine maple stands, which grow on relatively steep slope and rocky soil. This biotope is particularly sparse, but at least present in five of the biogeographical regions of Wallonia. A rectangular study area (Figure 1) was delineated to include the majority of these biotopes present in Belgium. This region is relatively flat (slopes smaller than 7 percents) except in the valleys.

**Figure 1.** Aerial orthophoto (**left**) and slope derived from the LIDAR images (**right**) on the study area.

Two types of input data were available in the study area. First, a mosaic of ortho-rectified aerial photographs upscaled to2mresolution and including four spectral bands; Second, a LIDAR point cloud dataset rasterized at2mresolution.

The aerial photographs cover the entire study area. This coverage was done with several flights between March and April 2015. Image acquisition included four spectral bands (blue, green, red and near-infrared) at a spatial resolution of 0.25 m. The images available for the analysis were already ortho-rectified, mosaicked and rescaled in bytes. In order to avoid too much local heterogeneity, which would affect classification process, the original images were resampled at2mresolution using the mean values of all contributing pixels.

The LIDAR dataset was acquired in spring 2013 and 2014. The minimum sampling density is of 0.8 points per square meter. First and last returns were used to extract the ground elevation and the vegetation canopy height. In addition, this dataset required specific mathematical morphology analysis in order to remove some artifacts: a gray scale opening was applied in order to remove power lines. A digital elevation model (DEM) and a Digital Surface Model (DSM) of the vegetation were derived from the last and first returns, respectively. A Digital Height Model (DHM) was then obtained by subtracting the DEM from the DSM.

In addition to the remote sensing data, a vector database describing the biotopes inside Belgian protected areas from the European network of natural sites (NATURA2000) was available. This database was produced by the Walloon administration for Nature and Forest based on expert knowledge and exhaustive field inventories (all polygons). It is considered as the best available information about ravine forests in Wallonia and was therefore used as a reference. In order to ensure and improve the reliability of this map, polygons with visible clear cuts on the 2015 othophotos were manually removed from this reference.

#### **3. Method**

The core of the proposed process is the simultaneous segmentation of the topographic, spectral and height information. The resulting image segments are then enriched by computing a set of attributes based on remote sensing and ancillary data. The potential of the proposed method to automatically delineate meaningful spatial regions is assessed based on two expected properties of the ecotopes: a large homogeneity and the ability to build high performance ecological models. These steps are summarized on Figure 2.

**Figure 2.** Overall flowchart of the proposed method.

#### *3.1. Automated Ecotope Delineation*

The three variables of interest to discriminate ecological function at the scale of the analysis are the land cover, the topography and the soil type. However, the available soil type information was not precise enough and could be partly inferred by the topography. We therefore focused on variables that could be directly inferred by remote sensing: topography and land cover.

The multiresolution segmentation algorithm [34] was used to automatically delineate ecotopes. This algorithm can be tuned by a set of four parameters: the scale, the weight of the raster layer, the shape and the compactness. The scale parameter defines the maximum acceptable value of the change of heterogeneity when merging two neighboring image-segments. Increasing the scale parameter therefore increases the size of the image-segments. The weight of the layers defines how much each raster layer will contribute to the heterogeneity difference of the merged image-segments as shown in Equation (1):

$$h\_{diff} = \sum\_{L} w\_{L} \left( n\_{1} (h\_{mgr\text{gdL}} - h\_{1L}) + n\_{2} (h\_{mgr\text{gdL}} - h\_{2L}) / \sum\_{L} w\_{L} \tag{1}$$

where *hdi f f* is the total heterogeneity difference after merging based on the raster layers, *wL* is the weight of each raster layer, *hmergedL* is the heterogeneity of image-segments 1 and 2 for layer *L*; *n*<sup>1</sup> and *n*<sup>2</sup> are the number of pixels in image-segments 1 and 2; *h*2*<sup>L</sup>* and *h*2*<sup>L</sup>* are the heterogeneity indices of image-segments 1 and 2. Then, the shape parameter defines the proportion of the heterogeneity index that is based on the shape of the image-segment. Increasing the shape parameter therefore reduces the contribution of a large heterogeneity difference after merging the image-segment. The compactness parameter determines if this shape index should favor compact image-segments (similar to a disk) or smooth image-segments (similar to a rectangle).

The efficiency of a segmentation combining LIDAR height and multispectral image had already been proven [26]. Our working hypothesis is that simultaneously combining the topographic information with the spectral values of the orthophotos and the DHM derived from the LIDAR would improve the delineation of the ecotopes. The segmentation results are therefore compared with different weights to the topographic information with respect to the other layers. For the sake of a fair comparison, the average size of the image-segments is fixed to approximately 2 ha (Two hectare on average corresponds to smallest ecological management units according to a group of users including biodiversity researchers and managers.) To do so, the composite image was first segmented with a scale parameter of 50, a shape parameter of 20% and a compactness of 100%. The shape parameter was then reduced to 10% and a larger scale parameter was obtained using binary search algorithm with a tolerance of 0.2% on the total number of polygons obtained on the reference image segmentation (that is 318,380). Apart from the size that was fixed after the first segmentation, no other optimization of the segmentation was performed. The only difference between the segmentation is therefore the weight of the topographic component that is being tested with values of zero (only spectral and structural information), 0.5, 1 and 2 (increasing the influence of topographic information).

Including topography in segmentation required a transformation of the DEM data to highlight the different slope types and identify breaks. Because the segmentation algorithm is based on the minimization of the variance inside each image-segment, using DEM values would indeed tend to create many linear spatial regions along contour lines in areas of steep slopes, even if the slope is constant. Previous studies used the slope together with some curvature indices [18]. This is interesting for pedomorphic mapping, but (i) it then relies on arbitrary window size to compute minimum and maximum curvature and (ii) both sides of ridge and valley lines are in the same segment despites different sun illumination. In the case of ecotopes, the slope and the aspect of the slope are therefore more closely related to the functionnal homogeneity However, slope aspect could not be used by the segmentation algorithm because (i) it is undefined when the slope is null and (ii) it is a circular metric that jumps from 360 degrees to 0 degree for the same azimuthal direction. For those reasons, Janowski et al. [27] used easting and northing instead of azimuth. For the ecotopes, two synthetic

hillshade maps were derived along the North-South and the East-West transects using Equation (2), because this is the variable that is the most directly linked with the potential solar energy.

$$\text{Billlshaide} = 255 \times \left( \left( \cos(SZA) \times \cos(Slope) \right) + \left( \sin(SZA) \times \sin(Slope) \times \cos(SAA - Aspert) \right) \right) \tag{2}$$

where *SZA* and *SAA* are the hypothetical sun zenithal and azimuthal angles, respectively, and *Slope* and *Aspect* are derived from the DEM using a 3-by-3 moving window. The use of 3-by-3 windows corresponds to local hillshade at a high spatial resolution (2 m), so that only pixels with similar hillshade values are likely grouped together by the segmentation algorithm. The shape parameter of 20% that is used in the segmentation process aims at preserving the compacity of the image segment when isolated pixels have a markedly different orientations than their surroundings, but image-segments are expected not to merge when there is a change of slope.

In practice, synthetic hillshade maps were created by setting a large sun zenith angle (75°) for four sun azimuth angles (0°, 90°, 180°, 270°). The difference between the results of both pairs of opposite theoretical sun azimuth angles were then computed. Cast shadows were ignored in this process because the aim of the hillshade is only to provide a continuous topographic characterization. As can be seen on Figure 3, the values of the hillshade are equal on flat surfaces and on slopes oriented with 45° or 135° azimuths. In the case of flat areas, the value in two opposite directions is indeed equal, so that their difference is zero for all azimuths. In the other case, the values are either positive or negative and they are equal for the orthogonal direction because 45° is the bisector of those azimuthal angles.

**Figure 3.** False color composite of the hillshades along the North-South and East-West directions (**left**) and slope derived from the LIDAR images (**right**) on a subset of the study area. Shades of grey indicate that the hillshade values in the two orthogonal directions are equal while colored areas highlight the differences between the two directions.

#### *3.2. Quality Assessment*

For the sake of ecological models, image-segments are enriched based on the proportion of land cover features that they contain as well as various soil and contextual attributes [16]. With all attributes being derived from external databases, the quality assessment focuses on the homogeneity of the image-segments (which is a key feature for ecotopes) (Section 3.2.2) and the ability to run performant ecological models (Section 3.2.3).

Due to the lack of other up-to-date high resolution land cover map of the Walloon region at the time of the study, a high resolution pixel-based land cover map was produced in order to characterize the ecotopes and build some of their homogeneity indices. While the production of this high resolution land cover database is out of the scope of this paper, it is briefly described in Section 3.2.1.

#### 3.2.1. High Resolution Pixel-Based Land Cover

A Bayesian classifier with automated training sample extraction method [35] was used to classify 8 land cover types: bare soil, artificial, grassland, crops, coniferous, broadleaved, water and shrubs. The input image was based on the same datasets as the segmentation: the 4 spectral bands of the aerial photograph and the height information extracted from LIDAR. The *a priori* probability was computed based on the frequency of each land cover type within two height classes (below and above 50 cm). Because of the high reliability of the LIDAR DHM, this step was particularly useful to discriminate forests, shrubs and buildings from the other land covers. The training dataset was compiled based on existing datasets covering the study area, including a 2007 land cover map from the Walloon Region, Open Street Map data [36] and forest inventory data from the Nature and Forest Department. The results were then consolidated with a crop mask in order to discriminate grassland and cropland. Furthermore, the classification of forest types was consolidated in the homogeneous region thanks to a classification of Sentinel-2 cloud-free images of early spring and mid summer 2016 (assuming that the forest type does not change from year to year and excluding clear cuts from the analysis).

For the validation, a simple random sample of 700 points was photointerpreted on the orthophoto with a geolocation tolerance of 5 m and ambiguous points were verified on the ground. The estimated overall accuracy of the consolidated product (93% with a 95% confidence) was above the other products, therefore it was considered as our best reference in the frame of this paper.

#### 3.2.2. Homogeneity Measures

In order to test the hypothesis of this study, different homogeneity indices have been computed. Those indices look at the homogeneity from land cover (based on the high resolution land cover layer), from the topography and from the soil types. They are compared with an arbitrary regular grid with the same cell area than the average polygon size, which provides a reference considering the segmentation ratio. Because of the specific interest towards a biotope that is mainly present in areas of steep slopes, the homogeneity indices were not only computed for all the study area, but also for a subset composed of the polygons with an average slope above 10 degrees.

Giving more weight to the topographic bands could affect the homogeneity in terms of land cover delineation. In order to control a potential loss of land cover homogeneity, the average purity level was computed for each segmentation. The proportion of each land cover class was computed inside each polygon based on the high-resolution pixel-based land cover classification presented in Section 3.2.1. The purity index is then defined as the average of the maximum values of land cover proportions of each image-segment.

From the topographic point of view, the primary variable of interest is the slope. The slope was measured on a smoothed version of the 2 m DEM in order to remove micro-topography effects and to remove artifacts due to the noise of the dataset. Because the slope is a quantitative variable, its heterogeneity was estimated using the standard deviation (STD) inside each polygon. For the aspect of the slope, standard deviation could not be used because of the break between 0 and 360°. The azimuth values were therefore converted into nine categories, including in eight directions (North, North-East, East, South-East, South, South-West, West and North-West) plus one class for the flat areas (where the aspect is undefined). The purity index of these nine categories is then computed like in the case of the land cover.

Finally, an independent data source was also considered: the soil map. The purity index for soil drainage classes and soil depth classes was used as an additional indirect indicator of the polygon homogeneity. Those two soil classes were derived from the digital soil map of the Walloon region. The precision of this map corresponds to a scale of 1/25,000, which is coarser than the polygons delineation, but this uncertainty affects all polygon boundaries in a similar way.

#### 3.2.3. Biotope Models

In addition to the homogeneity measures, a fitness to purpose analysis was implemented. The sensitivity of two state-of-the-art algorithms, namely Random Forest (RF) and Generalized Additive Model (GAM), has been tested for the detection of ravine maple forests. Each model was calibrated using the same workflow for each of the segmentation results.

First, a large set of attributes have been derived from existing database and GIS analysis. This set includes bioclimatic variables interpolated from Worldclim [37], soil variables, topographic variables and land cover variables obtained by zonal statistics within each ecotope. Those variables have been selected based on expert knowledge and their contribution to habitat suitability models have been assessed in a previous study [16].

Calibration and validation polygons were then selected by crossing the ecotope database with the polygons of the NATURA 2000 database. An ecotope was labeled as a ravine maple forest biotope if more than half of its area was covered by its equivalent in the Natura 2000 cartography. To obtain a presence/absence dataset, ecotopes matching with ravine maple forests were considered as presence, while ecotopes matching with any other forest biotope were considered as an absence.

Different quality indices were used to validate the model, including the Overall Accuracy (OA) and the Area Under the Curve (AUC) of the model as well as producer and user accuracy (PA and UA) of the optimal binary classification between ravine forest and other forest biotope. In order to evaluate the accuracy of the model to detect ravine forests among all other biotopes, another overall accuracy was calculated taking into account all surfaces covered by Natura 2000 surveys (OA\_Tot). Those indices were computed for the validation polygons which have been separated from the rest of the dataset before the calibration step. In order to provide an unbiased estimate of the correctly classified areas, the ecotope polygons were used as sampling units and their areas were taken into account [38]. The optimal binary classification was automatically determined based on the best compromise between sensitivity and specificity.

#### **4. Results**

By design, approximately 318400 image-segments were automatically created in the study area (with a range of 500 polygons, that is less than 0.2 percent). A visual check did not catch any macroscopic errors, but revealed most of the topographic features hidden by the vegetation on the aerial image. Figure 4 shows a subset of the segmentation result, highlighting the impact of the topography on the image-segments created inside patches of homogeneous land cover. As expected, areas of homogeneous slope are well delineated in addition to the land cover induced partitioning. Furthermore, the limits of the ecotopes are consistent with the pattern of slope curvature, which were not used for the segmentation.

Quantitative results related to the homogeneity of the image-segments are summarized in Tables 1 and 2. Overall, the advantage of automatically partitioned landscape against a regular grid of the same size is obvious. The results indeed show that the heterogeneity of the topographic attributes decreases and the separability of ecotopes increases when the partition of the landscape is determined by topography and land cover. As shown in Figure 2, the results within the subset of polygons with a slope above 15 percent further highlight the differences where the terrain plays a bigger role in the definition of the polygons.



**Figure 4.** Image-segment boundaries (in red) overlaid on the curvature of the DEM at 10 m resolution (**left**) and the orthophoto from the Walloon Region (**right**, copyright SPW 2015). The images at the top (**a**,**b**) display the segmentation with the topographic bands (weight of 1); the images at the bottom (**c**,**d**) display the segmentation without the topographic bands. The green rectangle highlights an area where the land use boundaries follow the topography.

**Table 2.** Homogeneity of the subset of image-segments with a slope greater than 15 percent. The grid is composed of squares with the same area as the average of image-segments. Large purity values and low average of slope variance indicate a good segmentation.


The analysis of the predictive model indicate that the ecotopes are appropriate mapping units to map ravine forest in the study area. The overall accuracy of the best model is indeed 99.9% (Table 3). However, this value does not completely reflect the errors of the model because ravine forests are rare in the study area and specific to the polygons with a majority of broadleaved trees. Additional indices measured on a subset of the ecotopes with a majority of broadleaved trees are therefore more relevant to compare the different scenarios. On this subset, the use of topographic information to delineate polygons had a significantly positive impact on the results of the models. This confirms the results obtained by homogeneity measures (Tables 1 and 2). However, the best prediction is achieved for the

GAM model when the weight of the topographic information is equal to the weight of the spectral information and the performances of the model decrease with the weight of 2.

**Table 3.** Results of the different models of ravine forest predictions with respect to the weight of topography in the segmentation. Matches correspond to the number of ecotopes matching at more than 50 percent with biotopes survey polygons. The next rows show the different quality indices including the overall accuracy of ravine forest mapping in the study area (*OATot*), the Overall Accuracy (OA), Area Under the Curve (AUC), Producer Accuracy (PA) and User Accuracy (UA) of the ravine forests for each model based on ecotopes covered by a majority of broadleaved trees.


The number of matching polygons increases when the segmentation uses more topographic components (Table 3). This is due to the fact that the polygon boundaries match the Natura 2000 boundaries closer than in the case without topographic contributions, but also because the polygons become on average smaller in rugged terrain when the topography is taken into account, as shown in Table 2.

#### **5. Discussion**

This study demonstrates that automated image segmentation simultaneously combining topographic information from LIDAR with the spectral information from optical sensors provides ecologically relevant polygons. Two facets of the results are discussed in this section: the technical quality of the results and the usefulness of the model for biodiversity studies.

#### *5.1. Consistency of the Polygons*

The objective of this paper was to build homogeneous polygons that would better match the concept of ecotopes than a delineation solely accounting for the land cover. While it was foreseen that the addition of topographic information to the segmentation process would reduce the topographic heterogeneity, the increased land cover homogeneity was surprising. This could be due to the long term land management practices that optimized land use based on the topography (for example, most crop fields are located on flat areas while steep slope are mainly covered by forests). Such patterns have been observed in the landscape, but the causality should be further investigated.

The results of the model shows that including topographic information improves the correspondence between ecotopes polygons and the field mapping of ravine forest biotope. The increased number of matches is partly due to the reduction of the average polygon size inside rugged terrain (about 25% for the weight of 2). The reduction of the size is however not sufficient to explain the large increase in the number of matches. This could be better explained by the fact that presence of biotope is highly dependent on the topographic situation. Indeed, contrasted situations such as south or north hillside leads to very different abiotic features. Furthermore, even a small difference of slope leads to different water intakes leading also to different vegetation communities.

Even if the model weren't created in the best conditions due to the scarcity of the biotope, we can stress that we see a large leap in the AUC of the models (more than 16% for both RF and GAM) by adding topographic data in the segmentation process. Concerning the model accuracy, the big increase observed by adding topographic data is consistent with the improvement observed by the heterogeneity indices. However, we can see that the indices of the models don't follow the same trend than the heterogeneity indices and are less correlated with the level of contribution of topographic data. This is probably due to the fact that we model a rare biotope with scarce presence data. Thus, the overall quality of the models is sensitive to the polygons selected in the calibration/validation process.

In this study, the average size was selected according to user requirements and fixed in order to fairly compare the contributions of the models. However, the size could also be optimized based on data driven features [39,40] In this case, the weight of the topography, the vegetation height and the spectral values of the images should be considered to determine an optimal size. However, the observed difference between the trend in terms of ecotope purity and the trend of the fitness-to-purpose analysis should be further investigated as a potential issue to the use of a single optimization criteria.

On the other hand, landscape and landform analysis very much depends on the scale of the process being addressed and a hierarchical approach could help to extract additional characteristics from the landscape. For instance, elevation and curvature are important features at coarser scale to identify ridges or valleys. However, ridges and valley landforms include two sides facing opposite directions, which is not homogeneous in terms of insulation. The use of hillshade at local scale obviously placed the emphasis on potential insulation, but it also split image-segments in places of strong positive or negative curvature, which contributed to the improvement of characteristic soil properties. Identifying pattern from another scales could however be necessary to cover the processes that occur in more mountainous areas.

#### *5.2. Usefulness of Biotope Models*

The high (above 95% for both GAM and RF) AUC of the best models shows that models based on ecotope polygons are very consistent. However, despite the high specificity and sensitivity of the models, the user accuracy of the ravine forest class is very low. This low user accuracy can be explained by three different factors.

First, the ravine forests are rare in the study area. As a result, even a very small proportion of false positive had a strong impact on the user accuracy. Nevertheless, the absolute number of forest remains small, so that the models could help field prospecting by narrowing the search area to fewer than five percents of the total surface of forests.

Second, the models were used to test a concept and compare different segmentation strategies, but they could be tuned for other specific purposes. The selected optimization criteria, based on the sensitivity and the specificity, favored a solution that maximized the sensitivity because the specificity computed on a large proportion of absences was rapidly very high (above 97%), then slowly increased. This type of optimization is particularly useful for restraining the surveyed area in order to exhaustively map a specific biotope. Another threshold for binary classification could seek an optimum between producer's and user's accuracy based on the F-Score. With this alternative, the UA would be 0.63 and the PA would be 0.65, which is a good compromise for a generic result but less interesting for the identification of unknown biotopes of high biological interest.

Third, the model is limited by the available data, which does not replace field based observations. For instance, typical ferns in the understory are not visible by remote sensing. On the other hand, the ecotope might include all conditions for the development of maple ravine forests, but a different type of forest could have developed because of historical land management of the ecotope. From this point of view, a substantial proportion of the false positives could be considered priority zones for biotope restoration.

Nevertheless, values of user accuracy that we are discussing about are dependent of the correctly classified ravine forest among other broadleaved forests.If we take a look at total overall accuracy values, they show an excellent prediction of ravine forest in our study area therefore moderating the poor user accuracy previously discussed.

Referring to the field data, forest communities don't follow a logic of tangible frontier, but they look like a gradient of vegetation communities mixing with another one. Limits are vague, but the conceptual model of the geographic database uses crisp boundaries to represent those transition areas. This conceptual mapping model into the so-called spatial regions is not specific to the GEOBIA, but it is also performed on the field when a boundary has to be drawn. While about half of the boundaries are matching the boundaries of the reference polygons with precision close to 10 m, diverging delineation occurs on the other half of the boundaries (Example on Figure 5). An independent field campaign was unable to undoubtedly and consistently arbitrate between the two datasets. Despite its limitations, the repeatability of the automated image segmentation makes it very useful in prospective studies or to guide interpretation on the field. However, the use of sharp boundaries could be an issue to represent gradients of vegetation or to be associated with punctual observations. It is therefore of paramount importance to remember that the proposed mapping strategy is a model used to represent the landscape in a way that closely matches the definition of biotopes from the field, but that there is no universal representation of nature.

**Figure 5.** Divergences of delineation of ravine forest in the Natura 2000 database (green polygons) and the automated segmentation (red outlines).

#### **6. Conclusions**

This study demonstrates that hillshade layers can be used simultaneously with spectral information to improve the automated delineation of ecotopes in a GEOBIA framework. The AUC of predictive ecological models was improved by 15% when the ecotopes were delineatd using these topographic layers. Furthermore, the inclusion of topographic features in the segmentation process also improved the purity in terms of land cover, probably due to the indirect impact of the topography on the land use in the study area.

The good results at small scale factor suggests that the proposed GEOBIA workflow could be tested at larger scale factor in combination with curvature indices in order to generate homogeneous landforms with minimal arbitrary decisions.

**Author Contributions:** Conceptualization, J.R., A.B., W.C., M.D. and P.D.; Formal analysis, J.R. and A.B.; Methodology, J.R., A.B. and W.C.; Supervision, M.D. and P.D.; Validation, J.R. and A.B.; Writing—original draft, J.R., A.B. and W.C.; Writing—review and editing, J.R., A.B., M.D. and P.D.

**Funding:** This research was funded by Fédération Wallonie-Bruxelles in the frame of the Lifewatch-WB project.

**Acknowledgments:** The authors thank the three reviewers for their constructive comments.

**Conflicts of Interest:** The authors declare no conflict of interest.

**Dataset:** Extended database is available on http://maps.elie.ucl.ac.be/lifewatch/ecotopes.html.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Edge Dependent Chinese Restaurant Process for Very High Resolution (VHR) Satellite Image Over-Segmentation**

#### **Hong Tang \*, Xuejun Zhai and Wei Huang**

Beijing Key Laboratory of Environmental Remote Sensing and Digital Cities, Faculty of Geographical Science, Beijing Normal University, Beijing 100875, China; zhai.xuejun@mail.bnu.edu.cn (X.Z.); huangwei@mail.bnu.edu.cn (W.H.)

**\*** Correspondence: tanghong@bnu.edu.cn; Tel.: +86-10-5880-6401

Received: 15 August 2018; Accepted: 18 September 2018; Published: 21 September 2018

**Abstract:** Image over-segmentation aims to partition an image into spatially adjacent and spectrally homogeneous regions. It could reduce the complexity of image representation and enhance the efficiency of subsequent image processing. Previously, many methods for image over-segmentation have been proposed, but almost of them need to assign model parameters in advance, e.g., the number of segments. In this paper, a nonparametric clustering model is employed to the over-segmentation of Very High Resolution (VHR) satellite images, in which the number of segments can automatically be inferred from the observed data. The proposed model is called the Edge Dependent Chinese restaurant process (EDCRP), which extends the distance dependent Chinese restaurant process to make full use of local image structure information, i.e., edges. Experimental results show that the presented methods outperform state of the art methods for image over-segmentation in terms of both metrics based direct evaluation and classification based indirect evaluation.

**Keywords:** image over-segmentation; distance dependent Chinese restaurant process; nonparametric Bayesian clustering model; superpixels

#### **1. Introduction**

With the development of imaging techniques and the space satellite manufacturing, both spectral and spatial resolutions of Very High Resolution (VHR) satellite images have been improved significantly [1–3]. Consequently, VHR satellite images have been extensively applied to many applications, such as natural disaster monitoring [4,5], land cover change detection [6,7], environmental protection [8], and agricultural production [9], and so on [10,11]. Along with the increase in spatial resolution, it has becomes a considerable challenge for traditional methods to extraction information from VHR satellite images. Image over-segmentation is a common way to simplify image representations and speed image processing, by partitioning images into spatially adjacent and spectrally homogeneous regions [12,13].

Image over-segmentation has been widely used in many applications, including computer vision [14,15], object recognition [16,17], image retrieval [18,19], and image classification [20,21]. The following properties might be expected for a "good" over-segmentation result: (1) pixels within an segment should be spatially adjacent and have similar features; (2) the boundaries of segments should adhere well to the "meaningful" image boundaries; and, (3) each segment resides within only one real geo-object. Many over-segmentation algorithms have been used to segment everyday photos and pictures, and they aimed to achieve abovementioned properties, including the Simple Linear Iterative Clustering (SLIC) [22], Entropy Rate Superpixel segmentation (ERS) [23], normalized cut (NC) [24],

edge-augmented mean-shift (ED) [25], and so on [26,27]. These methods can be roughly divided into four categories, including threshold-based algorithms [28,29], edge-based algorithms [30,31], graph-based algorithms [32,33], and clustering-based methods [34,35]. It is rather simple and efficient for the threshold-based algorithms to obtain over-segmentation from images. However, it is very difficult to choose the appropriate threshold. Although the boundary recall is often rather high in the results of edge-based methods, the results are often sensitive to the quality of the detected edges. It is straightforward for the graph-based algorithms to utilize the spatial constraints by measuring the affinity in terms of spectral or structural similarity.

Clustering-based methods are widely used for image over-segmentation by embedding spatial information. The SLIC is the most typical method to over-segmentation by clustering [22]. The Entropy Rate Superpixel (ERS) also treats the over-segmentation as a clustering problem [23]. The main difficulty for clustering-based methods to image segmentation is how to estimate the suitable number of clusters before clustering. As a Bayesian nonparametric clustering model, the Chinese restaurant process (CRP) mixture model [36], and its variants provide a principled way to infer the number of clusters from the observed data. An underlining assumption of these nonparametric clustering models is that the observed data in each group is modeled as an exchangeable sequence of random variables. In order words, for an exchangeable sequence of random variables, any permutation of the sequence has the same joint probability distribution as the original sequence [37]. If the CRP is directly applied to image clustering, the spatial dependency among pixels would be lost. Generally speaking, there exist three kinds of ways enhance spatial dependency among pixels when the CRP and its variants are applied to image analysis, i.e., preprocessing, poster-processing, and directly modeling the interdependence among neighboring variables [36]. For example, by using over-segmentation of VHR satellite images as a preprocessing, the CRP and its variants have been successfully applied to image or feature fusion [38], geo-object category detection [39], and unsupervised classification [40], and so on. However, it is difficult for humans to assign an appropriate number of over-segments, since a large number of geo-objects with different size and shapes are scattered in VHR satellite images without uniform spatial distribution. Majority voting using neighboring pixels are often used as a poster-processing to enhance spatial consistence of clustering labels [41]. However, the clustering results would be strongly dependent on both the presented neighboring system.

As an extension of the CRP, the distance dependent Chinese restaurant process (ddCRP) explicitly models the dependency between two random variables as a function of their distance [37]. The ddCRP was originally proposed for text modeling [37]. Whether and how can the ddCRP be effectively used for the over-segmentation of VHR satellite images? To answer this question, we systematically analyzed characteristics of the ddCRP. Specifically, we evaluated the impact of three components in the ddCRP model on the over-segmentation of VHR satellite images in this paper, i.e., the distance dependent term, the sampling probability term and the connection mode. Furthermore, we present an improved model for over-segmentation of VHR satellite images, which is termed as the Edge Dependent Chinese Restaurant Process (EDCRP).

The rest of this paper is organized as follows. In Section 2, we described the basic principle of the ddCRP, and analyzed the characteristics of its components. The EDCRP model was described in Section 3. Experimental results coupled with related discussions are given in Section 4. Some conclusions are drawn in Section 5.

#### **2. Distance Dependent Chinese Restaurant Process**

After the ddCRP is introduced in the first subsection, we systematically the characteristics of its three components, i.e., distance dependent term, likelihood term, and connection mode.

#### *2.1. ddCRP*

The CRP is a classical representation of the Dirichlet Process (DP) mixture model, which is an intuitional way to obtain a distribution over infinite partitions of the integers. Under the metaphor of the CRP, the random process is described, as follows. Customers enter a Chinese restaurant one by one for dinner. When a customer enters the restaurant, he or she chooses a table, which has been occupied by other customers who entered previously with a probability proportional to the number of customers already sitting at the table. Otherwise, one takes a seat at a new table with a probability proportional to a scale parameter *α*. The random process is given by

$$p(t\_i = k | \mathbf{T}\_{-i\prime} u) \propto \begin{cases} n\_{k\prime} & k < K \\ n\_{\prime} & k = K + 1 \end{cases} \tag{1}$$

where *ti* represents the index of table chosen by the *i*th customer; **T**¬*<sup>i</sup>* = {*t*1,..., *ti*−1, *ti*+1,..., *tN*} denotes the table assignments of *N* customers with the exception of *i*th customer; *K* is the number of tables already occupied by some customers; and, *nk* denotes the number of customers sitting at table *k*.

After all of the customers have taken their seats, a random partition has been realized. Customers sitting at the same table belong to a same cluster. This specific clustering property indicates a powerful probabilistic clustering method. The most important advantage of this algorithm is that the number of partitions does not need to be assigned in advance. The reason is that the CRP treats the number of clusters as a random variable that can be inferred from the observations by marginalizing out the random measure function.

Please note that, although the customers are described as entering the restaurant one by one in order, any permutation of their ordering has the same joint probability distribution of the original sequence. This is the exchangeability property of CRP mixture model. In practice, exchangeability is not a reasonable assumption for many realistic applications. As for image over-segmentation, it is necessary to identify the spatially contiguous segments where the same label is allocated to the neighbor pixels with homogeneous spectra. Therefore, the traditional CRP model is not suitable for image over-segmentation. To introduce the dependency among random variables, Blei et al. proposed the distance dependent Chinese restaurant process (ddCRP) in [37]. The partition, formed by assigning the customers to tables in the CRP, is replaced with the relationship between one customer and other customers in the ddCRP. Under the similar metaphor of the CRP, the ddCRP model states that each customer selects another customer to sit with, who has already entered the restaurant. Given the relationship among customers, it is easy to infer which customers will sit at the same table. Therefore, the tables are a byproduct of the customer assignments. Let *f* denote the decay function, *dij* the distance measure between two observations *i* and *j*, *α* the scaling parameter, and *ci* the customer assignment of *i*th customer, i.e., *i*th customer choose to sit with *ci* customer. The random process is to allocate customer assignments according to

$$p(c\_i = j) \propto \begin{cases} f(d\_{ij}), & i \neq j \\ a, & i = j' \end{cases} \tag{2}$$

where the distance measure *dij* is used to evaluate the difference between the two observations *i* and *j*, and the decay function *f*(*x*) mediates how the distance affects the probability of connecting the two customers together. The farther the distance between the two data points, the smaller their probability of connecting with each other. The traditional CRP is a particular case of ddCRP when the decay function is assumed to be a constant, i.e., *f dij* = 1.

To bridge the gap of terminologies between feature clustering and image over-segmentation, Figure 1 is employed to construct the analogy between the metaphor of ddCRP and image over-segmentation. An image with 16 pixels is shown in Figure 1a, where each numbered square denotes a pixel. The image is a restaurant, which accommodates 16 customers under the metaphor of the ddCRP. Figure 1b shows a middle state during the random process of customer assignments, where each arrow indicates a customer choose to sit with another customer. All of customers, who chose to sit with, naturally form a cluster. In other words, they sit at a same table in the ddCRP. This is illustrated in Figure 1c, where the pixels with a same color belong to a cluster. Under the terminology of image over-segmentation, a segment also consists of all of the pixels with a same color, where every two pixels within a segment can reach to each other along the inferred customer assignments in the ddCRP. Therefore, the inferred customer assignments are the key to derive the results of image over-segmentation.

**Figure 1.** The illustration of distance dependent Chinese restaurant process (ddCRP) for image over-segmentation; (**a**) an image (i.e., a restaurant) with 16 pixels (i.e., customers) where each numbered square denotes a pixel; (**b**) each arrow indicates a customer choose to sit with another customer; and (**c**) a segment consists of pixels with a same color and every two pixels within a segment can reach to each other along the inferred customer assignments.

Given an image with *N* pixels **X** = {*x*1,..., *xN*}, the image over-segmentation is to partition pixels into multiple regions with label **Z** = {*z*1,..., *zN*}. In the ddCRP, the label **Z** is a by-product of customer assignment **<sup>C</sup>**, whose posterior is given by *<sup>p</sup>*(**C**|**X**) = *<sup>p</sup>*(**C**)*p*(**X**|**C**) <sup>∑</sup>*c*1:*<sup>N</sup> <sup>p</sup>*(**C**)*p*(**X**|**C**). The customer assignment, e.g., *ci*, is assumed to be dependent on the distance between two pixels, e.g., *dij* and is independent of other customer assignments. Therefore, the posterior of customer assignment is proportional to

$$p(\mathbf{C}|\mathbf{X}) \propto \left[\prod\_{i=1}^{N} p(c\_i|a, d, f)\right] p(\mathbf{X}|\mathbf{C}, \mathbf{G}),\tag{3}$$

where the likelihood of image pixels is given by *p*(**X**|**C**, *G*). It is difficult to compute the posterior because the ddCRP places a prior over a huge number of customer assignments. A simple Gibbs sampling method can be used to approximate the posterior by inferring its conditional probability

$$p(c\_i | \mathbf{C}\_{-i}, \mathbf{x}\_{1:N}, \mathbf{a}, d\_r f, \mathbf{G}) \propto \begin{cases} \begin{array}{l} f\left(d\_{ij}\right) & \text{sitation 1} \\ f\left(d\_{ij}\right) \frac{p\left(\mathbf{x}\_{\left(z\left(\mathbf{c}\_{-i}\right)=k\right) \cup \left(z\left(\mathbf{c}\_{-i}\right)=l\right)}\right) \left|\mathbf{G}\right)}{p\left(\mathbf{x}\_{\left(z\left(\mathbf{c}\_{-i}\right)=k\right)} \left|\mathbf{G}\right) p\left(\mathbf{x}\_{\left(z\left(\mathbf{c}\_{-i}\right)=l\right)} \right| \mathbf{G}\right)}, \text{ situation 2} \\ & \text{ $\mathfrak{a}$ } \end{cases} \end{cases} \tag{4}$$

where **C**¬*<sup>i</sup>* represent the customer assignments of all of the pixels with exception of *i*th pixel, and *<sup>x</sup>*(*z*(**C**¬*i*)=*k*) denotes the pixels associate with *<sup>k</sup>*th segment derived from **<sup>C</sup>**¬*i*; *<sup>G</sup>* is a prior of the parameter *θ*, which is utilized to generate observations using a probability *p*(*x*|*θ*). In this paper, the prior *G* and the probability distribution *p*(*θ*) are assumed to be Dirichlet process [36] and multinomial distribution, respectively. The parameter *θ* can be marginalized out, since they are conjugated. As a family of stochastic processes, the Dirichlet process can be seen as the infinite-dimensional generalization of the Dirichlet distribution.

As the shown in Equation (4), there are three possible situations to generate a new assignment *ci*. In situation 1, the assignment *ci* points to another pixel but do not make any change in the over-segmentation result. The situation 2 means that two different segments are merged into a new segment due to the customer assignment *ci*. For situation 3, the assignment *ci* points to itself in a probability proportional to *α*.

In the following, we analyze the characteristics of the three components in the ddCRP, i.e., the distance dependent term *f dij* , the likelihood of the image *p*(*x*|*G*), and the connection mode among the pixels.

#### *2.2. Distance Dependent Term*

As for image over-segmentation, the dependence between the pixels can be naturally embedded into the ddCRP by the use of a spatial distance measure and a decay function, for example, the Euclidean distance between pixels *dspatial* = *rowi* − *rowj* <sup>2</sup> + *coli* − *colj* 2 , where (*row*, *col*) is pixel's location in an image. The decay function is used to mediate the effects of distance between pixels on the sampling probability. The decay function should satisfy the following nonrestrictive assumptions; it should (1) be nonincreasing and (2) take nonnegative finite values, and (3) *f*(∞) = 0. There are several types of decay functions [37], such as the window decay function, the exponential decay function and the logistic decay function. If the distance satisfies the preference, the decay function returns a larger value; otherwise, it returned a smaller value.

Since over-segments consist of a set of pixels that are spatially adjacent and spectrally homogeneous, only the neighboring pixels are considered in this paper, i.e., pixels *i* and *j* with spatial distance *dspatial* ≤ 1. As shown in Figure 1, the possible assignments of the 11th pixel consist of 7th, 10th, 12th, and 15th pixels. Therefore, the information on neighboring spatial locations has been introduced by this setting. In order to take spectral information between neighboring pixels into account in this model, the spectral difference between neighboring pixels is also introduced in the ddCRP. Spectral distance can be represented by the difference *dspectral* = *xi* − *xj* , where *xi* is the DN value of *i*-th pixel. So, the spectral difference and the spatial distance are combined to the final distance measure, *dij* = *dspatial* + *dspectral*. The ddCRP model with spatial and spectral distance abbreviated to spatial-spectral ddCRP model. In our previse work [42], empirical experiments showed that the spatial-spectral ddCRP model exhibits a promising performance.

#### *2.3. Likelihood Term*

Let the prior and probability distribution are the Dirichlet distribution and the multinomial distribution, respectively. Given customer assignments **c**, the likelihood of *k*th segment is

$$p\left(\mathbf{x}\_{z(\mathbf{c})=k} \Big| \mathcal{G}\right) = \frac{\boldsymbol{\pi}(\sum\_{w} \beta\_{w})}{\prod\_{w} \boldsymbol{\pi}(\beta\_{w})} \div \frac{\boldsymbol{\pi}\left(\sum\_{w} \beta\_{w} + n\left(\mathbf{x}\_{z(\mathbf{c})=k}\right)\right)}{\prod\_{w=1} \boldsymbol{\pi}\left(\beta\_{w} + n\left(\mathbf{x}\_{z(\mathbf{c})=k} = w\right)\right)},\tag{5}$$

where *β* denotes the parameter of the base distribution *H*, which is initialized while using a uniform distribution. The *τ*(.) represents the gamma function. Each visual word is denoted by *w*. In this paper, the visual word is a digital number (DN) value of panchromatic images. *n* - *xz*(**c**)=*<sup>k</sup>* denotes the number of pixels belonging to the *k*th segment. As shown in Equation (2), *k*th and *l*th over-segments could be merged into a new one in a probability proportional to the production of distance dependent term and the likelihood ratio *<sup>p</sup>* - *<sup>x</sup>*(*z*(**c**¬*i*)=*k*)∪(*z*(**c**¬*i*)=*l*) *G p* - *<sup>x</sup>*(*z*(**c**¬*i*)=*k*) *G p* - *<sup>x</sup>*(*z*(**c**¬*i*)=*l*) *G* . For the sake of simplifying description, the likelihood ratio is called the merging probability of two over-segments in the following. To reveal the characteristics of the merging probability, we performed five simulated experiments, where a paired of Gaussian distributions are utilized to simulate DN values of pixels within *k*th and *l*th over-segments. Table 1 lists parameters of Gaussian distributions, i.e., mean *μ* and variance *σ*, used in the five experiments. For each experiment, two subfigures are drawn on each row of Figure 2, where the left subfigure shows the paired of Gaussian distributions, and the right one is the distribution of the merging probability over the numbers of pixels within the paired over-segments. Within the right subfigure, the *z*-axis is the merging probability and *x*-, *y*-axis is the number of simulated pixels within the two over-segments, respectively.


**Table 1.** Parameters of paired Gaussian distributions used in the five experiments.

It can be seen from Figure 2 that the distributions of the merging probability exhibit a similar pattern in the first four experiments, i.e., from (a) to (d). The merging probability is always less than 1. That is to say the two over-segments incline to remain to be separated, since they are not similar in terms of the rather large difference between the two mean DN values. Therefore, during the inferring process in the ddCRP, the customer assignment is allocated with a lower probability. In other words, any paired of pixels from the two over-segments would connect with each other with a lower probability.

Furthermore, for a given number of pixels within one segment, the merging probability decreases with the increase of number of pixels within the other segment. The underlining reason is that the dissimilarity between the two over-segments can be furthermore verified with more and more observations. In other words, a reliable state of customer assignments would be expected only when the number of pixels within each segment is up to some level. As shown in Figure 2a–d, the merging probability would be lower than 0.1 and become stable when the number of pixels is larger than about eight. This observation motivate us to use a number of pixels instead of individual pixel as a descriptor of each pixel in the improved model in Section 3.

As shown in Figure 2d, explicit local fluctuations occur in the distribution of merging probability since the two segments become more similar in terms of mean DN values. It can be seen from Figure 2e that the merging probability of experiment (e) looks very different from the other four experiments. The merging probability is significantly larger than 1 for most cases. This results show that the two over-segments would incline to be merged when they are similar in terms of mean DN values. This argument could also be verified furthermore when the number of pixels become more and more. It can be seen from Figure 2 that, if two segments are generated from dissimilar distributions, the merging probability will be less than 1, otherwise larger than 1. This suggests that it would be a good choice to let the parameter *α* equal to 1 in the ddCRP.

**Figure 2.** *Cont*.

**Figure 2.** Five experiments are shown in subfigures (**a**–**e**), respectively. As for each experiment, the left subfigure shows a paired of Gaussian distributions, which are utilized to generate DN values of two over-segments, respectively; the right one is the distribution of the merging probability over the numbers of pixels within the two over-segments.

#### *2.4. Connection Mode*

In the ddCRP, the customer assignment is always allocated from some candidates that satisfy pre-specified constraints, which is called the connection mode in this paper. As shown in Figure 3a, the assignment of the 11th pixel is connected with one of its nearest neighbors (i.e., 7th, 10th, 12th, and 15th pixels) or itself according to the Gibbs sampling formula Equation (4). This setting implies a competition among the candidates and decreases the connecting probability of each candidate to some extent. Actually, it may be unnecessary to compare the connection relationship between pixels and other neighboring pixels. As shown in Figure 3b,c, each pixel might point to multiple pixels, i.e., multiple arrows. To discriminate different cases, one arrow that a pixel points to another pixel, is termed as "one-connection mode" (CM1), two arrows "two-connection mode" (CM2), and four arrows "four-connection mode" (CM4), respectively. Under the metaphor of CRP, the number of arrows indicates how many customers one could select to sit with during the inferring process.

**Figure 3.** Three kinds of connection modes, (**a**) the one-connection mode (CM1), (**b**) the two-connection mode (CM2), and (**c**) the four-connection mode (CM4).

Figure 4 shows the results of over-segmentation over three kinds of geographic scenes, i.e., Suburban (S), Farmland (F), and Urban (U) areas, using the three type of connection modes. The suburban area contains sparse buildings, roads, and fields. The area of farmland consists of cultivated lands with similar shapes and slightly different spectra. In contrast, the urban area displays a large spectral variation, contains many buildings and roads, and has a complex structure. The images come from panchromatic TIANHUI satellites, which have a size of 200 × 200 pixels and a resolution of approximately 2 m.

Four metrics are employed to quantitatively evaluate the quality of image over-segmentations, i.e., the boundary recall (BR) [43], the achievable segmentation accuracy (ASA) [44], the under-segmentation error (UE) [45], and the degree of landscape fragmentation (LF) [42]. The BR is to measure how many percentage of real object boundaries have been discovered by an over-segmentation method. Based on the overlap between segments and real object regions, the ASA and UE is the percentage of pixels within or out of object regions, respectively. The last metric is to measure the degree of fragmentation for an over-segmentation result.

It can be seen from Table 2 that the CM1 is the best connection mode in terms of both BR and ASA for all of the three scenes. In contrast, the CM4 is the best one in terms of LF for all of the three scenes. The CM2 is in the middle among the three connection modes for all of the experiments with the exception of the UE of urban area. Please note that the metrics of both BR and ASA could increase with the number of segments. This also can reflected by the measurement of LF. The FL under CM1 is the highest among the three connection mode for all of three scenes. It also can be seen from Figure 4 that, under the CM1, there exist many segments of very small size, even many isolated points. In contrast, the size of segments is rather large under the CM4 and there explicitly exist under-segmented results. For example, many long and narrow objects, e.g., roads, are often incorrectly merged into larger regions under the CM4. To make full use of characteristics of different connection modes, both CM1 and CM2 will be utilized in our proposed method in Section 3.

**Table 2.** The quantitative evaluation of over-segmentations using the ddCRP over three type of geo-scenes, i.e., Suburban (S), Farmland (F), and Urban (U) areas, under three kinds of connection modes, i.e., one-connection mode (CM1), two-connection mode (CM2), and four-connection mode (CM4), respectively. For every evaluation metric, the best performance among different connection modes was marked as red or blue bold.


**Figure 4.** Over-segmentation of Very High Resolution (VHR) satellite images using the ddCRP with different connection modes: (**a**) VHR satellite images of three kinds of scenes, i.e., Suburb (S), Farmland (F), and Urban (U) areas, and corresponding over-segmentations using the ddCRP with (**b**) one-connection mode (CM1), (**c**) the two-connection mode (CM2), and (**d**) the four-connection mode (CM4), respectively.

#### **3. Edge Dependent Chinese Restaurant Process**

Based on the analysis about the ddCRP in Section 2, we present an improved method for the over-segmentation of VHR satellite images in this section, which is called Edge Dependent Chinese Restaurant Process (EDCRP). As shown in Figure 5, the EDCRP consists of three parts. The first part is edge detection from VHR satellite images. The second part is the selection of the feature descriptor and the connection mode that is based on these detected edges. The third part is image over-segmentation while using the spatial-spectral ddCRP.

**Figure 5.** The flowchart of Edge Dependent Chinese Restaurant Process (EDCRP) for image over-segmentation.

#### *3.1. Edge Detection*

There exist many methods to extract edge information from images, e.g., the Sobel operator, the Roberts operator, the Prewitt operator, the Laplacian operator, or the Canny operator [46]. All of these methods can be utilized in the proposed method in order to improve the structure integrality of image over-segmentation. In this paper, the Canny operator was adopted since it works better by simultaneously considering both strong and weak edges.

Although the distance between neighboring pixels in the ddCRP could reflect the strength of a potential edge to some degree, there often exist many locations without meaningful edges, while the distance is rather large. Therefore, the detected edges could provide important prior knowledge of image structures for the ddCRP to work better. Given a pixel on an edge, the one-connection mode could provide an opportunity to motivate the model to choose a neighboring pixel with the shortest distance between them in the highest probability (under the assumption that the likelihood is the same value). Consequently, multiple over-segments would occur nearby the pixel in a higher probability. Meanwhile, if the one-connection mode is used everywhere, there would exist many segments of very small size, even many isolated points, as shown in Figure 4. Therefore, the two-connection mode coupled with a gray histogram descriptor is designed to remove these tiny over-segments, in particular isolated points.

#### *3.2. Feature Descriptor and Connection Mode*

As for each pixel, there exist two options for feature descriptors and connection modes, which is dependent on whether the pixel is on an edge or not. As for a pixel on an edge, its DN value is used as a feature descriptor and one-connection mode is used to decide which pixels could be tied together. A gray histogram based on neighboring pixels is used to describe a pixel, which is not on any edge and the two-connection mode is used. As shown in Figure 3, the gray histogram descriptor of *6*-th pixel in Figure 3 will be constructed by the nine values of 1-th, 2-th, 3-th, 5-th, 6-th, 7-th, 9-th, 10-th, and 11-th pixels. The neighboring gray histogram descriptor is to remove the isolated points in the over-segmentation results. As discussed in the Section 2.3, the merging probability would be approximated to the scaling parameter *α*, when a segment has only a small number of pixels. As shown in Figure 2a–d, the merging probability would be lower than 0.1 and become stable when the number of pixels is larger than about 8. This observation motivate us to use nine pixels instead of individual pixel as a descriptor in the proposed model.

#### *3.3. Spatial-Spectral ddCRP*

As discussed in [42], the distance that was constructed with spatial and spectral features is better than only spatial distance. Therefore, both spatial and spectral distances are used in the proposed method. For the sake of clarity, it is called spatial-spectral ddCRP.

#### **4. Experimental Results**

In this section, we first described the experimental data. Then, we analyzed the effect of both feature descriptor and connection mode on image over-segmentations. Furthermore, the EDCRP is compared with state of the art methods in terms of quantitative evaluation matrices. At last, we discussed the efficiency and the possible extension of the EDCRP.

#### *4.1. Experimental Data*

As shown in Figure 6, two panchromatic images from different sensors are used in our experiments. The top-left image is a panchromatic TIANHUI image, which is in Mi Yun district of Beijing, China. The imaging time is 25 July 2013, the resolution is about 2 m, and the size of the image is 800 × 800 pixels. As shown in Figure 6a–d, four subareas of the TIANHUI image, coupled with real boundaries of interested geo-objects, will be employed to illustrate the qualitative effect of both the feature descriptor and the connection mode on over-segmentations in Section 4.2. The bottom-right image in Figure 6 is a panchromatic QUICKBIRD image, which was acquired on 22 April 2006. The area is located in Tong Zhou district of Beijing, China. The size of the image is 900 × 900 pixels with a resolution of 0.60 m.

#### *4.2. Effect of Feature Descriptor and Connection Mode over Over-Segmentations*

As shown in Figure 5, the EDCRP can be regard as a mixture of two kinds of feature descriptors and connection modes based on the spatial-spectral ddCRP model. In order to validate the necessary of this kinds of mixture, we compared the results of EDCRP with that of the spatial-spectral ddCRP model under two kinds of combination of both feature descriptor and connection-mode. Specifically, the first combination is that the DN value of each pixel is used as its feature and the one-connection is used. The second combination is that the feature descriptor of each pixel is the gray histogram of its neighboring pixels and the two-connection mode is used. For the sake of simplifying the notation, the two combinations are denoted by ddCRP\_1 and ddCRP\_2, respectively.

Only the TIANHUI image is used for the analysis in this section. Specifically, we first analyzed the over-segmentations of the two kinds of different combinations from the viewpoint of qualitatively visual inspection. Furthermore, we compared them with that of the proposed method. At last, four quantitative metrics are employed to measure the quality of these over-segmentations.

**Figure 6.** Two VHR satellite images used in our experiments, where subfigures (**a**–**d**) show the subarea A, B, C, and D coupled with boundaries of interested geo-objects, respectively.

#### 4.2.1. ddCRP\_1

Figure 7 shows the boundaries of over-segments by using the ddCRP\_1. Generally speaking, there exist many tiny segments, even isolated points. As shown in the two subfigures, individual geo-objects, e.g., water and building, are over-segmented into many segments. In addition, most of the over-segments wriggle their way along the real boundaries of geo-objects. The structure of over-segments is not very good. For example, there exist two edges along the top-right boundary of the building in subarea D of Figure 7.

#### 4.2.2. ddCRP\_2

Figure 8 shows the boundaries of over-segmentation using the ddCRP\_2. It can be seen that the ddCRP\_2 significantly outperforms over the ddCRP\_1 in term of structure of over-segment's boudary. On the one hand, from the viewpoint of overall visual inspection, boudaries of over-segments shown in Figure 8a–d correspond well to boudaries of real geo-objects. For example, as shown in Figure 8c,d, respectively, the boundaries of both water and building look very similar to the real one in terms of the shape of geo-ojects. On the other hand, over-segments of the ddCRP\_2 have been pushed into two different directions relative to that of the ddCRP\_1. As for the homogeneous regons, over-segments become bigger. Heterogeneous regions have been partioned into more segments of ralative small size. These two kinds of situation can be simultaneously seen in Figure 8d. The top-right part of the

building have been merged into a rather large segment, and its bottom-left part have been segmented into multiple small segments with regular boundaries.

**Figure 7.** The boundaries of over-segments obtained by using the ddCRP\_1 from the TIANHUI image. Both subareas C and D coupled with over-segments are shown in the two right-side subfigures.

(**b**) Subarea B The boundaries of over-segments (**d**) Subarea D

**Figure 8.** The boundaries of over-segments obtained by using the ddCRP\_2 from the experimental image. The subfigures (**a**–**d**) are zoomed results from subareas A, B, C, and D, respectively.

In terms of the weakness of the ddCRP\_2, on the one hand, there exist too much tiny segments over heterogeneous regions. On the other hand, some different geo-objects have been merged into a large segment. For example, it can be seen from Figure 8a that both the water and part of the land have been merged into a segment. In Figure 8b, the road also be merged with the vegetation along the road into a segment. These objects in Figure 6a,b are expected to be separated by over-segmentation algorithms. However, the ddCRP\_2 did not achieve the expected result.

#### 4.2.3. EDCRP

The result of EDCRP is shown in Figure 9. The result of EDCRP is similar with that of the ddCRP\_2 in terms of the "shape" of over-segments. We argue that it is reasonable, since the EDCRP can be regarded as a mixture of both the ddCRP\_1 and ddCRP\_2, which are choosen dependent on pixels on edges or not. Generally speaking, the number of pixels on edges is explicitly less than that of pixels out of edges. Therefore, the EDCRP inherits the strongth of the ddCRP\_2, i.e., well struture of over-segments' boudaries. Fortunately, the EDCRP also rules out the weakness of the ddCRP by using the ddCRP\_1 over pixels on detected edges. In other words, the EDCRP would keep two regions being seperated where the ponits have been identified as a point on edges. For example, two different geo-objects in Figure 9a,b have been well seperated into different segments. It is very different from that in Figure 8a,b, which have been merged into large segments.

**Figure 9.** The boundaries of over-segments obtained by using the EDCRP from the experimental image. The subfigures (**a**–**d**) are zoomed results from subareas A, B, C, and D, respectively.

#### 4.2.4. Quantitive Evaluation

Four metrics are adopted to quantitatively evaluate the quality of over-segmentations, i.e., BR, ASA, UE, and LF. As shown in Table 3, the EDCRP obtains the highest rate of boundary recall (i.e., BR) and achievable segmentation accuracy (ASA). Meanwhile, the EDCRP achieve the lowest error of under-segmentation (UE). As for landscape fragmentation (LF), the ddCRP\_2 is the lowest one, with exception of the category building. This result can be explained by the observation in Figure 8 that there exist a large number of tiny segments. For a given image, both BR and ASA would often increase with the number of over-segments. In other words, both BR and ASA would increase with the decrease of LF for a given image. However, both BR and ASA of the EDCRP is the highest one, even when its LF is high than that of the ddCRP\_2. In a word, the EDCRP outperforms both ddCRP\_1 and ddCRP\_2 in terms of all of the four quantitative metrics.

**Table 3.** The quantitative evaluation of over-segmentation.


#### *4.3. Comparative Experiments*

In this subsection, the EDCRP is compared with three state-of-the-art methods, i.e., Turbo Pixels [47], SLIC [22], and ERS [23] while using the two panchromatic images with different resolutions. These methods are compared from the two viewpoints of both direct and indirect performance evaluation [48,49]. On the one hand, four metrics are utilized to directly evaluate the quality of over-segmentations, i.e., BR, ASA, UE, and LF. On the other hand, the quality of over-segmentations is indirectly evaluated in terms of image classification accuracy.

Both the SLIC and ERS achieve image over-segmentation by clustering. Turbo Pixels [47] uses a geometric-flow-based algorithm for acquiring an over-segmentation of an image. For the sake of fairly comparing, the number of clusters for other methods is assigned by the number of all over-segments inferred by the EDCRP. As shown in Figure 10, both SLIC and Turbo pixel generated over-segments with similar size and a more or less convex shape. Their boundaries do not exhibit image structure information. Over-segments produced by ERS have some structure information, but the result has higher complexity in flat area, especially in the water area.

**Figure 10.** The over-segmentation results over both TIANHUI (first row) and QUICKBIRD (second row) images using four over-segmentation methods, i.e., (**a**) Turbo pixel, (**b**) Simple Linear Iterative Clustering (SLIC), (**c**) Entropy Rate Superpixel segmentation (ERS), and (**d**) Edge Dependent Chinese Restaurant Process (EDCRP).

#### 4.3.1. Metrics Based Direct Evaluation

Table 4 lists the values of four metrics of the three approaches for over-segmentation of the TIANHUI image. The ERS achieved the highest rate of boundary recall (i.e., BR) and achievable segmentation accuracy (ASA) among the three methods for all of four kinds of geo-object categories. As for the under-segmentation error (UE), the ERS is the lowest with the exception of the category water. The ERS also obtained a relative low value of LF. Based on these quantitative evaluation, the ERS outperform both the Turbo pixel and SLIC.

However, when compared with the EDCRP, the ERS achieved a lower value of both BR and ASA, and a rather higher value of both UE and LF. In other words, the EDCRP outperform all of the sate-of-the-art methods for over-segmentation in terms of four quantitative metrics. As for visual inspection, the over-segmentation result that is produced by ERS is similar with that of the EDCRP in terms of structure of over-segments. The significant difference is that ERS still produces rather small segments, even for homogeneous regions, e.g., water.

**Table 4.** The evaluation of over-segmentations of TIANHUI image while using Turbo Pixel (TP), SLIC and ERS.


Table 5 lists the values of four metrics of the four approaches to the over-segmentation of the QUICKBIRD image. The EDCRP exhibits the best performance in terms of three metrics, i.e., BR, UE and LF. As for the achievable segmentation accuracy, EDCRP is only lower than the ERS, and higher than both Turbo pixel and SLIC.


**Table 5.** The quantitative evaluation of different over-segmentations of QUICKBIRD image.

#### 4.3.2. Classification Based Indirect Evaluation

Image over-segments are often used as objects for classification. In our experiments, random forests with 100 trees are trained as classifiers using randomly selected 50% over-segments. Table 6 lists both the Overall Accuracy (OA) and kappa of the classifier. The EDCRP outperforms other three approaches for image over-segmentation in terms of classification accuracy.

**Table 6.** The Overall Accuracy (OA) and kappa of the classifier over both the TIANHUI and QUICKBIRD images.


Figure 11 shows the classification maps of the TIANHUI image. It can be seen from the classification maps that there exists rather explicit fingerprints of the over-segments. For example, the building in the subarea D, there exist misclassified blobs with convex shape in the subfigure (d) and (e) of Figure 11. In the subarea C, there exists explicit wriggled boundary of water in the subfigure (f) of Figure 11. In contrast, the classification map of EDCRP exhibits very good structure integrality for almost geo-objects. Specifically, as shown in the subarea B of Figure 11c, the wriggled *path* has been well classified.

**Figure 11.** Classification maps of the TIANHUI image based on the over-segments; (**a**) is the TIANHUI image; (**b**) is the ground-truth of the image; (**c**–**f**) are the classification maps based on the over-segments produced by the EDCRP, Turbo Pixel, SLIC, and ERS, respectively. The legend shows the class labels in both the ground-truth and classification maps.

#### *4.4. Discussions*

In this section, we disscuss the efficiency and possible extensions of the EDCRP.

#### 4.4.1. Efficiency

Since the number of clusters is also a random variable to be inferred in a nonparametric Bayesian clustering model, the process of clustering need more time than the parametric clustering method to be a convergence state. As shown in Equation (4), the posterior of customer assignments are approximated by using Gibbs sampling. A reliable sample would be generated only when the process of sampling reach a stable state. It can be seen from Table 7 that the efficiency of the proposed method is explicitly lower than that of the state of the art methods for image over-segmentation. Therefore, it deserves investigation as to how to significantly enhance the efficiency of the EDCRP in the future.

**Table 7.** The running time of different approaches for image over-segmentation.


#### 4.4.2. Possible Extensions

Although the present model has only been applied to the VHR panchromatic image in this paper, it is straightforward to extend the EDCRP to the over-segmentation of multi-spectral images [50–52]. It can be seen from Equation (4) that the posterior of customer assignments can be approximated if both the likelihood and distance dependent terms can be well defined. Given a multispectral image, the likelihood in Equation (5) can be defined by replacing the multinomial distribution for discrete DN values with Gaussian distribution for multi-spectral satellite images.

#### **5. Conclusions**

In this paper, we systematically analyzed the characteristics of components in the ddCRP for VHR panchromatic image over-segmentation, i.e., distance dependent term, likelihood term, and connection mode among neighboring pixels. Furthermore, we present an improved nonparametric Bayesian method, which is called Edge Dependent Chinese Restaurant Process (EDCRP). Experimental results show that the presented methods outperform the state of the art over-segmentation methods in terms of both four evaluation metrics, i.e., under-segmentation error (UE), the boundary recall (BR), the achievable segmentation accuracy (ASA), and the degree of landscape fragmentation (LF) and classification accuracies. In the future, we would investigate how to speed the process of inferring customer assignments, and extend the present method to over-segment VHR multi-spectral satellite images.

**Author Contributions:** Conceptualization, H.T.; Funding acquisition, H.T.; Investigation, H.T. and X.-J.Z.; Methodology, X.-J.Z.; Software, X.-J.Z. and W.H.; Supervision, H.T.

**Acknowledgments:** This work was supported by the National Key R&D Program of China (No. 2017YFB0504104) and the National Natural Science Foundation of China (No. 41571334).

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **A Robust Rule-Based Ensemble Framework Using Mean-Shift Segmentation for Hyperspectral Image Classification**

#### **Majid Shadman Roodposhti 1,\*, Arko Lucieer 1, Asim Anees 2,3 and Brett A. Bryan <sup>4</sup>**


Received: 10 August 2019; Accepted: 29 August 2019; Published: 1 September 2019

**Abstract:** This paper assesses the performance of DoTRules—a dictionary of trusted rules—as a supervised rule-based ensemble framework based on the mean-shift segmentation for hyperspectral image classification. The proposed ensemble framework consists of multiple rule sets with rules constructed based on different class frequencies and sequences of occurrences. Shannon entropy was derived for assessing the uncertainty of every rule and the subsequent filtering of unreliable rules. DoTRules is not only a transparent approach for image classification but also a tool to map rule uncertainty, where rule uncertainty assessment can be applied as an estimate of classification accuracy prior to image classification. In this research, the proposed image classification framework is implemented using three world reference hyperspectral image datasets. We found that the overall accuracy of classification using the proposed ensemble framework was superior to state-of-the-art ensemble algorithms, as well as two non-ensemble algorithms, at multiple training sample sizes. We believe DoTRules can be applied more generally to the classification of discrete data such as hyperspectral satellite imagery products.

**Keywords:** image classification; ensemble; mean-shift; entropy; uncertainty map

#### **1. Introduction**

Image classification is a vital tool for generating maps for environmental monitoring [1]. While for decades, multispectral imagery archives have been used to produce thematic maps, hyperspectral imagery is potentially a better option because of the higher spectral resolution. Hyperspectral images, which often contain more than 50 bands of continuous spectral information [2], can provide considerably more spatial and spectral information about the visible objects in their recorded field of view than multispectral imagery [3]. Because of the quality of information, hyperspectral images are widely used in applications such as precision agriculture [4], biotechnology [5], mineral exploration [6], and land-cover investigations [7]. These various types of applications have generated interest in hyperspectral image classification that has grown rapidly during the past two decades, with significant progress [8].

Up to now, many popular machine-learning algorithms have been applied in hyperspectral image classification. These include instance-based [9], regression [10], regularization [11], decision tree [12], probabilistic [13], reinforcement learning [14], dimensionality reduction [15], ensemble [16], Bayesian [17], maximum margin [18], evolutionary [19], clustering [9], association rule learning [20], artificial neural network [12,21,22] and deep learning [23] methods (see Figure 1). Regardless of the classification performance, many of these algorithms act as black-boxes, resulting in a poor recognition of the classification structure and robustness owing to the high-dimensionality of the data [24,25].

**Figure 1.** A visual illustration of different categories of machine learning methods used for image classification.

Recently, ensemble classification methods have received more attention from the machine learning community, resulting in their increased popularity in different applications such as hyperspectral image classification [26–28]. Nonetheless, as opposed to other black-box classification algorithms, rule-based ensembles have demonstrated the ability to inform the interpretation of classification schemes [29]. Rules are very general structures that offer an easily understandable and transparent way to find the most reliable class allocation [30]. The inferred logic of the model structure obtained by rule-based methods can be dissected, deciphered and applied out-of-the-box to new homogeneous classification problems. This is a major motivation, and it makes rule-based approaches more desirable compared with black-box approaches, even at the potential cost of a reduced classification accuracy [31]. This paper presents a simplified and novel rule-based ensemble framework based on the mean-shift and uncertainty assessment as a hyperspectral image classification tool, and we compare its performance against other state-of-the-art ensemble algorithms, where the mean-shift application is exclusive to the proposed framework. For the sake of simplicity throughout the paper, the proposed framework is referred to as DoTRules (Dictionary of Trusted Rules).

Here, we present DoTRules for hyperspectral image classification to provide a better and more transparent understanding of classification schemes, as well as accurate and robust classification performance. This adds to the growing literature of ensemble methods applied to the classification of hyperspectral data [32–34], especially those aimed at improving the performance of classification with acceptable clarity [35]. DoTRules is based on rules and uncertainty assessment. It was first introduced and applied to the calibration of land-use/cover change simulation models [36]. We assess the performance of the DoTRules algorithm as a novel rule-based classification framework modified to employ a bagging approach in order to boost accuracy. This accuracy boost is implemented by applying a thresholding assignment in order to extract trusted rules and then employing a novel voting approach to extract the class label recommended by the more trusted rules.

DoTRules extracts different subsets of training data from the full dataset, which can then be incorporated into boosting accuracy using a bagging approach designed to improve stability and accuracy. DoTRules has been found to perform well at modelling discrete data [37]. Since satellite imagery products inherently contain discrete digital numbers (DNs), DoTRules can work natively with them, quantifying the likelihood of belonging to a certain map class. It identifies classification rules and quantifies their frequencies so that some will be more influential than others. It also handles null values, which originate from unmatched rules between training and test samples. In addition, the uncertainty of every recognised classification rule is quantified using Shannon entropy. In simple terms, it scrutinises the uncertainty of each classification rule prior to assigning class labels based on their uncertainty value, so that the overall accuracy of classification can be improved. This not only results in boosting accuracy but also enables data analysts to spatially map every unique rule's uncertainty. In terms of applying DoTRules, every pixel of the target hyperspectral dataset corresponds to one rule from each rule set, and, after quantifying uncertainty, only the most competitive one is selected among all of the corresponding rules for a target pixel. Thus, as opposed to many other methods, DoTRules is not a black-box method, as the attributes and characteristics of every single rule can be openly observed. In addition, by quantifying the uncertainty of every rule we can then anticipate their hit ratio. This provides a tool for the spatial segregation of more reliable/accurate classified boundaries from less reliable/accurate ones prior to image classification.

The main objectives of this study are to: (1) demonstrate DoTRules as an accurate and transparent rule-based ensemble framework for hyperspectral image classification; (2) map the uncertainty of every unique classification rule as an estimate of the rules' hit ratio. This highlights the contribution of this paper, i.e., developing an accurate and transparent rule-based ensemble algorithm that provides a prior estimate of classification accuracy at the pixel level. Mapping the spatial distribution of classification accuracies is considered extremely beneficial for enhancing the capabilities of a classifier used as a land-use and land-cover map production tool based on satellite imagery [38,39]. Here, we describe the modified version of DoTRules for hyperspectral image classification, before demonstrating its application in three different study areas. We quantify the accuracy of DoTRules for hyperspectral image classification, and compare the results against some popular state-of-the-art ensemble approaches, i.e., extreme gradient boosting (XGBoost) [40,41], random forest (RF) [1,42–45], rotation forests (RoFs) [46–49], regularised random forest (RRF) [50,51], as well as two non-ensemble algorithms, namely, support vector machine (SVM) [52–56], and deep belief network (DBN) [57,58] as the classic deep learning method. Although SVM and DBN are not ensemble methods, they are included in our comparison because of their popularity, as they have been repeatedly used in recent hyperspectral image classification studies using Indian Pines, Salinas and Pavia University datasets. Finally, we discuss the advantages and disadvantages of the proposed approach for hyperspectral image classification.

#### **2. Methods and Datasets**

#### *2.1. DoTRules*

DoTRules is based on a dictionary of trusted rules. It is designed for prediction when a large amount of discrete data are involved. However, it may also be applied to continuous data after discretisation. This is similar to the RF [59,60] method insofar as rule sets are used to select the mode response (i.e., most frequent class label). However, instead of generating random trees, DoTRules operates by constructing many corresponding rules for every pixel (i.e., feature vector), which are derived from different rule sets. Each rule set is generated from a different combination of predictor variables in the ensemble run. For every unique rule, the most frequently occurring class label, which carries the highest probability of occurrence, is assigned [37,61]. However, as there are many rule sets, there may be many matching rules with defined class labels for a single data sample. To get the best (i.e., final) class label, a weighted majority filter (weighted mode) is applied on every available corresponding rule for a single data sample after the elimination of unreliable rules. The weighted majority filter puts more emphasis on those rules that are assembled by more components (i.e., matching variables) with less generalised class labels. The DoTRules procedure consists of the following steps implemented in R [62]:

#### STEP 1: *Segmentation analysis*

First, a data segmentation or segmentation analysis should be applied to each predictor variable *J*=*{j1,j2,* ... *, jn}* before classification, where *J* is a defined set of *spectral bands*/*band combinations*, but not necessarily every spectral band or a possible combination. These homogeneous digital numbers (DNs) of the hyperspectral satellite image are then converted to segments. This is intended to partition *m* observations of the original image into *S* segments for each *j* in *J*, in which each DN in each segment (ideally) shares some common trait. Although various types of segmentation or even clustering algorithms can benefit the proposed classification framework, here we applied a *mean-shift* segmentation algorithm [63]. The mean-shift algorithm [63,64] is a recursive algorithm that allows us to execute a nonparametric mode-based segmentation. This is performed by a data segmentation based on a kernel density estimate of the probability density function associated with the data-generating process. The main motivation for applying a mean-shift algorithm is the fact that it is model-free and does not assume any prior distribution shape for data segments. Furthermore, it is robust to outliers and does not require a pre-specification of the number of segments.

In its standard form, the mean-shift algorithm works as follows. We observe a set of DN values from *x*1, ... ,*xm*, for each spectral band *J*=*{j1,j2,* ... *, jn}*. We fix a kernel function *ker f* and a bandwidth parameter σ, and we apply the update rule:

$$\mathbf{x} \leftarrow \frac{\sum\_{i=1}^{m} \ker f(\|\frac{\mathbf{x}\_i - \mathbf{x}}{\sigma}\|) \mathbf{x}\_i}{\sum\_{i=1}^{m} \ker f(\|\frac{\mathbf{x}\_i - \mathbf{x}}{\sigma}\|)} \tag{1}$$

where σ is a bandwidth parameter. The fundamental parameter in mean-shift algorithms is the bandwidth σ, which determines the number of segments [65]. Furthermore, regions with less than some pixel-count *C* may be optionally eliminated. To account for different spatial and spectral variances it is practical to choose a kernel window of size σ = σs, σ<sup>r</sup> with differing radii. σ<sup>s</sup> is in the spatial domain, and σ<sup>r</sup> is in the range domain. The statistics literature has developed various ways to estimate the bandwidth. One of them is the adaptive mean-shift where you let the bandwidth parameter vary for each data point. Here, the σ parameter is calculated using the kNN algorithm [66]. If *xi,S* is the k-nearest neighbour of *xi*, then the bandwidth is calculated as:

$$
\sigma\_{\bar{i}} = \|\mathbf{x}\_{\bar{i}} - \mathbf{x}\_{\bar{i}}\mathbf{s}\|\tag{2}
$$

Here, the aim of the segmentation analysis is to summarise the input data and then minimise the required number of rules for correctly classifying pixels to their corresponding class label. As more accurate segments will improve the classification results, it is beneficial to apply the segmentation analysis on spectral band compositions composed of less similar spectral bands (i.e., within multidimensional space). Thus, a pairwise dissimilarity measure *dis(ji, jn*) between spectral bands *ji* and *jn*, for 1 ≤ *i*, *j* ≤ *n* [67] can be applied to achieve more robust segments. STEP 2: *Formatting the data*

In order to avoid mixing segment (*S*) values during the concatenation phase for the rule implementation in later steps, data segments should be formatted. Following the data segmentation, considering the maximum number of segments (*S*), the obtained data from step one should preferably be converted to two-digit (i.e., *S* < 100) or three-digit (i.e., 100 <= *S* < 1000) numbers, or more. This is a requirement prior to the rule implementation. Hence, if a maximum value of *S* is under 100, the data

should be formatted in a two-digit format (e.g.,3 = 03, 26 = 26), while if the maximum value of *S* is =>100 and < 1000, then the data should be in a three-digit format (e.g., 3 = 003, 26 = 026), and so forth. STEP 3: *Splitting data into training and test samples*

Both our training and test sets will be in a tabular form, consisting of a set of pixels *I*=*{i1,i2, .., im}.* Each pixel *i* in *I* has a value *xij* for each predictor variable *J*. Simply, *xij* is the converted segment value of the sample *i* in *I* and *j* in *J*. Thus, for each predictor variable *j*, *xij* can adopt one of a fixed set of possible values ≤ *S*. Each pixel *i* has a corresponding class label *li* ∈ L={*l*1, *l*2, ... , *lh*}, which are also discrete semantic attributes from the global set of class labels, such as corn, grass, oats, etc. It should be noted that to implement ensemble learners using DoTRules, we need to derive *z* sub-sets of our training dataset to construct different rule sets *D* containing individual classification rules *d*. This consists of all the available pixels in the primary training dataset but includes a different (random) combination of *j* in the feature vector.

#### STEP 4: *Creating a rule set*

For every *zth* sub-set of the training set, we will concatenate values of a pixel *xij* for every *j* in *J* to form a rule set *D*. The concatenation of two or more characters is the string formed by them in a series (i.e., the concatenation of 001, 020, and 200 is 001020200). Equation (3) illustrates the pixel values for the segmented predictor variables concatenated for each pixel (row) *i*, thereby creating a rule for each pixel in the corresponding subset of the training dataset.

$$D\_z = \begin{pmatrix} \mathbf{x\_{11}} \\ \mathbf{x\_{21}} \\ \vdots \\ \mathbf{x\_{m1}} \end{pmatrix} \begin{pmatrix} \mathbf{x\_{12}} \\ \mathbf{x\_{22}} \\ \vdots \\ \mathbf{x\_{m2}} \end{pmatrix} \begin{pmatrix} \mathbf{x\_{1n}} \\ \vdots \\ \mathbf{x\_{2n}} \\ \vdots \\ \mathbf{x\_{mn}} \end{pmatrix} = \begin{bmatrix} \mathbf{x\_{11}} & \mathbf{x\_{12}} & \cdots & \mathbf{x\_{1n}} \\ \mathbf{x\_{21}} & \mathbf{x\_{22}} & \cdots & \mathbf{x\_{2n}} \\ \vdots & \vdots & \vdots & \vdots \\ \mathbf{x\_{m1}} & \mathbf{x\_{m2}} & \cdots & \mathbf{x\_{mn}} \end{bmatrix} = \begin{bmatrix} d\_1 \\ d\_2 \\ \vdots \\ \vdots \\ d\_{mn} \end{bmatrix} \tag{3}$$

Note that following the concatenation and extraction of rules (Equation (3)), every rule within a specific rule set has maintained its single class label *li* ∈ *L*. We then aggregate duplicate rules where pixels have exactly the same values for all criteria, leaving an efficient new rule set of unique rules *D z*. The frequency of occurrence of all potential class labels *li* ∈ *L* is then calculated for each unique rule *d* in *D z*:

$$\begin{bmatrix} L\_1 \\ L\_2 \\ \vdots \\ L\_v \end{bmatrix} \xrightarrow{\rightarrow} \begin{bmatrix} f(l\_1) & f(l\_2) & f(l\_3) & \cdots & f(l\_h) \\ f(l\_1) & f(l\_2) & f(l\_2) & \cdots & f(l\_h) \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ f(l\_1) & f(l\_2) & f(l\_3) & \cdots & f(l\_h) \end{bmatrix} \tag{4}$$

where *v* is the number of unique rules in *D*z. The class label from the set *L* with the highest frequency (i.e., the mode) is then assigned to each corresponding unique rule *d* . The total number of rule sets *D* = [1, ... , *z*] and the number of components in each rule set (i.e., the length of a rule) is user-defined. Although the classification accuracy may increase by using more rule sets, it will be at the expense of the computation cost. In terms of rule length, the accuracy of classification may not increase necessarily by the implementation of longer rules, where longer rules with more conditional components from the *J* set will model the training data too well (i.e., overfitting), resulting in less generalised responses for estimations of class labels, and vice versa (i.e., underfitting). Overall, as the quantity of matching pixels in the test dataset is inversely proportional to the length of rules, the longer rules with more components are more specific with fewer matches, while the shorter rules with fewer components are more general with many matches in the test dataset.

To ensure a more accurate estimation, the default value of *z* is set to 100 rule sets. Then, to avoid overfitting and underfitting issues, the number of predictor variables (*j*) used in every rule *d* within a specific rule set (length of rules within a considered rule set) is defined by a random function with a lower and upper bound defined by the user. This random function is called once, before creating every single rule set, to define the number and combination of components within that rule set. As the optimal combination of predictor variables is unknown, random band selection helps reduce the potential for the overfitting of the classifier. In this way, rules with various length will be implemented. The lower (*min*) and upper bound (*max*) for the length of rules (λ) in each rule set *D* = [1, ... , *z*] is a positive natural number defined by:

$$\lambda \begin{cases} \max(\lambda) \le n \\ \min(\lambda) > 0 \end{cases} \tag{5}$$

where *n* is the number of selected predictor variables in *J* set. The number of rule sets, min and max values of λ can be further optimised using cross-validation.

#### STEP 5: *Calculating and mapping rule entropy.*

The aim of this step is to assess the uncertainty value of each rule. In information theory, entropy is the quantitative measure of system disorder, instability and uncertainty, and may be used to forecast the trend of a specified system. Entropy indicates the expected amount of information contained [68]. Here, the entropy value of every unique rule *d* from a rule set *Dz* is calculated based on the frequencies of each possible class label (Equation (4)) as:

$$c\_{d\nu} = -\sum\_{i=1}^{h} p\_{l\_i} \log\_2 p\_{l\_i} \tag{6}$$

where *ed* is the entropy of the unique rule d and *Pli* is the probability value of the class label *li* <sup>∈</sup> *<sup>L</sup>*. Here, *h* is the number of class labels in *L*. The general idea is that for a given rule, which may cover one or many pixels, the greater the probability of a class membership for a given class label, the less the uncertainty associated with that class. This provides a quantitative estimate of uncertainty for every single rule within different rule sets prior to assigning class labels. These estimates of uncertainty values can be applied to both the spatial mapping of rule uncertainty in classification, and to eliminating those unreliable rules with a high entropy from different subsets and/or rule sets before combining votes. The spatial distribution of uncertainty is quantified by mapping the entropy of each unique rule back to the corresponding pixels. These estimates of uncertainties are extremely beneficial and can be considered even prior to assigning class labels to pixels. Every time that DoTRules is applied to a training data subset, a class label of the highest frequency is allocated, and the entropy of that rule is calculated.

#### STEP 6: *Eliminating unreliable rules within all rule sets.*

After assessing the uncertainty of each individual rule, unreliable rules (i.e., rules with a high entropy) should be eliminated to improve the quality of the voting outcome, which directly affects the classification accuracy. Thus, every such rule *d* (in *D <sup>z</sup>*), for which the *ed* is greater than the user-defined threshold, is eliminated. In our study, we specified that if *ed* is > 0.3 for a rule, and its corresponding pixel's frequency is < α (to avoid randomness), then the rule is considered to be unreliable and is eliminated accordingly. α is calculated as follows, keeping the random chance for a resultant entropy value under 0.05%:

$$\alpha = \text{Cel}\left(\frac{\ln(0.05)}{\ln(1/h)}\right) \quad \text{for} \quad h > 1\tag{7}$$

where *h* is the number of class labels.

#### STEP 7: *Classifying the test dataset.*

Above, we described the process of creating DoTRules and allocating the most likely class label for each rule based on the frequency. In the same way, class labels can now be assigned for the study area using another subset of the primary training dataset (i.e., implementing more rule sets). Every time a new rule set is implemented, the same procedure is followed to establish rules for the test dataset. We then match each test data rule with its equivalent training rules in the DoTRules using a many-to-one matching algorithm and allocate the most likely class label to each test data rule. This will be repeated every time that a weak learner is being implemented from every single rule set.

#### STEP 8: *Handling null values*

There is always a possibility of encountering null records in the test dataset while using DoTRules. In this situation, new pixels in the test dataset present combinations of states for criteria not encountered in the training data, which may increase the *out of bag* error. Handling null values is inevitable for maintaining the classification accuracy, where in the proposed ensemble framework using mean-shift it is a combined procedure. First, all rules are sorted based on their similarity, then every single null value is assigned to the class label of its closest (i.e., most similar) rule, based on the alphanumeric similarity of the constructed rules. However, the influence of these rules in combining votes is minimised as they are characterised by null entropy values.

#### STEP 9: *Combining votes*

In order to fulfil the classification procedure, this step is used to assign a final label to each pixel. To combine votes of each set of learners, we first remove all unreliable rules (with low or null entropy records) within every rule set using a thresholding approach. Afterwards, a mode filter is applied to the resultant class labels coming from sets of corresponding rules for each pixel. This mode function not only considers the frequency of class labels, but also considers the length of a rule as a weighted function. Since a rule is formed by concatenating *n* number of predictor variables (*j*), a rule that contains more predictor variables as components therefore has a higher weight in the mode function. Nonetheless, if none of the recognised reliable rules, for a certain pixel in the test dataset, is matched by any corresponding rule from the various training rule sets (derived from subsets of the training data), then the mode function will be applied to the corresponding labels of unreliable rules explained in STEP 8 with the same mode function.

#### STEP 10: *Calculating and mapping the hit ratio*

Calculating and mapping the hit ratio helps to visualise the spatial distribution of the classification error. Similar to the entropy value, which is calculated for every unique rule based on the frequencies of each possible class label, we map the hit ratio of every unique rule in our combined results back to the original pixels. DoTRules is rule-based, where every unique rule *d* from a rule set *D z* corresponds to one or many pixels; thus, we can calculate the classification hit ratio of those rules using Equation (8):

$$A\_{d\nu} = \sum\_{i=1}^{h} l\_i^+ \Big/ \sum\_{i=1}^{h} l\_i \tag{8}$$

Here, *li* <sup>+</sup> is the sum of the correct classified labels.

#### *2.2. Rule Uncertainty Threshold*

In using DoTRules for the classification of hyperspectral imagery, the class label of a rule is also described by both its entropy value and the frequency of all potential class labels (Figure 2). Therefore, a rule can be considered reliable if its entropy is less than 0.3 bits, which is calculated at least for *n* potential class labels (frequency > α). However, it is important to note that among the reliable rules coming from the various rule sets for a certain pixel, those with a longer concatenated string (rule) will have more impact in combining final votes. This is mostly due to the fact that they are composed of more variables but meet the same uncertainty threshold, and hence can make more robust predictions. In other words, longer rules have fewer pixels with a specific class label, while shorter rules have more pixels belonging to multiple class labels. The fewer the pixels shared between different rules, the more accurate the classification results will be.

**Figure 2.** Schematic demonstration of (**a**,**b**) reliable and (**c**,**d**) unreliable rules extracted using DoTRules. The black circles represent the segment values of randomly selected spectral bands composing different rules for one target pixel. Considering rule sets number #1 and #2, the latter will have more impact in combining votes due to its larger length.

As the estimated entropy values for the distribution of response variables (class labels) with low frequencies are less reliable (Figure 2d) and may result from random chance, a second threshold is applied to the frequencies of potential labels. This will further improve the quality of the rule elimination process.

#### *2.3. Comparing DoTRules with Other Methods*

To measure and quantify DoTRules' performance, we implemented different classification algorithms, including XGBoost, RF, RoF, RRF, SVM, and DBN on the same datasets. These six algorithms are among the most popular methods for hyperspectral image classification, and they belong to three different categories of machine-learning methods. The first four algorithms are state-of-the-art ensemble methods, while SVM is a maximum margin classifier and DBN is a deep learning method. Thus, these methods provide appropriate benchmarks for assessing the performance of the DoTRules. XGBoost is an algorithm that has recently been dominating applied machine learning [69], and RF, RoF and RRF were selected because of both their natural similarity to DoTRules and performance in hyperspectral data classification [1,42–45]. They are also computationally efficient and suitable for large training datasets with many variables and can solve multiclass classification problems [70]. Furthermore, SVM [9,18,71] and DBN [57,58] algorithms have demonstrated promising results in previous studies. We compared the overall accuracy (OA) and kappa coefficient (*k*) of DoTRules with these various algorithms for hyperspectral image classification, using three different datasets from Indian Pines, Salinas and Pavia University (Figure 3).

**Figure 3.** False color composites and ground truth images of the datasets used to illustrate the image classification using DoTRules, including the (**a**,**b**) Indian Pines, (**c**,**d**) Salinas and (**e**,**f**) Pavia University datasets.

After tuning the required parameters of the above algorithms using the CARET package in R [72], a training process was implemented. In order to make a valid comparison, not only applicable to different study areas but also robust to variations of the portion of training and test sample sizes, different sample sizes of 1%, 5% and 10% were used. In addition, the overall accuracy value was taken as an average of five consecutive runs of each combination of algorithm and sample size. This was to avoid a sudden change in the overall accuracy value arising from changes in the training sample.

#### *2.4. Datasets*

DoTRules was tested using three hyperspectral image datasets (Figure 3), namely, the Indian Pines [22,73], Salinas [74,75] and Pavia University datasets [22,71]. Both the Indian Pines and Salinas datasets contain noisy bands due to water vapour, atmospheric effects, and sensor noise. All three datasets are available at http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral\_ Remote\_Sensing\_Scenes. The mean spectral signatures of the three datasets is also demonstrated in Figure 4.

**Figure 4.** The mean spectral signatures of the (**a**) Indian Pines, (**b**) Salinas Valley and (**c**) Pavia University datasets.

The Indian Pines dataset is an AVIRIS image collected over the Indian Pines test site location, Indiana, USA. This dataset consists of 220 spectral bands in the same wavelength range as the Salinas dataset; however, four spectral bands are removed as they contain no data. This scene is a subset of a larger scene, and it contains 145 × 145 pixels covering 16 ground truth classes. We removed 20 spectral bands affected by water absorption and noise.

The Salinas image consists of 224 bands, and each band contains 512 × 217 pixels covering 16 classes. It was recorded by the AVIRIS sensor over Salinas Valley, CA, USA, with a spatial resolution of 3.7 m, and the spectral information ranging from 0.4 to 2.5 μm. We used 204 bands, after removing the water absorption bands.

The Pavia University dataset was collected by the Reflective Optics System Imaging Spectrometer (ROSIS) system that is a compact airborne imaging spectrometer. It consists of 103 spectral bands after removing the noisy bands, and 610 × 340 pixels for each band with a pixel resolution of 1.3 m. The ground truth image consists of nine classes.

#### **3. Results**

#### *3.1. Simulation Experiments*

For all three hyperspectral datasets, DoTRules was superior to all other algorithms in terms of the overall accuracy and kappa coefficient. However, considering the very low sample size (i.e., 1%) of the small-sized datasets (i.e., Indian Pines and Pavia University) it was not the most accurate approach. This is confirmed by the results of the accuracy assessment for the different sample sizes, which are averaged from five consecutive runs for a target sample size (Table 1).

**Table 1.** The accuracy assessment results of three applied datasets, including the overall accuracy (OA%) and kappa coefficient (κ) for all applied methods including support vector machine (SVM), deep belief network (DBN), extreme gradient boosting (XGBoost), random forest (RF), rotation forests (RoFs), regularised random forest (RRF), as well as Dictionary of Trusted Rules (DoTRules). The maximum values are highlighted in bold.


The classification results also demonstrate that the DoTRules classification was able to closely match the spatial pattern of the ground truth image (Figure 5). These results were consistent across all three hyperspectral datasets. DoTRules was not only an accurate but also a transparent rule-based approach where the reliability (based on uncertainty) of each rule can be mapped. This is a desirable feature in remote sensing applications where the visual investigation of classification rules is informative.

**Figure 5.** The DoTRules classification results and estimated pixel-based *ed*, for the Indian Pines, Salinas and Pavia datasets. The red pixels show the location of unreliable rules according to entropy thresholding (*ed* > 0.3, for α = 0.05), while the grey pixels are reliable rules above the threshold. The red pixels are counted for each sample size.

#### *3.2. Uncertainty Mapping*

As DoTRules is rule-based, and each unique rule with its specific entropy value corresponds to one or more pixels, it is possible to estimate and map the uncertainty of every unique rule back to those pixels. This is a preliminary product of DoTRules, before assigning a class label to every pixel.

To illustrate the applicability of the entropy map to locate areas belonging to a low versus high classification accuracy, entropy values above and below the applied threshold (*ed '*= 0.3, Equation (7)) were mapped to segregate regions which have more reliable and less reliable classification responses (Figure 5). In this way, the DoTRules spatial uncertainty map can facilitate a better understanding of uncertainty in the classified product and the segregation of more and less reliable geographic areas before assigning class labels to every pixel of the test sample dataset. This provides clear spatial insight into the uncertainty of the classification at an early stage of the classification process.

In developing and applying DoTRules, we have found that a larger sample size offers a higher classification accuracy where the number of less reliable rules with higher levels of uncertainty is reduced. Conversely, a smaller sample size, with fewer rules detected in our rule sets, was less able to capture the complexity of the hyperspectral image classification. This is mainly due to the fact that for DoTRules, training samples should be enough to cover all possible forms of rules. Figure 5 demonstrates the rule uncertainty for the Indian Pines, Salinas and Pavia University datasets using 1%, 5% and 10% training sample sizes.

#### *3.3. Correspondence Between Uncertainty and Hit Ratio of Rules*

In general, where there is low entropy (i.e., low uncertainty) for a rule within our rule set, the classification also tends to be more accurate. Simply, a lower entropy means there is just one clear answer (the mode class label) for a rule, while a high entropy indicates a more uniform distribution of the map class frequencies for that rule, which indicates a less reliable classification. Plotting hit ratio values against entropy values of every constructed rule among our various rule sets demonstrates that the hit ratio of rules can be defined by a polynomial function of their entropy value (Figure 6), which is supported by a strong coefficient of determination for all three datasets.

**Figure 6.** Entropy versus the hit ratio of rules for the (**a**) Indian Pines, (**b**) Salinas Valley and (**c**) Pavia University dataset 10% training sample sizes. The bubble sizes show the frequency of each rule among all corresponding rules from different rule sets before combining votes.

To further demonstrate the applicability of DoTRules' uncertainty product for the anticipation of the rule-exclusive hit ratio, we then applied the derived functions based on the correspondence of the hit ratio and entropy of the training data to predict the hit ratio of rules within the test datasets. The root mean square error (RMSE) values of the predicted hit ratios based on the entropy polynomial function were <1 for all three datasets (Table 2). Table 2 demonstrates that the uncertainty product of DoTRules may be applied to estimate the hit ratio of the classification rules in the context of the hyperspectral image classification.


**Table 2.** The prediction of rules' hit ratio based on the corresponding entropy values for a 10% training sample size.

#### **4. Discussion**

In this paper, we have presented a rule-based ensemble framework based on a mean-shift segmentation and uncertainty analysis, referred to as DoTRules (a Dictionary of Trusted Rules), for hyperspectral image classification. DoTRules constructs many rule sets composed of corresponding rules for each pixel in a hyperspectral image to predict the class of the test samples. When applied to different datasets and sample sizes, DoTRules proved to be an effective strategy for the classification of hyperspectral imagery, with promising results compared to other established algorithms. Furthermore, DoTRules enables both rule uncertainty and hit ratio mapping, which is an advantage for the users of classified land-use and land-cover maps created from remote sensing imagery. Below, we discuss improvements in hyperspectral image classification achieved using DoTRules.

#### *4.1. The Overall Accuracy of Classification*

According to our results, for all three applied hyperspectral datasets, the DoTRules ensemble framework was more accurate than the other applied classification algorithms for most training sample sizes (Table 1). This is due to the robust rule detection framework using mean-shift segmentation, where Shannon entropy is used to assess the uncertainty of individual rules for classification purpose. Here, the segmentation is done in a way where each DN in each segment (ideally) shares some common trait. This bears similarities with an object-oriented classification that involves the categorization of pixels based on the spatial relationship with the surrounding pixels. While pixel-based classification is exclusively based on the information in each pixel, object-based classification is based on information from a set of similar pixels (i.e., objects or image objects). Image objects are groups of pixels that are similar to one another based on the spectral properties (i.e., colour), size, shape, and texture, as well as context from a neighbourhood surrounding the pixels, in an attempt to mimic the type of analysis done by humans during visual interpretation. In addition, passing segment information to pixels and extracting reliable rules (i.e., low uncertainty rules) using minimum entropy through a voting system further preserves the high classification accuracy, especially when a representative training sample size is applied.

The observed increase in the overall accuracy of DoTRules' estimates when applying larger sample sizes may be due to an extra number of rules being detected and relatively fewer *null* records. Rules are very general structures that offer an easily understandable and transparent way to find the most reliable class allocation for a pixel [30]. As opposed to decision trees, every rule corresponds to only one pixel. This is unique to DoTRules and a common criticism of XGboost, RF, RoF, RRF and similar black-box algorithms [76,77]. Users can access all rules and their corresponding information, such as the rule ID, components of a rule (segment class for every selected band), true class label, probability (relative frequency) of every potential class label, rule entropy and hit ratio (accuracy) (Figure 2), while they are always connected to their corresponding pixels. This beneficial trait is highly valued in geoscience and remote sensing applications, especially in the context of land-use and land-cover mapping applications [38,78,79]. To be able to assign every pixel to a map class, each pixel should have at least one matching rule from various rule sets. Logically, the number of recognised rules within each individual rule set will be increased by a consequent increase in the training sample size (i.e., 1% to 10%), while the number of null records derived from unmatched rules between the test and training

dataset will be reduced. Therefore, the greater the number of trusted rules, the greater the capability of our proposed framework to allocate test pixels into their true class labels.

#### *4.2. Quantifying and Mapping the Uncertainty of Rules*

While a few studies have successfully mapped the uncertainty of classification before image classification [38,80], one strength of DoTRules in hyperspectral image classification is its demonstrated ability to quantify the uncertainty of every identified transition rule using entropy values prior to the final classification (Figure 5). In other words, DoTRules was able to report the uncertainty of rules based on Shannon entropy, independent from the test dataset. The results from different hyperspectral datasets show that the lower the entropy value, the higher the hit ratio (Figure 6). Thus, considering the strong relationship between the entropy and hit ratio, it is possible to apply the entropy values as estimates of the hit ratio. The estimation of the rule uncertainty prior to the classification of a hyperspectral dataset aids in understanding the specific strengths and weaknesses of a classifier dealing with pixels containing a range of spectral information.

#### *4.3. Quantifying Hit Ratio of Rules*

DoTRules demonstrated the ability to quantify the rule-exclusive hit ratio using their corresponding entropy values (Table 2). Thus, the uncertainty product based on the entropy values can be applied to segregate areas of less and more reliable prediction independently of the test data availability. Thus, in the absence of a proper test dataset for the validation of classification results, rules' uncertainty values can be applied to represent their corresponding hit ratio. The collection of reliable ground truths for validation purposes is usually an expensive task in terms of time and economic costs [81]. Consequently, in many cases, it may not be possible to rely on test data to ensure good performance of a classifier. Accordingly, aside from using traditional accuracy metrics as a single number derived from a confusion matrix, mapping and thresholding the rule-exclusive hit ratio in a classification scheme is worthwhile for visualising general patterns of high and low accuracy values within the classified map and quantifying the accuracy of prediction in specific targeted locations.

#### *4.4. Limitations of DoTRules and Future Work*

Although the results obtained by DoTRules are encouraging, further comparative experiments with additional hyperspectral imagery datasets should be implemented. This can be more useful with a particular focus on assessing the classification performance at higher levels of disaggregation, such as a class-level accuracy assessment. As some of the required parameters for the DoTRules implementation are subjective, such as 1) the rule uncertainty threshold, 2) the minimum and maximum length of random rules and 3) the optimum number of rule sets, more research may be beneficial in the computational optimisation of DoTRules parameters. Our further work is focusing on the development of more computationally efficient schemes for the ensemble framework.

Another limitation of the proposed ensemble framework is the fact that the proposed framework is less efficient for very low sample sizes (i.e., 1% or less). DoTrules usually needs a larger training set to extract the underlying relationships between variables. This is a common requirement for all ensemble methods except RoF. Although RoF is the best performing algorithm for the 1% sample size of Indian Pines, it benefits from the transformation of the hyperspectral data.

#### **5. Conclusions**

We have applied DoTRules—a Dictionary of Trusted Rules—as an innovative ensemble framework for classifying hyperspectral data with high accuracy estimates compared with other popular classification algorithms. DoTRules' classification accuracy was superior to six other popular and state-of-the-art ensemble and non-ensemble algorithms. In the case of DoTRules, every rule within any rule set can be accessed, and their corresponding uncertainty value may be observed. This feature is unique to DoTRules and the absence of this ability underpins a common criticism of many ensemble algorithms (including many of the algorithms applied here) as black-box classifiers. Furthermore, DoTRules is also capable of quantifying and mapping the uncertainty of these classification rules, prior to the image classification where the uncertainty values can be applied as an estimate of the hit ratio. While the entropy product of DoTRules provides spatial insights, including the location of less reliable classification rules as well as more reliable ones, regardless of the test sample dataset availability, it can also certify and locate less accurate rules using the estimated hit ratio. The spatial exploration of rule uncertainty in hyperspectral image classification is beneficial for the early prediction of success or failure of a classifier in specific geographic locations. The uncertainty maps may also serve to enhance the application of map products by alerting map users to the spatial variation of rules' hit ratio over the entire mapped region. This, together with the simplicity and accuracy of DoTRules, indicates that the methodology offers new features and is ready for operational use by the remote sensing community.

**Author Contributions:** Conceptualization, M.S.R.; methodology, M.S.R.; validation, M.S.R. and A.A.; writing—original draft preparation, M.S.R., A.L. and B.A.B.; writing—review and editing, A.L., A.A. and B.A.B.; visualization, M.S.R.; supervision, A.L. and B.A.B.; funding acquisition, B.A.B. All authors read and approved the final manuscript.

**Funding:** This research was supported by CSIRO Australian Sustainable Agriculture Scholarship (ASAS) as a top-up scholarship to Majid Shadman Roodposhti, a PhD scholar, at the University of Tasmania (RT109121), School of Land and Food.

**Acknowledgments:** The authors greatly appreciate Maia Angelova Turkedjieva and Ye Zhu, Deakin University, for their suggestions to improve this manuscript. We also thank Monica Cuskelly, Associate Dean (Research), University of Tasmania, for editing the manuscript. We also thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


on Modelling and Simulation, Hobart, Australia, 3–8 December 2017; Syme, G., Hatton MacDonald, D., Fulton, B., Piantadosi, J., Eds.; Hobart, TAS, Australia, 2017; p. 508.


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Operational Large-Scale Segmentation of Imagery Based on Iterative Elimination**

#### **James D. Shepherd 1,\*, Pete Bunting <sup>2</sup> and John R. Dymond <sup>1</sup>**


Received: 28 February 2019; Accepted: 15 March 2019; Published: 18 March 2019

**Abstract:** Image classification and interpretation are greatly aided through the use of image segmentation. Within the field of environmental remote sensing, image segmentation aims to identify regions of unique or dominant ground cover from their attributes such as spectral signature, texture and context. However, many approaches are not scalable for national mapping programmes due to limits in the size of images that can be processed. Therefore, we present a scalable segmentation algorithm, which is seeded using k-means and provides support for a minimum mapping unit through an innovative iterative elimination process. The algorithm has also been demonstrated for the segmentation of time series datasets capturing both the intra-image variation and change regions. The quality of the segmentation results was assessed by comparison with reference segments along with statistics on the inter- and intra-segment spectral variation. The technique is computationally scalable and is being actively used within the national land cover mapping programme for New Zealand. Additionally, 30-m continental mosaics of Landsat and ALOS-PALSAR have been segmented for Australia in support of national forest height and cover mapping. The algorithm has also been made freely available within the open source Remote Sensing and GIS software Library (RSGISLib).

**Keywords:** image segmentation; remote sensing; land cover; iterative elimination; RSGISLib

#### **1. Introduction**

The classification and analysis of remotely-sensed optical data has become a key technology for the mapping and monitoring of land cover [1,2]. Traditionally, such analysis has been performed on a per pixel basis, but over the last 20 years, there has been a significant movement to embrace context and segment-based classifications [3] due to observed improvements in classification accuracy [3–7]. A significant driver for this adoption has been the availability of the eCognition software [8], which has provided user-friendly tools for these operations and made them widely available to the community. For national mapping programs, high-resolution remotely-sensed data such as RapidEye, Sentinel 1 and 2, SPOT (4–5) and Landsat (TM, ETM+, OLI), provide the majority of data. These data typically have a resolution of 5–30 m, thereby providing sufficient detail for monitoring land cover, such as forestry, agriculture and grasslands. The use of segmentation for land cover mapping at this resolution aims to identify spatially-homogeneous units such as fields and forestry blocks as single objects, or a small number of large objects. However, any under-segmentation will result in a poor classification result, as features are merged and cannot be identified. Therefore, where errors occur within the segmentation result, a small amount of over-segmentation is preferable to retain classification accuracy [9].

Building on the extensive literature on image segmentation (e.g., [10–14]) within the image processing and medical imaging communities, there have been a significant number of publications devoted to segmentation of remotely-sensed imagery. These can be categorised as either top-down or bottom-up approaches. Top-down approaches, such as multi-scale wavelet decomposition [15], start with a single region and progressively break the image into smaller regions until a termination condition is met. Bottom-up methods, such as region growing and spectral clustering [16], start with individual pixels and group them until a termination condition is met. Termination criteria vary between approaches, but they are generally a combination of the colour similarity, shape, and size of the regions generated. To group pixels within a bottom-up approach, a number of methods have been proposed, but include region growing [17], thresholding [18], statistical clustering [19], Markov random field models [20], fuzzy clustering [21], neural networks [22] and image morphology watersheds [23]. Thresholding techniques are generally applied to extract features from a distinctive background, with histograms commonly being used to identify the threshold values automatically [24]. Region growing techniques require seeds, which are difficult to generate reliably automatically and are time consuming to provide manually. Soille [17], Brunner and Soille [25] used an automated seeding method based on quasi-flat zones, as did Weber et al. [26], who applied it to multi-temporal imagery. Watershed techniques are one of the most commonly applied, but these are often over-segmented and, due to the filtering used to generate the gradient image(s) boundaries between features, are often blurred and poorly defined at the pixel level.

Segmentation techniques are often designed for specific applications, such as tree crown delineation [27], requiring considerable parameterisation that can be difficult to define and optimise [28]. The termination criteria used are also a common differentiating factor in the results, which are generated by the segmentation algorithm. For example, Baatz and Schäpe [28] aimed to minimise the spectral homogeneity of the image regions while creating regions of similar size and compact shape [8] to allow accurate statistical comparison between features for later classification. The key parameter within the eCognition multi-scale segmentation algorithm is the scale factor *f* [28]. The user-defined threshold of *f* controls the termination for the segmentation process where *f* is defined as:

$$\begin{cases} f = w \cdot h\_{\text{colour}} + (1 - w) \cdot h\_{\text{shape}}\\ h\_{\text{colour}} = \sum\_{b=1}^{\eta\_{\text{kude}}} w\_b \cdot \sigma\_b\\ h\_{\text{shape}} = w\_\text{s} \cdot h\_{\text{cumpct}} + (1 - w\_\text{s}) \cdot h\_{\text{smooth}}\\ h\_{\text{cmpct}} = \frac{p\_l}{\sqrt{\eta\_{\text{pr}l}}}\\ h\_{\text{smooth}} = \frac{p\_l}{p\_{\text{blor}}} \end{cases} \tag{1}$$

where *w* is a user-defined weight (range: 0–1) defining the importance of object colour versus shape within the termination criteria, *nbands* is the number of input image bands, *wb* is the weight for the image band *b*, *σ<sup>b</sup>* is the standard deviation of the object for band *b*, *ws* is the user-defined weight (range: 0–1) defining the importance of compactness versus smoothness, *pl* is the perimeter length of the object, *npxl* is the number of pixels within the object and *pbbox* is the perimeter of the bounding box of the object.

Others have used k-means clustering as an alternative approach. Lucchese and Mitra [29] used k-means clustering and median filtering for post-processing. While the method is simple and effective, some of the objects with sharper edges were over-smoothed from the median filtering. Wang et al. [19] first initialised the segmentation using a k-means clustering to remove noise and reduce the feature space from single pixels to clumps. The clumps were then merged using a combination of spectral colour (weighted by object size) and object shape. The resulting segmentation is similar to that derived from the eCognition software. Likewise, Brunner and Soille [25] executed a single merge of clumps based on spectral similarity.

Computational constraints are another issue commonly encountered with many segmentation algorithms. The size of the scene that can be processed is in some cases significantly limited. Often, one whole satellite image cannot be processed in a single step, due to memory constraints. Additionally, methods commonly deployed within the field of computer vision expect either one (greyscale) or three (red, green and blue) image bands, or spectral measurements, and it is assumed that these data channels are scaled to the same range (i.e., 0–255). In remote sensing, this is commonly not the case, with many satellites having multi-spectral capabilities with at least four spectral wavelengths, including the near-infrared (NIR), which over vegetation, results in considerably larger range of values than other channels due to leaf cell structure [30].

Assessing the quality and effectiveness of a segmentation result has proven difficult [31]. In this context, the aim of segmentation is to subdivide the landscape into units of the same land cover or spectral colour, where ideally, the entire feature, such as a single species forest block, will be captured as a single segment so that these units can be classified within an appropriate context. However, if the scene is under-segmented, that is, multiple features of interest are contained within a single segment, then classification of these features is not possible. Therefore, under-segmentation is to be avoided. Therefore, a degree of over-segmentation is generally accepted, as the following classification of neighbouring objects of the same class can be merged. Additionally, the scale of the imagery and result being sought will significantly impact segmentation requirements. Segmentation quality can be assessed for specific tasks, such as delineation of tree crowns or buildings, with respect to a set of reference data. The generation of a reference set is a clear task with a relatively clear answer for specific features, but for more general segmentation tasks, such as splitting a whole SPOT 5 scene into units for land cover classification, where a large number of land covers, uses and vegetation conditions are present, there is no one "best" solution. In an attempt to quantify the differences between different segmentation results, various approaches exist [32] where parameters such as shape (e.g., shape index [33]), comparison to a set of reference features [34–36], synthetic images [37] and colour metrics [31,38] have been used.

Our aim was to provide a segmentation technique that is scalable to support the national land cover mapping program of New Zealand, while only using a few understandable parameters. Segments needed to be of uniform spectral response (i.e., colour) with a minimum size and suitable for land cover and land cover change classification, while boundaries between land covers should be sharp and accurate. The algorithm needs to be efficient over large datasets (i.e., multiple mosaicked scenes) and applicable to a wide range of sensors and image pixel resolution. We have therefore proposed a new iterative elimination algorithm to meet these aims. We describe and apply the method on Landsat ETM+ and SPOT5 imagery of typical landscapes in New Zealand.

#### **2. Methods**

The iterative elimination method we present here has four steps (Figure 1). First, is a seeding step, which identifies the unique spectral signatures within the image, second a clumping process to create unique regions, third an elimination process to remove regions below the minimum mapping unit, and, finally, the regions are relabelled so that they are sequentially numbered, providing the final segmentation. These steps require two parameters, the initial number of cluster centres to be produced and the minimum mapping unit defined for the elimination process.

When processing imagery with a large number of highly-correlated image bands (i.e., time series and hyper-spectral imagery), it is recommended that a subset or derived compression of the image data (e.g., principle component analysis) of the available input image bands be used to both minimise auto-correlation between the bands and reduce processing time. In this study, the segmentation of SPOT-5 data from two dates was undertaken using the near-infrared (NIR), shortwave infrared (SWIR) and red wavelengths (i.e., the green band was not included due to the significant correlation with the red band). Images were processed from DN to standardised reflectance before segmentation to reduce scene complexity [39].

**Figure 1.** Flowchart for the segmentation algorithm.

#### *2.1. Seeding*

The first step in the algorithm is to seed an initial set of unique spectral signatures to represent the image. To seed, in an unsupervised fashion, the k-means [40] clustering algorithm was used. While there are many clustering algorithms within the literature such as Iterative Self-Organising Data (ISOData [41]), mean-shift [42], hierarchical clustering [43] and fuzzy k-means [44], it was found that k-means produced the best results while being computationally efficient. The k-means clustering algorithm is an iterative procedure requiring calculations in the order of *nc* (O(*nc*)) where *n* is the number of pixels and *c* is the number of cluster centres. To minimise the number of calculations, the image was subsampled before clustering. It was found that subsampling to 10% or 1% of the image pixels was sufficient to provide similar cluster centres to those derived from the whole input dataset (Figure 2), while providing a significant performance improvement in computation time (Table 1). Another advantage of k-means is that it allows development of the model on one dataset (i.e., a subset) and application to another. As with any statistical sampling, a sufficiently large sample is required to ensure representativeness and that no features are missed. Experiments where carried out using a number of scenes (5 Landsat-7, 5 SPOT-5 and 3 WorldView-2) from each of the sensors to cover a range of environments.

**Figure 2.** Demonstration that k-means cluster centres generated using 50%, 10%, 1% and 0.1% subsampled image data result in cluster centres close to the original data with considerably less computing time. The imagery used is Landsat ETM+, SPOT-5 and WorldView-2.


**Table 1.** Timing experiment generating 60 clusters using k-means employing resampling. Experiments were carried out on a 2.93-GHz Intel Xeon processor with 8 GB of RAM running Mac OSX 10.7 using a SPOT 5 scene with 60 million pixels and 4 images bands.

Within the k-means clustering process, the Euclidean distance metric is used to calculate the distance between features and cluster centres within feature space. Therefore, due to the differences in dynamic range between the input wavelengths (i.e., near-infrared has a larger range than red), some channels could be weighted higher than others and need to be rescaled. A number of options exist for rescaling the data, but it was found that normalising each image band (wavelength) independently within a range of two standard deviations from the mean, thresholded by the image minimum and maximum, produced the best results, increasing the contrast between the land cover classes.

#### *2.2. Clumping*

The k-means-classified pixels are then clumped to define a set of geographically uniquely-labelled regions. This result is over-segmented in many regions of the scene due to the per-pixel nature of the processing and spectral confusion between units. It is common that over half of the resulting clumps are only a single pixel in size (Figure 3).

**Figure 3.** The number of clumps of each size at each elimination step from 1–100 pixels (1 ha) on a 10 × 10 km SPOT5 image (see Figure 4).

#### *2.3. Local Elimination*

The next stage eliminates these small clumps up to a minimum size, which will correspond with the minimum mapping unit for subsequently derived mapping products. The elimination is an iterative process (Algorithm 1) where regions are eliminated based on size, starting with the smallest (i.e., 1 pixel). To eliminate, a region is merged with its spectrally-closest (colour) neighbour, which must be larger than itself, but does not need to be larger than the minimum mapping unit. If there are no neighbouring regions larger than the region being eliminated, then it is left for the next iteration, thereby ensuring that regions are eliminated naturally into the most appropriate neighbour. A critical consideration of the elimination is that the elimination itself (i.e., merging of clumps) is not performed until the end of each iteration. Otherwise, the elimination will result in region growing from the first

clump reviewed for elimination as the clump spectral mean will change. Upon merging, the clump statistics are updated to provide the new mean spectral values used within the next iteration.

**Algorithm 1** Pseudocode for the elimination process.

```
List elimPairs
for size = 1 → minunitsize do
   clumps ← GetUnitsOfSizeLessThan(size)
   for all clump ← clumps do
      neighbors ← GetNeighborsAreaGtThan(clump,size)
      if neighbors.size() > 0 then
         neighbor ← FindSpectrallyClosestNeighbor(neighbors, clump)
         elimPairs ← pair(neighbors, clump)
      end if
   end for
   for all pair ← elimPairs do
      pair.neighbor ← MergeRegions(pair.neighbor, pair.clump)
   end for
end for
```
One problem with the elimination process is that features, which are highly spectrally distinct from their neighbours, such as water bodies, with an area smaller than the minimum clump size will be merged. This can result in undesirable spectral mixing and potentially reduces the likelihood of correct classification of land cover. To resolve this issue, an optional threshold was introduced that prevents clumps with a spectral distance (*d*) above the defined threshold from being merged with their neighbour, hence allowing regions below the minimum size to be retained. The spectral distance threshold was set at 0.1 reflectance. Whenever clumps differ by more than that, they do not merge. We found the 0.1 threshold effective, but this might need to be reviewed depending on the types of features found in some scenes.

As shown in Figure 3, the elimination process reduces the number of clumps representing a 10 × 10 km SPOT5 scene from over 170,000 to less than 4000. Before elimination (Figure 4a), there were 1137 segments over the 100-pixel (1 ha) threshold of the object size. Therefore, 31% of segments in the final result (Figure 4b) were based on a clump directly originating from the k-means clustering with the rest amalgamated from the smaller clumps, which will not correspond directly with the spectral regions defined by the k-means clustering.

The first step, which removes single pixels, results in a significant removal of clumps (170,000–80,000). For single pixels, a memory-efficient approximation of the elimination process can be adopted. Using a 3 × 3 window, all single pixel clumps can be identified and stored as a binary image mask. The single pixels can be iteratively eliminated by merging each pixel with the clump of the spectrally-closest pixel that is part of a larger clump (i.e., at least two pixels). If there is no neighbouring clump of at least two pixels, then the single pixel is considered in the next iteration. The spectrally-closest pixel may occasionally differ from the spectrally-closest clump, but this has not been found to be significant and greatly reduces the memory requirements of the elimination process, thus allowing larger datasets to be rapidly processed.

(**b**)

**Figure 4.** (**a**,**b**) Elimination from 1–100 pixels on a 10 × 10 km SPOT5 image (10-m resolution). Segments are coloured with their mean spectra.

#### *2.4. Relabelling*

During elimination, the features below the minimum mapping unit are removed and merged with their spectrally-closest larger neighbour. This results in a gap within the attribute table for each eliminated unit. Therefore, a relabelling process to collapse the attribute table is undertaken to ensure the feature identification numbers are sequential. Sequential numbering minimises the number of required rows within the attribute table and makes later classification more efficient.

#### *2.5. Parameterisation*

Using k-means seeding, the algorithm has two main parameters: (1) the number of seeds (*k*) (i.e., initial clusters in feature space) and (2) the minimum segment size for the elimination. There are no hard and fast rules for identifying these parameters, as with all segmentation algorithms, it is difficult to define a metric that completely quantifies the quality of the segmentation [31]. Nevertheless, the selection process may be guided by observing the resulting number of segments, their size, and their internal and inter-neighbour spectral variation [38]. It is worth noting that smaller segments will provide lower spectral variation both within the segments and to their neighbours.

The number of clusters selected for the k-means to be produced as seeds is the key parameter governing the spectral separation of the features. Too few clusters and features, which are close to one another within the feature space, will be under-segmented, while too many clusters will result in an over-segmented scene. Through experimentation, it was found that between 30 and 90 seeds, depending on the level of detail required by the user, were generally sufficient for multi-spectral imagery, such as RapidEye, Sentinel-2, SPOT5, WorldView-2 and Landsat (TM, ETM+, OLI), regardless of the number of image bands or the use of multi-date image stacks. Although, in some use cases, such as segmenting bright objects on a dark background (e.g., ships on the ocean), a significantly smaller number of seeds (e.g., 2 or 3) may be more suitable. For vegetation studies, a value of 60 seeds has been consistently used across a range of studies (e.g., [45,46]) using Sentinel-2, SPOT5 and Landsat data, either at a single or with multiple dates and found to produce good results.

The minimum segment size for the elimination can be related to the minimum mapping unit of the output product being derived using the resulting segments. Where a minimum mapping unit is not defined by the project, the user needs to assess the size of the smallest features they are interested in within the image. For example, if isolated trees within paddocks are of interest within high resolution imagery (i.e., WorldView2), then the minimum segment size needs to be set such that the isolated trees are above this threshold, otherwise they will be eliminated in the larger paddock segment.

#### **3. Results**

Assessment of segmentations was aided through the establishment of a set of reference regions, which were manually drawn with reference to the original satellite imagery. We also observed key features and the overall appearance of the segmentation result compared with input image. In this context, segmentation can be thought of as a means of image compression. Therefore, when segments are coloured with their mean spectral values, the resulting image should retain the same structures as the original image (i.e., as shown in Figure 4). We segmented a range of optical imagery, and the resulting number of segments, their size and their spectral variation were assessed to ascertain whether the segmentation results were fit for the purpose of land cover classification, including change.

#### *3.1. Visual Assessment*

Segmentations were produced using SPOT-5 scenes for the whole of New Zealand, and Figure 5 shows an example. The NIR, SWIR and red image bands were used and segmented with *k* = 60 and a minimum segment size of 100 pixels.

(**b**)

**Figure 5.** Example segmentation on a SPOT 5 scene (**a**), using a colour composite of red = NIR, green = SWIR and blue = Red. Segmented with 60 seeds and eliminated to a minimum mapping unit of 1 ha (**b**). The spatial pattern of land cover is well represented with the number of clumps being 0.6% of the number of pixels.

Figure 5 demonstrates the application of the segmentation process on a region of New Zealand typical to the North Island with forest, scrub, agricultural fields and grasslands at various conditions. As demonstrated, the algorithm has maintained the spectral homogeneous forest and scrub blocks as larger continuous segments, while regions of higher spectral heterogeneity, such as the grasslands, have been segmented into smaller units. Overall, it is considered that this segmentation provides an ideal basis for a subsequent land cover classification. There is some moderate over-segmentation, particularly within the grasslands and some spectrally-heterogeneous forest, but this is often desirable to ensure under-segmentation is not present, which prevents the correct classification of those segments.

Where small spectrally-distinct features need to be retained, the spectral distance threshold on the elimination, which prevents features too far away from one another in the feature space from being merged, can be used. As demonstrated in Figure 6c, where a threshold of 0.1 reflectance has been applied, the small ponds and some additional shelter belts have been retained that are below the minimum mapping unit of 2.5 ha. This produces a segmentation result that more closely corresponds to the original image (Figure 6a) compared to the segmentation without the application of the threshold (Figure 6b).

#### *3.2. Parameterisation*

To understand the effect of the number of seeds, a number of tests were undertaken for a range of sensors. The datasets used for these experiments (Table 2) were a pair of multi-spectral WorldView2 scenes for Wales, U.K., spectrally transformed to top-of-atmosphere reflectance. A pair of SPOT5 scenes and Landsat (L4/L7), covering regions of North Island, New Zealand, scenes were all corrected to surface reflectance and standardised for Sun angle and topography [39]. For each of these sensors, experiments were undertaken for an image at a single date and the combined multi-date pair. For the single date tests, all the image bands were used, while for the multi-date tests, a subset of the image bands deemed optimal for visualisation was used. For the SPOT5 and Landsat data, these were the red, NIR and SWIR (SWIR1 for Landsat) image bands; while for WorldView2 images, bands NIR (Band 8) , red edge (Band 6) and blue (Band 2) were used. The minimum segment size was defined as 100 pixels for the WorldView2 and SPOT5 data and 30 pixels for Landsat 4/7 data.

(**a**) The original SPOT-5 image before segmentation. **Figure 6.** *Cont.*

 

(**b**) The segmented SPOT-5 image, shown as the mean-reflectance for a segment, without using a spectral distance threshold to allow smaller features to be retained.

(**c**) The segmented SPOT-5 image, shown as the mean-reflectance for a segment, where a threshold to allow smaller features to be retained has been used.

**Figure 6.** (**a**) The original SPOT 5 image, (**b**) the segmentation applied without a spectral distance threshold (60 seeds and eliminated to the 2.5-ha minimum mapping unit) and (**c**) the segmentation applied with a spectral distance threshold of 0.1 reflectance. Images are displayed with a band combination of NIR, SWIR and red.


**Table 2.** Datasets used for parameter testing. \* indicates the image used for the single date tests.

Figure 7 shows the number of segments required to characterise 50% of the total area of the scenes. If a large number of segments are required to make up 50% of the area, it is likely the scene is over-segmented, while if small, it is likely to be under-segmented. These plots therefore provide an understanding of the relationship between the number and size of the segments with respect to the number of *k* clusters, from which the recommendation that the suitable range of values for *k* is 30–90 can be made. A concurrent visual examination of segments is important.

For the SPOT scenes (Figure 7A,B), the number of segments increased rapidly before plateauing. This shape suggests there is limited benefit in increasing the size of *k* once the plateau has been reached. In the test datasets, it was observed that a typical range of *k* was from 30–90 clusters (shown by the red lines). The 50% cumulative area profile (Figure 7C) for the WorldView2 scene (Figure 8) is a straight line. This is due to the number of large features within the scene, specifically the raised bog (Borth Bog, Wales, U.K.) in the centre of the scene, the estuary and sea, which even at high values of *k*, generated large segments due to their spectral homogeneity. When higher values of the cumulative area were considered (Figure 7D), the normal profile was seen, and the range of *k* from 30–90 clusters when visually assessed was still found to be appropriate.

#### *3.3. Comparison to Other Algorithms*

To compare the performance of the algorithm detailed in this paper to those of others, four alternatives were identified. The algorithms used for the comparison were the mean-shift algorithm [42] implemented within the Orfeo toolbox [47], as it is widely cited (e.g., [48–50]) as an approach that produced good results on a wide variety of Earth Observation (EO) data, the eCognition multi-resolution segmentation algorithm [28], as the algorithm most commonly used within the literature (e.g., [1,2,51]), and the Quickshift algorithm of Vedaldi and Soatto [52] and the algorithm of Felzenszwalb and Huttenlocher [53], implemented within the scikit-image library and interfaced within the RSGISLib library [54], as examples of more recent approaches from the computer vision community applicable to EO data. A SPOT-5 scene (a subset of which is shown in Figure 10A), which represents a range of land covers and uses, was used for the experiment. To identify the parameters for each of the segmentation algorithms, a grid search was performed for each algorithm across a range of parameters (Table 3).



**Figure 8.** (**A**,**B**) Example segmentation on a WorldView2 scene, with a value of *k* = 60 and a minimum mapping unit of 0.4 ha, using a colour composite of red = NIR, green = red edge and blue = blue. The spatial pattern of land cover is well represented with the number of clumps being 0.2% of the number of pixels.

To compare the parameterisations and segmentation quality, two quantitative approaches were adopted. The first used a set of 200 reference segments (Figure 9), manually drawn on the SPOT-5 image, to calculate the precision and recall metrics of Zhang et al. [36]. The precision and recall metrics were combined to produce the *f* metric [36] as follows using an *α* of 0.5:

$$f = \frac{1}{a\frac{1}{\text{precision}} + (1 - a)\frac{1}{\text{recall}}}\tag{2}$$

The second approach was to use the metrics of Johnson and Xie [38] for estimating the overand under-segmentation within the result. For this study, the metrics of Johnson and Xie [38] were calculated with all four spectral bands of SPOT-5 to create the overall global score (*gs*). To combine *f* and *gs* to generate a single ranked list, *gs* was normalised to a range of 0–1 (*gs*\_*norm*) and added to *f* , using a weight *ω*:

$$\text{gs\\_g} = \omega \text{gs\\_norm} + (1 - \omega)f \tag{3}$$

For this analysis, *ω* was set at 0.25, giving more weight to reference segments metric *f* . The top parameter sets and metrics for each segmentation algorithm are given in Table 4.

**Figure 9.** (**A**–**D**) Examples of the reference segments that were manually drawn.


**Table 4.** Comparison of the segmentation results.

The rank refers to the position in the order list of all performed segmentations. Therefore, using the *gs*\_ *f* metric, there were 85 segmentations using different parameterisations of the Shepherd et al. algorithm (from this article) that were ranked higher than the highest Quickshift segmentation. For the mean-shift, there were 252 Quickshift and Shepherd et al. segmentations ranked higher than the highest mean-shift. Of those Shepherd et al. segmentations, the parameters, particularly *k*, were clustered with a *k* value of 60 being identified for the top 38 segmentations, corresponding with the analysis shown in Figure 7.

Figure 10 illustrates the segmentation results from the comparison using the parameters outlined in Table 4. The results were similar; however, two were visually more appropriate, these being the Shepherd et al. (Figure 10B) and mean-shift (Figure 10D), as they both captured the majority of the homogeneous regions as continuous segments while sub-dividing the areas of spectral variation. Both the Shepherd et al. (Figure 10B) and mean-shift (Figure 10D) algorithms demonstrated some artefacts associated with the smooth gradient transitions, as do the other methods tested. In regions of transition, the Quickshift algorithm (Figure 10C) produced a large number of very small segments, and these are not desirable within a segmentation approach; however, the homogeneous regions were well delineated with good correspondence with the reference regions. The eCognition multi-resolution segmentation (Figure 10E) aims to produce segments of similar size, and therefore, had some over-segmentation of the homogeneous features present in the Shepherd et al. (Figure 10B), Quickshift (Figure 10C) and mean-shift (Figure 10D) segmentations.

#### **4. Discussion**

It is difficult to compare segmentation algorithms because comparison involves multiple criteria and depends on the specific application and scale [51]. However, we have demonstrated that the algorithm proposed in this paper compares favourably in most assessment metrics with the commonly-used mean-shift [42] and multi-resolution [28] segmentation approaches when applied to a typical rural and forest landscape in New Zealand. In addition, our algorithm has attributes that make it more useful in certain applications. It is readily scalable to large areas, such as nations or regions (e.g., [55]), which is desirable for preventing hard boundaries on tiles. It uses a small number of parameters, which may be consistently used across a large range of geographic areas and data types. It permits the direct setting of a minimum mapping unit, and it is freely available in open-source software.

The technique presented in this paper is being used operationally for national land-cover and land-cover change mapping of New Zealand using a single parametrisation (60 seeds and 100-pixel minimum mapping unit) on SPOT-5 and Sentinel-2 data. This is because the iterative elimination algorithm is highly scalable, being operationally applied to each of the regions of New Zealand, the largest of which is Canterbury. The Canterbury regional mosaic comprised 36 SPOT5 scenes and produced a single image with 36,533 × 35,648 pixels. Computationally, the segmentation of the Canterbury region required 12 GB of RAM and ran in approximately three hours on a 3-GHz x86 processor running Linux, producing 1,222,885 segments. This level of scalability is advantageous for operational use, as image tiling is not required, thus avoiding complex and time-consuming operations to merge tile boundaries, and the results are consistent across the scene. However, for larger areas, we have also implemented a tiled version of the algorithm [56], which has been used to produce a single segmentation for the Australian continent [55] using a mosaic (141,482 × 130,103 pixels) of Landsat and ALOS PALSAR. This resulted in 33.7 million segments.

In hilly and mountainous terrain topographic shadowing can significantly affect the result of the image segmentation, as the shadowing alters the spectral response of those pixels. We therefore recommend that imagery be standardised to a consistent Sun and view angle before segmentation [39]. For example, a SPOT5 scene of 7753 × 7703 pixels, segmented with 60 clusters and eliminated to 100 pixels, generated 7715 fewer segments when standardised imagery was used, but more significantly, the distribution of those features only corresponded with patterns in land cover as opposed to land cover and topographic shadowing.

We have also successfully applied our segmentation algorithm to multi-date imagery. In this case, the segmentation is required to capture the information present in both scenes and the change between the scenes for subsequent classification. The change segments can be easily differentiated from the rest by the change in spectral reflectance between early and later dates. Automatic identification of these change segments has proven useful in developing a semi-automated method for updating land cover.

Segments are now being successfully provided across large regions using this algorithm. Therefore, new methods of assessing the quality of segmentation results are required, so that optimal parameters can be estimated. This will become increasingly important as datasets for segmentation become larger. Additionally, when segmenting large regions, the number of segments being generated is large, and their classification remains challenging, primarily due to the significant memory requirements for attribution. Existing software and methods are commonly unable to cope with this number of segments. Therefore, a new and advanced attribute table file and image format (using HDF5, called KEA [57]) and software API have been implemented within the RSGISLib [54] (http://www.rsgislib.org) to support rule-based object-oriented classification of the segments. For full details, see Clewley et al. [56].

#### **5. Conclusions**

This paper has outlined a new technique for the segmentation of remotely-sensed imagery, which is highly scalable through an innovative iterative elimination process. It is being used

operationally at national scales across New Zealand. The technique has clear and simple parameters and can be consistently applied across a large range of geographic areas and data types. Segmentation quality is similar to that achieved by other commonly-used algorithms such as mean-shift implemented within the Orfeo toolbox and the multi-resolution segmentation algorithm widely used within the eCognition software. The algorithm provides good performance in both memory usage and computation time. The algorithm is recommended for segmenting all modalities of remotely-sensed imagery, although pre-processing of SAR data for speckle reduction is required. Additionally, a free and open implementation of the algorithm has also been provided as part of the remote sensing and GIS software library (RSGISLib; [54]) and can be downloaded from http://www.rsgislib.org.

**Author Contributions:** J.D.S. was responsible for algorithm development. P.B. was responsible for algorithm application. J.R.D. contributed to algorithm application. All authors were involved in reviewing and approving the final version of the manuscript.

**Funding:** The Ministry of Business Innovation & Employment (New Zealand) funded this research under Contract C09X1709.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **Direct, ECOC, ND and END Frameworks—Which One Is the Best? An Empirical Study of Sentinel-2A MSIL1C Image Classification for Arid-Land Vegetation Mapping in the Ili River Delta, Kazakhstan**

**Alim Samat 1,2,3,\*, Naoto Yokoya 4, Peijun Du 5, Sicong Liu 6, Long Ma 1,2,3, Yongxiao Ge 1,2, Gulnura Issanova 7,8, Abdula Saparov 9, Jilili Abuduwaili 1,2,3 and Cong Lin <sup>5</sup>**


Received: 30 July 2019; Accepted: 16 August 2019; Published: 20 August 2019

**Abstract:** To facilitate the advances in Sentinel-2A products for land cover from Moderate Resolution Imaging Spectroradiometer (MODIS) and Landsat imagery, Sentinel-2A MultiSpectral Instrument Level-1C (MSIL1C) images are investigated for large-scale vegetation mapping in an arid land environment that is located in the Ili River delta, Kazakhstan. For accurate classification purposes, multi-resolution segmentation (MRS) based extended object-guided morphological profiles (EOMPs) are proposed and then compared with conventional morphological profiles (MPs), MPs with partial reconstruction (MPPR), object-guided MPs (OMPs), OMPs with mean values (OMPsM), and object-oriented (OO)-based image classification techniques. Popular classifiers, such as C4.5, an extremely randomized decision tree (ERDT), random forest (RaF), rotation forest (RoF), classification via random forest regression (CVRFR), ExtraTrees, and radial basis function (RBF) kernel-based support vector machines (SVMs) are adopted to answer the question of whether nested dichotomies (ND) and ensembles of ND (END) are truly superior to direct and error-correcting output code (ECOC) multiclass classification frameworks. Finally, based on the results, the following conclusions are drawn: 1) the superior performance of OO-based techniques over MPs, MPPR, OMPs, and OMPsM is clear for Sentinel-2A MSIL1C image classification, while the best results are achieved by the proposed EOMPs; 2) the superior performance of ND, ND with class balancing (NDCB), ND with data balancing (NDDB), ND with random-pair selection (NDRPS), and ND with further centroid (NDFC) over direct and ECOC frameworks is not confirmed, especially in the cases of using weak classifiers for low-dimensional datasets; 3) from computationally efficient, high accuracy, redundant to data dimensionality and easy of implementations points of view, END, ENDCB, ENDDB, and ENDRPS are alternative choices to direct and ECOC frameworks; 4) surprisingly, because in the ensemble learning (EL) theorem, "weaker" classifiers (ERDT here) always have a better chance of reaching the trade-off between diversity and accuracy than "stronger" classifies (RaF, ExtraTrees, and SVM here), END with ERDT (END-ERDT) achieves the best performance with less than a 0.5% difference in the overall accuracy (OA) values, but is 100 to 10000 times faster than END with RaF and ExtraTrees, and ECOC with SVM while using different datasets with various dimensions; and, 5) Sentinel-2A MSIL1C is better choice than the land cover products from MODIS and Landsat imagery for vegetation species mapping in an arid land environment, where the vegetation species are critically important, but sparsely distributed.

**Keywords:** ND; END; ECOC; MRS; Extended object-guided morphological profiles; Multiclass classification; Arid-land vegetation mapping; Sentinel-2A MSIL1C; Central Asia

#### **1. Introduction**

Arid and semiarid lands encompass approximately 30–40% of the Earth's surface, and Central Asia contains one of the world's largest arid and semiarid areas. These areas have harsh climatic conditions and they are under high pressure to produce food and fibers for their rapidly increasing populations, which include a wide range of land utilization and management regimes, which results in a reduction in arid ecosystem quality. Understanding the effects and responses between landscapes and regional environments is fundamental to maintain their ecological and productive value in such circumstances. Hence, the effects and responses of landscape heterogeneity on the local and regional atmosphere, the surface energy balance, the carbon exchange, and climate changes are major topics that have attracted widespread interest [1–5]. Among these responses, the vegetation species, distribution, diversity, and biomass in these lands typically undergo wide seasonal and international fluctuations, which are largely regulated by water availability and impacted by both climatic shifts and human activities [6–8]. Thus, monitoring the vegetation status of these lands is an essential part of identifying problems, developing solutions, and assessing the effects of actions.

However, large-scale and long-term field sampling of vegetation information is challenging when considering the sampling efforts and costs. Moreover, the samples are often very sparsely distributed, and site revisits remain infrequent, while the success of any monitoring of vegetation dynamics depends on the availability of up-to-date and spatially detailed species richness and distribution at a regional scale [9–11]. Fortunately, satellite-based remote sensing (RS) data can address these challenges by identifying and detailing the biophysical characteristics of vegetation species' habitats, predicting the distribution and spatial variability in the richness, and detecting natural and human-caused changes at scales that range from individual landscapes to the whole world [1,9,12–18]. Therefore, an increasing number of geologists, ecologists, and biologists are turning to rapidly develop RS data sources for vegetation-based ecological and environmental research at local, regional, and global scales [19–24].

Regarding applications of RS data in vegetation studies, high- and moderate-resolution optical RS sensors, including IKONOS, Satellite for Observation of Earth (SPOT), Thematic Mapper ™, Enhanced Thematic Mapper (ETM), ETM+, Operational Land Imager (OLI), Sentinel-2, and Moderate Resolution Imaging Spectroradiometer (MODIS), are widely accepted and are considered to be adequate for vegetation species, diversity, distribution, and biophysical information extraction in different settings [25–32]. Creating land cover maps that detail the biophysical cover of the Earth's surface is among the oldest and ongoing hot applications. Land cover maps have been continuously suggested proven especially valuable for predicting the distribution of both individual plant species and species assemblages across broad areas that could not otherwise be surveyed in more quantitative ways with respect to vegetation index (VI)-based approaches [9,33–35]. In particular, after various land cover products that are derived from RS data at the regional and global scales have been produced, and they are freely available at spatial resolutions from 30 m to 1 km. Solid proofs can be found for extensive applications of representative products, including the 1 km University of Maryland (UMD) land cover layer [36], the Global Land Cover 2000 (GLC2000) products [37], the MODIS products [38], the 500 m MODIS [39], the 300 m GlobCover [40], and the 30 m global land cover maps [41] for vegetation studies at the regional and global scales [42–49]. However, most of the existing land cover products are coarse, not only in the spatial resolution and land cover type details, but also in the

update frequency. For example, the 30 m global land cover maps are only available for 2010, 2015, and 2017 with a maximum of 10 land cover types (only eight types for our study area), while the MODIS products are only available for 2000, 2005, 2010, and 2015 with a maximum 300 m resolution (only 15 types for our study area). Furthermore, the differences between these products are very large, as shown in Figures 1b and 1c.

**Figure 1.** (**a**) Geographic location of the study area, (**b**) 2015 Moderate Resolution Imaging Spectroradiometer land use and cover change (MODIS LUCC), (**c**) 2017 GLC30, (**d**) blue rectangle, (**e**) blue rectangle, and (**f**) green rectangle. Sentinel-2 RGB image of the study area with in situ points (red dots) and the corresponding land cover types.

The Sentinel-2 mission comprises a constellation of two polar-orbiting satellites placed in the same orbit with a five-day revisit time over land and coastal areas. Each satellite carries a MultiSpectral Instrument (MSI) with 13 bands spanning from the visible and the near-infrared (VNIR) portion of the electromagnetic spectrum to the short wave infrared (SWIR) portion of the spectrum and features four bands at a 10 m spatial resolution, six bands at a 20 m spatial resolution, and three bands at a 60 m spatial resolution with a 290 km field of view [50]. Since Sentinel-2A and Sentinel-2B were successfully launched on June 23, 2015, and March 7, 2017, respectively, their products (Level-2A, which covers the bottom-of-atmosphere reflectance in cartographic geometry; Level-1B, which covers the top-of-atmosphere (TOA) radiance in sensor geometry; and, Level-1C, which covers the TOA reflectance in fixed cartographic geometry) have been widely applied for monitoring land cover changes, agricultural applications, monitoring vegetation and retrieving biophysical parameters, observing coastal zones, monitoring inland water, monitoring glaciers and ice and mapping snow, mapping floods, and management [51–56]. However, these products have not been used for detailed vegetation mapping in arid land environments. Hence, the first objective of this paper is to investigate the performance of the Sentinel-2 MSIL1C product for vegetation mapping in an arid land environment.

Producing any substantial land cover/use maps always requires a specific classification method or ML algorithm. Although many methods and algorithms have been developed for satellite data classification applications, the search for advanced classification methods or algorithms is still a hot filed [57–60]. There are no supper classification methods or algorithms that could universally work at high performance, due to facts that classification performance not only controlled by robustness of adopted methods or algorithms, but also affected by discrimination and identification quality, size and distribution quantity of provided data [61,62]. According to the literatures from RS data classification community, the commonly used ML algorithms are artificial neural networks (ANNs) [63], support vector machine (SVM) [64,65], extreme learning machine (ELM) [66], decision trees (DTs) [67], ensemble methods, such as bagging, adaboost, and RaF [57,68,69], and deep neural networks (DNNs) [70,71]. In most scenarios, these algorithms involve a nominal class variable that has more than two values problem, because the real-world land surface usually recorded by EO sensors with simultaneous discrimination of numerous classes. In general, there are two approaches for addressing this type of problem: 1) adapting the binary algorithm to its multiclass counterpart to deal with multiclass problems directly (e.g., DTs, ANNs, ELMs, RaFs); and, 2) reduce the multiclass problem into several binary subproblems first, then form a multiclass prediction based on the results from several binary classifiers, such as AdaBoost, multiclass SVMs, and ND [72–74]. When compared with the direct approaches, the latter approach is appealing, because it does not involve any changes to the underlying binary algorithm [75]. In particular, a structural risk minimization (SRM) rule-based SVM can successfully work with limited quantities and quality of training samples and it often achieves a higher classification accuracy than linear discriminate analysis (LDA), DTs, ANNs, bagging, AdaBoost, and RaFs [64–66,76–78].

Well-known examples of the second approach are ECOC and pairwise classification, which often result in significant increases in the accuracy [72,75,79]. However, many studies have explicitly proven that ECOC works better than pairwise classification mainly due to its more advanced decoding strategies [80–82]. The ECOC framework consists of two steps: coding and decoding. Popular coding strategies include one-versus-all, one-versus-one, random sparse, binary complete, ternary complete, original and dense random coding, while the most frequently applied decoding designs are Hamming decoding, inverse Hamming decoding, Euclidean decoding, attenuated Euclidean decoding, loss-based decoding, probabilistic decoding, β-density-based distribution decoding, and loss-weighted decoding [72,82]. While the one-versus-one and one-versus-all strategies have been widely adopted in RS data classification, only a few works [83,84] have focused on applications of the ND and its ensemble variants, which have been proven to outperform the direct multiclass, ECOC, and pairwise classification methods while using C4.5 and logistic regression as the base learners [75,85,86]. Additionally, the most recent and more advanced direct multiclass classification methods may also see improved accuracy by interacting with ND and END. Thus, the second objective of this paper is to investigate the performance of the popular ND algorithms, including NDCB, NDDB, NDRPS, NDFC, and their ensemble versions (i.e., ENDCB, ENDDB, ENDRPS, and ENDBC, respectively) by setting C4.5 and bagging [87], AdaBoost [88], an RaF [89], a RoF [90], ExtraTrees [91], and an SVM [92] as the base learners.

The discrimination and identification quality of the provided data is another critical factor that controls the classification performance of the adopted classifier. Over the years, many approaches have been proposed to increase the discrimination and identification ability of the provided data. Among these approaches, structural filtering, MPs, random fields, object-based image analysis (OBIA) and geographic OBIA (GEOBIA), sparse representation (SR), and deep learning (DL)-based contextual information extraction are the most undertaken families of methods [77,93–96]. In the last ten years, mathematical morphology (MM)-based operators, such as MPs, EMPs, APs, and MPs with partial reconstruction (MPPR), have been the most widely accepted approaches in the RS image classification community, mainly due to their advanced and proven performances in contextual information extraction from HR/VHR RS imagery [68,77,93,96,97].

However, the SE sequences or attribute filters (AFs) that are necessarily adopted in the above operators always result in computationally inefficient and redundant high-dimensional features, which may become prohibitively large data processing cases. Additionally, the sequences of SE and AFs, with limited sizes and shapes, cannot match all of the sizes and shapes of the objects in an image; specifically, a single SE is not suitable for an entire image in each operation case [97,98]. MSER-MPs, SPMPs, and multi-resolution segmentation (MRS)-OMPs have been proposed for the spectral-spatial classification of VHR multi/hyperspectral images with the ExtraTrees, ForestPA, and ensemble extremely randomized decision trees (EERDTs) ensemble classifiers in our previous works to overcome such challenges [98,99]. MSER\_MPs(M), SPMPs(M), and OMPs(M) were also proposed by considering the mean pixel values within regions, such as MSER objects, superpixels, and MRS objects, to foster effective and efficient spatial FE. The improvements from taking the mean values were clear. Specifically, the size of the regions generated was on a reasonable scale, which was mainly controlled by the spatial resolution and a readily available landscape image [99]. However, as hybrid methods of MPs and OBIA, comparison studies between SPMPs and OBIA, and between OMPs and OBIA were not carried out in our previous works. In addition, as generally known from the OBIA and GEOBIA communities, there are plenty of spectral, statistical, spatial, and geometrical measures of regions (i.e., objects) that can be adopted to further improve the classification accuracy [27,100–102]. Thus, extending the OMPs by considering more advanced object measures is interesting, especially when using Sentinel-2A MSIL1C data for vegetation mapping in large coverage areas in arid land environments, which is the last objective of this paper. In Table 1, we provide the acronym with corresponding full names that are used in this paper.


**Table 1.** Acronyms with corresponding full names used in this paper.


**Table 1.** *Cont.*

#### **2. Materials and Methods**

#### *2.1. Materials*

#### 2.1.1. Study Region

Our study area is located at the Ili River delta, in the central-western part of the Balkhash Lake basin, in the southeastern part of Kazakhstan (Figure 1a). The Balkhash Lake basin is one of the largest internal drainage areas in the arid and semiarid region in Central Asia; it is located between 72.44◦–84.99◦E and 42.24◦–49.14◦N, covering an area of approximately 500,000 km<sup>2</sup> and it is shared by the Republic of Kazakhstan (approximately 60%) and the People's Republic of China (approximately 40%) [103]. Balkhash Lake is the world's fifth-largest inland water reservoir (605 km long and 4–74 km wide), with a volume of 87.7 km3 and a catchment area of 15,730 km2 [104]. All of the inflow to the Balkhash Lake is received from the western Tien-Shan and the Dzungarsky Alatau and the runoff from their ridges. The two largest rivers flowing into the lake are the Ili River and Karatal River, accounting for approximately 78% and 15% of the total inflow, respectively [105]. Balkhash Lake and several plentiful wetlands in its inflow deltas are considered to be very sensitive ecosystems, whose existence depends on variable climate conditions and extensive human activities, especially in the form of water abstractions from inflows. During the Soviet era, the inflow waters were largely used for irrigation (mainly for rice crops), industry, the water supply to populated areas, and the fishing industry, which resulted in a significant decrease in the water level and the degradation of the surrounding environments [103,105]. After the collapse of the Soviet Union, most of the social and economic activities in the Balkhash Lake basin rapidly diminished, which causes drastic changes in the land cover/use and broad rehabilitation of the ecosystem. Understanding the effects and responses between such drastic changes and the regional environment is crucial for the sustainable development of this basin, which can only be accomplished by the sustainable monitoring of entire environments. Many efforts have been made in recent decades; however, while most studies have focused on water, e.g., water resource management, water level and surface changes, chemical properties, regional-scale land cover/use changes, ecosystem services, and vegetation activity [106–111], only a few studies have focused on basin-level studies while using RS datasets [104]. In almost all of the above studies, datasets from Landsat, the Advanced Very-High-Resolution Radiometer (AVHRR), and MODIS were mainly used. Hence, it is of interest to use more advanced Sentinel-2A MSIL1C products with more advanced spatial FE and ML techniques for vegetation mapping in this area.

#### 2.1.2. Datasets

• Sentinel-2 data collection and preprocessing

In this study, Sentinel-2A geolocated TOA reflectance (L1C) products were acquired from the Copernicus Open Access Hub (https://scihub.copernicus.eu). We selected a total of six images with zero or near-zero (< 10%) cloud coverage, taken between 25 July and 8 August, 2017. Only the visible bands of blue (band 2), green (band 3), red (band 4), and the near-infrared (band 8) region with a 10 m spatial resolution were used. All of the images of the study region were projected to WGS 84/UTM zone 43N and then mosaicked while using the SNAP (v6.0), which is a free, open source software program that is distributed by the ESA under the GNU General Public License and was founded through the ESA's Scientific Exploration of Operational Missions (SEOM) Program. In Figure 1d, Figure 1e, and Figure 1f, true RGB color images were composited by setting band 4 to red, band 3 to green, and band 2 to blue, respectively, for real-world land surface illustration purposes.

• In situ data collection

In total, 120 valid in situ sites (red dots in Figure 1d) were visited on July 27, 28, 29, and 30, 2017. Specifically, 46 field sites were visited on July 27 in the Bakanas irrigation area (Figure 1e), 23 field sites were visited on July 28 in the Ili River delta region, six field sites were visited on July 29 on the way back from Balkhash city to Bakanas District, and 45 field sites were visited on July 30 in the Bakbakty irrigation area (Figure 1f). For all of the field sites, the coordinates were determined while using a differential global positioning system (GPS) and the Chinese BeiDou navigation system, which has a 2 m positioning accuracy. Additionally, the land cover type among 23 possibilities with 19 vegetation types was recorded (Figure 1), between 10 AM and 6 PM local time. Moreover, field sites are determined at locations with only large and uniform spatial coverage of the same land cover type for a more objective and representative in situ site selection. Figure 2 shows the ground photos of representative land cover types in our study area. According to the collected in situ information and

further referring to the high-resolution optical images in Google Earth, 582 regions of interest (ROIs) were selected for model training, validation, and data classification for vegetation mapping. Detailed ROI, training sample, and validation sample information are listed in Table 2.

**Figure 2.** Ground photos of the land cover types.


**Table** 

#### *2.2. Methods*


In the area of statistics, ND are a standard technique for solving certain multiclass classification problems with logistic regression (LR). Generally, ND can be represented with a binary tree structure, where the set of classes is recursively split into two subsets until there is only one (Figure 3). In other words, the root node of the ND contains all of the classes that correspond to the multiclass classification problem, and each node contains a single class, which means that, for an *n*-class problem, there are *n* leaf nodes and *n*-1 internal nodes. To build an ND approach based on such a tree structure, we perform the following steps: 1) at each internal node, store the instances pertaining to the classes associated with current node but no other instances; 2) group the classes pertaining to each node into two subsets to ensure that each subset holds the classes that are associated with exactly one of the node's two successor nodes; and, 3) train the binary classifier at each node for the resulting binary class problem [75,85]. If the adopted binary classifier at each node can compute the class probability, the ND can compute class probability in a natural way, which is a convenient feature in real-world applications [112].

**Figure 3.** Overall technical flowchart for the methodology.

After an ND approach is built, one critical question is how to combine the probability estimates from individual binary problems to obtain class probability estimates for the original multiclass problem. Multiclass probability estimates can be obtained by simply multiplying the probability estimates returned from the individual binary learners because the individual dichotomies are statistically independent as they are nested. More specifically, let *Ci*<sup>1</sup> and *Ci*<sup>2</sup> be the two subsets of classes generated by a split of the set of classes C*<sup>i</sup>* at internal node *i* of the tree structure, and let *p*(*c* ∈ *Ci*1|*x*, *c* ∈ *Ci*) and *p*(*c* ∈ *Ci*2|*x*, *c* ∈ *Ci*) be the conditional probability distributions that are estimated by the binary learner at node *i* for a given instance *x*. Subsequently, we can have the estimated class probably distribution for the original multiclass problem by [85]:

$$p(\mathbf{c} = \mathbf{C} | \mathbf{x}) = \prod\_{i=1}^{n-1} \left( l(\mathbf{c} \in \mathbb{C}\_{i1}) p(\mathbf{c} \in \mathbb{C}\_{i1} | \mathbf{x}, \mathbf{c} \in \mathbb{C}\_{i}) \right. \\ \left. + l(\mathbf{c} \in \mathbb{C}\_{i2}) p(\mathbf{c} \in \mathbb{C}\_{i2} | \mathbf{x}, \mathbf{c} \in \mathbb{C}\_{i}) \right) \tag{1}$$

where *I*(·) is the indicator function and the product is over all the internal nodes. Notably, not all of the nodes must be examined to compute the probability for a particular class value, which makes the evaluation of the path to the leaf associated with that class sufficient.

Ever since the basic form of the class subset split criterion was originally proposed by Frank and Kramer [85], many other sophisticated criteria, such as random selection, random-pair selection, clustering, multisubset evaluation, class-balanced-based optimization, data-balanced-based optimization, and genetic algorithm-based optimization, have been proposed and proven to have superior performance on the classification accuracy and model training efficiency, especially with END, which use common EL algorithms such as bagging, boosting, and RaFs [74,75,85,86,112–114].

According to the formation by Frank and Kramer [85], there are *<sup>T</sup>*(*c*)=(3*<sup>c</sup>* <sup>−</sup> (2*c*+<sup>1</sup> <sup>−</sup> <sup>1</sup>))/2 possible dichotomies for a *c*-class problem, which is very large and not ideal for efficient model training. Especially when large amounts of data are readily available, advanced, but computationally inefficient learners (e.g., ANNs and SVMs) are adopted in an ensemble scenario. One simple solution is using random selection dichotomies instead of complete selection dichotomies, which reduces the number of possible dichotomies to *T*(*c*)=(2*c* − 3) × *T*(*c* − 1), where *T*(1) = 1. Briefly, all the distinct dichotomies for a given *n*-class problem were uniformly sampled with replacement, and the class probability estimates for a given instance *x* were obtained by averaging the estimates from the individual END members. According to statistical theory regarding EL, reduced numbers of dichotomies are still large enough to ensure that there is a high level of diversity among END members to facilitate the improvement by the ensemble. One drawback of random selection is that it can produce very imbalanced tree structures, which results in a negative effect on the training time of the full model while the number of internal nodes remains the same in any ND for the same number of classes because an unbalanced tree often implies that the internal binary learners are trained on large datasets near the leaves. Dong et al. [75] proposed class-balanced and data-balanced versions of ND, namely, the NDCB and NDDB, respectively, to mitigate the effect of this issue. When compared with NDCB, NDDB can avoid the potential problem from multiclass problems with imbalanced samples. Empirical experiments have shown that NDCB and NDDB have little effect on the accuracy in most cases, but they have great benefits in reducing the time needed for model training, particularly for problems with many classes in ensemble NDCB and NDDB (ENDCB and ENDDB, respectively). The growth function for NDCB is [75]:

$$T(c) = \begin{cases} \frac{1}{2} \binom{c}{c/2} T(\frac{c}{2})^2, & \text{if } c \text{ is even} \\\\ \binom{c}{(c+1)/2} T(\frac{c+1}{2}) T(\frac{c-1}{c}), \text{if } c \text{ is odd} \end{cases} \tag{2}$$

where *T*(2) = *T*(1) = 1.

When constructing an effective EL system, one is always faced with a dilemma between high classifier diversity and excellent performance, which is hard to satisfy in practice. Unfortunately, the above END methods are deterministic when generating subclass groups that cannot maintain the benefits of high diversity. Additionally, errors that are made by binary classifiers at earlier nodes can be inherently spread to lower nodes of the ND tree and they cannot be easily corrected. For these reasons, it is important to generate the dichotomies in a nondeterministic way to reach the high diversity requirement on the one hand and reduce the number of errors in and the size of the upper nodes of the ND tree on the other hand. NDBC, NDRPS, and their ensemble versions are good examples for this intension [74,112]. However, as compared with NDCB, NDRPS is more direct, easily discovers similar classes, and exhibits a degree of randomness, which leads to more diversity and a higher-performing ensemble. The growth function of the NDRPS was empirically estimated by [112]:

$$T(c) = p(c)T\left(\frac{c}{3}\right)T\left(\frac{2c}{3}\right) \tag{3}$$

where *p*(*c*) represents the size of the dichotomies from the base learner and *T*(2) = *T*(1) = 1.

#### • Multiresolution segmentation

OBIA is a classic technique in RS image interpretation that integrates the spatial and spectral features and it splits RS images into a set of nonoverlapping homogeneous regions or objects, depending on the segmentation method that was specified. During recent decades, OBIA has gained widespread attention in the RS community, mainly because it can overcome the limitations of pixelwise analysis, such as the neglect of geometric, contextual, and semantic information, particularly in the processing of HR/VHR RS imagery [100–102,115]. Over the years, many image segmentation methods have been proposed and extensively examined while using various RS imagery. Among these methods, MRS is one of the most frequently used methods, which is mainly due to its capability to produce high-quality segments at different scales [69,102,116,117].

MRS is a bottom-up region-merging-based segmentation technique that starts with one-pixel objects and it merges the most similar adjacent pixels or objects provided that the internal heterogeneity of the resulting object does not exceed a user-defined threshold [118]. The heterogeneity measure in eCognition considers the spectral heterogeneity, which allows for multivariant segmentation by adding weights to the image channels, and the shape heterogeneity, which describes the improvement in the shape with regard to the smoothness and compactness. In any OBIA, the segmentation scale determines the average size and number of segments that were produced. Defining an optimal scale segmentation to avoid oversegmentation and undersegmentation issues is always challenging because of the spectral similarity between different objects and landscape complexity in the real world [101,102,119]. For MRS, numerous studies demonstrate the importance of the scale parameter, because it controls the dimensions and the size of the segments, which may directly affect subsequent results [101,117,119]. A successful research result on scale optimization is to combine the local variance (LV) and the rates of change of the LV (ROC-LV) to determine the appropriate segmentation scales for a given resolution [120]. The automated selection of scale parameters is basically an automation of the ESP tool, where the production of a graph is replaced by an iterative procedure that segments an image at the first threshold that occurs in the LV graph. The readers are referred to the original works by [118] and [120] for more detailed information.

#### 2.2.2. Proposed Method

MPs are composed of morphological opening and closing profiles, which consist of an ensemble of OBR and CBR operators. According to the definition of MPs, OBR and CBR operators are connected operators that satisfy the assertion of removing the structures that cannot contain the SE and preserving those structures that can contain the SE [121–123]. While applying such operators with a sequence of SEs of increasing size, one can extract information regarding the contrast and the size of the geometrical structures that are present in the image. Originally, the formulation of the spatial information that was included in the MPs refers to a single-band image; therefore, the direct construction of MPs is not straightforward for multi/hyperspectral images. Several approaches have been considered to overcome this shortcoming [77,93,94,124,125]. Among these approaches, one simple, and yet efficient, approach is to use a few images that contain most of the spectral information that was obtained by some FE method, namely, the EMPs [121]. If we consider the first *m* principal components that were extracted from the multi/hyperspectral images with principal component analysis (PCA), the EMPs are obtained by stacking all of the MPs that are built on all *m* components.

According to the definition from MM and our previous works [98,99], the MRS object-guided morphological OBR operators can be obtained by first eroding the input image while using segmented objects (where Θ<sup>λ</sup> *<sup>S</sup>* represents the numbers (*S*) of objects from MRS with scale λ) in the SE approach and by using the result as a marker in geodesic reconstruction by a dilation phase:

$$\text{OOBR}(f) = R\_f^D \left| f \odot \left( \exists \Theta\_{j,j \in S}^\lambda \in \Theta\_S^\lambda \right) \right| \tag{4}$$

Similarly, we have

$$\text{OCBR}(f) = R\_f^E \Big[ f \oplus \left( \exists \Theta\_{j,j \in \mathcal{S}}^{\lambda} \in \Theta\_{\mathcal{S}}^{\lambda} \right) \Big] \tag{5}$$

where the object-guided CBR (OCBR), which was obtained by complementing the image *f* <sup>C</sup>, contains the object-guided OBR (OOBR) with SEs <sup>∃</sup>Θ<sup>λ</sup> *<sup>j</sup>*,*j*∈*<sup>S</sup>* and it complements the resulting procedure:

$$\text{OCBR}(f) = \mathcal{R}\_f^{\text{DC}} \left[ f^{\text{C}} \odot \left( \exists \Theta\_{j, j \in \text{S}}^{\lambda} \in \Theta\_S^{\lambda} \right) \right] \tag{6}$$

In MM, the erosion of *f* by *b* at any location (*x*, *y*) is defined as the minimum value of all the pixels in its neighborhood, denoted by *b*. In contrast, dilation returns the maximum value of the image in the window that was outlined by *b*. Subsequently, we can have the following new formations for the erosion and dilation operators:

$$\begin{aligned} \left| f \odot (\exists \Theta^{\lambda}\_{\hat{\jmath}, \hat{\imath} \in S} \in \Theta^{\lambda}\_{\hat{S}} \right| (\mathbf{x}, y) &= \min\_{(s, t) \in \Theta^{\lambda}\_{\hat{\jmath}, \hat{\imath} \in S}} \left\{ f(\mathbf{x} + s, y + t) \right\} \\ \left[ f \oplus (\exists \Theta^{\lambda}\_{\hat{\jmath}, \hat{\jmath} \in S} \in \Theta^{\lambda}\_{\hat{S}}) \right] (\mathbf{x}, y) &= \max\_{(s, t) \in \Theta^{\lambda}\_{\hat{\jmath}, \hat{\imath} \in S}} \left\{ f(\mathbf{x} + s, y + t) \right\} \end{aligned} \tag{7}$$

By substituting Equation (13) into Equations (10) and (12), we have the formations of the OOBR and the OCBR as:

$$\begin{aligned} \text{COBR}(f) &= R\_f^D \left| \min\_{(s,t) \in \Theta\_{j, \text{iS}}^\dagger} \{ f(\mathbf{x} + s, y + t) \} \right| \\ \text{OCBR}(f) &= R\_f^E \left| \max\_{(s,t) \in \Theta\_{j, \text{iS}}^\dagger} \{ f(\mathbf{x} + s, y + t) \} \right| = R\_f^{D \text{C}} \left| \min\_{(s,t) \in \Theta\_{j, \text{iS}}^\dagger} \left\{ f^\text{C}(\mathbf{x} + s, y + t) \right\} \right| \end{aligned} \tag{8}$$

If the SEs <sup>∃</sup>Θ<sup>λ</sup> *<sup>j</sup>*,*j*∈*<sup>S</sup>* are specified by MRS objects with a sequence of scale parameter <sup>λ</sup>, then the MRS object guided morphological profiles (OMPs) of an image *f* can be defined as:

$$\text{COMPs}(f) = \left[ \text{OOBR}(f)^{\left(\exists \lambda \in [\lambda\_1^\*, \lambda\_2^\*, \dots, \lambda\_Q^\*]\right)}, \text{OCBR}(f)^{\left(\exists \lambda \in [\lambda\_1^\*, \lambda\_2^\*, \dots, \lambda\_Q^\*]\right)} \right] \tag{9}$$

where {λ<sup>∗</sup> <sup>1</sup>, λ<sup>∗</sup> <sup>2</sup>, ... , λ<sup>∗</sup> *<sup>Q</sup>*} represents the sets of *Q* numbers of the user-specified scale parameter λ.

By further considering the extensively proven performance from object profiles in OO-based image classification, the extended OMPs (EOMPs) can be calculated, as follows:

$$\text{EOMPs}(f) = \left[ \text{OOBR}(f)^{\left(\exists \text{l} \in [\lambda\_1^\*, \lambda\_2^\*, \dots, \lambda\_Q^\*]\right)}, \text{OCRR}(f)^{\left(\exists \text{l} \in [\lambda\_1^\*, \lambda\_2^\*, \dots, \lambda\_Q^\*]\right)}, (f)\_{\text{OO}}^{\left(\exists \text{l} \in [\lambda\_1^\*, \lambda\_2^\*, \dots, \lambda\_Q^\*]\right)} \right] \tag{10}$$

where

$$(f)\_{\rm CO} \quad \left[ (f)\_{\rm CO} \quad \right. \quad \left. \left. \left( O\_{\rm Mn}^{\rm \circ}, O\_{\rm Mn \, tr}^{\rm \circ}, O\_{\rm Mn \, r}^{\rm \circ}, O\_{\rm Mn \, r}^{\rm \circ}, O\_{\rm Mn \, r}^{\rm \circ}, O\_{\rm Mn \, r}^{\rm \circ}, O\_{\rm Mn \, r}^{\rm \circ}, O\_{\rm Mn \, tr}^{\rm \circ}, O\_{\rm Mn \, tr}^{\rm \circ}, O\_{\rm Mn \, tr}^{\rm \circ}, O\_{\rm Mn \, tr}^{\rm \circ}, O\_{\rm Mn \, tr}^{\rm \circ}, O\_{\rm Mn \, tr}^{\rm \circ} \right) \right]\_{\rm H = 1, \dots, \rm H} \quad \{\}$$

represents the collections of 13 object features, including pixel value-based measures, such as the minimum, the maximum, the mean, the standard deviation, and the mean of the inner border, and geometrical measures, such as the roundness (*O*λ<sup>∗</sup> *k Roun*. ), the compactness (*O*λ<sup>∗</sup> *k Comp*. ), the asymmetry (*O*λ<sup>∗</sup> *k Asym*. ), the rectangular fit (*O*λ<sup>∗</sup> *k Rect*. ), the border index (*O*λ<sup>∗</sup> *k BorderI*. ), the shape index (*O*λ<sup>∗</sup> *k ShapeI*. ), and the elliptic fit (*O*λ<sup>∗</sup> *k Elliptic*).

Finally, Figure 3 shows the overall technical flowchart for the proposed method.

#### 2.2.3. Experimental Setup

To analyze the performance of the introduced the multiclass classification methods ND and END, state-of-the-art and classic ML algorithms, including C4.5 [87], END with ERDT (END-ERDT) [91], RaFs [89], ExtraTrees [91,98], classification via random forest (CVRaFs) [126], RoFs [90], and an SVM [64], were also applied in direct- or ECOC-based multiclass classification. The considered ECOC methods include one-versus-one (ECOC:1vs1), one-versus-all (ECOC:1vsAll), random correlation (ECOC:RC), dense random (ECOC:DR), sparse random (ECOC:SR), and ordinal (ECOC:Ordinal) methods. Critical tree parameters of C4.5, END-ERDT, RaF, RoF, CVRaF, and ExtraTrees classifiers are set by default, while the ensemble size is set to 100 by default for RaF, RoF, CVRaF, and ExtraTrees. The involved parameters of the radial basis function (RBF) kernel-based SVM were tuned by using Bayes optimization (SVM-B) and 10 by 10 grid-search optimization (SVM-G) [127].

We applied a disk-shaped SE with n = 10 openings and closings by conventional and partial reconstructions to obtain the MPs and MPPR from the four raw bands of MSIL1C and the first three PCA-transformed components, ranging from one to ten with a step-size increment of one. These parameters mean that we obtain 84 = 4 + 4 × 10 × 2 dimensional datasets using four raw bands and 63 = 3 + 3 × 10 × 2 dimensional datasets using the first three PCA-transformed components, which are represented by Raw\_MPs, Raw\_MPPR, PCA\_MPs, and PCA\_MPPR in the graphs in the experimental parts. For fair evaluations from dimensionality, we set the MRS segregation scale parameter λ with 10 different values in the FE phase for OMPs and EOMPs. In other words, we obtained 84 = 4 + 4 × 10 × 2 and 63 = 3 + 3 × 10 × 2-dimensional datasets for the raw and PCA-transferred data, respectively, while using OMPs and 524 = 4 + 4 × 13 × 10 and 393 = 3 + 3 × 13 × 10 dimensional OO feature datasets from the raw and PCA-transformed data, respectively. Naturally, there are 604 = (524 − 4) + 84 and 453 = (393 − 3) + 63 dimensional datasets for raw and PCA-transferred data, respectively, while using EOMPs.

In the experiment, the average accuracy (AA), the overall accuracy (OA), the CPU running time (CPUTime), and the kappa statistic were used to evaluate the classification performance of all the considered methods. All of the experiments were conducted while using Oatave 5.1.0 on a Windows 10 64-bit system with an Intel Core i7-4790 3.60 GHz CPU and 64 GB of RAM.

#### **3. Results**

#### *3.1. Subsection Assessment of the Feature Extractors*

#### 3.1.1. Accuracy Evaluation

Figure 4 illustrates the OA values from the ensemble methods, including RaF, ExtraTrees, and END-ERDT while using MPs, MPPR, and EOMP features that were extracted from the raw and PCA-transformed datasets. Each point on the *x*-axis represents the MRS scale sets for OO, OMPs, OMPsM, and EOMPs feature extractors (e.g., 50-500-50 means the scale parameter λ of MRS that starts with 50 and stops at 500 with total 10 steps by step 50), while the *y*-axis representation the OA values. First, the superiority of the proposed FE method EOMPs is obvious when compared with that of the MPs, MPPR, OMPs, and OMPsM, and the superiority of OO as compared with that of the MPs, MPPR, and OMPs. Specifically, the best improvements were achieved by EOMPs across all three classifiers with two datasets (see the dark green lines). Moreover, the superiority of MPPR compared to MPs and OMPs and the superiority of OMPsM when compared to MPPR is clear, which again supports the findings by Liao et al. [97] and Samat et al. [98]. Additionally, the performance of OO and OMPs could actually be limited by setting the segmentation scale parameter λ to very large values. For example, a decreasing trend in the OA values from OMPs can be observed after the starting scale is larger than 100 with 100 or 50 scale steps (see the brown lines).

**Figure 4.** Overall accuracy (OA) values from RaF (**a** and **d**), ExtraTrees (**b** and **e**), and END-ERDT (**c** and **f**) using morphological profiles (MPs) (disk, 1-10-1), MPs with partial reconstruction (MPPR) (disk, 1-10-1), and extended object-guided MPs (EOMP) features extracted from the raw bands.

#### 3.1.2. Visual Evaluation

In Figure 5, a 600 × 800 image patch was selected from the south-central area of the Bakbakty irrigation area (Figure 1f) to show the differences between the OBR, opening by partial reconstruction (OBPR) and object guided OBR (OOBR) operators with different scale parameter settings while using the first raw band. According to the graphs in the first row of Figure 4, the image becomes slightly grayer as the size of the SEs increases in OBR, with most of the small details, such as boundaries between objects, still remain. In contrast, boundaries between different objects become too blurred and indistinguishable as the size of the SEs increases in OPPR; many large objects, such as the urban area in the central-western area that should appear at a certain scale of the area attribute, remain at a low scale, and disk shapes, such as new objects, are created with large SE size after OPPR (see the last image in row 2 of Figure 4). For the proposed EOMPs, the target image becomes slightly grayer as the scale parameter λ, which controls the total number and individual scales of segments, increases. Furthermore, most of the boundaries between the different land cover types remain exactly as in the original, which is mainly due to the EOMPs only filtering the areas within the corresponding boundaries. However, when the segments are too large and are composed of many different objects, the performance of OOBR could be limited by returning profiles of only one object. In addition, different segments could have the same minimum and/or maximum pixel values; as a result, OOBR could return similar, or even the same, profiles for different objects. For example, a very bright and rectangular building in the center of the target image cannot be distinguished from its surroundings when the value of λ is greater than 600 (see the last row of Figure 4).

**Figure 5.** Examples of OBR (row 1), opening by partial reconstruction (OBPR) (row 2), multi-resolution segmentation (MRS) segments (row 3), object-guided OBR (OOBR) (row 4) computed from band 1 (first image at row 1) at the center-bottom of Figure 1f (the numbers in the table in row 3 show the disk sizes in OBR and OBPR and the segmentation scale λ in MRS).

#### *3.2. Evaluation of ND and END*

#### 3.2.1. Classification Accuracy

As mentioned in Part 1, the second objective of this paper is to investigate the performance of popular ND algorithms and their ensemble versions. Hence, Figure 6 presents the OA values from various classifiers that were adopted in direct, ND, ECOC, and END multiclass classification frameworks by using all of the considered features.

If we simply compare the OA bars from all of the adopted classification algorithms while using various features in all three multiclass classification framework scenarios, the results of MPPR are superior to those of MPs and OMPs, and the results of OMPs are superior to those of MPPR, OO is superior to MPs and MPPR, EOMPs is superior to all others, and are uniformly shown in almost all of the classification scenarios, which confirms the superiority of our proposed method.

**Figure 6.** OA values from various classifiers in different multiclass classification frameworks (Direct and ND: a, b; ECOC: c, d; END: e, f) using all the considered features.

When comparing the OA bars in Figures 6a and 6b from C4.5, ERDT, RaF, and ExtraTrees in direct and ND multiclass classification, improvements from ND, NDCB, NDDB, NDRPS, and NDFC over the direct framework is not clear. Interestingly, the performance of the weak, direct multiclass classification algorithm can be reduced in the case of readily available low-dimensional data with low discrimination capability. For instance, the C4.5 classifier reached OA values that were greater than 82% and 80% individually while using the original raw bands and the PCA-transformed datasets, respectively, in the direct multiclass classification framework; moreover, C4.5 in ND, NDCB, NDDB, NDRPS, and NDFC multiclass classification frameworks uniformly reached OA values that were less than 82% and 79% while using the original raw bands and the PCA-transformed datasets, respectively. Similar results can also be found for ERDT not only using the original raw and PCA-transformed datasets, but also using MPs and MPPR features from the original raw and PCA-transformed datasets, whereas the ERDT has proven much weaker than C4.5 [98]. When comparing the OA bars of ensemble classifiers, such as RaF and ExtraTrees, there are no obviously increased or decreased OA values observed for ExtraTrees in the direct and ND, NDCB, NDDB, NDRPS, and NDFC multiclass classification frameworks, but a slightly decreasing trend is shown by RaF in the ND, NDCB, NDDB, NDRPS, and NDFC frameworks while using the original raw and PCA-transformed datasets.

According to the results that are shown in Figures 6c and 6d, there are no obvious increases or decreases in the OA for the same classifiers with different ECOC techniques, except for ERDT and C4.5 in the one vs. all (1 vs. all) and C4.5 in ordinal multiclass classification cases while using original raw and PCA transformed datasets. Additionally, differences in OA values from ECOC techniques using OO and spatial features are smaller and more stable than those from the ND frameworks. Take the C4.5 classifier as an example, 95%–99% and 93%–98% OA value ranges for ND multiclass classification framework becomes into 98%–99.80% and 97.5%–99.80% OA values for ECOC RC. Additionally, more interestingly in comparing with direct and ND frameworks, better OA results can always be reached for weak classifiers (e.g., C4.5, ERDT) in ECOC one vs. one, random correlation, dense random, and sparse random multiclass classification techniques. For instance, a minimum larger than 82% (ECOC 1vs all) and maximum around 86% (ECOC RC, DR and SR) OA values are shown by C4.5 in ECOC frameworks while using original raw bands, while minimum larger than 81% (NDRP) and maximumly larger than 82% (ND) OA values are shown in ND, NDCB, NDDB, NDRP, and NDFC frameworks. On the contrary, when the stronger classifiers, such as RaF, ExtraTree, and SVM are adopted, differences between them in direct, ND, and ECOC frameworks are much smaller, especially from those using high dimensional datasets with high discrimination capabilities. In contrast with RaF and ExtraTrees, better OA values could be reached by SVM in ECOC frameworks while using low dimensional datasets with low discrimination capabilities in the original raw bands and PCA-transformed datasets.

By comparing the results in Figures 6e and 6f with the results in Figures 6a and 6b, we can clearly observe the superiority in the OA values of END, ENDRPS, ENDCB, and ENDDB over ND, NDRPS, NDCB, and NDDB, respectively, which is in accordance with the findings from Frank and Kramer [93], Dong et al. [83], and Rodríguez et al. [94]. Interestingly, the OA values of ERDT in the END, ENDRPS, ENDCB and ENDDB frameworks always reached better OA values than C4.5 and RaF (except for ENDRPS) with the same multiclass classification sets, even when using the original raw bands and PCA-transformed datasets with low discrimination capabilities (see the bars in light blue in Figures 6e and 6f). When better data with high discrimination capabilities are available, the END, ENDRPS, ENDCB, and ENDDB multiclass classification frameworks are capable of reaching better OA values while using weak but simple classifiers (e.g., C4.5 and ERDT) than direct and ECOC when using stronger but more complex classifiers (e.g., RaF, ExtraTrees, and SVM). For example, the OA values for C4.5 and ERDT are approximately 98% larger in the END framework, while the OA values for SVM-B and SVM-G are approximately 97% larger in the ECOC:Ordinal framework while using various spatial features that were extracted from the original raw bands.

In Figure 7, we present the OA curves from the direct and END-based multiclass classification frameworks with incrementally increased ensemble size. The conventional C4.5 and ERDT classifiers are adopted in direct multiclass classification approaches RaF and ExtraTrees, respectively. C4.5, ERDT, RaF, and ExtraTrees are adopted as the base learners in the END, ENDCB, ENDDB, and ENDRPS frameworks. Note that the size of RaF and ExtraTrees are set to 100 in the END, ENDCB, ENDDB, and ENDRPS frameworks. Based on the results, the superiority of ENDCB and ENDDB over END is not obvious in the context of the OA values, as shown in a study by Dong et al. [75]. In contrast, ENDRPS showed the worst results while using the C4.5 and ERDT classifiers. Additionally, the END, ENDCB, and ENDDB frameworks with the ERDT classifier can achieve classification accuracy results that are better than those attained by RaF, by using both the original raw bands and the MPs features that were extracted from raw bands (see the results in Figures 7a and 7e). However, optimum results can be reached by feeding the ExtraTrees to the END, ENDCB, ENDDB, and ENDRPS frameworks. For effects from the ensemble size, increasing the ensemble size beyond 80 does not yield obvious improvements in the OA values for the END, ENDCB, ENDDB, and ENDRPS frameworks with C4.5 and ERDT while using the considered features, while increasing the ensemble size beyond 30 does not yield obvious improvements in the OA values for the END, ENDCB, ENDDB, and ENDRPS frameworks with RaF and ExtraTrees.

**Figure 7.** OA values versus the ensemble size of the END, ENDCB, ENDDB, and ENDRPS frameworks with ERDT (**a**, **e**), C4.5 (**b**, **f**), RaF (**c**, **g**), and ExtraTrees (**d**, **h**) classifiers while using the original raw bands (**a**, **b**, **c**, **d**) and the MPs (**e**, **f**, **g**, **h**).

#### 3.2.2. Computational Efficiency

Computational efficiency is always considered to be another key factor after the classification accuracy when evaluating a classifier's performance. In accordance with Figures 6 and 7, Figure 8 shows the CPUTime (in seconds) in the training phase for various classifiers in different multiclass classification frameworks and using all of the considered features, while Figure 9 shows the results for the END, ENDCB, ENDDB, and ENDRPS frameworks with different ensemble sizes.

When comparing the charts in Figure 8, direct ERDT is at least 10 to 1000 times faster than the C4.5, RaF, ExtraTrees, CVRaF, and RoF classifiers, ExtraTrees is faster than RaF, CVRaF, and RoF, which is in accordance with our previous findings [98]. The extremely fast operability of ERDT is inherently available in the ND, NDCB, NDDB, NDFC multiclass classification frameworks, and in their ensemble versions, as shown in Figures 8e and 8f. Specifically, using ERDT in the ND, NDCB, NDDB, NDFC multiclass classification frameworks is at least 10 times faster than using C4.5 and at

least 100 times faster than using RaF and ExtraTrees. It is reasonable that the ensemble size of RaF and ExtraTrees is set to 100 as the default in those frameworks.

In contrast with results from the ECOC frameworks, as shown in Figures 8c and 8d, C4.5, ERDT, RaF, and ExtraTrees in the ND, NDCB, NDDB, NDRPS, and NDFC frameworks are slightly faster than their corresponding frameworks in the ECOC:1vs1, ECOC:1vsAll, and ECOC:RC frameworks. As expected, the worst computational efficiency is shown by the ECOC frameworks with SVM-B and SVM-G parameter optimization techniques. Specifically, SVM-B is 10 times faster than SVM-G, whereas the former is at least 1000 times slower than ERDT in the ND frameworks and at least 100 times slower than ERDT in the END frameworks.

Critical tree parameters, including the minimum leaf size and the maximum depth, are also tuned using Bayes optimization in the ECOC:DR, ECOC:Ordinal, and ECOC:SR frameworks to identify the computational effects from parameter optimization. As shown in Figures 8c and 8d, the computational burden from the parameter tuning process is also severe for C4.5. For instance, the ECOC:SR framework with C4.5 took approximately 1000 seconds of CPUTime on the four original raw bands, while less than 5, 10, and 100 seconds are usual in the direct, ND, NDCB, NDRPS, NDDB frameworks, and their ensemble version frameworks. If we correspondingly look back at the OA results that are shown in Figure 6, obvious improvements in the OA values are not indicated. In other words, the computational complexity that was brought by parameter optimization could be further eliminated in more sophisticated ECOC multiclass classification frameworks without an obvious reduction in the accuracy.

According to the results that are shown in Figure 9, it is clear that the direct classifier ExtraTrees is faster than RaF, and RaF is faster than the END, ENDCB, ENDRPS, ENDDB multiclass classification frameworks while using C4.5, ERDT, RaF, and ExtraTrees as the base learners. Moreover, the computational efficiency of ENDCB and ENDDB over END is also clear, while all of the computational costs of END, ENDCB, ENDRPS, and ENDDB frameworks linearly increase as the ensemble size increases. Interestingly, both the adopted classifier and the ND frameworks can influence the computational efficiency. For instance, the worst computational efficiency is shown by ENDRPS with ERDT while using both the regional raw and MPs datasets (see Figures 9a and 9e), while END with C4.5, RaF and ExtraTrees showed the worst computational efficiency. According to the results that are shown in Figure 7, ENDRPS with ERDT might not be the optimal choice for both accurate and efficient classification with respect to the performance of the END, ENDCB, and ENDDB frameworks.

**Figure 8.** CPU running in seconds for various classifiers in different multiclass classification frameworks (Direct and ND: a, b; ECOC: c, d; END: e, f) using considered features.

**Figure 9.** CPU runtime in seconds versus the ensemble sizes of the END, ENDCB, ENDDB, and ENDRPS frameworks with the ERDT (**a**, **e**), C4.5 (**b**, **f**), RaF (**c**, **g**), and ExtraTrees (**d**, **h**) classifiers using the original raw bands (**a**, **b**, **c**, **d**) and features from the MPs (**e**, **f**, **g**, **h**).

#### 3.2.3. Robustness to the Data Dimensionality

Data quality is also a critical factor that controls the classification performance of adopted classifiers, and many approaches can be used to increase the discrimination and identification quality of the provided data by introducing new features. However, increasing the number of data dimensions by introducing new features could limit the training samples large enough to mitigate the Hughes phenomenon on the one hand and increase the computational complexity of feature space splitting-based classifiers (e.g., C4.5, RaF, and RoF) on the other hand. Hence, it is of interest to comparatively investigate the robustness of ND and END to the data dimensionality.

According to the results in Figures 6 and 7, the improved data quality by introducing new features is clear. For various single and ensemble methods, direct and ND-based classifiers, C4.5 is more robust than ERDT to the data dimensionality in the direct, ND, NDCB, NDDB, NDRPS, and NDFC frameworks. For example, ND with ERDT achieves OA values between 92% and 99% after features from MPs, MPPR, OMPs, OMPsM, and EOMPs are introduced, while ND with C4.5 achieves OA values that are between 95% and 99% (see Figure 6a). The ensemble versions of C4.5 and ERDT are less robust than the RaF, RoF, CVRaF, and ExtraTrees, both in direct and various ND. When compared with the results from direct and various ND frameworks, uniformly better robustness to data dimensionality is shown by all of the ECOC frameworks, especially with the RaF, ExtraTrees, and SVM classifiers. Taking the ECOC:RC framework with C4.5 as an example, the OA values range between 95% and 99% for ND with C4.5 and they shrink to a range between 98% and 99% after features from MPs, MPPR, OMPs, OMPsM and EOMPs are introduced.

As expected, the ECOC frameworks with SVM show better robustness to data dimensionality than the ECOC frameworks with C4.5, ERDT, RaF, and ExtraTrees, whereas the SVM is capable of overcoming the Hughes phenomenon that is caused by the data dimensionality with kernel trick [64,65]. When comparing the OA values from various END-based multiclass classification frameworks, it is clear that 1) various END frameworks have better robustness to the data dimensionality than various ND frameworks; 2) differences in the robustness to the data dimensionality between C4.5 and ERDT, C4.5, and RaF, and ERDT and ExtraTrees in various END frameworks are much smaller than those from various ND frameworks; and, 3) similar and even better than ECOC frameworks on the robustness to the data dimensionality can be reached by the END frameworks. For instance, END with C4.5 showed an OA ranging between 98% and approximately 99.8% after various considered features are

introduced, while most of the ECOC frameworks with SVM show an OA that ranges between 97% and approximately 99.8%.

As shown in Figure 8, the computational complexity that was brought by the data dimensionality is clear for all classifiers in all of the multiclass classification frameworks. Especially for the C4.5, ERDT, ExtraTrees, RoF, and CVRaF classifiers that adopt feature splits or selection criteria in feature spaces that control the complexity of adopted DTs. For example, a higher computational cost is always shown for ERDT and ExtraTrees in the END frameworks by using Raw\_OO (with 524 dimensions) and Raw\_EOMPs (with 604 dimensions) features, while a similar and lower computational cost is shown by using Raw\_MPs, Raw\_MPPR, Raw\_OMPs, and Raw\_OMPsM features (see Figure 8e). RaF is more robust than the C4.5, ERDT, ExtraTrees, RoF, and CVRaF classifiers to the data dimensionality. From a computational efficiency point of view, the best robustness to the data dimensionality is always shown by ERDT in the direct, ND, ECOC and END frameworks. Additionally, because of the kernel trick, differences in the robustness to the data dimensionality from SVM in ECOC frameworks are smaller than those from the DT-based classifiers that were adopted in the END frameworks.

#### *3.3. Final Vegetation Map*

Figure 10 shows the classification map using the proposed method and the considered products to show the superiority of the Sentinel-2 MIL1C products over the MODIS LUCC and GLC30 datasets in arid region. To further compare the findings of END-ERDT capable of reaching the best classification accuracy with a very high computational efficiency, Table 3 reports the classification accuracy values (the user accuracy (UA), AA, OA, and kappa statistics) with CPUTime in seconds for END-ERDT and ECOC:1vsAll with SVM-G optimization.

According to the results in Figure 10, it is apparent that Sentinel-2A MIL1C is better than MODIS LUCC and GLC30 for vegetation diversity mapping in arid regions in Central Asia. Specifically, 19 different vegetation types were recorded by Sentinel-2A MIL1C for our study area, while 15 and eight land cover types were recorded by MODIS LUCC and GLC30 products without specific vegetation taxonomic names. For instance, vegetation species, such as Alhagi sparsifolia, Haloxylon ammodendron, and Artemisia lavandulaefolia are classified as shrubs or herbaceous, while Iris lactea Pall. & Sophora alopecuroides and Sophora alopecuroides are classified into grassland in the MODIS LUCC and GLC30 products. From a vegetation species taxonomy and distribution mapping point of view, the land cover taxonomy classification system might not be appropriate. For example, the vegetation species richness, which is defined as the numbers of different species that are present in a certain study zone, for the Bakanas and Bakbakty irrigation zones that are depicted by blue and green rectangles, respectively, in Figure 10a is four (crops, forest, grass, and shrubs) from the GLC30 product (see Figures 10f and 10g), five (crops, tree, grass, herbaceous, shrubs) from the MODIS LUCC product, and 12 (rice, cloves, wheat, corn, reeds, Alhagi sparsifolia, Carex duriuscula, shrubs, Haloxylon ammodendron, grass, tamarisk, Iris lacteal Pall., and Sophora), and 13 (rice, cloves, wheat, corn, desert steppe, reeds, Alhagi sparsifolia, Carex duriuscula, shrubs, Haloxylon ammodendron, grass, tamarisk, Iris lacteal Pall., and Sophora alopecuroides) from the Sentinel-2 MIL1C classification with END-ERDT while using spectral and spatial features.

Based on the results in Table 3, again, it can be clearly seen that the END-ERDT method is capable of achieving the best results (OA = 99.85%) while using the stacked raw and EOMPs features with the highest model training efficiency (15.20 seconds) with respect to the results from RBF kernel-based SVM-G optimization teaching in the ECOC:1vsAll multiclass classification framework, which confirms the previous findings that END-ERDT could be an alternative to an SVM for generalized classification accuracy, computationally efficient operations, and easy to deploy points of view, especially in the case of sufficient samples with advanced features that are readily available. When the original raw data were adopted, OA values of 87.80% and 88.71% were achieved by the END-ERDT and SVM classifiers, respectively. Furthermore, END-ERDT showed the worst UA of 15.14% for Alhagi sparsifolia, while SVM showed the worst UA values of 2.75% for Alhagi sparsifolia and 1.79% for Sophora alopecuroides. After the advanced features were included, almost all of the land cover classes were correctly classified with a > 95% UA value by both classifiers, and especially after the OO and EOMPs were included. However, only on the raw data, the END-ERDT model was trained in several to more than ten seconds, the optimum RBF kernel-based SVM model took more than ten thousand seconds.

**Figure 10.** Final vegetation distribution map using Sentinel-2 MSIL1C products with the END-ERDT classifier for our study area (**a**) and subareas (**b**, **c**) and corresponding examples from the 2015 MODIS LUCC products (**d**, **e**) and the 2017 GLC30 (**f**, **g**) products (for the legends, refer to that in Figure 1).



#### **4. Discussion**

For arid land vegetation mapping while using Sentinel-2 MSIL1C image task, the superior performance of the proposed EOMPs over conventional OO, MPs, MPPR, OMPs, and OMPsM is confirmed, both statistically and visually. Additionally, as expected, possible side effects from very large segments could be controlled and even overcome by simply containing the mean pixel values of the objects and the object profiles, such as the compactness, roundness, and shape index. To overcome these potential drawbacks, multiple scale parameter λ should be provided in OOBR and OCBR. On the other hand, those that have been repeatedly proven effective object profiles should also be considered.

With respect to the results from various classifiers in direct, ECOC, ND, and END frameworks, END with ERDT (END-ERDT) always capable of reaching the highest OA values. This finding could be explained by the "diversity" foundation for constructing an effective EL system, which says that "weaker" classifiers (ERDT here) always have a better chance of reaching the trade-off between diversity and accuracy than "stronger" classifies (C4.5 here) [85,128,129]. Additionally, according to statistical theory regarding EL, reduced numbers of dichotomies in ENDCB, ENDDB, and ENDRPS are still large enough to ensure that there is a high level of diversity among END members to facilitate improvement by the ensemble. Hence, investigating the performance of other weak classifiers in END framework will be an interesting topic.

Ensembling randomly generated ND is an effective approach to multiclass classification problems, as proven by the results in Figure 6 and by the works of Frank and Kramer [85]. However, the equal sampling strategy that was adopted in END could limit the classification accuracy by generating a very limited depth of trees that is controlled by the number of classes; moreover, a very unbalanced tree can negatively affect the runtime. To remedy such limitations, NDCB, NDDB, and their ensemble versions (ENDCB and ENDDB, respectively) were proposed by Dong et al. [75]. According to their results, the runtime efficiency of ENDCB and ENDDB were slightly better than that of END in the same cases, and no obvious improvements were observed by setting the ensemble size to a constant value. Hence, it is of interest to comparatively investigate the performances of END, ENDCB, ENDDB, and ENDRPS with various sets of ensemble sizes. Our experiments confirmed that the positive effects from ensemble size are larger for END frameworks with weak classifiers than those with strong classifiers.

In studies that involve RS for biodiversity searches, land cover classification is considered the first-order analysis for species occurrence and mapping [10,23]. In general, coarse-spatial-resolution satellite imagery (e.g., MODIS, TM, and ETM+) and land cover products (e.g., the MODIS land use and cover change (LUCC) and Global Land Cover 30 (GLC30) datasets) are useful in detecting and evaluating ecosystems and habitat structures on a large scale, while HR/VHR satellite imagery products are useful for estimating habitat quality, predicting taxonomic groups, determining species richness, and mapping diversity [130–132]. In arid and semiarid regions, sparsely distributed vegetation species are crucially important in regional ecosystems, but they are easily mixed into dominant land cover types (e.g., bare land) in coarse-resolution satellite imagery. Spectral unmixing and subpixel mapping techniques could eventually solve these problems; however, vegetation species diversity mapping at a fine scale using coarse-resolution satellite images from MODIS, TM, ETM+, and OLI sensors is still quite challenging. For example, spectral unmixing can determine the fractions of classes within mixed pixels, but it fails to predict the spatial location. Our experiment also showed that the Sentinel-2 MIL1C products proved to be a more valuable data source than MODIS LUCC and GLC30 datasets for arid land vegetation mapping. Hence, Sentinel-2 products with 10m, 20m, and 60m spatial resolution, 13 bands spanning from the visible and the near-infrared (VNIR) to the short-wave infrared (SWIR) portion of the spectrum 12 spectral bands, and with a five-day revisit time over land and coastal areas, are better choice than MODIS, TM, ETM+, and OLI for arid land vegetation mapping.

Based on this work, we also envisage future perspectives. Further calculating more advanced vegetation species diversity indices, such as the spectral variation hypothesis (SVH), alpha-diversity, and beta-diversity, to show the superiority of Sentinel-2 MSIL1C images over MODIS and Landsat images should be an interesting future direction, especially at large regional or national scales. Since END-ERDT showed a state-of-the-art classification performance, statistical and more empirical experiments should both also be conducted. Finally, we will deploy the END-ERDT method on a Spark platform to support big data processing to facilitate its application.

#### **5. Conclusions**

Sentinel-2 MSIL1C images of the Ili River delta region of Kazakhstan were classified while using spectral and EOMPs to investigate the performance of the Sentinel-2A MSIL1C products for vegetation mapping in an arid land environment with respect to land cover products from MODIS and Landsat and to answer the question of "is ND and END are superior to state-of-the-art direct and ECOC-based-multiclass classification approaches?" and an accurate classification purposes.

According to the results, several conclusions can be drawn. First and foremost, the proposed EOMP features are better than all of the features, while the OO features are better than the spatial features from the MPs, MPPR, OMPs, and OMPsM for Sentinel-2 MSIL1C image classification. Furthermore, some previous findings of the ND, NDCB, NDDB, NDRPS, and NDFC frameworks showed superiority to direct multiclass classification, and the ECOC approaches are arguably useful in the Sentinel-2 MSIL1C image classification task. This finding can be explained by the fact that the final classification performance is controlled not only by the robustness of the adopted classifier but also by the discrimination capable of providing data. Additionally, the superiority of the END, ENDRPS, ENDCB, and ENDDB frameworks over the ND, NDCB, NDDB, and NDRPS frameworks is confirmed, and one can obtain compatible and even better OA results than the direct and ECOC frameworks by using weak and simple classifiers in the END, ENDRPS, ENDCB, and ENDDB frameworks. For example, END-ERDT can be an alternative to RBF kernel-based SVM in the ECOC framework from the generalized classification accuracy, computationally efficient model training, and easy deployment points of view. Finally, from both greater numbers of species identification and a high classification accuracy point of view, the Sentinel-2A MSIL1C product is more suitable than the global land cover products that are generated from MODIS and Landsat imagery for arid-land vegetation species mapping.

**Author Contributions:** Conceptualization, A.S. and P.D.; Methodology, A.S.; Situ data collection: A.S., Y.G., L.M. and G.I.; Sentinel 2 data collection and processing: A.S.; Land cover type check using VHR image from Google Earth: C.L.; Original draft preparation: A.S.; Review and editing: A.S., N.Y., P.D., and S.L.; Project admission: A.S. and J.A.; Funding: A.S. and J.A.

**Funding:** This work was partially supported by the National Natural Science Foundation of China (grant nos. U1603242, 41601440), the Youth Innovation Promotion Association Foundation of the Chinese Academy of Sciences (2018476), and the West Light Foundation of the Chinese Academy of Sciences (2016-QNXZ-B-11).

**Acknowledgments:** The authors would like to acknowledge the use of free open-access Copernicus Sentinel-2 data (https://scihub.copernicus.eu) and the use of SNAP-ESA Sentinel Application Platform v6.0.0 (http://step.esa.int).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
