2. Related Work
Studies on parametrizing SP can be categorized into two general groups: supervised approaches [
21,
30,
31,
40], and unsupervised approaches [
17,
20,
22,
29]. With supervised methods, the goal is to identify an optimal segmentation (or set of segmentations) by evaluating the overlap of ground-truth reference polygons and computer-generated image segments. This evaluation is performed using arithmetic or geometric dissimilarity metrics to calculate the discrepancy between reference polygons and the respective (overlapping) generated segments. When these metrics indicate the least dissimilarity between a given ground truth and the overlapping segment, it could be concluded that an optimal segmentation has been achieved. Liu et al. [
40] proposed three discrepancy metrics (i.e., Potential Segmentation Error (PSE), Number-of-Segments Ratio (NSR), and Euclidean Distance 2 (ED2)) for the supervised optimization of MRS parameters. In that study, PSE was defined as a measure of undersegmentation (i.e., a PSE value equal to zero indicates no undersegmentation), NSR as a measure of oversegmentation (i.e., larger values of NSR indicate oversegmentation occurred), and ED2 was used to combine the other two metrics into a single value that takes into account both undersegmentation and oversegmentation (i.e., the smaller the value of ED2 is, the more the resulting segments match the corresponding reference polygons). In another study, Clinton et al. [
41] compared several different supervised metrics and combinations approaches, and found that the
D-metric, calculated as the root-mean-square of an oversegmentation measure (
OverSegmentationij or
OSeg) and an undersegmentation measure (
UnderSegmentationij or
USeg), was consistently a good indicator of segmentation quality. In a further work, Zhang et al. [
42] used similar undersegmentation/oversegmentation measures, and compared different approaches for combining them (
F-measure, Euclidean Distance, ED2), and found that
F-measure was more appropriate for combining the undersegmentation and oversegmentation metrics together due to its higher sensitivity to excessive under-/oversegmentation.
In contrast to supervised methods, unsupervised SP optimization methods can be applied without the need for reference polygons. Many unsupervised methods aim to identify the segmentation parameters that maximize the average intra-segment heterogeneity and inter-segment homogeneity of segmentation results [
29]. Compared to supervised methods, unsupervised techniques can potentially be faster (not requiring reference polygons) and less subjective. Aside from the methods based on maximizing intra-segment heterogeneity/inter-segment homogeneity, Drǎgut et al. [
28] developed a method known as the Estimation of Scale Parameter (ESP) tool, which was inspired by the concept of Local Variance (LV) that earlier proved to be beneficial for recognizing the structure of a given image with respect to its spatial resolution and land cover [
43], and for applying to the GEOBIA framework [
44]. The basis on which the ESP tool is built is that as larger SPs are used for segmenting a given image, the average global standard deviation (as a representative of LV) of the spectral values of image segments increases accordingly. This increase in the average standard deviation stops when the boundaries of some image segments approximately correspond to a real-world feature. By monitoring the LV curve resulting from this approach, it can be observed that several break points appear on the curve, and each of these abrupt changes can be indicative of an optimal SP for the corresponding land cover. As a result, using the ESP tool it is possible to estimate multiple SPs for various image objects with different sizes and structures.
Parameterizing the MRS algorithm has also been carried out using the notion of spatial autocorrelation in some studies [
45,
46,
47]. Spatial autocorrelation is important in the sense that it effectively reveals the statistical separability of spatial image objects from each other [
45]. Martha et al. [
47] made use of the objective function (composed of a spatial autocorrelation indicator (i.e., Moran’s I) and inter-segment variance analysis) proposed by Espindola et al. [
45] to develop a plateau objective function (POF) capable of constraining the lower limit of the objective function for detecting multiple optimal SPs. The authors hypothesized that the peak values of the POF are close to the maximum values of the objective function. Thus, it can be concluded that a trade-off between oversegmentation and undersegmentation in the results can be found where the peaks exceeding the constrained lower limit of the objective function are observed; that is, such a trade-off indicates an optimal SP for the image under consideration. In addition, the local peaks on the curve of the objective function can be indicative of optimal SPs for land features with different sizes in the image. Johnson and Xie [
29] proposed another multi-scale approach that estimates optimal segmentations in two steps, namely a global evaluation step, and a local evaluation step. Optimal segmentations are selected by normalizing and combining (through addition) weighted variance (for evaluation of intra-segment heterogeneity), and Moran’s I (for evaluation of inter-segment heterogeneity). Following that study, Johnson et al. [
22] found that combining the Weighted Variance and Moran’s I metrics using the
F-measure was more effective than combining them using addition, which echoed the findings by Zhang et al. [
42] that the
F-measure was more sensitive to excessive over- and undersegmentation. In a more recent study, Cánovas-García and Alonso-Sarría [
48] adapted the method introduced by Espindola et al. [
45] and developed a local SP optimizing technique by replacing the Moran’s I index with the Geary index (as an intra-segment heterogeneity measure) to also include objects’ variability when optimizing the SP. The main advantage of their method is its local optimization nature (uniform spatial units) that is beneficial in cases where the study area is large and covers diverse types of land use/cover. The main reason that local approaches could typically lead to more desirable results is that they can better capture the spectral contrast between objects than global approaches can, thus yielding more appropriate SPs [
19].
In 2014, Yang et al. [
49] developed a multiband unsupervised approach based on measuring the spectral homogeneity of image segments generated to estimate appropriate SP automatically. According to their study, as the size of an image object increases, its corresponding spectral homogeneity continues to decrease until it conforms to a real-world object. Therefore, measuring spectral homogeneity can be considered as a proxy for objectively estimating the SP. In order to measure spectral homogeneity, the authors adopted spectral angle [
50]. The spectral angle indicates the amount of similarity between two pixels; that is, the more the two pixels are similar to each other, the smaller the value of the spectral angle is. Based on this fact, Yang et al. [
49] first calculated the spectral angle between each pair of two pixels in each segment. Then, the calculated mean spectral angle values of all the pairs of two pixels for each segment were averaged over the entire image. Finally, the SP resulting in the smallest value of mean spectral angle (i.e., the largest spectral homogeneity) was considered as an optimal SP.
In a recent study, Jozdani et al. [
20] developed a scene-independent unsupervised approach to optimizing the SP to extract urban buildings of different sizes. In that research, by assuming that the sizes of buildings in a given urban block are close to each other, a degree-2 polynomial regression model was established that associated appropriate SPs with the median size of buildings and the spatial resolution of the image. According to their experiments, it is possible to estimate appropriate multiple SPs to quickly extract differently sized buildings with reasonable accuracy and without relying on intensive computations.
In addition to the abovementioned approaches, other methods have been proposed that do not heavily rely on the SP to optimize segmentation. For instance, Martha et al. [
51] first performed an initial segmentation with a small SP to generate oversegmented results. These oversegmented results were then fed into the chessboard segmentation algorithm to be fine-tuned and merged, resulting in more appropriate final segmentation results without directly optimizing the SP. In a similar study, Witharana and Lynch [
31] combined the segmentation derived from the multi-threshold segmentation (MTS) algorithm with that derived from the MRS to improve segmentation results without applying an intensive trial-and-error process. In their method, MTS is first applied to the image to generate undersegmented, simplified image objects. Subsequently, MRS is performed on the simplified segments to obtain finer image objects whose boundaries more reasonably correspond to real-world features of interest. Finally, a straightforward trial-and-error process can be applied to the segmentation results of the MRS to further fine-tune image objects generated.
Rather than hybridizing the MRS algorithm with other segmentation algorithms to optimize segmentation and to improve classification accuracy, in a study by Stumpf and Kerle [
52], the main objective was to identify which features were more significant at each scale level to improve final classification results in the GEOBIA framework (in this case, distinguishing landslides from other image objects). For this purpose, they first performed multiple segmentations with different SP values (i.e., 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 70, 80, 90, and 100) and then calculated several features (i.e., spectral, textural, geometric, etc.) for the resulting image objects at different scale levels. Following this step, an RF-based variable-importance model was applied to the calculated features to recognize significant features at each segmentation level that could lead to more accurate classification results, namely smaller out-of-bag (OOB) error resulting from the RF model. To put it simply, instead of solely taking the SP into account to improve the extraction of objects of interest, final classification accuracy can be considered as a function of the SP and significant features calculated for image objects generated.
Although unsupervised methods to optimize the SP/segmentation do not require reference polygons, they can be still computationally intensive. For example, most existing unsupervised approaches require iteratively testing many different SP values to identify an optimal segmentation, which can be problematic especially when segmenting a large volume of remotely sensed images. Even though the framework proposed by Jozdani et al. [
20] avoids this iterative procedure, it is still limited to the appropriate extraction of urban buildings and thus cannot be generalized to other land covers in its current form.
4. Results and Discussion
Based on a title and abstract screening of the journal papers identified by the Scopus/Google Scholar search queries, 215 journal papers were identified as potentially relevant to our study. (More details on the selection and reviewing design in this research are shown in
Table 2 and
Table 3.) Examining these selected papers, we encountered three types of papers: (1) papers that used the MRS and provided all the details we needed for the RT modeling (i.e., the MRS parameters, corresponding land cover class(es), and spatial and radiometric resolutions), (2) papers that applied MRS but did not mention some or any of the information needed, and (3) papers that did not use MRS for the segmentation phase. After discarding the papers that did not use the MRS for segmentation, the main challenge was with the papers that did not report some values related to MRS parameters, image spatial resolutions, and/or image radiometric resolutions. To address this problem, we applied a different approach for each group of missing information. If no detail on the spatial resolution of the image was given in a paper, the spatial resolution of the panchromatic band was recorded (we assumed that the image had been pansharpened by the vendor or the author(s) prior to segmentation, as this was often the case in the other studies that did report the image spatial resolutions). If the radiometric resolution of an image was not reported, we also assumed that it had not been changed from its original resolution by the authors. Finally, in the case of lacking information on the MRS parameters, we used the default value for each of the parameters (i.e., shape = 0.1, compactness = 0.5), as these default parameters are used quite commonly (if the SP was not reported, we discarded the study from our analysis).
As mentioned earlier, the common classes considered in the papers were trees, grass, bare soil, impervious features (including buildings, roads, asphalt, and other artificial features), buildings, water, roads, parking lot, and pools. Since some of these classes had overlap with one another, we merged similar classes with each other: the tree and grass classes were merged into a vegetation class, and the water and pool classes were merged into a water class. We decided not to consider the impervious class in our analysis because it often consisted of multiple more specific land cover classes (e.g., roads and buildings), which were also included in our analysis. Merging the overlapping classes led to two decisive advantages: (1) avoiding the construction of redundant regression models, and (2) increasing the number of samples in some classes that did not have sufficient samples for regression modeling before the merging. As a result, five final land cover classes (i.e., building, vegetation, road, bare soil, and water) were selected for subsequent analyses in this study.
4.1. Regression Modeling
The majority of the papers reviewed in this study only employed VHR optical data; however, a few papers also used other types of remotely sensed data (e.g., RADAR, LiDAR, etc.) for segmentation. However, because the number of the papers that used non-optical images in their procedures was not sufficient, we only considered optical images for the fitting process. In addition, since for urban area mapping, VHR imagery is of vital importance due to the high spatial heterogeneity of cities [
45], VHR images are mostly used. We thus further filtered out the recorded data by only considering VHR images. The images considered as VHR in this study were those with spatial resolutions of ≤5 m. However, due to a lack of data for images coarser than 3.2 m (the spatial resolution of IKONOS’s multispectral bands) and finer than 9 cm, we limited our analysis to images with spatial resolutions between 9 cm and 3.2 m. After performing all the refinements on the data extracted from the reviewed papers, of the remaining 39 papers, a total number of 114 samples consisting of selected MRS parameters, spatial resolution, and radiometric resolution were used for regression modeling. In
Figure 3, the number of samples for each class is given. As can be seen in this figure, the largest and smallest classes were the vegetation (30 samples) and water classes (18 samples), respectively.
In addition, in the studies we reviewed, an equal weight was assigned to all of the spectral bands that were used for segmentation. Not all of the studies used the same number of spectral bands, though, and some studies used only a subset of the available spectral bands for segmentation. Because we could not find evidence in the literature that spectral band selection/band weighting greatly affected the SP value(s) that were selected as most appropriate for segmentation (and also because of our already relatively small sample size), we chose not to exclude any studies from our analysis on the basis of spectral the bands that they utilized (or did not utilize) for segmentation.
To perform regression modeling on these samples, the dependent variable was chosen to be the SP, and the independent variables were chosen to be the spatial and radiometric resolutions, shape, and compactness. Because the main goal of this paper was to construct an equation separately for each land cover to estimate appropriate SPs, we fitted a separate RT model to the data of each class. Because of the limited number of data (specifically in the water class), no specific parameter was applied to the RT models, and no pruning was performed. The graphical representations of the equations of the RT models are shown in
Appendix C.
4.2. Evaluation of Image Segmentation Results
4.2.1. Applying the RT Model Results
As mentioned earlier, to estimate SPs using the RT models, one needs to incorporate the spatial and radiometric resolutions of the image (i.e., image-based information), and the shape and compactness parameters (MRS parameters information). The default values for the shape and compactness parameters are commonly set to 0.1 and 0.5, respectively, which can be incorporated into the models to estimate the SP for each land cover. However, assigning the default values to these parameters can cause some degree of uncertainty and bias in the SPs that are derived from the RT models for different land covers. To address this problem, we first calculated the mean of each of these parameters for each class separately (
Table 4) and then used their mean values to estimate the SPs for the corresponding classes. As examples of our RT model outputs,
Table 5 shows the class-specific SPs estimated for several common VHR satellite sensors as well as the airborne sensors used in our test images.
4.2.2. Visual Evaluation Results for the Test Images
For a visual evaluation of results, we segmented the test images using the SPs identified by the RT models (as reported in
Table 5), in combination with the shape/color and compactness parameters given in
Table 4. It is worth noting that because the data used to construct the RT models were assumed to be pansharpened (if applicable), we also pansharpened (using the Gram-Schmidt Spectral sharpening method) the satellite images for segmentation. In
Figure 4,
Figure 5,
Figure 6 and
Figure 7, the segmentation results for different classes in four of the six test images (the 30-cm, 50-cm, 75-cm, 1-m) are depicted (for the sake of brevity, the segmentation results for the 25-cm and 65-cm test images are instead shown in
Appendix B). For the 30-cm test image (
Figure 4), we selected four representative land covers (i.e., building, vegetation, road, and water) to visually evaluate the performance of the respective RT models. Given the complexity of the building rooftops, the segmentation result of the buildings was generally satisfactory. However, some degree of undersegmentation occurred in areas where the rooftops had a very similar appearance to nearby non-building land covers. The vegetation class was also extracted relatively accurately, as green spaces (e.g., trees and grass) were distinguished from other land covers in most cases. The problem with green spaces in remotely sensed imagery is that their extraction could be highly scale-dependent; that is, using a single SP, it might not be possible to, for example, extract both urban forests and single trees simultaneously. This problem also occurred in this image, where trees, grass, and urban forests were all present at the same time. As a result, it can be seen that the large urban forests were oversegmented, while a few areas containing small patches of grass or single trees were undersegmented (mixed with the pixels of roads). One way to mitigate this issue could be to construct a separate RT model for each type of vegetation (e.g., one for single trees, one for tree patches, one for parks, and so on), although it may be difficult to find sufficient samples from the literature to construct these models. The results for the extraction of the roads were also generally satisfactory, having some acceptable degree of oversegmentation and no excessive undersegmentation. In general, roads could be very difficult to be appropriately extracted because of various nearby non-road features (e.g., cars, shadows, vegetation, asphalt defects, etc.) that hinder the segmentation algorithm from detecting correct boundaries of roads. The RT model established for the extraction of water bodies in this test image also performed relatively well in segmenting the pond in the image.
In the 50-cm test image (
Figure 5), the accuracy of the extracted building, road, and water features was evaluated. Undersegmentation of the extracted buildings from this image was more prevalent than in the 30-cm image. One reason for this problem could be the highly spectrally similar information of some buildings to their neighboring buildings. Nevertheless, this type of undersegmentation did not cause the buildings to be mixed with other land covers. In contrast to the 30-cm test image, the roads present in the 50-cm image were not extracted properly, resulting in oversegmentation. The extraction of the small lake in this image also led to some oversegmentation, but not as much as was observed for roads.
In the 75-cm test image (
Figure 6), it is evident that vegetation, road, and bare soil land cover features were extracted better than buildings and water bodies. There was again an apparent degree of oversegmentation in the extracted water bodies in this image. Moreover, some of the buildings were undersegmented, which was undesirable.
As with the segmentation results of the 50 cm and 75 cm test images, the water bodies in the 1-m image (
Figure 7) were oversegmented. The buildings in this image, however, were more successfully extracted than those of the 75 cm image, although a few undersegmented buildings could be still observed. As with the results of the 30 cm image, the road class was segmented very reasonably in this test image, although there were similar spectral properties between the roads and nearby features, specifically vegetation. The vegetation land cover extracted using the SP estimated by the corresponding RT model did not generally behave consistently in this test image; that is, in some cases (e.g., over urban forests), some level of oversegmentation was apparent, but in some other cases, where the nearby buildings had similar spectral information to vegetation, undersegmentation occurred.
4.2.3. Quantitative Evaluation Results for the Test Images
Table 6 and
Figure 8 show the values of the supervised metrics calculated for the segmentations selected by our RT models for all the six test images, as well as the values for the segmentations selected by the ESP tool, and the results for a range of other segmentations. From
Figure 8, it can be seen that our RT models produced more accurate segmentations than the ESP tool (lower
D values and higher
F values) for the building and vegetation classes, while the ESP tool performed better than our RT model for the water class. For the building and vegetation classes, our RT models also had segmentation accuracies that were among the highest of the other SPs tested, while for the water class our RT model produced less accurate segmentations than many of the other tested SPs. The main problem with our RT model predictions for the water class was likely due to the large differences in the sizes/shapes of water features extracted in the past studies, e.g., some studies aimed to accurately segment pools, others aimed to segment linear water features (rivers or canals), and yet others large water bodies (e.g., ponds or lakes). From
Table 6, it is also clear that our RT models (as well as the ESP tool) resulted in more undersegmentation than oversegmentation for the building and vegetation classes, but more oversegmentation than undersegmentation for the water class. Comparing the
D and
F values of the RT models obtained for the building and vegetation classes, the building class was extracted more accurately, perhaps because the polygons for the vegetation class were more diverse in size and spectral heterogeneity (e.g., some polygons were of individual trees, some were larger grassy areas).
4.3. General Discussion
As the experiments in this study showed, in order to segment different land cover types successfully it is necessary to use different SPs. In other words, as already suggested in a number of studies, instead of using a single-scale segmentation approach, it is more effective to employ a multi-scale approach to extract different land cover features in a given image. Although the level of complexity of different features can vary from landscape to landscape and from image to image, thus affecting the selection of appropriate SPs, this study indicated that it is possible to narrow down the range of suitable SPs for segmenting a specific type of land cover using information from the literature and regression modeling. In previous studies, a strong emphasis was put mainly on the inverse relationship between the SP selected and the spatial resolution of an image. However, there are some other image characteristics that contribute to the selection of SPs. In this research, in contrast to many previous studies, we considered radiometric resolution of the image as another determinant of the SP estimation, although its effect on the SP was much less than that of the spatial resolution. Furthermore, the evaluation results showed that such approaches to the parametrization of SP are only a proxy for narrowing down the wide range of possible SPs to segment a given image optimally. Therefore, such an approach could significantly reduce the time and labor required to generate desirable segmentation results.
The RT models represented in this research can be applied to other urban areas as well. However, it should be noted that the RT models may fail to estimate appropriate SPs in cases where the urban structure is significantly different from those whose data were used to construct the RT models. The main cause of this problem is that the fitted RT models, which are generated using a limited number of samples, cannot completely account for all the variability encountered in the real world. One of the most viable ways through which this problem can be addressed is by increasing the amount and diversity of the data used to generate the RT models, so that they can be further generalized and better adapted to more types of urban areas. In this regard, we have provided all of the data that we used to construct the RT models in this study (provided as a Supplementary Table), so that other researchers can build upon it to achieve better results in the future (e.g., by adding more data for their classes/study areas of interest and then rerunning the RT models).
Another caveat that should be noted here is that in some studies that we reviewed and extracted details from to establish the models, a single universal SP (e.g., an SP of 10) had been used to extract all of the land covers, potentially resulting in over- or undersegmentation of some classes in these studies. This could have affected the accuracy of some of our RT models, especially for those land cover classes with small sample sizes. For instance, the oversegmented results of the water class in most of our cases demonstrate the negative effect of applying a single SP to extract multiple types of land cover present in an image. Although it is generally agreed that oversegmentation is less harmful than undersegmentation, it could still deteriorate the classification results because potentially informative non-spectral information (e.g., size and shape of land cover features) will not be utilized for training the classifier [
36]. It should be hence again accentuated that the quality of a segmentation yielded in the GEOBIA framework should not be easily neglected by assigning a universal, and in most cases, very small value to the SP to extract different types of land cover in urban areas, whose complexities call for more attention in the segmentation phase.
In addition to the aforementioned problem affecting the quality of the RT models, it is important to note that the predicted SP values for the classes with lower sample sizes should be used with greater caution, and may require more fine-tuning by the user than for other classes (e.g., users may want to perform segmentation using a wider range of SP values greater/less than our identified SP value to select the most appropriate SP value for their own images).
In this study (as in many other GEOBIA studies), we have found that different land cover classes were better represented at different segmentation levels (i.e., using different SP values). However, in many cases, it is desirable to produce a multi-class land cover map (rather than separate maps of each individual land cover class). To use our approach for multi-class land cover map production, one possibility would be to perform classification in a step-wise manner (e.g., classifying the land cover types with lower SP values first, or vice versa), as has been done in several other GEOBIA studies that utilized multiple segmentation levels for classification [
37,
54,
55,
56]. More sophisticated solutions for multi-scale classification also exist (and may indeed work better than our simple example), but a deeper discussion/comparison of these methods is not provided here because it is outside the main focus of our study.
5. Conclusions
In this study, we conducted a meta-analysis to investigate the potential of employing information derived from past GEOBIA urban land cover mapping studies to select appropriate, i.e., reasonably accurate, segmentation parameters (specifically the scale parameter (SP)) for the commonly used multiresolution segmentation (MRS) algorithm. In this regard, we reviewed peer-reviewed journal papers involving GEOBIA for urban land cover mapping published from 2010 to 2017, and extracted data on the MRS parameters used, image spatial/radiometric resolutions used, and land cover types mapped, from a total number of 39 papers selected from 215 potentially relevant papers. Afterward, considering five classes (i.e., building, vegetation, road, bare soil, and water), we applied an RT model to the corresponding data of each class to predict the appropriate SP. The experiments performed on six test images (two pansharpened satellite images, and four aerial images) with different spatial resolutions (25 cm, 30 cm, 50 cm, 65 cm, 75 cm, 1 m) and with different radiometric resolutions (8 bits, 11 bits) showed that it was possible to narrow down the wide range of possible SPs that can be used to segment remotely sensed imagery for urban land cover mapping. Therefore, the RT models and equations from this study can be applied to different urban areas, although there would be no guarantee that the results would be suitable in every case. This study also confirmed the conclusions drawn by a number of other studies on GEOBIA that highlight the importance of applying a multi-scale approach to segment and extract different land cover types more accurately.