**3. Experiments**

We implemented all the codes in Python 3.6; the CRF algorithm was implemented based on the PyDenseCRF package. To analyze the correction effect for the deep semantic segmentation model's classification result images, this study introduces images from Vaihingen and Potsdam in the "semantic labeling contest of the ISPRS WG III/4 dataset" as the two test datasets.

#### *3.1. Comparison of CRF and ELP on Vaihingen Dataset*

#### 3.1.1. Method Implementation and Study Images

We introduced five commonly used DSSNN models as testing targets: FCN8s, FCN16s, FCN32s, SegNET, and U-Net. All these deep models are implemented using the Keras package, and all the models take a 224 × 224 image patch as input and output a corresponding semantic segmentation result. The five image files from Vaihingen were selected as the study images and are listed in Table 1.


**Table 1.** Five study images from the Vaihingen dataset.

All five images contain five categories: impervious surfaces (I), buildings (B), low vegetation (LV), trees (T), and cars (C). These images have three spatial bands (near-infrared (NIR), red (R), and green (G)). The study images and their corresponding ground truth images are shown in Figure 4.

We selected study images 1 and 3 as training data and used all the images as test data. Study images 1 and 3 and their ground truth images were cut into 224 × 224 image patches with 10-pixel intervals; all the patches were stacked into a training set, and all the deep semantic segmentation models were trained on this training set.

This study used two methods to compare correction ability:

(1) CRF: We compared our method with the traditional CRF method. For each classification result image, the CRF was executed 10 times, and each time, the corresponding correct-distance parameters were set to 10, 20, ... , 100. Since all the study images have corresponding ground truth images, the CRF algorithm selects the result with the highest accuracy among the 10 executions.

(2) ELP: Using the proposed method, the threshold value parameter α was set to 0.05, and the CRF's correct-distance parameters in the LocalizedCorrection algorithm were set to 10. The number of ELP iterations was set to 10. Since ELP emphasizes "end-to-end" ability, no ground truth is needed to analyze the true classification accuracy during the iteration process; therefore, ELP directly outputs the result of the last iteration as the corrected image.

**Figure 4.** The study images and their corresponding ground truth from the Vaihingen dataset.

#### 3.1.2. Classification Results of Semantic Segmentation Models

We used all five deep semantic segmentation models as end-to-end classifiers to process five study images. The classification results are illustrated in Figure 5.

As shown in Figure 5, because study images 1 and 3 are used as training data (the deep neural network is sufficiently large to "remember" these images), the classification results by the five models for study images 1 and 3 are close to "perfect": almost all the ground objects and boundaries are correctly identified. However, because just two training images cannot exhaustively represent all the boundaries and object characteristics, these models cannot perfectly process study images 2, 4, and 5. As shown in Figure 5, there are obvious defects and boundary deformations, and many objects are misclassified in large areas. Based on the ground truth images, the classification accuracies of these result images are as follows:

As shown in Figure 5, because study images 1 and 3 are used as training data (the deep neural network is sufficiently large to "remember" these images), the classification results by the five models for study images 1 and 3 are close to "perfect": almost all the ground objects and boundaries are correctly identified. However, because just two training images cannot exhaustively represent all the boundaries and object characteristics, these models cannot perfectly process study images 2, 4, and 5. As shown in Figure 5, there are obvious defects and boundary deformations, and many objects are misclassified in large areas. Based on the ground truth images, the classification accuracies of these result images are as follows:

Table 2 shows that because study images 1 and 3 are training images, all five models' classification accuracies of these two images are above 95%, which is a satisfactory result. However, on study images 2, 4, and 5, due to the many errors on ground objects and boundaries, all five models' classification accuracies degrade to approximately 80%. Therefore, it is necessary to introduce a correction mechanism to correct the boundary errors in these images.

**Figure 5.** Classification results of five deep semantic segmentation models.

**Table 2.** Classification accuracy for the study images.


3.1.3. Comparison of the Correction Characteristics of ELP and CRF

To compare the correction characteristics of the ELP and CRF, this section uses U-Net's classification result for test image 5 and applies ELP and CRF to correct a subarea of the result image. The detailed results of the two algorithms with regard to iterations (ELP) and execution (CRF) are as shown in Figure 6.

**Figure 6.** Detailed comparison of the results of the ELP and conditional random field (CRF) algorithms. (**a**) Ground truth and U-Net's classification results of a subarea of test image 5; (**b**) results of ten iterations of ELP; (**c**) results of ten executions of CRF.

As shown in Figure 6a, for the sub-image of test image 5, the classification results obtained by U-Net are far from perfect, and the boundaries of objects are blurred or chaotic. Especially at locations *A*, *B*, and *C* (marked by the red circles), the buildings are confused with impervious surfaces, and the buildings contain large holes or misclassified parts. On this sub-image, the classification accuracy of U-Net is only 79.02%.

Figure 6b shows the results of the 10 ELP iterations. As the method iterates, the object boundaries are gradually refined, and the errors at locations *A* and *B* are gradually corrected. By the 5th iteration, the hole at position *A* is completely filled, and by the 7th iteration, the errors at location *B* are also corrected. For location *C*, because our algorithm follows an end-to-end process, no samples exist in this process to determine which part of the corresponding area is incorrect; therefore, location *C* is not significantly modified during the iterations. Nevertheless, the initial classification error is not enlarged. As the iteration continues, the resulting images change little from the 7th to 10th iterations, and the algorithm's result becomes stable.

Figure 6c shows the results of the CRF. From executions 1 to 3, it can be seen that the CRF can also perform boundary correction. After the 4th iteration, the errors in locations *A* and *B* are corrected. It can also be seen that at position *C*, part of the correctly classified building roof was modified into impervious surfaces, further exaggerating the errors. The reason for this outcome is that at location *C*, for the corresponding roof color, the correctly classified part is smaller than the misclassified part. In the global correction context, the CRF algorithm more easily replaces relatively small parts. At the same time, as the iteration progresses, errors gradually appear due to the CRF's correction process (as marked in orange); some categories that were originally not dominant in an area (such as trees and cars) experience large decreases, and the classification accuracy continues to decrease with further iterations.

Based on the ground truth image, we evaluate the classification accuracy of the two methods after each iteration/execution as shown in Table 3:


**Table 3.** Comparison of two algorithms by iteration/execution.

As seen from Table 3, compared to the original classification result, whose accuracy is 79.02%, the ELP's classification accuracy increases to 80.93% after the first iteration—the lowest among its ten iterations, and it reaches 85.81% by the 8th iteration for a classification accuracy improvement of 6.79%. The CRF's classification accuracy is 79.41% after the first executions, and it reaches its highest accuracy of 83.76% in the 4th execution; subsequently, the classification accuracy gradually declines during the remaining iterations, falling to 73.86% by the 10th execution. Overall, the CRF reduced the accuracy by 5.16% compared with the original classification image. A graphical comparison of the classification accuracy of the two methods is shown in Figure 7:

**Figure 7.** Graphical comparison of the classification accuracy of the two methods.

In Figure 7, the black dashed line indicates the original classification accuracy of 79.02%. The CRF's accuracy improvement was slightly better than that of the ELP algorithm in the second, third, and fourth iterations; however, its classification accuracy decreases rapidly in the later iterations, and by the ninth iteration, the classification accuracy is lower than that of the original classification result image. In contrast, the classification accuracy of the ELP increases steadily, and after approaching its highest accuracy, in subsequent iterations the accuracy remains relatively stable. From the above results, the ELP not only achieves a better correction result but also avoids causing obvious classification accuracy reductions from performing too many iterations. In end-to-end application scenarios where no samples participate in the result evaluation, we cannot know when the highest correction result has been reached; thus, the ideal method of termination conditions are also unknown. Specifying a too-small distance parameter will cause under-correction, while a too-large parameter will cause over-correction. The relatively stable characteristics and greater accuracy of ELP clearly allow it to achieve better processing results than those of CRF.

#### 3.1.4. Correction Results Comparison

The correction results of the CRF method are shown in Figure 8.

As shown in Figure 8, for images 1 and 3, because the classification accuracy is relatively high, less room exists for correction, and the resulting images change only slightly. For images 2, 4, and 5, although the large errors and holes are corrected, numerous incorrect borders are present, pixels appear at shadowed parts of the ground objects, and many small objects (e.g., cars or trees) are erased by larger objects (e.g., impervious surfaces or low vegetation). For ELP, the correction results are shown in Figure 9.

**Figure 8.** CRF correction results.

As Figure 9 shows, ELP also corrects the large errors and holes but does not produce overcorrection errors in the shadowed parts of the ground objects, and small objects are not erased. Therefore, in general, the correction results of ELP are better than those of CRF. Correction accuracy comparisons of the two algorithms are shown in Table 4.



**Figure 9.** ELP correction results.

As shown in Table 3, for study images 1 and 3, because the original classification accuracy is high, the correction results of CRF and ELP are similar to the original classification result, and the improvements are limited. On study images 2, 4, and 5, the ELP's average improvements are 6.78%, 7.09%, and 5.83%, respectively, while the corresponding CRF improvements are only 2.84%, 2.88%, and 2.74%. Thus, ELP's correction ability is significantly better than that of CRF.

#### *3.2. Comparison of Multiple Post-Processing Methods on Potsdam Dataset*

#### 3.2.1. Test Images and Methods

This study introduces four images from the Potsdam dataset, which are listed in Table 5.


**Table 5.** Four study images from the Potsdam dataset.

We selected two images as training data and the other two images as test data. Three bands (red (R), green (G), blue (B)) of images were selected. These images contain six categories: impervious surfaces (I), buildings (B), low vegetation (LV), trees (T), cars (C), and clutter/ background (C/B). The study images and their corresponding ground truth images are shown in Figure 10.

**Figure 10.** The study images and their corresponding ground truth from the Potsdam dataset.

To further evaluate ELP's ability, this paper compares four methods:

(1) U-Net + CRF: Use U-Net to classify an image and use CRF to perform post-processing.

(2) U-Net + MRF: Use U-Net to classify an image and use the Markov random field (MRF) to perform post-processing.

(3) DeepLab: Adopt DeepLab v1 model; in the DeepLab v1, the model has a built-in CRF as the last processing component, and this model can obtain a more accurate boundary than a model without the CRF component.

(4) U-Net + ELP: Use U-Net to classify an image and use ELP to perform post-processing.

#### 3.2.2. Process Results of Four Methods

For the two testing images, the final process results of the four methods are illustrated in Figure 11.

As can be seen in Figure 11, because U-Net + CRF uses a global CRF processing strategy, there are many overcorrection areas, and some objects in the result image contain chaotic wrongly classified pixels. For the U-Net + MRF, the majority of noise pixels are removed, but the correction effects are not obvious. DeepLab's CRF is performed on image patch, not the whole image, so the overcorrection phenomenon is less than that of U-Net + CRF to some extent. U-Net + ELP obtained the best classification among all of the methods. The classification accuracies of the four methods are presented in Table 6.


**Table 6.** Classification accuracies of the four methods.

**Figure 11.** Results of the four methods.

As shown in Table 6, U-Net + MRF achieves the lowest classification accuracy, U-Net + CRF and DeepLab are higher than U-Net + MRF, and U-Net + ELP achieves the best classification accuracy.

#### 3.2.3. Analysis of computational complexity

To analyze the computational complexity of the methods, we use four methods to process the testing image 1 and run the process five times. The experiments are performed on a computer (i9 9900k/64 GB/RTX 2080ti 11G), and the average process times are listed in Table 7.


**Table 7.** Process time of four methods.

As shown in Table 7, because the U-Net model can make full use of the graphics processing unit (GPU), and the processing speed of CRF and MRF on the whole image is also fast, U-Net + CRF and U-Net + MRF obtain results in a short time. DeepLab performs CRF after each image patch classification, so it does not need the post-processing stage, but the patch-based CRF process needs additional data access time and duplicate pixels at the patches' border, so its process time is similar to that of U-Net + CRF and U-Net + MRF. Since U-Net + ELP adopts the same deep model, its process time of the first three steps is similar to that of U-Net + CRF and U-Net + MRF, but at the post-processing stage, it needs a much longer time than the other methods.

For the ELP algorithm, the *HList* is updated at each iteration, and the suspicious areas marked by *HList* will change constantly. Each suspicious area needs to be processed by the CRF method, so the processing complexity of the ELP will vary along with the complexity of the image content. Although each suspicious area is small, ELP's greater amount of iterations, localization process, and result image update mechanism will introduce an additional computational burden, so ELP needs more process time than traditional methods.

#### 3.2.4. Analysis of Different Threshold Parameter Values of ELP

The threshold value α of ELP will determine the choice of suspicious areas. To test the influence of this parameter, we chose UNet + ELP to process testing image 1, and set α in the range 0, 0.01, 0.02, ... , 0.09, which can vary from 0 to 0.09 with an interval of 0.01. The classification accuracy is shown in Table 8.


**Table 8.** Classification accuracy comparison of different threshold values α*.*

As can be seen from Table 8, when α is less than 0.6, the classification accuracy does not change significantly; when α is larger than 0.7, the classification accuracy is decreased. The main reason for this phenomenon is that when α is small, ELP will be more sensitive to the discovery of suspicious areas; however, too many suspicious areas merely increase the computational burden without contributing to obvious changes in accuracy. In contrast, when α is larger, ELP will have a diminished capability to discover suspicious areas, and many suspicious areas that need correction will be omitted, which will cause a decrease in accuracy. At the same time, we can see from Table 8 that in a large range (0.0 to 0.6), the accuracy of ELP does not change greatly, which reveals that ELP has good stability with the threshold value α.

#### 3.2.5. Analysis of Different Segmentation Number of ELP

The ELP method adopts the SLIC algorithm as the segmentation method, and an important parameter of the SLIC algorithm is *Nsegment* which decides the number of segments after the algorithm performed. When the *Nsegment* is assigned an overly small value, under-segmentation may appear; conversely, when *Nsegment* is assigned an overly large value, over-segmentation may appear. To test the influence of this parameter on the ELP, we set *Nsegment* = 1000, 2000, ... , 10,000 and allowed *Nsegment* to vary from 1000 to 10,000 with an interval of 1000. The classification accuracy of testing image 1 by U-Net + ELP is shown in Table 9.


**Table 9.** Classification accuracy comparison of different segmen<sup>t</sup> number.

It can be seen from Table 9 that when *Nsegment* = 1000 to 3000, because the image is large (6000 × 6000) and the segmen<sup>t</sup> number is relatively small, the image is under-segmented, and each segmentation may contain pixels with a different color or brightness. This situation makes it difficult for ELP to focus on suspicious areas, and the classification accuracy is low. When *Nsegment* = 9000 to 10,000, the image is obviously over-segmented, and the segmentations are too small. This situation leads ELP to have a small update size in its LocalizedCorrection algorithm and leads to ELP's poor performance. For *Nsegment* = 4000 to 8000, the classification accuracy of ELP does not vary greatly, which indicates that ELP does not have restrictive requirements for the segmen<sup>t</sup> parameter; as long as the segmen<sup>t</sup> method can correctly separate the regions with similar colors/brightness and the segmen<sup>t</sup> size is not too small, ELP can achieve satisfactory results.
