**2. Methodology**

#### *2.1. Deep Semantic Segmentation Neural Networks' Classification Strategy and Post-Processing*

Deep semantic segmentation neural networks such as FCNs, SegNET, and U-NET have been widely studied and applied by the remote sensing image classification research community. The training and classification processes of these networks are illustrated in Figure 1.

**Figure 1.** End-to-end classification strategy.

As shown in Figure 1, the end-to-end classification strategy is usually adopted for DSSNNs' training and classification. During the training stage, a set of remote sensing images *ImageSet* = {*I*1, *I*2, ... , *I*n} is adopted and manually interpreted into a ground truth set *GroundTruthSet* = {*Igt*1, *Igt*2, ... , *Igtn*}; then, the *GroundTruthSet* is separated into patches to construct the training dataset. The classification model *Mendtoend* is obtained based on this training dataset. During the classification stage, the classification model is utilized to classify a completely new remote sensing image *Inew* (not an image from *ImageSet*). This strategy achieves a higher degree of automation; the classification process has no relationship with the training data or the training algorithm, and newly obtained or other images in the same area can be classified automatically with *Mendtoend*, forming an input-to-output/end-to-end structure. Thus, this strategy is more valuable in practical applications when massive amounts of remote sensing data need to be processed quickly.

However, the classification results of the end-to-end classification strategy are usually not "perfect", and they are affected by two factors. On the one hand, because the training data are constructed by manual interpretation, it is difficult to provide training ground truth images that are precise at the pixel level (especially at the boundaries of ground objects). Moreover, the incorrectly interpreted areas of these images may even be amplified through the repetitive training process [16]. On the other hand, during data transfer among the neural network layers, along with obtaining high-level spatial features, some spatial context information may be lost [35]. Therefore, the classification results obtained by the "end-to-end classification strategy" may result in many flaws, especially at ground object boundaries. To correct these flaws, in the computer vision research field, the conditional random field (CRF) method is usually adopted in the post-processing stage to correct the result image. The conditional random field can be defined as follows:

$$P(X|F) = \frac{1}{Z(F)} \exp\left(-\sum\_{c \in \mathcal{C}\_{\mathcal{E}}} \log(X\_{\mathcal{C}}|F)\right),\tag{1}$$

where *F* is a set of random variables {*F*1, *F*2, ... , *FN*}; *Fi* is a pixel vector; *X* is a set of random variables {*<sup>x</sup>*1, *x*2, ... , *xN*}, where *xi* is the category label of pixel *i*; *Z*(*F)* is a normalizing factor; and *c* is a clique in a set of cliques *Cg*, where *g* induces a potential ϕ*c* [23,24]. By calculating Equation (1), the CRF adjusts the category label of each pixel and achieves the goal of correcting the result image. The CRF is highly

effective at processing images that contain only a small number of objects. However, the numbers, sizes, and locations of objects in remote sensing images vary widely, and the traditional CRF tends to perform a global optimization of the entire image. This process leads to some ground objects being excessively enlarged or reduced. Furthermore, if the di fferent parts of ground objects that are shadowed or not shadowed are processed in the same manner, the CRF result will contain more errors [31]. In our previous work, we proposed a method called the restricted conditional random field (RCRF) that can handle the above situation [31]. Unfortunately, the RCRF requires the introduction of samples to control its iteration termination and produce an output integrated image. When integrated into the classification process, the need for samples will cause the whole classification process to lose its end-to-end characteristic; thus, the RCRF cannot be integrated into an end-to-end process. In summary, to address the above problems, the traditional CRF method needs to be further improved by adding the following characteristics:

(1) End-to end result image evaluation: Without requiring samples, the method should be able to automatically identify which areas of a classification result image may contain errors. By identifying areas that are strongly suspected of being misclassified, we can limit the CRF process and analysis scope.

(2) Localized post-processing: The method should be able to transform the entire image post-processing operation into local corrections and separate the various objects or di fferent parts of objects (such as roads in shadow or not in shadow) into sub-images to alleviate the negative impacts of di fferences in the number, size, location, and brightness of objects.

To achieve this goal, a new mechanism must be introduced to improve the traditional CRF algorithm for remote sensing classification results post-processing.

#### *2.2. End-to-End Result Image Evaluation and Localized Post-Processing*

The majority of evaluation methods for classification results require samples with category labels that allow the algorithm to determine whether the classification result is good; however, to achieve an end-to-end classification results evaluation, samples cannot be required during the evaluation process. In the absence of testing samples, although it is impossible to accurately indicate which pixels are incorrectly classified, we can still find some areas that are highly suspected of having classification errors by applying some conditions.

Therefore, we need to establish a relation between the remote sensing image and the classification result image and find the areas where the colors (bands) of the remote sensing image are consistent, but the classification results are inconsistent; these are the areas that may belong to the same object but are incorrectly classified into di fferent categories. Such areas are strong candidates for containing incorrectlyclassifiedpixels.Furthermore,wetrytocorrecttheseerrorswithinarelativelysmallarea.

 To achieve the above goals, for a remote sensing image *Iimage* and its corresponding classification image *Icls*, the methods proposed in this paper are illustrated in Figure 2:

**Figure 2.** End-to-end result image evaluation and localized post-processing.

As shown in Figure 2, we use four steps to perform localized correction:

(1) Remote sensing image segmentation

We need to segmen<sup>t</sup> the remote image based on color (band value) consistency. In this paper, we adopt the simple linear iterative clustering (SLIC) algorithm as the segmentation method. The algorithm initially contains *k* clusters. Each cluster is denoted by *Ci* = {*li*, *ai*, *bi*, *xi*, *yi*}, where *li*, *ai*, and *bi* are the color values of *Ci* in CIELAB color space, and *xi*, *yi* are the center coordinates of *Ci* in the image. For two clusters, *Ci* and *Cj,* the SLIC algorithm is introduced to compare color and space distances simultaneously, as follows:

$$distance\_{color}(C\_i, C\_j) = \sqrt{(l\_i - l\_j)^2 + (a\_i - a\_j)^2 + (b\_i - b\_j)^2},\tag{2}$$

$$distance\_{\text{space}}(\mathbb{C}\_{i\prime}\mathbb{C}\_{j}) = \sqrt{(\mathbf{x}\_{i} - \mathbf{x}\_{j})^{2} + (y\_{i} - y\_{j})^{2}},\tag{3}$$

where *distancecolor* is the color distance and *distancespace* is the spatial distance. Based on these two distances, the distance between the two clusters is:

$$distance(\mathbf{C}\_{i\prime}, \mathbf{C}\_{j}) = \sqrt{(\frac{distance\_{color}(\mathbf{C}\_{i\prime}, \mathbf{C}\_{j})}{N\_{color}})^{2} + (\frac{distance\_{space}(\mathbf{C}\_{i\prime}, \mathbf{C}\_{j})}{N\_{space}})^{2}} \tag{4}$$

where *Ncolor* is the maximum color distance and *Nspace* is the maximum position distance. The SLIC algorithm uses the iterative mechanism of the *k*-means algorithm to gradually adjust the cluster position and the cluster to which each pixel belongs, eventually obtaining *Nsegment* segments [36]. The advantage of the SLIC algorithm is that it can quickly and easily cluster adjacent similar regions into a segment; this characteristic is particularly useful for finding adjacent areas that have a consistent color (band value). For *Iimage*, the SLIC algorithm is used to obtain the segmentation result *ISLIC*. In each segmen<sup>t</sup> in *ISLIC*, the pixels are assigned the same segmen<sup>t</sup> label.

(2) Create a list of segments with suspicious degree evaluations

For all the segments in *ISLIC*, a suspicion evaluation list for the segments *HList* = {*h*1, *h*2, ... , *h*n} is constructed, where *hi* is a set *hi* = {*hid*i, *hpixelsi*, *hreci*, *hspci*}, *hidi* is a segmen<sup>t</sup> label, *hpixelsi* holds the locations of all the pixels in the segment; *hreci* is the location and size of the enclosing frame rectangle of *hpixelsi*; and *hspci* is a suspicious evaluation value, which is either 0 or 1—a "1" means that the pixels in the segmen<sup>t</sup> are suspected of being misclassified, and a "0" means that the pixels in the segmen<sup>t</sup> are likely correctly classified. The algorithm to construct the suspicious degree evaluation list is as follows (SuspiciousConstruction Algorithm):

**Algorithm** SuspiciousConstruction

```
Input: ISLIC
Output: HList
Begin
     HList = an empty list;
     foreach (segment label i in ISLIC){
          hidi = i;
          hpixelsi = Locations of all the pixels in corresponding segment i;
          hreci= the location and size of hpixelsi's enclosing frame rectangle;
          hspci = 0;
          hi = Create a set {hidi, hpixelsi, hreci, hhypi};
          HList ← hi;
     }
     return HList;
```