*2.3. Initial Classification by Automatic Sample Selection Using RF*

As noise label is used to guide the feature selection. This however may harm the classification result compared to the pure label. Therefore, the existing basemaps are viewed as noisy labels of newly acquired HRS image; then, the selected deep features are utilized to purify the initial labels through a data cleaning procedure.

In the field of machine learning, data cleaning is often introduced in the classification task with noisy labels, and intends to identify and correct mislabeled samples [45]. The core of the data cleaning idea lies in estimating the label uncertainty of each sample. Note that in the label uncertainty estimation step, the training data is also noisy. Therefore, classifiers that are robust to label noise are preferable. Most classifiers are highly sensitive to the label noise, such as SVM and AdaBoost. However, some algorithms can avoid the effect of label noise to an extent. As mentioned before, the random forest is an ensemble decision tree classifier that introduces randomness in both samples and features selection, which makes it more robust, thus suitable for data cleaning tasks.

Inspired by the work in Reference [46], we use a cross-validation algorithm to estimate the uncertainty of the samples' labels. The pseudocode for estimating the uncertainty of the initial labels is given in Algorithm 1.

**Algorithm 1.** Label uncertainty estimation

**Input**: *S* (sample set, i.e., pixel index from HRS image) with *F* (features from Section 2.2), *L* (noisy label acquired from the existing basemaps); *kmax* (pre-defined times of dataset partition); *Nest* (number of RF meta-estimators); *Dmax* (max depth of the decision trees in RF) **Procedure:**

(1) Divide *S* into *Spos*, and *Sneg* according to *L*.

(2) Initialize *Mu* as *N*-dimensional zero vectors as the label uncertainty estimator, N is sample capacity.

**For** k in range(*kmax*):


#### **End for**

**Output:** Accumulator *Mu* indicating the label uncertainty of *S*.

For supervised machine learning, equally-sized training samples for each class are preferable. However, in satellite images, the background usually occupies more space than that of the buildings. In order to adjust the bias introduced by unbalancing distribution of samples, a larger penalty is imposed on inconsistent label prediction results of the background samples, i.e.,

$$M\_{\rm u}[L(S) \neq L\_{\rm p}(S)] = \begin{cases} 1 & \text{if } L(S) = \text{pos} \\ \sqrt{N\_{\rm neg} / N\_{\rm pos}} & \text{otherwise} \end{cases} \tag{1}$$

where *Mu* is an accumulative matrix describing label uncertainty of each sample, *L*(*S*) is the noisy label of S, *Lp*(*S*) is the label predicted by the classifier, *Nneg*, and *Npos* are the number of background, and building pixels, respectively.

After obtaining Mu, *r* = *Mu*/*k* is calculated for each pixel, then a pixel with *r* > 0.5 is a possible mislabeled sample. Otherwise, it is considered as a clean sample. Finally, these cleaned samples are used to train an RF classifier, rFfinal, to predict the label of potentially changed samples to building or other class. The label probability of each sample is also obtained by Rffinal, which is then used for subsequent post-processing.

#### *2.4. Post-Optimization Using Graph Cuts and Change Map Computing*

Since the data cleaning processing is conducted pixel-wise, and little contextual information is taken into account, the initial classification result is fragmented. To ensure neighborhood consistency, post-optimization processing is formulated as an energy minimization problem, and graph cuts [47] algorithm that are performed on superpixels instead of entire pixels are used to find the solution and ensure the efficiency.

Here we use the SLIC algorithm to segment the HRS image into superpixels. It is shown that SLIC generates compact superpixels adhering tightly to the boundary [35]. The probability of the superpixel belonging to each class (building or other) is then calculated using Equation (2). It includes two aspects: (1) The averaged label probability of pixels in the superpixel; and (2) the proportion of pixels belongs to the current class.

$$\text{tp}(L(\text{Spix}) = \text{c}) = 0.5 \times \left( \sum\_{\text{piv} \in \text{Spix}} \text{p}(L(\text{pix}) = \text{c}) + \frac{\text{pix} \in \text{Spix}, L(\text{pix}) = \text{c} \, \vert}{\left| \text{pix} \in \text{Spix} \right|} \right) \tag{2}$$

where *Spix* is the superpixel, *pix* are the pixels belonging to *Spix*, *c* is the label of two defined classes, *L*(*x*) returns the label of x, and |*s*| is the number of elements in set s.

The basic idea of graph cuts is to incorporate prior knowledge of label assignment, and the penalty imposed on adjacent superpixels with different labels, into a weighted graph. We then construct an energy function on the graph, and the optimal label assignment is obtained by optimizing the energy function defined as:

$$E = \sum\_{i} D(c\_i) + \lambda \sum\_{i$$

The first term, *D*(*ci*), is the data term which is determined by the negative logarithm of the probability obtained from Equation (3) and defined as

$$D(\mathbf{c}\_i) = -\log(p(L(\mathbf{Spi}\_i) = \mathbf{c}\_i))\tag{4}$$

The second term in Equation (3), *S*(*ci*, *cj*), is the smooth term, imposing a penalty on adjacent superpixels with different labels according to their similarity. Metric of spectral difference, i.e., Gaussian kernel of the averaged RGB feature, is utilized as the similarity measurement. Since the longer boundary is shared between the two superpixels, the higher their influence will be on each other, the penalty is weighted on the mutual border length. The smooth term employed in this paper is defined as:

$$S(c\_{i\prime}c\_j) = w(i,j) \times \exp\frac{(\|f\_i - f\_j\|)}{\sigma^2} \times \delta(i,j),\tag{5}$$

where

$$w(i,j) = \frac{\hom(i,j) \times \left| \mathcal{N}(i) \right|}{\sum\_{j \in \mathcal{N}(i)} \hom(i,j)},\tag{6}$$

$$\delta(i,j) = \begin{cases} 1 & \text{if } c\_i \neq c\_j \\ 0 & \text{otherwise} \end{cases} \tag{7}$$

σ is the standard deviation of Gaussian Kernel; *fi*, *fj* are the averaged RGB feature of *i*th and *j*th superpixels, respectively; *bon*(*i*, *j*) is the shared border length of the *i*th and *j*th superpixels; *N*(*i*) is the number of neighbors of superpixel *i*; and *ci* is the label of superpixel *i*.

The parameter, λ, in Equation (3) controls the proportion of smooth term in the energy function. The larger the value of λ, the heavier will be the penalty imposed on the adjacent superpixels with different labels. This leads to more smoothing effects. The value of λ is related to the size of buildings in HRS image. If most buildings are small, consisting of only a few superpixels, λ needs to be reduced to avoid over-smoothing of the building superpixels by the surrounding background superpixels. Otherwise, λ, is set to a larger value to introduce a better smoothing effect.

After building the energy function, the maximum flow of the graph [48] is obtained to get the minimum cuts and obtain the optimal label for each superpixel. After obtaining the final classification result of the new HRS images, the labels of the images are compared to the existing map to obtain the change map.

#### **3. Experimental Results and Discussion**

The proposed framework is implemented using python language. Pre-trained model weights of FCN-8s are obtained from (https://github.com/shelhamer/fcn.berkeleyvision.org) under caffe [49] framework and then transformed into tensorflow (https://www.tensorflow.org/) readable form, and reconstructed into fully convolutional feature extractor (FCFE). Graph cuts are implemented using PyMaxflow (https://github.com/pmneila/PyMaxflow).

#### *3.1. Experiment Setup*

#### 3.1.1. Datasets Description

To evaluate the proposed method, we use five datasets as shown in Figure 7, they include two sets, including ISPRS simulated dataset, and Boston real dataset—for details, see Table 1:

**Figure 7.** Experimental data sets: (**a**,**b**) ISPRS simulated dataset, (**c**–**e**) Boston real dataset (the first row is the newly acquired HRS image, the middle row is the outdated building map, and the third row is the ground truth building map for new HRS images).


**Table 1.** Details of newly acquired HRS images in five datasets.

ISPRS simulated dataset: Two airborne images from ISPRS 2D semantic segmentation benchmarks (downloaded from http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-vaihingen.html) are employed to simulate two synthetic datasets as newly acquired HRS images. Approximately 10% of new building labels are randomly added. To simulate the outdated basemaps, 15% of the existing labels are deleted from the ground truth.

Boston real dataset: Three real datasets are selected from the urban areas of Boston, USA. The outdated basemaps are obtained from an existing classification dataset [50] (downloaded from http://www.cs.utoronto.ca/~vmnih/data/), and regions that contain obvious changes are cropped. Then the corresponding newly acquired HRS images are downloaded from Google Earth. The main challenges with this dataset are: (1) Backgrounds are heterogeneous and share spectral similarity with the buildings; therefore, pure pixel-based change detection may result in a high false-positive rate. (2) Buildings are relatively small; therefore, object-based strategies may suffer from instability of random classifiers. This may lead to false-negative outcomes. (3) Labels of the existing buildings

suffer from severe mis-registration error, which makes information about building samples inaccurate. In order to evaluate the effectiveness of the proposed framework, an expert person is also invited to delineate the buildings' boundaries from the HRS images. The results are then reviewed by another expert, both independent of the experiment.

#### 3.1.2. Assessment Criteria

In image-image change detection, the recognition result is a change map indicating the location of pixels that are notably different between multiple images. The result of image-map comparison is the updated label map. Similar criteria can be used to assess the accuracy assessment in both change detection techniques. In this paper, three evaluating indexes are obtained in pixel-wise fashion to evaluate the accuracy of the change detection result, including, completeness (Comp), false detection rate (FDR), and overall accuracy (OA):

$$\text{Completeness} = \frac{\mathbb{C}\_d}{\mathbb{C}\_t} \, ^\prime \tag{8}$$

$$FDR = 1 - \frac{\mathbb{C}\_d}{\mathbb{C}\_d} \, \tag{9}$$

$$OA = \frac{\mathbb{C}\_d + \mathbb{C}\_n}{\mathbb{C}},\tag{10}$$

where *Cd* is the number of changed pixels (both background to building and building to background) that are correctly detected, *Ct* is the number of really changed pixels between newly acquired HRS image and the outdated basemap, *Ca* is the number of all the pixels that are labeled differently in the new labeled map, and the outdated basemap, *Cn* is the number of unchanged pixels that are correctly detected, and *C* is the number of pixels in the HRS image. Completeness measures the percentage of successfully corrected changed pixels among all changed pixels, whereas *FDR* reflects the proportion of false change pixels that are labeled as changed by the proposed algorithm. The *OA* also determines the comprehensive detection capability by taking both changed and unchanged pixels into account.

#### 3.1.3. Parameters Setting

There are three parameters having a high impact on the results. All these parameters are set based on trial and error. Unless otherwise stated, these parameters are used in our experiments.

The first one is a max depth of the RF classifier, *Dmax*, which determines the degree to which RF fits the training set. For a small *Dmax*, RF is under-fit to the training set resulting in a high variance. If *Dmax* is set to a large value, RF tends to over-fit to the mislabeled data in the training sets, resulting in a high bias. To balance the completeness and FDR, we set *Dmax* = 11.

Compared to *Dmax*, a number of decision tree estimators, *Nest*, in RF has trivial effects on the data cleansing accuracy. For *Nest* < 5, OA and FDR slightly fluctuate, due to the intrinsic randomness of the meta-classifiers, whereas for *Nest* > 5, OA and FDR converge to a fixed level. Since the computational demands are linearly proportional to *Nest*, we set its value to the minimum stable value of 5.

The main parameters of the post-optimization are the proportion of smooth term, λ, and the standard deviation of Gaussian kernel, σ.

Parameter λ controls the smoothness of the classification result. For a small λ, graph cuts tend to undersmooth the label results, and thus, holes and gaps of building labels and spurious fragmentations are under smoothened, causing a low completeness and OA, and a high FDR. For a very large λ, the label results are over smoothened and lots of existing buildings are obliterated, causing the bounce of FDR and re-sink of completeness and OA. Here, we set λ equal to 1.0 for ISPRS datasets, and 0.3 for Boston datasets. The value of σ is also set to 10.

### *3.2. Results of ISPRS Simulated Data*

#### 3.2.1. Change Detection Results

The detection results of the ISPRS datasets are presented in Figure. 8. The middle row of Figure 8 presents the initial classification results. The bottom row of Figure 8 shows the results after optimization by using a graph cuts algorithm. Initial results show that most of the new buildings are detected. However, these building labels have holes and gaps that undermine OA. Moreover, in areas that share similar spectral textual characteristics with the buildings, such as bare soil and roads, spurious and fragmented building labels occur. This results in a high FDR. After optimization, more pure building extraction results are obtained.

**Figure 8.** Experiment results: (**a**) results of the ISPRS simulated dataset a, (**b**) results of the ISPRS simulated dataset b (the first row is the HRS images, the second row is the initial classification results, and the third row is the final classification results).

#### 3.2.2. Results with Different Label Noise Levels

Here we analyze the performance of the proposed method on data sets with different levels of label noise and the overall accuracy w.r.t. different settings are explored. The HRS images, as shown in Figure 7a,b, are segmented into superpixels with the approximate size of the buildings. The labels of specified proportions of superpixels (ranging from 5% to 50%) are then selected randomly and flipped to introduce different levels of noise. The whole procedure of the proposed method is then performed on these modified data sets, and the results are presented in Figure 9.

(**b**)

**Figure 9.** Overall accuracy w.r.t different simulated noise levels: (**a**) results of ISPRS dataset a, (**b**) results of ISPRS dataset b.

The results indicate that for noise rates up to 40%, the overall accuracy of the proposed method is above 90%. Even in cases where the original noise rate reaches as high as 50% (which means the information provided by outdated basemaps are mixed), the proposed framework is able to obtain an accuracy of 75%. This indicates the effectiveness of the proposed method.

#### *3.3. Results of Boston Real Dataset*

#### 3.3.1. Detection Results

Figure 10 shows the outcomes of the initial classification results of Boston real datasets. Comparing the results obtained by the proposed method (the middle row of Figure 10) and the ground truth map (the bottom row of Figure 10), it is seen that most of the new buildings are correctly detected, and mis-registration errors are corrected. However, these building labels have holes and gaps that undermine OA.

**Figure 10.** Initial classification result: (**a**) results of Boston real dataset c, (**b**) results of Boston real dataset d, (**c**) results of Boston real dataset e (first row—outdated basemap; middle row—groundtruth; third row—data cleansing result).

After optimization using graph cuts, the results are presented in the third row of Figure 11. Compared with the first row in Figure 11, it is seen that the phenomenon of small segments is removed, and the building extraction results are more accurate. Based on the optimized classification results, we obtain the change maps and compare them with the ground truth of the change map. The results are shown in the fourth row of Figure 11, where the red color means the changes are correctly detected, and the green means the changes are not detected.

**Figure 11.** Results after post-optimization: (**a**) results of Boston real dataset c, (**b**) results of Boston real dataset d, (**c**) results of Boston real dataset e (first row—building label maps before optimization by object-based analysis and graph cuts; middle row—building label maps optimized by object-based analysis and graph cuts; third row—building map ground truth; fourth row—change map, where red means the changes are correctly detected, green means they are not).

#### 3.3.2. Performance Comparison

In order to demonstrate the effectiveness of the proposed method, comparisons are made to three benchmarking methods, namely, A, B, and C. Method A employs the same framework as the proposed method, but uses conventional spatial-spectral features by combing GLCM textural features and normalized RGB, to replace the feature detector in our method. Method B employs a deep feature extractor as in Reference [24], and then follows the following steps: (1) Segmentation of the HRS images into superpixels; (2) cropping the bounding box of each superpixel, feeding it into ImageNet pre-trained VGGNet, extracting 4096-dimensional features from fc7, and reducing them to 100-dimensional using principal component analysis; (3) cleansing the data using graph cuts optimization. Method C is a fully pixel-based method that directly uses pixel-wise re-predicted label map for graph cuts optimization.

For the four methods to be comparable, the receptive field of features is set to 15, which is the same as the proposed method. Meanwhile, all the hyperparameters are determined through a grid search to obtain the highest performance. The accuracy results are shown in Table 2. The results confirm that the proposed method overperforms methods A, B, and C.


**Table 2.** Comparison Results.

Compared with the proposed method, Method A shows a lower AR and a higher FDR. This shows that the deep features perform better than the hand-crafted features. Method B employs an earlier deep feature extraction strategy, however its performance on the experiment data is very low. The reason is that the buildings in the used datasets are generally small; this leads to two problems in direct segmentation of the HRS images into objects and in data cleansing: (1) The number of building samples is severely decreased, therefore, enough information is unavailable to distinguish background from the building; (2) a single building only consists of few superpixels, this makes the building objects vulnerable to the instability of random classifiers and/or over-smoothing by surrounding background objects. Nevertheless, with additional pixel-wise graph cuts post-processing in Method C, the accuracy remains low compared to the initial classification result. This is because the graph cuts algorithm punishes adjacent pixels with different labels and the correction of spurious clique needs lots of energies. Therefore, they cannot be corrected through max-flow optimization of the energy function. On the contrary, holes in building labels and fragmentations in non-building areas may dilate, leading to decreasing AR and OA.

All the experiments were performed on a laptop computer with Intel Core i7-7700HQ at a 2.8 GHz CPU with 32 GB memory, and an NVIDIA GTX1060MAXQ GPU (with 6.0 GB memory). The processing time is about five minutes for the three real data sets.

#### **4. Conclusions and Future Works**

In this paper, we proposed a novel framework for image-map building change detection. First, we demonstrated the representative ability of the features extracted from the early convlayer of pre-trained DCNNs and proved the feasibility of selecting important features using outdated building basemaps. Then, a random forest-based data cleansing method was implemented to preliminarily detect and correct changed pixels. The pixel-level re-predicted label maps were, however, fragmented, therefore, we adopted object-based analysis to introduce contextual information and ameliorate spurious predictions. We then used a graph cuts algorithm to optimize the label assignment results.

There are some limitations in the proposed method; for instance, a sparse distribution of the buildings may result in omission errors. Since FCFE demonstrates high efficiency in dense feature descriptors, it can be used in other tasks, such as classification and image registration [51].

**Author Contributions:** Conceptualization, Y.Z. (Yunsheng Zhang), J.P.; Methodology, Y.Z. (Yunsheng Zhang), Y.Z. (Yaochen Zhu), J.P.; Software, Y.Z. (Yaochen Zhu), S.C.; Validation, Y.Z. (Yaochen Zhu), S.C.; Resources, H.L., L.Z.; Writing—Original Draft Preparation, Y.Z. (Yunsheng Zhang), Y.Z. (Yaochen Zhu), J.P.; Writing—Review & Editing, Y.Z. (Yunsheng Zhang), H.L., L.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Hunan Provincial Natural Science Foundation of China (No. 2018JJ3637), Natural Science Foundation of China (No. 51978283), Open Fund of Key Laboratory of Urban Land Resource Monitoring and Simulation, Ministry of Land and Resource (No. KF-2018-03-047).

**Conflicts of Interest:** The authors declare no conflict of interest.
