*2.2. Methodology*

In this study, we are expected to correctly train and evaluate a model using imperfect annotation. Due to the inevitable misalignments, values of the loss functions or metrics, which are directly computed by the pixel-to-pixel comparison of the prediction and annotation, are inaccurate. To avoid this, we introduce the nearest feature selector (NFS) module to perform similarity selection during training and testing stages.

As shown in Figure 2, at the training phase, the NFS is applied to prediction and imperfect annotation to generate aligned prediction and annotation for accurate loss estimation and proper back-propagation. As for the testing phase, the NFS is applied to prediction and imperfect annotation to generate aligned prediction and annotation that can be used for reliable accuracy analysis. Since the NFS is applied to select the most paired overlap, it can avoid misalignments in the ground truth and produce a more reliable accuracy or prediction error.

**Figure 2.** Experimental design for model training and evaluation under imperfect annotation. The proposed nearest feature selector(NFS) is applied to perform similarity selection during training and testing stages.

Figure 3 presents the workflow for building outline extraction. The aerial images and their corresponding building outlines are partitioned into two sets for training and testing. Through several cycles of training and validation, the hyperparameters, including batch size, the number of iterations, random seed, and initial learning rate were determined and optimized using the basic model (i.e., SegNet + L1 loss). Subsequently, the predictions generated by the optimized models are evaluated using the patches within the test set. For performance evaluations, we select three typically used

balanced metrics, i.e., the f1-score, Jaccard index, and kappa coefficient. These metrics are computed before the post-processing operations [43,44].

**Figure 3.** Experimental workflow for buidling outline extraction. Existing loss functions and proposed nearest feature selector are trained and evaluated using 224 × 224 image patches extracted from original dataset.

#### 2.2.1. Data Preprocessing

According to the location and extent of every building polygon, a square window is applied to the centroid of the polygon to extract the corresponding image patch. Later, all patches are resized as 224 × 224 pixels. After data preprocessing, there are 16,635 and 14,834 image patches extracted from training and testing area, respectively. Since we have carefully checked the annotations, there are no negative patches to be discarded. Then, the image patches within the training area are shuffled and partitioned into two groups: training (70%), and validation (30%). Subsequently, the number of patches used for training, validation, and testing are 11,644, 4990, and 14,834, respectively.

#### 2.2.2. Proposed Model

For an efficient building outline extraction, we utilize a modified SegNet [30] for feature extraction and the NFS to achieve a dynamic alignment between the ground truth and prediction (see Figure 4).

**Figure 4.** Overview of the proposed model. The model consists of a modified SegNet for feature extraction and the nearest feature selector (NFS) module for dynamic alignment.

#### • **Feature extraction**

In this study, we utilize a modified SegNet for effective feature extraction from very-high-resolution aerial images. As shown in Figure 4, the modified SegNet comprises sequential operation layers, including convolution, nonlinear activation, batch normalization, subsampling, and unpooling operations.

The convolution operation is an element-wise multiplication within a two-dimensional kernel (e.g., 3 × 3, or 5 × 5). The size of the kernel determines the receptive field and computational efficiency of the convolution operation. Owing to the complexity of the task, we set the number of kernels of the corresponding convolutional layers to [24, 48, 96, 192, 384, 192, 96, 48, 24] [34]. Subsequently, the convolution output is managed using a rectified linear unit [45], which treats all values less than zero as zeros. To accelerate network training, a batch normalization [46] layer was appended to every activation function except for the final layer. Max-pooling [47] and the corresponding unpooling [30] were used to reduce and upsample the width and height of intermediate features, respectively.

#### • **Nearest Feature Selector(NFS)**

Figure 5 shows the mechanisms of the NFS. The center area of the ground truth slides over the corresponding prediction along both the X- and Y-axes to generate overlaps of *Xi*,*<sup>j</sup>* and *Yc* , respectively, where *i* and *j* are the distances from the initial position. To obtain a balance between the computational efficiency and sliding field, we set the maximum values of both *i* and *j* to five. Subsequently, they were used for similarity estimation through different criteria according to the number of channels of the output.

For the prediction and ground truth containing a single channel, the classic L1 distance is used. Thus, the distance of the (*i*,*j*) overlap can be formulated as:

$$\mathbf{D}\_{l,j} = \frac{1}{W \times H} \sum\_{i=1}^{W} \sum\_{j=1}^{H} ||\mathbf{X}\_{i,j} - \mathbf{Y}\_c|| \tag{1}$$

where *X* is the prediction, and *Y* is the corresponding ground truth. Both *X* and *Y* are ∈ *R W*×*H*. *W* and *H* are the width and height of the corresponding output, respectively.

**Figure 5.** Overview of the nearest feature selector (NFS) module. The center area of ground truth slides over prediction along X- and Y-axes to generate overlaps that are used for similarity selection.

*Remote Sens.* **2020**, *12*, 2722

For the prediction and ground truth containing multiple channels, the average cosine similarity along the channels will be calculated. In such cases, the distance of overlaps can be formulated as:

$$\mathbf{D}\_{i,j} = 1 - \frac{1}{\mathcal{W} \times H} \sum\_{i=1}^{\mathcal{W}} \sum\_{j=1}^{H} \frac{\mathbf{X}\_{i,j} \cdot \mathbf{Y}\_c}{||\mathbf{X}\_{i,j}|| \times ||\mathbf{Y}\_c||} \tag{2}$$

From all overlaps, location indices of the one with the closest distance to the ground truth is determined as:

$$(l\_{\min}, l\_{\min}) = \underset{i,j}{\arg\min} \, \mathcal{D} \tag{3}$$

The nearest overlap (*Ximin*,*jmin* ,) and corresponding ground truth (*Yc* ) are selected for further final loss estimation. Four well-known loss functions, namely, L1, mean square error (MSE), binary cross-entropy (BCE) [48], and focal loss [49], are chosen in this study.

$$\mathcal{L}\_{\mathbf{L1}} = \frac{1}{W \times H} \sum\_{m=1}^{W} \sum\_{n=1}^{H} ||y\_{m,n} - g\_{m,n}|| \tag{4}$$

$$\mathcal{L}\_{\mathbf{MSE}} = \frac{1}{\mathcal{W} \times H} \sum\_{m=1}^{\mathcal{W}} \sum\_{n=1}^{H} (y\_{m,n} - g\_{m,n})^2 \tag{5}$$

where *W* and *H* represent the width and hight of the nearest overlap (*Ximin*,*jmin Ximin*,*jmin*) and corresponding ground truth (*Yc* ). The values of *ym*,*<sup>n</sup>* and *gm*,*<sup>n</sup>* are the predicted probability and ground truth, respectively.

For notational convenience, we define *pm*,*n*:

$$p\_{\mathfrak{M},\mathfrak{n}} = \begin{cases} y\_{\mathfrak{m},\mathfrak{n}\_{\mathsf{f}}} & \text{if } \mathcal{g}\_{\mathfrak{m},\mathfrak{n}} = 1 \\ 1 - y\_{\mathfrak{m},\mathfrak{n}\_{\mathsf{f}}} & \text{if } \mathcal{g}\_{\mathfrak{m},\mathfrak{n}} = 0 \end{cases} \tag{6}$$

As compared with traditional cross-entropy, focal loss introduces a scaling factor (*γ*) to focus on difficult samples. Mathematically, the BCE and focal loss can be formulated as:

$$\mathcal{L}\_{\mathbf{BCE}} = -\frac{1}{\mathcal{W} \times H} \sum\_{m=1}^{\mathcal{W}} \sum\_{n=1}^{H} \log(p\_{m,n}) \tag{7}$$

$$\mathcal{L}\_{\text{focal}} = -\frac{1}{W \times H} \sum\_{m=1}^{W} \sum\_{n=1}^{H} (1 - p\_{m,n})^{\gamma} \log(p\_{m,n}) \tag{8}$$

Because the NFS is computed dynamically, it can be seamlessly integrated into the existing loss without further modification.

Three typically used balanced metrics, i.e., the f1-score, Jaccard index, and kappa coefficient, are used for the quantitative evaluation. Compared with unbalanced metrics such as precision and recall, the selected metrics provide a more generalized accuracy level by considering both precision and recall.

$$\mathbf{F1} - \mathbf{score} = \frac{2 \times TP}{2 \times TP + (FP + FN)} \tag{9}$$

$$\text{Jaccard} = \frac{TP}{TP + FP + FN} \tag{10}$$

$$Pe = \frac{(TP + FN) \times (TP + FP) + (FP + TN) \times (FN + TN)}{(TP + FP + FN + FN) \times (TP + FP + FN + FN)} \tag{11}$$

$$Po = \frac{TP + TN}{TP + FP + FN + FN} \tag{12}$$

$$\mathbf{Kappa} = \frac{\mathbf{Po} - \mathbf{P}\varepsilon}{1 - \mathbf{P}\varepsilon} \tag{13}$$

where TP, FP, FN, and TN represent the number of true positives, false positives, false negatives, and true negatives, respectively.
