**1. Introduction**

The rooftops of buildings are dominant features in urban satellite or aerial imagery. For many remote sensing applications, such as slum mapping [1], urban planning [2], and solar panel capacity analysis [3], the spatial distributions and temporal renews of buildings are critical. These information are collected from labor-intensive and time-consuming field surveys [4]. For analyses in the city or country scale, especially in developing countries, a robust and cost-efficient method for automatic building extraction is preferred.

Over the past decades, many algorithms have been proposed [5]. These methods are verified by datasets of various types (e.g., imagery or point cloud), scales (e.g., city or country), resolutions (e.g., centimeter or meter), or spectrums (e.g., visible light, or multispectral) [6–10]. Based on whether sampled ground truths are required, existing building outline extraction methods can be classified into two categories: (i) unsupervised and (ii) supervised methods.

#### *1.1. Unsupervised Methods*

For most unsupervised methods, building outlines are extracted using thresholding pixel values or histograms [11], edge detectors [12], and region techniques [13,14]. Because of their simplicity, these methods do not require additional training data and are fast. However, when applied to residential areas with complex backgrounds, some artifacts and noises are inevitable in the extracted building outlines.

#### *1.2. Supervised Methods*

Unlike unsupervised methods, supervised methods extract building outlines from the images through patterns learned from ground truths. By learning from correct examples, supervised methods typically performed better in terms of both generalization and precision [15–17].

In the early stages, a two-stage approach that combines handcrafted descriptors for feature extraction [18–21] and classifiers for categorizing [22–24] are adopted in supervised methods. Because of the separation, an optimal combination of both the feature descriptor and classifier is difficult to achieve. Rather than the two-stage approach, convolutional neural network (CNN) methods enable a unified feature extraction and classification through sequential convolutional and fully connected layers [25,26]. Initially, CNN-based methods are constructed in a patch-by-patch manner that predicts the class of a pixel through the surrounding patch [27]. Subsequently, fully convolutional networks (FCNs) are introduced to reduce memory costs and improve computational efficiency through sequential convolutional, subsampling, and upsampling operations [28,29]. Because of information loss caused by subsampling and upsampling operations, the prediction results of classic FCN models often present blurred edges. Hence, advanced FCN-based methods using various strategies have been proposed, such as unpooling [30], deconvolution [31], skip connections [32,33], multi-constraints [34], and stacking [35]. Among FCN-based methods, two different approaches exist: (a) indirect and (b) direct approaches.

#### 1.2.1. Indirect Approach

In the indirect approach, instead of extracting the building outline directly from the input aerial or satellite image, semantic maps are first generated. The outlines on top of those maps are computed consequently. Because the outlines are derived from segmentation output, the final accuracy relies significantly on the robustness of semantic segmentation.

In principle, all FCN-based methods mentioned above can be used for indirect building outline extraction. However, owing to the sensitivity of the outline/boundary, training with only semantic information typically results in an inconsistent outline or boundary. To prevent this, BR-Net [36] utilizes a modified U-Net, and a multitask framework to generate predictions for semantic maps and building outlines based on a consistent feature representation from a shared backend.

#### 1.2.2. Direct Approach

Unlike the indirect approach, the direct approach extracts the building outlines directly from the input aerial or satellite images. Compared with the indirect approach, the direct approach learns the extraction pattern directly from the ground truth outline that preserves a higher fidelity. In the direct approach, building outline extraction is considered a segmentation or pixel-level classification problem that involves extremely biased data [37]. In recent years, some advanced FCN-based models, such as RSRCNN [38], ResUNet [39], and D-LinkNet [40] have been proposed for better outline extractions.

However, these models focus on deeper network architectures to better utilize the feature representation capability of hidden layers. Furthermore, regardless of how these models generate predictions, their loss functions are computed directly from the pixel-to-pixel similarity of the ground truth. Owing to the extremely biased distribution of positive and negative pixels, the gradient explosion during training becomes a severe problem. Additionally, because of occasional human errors, several or tens of pixel misalignments will inevitably occur between the annotation and the corresponding aerial image. Owing to the much fewer positive pixels of the building outline, the pixel-to-pixel losses are extremely sensitive to these misalignments.

Hence, we propose a nearest feature selector (NFS) module, enabling a dynamic re-alignment between the ground truth and prediction. A dynamic matching between the ground truth and

prediction is performed at every iteration to determine the matched position. Subsequently, the overlapped areas of both the ground truth and prediction are used for further loss computation. Because the NFS is used for the upper stream, it can be seamlessly integrated into all existing loss functions. The effectiveness of the proposed NFS module is demonstrated using a VHR image dataset [36] located in New Zealand (see Section 2.1). In comparative experiments, under different loss functions, the addition of the NFS indicates significantly higher values of the f1-score, Jaccard index [41], and kappa coefficient [42].

The main contributions of this study can are as follows:


The rest of the paper is organized as follows: At first, we introduce the materials and methods used for this research in the Section 2. Then, we present the learning curves and quantitative and qualitative results in the Section 3. Subsequently, we illustrate our discussion and conclusion in the Sections 4 and 5, respectively.

#### **2. Material and Method**
