2.2.1. Structure of the Proposed Fully Convolutional Feature Extractor

The structure of the proposed fully convolutional feature extractor is illustrated in Figure 3. To reconstruct pre-trained FCN-8s for dense feature extraction tasks, we make the following three modifications: (1) The feature maps extracted by convlayers after the pool2 layer are coarse (i.e., one-sixteenth the size of original image), and assumed to contain more information about source dataset. Therefore, only the first two groups of convlayers with the first pooling layers are preserved. This modification is aimed to exploit multi-level well-generalized features, while preserving valuable spatial information. (2) In the original FCN-8s, the first convlayer zero-pads the input image with 100 pixels to prevent severe size-reduction imposed by cascaded pooling layers. Other convlayers also pad the input feature map with 1 pixel. Note that all convolution kernels in FCN-8s are 3 × 3 in size, and their output has exactly the same spatial dimension as the input. In our fully-convolutional feature extractor (FCFE), all convlayers are set to pad input the feature map with 1 pixel. Therefore, feature maps from the first group of convlayers have the same size as the input image, while feature maps from the last convlayers are two-times downsampled. (3) The feature map extracted from the last group of convlayers is upsampled to the input size using bilinear interpolation. All feature maps are then concatenated to multi-scale deep features.

In Figure 3, the multi-scale features extracted by FCFE are up-sampled and fused feature maps from conv1\_1, conv1\_2, conv2\_1, and conv2\_2 layers of PASCAL VOC dataset-pretrained FCN-8s model, with 64, 64, 128, and 128 channels, respectively. Layer deconv2 uses bilinear interpolation to upsample feature maps from conv2\_1 and conv2\_2 to the size of the input image and fuse them together. The fusing1 layer concatenates the feature maps from conv1\_1, conv1\_2, and deconv2 to obtain the final 384-dimensional multi-scale features.

**Figure 3.** Structure of the proposed FCFE.

2.2.2. Feature Selection Guided by the Existing Basemaps Using Random Forest

Only part of the features directly extracted by the FCFE is highly discriminative for buildings, and the rest are redundant and high-dimensional. Therefore, direct feeding of the features into the subsequent data cleaning pipeline demands excessive computation, and also harms the data cleaning effects. According to the study in Reference [41], each feature layer generated by DCNN responds to a major class. Thus, the feature selection processing is performed to select the most informative features and ensure the classification result. Feature selection is the process of removing redundant and irrelevant features, often accomplished by determining the usefulness of all feature variables [42]. Feature selection methods can be generally classified into three categories, including supervised, semi-supervised, and unsupervised methods. The existing building basemaps may contain erroneously labeled areas, due to time-lapse with the newly acquired HRS image, however, the majority of the labels remain correct and can be used in the feature selection schemes.

Here we employ RF classifiers to select features in our proposed method. RF classifier trains multiple decision trees with a random subset of samples based on a random subset of features [43,44]. RF algorithm can be trained efficiently to process the multiple label classification problems, and it is widely used in RS image classification tasks [43]. RF also provides the importance of the used features. Therefore, the feature importance estimated by RF is the average importance of each decision tree.

In order to select the salient feature that discriminates well from the building to background pixels, 384-dimensional FCFE extracted features and existing building basemaps, as pixel-wise labels, are considered as the training set to fit an RF classifier. The features' importance is then evaluated, and nch (experimentally set to be 20) most important features are selected chosen to form the feature descriptor of the newly acquired HRS image.

To visually analyze the features extracted by the proposed method, an image, as shown in Figure 4, is used to perform the FCFE and feature selection processing. To display and compare features inner-layer- and cross-layer-wise, eight features are randomly chosen from each layer, and a total number of 32 feature maps are illustrated in Figure 5.

By carefully examining Figure 5, three characteristics of the feature extracted by FCFE can be concluded: (1) A small part of the features is highly discriminative between buildings and background, with the corresponding feature maps showing salient contrast between the two classes; (2) a large number of features are less useful; with feature maps being ambiguous and showing inconspicuous differences; (3) features from early convlayers are fine-grained and adhere better to the boundaries, whereas features from latter convlayers are comparatively coarse and more abstract.

**Figure 4.** Example data for illustration of the proposed feature extraction and selection techniques. (**a**) Example image, and (**b**) outdated map.

**Figure 5.** Eight randomly selected feature maps from each layer of the FCFE; (**a**) conv1\_1; (**b**) conv1\_2; (**c**) conv2\_1; (**d**) conv2\_2.

Sixteen most important features chosen after feature selection are shown in Figure 6. Three properties of selected features can be seen in Figure 6: (1) By filtering the ineffective features out, the remaining features are more representative and visually separable; (2) selected feature maps are functionally versatile. It is also seen that (a,d,e,h,o) positively respond to the buildings, whereas (b,c,f,j,k) negatively respond to the buildings; and (I,m,p) strongly respond to shadows and are actually shadow detectors. Since the buildings are supposed to be near, where the shadows appear, the detection of shadows can positively support the recognition of buildings. (3) Features from four convlayers are all selected to form the multi-scale features. As stated before, features from early layers contain low-level knowledge, such as positions and boundaries, while features from latter layers encode high-level intuitions, such as neighboring and

contextual information. Based on that, the selected features are complementary and representative, and they are combined into a feature descriptor for HRS images.

**Figure 6.** Sixteen most important features were selected by RF with the support of existing building basemaps.
