**1. Introduction**

Developing countries have witnessed a rapid expansion of urban areas during the last decades. With the fast urbanization, updating buildings geo-database plays an important role in urban planning, as it provides valuable information regarding, e.g., land use/cover monitoring [1], evaluation of agricultural lands decline [2], disaster assessment [3], civil BIM updating [4]. Such information also enables the government to adopt suitable and sustainable development strategies. Automatic building geo-database updating relies on identifying the areas, where changes occurred. Currently, change identification is mainly a labor-intensive work, especially in urban environments, due to its complexity. Therefore, automatic geo-database updating based on remote sensing images remains an open and unsolved issue.

During the past decades, several methods have been proposed to increase the level of automation in change detection. According to their comparison basis, the change detection methods can be categorized into two classes: (1) Image-image comparison; and (2) image-map comparison [5]. The former approach aims at direct recognition of differences between multi-temporal remotely sensed images [6,7]. The image-map comparison-based method, however, detects changes between existing data and newly acquired images, where the semantic classification of the newly acquired images is also required. For image-map comparison, supervised machine learning methods are employed, see,

e.g., Reference [8]. However, for an accurate classifier to be trained, a large enough set of labeled samples is required. Labeling samples, however, need expensive manual work and a high level of expertise and knowledge on image interpretation.

To address this issue, existing GIS data or online maps, such as Open Street Map (OSM) data, and Google maps, are employed to provide prior information. For example, Bouziani et al. obtain prior class knowledge from the existing geo-database to identify the change of buildings based on transitional probability between classes, and to change map segmentation [5]. Kaiser et al. exploit the online map to guide aerial image segmentation, although they simply ignore the temporal inconsistencies between the used map and aerial images, and simply count on human interaction to remove the mis-registrations between the map and the roof images of buildings [9]. Wan et al. employ OSM data to obtain initial samples for training SVM to classify HRS images [10]. To alleviate the effect of intrinsic errors caused by incorrect labeling by volunteers, they further use a cluster analysis to filter out the possible errors. Gevaert et al. provide a model for outdated base-maps as noisy labels of newly acquired UAV images, and then utilize data cleansing methods to filter out the potentially mislabeled samples, and further re-predict their labels by supervised classification [11]. Chen et al. treat historical digital line graph (DLG) data as the source of initial noisy labels, and then the pure part is selected by an iterative training method [12]. For highly accurate classification, they also use several hand-crafted image-based and point-cloud based features for the supervised classification task. The elevation feature is also very useful to distinguish buildings; however, it is not always available.

In addition to the availability of a large enough set of labeled samples, selecting proper discriminable features is another key point for classification. Some carefully hand-crafted features are heuristically proposed and combined to classify VHR images. Most of the existing methods employ spectral and textural features, or DEM data, as feature descriptors, see References [11,13,14]. Although the hand-crafted features are designed to describe a specific image pattern, their performance depends on the available training data. Different from hand-crafted features, the recently developed deep learning techniques directly learn features from the original data. Deep learning is widely used in various research areas, e.g., natural image classification [15], object detection [16], and semantic segmentation [17]. Deep learning methods are also used to learn features from remote sensing (RS) images for classification [18]. For instance, autoencoder-based techniques are used in RS for extracting features from images [19–21]. Such methods learn to extract feature encodings in an unsupervised setting, which can then be reconstructed back to the input with minimum error [21]. Different variations of autoencoders are applied to various tasks in the RS field. By increasing the spatial resolution of the RS images, the training of such autoencoders becomes time-consuming and further requires large memory.

In practice, a large set of accurately labeled data is often unavailable. In recent works, this issue is addressed in the RS domain by training deep convolutional neural networks (DCNNs) from scratch. Feature extraction DCNNs is also widely used in computer vision research, where the training is based on large open-source datasets, see References [22,23]. The intuition behind DCNNs is that with strong learning abilities, DCNNs can learn to respond to various kinds of feature patterns in different abstract-levels from large and complex datasets. The learned features can then be generalized to be used for smaller datasets, even if those datasets are remarkably different from the training datasets [24]. Much research has been done to generate a single feature descriptor for the whole image with high-level activations of pre-trained DCNNs [25]. In these methods, the size of the input is strictly fixed, so interpolations are needed to resize the images to a specified scale. To extract dense feature maps in a pixel-wise fashion, such methods need to crop window, resize, and do forward propagation at the center of each pixel [20,26]. Since most of the computation in the neighboring windows are shared through the convolution, they are computationally redundant and limited to small/moderate-size images. Many existing methods focus on extracting features from the back part of DCNNs (i.e., the last convolutional layer and fc layers) and generate one single feature description for the whole image.

To improve classification performance, the spatial context of the images has to be fully used [23,27]. Single-pixel based methods are unable to take a large enough image field to distinguish the building objects from the background information and ensure a consistent classification result in the global context. Several pixel-based methods are proved to be successful for change detection of low- and moderate-resolution remotely sensed images [7]. Nevertheless, with the emergence of high-resolution remote sensing (HRS) data, such methods are not effective, since the results can easily keep salt-and-pepper noise, due to increasing (decreasing) intra-(inter-)class variance [28]. To address this issue, object-based methods are adopted in References [29–32]. Such object-based change detection methods significantly reduce the required amount of data to be processed, and further generate change recognition result with shape and boundary information that can be directly used to update geo-databases, see Reference [33]. This however may lead to new problems as object segmentation is intrinsically challenging for remote sensing images [34].

In this paper, we propose to cast the image-map change detection problem into the identification and correction of noisy labels. For extracting discriminable features, a fully convolutional network (FCN) pre-trained on the PASCAL VOC dataset [17] is treated as a fully convolutional feature extractor (FCFE). Since the long-range relationship comparatively is trivial in the HRS images, and spatial information is severely lost by down-sampling in the last convolutional layers, only first two groups of convolutional layers (4 layers) are preserved. The tensors from all convolutional layers are then up-sampled to the same size of the input and fused together by concatenation as pixel-wise features. Through FCFE, the feature computation of all pixels is achieved by a single forward propagation. Therefore, it is more efficient than that of the most window-based feature extractors. However, directly concatenated and up-sampled pixel-wise features are redundant and have a high dimension for subsequent processing. Therefore, a noise label guided feature selection is proposed to select the most informative features for building extraction. As pixel-wise re-predicted labels of newly acquired HRS images are usually fragmented, especially in areas with a similar spectral, textural characteristic, such as buildings, roads, and bare soil. To alleviate this problem, new HRS images are segmented into superpixels, and then superpixel-based graph cuts are used to refine the initial classification result. For further performance improvement, we also propose a new label uncertainty calculation technique for each superpixel.

The contribution of our work are the following: (1) We present a novel framework with the combination of pixel-wise and object-based analysis for image-map change detection based on data cleaning method; (2) FCN pre-trained on the PASCAL VOC dataset for semantic segmentation is then used to reconstruct the proposed fully convolutional feature extractors to extract dense features of HRS images; and (3) outdated noise label is then used to guide the feature selection for eliminating the redundancy of the features.

The remainder of this paper is organized as the following. Section 2 provides the details of the proposed image-map change detection framework. Section 3 analyses the performance of experiments conducted on two simulated, and three real datasets. Finally, conclusions are presented in Section 4.

#### **2. Methods**

#### *2.1. Overview of the Method*

The workflow of the proposed approach is illustrated in Figure 1, where the three main components are:

(1) Feature calculation, which is a fully convolutional feature extractor reconstructed from FCN-8s [17] and pre-trained on the PASCAL VOC dataset. Feature calculation extracts multi-scale pixel-wise features from newly acquired HRS images. An RF classifier is then trained to rank the importance of the extracted features based on the outdated basemap. After that, representative features are selected as feature descriptors for each pixel.


**Figure 1.** Flowchart of the proposed change detection framework. HRS, resolution remotely sensed.

#### *2.2. Feature Extraction through Fully Convolutional Feature Extractor*

Although the last layers of CNNs are more effective in capturing semantics, they are ineffective in capturing fine-grained spatial details, which are needed for spatial feature extraction [36]. Two obstacles that hinder the direct transformation of DCNNs into dense feature extractors are: (1) Pooling layers shrink features maps exponentially, and this depresses valuable spatial information; (2) fc layers map fix-size feature tensors into activation vectors, this constrains the input size. In computer vision, images are relatively small and contain only a few salient objects and/or one main scene. This makes cascaded down-sampling important to extract relationships within the main objects. However, HRS images contain objects this belong to different categories, and there exists no single subject being able to globally determine the theme of HRS images. Therefore, long-range relationships captured by stacked pooling layers seem trivial, but the local response captured by the early convolutional layers (convlayer) is much more important.

Convolutional kernels in DCNNs pre-trained on a very-large dataset are considerably rich filter banks capturing various kinds of features. Zeiler and Fergus demonstrate that the early convlayer encodes low-level features, such as edges, corners, shapes, or textures, while the deeper layers extract high-level information, such as objects, or categories [37]. Kemker et al. assert that the features extracted by the convlayer of the pre-trained DCNNs can produce Gabor-like results [38]. Generally, feature maps extracted by the deeper convlayer are coarse and abstract, suffer from a severe size reduction, and contain more information of the source datasets, which is irrelevant when transferring to a new target dataset. Nevertheless, feature maps extracted from the earlier layers are fine-grained and adhere better to the boundaries. Therefore, one can assume that the features from early convlayers of pre-trained DCNNs have stronger generalization abilities [39]. Since convlayers also accepts arbitrary input size and intrinsically preserves spatial information, fully convolutional networks (FCN) reconstructed by the early part of pre-trained DCNNs are more efficient to extract dense features.

FCN-8s [17] is an FCN pre-trained on the PASCAL VOC dataset for 20-class semantic segmentation, is used to reconstruct the proposed fully convolutional feature extractors (FCFE). The used FCN-8s is trained on the PASCAL VOC 2011 segmentation challenge training set, which includes 11,530 images and 5034 segmentations. It is reconstructed and fine-tuned from VGGNet [40] that is pre-trained

on ImageNet. FCN-8s consists of five groups of convlayers with pooling layers that encode the input image into high-dimensional dense feature maps. It also has three deconvolutional layers that up-sample and fuse activations from the last three pooling layers to the size of the input as the predictions. The structure of the original FCN-8s is illustrated in Figure 2.

**Figure 2.** Structure of the original fully convolutional network (FCN)-8s [17].
