Semi-Supervised Building Detection from High-Resolution Remote Sensing Imagery

Zheng, Daoyuan; Kang, Jianing; Wu, Kaishun; Feng, Yuting; Guo, Han; Zheng, Xiaoyun; Li, Shengwen; Fang, Fang

doi:10.3390/su151511789

Open AccessArticle

Semi-Supervised Building Detection from High-Resolution Remote Sensing Imagery

by

Daoyuan Zheng

^1,2

,

Jianing Kang

^1,2,

Kaishun Wu

^1,3,

Yuting Feng

^1,2,

Han Guo

^1,*,

Xiaoyun Zheng

¹,

Shengwen Li

^1,2,3

and

Fang Fang

^1,2,3

¹

Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Natural Resources, Shenzhen 518034, China

²

School of Computer Science, China University of Geosciences, Wuhan 430074, China

³

National Engineering Research Center of Geographic Information System, China University of Geosciences, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(15), 11789; https://doi.org/10.3390/su151511789

Submission received: 27 June 2023 / Revised: 22 July 2023 / Accepted: 27 July 2023 / Published: 1 August 2023

(This article belongs to the Special Issue Intelligent GIS Application for Spatial Data Analysis)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Urban building information reflects the status and trends of a region’s development and is essential for urban sustainability. Detection of buildings from high-resolution (HR) remote sensing images (RSIs) provides a practical approach for quickly acquiring building information. Mainstream building detection methods are based on fully supervised deep learning networks, which require a large number of labeled RSIs. In practice, manually labeling building instances in RSIs is labor-intensive and time-consuming. This study introduces semi-supervised deep learning techniques for building detection and proposes a semi-supervised building detection framework to alleviate this problem. Specifically, the framework is based on teacher–student mutual learning and consists of two key modules: the color and Gaussian augmentation (CGA) module and the consistency learning (CL) module. The CGA module is designed to enrich the diversity of building features and the quantity of labeled images for better training of an object detector. The CL module derives a novel consistency loss by imposing consistency of predictions from augmented unlabeled images to enhance the detection ability on the unlabeled RSIs. The experimental results on three challenging datasets show that the proposed framework outperforms state-of-the-art building detection methods and semi-supervised object detection methods. This study develops a new approach for optimizing the building detection task and a methodological reference for the various object detection tasks on RSIs.

Keywords:

building detection; high-resolution remote sensing imagery; semi-supervised deep learning; object detection; consistency learning

1. Introduction

Buildings are the carrier of human productive activities, and their information reflects many characteristics of the urban environment [1]. Quickly requiring accurate and reliable building information provides effective references for the construction of smart cities, which can improve the livability of cities and urban development [2], and also provides important references for urban planning [3], disaster management [4], and map updating [5]. Recent emerging high-resolution (HR) remote sensing images (RSIs) provide richer building details, which alleviates the difficulty of acquiring small buildings from low-resolution images [6,7], providing the data foundation of aerial photographs for accurately acquiring building information. Automatic building detection from HR RSIs provides a practical approach for building information acquisition [8] and has received increasing attention in recent years. However, accurate detection of buildings from HR RSIs still poses significant challenges due to certain factors, such as complex backgrounds, diverse appearances, and occlusions and shadows. Thus, developing new effective building detection methods has become challenging and valuable research.

Building detection methods can be classified into two categories, traditional and deep learning-based methods. The traditional methods are based on artificial features, which are mainly derived from physical characteristics, such as building contours [9] and spectral features [10]. These methods are typically combined with classical machine learning methods to detect buildings [11]. Moreover, these methods are sensitive to noise and illumination variations. In addition, the features need to be hand-crafted for new regions and new data sources, which typically requires extensive engineering skills and geology expertise. Deep learning-based methods are data-driven, learning features from labeled data [12,13]. The object detection models without manual feature engineering, such as Faster R-CNN [14] and SSD [15], have been extended to building detection and started to serve as the dominant approach owing to the advancement of deep learning techniques for object detection. Liu et al. [16] propose a hierarchical building detection framework to extract building features at different scales and spatial resolutions. A locally constrained framework is proposed to improve the detection of small and densely distributed buildings [17]. A feature split–merge–enhancement network based on SSD architecture is proposed to better detect ground objects with scale differences [18]. These methods are based on fully supervised models that require a large amount of annotated data to train their models, which is costly and time-consuming.

Recently, semi-supervised learning (SSL) has been illustrated to be effective in reducing the burden of sample annotation, which utilizes a limited amount of labeled data and a large number of unlabeled data. And some semi-supervised object detection (SS-OD) methods have been derived for detecting objects from RSIs [19,20,21]. However, these methods are based on an anchor mechanism for detection, which is very sensitive to the size and aspect ratio of detected objects. In particular, the wide variety of buildings and their diverse shapes pose a great challenge to accurately detecting buildings in HR RSIs.

In this work, an SSL building detection (SS-BD) framework is proposed. Specifically, a color and Gaussian augmentation (CGA) module is developed to train a fully convolutional one-stage (FCOS) object detector with labeled images to address the scarcity of annotated samples for detector training. Then, a consistency learning (CL) module is derived from the teacher–student network to impose consistent prediction between different perturbed images, improving the detection ability of the unlabeled images. Finally, the student model is trained with a joint loss and the teacher model is refined by averaging the weights of the student model over training steps. The proposed framework removes the predefined anchor boxes and provides a more feasible and efficient detection pipeline compared with the anchor-based detection framework. To the best of our knowledge, this method is the first to introduce SSL for building detection of HR RSIs. The contributions of this study are summarized as follows:

This study proposes a semi-supervised framework for building detection from HR RSIs, which leverages the information from the unlabeled RSIs to improve the semi-supervised building detection performance. This study provides a methodological reference for the various object detection tasks on RSIs.
A CGA module is developed to increase the diversity of building features, which enhances the detection ability of an anchor-free detector on the labeled RSIs.
A CL module is designed to impose consistent prediction between different perturbed unlabeled RSIs, which improves the detection accuracy and generalization of the detectors.
The experimental results on three datasets demonstrate that the proposed framework is superior to several state-of-the-art building detection methods and SS-OD methods, achieving a higher ${AP}_{50}$ of 0.736, 0.704, and 0.370 on the WHU, CrowdAI, and TCC building datasets, respectively.

The rest of this paper is organized as follows: Section 2 reviews some related works; Section 3 details the proposed framework; Section 4 presents the experimental results; Section 5 describes the ablation study and discusses the factors that may affect the performance of the proposed framework; and finally, Section 6 concludes this study.

2. Related Literature

The work related to this study can be classified into three categories: building detection from RSIs and SS-OD, and SS-OD from RSIs. These categories will be reviewed in the following three subsections.

2.1. Building Detection from RSIs

Traditional building detection approaches are mainly from handcrafted feature-based methods. These methods realize building detection from RSIs by utilizing the physical building characteristics such as the geometric shape and context information of the building. Huang et al. [22] improve the morphological building index detector by considering the spectral, geometrical, and contextual information of buildings. Furthermore, a geometric saliency-based method is proposed for accurate building detection with the new geometric building index [23]. The automatic building detection model uses texture information to better distinguish between trees and buildings [24]. A building detector is created by using invariant color features and shadow information of buildings [25]. The handcrafted feature-based methods for automatic building detection relieve the pressure of manual visual interpretation to a large extent. However, the design of handcrafted features is time-consuming, laborious, and depends on empirical parameter settings, making it hard to improve the generalization and efficiency of the method.

In the fields of aerial photogrammetry and remote sensing, building detection algorithms based on deep convolutional neural networks show an excellent detection performance due to the advantage of extracting abstract features of images [6]. Existing algorithms can be divided into semantic segmentation-based and object detection-based methods. Segmentation-based methods [26,27,28,29] are mainly built on fully convolution networks (FCNs) [30] to realize pixel-level building classification. Numerous semantic segmentation models are generalized to the building detection tasks and improve detection performance [5,6,31,32,33]. This work aims to introduce object detection algorithms in automatic building detection. These algorithms are mainly categorized into anchor-based and anchor-free methods. The former methods require the predefined anchor boxes to generate a series of region proposals. These methods are originally from Faster R-CNN [14], YOLO v1 [34], SSD [15], RetinaNet [35], etc. Faster R-CNN uses a region proposal network (RPN) to generate proposal boxes and a proposal prediction network to efficiently classify them. Anchor-free methods remove the anchor boxes and attempt to predict object boxes by detecting the key points or center-ness of objects. The typical detectors are CornerNet [36] and FCOS [37]. CornerNet [36] attempts to predict a bounding box as a pair of corners in a one-stage process. FCOS [37] is the first fully convolutional detector that predicts the category location and center of bounding boxes in a pixel-level fashion. Several studies have advanced object detection research in the field of building extraction. An automatic building detection method is proposed to identify roof shape types from RSIs [38]. Hamaguchi et al. [39] propose a building detection method that handles various sizes. A CNN-based framework with a suitable ROI scale is designed for object detection in HR RSIs [40]. DAPNet [41] is proposed to detect objects in sparse and dense scenes of optical RSIs by improving the architecture of the Faster R-CNN model. An FER-CNN model integrating new boundary detection is proposed to improve the accuracy of building detection [42].

Semantic segmentation-based and object detection-based methods require massive pixel-level and bounding box annotated samples for model training, respectively. However, the above methods are very dependent on a large number of labeled samples, resulting in the unavailability of large-scale regions. The motivation of this study arises from the need to detect buildings with unlabeled data and enhance the model performance in SSL.

2.2. SS-OD

SSL for object detection has received much attention in recent years. SS-OD is aimed at learning detection models based on labeled and unlabeled images, which can leverage a large number of unlabeled images to improve the performance of object detection. SS-OD methods can be classified into two categories: consistency-based and pseudo labeling-based methods.

The consistency-based methods apply the technique of consistency regularization, which enforces the detection model to produce consistent predictions for different views or perturbations of the same input image [43,44,45,46]. Accordingly, the models can be regularized and their robustness to noise and variations is enhanced. For example, Mean-Teacher [47] is a consistency-based method by averaging model weights instead of label predictions. Jeong et al. [46] apply the consistency constraint on an unlabeled image and the corresponding flipped version. Their method proposes a new consistency loss for both the classification of a bounding box and the regression of its location. Tang et al. [48] propose a consistency-based proposal learning module that learns noise-robust proposal features and predictions by consistency losses. ISD [49] is proposed for addressing the problems caused by interpolation regularization. The method defines different types of interpolation-based loss functions to improve the performance of SSL.

The pseudo labeling-based methods attempt to generate highly confident pseudo labels on unlabeled images to better train a detection model. Pseudo label generation and utilization are crucial to the success of SS-OD [50]. STAC [51] generates stable pseudo labels and updates the model by enforcing consistency by weak augmentations and strong augmentations, respectively. Wang et al. [52] propose a self-training method for object detection, called SSM. This method makes region proposals reliable via cross-image validation and fuses the model with active learning. The Unbiased-Teacher model addresses the pseudo labeling bias issue and produces more accurate pseudo labels [53]. An effective SS-OD model is proposed by using instant teaching and a co-rectify scheme to improve the number of pseudo labels [54]. A soft teacher mechanism and a box-jittering approach are incorporated into an end-to-end SS-OD method [55]. Chen et al. [56] propose a DenSe Learning method to improve the stability and quality of pseudo labels, thus improving the detection performance on SS-OD.

Those methods mentioned above aim to enhance the model performance on natural images. However, detecting objects from RSIs poses a significant challenge due to diverse environmental factors, such as shadows, vegetation cover, complex roofs, dense building areas, and oblique angles.

2.3. SS-OD from RSIs

Emerging SS-OD techniques have been derived for object detection from RSIs [19,20,21,57] and effectively reduce the burden of sample annotation. These techniques utilize a limited amount of labeled RSIs and a large number of unlabeled RSIs for model training. The performance and generalization ability of the object detectors are improved by reducing the distribution gap between labeled and unlabeled RSIs. For example, Liao et al. [19] propose an improved Faster R-CNN for semi-supervised SAR target detection. Chen et al. [20] develop a Rotation-Invariant and Relation-Aware cross-domain adaptation object detection (CDAOD) network in SSL to address the rotation diversity of HR RSIs. Wang et al. [21] present an SSL-based object detection framework for SAR ship detection. The framework generates pseudo labels by using a label propagation strategy and trains the Faster R-CNN network in SSL. Du et al. [57] propose a novel semi-supervised SAR ship detection network via scene characteristic learning to enhance its feature representation ability for ship targets and clutter.

Those methods for RSIs are anchor based, which employ the anchor mechanism to generate dense anchor boxes. In practice, anchor-based methods are sensitive to the size and aspect ratio of the detection object. However, the size and shape of the buildings are diverse, making these methods inappropriate for detecting buildings from HR RSIs.

3. Research Method

3.1. Framework

This work presents an SSL framework for building detection from HR RSIs. The overall pipeline of the framework is shown in Figure 1. The framework mainly consists of three modules: CGA, CL, and joint learning with EMA. The CGA module takes an anchor-free FCOS detector as a basic detection model and trains the detector on the labeled images with color and Gaussian data augmentation. The anchor-free FCOS model, which is an FCN-based detector, removes the set of anchor boxes and avoids parameter tuning related to them [37]. The CL module designs a consistency loss between predictions obtained by feeding different augmented unlabeled images into the teacher model and the student model. Then, the supervised loss and consistency loss are jointly used to optimize the student model. Finally, the knowledge that the student model learned is transferred to the teacher model which is used as a detector network for building detection.

In this study, two sets of RS building detection data were used to train the model, labeled set

X^{L} = {x_{1}^{L}, \dots, x_{m}^{L}}

and unlabeled set

X^{U} = {x_{1}^{U}, \dots, x_{n}^{U}}

, where m and n denote the number of labeled and unlabeled images, respectively, and

m ≪ n

. The labeled images annotate the category of each building and position of each bounding box. The unlabeled images are involved and utilized to assist the detection model to produce a precise and robust prediction. The overall objective function of the framework is formulated as follows:

L = L_{s} + λ \times L_{u},

(1)

where

L_{s}

is supervised loss on labeled images,

L_{u}

is unsupervised loss on unlabeled images, and

λ

is a hyper-parameter used to control the importance of

L_{u}

.

3.2. CGA

The CGA module aims to enrich the diversity and quantity of labeled RSIs by performing data augmentation on them. In SS-OD methods, the labeled images and their annotations are used to train an object detector with a supervised loss

L_{s}

. Specifically, the proposed SS-BD framework adopts the anchor-free FCOS detector as the basic detection model. On the basis of the dense proposal generator [37], the detector is composed of a backbone, an FPN neck, and a dense head, whose architecture is shown in Figure 2. The backbone is used to learn the multilevel discriminative feature maps. These feature maps are then passed through an FPN neck to extract deep features related to building and nonbuilding information from RSIs. The following dense head contains three branches at each point of the feature map, namely, classification, center-ness, and regression branches.

When the number of labeled RSIs is small, it is not enough to train a fine detector due to insufficient building features. To enrich the building features, two data augmentation strategies, including color augmentation and Gaussian augmentation, are introduced to train the FCOS detector and further enhance the detection ability of the detector. Color augmentation includes changes to the pixel value saturation of the original image, such as saturation, brightness, contrast, and sharpness. Meanwhile, random Gaussian noises are added to the original images for Gaussian augmentation. Some transformed images with data augmentation are visualized in Figure 3.

During training, each original image and the two augmented images are forward-propagated through the detection model. In FCOS, each point in the feature map can obtain three types of predictions, including the classification score, center-ness score, and regression locations. The supervised loss is then computed based on their predictions and their box annotations. This loss consists of three losses, which can be formulated as follows:

\begin{matrix} L_{s} = \frac{1}{N_{p o s}} \sum_{w, h}^{} (L_{c l s} (p_{w, h}, c_{w, h}^{B}) \\ + ℓ_{\{c_{w, h}^{B} > 0\}} L_{r e g} (t_{w, h}, t_{w, h}^{B}) + ℓ_{\{c_{w, h}^{B} > 0\}} L_{c e n t e r}), \end{matrix}

(2)

where

L_{c l s}

,

L_{r e g}

, and

L_{c e n t e r}

are the classification, regression, and center-ness losses, respectively.

ℓ_{\{c_{w, h}^{B} > 0\}}

is the indicator function;

N_{p o s}

denotes the number of positive samples;

p_{w, h}

and

t_{w, h}

are the classification score and regression result of the

(w, h)

points in the feature map, respectively; and

c_{w, h}^{B}

and

t_{w, h}^{B}

are the building category and position of the point

(w, h)

. The supervised loss approaches its minimum value at the end of the training, and the weights of the trained detector are obtained. The weights of the trained detector are then replicated to the teacher and the student models [53]. This pretraining process with data augmentation makes an initialization for the proposed framework, contributing to the enhancement of building detection accuracy and helping to alleviate overfitting unlabeled images in the later CL module.

3.3. CL

The CL module uses the idea of consistency regularization that encourages the detection model to make consistent predictions on the unlabeled images under various disturbances. For leveraging a large number of unlabeled images

X^{U}

, the module defines a consistency loss based on the dense class predictions and regression results from different perturbations of the images. Specifically, given an unlabeled image

x_{i}^{U}

, two augmented images

x_{i}^{u_{1}}

and

x_{i}^{u_{2}}

are generated by applying different weak augmentations on

x_{i}^{U}

. Then, the augmented images are fed into the teacher and the student models. Two augmented input–output pairs can be obtained through the forward pass of their individual model. The outputs for CL only contain class probability scores and the regression result and need no extra NMS. In this work, the mean squared error (MSE) is used as the consistency regularization loss. The classification consistency loss used for a pair of points in the feature map is presented as follows:

L_{c o n_c l s} = \frac{1}{N} \sum_{w, h}^{} R (p_{w, h}^{u_{1}}, p_{w, h}^{u_{2}}),

(3)

where

R (., .)

denotes the MSE loss; N is the number of pixels located in the feature map;

p_{w, h}^{u_{1}}

and

p_{w, h}^{u_{2}}

denote the classification output scores from the teacher and student models, respectively. In the regression branch, an MSE loss is minimized based on the localization outputs of each pair of points to achieve regression consistency, which can be described by

L_{c o n_r e g} = \frac{1}{N} \sum_{w, h}^{} R (t_{w, h}^{u_{1}}, t_{w, h}^{u_{2}}),

(4)

where

t_{w, h}^{u_{1}}

and

t_{w, h}^{u_{2}}

are the regression results from the teacher and student models, respectively.

The total consistency loss is summed by the two consistency losses, classification loss and regression loss, as follows:

L_{c o n} = L_{c o n_c l s} + L_{c o n_r e g} .

(5)

The above process can be summarized by Algorithm 1.

Algorithm 1 Consistency Loss

Input:: Unlabeled image $x_{i}^{U}$
Output:: Consistency loss
1:: Generate the random augmented images:
2:: $x_{i}^{u_{1}} = a u g m e n t a t i o n_{1} (x_{i}^{U})$
3:: $x_{i}^{u_{2}} = a u g m e n t a t i o n_{2} (x_{i}^{U})$
4:: Forward $x_{i}^{u_{1}}$ and $x_{i}^{u_{2}}$ through the teacher model and student model to generate output:
5:: $t_{w, h}^{u_{1}}, p_{w, h}^{u_{1}} = t e a c h e r (x_{i}^{u_{1}})$
6:: $t_{w, h}^{u_{2}}, p_{w, h}^{u_{2}} = s t u d e n t (x_{i}^{u_{2}})$
7:: Compute consistency loss for classification and regression:
8:: $L_{c o n_c l s}, L_{c o n_r e g} = R (p_{w, h}^{u_{1}}, p_{w, h}^{u_{2}}), R (t_{w, h}^{u_{1}}, t_{w, h}^{u_{2}})$
9:: Compute total consistency loss:
10:: $L_{c o n} = L_{c o n_c l s} + L_{c o n_r e g}$
11:: return $L_{c o n}$

3.4. Joint Learning with EMA

In this study, the total loss is calculated by weighting the supervised and unsupervised losses. The supervised loss is the sum of three losses: classification, regression, and center-ness losses. These losses are calculated using Equation (2). The unsupervised loss is defined as the total consistency loss in Equation (5). Therefore, the total loss is formulated as follows:

L = L_{s} + λ \times L_{c o n},

(6)

where

λ

is a hyper-parameter that weights the consistency loss term. The total loss L is used to train a student model by using the stochastic gradient descent [60] algorithm.

Based on existing work [53,56], the EMA method, which averages model weights over training steps, is employed to refine the teacher model by the ensemble of the student model. The EMA updating method aims to facilitate the fitting of the model to the labeled and unlabeled images. Specifically, the weights of the teacher detector are gradually updated in different training iterations, which is defined as follows:

θ_{t} = α θ_{t} + (1 - α) θ_{s},

(7)

where

θ_{t}

and

θ_{s}

denote the teacher and student model weights, respectively. After the training, the learned teacher model is employed to detect buildings from HR RSIs.

4. Results

4.1. Datasets

In the experiments, three standard building extraction datasets, including the WHU aerial imagery dataset [33], the CrowdAI building dataset [58], and the building dataset of TCC [59], are employed to examine the performance of the proposed method.

4.1.1. WHU Aerial Imagery Dataset

The WHU aerial imagery dataset [33] is one of the challenging datasets in building detection, and it has been widely used in previous works. The dataset includes aerial imagery from Christchurch and New Zealand with a 0.075 m spatial resolution, containing more than 220,000 building instances. This dataset consists of 8189 tiles of size 512 × 512 pixels. The training, validation, and test sets consist of 4736, 1036, and 2416 images, respectively. Four examples and their ground truth (GT) labels are shown in Figure 4.

4.1.2. CrowdAI Building Dataset

The CrowdAI building dataset [58] contains 341,438 satellite images with the size 300 × 300 pixels. This dataset poses some challenges for building detection, such as shadows, occlusions, and complex backgrounds, as shown in Figure 5. Moreover, the dataset provides instance annotations of approximately 2.4 million buildings in the MS-COCO format. The training and test sets contain 8366 and 1820 images, respectively.

4.1.3. TCC Building Dataset

The TCC building dataset [59] has a total of 7260 images of 500 × 500 pixels with a spatial resolution of 0.29 m, which provides 63,886 building instance annotations with the MS-COCO format. The RSIs in the dataset are sampled from Google Maps and distributed in four cities in China, including Beijing, Wuhan, Shanghai, and Shenzhen. The dataset includes several non-orthophoto images, which are challenging for building detection tasks. Some examples from this dataset are shown in Figure 6. The training set and test set consist of 5985 and 1275 images, respectively.

Three mixed datasets containing labeled and unlabeled RSIs are created based on the three building datasets. Specifically, the authors of this study randomly select 1% of the labeled images from their training sets as labeled RSIs and treat the remaining training images as unlabeled RSIs. The three datasets are summarized in Table 1.

4.2. Experimental Settings

All the experiments in this work are implemented on the PyTorch [61] and MMDetection [62] frameworks. The base detector adopts the anchor-free FCOS model [37] with the ResNet50 [12] backbone, wherein the pretrained weights are from the ImageNet-pretrained network. In the CGA module, the “ImageEnhance” function of the Pillow library is used to perform the augmentations. The saturation and sharpness are changed at a random factor ranging from 0 to 3, the brightness and contrast are changed at a random factor ranging from 0 to 2, and the Gaussian noise is used with a standard deviation of 0.15. The FCOS detector is trained for 12 epochs with a batch size of two. This study follows the MS-COCO [63] training configuration in [62] by applying data augmentation through random flip and resize operations. The initial learning rate is 0.0025, and the SGD is used as the optimizer with a momentum of 0.9 and a weight decay of 0.0001.

In the CL module, the weak augmentation for consistency regularization is in accordance with the previous work [53], including random horizontal flip, salt noise, Gaussian blur, color jittering, and grayscale. In each batch, four images are randomly sampled from the labeled and unlabeled sets with a ratio of 1:1, respectively. The learning rate starts from 0.0025 and is divided by 16 at 22 and 28 epochs. The max training epoch is 32, and the

α

and

λ

are set to 0.99 and 3, respectively. All the experiments are carried out on a 3090 GPU with 24G memory.

4.3. Evaluation Metrics

In this work, the widely used metric, AP, is employed to evaluate the detection ability of the model, which includes

{AP}_{@ 0.5 : 0.95}

,

{AP}_{50}

,

{AP}_{75}

,

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

.

{AP}_{@ 0.5 : 0.95}

is calculated by using the following formula:

\begin{matrix} {AP}_{@ 0.5 : 0.95} = \frac{{AP}_{0.50} + {AP}_{0.55} + \dots + {AP}_{0.90} + {AP}_{0.95}}{10}, \end{matrix}

(8)

where

{AP}_{@ 0.5 : 0.95}

is calculated from the mean value of

{AP}_{0.50}

to

{AP}_{0.95}

, with the threshold of 0.05 as the step size.

{AP}_{50}

and

{AP}_{75}

indicate AP when the intersection over union is greater than 0.5 and 0.75, respectively.

{AP}_{s}

refers to the AP of small targets with an area larger than 32 × 32,

{AP}_{m}

means the AP of medium targets with an area larger than 32 × 32 and less than 96 × 96, and

{AP}_{l}

refers to the AP of large targets with an area larger than 96 × 96.

4.4. Baselines

To investigate the effectiveness of the proposed framework, four state-of-the-art building detection methods are selected for comparison, including the anchor-based and anchor-free object detection methods. In the anchor-based methods, Faster R-CNN [14] and SSD [15] have been widely applied in building detection. Faster R-CNN utilizes RPN to generate high-quality region proposals for optimizing the detection performance of buildings. SSD localizes buildings with multilayer feature maps in a one-stage manner. Two anchor-free detectors, including CornerNet [36] and FCOS [37], served as baselines. The CornerNet model is utilized to detect buildings by calculating a pair of corners and grouping them into detection results. The basic detector, FCOS, is also chosen to perform building detection tasks for comparison.

Two SS-OD methods are used for performance comparison, including Unbiased-Teacher [53] and Listen2Student [64]. Unbiased-Teacher is designed to improve the quality of pseudo labels by addressing the class-imbalance problem. The Listen2Student mechanism can effectively work on anchor-free detectors and reduce the impact of poor quality pseudo labels on detection performance.

4.5. Overall Results

4.5.1. Evaluation with the WHU Aerial Imagery Dataset

This study evaluates the proposed framework on the WHU test dataset. The experimental results are presented in Table 2. The proposed framework achieves the best accuracy on all evaluation metrics, reaching 0.380 and 0.736 in terms of the AP and

{AP}_{50}

, respectively. The proposed framework acquires 0.049 and 0.066 higher in AP compared with the anchor-based methods, Faster R-CNN and SSD, respectively. In comparison with the anchor-free methods, this method gains great improvement, which performs 0.137 higher than CornerNet and 0.169 higher than FCOS in AP, respectively. Meanwhile, the proposed framework performs better on most evaluation indicators than the semi-supervised methods. The experimental results demonstrate that our framework can effectively leverage the unlabeled RSIs and improve the building detection performance.

Some samples are visualized in Figure 7 to further observe the results. The figure highlights the areas of discrepancy in the detection results with red dotted-line circles. In the first row, the baseline methods present missed or wrong location detections in the dense building areas, while the proposed framework correctly identifies the locations of buildings. In the second row, our framework can better distinguish buildings from backgrounds or roads with similar spectra and textures compared with the baselines. In addition, the proposed framework can reduce the omission of buildings in scenes with large buildings (third row), small buildings (fourth row), and buildings obscured by trees. The visual results demonstrate the superiority of the detection ability of the proposed framework.

4.5.2. Evaluation with the CrowdAI Building Dataset

The comparison building detection results on the CrowdAI test dataset are listed in Table 3. The anchor-based methods, Faster R-CNN and SSD, achieve superior performance over the anchor-free methods under supervision with the limited labeled images. This improvement can be attributed to the help of the predefined anchor for alleviating the scale variances of buildings in the RSI. The proposed SS-BD method obtains better building detection accuracy with AP and

{AP}_{50}

values of 0.365 and 0.704, compared with the baseline methods, exceeding 0.142 and 0.152 over the supervised FCOS framework. In addition, the proposed method is 0.127 and 0.141 higher in AP than the semi-supervised methods, Unbiased-Teacher and Listen2Student. This finding shows that the proposed SS-BD framework can achieve more promising building extraction results.

Figure 8 shows the visualized results of the proposed framework and the baselines on the CrowdAI building dataset. The first and second rows of the figure present the sense that the road object is similar to the buildings in color. The baseline methods cannot attain an accurate and complete building detection result, while the proposed framework effectively avoids this inadequacy. In the fourth row, building areas are characterized by complex shapes and obscured by trees, and the baseline methods fail to find the buildings. By contrast, the proposed method performs better at detecting these buildings.

4.5.3. Evaluation with the TCC Building Dataset

In this section, experiments are conducted on the TCC building dataset to further verify the performance of the proposed semi-supervised detection framework. In Table 4, the quantitative results indicate that the proposed framework outperforms the baselines on all six AP metrics, reaching 0.133, 0.370, and 0.057 in terms of AP,

{AP}_{50}

, and

{AP}_{75}

, respectively. All the methods achieve lower precision on this dataset than the two other datasets due to the larger challenging scenes in the TCC building dataset. The supervised methods present poor detection accuracy with a limited number of labeled images than the semi-supervised methods. The proposed framework enhances the ability to accurately detect buildings due to the adoption of effective SSL strategies. These experimental results further show that the proposed framework is robust.

Figure 9 visualizes some results on the TCC building dataset. The proposed framework can better detect buildings than the baseline methods in various scenes, including buildings with different textures and shapes, as shown in the first and second rows, with non-orthogonal images in the third row and dense small buildings in the fourth row. Furthermore, the proposed framework can more accurately identify the buildings compared with the baseline methods, while the five baseline methods result in false or missed detections of their detection boxes. In the fourth row of Figure 9, the proposed framework produces building boxes with accurate locations and sizes even in areas where buildings are densely distributed. These results further indicate that the proposed framework shows superior detection ability in addressing the challenges of building detection in complex scenarios.

5. Discussion

5.1. Ablation Study

In this section, the ablation study is reported to examine the effect of two key components in the proposed framework. The ablation experiment was performed on the WHU aerial imagery dataset. Specifically, the Basic method removes the CGA and CL modules from the proposed framework and only adopts the FCOS detector trained with the labeled RSIs. The w/o CL method removes the CL module from the proposed framework. The w/o CGA method is trained on the labeled and unlabeled images after removing the CGA module. Table 5 lists the experimental results of the proposed framework and the three methods.

As shown in Table 5, the Basic method exhibits the worst performance with an AP value of 0.211. The w/o CL method outperforms the Basic method in AP,

{AP}_{50}

, and

{AP}_{75}

by approximately 0.153, 0.158, and 0.230, respectively. The improvement suggests that the addition of CGA makes the model training with the augmented labeled image better. The w/o CGA method also achieves a better result in all six metrics compared with the Basic method, indicating that the CL module helps in improving the detection ability. The proposed framework achieves the best accuracy in terms of AP when the CGA and CL are introduced into the Basic method. These results demonstrate that the two modules facilitate building detection accuracy in SSL.

5.2. Effect of the Amount of Labeled Samples for the Training Model

As a semi-supervised method, this study is concerned with using the labeled and unlabeled data to perform the building extraction tasks from HR RSIs. The amount of labeled and unlabeled data used in the training may affect the model’s performance. Two groups of experiments are reported in the following two subsections to investigate the effect.

5.2.1. Amount of Labeled RSIs

In this experiment, 1%, 2%, and 5% of the labeled training images are randomly sampled as a labeled set, and the remaining labeled training images are used as an unlabeled set. The experimental results of the supervised FCOS framework and the proposed framework are reported in Figure 10.

The proposed framework achieves performance improvements compared with the supervised framework under the four ratios of labeled data, as shown in Figure 10. For example, the performance is 0.049 higher, 0.015 higher, and 0.057 in the AP at a 5%-labeled ratio on the three building datasets. The experimental results indicate that the proposed framework is effective and robust for training with different ratios of labeled data. In addition, the AP values gradually improve as the ratio of labeled data increases. This trend suggests that more labeled RSIs contribute to enhancing the SS-BD performance. In addition, the proposed semi-supervised framework has a larger performance advantage over the supervised learning method when the ratio of the labeled data is small. The reason might be explained by the fact that constraining the consistency among predictions from unlabeled images shows greater potential for performance improvement as the number of unlabeled images increases.

5.2.2. Amount of Building Instances

In practice, the amount of building instances in each RSI varies. Accordingly, the performance of the proposed framework is evaluated under different amounts of labeled buildings provided in the labeled RSIs. Specifically, three groups of random image sampling are performed on the CrowdAI building dataset. The sampled ratios are set to 1%, 2%, and 5%. The images in each group contain a fixed amount of buildings relative to the total buildings in their training set.

Table 6 shows the quantitative results between the supervised FCOS framework and the proposed framework. The proposed framework consistently maintains a better detection capability than the supervised framework with the increase in the amount of building instances for training. The comparison results demonstrate the effectiveness of the proposed semi-supervised framework on varying amounts of building instances in the labeled RSIs.

5.3. Generalization Analysis

In this section, the authors conducted experiments in several scenarios to investigate the generalization of the proposed framework. The same number of unlabeled RSIs is sampled from Google Earth in four areas: Tokyo, Shenzhen, Serbia, and New Zealand. Some sampled images are visualized in Figure 11.

In the comparison experiments, the unlabeled RSIs of the training set are replaced with the data from each of these regions for semi-supervised training. The quantitative results on the WHU test set are plotted in Figure 12. The inclusion of unlabeled RSIs from different regions significantly improves the detection accuracy of the framework compared with the supervised framework without unlabeled RSIs involved in the training. The framework achieves similar accuracy on all evaluation metrics by using the Tokyo, Shenzhen, and New Zealand regions as unlabeled RSIs. Although the proposed framework still achieves good performance by adding the Serbian region as unlabeled RSIs, the AP values are slightly lower compared with using the other three regions. This phenomenon is due to the sparse distribution of the buildings in the Serbian region, resulting in the limited acquisition of features. These cross-region experiments efficiently illustrate the excellent generalization ability of the proposed model, which can benefit the features of unlabeled RSIs from different regions and improve the building detection capability of the model.

5.4. Effect of Augmentation Strategies

The proposed framework employs a data augmentation strategy, CGA, to enrich the diversity of building features by applying perturbation transformations on the RSIs. To further observe the effects of different data augmentation strategies on the proposed SS-BD framework, this study uses color augmentation, Gaussian augmentation, and the CGA strategy to conduct the experiments with WHU 1%-labeled RSIs.

The experimental results are reported in Table 7. The results show that a higher AP of 0.373 is achieved with Gaussian augmentation than that of 0.353 with color augmentation. This observation may be due to color differences that break the pixel value continuity of the image, reducing detection performance. When the two augmentation strategies are combined, the CGA augmentation strategy in the CGA module obtains better accuracy, including 0.380, 0.736, and 0.354 in terms of AP,

{AP}_{50}

, and

{AP}_{75}

, respectively. These results demonstrate that the proposed CGA strategy can have a positive effect on the detector performance.

6. Conclusions

In this work, a semi-supervised framework is proposed to alleviate the problems of large labeled datasets required for building detection from HR RSIs, which provides an important reference for the resource allocation and sustainable development of smart cities. The experimental results show that the proposed framework has a higher AP of 0.380, 0.365, and 0.133 and an

{AP}_{50}

of 0.736, 0.704, and 0.370 on the WHU, CrowdAI, and TCC datasets, respectively. The proposed framework increases the diversity of building features with the color and Gaussian data augmentation strategies and improves the detection ability on the unlabeled images by the introduction of consistency learning. Compared with the competitive approaches, the proposed framework achieves the best detection accuracy over multiple datasets, showing a good generalization ability, and has good detection performance in some challenging scenarios, such as road objects with colors similar to those of buildings, building areas characterized by complex shapes and obscured by trees, and images with dense small buildings.

Furthermore, additional features, such as building center-ness, can be introduced into the framework to improve building detection from HR RSIs. Moreover, the proposed framework is not superior to supervised methods when labeled RSIs are relatively sufficient, which should be further investigated to exhibit better performance.

Author Contributions

Conceptualization, D.Z.; methodology, K.W.; software, J.K.; validation, J.K.; formal analysis, D.Z. and J.K.; resources, H.G. and X.Z.; data curation, J.K.; writing—original draft preparation, D.Z. and Y.F.; writing—review and editing, S.L. and F.F.; visualization, Y.F.; supervision, F.F.; funding acquisition, H.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Open Fund of the Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Natural Resources, under Grant KF-2021-06-088.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The WHU building dataset used in this article can be downloaded at https://study.rsgis.whu.edu.cn/pages/download/building_dataset.html (accessed on 29 September 2022). The TCC building dataset used in this article can be downloaded at https://doi.org/10.11922/sciencedb.00620 (accessed on 31 March 2021).

Conflicts of Interest

The authors declare no conflict of interest.

References

Stiller, D.; Stark, T.; Strobl, V.; Leupold, M.; Wurm, M.; Taubenböck, H. Efficiency of CNNs for building extraction: Comparative analysis of performance and time. In Proceedings of the 2023 Joint Urban Remote Sensing Event (JURSE), Heraklion, Greece 17–19 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–4. [Google Scholar]
Huang, L.; Zhu, J.; Qiu, M.; Li, X.; Zhu, S. CA-BASNet: A Building Extraction Network in High Spatial Resolution Remote Sensing Images. Sustainability 2022, 14, 11633. [Google Scholar] [CrossRef]
Zhao, K.; Liu, Y.; Hao, S.; Lu, S.; Liu, H.; Zhou, L. Bounding boxes are all we need: Street view image classification via context encoding of detected buildings. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602817. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made disasters. Remote Sens. Environ. 2021, 265, 112636. [Google Scholar] [CrossRef]
Gong, M.; Liu, T.; Zhang, M.; Zhang, Q.; Lu, D.; Zheng, H.; Jiang, F. Context-content collaborative network for building extraction from high-resolution imagery. Knowl.-Based Syst. 2023, 263, 110283. [Google Scholar] [CrossRef]
Zhou, Y.; Chen, Z.; Wang, B.; Li, S.; Liu, H.; Xu, D.; Ma, C. BOMSC-Net: Boundary optimization and multi-scale context awareness based building extraction from high-resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5618617. [Google Scholar] [CrossRef]
Guo, H.; Shi, Q.; Du, B.; Zhang, L.; Wang, D.; Ding, H. Scene-driven multitask parallel attention network for building extraction in high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4287–4306. [Google Scholar] [CrossRef]
Li, J.; Huang, X.; Tu, L.; Zhang, T.; Wang, L. A review of building detection from very high resolution optical remote sensing images. GISci. Remote Sens. 2022, 59, 1199–1225. [Google Scholar] [CrossRef]
Zhang, K.; Ming, D.; Du, S.; Xu, L.; Ling, X.; Zeng, B.; Lv, X. Distance Weight-Graph Attention Model-Based High-Resolution Remote Sensing Urban Functional Zone Identification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Zhang, T.; Huang, X. Monitoring of urban impervious surfaces using time series of high-resolution remote sensing images in rapidly urbanized areas: A case study of Shenzhen. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 2692–2708. [Google Scholar] [CrossRef]
Qin, R.; Tian, J.; Reinartz, P. Spatiotemporal inferences for use in building detection using series of very-high-resolution space-borne stereo images. Int. J. Remote Sens. 2016, 37, 3455–3476. [Google Scholar] [CrossRef] [Green Version]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13733–13742. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Liu, Y.; Zhang, Z.; Zhong, R.; Chen, D.; Ke, Y.; Peethambaran, J.; Chen, C.; Sun, L. Multilevel building detection framework in remote sensing images based on convolutional neural networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 3688–3700. [Google Scholar] [CrossRef]
Xie, Y.; Cai, J.; Bhojwani, R.; Shekhar, S.; Knight, J. A locally-constrained YOLO framework for detecting small and densely-distributed building footprints. Int. J. Geogr. Inf. Sci. 2020, 34, 777–801. [Google Scholar] [CrossRef]
Ma, W.; Li, N.; Zhu, H.; Jiao, L.; Tang, X.; Guo, Y.; Hou, B. Feature split–merge–enhancement network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
Liao, L.; Du, L.; Guo, Y. Semi-supervised SAR target detection based on an improved faster R-CNN. Remote Sens. 2021, 14, 143. [Google Scholar] [CrossRef]
Chen, Y.; Liu, Q.; Wang, T.; Wang, B.; Meng, X. Rotation-invariant and relation-aware cross-domain adaptation object detection network for optical remote sensing images. Remote Sens. 2021, 13, 4386. [Google Scholar] [CrossRef]
Wang, C.; Shi, J.; Zou, Z.; Wang, W.; Zhou, Y.; Yang, X. A Semi-Supervised Sar Ship Detection Framework Via Label Propagation and Consistent Augmentation. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 4884–4887. [Google Scholar]
Huang, X.; Zhang, L. An adaptive mean-shift analysis approach for object extraction and classification from urban hyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 2008, 46, 4173–4185. [Google Scholar] [CrossRef]
Huang, J.; Xia, G.S.; Hu, F.; Zhang, L. Accurate building detection in VHR remote sensing images using geometric saliency. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 3991–3994. [Google Scholar]
Awrangjeb, M.; Zhang, C.; Fraser, C.S. Improved building detection using texture information. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2013, 38, 143–148. [Google Scholar] [CrossRef] [Green Version]
Sirmacek, B.; Unsalan, C. Building detection from aerial images using invariant color features and shadow information. In Proceedings of the 2008 23th International Symposium on Computer and Information Sciences, Istanbul, Turkey, 27–29 October 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1–5. [Google Scholar]
Yin, X.X.; Sun, L.; Fu, Y.; Lu, R.; Zhang, Y. U-Net-Based medical image segmentation. J. Healthc. Eng. 2022, 2022, 4189781. [Google Scholar] [CrossRef]
Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. Segnext: Rethinking convolutional attention design for semantic segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
Mo, Y.; Wu, Y.; Yang, X.; Liu, F.; Liao, Y. Review the state-of-the-art technologies of semantic segmentation based on deep learning. Neurocomputing 2022, 493, 626–646. [Google Scholar] [CrossRef]
Gu, J.; Kwon, H.; Wang, D.; Ye, W.; Li, M.; Chen, Y.H.; Lai, L.; Chandra, V.; Pan, D.Z. Multi-scale high-resolution vision transformer for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 12094–12103. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Boonpook, W.; Tan, Y.; Xu, B. Deep learning-based multi-feature semantic segmentation in building extraction from images of UAV photogrammetry. Int. J. Remote Sens. 2021, 42, 1–19. [Google Scholar] [CrossRef]
Sun, G.; Huang, H.; Zhang, A.; Li, F.; Zhao, H.; Fu, H. Fusion of multiscale convolutional neural networks for building extraction in very high-resolution images. Remote Sens. 2019, 11, 227. [Google Scholar] [CrossRef] [Green Version]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933. [Google Scholar] [CrossRef] [PubMed]
Alidoost, F.; Arefi, H. A CNN-based approach for automatic building detection and recognition of roof types using a single aerial image. PFG–J. Photogramm. Remote Sens. Geoinf. Sci. 2018, 86, 235–248. [Google Scholar] [CrossRef]
Hamaguchi, R.; Nemoto, K.; Imaizumi, T.; Hikosaka, S. Detecting buildings of any size using integration of CNN models. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1280–1283. [Google Scholar]
Dong, Z.; Wang, M.; Wang, Y.; Zhu, Y.; Zhang, Z. Object detection in high resolution remote sensing imagery based on convolutional neural networks with suitable object scale features. IEEE Trans. Geosci. Remote Sens. 2019, 58, 2104–2114. [Google Scholar] [CrossRef]
Cheng, L.; Liu, X.; Li, L.; Jiao, L.; Tang, X. Deep adaptive proposal network for object detection in optical remote sensing images. arXiv 2018, arXiv:1807.07327. [Google Scholar]
Reda, K.; Kedzierski, M. Detection, classification and boundary regularization of buildings in satellite imagery using faster edge region convolutional neural networks. Remote Sens. 2020, 12, 2240. [Google Scholar] [CrossRef]
Verma, V.; Kawaguchi, K.; Lamb, A.; Kannala, J.; Solin, A.; Bengio, Y.; Lopez-Paz, D. Interpolation consistency training for semi-supervised learning. Neural Netw. 2022, 145, 90–106. [Google Scholar] [CrossRef]
Yu, W.; Zhu, S.; Yang, T.; Chen, C. Consistency-based active learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3951–3960. [Google Scholar]
Li, G.; Li, X.; Wang, Y.; Wu, Y.; Liang, D.; Zhang, S. Pseco: Pseudo labeling and consistency training for semi-supervised object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 457–472. [Google Scholar]
Jeong, J.; Lee, S.; Kim, J.; Kwak, N. Consistency-based semi-supervised learning for object detection. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019); MIT Press: Cambridge, MA, USA, 2019. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems 30 (NIPS 2017); MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
Tang, P.; Ramaiah, C.; Wang, Y.; Xu, R.; Xiong, C. Proposal learning for semi-supervised object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 2291–2301. [Google Scholar]
Jeong, J.; Verma, V.; Hyun, M.; Kannala, J.; Kwak, N. Interpolation-based semi-supervised learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 11602–11611. [Google Scholar]
Guo, L.Z.; Zhang, Z.Y.; Jiang, Y.; Li, Y.F.; Zhou, Z.H. Safe deep semi-supervised learning for unseen-class unlabeled data. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 3897–3906. [Google Scholar]
Sohn, K.; Zhang, Z.; Li, C.L.; Zhang, H.; Lee, C.Y.; Pfister, T. A simple semi-supervised learning framework for object detection. arXiv 2020, arXiv:2005.04757. [Google Scholar]
Wang, K.; Yan, X.; Zhang, D.; Zhang, L.; Lin, L. Towards human-machine cooperation: Self-supervised sample mining for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1605–1613. [Google Scholar]
Liu, Y.C.; Ma, C.Y.; He, Z.; Kuo, C.W.; Chen, K.; Zhang, P.; Wu, B.; Kira, Z.; Vajda, P. Unbiased Teacher for Semi-Supervised Object Detection. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Zhou, Q.; Yu, C.; Wang, Z.; Qian, Q.; Li, H. Instant-teaching: An end-to-end semi-supervised object detection framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 4081–4090. [Google Scholar]
Xu, M.; Zhang, Z.; Hu, H.; Wang, J.; Wang, L.; Wei, F.; Bai, X.; Liu, Z. End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3060–3069. [Google Scholar]
Chen, B.; Li, P.; Chen, X.; Wang, B.; Zhang, L.; Hua, X.S. Dense learning based semi-supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 4815–4824. [Google Scholar]
Du, Y.; Du, L.; Guo, Y.; Shi, Y. Semi-Supervised SAR Ship Detection Network via Scene Characteristic Learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5201517. [Google Scholar] [CrossRef]
Mohanty, S.P. Crowdai Mapping Challenge 2018: Baseline with Mask Rcnn. GitHub Repository. 2018. Available online: https://github.com/crowdai/crowdai-mapping-challenge-mask-rcnn (accessed on 2 May 2022).
Wu, K.; Zheng, D.; Chen, Y.; Zeng, L.; Zhang, J.; Chai, S.; Xu, W.; Yang, Y.; Li, S.; Liu, Y.; et al. A dataset of building instances of typical cities in China. Chin. Sci. Data 2021, 6, 191–199. [Google Scholar]
Hu, J.; Doshi, V.; Eun, D.Y. Efficiency Ordering of Stochastic Gradient Descent. Adv. Neural Inf. Process. Syst. 2022, 35, 15875–15888. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019); MIT Press: Cambridge, MA, USA, 2019. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Liu, Y.C.; Ma, C.Y.; Kira, Z. Unbiased teacher v2: Semi-supervised object detection for anchor-free and anchor-based detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 9819–9828. [Google Scholar]

Figure 1. Overview of the proposed SS-BD framework for building detection from RSIs. (a) Color and Gaussian augmentation. (b) Consistency learning. (c) Joint learning with EMA.

Figure 2. Structure of the FCOS detector.

Figure 3. Visualization of the augmented samples from the WHU aerial imagery [33], CrowAI [58], and the building dataset of the Typical Cities in China (TCC) [59]. From left to right: original image, color augmentation, Gaussian augmentation.

Figure 4. Four samples of the WHU aerial imagery dataset and the corresponding GT bounding boxes. The green box denotes the bounding box of the building.

Figure 5. Four samples of the CrowdAI building dataset and the corresponding GT bounding boxes. The green box denotes the bounding box of the building.

Figure 6. Four samples of the TCC building dataset and the corresponding GT bounding boxes. The green box denotes the bounding box of the building.

Figure 7. Visualization of the detection results on the WHU aerial imagery dataset. (a) Faster R-CNN. (b) SSD. (c) CornerNet. (d) FCOS. (e) Unbiased-Teacher. (f) Listen2Student. (g) Proposed framework. (h) GT. The green boxes are the prediction from different comparison methods and the red circles indicate the difference results.

Figure 8. Visualization of the detection results on the CrowdAI building dataset. (a) Faster R-CNN. (b) SSD. (c) CornerNet. (d) FCOS. (e) Unbiased-Teacher. (f) Listen2Student. (g) Proposed framework. (h) GT. The green boxes are the prediction from different comparison methods and the red circles indicate the difference results.

Figure 9. Visualization of the detection results on the TCC building dataset. (a) Faster R-CNN. (b) SSD. (c) CornerNet. (d) FCOS. (e) Unbiased-Teacher. (f) Listen2Student. (g) Proposed framework. (h) GT. The green boxes are the prediction from different comparison methods and the red circles indicate the difference results.

Figure 10. Quantitative results in the APs on the test set in three building datasets. The supervised FCOS framework is trained on labeled images.

Figure 11. Three typical samples from the four areas: Tokyo, Shenzhen, Serbia, and New Zealand.

Figure 12. Quantitative results of the supervised FCOS framework trained with only 1% WHU-labeled RSIs, and the proposed framework trained with the labeled and unlabeled RSIs in different regions.

Table 1. Experimental protocols of the training sets of the WHU aerial imagery, CrowdAI, and TCC building datasets.

Dataset	w/ Labels		w/o Labels
Dataset	Images	Buildings	Images	Buildings
WHU	47	1537	4689	147,600
CrowdAI	84	728	8282	71,143
TCC	60	606	5925	52,309

Table 2. Evaluation results of all the considered methods on the WHU test dataset. The best results are highlighted in bold. The AP values denote

{AP}_{@ 0.5 : 0.95}

.

Table 2. Evaluation results of all the considered methods on the WHU test dataset. The best results are highlighted in bold. The AP values denote

{AP}_{@ 0.5 : 0.95}

.

Method	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$
Faster R-CNN [14]	0.331	0.671	0.283	0.206	0.483	0.001
SSD [15]	0.314	0.637	0.283	0.182	0.479	0.024
CornerNet [36]	0.243	0.365	0.267	0.134	0.527	0.009
FCOS [37]	0.211	0.581	0.085	0.147	0.307	0.011
Unbiased-Teacher [53]	0.328	0.642	0.303	0.241	0.456	0.064
Listen2Student [64]	0.378	0.728	0.363	0.291	0.487	0.124
Ours	0.380	0.736	0.354	0.244	0.545	0.082

Table 3. Evaluation results of all the considered methods on the CrowdAI test dataset. The best results are highlighted in bold. The AP values denote

{AP}_{@ 0.5 : 0.95}

.

Table 3. Evaluation results of all the considered methods on the CrowdAI test dataset. The best results are highlighted in bold. The AP values denote

{AP}_{@ 0.5 : 0.95}

.

Method	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$
Faster R-CNN [14]	0.314	0.628	0.279	0.091	0.455	0.059
SSD [15]	0.294	0.612	0.249	0.069	0.429	0.126
CornerNet [36]	0.208	0.380	0.208	0.043	0.401	0.025
FCOS [37]	0.223	0.552	0.122	0.039	0.337	0.052
Unbiased-Teacher [53]	0.238	0.531	0.176	0.057	0.356	0.108
Listen2Student [64]	0.224	0.514	0.149	0.046	0.338	0.016
Ours	0.365	0.704	0.341	0.119	0.509	0.126

Table 4. Evaluation results of all the considered methods on the TCC test dataset. The best results are highlighted in bold. The AP values denote

{AP}_{@ 0.5 : 0.95}

.

Table 4. Evaluation results of all the considered methods on the TCC test dataset. The best results are highlighted in bold. The AP values denote

{AP}_{@ 0.5 : 0.95}

.

Method	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$
Faster R-CNN [14]	0.080	0.245	0.029	0.02	0.125	0.059
SSD [15]	0.054	0.170	0.022	0.043	0.068	0.045
CornerNet [36]	0.055	0.123	0.043	0.006	0.099	0.049
FCOS [37]	0.037	0.140	0.006	0.041	0.052	0.025
Unbiased-Teacher [53]	0.086	0.231	0.040	0.030	0.114	0.099
Listen2Student [64]	0.074	0.208	0.034	0.047	0.085	0.077
Ours	0.133	0.370	0.057	0.098	0.174	0.104

Table 5. Quantitative results of the ablation study on the WHU dataset. The best results are highlighted in bold. The AP values denote

{AP}_{@ 0.5 : 0.95}

.

Table 5. Quantitative results of the ablation study on the WHU dataset. The best results are highlighted in bold. The AP values denote

{AP}_{@ 0.5 : 0.95}

.

Method	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$
Basic	0.211	0.581	0.085	0.147	0.307	0.011
w/o CL	0.364	0.739	0.315	0.252	0.503	0.054
w/o CGA	0.364	0.716	0.330	0.231	0.524	0.067
Ours	0.380	0.736	0.354	0.244	0.545	0.082

Table 6. Quantitative results in the APs of the proposed framework trained with a different number of labeled building instances. The best results are highlighted in bold. The AP values denote

{AP}_{@ 0.5 : 0.95}

.

Table 6. Quantitative results in the APs of the proposed framework trained with a different number of labeled building instances. The best results are highlighted in bold. The AP values denote

{AP}_{@ 0.5 : 0.95}

.

Method	AP with Different Amounts of Labeled Building Instances
Method	720 (1%)	1448 (2%)	3600 (5%)
Supervised	0.225	0.321	0.445
Ours	0.318	0.382	0.454

Table 7. Quantitative results of the APs of the proposed framework trained with varying augmentation. The best results are highlighted in bold. The AP values denote

{AP}_{@ 0.5 : 0.95}

.

Table 7. Quantitative results of the APs of the proposed framework trained with varying augmentation. The best results are highlighted in bold. The AP values denote

{AP}_{@ 0.5 : 0.95}

.

Augmentation	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$
Color	0.353	0.709	0.311	0.238	0.502	0.057
Gaussian	0.373	0.720	0.350	0.247	0.532	0.071
CGA (Ours)	0.380	0.736	0.354	0.244	0.545	0.082

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, D.; Kang, J.; Wu, K.; Feng, Y.; Guo, H.; Zheng, X.; Li, S.; Fang, F. Semi-Supervised Building Detection from High-Resolution Remote Sensing Imagery. Sustainability 2023, 15, 11789. https://doi.org/10.3390/su151511789

AMA Style

Zheng D, Kang J, Wu K, Feng Y, Guo H, Zheng X, Li S, Fang F. Semi-Supervised Building Detection from High-Resolution Remote Sensing Imagery. Sustainability. 2023; 15(15):11789. https://doi.org/10.3390/su151511789

Chicago/Turabian Style

Zheng, Daoyuan, Jianing Kang, Kaishun Wu, Yuting Feng, Han Guo, Xiaoyun Zheng, Shengwen Li, and Fang Fang. 2023. "Semi-Supervised Building Detection from High-Resolution Remote Sensing Imagery" Sustainability 15, no. 15: 11789. https://doi.org/10.3390/su151511789

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semi-Supervised Building Detection from High-Resolution Remote Sensing Imagery

Abstract

1. Introduction

2. Related Literature

2.1. Building Detection from RSIs

2.2. SS-OD

2.3. SS-OD from RSIs

3. Research Method

3.1. Framework

3.2. CGA

3.3. CL

3.4. Joint Learning with EMA

4. Results

4.1. Datasets

4.1.1. WHU Aerial Imagery Dataset

4.1.2. CrowdAI Building Dataset

4.1.3. TCC Building Dataset

4.2. Experimental Settings

4.3. Evaluation Metrics

4.4. Baselines

4.5. Overall Results

4.5.1. Evaluation with the WHU Aerial Imagery Dataset

4.5.2. Evaluation with the CrowdAI Building Dataset

4.5.3. Evaluation with the TCC Building Dataset

5. Discussion

5.1. Ablation Study

5.2. Effect of the Amount of Labeled Samples for the Training Model

5.2.1. Amount of Labeled RSIs

5.2.2. Amount of Building Instances

5.3. Generalization Analysis

5.4. Effect of Augmentation Strategies

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI