1. Introduction
With the rapid development of economy and society, the urbanization process is rapidly advancing, and Land-Use and Land-Cover Change (LUCC) information is rapidly changing. Humans must have timely and accurate information on changes in the Earth’s surface, which is important in urban development planning [
1,
2], land management [
3], vegetation cover [
4], and other practical uses. In recent years, high-resolution remote sensing imagery has become more accessible, and it is used to capture land cover change information. Simultaneously, as high-resolution remote sensing images can provide richer and clearer spatial feature information, the application of high-resolution remote sensing images in urban change detection has gradually increased [
5,
6].
Scholars have proposed a variety of remote sensing image change detection methods. Most of the early change detection methods obtain change results by performing algebraic operations on remote sensing images, such as Change Vector Analysis (CVA) [
7]. More recent methods used by some scholars entail the extraction of spectrum, texture, shape, color, and other artificially designed features from remote sensing images before and after the changes for analysis and research, to obtain change intensity maps. Then, the thresholds are set or sample points are manually selected to obtain the change detection results. Some of these methods include Support Vector Machine (SVM) [
8], Random Forest (RF) [
9], etc.
Traditional change detection methods are complicated to operate and have poor robustness. In recent years, deep learning technology has performed well in computer vision fields such as image recognition and semantic segmentation, and some scholars have applied it to change detection tasks. Post-classification change detection is one of the commonly used methods. The idea is to use the semantic segmentation network to perform semantic segmentation on bi-temporal images separately and obtain the difference in the segmentation results as the change area. Semantic segmentation networks such as FCN [
10] and U-Net [
11] are widely used. Then, Siamese neural networks became the standard method for change detection [
12]. Scholars have used various methods to improve the change detection capability of Siamese networks. In [
13], pyramidal pooling is proposed based on fully convolutional neural networks. In [
14], an encoder–decoder architecture based on UNet++ is proposed. In [
15], an SLN model without image pre-processing is proposed. Although CNN networks can detect non-linear variations in high-resolution images, as they usually predict the class of each pixel, pixel-level accuracy may be high, but pixel-to-pixel correlations are easily ignored, leaving the integrity of the detection results lacking.
The generative adversarial network (GAN) [
16] and its variants have been widely used in computer vision in recent years, such as image style migration [
17] and image attribute editing [
18]. GANs consist of two parts: a Generator (G) and a Discriminator (D), G’s goal is to generate images that D cannot distinguish between true and false, D’s goal is to judge the generator’s images as false as possible. G and D are trained simultaneously through the min–max game theory against each other, so that the generated image of the generator gradually approaches the true image. When the discriminator judges the generated image to be the true image, it means that the data distribution of the generator approximates the training data distribution.
The wide application of GANs in computer vision has inspired scholars to use it in remote sensing image processing. In 2016, Luc et al. used a GAN for image semantic segmentation, pioneering the application of GANs in the direction of semantic segmentation [
19]. In response to the fact that most of the currently available methods for road extraction in high-resolution remote sensing images cannot automatically extract roads with smooth and accurate boundaries, Shi et al. [
20] proposed a road extraction method based on the GAN. Kim et al. [
21] used the conditional generative adversarial network (CGAN) to convert remote sensing images into simple road routing map images and then extracted roads. Niu et al. [
22] addressed the change detection problem in heterogeneous remote sensing images by converting heterogeneous images into homogeneous images using the conditional GAN and performed change detection on the obtained results with good results. We applied GAN networks to Siamese change detection networks and proposed GAN-enhanced change detection networks. The main contributions of this paper include the following:
(1) We produced a change detection dataset for urban scenes based on high-resolution images of Xi’an, named XI’AN-CDD. It mainly included the changes in three types of land objects, namely, roads, cultivated land, and buildings;
(2) We built a Siamese-structured change detection network based on the center-surround, ASPP, and attention fusion modules, named UNET-CD. We built a GAN-based change detection network named UNET-GAN-CD using UNCE-CD as a generator. The performance of this network was verified on the XI’AN-CDD and CDD datasets, respectively.
3. Method
The structural composition of the constructed GAN-based remote sensing image change detection network is shown in
Figure 5. It mainly consists of two convolutional neural networks, the generator, and the discriminator. This model is named UNET-GAN-CD.
In this framework, generator continuously generate new data distribution based on the input data. The discriminator determines whether the data generated by the generator is real data or fake data. Both of them were confronted with each other in the model training process of the GAN, finally, the generator and discriminator together reach the optimal state. This framework corresponds to a Min-Max game of the two training networks.
3.1. Generator
The generator is an encoder–decoder Siamese network named UNET-CD, as shown in
Figure 6. In order to solve the problems of image information loss and image resolution reduction caused by the pooling in the Siamese network structure, a 3 × 3 convolutional layer with a stride of two is used to replace the pooling layer.
The encoder uses the Siamese structure with shared weights and is designed as a center-surround module, as shown in
Figure 6. In order to obtain the multi-scale features of the image, the central part of the input image is cropped and resampled to the original image size. The bi-temporal images and their corresponding cropped images are input into the network to extract the features. The feature maps obtained from the original images are called the surround features, and the feature maps obtained from the cropped images are called the center features. The surround features and the central features extracted by the encoder are fused in the band dimension, as shown in
Figure 6, where m represents the feature fusion; the expression is:
where feature represents the fused feature maps; S
subi, C
subi denote the differential feature maps of the surround feature maps and the center feature maps, respectively; and merge denotes the fusion operation.
In the up-sampling process, deep features at the decoder and shallow features at the encoder are fused through the attention module. The last layer of the change detection network uses the Softmax activation function to normalize the output neuron values, and the change and unchanged values are mapped between 0 and 1 using the sigmoid function.
3.1.1. Generator Architecture
The specific network architecture of the generator is shown in
Table 1. I, O, and K represent the number of input channels, the number of output channels, and the kernel size, respectively. ReLU is the activation function, and P denotes the pooling operation.
3.1.2. ASPP Module
As shown in
Figure 7, the ASPP module uses a dilated convolution with expansion rates of 6, 12, and 18 to increase the perceptual field of the model and capture the multi-scale features of the image. To obtain more global contextual information on the image, the ASPP module also acquires the image-level features through GAP. Finally, the acquired multi-scale image features are up-sampled to the appropriate image size using a bilinear interpolation method, and feature fusion is performed.
3.1.3. Center-Surround Architecture
In
Figure 8, assuming that the original image has a size of 224 × 224 pixels, the cropped center sub-image has the size of 112 × 112 pixels and is up-sampled to 224 × 224 pixels. Then, the original image is input to the surround module, and the up-sampled center sub-image is input to the central module. The multi-resolution information of the image is generated via the above process.
3.1.4. Attention Fusion Module
In
Figure 9, F
low denotes the shallow features of the network at the encoder; F
high denotes the deep features at the decoder; and F denotes the fused feature image.
The interaction and information integration of the feature image meta-values across channels is achieved using 1 × 1 convolution kernels for shallow and deep features, respectively; then, the integrated deep and shallow feature images are summed pixel by pixel. The weight size of their pixels is obtained. Finally, the image element weight value results of the feature image are output to the range of 0-1 using the sigmoid activation function. The new weight value feature image and the deep features after information integration are subjected to image-by-pixel dot product, and all the resultant values of the deep feature map are adjusted to obtain the final result.
3.2. Discriminator
To reduce the number of parameters in the network and improve the model training and prediction time, the discriminator in this paper uses the fully convolutional network FCN16s, and the last layer of the discriminator uses the Softmax activation function, while the rest of the layers use the Leaky ReLU activation function.
As shown in
Figure 10, after the result of the generator is fused with the input image, it is input to the discriminator network for judgment, and the output result of the discriminator is only restored to 1/4 of the size of the original input image. Therefore, the input discriminator image size is 224 × 224, and the output image size of the discriminator is 56 × 56.
Discriminator Architecture
The network architecture of the discriminator is shown in
Table 2. I, O, and K represent the number of input channels, the number of output channels, and the kernel size, respectively. LeakyReLU is the activation function, and BN denotes the Batch Normalization.
3.3. Loss Function
The loss function of UNET-GAN-CD consists of two parts, the Wasserstein distance and the category cross-entropy. The Wasserstein distance is proposed in the WGAN network. We use it instead of the JS scatter in the original GAN model, and its loss function is defined by the following formula:
where Π(Pγ,Pg) denotes the set of all joint distributions of Pγ and Pg combined. For each possible joint distribution γ, one can sample (x,y)~γ from it to obtain a real sample x and a generative sample y and calculate the expectation value, ‖x − y‖, of the sample pair distance in joint distribution γ.
The results obtained from the generator and the real changed labels are used for calculating the category cross-entropy and for the binary classification of change detection; when the change category label is 1, the category cross-entropy is calculated as:
where y
i is the label category information, the changed label pixel element is 1, the unchanged pixel element is 0, and f(x) is the change probability value of the image element obtained by the generator.
The loss function of the generator can be expressed as:
The loss function of the discriminator can be expressed as:
5. Conclusions
In this paper, we proposed a GAN-based change detection method for urban remote sensing images. The model could detect changes in roads, buildings, and cultivated land occurring in urban scenes, and the effectiveness of the model was proved via quantitative and qualitative evaluations. In order to verify the effectiveness of the GAN in change detection model, we conducted an ablation experiment for GAN on XI’AN-CDD. Specifically, the addition of GAN improved the overall accuracy, Kappa coefficient, and F1-score by 0.76%, 3.4%, and 4.4%, respectively. It also further improved the completeness of the detection results and improved the missing detection phenomenon. On the public change detection dataset (CDD), UNET-GAN-CD achieved the best F1-score and achieved the best balance between precision and recall.
During model training and testing, only RGB bands were used for the experiments, and future research can be devoted to exploring multi-band remote sensing images processing. Most of the existing change detection datasets are binary change detection datasets, and multi-category change detection is also the direction of our future research. In addition, training a change detection model with excellent performance using a small sample of change category training data is also the next research focus.