1. Introduction
In the field of remote sensing (RS) data analysis, change detection (CD) has become one of the primary applications of Earth orbit observations, which plays a crucial role in urban change analysis [
1,
2], environmental monitoring [
3], agricultural surveillance [
4], disaster detection [
5,
6], and many other domains. The purpose of CD is to utilize RS images taken at different epochs of the same geographical area, along with relevant geospatial data, in consideration of the characteristics of corresponding objects and RS imaging mechanisms. This is achieved through the application of image and signal processing theories, as well as mathematical modeling approaches, to contrast and distinguish changed and unchanged regions in the RS images at different times [
7,
8]. The specific approach involves using binary labels to represent changed and unchanged areas [
9].
Due to the development of satellite sensor hardware and imaging systems, very high resolution (VHR) RS images have become the primary data source for CD. However, conducting CD on VHR RS images poses great challenges due to the limited spectral information [
10], geometric distortions, information loss, and significant differences in the sizes of various objects. Additionally, the challenges include variations in imaging conditions between different epochs of RS images (e.g., seasonal differences, illumination differences, weather differences) so that the same objects may exhibit different spectral characteristics at different times [
11], leading to potential errors in detection results. Furthermore, unrelated changes (e.g., vegetation growth and human activities) can affect the accuracy of CD, and the imbalance in pixel quantities between unchanged and changed areas, where the number of pixels in unchanged regions far exceeds that in changed regions, may lead to a sample imbalance problem. Therefore, effectively extracting high-level semantic features of objects with complex textures and learning the change information in dual-time images are crucial issues for the CD task.
Traditional CD methods can be classified into three categories, including algebraic operations [
12], image transformations [
13], and post-classification [
14]. Algebraic-based methods directly compare pixel values between dual-temporal images via an algebraic operation (e.g., differencing [
15], quantization [
16], or regression [
17,
18]), followed by a thresholding operation to determine the change areas. However, due to the difficulty in choosing appropriate thresholds, algebraic-based methods are ineffective in recognizing complex change information. Conversely, image transformation-based methods mitigate irrelevant information in two-time point images through data transformation, thereby accentuating the disparities between the images to attain improved CD outcomes. Deng et al. [
19] applied Principal Component Analysis (PCA) to multi-temporal data from Landsat and SPOT5 images. By highlighting changes in the images, PCA was utilized for subsequent supervised change classification, facilitating the identification of areas undergoing alterations. However, PCA transformation is scene-dependent, and it may be challenging to interpret its principles [
20]. Post-classification methods [
21] identify changes by comparing multiple classification maps based on the pre-learned semantic categories. Support Vector Machines [
22], decision trees [
23], machine learning methods, and GIS-based methods [
24] have been employed in CD tasks. Traditional CD methods are typically used for low- to medium-resolution images. However, when encountering VHR images, these methods often face limitations in identifying the changes [
25]. Additionally, traditional CD methods struggle to model contextual information [
26], leading to difficulties in extracting complex change information.
Compared to traditional methods, deep learning (DL) has the capability of learning hierarchical feature representations from data samples, demonstrating powerful representation learning capability. Additionally, DL methods exhibit nonlinear mapping and end-to-end learning capabilities. Deep learning-based approaches play an increasingly significant role in RS image analysis tasks [
27,
28,
29]. These deep networks include Deep Belief Networks (DBNs), Stacked Autoencoders (SAEs), Generative Adversarial Networks (GANs), Recurrent Neural Networks (RNNs), and Convolutional Neural Networks (CNNs). Among these, the weight-sharing mechanism and local connectivity of CNNs enable the extraction of fine abstract features [
30], making CNNs widely utilized for feature extraction in CD tasks. Numerous CD models based on CNNs have been proposed, including classical CNNs [
31] and their extended architectures [
32,
33,
34]. Zhan et al. [
35] utilize a shared-weight Siamese convolutional network to extract features from dual-temporal images simultaneously. However, due to the simplicity of the network, its capability of identifying changes is limited. In contrast to the basic Siamese network, Daudt et al. [
36] designed three Fully Convolutional Neural Network (FCNN)-based CD architectures, including Fully Convolutional Early Fusion (FC-EF), Fully Convolutional Siamese Concatenation (FC-Siam-conc), and Fully Convolutional Siamese Difference (FC-Siam-diff). These methods employ the classical U-shaped network with skip connections in the encoding and decoding stages to reduce the loss of detailed information, thereby enhancing CD accuracy to some extent. Chen et al. [
37] enhanced the feature extraction capability by deepening the network structure. However, with the increasing depth of the network, the performance improvement becomes less significant. To more effectively extract feature information and improve CD accuracy, new methods have been applied to CD tasks. For example, dilated convolution methods have been introduced into the CD field [
38,
39]. Dilated convolution enhances the network’s ability to capture global information by increasing the receptive field, but it lacks information interaction among multiple feature layers. To fully exploit the potential of multi-level features, SNUNet [
40], based on the dense connection structure of UNet++ [
41], stacks convolutional layers at different levels to achieve multi-scale and multi-level feature interaction. Attention mechanisms present another means for CD; for example, channel attention and spatial attention mechanisms were adopted to improve detection results by reducing redundant information [
42,
43]. While these mechanisms focus on the importance of different pixels in each channel, they still lack the establishment of long-term context associations in the temporal–spatial dimensions, with limited ability to extract global information. Self-attention mechanisms [
44] present a new approach to capturing distant dependencies, offering a new solution to enhance the network’s global information representation capability. Chen and Shi [
45] use self-attention modules to handle multi-scale features for capturing distant dependencies of features at different scales, but these suffer from low computational efficiency.
Compared to CNNs, the transformer [
46] introduces self-attention mechanisms, considering information from all positions in the sequence. It models distant dependencies in sequence data through a combination of multi-head attention and feedforward neural networks. The Vision Transformer (ViT) [
47] directly divides images into patch sequences and feeds them into a pure transformer for image classification tasks [
48,
49]. In recent years, the transformer has also been introduced into the field of CD. SwinSUNet [
50], based on SwinTransformer [
51], constructs a pure transformer Siamese network for CD. SwinSUNet employs a shuffling window mechanism that allows cross-window connections for extracting global information, but it suffers from significant computational complexity. Compared to pure transformer networks, the Remote Sensing Image Change Detection with Transformers (BIT) proposed by Chen et al. [
52] uses convolutional networks to extract dual-temporal features of images and then utilizes transformer modules to further extract high-level semantic information for enhancing the representation of global information. However, it lacks interaction between CNN and transformer networks and fails to address the focus on change information. ChangeFormer [
53] constructs a dual-branch Siamese multi-layer transformer encoder, combined with a lightweight MLP decoder, to accomplish CD tasks. However, this approach still lacks interaction between different networks. To further enhance the interaction between transformer and CNN networks, ICIF-Net [
54] adopts a parallel connection of a CNN and a transformer to extract semantic features from dual-temporal images. It obtains features from both CNN and transformer networks and uses cross-attention to fuse the two types of features within the scale, balancing local and global information. However, this also consumes significant resources. To reduce the number of parameters, a modified cross-attention hierarchical network (ACAHNet) [
55] was proposed for CD relative to the basic transformer, integrating transformer layers into the CNN-based main framework. This leads to a rapid increase in model parameters and computational complexity. To better leverage multi-scale feature information, AMTNet [
56] adopts a CNN–transformer architecture in a Siamese network, utilizing a convolutional network as the backbone to extract multi-scale features from original input images. The features processed by the transformer network are further employed to model context information in multi-scale features of dual-temporal images. Although these methods consider combining CNN and transformer networks, they do not adequately address deep interaction between CNN and transformer networks and the limitation of fixed-size patches of the transformer.
In summary, CNN networks possess powerful feature extraction capabilities, while transformers have the advantage of global attention that focuses on global features. Therefore, considering that the combination of CNN and transformer can obtain both global and local attention, the CNN–transformer architecture has great potential in CD tasks. Existing methods mostly adopt combinations of CNN and transformers in a serial or parallel manner, lacking deep interaction between the CNN and the transformer during the feature extraction process. Furthermore, the transformer network, which utilizes fixed-size patches at each stage, lacks the capability of perceiving features from multiple receptive fields, which is disadvantageous for CD tasks involving objects of various sizes. Therefore, we propose a Siamese network with multiple receptive field branches based on transformers and Convolutional Neural Networks (CNNs) to extract features from dual-temporal images. To capture feature representations with multiple receptive fields in the patch merging module, we introduce additional branches with 5 × 5 and 7 × 7 convolutional layers in addition to the existing 3 × 3 convolutional patch embedding module at each feature extraction stage. This approach creates multi-scale patch embedding feature maps through multiple paths at each patch merging stage, forming multi-receptive field representations while reducing the loss of edge information. To retain detailed information, a feature extraction branch based on convolution is added for the 3 × 3 convolution layer path. We designed a multi-receptive field feature aggregation module to aggregate multi-receptive field features obtained from multiple paths. For the acquired dual-temporal features, a spatiotemporal discrepancy perception module is employed to obtain dual-temporal difference features. A transformer-based multi-level feature interaction module is utilized to further interact information between multi-level features.
This article primarily contributes in the following ways:
(1) We propose a U-shaped CD network based on the CNN–transformer architecture, namely the Global-Local Collaborative Learning Network (GLCL-Net). At each stage of feature extraction, feature maps with different receptive fields from different branches are generated, and features from different branches are interacted with to provide the network with global-to-local feature extraction capabilities. This enhances the network’s ability to perceive objects of different sizes.
(2) We introduce a multi-branch feature aggregation module (MBFA) to integrate features from multiple branches at the feature extraction stage and a multi-level feature interaction module (MLFI) for interaction among features at different levels, enhancing the network’s representational capacity for ground features.
(3) For extracting change information, we propose a spatiotemporal discrepancy perception module (SDPM) designed for CD tasks. This module effectively captures the changing portions of the dual-temporal features.
The structure of this article is as follows:
Section 2 provides a detailed description of the proposed method.
Section 3 introduces the experimental setup and presents the results.
Section 4 discusses the factors influencing these results. Finally,
Section 5 outlines the conclusions.
3. Experiments and Results
This section begins by describing the datasets and comparative methods used. Subsequently, we explain the evaluation metrics employed to assess the network’s performance and detail the experimental setup. Finally, we present the results of the comparative experiments.
3.1. Datasets
We conducted experiments on three publicly available CD datasets to evaluate the effectiveness of our GLCL-Net: the Global Very-High-Resolution Landslide Mapping (GVLM) dataset [
62], the Visual and RS-Change Detection (LEVIR-CD) dataset [
45], and Sun Yat-sen University Dataset (SYSU-CD) dataset [
63]. Details of these three datasets are summarized in
Table 1.
(1) Visual and RS-Change Detection (LEVIR-CD): The LEVIR-CD dataset consists of 637 pairs of high-resolution images obtained from Google Earth, with a pixel size of 1024 × 1024. Various types of buildings, such as high-rise apartments, villas, small garages, and large warehouses, are included, covering different change types induced by seasonal and lighting variations. Due to GPU memory constraints, we cropped the 637 image pairs into non-overlapping samples of 256 × 256 pixels, resulting in 7120/1024/2048 pairs for training/validation/testing, respectively.
(2) Sun Yat-sen University Dataset (SYSU-CD): This dataset contains 20,000 pairs of 0.5 m aerial images taken in Hong Kong during 2007–2014, with a size of 256 × 256. The main types of change in the dataset include: new urban construction, suburban expansion, pre-construction foundation, vegetation change, road expansion, and Marine construction. Of all 20,000 pairs of images in this dataset, 8000 pairs were evenly divided into validation sets and test sets, with the rest as the training part.
(3) Global Very-High-Resolution Landslide Mapping (GVLM): This dataset comprises 17 pairs of VHR images obtained through Google Earth services, covering extensive landslide sites across six continents: Asia, Africa, North America, South America, Europe, and Oceania. The images have a spatial resolution of 0.59 m and depict various landslide sites characterized by distinct geographic locations, sizes, shapes, occurrence times, spatial distributions, phenological states, and land cover types. Each pair of images has been randomly cropped into 256 × 256 image patches with a 40% overlap. Subsequently, a total of 13,529/3866/1932 pairs were selected for training/validation/testing across the entire target domain.
3.2. Baseline
To demonstrate the effectiveness of our GVLM-Net on the dual-temporal RS image CD task, we compared it with several state-of-the-art methods, as follows:
FC-EF [
36]: This method is based on a pure CNN CD model. It features a U-shaped architecture, taking the concatenation of the bi-temporal image pairs as input and employing an early fusion strategy.
FC-Siam-conc [
36]: This method initially employs a Siamese architecture with shared weights in the encoding phase to extract multi-level features from bi-temporal images. Subsequently, these multi-layered features are concatenated and processed through fully connected layers to extract change information.
FC-Siam-diff [
36]: This method is a variant of FC-EF. It utilizes a Siamese network to extract multi-level features from bi-temporal images and adopts a feature-differencing approach to extract change information.
DTCDSCN [
43]: This method is a dual-task CD network. It utilizes both channel and spatial attention mechanisms to reduce redundant information. We only compare the outputs of the network for the change detection task.
SNUNet [
40]: This method achieves interaction among multi-level and multi-scale features through dense connections, reducing information loss. It models contextual information using channel attention.
BiT [
52]: The method employs a transformer module to encode features extracted from the CNN network and extract semantic information, while contextual relationships are modeled by introducing a feature differencing-based network.
ChangeFormer [
53]: The method employs a hierarchically structured transformer encoder and a lightweight MLP decoder to accomplish CD tasks.
AMT [
56]: The method is based on a CNN–transformer architecture of Siamese networks, extracting multi-scale features from the original input image pairs. Attention and transformation modules are applied to model the dual-temporal images.
HANet [
64]: This method proposes a progressive foreground-balanced sampling strategy and designs a hierarchical attention network (HANet), which is a discriminative Siamese network capable of integrating multiscale features and refining detailed features.
S
2CD [
65]: This method obtains the initial spatial and channel differences by performing summation and subtraction operations on bi-temporal images and uses transformers to extract meaningful differences in spatial and channel patterns, thereby capturing subtle differences in both spatial and channel aspects of bi-temporal images.
3.3. Evaluation Metrics
We adopted four mainstream metrics in the CD domain to measure the discrepancy between the predicted results and ground truth. They are
(Pre.),
(Rec.),
score, and Intersection over Union (
). The calculation for each metric is as follows:
where
(True Positive) represents correctly predicted changed pixels.
(False Positive) represents unchanged pixels incorrectly predicted as changed.
(False Negative) represents changed pixels incorrectly predicted as unchanged.
3.4. Training Details
We trained the proposed model using cross-entropy loss (CE loss). Throughout the experiment, we built our proposed network model under the PyTorch framework. We utilized an NVIDIA GeForce RTX 3090 GPU (24 GB) in our experiments. Considering the limitations of GPU memory, we set the batch training size to 8. The maximum number of epochs for training the model is set to 200. The initial learning rate is 0.0001, using the Adam optimizer for linear learning rate decay. We perform common data enhancement on the input image patches, including flipping, rescaling, cropping, and Gaussian blurring. Throughout the training process, the model that performs best on the validation set is applied for testing.
3.5. Performance Comparison
To evaluate the effectiveness of the proposed GLCL-Net model, our network was trained on the training sets of three dual-temporal RS image change detection datasets and tested on the respective test sets.
Table 2,
Table 3 and
Table 4 report the overall performance metrics on the LEVIR-CD, SYSU-CD, and GVLM-CD test samples, with quantitative results indicating that our GLCL-Net consistently demonstrates significant overall advantages across four metrics. For instance, on the LEVIR-CD/SYSU-CD/GVLM-CD datasets, the
F1 score for our GLCL-Net is consistently higher by 0.7%/0.91%/0.17% compared to the best-performing models.
Observations from
Table 2,
Table 3 and
Table 4 reveal that, objectively, our GLCL-Net has achieved breakthroughs across all four metrics on the three datasets. This substantiates the effectiveness of introducing a multi-branch feature extraction network based on CNNs and transformers for CD tasks. Additionally, the visual results of each model on various datasets are presented in
Figure 6,
Figure 7 and
Figure 8. It is noteworthy that, for the purpose of distinguishing the correctness of different region detections, we have employed different colors, including
TP (white), TN (black),
FP (red), and
FN (green).
LEVIR-CD Visualization: As shown in
Figure 6, we selected representative and challenging samples for visual comparison, such as small buildings in
Figure 6a,e, densely changing buildings in
Figure 6b, a larger scene change in
Figure 6c, and intense seasonal and lighting changes in
Figure 6d. In
Figure 6, it can be observed that our GLCL-Net performs well compared to other methods. In the detection of small building changes in
Figure 6a,e, our GLCL-Net is more sensitive to small building changes. In the detection of densely changing buildings in
Figure 6b, our GLCL-Net is better at identifying subtle boundaries. In
Figure 6c, our GLCL-Net has fewer false detections. In
Figure 6d, our network can avoid interference from complex backgrounds. Therefore, relative to other comparative methods, our GLCL-Net achieves the best visual results on the LEVIR-CD dataset.
Visualization on SYSU-CD: As shown in
Figure 7, we selected representative and challenging samples from the SYSU-CD dataset for visual performance comparison, including large buildings with different shooting angles in
Figure 7a,b, severe interference due to seasonal variations in
Figure 7c, and challenges posed by complex roads in
Figure 7d,e. In
Figure 7a,b, our GLCL-Net demonstrates more accurate building recognition compared to other competitors in terms of visual presentation. Under severe interference caused by seasonal variations in
Figure 7c, only our GLCL-Net maintains a high recognition rate and low false-positive rate. Furthermore, under the challenges of complex road situations in
Figure 7d,e, our GLCL-Net maintains a high recognition rate.
GVLM-CD Visualization: As shown in
Figure 8, we selected representative and challenging samples from the GVLM-CD dataset for visual performance comparison. This includes areas with complex boundaries in
Figure 8a,d, large changed areas in
Figure 8b,c, and changes in complex backgrounds in
Figure 8e. In
Figure 8a,d, the presence of complex boundaries may lead to false detections in the changed areas. Our GLCL-Net demonstrates superior visual performance compared to other comparative methods. In
Figure 8b,c, the recognition accuracy of other comparative methods is significantly affected by false changes, while our network reduces the impact of false changes. In
Figure 8e, other methods exhibit noticeable false detections, while our GLCL-Net shows the best visual performance.