Global-Local Collaborative Learning Network for Optical Remote Sensing Image Change Detection

Li, Jinghui; Shao, Feng; Liu, Qiang; Meng, Xiangchao

doi:10.3390/rs16132341

Open AccessArticle

Global-Local Collaborative Learning Network for Optical Remote Sensing Image Change Detection

Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo 315211, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(13), 2341; https://doi.org/10.3390/rs16132341

Submission received: 31 March 2024 / Revised: 23 June 2024 / Accepted: 24 June 2024 / Published: 27 June 2024

Download

Browse Figures

Versions Notes

Abstract

Due to the widespread applications of change detection technology in urban change analysis, environmental monitoring, agricultural surveillance, disaster detection, and other domains, the task of change detection has become one of the primary applications of Earth orbit satellite remote sensing data. However, the analysis of dual-temporal change detection (CD) remains a challenge in high-resolution optical remote sensing images due to the complexities in remote sensing images, such as intricate textures, seasonal variations in imaging time, climatic differences, and significant differences in the sizes of various objects. In this paper, we propose a novel U-shaped architecture for change detection. In the encoding stage, a multi-branch feature extraction module is employed by combining CNN and transformer networks to enhance the network’s perception capability for objects of varying sizes. Furthermore, a multi-branch aggregation module is utilized to aggregate features from different branches, providing the network with global attention while preserving detailed information. For dual-temporal features, we introduce a spatiotemporal discrepancy perception module to model the context of dual-temporal images. Particularly noteworthy is the construction of channel attention and token attention modules based on the transformer attention mechanism to facilitate information interaction between multi-level features, thereby enhancing the network’s contextual awareness. The effectiveness of the proposed network is validated on three public datasets, demonstrating its superior performance over other state-of-the-art methods through qualitative and quantitative experiments.

Keywords:

change detection; convolutional neural network; remote sensing; multi-receptive field; multi-level feature; transformer

1. Introduction

In the field of remote sensing (RS) data analysis, change detection (CD) has become one of the primary applications of Earth orbit observations, which plays a crucial role in urban change analysis [1,2], environmental monitoring [3], agricultural surveillance [4], disaster detection [5,6], and many other domains. The purpose of CD is to utilize RS images taken at different epochs of the same geographical area, along with relevant geospatial data, in consideration of the characteristics of corresponding objects and RS imaging mechanisms. This is achieved through the application of image and signal processing theories, as well as mathematical modeling approaches, to contrast and distinguish changed and unchanged regions in the RS images at different times [7,8]. The specific approach involves using binary labels to represent changed and unchanged areas [9].

Due to the development of satellite sensor hardware and imaging systems, very high resolution (VHR) RS images have become the primary data source for CD. However, conducting CD on VHR RS images poses great challenges due to the limited spectral information [10], geometric distortions, information loss, and significant differences in the sizes of various objects. Additionally, the challenges include variations in imaging conditions between different epochs of RS images (e.g., seasonal differences, illumination differences, weather differences) so that the same objects may exhibit different spectral characteristics at different times [11], leading to potential errors in detection results. Furthermore, unrelated changes (e.g., vegetation growth and human activities) can affect the accuracy of CD, and the imbalance in pixel quantities between unchanged and changed areas, where the number of pixels in unchanged regions far exceeds that in changed regions, may lead to a sample imbalance problem. Therefore, effectively extracting high-level semantic features of objects with complex textures and learning the change information in dual-time images are crucial issues for the CD task.

Traditional CD methods can be classified into three categories, including algebraic operations [12], image transformations [13], and post-classification [14]. Algebraic-based methods directly compare pixel values between dual-temporal images via an algebraic operation (e.g., differencing [15], quantization [16], or regression [17,18]), followed by a thresholding operation to determine the change areas. However, due to the difficulty in choosing appropriate thresholds, algebraic-based methods are ineffective in recognizing complex change information. Conversely, image transformation-based methods mitigate irrelevant information in two-time point images through data transformation, thereby accentuating the disparities between the images to attain improved CD outcomes. Deng et al. [19] applied Principal Component Analysis (PCA) to multi-temporal data from Landsat and SPOT5 images. By highlighting changes in the images, PCA was utilized for subsequent supervised change classification, facilitating the identification of areas undergoing alterations. However, PCA transformation is scene-dependent, and it may be challenging to interpret its principles [20]. Post-classification methods [21] identify changes by comparing multiple classification maps based on the pre-learned semantic categories. Support Vector Machines [22], decision trees [23], machine learning methods, and GIS-based methods [24] have been employed in CD tasks. Traditional CD methods are typically used for low- to medium-resolution images. However, when encountering VHR images, these methods often face limitations in identifying the changes [25]. Additionally, traditional CD methods struggle to model contextual information [26], leading to difficulties in extracting complex change information.

Compared to traditional methods, deep learning (DL) has the capability of learning hierarchical feature representations from data samples, demonstrating powerful representation learning capability. Additionally, DL methods exhibit nonlinear mapping and end-to-end learning capabilities. Deep learning-based approaches play an increasingly significant role in RS image analysis tasks [27,28,29]. These deep networks include Deep Belief Networks (DBNs), Stacked Autoencoders (SAEs), Generative Adversarial Networks (GANs), Recurrent Neural Networks (RNNs), and Convolutional Neural Networks (CNNs). Among these, the weight-sharing mechanism and local connectivity of CNNs enable the extraction of fine abstract features [30], making CNNs widely utilized for feature extraction in CD tasks. Numerous CD models based on CNNs have been proposed, including classical CNNs [31] and their extended architectures [32,33,34]. Zhan et al. [35] utilize a shared-weight Siamese convolutional network to extract features from dual-temporal images simultaneously. However, due to the simplicity of the network, its capability of identifying changes is limited. In contrast to the basic Siamese network, Daudt et al. [36] designed three Fully Convolutional Neural Network (FCNN)-based CD architectures, including Fully Convolutional Early Fusion (FC-EF), Fully Convolutional Siamese Concatenation (FC-Siam-conc), and Fully Convolutional Siamese Difference (FC-Siam-diff). These methods employ the classical U-shaped network with skip connections in the encoding and decoding stages to reduce the loss of detailed information, thereby enhancing CD accuracy to some extent. Chen et al. [37] enhanced the feature extraction capability by deepening the network structure. However, with the increasing depth of the network, the performance improvement becomes less significant. To more effectively extract feature information and improve CD accuracy, new methods have been applied to CD tasks. For example, dilated convolution methods have been introduced into the CD field [38,39]. Dilated convolution enhances the network’s ability to capture global information by increasing the receptive field, but it lacks information interaction among multiple feature layers. To fully exploit the potential of multi-level features, SNUNet [40], based on the dense connection structure of UNet++ [41], stacks convolutional layers at different levels to achieve multi-scale and multi-level feature interaction. Attention mechanisms present another means for CD; for example, channel attention and spatial attention mechanisms were adopted to improve detection results by reducing redundant information [42,43]. While these mechanisms focus on the importance of different pixels in each channel, they still lack the establishment of long-term context associations in the temporal–spatial dimensions, with limited ability to extract global information. Self-attention mechanisms [44] present a new approach to capturing distant dependencies, offering a new solution to enhance the network’s global information representation capability. Chen and Shi [45] use self-attention modules to handle multi-scale features for capturing distant dependencies of features at different scales, but these suffer from low computational efficiency.

Compared to CNNs, the transformer [46] introduces self-attention mechanisms, considering information from all positions in the sequence. It models distant dependencies in sequence data through a combination of multi-head attention and feedforward neural networks. The Vision Transformer (ViT) [47] directly divides images into patch sequences and feeds them into a pure transformer for image classification tasks [48,49]. In recent years, the transformer has also been introduced into the field of CD. SwinSUNet [50], based on SwinTransformer [51], constructs a pure transformer Siamese network for CD. SwinSUNet employs a shuffling window mechanism that allows cross-window connections for extracting global information, but it suffers from significant computational complexity. Compared to pure transformer networks, the Remote Sensing Image Change Detection with Transformers (BIT) proposed by Chen et al. [52] uses convolutional networks to extract dual-temporal features of images and then utilizes transformer modules to further extract high-level semantic information for enhancing the representation of global information. However, it lacks interaction between CNN and transformer networks and fails to address the focus on change information. ChangeFormer [53] constructs a dual-branch Siamese multi-layer transformer encoder, combined with a lightweight MLP decoder, to accomplish CD tasks. However, this approach still lacks interaction between different networks. To further enhance the interaction between transformer and CNN networks, ICIF-Net [54] adopts a parallel connection of a CNN and a transformer to extract semantic features from dual-temporal images. It obtains features from both CNN and transformer networks and uses cross-attention to fuse the two types of features within the scale, balancing local and global information. However, this also consumes significant resources. To reduce the number of parameters, a modified cross-attention hierarchical network (ACAHNet) [55] was proposed for CD relative to the basic transformer, integrating transformer layers into the CNN-based main framework. This leads to a rapid increase in model parameters and computational complexity. To better leverage multi-scale feature information, AMTNet [56] adopts a CNN–transformer architecture in a Siamese network, utilizing a convolutional network as the backbone to extract multi-scale features from original input images. The features processed by the transformer network are further employed to model context information in multi-scale features of dual-temporal images. Although these methods consider combining CNN and transformer networks, they do not adequately address deep interaction between CNN and transformer networks and the limitation of fixed-size patches of the transformer.

In summary, CNN networks possess powerful feature extraction capabilities, while transformers have the advantage of global attention that focuses on global features. Therefore, considering that the combination of CNN and transformer can obtain both global and local attention, the CNN–transformer architecture has great potential in CD tasks. Existing methods mostly adopt combinations of CNN and transformers in a serial or parallel manner, lacking deep interaction between the CNN and the transformer during the feature extraction process. Furthermore, the transformer network, which utilizes fixed-size patches at each stage, lacks the capability of perceiving features from multiple receptive fields, which is disadvantageous for CD tasks involving objects of various sizes. Therefore, we propose a Siamese network with multiple receptive field branches based on transformers and Convolutional Neural Networks (CNNs) to extract features from dual-temporal images. To capture feature representations with multiple receptive fields in the patch merging module, we introduce additional branches with 5 × 5 and 7 × 7 convolutional layers in addition to the existing 3 × 3 convolutional patch embedding module at each feature extraction stage. This approach creates multi-scale patch embedding feature maps through multiple paths at each patch merging stage, forming multi-receptive field representations while reducing the loss of edge information. To retain detailed information, a feature extraction branch based on convolution is added for the 3 × 3 convolution layer path. We designed a multi-receptive field feature aggregation module to aggregate multi-receptive field features obtained from multiple paths. For the acquired dual-temporal features, a spatiotemporal discrepancy perception module is employed to obtain dual-temporal difference features. A transformer-based multi-level feature interaction module is utilized to further interact information between multi-level features.

This article primarily contributes in the following ways:

(1) We propose a U-shaped CD network based on the CNN–transformer architecture, namely the Global-Local Collaborative Learning Network (GLCL-Net). At each stage of feature extraction, feature maps with different receptive fields from different branches are generated, and features from different branches are interacted with to provide the network with global-to-local feature extraction capabilities. This enhances the network’s ability to perceive objects of different sizes.

(2) We introduce a multi-branch feature aggregation module (MBFA) to integrate features from multiple branches at the feature extraction stage and a multi-level feature interaction module (MLFI) for interaction among features at different levels, enhancing the network’s representational capacity for ground features.

(3) For extracting change information, we propose a spatiotemporal discrepancy perception module (SDPM) designed for CD tasks. This module effectively captures the changing portions of the dual-temporal features.

The structure of this article is as follows: Section 2 provides a detailed description of the proposed method. Section 3 introduces the experimental setup and presents the results. Section 4 discusses the factors influencing these results. Finally, Section 5 outlines the conclusions.

2. Proposed Methods

2.1. Overall Architecture

The overall architecture of the proposed network is illustrated in Figure 1. The input to the network consists of three-channel dual-temporal RS images, and the shared-weight dual-branch encoder is employed to obtain dual-temporal features. The resulting features are then fed into the CD network, which outputs a binary prediction map. In the encoding stage, the Overlapping Patch Extraction (OPE) strategy is utilized to enlarge the patch window to allow partial overlap of adjacent windows in the edge region. This effectively avoids the loss of relevant information at the patch edges and enhances the integrity of feature extraction. Additionally, considering that the transformer network captures a single receptive field feature with a fixed patch size in each stage, which may be disadvantageous for extracting features of different-sized objects in CD tasks, our CD network adopts a multi-branch Siamese network based on both transformers and CNNs to extract features from dual-temporal images. In the patch merging stage, we generate feature maps with multi-scale patch embeddings from multiple paths, obtaining multi-receptive field representations of features at each patch merging stage, thus enhancing sensitivity to different-sized land features. To preserve more detailed information, the 3 × 3 convolutional layer branch undergoes an additional convolution-based feature extraction branch. Subsequently, the obtained multi-branch features are fed into the multi-branch feature aggregation module, facilitating interaction between the CNN and transformer networks. For the dual-temporal image features obtained by the encoder, a spatiotemporal discrepancy perception module is proposed, focusing on extracting the changing information in dual-temporal features and obtaining multi-level features containing change information. These multi-level features are further processed using a multi-level feature interaction module based on the transformer attention mechanism. Next, the multi-scale features are decoded, with the decoder having a symmetric structure to the encoder. The decoding stage employs a patch merging layer to upsample the features and increase their resolution. The restored feature maps are skip-connected with features from the multi-level feature interaction module at the same resolution and input to the transformer module for decoding. After the three-stage process, the output resolution is one-fourth of the original input image. Finally, a linear projection is applied to restore the resolution to the original input resolution, and a convolutional layer is used to output the final binary classification prediction map.

2.2. Multi-Branch Multi-Receptive Field Patch Merging (MMP)

In conventional transformer models, N × N convolutional operations are typically used in the decoding stage for patch merging, followed by feeding the token sequence into transformer modules in subsequent stages. However, the limitation of using a fixed patch size in each stage of the transformer network results in capturing features with a single receptive field, which is disadvantageous for extracting features of different-sized objects in CD tasks. As shown in Figure 2, we enhance the conventional 3 × 3 convolutional patch embedding module by introducing additional 5 × 5 and 7 × 7 patch embedding branches, enabling multi-receptive-field feature extraction. Specifically, our CD network employs a multi-branch Siamese network based on both CNN and transformer networks to extract features from dual-temporal images. To enable the patch merging module to capture feature information from different receptive fields, we add two additional branches of patch embedding layers, 5 × 5 convolutional layers, and 7 × 7 convolutional layers at each stage of feature extraction on top of the existing transformer 3 × 3 convolutional patch embedding module. Inspired by [57,58], we use two consecutive 3 × 3 convolutional layers and three consecutive 3 × 3 convolutional layers to simulate 5 × 5 convolutional layers and 7 × 7 convolutional layers, respectively, in order to reduce computational complexity. In this way, multi-scale features are generated from multiple patch embedding paths, forming a multi-receptive-field representation at each patch merging stage and mitigating the loss of patch edge information. Specifically, to preserve more detailed information, an additional convolution-based feature extraction branch is applied to the 3 × 3 convolutional patch embedding branch.

To reduce the number of model parameters, we apply multi-branch multi-receptive field patch merging in the second, third, and fourth stages of feature extraction. For each input feature map at the i-th feature extraction stage, after applying multi-branch multi-receptive-field patch merging, we obtain the outputs

Φ_{3} (X_{i})

,

Φ_{5} (X_{i})

and

Φ_{7} (X_{i})

at the i-th stage. The size of the output features from multiple branches can be obtained using the following formulas:

\begin{array}{l} H_{i} = \frac{H_{i - 1} - p + 2 d}{s} + 1 \\ W_{i} = \frac{W_{i - 1} - p + 2 d}{s} + 1 \end{array}

(1)

By adjusting the padding size, we can obtain multi-branch features with the same resolution, which are then further fused through transformer blocks, and the resulting features are further aggregated through the multi-branch feature aggregation module.

2.3. Multi-Branch Feature Extraction Module

As shown in Figure 2, in each multi-branch feature extraction (MBFE) module, the input comprises three feature layers, including three paths of features from the multi-field patch merging module with convolution kernel sizes of 3 × 3, 5 × 5, and 7 × 7. To capture multi-field representations, we ensure that the feature maps from the three paths are processed through the transformer module separately. Additionally, the 3 × 3 convolutional embedding branch undergoes an additional convolution-based feature extraction branch. Therefore, for the multi-branch feature extraction module, there are four output branches. For the transformer module, we employ a factorized attention mechanism, different from efficient attention, which calculates softmax for both Q and K to reduce computational complexity. The factorized attention mechanism only computes softmax for K:

F A (Q, K, V) = ρ q (Q) \times ρ κ {(K)}^{T} V

(2)

where

ρ q (\cdot)

is scaling function and

ρ κ (\cdot)

is softmax functions:

ρ κ (K) = S o f t m a x (K)

(3)

ρ q (Q) = \frac{Q}{\sqrt{n}}

(4)

In the transformer module, we employ convolutional position encoding to facilitate the aggregation of features from different branches. Since we need to input the features obtained from the multi-branch patch merging module into the transformer module separately, different-sized receptive field convolutions are applied to the multi-branch features. Without position encoding, there is a lack of awareness of positional information. Therefore, using a transformer block that includes convolutional position encoding can better perceive the positional information of each branch, facilitating further fusion of multi-branch features. Specifically, in each convolutional positional encoding step, the input feature map

X \in ℝ^{H \times W \times C}

is first processed with a depth convolution of kernel size 3 and stride 1. Subsequently, it is flattened into

X c = ℝ^{N \times C}

and concatenated with the input.

2.4. Multi-Branch Feature Aggregation Module (MBFA)

Inspired by the coordinate attention mechanism [59], we propose a multi-branch feature aggregation (MBFA) module, as shown in Figure 3. Each of the four feature maps x_i generated by the MBFE block serves as input to the module. Initially, the input features are concatenated along the channel dimension, followed by global pooling to extract global spatial information.

z_{g} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x (i, j)

(5)

where x represents the input, and

z_{g}

represents the output of the pooling operation.

The global pooling operation in the MBFA module involves pooling operations in two directions, expressed as:

\begin{array}{l} z_{h} = \frac{1}{W / 8} \sum_{0 \leq i < W / 8} x (h, j) \\ z_{w} = \frac{1}{H / 8} \sum_{0 \leq j < H / 8} x (j, w) \end{array}

(6)

where

z_{h}

represents the output of the pooling with a kernel (1,W/8) in the horizontal direction, and

z_{w}

is the output of applying the kernel (H/8,1) to the input x in the vertical direction.

To mitigate the significant loss of spatial information caused by compressing all global spatial information into a single channel through global average pooling, we employ dual-branch parallel feature encoding along the horizontal and vertical coordinates, respectively. This dual-branch encoding ensures that spatial coordinate information is incorporated into the attention map. The transformation results in both directions (i.e.,

z_{h}

and

z_{w}

) being concatenated to achieve information aggregation. We apply a 1 × 1 convolutional layer

C o n v_{1 \times 1}

to reduce the channel dimension.

f = C o n v_{1 \times 1} (C o n c a t (z_{h}, z_{w}))

(7)

where

f \in ℝ^{C / r \times (H \times W)}

represents the aggregated feature, and r is the reduction ratio to alleviate the model parameter load.

Subsequently, the aggregated feature

f

is processed through a normalization layer. It is then split along the spatial dimension into two tensors,

f_{h} \in ℝ^{C / r \times H}

and

f_{w} \in ℝ^{C / r \times W}

. These tensors are transformed through another 1 × 1 convolutional layer.

\begin{array}{l} g_{w} = σ (C o n v_{1 \times 1} (f_{w})) \\ g_{h} = σ (C o n v_{1 \times 1} (f_{h})) \end{array}

(8)

where

σ (\cdot)

represents a sigmoid function. Then,

g_{h} \in ℝ^{C \times H}

and

g_{w} \in ℝ^{C \times W}

serve as attention weights to simultaneously weight the input residual

X_{c o n c a t}

in both directions.

Y = X_{c o n c a t} ⊙ g_{h} ⊙ g_{w}

(9)

where

⊙

denotes dot product operation.

The proposed MFAM module takes the outputs of the multi-branch feature extraction block as input and aggregates multi-receptive field information at the same stage. By applying multi-branch feature aggregation, this module facilitates the convenient transmission of rich information, thus enhancing the overall performance of the final CD.

2.5. Spatiotemporal Discrepancy Perception Module (SDPM)

For the dual-temporal multi-scale features, the classical element-wise subtraction or channel concatenation is insufficient for fully extracting the change information from dual-temporal feature maps. Therefore, as illustrated in Figure 4, we propose a spatiotemporal discrepancy perception module (SDPM) to capture the change information of dual-temporal features at each stage. This module processes features simultaneously in both spatial and channel dimensions, providing enhanced extraction and interaction of dual-temporal features for better extraction of change information.

Channel dimension feature extraction involves embedding the dual-temporal features

i n p u t_{1}

and

i n p u t_{2}

into two channel-wise attention vectors

w_{c 1}

and

w_{c 2}

. Initially, dual-temporal features

i n p u t_{1}

and

i n p u t_{2}

are concatenated along the channel dimension to obtain

w_{1}

. Subsequently, global max-pooling and global average-pooling operations are performed to retain more spatial information, resulting in two result vectors. These two result vectors are concatenated to obtain

Y \in ℝ^{4 C}

. Then, the sigmoid function is applied to

Y

to obtain

w_{c}

, which is then split into

w_{c 1}

and

w_{c 2}

. The process is described as follows:

Y = C o n c a t (A v g (w_{1}), M a x (w_{1}))

(10)

w_{c 1}, w_{c 2} = F_{s p l i t} (σ (Y))

(11)

where

A v g (\cdot)

represents max pooling,

M a x (\cdot)

represents average pooling,

σ (\cdot)

is the sigmoid function, and

F_{s p l i t} (\cdot)

represents vector splitting.

To further extract local information from features, we introduce spatial feature extraction, in which the dual-temporal input features

i n p u t_{1}

and

i n p u t_{2}

are concatenated and embedded into two spatial weight maps

w_{s 1}

and

w_{s 2}

. Firstly, the concatenated features

w_{2}

of dual-temporal input features

i n p u t_{1}

and

i n p u t_{2}

are processed through two 1 × 1 convolutional layers followed by a

Re L U

unction to obtain the feature map

F \in ℝ^{H \times W \times 2}

. Subsequently, the sigmoid function is applied to F to obtain

w_{s}

, which is then split into

w_{s 1}

and

w_{s 2}

. The process is as follows:

F = C o n v_{1 \times 1} (R E L U (C o n v_{1 \times 1} (w_{2})))

(12)

w_{s 1}, w_{s 2} = F_{s p l i t} (σ (F))

(13)

where

F_{s p l i t} (\cdot)

represents vector splitting, consisting of 1 × 1 convolutions.

After channel dimension feature extraction and spatial dimension feature extraction, we obtain channel weight vectors

w_{c 1}

and

w_{c 1}

,

i n p u t_{1}

as well as

i n p u t_{2}

spatial weight mappings

w_{c 1}

and

w_{c 2}

. To aggregate spatial and channel information for better differentiation, the following steps are performed: Firstly, we multiply the channel weight vectors

w_{c 1}

and

w_{c 2}

by

i n p u t_{1}

and

i n p u t_{2}

, respectively, to obtain two channel features

c_{1}

and

c_{2}

. Similarly, we multiply the spatial weight mappings

w_{s 1}

and

w_{s 2}

by

i n p u t_{1}

and

i n p u t_{2}

, respectively, to obtain two spatial features

s_{1}

and

s_{2}

. Then,

c_{1}

,

c_{2}

,

s_{1}

, and

s_{2}

to obtain the spatio-temporal feature

l_{1}

. Additionally, we add

c_{1}

and

s_{1}

, as well as

i n p u t_{1}

, to obtain the feature

l_{2}

enhanced by

i n p u t_{1}

, and we add

c_{2}

and

s_{2}

, as well as

i n p u t_{2}

, to obtain the feature

l_{3}

enhanced by

i n p u t_{2}

. Finally, we compute the difference between

l_{2}

and

l_{3}

to extract dual-temporal differential information:

\begin{array}{l} l_{1} = w_{c 1} \otimes i n p u t_{1} + w_{c 2} \otimes i n p u t_{2} + w_{s 1} \otimes i n p u t_{1} + w_{s 2} \otimes i n p u t_{2} \\ l_{2} = w_{c 1} \otimes i n p u t_{1} + w_{s 1} \otimes i n p u t_{1} + i n p u t_{1} \\ l_{3} = w_{c 2} \otimes i n p u t_{2} + w_{s 2} \otimes i n p u t_{2} + i n p u t_{2} \end{array}

(14)

where

\otimes

denotes cross-product operation.

Ultimately, we concatenate

l_{1}

,

l_{2}

,

l_{3}

, and

l_{2} - l_{3}

, and after adjusting the feature dimensions with a 1 × 1 convolution, we obtain enhanced change features:

o u t p u t = C o n v_{1 \times 1} (C o n c a t (l_{1}, l_{2}, l_{3}, l_{2} - l_{3}))

(15)

2.6. Multi-Level Feature Interaction Module (MLFI)

In CD tasks, the fusion of information across multiple levels of features is crucial. To facilitate interaction among multi-level features, we propose a dual-attention mechanism based on the transformer architecture, comprising channel attention and token attention. This dual-attention mechanism captures the interaction representation of multi-level features effectively. Specifically, as illustrated in Figure 1, the input sequence to the multi-level feature interaction module (MLFI) module consists of multi-scale features

Y_{1} = ℝ^{H / 4 \times W / 4 \times C}

,

Y_{2} = ℝ^{H / 8 \times W / 8 \times 2 C}

,

Y_{3} = ℝ^{H / 16 \times W / 16 \times 4 C}

, and

Y_{4} = ℝ^{H / 32 \times W / 32 \times 8 C}

, obtained through the DM from the dual-temporal features with differential information. Initially, these four multi-scale outputs are reshaped into four tensors

N_{i} \times c

with the same channel number. Subsequently, these tensors are concatenated along the channel dimension to form

Y_{C} = ℝ^{N \times C}

. Following this,

Y_{C}

is fed into our multi-level feature interaction module, where both global and local dependencies among multi-level features are enhanced through one layer of channel attention and three layers of token attention mechanisms. Finally, the output features are segmented and reshaped into four feature maps with the same input resolution.

In the token attention mechanism, token-aware attention based on the dot-product attention module [60] is applied to process the sequence. As shown in Figure 5a,

Y_{C}

is mapped to

Q \in ℝ^{N \times d_{k}}

. Particularly, during the generation of keys (K) and values (V), a ratio reduction module is inserted to shape-reduce the high-resolution feature map.

K_{r} = R e s h a p e (K, (\frac{N}{r}, r C)) P r o j (r C, r)

(16)

where

R e s h a p e (\cdot)

refers to reorganizing the tensor into the specified shape.

K_{r}

represents the keys; it applies a spatial reduction ratio r that reduces the length of the pixel sequence to

1 / r

of its original length. Correspondingly, the channel size increases to r times its original size. Then, a linear projection layer

P r o j (\cdot)

restores the channel depth of the intermediate feature layer from rC to r.

Then, the generation process of

V_{r}

and

K_{r}

is the same:

V_{r} = R e s h a p e (V, (\frac{N}{r}, r C)) P r o j (r C, r)

(17)

By mapping queries and keys, an attention matrix is formed under dot-product operation, which can weight tokens.

A t t e n_{T} (Q, K_{r}, V_{r}) = S o f t m a x (\frac{Q K_{r}^{T}}{\sqrt{d_{k}}})

(18)

where

S o f t m a x (\cdot)

denotes the normalization function, T represents the transpose operation on the input vector, Q, K, V are input sequences, and d_k is the reduced dimension.

We can use the token-aware attention mechanism to learn long-range dependencies between pixels in the sequence. This dependency can transcend single-stage limitations, allowing us to capture pixel relationship information that differs between different stages.

Due to the focus on positional relationships and contextual associations through the token-aware attention mechanism, long-distance dependencies between channels in multi-level features remain to be explored. Therefore, we further propose channel-aware attention, as shown in Figure 5b. First, the sequence

Y_{C}

is mapped to

Q \in ℝ^{N \times d_{k}}

,

K \in ℝ^{N \times d_{k}}

,

K \in ℝ^{N \times d_{k}}

and

V \in ℝ^{N \times d_{v}}

. Then,

K^{T}

and

V

are weighted. Specifically, the vector

v_{j} \in ℝ^{N \times 1}

of a single channel is weighted with the vector

k_{i}^{T} \in ℝ^{1 \times N}

of all token elements at the corresponding positions within channel j, yielding

C A_{(i, j)}

, an element of the channel-aware attention matrix. The channel-aware attention matrix

C A

can weight channels. The channel attention equation is represented as follows:

C A = S o f t m a x {(\frac{K}{\sqrt{N}})}^{T} \times V

(19)

A t t e n_{C} (Q, K, V) = S o f t m a x (\frac{Q}{\sqrt{N}}) \times C A

(20)

where

S o f t m a x (\cdot)

denotes the normalization function, T represents the transpose operation on the input vector,

Q

,

K

,

V

are input sequences, and N is the sequence length.

In the MLFA module, the output

Y_{A C}

of the i-th sub-layer attention module is connected to the residual

Y_{C}

to obtain

R e s Y_{A C}

, avoiding the problem of gradient disappearance. Then,

R e s Y_{A C}

is normalized and passed through a feed-forward network (FFN), and the result is again connected to the residual

R e s Y_{A C}

to obtain

R e s Y_{C}

. Finally, the output

R e s Y_{C}

of the last sub-layer of the MLFA module is segmented and reshaped into four feature maps with the same resolution as the input.

Y_{A C} = A t t e n t i o n (N o r m (Y_{C}))

(21)

R e s Y_{C} = Y_{C} + Y_{A C}

(22)

R e s Y_{A C} = Y_{A C} + F F N (N o r m (Y_{A C}))

(23)

O_{i} = R e s h a p (S p l i t (R e s Y_{A C}))

(24)

where

A t t e n t i o n (\cdot)

can be set as

A t t e n t i o n_{C} (\cdot)

or

A t t e n t i o n_{T} (\cdot)

. In MLFA,

A t t e n t i o n_{C} (\cdot)

is executed once, and

s p l i t (\cdot)

represents vector splitting.

3. Experiments and Results

This section begins by describing the datasets and comparative methods used. Subsequently, we explain the evaluation metrics employed to assess the network’s performance and detail the experimental setup. Finally, we present the results of the comparative experiments.

3.1. Datasets

We conducted experiments on three publicly available CD datasets to evaluate the effectiveness of our GLCL-Net: the Global Very-High-Resolution Landslide Mapping (GVLM) dataset [62], the Visual and RS-Change Detection (LEVIR-CD) dataset [45], and Sun Yat-sen University Dataset (SYSU-CD) dataset [63]. Details of these three datasets are summarized in Table 1.

(1) Visual and RS-Change Detection (LEVIR-CD): The LEVIR-CD dataset consists of 637 pairs of high-resolution images obtained from Google Earth, with a pixel size of 1024 × 1024. Various types of buildings, such as high-rise apartments, villas, small garages, and large warehouses, are included, covering different change types induced by seasonal and lighting variations. Due to GPU memory constraints, we cropped the 637 image pairs into non-overlapping samples of 256 × 256 pixels, resulting in 7120/1024/2048 pairs for training/validation/testing, respectively.

(2) Sun Yat-sen University Dataset (SYSU-CD): This dataset contains 20,000 pairs of 0.5 m aerial images taken in Hong Kong during 2007–2014, with a size of 256 × 256. The main types of change in the dataset include: new urban construction, suburban expansion, pre-construction foundation, vegetation change, road expansion, and Marine construction. Of all 20,000 pairs of images in this dataset, 8000 pairs were evenly divided into validation sets and test sets, with the rest as the training part.

(3) Global Very-High-Resolution Landslide Mapping (GVLM): This dataset comprises 17 pairs of VHR images obtained through Google Earth services, covering extensive landslide sites across six continents: Asia, Africa, North America, South America, Europe, and Oceania. The images have a spatial resolution of 0.59 m and depict various landslide sites characterized by distinct geographic locations, sizes, shapes, occurrence times, spatial distributions, phenological states, and land cover types. Each pair of images has been randomly cropped into 256 × 256 image patches with a 40% overlap. Subsequently, a total of 13,529/3866/1932 pairs were selected for training/validation/testing across the entire target domain.

3.2. Baseline

To demonstrate the effectiveness of our GVLM-Net on the dual-temporal RS image CD task, we compared it with several state-of-the-art methods, as follows:

FC-EF [36]: This method is based on a pure CNN CD model. It features a U-shaped architecture, taking the concatenation of the bi-temporal image pairs as input and employing an early fusion strategy.

FC-Siam-conc [36]: This method initially employs a Siamese architecture with shared weights in the encoding phase to extract multi-level features from bi-temporal images. Subsequently, these multi-layered features are concatenated and processed through fully connected layers to extract change information.

FC-Siam-diff [36]: This method is a variant of FC-EF. It utilizes a Siamese network to extract multi-level features from bi-temporal images and adopts a feature-differencing approach to extract change information.

DTCDSCN [43]: This method is a dual-task CD network. It utilizes both channel and spatial attention mechanisms to reduce redundant information. We only compare the outputs of the network for the change detection task.

SNUNet [40]: This method achieves interaction among multi-level and multi-scale features through dense connections, reducing information loss. It models contextual information using channel attention.

BiT [52]: The method employs a transformer module to encode features extracted from the CNN network and extract semantic information, while contextual relationships are modeled by introducing a feature differencing-based network.

ChangeFormer [53]: The method employs a hierarchically structured transformer encoder and a lightweight MLP decoder to accomplish CD tasks.

AMT [56]: The method is based on a CNN–transformer architecture of Siamese networks, extracting multi-scale features from the original input image pairs. Attention and transformation modules are applied to model the dual-temporal images.

HANet [64]: This method proposes a progressive foreground-balanced sampling strategy and designs a hierarchical attention network (HANet), which is a discriminative Siamese network capable of integrating multiscale features and refining detailed features.

S²CD [65]: This method obtains the initial spatial and channel differences by performing summation and subtraction operations on bi-temporal images and uses transformers to extract meaningful differences in spatial and channel patterns, thereby capturing subtle differences in both spatial and channel aspects of bi-temporal images.

3.3. Evaluation Metrics

We adopted four mainstream metrics in the CD domain to measure the discrepancy between the predicted results and ground truth. They are

p r e c i s o n

(Pre.),

r e c a l l

(Rec.),

F 1

score, and Intersection over Union (

I o U

). The calculation for each metric is as follows:

p r e c i s o n = \frac{T P}{T P + F P}

(25)

r e c a l l = \frac{T P}{T P + F N}

(26)

F 1 = 2 \times \frac{p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(27)

I o U = \frac{T P}{T P + F P + F N}

(28)

where

T P

(True Positive) represents correctly predicted changed pixels.

F P

(False Positive) represents unchanged pixels incorrectly predicted as changed.

F N

(False Negative) represents changed pixels incorrectly predicted as unchanged.

3.4. Training Details

We trained the proposed model using cross-entropy loss (CE loss). Throughout the experiment, we built our proposed network model under the PyTorch framework. We utilized an NVIDIA GeForce RTX 3090 GPU (24 GB) in our experiments. Considering the limitations of GPU memory, we set the batch training size to 8. The maximum number of epochs for training the model is set to 200. The initial learning rate is 0.0001, using the Adam optimizer for linear learning rate decay. We perform common data enhancement on the input image patches, including flipping, rescaling, cropping, and Gaussian blurring. Throughout the training process, the model that performs best on the validation set is applied for testing.

3.5. Performance Comparison

To evaluate the effectiveness of the proposed GLCL-Net model, our network was trained on the training sets of three dual-temporal RS image change detection datasets and tested on the respective test sets. Table 2, Table 3 and Table 4 report the overall performance metrics on the LEVIR-CD, SYSU-CD, and GVLM-CD test samples, with quantitative results indicating that our GLCL-Net consistently demonstrates significant overall advantages across four metrics. For instance, on the LEVIR-CD/SYSU-CD/GVLM-CD datasets, the F1 score for our GLCL-Net is consistently higher by 0.7%/0.91%/0.17% compared to the best-performing models.

Observations from Table 2, Table 3 and Table 4 reveal that, objectively, our GLCL-Net has achieved breakthroughs across all four metrics on the three datasets. This substantiates the effectiveness of introducing a multi-branch feature extraction network based on CNNs and transformers for CD tasks. Additionally, the visual results of each model on various datasets are presented in Figure 6, Figure 7 and Figure 8. It is noteworthy that, for the purpose of distinguishing the correctness of different region detections, we have employed different colors, including TP (white), TN (black), FP (red), and FN (green).

LEVIR-CD Visualization: As shown in Figure 6, we selected representative and challenging samples for visual comparison, such as small buildings in Figure 6a,e, densely changing buildings in Figure 6b, a larger scene change in Figure 6c, and intense seasonal and lighting changes in Figure 6d. In Figure 6, it can be observed that our GLCL-Net performs well compared to other methods. In the detection of small building changes in Figure 6a,e, our GLCL-Net is more sensitive to small building changes. In the detection of densely changing buildings in Figure 6b, our GLCL-Net is better at identifying subtle boundaries. In Figure 6c, our GLCL-Net has fewer false detections. In Figure 6d, our network can avoid interference from complex backgrounds. Therefore, relative to other comparative methods, our GLCL-Net achieves the best visual results on the LEVIR-CD dataset.

Visualization on SYSU-CD: As shown in Figure 7, we selected representative and challenging samples from the SYSU-CD dataset for visual performance comparison, including large buildings with different shooting angles in Figure 7a,b, severe interference due to seasonal variations in Figure 7c, and challenges posed by complex roads in Figure 7d,e. In Figure 7a,b, our GLCL-Net demonstrates more accurate building recognition compared to other competitors in terms of visual presentation. Under severe interference caused by seasonal variations in Figure 7c, only our GLCL-Net maintains a high recognition rate and low false-positive rate. Furthermore, under the challenges of complex road situations in Figure 7d,e, our GLCL-Net maintains a high recognition rate.

GVLM-CD Visualization: As shown in Figure 8, we selected representative and challenging samples from the GVLM-CD dataset for visual performance comparison. This includes areas with complex boundaries in Figure 8a,d, large changed areas in Figure 8b,c, and changes in complex backgrounds in Figure 8e. In Figure 8a,d, the presence of complex boundaries may lead to false detections in the changed areas. Our GLCL-Net demonstrates superior visual performance compared to other comparative methods. In Figure 8b,c, the recognition accuracy of other comparative methods is significantly affected by false changes, while our network reduces the impact of false changes. In Figure 8e, other methods exhibit noticeable false detections, while our GLCL-Net shows the best visual performance.

4. Discussion

4.1. Ablation Analysis

The experimental results are presented in Table 5, where ‘w/o’ is an abbreviation for ‘without’, ‘MBFA’ stands for multi-branch feature aggregation module, ‘SDPM’ represents spatiotemporal discrepancy perception module, and ‘MLFI’ is multi-level feature interaction module, including token-aware block and channel-aware block. Additionally, ‘MBFE’ denotes the multi-branch feature extraction module. The ablation analysis of each module is as follows:

MBFA: To analyze whether MBFA contributes to better overall network performance, we replaced the MBFA module with a 1 × 1 convolution. As shown in Table 5, removing MBFA leads to a significant performance decline compared to the original GLCL-Net, highlighting the importance of inter-branch feature communication. As anticipated, quantitative experimental results validate the feasibility and superiority of the proposed MBFA module.

SDPM: We designed the SDPM module to extract change information from dual-temporal features at each stage. The SDPM processes features simultaneously in both spatial and channel dimensions, capturing more global and local information for improved extraction of change information. We replaced the MBFA module with a 1 × 1 convolution. As shown in the column without the SDPM component in Table 5, the network’s performance declined, validating the effectiveness of the proposed SDPM module.

MLFI: We designed the MLFA module to facilitate information interaction among multi-level features, aiming to enhance the performance of our GLCL-Net. To validate the performance of our proposed MLFA, we directly disabled the MLFA module in the network. As shown in Table 5, the absence of our MLFA leads to a significant performance decline.

MBFE: To validate the performance improvement brought by multi-branch feature extraction, we replaced the multi-branch feature extraction-based encoder with a simple single-branch encoder. As shown in Table 5, in the case of a single-branch encoder, the network’s performance notably declined. This demonstrates that multi-branch feature extraction contributes to the performance enhancement of our network.

To further validate the effectiveness of each module, we visualized the results of the ablation experiments. Figure 9, Figure 10 and Figure 11 show the ablation experiment visualization results on the LEVIR-CD, SYSU-CD, and GVLM-CD datasets, respectively. It can be seen that after the ablation experiments, the visualization performance on all three datasets degraded. As shown in Figure 9a,d,e, small area changes exhibit more missed detections and false detections. In Figure 10a,b,f, the omission rate for large area changes increased. In Figure 11a,b,d, the boundary regions became more difficult to identify.

4.2. Network Visualization

To better validate the effectiveness of our proposed modules, we visualized the feature maps at different inference stages of GLCL-Net. As shown in Figure 12, we presented two representative feature maps from the original shallow features A(#), four corresponding representative feature maps from the bi-temporal feature maps B(#) after the MBFA module, and four corresponding representative feature maps from the feature maps C(#) with change information after the SDPM module, as well as the change feature maps D(#) after the MLFI module. From Figure 12, it can be observed that the shallow features extracted from the bi-temporal images have clearer textures after passing through the MBFA module. The bi-temporal features can extract preliminary change features after passing through the SDPM module, and these change features become more pronounced after passing through the MLFI module. The final prediction map can be obtained after classification.

4.3. Parameter Analysis

To evaluate the impact of the multi-branch feature extraction module with different parameters on CD results, we conducted ablation experiments by varying the number of branches in the encoding stage of the feature extraction module and the depth of transformer layers in each submodule. The obtained results are presented in Table 6. It can be observed that different parameter settings did not lead to significant changes in overall performance. We found that increasing the depth of transformer layers in the module while keeping the number of branches in the multi-branch feature extraction module fixed or increasing the number of branches while keeping the depth of transformer layers fixed improved the overall performance of CD results. Furthermore, the increase in the depth of transformer layers had a more pronounced effect on enhancing the performance of CD results. Specifically, when the number of branches in the multi-branch feature extraction module was set to (3:3:3), and the depth of transformer layers in the module was set to (3:8:3), although it resulted in a decrease in recall rate, it achieved the best overall performance across the three CD datasets.

4.4. Model Efficiency Analysis

For the model efficiency comparison, Table 7 presents the number of model parameters (Params), floating point operations per second (FLOPs), and average training time required for training on 100 pairs of training data, using an image size of 256 × 256 × 3. Due to our multi-branch feature extraction approach, our GLCL-Net achieves better semantic representation at the cost of higher resource demands. The number of parameters in our model is higher than that of other networks. In terms of FLOPs, our model ranks between DTCDSCN and ChangeFormer. Unfortunately, the training efficiency of GLCL-Net lags behind that of other networks. Overall, despite our GLCL-Net achieving better semantic representation, it is still necessary to explore lightweight architectures and investigate potential acceleration to improve in terms of the number of parameters, FLOPs, and training speed.

Due to our GLCL-Net being heavier than other networks, we further validated that our GLCL-Net achieves better semantic representation at the cost of higher resource demands and excluded the possibility of the network overfitting the data by plotting the training and validation curves. As shown in Figure 13, the training and validation losses of our GLCL-Net decrease simultaneously on the LEVIR-CD and SYSU-CD datasets. Although the validation loss fluctuates on the GVLM-CD dataset, it generally decreases along with the training loss. Therefore, the network is not overfitting the data. However, the drawback of our network is that it requires more resources to achieve better performance.

5. Conclusions

To comprehensively aggregate local and global features, this paper proposes a multi-branch RS image CD feature extraction network based on CNNs and transformers, capturing multi-scale local features and global features separately. For multi-branch feature fusion, a multi-branch feature aggregation module is introduced to aggregate features from multiple receptive fields of transformers and CNNs. A dual-temporal image spatiotemporal discrepancy perception module is proposed to further extract dual-temporal change information. To integrate multi-level features, a transformer attention-based multi-level feature interaction module is employed.

We conducted extensive experiments on three publicly available datasets, LEVIR-CD, GVLM-CD, and SYSU-CD. Both quantitative and qualitative experimental results demonstrate that our proposed GLCL-Net achieves state-of-the-art performance overall compared to recently published competitors. However, the main disadvantage of our model is the large resource consumption due to the introduction of multiple branches for feature extraction. Large resource consumption is the limitation of our proposed method. Therefore, future work will focus on improving model efficiency and lightweight design while maintaining the high performance of the network. In addition, exploring the applicability of this method in multi-source remote sensing images, such as overcoming sensor differences in two-phase remote sensing images, is an important direction for future research.

Author Contributions

Conceptualization, J.L. and F.S.; methodology, J.L. and Q.L.; software, J.L.; validation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, J.L., Q.L. and X.M.; supervision, Q.L., F.S. and X.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ningbo Natural Science Foundation of China (grant 2022J067).

Data Availability Statement

The LEVIR-CD dataset is available at: https://justchenhao.github.io/LEVIR (accessed on 18 April 2022). The SYSU-CD dataset is available at: https://github.com/liumency/SYSU-CD (accessed on 10 November 2023). The GVLM-CD dataset is available at: https://github.com/zxk688/GVLM (accessed on 21 December 2023).

Acknowledgments

The authors sincerely appreciate the helpful comments and constructive suggestions given by the academic editors and reviewers.

Conflicts of Interest

The authors declare no conflict of interest.

References

Marin, C.; Bovolo, F.; Bruzzone, L. Building Change Detection in Multitemporal Very High Resolution SAR Images. IEEE Trans. Geosci. Remote Sens. 2015, 53, 2664–2682. [Google Scholar] [CrossRef]
Huang, X.; Zhang, L.; Zhu, T. Building Change Detection from Multitemporal High-Resolution Remotely Sensed Images Based on a Morphological Building Index. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 105–115. [Google Scholar] [CrossRef]
Chen, C.-F.; Son, N.-T.; Chang, N.-B.; Chen, C.-R.; Chang, L.-Y.; Valdez, M.; Centeno, G.; Thompson, C.; Aceituno, J. Multi-Decadal Mangrove Forest Change Detection and Prediction in Honduras, Central America, with Landsat Imagery and a Markov Chain Model. Remote Sens. 2013, 5, 6408–6426. [Google Scholar] [CrossRef]
Adão, T.; Hruška, J.; Pádua, L.; Bessa, J.; Peres, E.; Morais, R.; Sousa, J. Hyperspectral Imaging: A Review on UAV-Based Sensors, Data Processing and Applications for Agriculture and Forestry. Remote Sens. 2017, 9, 1110. [Google Scholar] [CrossRef]
Mahdavi, S.; Salehi, B.; Huang, W.; Amani, M.; Brisco, B. A PolSAR Change Detection Index Based on Neighborhood Information for Flood Mapping. Remote Sens. 2019, 11, 1854. [Google Scholar] [CrossRef]
Gong, M.; Zhao, J.; Liu, J.; Miao, Q.; Jiao, L. Change Detection in Synthetic Aperture Radar Images Based on Deep Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 125–138. [Google Scholar] [CrossRef] [PubMed]
Qin, R.; Tian, J.; Reinartz, P. 3D change detection—Approaches and applications. ISPRS J. Photogramm. Remote Sens. 2016, 122, 41–56. [Google Scholar] [CrossRef]
Khelifi, L.; Mignotte, M. Deep Learning for Change Detection in Remote Sensing Images: Comprehensive Review and Meta-Analysis. IEEE Access 2020, 8, 126385–126400. [Google Scholar] [CrossRef]
Singh, A. Review Article Digital change detection techniques using remotely-sensed data. Int. J. Remote Sens. 1989, 10, 989–1003. [Google Scholar] [CrossRef]
Li, Q.; Zhong, R.; Du, X.; Du, Y. TransUNetCD: A Hybrid Transformer Network for Change Detection in Optical Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5622519. [Google Scholar] [CrossRef]
Li, P.; Xu, H. Land-Cover Change Detection using One-Class Support Vector Machine. Photogramm. Eng. Remote Sens. 2010, 76, 255–263. [Google Scholar] [CrossRef]
Malila, W.A. Change Vector Analysis: An Approach for Detecting Forest Changes with Landsat. 1980. Available online: https://docs.lib.purdue.edu/lars_symp/385/ (accessed on 1 January 2018).
Bai, B.; Fu, W.; Lu, T.; Li, S. Edge-Guided Recurrent Convolutional Neural Network for Multitemporal Remote Sensing Image Building Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5610613. [Google Scholar] [CrossRef]
Ahlqvist, O. Extending post-classification change detection using semantic similarity metrics to overcome class heterogeneity: A study of 1992 and 2001 U.S. National Land Cover Database changes. Remote Sens. Environ. 2008, 112, 1226–1241. [Google Scholar] [CrossRef]
Quarmby, N.A.; Cushnie, J.L. Monitoring urban land cover changes at the urban fringe from SPOT HRV imagery in south-east England. Int. J. Remote Sens. 1989, 10, 953–963. [Google Scholar] [CrossRef]
Howarth, P.J.; Wickware, G.M. Procedures for change detection using Landsat digital data. Int. J. Remote Sens. 1981, 2, 277–291. [Google Scholar] [CrossRef]
Ludeke, A.K.; Maggio, R.C.; Reid, L.M. An analysis of anthropogenic deforestation using logistic regression and GIS. J. Environ. Manag. 1990, 31, 247–259. [Google Scholar] [CrossRef]
Jackson, R.D. Spectral indices in N-space. Remote Sens. Environ. 1983, 13, 409–421. [Google Scholar] [CrossRef]
Deng, J.S.; Wang, K.; Deng, Y.H.; Qi, G.J. PCA-based land-use change detection and analysis using multitemporal and multisensor satellite data. Int. J. Remote Sens. 2008, 29, 4823–4838. [Google Scholar] [CrossRef]
De Carvalho, O.; Guimarães, R.; Silva, N.; Gillespie, A.; Gomes, R.; Silva, C.; De Carvalho, A. Radiometric Normalization of Temporal Images Combining Automatic Detection of Pseudo-Invariant Features from the Distance and Similarity Spectral Measures, Density Scatterplot Analysis, and Robust Regression. Remote Sens. 2013, 5, 2763–2794. [Google Scholar] [CrossRef]
Yuan, F.; Sawaya, K.E.; Loeffelholz, B.C.; Bauer, M.E. Land cover classification and change analysis of the Twin Cities (Minnesota) Metropolitan Area by multitemporal Landsat remote sensing. Remote Sens. Environ. 2005, 98, 317–328. [Google Scholar] [CrossRef]
Gapper, J.J.; El-Askary, H.; Linstead, E.; Piechota, T. Coral Reef Change Detection in Remote Pacific Islands Using Support Vector Machine Classifiers. Remote Sens. 2019, 11, 1525. [Google Scholar] [CrossRef]
Gao, F.; de Colstoun, E.B.; Ma, R.; Weng, Q.; Masek, J.G.; Chen, J.; Pan, Y.; Song, C. Mapping impervious surface expansion using medium-resolution satellite image time series: A case study in the Yangtze River Delta, China. Int. J. Remote Sens. 2012, 33, 7609–7628. [Google Scholar] [CrossRef]
Li, D. Remotely sensed images and GIS data fusion for automatic change detection. Int. J. Image Data Fusion 2010, 1, 99–108. [Google Scholar] [CrossRef]
Im, J.; Jensen, J.R. A change detection model based on neighborhood correlation image analysis and decision tree classification. Remote Sens. Environ. 2005, 99, 326–340. [Google Scholar] [CrossRef]
Johansen, K.; Arroyo, L.A.; Phinn, S.; Witte, C. Comparison of Geo-Object Based and Pixel-Based Change Detection of Riparian Environments using High Spatial Resolution Multi-Spectral Imagery. Photogramm. Eng. Remote Sens. 2010, 76, 123–136. [Google Scholar] [CrossRef]
Liu, Q.; Meng, X.; Shao, F.; Li, S. Supervised-unsupervised combined deep convolutional neural networks for high-fidelity pansharpening. Inf. Fusion 2023, 89, 292–304. [Google Scholar] [CrossRef]
Liu, Q.; Ren, K.; Meng, X.; Shao, F. Domain Adaptive Cross Reconstruction for Change Detection of Heterogeneous Remote Sensing Images via a Feedback Guidance Mechanism. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4507216. [Google Scholar] [CrossRef]
Zhang, S.; Meng, X.; Liu, Q.; Yang, G.; Sun, W. Feature-Decision Level Collaborative Fusion Network for Hyperspectral and LiDAR Classification. Remote Sens. 2023, 15, 4148. [Google Scholar] [CrossRef]
Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.-S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
Song, A.; Kim, Y.; Kim, Y.-I. Change Detection of Surface Water in Remote Sensing Images Based on Fully Convolutional Network. J. Coast. Res. 2019, 91, 426–430. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. Available online: http://ieeexplore.ieee.org/document/7780459/ (accessed on 8 December 2023).
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. Available online: http://arxiv.org/abs/1505.04597 (accessed on 8 December 2023).
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 5686–5696. Available online: https://ieeexplore.ieee.org/document/8953615/ (accessed on 8 December 2023).
Zhan, Y.; Fu, K.; Yan, M.; Sun, X.; Wang, H.; Qiu, X. Change Detection Based on Deep Siamese Convolutional Network for Optical Aerial Images. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1845–1849. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully Convolutional Siamese Networks for Change Detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. Available online: https://ieeexplore.ieee.org/document/8451652/ (accessed on 8 December 2023).
Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Huang, H.; Zhu, J.; Liu, Y.; Li, H. DASNet: Dual Attentive Fully Convolutional Siamese Networks for Change Detection in High-Resolution Satellite Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 1194–1206. [Google Scholar] [CrossRef]
Zhang, M.; Xu, G.; Chen, K.; Yan, M.; Sun, X. Triplet-Based Semantic Relation Learning for Aerial Remote Sensing Image Change Detection. IEEE Geosci. Remote Sens. Lett. 2019, 16, 266–270. [Google Scholar] [CrossRef]
Wang, M.; Tan, K.; Jia, X.; Wang, X.; Chen, Y. A Deep Siamese Network with Hybrid Convolutional Feature Extraction Module for Change Detection Based on Multi-sensor Remote Sensing Images. Remote Sens. 2020, 12, 205. [Google Scholar] [CrossRef]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A Densely Connected Siamese Network for Change Detection of VHR Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8007805. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. arXiv 2018, arXiv:1807.10165. Available online: http://arxiv.org/abs/1807.10165 (accessed on 8 December 2023).
Peng, X.; Zhong, R.; Li, Z.; Li, Q. Optical Remote Sensing Image Change Detection Based on Attention Mechanism and Image Difference. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7296–7307. [Google Scholar] [CrossRef]
Liu, Y.; Pang, C.; Zhan, Z.; Zhang, X.; Yang, X. Building Change Detection for Remote Sensing Images Using a Dual-Task Constrained Deep Siamese Convolutional Network Model. IEEE Geosci. Remote Sens. Lett. 2021, 18, 811–815. [Google Scholar] [CrossRef]
Alimjan, G.; Jiaermuhamaiti, Y.; Jumahong, H.; Zhu, S.; Nurmamat, P. An image change detection algorithm based on multi-feature self-attention fusion mechanism UNet network. Int. J. Pattern Recognit. Artif. Intell. 2021, 35, 2159049. [Google Scholar] [CrossRef]
Chen, H.; Shi, Z. A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. Available online: http://arxiv.org/abs/1706.03762 (accessed on 8 December 2023).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. Available online: http://arxiv.org/abs/2010.11929 (accessed on 8 December 2023).
Li, Y.; Wu, C.-Y.; Fan, H.; Mangalam, K.; Xiong, B.; Malik, J.; Feichtenhofer, C. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 4794–4804. Available online: https://ieeexplore.ieee.org/document/9879809/ (accessed on 25 February 2024).
Azad, R.; Jia, Y.; Aghdam, E.K.; Cohen-Adad, J.; Merhof, D. Enhancing Medical Image Segmentation with TransCeption: A Multi-Scale Feature Fusion Approach. arXiv 2023, arXiv:2301.10847. Available online: http://arxiv.org/abs/2301.10847 (accessed on 25 February 2024).
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5224713. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. Available online: http://arxiv.org/abs/2103.14030 (accessed on 8 December 2023).
Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection with Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607514. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. A Transformer-Based Siamese Network for Change Detection. arXiv 2022, arXiv:2201.01293. Available online: http://arxiv.org/abs/2201.01293 (accessed on 8 December 2023).
Feng, Y.; Xu, H.; Jiang, J.; Liu, H.; Zheng, J. ICIF-Net: Intra-Scale Cross-Interaction and Inter-Scale Feature Fusion Network for Bitemporal Remote Sensing Images Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4410213. [Google Scholar] [CrossRef]
Zhang, X.; Cheng, S.; Wang, L.; Li, H. Asymmetric Cross-Attention Hierarchical Network Based on CNN and Transformer for Bitemporal Remote Sensing Images Change Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2000415. [Google Scholar] [CrossRef]
Liu, W.; Lin, Y.; Liu, W.; Yu, Y.; Li, J. An attention-based multiscale transformer network for remote sensing image change detection. ISPRS J. Photogramm. Remote Sens. 2023, 202, 599–609. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. Available online: http://ieeexplore.ieee.org/document/7780677/ (accessed on 5 January 2024).
Ibtehaz, N.; Rahman, M.S. MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 2020, 121, 74–87. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13708–13717. Available online: https://ieeexplore.ieee.org/document/9577301/ (accessed on 5 January 2024).
Huang, X.; Deng, Z.; Li, D.; Yuan, X.; Fu, Y. MISSFormer: An Effective Transformer for 2D Medical Image Segmentation. IEEE Trans. Med. Imaging 2023, 42, 1484–1494. [Google Scholar] [CrossRef]
Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient Attention: Attention with Linear Complexities. arXiv 2020, arXiv:1812.01243. Available online: http://arxiv.org/abs/1812.01243 (accessed on 18 December 2023).
Zhang, X.; Yu, W.; Pun, M.-O.; Shi, W. Cross-domain landslide mapping from large-scale remote sensing images using prototype-guided domain-aware progressive representation learning. ISPRS J. Photogramm. Remote Sens. 2023, 197, 1–17. [Google Scholar] [CrossRef]
Shi, Q.; Liu, M.; Li, S.; Liu, X.; Wang, F.; Zhang, L. A Deeply Supervised Attention Metric-Based Network and an Open Aerial Image Dataset for Remote Sensing Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5604816. [Google Scholar] [CrossRef]
Han, C.; Wu, C.; Guo, H.; Hu, M.; Chen, H. HANet: A hierarchical attention network for change detection with bitemporal very-high-resolution remote sensing image. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3867–3878. [Google Scholar] [CrossRef]
Wang, L.; Fang, Y.; Li, Z.; Wu, C.; Xu, M.; Shao, M. Summator–Subtractor Network: Modeling Spatial and Channel Differences for Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5604212. [Google Scholar] [CrossRef]

Figure 1. The architecture of GLCL-Net comprises several key components. In the feature extraction stage, a multi-branch feature extraction network is constructed based on CNNs and transformers to capture both global and local features from different branches. A multi-branch feature aggregation module is introduced to aggregate features captured by both CNNs and transformers from various branches. Additionally, a dual-temporal image spatiotemporal discrepancy perception module is proposed to further extract change information from dual-temporal features. To facilitate interaction among multi-level features, a multi-level feature interaction module based on transformer attention is employed. The final step involves passing the features through a decoder to generate the predicted result.

Figure 2. Illustration of multi-branch feature extraction. Building upon the existing 3 × 3 convolutional patch embedding module, two additional branches for simulating 5 × 5 and 7 × 7 convolutional patch embeddings are introduced, enhancing the capability of the patch merging module to capture information from multiple receptive fields. To preserve more detailed information, the 3 × 3 patch embedding branch undergoes an additional convolution-based feature extraction branch.

Figure 3. Illustration of multi-branch feature aggregation module.

Figure 4. Illustration of spatiotemporal discrepancy perception module.

Figure 5. Two attention mechanisms in the multi-level feature interaction module. (a) Token-aware attention is based on the dot-product attention mechanism [60]. (b) Channel-aware attention mechanism is based on efficient attention [61].

Figure 6. Visualization results of different methods on the LEVIR-CD dataset. (a–f) Prediction results of all the compared methods for different samples. White represents a TP, black is a TN, red indicates an FP, and green stands for an FN.

Figure 7. Visualization results of different methods on the SYSU-CD dataset. (a–f) Prediction results of all the compared methods for different samples. White represents a TP, black is a TN, red indicates an FP, and green stands for an FN.

Figure 8. Visualization results of different methods on the GVLM-CD dataset. (a–f) Prediction results of all the compared methods for different samples. White represents a TP, black is a TN, red indicates an FP, and green stands for an FN.

Figure 9. Visualization results of ablation experiments on the LEVIR-CD dataset. (a–f) Prediction results of all the compared methods for different samples. White represents a TP, black is a TN, red indicates an FP, and green stands for an FN.

Figure 10. Visualization results of ablation experiments on the SYSU-CD dataset. (a–f) Prediction results of all the compared methods for different samples. White represents a TP, black is a TN, red indicates an FP, and green stands for an FN.

Figure 11. Visualization results of ablation experiments on the GVLM-CD dataset. (a–f) Prediction results of all the compared methods for different samples, respectively. White represents a TP, black is a TN, red indicates an FP, and green stands for an FN.

Figure 12. Example of network visualization. Red denotes higher attention values and blue denotes lower values. (A) Selected shallow feature maps A(#), (B) bi-temporal feature maps B(#) after the MBFA module, (C) feature maps C(#) with change information after the SDPM module, and (D) change feature maps D(#) after the MLFI module.

Figure 13. The training and validation curves of the GLCL-Net on LEVIR-CD (a), SYSU-CD (b), and GVLM-CD (c).

Table 1. Information on the three datasets.

Datasets	Spatial Resolution	Size/Image	Number of Images in Datasets
Datasets	Spatial Resolution	Size/Image	Train	Validation	Test
LEVIR-CD	0.5 m/pixel	256 × 256	7120	1024	2048
SYSU-CD	0.5 m/pixel	256 × 256	12,000	4000	4000
GVLM-CD	0.59 m/pixel	256 × 256	13,529	3866	1932

Table 2. Performance of various algorithms and GLCL-Net on the LEVIR-CD dataset. The highest score is marked in bold. All scores are expressed in percentage (%).

LEVIR-CD	Pre(%)	Rec (%)	F1 (%)	IoU (%)
FC-FE (2018) [36]	86.91	80.17	83.40	71.51
FC-Siam-conc (2018) [36]	91.10	76.77	83.69	71.96
FC-Siam-diff (2018) [36]	89.53	83.31	86.31	75.92
DTCDSCN (2020) [43]	83.60	88.64	86.05	75.51
SNUNet (2021) [40]	90.60	88.60	89.49	80.87
BIT (2022) [52]	89.24	89.37	89.31	80.68
ChangeFormer (2022) [53]	90.44	86.71	88.53	79.42
AMTNet (2023) [56]	91.19	88.97	90.07	81.93
HANet (2023) [64]	90.04	87.91	88.96	80.12
S²CD (2024) [65]	91.85	88.82	90.31	82.34
Ours	91.83	89.73	90.77	83.10

Table 3. Performance of various algorithms and GLCL-Net on the SYSU-CD dataset. The highest score is marked in bold. All scores are expressed in percentage (%).

SYSU-CD	Pre (%)	Rec (%)	F1 (%)	IoU (%)
FC-FE (2018) [36]	75.97	70.80	73.29	57.85
FC-Siam-conc (2018) [36]	76.41	76.17	76.29	61.67
FC-Siam-diff (2018) [36]	78.05	55.29	67.92	51.43
DTCDSCN (2020) [43]	81.41	71.69	76.24	61.60
SNUNet (2021) [40]	83.43	74.37	78.22	64.23
BIT (2022) [52]	79.63	72.10	75.68	60.87
ChangeFormer (2022) [53]	82.17	74.11	77.93	63.84
AMTNet (2023) [56]	82.05	75.04	78.39	64.46
HANet (2023) [64]	80.55	72.60	76.37	61.77
S²CD (2024) [65]	83.85	73.20	78.59	64.73
Ours	84.00	75.10	79.30	65.70

Table 4. Performance of various algorithms and GLCL-Net on the GVLM-CD dataset. The highest score is marked in bold. All scores are expressed in percentage (%).

GVLM-CD	Pre (%)	Rec (%)	F1 (%)	IoU (%)
FC-FE (2018) [36]	86.35	84.24	85.28	74.34
FC-Siam-conc (2018) [36]	83.01	84.84	83.91	72.29
FC-Siam-diff (2018) [36]	84.94	84.90	84.92	73.79
DTCDSCN (2020) [43]	88.31	86.34	87.32	77.49
SNUNet (2021) [40]	78.00	83.44	79.71	74.86
BIT (2022) [52]	89.19	88.20	88.70	79.69
ChangeFormer (2022) [53]	89.85	87.50	88.66	79.63
AMTNet (2023) [56]	86.64	91.85	89.17	80.45
HANet (2023) [64]	87.96	89.17	88.56	79.47
S²CD (2024) [65]	88.63	91.54	89.06	79.92
Ours	89.59	89.08	89.34	80.73

Table 5. Ablation experiment results on LEVIR-CD, SYSU-CD, and GVLM-CD. All values are reported as percentages (%).

Dataset	Methods	Pre (%)	Rec (%)	F1 (%)	IoU (%)
LEVIR-CD	w/o MBFE	89.60	86.11	87.82	78.29
	w/o MBFA	90.79	86.89	88.80	79.86
	w/o SDPM	90.58	87.12	88.82	79.89
	w/o MLFI	90.91	86.41	88.60	79.53
	GLCL-Net	91.77	89.78	90.77	83.09
SYSU-CD	w/o MBFE	83.39	73.13	77.92	63.83
	w/o MBFA	80.02	74.18	78.05	64.00
	w/o SDPM	81.93	74.67	78.13	64.12
	w/o MLFI	81.71	73.54	77.41	63.14
	GLCL-Net	84.00	75.10	79.30	65.70
GVLM-CD	w/o MBFE	86.20	88.16	87.17	77.26
	w/o MBFA	87.92	87.45	87.68	78.07
	w/o SDPM	89.36	88.71	89.04	80.24
	w/o MLFI	88.16	87.52	87.84	78.32
	GLCL-Net	89.59	89.08	89.34	80.73

Table 6. Ablation study on different parameters.

Dataset	Layers	Branches	Pre (%)	Rec (%)	F1 (%)	IoU (%)
LEVIR-CD	3:6:3	2:3:3	91.13	88.72	89.91	81.67
	3:6:3	3:3:3	91.77	89.77	90.76	83.09
	3:8:3	2:3:3	91.72	89.18	90.43	82.54
	3:8:3	3:3:3	91.83	89.73	90.77	83.10
SYSU-CD	3:6:3	2:3:3	82.85	74.24	78.31	64.35
	3:6:3	3:3:3	83.83	74.50	78.89	65.14
	3:8:3	2:3:3	82.95	75.68	79.21	65.12
	3:8:3	3:3:3	84.00	75.10	79.30	65.70
GVLM-CD	3:6:3	2:3:3	88.14	89.17	88.65	79.63
	3:6:3	3:3:3	88.41	89.16	88.78	79.83
	3:8:3	2:3:3	89.22	88.12	89.16	80.44
	3:8:3	3:3:3	89.59	89.08	89.34	80.73

Table 7. Model efficiency comparison. We report the number of model parameters, FLOPs, and average training time on 100 training images.

Model	DTCDSCN	SNUNet	BIT	ChangeFormer	AMTNet	HANet	S²CD	Ours
Params (M)	41.07	12.04	12.40	41.02	25.40	30.29	12.16	58.79
FLOPs (G)	13.1	54.70	10.57	20.32	22.15	20.89	11.32	18.21
Training time(s)	0.022	0.040	0.022	0.033	0.036	0.013	0.032	0.094

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Shao, F.; Liu, Q.; Meng, X. Global-Local Collaborative Learning Network for Optical Remote Sensing Image Change Detection. Remote Sens. 2024, 16, 2341. https://doi.org/10.3390/rs16132341

AMA Style

Li J, Shao F, Liu Q, Meng X. Global-Local Collaborative Learning Network for Optical Remote Sensing Image Change Detection. Remote Sensing. 2024; 16(13):2341. https://doi.org/10.3390/rs16132341

Chicago/Turabian Style

Li, Jinghui, Feng Shao, Qiang Liu, and Xiangchao Meng. 2024. "Global-Local Collaborative Learning Network for Optical Remote Sensing Image Change Detection" Remote Sensing 16, no. 13: 2341. https://doi.org/10.3390/rs16132341

APA Style

Li, J., Shao, F., Liu, Q., & Meng, X. (2024). Global-Local Collaborative Learning Network for Optical Remote Sensing Image Change Detection. Remote Sensing, 16(13), 2341. https://doi.org/10.3390/rs16132341

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Global-Local Collaborative Learning Network for Optical Remote Sensing Image Change Detection

Abstract

1. Introduction

2. Proposed Methods

2.1. Overall Architecture

2.2. Multi-Branch Multi-Receptive Field Patch Merging (MMP)

2.3. Multi-Branch Feature Extraction Module

2.4. Multi-Branch Feature Aggregation Module (MBFA)

2.5. Spatiotemporal Discrepancy Perception Module (SDPM)

2.6. Multi-Level Feature Interaction Module (MLFI)

3. Experiments and Results

3.1. Datasets

3.2. Baseline

3.3. Evaluation Metrics

3.4. Training Details

3.5. Performance Comparison

4. Discussion

4.1. Ablation Analysis

4.2. Network Visualization

4.3. Parameter Analysis

4.4. Model Efficiency Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI