DC-Mamba: A Novel Network for Enhanced Remote Sensing Change Detection in Difficult Cases

Zhang, Junyi; Chen, Renwen; Liu, Fei; Liu, Hao; Zheng, Boyu; Hu, Chenyu

doi:10.3390/rs16224186

Open AccessArticle

DC-Mamba: A Novel Network for Enhanced Remote Sensing Change Detection in Difficult Cases

by

Junyi Zhang

^1,*

,

Renwen Chen

¹

,

Fei Liu

²,

Hao Liu

¹

,

Boyu Zheng

¹ and

Chenyu Hu

¹

State Key Laboratory of Mechanics and Control for Aerospace Structures, College of Aerospace Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

²

National Key Laboratory on Electromagnetic Environmental Effects and Electro-Optical Engineering, Army Engineering University of PLA, Nanjing 210007, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(22), 4186; https://doi.org/10.3390/rs16224186

Submission received: 12 September 2024 / Revised: 29 October 2024 / Accepted: 8 November 2024 / Published: 10 November 2024

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Remote sensing change detection (RSCD) aims to utilize paired temporal remote sensing images to detect surface changes in the same area. Traditional CNN-based methods are limited by the size of the receptive field, making it difficult to capture the global features of remote sensing images. In contrast, Transformer-based methods address this issue with their powerful modeling capabilities. However, applying the Transformer architecture to image processing introduces a quadratic complexity problem, significantly increasing computational costs. Recently, the Mamba architecture based on state-space models has gained widespread application in the field of RSCD due to its excellent global feature extraction capabilities and linear complexity characteristics. Nevertheless, existing Mamba-based methods lack optimization for complex change areas, making it easy to lose shallow features or local features, which leads to poor performance on challenging detection cases and high-difficulty datasets. In this paper, we propose a Mamba-based RSCD network for difficult cases (DC-Mamba), which effectively improves the model’s detection capability in complex change areas. Specifically, we introduce the edge-feature enhancement (EFE) block and the dual-flow state-space (DFSS) block, which enhance the details of change edges and local features while maintaining the model’s global feature extraction capability. We propose a dynamic loss function to address the issue of sample imbalance, giving more attention to difficult samples during training. Extensive experiments on three change detection datasets demonstrate that our proposed DC-Mamba outperforms existing state-of-the-art methods overall and exhibits significant performance improvements in detecting difficult cases.

Keywords:

Mamba; change detection; remote sensing; state space model; optical high-resolution images

1. Introduction

With the advancement of remote sensing technology, the spatial resolution of remote sensing images continues to improve, making change detection (CD) a popular research field. Change detection aims to monitor surface changes in the same area using paired temporal remote sensing images. It plays a crucial role in various fields such as urban planning [1,2,3], agricultural management [4], environmental monitoring [5,6,7,8], and disaster response [9,10]. Optical high-resolution remote sensing imagery provides rich contextual information and detailed geometric structure information, and has become one of the most extensively applied and researched data sources in the field of change detection. Traditional change detection methods mainly include transformation-based [11,12,13], classification-based [14,15], and object-based [16,17] approaches. However, these methods, relying only on shallow image features, struggle to effectively handle complex surface change areas.

With the development of the convolutional neural network (CNN), many new solutions have been introduced to the field of change detection [18,19,20,21]. CNN-based methods for change detection typically use pairs of images composed of a pre-change and a post-change image as input, and output segmentation masks that locate the regions of interest where changes have occurred. Despite showing promising results across multiple datasets, the CNN lacks the capability to effectively extract global features, which makes it difficult to handle change detection tasks in complex scenes.

The emergence of visual Transformer [22,23,24,25] provides an effective approach to address the above-mentioned issues. The Transformer architecture is capable of capturing global contextual information and effectively modeling spatial dependencies through self-attention mechanisms, achieving impressive results in the field of change detection [26,27,28,29,30,31]. However, applying the Transformer structure to image processing introduces a quadratic complexity issue, significantly increasing computational costs. Although some methods improve attention efficiency by limiting the size or stride of the computation window, such as [24], they do so at the expense of reducing the receptive field size.

Recently, Mamba [32] has introduced time-varying parameters into state-space models (SSMs) and achieved significant success in the field of natural language processing. The Mamba architecture features linear complexity, effectively addressing the computational cost issue of Transformer, and is considered as a viable alternative to Transformer. Inspired by the success in natural language processing, the Mamba architecture has been expanded into the field of visual image processing with promising results [33,34,35,36,37,38,39,40]. In the domain of remote sensing change detection, several Mamba-based methods [41,42,43] have emerged, showing noticeable performance improvements compared with other change detection approaches.

Despite the strong performance of the latest Mamba-based change detection methods on classic datasets, their results on challenging detection cases and high-difficulty datasets are not satisfactory. This is because these methods lack optimization for difficult cases. In this paper, we define change detection cases with irregular change boundary shapes or irregular continuous change areas as difficult cases. These difficult cases present significant challenges for change detection tasks.

Classic datasets (such as WHU-CD and LEVIR-CD) consist of urban building data, where the change areas are mostly simple in shape and neatly arranged, including residential and warehouse districts. The few difficult cases present do not significantly affect the overall performance metrics of change detection models. In contrast, high-difficulty datasets (such as SYSU-CD) contain various irregular change areas, including vegetation changes, road expansions, and marine constructions. The larger proportion of difficult cases in these datasets leads to poor performance of existing Mamba-based methods.

Figure 1 illustrates two challenging cases of change detection in dual-time remote sensing, along with the detection results from two state-of-the-art Mamba-based methods, highlighted with red bounding boxes on key areas. It can be observed that these challenging cases exhibit irregular boundary shapes or small continuous change areas, posing significant challenges for the change detection task. Due to the cross-scan mechanism of the Mamba architecture, both Mamba-based methods effectively capture the overall features of the changed areas. However, they perform poorly in detecting complex edge features and continuous change areas. Thus, developing effective solutions for difficult case detection is crucial for enhancing the robustness of remote sensing change detection models and for future research endeavors.

In this paper, inspired by VMamba [34], we propose a Mamba-based remote sensing change detection network for difficult cases (DC-Mamba), which effectively enhances the model’s detection capability in complex change scenarios. Specifically, DC-Mamba consists of an edge-feature enhancement (EFE) block, a dual-flow state-space (DFSS) encoder, a DFSS decoder, and a dynamic upsampler (DySample). Unlike existing Mamba-based methods, we do not directly input the original or downsampled original images into the encoder. Instead, we process them through the EFE block before feeding them into the encoder to enhance shallow edge features of the paired temporal images. Additionally, we design the DFSS Encoder and DFSS Decoder to better extract both global and local image features. To address the needs of challenging case detection, we also introduce a dynamic loss function.

In summary, the main contributions of this paper are as follows:

(1): We propose a novel remote sensing change detection network (DC-Mamba), which enhances edge features through the edge-feature enhancement (EFE) block before inputting images into the encoder. And it utilizes the dual-flow state-space (DFSS) block to integrate global and local features, thereby improving the detection of small local change areas and complex change edges.
(2): We introduce a dynamic loss function for DC-Mamba, dynamically adjusting the weights of two components of the loss function to address sample imbalance issues and increase focus on challenging samples during training.
(3): Extensive experiments on three datasets, WHU-CD, LEVIR-CD+, and SYSU, demonstrate that our proposed DC-Mamba achieves the best overall performance and significantly improves performance in difficult case detection.

The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 details our proposed method. Section 4 presents experimental results. And Section 5 concludes the paper.

2. Related Works

2.1. CNN-Based Method

The rapid development of the CNN and its excellent performance in feature extraction have led to its widespread application in change detection tasks. FC-Siam-Conc [18], FC-EF [18], and FC-Siam-Diff [18] are among the earliest CNN-based methods. FC-Siam-Conc concatenates pre-change and post-change images along the channel dimension and processes them as a single input. FC-EF is a CD model built upon the U-net framework, utilizing the concatenated bitemporal images along the channel dimension as its input. FC-Siam-Diff is a variant of FC-EF that incorporates a weight-shared Siamese network architecture to capture multi-level features, thereby enhancing the integration of bitemporal information through feature subtraction.

With the introduction of backbone networks like VGG [44], ResNet [45], and DenseNet [46], CNN-based methods have seen further development. Several representative network architectures have been proposed. Examples include SNUNet [21], IFNet [19], and HANet [47]. Fang et al. designed SNUNet [21], a Siamese architecture with denser connections to promote interaction among shallow features. SNUNet utilizes a weight-shared NestedUNet to capture multi-level features, incorporating channel attention mechanisms at different stages of the decoding process. Additionally, deep supervision is employed to improve the training effectiveness of the intermediate layers. IFNet [19] employs a weight-shared VGG-16 [44] for extracting multi-level features and integrates bitemporal information through concatenation. It applies both spatial and channel attention mechanisms at each stage of the decoder. Additionally, deep supervision is implemented by calculating supervised loss at each decoder level to improve the training of intermediate layers. Han et al. designed HANet [47], which introduces attention mechanism and uses a new sampling method to effectively integrate global information. HANet is a discriminative Siamese network that integrates multi-scale features and refines detailed features. Lei et al. [48] proposed a differential enhancement network to reduce the impact of irrelevant factors on detection results by learning differential representations of foreground and background.

Despite the maturity and effectiveness of the aforementioned CNN-based methods in change detection tasks, the CNN is limited by the size of the receptive field, making it challenging to capture global features. This limitation affects their performance in change detection tasks where change objects are sparse.

2.2. Transformer-Based Method

In recent years, Transformer has effectively overcome the limitations of CNN-based methods with its superior ability in long-range modeling, prompting many researchers to adopt Transformer architectures for change detection tasks.

BIT [28] uses a weight-shared ResNet18 to extract bitemporal features and employs a Semantic Tokenizer to compress these features into a smaller set of semantic tokens. These tokens are then concatenated and fed into a Transformer encoder/decoder to capture spatial–temporal relationships. Changeformer [27] employs the weight-shared Segformer-B1 to extract multi-level features. At each level, the features are processed through differencing and convolution operations to aid in bitemporal feature fusion. The resulting differential features from various levels are then concatenated and fed into a decoder consisting of fully connected layers to perform change detection. SwinSUnet [26] leverages a weight-shared Swin-Transformer [24] to extract multi-level features. Features from the final layer are concatenated to combine bitemporal information before being processed by the decoder. Additionally, a U-net-like connection is used to merge the extracted multi-level features with those at corresponding decoder layers via concatenation, with further enhancement provided by channel attention mechanisms. Li et al. [49] introduced a hybrid model, TransUNetCD, blending Transformer’s global context modeling capability with CNN blocks and UNet architecture, eliminating UNet’s redundant information and enhancing feature quality through differential enhancement modules.

Despite achieving strong performance in change detection, the aforementioned Transformer-based methods face challenges due to their quadratic complexity in image processing, significantly increasing computational costs. This drawback is particularly disadvantageous for dense prediction tasks like change detection.

2.3. Mamba-Based Method

Mamba, based on state-space models (SSMs), introduces the cross-scan mechanism and hardware-aware algorithms, enabling the parameterization of SSMs based on input sequences. Compared with Transformer, Mamba exhibits linear computational complexity relative to the length of input sequences, showing great potential in long sequence modeling tasks. Works like Vision Mamba and VMamba expand Mamba’s capability in handling visual data through bidirectional and multidirectional scanning methods.

Notably, Mamba has been widely applied in remote sensing change detection tasks. Chen et al. [41] first explored the potential of the Mamba architecture in remote sensing change detection tasks, achieving significant results in binary change detection (BCD), semantic change detection (SCD), and building damage assessment (BDA). Zhao et al. [42] proposed RS-Mamba for dense prediction tasks in high-resolution remote sensing images, introducing a diagonal scan mechanism based on 2D cross-scan. Zhang et al. [43] introduced the CDMamba model, effectively integrating global and local features through SRCM blocks, enhancing the ability of Mamba-based methods to detect detailed changes in dense prediction tasks.

Despite the promising results of the aforementioned Mamba-based methods, they often directly input original or downsampled images into the encoder, potentially overlooking some shallow edge detail features. This is because after the original image enters the Mamba encoder, the resolution significantly decreases as the number of channels increases, causing the Mamba network to focus more on deep features, such as semantic information and object categories. However, the neglected shallow features contain important information like edges and textures, which are also vital for change detection tasks. Additionally, they lack optimization for detecting complex change areas, leading to noticeable performance drops in difficult case detection.

In this paper, we propose some simple but effective modules and components based on the Mamba architecture to enhance the model’s capability in handling detailed features and improve accuracy in difficult case detection.

3. Methods

3.1. Preliminaries: SSM

In deep-learning research, the state-space models (SSMs) [50,51,52] provide a framework for describing and understanding the generation process of time series data through state-space modeling, the assumption of linear time invariant (LTI), and mapping functions.

An SSM utilizes a hidden state

h (t) \in R

as an intermediary variable to map the input one-dimensional function or sequence

x (t)

to the output

y (t) \in R

. The SSM can be represented by a set of linear ordinary differential equations (ODEs).

h^{'} (t) = A h (t) + B x (t)

(1)

y (t) = C h (t)

(2)

where

h^{'} (t)

denotes the time derivative of

h (t)

,

A \in R^{N \times N}

represents the evolution parameters, and

B \in R^{N \times 1}

and

C \in R^{1 \times N}

are projection parameters.

To integrate continuous systems into deep-learning algorithms, the SSM uses a zero-order hold (ZOH) to convert the continuous parameters A and B into their discrete counterparts

\bar{A}

and

\bar{B}

. This discretization is achieved by introducing a time scaling parameter

Δ \in R^{D}

, defined as follows:

\bar{A} = e x p (Δ A)

(3)

\bar{B} = {(Δ A)}^{- 1} (e x p (Δ A) - I) (Δ B)

(4)

The discretized Formulas (1) and (2) can be expressed as

h (t) = \bar{A} h_{t - 1} + \bar{B} x_{t}

(5)

y (t) = C h (t)

(6)

Thus, the output

x

can be directly computed through full convolution.

However, the parameters of the SSM based on the LTI assumption impose a constraint where the input–output relationship does not vary with time. To overcome this limitation, the recently proposed Mamba architecture builds upon the SSM, allowing the matrices

\bar{B}

, C, and Δ to vary with the input sequence. This improvement boosts the model’s capability for selective information processing across sequences, dynamically adjusting the learned context. Additionally, the Mamba architecture incorporates hardware-aware algorithms to optimize the GPU memory layout, effectively reducing memory and computational overhead during model training.

3.2. Overall Architecture

Figure 2 and Figure 3 illustrate the overall architecture of the proposed DC-Mamba and its constituent modules. DC-Mamba is composed of an edge-feature enhancement (EFE) block, a dual-flow state-space (DFSS) encoder, a DFSS decoder, and a dySample module. As shown in Figure 2, the input consists of two temporal remote sensing images

T_{1} \in R^{256 \times 256 \times 3}

and

T_{2} \in R^{256 \times 256 \times 3}

, where

T_{1}

and

T_{2}

are 3-channel RGB images of size 256 × 256.

Initially,

T_{1}

and

T_{2}

are separately fed into the EFE block to enhance their shallow edge features. The structure of the EFE block, illustrated in Figure 3a, involves extracting image features with a 3 × 3 convolution using stride = 1, BatchNorm for data normalization to prevent data anomalies, ReLU activation for introducing non-linear transformations, and residual connections to retain original image features. Post EFE block processing, the edge features of

T_{1}

and

T_{2}

are enhanced while maintaining their size and number of channels, serving as inputs to the encoder.

The encoder comprises 4 stages, each designed to extract dual-temporal image features at different scales, as depicted in Figure 3b. Each stage of the encoder includes down-sampling and DFSS block, reducing the height and width of the dual-temporal image features by half and doubling the number of channels.

Similarly, the decoder consists of 4 stages, as shown in Figure 3c. The input to the first stage of the decoder is a fusion of the dual-temporal image feature output by the encoder. The subsequent three stages of the decoder take inputs composed of two parts: the output of the previous decoder stage and a fusion of corresponding-channel dual-temporal image features. This approach facilitates effective interaction of temporal image feature information. Each stage of the decoder first processes its input through a DFSS block, followed by 3 × 3 convolution and batchNorm layers with residual connections, culminating in gradual image size restoration through linear layers and upsample operations.

The final output of the decoder connects to the DySample module, a lightweight and effective upsampler suitable for dense prediction tasks like dual-temporal remote sensing image change detection. Following upsampling, the size of the change detection image is restored to 256 × 256, resulting in the binary image representation of the change detection result

G T \in R^{256 \times 256 \times 2}

.

3.3. Dual-Flow State-Space Block

The Mamba architecture is initially developed for natural language data where contextual causality exists. However, for non-causal data like remote sensing images, directly expanding the original Mamba along the spatial dimension lacks adequate modeling of spatial contextual information. Recently, Vamba is proposed for visual image tasks, introducing a two-dimensional cross-scanning mechanism to address this issue. As shown in Figure 4, the cross-scanning mechanism reorganizes image data along four different directions in the spatial dimension, enabling each position in the image to access contextual information from multiple directions. Subsequently, each sequence of image data is selectively processed through the SSM, and the feature information from the four sequences is merged for final output.

However, this two-dimensional cross-scanning mechanism primarily enhances the model’s ability to extract global features, often neglecting the importance of edge-local features. In tasks like dual-temporal image change detection, this can significantly impact accuracy. Therefore, we propose the dual-flow state-space (DFSS) block, depicted in Figure 3b. Input feature map

F_{i n} \in R^{H \times W \times C_{0}}

undergoes downsampling before entering the DFSS block, where it is split into two information flows. The main information flow sequentially passes through LayerNorm [53], Linear, and 3 × 3 convolution layers into the core VSSM module, which is the integration of the Mamba architecture with the two-dimensional cross-scan mechanism. The VSSM module rearranges the image along four spatial dimensions, allowing every part of the image to acquire spatial contextual information from different directions, culminating in the VSSM module’s output. This output is processed through LayerNorm and then merged with feature information activated by SiLU through element-wise multiplication, yielding the output of the main information flow. The second information flow integrates the input image, preserving edge-local features through residual connections, into the main information flow, resulting in the final output

F_{o u t} \in R^{\frac{H}{2} \times \frac{W}{2} \times C_{1}}

of the DFSS block.

3.4. Dynamic Loss Function for DC-Mamba

For the dual-temporal image change detection task based on the Mamba architecture, we design a loss function that can be dynamically adjusted through hyperparameters. It consists of two weighted parts of the loss. The first part is the balanced cross-entropy (BCE) loss, commonly used for binary classification tasks, defined as

L_{B C E} (p, y) = L_{B C E} (p_{t}) = - α_{t} l o g (p_{t})

(7)

p_{t} = \{\begin{array}{l} p & if y = 1 \\ 1 - p & otherwise . \end{array}

(8)

where

y \in {\pm 1}

represents positive and negative classes, and

p \in [0, 1]

denotes the estimated probability of the model for class

y = 1

.

α_{t}

is a hyperparameter introduced to address class imbalance issues, setting

α \in [0, 1]

for class 1 and

1 - α

for class −1.

However, in binary classification tasks, the default initialization results in equal probabilities for

y = \pm 1

. In cases of class imbalance, this can lead to the dominance of the larger, simpler class during training. While BCE loss introduces

α_{t}

to balance positive and negative samples [54], it does not differentiate between easy and hard samples. Therefore, for dense prediction tasks with class imbalance like dual-temporal image change detection, we introduce Focal loss [55] as the second part of the loss function.

Focal loss replaces the hyperparameter

α_{t}

in BCE loss with a focal modulation factor

{(1 - p_{t})}^{γ}

, where

γ \geq 0

is a tunable focusing parameter. Focal loss can be defined as

L_{F L} (p_{t}) = - {(1 - p_{t})}^{γ} l o g (p_{t})

(9)

It can be observed that for easy samples where the model is confident in its predictions (approaching

p_{t}

to 1), the focal modulation factor

{(1 - p_{t})}^{γ}

tends toward 0. This significantly reduces the weight of correctly classified easy samples, preventing them from dominating the training process. For difficult cases that are more likely to be misclassified, the model exhibits less confidence (resulting in smaller

p_{t}

), and the focal modulation factor

{(1 - p_{t})}^{γ}

approaches 1, thereby minimally affecting the weight of hard samples during training.

In conclusion, the designed loss function is expressed as

L_{D y n a m i c} = ε \cdot L_{B C E} + σ \cdot L_{F L}

(10)

Here,

ε, σ \in [0, 1]

represent the weighting parameters for BCE loss and Focal loss, respectively. Different dual-temporal image change detection datasets exhibit varying levels of density of change targets and significant differences in dataset difficulty. Therefore, dynamically adjusting the weights of these two components of the loss function is expected to improve training results, as validated in experiments.

4. Experiments and Analysis

4.1. Datasets

4.1.1. SYSU-CD

The SYSU-CD [56] dataset introduces 20,000 pairs of 0.5 m/pixel resolution aerial images from Hong Kong between 2007 and 2014, each with dimensions of 256 × 256 pixels, divided into training, validation, and test sets in a ratio of 6:2:2. This dataset is distinguished by its focus on urban and coastal changes, featuring high-rise buildings and infrastructure developments, where change detection poses significant challenges due to shadow and deviation effects. This dataset encompasses a range of various change scenarios such as urban construction, suburban expansion, groundwork, vegetation changes, road expansion, and sea construction.

4.1.2. LEVIR-CD+

The LEVIR-CD+ [3] dataset is an advanced version of LEVIR-CD, containing 985 pairs of very high-resolution images at 0.5 m/pixel, each with dimensions of 1024 × 1024 pixels. Spanning a time interval of 5 to 14 years, these multi-temporal images document significant building construction changes. It also encompasses a wide array of building types, including urban residential areas, small-scale garages, and large warehouses, with a focus on both the emergence of new buildings and the decline of existing structures. LEVIR-CD+ is a valuable benchmark for evaluating CD methodologies. We cut each image into patches of size 256 × 256 and divided the images into training and test sets in a ratio of 7:3.

4.1.3. WHU-CD

The WHU-CD [57] dataset, a subset of the larger WHU Building dataset, is tailored for the building CD task, consisting of a pair of 32,507 × 15,354 spatial remote sensing images of New Zealand with a resolution of 0.2 m/pixel, taken in April 2012 and April 2016, covering an area of 20.5 square kilometers. We cut the WHU-CD dataset into patches of size 256 × 256, resulting in a total of 7620 images. The images are divided into training, validation, and test sets in a ratio of 7:1.5:1.5.

4.2. Experimental Setup

4.2.1. Implementation Details

In the proposed DC-Mamba structure, during the edge-feature enhancement (EFE) stage, a 3 × 3 convolution with stride = 1 and 3 output channels is used to extract and enhance the edge features of the original image. The encoder consists of four stages with layer configurations {2, 2, 4, 2}, where feature map sizes are reduced to {1/2, 1/4, 1/8, 1/16} of the original image size, and feature channel numbers are set to {24, 48, 96, 192}. Linear interpolation is employed for downsampling within the encoder stages. In the four stages of the decoder, the feature map sizes and channel numbers mirror those of the encoder, and upsampling is performed using the dysample method.

4.2.2. Training Details

We construct and deploy DC-Mamba using the PyTorch framework on a single NVIDIA RTX 3080Ti. We employ straightforward data augmentation techniques, avoiding any complex methods. During training, we utilize the AdamW optimizer [58] to optimize the network with a learning rate of 1 × 10⁻⁴ and weight decay of 1 × 10⁻³. The batch size is set to 4, and the training iterations are configured to 60,000.

To determine the most suitable Mamba backbone network, we test the performance of DC-Mamba with three different backbones: VMamba-Tiny, VMamba-Small, and VMamba-Base. Table 1 shows the experimental results. It can be observed that VMamba-Small exhibits a significant improvement over VMamba-Tiny, whereas VMamba-Base shows slightly lower Recall, F1, and OA scores due to overfitting caused by its increased parameter count. Therefore, in our subsequent experiments in this paper, we adopted VMamba-Small as the backbone network.

In the dynamic loss function, the focusing parameter

{(1 - p_{t})}^{γ}

has

γ

set to 2,

ε

set to 1, and

σ

set to 0.75 (details are described in ablation studies). According to Formulas (7), (9), and (10), the loss function can be expressed as

L_{D y n a m i c} = - α_{t} l o g (p_{t}) - {0.75 \times (1 - p_{t})}^{2} l o g (p_{t})

(11)

4.2.3. Evaluation Metrics

To evaluate the performance of our model, we employ a set of metrics, including precision (Pre), recall (Rec), F1 score (F1), overall accuracy (OA), and intersection over union (IoU). The precision reflects the proportion of true positive pixels among all pixels identified as positive. The recall reflects the proportion of all true positive pixels correctly identified as positive. The overall accuracy expresses the ratio of the correctly predicted pixels to the total pixels. The F1 score balances precision and recall by calculating their harmonic mean. The IoU measures the overlap between predicted and ground truth positive. All the five metrics can be individually defined as follows.

P r e c i s i o n = T P / (T P + F P)

(12)

R e c a l l = T P / (T P + F N)

(13)

F 1 = (2 \times P r e \times R e c a l l) / (P r e + R e c a l l)

(14)

O A = T P / (T P + F N + F P)

(15)

I o U = (T P + T N) / (T P + F P + T N + F N)

(16)

where TP, FP, TN, and FN represent the number of true positives, false positives, true negatives, and false negatives, respectively.

4.3. Ablation Studies

In this section, we conducted ablation studies on the LEVIR-CD+, WHU-CD, and SYSU datasets to assess the impact of various components and parameters of DC-Mamba. Our ablation study includes four different configurations, as shown in Table 2. And Figure 5 provides a visual comparison of the results. We use the original Mamba network with dynamic loss function as the baseline and design ablation experiments on the EFE block and the DFSS block, highlighting key areas with red boxes in Figure 5.

It can be seen that independently adding either the EFE block or the DFSS block significantly improves the model performance. Specifically, when the DFSS block is individually added, the F1 score improves by 0.55%, 1.27%, and 1.35% on the three datasets, respectively. This indicates that integrating global and local features is highly effective for change detection tasks. In the DFSS block, the two-dimensional cross-scan mechanism rearranges the image along four spatial dimensions, allowing each part of the image to obtain spatial context information from different directions, thereby enhancing the model’s ability to extract global features. Meanwhile, the second information flow integrates the input image into the main information flow through residual connections, preserving local features.

When the EFE block is individually added, the key metric F1 improves by 0.89%, 1.02%, and 1.68% on the three datasets, respectively. This demonstrates the importance of shallow features for change detection tasks. We believe that the proposed EFE block focuses on fine-grained edge information and texture information in the image, effectively preserving the shallow features of the original images

T_{1}

and

T_{2}

. To further validate this point, we visualize the feature heatmaps on different datasets and compare the decoder’s output feature heatmaps, as shown in Figure 6. Figure 6d illustrates the effective shallow feature extraction of the EFE block, which ensures that the edge features and detail information of the image are preserved after it enters the encoder. By comparing Figure 6e and Figure 6f, it is evident that the change areas output by the DC-Mamba decoder using the EFE block are more precise, particularly showing significantly better performance at the edges of the changed regions. The above phenomenon qualitatively supports our viewpoint.

Table 3 shows the impact of hyperparameter settings in the dynamic loss function on model performance, also conducted on the SYSU dataset. Here,

ε, σ \in [0, 1]

represent the weight parameters of BCE loss and Focal loss, respectively. The experimental results indicate that introducing the dynamic loss function significantly improves model performance compared with using BCE loss or Focal loss alone. Moreover, the model achieves optimal performance on the SYSU dataset when

ε = 1, σ = 0.75

. This suggests that the dynamic loss function effectively improves training outcomes by reducing the weight of simple samples, thereby focusing more on difficult cases.

The ablation studies above demonstrate that our proposed components, whether used individually or in combination, are beneficial for enhancing the model’s performance in change detection tasks.

4.4. Comparative Experiment

We conduct comparative experiments using state-of-the-art (SOTA) models from three categories: CNN-based methods (including SNUNet [21], IFNet [19], HANet [47], and FC-Siam-Conc [18]), Transformer-based methods (including BIT [28], ChangeFormer [27], SwinSUNet [26], and TransUNetCD [49]), and Mamba-based methods (including RSMamba [42], ChangeMamba [41], and CDMamba [43]). Our proposed DC-Mamba is tested against these models on the LEVIR-CD+, WHU-CD, and SYSU datasets. Table 4 presents the test results, highlighting the best-performing metrics in red and the second-best in blue.

It is evident that DC-Mamba, our proposed method, demonstrates significant performance advantages compared with both CNN-based and Transformer-based methods. Furthermore, when compared with the three latest Mamba-based methods, our approach exhibits superior overall performance. Specifically, on the LEVIR-CD+ dataset, although DC-Mamba shows lower Recall compared with RSMamba, it achieves the highest scores in other metrics, with the F1 score improving by 1.50%. On the WHU-CD dataset, DC-Mamba achieves the highest Recall, F1 score, and IoU and the second-highest OA, with the F1 score improving by 1.16%.

The SYSU dataset, characterized by more complex and numerous change regions, poses greater difficulty. On this dataset, our method achieves the highest scores across all metrics. Compared with the second-best results, DC-Mamba improves Recall, F1 score, IoU, and OA by 2.57%, 1.37%, 1.08%, and 0.35%, respectively. This highlights the effectiveness of our proposed EFE block, DFSS block, and dynamic loss function in enhancing model performance for detecting challenging cases and intricate edges.

To further validate the performance of DC-Mamba, qualitative analysis results on the LEVIR-CD+, WHU-CD, and SYSU test sets are presented in Figure 7, Figure 8 and Figure 9, respectively. And the key areas are highlighted with red boxes. Each figure contrasts our method with selected CNN-based, Transformer-based, and Mamba-based methods. Figure 7 and Figure 8 illustrate that because the change areas in the LEVIR-CD+ dataset and the WHU-CD dataset are mostly buildings with regular shapes, all methods can effectively reconstruct the shape of change areas, but our DC Mamba method clearly performs better. Specifically, in Figure 8a, DC-Mamba detects smaller intermediate change areas missed by the other methods. In Figure 9d, DC-Mamba provides more detailed and accurate detections along change edges.

Figure 9 shows that SYSU dataset challenges include irregular and non-structured changes, where Mamba-based methods clearly outperform CNN-based and Transformer-based methods. Compared with CDMamba, our method excels in capturing edge details and detecting challenging cases. Figure 9c demonstrates DC-Mamba’s superior ability to reconstruct edge shapes in the left change area, while Figure 9d highlights DC-Mamba’s capability to fully detect multiple continuous small change areas compared with several missed detections by CDMamba.

To validate the efficiency of the proposed model, Table 5 compares the parameters and GFLOPs of different change detection models. Leveraging the Mamba architecture, our model exhibits a linear growth in computational complexity within the encoder and the decoder, effectively reducing computational costs.

Our model has 17.35 million parameters (Params) and performs at 44.95 billion floating-point operations per second (GFLOPs). Compared with Transformer-based methods, the Params of our model are significantly reduced. And the GFLOPs are lower than most Transformer-based methods and Mamba-based methods. While our model incurs slightly higher computational costs compared with CDMamba, our approach achieves significantly improved performance enhancements.

5. Conclusions

In this paper, to address the lack of optimized solutions for complex change regions in existing change detection methods, we propose a new change detection model called DC-Mamba. Specifically, to avoid the loss of shallow edge detail features, we introduce the EFE block. Additionally, the DFSS block is designed to enhance the model’s capability in extracting global features while better preserving local features. On this basis, we propose the dynamic loss function to tackle sample imbalance issues, ensuring that complex change regions receive more attention during training. Ablation studies validate the effectiveness of each proposed module and component.

Comparative experiments on the LEVIR-CD+, WHU-CD, and SYSU datasets show that our proposed DC-Mamba outperforms other methods. Particularly, it demonstrates significant improvements in challenging instance detection and on difficult datasets compared with the latest Mamba-based methods.

Author Contributions

Conceptualization, J.Z. and F.L.; methodology, J.Z.; software, J.Z.; validation, H.L., B.Z. and C.H.; formal analysis, J.Z.; investigation, R.C.; resources, R.C.; data curation, H.L.; writing—original draft preparation, J.Z.; writing—review and editing, J.Z.; visualization, J.Z.; supervision, R.C. and F.L.; project administration, R.C.; funding acquisition, R.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 51675265, in part by the Interdisciplinary Innovation Fund for Doctoral Students of Nanjing University of Aeronautics and Astronautics under Grant KXKCXJJ202201, and in part by the Advantage Discipline Construction Project Funding of University in Jiangsu Province under Grant PAPD.

Data Availability Statement

The original contributions presented in the study are included in the article. Further inquiries can be directed to the corresponding author. The datasets used in this paper are included in the papers referenced.

Acknowledgments

We would like to thank Sun Yat-Sen University, Beihang University, and Wuhan University for providing experimental datasets. We would also like to thank Hongruixuan Chen, Jian Song, Chengxi Han, Junshi Xia, and Naoto Yokoya from the University of Tokyo for their assistance in this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, H.; Chen, H.; Zhou, C.; Chen, K.; Liu, C.; Zou, Z.; Shi, Z. Bifa: Remote sensing image change detection with bitemporal feature alignment. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5614317. [Google Scholar] [CrossRef]
Wellmann, T.; Lausch, A.; Andersson, E.; Knapp, S.; Cortinovis, C.; Jache, J.; Scheuer, S.; Kremer, P.; Mascarenhas, A.; Kraemer, R. Remote sensing in urban planning: Contributions towards ecologically sound policies? Landsc. Urban Plan. 2020, 204, 103921. [Google Scholar] [CrossRef]
Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Li, X.; Ling, F.; Foody, G.M.; Du, Y. A superresolution land-cover change detection method using remotely sensed images with different spatial resolutions. IEEE Trans. Geosci. Remote Sens. 2016, 54, 3822–3841. [Google Scholar] [CrossRef]
Chen, K.; Zou, Z.; Shi, Z. Building extraction from remote sensing images with sparse token transformers. Remote Sens. 2021, 13, 4441. [Google Scholar] [CrossRef]
Singh, A. Change detection in the tropical forest environment of northeastern india using landsat. Remote Sens. Trop. Land Manag. 1986, 44, 237–254. [Google Scholar]
Jackson, R.D. Spectral indices in n-space. Remote Sens. Environ. 1983, 13, 409–421. [Google Scholar] [CrossRef]
Todd, W.J. Urban and regional land use change detected by using landsat data. J. Res. US Geol. Surv. 1977, 5, 529–534. [Google Scholar]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made disasters. Remote Sens. Environ. 2021, 265, 112636. [Google Scholar] [CrossRef]
Xu, J.; Lu, W.; Li, Z.; Khaitan, P.; Zaytseva, V. Building damage detection in satellite imagery using convolutional neural networks. arXiv 2019, arXiv:1910.06444. [Google Scholar]
Celik, T. Unsupervised change detection in satellite images using principal component analysis and k-means clustering. IEEE Geosci. Remote Sens. Lett. 2009, 6, 772–776. [Google Scholar] [CrossRef]
Han, T.; Wulder, M.A.; White, J.C.; Coops, N.C.; Alvarez, M.; Butson, C. An efficient protocol to process landsat images for change detection with tasselled cap transformation. IEEE Geosci. Remote Sens. Lett. 2007, 4, 147–151. [Google Scholar] [CrossRef]
Saha, S.; Bovolo, F.; Bruzzone, L. Unsupervised deep change vector analysis for multiple-change detection in vhr images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 3677–3693. [Google Scholar] [CrossRef]
Sun, Y.; Lei, L.; Guan, D.; Kuang, G. Iterative robust graph for unsupervised change detection of heterogeneous remote sensing images. IEEE Trans. Image Process. 2021, 30, 6277–6291. [Google Scholar] [CrossRef] [PubMed]
Negri, R.G.; Frery, A.C.; Casaca, W.; Azevedo, S.; Dias, M.A.; Silva, E.A.; Alcântara, E.H. Spectral–spatial-aware unsupervised change detection with stochastic distances and support vector machines. IEEE Trans. Geosci. Remote Sens. 2020, 59, 2863–2876. [Google Scholar] [CrossRef]
Hussain, M.; Chen, D.; Cheng, A.; Wei, H.; Stanley, D. Change detection from remotely sensed images: From pixel-based to object-based approaches. ISPRS J. Photogramm. Remote Sens. 2013, 80, 91–106. [Google Scholar] [CrossRef]
Chen, H.; Yokoya, N.; Chini, M. Fourier domain structural relationship analysis for unsupervised multimodal change detection. ISPRS J. Photogramm. Remote Sens. 2023, 198, 99–114. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. In Proceedings of the IEEE 25th International Conference on Image Processing, Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Peng, D.; Zhang, Y.; Guan, H. End-to-end change detection for high resolution satellite images using improved unet++. Remote Sens. 2019, 11, 1382. [Google Scholar] [CrossRef]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A densely connected siamese network for change detection of vhr images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8007805. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zhang, H.; Wu, W. Cat: Re-conv attention in transformer for visual question answering. In Proceedings of the IEEE International Conference on Pattern Recognition, Montreal, QC, Canada, 21–25 August 2022; pp. 1471–1477. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Zhang, H.; Wu, W. Context relation fusion model for visual question answering. In Proceedings of the IEEE 29th International Conference on Image Processing, Bordeaux, France, 16–19 October 2022; pp. 2112–2116. [Google Scholar]
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure transformer network for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5224713. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. A transformer-based siamese network for change detection. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 207–210. [Google Scholar]
Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607514. [Google Scholar] [CrossRef]
Liu, M.; Chai, Z.; Deng, H.; Liu, R. A cnn-transformer network with multiscale context aggregation for fine-grained cropland change detection. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2022, 15, 4297–4306. [Google Scholar] [CrossRef]
Feng, Y.; Jiang, J.; Xu, H.; Zheng, J. Change detection on remote sensing images using dual-branch multilevel intertemporal network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4401015. [Google Scholar] [CrossRef]
Feng, Y.; Xu, H.; Jiang, J.; Liu, H.; Zheng, J. Icif-net: Intrascale cross-interaction and inter-scale feature fusion network for bitemporal remote sensing images change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. Vmamba: Visual state space model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
Liao, W.; Zhu, Y.; Wang, X.; Pan, C.; Wang, Y.; Ma, L. Lightm-unet: Mamba assists in lightweight unet for medical image segmentation. arXiv 2024, arXiv:2403.05246. [Google Scholar]
Chen, T.; Tan, Z.; Gong, T.; Chu, Q.; Wu, Y.; Liu, B.; Ye, J.; Yu, N. Mim-istd: Mamba-in-mamba for efficient infrared small target detection. arXiv 2024, arXiv:2403.02148. [Google Scholar] [CrossRef]
Liu, C.; Chen, K.; Chen, B.; Zhang, H.; Zou, Z.; Shi, Z. RSCaMa: Remote sensing image change captioning with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Yang, C.; Chen, Z.; Espinosa, M.; Ericsson, L.; Wang, Z.; Liu, J.; Crowley, E.J. Plainmamba: Improving non-hierarchical mamba in visual recognition. arXiv 2024, arXiv:2403.17695. [Google Scholar]
Ruan, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar]
Yang, Y.; Xing, Z.; Zhu, L. Vivim: A video vision mamba for medical video object segmentation. arXiv 2024, arXiv:2401.14168. [Google Scholar]
Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. Change-mamba: Remote sensing change detection with spatio-temporal state space model. arXiv 2024, arXiv:2404.03425. [Google Scholar]
Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. RS-mamba for large remote sensing image dense prediction. arXiv 2024, arXiv:2404.02668. [Google Scholar] [CrossRef]
Zhang, H.; Chen, K.; Liu, C.; Chen, H.; Zou, Z.; Shi, Z. CDMamba: Remote Sensing Image Change Detection with Mamba. arXiv 2024, arXiv:2406.04207. [Google Scholar]
Karen, S.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Han, C.; Wu, C.; Guo, H.; Hu, M.; Chen, H. Hanet: A hierarchical attention network for change detection with bitemporal very-high-resolution remote sensing images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2023, 16, 3867–3878. [Google Scholar] [CrossRef]
Lei, T.; Wang, J.; Ning, H.; Wang, X.; Xue, D.; Wang, Q.; Nandi, A.K. Difference enhancement and spatial–spectral nonlocal network for change detection in vhr remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4507013. [Google Scholar] [CrossRef]
Li, Z.; Yan, C.; Sun, Y.; Xin, Q. A densely attentive refinement network for change detection based on very-high-resolution bitemporal remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4409818. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Gu, A.; Johnson, I.; Goel, K.; Saab, K.; Dao, T.; Rudra, A.; Ré, C. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In Proceedings of the Advances in Neural Information Processing Systems, Virtual Conference, 6–14 December 2021; pp. 572–585. [Google Scholar]
Smith, J.T.; Warrington, A.; Linderman, S.W. Simplified state space layers for sequence modeling. arXiv 2022, arXiv:2208.04933. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450,. [Google Scholar]
Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9268–9277. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Shi, Q.; Liu, M.; Li, S.; Liu, X.; Wang, F.; Zhang, L. A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5604816. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]

Figure 1. Examples of difficult cases. T1 and T2 represent dual-time remote sensing images, GT is the ground truth of changed areas, and ChangeMamba and CDMamba are two SOTA methods based on Mamba. The key areas are highlighted with red boxes.

Figure 2. Overall architecture of our proposed DC-Mamba.

Figure 3. The constituent modules of DC-Mamba. (a) Architecture of the EFE block, (b) architecture of the DFSS encoder, and (c) architecture of the DFSS decoder.

Figure 4. Overall architecture of our proposed DC-Mamba.

Figure 5. Ablation studies on EFE block and DFSS block conducted on the LEVIR-CD+, WHU-CD, and SYSU datasets. Baseline uses the original Mamba architecture. DC-Mamba uses both the EFE block and DFSS block. All the models utilize dynamic loss function, where

ε = 1, σ = 0.75

. The key areas are highlighted with red boxes.

Figure 5. Ablation studies on EFE block and DFSS block conducted on the LEVIR-CD+, WHU-CD, and SYSU datasets. Baseline uses the original Mamba architecture. DC-Mamba uses both the EFE block and DFSS block. All the models utilize dynamic loss function, where

ε = 1, σ = 0.75

. The key areas are highlighted with red boxes.

Figure 6. Visualization results of feature heatmaps on different datasets. Red denotes higher attention values and blue denotes lower values. (a–c) Pre-change image, post-change image, and change map, respectively. (d) Overlay of the feature heatmaps output by the EFE block onto the original image. (e) Feature heatmaps output by the DC-Mamba decoder without using the EFE block. (f) Feature heatmaps output by the DC-Mamba decoder using the EFE block.

Figure 7. Qualitative analysis results on the LEVIR-CD+ test set. (a–d) Selected samples of different types. The key areas are highlighted with red boxes.

Figure 8. Qualitative analysis results on the WHU-CD test set. (a–d) Selected samples of different types. The key areas are highlighted with red boxes.

Figure 9. Qualitative analysis results on the SYSU test set. (a–d) Selected samples of different types. The key areas are highlighted with red boxes.

Table 1. Performance of DC-Mamba with different backbones on the SYSU dataset. The optimal results are highlighted in bold.

Backbone	Params (M)	Recall	F1	IoU	OA
VMamba-Tiny	14.28	80.19	83.17	70.98	92.56
VMamba-Small	43.76	81.47	84.20	72.72	92.79
VMamba-Base	70.92	81.38	84.06	72.81	92.68

Table 2. Ablation experiments on the LEVIR-CD+, WHU-CD, and SYSU datasets. The optimal results are highlighted in bold.

Method			LEVIR-CD+			WHU-CD			SYSU
$L_{D y n a m i c}$	EFE	DFSS	F1	IoU	OA	F1	IoU	OA	F1	IoU	OA
✓			82.77	70.07	98.46	92.87	88.32	99.29	81.29	71.05	92.55
✓	✓		83.66	71.54	98.54	93.89	89.95	99.37	82.97	72.01	92.70
✓		✓	83.32	71.06	98.66	94.14	90.13	99.41	82.64	71.90	92.69
✓	✓	✓	84.61	72.26	98.73	95.22	90.87	99.48	84.20	72.72	92.79

Table 3. The impact of hyperparameter settings in the Dynamic Loss Function on model performance is evaluated on the SYSU dataset. All results are reported in percentages (%). The optimal results are highlighted in bold.

ε	σ	Precision	Recall	F1	IoU	OA
1	0	82.97	75.87	81.90	69.51	91.95
0	1	83.73	76.27	81.87	69.30	92.03
1	1	86.27	81.16	83.70	71.97	92.55
0.5	1	84.64	78.56	82.05	69.91	92.18
1	0.5	85.89	79.73	82.34	70.73	92.08
0.75	1	86.06	80.60	82.98	70.91	92.20
1	0.75	87.12	81.47	84.20	72.72	92.79

Table 4. Comparison results on three change detection datasets. All results are reported in percentages (%). And the best result and the second-best result are highlighted in red and blue, respectively.

	LEVIR-CD+				WHU-CD				SYSU
Method	Rec	F1	IoU	OA	Rec	F1	IoU	OA	Rec	F1	IoU	OA
CNN-based methods
SNUNet	78.52	76.94	64.68	97.52	87.67	87.60	77.98	98.89	72.15	72.51	62.51	87.41
IFNet	80.25	80.63	67.63	98.13	86.26	84.95	74.53	98.79	73.34	74.87	65.16	89.41
HANet	76.52	78.62	65.73	98.23	88.14	87.89	77.92	99.12	70.56	71.67	63.82	88.41
FC-Siam-Conc	78.25	79.57	65.21	98.32	90.06	89.23	78.87	99.01	71.58	75.17	66.01	89.42
Transformer-based methods
BIT	81.12	81.35	68.94	98.47	91.30	90.47	82.76	99.26	74.65	77.31	69.51	89.14
ChangeFormer	78.45	77.98	68.54	98.24	90.58	89.01	79.35	99.06	75.45	77.32	67.42	88.51
SwinSUNet	79.62	78.41	65.94	98.18	89.46	92.34	85.63	99.35	77.14	78.41	68.42	88.42
TransUNetCD	80.56	79.12	69.63	98.42	90.26	86.98	77.03	99.02	76.42	78.34	66.42	89.31
Mamba-based methods
RSMamba	83.87	81.98	69.09	98.53	90.24	92.79	86.55	99.44	78.90	81.34	68.94	91.78
ChangeMamba	82.49	82.11	68.27	98.46	92.29	94.06	88.79	99.45	78.25	82.83	70.70	92.35
CDMamba	81.00	83.01	70.95	98.65	92.01	93.76	88.26	99.51	78.14	82.41	71.64	92.44
DC-Mamba	83.75	84.61	72.26	98.73	94.33	95.22	90.87	99.48	81.47	84.20	72.72	92.79

Table 5. Comparison of different CD models in computational cost. We report the number of parameters (Params) and the giga floating-point operations per second (GFLOPs).

Method	Params (M)	GFLOPs
CNN-based methods
SNUNet	10.21	176.36
IFNet	50.71	147.8
HANet	2.61	70.68
FC-Siam-Conc	1.54	21.07
Transformer-based methods
BIT	43.27	380.62
ChangeFormer	41.02	811.15
SwinSUNet	39.28	43.50
TransUNetCD	28.37	244.54
Mamba-based methods
RSMamba	51.95	123.79
ChangeMamba	84.70	179.32
CDMamba	11.90	29.64
DC-Mamba	17.35	44.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Chen, R.; Liu, F.; Liu, H.; Zheng, B.; Hu, C. DC-Mamba: A Novel Network for Enhanced Remote Sensing Change Detection in Difficult Cases. Remote Sens. 2024, 16, 4186. https://doi.org/10.3390/rs16224186

AMA Style

Zhang J, Chen R, Liu F, Liu H, Zheng B, Hu C. DC-Mamba: A Novel Network for Enhanced Remote Sensing Change Detection in Difficult Cases. Remote Sensing. 2024; 16(22):4186. https://doi.org/10.3390/rs16224186

Chicago/Turabian Style

Zhang, Junyi, Renwen Chen, Fei Liu, Hao Liu, Boyu Zheng, and Chenyu Hu. 2024. "DC-Mamba: A Novel Network for Enhanced Remote Sensing Change Detection in Difficult Cases" Remote Sensing 16, no. 22: 4186. https://doi.org/10.3390/rs16224186

APA Style

Zhang, J., Chen, R., Liu, F., Liu, H., Zheng, B., & Hu, C. (2024). DC-Mamba: A Novel Network for Enhanced Remote Sensing Change Detection in Difficult Cases. Remote Sensing, 16(22), 4186. https://doi.org/10.3390/rs16224186

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DC-Mamba: A Novel Network for Enhanced Remote Sensing Change Detection in Difficult Cases

Abstract

1. Introduction

2. Related Works

2.1. CNN-Based Method

2.2. Transformer-Based Method

2.3. Mamba-Based Method

3. Methods

3.1. Preliminaries: SSM

3.2. Overall Architecture

3.3. Dual-Flow State-Space Block

3.4. Dynamic Loss Function for DC-Mamba

4. Experiments and Analysis

4.1. Datasets

4.1.1. SYSU-CD

4.1.2. LEVIR-CD+

4.1.3. WHU-CD

4.2. Experimental Setup

4.2.1. Implementation Details

4.2.2. Training Details

4.2.3. Evaluation Metrics

4.3. Ablation Studies

4.4. Comparative Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI