1. Introduction
With the advancement of remote sensing technology, the spatial resolution of remote sensing images continues to improve, making change detection (CD) a popular research field. Change detection aims to monitor surface changes in the same area using paired temporal remote sensing images. It plays a crucial role in various fields such as urban planning [
1,
2,
3], agricultural management [
4], environmental monitoring [
5,
6,
7,
8], and disaster response [
9,
10]. Optical high-resolution remote sensing imagery provides rich contextual information and detailed geometric structure information, and has become one of the most extensively applied and researched data sources in the field of change detection. Traditional change detection methods mainly include transformation-based [
11,
12,
13], classification-based [
14,
15], and object-based [
16,
17] approaches. However, these methods, relying only on shallow image features, struggle to effectively handle complex surface change areas.
With the development of the convolutional neural network (CNN), many new solutions have been introduced to the field of change detection [
18,
19,
20,
21]. CNN-based methods for change detection typically use pairs of images composed of a pre-change and a post-change image as input, and output segmentation masks that locate the regions of interest where changes have occurred. Despite showing promising results across multiple datasets, the CNN lacks the capability to effectively extract global features, which makes it difficult to handle change detection tasks in complex scenes.
The emergence of visual Transformer [
22,
23,
24,
25] provides an effective approach to address the above-mentioned issues. The Transformer architecture is capable of capturing global contextual information and effectively modeling spatial dependencies through self-attention mechanisms, achieving impressive results in the field of change detection [
26,
27,
28,
29,
30,
31]. However, applying the Transformer structure to image processing introduces a quadratic complexity issue, significantly increasing computational costs. Although some methods improve attention efficiency by limiting the size or stride of the computation window, such as [
24], they do so at the expense of reducing the receptive field size.
Recently, Mamba [
32] has introduced time-varying parameters into state-space models (SSMs) and achieved significant success in the field of natural language processing. The Mamba architecture features linear complexity, effectively addressing the computational cost issue of Transformer, and is considered as a viable alternative to Transformer. Inspired by the success in natural language processing, the Mamba architecture has been expanded into the field of visual image processing with promising results [
33,
34,
35,
36,
37,
38,
39,
40]. In the domain of remote sensing change detection, several Mamba-based methods [
41,
42,
43] have emerged, showing noticeable performance improvements compared with other change detection approaches.
Despite the strong performance of the latest Mamba-based change detection methods on classic datasets, their results on challenging detection cases and high-difficulty datasets are not satisfactory. This is because these methods lack optimization for difficult cases. In this paper, we define change detection cases with irregular change boundary shapes or irregular continuous change areas as difficult cases. These difficult cases present significant challenges for change detection tasks.
Classic datasets (such as WHU-CD and LEVIR-CD) consist of urban building data, where the change areas are mostly simple in shape and neatly arranged, including residential and warehouse districts. The few difficult cases present do not significantly affect the overall performance metrics of change detection models. In contrast, high-difficulty datasets (such as SYSU-CD) contain various irregular change areas, including vegetation changes, road expansions, and marine constructions. The larger proportion of difficult cases in these datasets leads to poor performance of existing Mamba-based methods.
Figure 1 illustrates two challenging cases of change detection in dual-time remote sensing, along with the detection results from two state-of-the-art Mamba-based methods, highlighted with red bounding boxes on key areas. It can be observed that these challenging cases exhibit irregular boundary shapes or small continuous change areas, posing significant challenges for the change detection task. Due to the cross-scan mechanism of the Mamba architecture, both Mamba-based methods effectively capture the overall features of the changed areas. However, they perform poorly in detecting complex edge features and continuous change areas. Thus, developing effective solutions for difficult case detection is crucial for enhancing the robustness of remote sensing change detection models and for future research endeavors.
In this paper, inspired by VMamba [
34], we propose a Mamba-based remote sensing change detection network for difficult cases (DC-Mamba), which effectively enhances the model’s detection capability in complex change scenarios. Specifically, DC-Mamba consists of an edge-feature enhancement (EFE) block, a dual-flow state-space (DFSS) encoder, a DFSS decoder, and a dynamic upsampler (DySample). Unlike existing Mamba-based methods, we do not directly input the original or downsampled original images into the encoder. Instead, we process them through the EFE block before feeding them into the encoder to enhance shallow edge features of the paired temporal images. Additionally, we design the DFSS Encoder and DFSS Decoder to better extract both global and local image features. To address the needs of challenging case detection, we also introduce a dynamic loss function.
In summary, the main contributions of this paper are as follows:
- (1)
We propose a novel remote sensing change detection network (DC-Mamba), which enhances edge features through the edge-feature enhancement (EFE) block before inputting images into the encoder. And it utilizes the dual-flow state-space (DFSS) block to integrate global and local features, thereby improving the detection of small local change areas and complex change edges.
- (2)
We introduce a dynamic loss function for DC-Mamba, dynamically adjusting the weights of two components of the loss function to address sample imbalance issues and increase focus on challenging samples during training.
- (3)
Extensive experiments on three datasets, WHU-CD, LEVIR-CD+, and SYSU, demonstrate that our proposed DC-Mamba achieves the best overall performance and significantly improves performance in difficult case detection.
The rest of the paper is organized as follows.
Section 2 reviews related work.
Section 3 details our proposed method.
Section 4 presents experimental results. And
Section 5 concludes the paper.
2. Related Works
2.1. CNN-Based Method
The rapid development of the CNN and its excellent performance in feature extraction have led to its widespread application in change detection tasks. FC-Siam-Conc [
18], FC-EF [
18], and FC-Siam-Diff [
18] are among the earliest CNN-based methods. FC-Siam-Conc concatenates pre-change and post-change images along the channel dimension and processes them as a single input. FC-EF is a CD model built upon the U-net framework, utilizing the concatenated bitemporal images along the channel dimension as its input. FC-Siam-Diff is a variant of FC-EF that incorporates a weight-shared Siamese network architecture to capture multi-level features, thereby enhancing the integration of bitemporal information through feature subtraction.
With the introduction of backbone networks like VGG [
44], ResNet [
45], and DenseNet [
46], CNN-based methods have seen further development. Several representative network architectures have been proposed. Examples include SNUNet [
21], IFNet [
19], and HANet [
47]. Fang et al. designed SNUNet [
21], a Siamese architecture with denser connections to promote interaction among shallow features. SNUNet utilizes a weight-shared NestedUNet to capture multi-level features, incorporating channel attention mechanisms at different stages of the decoding process. Additionally, deep supervision is employed to improve the training effectiveness of the intermediate layers. IFNet [
19] employs a weight-shared VGG-16 [
44] for extracting multi-level features and integrates bitemporal information through concatenation. It applies both spatial and channel attention mechanisms at each stage of the decoder. Additionally, deep supervision is implemented by calculating supervised loss at each decoder level to improve the training of intermediate layers. Han et al. designed HANet [
47], which introduces attention mechanism and uses a new sampling method to effectively integrate global information. HANet is a discriminative Siamese network that integrates multi-scale features and refines detailed features. Lei et al. [
48] proposed a differential enhancement network to reduce the impact of irrelevant factors on detection results by learning differential representations of foreground and background.
Despite the maturity and effectiveness of the aforementioned CNN-based methods in change detection tasks, the CNN is limited by the size of the receptive field, making it challenging to capture global features. This limitation affects their performance in change detection tasks where change objects are sparse.
2.2. Transformer-Based Method
In recent years, Transformer has effectively overcome the limitations of CNN-based methods with its superior ability in long-range modeling, prompting many researchers to adopt Transformer architectures for change detection tasks.
BIT [
28] uses a weight-shared ResNet18 to extract bitemporal features and employs a Semantic Tokenizer to compress these features into a smaller set of semantic tokens. These tokens are then concatenated and fed into a Transformer encoder/decoder to capture spatial–temporal relationships. Changeformer [
27] employs the weight-shared Segformer-B1 to extract multi-level features. At each level, the features are processed through differencing and convolution operations to aid in bitemporal feature fusion. The resulting differential features from various levels are then concatenated and fed into a decoder consisting of fully connected layers to perform change detection. SwinSUnet [
26] leverages a weight-shared Swin-Transformer [
24] to extract multi-level features. Features from the final layer are concatenated to combine bitemporal information before being processed by the decoder. Additionally, a U-net-like connection is used to merge the extracted multi-level features with those at corresponding decoder layers via concatenation, with further enhancement provided by channel attention mechanisms. Li et al. [
49] introduced a hybrid model, TransUNetCD, blending Transformer’s global context modeling capability with CNN blocks and UNet architecture, eliminating UNet’s redundant information and enhancing feature quality through differential enhancement modules.
Despite achieving strong performance in change detection, the aforementioned Transformer-based methods face challenges due to their quadratic complexity in image processing, significantly increasing computational costs. This drawback is particularly disadvantageous for dense prediction tasks like change detection.
2.3. Mamba-Based Method
Mamba, based on state-space models (SSMs), introduces the cross-scan mechanism and hardware-aware algorithms, enabling the parameterization of SSMs based on input sequences. Compared with Transformer, Mamba exhibits linear computational complexity relative to the length of input sequences, showing great potential in long sequence modeling tasks. Works like Vision Mamba and VMamba expand Mamba’s capability in handling visual data through bidirectional and multidirectional scanning methods.
Notably, Mamba has been widely applied in remote sensing change detection tasks. Chen et al. [
41] first explored the potential of the Mamba architecture in remote sensing change detection tasks, achieving significant results in binary change detection (BCD), semantic change detection (SCD), and building damage assessment (BDA). Zhao et al. [
42] proposed RS-Mamba for dense prediction tasks in high-resolution remote sensing images, introducing a diagonal scan mechanism based on 2D cross-scan. Zhang et al. [
43] introduced the CDMamba model, effectively integrating global and local features through SRCM blocks, enhancing the ability of Mamba-based methods to detect detailed changes in dense prediction tasks.
Despite the promising results of the aforementioned Mamba-based methods, they often directly input original or downsampled images into the encoder, potentially overlooking some shallow edge detail features. This is because after the original image enters the Mamba encoder, the resolution significantly decreases as the number of channels increases, causing the Mamba network to focus more on deep features, such as semantic information and object categories. However, the neglected shallow features contain important information like edges and textures, which are also vital for change detection tasks. Additionally, they lack optimization for detecting complex change areas, leading to noticeable performance drops in difficult case detection.
In this paper, we propose some simple but effective modules and components based on the Mamba architecture to enhance the model’s capability in handling detailed features and improve accuracy in difficult case detection.
3. Methods
3.1. Preliminaries: SSM
In deep-learning research, the state-space models (SSMs) [
50,
51,
52] provide a framework for describing and understanding the generation process of time series data through state-space modeling, the assumption of linear time invariant (LTI), and mapping functions.
An SSM utilizes a hidden state
as an intermediary variable to map the input one-dimensional function or sequence
to the output
. The SSM can be represented by a set of linear ordinary differential equations (ODEs).
where
denotes the time derivative of
,
represents the evolution parameters, and
and
are projection parameters.
To integrate continuous systems into deep-learning algorithms, the SSM uses a zero-order hold (ZOH) to convert the continuous parameters
A and
B into their discrete counterparts
and
. This discretization is achieved by introducing a time scaling parameter
, defined as follows:
The discretized Formulas (1) and (2) can be expressed as
Thus, the output can be directly computed through full convolution.
However, the parameters of the SSM based on the LTI assumption impose a constraint where the input–output relationship does not vary with time. To overcome this limitation, the recently proposed Mamba architecture builds upon the SSM, allowing the matrices , C, and Δ to vary with the input sequence. This improvement boosts the model’s capability for selective information processing across sequences, dynamically adjusting the learned context. Additionally, the Mamba architecture incorporates hardware-aware algorithms to optimize the GPU memory layout, effectively reducing memory and computational overhead during model training.
3.2. Overall Architecture
Figure 2 and
Figure 3 illustrate the overall architecture of the proposed DC-Mamba and its constituent modules. DC-Mamba is composed of an edge-feature enhancement (EFE) block, a dual-flow state-space (DFSS) encoder, a DFSS decoder, and a dySample module. As shown in
Figure 2, the input consists of two temporal remote sensing images
and
, where
and
are 3-channel RGB images of size 256 × 256.
Initially,
and
are separately fed into the EFE block to enhance their shallow edge features. The structure of the EFE block, illustrated in
Figure 3a, involves extracting image features with a 3 × 3 convolution using stride = 1, BatchNorm for data normalization to prevent data anomalies, ReLU activation for introducing non-linear transformations, and residual connections to retain original image features. Post EFE block processing, the edge features of
and
are enhanced while maintaining their size and number of channels, serving as inputs to the encoder.
The encoder comprises 4 stages, each designed to extract dual-temporal image features at different scales, as depicted in
Figure 3b. Each stage of the encoder includes down-sampling and DFSS block, reducing the height and width of the dual-temporal image features by half and doubling the number of channels.
Similarly, the decoder consists of 4 stages, as shown in
Figure 3c. The input to the first stage of the decoder is a fusion of the dual-temporal image feature output by the encoder. The subsequent three stages of the decoder take inputs composed of two parts: the output of the previous decoder stage and a fusion of corresponding-channel dual-temporal image features. This approach facilitates effective interaction of temporal image feature information. Each stage of the decoder first processes its input through a DFSS block, followed by 3 × 3 convolution and batchNorm layers with residual connections, culminating in gradual image size restoration through linear layers and upsample operations.
The final output of the decoder connects to the DySample module, a lightweight and effective upsampler suitable for dense prediction tasks like dual-temporal remote sensing image change detection. Following upsampling, the size of the change detection image is restored to 256 × 256, resulting in the binary image representation of the change detection result .
3.3. Dual-Flow State-Space Block
The Mamba architecture is initially developed for natural language data where contextual causality exists. However, for non-causal data like remote sensing images, directly expanding the original Mamba along the spatial dimension lacks adequate modeling of spatial contextual information. Recently, Vamba is proposed for visual image tasks, introducing a two-dimensional cross-scanning mechanism to address this issue. As shown in
Figure 4, the cross-scanning mechanism reorganizes image data along four different directions in the spatial dimension, enabling each position in the image to access contextual information from multiple directions. Subsequently, each sequence of image data is selectively processed through the SSM, and the feature information from the four sequences is merged for final output.
However, this two-dimensional cross-scanning mechanism primarily enhances the model’s ability to extract global features, often neglecting the importance of edge-local features. In tasks like dual-temporal image change detection, this can significantly impact accuracy. Therefore, we propose the dual-flow state-space (DFSS) block, depicted in
Figure 3b. Input feature map
undergoes downsampling before entering the DFSS block, where it is split into two information flows. The main information flow sequentially passes through LayerNorm [
53], Linear, and 3 × 3 convolution layers into the core VSSM module, which is the integration of the Mamba architecture with the two-dimensional cross-scan mechanism. The VSSM module rearranges the image along four spatial dimensions, allowing every part of the image to acquire spatial contextual information from different directions, culminating in the VSSM module’s output. This output is processed through LayerNorm and then merged with feature information activated by SiLU through element-wise multiplication, yielding the output of the main information flow. The second information flow integrates the input image, preserving edge-local features through residual connections, into the main information flow, resulting in the final output
of the DFSS block.
3.4. Dynamic Loss Function for DC-Mamba
For the dual-temporal image change detection task based on the Mamba architecture, we design a loss function that can be dynamically adjusted through hyperparameters. It consists of two weighted parts of the loss. The first part is the balanced cross-entropy (BCE) loss, commonly used for binary classification tasks, defined as
where
represents positive and negative classes, and
denotes the estimated probability of the model for class
.
is a hyperparameter introduced to address class imbalance issues, setting
for class 1 and
for class −1.
However, in binary classification tasks, the default initialization results in equal probabilities for
. In cases of class imbalance, this can lead to the dominance of the larger, simpler class during training. While BCE loss introduces
to balance positive and negative samples [
54], it does not differentiate between easy and hard samples. Therefore, for dense prediction tasks with class imbalance like dual-temporal image change detection, we introduce Focal loss [
55] as the second part of the loss function.
Focal loss replaces the hyperparameter
in BCE loss with a focal modulation factor
, where
is a tunable focusing parameter. Focal loss can be defined as
It can be observed that for easy samples where the model is confident in its predictions (approaching to 1), the focal modulation factor tends toward 0. This significantly reduces the weight of correctly classified easy samples, preventing them from dominating the training process. For difficult cases that are more likely to be misclassified, the model exhibits less confidence (resulting in smaller ), and the focal modulation factor approaches 1, thereby minimally affecting the weight of hard samples during training.
In conclusion, the designed loss function is expressed as
Here, represent the weighting parameters for BCE loss and Focal loss, respectively. Different dual-temporal image change detection datasets exhibit varying levels of density of change targets and significant differences in dataset difficulty. Therefore, dynamically adjusting the weights of these two components of the loss function is expected to improve training results, as validated in experiments.
4. Experiments and Analysis
4.1. Datasets
4.1.1. SYSU-CD
The SYSU-CD [
56] dataset introduces 20,000 pairs of 0.5 m/pixel resolution aerial images from Hong Kong between 2007 and 2014, each with dimensions of 256 × 256 pixels, divided into training, validation, and test sets in a ratio of 6:2:2. This dataset is distinguished by its focus on urban and coastal changes, featuring high-rise buildings and infrastructure developments, where change detection poses significant challenges due to shadow and deviation effects. This dataset encompasses a range of various change scenarios such as urban construction, suburban expansion, groundwork, vegetation changes, road expansion, and sea construction.
4.1.2. LEVIR-CD+
The LEVIR-CD+ [
3] dataset is an advanced version of LEVIR-CD, containing 985 pairs of very high-resolution images at 0.5 m/pixel, each with dimensions of 1024 × 1024 pixels. Spanning a time interval of 5 to 14 years, these multi-temporal images document significant building construction changes. It also encompasses a wide array of building types, including urban residential areas, small-scale garages, and large warehouses, with a focus on both the emergence of new buildings and the decline of existing structures. LEVIR-CD+ is a valuable benchmark for evaluating CD methodologies. We cut each image into patches of size 256 × 256 and divided the images into training and test sets in a ratio of 7:3.
4.1.3. WHU-CD
The WHU-CD [
57] dataset, a subset of the larger WHU Building dataset, is tailored for the building CD task, consisting of a pair of 32,507 × 15,354 spatial remote sensing images of New Zealand with a resolution of 0.2 m/pixel, taken in April 2012 and April 2016, covering an area of 20.5 square kilometers. We cut the WHU-CD dataset into patches of size 256 × 256, resulting in a total of 7620 images. The images are divided into training, validation, and test sets in a ratio of 7:1.5:1.5.
4.2. Experimental Setup
4.2.1. Implementation Details
In the proposed DC-Mamba structure, during the edge-feature enhancement (EFE) stage, a 3 × 3 convolution with stride = 1 and 3 output channels is used to extract and enhance the edge features of the original image. The encoder consists of four stages with layer configurations {2, 2, 4, 2}, where feature map sizes are reduced to {1/2, 1/4, 1/8, 1/16} of the original image size, and feature channel numbers are set to {24, 48, 96, 192}. Linear interpolation is employed for downsampling within the encoder stages. In the four stages of the decoder, the feature map sizes and channel numbers mirror those of the encoder, and upsampling is performed using the dysample method.
4.2.2. Training Details
We construct and deploy DC-Mamba using the PyTorch framework on a single NVIDIA RTX 3080Ti. We employ straightforward data augmentation techniques, avoiding any complex methods. During training, we utilize the AdamW optimizer [
58] to optimize the network with a learning rate of 1 × 10
−4 and weight decay of 1 × 10
−3. The batch size is set to 4, and the training iterations are configured to 60,000.
To determine the most suitable Mamba backbone network, we test the performance of DC-Mamba with three different backbones: VMamba-Tiny, VMamba-Small, and VMamba-Base.
Table 1 shows the experimental results. It can be observed that VMamba-Small exhibits a significant improvement over VMamba-Tiny, whereas VMamba-Base shows slightly lower Recall, F1, and OA scores due to overfitting caused by its increased parameter count. Therefore, in our subsequent experiments in this paper, we adopted VMamba-Small as the backbone network.
In the dynamic loss function, the focusing parameter
has
set to 2,
set to 1, and
set to 0.75 (details are described in ablation studies). According to Formulas (7), (9), and (10), the loss function can be expressed as
4.2.3. Evaluation Metrics
To evaluate the performance of our model, we employ a set of metrics, including precision (Pre), recall (Rec), F1 score (F1), overall accuracy (OA), and intersection over union (IoU). The precision reflects the proportion of true positive pixels among all pixels identified as positive. The recall reflects the proportion of all true positive pixels correctly identified as positive. The overall accuracy expresses the ratio of the correctly predicted pixels to the total pixels. The F1 score balances precision and recall by calculating their harmonic mean. The IoU measures the overlap between predicted and ground truth positive. All the five metrics can be individually defined as follows.
where
TP,
FP,
TN, and
FN represent the number of true positives, false positives, true negatives, and false negatives, respectively.
4.3. Ablation Studies
In this section, we conducted ablation studies on the LEVIR-CD+, WHU-CD, and SYSU datasets to assess the impact of various components and parameters of DC-Mamba. Our ablation study includes four different configurations, as shown in
Table 2. And
Figure 5 provides a visual comparison of the results. We use the original Mamba network with dynamic loss function as the baseline and design ablation experiments on the EFE block and the DFSS block, highlighting key areas with red boxes in
Figure 5.
It can be seen that independently adding either the EFE block or the DFSS block significantly improves the model performance. Specifically, when the DFSS block is individually added, the F1 score improves by 0.55%, 1.27%, and 1.35% on the three datasets, respectively. This indicates that integrating global and local features is highly effective for change detection tasks. In the DFSS block, the two-dimensional cross-scan mechanism rearranges the image along four spatial dimensions, allowing each part of the image to obtain spatial context information from different directions, thereby enhancing the model’s ability to extract global features. Meanwhile, the second information flow integrates the input image into the main information flow through residual connections, preserving local features.
When the EFE block is individually added, the key metric F1 improves by 0.89%, 1.02%, and 1.68% on the three datasets, respectively. This demonstrates the importance of shallow features for change detection tasks. We believe that the proposed EFE block focuses on fine-grained edge information and texture information in the image, effectively preserving the shallow features of the original images
and
. To further validate this point, we visualize the feature heatmaps on different datasets and compare the decoder’s output feature heatmaps, as shown in
Figure 6.
Figure 6d illustrates the effective shallow feature extraction of the EFE block, which ensures that the edge features and detail information of the image are preserved after it enters the encoder. By comparing
Figure 6e and
Figure 6f, it is evident that the change areas output by the DC-Mamba decoder using the EFE block are more precise, particularly showing significantly better performance at the edges of the changed regions. The above phenomenon qualitatively supports our viewpoint.
Table 3 shows the impact of hyperparameter settings in the dynamic loss function on model performance, also conducted on the SYSU dataset. Here,
represent the weight parameters of BCE loss and Focal loss, respectively. The experimental results indicate that introducing the dynamic loss function significantly improves model performance compared with using BCE loss or Focal loss alone. Moreover, the model achieves optimal performance on the SYSU dataset when
. This suggests that the dynamic loss function effectively improves training outcomes by reducing the weight of simple samples, thereby focusing more on difficult cases.
The ablation studies above demonstrate that our proposed components, whether used individually or in combination, are beneficial for enhancing the model’s performance in change detection tasks.
4.4. Comparative Experiment
We conduct comparative experiments using state-of-the-art (SOTA) models from three categories: CNN-based methods (including SNUNet [
21], IFNet [
19], HANet [
47], and FC-Siam-Conc [
18]), Transformer-based methods (including BIT [
28], ChangeFormer [
27], SwinSUNet [
26], and TransUNetCD [
49]), and Mamba-based methods (including RSMamba [
42], ChangeMamba [
41], and CDMamba [
43]). Our proposed DC-Mamba is tested against these models on the LEVIR-CD+, WHU-CD, and SYSU datasets.
Table 4 presents the test results, highlighting the best-performing metrics in red and the second-best in blue.
It is evident that DC-Mamba, our proposed method, demonstrates significant performance advantages compared with both CNN-based and Transformer-based methods. Furthermore, when compared with the three latest Mamba-based methods, our approach exhibits superior overall performance. Specifically, on the LEVIR-CD+ dataset, although DC-Mamba shows lower Recall compared with RSMamba, it achieves the highest scores in other metrics, with the F1 score improving by 1.50%. On the WHU-CD dataset, DC-Mamba achieves the highest Recall, F1 score, and IoU and the second-highest OA, with the F1 score improving by 1.16%.
The SYSU dataset, characterized by more complex and numerous change regions, poses greater difficulty. On this dataset, our method achieves the highest scores across all metrics. Compared with the second-best results, DC-Mamba improves Recall, F1 score, IoU, and OA by 2.57%, 1.37%, 1.08%, and 0.35%, respectively. This highlights the effectiveness of our proposed EFE block, DFSS block, and dynamic loss function in enhancing model performance for detecting challenging cases and intricate edges.
To further validate the performance of DC-Mamba, qualitative analysis results on the LEVIR-CD+, WHU-CD, and SYSU test sets are presented in
Figure 7,
Figure 8 and
Figure 9, respectively. And the key areas are highlighted with red boxes. Each figure contrasts our method with selected CNN-based, Transformer-based, and Mamba-based methods.
Figure 7 and
Figure 8 illustrate that because the change areas in the LEVIR-CD+ dataset and the WHU-CD dataset are mostly buildings with regular shapes, all methods can effectively reconstruct the shape of change areas, but our DC Mamba method clearly performs better. Specifically, in
Figure 8a, DC-Mamba detects smaller intermediate change areas missed by the other methods. In
Figure 9d, DC-Mamba provides more detailed and accurate detections along change edges.
Figure 9 shows that SYSU dataset challenges include irregular and non-structured changes, where Mamba-based methods clearly outperform CNN-based and Transformer-based methods. Compared with CDMamba, our method excels in capturing edge details and detecting challenging cases.
Figure 9c demonstrates DC-Mamba’s superior ability to reconstruct edge shapes in the left change area, while
Figure 9d highlights DC-Mamba’s capability to fully detect multiple continuous small change areas compared with several missed detections by CDMamba.
To validate the efficiency of the proposed model,
Table 5 compares the parameters and GFLOPs of different change detection models. Leveraging the Mamba architecture, our model exhibits a linear growth in computational complexity within the encoder and the decoder, effectively reducing computational costs.
Our model has 17.35 million parameters (Params) and performs at 44.95 billion floating-point operations per second (GFLOPs). Compared with Transformer-based methods, the Params of our model are significantly reduced. And the GFLOPs are lower than most Transformer-based methods and Mamba-based methods. While our model incurs slightly higher computational costs compared with CDMamba, our approach achieves significantly improved performance enhancements.
5. Conclusions
In this paper, to address the lack of optimized solutions for complex change regions in existing change detection methods, we propose a new change detection model called DC-Mamba. Specifically, to avoid the loss of shallow edge detail features, we introduce the EFE block. Additionally, the DFSS block is designed to enhance the model’s capability in extracting global features while better preserving local features. On this basis, we propose the dynamic loss function to tackle sample imbalance issues, ensuring that complex change regions receive more attention during training. Ablation studies validate the effectiveness of each proposed module and component.
Comparative experiments on the LEVIR-CD+, WHU-CD, and SYSU datasets show that our proposed DC-Mamba outperforms other methods. Particularly, it demonstrates significant improvements in challenging instance detection and on difficult datasets compared with the latest Mamba-based methods.