Siamese-SAM: Remote Sensing Image Change Detection with Siamese Structure Segment Anything Model

Wei, Gang; Miao, Yuqi; Wang, Zhicheng

doi:10.3390/app15073475

Open AccessArticle

Siamese-SAM: Remote Sensing Image Change Detection with Siamese Structure Segment Anything Model

by

Gang Wei

,

Yuqi Miao

and

Zhicheng Wang

^*

CAD Research Center, Tongji University, Shanghai 201804, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3475; https://doi.org/10.3390/app15073475

Submission received: 4 March 2025 / Revised: 20 March 2025 / Accepted: 21 March 2025 / Published: 21 March 2025

(This article belongs to the Special Issue Intelligent Computing and Remote Sensing—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Change detection in remote sensing images is a critical task that requires effectively capturing both global and differential information between bitemporal or more images. Recent progress in foundational vision models, like the Segment Anything Model (SAM), has led to significant improvements in feature extraction. However, these models do not have specific mechanisms designed to effectively utilize global and differential information for change detection tasks. To address this limitation, we propose Siamese-SAM, a novel Siamese network incorporating SAM as the encoder for each input image. To enhance feature representations, we introduce three specialized modules: the Global Information Enhancement Module (GIEM) to refine global representations, the Differential Information Enhancement Module (DIEM) to emphasize differential features, and the Differential Global Information Fusion Module (DGIF) to integrate global and differential information effectively. Our model is evaluated on three benchmark datasets: LEVIR-CD, SYSU-CD, and GZ-CD, achieving state-of-the-art performance. Specifically, Siamese-SAM attains F1 scores of 92.67%, 82.61%, and 88.79% and IoU scores of 86.34%, 70.17%, and 79.83%, respectively, outperforming conventional approaches.

Keywords:

remote sensing change detection; Siamese network; Segment Anything Model; global and differential information fusion

1. Introduction

Change detection (CD) in remote sensing is the process of identifying surface modifications between bitemporal or more images [1]. The specific process can be seen in Figure 1; that is, some methods are used to distinguish the difference between two or more images at different times. It is essential for tracking dynamic changes on the Earth’s surface [2], with wide-ranging applications in areas such as land use monitoring [3], disaster assessment [4], environmental change analysis [5], and urban expansion tracking [6]. Accurate and reliable CD methods are essential for timely decision-making in environmental protection, infrastructure planning, and emergency response [7].

Despite its importance, remote sensing change detection continues to be a challenging task due to various factors [8]. One primary issue is the influence of illumination variations, seasonal changes, and atmospheric conditions, which introduce noise and complicate the differentiation of genuine changes from environmental variations. Second, the spectral characteristics of changed and unchanged regions often overlap, leading to inter-class similarity and intra-class variability, which complicates the learning process. Third, many deep-learning-based CD models rely heavily on local pixel-wise differences, often failing to capture the broader spatial relationships necessary for robust change detection. Siamese-based architectures [9] have been widely employed for bitemporal image analysis, yet existing models typically apply simple difference operations to extract changes, limiting their ability to distinguish subtle variations.

Recent advancements in foundation vision models, particularly the SAM [10], have demonstrated remarkable generalization capabilities in segmentation tasks. SAM, designed as a promptable segmentation model, possesses strong feature extraction abilities and has shown impressive zero-shot performance across diverse datasets. However, its direct application to change detection is non-trivial due to several inherent limitations [11].

SAM, originally designed for natural image segmentation, lacks explicit mechanisms for modeling global contextual information and differential features, which are crucial for accurate change detection. Additionally, it does not inherently integrate global and differential representations, limiting its effectiveness in remote sensing applications [12]. Moreover, due to the inductive bias learned from natural image datasets [13], foundation models like SAM may struggle with domain adaptation when applied to remote sensing. Unlike natural images, remote sensing images exhibit unique spectral properties, large-scale geographic structures, and varied imaging conditions, which are not well captured by models trained on everyday scenes [14]. This domain gap can lead to suboptimal feature representations and reduced sensitivity to remote-sensing-specific changes.

To overcome these limitations, we propose Siamese-SAM, a novel Siamese-based framework that leverages SAM as an encoder while introducing three dedicated modules to enhance both global and differential feature representations. The following summarizes the main contributions of our research:

We introduce the Siamese-SAM model for detecting changes in remote sensing images, which fully leverages the advantages of the Siamese structure and SAM as an encoder. The Siamese structure effectively models bitemporal image dependencies, while SAM provides powerful feature extraction capabilities. This design enables rich feature representation learning and enhances the accuracy of change detection.
We designed the GIEM and the Differential Information Enhancement Module (DIEM) to enhance feature representations. GIEM strengthens global contextual understanding, improving the model’s ability to capture large-scale scene changes. DIEM focuses on differential feature enhancement, ensuring more effective extraction of changed regions while reducing the interference of irrelevant variations.
We designed the DGIF to integrate global and differential information effectively. This module combines global structural consistency with precise change localization, ensuring that both contextual and differential features contribute to accurate change detection. The fusion mechanism further enhances the model’s robustness across various change detection scenarios.
Through experiments on three benchmark datasets (LEVIR-CD, GZ-CD, and SYSU-CD), our Siamese-SAM model achieves state-of-the-art performance, demonstrating superior robustness and accuracy. We also conducted extensive ablation studies to validate the effectiveness of each of the proposed modules, confirming their contributions to performance improvement.

2. Related Work

This section first reviews traditional methods for change detection in remote sensing, summarizing their progress and challenges. Next, we discuss the application of Siamese networks in change detection, highlighting their advantages and limitations. Finally, we introduce foundation vision models (VFMs), including SAM, and analyze their potential and adaptation challenges in remote sensing change detection.

2.1. Change Detection in Remote Sensing

Change detection in remote sensing has evolved significantly since its inception [15], starting with pixel-based methods that utilize algebraic, statistical, and transformation-based techniques. Algebraic methods, such as image differencing and change vector analysis (CVA) [16], construct difference images and apply thresholds to generate change maps, but their reliance on fixed thresholds often limits robustness. Statistical methods, like those used for hyperspectral or SAR images [17,18], detect changes based on pixel distribution attributes, though they often struggle with complex scenes. Transformation-based approaches, including principal component analysis (PCA) and multivariate alteration detection (MAD) [19], enhance changes by transforming images into a feature space, but they still face challenges with interpretability.

Unsupervised methods, which extend traditional techniques, aim to detect changes without labeled data. For example, logical verification CVA (LV-CVA) [20] introduces logical reasoning to reduce false detections, while Markov random field (MRF)-based methods [21,22] incorporate spatial and contextual information to improve detection accuracy. Although these methods enhance robustness, they remain constrained by their dependence on predefined rules and assumptions, limiting their applicability in highly complex scenarios.

Overall, while traditional methods have contributed to the development of CD, their limitations in accuracy and adaptability highlight the need for modern, data-driven approaches that can better handle the complexities of high-resolution remote sensing data.

2.2. Siamese Networks for Change Detection

With the advent of big data and the popularization of computing resources, deep learning methods, especially those utilizing Siamese networks [23], have demonstrated great potential for remote sensing change detection [24,25]. Siamese networks process bitemporal images in parallel, allowing shared weights and consistent feature extraction, which is well suited for identifying changes between two images.

Traditional convolutional neural network (CNN)-based methods, such as FCN [26] and U-Net [27], laid the foundation for Siamese-based architectures. U-Net, with its skip connections [28], retains high-resolution features and improves pixel-level classification accuracy in high-resolution remote sensing images. By leveraging Siamese structures, models can efficiently compare features extracted from bitemporal images and identify changes. For instance, SNUNet-CD utilizes a Siamese U-Net backbone to enhance spatial and temporal consistency during feature extraction [29].

The integration of Transformer architectures has further improved the performance of Siamese networks in change detection tasks [30,31]. For example, Chen et al. [32] proposed BIT (Bitemporal Image Transformer), which combines Transformer-based global feature modeling with Siamese networks to capture contextual information more effectively than pure CNN-based methods. Similarly, UVACD by Wang et al. [33] combines a CNN backbone for semantic feature extraction with a Transformer-based module for temporal information interaction, suppressing irrelevant features and enhancing discriminative change detection capabilities.

Hierarchical structures, such as those used in the Swin Transformer [34], have been applied to Siamese networks to improve computational efficiency and multi-scale feature extraction. Zhang et al. [35] proposed SwinSUNet, which employs a Siamese structure with Swin Transformer blocks to process bitemporal images in parallel, capturing global and local features at multiple scales. Additionally, Yan et al. [36] introduced a fully Transformer-based Siamese network (FTN) that combines global feature extraction with pyramid-based multi-level visual feature fusion, further enhancing change detection performance.

Overall, Siamese networks, particularly those integrating CNNs and Transformers, have become a dominant framework for remote sensing change detection, demonstrating significant improvements in accuracy and efficiency.

2.3. Foundation Vision Models for Remote Sensing

SAM represents a major advancement in VFMs, as it is trained on millions of annotated images to achieve zero-shot segmentation capability. However, SAM and similar models face domain adaptation challenges when applied to specific fields like medical imaging, manufacturing, and remote sensing images. These models, predominantly trained on natural image datasets, often struggle with small or irregular objects and tend to focus on foreground features, which limits their performance in remote sensing applications. To address the high computational cost of SAM, FastSAM has been introduced, providing real-time segmentation with comparable generalization performance but at a significantly faster inference speed [37].

In the context of remote sensing change detection, the first adaptation of SAM, known as SAM-CD [38], has been proposed to explore the model’s potential in this domain. However, while SAM-CD leverages the powerful feature extraction capability of SAM, it does not explicitly consider the importance of differential information between bitemporal images, which is critical for accurately identifying changes. This limitation highlights the need for further adaptation of VFMs to effectively handle the unique challenges of change detection in remote sensing images.

In this work, we address these limitations by utilizing a VFM-based Siamese structure to enhance the model’s ability to extract semantic latent features and effectively capture both global contextual information and differential features, enabling robust and accurate change detection in remote sensing imagery.

3. Methods

In this section, we mainly introduce the structure of the model and modules, the dataset and the training details.

3.1. Proposed Approach

3.1.1. Overall Network Structure

The overall network structure of our proposed model is designed to leverage FastSAM as the encoder for feature extraction, taking advantage of its efficiency and strong generalization capabilities. The network adopts a Siamese structure, which processes the input bitemporal images in parallel, allowing shared and consistent feature extraction. Specifically, the extracted features are further enhanced through two specialized branches for global information enhancement and differential information enhancement, respectively. These enhanced features are then fused through a global and differential information fusion module, which integrates the complementary aspects of both types of information to generate a comprehensive representation of changes. Finally, the fused features are decoded through the decoder to produce the final change detection map.

The overall structure is visualized in Figure 2, where the snowflake icons indicate frozen parameters of the FastSAM encoder, while the spark icons represent trainable parameters. This distinction highlights the effective adaptation of the pre-trained FastSAM model within the Siamese structure while focusing the training on task-specific modules, ensuring both computational efficiency and optimal performance.

3.1.2. Global Information Enhancement Module

The GIEM is designed to comprehensively enhance global feature representations by leveraging three attention mechanisms: Pixel Attention, Channel Attention, and Simple Pixel Attention. These mechanisms work collaboratively to strengthen global contextual information in the extracted features. Specifically, Pixel Attention [39] emphasizes spatially salient regions, Channel Attention [40] highlights important feature channels, and Simple Pixel Attention provides an efficient approach to refine pixel-level features with minimal complexity. The overall structure of the module is shown in Figure 3, and the detailed structures of the three attention mechanisms are illustrated in Figure 4.

To further optimize the module, Pointwise Convolution (PWConv) [41] is employed. By performing 1 × 1 convolutions, PWConv effectively reduces the computational cost while maintaining the ability to process feature maps across all channels. This makes it a lightweight yet powerful operation, enabling the module to achieve global feature enhancement without imposing a heavy computational burden. Combined with the attention mechanisms, PWConv ensures that the module operates efficiently, making it well suited for remote sensing change detection tasks.

3.1.3. Differential Information Enhancement Module

The DIEM enhances differential features by capturing and emphasizing differences in both horizontal and vertical directions of the input feature maps. This is achieved through a Coordinate Attention Mechanism [42], as shown in Figure 5.

The module splits the input features into horizontal and vertical components using adaptive average pooling. These components are concatenated, processed through shared convolution and activation layers, and then split back into horizontal and vertical representations. Attention weights are generated for each direction and applied to the original features, refining them by focusing on significant spatial differences. This mechanism effectively highlights meaningful changes, ensuring better feature representation for change detection tasks.

This coordinate-based attention mechanism not only enhances the discriminative capability of the module but also ensures efficient computation. By focusing on directional differences and dynamically adjusting feature responses, the DIEM effectively highlights meaningful changes between bitemporal images, making it a critical component in the proposed network.

3.1.4. Differential Global Information Fusion Module

The DGIF is designed to effectively combine global and differential features, capturing both broad contextual information and localized changes. As shown in Figure 6, the module utilizes three parallel dilated convolutions with dilation rates of 1, 2, and 3, which extract multi-scale features. This design allows the module to efficiently expand the receptive field, capturing both fine-grained details and large-scale patterns without significantly increasing computational costs.

The module begins by concatenating the global and differential input features, which are then passed through the three dilated convolutions. These parallel convolutions with increasing dilation rates enable the module to focus on changes at different spatial resolutions, ensuring that both small, subtle changes and broader patterns are effectively captured.

To further enhance the fused features, the module incorporates residual connections by adding the original global and differential features, along with their scaled versions. This residual connection not only preserves the original information but also helps the network maintain gradient flow during training, making the module more stable and effective. The final refined output is generated through additional convolutions, which ensure that the integrated features are fully optimized for downstream tasks.

By combining multi-scale feature extraction, residual connections, and efficient fusion, the DGIF module ensures that both global and differential information are fully utilized, resulting in a more robust and accurate representation for change detection.

3.2. Datasets

In this section, we mainly introduce the three datasets used in the experiments, including LEVIR-CD [43], SYSU-CD [44], and GZ-CD [45].

3.2.1. LEVIR-CD

As a key dataset in the field of remote sensing change monitoring, LEVIR-CD has significant advantages over similar studies due to its large-scale multi-temporal image pair characteristics. The dataset consists of 637 pairs of high-resolution Google Earth images, each with a spatial resolution of half a meter. Each image has a size of 1024 × 1024 pixels. The time span of the dataset spans 16 years, from 2002 to 2018, capturing the significant morphological changes in buildings during urbanization.

The image content is highly diverse, covering a variety of building structures such as villa communities, high-rise apartment clusters, decentralized garages, and industrial storage buildings, focusing on the dynamic process of building expansion and morphological changes. To enhance the algorithm’s adaptability, all images are standardized and cropped to a consistent input size of 256 × 256 pixels. The dataset is then split into 7120 training sets, 1024 validation sets, and 2048 test sets.

Figure 7 showcases a selection of data from the dataset, illustrating the transformation over time. Notably, the images depict the emergence of buildings on previously vacant land, highlighting the dynamic nature of urban development. This visual representation underscores the dataset’s capability to capture and document the evolving landscape, making it an invaluable resource for change detection studies.

3.2.2. SYSU-CD

The SYSU-CD dataset, developed by Sun Yat-Sen University in 2022, is a comprehensive large-scale open-source dataset designed for change detection tasks. It comprises 20,000 pairs of image patches, each with a resolution of 0.5 m, extracted from 800 pairs of 1024 × 1024 orthophoto aerial images of Hong Kong. Each image patch is accompanied by a corresponding pixel-level binary change map, providing precise annotations for change detection analysis.

This dataset not only captures typical urban and suburban changes but also includes detailed annotations for high-density building changes and offshore constructions, making it particularly valuable for urban planners and remote sensing applications. In our experiments, we utilize the original 256 × 256 image patches for model training and evaluation. The dataset is divided into a 6:2:2 ratio, consisting of 12,000 training samples, 4000 validation samples, and 4000 test samples.

Figure 8 illustrates a representative selection of the dataset, showcasing five distinct data samples. Columns a to c highlight changes occurring in offshore areas, such as port constructions and land reclamation projects. In contrast, columns d and e focus on terrestrial changes, particularly the development and expansion of high-rise buildings and urban infrastructure. These examples demonstrate the dataset’s ability to capture diverse change scenarios, from large-scale urban developments to subtle offshore modifications, underscoring its utility for both academic research and practical applications in urban planning and remote sensing.

3.2.3. GZ-CD

The GZ-CD dataset is another widely adopted benchmark dataset in the field of change detection, focusing on urbanization-related changes in Guangzhou, China. The dataset contains 19 pairs of high-definition images collected from 2006 to 2019, focusing on monitoring the seasonal evolution characteristics of Guangzhou’s rapidly urbanizing suburbs over a 14-year period. As buildings are the primary drivers of change in urbanization, the dataset emphasizes building-related changes, making it particularly suitable for studying urban development and transformation.

During the training phase, a standard sample unit of 256 × 256 pixels is generated based on the original image pairs. This process generated a total of 1067 image block pairs. To prevent overfitting and ensure a balanced representation of change patterns, we randomly sampled the dataset and divided it into training, validation, and test sets in a 7:1:2 ratio. Specifically, 70% of the dataset (approximately 747 image pairs) was allocated to the training set, 10% (approximately 107 image pairs) to the validation set, and 20% (approximately 213 image pairs) to the test set. This split ratio allows for thorough training, effective validation, and robust evaluation of the model’s performance.

Figure 9 provides a visual representation of the dataset, showcasing five representative examples of change detection. In columns a to c, the changes primarily involve the construction of buildings on previously undeveloped land, highlighting the rapid urbanization in the region. Column d illustrates the emergence of a swimming pool. Finally, column e depicts the transformation of green areas into industrial facilities, emphasizing the dataset’s capacity to detect land-use changes. These examples collectively demonstrate the dataset’s utility in studying diverse urbanization processes, from large-scale building constructions to subtle land-use modifications, making it a valuable resource for both research and practical applications in urban planning and change detection.

3.3. Experimental Details

3.3.1. Evaluation Metrics

To comprehensively evaluate the performance of the Siamese-SAM model, we adopt a set of evaluation metrics, including precision (PR), recall (RC), overall accuracy (OA), Kappa coefficient (KAPPA), intersection over union (IoU), and F1 score (F1). These metrics provide a well-rounded assessment of the model’s effectiveness in detecting changes in remote sensing images.

Among these metrics, the F1 score and IoU are the primary indicators used to evaluate the model’s performance. The F1 score focuses on the accuracy of the detected changed category, while the IoU [46] measures the overlap between the predicted changed area and the ground truth. In addition, PR, RC, OA, and KAPPA [47] are employed as supporting metrics to provide a more comprehensive and detailed evaluation. Their inclusion ensures a thorough understanding of the model’s overall accuracy, class balance, and agreement with the ground truth.

The equations for these metrics are as follows:

P R = \frac{T P}{T P + F P}

(1)

R C = \frac{T P}{T P + F N}

(2)

F 1 = \frac{2 \times P R \times R C}{P R + R C}

(3)

O A = \frac{T P + T N}{T P + T N + F P + F N}

(4)

E = \frac{(T P + F P) \times (T P + F N) + (F N + T N) \times (F P + T N)}{{(T P + T N + F P + F N)}^{2}}

(5)

K A P P A = \frac{O A - E}{1 - E}

(6)

I o U = \frac{T P}{T P + F P + F N}

(7)

In this context, TP (True Positive) refers to the correctly predicted changed area, while FP (False Positive) represents the unchanged area that is mistakenly identified as changed. Similarly, TN (True Negative) denotes the area correctly identified as unchanged, and FN (False Negative) refers to the changed area that is mistakenly marked as unchanged. These definitions form the basis for calculating the evaluation metrics.

3.3.2. Training Details

The experimental environment was built on the NVIDIA GeForce RTX 3090 computing platform (produced in Taiwan, China), using the PyTorch framework (v2.2.1) and CUDA 12.1 computing protocol. A dynamic learning rate strategy is applied during the training phase, with an initial learning rate of 5 × 10⁻⁵. The learning rate decay stops when it reaches 1 × 10⁻⁵. The small batch gradient descent method (batch size = 4) is used for parameter optimization. The model was trained for a total of 100 epochs, and the learning rate was updated using a cosine annealing schedule [48], as defined by the following formula:

l r = l r_{m i n} + \frac{1}{2} (l r_{0} - l r_{m i n}) (1 + \cos (\frac{T_{c u r} * π}{T_{m a x}}))

(8)

where

l r

represents the current learning rate,

l r_{0}

represents the initial learning rate,

l r_{m i n}

represents the minimum learning rate,

T_{c u r}

represents the current number of training epochs, and

T_{m a x}

represents the max number of training epochs. This schedule allows a smooth decrease in the learning rate, balancing parameter exploration and convergence, which is particularly beneficial for this task.

The loss function used in this study is a combination of Binary Cross-Entropy (BCE) Loss and Dice Loss, formulated as follows:

l o s s = a * B C E_{l o s s} + b * D i c e_{l o s s}

(9)

where

a = 0.6

and

b = 0.4

. BCE Loss focuses on achieving accurate pixel-level classification to ensure precise probabilistic predictions, while Dice Loss emphasizes the spatial alignment between predicted and ground-truth change areas, effectively addressing the challenges of detecting subtle and irregular morphological changes. This combination leverages the strengths of both losses, balancing pixel-level precision and structural similarity. The model is optimized using the AdamW optimizer, which incorporates weight decay regularization to prevent overfitting and ensure stable convergence.

4. Experiment and Results

This part mainly includes ablation experiments on SAM encoder and modules, as well as comparative experiments on three datasets.

4.1. Ablation Experiments on LEVIR-CD

In the experiments, we tested four SAM series models as encoders, including SAM-b, SAM-l, FastSAM-s, and FastSAM-x. Among them, FastSAM-x achieved the best performance, as shown in the bolded results in Table 1. The results demonstrated that using FastSAM as the encoder significantly outperforms the traditional SAM models. This is primarily because FastSAM leverages an optimized architecture and faster inference capabilities, which enhance its ability to process and adapt to complex remote sensing data, providing superior accuracy compared to the original SAM models.

Ablation experiments were conducted on each module using FastSAM-x as the encoder. We added or removed each proposed module to evaluate its contribution to the overall model performance. The experimental results are presented in Table 2, where the impact of each module on the model’s accuracy is shown.

Ablation experiments of GIEM and DIEM: First, we analyze the contributions of the GIEM and the DIEM to the model’s performance. Focusing on the IoU and F1 metrics, we observe the following improvements: the baseline model achieves an IoU of 82.35% and an F1 score of 90.32%. By adding GIEM, the IoU improves to 82.92%, and the F1 score increases to 90.66%. Incorporating DIEM further enhances the performance, with the IoU rising to 83.74% and the F1 score reaching 91.15%. When both GIEM and DIEM are added together, the model achieves an IoU of 84.87% and an F1 score of 91.65%, demonstrating a clear synergy between these two modules. These results suggest that both GIEM and DIEM contribute positively, with DIEM showing a more substantial improvement in both IoU and F1, highlighting the importance of enhancing differential information in change detection tasks.
Ablation experiments of DGIF: Building on the results of GIEM and DIEM, we further analyze the impact of the DGIF module. The addition of DGIF yields a significant performance boost, bringing the IoU to 86.34% and the F1 score to 92.67%. Compared to the baseline, the IoU improves by 4.99%, and the F1 score increases by 2.35%. This demonstrates that the DGIF module, which enhances the fusion of global and differential information, plays a crucial role in further improving the model’s accuracy. The results from these ablation experiments clearly indicate that enhancing differential information is particularly beneficial for change detection tasks, where capturing subtle variations is critical.

In terms of loss parameter selection, we also conducted corresponding ablation experiments, and the results are shown in Table 3.

4.2. Comparative Experiments

4.2.1. Comparative Experiments on LEVIR-CD

For the comparison experiment on the LEVIR-CD dataset, we performed a comprehensive evaluation of our proposed Siamese-SAM model by comparing it with multiple other models. All methods were trained using the same approach to ensure fairness and the reliability of the results. The detailed quantification of model evaluation metrics is presented in Table 4.

Empirical evaluation demonstrates that FC-EF exhibits the poorest performance among deep-learning-based change detection architectures, achieving substantially lower F1 score (80.98%) and IoU (68.03%) compared to baseline methods. Compared to SAM-CD, our Siamese-SAM achieves a 1.22% improvement in F1 score and a 2.1% improvement in IoU. Additionally, there are significant improvements in precision (+0.86%), recall (+1.57%), Kappa (+1.28%), and overall accuracy (+0.1%). Although Siamese-SAM slightly outperforms SAM-CD on the LEVIR-CD dataset, its improvements on SYSU-CD and GZ-CD are more significant, demonstrating the strong generalization ability of the model on different datasets. At the same time, we conducted a large amount of visual analysis to prove the superiority of the model.

For the visual analysis of the results, we selected three datasets and highlighted the key regions with red boxes to demonstrate the advantages of our model. The four categories are indicated by different colors: white represents true positives (areas where a change occurred and the model correctly identified the change), black represents true negatives (areas where no change occurred and the model correctly identified no change), red represents false positives (areas where no change occurred but the model incorrectly identified a change), and green represents false negatives (areas where a change occurred but the model failed to identify it).

Figure 10 illustrates the visualization results for three different sets of data on the LEVIR-CD dataset. In the first dataset, our Siamese-SAM model was the only one able to detect the newly built house on the empty lot with a high degree of completeness, accurately capturing the entire change area. In the second dataset, our model demonstrated superior performance in edge detection, providing the most precise delineation of the changed house’s boundary, outperforming other models in handling such fine details. In the third dataset, our model made the fewest mistakes in identifying the changed areas, showing its ability to minimize false positives and false negatives. The experimental results unequivocally demonstrate the methodological excellence of Siamese-SAM, particularly in achieving superior detection capabilities for sub-pixel-scale changes and fine-grained spatial contexts where conventional deep learning architectures exhibit inherent limitations.

4.2.2. Comparative Experiments on SYSU-CD

In the SYSU-CD dataset, FC-Siam-Diff showed the worst performance with an IoU of 50.09% and F1 score of 66.75%. SAM-CD achieved an IoU of 69.76% and F1 score of 81.91%. Our proposed Siamese-SAM model outperformed SAM-CD, with an IoU of 70.17% and F1 score of 82.61%. Compared to SAM-CD, Siamese-SAM achieved an F1 score increase of 0.7% and IoU increase of 0.41%. Additionally, Siamese-SAM also showed significant improvements in precision (+0.79%), recall (+0.61%), Kappa (+1.30%), and overall accuracy (+0.52%). Among all the models tested, only Siamese-SAM surpassed an IoU of 70%, further highlighting its superior performance. The detailed quantification of model evaluation metrics is presented in Table 5.

Figure 11 shows the comparative visualization results on the SYSU-CD dataset, with key regions highlighted using red boxes. In the first set of data, which involves the vegetation reduction near the highway, our Siamese-SAM model detected the largest area of change, clearly distinguishing the affected regions, outperforming other models in both the extent and accuracy of detection. In the second set, where new vegetation grew on open land, our model not only identified the largest change area but also excelled in handling the complex boundary between the vegetation, highway, and buildings, ensuring relatively smooth segmentation without overfitting or underestimating the change, a challenge that other models struggled with. In the third set, which features the appearance of open land due to the reduction of vegetation, our model demonstrated superior capability in accurately detecting the boundary between the open land and the remaining vegetation, capturing fine details that other models failed to identify. These results emphasize the robustness and precision of our Siamese-SAM model, particularly in handling subtle and complex changes, and highlight its superiority over existing models in both large-scale and fine-grained change detection tasks.

4.2.3. Comparative Experiments on GZ-CD

The FC-Siam-Diff model showed the lowest performance on the dataset, with an IoU of just 51.34% and an F1 score of 67.85%. The FC-EF and FC-Siam-Conc models demonstrated comparable results on this dataset. In contrast, SAM-CD achieved an IoU of 78.42% and an F1 score of 87.41%. Our Siamese-SAM model outperformed SAM-CD, achieving an IoU of 79.83% and an F1 score of 88.79%, marking improvements of 1.41% and 1.38%, respectively. Additionally, Siamese-SAM also demonstrated significant improvements in precision (+2.81%), recall (+0.01%), Kappa (+1.04%), and overall accuracy (+0.26%). These results highlight the effectiveness of our proposed model in detecting changes with greater precision and reliability on the GZ-CD dataset. The detailed quantification of model evaluation metrics is presented in Table 6.

Figure 12 shows the comparative visualization results on the GZ-CD dataset, with key regions highlighted using red boxes. In the first set of data, which involves the newly constructed factory on open land, all models perform relatively well, but our Siamese-SAM model provides the smoothest edge detection for the factory, capturing the fine details more effectively than the others. In the second set, where vegetation in the lower-right corner was transformed into a building, only our model was able to detect the majority of the changed area, while other models detected only a small portion of the change. In the third set, which features the disappearance of a factory in the lower-left corner, our model not only detected the change but also minimized the false positives, resulting in fewer errors compared to the other models. These results further demonstrate the superior performance of Siamese-SAM in capturing subtle details, handling complex boundaries, and reducing errors in change detection tasks.

5. Discussion

5.1. Advantages of the Proposed Siamese-SAM

From all the experimental results, it is clear that the Siamese-SAM we proposed holds great potential for change detection tasks. The Siamese-SAM model cleverly integrates the Siamese network architecture with the powerful feature extraction capabilities of the SAM. This combination not only effectively captures the difference information between the two-phase images, but also enhances the understanding of the global context, resulting in more reliable change detection results. The comprehensive and well-conducted ablation experiments demonstrate that the three modules we designed (GIEM, DIEM, and DGIF) each play a crucial role in improving model performance. GIEM and DIEM significantly enhance the model’s ability to understand global context and extract differential features. DGIF effectively integrates both global context and differential features, allowing them to contribute more effectively to accurate change detection.

5.2. Limitations and Prospects

Although the proposed Siamese-SAM performs well, it does have certain limitations. Firstly, the computational overhead of the current model remains relatively high, making it challenging to process high-resolution remote sensing images efficiently. Secondly, the current tasks focus on single-classification tasks, whereas many real-world applications (such as land-use transition monitoring) require multi-class change classification, which is an area not explored in our work. Thirdly, while SAM was trained on natural images, its inductive bias towards remote sensing imagery remains unavoidable. Although our method mitigates this to some extent, it remains a challenge.

Therefore, we envision several future directions to address these limitations:

Hybrid Encoder Development: We plan to develop a hybrid encoder that combines the segmentation prior of SAM with lightweight, remote-sensing-specific adapters. This approach not only strengthens the model’s capability to leverage multispectral feature information but also minimizes computational expenses, thereby preserving robust detection performance under varying environmental conditions [53,54,55].
Expanding Datasets and Multi-Image Change Detection: We hope to create more specialized datasets for remote sensing image change detection and extend the analysis from two images to N images, which will provide more in-depth insights into urban change processes.
Exploring Large Segmentation/Detection Models: Future work could explore larger-scale segmentation [56] and detection models specifically tailored for remote sensing tasks [57].

6. Conclusions

In this paper, we introduced Siamese-SAM, an innovative change detection model that leverages a Siamese network architecture tailored for remote sensing image analysis. By incorporating the SAM as the encoder for each input image and introducing three specialized modules—the GIEM, DIEM, and DGIF—our model significantly enhances the utilization of both global contextual information and differential features.

Our comparative experiments across three benchmarks (LEVIR-CD, SYSU-CD, GZ-CD) reveal that Siamese-SAM achieves state-of-the-art performance metrics: F1 scores of 92.67%, 82.61%, and 88.79%, along with IoU values of 86.34%, 70.17%, and 79.83%. The experimental results not only demonstrate superior performance over existing conventional approaches but also underscore the advantages of our methodology in the domain of remote sensing change detection.

Furthermore, comprehensive ablation studies validate the importance of each proposed module in improving model performance, confirming the rationality and effectiveness of the Siamese-SAM framework. Despite these successes, we acknowledge certain limitations, such as dependence on high-quality training data and computational costs. Future research will focus on addressing these challenges, including exploring optimization techniques to improve computational efficiency and developing more effective cross-domain adaptation strategies to bridge the gap between natural images and remote sensing imagery.

In summary, Siamese-SAM offers a powerful tool for remote sensing image change detection and shows significant potential for applications in environmental monitoring, disaster assessment, and urban expansion tracking. We believe this model will play a pivotal role in advancing both research and practical implementations in the field.

Author Contributions

Conceptualization, G.W. and Y.M.; methodology, Y.M.; software, G.W.; validation, G.W., Y.M. and Z.W.; formal analysis, G.W.; investigation, Z.W.; resources, Z.W.; data curation, Y.M.; writing—original draft preparation, Z.W.; writing—review and editing, Y.M. and G.W.; visualization, Z.W.; supervision, Z.W.; project administration, Z.W.; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and the code of this study are available from the corresponding author upon request.

Acknowledgments

Thanks to Tongji University and the mentor for providing computing resources.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Singh, A. Digital Change Detection Techniques Using Remotely-Sensed Data. Int. J. Remote Sens. 1989, 10, 989–1003. [Google Scholar] [CrossRef]
Huang, Z.; Cao, C.; Chen, W.; Xu, M.; Dang, Y.; Singh, R.P.; Bashir, B.; Xie, B.; Lin, X. Remote Sensing Monitoring of Vegetation Dynamic Changes after Fire in the Greater Hinggan Mountain Area: The Algorithm and Application for Eliminating Phenological Impacts. Remote Sens. 2020, 12, 156. [Google Scholar] [CrossRef]
Mashala, M.J.; Dube, T.; Mudereri, B.T.; Ayisi, K.K.; Ramudzuli, M.R. A Systematic Review on Advancements in Remote Sensing for Assessing and Monitoring Land Use and Land Cover Changes Impacts on Surface Water Resources in Semi-Arid Tropical Environments. Remote Sens. 2023, 15, 3926. [Google Scholar] [CrossRef]
Han, D.; Yang, G.; Lu, W.; Huang, M.; Liu, S. A Multi-Level Damage Assessment Model Based on Change Detection Technology in Remote Sensing Images. Nat. Hazards 2024. [Google Scholar] [CrossRef]
Afaq, Y.; Manocha, A. Analysis on Change Detection Techniques for Remote Sensing Applications: A Review. Ecol. Inform. 2021, 63, 101310. [Google Scholar] [CrossRef]
Al-Dail, M.A. Change Detection in Urban Areas Using Satellite Data. J. King Saud Univ. Eng. Sci. 1998, 10, 217–227. [Google Scholar] [CrossRef]
Willis, K.S. Remote Sensing Change Detection for Ecological Monitoring in United States Protected Areas. Biol. Conserv. 2015, 182, 233–242. [Google Scholar] [CrossRef]
Cheng, G.; Huang, Y.; Li, X.; Lyu, S.; Xu, Z.; Zhao, H.; Zhao, Q.; Xiang, S. Change Detection Methods for Remote Sensing in the Last Decade: A Comprehensive Review. Remote Sens. 2024, 16, 2355. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. A Transformer-Based Siamese Network for Change Detection. In Proceedings of the IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: Piscatawat, NJ, USA, 2022. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Zhang, C.; Liu, L.; Cui, Y.; Huang, G.; Lin, W.; Yang, Y.; Hu, Y. A Comprehensive Survey on Segment Anything Model for Vision and Beyond. arXiv 2023, arXiv:2305.08196. [Google Scholar]
Jiang, H.; Peng, M.; Zhong, Y.; Xie, H.; Hao, Z.; Lin, J.; Ma, X.; Hu, X. A Survey on Deep Learning-Based Change Detection from High-Resolution Remote Sensing Images. Remote Sens. 2022, 14, 1552. [Google Scholar] [CrossRef]
Ji, W.; Li, J.; Bi, Q.; Liu, T.; Li, W.; Cheng, L. Segment Anything Is Not Always Perfect: An Investigation of SAM on Different Real-World Applications. Mach. Intell. Res. 2024, 21, 617–630. [Google Scholar]
Tombe, R.; Viriri, S. Remote Sensing Image Scene Classification: Advances and Open Challenges. Geomatics 2023, 3, 137–155. [Google Scholar] [CrossRef]
Weismiller, R.; Kristof, S.; Scholz, D.; Anuta, P.; Momin, S. Change detection in coastal zone environments. Photogramm. Eng. Remote Sens. 1977, 43, 1533–1539. [Google Scholar]
Li, L.; Li, X.; Zhang, Y.; Wang, L.; Ying, G. Change detection for high-resolution remote sensing imagery using object-oriented change vector analysis method. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 2873–2876. [Google Scholar]
Vu, P.X.; Duc, N.T.; Yem, V.V. Application of Statistical Models for Change Detection in SAR Imagery. In Proceedings of the 2015 International Conference on Computing, Management and Telecommunications, ComManTel 2015, Da Nang, Vietnam, 28–30 December 2015; pp. 239–244. [Google Scholar]
Zhao, J.; Chang, Y.; Yang, J.; Niu, Y.; Lu, Z.; Li, P. A Novel Change Detection Method Based on Statistical Distribution Characteristics Using Multi-Temporal PolSAR Data. Sensors 2020, 20, 1508. [Google Scholar] [CrossRef]
Raj, J.R.; Srinivasulu, S. Change detection of images based on multivariate alteration detection method. In Proceedings of the 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 6–7 March 2020; pp. 847–850. [Google Scholar]
Du, P.; Wang, X.; Chen, D.; Liu, S.; Lin, C.; Meng, Y. An improved change detection approach using tri-temporal logic-verified change vector analysis. ISPRS J. Photogramm. Remote Sens. 2020, 161, 278–293. [Google Scholar]
Li, Z.; Shi, W.; Lu, P.; Yan, L.; Wang, Q.; Miao, Z. Landslide mapping from aerial photographs using change detection-based Markov random field. Remote Sens. Environ. 2016, 187, 76–90. [Google Scholar]
Touati, R.; Mignotte, M.; Dahmane, M. Multimodal Change Detection in Remote Sensing Images Using an Unsupervised Pixel Pairwise-Based Markov Random Field Model. IEEE Trans. Image Process. 2020, 29, 757–767. [Google Scholar]
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature Verification Using a “Siamese” Time Delay Neural Network. In Proceedings of the 7th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–2 December 1993; pp. 737–744. [Google Scholar]
Zhu, Q.; Guo, X.; Li, Z.; Li, D. A Review of Multi-Class Change Detection for Satellite Remote Sensing Imagery. Geo Spatial Inf. Sci. 2024, 27, 1–15. [Google Scholar]
You, Y.; Cao, J.; Zhou, W. A Survey of Change Detection Methods Based on Remote Sensing Images for Multi-Source and Multi-Objective Scenarios. Remote Sens. 2020, 12, 2460. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III. Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A densely connected Siamese network for change detection of VHR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8007805. [Google Scholar]
Li, Y.; Li, X.; Dai, Y.; Hou, Q.; Liu, L.; Liu, Y.; Cheng, M.-M.; Yang, J. LSKNet: A Foundation Lightweight Backbone for Remote Sensing. Int. J. Comput. Vis. 2024, 133, 1410–1431. [Google Scholar] [CrossRef]
Song, L.; Xia, M.; Xu, Y.; Weng, L.; Hu, K.; Lin, H.; Qian, M. Multi-Granularity Siamese Transformer-Based Change Detection in Remote Sensing Imagery. Eng. Appl. Artif. Intell. 2024, 136, 108960. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection with Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607514. [Google Scholar] [CrossRef]
Wang, G.; Li, B.; Zhang, T.; Zhang, S. A Network Combining a Transformer and a Convolutional Neural Network for Remote Sensing Image Change Detection. Remote Sens. 2022, 14, 2228. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5224713. [Google Scholar] [CrossRef]
Yan, T.; Wan, Z.; Zhang, P. Fully transformer network for change detection of remote sensing images. In Proceedings of the Asian Conference on Computer Vision, Macau, China, 4–8 December 2022; pp. 1691–1708. [Google Scholar]
Zhao, X.; Ding, W.; An, Y.; Du, Y.; Yu, T.; Li, M.; Tang, M.; Wang, J. Fast Segment Anything. arXiv 2023, arXiv:2306.12156. [Google Scholar]
Ding, L.; Zhu, K.; Peng, D.; Tang, H.; Yang, K.; Bruzzone, L. Adapting Segment Anything Model for Change Detection in VHR Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611711. [Google Scholar] [CrossRef]
Zhao, H.; Kong, X.; He, J.; Qiao, Y.; Dong, C. Efficient Image Super-Resolution Using Pixel Attention. In Proceedings of the Computer Vision–ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; Proceedings, Part III. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 56–72. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hua, B.-S.; Tran, M.-K.; Yeung, S.-K. Pointwise Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 984–993. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Shi, Q.; Liu, M.; Li, S.; Liu, X.; Wang, F.; Zhang, L. A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5604816. [Google Scholar] [CrossRef]
Peng, D.; Bruzzone, L.; Zhang, Y.; Guan, H.; Ding, H.; Huang, X. SemiCDNet: A Semisupervised Convolutional Neural Network for Change Detection in High Resolution Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 5891–5906. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Song, C.; Yang, M.; Qi, H.; Li, S. A Kappa Measurement of Query Consistency and Its Application. In Proceedings of the 2009 International Conference on Asian Language Processing, Singapore, 7–9 December 2009; pp. 299–303. [Google Scholar]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully Convolutional Siamese Networks for Change Detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Varghese, A.; Gubbi, J.; Ramaswamy, A.; Balamuralidhar, P. ChangeNet: A deep learning architecture for visual change detection. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Yin, H.; Weng, L.; Li, Y.; Xia, M.; Hu, K.; Lin, H.; Qian, M. Attention-guided siamese networks for change detection in high resolution remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2023, 117, 103206. [Google Scholar]
Zhan, Z.; Ren, H.; Xia, M.; Lin, H.; Wang, X.; Li, X. AMFNet: Attention-Guided Multi-Scale Fusion Network for Bi-Temporal Change Detection in Remote Sensing Images. Remote Sens. 2024, 16, 1765. [Google Scholar] [CrossRef]
Wang, Z.; Gu, G.; Xia, M.; Weng, L.; Hu, K. Bitemporal Attention Sharing Network for Remote Sensing Image Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 10368–10379. [Google Scholar] [CrossRef]
Jiang, S.; Lin, H.; Ren, H.; Hu, Z.; Weng, L.; Xia, M. MDANet: A High-Resolution City Change Detection Network Based on Difference and Attention Mechanisms under Multi-Scale Feature Fusion. Remote Sens. 2024, 16, 1387. [Google Scholar] [CrossRef]
Ji, H.; Xia, M.; Zhang, D.; Lin, H. Multi-Supervised Feature Fusion Attention Network for Clouds and Shadows Detection. ISPRS Int. J. Geo. Inf. 2023, 12, 247. [Google Scholar]
Hu, Z.; Weng, L.; Xia, M.; Hu, K.; Lin, H. HyCloudX: A Multibranch Hybrid Segmentation Network with Band Fusion for Cloud/Shadow. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 6762–6778. [Google Scholar]
Zhu, T.; Zhao, Z.; Xia, M.; Huang, J.; Weng, L.; Hu, K.; Lin, H.; Zhao, W. FTA-Net: Frequency-Temporal-Aware Network for Remote Sensing Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 3448–3460. [Google Scholar]

Figure 1. Task definition of change detection in remote sensing images.

Figure 2. The overall structure of the Siamese-SAM model.

Figure 3. The structure of global information enhancement module.

Figure 4. The structures of three attention mechanisms, where (a) is Channel Attention, (b) is Pixel Attention, and (c) is Simple Pixel Attention.

Figure 5. The structure of differential information enhancement module.

Figure 6. The structure of differential global information fusion module.

Figure 7. Visualization of the LEVIR-CD dataset. T1_img and T2_img are images at different times, and columns (a–e) are the corresponding instances.

Figure 8. Visualization of the SYSU-CD dataset. T1_img and T2_img are images at different times, and columns (a–e) are the corresponding instances.

Figure 9. Visualization of the GZ-CD dataset. T1_img and T2_img are images at different times, and columns (a–e) are the corresponding instances.

Figure 10. Three sets of visualization results of different algorithms on the LEVIR-CD dataset. T1_img and T2_img are images at different times, label is the ground truth. a–e are BIT, SAGNet, AMFNet, SAM-CD, and Siamese-SAM respectively.

Figure 11. Three sets of visualization results of different algorithms on the SYSU-CD dataset. T1_img and T2_img are images at different times, label is the ground truth. a–e are BIT, SAGNet, AMFNet, SAM-CD, and Siamese-SAM, respectively.

Figure 12. Three sets of visualization results of different algorithms on the GZ-CD dataset. T1_img and T2_img are images at different times, label is the ground truth. a–e are BIT, SAGNet, AMFNet, SAM-CD, and Siamese-SAM, respectively.

Table 1. Ablation experiments of SAM-Encoder on the LEVIR-CD dataset (the best results are highlighted in bold).

SAM-Encoder	PR (%)	RC (%)	OA (%)	Kappa (%)	IoU (%)	F1 (%)
SAM-b	49.40	78.80	90.56	55.69	43.61	60.73
SAM-l	71.88	76.30	95.04	71.29	58.77	74.03
FastSAM-s	90.65	89.58	99.00	89.59	82.00	90.11
FastSAM-x	94.15	91.24	99.42	92.37	86.34	92.67

Table 2. Ablation experiments on the LEVIR-CD dataset (best results are highlighted in bold type).

Method	PR (%)	RC (%)	OA (%)	Kappa (%)	IoU (%)	F1 (%)
Baseline	92.01	88.68	99.03	89.81	82.35	90.32
Baseline + GIEM	91.97	89.39	99.06	90.17	82.92	90.66
Baseline + DIEM	92.82	89.54	99.11	90.68	83.74	91.15
Baseline + GIEM + DIEM	93.12	90.22	99.36	91.28	84.87	91.65
Baseline + GIEM + DIEM + DGIF	94.15	91.24	99.42	92.37	86.34	92.67

Table 3. Ablation experiment results of loss parameters (best results are highlighted in bold type).

Parameters	PR (%)	RC (%)	OA (%)	Kappa (%)	IoU (%)	F1 (%)
$a = 0.6, b = 0.3$	92.20	90.78	99.34	91.15	84.33	91.49
$a = 0.6, b = 0.5$	92.45	91.02	99.36	91.23	84.47	91.73
$a = 0.5, b = 0.4$	92.92	90.89	99.38	91.46	84.77	91.90
$a = 0.7, b = 0.4$	93.50	90.58	99.39	91.70	85.02	92.02
$a = 0.6, b = 0.4$	94.15	91.24	99.42	92.37	86.34	92.67

From the results, we can see that when

a = 0.6

and

b = 0.4

, the model performance is the best.

Table 4. Comparative experiments on the LEVIR-CD dataset (best results are highlighted in bold type).

Method	PR (%)	RC (%)	OA (%)	Kappa (%)	IoU (%)	F1 (%)
FC-EF [49]	84.78	77.50	98.14	80.00	68.03	80.98
FC-Siam-Diff [49]	88.54	79.87	98.45	83.17	72.39	83.98
FC-Siam-Conc [49]	87.52	83.66	98.56	84.79	74.74	85.55
ChangeNet [50]	91.36	86.16	99.11	88.22	79.67	88.69
BIT [32]	91.81	87.62	99.18	89.24	81.27	89.67
SAGNet [51]	92.24	88.10	99.22	89.72	82.02	90.12
AMFNet [52]	92.56	88.87	99.26	90.29	82.95	90.68
SAM-CD [38]	93.29	89.67	99.32	91.09	84.24	91.45
Ours	94.15	91.24	99.42	92.37	86.34	92.67

Table 5. Comparative experiments on the SYSU-CD dataset (best results are highlighted in bold type).

Method	PR (%)	RC (%)	OA (%)	Kappa (%)	IoU (%)	F1 (%)
FC-EF [49]	81.77	66.76	88.65	66.38	58.11	73.50
FC-Siam-Diff [49]	83.34	55.67	86.92	59.01	50.09	66.75
FC-Siam-Conc [49]	85.18	66.61	89.36	68.17	59.69	74.76
ChangeNet [50]	77.09	71.18	88.21	66.41	58.75	74.02
BIT [32]	81.50	73.90	89.89	71.02	63.29	77.52
SAGNet [51]	79.90	81.03	90.36	74.75	66.98	80.46
AMFNet [52]	80.12	82.23	90.95	76.07	69.55	81.17
SAM-CD [38]	81.33	82.50	91.13	76.65	69.76	81.91
Ours	82.12	83.11	91.65	77.95	70.17	82.61

Table 6. Comparative experiments on the GZ-CD dataset (best results are highlighted in bold type).

Method	PR (%)	RC (%)	OA (%)	Kappa (%)	IoU (%)	F1 (%)
FC-EF [49]	82.01	64.02	95.74	69.64	56.14	71.91
FC-Siam-Diff [49]	80.25	58.77	95.25	65.35	51.34	67.85
FC-Siam-Conc [49]	84.85	63.34	95.91	70.38	56.91	72.54
ChangeNet [50]	80.26	81.52	96.43	78.92	67.91	80.89
BIT [32]	89.90	77.16	97.08	81.46	71.00	83.04
SAGNet [51]	84.95	85.01	97.22	83.45	73.88	84.98
AMFNet [52]	87.63	83.95	97.42	84.33	75.06	85.75
SAM-CD [38]	87.90	86.93	97.87	86.72	78.42	87.41
Ours	90.71	86.94	98.13	87.76	79.83	88.79

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, G.; Miao, Y.; Wang, Z. Siamese-SAM: Remote Sensing Image Change Detection with Siamese Structure Segment Anything Model. Appl. Sci. 2025, 15, 3475. https://doi.org/10.3390/app15073475

AMA Style

Wei G, Miao Y, Wang Z. Siamese-SAM: Remote Sensing Image Change Detection with Siamese Structure Segment Anything Model. Applied Sciences. 2025; 15(7):3475. https://doi.org/10.3390/app15073475

Chicago/Turabian Style

Wei, Gang, Yuqi Miao, and Zhicheng Wang. 2025. "Siamese-SAM: Remote Sensing Image Change Detection with Siamese Structure Segment Anything Model" Applied Sciences 15, no. 7: 3475. https://doi.org/10.3390/app15073475

APA Style

Wei, G., Miao, Y., & Wang, Z. (2025). Siamese-SAM: Remote Sensing Image Change Detection with Siamese Structure Segment Anything Model. Applied Sciences, 15(7), 3475. https://doi.org/10.3390/app15073475

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Siamese-SAM: Remote Sensing Image Change Detection with Siamese Structure Segment Anything Model

Abstract

1. Introduction

2. Related Work

2.1. Change Detection in Remote Sensing

2.2. Siamese Networks for Change Detection

2.3. Foundation Vision Models for Remote Sensing

3. Methods

3.1. Proposed Approach

3.1.1. Overall Network Structure

3.1.2. Global Information Enhancement Module

3.1.3. Differential Information Enhancement Module

3.1.4. Differential Global Information Fusion Module

3.2. Datasets

3.2.1. LEVIR-CD

3.2.2. SYSU-CD

3.2.3. GZ-CD

3.3. Experimental Details

3.3.1. Evaluation Metrics

3.3.2. Training Details

4. Experiment and Results

4.1. Ablation Experiments on LEVIR-CD

4.2. Comparative Experiments

4.2.1. Comparative Experiments on LEVIR-CD

4.2.2. Comparative Experiments on SYSU-CD

4.2.3. Comparative Experiments on GZ-CD

5. Discussion

5.1. Advantages of the Proposed Siamese-SAM

5.2. Limitations and Prospects

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI