1. Introduction
People on Earth are increasingly focused on monitoring the planet due to the heightened frequency of disasters like earthquakes and floods and their effects on ongoing human activities such as construction projects and deforestation. Through the analysis of temporal remote sensing images taken at the same location, change detection (CD) emerges as a crucial tool in monitoring Earth’s status, driving a wide range of applications in environmental monitoring, resource monitoring, urban planning, and disaster assessment [
1,
2]. Therefore, CD has attracted extensive attention in recent years.
In the initial phases of research, researchers predominantly employed conventional algorithms, encompassing algebra-based, transform-based, classification-based, and machine learning-based techniques for change detection tasks. Algebra-based methods derive the change map through algebraic operations or transformations on temporal images, such as image differencing, image regression, image rationing, and change vector analysis (CVA) [
3]. Transform-based methods utilize diverse transformations to map images into another space, highlighting the change information [
4,
5]. These methods then employ threshold-based and clustering-based approaches to generate change maps. Classification-based methods identify changes by comparing multiple classification maps or using a trained classifier [
6]. While traditional algorithms demonstrate efficacy in specific applications, their adaptability and accuracy are often restricted due to their dependence on manual features and threshold selection. Furthermore, their performance is significantly compromised when faced with variations in atmospheric conditions, seasonal factors, and differences between various sensors.
With the advancements in remote sensing technology, different platforms have become increasingly capable of collecting a wide range of data. These large-scale data enable deep learning to model the relationship between the image contents and the real-world geographical feature as closely as possible, greatly improving the effectiveness and robustness in change detection tasks. Differentiated by their use of temporal images, the methods can be classified as early fusion and late fusion. Early fusion involves concatenating inputs and conducting feature extraction, followed by classification [
7,
8]. On the other hand, late fusion methods use feature extraction networks to extract features independently from dual-temporal images and compare feature differences to detect changes [
9,
10]. Compared with the early fusion, late fusion generally provides higher performance.
Numerous studies have embraced the Siamese network architecture, leveraging a shared feature extractor to map temporal remote sensing images into a unified space for quantifying differences [
11]. Techniques such as astronus convolution [
12], large-kernel convolution [
13], and feature pyramid networks [
14] have been incorporated to broaden the receptive field. This augmentation strengthens the network’s capability of acquiring hierarchical spatial-context representations and addressing potential disruptive factors, such as season and illumination changes. Spatial attention mechanisms [
15,
16,
17,
18] and channel attention mechanisms [
19,
20,
21] play a pivotal role in guiding the network to automatically focus on important information related to images/features in channels or positions while suppressing irrelevant portions that are commonly associated with backgrounds and disruptive elements. For instance, the integration of convolutional block attention modules (CBAM) in [
18] facilitates the learning of spatial-wise and channel-wise discriminative features, thereby enhancing change detection. Li et al. [
22] designed a supervised attention module to reweight features, enabling more effective aggregation of multilevel features from high to low levels. Self-attention is also employed to establish long-range dependencies across images and improve overall representation. Chen et al. [
15] introduced a spatial–temporal attention module and a pyramid spatial–temporal attention module to capture spatial–temporal long-range dependencies and generate multi-scale attention representations, respectively. Consequently, the network exhibits increased robustness against illumination variants, demonstrating promising performance. Transformers, with self-attention as a key component, have recently shown significant improvements in change detection [
23,
24,
25,
26]. Adopting the Swin transformer as a fundamental block, Zhang et al. [
24] constructed a Siamese U-shaped structure to learn multiscale features for change detection. Merging the advantages of convolutional neural networks (CNNs) and Transformers, ref. [
27] extracts local–global features for enhanced change detection. Additionally, there are also some works that attempt to integrate the prior information of the changed target for enhanced performance (in a way, incorporating the edge information) [
28]. In addition, leveraging the superior visual recognition capabilities of vision foundation models, Ding et al. [
29] employed the visual encoder FastSAM to extract visual representations in RS scenes, achieving promising performance.
In addition to feature extraction, understanding temporal dependencies through capturing temporal interactions is crucial for generating feature differences [
30,
31,
32]. Various methods, such as feature subtraction [
33,
34] and concatenation [
35,
36], are commonly employed for temporal interaction. Multiscale interaction is also recognized as beneficial, accounting for changes at different scales [
37]. When treating change detection as the process of extracting change information from multi-period sequence data, recurrent neural networks (RNNs)—particularly, Long Short-Term Memory (LSTM)—have proven effective in capturing nonlinear interactions between bitemporal data. Previous studies [
38,
39,
40] have utilized LSTM for acquiring change information. To address potential misinteraction, attention mechanisms have been introduced [
41,
42,
43] to guide the network’s focus on critical interactions. Additionally, Fang et al. [
44] emphasized the importance of temporal interaction during feature extraction. Consequently, aggregation–distribution and feature exchange were introduced to enable interaction during feature extraction. Liang et al. [
31] proposed patch exchange between temporal images as a means to augment change detection. Feature exchange, although effective for aligning multimodality features in fusion scenarios [
45], poses challenges in bitemporal images due to their inherent content differences. It is worth noting that existing works primarily perform interaction at the feature level, often neglecting change intensity information at the image level, which could lead to the loss of crucial details.
This paper introduces a multistage interaction network (MIN-Net) to address the aforementioned issues. MIN-Net facilitates bitemporal interaction at three stages: image-level interaction, feature-level interaction, and decision-level interaction. The image-level interaction captures information from changes in image intensity through subtraction. Feature-level interaction guides the network in extracting critical spatial features related to image variants and emphasizes alignment of critical semantic channels to overcome pseudo-changes. Finally, decision-level interaction combines these stages to produce feature differences for effective change detection. The comprehensive multistage interaction in MIN-Net enhances its capacity to accurately extract changes between bitemporal images. Extensive experiments on three datasets—LEVIR-CD, WHU-CD, and CLCD—demonstrate MIN-Net’s state-of-the-art performance. The contributions of this work can be summarized in three key aspects:
We introduce a multistage interaction network that allows our network to leverage the advantages of both early fusion and late fusion for effective change extraction;
We introduce the spatial and channel interactions to overcome challenges posed by background diversity and pseudo-changes;
Extensive experiments on LEVIR-CD, WHU-CD, and CLCD datasets showcase promising performance with F1 (we provide the definition in
Section 3.1), with scores of
91.47%,
93.73%, and
76.60%, respectively.
The remainder of this paper is organized as follows.
Section 2 presents the details of our MIN-Net.
Section 3 shows the experimental results. This paper concludes in
Section 4.
2. The Proposed Method
This section details the introduced MIN-Net, encompassing its overall framework along with a comprehensive explanation of its components, including image-level interaction, feature-level interaction, and decision-level interaction.
2.1. Overall Framework
Figure 1 illustrates the overall framework of our MIN-Net. Given bitemporal images
and
, MIN-Net initially extracts hierarchical features
using the shared backbone ResNet-18. These extracted features are then fed into a feature pyramid network (FPN) to leverage the combined benefits of low-level and high-level representations, producing
. Distinctively, we introduce a feature-level interaction module (FIM) between the two FPNs, enabling interaction at the feature extraction stage. In addition to feature-level interaction, we incorporate image-level interaction to directly extract difference information from the given images. With both image-level and feature-level interactions, we proceed to extract feature differences using the decision-level interaction, resulting in
D.
Using
D, the change probability for each pixel is generated through a simple multi-layer perceptron (MLP):
Here, represents an upsampling operation.
The loss function in our MIN-Net comprises two components, pixel-wise classification loss
and the dice loss
, to address the sample imbalance problem. Their definitions are given by
where
N indicates the number of pixels and
i indexes each pixel. Here,
represents the predicted probability value output by the network, and
and
correspond to the ground-truth labels. We assign equal contribution to both losses, i.e., the two losses are directly summed. In the following subsections, we elaborate on the image-level, feature-level and decision-level interactions.
2.2. Image-Level Interaction
The image-level interaction can be considered as an early fusion step that directly extracts change information from the given data, compensating for potential loss of change information during feature extraction. Specifically, image-level interaction initiates subtraction between and , producing change intensity information. This information is then fed through the subsequent ResNet-18 backbone to extract multiscale change semantic information, resulting in . Here, the index j denotes the scale, ranging from the first to the fourth. In this setup, the backbone network is unshared between the feature extraction from and . The primary reason for this choice is the clear information distinction between them, and we expect the network to effectively extract the change information.
2.3. Feature-Level Interaction
As illustrated in
Figure 2, the feature-level interaction employs a dual strategy, involving spatial interaction blocks to guide the network in jointly extracting crucial spatial features and channel interaction blocks to jointly emphasis the critical semantic channels. Change detection inherently involves extracting semantic variants from the provided temporal images. Therefore, the network should prioritize the extraction of semantic differences between bitemporal images rather than focusing on all information indiscriminately. To achieve this, we introduce the spatial interaction block to direct the network’s attention towards critical spatial features related to semantic variants, preventing it from being misled by irrelevant information. Simultaneously, we acknowledge the potential variations in semantic channels arising from differences in imaging conditions and weather, possibly leading to pseudo-changes. Hence, a channel interaction block is incorporated to align images, jointly emphasizing critical semantic channels between bitemporal images.
Spatial Interaction Block: As shown in
Figure 3, given
and
with
C channels, the first step involves extracting their semantic differences using the following equation:
Using the semantic differences as the query, the features most related to these differences are obtained through cross-attention mechanism:
Here, Conv represents the convolutional operation, and softmax is applied along the channel dimension with scaling factor . This process aims to capture and emphasize features most relevant to the semantic differences in the images, making the semantic variations more apparent and easier to detect by the network.
Channel Interaction Block: As shown in
Figure 4, the channel interaction block aligns the extracted features through shared channel attention. Specifically, considering
and
with
C channels, the context information is initially extracted from bitemporal images using global average pooling (GAP). The results are then concatenated, and a fully connected layer is applied to obtain the shared attention, denoted as SCA. This process can be mathematically formulated as
where
is the fully connected layer. Utilizing the shared channel attention, the extracted feature maps are then calibrated as follows:
where ⊗ is the broadcast element-wise multiplication. Consequently, the extracted features ensure a focus on the same critical semantics, proving advantageous in addressing the challenges posed by pseudo-changes.
With the extracted spatial and channel-wise interacted features, we then obtain the augmented features via
2.4. Decision-Level Interaction
The decision-level interaction leverages both image-level and feature-level interactions to capture difference information for subsequent change detection. Given the feature-level augmented interactions
and
, the first step involves concatenating them along channels. The result is then passed through a convolutional layer to produce the feature differences
. Mathematically, this process is represented as
Subsequently, is fused with through feature concatenation to generate multi-scale feature differences . The resulting at different scales is then fed into a feature pyramid network to produce the final output D for change detection.
3. Experimental Results
In this section, we present comprehensive experiments to demonstrate the efficacy of our proposed MIN-Net in change detection. A detailed ablation study is also provided to elucidate the effectiveness of individual modules within the network.
3.1. Experimental Setting
Datasets: We select three datasets for evaluation, including LEVIR-CD, WHU-CD and CLCD. Their information is as follows:
LEVIR-CD [
15] is a large-scale dataset for building change detection, consisting of 637 pairs of high-resolution images from Google Earth. Each image is
pixels with a spatial resolution of 0.5 m. The dataset spans 20 different regions from 2002 to 2018. Following [
15], images are segmented into non-overlapping patches of
pixels, resulting in a total of 7120/1024/2048 samples for training, validation, and testing, respectively;
WHU-CD [
46] is a publicly available building change detection dataset. It comprises one pair of aerial images covering the area of Christchurch, New Zealand, for the years 2012 and 2016. The image dimensions are
pixels with a spatial resolution of 0.075 m. Similar to LEVIR-CD, the dataset is divided into non-overlapping patches of
pixels. The dataset is randomly split into 6096/762/762 samples for training, validation, and testing, respectively;
CLCD [
47] is a dataset designed for cropland change detection, collected by Gaofen-2 in Guangdong Province, China, in 2017 and 2019. It consists of 600 pairs of cropland change samples, each with dimensions of
pixels and varying spatial resolutions from 0.5 to 2 m. Following the methodology in [
47], we allocate 360 pairs for training, 120 pairs for validation, and 120 pairs for testing.
Network Implementation: Our network was trained using two NVIDIA 3090 GPUs, employing the AdamW optimizer. We implemented the OneCycleLR strategy for learning rate tuning, setting a maximum learning rate of 0.005 and a minimum of 0.005/500. For the LEVID-CD and WHU-CD datasets, the batch size was fixed at 32 and the learning rate was set to 0.005. In the case of the CLCD dataset, the learning rate was adjusted to 0.001 and the batch size was set to 4. All the training was conducted for 250 epochs.
Evaluation Metrics: The confusion matrix elements FP (false positive), FN (false negative), TP (true positive), and TN (true negative) serve as the foundation for quantitative analysis in binary change detection. These elements denote pixels that were misclassified as changed, pixels misclassified as unchanged, correctly detected changed pixels, and correctly detected unchanged pixels, respectively. Evaluation metrics, including overall accuracy (OA), precision, recall, F1 score, and Intersection over Union (IoU), are then computed using these elements:
OA calculates the ratio of correctly classified pixels to the total number of pixels in the dataset, defined by
Precision measures the fraction of detections that were actually changed among all the instances predicted as changed, defined by
Recall measures the ability of the model to capture all the actual changes, defined by
F1 combines recall and precision together, defined by
IoU computes the overlap between the predicted and actual change regions, defined by
In general, larger values indicate better prediction.
Compared Methods: We select 10 methods for comparison, including FC-EF [
11], FC-Siam-Diff [
11], FC-Siam-Conc [
11], STANet [
15], DTCDSCN [
48], ChangeFormer [
49], BIT ChangeFormer [
23], ICIF-Net [
30], DMINet [
32], and WNet [
26]. A comparison of all the methods is presented in
Table 1.
By default, we utilized the original parameters as provided in the respective papers for training the comparison methods. Additionally, considering that the image input size for the CLCD dataset is 512 × 512, whereas the default batch size setting of the comparison methods is often configured for an image size of 256 × 256, we adjusted the batch size accordingly to 1/4 of the original values when training the comparison methods on the CLCD dataset. Specifically, the learning rate, batch size, and epochs for FC-EF, FC-Siam-Diff, and FC-Siam-Conc were set to 0.01, 16, and 200, respectively. For STANet, they were set to 0.001, 4, and 200; for DTCDSCN, 0.001, 16, and 200; for ChangeFormer, 0.0001, 16, and 200; for BiT, 0.01, 16, and 200; for ICIF-Net, 0.01, 8, and 200; for DMINet, 0.01, 16, and 250; and, for WNet, 0.0001, 16, and 250.
3.2. Comparisons with State-of-the-Art
3.2.1. Results on LEVIR-CD Dataset
Table 2 provides a comprehensive comparison of various methods on the LEVIR-CD dataset. In general, our MIN-Net outperforms the alternative methods, particularly in terms of F1, IoU, and OA. Methods like BIT, ChangeFormer, and WNet exhibit better performance, owing to superior feature representation. The hybrid advantages of combining CNN and Transformer architecture contribute to the success of ICIF-Net, achieving more promising results. Thanks to the multistage temporal interaction, our MIN-Net demonstrates enhanced capabilities in suppressing irrelevant positions and channels, effectively addressing background complexity and pseudo-changes. As a result, MIN-Net shows a notable improvement in performance, achieving a gain of
0.77% over the second-best method, DMINet, with respect to the F1 score. Overall, the superior performance clearly demonstrates the effectiveness of our method in capturing temporal dependencies for enhanced change detection.
We present a qualitative comparison of the LEVIR-CD dataset in
Figure 5. In the figure, true positives and true negatives are denoted by white and black, while false positives and false negatives are indicated by green and red. Here, we focus on visual results of BIT, ChangeFormer, ICIF-Net, DMINet, WNet, and our MIN-Net, considering their superior performance over other methods. Overall, our method surpasses alternative methods, with fewer false positives and false negatives, providing a better match with the ground truth. This phenomenon is particularly evident in the third scene, where all other methods exhibit obvious false negatives. Augmented by the multistage interaction to address background complexity and pseudo-changes, our MIN-Net effectively extracts the actual changes.
3.2.2. Results on WHU-CD Dataset
We present the performance results of all methods on the WHU-CD dataset in
Table 3. Notably, our method consistently outperforms other approaches, demonstrating a significant gain of
2.95% in the F1 score over the second-best method, BIT. The visual comparison in
Figure 6 highlights the effectiveness of our MIN-Net, showcasing superior performance with fewer false positives and negatives. This phenomenon underscores the efficacy of MIN-Net in change detection, attributed to its three stages of interaction. These interactions contribute to shortening the semantic gap between bitemporal images and effectively suppressing interruptions from complex backgrounds.
3.2.3. Results on CLCD Dataset
The CLCD dataset presents increased complexity with multiple changes related to cropland, and the number of training samples is notably fewer than in the LEVIR-CD and WHU-CD datasets. Consequently, in
Table 4, all methods exhibit a noticeable performance drop compared to results for the LEVIR-CD and WHU-CD datasets. Despite these challenges, our MIN-Net, leveraging image-level, feature-level, and decision-level interaction, demonstrates very promising performance by more accurately detecting information changes between bitemporal images. The imaging complexity is visually evident in
Figure 7. Despite these challenges, our MIN-Net exhibits higher robustness, particularly noticeable in the last image with low resolution and numerous textures. The feature-level interaction plays a crucial role in enhancing change information extraction and semantic alignment, resulting in improved change boundaries.
3.3. Ablation Study
Here, we provide an ablation study on different modules and spatial–channel interactions to showcase their effect.
3.3.1. Effectiveness of Different Modules
In this section, we conduct an ablation study, focusing on image-level and feature-level interaction. The ablation study for decision-level interaction is omitted since it is essential for producing and fusing different-level feature differences. As presented in
Table 5, the inclusion of feature-level interaction proves effective in encouraging the network to extract variant information between bitemporal images, thereby reducing the semantic gap and resulting in performance improvement. Image-level interaction serves to compensate for potential loss of change information and contributes to enhanced change detection performance. The combination of image-level and feature-level interaction yields hybrid advantages, culminating in the highest performance.
3.3.2. Effectiveness of Spatial and Channel Interaction Blocks
In this study, we investigate the impact of spatial and channel interaction, as presented in
Table 6. The removal of spatial interaction results in the failure to guide the network in extracting change features, leading to a noticeable performance drop. Similarly, removing channel interaction prevents the network from focusing on shared semantics, resulting in decreased performance. Overall, this experiment clearly demonstrates the effectiveness of both spatial and channel interaction for robust feature extraction.
3.4. Discussion
As evidenced by the improved performance on the WHU-CD and LEVIR-CD datasets, our method offers valuable insights into urban construction. Furthermore, the superior performance on the CLCD dataset suggests that our approach can effectively monitor cropland areas.
To provide a comprehensive assessment of our method, we present some challenging cases in
Figure 8. As depicted, all methods struggle to accurately extract all changes, possibly due to buildings being obscured by trees. Incorporating global feature extraction may help overcome this limitation, which is an avenue for future research.