Unsupervised Noise-Resistant Remote-Sensing Image Change Detection: A Self-Supervised Denoising Network-, FCM_SICM-, and EMD Metric-Based Approach

Xie, Jiangling; Li, Yikun; Yang, Shuwen; Li, Xiaojun

doi:10.3390/rs16173209

Open AccessArticle

Unsupervised Noise-Resistant Remote-Sensing Image Change Detection: A Self-Supervised Denoising Network-, FCM_SICM-, and EMD Metric-Based Approach

by

Jiangling Xie

¹,

Yikun Li

^1,2,3,*,

Shuwen Yang

^1,2,3 and

Xiaojun Li

^1,2,3

¹

Faculty of Geomatics, Lanzhou Jiaotong University, Lanzhou 730070, China

²

National-Local Joint Engineering Research Center of Technologies and Applications for National Geographic State Monitoring, Lanzhou 730070, China

³

Gansu Provincial Engineering Laboratory for National Geographic State Monitoring, No. 88 Anning West Road, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(17), 3209; https://doi.org/10.3390/rs16173209

Submission received: 7 June 2024 / Revised: 24 August 2024 / Accepted: 28 August 2024 / Published: 30 August 2024

(This article belongs to the Special Issue Land Cover Change Detection and Mapping Based on Remote Sensing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

The detection of change in remote-sensing images is broadly applicable to many fields. In recent years, both supervised and unsupervised methods have demonstrated excellent capacity to detect changes in high-resolution images. However, most of these methods are sensitive to noise, and their performance significantly deteriorates when dealing with remote-sensing images that have been contaminated by mixed random noises. Moreover, supervised methods require that samples are manually labeled for training, which is time-consuming and labor-intensive. This study proposes a new unsupervised change-detection (CD) framework that is resilient to mixed random noise called self-supervised denoising network-based unsupervised change-detection coupling FCM_SICM and EMD (SSDNet-FSE). It consists of two components, namely a denoising module and a CD module. The proposed method first utilizes a self-supervised denoising network with real 3D weight attention mechanisms to reconstruct noisy images. Then, a noise-resistant fuzzy C-means clustering algorithm (FCM_SICM) is used to decompose the mixed pixels of reconstructed images into multiple signal classes by exploiting local spatial information, spectral information, and membership linkage. Next, the noise-resistant Earth mover’s distance (EMD) is used to calculate the distance between signal-class centers and the corresponding fuzzy memberships of bitemporal pixels and generate a map of the magnitude of change. Finally, automatic thresholding is undertaken to binarize the change-magnitude map into the final CD map. The results of experiments conducted on five public datasets prove the superior noise-resistant performance of the proposed method over six state-of-the-art CD competitors and confirm its effectiveness and potential for practical application.

Keywords:

change detection; self-supervised denoising network; Earth mover’s distance; FCM_SICM

1. Introduction

In recent years, various advanced remote-sensing and computer technologies have undergone rapid development, which has brought about several opportunities and challenges in monitoring the evolution of the Earth’s surface and natural environment. Remote-sensing change detection (RS CD), one of the most important topics in the field of remote sensing, is a potent means of acquiring real-time and accurate information about changes on the Earth’s surface. Specifically, RS CD represents a highly effective and efficient method of detecting changes in land use. Thus, it is a potent tool for protecting the ecological environment, managing natural resources, researching social development, and understanding the relationship between humans and the natural world [1,2]. In the initial stages of RS CD, due to limitations in spatial resolution, changes occurring on Earth can only be detected on large spatial scales. For this reason, RS CD is predominantly used for land surveys, urban studies, ecosystem monitoring, and disaster monitoring and assessment. With the advent of new-generation remote-sensing devices capable of acquiring very high-resolution remote-sensing (VHRRS) images (i.e., images with a resolution of less than 1 m/pixel), researchers now have a unique opportunity to monitor the Earth in rich detail [3]. Because of the availability of VHRRS images, current CD technologies can effectively and efficiently capture information about on-the-ground changes in great detail and accurately detect critical changes on the Earth’s surface [4,5]. In recent decades, researchers have proposed numerous RS CD methods, which can be roughly categorized as supervised or unsupervised. Supervised RS CD methods incorporate expert knowledge, and for this reason, they often produce highly accurate results. However, supervised RS CD algorithms require manually provided training data, which incur high labor and time costs and, thus, hinder the large-scale application of such algorithms to real-time scenarios [6]. Unsupervised change-detection algorithms require no human intervention and are, therefore, run at high processing speeds. Therefore, unsupervised methods are more suitable for real-time CD scenarios and have attracted significant attention, both within and outside remote-sensing communities [7].

Recently, with the continuous development of RS technology, both traditional unsupervised CD approaches and deep-learning-based unsupervised CD methods have achieved satisfactory CD accuracy in handling middle-resolution RS (MRRS) images and high-resolution RS (HRRS) images [8]. Traditional CD methods include image ratio [9], change vector analysis (CVA) [10], principal component analysis (PCA) [11], multivariate alteration detection (MAD) [12], iteratively reweighted multivariate alteration detection (IR-MAD) [13], and change detection combining PCA and K-means (PCAKMeans) [14]. The success of unsupervised CD methods based on deep-learning techniques in RS CD cannot be overlooked either [15,16,17,18]. Tang et al. developed an unsupervised CD method (GMCD) based on graph convolutional networks and metric learning. By extracting information on multiple scales with Siamese FCN, integrating it in order to produce reliable illustrations of differences, and employing a pseudo-label generation mechanism to produce reliable pseudo-labels, they facilitated model training in an unsupervised manner [19]. Wu et al. built a deep Siamese kernel principal component analysis convolution mapping network (KPCAMNet) to extract high-level spectral–spatial feature maps and generate a final CD map [20]. Saha et al. developed a deep change vector analysis (DCVA) framework, fully utilizing multi-layer deep features extracted by the convolutional neural network (CNN) to determine which pixels had undergone change [21]. Transformer-based methods have recently made remarkable strides in the field of change detection. However, supervised approaches are inherently resource-intensive and demand high computational complexity. Notably, as the number of tokens increases, the computational demand escalates quadratically, meaning that large-scale inputs, such as optical high-resolution images, become difficult to process [22]. Moreover, such supervised transformer models, which are based on an attention mechanism, necessitate a vast amount of training data, among which the quality of the samples is paramount for achieving accurate CD outcomes using transformer-based approaches [23]. In contrast, while unsupervised transformer-based methods circumvent reliance on extensive training datasets, they exhibit limited robustness to noise, rendering them vulnerable to random noise during the generation of pseudo-labels. For example, Trans-MAD [24] creates pseudo-training labels by integrating CD results from C²VA and MAD, processing the deep features extracted by the network via IR-MAD, and finalizing the CD output through a decision fusion module. Previous studies have demonstrated that C²VA and MAD are highly sensitive to random mixed noise, resulting in unreliable CD outputs. Consequently, the labels generated for neural network training may be inaccurate, leading to suboptimal final CD results. Generally, traditional and deep-learning-based CD methods have achieved satisfactory results in handling a variety of remote-sensing data. However, when the target remote-sensing images are contaminated by various noises, the critical details of change are often disrupted, and the efficacy of most unsupervised CD methods cannot be guaranteed. Therefore, it is crucial that we integrate image-denoising techniques into unsupervised CD methods to accomplish noise-resistant RS CD [25].

Image denoising is a fundamental task in RS image processing; it is crucial for maintaining the visual quality of acquired images and mitigating noise-induced adverse effects on subsequent image analysis and processing tasks [26]. Denoising techniques effectively extract feature information from remote-sensing images while reducing noise pollution, thus preserving the textural features of images. CNN-based denoising approaches are more effective than traditional denoising methods and have, therefore, become mainstream. DnCNN [27], introduced by Zhang et al. in 2017, is a notable deep-learning-based denoising algorithm that has become a mainstay in image processing. This method employs a deep convolutional network to extract complex features from images by increasing the network depth. PRIDNet [28] utilizes channel attention mechanisms to recalibrate the importance of input noise channels. It then employs a pyramid pooling method to extract multi-scale features. In the final stage, feature fusion is achieved through kernel selection operations, which adaptively combine multi-scale features. DCCNet [29] generates multi-resolution inputs through discrete wavelet transform and shuffling operations. By carrying out convolution operations on low-resolution inputs, DCCNet significantly reduces the number of network parameters. A deep end-to-end persistent memory network (MemNet) [30] has also been proposed for image restoration, within which a memory block incorporates a gating mechanism to address the long-term dependency problem of earlier CNN architectures. CSANN [31] concatenates the noise level with the average and maximum values of each channel as the input and features a convolutional network that can learn the relationships between channels. Simultaneously, it combines the noise-level map with the average and maximum values of each spatial location as the input and uses a convolutional network to learn the relationships between spatial locations. Experimental results obtained via this algorithm have indicated improved visual quality and higher PSNR values. Although image-denoising algorithms represent a significant development in the field, they still have certain limitations. Traditional image-denoising algorithms [32] often perform well on specific types of noise but may be less effective when dealing with mixed-noise types. Moreover, these algorithms generally lack adaptability when processing noise of varying intensity, typically requiring manual parameter adjustments to address different noise types and levels. While supervised deep-learning-based denoising algorithms can achieve superior denoising results, they are highly dependent on the quality and diversity of the corresponding training data. If the training data do not include a sufficient variety of noise types or scenes, the model’s performance may be suboptimal when negotiating new data. Overall, image-denoising algorithms face the inherent challenge of balancing the preservation of image details with noise removal. Excessive denoising can result in the loss of details, such as edges and textures, while insufficient denoising may leave residual noise. Striking this balance is particularly challenging in practice.

Therefore, for an effective noise-resistant RS CD, in this study, inspired by [33], we propose an unsupervised CD framework that couples a novel self-supervised denoising network (containing realistic three-dimensional attention weights) with a novel unsupervised method of detecting change. This framework is designed to detect changes within noisy images. Initially, two input images are processed separately by two self-supervised denoising networks (SSDNets). It can exploit information from the spatial and channel dimensions of RS images and is trained on a single image without additional parameters. As a result, the reconstructed RS image has a reduced level of noise, and its detailed spatial information is preserved. Subsequently, the reconstructed noisy images are input into a noise-resistant FCM_SICM [34] for fuzzy clustering, which yields pixel-wise signal-class centers and fuzzy membership matrices. FCM_SICM can capture global and local spatial and spectral information from RS images. It adaptively adjusts the size of the spatial weight matrix to accommodate varying degrees of noise. By decomposing pixels into several signal classes through fuzzy clustering, the issue of mixed pixels in RS images is addressed in a noise-resistant way. This approach also prevents noises from being overly emphasized, thereby reducing the impact of noise on the final clustering results. Then, the EMD [35] metric, which is based on a linear optimization technique (that is, the simplex method) and has the capacity to resist noise, is utilized to establish a weight-matched relationship between the signal classes of bitemporal pixels to compute a map of pixel-level changes in magnitude. Finally, an automatic thresholding algorithm is used to binarize the map of change magnitude into the final CD result. The main contributions of this study are as follows.

(1): We designed SSDNet-FSE, an unsupervised change-detection framework that is resilient to mixed random noise. By combining image-denoising techniques with change-detection methods, this framework enables us to detect changes within bitemporal RS images under noisy conditions. It achieves higher CD accuracy with more compact internal structures and finer boundaries;
(2): We propose a self-supervised image-denoising method designed explicitly for RS images. This network leverages information from both the spatial and channel dimensions of RS images and is trained using only one image, without additional parameters. It effectively handles mixed-noise scenarios, reconstructing noise-reduced RS images while preserving detailed texture information. The method strikes a favorable balance between noise reduction and detail preservation, resulting in satisfactory denoising performance. This approach is particularly well suited for subsequent CD tasks;
(3): The CD component of the proposed framework comprises two techniques, FCM_SICM and EMD, that synergistically exhibit good noise resilience. Experimental results demonstrate that the FCM_SICM-EMD method is robust and effective for performing CD tasks under noisy conditions.

2. Materials and Methods

2.1. Overview

The comprehensive framework of SSDNet-FSE is depicted in Figure 1; its architecture comprises two major stages. Stage 1 involves self-supervised denoising of the noise-contaminated RS image; stage 2 employs FCM_SICM and EMD to accomplish unsupervised RS CD. In the initial stage of CD, the original noise-polluted image is fed into the proposed self-supervised denoising network, SSDNet. In the SSDNet, the innovative SimAM attention mechanism [36] generates authentic three-dimensional weights. Different features are weighed, and information loss is attenuated during feature extraction. In the resulting denoised image, boundaries and fine textural details are successfully restored. In the subsequent stage, the noise-reduced images are fed directly into the noise-robust FCM_SICM module, which adjusts the number and centroids of the signal classes for each individual RS image to accurately reflect their unique spectral and spatial characteristics. Because the bitemporal RS images do not share identical sets of signal classes, precise one-to-one correspondence between two signal-class sets is unfeasible [37]. For this reason, in this study, we utilize the metric of EMD to establish many-to-many correspondence between two signal-class sets of the bitemporal RS images. Eventually, the spectral disparities between the pixels of the bitemporal images are accurately measured via the EMD, yielding a change-magnitude map. As the EMD is computed by a linear optimization technique, i.e., the simplex method, it has some capacity to resist noise, which is of great utility in CD tasks involving noisy images. Subsequent processing involves the application of thresholding algorithms and morphological operations to produce the final CD map.

2.2. SSDNet

The classic unsupervised denoising method Noise2Noise [38] is limited by two major shortcomings. First, it requires the pre-generation of a synthetic noise sample from a statistical model. Second, it needs to pair existing noisy data to train a denoising neural network. However, due to the limited availability of RS imagery, it is not always possible to collect sufficient RS image pairs for training. The Neighbor2Neighbor [39] self-supervised denoising network achieves ideal denoising results due to its excellent sampling strategy and objective function, but this method still requires a suitable RS image dataset to train its denoiser. Moreover, the Noise2Noise and Noise2Void [40] algorithms can prevent model overfitting by masking some pixels of the input image and exploiting neighboring pixels to recover the critical information lost during image reconstruction. However, this often blurs the details in reconstructed images. Therefore, neither denoising method meets our standards for constructing a noise-resistant unsupervised RS CD framework with a satisfactory performance.

Inspired by [41], in this study, we propose a novel self-supervised network (Figure 2) to denoise RS images and thereby address the aforementioned challenges. This method uses only a single noise-polluted RS image as an input and can produce denoised images with well-preserved textural details and edges. The overall network structure is shown in Figure 3. To enable the neural network to focus more effectively on critical information in RS images (i.e., clearer boundaries and contours), we incorporated an attention module into the model. Currently, most attention modules typically focus on either the channel domain or the spatial domain. However, both mechanisms help process information in the human brain’s visualization of objects. However, directly estimating complete three-dimensional weights is challenging. For instance, CBAM [42] is a mechanism that estimates one-dimensional and two-dimensional weights and combines them, but this method cannot directly generate true three-dimensional weight values. Therefore, we considered adding a parameter-less attention module, SimAM, to the network feature-extraction process, as depicted in Figure 2. Inspired by visual neuroscience, SimAM effectively generates real three-dimensional weights and assigns each neuron a unique weight. Finally, an energy function is defined for each neuron. Put simply, the energy function adopts binary labels with an added regularization term. The final energy function is defined as follows:

e_{t} (w_{t}, b_{t}, y, x_{i}) = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(- 1 - (w_{t} + b_{t}))}^{2} + {(1 - (w_{t} t + b_{t}))}^{2} + λ w_{t}^{2}

(1)

w_{t} = - \frac{2 (t - μ_{t})}{{(t - μ_{t})}^{2} + 2 σ_{t}^{2} + 2 λ}

(2)

b_{t} = - \frac{1}{2} (t + \frac{1}{M - 1} \sum_{i = 1}^{M - 1} x_{i}) w_{t}

(3)

μ_{t} = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} x_{i}

(4)

σ_{t}^{2} = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} (x_{i} - μ_{t})

(5)

where

t

and

x_{i}

represent the target neuron and other neurons within a single channel of the input feature,

X \in R^{C \times H \times W}

.

i

is an index of spatial dimensions, with

M = H \times W

denoting the number of neurons in that channel.

λ

is the regularization parameter, and

w_{t}

and

b_{t}

denote transformations of weights and biases, respectively.

μ_{t}

and

σ_{t}^{2}

are the mean and variance calculated over all neurons except t in that channel. Theoretically, each channel has

M

energy functions. Thus, solving all these equations is computationally intensive. Nonetheless, there is a rapid closed-form solution for

w_{t}

and

b_{t},

which can be derived by rewriting the energy function as follows.

e_{t}^{*} = \frac{4 (\hat{σ^{2}} + λ)}{{(t - \hat{u})}^{2} + 2 {\hat{σ}}^{2} + 2 λ}

(6)

where

\hat{u} = \frac{1}{M} \sum_{i = 1}^{M - 1} x_{i}

,

{\hat{σ}}^{2} = \sum_{i = 1}^{M - 1} {(x_{i} - \hat{u})}^{2}

.

In alignment with the U-Net architecture, the SSDNet network is structured as a convolutional encoder–decoder neural network. Initially, the encoder converts the input image (H × W × C) into feature maps of size H × W × 48 using partial convolution (PConv) layers, which are followed by four additional encoder blocks that further process these feature maps. Each of the first four blocks includes a PConv layer, a leaky rectified linear unit (LeakyReLU), a SimAM layer, and a max-pooling layer with a 2 × 2 receptive field. The final encoder block consists of a PConv layer, a LeakyReLU layer, and a SimAM layer. The encoder’s output is a set of feature maps with the following dimensions: (H/32) × (W/32) × 48. The decoder comprises five blocks, among which the first four blocks contain an upsampling layer, a concatenation operation from U-Net, and two convolutional layers with dropout and LeakyReLU. The upsampling layers double the size of the feature maps. The concatenation operation merges the LeakyReLU outputs with the upsampled feature maps. The last decoder block includes three standard dropout convolutional layers with channel sizes of 64, 32, and C, ultimately converting the feature maps back to an image of size H × W × C.

Before model training, we perform Bernoulli random sampling on the input noisy image

y,

according to Equation (7).

{\hat{y}}_{k} ∶ = b_{k} ⊙ y; {\bar{y}}_{k} ∶ = (1 - b_{k}) ⊙ y

(7)

where

{\{{\hat{y}}_{k}\}}_{k = 1}^{K}

and

{\{{\bar{y}}_{k}\}}_{k = 1}^{K}

represent Bernoulli sampling, ⊙ denotes element-wise multiplication, and

b

represents a Bernoulli sampling instance with a probability between 0 and 1. We then obtain an image pair, wherein the sampling result differs from

y

but still contains most of its information. Therefore, the proposed method minimizes the loss function defined by Equation (8) to train the denoiser.

\binom{m i n}{θ} (\sum_{k = 1}^{K} {‖f_{θ} ({\hat{y}}_{k}) - x‖}_{b_{k}}^{2} + \sum_{k = 1}^{K} {‖σ‖}_{b_{k}}^{2})

(8)

where

K

represents the number of Bernoulli random sampling, and

f_{θ}

represents the denoiser. For any

f_{θ k}

,

σ (k)

denotes the standard deviation of the

k

th denoiser. In observing the loss function, only the sampled image pixels are used to compute the loss for each pair

({\hat{y}}_{k}, {\bar{y}}_{k})

. Since

b_{k}

is randomly generated, when the number of iterations is sufficient, the difference in image pixels can be measured by the total loss of all image pairs. After the denoiser model converges, the obtained denoiser

f_{θ}

and the output are applied to the prediction stage. In the prediction stage, the original noise-contaminated image is again subjected to Bernoulli random sampling to generate multiple instances of noisy images. Then, based on the denoiser trained in the training stage, multiple new denoisers (

f_{θ 1}, f_{θ 2}, \dots, f_{θ n}

) are generated and applied to the noise-polluted images. The employed strategy can effectively prevent model overfitting. Subsequently, the clean images (

{\hat{x}}_{1}, {\hat{x}}_{2}, \dots, {\hat{x}}_{n}

) are averaged to obtain the final result, defined as follows:

x^{*} = \frac{1}{N} \sum_{n = 1}^{N} f_{θ n} ({\hat{y}}_{n})

(9)

where

{\hat{y}}_{n}

represents the

n

th Bernoulli sampling instance of the noise-polluted image

y

.

2.3. FCM_SICM-EMD

2.3.1. FCM_SICM

In this section, we propose a CD algorithm that couples FCM with adaptive spatial and intensity constraints and fuzzy memberships, along with the EMD metric. However, the traditional FCM is susceptible to noise interference, causing the clustering results to be adversely disrupted by noises. Moreover, traditional spatial FCM methods, such as FCM_S1/S2, typically consider only the spatial relationships between the center pixel and its neighboring pixels [43]. Prior studies [44] have demonstrated that incorporating local spatial and intensity information effectively preserves the edges of images, albeit demanding high computational complexity. Therefore, it is necessary that we find a method that can simultaneously introduce local spatial and intensity information into FCM and reduce computational complexity. To this end, we adopt a hybrid approach that combines local spatial and intensity information with membership linking, termed FCM_SICM. Its objective function is defined as follows:

J^{l} = \sum_{i = 1}^{K} \sum_{j = 1}^{N} α {(u_{i j}^{l})}^{m} {‖y_{j} - c_{i}^{l}‖}^{2} + \sum_{i = 1}^{K} \sum_{j = 1}^{N} {β (u_{i j}^{l})}^{m} {‖{\bar{y}}_{j} - c_{i}^{l}‖}^{2}

(10)

where

α

and

β

denote the constraints imposed on the original image and the fast bilateral filtering image, respectively, and

{\bar{y}}_{j}

represents the

j

th pixel in the fast bilateral filtering image. Before entering the iterative clustering process, the image undergoes fast bilateral filtering to obtain local spatial intensity information. Considering the absolute difference between the original image and the spatial intensity information, a simple adaptive constraint is applied to the original FCM algorithm. This strategy constrains the spatial and intensity information and decreases computational complexity. However, the proposed approach utilizes Lagrange multipliers to update the memberships and cluster centers without considering the necessary number of iteration steps. For the

i

th cluster, the sum of the memberships calculated in the previous iteration can be used to reduce the number of iteration steps. Thus, in this study, the approach known as membership linking,

M

, is adopted to prevent the premature convergence of the objective function before satisfactory results have been achieved and can better differentiate the distinct clusters. The equation for membership linking,

M

, is defined as follows:

M = {l n}^{2} (\sum_{e = 1}^{N} u_{i e}^{l - 1} + 1)

(11)

Therefore, when incorporating

M

as the denominator of the objective function, we obtain:

J^{l} = \frac{\sum_{j = 1}^{N} α {(u_{i j}^{l})}^{m} {‖y_{j} - c_{i}^{l}‖}^{2} + \sum_{j = 1}^{N} {β (u_{i j}^{l})}^{m} {‖{\bar{y}}_{j} - c_{i}^{l}‖}^{2}}{{l n}^{2} (\sum_{e = 1}^{N} u_{i e}^{l - 1} + 1)}

(12)

2.3.2. EMD

The Earth mover’s distance measures the minimum normalized cost of transforming one distribution of probability into another, thereby identifying the distance between two distributions. In the context of CD in RS images, this manifests as the following linear programming problem:

f_{i}^{1}

= {(

c_{1}

,

u_{1}

(

p_{i}^{1}

)), …, (

c_{k}

,

u_{k}

(

p_{i}^{1}

)), …, where (

c_{n}

,

u_{n}

(

p_{i}^{1}

))} represents the features of pixel

p_{i}^{1}

in the time 1 image,

c_{k}

is the signal center of signal class

s_{k}

,

u_{k}

(

p_{i}^{1}

) is the fuzzy membership of pixel

p_{i}^{1}

with respect to signal class

s_{k}

, and n is the number of signal classes in the first temporal image.

f_{i}^{2}

= {(

c_{1}^{'}

,

u_{1}

(

p_{i}^{2}

)), …,(

c_{k}^{'}

,

u_{k}

(

p_{i}^{2}

)), …, where (

c_{m}^{'}

,

u_{m}

(

p_{i}^{2}

))} represents the features of pixel

p_{i}^{2}

in the second temporal image, where

c_{k}^{'}

is the signal center of signal class

s_{k}^{'}

,

u_{k}

(

p_{i}^{2}

) is the fuzzy membership of pixel

p_{i}^{2}

with respect to signal class

s_{k}^{'}

, and m is the number of signal classes in the second image. The EMD is then used to discern the minimum cost of transforming the features of pixel

f_{i}^{1}

into the features of pixel

f_{i}^{2}

, yielding the magnitude of change among pixels in the bitemporal images. The specific formula and relevant parameters of the EMD are as follows.

The EMD finds the optimal flow matrix

W = [w_{k l}]

, computing the minimum cost of transforming feature

f_{i}^{1}

into feature

f_{i}^{2}

, as shown in Equation (13).

W O R K (f_{i}^{1}, f_{i}^{2}, W) = \sum_{k = 1}^{m} \sum_{l = 1}^{n} w_{k l} d_{k l}

(13)

d_{k l}

represents the spectral difference between signal centers

c_{k}

and

c_{l}^{'}

. The optimal flow matrix W reflects the matching relationships and weights between signal classes in

f_{i}^{1}

and

f_{i}^{2}

, where

w_{k l}

reflects the matching weight between signal classes

s_{k}

and

s_{l}^{'}

. Equation (9) should satisfy the following constraints:

d_{k l} \geq 0 1 \leq k \leq n, 1 \leq l \leq m

(14)

\sum_{l = 1}^{m} w_{k l} \leq u_{k} (p_{i}^{1}) 1 \leq k \leq n

(15)

\sum_{k = 1}^{n} w_{k l} \leq u_{l} (p_{i}^{2}) 1 \leq l \leq m

(16)

\sum_{k = 1}^{n} \sum_{l = 1}^{m} w_{k l} = 1

(17)

Constraint (14) specifies that the matching of signal classes should only occur from feature

f_{i}^{1}

to feature

f_{i}^{2}

. Constraint (15) stipulates that a signal class,

s_{k}

, in pixel feature

f_{i}^{1}

cannot be assigned a total matching weight greater than its fuzzy membership

u_{k}

(

p_{i}^{1}

). Constraint (16) specifies that a signal class

s_{l}^{'}

in pixel feature

f_{i}^{2}

cannot be assigned a total matching weight greater than its fuzzy membership,

u_{l}

(

p_{i}^{2}

). Constraint (17) limits the sum of the total matching weights to 1, referred to as total flow. The optimal flow W calculated using the EMD reflects the optimal matching relationships and weights of signal classes between features

f_{i}^{1}

and

f_{i}^{2}

. Due to constraint (17), the formula can be simplified, and the EMD is defined as follows.

E M D (f_{i}^{1}, f_{i}^{2}) = \frac{\sum_{k = 1}^{n} \sum_{l = 1}^{m} w_{k l} d_{k l}}{\sum_{k = 1}^{n} \sum_{l = 1}^{m} w_{k l}} = \sum_{k = 1}^{n} \sum_{l = 1}^{m} w_{k l} d_{k l}

(18)

To further illustrate the characteristics of the EMD algorithm, this study qualitatively analyzes how the EMD establishes a many-to-many weight-matched relationship between signal classes in

f_{i}^{1}

and

f_{i}^{2}

in the following four ways. ① Assuming that there is a small spectral difference

d_{k l}

between signal classes

s_{k}

and

s_{l}^{'}

and both fuzzy memberships

u_{k}

(

p_{i}^{1}

) and

u_{l}

(

p_{i}^{2}

) are relatively large, we can confirm that

s_{k}

and

s_{l}^{'}

represent typical spectral characteristics for pixels

p_{i}^{1}

and

p_{i}^{2},

with only minor differences. Consequently, they should be assigned larger matching weights

w_{k l}

, resulting in a smaller final EMD (which means that this pixel will likely be identified as unchanged). ② If we assume a small spectral difference

d_{k l}

between signal classes

s_{k}

and

s_{l}^{'}

and a small fuzzy membership

u_{k}

(

p_{i}^{1}

) or

u_{l}

(

p_{i}^{2}

), we can infer that

s_{k}

and

s_{l}^{'}

are not typical spectral characteristics for pixels

p_{i}^{1}

or

p_{i}^{2}

. Therefore, although the spectral difference

d_{k l}

between

s_{k}

or

s_{l}^{'}

is small—constrained by the memberships

u_{k}

(

p_{i}^{1}

) or

u_{l}

(

p_{i}^{2}

) (Equations (11) and (12))—only small matching weights

w_{k l}

should be assigned to ensure that

d_{k l}

occupies only a small proportion of the total EMD. ③ Assuming that the spectral difference

d_{k l}

between signal classes

s_{k}

and

s_{l}^{'}

is large and both memberships

u_{k}

(

p_{i}^{1}

) and

u_{l}

(

p_{i}^{2}

) are also large, we understand that

s_{k}

and

s_{l}^{'}

represent typical spectral characteristics of pixels

p_{i}^{1}

and

p_{i}^{2}

. Despite the large spectral difference

d_{k l}

, larger matching weights

w_{k l}

should still be assigned. This results in a larger final EMD, making it more likely that bitemporal pixels are classed as changed pixels. ④ Assuming that the spectral difference

d_{k l}

between signal classes

s_{k}

and

s_{l}^{'}

is large, but the membership

u_{k}

(

p_{i}^{1}

) or

u_{l}

(

p_{i}^{2}

) is small, we understand that

s_{k}

or

s_{l}^{'}

are not typical spectral representations of pixels

p_{i}^{1}

or

p_{i}^{2}

. Moreover, under the constraint of memberships

u_{k}

(

p_{i}^{1}

) or

u_{l}

(

p_{i}^{2}

) (Equations (12) and (13)), only small matching weights

w_{k l}

should be assigned to ensure that

d_{k l}

occupies a small proportion of the final EMD.

FCM_SICM decomposes mixed pixels into several signal classes, each of which represents a cluster composed of pixels with similar spectral and spatial characteristics. Therefore, signal classes cannot be interpreted as specific forms of land cover. The fuzzy membership of RS image pixels belonging to different signal classes can accurately represent the pixels’ spectral information. Since the signal classes and fuzzy memberships are calculated by the FCM_SICM algorithm based on all pixels in the RS image, they are relatively reliable and less susceptible to noise interference. Thus, the pixel-wise distribution of probability is made up of the signal-class centers, and the corresponding fuzzy memberships are used as pixel-level features. As the EMD is an optimal measurement based on the linear optimization technique (i.e., the simplex method), it is resilient to noise interference. Therefore, in this study, to further resist noise, the EMD is used to establish weighted pairwise matching relationships between the signal classes of noise-contaminated image pixels and produce a pixel-level change-magnitude map, which is processed by an automatic thresholding algorithm to generate the final map of change as shown in Figure 4.

3. Dataset and Experimental Setup

Section 3.1 describes the datasets used to assess the performance of SSDNet-FSE and alternative methods. Section 3.2 describes said alternative methods, which are used as a benchmark, in more detail. Section 3.3 provides details of our evaluation criteria and experimental design.

3.1. Dataset Descriptions

(1): For the Shangtang dataset, this dataset is obtained from the SenseEarth platform dataset used in the AI Vision of the World 2020 Artificial Intelligence Remote Sensing Interpretation Competition hosted by SenseTime Technology [45]. Its image size is 512 × 512 pixels, with a spatial resolution of 3 m. The selected study area comprises images of rural areas containing four land-cover types, buildings, farmland, woodland, and wasteland, among which the major changes involve the construction of buildings;
(2): For the DSIFN-CD dataset, this dataset is manually collected from Google Earth. It consists of six large bitemporal high-resolution images covering six cities in China (Beijing, Chengdu, Shenzhen, Chongqing, Wuhan, and Xi’an) [46]. Five large image pairs are cropped into sub-image pairs of size 512 × 512 with a spatial resolution of 2 m/pixel, followed by data augmentation. The selected study area includes five land-cover types, namely roads, farmland, wasteland, buildings, and vegetation, where the primary changes happen in wasteland and vegetation areas;
(3): For the LZ dataset, this dataset comprises two Landsat8 images from the Lanzhou New Area in China, captured in 2016 and 2017, respectively [47]. These images display seven bands with a spatial resolution of 30 m and a size of 650 × 650 pixels. The scenes in these images include forests, farmland, wastelands, mountains, and buildings;
(4): For the CDD dataset, the CDD dataset consists of real RS images with seasonal variations (obtained from Google Earth), including four high-resolution images captured in four seasons [48]. A pair of images with a size of 1900 × 1000 pixels form the study area and are manually cropped to 992 × 992 pixels with a spatial resolution of 0.3–1 m/pixel, containing water bodies, buildings, forests, and grasslands, among which the dominant changes happen in areas of buildings and forests;
(5): For the GZ dataset, this dataset covers the suburban area of Guangzhou, China, collected between 2006 and 2019 [49]. A total of 19 bitemporal high-resolution images with red, green, and blue bands, a spatial resolution of 0.55 m, and a size of 1006 × 1168 are collected using the BIGEMAP (30.0.31.6) software and Google Earth service. These images are manually cropped into sub-image pairs of size 1024 × 1024, in which the major changes are caused by urban development.

3.2. Competing Methods

We select three traditional and three deep-learning-based unsupervised CD algorithms for comparison. Traditional algorithms include PCAKMeans [22], ASEA [50], and INLPG [51]. PCAKMeans employs local data projections within the feature vector space to perform k-means clustering. This method extracts feature vectors for each pixel using an h × h neighborhood, thereby automatically incorporating contextual information. Studies have demonstrated that this algorithm exhibits robust performance against zero-mean Gaussian and speckle noises, making it highly appealing for the detection of changes in optical and SAR images. ASEA is a spatial context exploration algorithm investigating the spatial contextual information surrounding each pixel in VHRRS images. It then defines a band-to-band (B2B) distance to measure the magnitude of changes between bitemporal images across adaptive regions. The integration of the proposed ASEA and B2B distance achieves effective CD in RS images. Research indicates that ASEA effectively reduces noise and enhances CD performance. Furthermore, the combination of the B2B distance with ASEA significantly improves the homogeneity of the changed regions, effectively smoothing noise in the CD map. INLPG extends the nonlocal patch-based graph (NLPG) method by improving the graph construction process, the calculation of structural difference, and the DI fusion process, thus making it more resilient to noise and accomplished in CD tasks.

Deep-learning-based algorithms include GMCD [19], KPCAMNet [20], and DCVA [21]. GMCD enhances the robustness of CD results by incorporating a noise modeling block within a feature-learning module based on fully convolutional networks (FCNs). This method significantly enhances the suppression of noise by modeling noise within RS images. KPCAMNet extracts high-level spatial–spectral feature maps through KPCA convolutions and, subsequently, maps changes in feature maps onto a two-dimensional polar domain. Finally, CD results are generated through threshold segmentation and clustering algorithms. In this study, we demonstrate that KPCAMNet is capable of detecting finer noise and more complete regions of change. DCVA, an unsupervised context-sensitive CD method, utilizes a pre-trained CNN to extract spatial contextual information and identify pixels.

3.3. Experimental Design and Evaluation Metrics

All experiments are conducted using a computer equipped with an NVIDIA RTX4080 GPU, Intel(R) i9-13900K 3.0 GHz processor, and 64GB RAM (Colorful Co., Ltd. Shenzhen, China). The parameters of the reference methods are as follows. (1) For DCVA, the layers are set to {2, 5, 8}. (2) For KPCAMNet, the network depth is set to four, and the number of KPCA convolutional kernels is set to eight. A radial basis function kernel is selected with a kernel parameter of 5 × 10⁻⁴. The size of the convolutional kernel is set to three based on prior experiments. (3) For GMCD, the number of epochs is set to 40, and the learning rate is set to 1 × 10⁻³. When the number of epochs reaches 20, the learning rate is reduced to 1 × 10⁻⁴. Notably, the methods involved are implemented with their default parameters, as described in prior relevant works. Additionally, the same postprocessing operations are applied to all methods to ensure objectivity in our comparisons.

To evaluate the robustness of the proposed algorithm under noisy conditions, in this experiment, we simultaneously add three types of noise to the target RS images, including 5% salt-and-pepper noise, zero-mean Gaussian noise with a variance of 0.05, and multiplicative noise with a magnitude of 0.05. The learning rate of the CNN model is set to 1 × 10⁻⁴, with 15,000 iterations. The number of clusters for FCM_SICM is set to 10. The fuzziness parameter m is set to 2.0. The minimum error is set to 1 × 10⁻⁵. The maximum number of iterations is set to 100, and the selective automatic thresholding algorithm is that of the Otsu method [52].

For the evaluation criteria, to quantitatively evaluate the performance of different CD methods, several standard metrics are employed, including the missed alarm rate (MA), false-alarm rate (FA), recall (Rec), F1-score (F1), overall accuracy (OA), and Kappa coefficient. Higher OA, Rec, F1, and Kappa values indicate better CD results.

These metrics are calculated as MA = FN/(TP + FN), FA = FP/(FP + TN), Pre = TP/(TP + FP), Rec = TP/(TP + FN), OA = (FP + FN)/(TP + TN + FP + FN), F1 = (2 × Pre × Rec)/(Pre + Rec), and Kappa = (OA − Pre)/(1 − Pre). In these formulas, false negatives (FNs) refer to the count of pixels wrongly classified as unchanged, false positives (FPs) refer to pixels wrongly identified as changed, true negatives (TNs) refer to pixels accurately classified as unchanged, and true positives (TPs) refer to pixels accurately identified as changed. The Kappa statistic is particularly useful for measuring the algorithm’s ability to detect change, with higher values indicating better results.

4. Experimental Results

4.1. Noise-Resistance Analysis

In this study, we introduced three types of noise into the target RS images to evaluate the noise resistance of the proposed method and four competitors. The added noise includes (1) 5% salt-and-pepper noise, (2) zero-mean Gaussian noise with a variance of 0.05, and (3) multiplicative noise with a magnitude of 0.05. The number of clusters for FCM_SICM was set to 10, and the fuzziness parameter m was set to 2.0. We used the Otsu automatic thresholding algorithm. The highest accuracy achieved by any of the CD methods is marked in bold, while the second highest is underlined. The lowest FA and MA among the CD methods are marked in red and green. Changed and unchanged areas are represented in white and black, while FA and MA are represented in red and green, respectively, on maps of the CD results.

The proposed method achieved the best CD results under the influence of the mixed noise on both the Shangtang dataset and the DSIFN-CD dataset, as shown in Figure 5 and Figure 6. From Table 1, it can be observed that SSDNet-FSE achieves the highest OA (0.9714), Kappa (0.8816), and F1 (0.8902). Its Rec (0.8211) is only 0.1102 lower than the highest achieved by DCVA (0.9313). This is because DCVA performs well in detecting regions with a relatively large and concentrated distribution of changes in the entire image, resulting in a high Rec rate at the cost of a high false-positive rate. In the DSIFN-CD dataset (Figure 6), the main land-cover types are farmland and grassland, with changes mainly occurring in farmland, which made competitive algorithms produce a significant number of false alarms. For example, KPCAMNet and DCVA are severely affected by noise, with a large amount of unchanged vegetation being incorrectly identified as changed areas, and KPCAMNet also exhibits a high rate of missed detections. In contrast, the proposed method successfully eliminates these false changes, resulting in more accurate and clearer CD results. The SSDNet-FSE achieves superior performance on the Shangtang and DSIFN-CD datasets with dimensions of 512 × 512 and a spatial resolution of 1–3 m/pixel, indicating that SSDNet-FSE can focus on correctly identified changed areas within small-scale VHRRS CD. This superior performance can be explained by the fact that FCM_SICM considers both the contextual information and spatial intensity information within images and utilizes membership linking to avoid premature convergence, making it resilient to noise contamination.

In the LZ dataset, compared to the competitive methods, SSDNet-FSE produced the best CD map with the least false alarms and detected a more complete and recognizable changed area. From Figure 7 and Table 2, it is evident that the results of several comparative algorithms were suboptimal. GMCD, DCVA, and PCAKMeans exhibited large areas of false alarms, while KPCAMNet, ASEA, and INLPG produced severe omission errors. These inferior results can be attributed to the 30 m spatial resolution of the LZ dataset. The competing algorithms failed to effectively decompose mixed pixels, which is common in mid-resolution images under noisy conditions. In contrast, SSDNet-FSE comprises three noise-resistant components, SSDNet, FCM_SICM, and EMD, which synergically yielded the best CD results.

The GZ dataset and the CCD dataset are both made up of VHRRS images (Figure 8 and Figure 9). Compared to several other datasets, these two datasets have larger image sizes (CDD: 992 × 992; GZ: 1024 × 1024), more land-cover types, and more complex ground structures in changed areas. In images of the first period, the largest areas of change mainly consist of buildings and forests with high spectral variability. Therefore, integrating spatial information is crucial for accurate CD. In the CCD dataset, the changed areas are uniformly distributed across the entire study area, with many smaller changed areas placing higher requirements on the algorithm’s capacity to extract spatial information. As shown in Table 3, although the Kappa value of the proposed algorithm is only 0.0063 higher than that of the GMCD on the CCD dataset, the change map obtained by the proposed algorithm has clearer boundaries and fewer false-positive areas. Moreover, the values of Rec and F1 are 0.045 and 0.011 higher, respectively, than GMCD, confirming its superior ability to identify true changes, better noise resistance, and greater capacity to identify small areas of change under the influence of noise.

For the GZ dataset, the proposed algorithm achieves the highest accuracy and the lowest FA and MA. Specifically, its OA is 0.035 higher than the second-highest algorithm (PCAKmeans). The Kappa, Rec, and F1 values are 0.114, 0.116, and 0.1012, respectively, higher than the second-highest algorithm (ASEA). These results, combined with Figure 10-GZ and Table 3-GZ, indicate that our proposed method can robustly process large-scale VHR RS images, obtaining highly desirable noise-resistant CD outcomes. Although VHRRS images are less affected by mixed-pixel issues, they contain richer spatial information than medium-resolution images. Benefiting from FCM_SICM-EMD’s ability to fuse spatial information, the proposed method can better integrate such information from the images to produce smoother boundaries and fewer holes, thereby accomplishing significantly enhanced CD performance.

4.2. Noise Sensitivity Analysis

This section describes mixed-noise sensitivity experiments carried out with five methods, including GMCD, KPCAMNet, DCVA, PCAKMeans, and SSDNet-FSE. The experiments involved adding salt-and-pepper noise (1%, 2%, 3%, 4%, 5%, and 6%), Gaussian noise (zero mean, variance 0.01, 0.02, 0.03, 0.04, 0.05, and 0.06), and multiplicative noise (0.01, 0.02, 0.03, 0.04, 0.05, and 0.06) to contaminate the DSIFN-CD, LZ, and CDD datasets for noise sensitivity analysis. When processing noise-free images, our method only utilized the FCM_SICM part of the original framework. In this experiment, the number of clusters was set to 10, and the fuzziness parameter m was experimentally set to 2.0. The same postprocessing procedure was applied to all competitive algorithms to guarantee objectivity in our evaluation.

From Figure 10, it can be observed that our proposed method maintains stable CD performance in the five datasets as the noise level increases, where the highest and lowest Kappa fluctuations in the five study areas do not exceed 0.1061. DCVA shows weak robustness against noise and performs poorly on all five datasets. PCAKMeans exhibits some robustness to noise and performs well when the noise level is low, but its performance declines rapidly with increasing noise. GMCD and PCAKMeans show a similar trend regarding medium-resolution RS images (LZ dataset), and GMCD’s ability to detect change weakens with increasing noise. On the DSIFN-CD and Shangtang datasets, GMCD and KPCAMNet show an overall upward trend. The aforementioned inconsistency in performance is due to the randomness of the noises added to the imagery. With the increase in noise intensity, many noise points appear in the holes generated in CD maps, and in the postprocessing stage, these holes are filled by morphologically dilated loci of noise. Other isolated noise points in the CD image are considered small noise speckles and are removed, leading to an increase in CD accuracy. Thus, by observing the ability of GMCD and KPCMNet to detect noise on five datasets, we can confirm that these two methods have some noise resistance, but their detection of change is unstable when compared with our method. Except for the LZ dataset, ASEA exhibits a stable and increasing trend in performance, as the noise intensity rises across the other four datasets. On the LZ dataset, however, ASEA shows a steady decline in performance with an increasing noise intensity, indicating good robustness to noise. Nonetheless, ASEA’s CD accuracy is lower than our method across all five datasets. In contrast, while INLPG exhibited a degree of robustness across the five datasets, its CD performance lacked consistency. Notably, the discrepancy between its highest and lowest accuracy reached 0.1364 and 0.5116 on the Shangtang and DSIFN-CD datasets, respectively. These findings underscore the markedly superior performance of our proposed method.

Figure 10. Noise-resistance performance of competitive methods on the five datasets.

Notably, from the change curve of the CCD dataset, ASEA, INLPG, GMCD, KPCAMNet, and our method demonstrate stable performance on the CDD dataset. As shown in Table 4, even when the noise level is 0.03, 0.04, and 0.05, GMCD’s Kappa is even slightly higher than ours because the CDD dataset has a total of 984,064 pixels, with only 9.575% (94,220) representing change pixels, and the true changes are distributed relatively evenly in the study area. The small change areas are filtered out as noise and cause a decrease in CD accuracy. Additionally, change-detection algorithms face significant challenges when dealing with bi-temporal RS images of changing areas like the CCD dataset. Smaller areas of change may be mistakenly removed as noise during the postprocessing stage. This explains why the CD accuracy of the aforementioned algorithms on noise-free images is similar to the accuracy when the mixed-noise level is 0.01 and 0.02. Therefore, improving the accuracy of SSDNet-FSE in detecting changes in small targets whilst under the influence of noise is a key focus of our future research.

Because of the sufficient pixel samples within large-size images with complicated land covers, the FCM_SICM can capture reliable signal classes that resist noises. Furthermore, based on the captured reliable signal classes, the change magnitude measured by the EMD is accurate, leading to good CD results on large-sized and complicated RS images, as indicated in our experimental results as follows.

In our method’s performance on large-scale VHRRS images (GZ Dataset: 1024 × 1024), as shown in Figure 9 and Figure 10, and the CD accuracy detailed in Table 3 and Table 4, our approach achieves the highest accuracy on the GZ dataset when noise levels increase from 0 to 0.06 mixed noise. Specifically, it outperforms the second-highest accuracy by 13.57%, 11.00%, 10.82%, 11.37%, 11.47%, 11.40%, and 11.21%, respectively. Furthermore, the difference between the highest and lowest accuracy on the GZ dataset is only 1.58%, demonstrating that our method consistently delivers stable and reliable change-detection results on large-scale VHRRS images.

4.3. Ablation Analysis

We further investigated the roles of each component in SSDNet-FSE. We selected the GZ dataset with a mixed-noise level of 0.05 as the study area. By establishing the following models, we verified the effectiveness of the proposed model (see M1–M8). In this section, M1–M7 models include self-supervised denoising network modules and unsupervised CD modules, while M8 only includes the unsupervised CD module. The clustering numbers of each FCM are all set to 10, with a fuzziness of two. The postprocessing method is applied for each algorithm.

For M1, the complete CD framework (SSDNet-FSE).

For M2, the FCM_SICM is replaced by FLICM in the original CD framework [53], which uses a distance measure incorporating fuzzy local spatial information and spectral similarity. By introducing a spatial fuzzy factor, manual adjustment of parameter factors is avoided.

For M3/M4, the FCM_SICM is replaced by FCM_S1 and FCM_S2 in the original CD framework. They introduce spatial neighborhood information by correcting the fuzzy membership of the central pixel based on the Euclidean distance between the spectral features of adjacent pixels and the central pixel.

For M5/M6, the FCM_SICM is replaced in the original CD framework with KFCM_S1 and KFCM_S2. KFCM [54] transforms the nonlinear transformation in low-dimensional space into a linear transformation in high-dimensional space by replacing the Euclidean distance of the original FCM with a Gaussian kernel function distance. KFCM_S1 and KFCM_S2 are calculated based on mean filtering and median filtering, thus reducing their running time.

For M7, the FCM_SICM is replaced in the original CD framework with the original FCM.

For M8, this framework is based on M1, but the first stage’s denoising module is completely removed.

For M9, the denoising network of SSDNet, as proposed in the original CD framework, is replaced by the original Self2Self network (Self2Self-FSE), and the rest of the framework remains unchanged.

Table 5 shows the experimental accuracy values. Using M8, it can be observed that, by directly performing CD without the denoising module M8 and replacing SSDNet with Self2Self (M9), compared to M1, the Kappa values decreased by 0.0881 and 0.0514, respectively. This result indicates that noise-resistant performance is crucial for the accurate detection of change.

Although M3–M6 are complete two-stage CD frameworks, they only incorporate spatial information from RS images, resulting in weak robustness against mixed noise and unsatisfactory CD accuracy. The highest and lowest accuracies among them differ from M1 in Kappa by 0.1394 and 0.3002, respectively, indicating that spatial FCM ((FCM_S1, FCM_S2, KFCM_S1, KFCM_S2), solely relying on spatial neighborhood information, cannot effectively handle mixed noise. The reason for this is that the signal-class center is generated by spatial FCM, which only incorporates image spatial neighborhood information and is significantly affected by random noise in an adverse manner (Figure 11 and Table 5). On the one hand, a larger FA is observed if the obtained signal-class centers are interfered with by the noises in a way that leads to a large EMD metric. On the other hand, if the obtained signal-class center is disrupted by the noises in a way that leads to a small EMD metric, a larger MA will be observed.

Notably, M2 achieves the second highest Kappa among the CD results of the ablation methods, which is lower than M1 by 0.0339 because of FLICM’s strong noise resistance. FLICM not only adopts a measure of distance that combines the spatial information and spectral information of images it also employs an adaptive clustering center initialization method. By dynamically adjusting the initial positions of clustering centers based on the local information of pixels, clustering centers more accurately represent the centers of each signal class, thus improving the robustness of clustering under noisy conditions. It is worth noting that the CD accuracy obtained by M8 is only lower than M1, M2, and M9, confirming that the CD stage of the proposed framework also has certain noise resistance.

4.4. Analysis of Change-Magnitude Maps

The change-magnitude map is the intermediate result generated by the proposed method and will be binarized to a final CD result. To investigate the impact of the change-magnitude map on the final CD accuracy, we employed nine ablation methods to generate the change-magnitude map. The experimental parameters in this section are the same as the ablation experiments, and the experimental results are shown in Figure 12 and Figure 13.

Compared with the eight competitors in the ablation analysis, SSDNet-FSE achieved the most consistent results on both the GZ and LZ datasets. As shown in Figure 9 (using the GZ dataset), severe but patchy areas of noise appeared in the change-magnitude maps of M3 and M5. The magnitudes of change in the true regions of change (marked by yellow boundaries) were close to that of the unchanged regions. From Table 5, it can be observed that M3 and M5 obtained the lowest CD Kappa values (0.5902 and 0.4401, respectively), which are consistent with the qualities of the corresponding change-magnitude maps. Although M4 and M6–M8 achieved higher CD accuracies than M3 and M5, the effects of noise contamination upon their change-magnitude maps are still visually apparent.

For the GZ dataset, only slight noise contamination can be observed in the change-magnitude maps of M1, M2, and M9. In the middle–left part of the study area, clear spatial details can be observed. For M1, the magnitude of change in the changed regions is significantly high. Although the overall magnitude of change in M2 is low, the changed regions still have relatively higher magnitudes of change than the unchanged regions. Consequently, M1 and M2 have the highest (0.7403) and second highest (0.7064) CD Kappa values, confirming the importance of the change-magnitude map for the effective detection of change. It is worth noting that only our method accurately detects the changed region in the bottom–right part of the GZ dataset, which contributes to the superior Kappa of M1 (0.0339 and 0.0514 higher than M2 and M9).

The change-magnitude maps obtained by nine competitive methods on the mid-resolution images (as shown in Figure 13) are qualitatively consistent with the results based on high-resolution images (the GZ dataset). It is worth noting that M8 yields results similar to M1 on the LZ dataset, indicating that the FCM_SICM-EMD algorithm, when dealing with mid-resolution RS images, can effectively suppress noise, decompose mixed pixels, and achieve satisfactory CD results without a preliminary denoising process.

4.5. Sensitivity Analysis of the Fuzziness Levels

To compare the impact of different fuzziness levels on the performance of the SSDNet-FSE CD algorithm, we tested seven degrees of fuzziness m (1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5) on five datasets contaminated by 5% salt-and-pepper noise, zero-mean Gaussian noise with a variance of 0.05, and multiplicative noise with a magnitude of 0.05 to assess the effect of fuzziness on the CD Kappa. The experimental results are shown in Figure 14 and Table 6.

From the experimental results (Figure 14), it can be observed that, with the elevated degree of fuzziness m, SSDNet-FSE exhibits a downward Kappa value on the Shangtang and GZ datasets. Since both datasets contain VHRRS imagery with slight mixed-pixel problems, some pixels may be incorrectly assigned to multiple signal classes with very different fuzzy memberships. The synergistic influence of noise pollution may lead to severely biased signal-class centers and thereby cause the EMD calculation to yield high transformation costs and magnitudes of change. Additionally, the cost of transforming noise-contaminated pixels within noisy imagery—as calculated via the EMD—is significantly larger than that of uncontaminated pixels in noise-free imagery, leading to the misclassification of unchanged pixels into changed pixels and an elevated false-alarm rate.

Since the major changes in ground cover in the Shangtang and GZ datasets are from vegetation to buildings, with significant spectral differences (as shown in Figure 5 and Figure 9), the cost of transforming the bitemporal pixels is relatively high. This causes the misclassification of unchanged pixels into changed pixels, as well as a decrease in CD Kappa, with an elevated false-alarm rate. In contrast, the spectral variability of land-cover types for DSIFN-CD and CDD is lower, leading to lower transformation costs. Therefore, although there may be some bias in the signal-class centers due to noise, they still fall within a tolerable range and can achieve relatively low false-alarm rates and accurate CD results

It is worth noting that, in the LZ dataset, which has a spatial resolution of 30 m/pixel and obviously mixed-pixel phenomena, each pixel contains simple spectral information and, hence, is less affected by random noises than VHRRS images. Moreover, FCM_SICM performs well in decomposing mixed pixels that have been polluted by random noises. Therefore, the proposed method performed well, even when the degree of fuzziness m was large.

The relationship between the convergence process and the number of iterations of FCM-SICM is shown in Figure 15. It can be observed that, as the m increases, the loss curve of FCM-SICM becomes smoother, and the number of iterations required to reach convergence is also reduced. For MRRS images, mixed-pixel phenomena can be better addressed by using a large m, which reduces the difference in fuzzy membership between typical and non-typical signal classes of noise-polluted pixels. Hence, the large FA caused by noisy pixels can be suppressed significantly. Here, typical signal classes reflect the primary spectral and spatial characteristics of image pixels, while non-typical signal classes do not reflect the typical spectral and spatial characteristics of pixels. For VHRRS images, although a higher m enables the model to converge quickly and reduce time costs, it can also encourage each pixel to inappropriately align with typical and non-typical signal classes with similar fuzzy memberships, ultimately resulting in a decrease in CD accuracy (as exemplified by the Shangtang and GZ datasets in Figure 14). The advantage of a smaller m in VHRRS images is obvious. During the clustering process, pixels can align with typical signal classes with large fuzzy memberships and non-typical signal classes with small fuzzy memberships, which means that only the centers of non-typical signal classes will be less affected by noise, alleviating the deviation from the correct signal-class center.

4.6. Computational Time

To compare the computational cost of the proposed method with several other algorithms, we conducted experiments on the Shangtang dataset, with the results presented in Table 7.

It can be observed that the PCAKMeans algorithm required the shortest processing time, taking only 0.48 s. However, due to the small number of parameters used in this algorithm, it achieves fast computation speeds with less-than-ideal CD accuracy. The running times of GMCD, KPCAMNet, and DCVA were 10.24, 6.05, and 10.49 s, respectively. The DCVA and GMCD methods utilize pre-trained models inherent to their original algorithms for change detection, which reduces their execution time compared to our approach. Specifically, DCVA integrates CVA with a pre-trained deep convolutional neural network, extracting deep features from multi-layer CNNs that have already been pre-trained, thereby shortening the algorithm’s runtime. According to the original GMCD literature, processing an image with the size of 458 × 559 without pre-trained models on a GeForce RTX 2080 Ti and 11 G memory required 355 s [16]. The long-running time of ASEA can be attributed to its adaptive spatial context extraction process, which considers the spatial context surrounding each pixel. The expansion of the adaptive region around each pixel continues until the region encounters the boundary of the detected target, leading to an elevated computational cost for ASEA. The CD component of the overall framework (FCM_SICM-EMD) required 16.79 s, which is within 0.1 and 4.35 s of the state-of-the-art ASEA and INLPG methods, respectively. SSDNet-FSE had the longest processing time due to its self-supervised denoising component, which requires significant computational resources to achieve an optimal noise reduction for RS images—an essential step in the process. SSDNet required 314.55 s for 5000 iterations, 749.64 s for 10,000 iterations, and 1144.94 s for 15,000 iterations. Thus, a balance between time cost and denoising quality can be struck by our method, allowing for adjustments based on specific needs.

5. Conclusions

In this paper, we have proposed a robust unsupervised RS image CD framework, SSDNet-FSE. It employs an enhanced self-supervised denoising network to denoise and reconstruct noise-contaminated RS images before the CD procedure, thus attenuating the influence of noise on the accuracy of the CD. Subsequently, spatial and spectral information in the reconstructed images is exploited by FCM_SICM to provide adaptive constraints for the subsequent noise-resistant fuzzy clustering process. By employing membership linking during clustering, the method can iteratively leverage intermediate clustering results, effectively differentiate between clusters, and decompose pixels into several signal classes to achieve noise-resistant mixed-pixel decomposition. In turn, the metric of EMD, which is computed based on a linear optimization technique and has some noise-resistant capability, is used to calculate the distances between signal classes and the corresponding fuzzy memberships of bitemporal pixels. The result of this process is a high-quality map of the magnitude of change. Finally, the automatic threshold algorithm and morphological processing are employed to yield the final CD results. The experimental results confirm that the proposed SSDNet, FCM, and EMD can synergistically process mixed-noise-contaminated RS images, accurately detect CD regions with clear boundaries and rich spatial details, and achieve superior performance on five public CD datasets when compared with six competitive CD methods. Moreover, the proposed SSDNet denoising network effectively removes noise from images while preserving the detailed textural information in reconstructed images with reduced noise. Additionally, the ablation experiments show that, even when SSDNet is not applied in preprocessing RS images, the FCM_SICM-EMD method still accurately detects change when applied directly to the original noisy images.

Although SSDNet-FSE achieved satisfactory CD results, it still has several limitations. (1) The denoising component has high computational complexity, requiring substantial time to reconstruct noise-free images. (2) SSDNet-FSE’s performance is less satisfactory when handling uniformly distributed and relatively small change targets. (3) The FCM_SICM method requires manual adjustment of the fuzziness parameter m to adapt to RS images with different resolutions. Future work will focus on reducing the computational complexity of the denoising network while maintaining its effectiveness and improving the CD component, enabling it to adaptively handle RS images with different spatial resolutions.

Author Contributions

All the authors have contributed substantially to the manuscript. J.X. and Y.L. proposed the methodology. J.X., Y.L. and X.L. performed the experiments and software. J.X. and Y.L. wrote the paper. J.X., S.Y. and Y.L. analyzed the data. All authors have read and agreed to the published version of the manuscript.

Funding

This research work is co-funded by the National Key R&D Program of China (Grant No. 2022YFB3903604) and the National Natural Science Foundation of China (No. 42161069).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The authors are grateful to the editor and anonymous reviewers for their helpful and valuable suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Caiza-Morales, L.; Gómez, C.; Torres, R.; Nicolau, A.P.; Olano, J.M. Manglee: A tool for mapping and monitoring Mangrove ecosystem on google earth engine—A case study in ecuador. J. Geovisualization Spat. Anal. 2024, 8, 17. [Google Scholar] [CrossRef]
Zhi, Z.; Liu, J.; Liu, J.; Li, A. Geospatial structure and evolution analysis of national terrestrial adjacency network based on complex network. J. Geovisualization Spat. Anal. 2024, 8, 12. [Google Scholar] [CrossRef]
Yang, J.; Huang, X. 30 m Annual land cover and its dynamics in China from 1990 to 2019. Earth Syst. Sci. Data Discuss. 2021, 13, 3907–3925. [Google Scholar] [CrossRef]
Qin, X.; He, S.; Yang, X.; Dehghan, M.; Qin, Q.; Martin, J. Accurate outline extraction of individual building from Very High-Resolution optical images. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1775–1779. [Google Scholar] [CrossRef]
Li, X.; Ling, F.; Giles, M.F.; Du, Y. A superresolution Land-Cover change detection method using remotely sensed images with different spatial resolutions. IEEE Trans. Geosci. Remote Sens. 2016, 54, 3822–3841. [Google Scholar] [CrossRef]
Qu, Y.; Li, J.; Huang, X.; Wen, D. TD-SSCD: A novel network by fusing temporal and differential information for self-supervised remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5407015. [Google Scholar] [CrossRef]
Zhao, Y.; Wang, G.; Yang, J.; Li, T.; Li, Z. Au3-gan: A method for extracting roads from historical maps based on an attention generative adversarial network. J. Geovisualization Spat. Anal. 2024, 8, 26. [Google Scholar] [CrossRef]
Lu, D.; Mausel, P.; Brondizio, E.; Moran, E. Change detection techniques. Int. J. Remote Sens. 2004, 25, 2365–2401. [Google Scholar] [CrossRef]
Ma, J.; Gong, M.; Zhou, Z. Wavelet fusion on ratio images for change detection in SAR images. IEEE Geosci. Remote Sens. Lett. 2012, 9, 1122–1126. [Google Scholar] [CrossRef]
Bovolo, F.; Bruzzone, L. A theoretical framework for unsupervised change detection based on change vector analysis in the polar domain. IEEE Trans. Geosci. Remote Sens. 2007, 45, 218–236. [Google Scholar] [CrossRef]
Deng, J.; Wang, K.; Deng, Y.; Qi, G. PCA-Based land use change detection and analysis using multitemporal and multisensor satellite data. Int. J. Remote Sens. 2008, 29, 4823–4838. [Google Scholar] [CrossRef]
Nielsen, A.A.; Conradsen, K.; Simpson, J.J. Multivariate alteration detection (MAD) and MAF postprocessing in multispectral, bitemporal image data: New approaches to change detection studies. Remote Sens. Environ. 1998, 64, 1–19. [Google Scholar] [CrossRef]
Marpu, P.R.; Gamba, P.; Canty, M.J. Improving change detection results of IR-MAD by eliminating strong changes. IEEE Geosci. Remote Sens. Lett. 2011, 8, 799–803. [Google Scholar] [CrossRef]
Celik, T. Unsupervised change detection in satellite images using principal component analysis and K-Means clustering. IEEE Geosci. Remote Sens. Lett. 2009, 6, 772–776. [Google Scholar] [CrossRef]
Chen, P.; Li, C.; Zhang, B.; Chen, Z.; Yang, X.; Lu, K.; Zhuang, L. A region-based feature fusion network for VHR image change detection. Remote Sens. 2022, 14, 5577. [Google Scholar] [CrossRef]
Wang, J.; Zhao, T.; Jiang, X.; Lan, K. A hierarchical heterogeneous graph for unsupervised SAR image change detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4516605. [Google Scholar] [CrossRef]
Zhao, H.; Liu, S.; Du, Q.; Bruzzone, L.; Zheng, Y.; Du, K.; Tong, X.; Xie, H.; Ma, X. GCFnet: Global collaborative fusion network for multispectral and panchromatic image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5632814. [Google Scholar] [CrossRef]
Zhang, H.; Yao, J.; Ni, L.; Gao, L.; Huang, M. Multimodal attention-aware convolutional neural networks for classification of hyperspectral and LiDAR data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3635–3644. [Google Scholar] [CrossRef]
Tang, X.; Zhang, H.; Mou, L.; Liu, F.; Zhang, X.; Zhu, X.; Jiao, L. An unsupervised remote sensing change detection method based on multiscale graph convolutional network and metric learning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5632814. [Google Scholar] [CrossRef]
Wu, C.; Chen, H.; Du, B.; Zhang, L. Unsupervised change detection in multitemporal VHR images based on deep kernel PCA convolutional mapping network. IEEE Trans. Cyber. 2022, 52, 12084–12098. [Google Scholar] [CrossRef]
Saha, S.; Bovolo, F.; Bruzzone, L. Unsupervised deep change vector analysis for multiple-change detection in VHR images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 3677–3693. [Google Scholar] [CrossRef]
Ke, Q.; Zhang, P. Hybrid-transCD: A hybrid transformer remote sensing image change detection network via token aggregation. ISPRS Int. J. Geo-Inf. 2022, 11, 263. [Google Scholar] [CrossRef]
Li, Q.; Zhong, R.; Du, X.; Du, Y. TransUNetCD: A hybrid transformer network for change detection in optical remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5622519. [Google Scholar] [CrossRef]
Lin, Y.; Liu, S.; Zheng, Y.; Tong, X.; Xie, H.; Zhu, H.; Du, K.; Zhao, H.; Zhang, J. An unsupervised transformer-based multivariate alteration detection approach for change detection in VHR remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 3251–3261. [Google Scholar] [CrossRef]
Liu, M.; Jiang, W.; Liu, W.; Tao, D.; Liu, B. Dynamic adaptive attention-guided self-supervised single remote-sensing image denoising. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4704511. [Google Scholar] [CrossRef]
Gu, S.; Li, Y.; Gool, L.V.; Timofte, R. Self-guided network for fast image denoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2511–2520. [Google Scholar]
Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; Zhang, L. Beyond a Gaussian Denoiser: Residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 2017, 26, 3142–3155. [Google Scholar] [CrossRef]
Zhao, Y.; Jiang, Z.; Men, A.; Ju, G. Pyramid real image denoising network. In Proceedings of the 2019 IEEE Visual Communications and Image Processing (VCIP), Sydney, NSW, Australia, 1–4 December 2019; pp. 1–4. [Google Scholar]
Jia, X.; Peng, Y.; Li, J.; Ge, B.; Xin, Y.; Liu, S. Dual-complementary convolution network for remote-sensing image denoising. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8018405. [Google Scholar] [CrossRef]
Tai, Y.; Yang, J.; Liu, X.; Xu, C. Memnet: A persistent memory network for image restoration. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4539–4547. [Google Scholar]
Wang, Y.; Song, X.; Chen, K. Channel and space attention neural network for image denoising. IEEE Signal Process Lett. 2021, 28, 424–428. [Google Scholar] [CrossRef]
Dabov, K.; Foi, A.; Katkovnik, V.; Egiazarian, K. Color image denoising via sparse 3D collaborative filtering with grouping constraint in luminance-chrominance space. In Proceedings of the 2007 IEEE International Conference on Image Processing, San Antonio, TX, USA, 16 September–19 October 2007; pp. 313–316. [Google Scholar]
Li, Y.; Li, X.; Song, J.; Wang, Z.; He, Y.; Yang, S. Remote-sensing-based change detection using change vector analysis in posterior probability space: A context-sensitive bayesian network approach. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3198–3217. [Google Scholar] [CrossRef]
Wang, Q.; Wang, X.; Fang, C.; Yang, W. Robust fuzzy C-means clustering algorithm with adaptive spatial & intensity constraint and membership linking for noise image segmentation. Appl. Soft Comput. 2020, 92, 106318. [Google Scholar]
Rubner, Y.; Tomasi, C.; Guibas, L.J. A metric for distributions with applications to image databases. In Proceedings of the Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), Bombay, India, 7 January 1998; pp. 59–66. [Google Scholar]
Quan, Y.; Chen, M.; Pang, T.; Ji, H. Self2self with dropout: Learning self-supervised denoising from single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1890–1898. [Google Scholar]
Yang, L.; Zhang, R.; Li, L.; Xie, X. SimAM: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the IEEE International Conference on Machine Learning, Online, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Lehtinen, J.; Munkberg, J.; Hasselgren, J.; Laine, S.; Karras, T.; Aittala, M.; Aila, T. Noise2noise: Learning image restoration without clean data. arXiv 2018, arXiv:1803.04189. [Google Scholar]
Huang, T.; Li, S.; Jia, X.; Lu, H.; Liu, J. Neighbor2neighbor: Self-supervised denoising from single noisy images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14781–14790. [Google Scholar]
Krull, A.; Buchholz, T.O.; Jug, F. Noise2void-learning denoising from single noisy images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2129–2137. [Google Scholar]
Chen, Y.; Bruzzone, L. Self-supervised change detection in multiview remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Wang, W.; Tan, X.; Zhang, P.; Wang, X. A CBAM based multiscale transformer fusion approach for remote sensing image change detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6817–6825. [Google Scholar] [CrossRef]
Ahmed, M.N.; Yamany, S.M.; Mohamed, N.; Farag, A.A.; Moriarty, T. A modified fuzzy c-means algorithm for bias field estimation and segmentation of MRI data. IEEE Trans. Med. Imaging 2002, 3, 193–199. [Google Scholar] [CrossRef] [PubMed]
Chaudhury, K.N.; Dabhade, S.D. Fast and provably accurate bilateral filtering. IEEE Trans. Image Process. 2016, 25, 2519–2528. [Google Scholar] [CrossRef] [PubMed]
SenseEarth. Available online: https://rs.sensetime.com/ (accessed on 20 March 2024).
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogram. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Geospatial Data Cloud. Available online: https://www.gscloud.cn/sources/ (accessed on 5 June 2022).
Lebedev, M.A.; Vizilter, Y.V.; Vygolov, O.V.; Knyaz, V.A.; Rubis, A.Y. Change detection in remote sensing images using conditional adversarial networks. Int. Arch. Photogram. Remote Sens. Spat. Inf. Sci. 2018, 42, 565–571. [Google Scholar] [CrossRef]
Fan, R.; Xie, J.; Yang, J.; Hong, Z.; Xu, Y.; Hou, H. Multiscale change detection domain adaptation model based on illumination–reflection decoupling. Remote Sens. 2024, 16, 799. [Google Scholar] [CrossRef]
Lv, Z.; Wang, F.; Liu, T.; Kong, X.; Benediktsson, J.A. Novel automatic approach for land cover change detection by using VHR remote sensing images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8016805. [Google Scholar] [CrossRef]
Sun, Y.; Lei, L.; Li, X.; Tan, X.; Kuang, G. Structure consistency-based graph for unsupervised change detection with homogeneous and heterogeneous remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4700221. [Google Scholar] [CrossRef]
Lv, Z.Y.; Liu, T.; Zhang, P.; Benediktsson, J.A.; Lei, T.; Zhang, X. Novel adaptive histogram trend similarity approach for land cover change detection by using bitemporal very-high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9554–9574. [Google Scholar] [CrossRef]
Krinidis, S.; Chatzis, V. A robust fuzzy local information C-means clustering algorithm. IEEE Trans. Image Process. 2010, 19, 1328–1337. [Google Scholar] [CrossRef] [PubMed]
Chen, S.; Zhang, D. Robust image segmentation using FCM with spatial constraints based on new kernel-induced distance measure. IEEE Trans. Syst. Man Cyber. Part B 2004, 34, 1907–1916. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Flowchart of the proposed SSDNet-FSE framework.

Figure 2. Graphical illustration of the SimAM attention mechanism, where the complete 3-D weights are for attention.

Figure 3. Network structure of SSDNet.

Figure 4. Coupling mechanism of FCM_SICM and EMD.

Figure 5. CD results of competitive methods obtained on Shangtang. (a) Time 1 image with mixed noises. (b) Time 2 image with mixed noises. (c) Ground truth (d) GMCD. (e) KPCAMNet. (f) DCVA. (g) PCAKMeans. (h) ASEA. (i) INLPG. (j) Ours.

Figure 6. CD Results of competitive methods obtained on DSIFN-CD. (a) Time 1 image with mixed noises. (b) Time 2 image with mixed noises. (c) Ground truth. (d) GMCD. (e) KPCAMNet. (f) DCVA. (g) PCAKMeans. (h) ASEA. (i) INLPG. (j) Ours.

Figure 7. CD results of competitive methods obtained on LZ. (a) Time 1 image with mixed noises. (b) Time 2 image with mixed noises. (c) Ground truth. (d) GMCD. (e) KPCAMNet. (f) DCVA. (g) PCAKMeans. (h) ASEA. (i) INLPG. (j) Ours.

Figure 8. CD Results of competitive methods obtained on CDD. (a) Time 1 image with mixed noises. (b) Time 2 image with mixed noises. (c) Ground truth. (d) GMCD. (e) KPCAMNet. (f) DCVA. (g) PCAKMeans. (h) ASEA. (i) INLPG. (j) Ours.

Figure 9. CD Results of competitive methods obtained on GZ. (a) Time 1 image with mixed noises. (b) Time 2 image with mixed noises. (c) Ground truth. (d) GMCD. (e) KPCAMNet. (f) DCVA. (g) PCAKMeans. (h) ASEA. (i) INLPG. (j) Ours.

Figure 11. Change maps obtained by nine ablation methods on GZ dataset.

Figure 12. Change-magnitude maps obtained by nine ablation methods on the GZ dataset (real change areas are marked with yellow boundaries).

Figure 13. Change-magnitude maps obtained by nine ablation methods on the LZ dataset (real change areas are marked with yellow boundaries).

Figure 14. Fuzzy level sensitivity on the five datasets.

Figure 15. FCM_SICM loss value vs. iteration number.

Table 1. The CD performance of competitive methods on Shangtang and DSIFN-CD datasets.

Dataset	Method	Evaluation Metrics
Dataset	Method	FA	MA	OA	Kappa	Recall	F1
Shangtang	GMCD [19]	0.0633	0.2307	0.9574	0.8204	0.7546	0.8374
	KPCAMNet [20]	0.4310	0.2204	0.8777	0.5856	0.7635	0.6279
	DCVA [21]	0.6848	0.0661	0.6839	0.3173	0.9313	0.4782
	PCAKMeans [22]	0.2042	0.4128	0.9150	0.6281	0.5752	0.6598
	ASEA [50]	0.0144	0.3271	0.9491	0.7717	0.6591	0.7900
	INLPG [51]	0.0084	0.4670	0.9288	0.6571	0.5194	0.6815
	Ours	0.0303	0.1637	0.9714	0.8816	0.8211	0.8902
DSIFN-CD	GMCD [19]	0.3762	0.1495	0.8831	0.6481	0.8357	0.7365
	KPCAMNet [20]	0.5956	0.5465	0.7857	0.2963	0.5975	0.5962
	DCVA [21]	0.5110	0.4443	0.8192	0.4094	0.5614	0.5414
	PCAKMeans [22]	0.3535	0.1013	0.8954	0.6880	0.8974	0.7701
	ASEA [50]	0.1875	0.1981	0.9323	0.7661	0.7967	0.8114
	INLPG [51]	0.3635	0.4108	0.8681	0.5326	0.5842	0.6479
	Ours	0.1750	0.0928	0.9497	0.8333	0.9040	0.8466

Table 2. The CD performance of competitive methods on LZ dataset.

Dataset	Method	Evaluation Metrics
Dataset	Method	FA	MA	OA	Kappa	Recall	F1
LZ	GMCD [19]	0.7661	0.0599	0.7750	0.2935	0.9506	0.3918
	KPCAMNet [20]	0.2671	0.2603	0.9620	0.7159	0.7764	0.7480
	DCVA [21]	0.9069	0.0756	0.3495	0.0448	0.9275	0.1744
	PCAKMeans [22]	0.8364	0.0130	0.6373	0.1798	0.9936	0.3028
	ASEA [50]	0.0529	0.5926	0.9558	0.5502	0.4028	0.5653
	INLPG [51]	0.0699	0.5903	0.9554	0.5490	0.4110	0.5701
	Ours	0.1845	0.2267	0.9712	0.7784	0.8079	0.8159

Table 3. The CD performance of competitive methods on CDD and GZ datasets.

Dataset	Method	Evaluation Metrics
Dataset	Method	FA	MA	OA	Kappa	Recall	F1
CDD	GMCD [19]	0.2550	0.3704	0.9439	0.6519	0.6203	0.6924
	KPCAMNet [20]	0.3643	0.3320	0.9316	0.6135	0.6529	0.6642
	DCVA [21]	0.8716	0.7507	0.7661	0.0494	0.2036	0.1624
	PCAKMeans [22]	0.6292	0.2314	0.8530	0.4261	0.7932	0.5109
	ASEA [50]	0.2476	0.4179	0.9416	0.6250	0.5763	0.6680
	INLPG [51]	0.2094	0.5504	0.9358	0.5413	0.4571	0.5961
	Ours	0.2905	0.3283	0.9422	0.6582	0.6653	0.7034
GZ	GMCD [19]	0.2783	0.4764	0.8659	0.5285	0.5496	0.6444
	KPCAMNet [20]	0.4610	0.4058	0.8194	0.4516	0.5975	0.5962
	DCVA [21]	0.5768	0.3769	0.7577	0.3514	0.6400	0.5326
	PCAKMeans [22]	0.2626	0.3506	0.8850	0.6204	0.6630	0.7106
	ASEA [50]	0.2902	0.3111	0.8828	0.6263	0.7039	0.7231
	INLPG [51]	0.2186	0.4202	0.8848	0.5979	0.6275	0.7147
	Ours	0.1774	0.2409	0.9200	0.7403	0.8199	0.8243

Table 4. The performance of competitive methods on the five datasets with varied mixed-noise levels.

Dataset	Method	Mixed-Noise Level
Dataset	Method	0.00	0.01	0.02	0.03	0.04	0.05	0.06
Shangtang	GMCD [19]	0.6678	0.5546	0.6420	0.7324	0.7538	0.7774	0.7776
	KPCAMNet [20]	0.5706	0.5784	0.5913	0.6210	0.5829	0.5769	0.5415
	DCVA [21]	0.1851	0.3195	0.2801	0.2727	0.2598	0.2473	0.1798
	PCAKMeans [22]	0.6677	0.6858	0.7955	0.8556	0.7308	0.4398	0.2600
	ASEA [50]	0.6507	0.6990	0.7508	0.7316	0.7701	0.7717	0.7544
	INLPG [51]	0.5107	0.5154	0.6330	0.6376	0.6552	0.6571	0.6552
	Ours	0.8626	0.8624	0.8641	0.8573	0.8789	0.8795	0.8606
DSIFN-CD	GMCD [19]	0.6918	0.5258	0.4653	0.4949	0.5892	0.5957	0.5941
	KPCAMNet [20]	0.1384	0.1924	0.3010	0.3408	0.3596	0.3228	0.3268
	DCVA [21]	0.4887	0.3119	0.3011	0.2941	0.2974	0.2862	0.2762
	PCAKMeans [22]	0.8561	0.7324	0.7837	0.7561	0.7494	0.6021	0.4905
	ASEA [50]	0.7311	0.7345	0.7411	0.7699	0.7648	0.7661	0.7602
	INLPG [51]	0.1260	0.3168	0.3799	0.4625	0.5454	0.5326	0.6376
	Ours	0.8398	0.8244	0.8447	0.8393	0.8356	0.8333	0.8085
LZ	GMCD [19]	0.6684	0.6674	0.6419	0.3911	0.2572	0.2038	0.1525
	KPCAMNet [20]	0.6888	0.6540	0.6401	0.6512	0.6922	0.6902	0.6886
	DCVA [21]	0.3192	0.0815	0.0556	0.0372	0.0388	0.0448	0.0406
	PCAKMeans [22]	0.7862	0.7033	0.3171	0.2038	0.1726	0.1645	0.1383
	ASEA [50]	0.7387	0.6581	0.6441	0.6593	0.5657	0.5502	0.5099
	INLPG [51]	0.4399	0.5621	0.6427	0.6473	0.5634	0.5490	0.6479
	Ours	0.7873	0.7917	0.7904	0.6856	0.7849	0.7796	0.7806
CDD	GMCD [19]	0.5946	0.6166	0.6339	0.6476	0.6524	0.6534	0.6249
	KPCAMNet [20]	0.5726	0.5966	0.6129	0.6144	0.6052	0.6044	0.6072
	DCVA [21]	0.0173	0.0318	0.0461	0.0412	0.0251	0.0252	0.0258
	PCAKMeans [22]	0.6029	0.6326	0.5821	0.4475	0.2793	0.2568	0.1395
	ASEA [50]	0.5597	0.5901	0.6021	0.6139	0.6188	0.6250	0.6007
	INLPG [51]	0.4930	0.4948	0.5057	0.5201	0.5509	0.5413	0.5334
	Ours	0.6234	0.6354	0.6372	0.6218	0.6476	0.6431	0.6392
GZ	GMCD [19]	0.5351	0.5392	0.5164	0.5506	0.5392	0.5285	0.5448
	KPCAMNet [20]	0.4259	0.4174	0.4389	0.4493	0.4633	0.4516	0.4694
	DCVA [21]	0.2324	0.3471	0.3185	0.2897	0.2584	0.3514	0.2101
	PCAKMeans [22]	0.6194	0.6385	0.6367	0.6331	0.6255	0.6204	0.6199
	ASEA [50]	0.6122	0.6161	0.6139	0.6182	0.6292	0.6263	0.6272
	INLPG [51]	0.5294	0.5899	0.6122	0.6020	0.5961	0.5979	0.6090
	Ours	0.7551	0.7485	0.7451	0.7468	0.7439	0.7403	0.7393

Table 5. CD performance of nine ablation methods on the GZ dataset.

Method	FA	MA	OA	Kappa	Rec	F1
M1	0.1774	0.2409	0.9200	0.7403	0.8199	0.8243
M2	0.2367	0.2342	0.9067	0.7064	0.7947	0.7884
M3	0.3648	0.2763	0.8632	0.5902	0.7356	0.7084
M4	0.3484	0.2678	0.8696	0.6073	0.7352	0.7153
M5	0.5470	0.1847	0.7689	0.4401	0.8212	0.6397
M6	0.3399	0.2739	0.8719	0.6109	0.7290	0.7185
M7	0.2510	0.3487	0.8879	0.6284	0.6762	0.7232
M8	0.2327	0.3281	0.8948	0.6522	0.6998	0.7423
M9	0.1748	0.3219	0.9079	0.6889	0.7343	0.7801

Table 6. CD kappa values of our method on the five datasets with increasing fuzziness level.

Dataset	Fuzzy Degree
Dataset	1.5	2.0	2.5	3.0	3.5	4.0	4.5
Shangtang [45]	0.8758	0.8694	0.8446	0.6638	0.6749	0.6797	0.6968
DSIFN-CD [46]	0.8402	0.8383	0.8405	0.8404	0.8412	0.8401	0.8353
LZ [47]	0.7059	0.7123	0.7150	0.7102	0.7071	0.7067	0.6820
CDD [48]	0.6394	0.6573	0.6349	0.6538	0.6510	0.6293	0.6303
GZ [49]	0.7230	0.7371	0.6801	0.6200	0.4806	0.4955	0.4034

Table 7. Comparison of parameters and computational costs of different methods using the Shangtang dataset.

Method	Time (s.)
GMCD [16]	10.24
KPCAMNet [17]	6.05
DCVA [18]	10.49
PCAKMeans [19]	0.48
ASEA [44]	16.69
INLPG [45]	12.44
FCM_SICM-EMD	16.79
SSDNet-FSE	1144.94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, J.; Li, Y.; Yang, S.; Li, X. Unsupervised Noise-Resistant Remote-Sensing Image Change Detection: A Self-Supervised Denoising Network-, FCM_SICM-, and EMD Metric-Based Approach. Remote Sens. 2024, 16, 3209. https://doi.org/10.3390/rs16173209

AMA Style

Xie J, Li Y, Yang S, Li X. Unsupervised Noise-Resistant Remote-Sensing Image Change Detection: A Self-Supervised Denoising Network-, FCM_SICM-, and EMD Metric-Based Approach. Remote Sensing. 2024; 16(17):3209. https://doi.org/10.3390/rs16173209

Chicago/Turabian Style

Xie, Jiangling, Yikun Li, Shuwen Yang, and Xiaojun Li. 2024. "Unsupervised Noise-Resistant Remote-Sensing Image Change Detection: A Self-Supervised Denoising Network-, FCM_SICM-, and EMD Metric-Based Approach" Remote Sensing 16, no. 17: 3209. https://doi.org/10.3390/rs16173209

APA Style

Xie, J., Li, Y., Yang, S., & Li, X. (2024). Unsupervised Noise-Resistant Remote-Sensing Image Change Detection: A Self-Supervised Denoising Network-, FCM_SICM-, and EMD Metric-Based Approach. Remote Sensing, 16(17), 3209. https://doi.org/10.3390/rs16173209

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unsupervised Noise-Resistant Remote-Sensing Image Change Detection: A Self-Supervised Denoising Network-, FCM_SICM-, and EMD Metric-Based Approach

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview

2.2. SSDNet

2.3. FCM_SICM-EMD

2.3.1. FCM_SICM

2.3.2. EMD

3. Dataset and Experimental Setup

3.1. Dataset Descriptions

3.2. Competing Methods

3.3. Experimental Design and Evaluation Metrics

4. Experimental Results

4.1. Noise-Resistance Analysis

4.2. Noise Sensitivity Analysis

4.3. Ablation Analysis

4.4. Analysis of Change-Magnitude Maps

4.5. Sensitivity Analysis of the Fuzziness Levels

4.6. Computational Time

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI