DMDiff: A Dual-Branch Multimodal Conditional Guided Diffusion Model for Cloud Removal Through SAR-Optical Data Fusion

Zhang, Wenjuan; Mei, Junlin; Wang, Yuxi

doi:10.3390/rs17060965

Open AccessArticle

DMDiff: A Dual-Branch Multimodal Conditional Guided Diffusion Model for Cloud Removal Through SAR-Optical Data Fusion

by

Wenjuan Zhang

¹,

Junlin Mei

^1,2 and

Yuxi Wang

^1,2,3,*

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

³

International Research Center of Big Data for Sustainable Development Goals, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(6), 965; https://doi.org/10.3390/rs17060965

Submission received: 17 January 2025 / Revised: 5 March 2025 / Accepted: 7 March 2025 / Published: 9 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Optical remote sensing images, as a significant data source for Earth observation, are often impacted by cloud cover, which severely limits their widespread application in Earth sciences. Synthetic aperture radar (SAR), with its all-weather, all-day observation capabilities, serves as a valuable auxiliary data source for cloud removal (CR) tasks. Despite substantial progress in deep learning (DL)-based CR methods utilizing SAR data in recent years, challenges remain in preserving fine texture details and maintaining image visual authenticity. To address these limitations, this study proposes a novel diffusion-based CR method called the Dual-branch Multimodal Conditional Guided Diffusion Model (DMDiff). Considering the intrinsic differences in data characteristics between SAR and optical images, we design a dual-branch feature extraction architecture to enable adaptive feature extraction based on the characteristics of the data. Then, a cross-attention mechanism is employed to achieve deep fusion of the multimodal feature extracted above, effectively guiding the progressive diffusion process to restore cloud-covered regions in optical images. Furthermore, we propose an image adaptive prediction (IAP) strategy within the diffusion model, specifically tailored to the characteristics of remote sensing data, which achieves a nearly 20 dB improvement in PSNR compared to the traditional noise prediction (NP) strategy. Extensive experiments on the airborne, WHU-OPT-SAR, and LuojiaSET-OSFCR datasets demonstrate that DMDiff outperforms SOTA methods in terms of both signal fidelity and visual perceptual quality. Specifically, on the LuojiaSET-OSFCR dataset, our method achieves a remarkable 17% reduction in the FID metric over the second-best method, while also yielding significant enhancements in quality assessment metrics such as PSNR and SSIM.

Keywords:

diffusion models; cloud removal; synthetic aperture radar (SAR); remote sensing image

1. Introduction

Optical remote sensing satellites face the widespread challenge of cloud cover during Earth observation. Electromagnetic waves in the visible to infrared spectrum cannot effectively penetrate thick cloud layers, resulting in sensors capturing only cloud information, while obscuring surface features beneath. Studies based on Landsat ETM+ data show that approximately 35% of terrestrial regions are cloud-covered, with even higher coverage over oceanic areas [1]. King et al. [2], by analyzing continuous observational data from the Terra and Aqua satellites over 12 and 9 years, respectively, found that the average cloud coverage in MODIS data was as high as 67%. Furthermore, research by Asner et al. [3] suggests that images with more than 10–15% cloud cover are unsuitable for geospatial studies. However, selecting only cloud-free or minimally cloud-covered images for analysis would result in a substantial waste of data resources. This is because most cloud-covered images still contain a significant number of usable pixels. To address this issue, research on cloud removal (CR) for optical remote sensing images has been proposed. It aims to remove cloud information in identified cloud-covered regions and restore or reconstruct surface information beneath the clouds.

Synthetic aperture radar (SAR), an active microwave remote sensing technology, acquires surface information by emitting electromagnetic waves with centimeter-scale wavelengths and receiving their echoes. Due to the superior penetration capability of longer wavelengths, SAR enables all-weather, all-time surface observation, capturing high-quality surface information even under cloud cover. This exceptional characteristic makes SAR an ideal auxiliary data source for CR in optical remote sensing images. When optical satellites are obstructed by clouds and unable to acquire valid information, SAR provides essential prior knowledge of surface features in the same region. However, due to significant differences in the imaging mechanisms and feature representations between SAR and optical sensors, how to effectively use SAR images to assist in optical image CR remains a significant challenge.

In recent years, DL-based methods have gained widespread attention due to their powerful nonlinear modeling capabilities and cross-modal feature extraction abilities. DL-based methods can automatically learn the complex mapping relationships between SAR and optical images, effectively transferring prior knowledge from SAR images to assist in the CR task for optical images. DL-based methods can be further classified into convolutional neural network (CNN)-based methods, generative adversarial network (GAN)-based methods, and diffusion-based methods [4]. CNN-based methods input cloudy images and SAR images into deep neural networks to predict cloud-free images by optimizing network parameters through backpropagation. For example, Meraner et al. [5] proposed a deep residual network, DSen2-CR, which achieves cloud region reconstruction by stacking multiple residual modules. Li et al. [6] introduced a CMD network, which performs cloud area reconstruction through encoding and decoding. While CNN-based methods effectively extract features from SAR and cloudy images, they may face challenges in capturing the complex contextual information required to generate cloud-free images with high perceptual quality [7].

To enhance the quality of cloud-free image generation, GANs have been applied to the domain of remote sensing cloud removal. GANs employ an adversarial training strategy, simultaneously training two models: a generator, which generates cloud-free images, and a discriminator, which evaluates the authenticity of the generated images. This adversarial process enables the generator to generate cloud-free images with improved perceptual quality. For example, Grohnfeldt et al. [8] introduced the SAR-Opt-cGAN model, which jointly inputs SAR and cloudy images into a GAN. Gao et al. [9] proposed a two-stage model; the first stage converts SAR image into a simulated optical image, and the second stage employs a GAN to fuse multi-source information for cloud removal. Despite their effectiveness, GAN-based models encounter significant challenges, including mode collapse and training instability. Researchers must carefully design loss functions, learning rates, and other hyperparameters, as well as construct discriminator that matches the complexity of the generator, in order to maintain a dynamic balance between the generator and discriminator. This requires extensive exploratory experimentation, which increases the workload involved in model tuning.

Recently, diffusion models have become a prominent research direction in computer vision (CV), drawing inspiration from nonequilibrium thermodynamics. The training of diffusion models is based on the connection between the denoising score matching of diffusion probabilistic models and Langevin dynamics, using a weighted variational lower bound [10]. Diffusion models have shown outstanding performance across various CV tasks, including deblurring [11,12], super-resolution [13,14], and inpainting [15,16]. Given the significant advantages of diffusion models in image generation and inpainting, researchers have begun exploring their application in remote sensing image CR tasks. Jing et al. [4] proposed the DDPM-CR model, which leverages diffusion models to extract noise features for subsequent CR tasks. Bai et al. [17] proposed the Conditional diffusion model, which employs diffusion models to learn the feature-mapping relationship between SAR and optical images, effectively converting SAR images into optical images. Recent studies have shown that the excellent feature extraction and nonlinear modeling capabilities of diffusion models enable them to effectively capture the complex relationships between SAR and optical images, offering novel solutions for cloud removal.

Currently, diffusion-based CR methods face several limitations when handling the joint conditional guidance of SAR and cloud imagery. First, they do not adequately consider the significant differences in imaging mechanisms and feature representations between the two data sources, which may lead to information distortion during the feature fusion process. Second, diffusion models generally adopt a noise prediction (NP) strategy, where Gaussian noise is predicted at each iteration. This strategy performs well in natural image tasks, but compared to remote sensing images, natural images have smaller imaging ranges and relatively simple content. Remote sensing images, however, have multiple spectral bands, large imaging ranges, and high variability in the imaged areas. For example, Sentinel-2 satellite data contain 13 spectral bands in the visible to short-wave infrared range, and 256 × 256-pixel sample data cover more than 6.5 square kilometers of the Earth’s surface. This area may include relatively uniform regions such as water, vegetation, or soil, as well as mixed areas such as villages or cities. As can be seen, remote sensing images exhibit significant spectral complexity and spatial heterogeneity, making it difficult for the NP strategy to effectively learn and reconstruct the complex information in remote sensing images. During cloud area reconstruction, this may result in spectral distortion and loss of spatial details.

To fully leverage the auxiliary information provided by SAR images and the cloud-free region information from cloudy images, this paper proposes a novel diffusion-based network for CR, named DMDiff (Dual-branch Multimodal Conditional Guided Diffusion Model). The main contributions of this study are as follows:

Considering the significant differences in imaging mechanisms and information characteristics between SAR and optical images, a multimodal feature extraction and feature fusion mechanism is designed. DMDiff incorporates an innovative dual-branch encoder and a cross-modal feature fusion encoder. In the dual-branch encoder, the SAR branch extracts spatial and radiometric information from the SAR image, while the optical branch captures optical signals from cloud-free regions of cloudy image. In the feature fusion encoder, cross-attention establishes complementary mapping relationships between the two branches to infer optical information for cloud-covered regions. A de-redundancy mechanism ensures compact feature representations, effectively guiding the diffusion model to generate high-quality cloud-free images during the progressive generation process.
To address the limitations of the noise prediction (NP) strategy in applying diffusion models to complex remote sensing scenarios, this study proposes an image adaptive prediction (IAP) strategy. Unlike the traditional NP strategy, which models noise distributions, IAP directly models the target image distribution rather than indirectly predicting Gaussian noise, more effectively guides the diffusion model to capture the inherent high spatial heterogeneity and complex spectral characteristics of remote sensing images. This strategy notably enhances the performance of diffusion models in remote sensing CR tasks.
A comprehensive experimental validation framework is established. For airborne data, various masking modes are designed to evaluate reconstruction performance. For satellite data, a dataset approximating real-world cloud scenarios is developed using actual cloud masks to analyze restoration performance across different land cover types. Finally, the effectiveness of DMDiff in real-world application scenarios is validated using the LuojiaSET-OSFCR real cloud dataset.

The rest of this paper is organized as follows: Section 2 reviews related work on cloud removal. Section 3 presents a detailed description of our method. Section 4 shows the results of both simulated and real experiments. Section 5 discusses the effectiveness of the model’s components, along with its limitations and directions for future work. Section 6 concludes the paper.

2. Related Work

2.1. End-to-End Method for CR

With the rapid development of deep learning technology, DL-based methods have shown significant advantages in optical remote sensing image CR tasks, achieving performance that surpasses traditional methods. End-to-end CR methods build deep neural network models that take cloudy images or other auxiliary data sources as input, with cloud-free images serving as supervisory signals. Through these neural networks, complex nonlinear mapping functions are learned, enabling the model to adaptively capture the deep feature relationships between the input data and the cloud-free image, thereby facilitating effective cloud region recovery.

CNN-based methods use the hierarchical feature extraction capability of deep convolutional neural networks to construct multi-scale feature representations for cloud area reconstruction. Meraner et al. [5] proposed the DSen2-CR model, which employs a series structure of multiple residual modules to extract deep features from Sentinel-1 SAR images and Sentinel-2 cloudy images, enabling end-to-end cloud area reconstruction. Wen et al. [18] used edge feature extraction (EFE) to capture the edge features of SAR images and input them into a ResNet for cloud-covered region reconstruction. Zhang et al. [19] introduced the DeepGEE-S2CR model, which utilizes a multi-level feature connection strategy to perform cloud removal using online data from the GEE platform. CNN-based methods have yielded impressive results. However, their inherent local receptive fields limit the effective modeling of global contextual information [20]. To overcome this limitation, researchers have begun exploring the application of Transformer architectures for CR tasks. The multi-head self-attention mechanism at the core of Transformers excels at capturing long-range dependencies in the data. Ding et al. [21] proposed the CVAE model, based on a conditional variational autoencoder and Transformer, which combines probabilistic graphical models with the global modeling capability of Transformer to deeply analyze the image degradation process and achieve accurate cloud area reconstruction. To combine the strengths of both CNNs and Transformers, Wu et al. [22] designed the Cloudformer architecture, which employs convolutional operators in the shallow layers to extract local features and self-attention mechanisms in the deep layers to capture global dependencies. The Former-CR method proposed by Han et al. [23] further extends this idea by using Transformer to extract multi-scale features from both SAR and optical images, enabling high-quality cloud area reconstruction.

Methods based on CNNs and Transformers have demonstrated significant performance in CR tasks, but there is still potential for improvement in texture details and image visual authenticity. GANs provide an innovative solution to this issue through an adversarial learning paradigm. Bermudez et al. [24] were the first to explore the feasibility of using a conditional GAN to convert SAR images into optical images. While this method performed well in reconstructing landforms with significant geometric features, it still faced limitations in preserving spectral fidelity. To overcome the limitations of relying on a single data source, Grohnfeldt et al. [8] introduced the SAR-Opt-cGAN model, which jointly inputs SAR and cloudy images into the GAN, demonstrating the significant contribution of multimodal data to improving reconstruction quality. Building upon SAR-Opt-cGAN, Gao et al. [9] introduced the SF-GAN model, where the first stage uses U-Net to convert SAR images into simulated optical images, and the second stage inputs multi-source data into a GAN for cloud removal. In an extension of this approach, Zhang et al. [25] designed the Cloud-Attention GAN, which uses a collaborative optimization of the conversion network, attention network, and GAN to achieve efficient end-to-end cloud removal. Li et al. [26] proposed the TransGAN-CFR model, which incorporates an innovative design of the window multi-head self-attention and feed-forward network, effectively improving the generator’s ability to capture global features. While GAN-based methods can generate visually realistic results, their adversarial training process often faces instability and mode collapse, which severely impacts the model’s convergence and generalization ability.

End-to-end methods have made significant progress in recent years, but some inherent limitations remain. End-to-end methods often rely on using pixel values from surrounding areas to replace cloud-covered regions, a strategy that struggles to effectively reconstruct the underlying texture details and surface features obscured by clouds. This issue becomes particularly pronounced when dealing with large cloud-covered regions, resulting in a substantial decline in reconstruction quality [7].

2.2. Diffusion Models for CR

Recently, diffusion models have attracted widespread attention, with research by Dhariwal et al. [27] showing that they surpass state-of-the-art GAN-based image generation methods. The model comprises two main stages: the forward process, which transforms the data distribution into a standard normal distribution, and the reverse process, which recovers the original data distribution from the standard normal distribution, capturing the feature distribution of the data during the diffusion process. Diffusion models have achieved notable success in various tasks within the CV domain [28,29,30,31,32,33].

Given the success of diffusion models in a wide range of CV tasks, researchers have begun exploring their potential applications in remote sensing image cloud removal. Jing et al. [4] proposed the DDPM-CR model, which uses a diffusion model as a feature extractor to extract features from SAR and cloudy images at different noise levels. These extracted features serve as inputs for a subsequent cloud removal network to generate cloud-free images. Bai et al. [17] introduced a SAR-guided conditional diffusion model, which incorporates SAR images into the diffusion process. This approach enables the diffusion model to learn the feature mapping between SAR and optical images, effectively transforming SAR images into optical images. Sui et al. [7] introduced the Diffusion Enhancement (DE) model, which converts cloudy images into reference cloud-free images using an additional reference model and weights the intermediate variables of the diffusion model with the reference image for conditional guidance.

Despite the significant progress made by the aforementioned methods in remote sensing image cloud removal, several limitations remain. First, these methods may not sufficiently consider the inherent differences in feature representations between SAR and cloudy images. Second, they rely on the NP strategy derived from natural image processing, which struggles to effectively capture the complex characteristics of remote sensing images. To address these limitations, this study proposes a novel diffusion-based cloud removal framework featuring a carefully designed dual-branch feature extraction and multimodal fusion mechanism, effectively integrating information from both SAR and cloudy images. Furthermore, to handle the intricate spatial–spectral characteristics of remote sensing images, we introduce an IAP strategy to replace the NP strategy, enabling the diffusion model to generate high-quality cloud-free images through a progressive generation process.

3. Materials and Methods

3.1. Introduction to Diffusion Model

The concept of diffusion models is derived from nonequilibrium thermodynamics, where the data distribution is learned through a diffusion process based on a Markov chain. This process consists of two core components [10]:

Forward process: Given a data distribution

q (x_{0})

, we define a forward process that transforms

q (x_{0})

into

q (x_{T})

, where

q (x_{T})

follows a standard normal distribution. Here,

T

represents the total number of diffusion steps. This process follows a predefined Markov chain, with each step in the forward process governed by the following formula:

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I),

(1)

q (x_{1}, \dots, x_{T} | x_{0}) = \prod_{t = 1}^{T} q (x_{t} | x_{t - 1}),

(2)

where

N

represents the Gaussian distribution,

β_{t} \in (0, 1)

is a hyperparameter that controls the noise intensity, and

I

is the all-one matrix. Based on the above formula, it is possible to derive the data distribution

q (x_{t} | x_{0})

at any time step and the sampling result as follows:

q (x_{t} | x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, \sqrt{1 - {\bar{α}}_{t}} I),

(3)

x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ,

(4)

where

α_{t} = 1 - β_{t}

,

{\bar{α}}_{t} = \prod_{i = 1}^{t} α_{i}

, and

ϵ

represents standard Gaussian noise.

Research by Nichol et al. demonstrated that the linear noise schedule used in DDPM results in excessive noise in the forward process. In contrast, the cosine noise schedule adds noise more smoothly [34]. This schedule is defined as follows:

f (t) = c o s {(\frac{t / T + s}{1 + s} \cdot \frac{π}{2})}^{2},

(5)

{\bar{α}}_{t} = \frac{f (t)}{f (0)}, β_{t} = 1 - \frac{{\bar{α}}_{t}}{{\bar{α}}_{t - 1}} .

(6)

Figure 1 illustrates the visual effects of different noise schedules in the forward process. The linear schedule rapidly transforms the samples into pure noise, while the cosine schedule adds noise more gradually, which is more beneficial for training the diffusion model.

Reverse process: This is defined as the transformation from the standard normal distribution

p_{θ} (x_{T})

back to the original data distribution

p_{θ} (x_{0})

via a parameterized neural network. In the Markov chain, the reverse Gaussian transition formula for any two adjacent intermediate variables is given by the following:

p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t)),

(7)

p_{θ} (x_{0} \dots x_{T - 1} | x_{T}) = \prod_{t = 1}^{T} p_{θ} (x_{t - 1} | x_{t}),

(8)

where

μ_{θ} (x_{t}, t)

and

Σ_{θ} (x_{t}, t)

represent the mean and variance of the reverse Gaussian distribution at time step

t

, respectively.

During training, the model optimizes

p_{θ}

by minimizing the mean squared error (MSE) loss between the standard Gaussian noise

ϵ

and the noise predicted by the neural network

ϵ_{θ} (x_{t}, t, c)

. The loss function is defined as follows:

L_{D D P M} = E_{ϵ, x_{t}, t, c} [{‖ϵ - ϵ_{θ} (x_{t}, t, c)‖}^{2}],

(9)

where

c

represents the conditional information added to the neural network.

3.2. Overview of the Cloud Removal Network

This study proposes a novel diffusion-based CR method called DMDiff. The workflow of a single reverse diffusion step in DMDiff is shown in Figure 2. To effectively integrate auxiliary information from SAR images and prior optical signals from the cloud-free regions of cloudy images, we design a dual-branch architecture and a multimodal fusion mechanism. These components provide precise conditional information that guides the diffusion model to reconstruct cloud-covered regions. Considering the complex spatial and spectral characteristics of remote sensing images, we introduce an IAP strategy, which directly predicts the target cloud-free image during each reverse iteration of the diffusion process. This approach enhances the model’s ability to learn the intricate features of remote sensing images.

First, considering the inherent differences in imaging mechanisms and feature representations between SAR and optical images, the two data types are separately input into the dual-branch multimodal feature extraction encoder (DMFEE) for feature extraction. The extracted features are then passed into the multimodal feature fusion and de-redundancy encoder (MFFDE), which establishes complementary mapping relationships between the SAR and optical features. This enables global-scale multimodal feature fusion, effectively inferring optical information for cloud-covered regions. Additionally, the MFFDE removes redundancy in both the spatial and channel dimensions, resulting in a compact feature representation. Finally, the optimized multimodal fused features serve as conditional information and inputs into the denoising neural network (network architecture is inspired by [34]). Under the supervision of the IAP strategy, this information guides the diffusion model’s progressive generation process to reconstruct high-quality cloud-free images.

3.3. Dual-Branch Multimodal Feature Extraction Encoder

As shown in Figure 3, the design of the DMFEE is based on the understanding of the inherent differences between SAR and cloudy images. These two types of remote sensing data exhibit significant differences in imaging mechanisms and feature representations. To address this, we propose a dual-branch architecture that more effectively extracts and utilizes multimodal data features. SAR images contain rich spatial and radiometric information, particularly excelling in the representation of surface textures and structural features. Leveraging this characteristic, the SAR branch is dedicated to enhancing the extraction of spatial features. Cloudy images contain abundant spatial and spectral information in their cloud-free regions, while the invalid information in cloud-covered regions can degrade the quality of feature extraction. Therefore, the design of the optical branch must focus not only on extracting spatial and spectral features but also on effectively distinguishing between cloud-free and cloud-covered regions. To address these challenges, we have carefully designed the DMFEE structure, enabling the model to better leverage the cloud-immune spatial information from the SAR images and the spatial–spectral information from the cloud-free regions of cloudy images. This enables the model to fully extract multimodal prior features, laying a solid foundation for subsequent feature fusion and image reconstruction. Below, we will detail the key components of the DMFEE and explain how they collaborate to achieve efficient multimodal feature extraction.

(1) Gated convolution (GC): Vanilla convolution (VC) assumes that all pixels contain valid information, making it suitable for tasks like classification and detection. However, in cloud removal, cloud-covered regions typically contain invalid information that must be distinguished from valid data in cloud-free areas [35]. To address this, we replace VC with GC, which combines VC with an adaptive gating mechanism. As shown in Figure 3d, GC introduces a learnable dynamic feature selection mechanism during the convolution process, assigning an independent gating value to each pixel. This enables the model to automatically identify and distinguish valid pixels from invalid ones, suppressing the noise from cloud-covered regions while retaining useful information. To further improve model stability, we substitute the ReLU activation function in the original GC with the GELU activation function. This change helps ensure more stable gradient descent during training. Given an input feature map

X

, GC can be represented as follows:

f e a t u r e = \sum \sum W_{f} \cdot X, g a t i n g = \sum \sum W_{g} \cdot X,

(10)

X_{o u t} = Φ (f e a t u r e) ⊙ σ (g a t i n g),

(11)

where

W_{f}

and

W_{g}

represent the weights for the convolution operations in the convolutional branch and the gating branch, respectively

Φ

is the GELU activation function,

σ

is the Sigmoid activation function, and

⊙

is element-wise multiplication.

(2) Multi-scale feature extraction module (MFEM): Remote sensing images typically contain objects at various spatial scales. Yang et al. [36] emphasized the critical role of multi-scale feature extraction in CR. Building on this understanding, we designed the MFEM-SAR and MFEM-Opt modules for the SAR and optical branches, respectively, as illustrated in Figure 3b,c. For the SAR branch, MFEM-SAR is specifically tailored to SAR images and consists of three parallel convolutional layers with kernel sizes of 3 × 3, 5 × 5, and 7 × 7. Each convolutional layer outputs feature maps with 128 channels, which are then concatenated along the channel dimension. Then, concatenated features are subsequently processed through a 1 × 1 convolution to fuse cross-scale information. Finally, the GELU activation function is applied to enhance the module’s non-linear representation ability. For the optical branch, MFEM-Opt is designed for cloudy images and follows a similar structure to MFEM-SAR, with the main difference being the replacement of VC with GC. Kernel sizes of 3 × 3, 5 × 5, and 7 × 7 are used, and the feature concatenation and cross-scale information fusion processes are identical to those in MFEM-SAR. Multi-scale feature extraction enables the model to capture object features at different spatial scales, thereby improving the model’s ability to reconstruct complex surface structures.

(3) Spatial attention module (SAM) and channel attention module (CAM): In remote sensing images, the significance of information from different spatial regions and spectral channels varies for CR tasks. Therefore, it is essential to dynamically adjust the model’s attention to these regions and channels. Based on the aforementioned analysis of SAR and optical data characteristics, this paper introduces a differentiated attention mechanism strategy. The SAR branch employs a SAM to enhance the representation of spatial information. The optical branch, considering the multispectral characteristic of optical remote sensing data, integrates both a SAM and a CAM to achieve joint attention on spatial and spectral features. This differentiated attention mechanism enables the model to adaptively highlight key information within each modality, improving the model’s ability to focus on the most relevant features for cloud removal.

SAM: The spatial attention mechanism is designed as an adaptive region selection strategy that enhances the model’s ability to identify and extract key spatial features from an image. It achieves precise focusing on task-relevant spatial regions by dynamically assigning weights to each spatial position in the feature map. The SAM works as a feature enhancement tool based on pixel-level importance, with the primary goal of optimizing the model’s feature representation by emphasizing semantically important spatial information relevant to the task at hand. As shown in Figure 4, the SAM employs efficient multi-scale attention (EMA) [37] as its core unit. EMA refines the semantic space of features through feature grouping and cross-dimensional interactions, significantly improving the pixel-level semantic information in high-level feature maps. The technical implementation of EMA generates a single-channel spatial attention map with the same spatial dimensions as the input feature map. This attention map assigns differentiated weight coefficients to different spatial regions, enabling the model to focus on the most information-rich areas for the task. The dynamic optimization of feature representation is accomplished by performing element-wise multiplication between the spatial attention map and the input feature map, allowing the model to adaptively focus on significant spatial features.

CAM: In addition to spatial information, spectral information is significant for cloud removal in cloudy images. Therefore, for the optical branch, we integrate a CAM in parallel with the SAM to effectively enhance and extract spectral information from cloudy images. As shown in Figure 5, inspired by the work of Chen et al. [38], this study employs a multi-head channel self-attention mechanism to capture intricate dependencies across different feature channels. By dividing the feature channels into multiple attention heads, the model can explore inter-channel relationships from different feature subspaces, enabling multi-angle modeling of spectral information. The process is as follows: Given an input feature map

X

, the model generates the query, key, and value matrices through linear transformations. For each attention head, channel attention is calculated using the following formula:

Q = R e s h a p e (X W^{Q}), K = R e s h a p e (X W^{K}), V = R e s h a p e (X W^{V}),

(12)

{A t t n}_{h} = S o f t m a x (\frac{Q_{h}^{T} K_{h}}{\sqrt{d_{K_{h}}}}) V_{h},

(13)

where

Q_{h}^{T}

,

K_{h}

, and

V_{h}

represent the transpose of the query, the key, and the value matrices of the

h

-th attention head, respectively.

d_{K_{h}}

denotes the dimension of the

K_{h}

. Finally, the outputs from all attention heads are concatenated to form the final output of the CAM.

A t t n = C o n c a t ({A t t n}_{1}, \dots {, A t t n}_{H}) .

(14)

3.4. Multimodal Feature Fusion and De-Redundancy Encoder

As another core architectural component of the model, the MFFDE is responsible for cross-modal feature integration and information optimization. As shown in Figure 6, the MFFDE is meticulously designed to achieve deep multimodal feature fusion while effectively suppressing redundant information. Its design is based on two key insights: first, SAR and optical features exhibit significant complementarity, necessitating the establishment of mapping relationships to infer optical information for cloud-covered regions; second, the multimodal feature fusion process inevitably introduces information redundancy, requiring specialized optimization mechanisms for effective suppression. The implementation of the MFFDE involves two key stages: multimodal feature fusion and feature de-redundancy optimization. In the multimodal feature fusion stage, a multimodal cross-attention feature interaction module (MCFIM) is employed, using cross-attention mechanisms and nonlinear transformations. This module establishes complementary mapping relationships between the spatial structural features of SAR and the features from the cloud-free regions of cloudy images, enabling effective inference of optical information for cloud-covered areas. In the feature de-redundancy stage, the feature de-redundancy module (FDM) integrates spatial and channel reconstruction convolution (SCConv) [39] as its core component to suppress spatial and channel redundancy. Below, we provide a detailed explanation of the key components of MFFDE and illustrate how they achieve efficient multimodal feature fusion and information de-redundancy.

(1) MCFIM: CNNs rely on convolutional kernels for local perception. While the receptive field can be expanded by increasing the number of layers or using larger kernels, CNNs are inherently biased toward local feature extraction, making it difficult to capture long-range global contextual information. In contrast, the self-attention mechanism in Transformer architectures facilitates direct interactions between sequence elements, effectively capturing long-range dependencies. Building on this understanding, we designed the MCFIM to establish global mapping relationships between the spatial structural features of SAR images and the cloud-free region features of cloudy images, enabling effective inference of optical information for cloud-covered regions. As shown in Figure 7, the spatial structural feature map

X_{1}

from the SAR branch and the cloud-free region feature map

X_{2}

from the optical branch are first subjected to

L a y e r N o r m

. The features are then passed through convolutional layers to adjust their dimensions and reshaped to generate query, key, and value matrices for attention computation.

X_{1} = L a y e r N o r m (X_{1}), X_{2} = L a y e r N o r m (X_{2}),

(15)

Q_{1} = R e s h a p e (X_{1} W^{Q_{1}}), K_{1} = R e s h a p e (X_{1} W^{K_{1}}), V_{1} = R e s h a p e (X_{1} W^{V_{1}}),

(16)

Q_{2} = R e s h a p e (X_{2} W^{Q_{2}}), K_{2} = R e s h a p e (X_{2} W^{K_{2}}), V_{2} = R e s h a p e (X_{2} W^{V_{2}}) .

(17)

where

W^{Q_{1}}

,

W^{K_{1}}

,

W^{V_{1}}

,

W^{Q_{2}}

,

W^{K_{2}}

, and

W^{V_{2}}

are projection matrices. Specifically,

Q_{2}

and

K_{1}

undergo matrix multiplication, as do

Q_{1}

and

K_{2}

, to perform the cross-attention calculation:

{A t t n}_{1} = S o f t m a x (\frac{Q_{2} K_{1}^{T}}{\sqrt{d_{K_{1}}}}) V_{1}, {A t t n}_{2} = S o f t m a x (\frac{Q_{1} K_{2}^{T}}{\sqrt{d_{K_{2}}}}) V_{2} .

(18)

Inspired by ResNet [40], the output of the attention computation is passed through a 1 × 1 convolution to adjust the channel dimensions, followed by an element-wise summation with the input features via a residual connection. This design promotes model convergence and improves training stability.

X_{o u t} = C o n v ({A t t n}_{1}) + C o n v ({A t t n}_{2}) + X_{1} + X_{2} .

(19)

(2) FDM: During the multimodal feature fusion process, the fused feature map inevitably contains redundant information across both the spatial and channel dimensions [41]. This redundancy not only increases the model’s computational complexity but may also introduce potential feature noise. To address this issue, this study applies SCConv to perform redundancy removal on the fused features. SCConv optimizes feature representations and reduces computational overhead through a deep reconstruction mechanism along both the spatial and channel dimensions. As shown in Figure 6, the FDM consists of SCConv, a 1 × 1 convolution, and a GELU activation function. SCConv is responsible for the de-redundancy task, the 1 × 1 convolution adjusts the number of channels in the feature map, and the GELU activation function enhances nonlinear representation. SCConv is composed of two concatenated units: the spatial reconstruction unit (SRU) and the channel reconstruction unit (CRU). The SRU separates and reconstructs features based on their weights to suppress spatial redundancy and enhance feature representation, while the CRU employs a split–transform–fuse strategy to reduce channel-wise information redundancy and optimize computational efficiency. Given an input feature map

X

, the SCConv process can be expressed as follows:

X^{w} = R e c o n s t r u c t (S e p a r a t e (X)),

(20)

Y = F u s e (T r a n s f o r m (S p l i t (X^{w}))) .

(21)

Finally, a 1 × 1 convolution is employed to adjust the channel dimension of the feature map, followed by a GELU activation function to enhance the nonlinear representation of the features. After this series of carefully designed feature extraction, fusion, and redundancy removal operations, the resulting feature map provides a highly optimized multimodal feature representation. This representation fully integrates the all-weather observation information from the SAR images and the spatial–spectral information from the cloudy images, serving as a rich conditioning signal to guide the subsequent image denoising diffusion process.

3.5. Image Adaptive Prediction Strategy

Diffusion models were originally designed for natural images. Existing unconditional diffusion models [10,34] and conditional diffusion models [27,42] typically employ the NP strategy, where the neural network predicts noise at each reverse denoising iteration, and the MSE between the predicted noise and the Gaussian noise is calculated. However, remote sensing images possess distinct data characteristics: they typically contain rich multispectral information, and a single image often covers diverse land cover types, such as buildings, farmland, and vegetation, leading to significant spatial heterogeneity. These characteristics make remote sensing images considerably more complex than natural images, both in terms of spatial and spectral dimensions, which presents significant challenges for CR tasks. In our experiments, we found that even with sufficient conditional information, the NP strategy still struggles to effectively learn and reconstruct the complex spatial structures and spectral information in remote sensing images. To overcome this limitation, we propose the IAP strategy, which innovatively drives the denoising network to directly predict the target cloud-free image, while redefining the loss function as follows:

L_{D D P M} = E_{x_{0}, x_{t}, t, c} [{‖x_{0} - ϵ_{θ} (x_{t}, t, c)‖}^{2}],

(22)

where

x_{0}

represents the target cloud-free image. Based on the predicted result

ϵ_{θ} (x_{t}, t, c)

from the denoising network, we perform diffusion sampling using the following probabilistic transition formula of the diffusion model to obtain the intermediate state

x_{t - 1}

at time step

t - 1

, thereby implementing the step-by-step denoising process of the diffusion model:

x_{t - 1} = \frac{\sqrt{α_{t}} (1 - {\bar{α}}_{t - 1})}{1 - {\bar{α}}_{t}} x_{t} + \frac{\sqrt{{\bar{α}}_{t - 1}} β_{t}}{1 - {\bar{α}}_{t}} ϵ_{θ} (x_{t}, t, c) + \frac{1 - {\bar{α}}_{t - 1}}{1 - {\bar{α}}_{t}} β_{t} ϵ .

(23)

The theoretical foundation for proposing this prediction strategy is primarily based on the following considerations: directly predicting the optical image better adapts to the inherent spatial–spectral feature complexity of remote sensing images. Pixel-level supervision ensures that, during each iteration, the diffusion model can achieve fine reconstruction of local details while preserving global features. Additionally, compared to Gaussian noise, conditional data (SAR and cloudy images) have data distributions that are more similar to the target cloud-free image, which significantly improves the model’s fitting accuracy.

4. Results and Analysis

4.1. Description of Datasets

This study conducted experiments using three datasets. The simulated experiments used an airborne dataset and the WHU-OPT-SAR dataset, while the real experiments used the LuojiaSET-OSFCR dataset.

The airborne dataset originates from the 2001 IEEE Geoscience and Remote Sensing Symposium (GRSS) Data Fusion Competition [43], containing paired SAR and optical images. The original image size is 2813 × 2289 pixels with a spatial resolution of 1 m. The land cover types include urban areas, roads, and agricultural fields. To accommodate the training requirements of neural networks, the original images were cropped to 128 × 128-pixel image patches, resulting in 1428 paired SAR–optical images.

The WHU-OPT-SAR dataset [44] consists of optical and SAR images acquired by the GF-1 and GF-3 satellites. It includes 100 paired SAR and optical images, with an image size of 5556 × 3704 pixels. The spatial resolution is uniformly resampled to 5 m. The dataset includes a variety of land cover types, such as city, farmland, forest, and water, with many mixed land cover regions and complex spatial structures. The images were cropped to 256 × 256 pixels for model training. To simulate real-world cloudy scenarios, we used true cloud masks from the 38-Cloud dataset [45], which is specifically designed for cloud detection tasks. By overlaying these cloud masks onto the WHU-OPT-SAR dataset, optical images with cloud-missing regions were generated for the experiments.

The LuojiaSET-OSFCR dataset comprises SAR and optical images collected by the Sentinel-1 and Sentinel-2 satellites [46], with an image size of 256 × 256 pixels. The SAR images include two polarization channels, while the optical images consist of 13 bands. The dataset is divided into 10 non-overlapping regions of interest (ROIs), with each data batch containing images captured from the same location and within a short temporal span: SAR images, cloudy images, cloud mask images, and cloud-free images. Using the provided cloud mask images, masked optical images corresponding to the cloudy images can be generated as inputs for model training and evaluation.

4.2. Implementation Details and Metrics

The DMDiff proposed in this study was implemented using PyTorch 1.13.1, with training and testing conducted on a single NVIDIA RTX A6000 GPU. The batch size was set to 4, and the AdamW optimizer [47] was employed with an initial learning rate of 1 × 10⁻⁴. The denoising network architecture was inspired by [34], with the number of residual blocks set to 2. Self-attention blocks were applied at resolutions of 8, 16, 32, and 64 to capture global information. During training, a cosine noise schedule was used, with diffusion time steps set to 2000. To enhance training stability and prevent overfitting, an exponential moving average (EMA) strategy was incorporated. During testing, the EMA parameters were used, and the IDDPM sampling strategy was employed, with 250 sampling steps configured to optimize computational efficiency.

We used four evaluation metrics to quantitatively assess the model’s performance: peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM) [48], Fréchet inception distance (FID) [49], and learned perceptual image patch similarity (LPIPS) [50]. PSNR is a widely used pixel-level metric for quantifying image quality. SSIM assesses the structural similarity between images by evaluating brightness, contrast, and structure. FID calculates the distribution-level similarity between images using the Inception-v3 network. LPIPS simulates human perception of image similarity based on neural networks. Given a reconstructed image

x

and a ground truth image

y

, the definitions of each metric are as follows:

P S N R (x, y) = 20 \cdot \log_{10} (M a x / \sqrt{\frac{1}{L} \sum_{i = 1}^{L} {(x_{i} - y_{i})}^{2}}),

(24)

where

L

represents the number of pixels in the image, and

M a x

denotes the maximum value that a pixel can take.

S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})},

(25)

where

μ_{x}

and

μ_{y}

represent the mean of the two images,

σ_{x}

and

σ_{y}

represent the variance of the two images,

σ_{x y}

represents the covariance of the two images, and

C_{1}

and

C_{2}

are constants.

F I D (x, y) = {‖μ_{x} - μ_{y}‖}^{2} + T r (σ_{x} + σ_{y} - {2 \times (σ_{x} \times σ_{y})}^{\frac{1}{2}}),

(26)

where

μ_{x}

and

μ_{y}

represent the mean matrices of the two images in the Inception network,

σ_{x}

and

σ_{y}

represent the covariance matrices, and

T r

represents the trace of the matrix.

L P I P S (x, y) = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, w} {‖w_{l} ⊙ (F_{l} (x) - F_{l} (y))‖}_{2}^{2},

(27)

where

F_{l}

represents the features extracted by the neural network at layer

l

, and

w_{l}

represents the weights learned by the neural network.

4.3. Compared Algorithms

We selected seven DL-based methods for comparison with our method, covering different architectures such as CNN, GAN, and diffusion models: SAR-Opt-cGAN (2018) [8], SpA-GAN (2020) [51], SF-GAN (2020) [9], DSen2-CR (2020) [5], GLF-CR (2022) [52], Cloud-Attention GAN (2023) [25], and conditional diffusion (2024) [17]. SAR-Opt-cGAN concatenates SAR and optical images along the channel dimension as inputs to the GAN. SpA-GAN introduces a spatial attention mechanism into the GAN to enhance the generator’s ability to utilize contextual information. SF-GAN proposes a two-stage CR framework; in the first stage, SAR images are converted into simulated optical images using U-Net, and in the second stage, multi-source information is concatenated and input into the GAN. DSen2-CR designs a CR network based on residual learning, extracting deep features from both SAR and optical images. GLF-CR proposes a global-local fusion algorithm that achieves global information fusion via a self-attention mechanism. Cloud-Attention GAN generates attention maps and simulated optical images based on U-Net to guide the GAN in focusing on cloud region restoration while preserving cloud-free areas. Conditional diffusion embeds SAR images as conditional information in the DDPM framework to achieve the conversion from SAR to optical images. To ensure fairness in the experiments, we strictly followed the configurations from the original papers and code for each method. For models trained with multiple GPUs, we conducted longer training on a single GPU to minimize the effects of hardware differences. For example, conditional diffusion was trained with four GPUs and 50,000 iterations in the original paper, but we adjusted it to a single GPU and 200,000 iterations.

4.4. Simulated Experiment Results

4.4.1. Analysis of Model Reconstruction Performance in Multi-Mask Scenarios

To evaluate the model’s reconstruction performance under different types of missing data scenarios, we designed six different masks for the airborne dataset: half, expand, line, sr, thin, and thick. These masks are intended to simulate data missing at varying degrees and spatial distribution patterns. The simulated corrupted images are shown in Figure 8a and Figure 9a. For example, both the half and line masks result in 50% of the data being missing. In the line mask, there is high spatial autocorrelation between adjacent pixels, enabling the model to extract contextual information, such as spatial structure and texture features, from the unmasked rows. In this case, the model needs to fully utilize the existing prior knowledge from the masked optical image. In the half mask, the model cannot obtain complete spatial structure information from the optical image, making it difficult to accurately infer the specific details of the right half using only the limited prior knowledge from the left half. Thus, the model must rely more on the supplementary information provided by the SAR image, learning optical land features in the missing regions through cross-modal feature transfer. By comparing the model’s reconstruction performance under different masks, we can comprehensively evaluate its robustness and adaptability to various data missingness scenarios.

First, we present the reconstruction results for linear features, as shown in Figure 8. SpA-GAN, which uses only optical images, suffers from severe structural deformation, with the repaired regions losing spatial coherence with the cloud-free image, resulting in very low visual consistency. SAR-Opt-cGAN is able to recover the overall linear structure, but the reconstructed result exhibits some texture artifacts, and there are discrepancies in the fine details and texture when compared to the original structure. SF-GAN shows slight improvements in reconstruction quality compared to SAR-Opt-cGAN, but artifacts and loss of detail still persist in the repaired regions. Cloud-Attention GAN also does not completely suppress texture artifacts in the reconstruction. Although DSen2-CR and GLF-CR partially recover the linear structure, the reconstructed regions appear visibly blurred, making it difficult to restore fine image details. Conditional diffusion, based on the iterative denoising strategy of the diffusion model, effectively alleviates the structural distortion problem caused by one-step generation. However, since conditional diffusion relies solely on SAR images as conditional constraints, the reconstructed results show significant spectral differences compared to the cloud-free images, highlighting the limitations of single-source reconstruction methods.

Urban areas, with their complex artificial structures and diverse spatial distribution patterns, represent some of the most challenging scenarios in remote sensing images. As shown in Figure 9, conditional diffusion demonstrates certain performance advantages in reconstructing complex urban textures, effectively rebuilding both the overall and detailed features of urban structures. However, its reconstruction results show significant spectral discrepancies when compared to cloud-free images, further highlighting the limitations of single-modality reconstruction methods. End-to-end methods face even more severe challenges in urban scene reconstruction. These methods generally suffer from poor reconstruction quality, manifesting as discontinuous textures, severe artifacts, and blurry details, which significantly degrade the visual quality and information integrity of the reconstructed images. This highlights the bottlenecks encountered by end-to-end methods when dealing with highly complex and information-rich urban remote sensing images.

The DMDiff model proposed in this study employs a carefully designed dual-branch feature extraction architecture and multimodal information fusion mechanism. It learns optical signal features from the non-missing regions of cloudy images and incorporates the supplementary information provided by SAR images for the missing regions. Through cross-modal feature transfer learning, the model generates realistic land cover features in the missing regions. It accurately reconstructs the geometric structures of linear features and the complex spatial characteristics unique to urban areas, maintaining strong spatial coherence with the non-missing regions. Moreover, the spectral characteristics of the reconstructed regions seamlessly align with those of the non-missing regions, effectively avoiding spectral distortion and ensuring the overall visual quality of the image.

Table 1 presents the quantitative evaluation results for each method under different missing scenarios, validating the performance of the proposed method from multiple perspectives. Conditional diffusion relies solely on SAR data, meaning the absence of optical data does not impact its performance. Consequently, its evaluation results remain consistent across different missing data types. In PSNR, DMDiff achieves the best results in five out of six masking scenarios, and the second-best in one. In the challenging expand scenario (75% data missing), DMDiff outperforms the second-best method by 2.3 dB. In the line and sr missing data scenarios, the improvement reaches 6 dB, demonstrating that DMDiff incurs minimal quality loss in reconstruction. In SSIM, DMDiff achieves optimal performance in five masking scenarios, highlighting the model’s exceptional ability in pixel-level structural reconstruction. In the expand scenario, DMDiff is the only method with an SSIM value exceeding 0.7, underscoring its robustness in large-area missing data reconstruction. In the line and sr masking scenarios, DMDiff achieves extraordinarily high values over 0.95, confirming the model’s accuracy in pixel-level reconstruction. In FID, which measures the distribution similarity of generated image features, our method achieves the best results in five masking types. Notably, DMDiff achieves two-digit FID values in all masking types and is the only method to achieve a single-digit FID in the line masking scenario, fully demonstrating the high similarity in feature distribution between the generated results and the original images. In LPIPS, which evaluates perceptual quality, DMDiff outperforms all other methods in every masking type, indicating that the reconstruction results are highly consistent with human visual perception. In the half and expand scenarios, DMDiff is the only method with an LPIPS value below 0.2. In the line and sr missing scenarios, it achieves exceptionally low perceptual differences of 0.02 and 0.03, respectively, fully validating the perceptual realism of the model’s reconstruction results.

The comprehensive quantitative evaluation across the four metrics demonstrates that DMDiff excels in multi-modal learning and cross-modal feature transfer, effectively using the prior knowledge from both SAR images and the corrupted images. In contrast, another diffusion-based method, conditional diffusion, while maintaining the overall spatial structure, suffers from spectral distortion due to relying solely on SAR data as a conditional constraint, which adversely impacts its quantitative evaluation metrics.

4.4.2. Analysis of Reconstruction Performance for Different Land Surface Types Using Real Cloud Masks

To evaluate the performance of the proposed method in reconstructing images across different land cover types, this study systematically classifies the WHU-OPT-SAR dataset based on land cover types. Specifically, the dataset is divided into six typical categories: city, farmland, forest, road, village, and water. For each land cover type, 300 images are randomly selected for model training, and 50 images are reserved for testing. To simulate real-world cloudy scenarios, we use true cloud masks from the 38-Cloud dataset. The processing pipeline for generating simulated cloudy images is as follows: (1) Cloud mask selection: randomly select a cloud mask image from the 38-Cloud dataset. (2) Resampling: adjust the selected cloud mask to match the spatial resolution of the optical images in the WHU-OPT-SAR dataset using bilinear interpolation. (3) Binarization: convert the resampled cloud mask into a binary format, where pixels with a value of 0 indicate cloud-covered regions, and pixels with a value of 1 indicate cloud-free regions. (4) Cloud application: perform pixel-wise multiplication between the binarized cloud mask and the optical images from the WHU-OPT-SAR dataset. For pixels with a mask value of 0, the corresponding optical image values are set to 0 (cloud-covered regions), while pixels with a mask value of 1 retain their original values (cloud-free regions). (5) Final dataset preparation: The generated cloudy images, incorporating real cloud mask distributions, are used as inputs for the model, while the original cloud-free images serve as ground truth for training and evaluation.

Figure 10 presents the reconstruction results on the WHU-OPT-SAR dataset. The results for city and village areas demonstrate that, in highly complex artificial surface scenes characterized by dense building clusters, road networks, and other challenging spatial structures, DMDiff is capable of generating architectural layouts with reasonable spatial organization in the missing regions, ensuring seamless connections with the surrounding areas. This result fully validates the model’s robustness and superiority in highly heterogeneous environments. The reconstruction results for farmland showcase the model’s outstanding performance in recovering fine-grained spatial features. By effectively integrating the spatial information provided by SAR images, the model accurately reconstructs the geometric shape and boundary features of the plots, clearly restoring the fine demarcation lines between farmland. The reconstructed regions exhibit a high degree of consistency with the unmasked areas in terms of spatial structure. For water and forest, two homogeneous land cover types, the proposed method demonstrates exceptional performance in reconstructing spectral continuity. The reconstructed regions transition smoothly into the surrounding areas, without noticeable spectral discontinuities or texture inconsistencies. This result highlights the model’s ability to effectively leverage the spectral information from the non-missing regions of the optical images, enabling high-precision spectral reconstruction for homogeneous land cover types. In road reconstruction, the model excels in both spatial continuity and spectral consistency restoration. It successfully reconstructs the missing roads, preserving the topological continuity of the road network, with the reconstructed roads maintaining a high degree of spectral consistency with the original roads. This performance highlights the model’s effectiveness in cross-modal feature learning, which contributes to the accurate reconstruction of prominent geometric structural features.

In contrast, end-to-end comparison methods exhibit certain limitations in reconstruction quality, often resulting in blurred reconstructions of the cloud regions and loss of details. This performance gap may be attributed to the shortcomings of the end-to-end model’s one-step generation strategy, which might struggle with the complexity of the dataset. On the other hand, the diffusion model, with its progressive generation mechanism, fine-tunes the image’s texture and spectral features during the iterative denoising process, producing more visually consistent reconstruction results. Notably, another diffusion-based method, conditional diffusion, performs poorly in this experiment, exhibiting severe issues with spatial and spectral distortion. This degradation in performance is primarily due to the limitations of the NP strategy in learning the complex spatial–spectral features of remote sensing images. A detailed analysis of this phenomenon will be provided in the ablation study section.

Table 2 presents the quantitative evaluation results of the proposed method and compared methods on the WHU-OPT-SAR dataset. At the overall test set level, the proposed method achieves the best performance across all four evaluation metrics. Specifically, it achieves a 0.3 dB improvement in PSNR over the second-best method, GLF-CR, and surpasses Cloud-Attention GAN by 20 units in terms of FID, demonstrating the superior performance of the proposed method in both image reconstruction quality and feature distribution consistency.

The proposed method shows significant adaptability and robustness across different land cover types. For complex artificial surface scenes (city and village), the FID improves by 40 and 36 units, respectively, compared to the second-best method, Cloud-Attention GAN. This suggests that the proposed method has a superior ability to learn feature distributions and accurately capture and reconstruct the complex artificial features of cities and villages, highlighting its outstanding performance in highly complex spatial environments. In the reconstruction of homogeneous land cover types (forest and water), the PSNR improves by 0.1 and 0.9 dB compared to GLF-CR, and the LPIPS decreases by 0.04 and 0.07 compared to Cloud-Attention GAN, fully validating the model’s superiority in maintaining both spatial continuity and texture similarity for homogeneous land cover types. In terms of human visual perception, the proposed method generates reconstructions that are much closer to the original images. For linear land cover types (farmland and road), the proposed method improves the SSIM by 0.02 and 0.001 compared to the second-best method, GLF-CR, demonstrating its excellent capability in restoring the spatial structure of linear land features.

The quantitative analysis of end-to-end methods highlights their limitations in reconstructing complex remote sensing images. In city scene reconstruction, the SSIM for existing methods is generally below 0.7, indicating significant challenges in capturing and reconstructing complex spatial structures. The FID for all land cover types typically exceeds 100, suggesting a considerable discrepancy in feature distribution between the generated and real samples. With the exception of Cloud-Attention GAN, most other end-to-end methods have LPIPS values above 0.3, indicating a substantial deviation between the reconstructed images and human visual perception.

Compared to conditional diffusion, the proposed method demonstrates comprehensive improvements in performance. On the overall evaluation of the test set, the PSNR increases by 16 dB, SSIM increases by 0.49, FID decreases by 136, and LPIPS decreases by 0.46. These substantial improvements across multiple quantitative metrics fully validate the proposed method’s exceptional performance in preserving image quality, reconstructing spatial structures, learning feature distributions, and aligning with human visual perception.

4.5. Real Experiment Results

To evaluate the practicality of the model on real-world data, this study conducted experiments using the LuojiaSET-OSFCR dataset. To eliminate outliers, the 13 bands of Sentinel-2 data were clipped to the range [0, 10,000], while the two bands of Sentinel-1 data were clipped to the range [−25, 0] and [−32.5, 0], respectively. Two regions of interest (ROIs) from the dataset were selected for training and testing. The model was trained and tested using the full-band data to ensure comprehensive utilization of multispectral information.

Figure 11 presents the reconstruction results of different methods on two ROIs, along with zoomed-in views of the reconstructed regions. SpA-GAN, which uses only the cloudy image as input, performs significantly worse in cloud region reconstruction due to the absence of supplementary information from SAR images. The reconstruction results exhibit large-scale texture artifacts and struggle to effectively recover the spatial structure of the cloud regions. SAR-Opt-cGAN, which jointly constrains SAR and cloudy images, shows some improvement in spatial structure recovery compared to SpA-GAN. However, the zoomed-in results reveal that the model tends to fill the cloud regions with meaningless texture artifacts, leading to significant spectral distortion. SF-GAN provides a noticeable improvement over SAR-Opt-cGAN, with the overall reconstruction closely resembling the target cloud-free image and preserving the spatial and spectral information of the cloud-free regions. However, the zoomed-in results indicate that the texture recovery in the cloud regions is still insufficient, with some blurring observed. Cloud-Attention GAN and SF-GAN exhibit similar overall reconstruction styles, but the enlarged results reveal that the spatial structure recovery in the cloud regions of Cloud-Attention GAN is slightly inferior to that of SF-GAN. DSen2-CR, with its residual connections, retains good information in the cloud-free regions. However, the cloud regions show almost no recovery of effective landform information, resulting in destructive artifacts. GLF-CR, through global information fusion, performs better than DSen2-CR in spectral feature recovery, with the cloud-free regions being closer to the target cloud-free image. Nevertheless, the cloud regions still exhibit blurry results. The generated results of conditional diffusion show significant noise characteristics. While there are some similarities in spatial structure compared to the cloud-free image, such as ridge lines and farmland boundaries, the fine-texture details are almost entirely lost, and the overall visual quality deviates substantially from the cloud-free image.

DMDiff, through its innovative dual-branch feature learning architecture and cross-modal feature fusion mechanism, effectively learns multimodal information from SAR and cloudy images, guiding the diffusion model to generate highly realistic optical land cover representations in the cloud regions. The reconstruction results show that the model can effectively generate naturally continuous ridge lines and homogeneous textures, restoring farmland boundary features, reconstructing farmland landscapes with reasonable geometric shapes and spatial layouts, recovering residential building information, and preserving fine details of small features while maintaining excellent spatial consistency with surrounding areas. The quantitative results in Table 3 further validate the outstanding performance of DMDiff, with improvements of 0.4%, 0.1%, 16.8%, and 6.8% over the second-best method in the four evaluation metrics, respectively.

Given the rich spectral information in Sentinel-2 multispectral images, this study compares the reconstruction performance of different methods from a spectral perspective across typical land cover types. Figure 12 presents the reconstructed spectral curves for vegetation, bare soil, and artificial. In all three land cover types, the spectral curves generated by the proposed method closely match those of the cloud-free image. In contrast, conditional diffusion, which relies solely on SAR images as the conditional constraint, struggles to capture the rich spectral features of remote sensing images due to its noise strategy, resulting in significant deviations from the cloud-free spectral curve. For vegetation, the spectral curve generated by the proposed method closely matches the cloud-free image, especially in the high-reflectance red-edge region. For artificial land features, most of the comparison methods exhibit varying degrees of spectral deviation. For instance, SAR-Opt-cGAN and Cloud-Attention GAN show significant spectral deviation at 783 nm, while DSen2-CR displays a notable spectral deviation at 1610 nm. In contrast, DMDiff accurately reconstructs the spectral curve of artificial land features. For bare soil, SAR-Opt-cGAN and SpA-GAN show substantial spectral reconstruction errors, and DSen2-CR exhibits significant deviations between 783–865 nm. DMDiff, however, maintains high consistency with the cloud-free image across all bands, with only a slight difference at 2190 nm. Overall, DMDiff demonstrates superior performance in spectral reconstruction of typical land cover types compared to existing methods.

5. Discussion

5.1. Ablation Studies

To further validate the contributions of each key component and their interactions within the DMDiff framework, this section presents a systematic ablation study using the Luojia-SET-OSFCR dataset. The experiments are designed at two levels: single-component ablation and module-level ablation. In the single-component ablation, we assess the individual impact of six core components: GC, MFEM, SAM and CAM, MCFIM, SCConv, and IAP. Each component is removed separately (denoted as model w/o), and its effect on model performance is evaluated. For the module-level ablation, we examine two key modules: DMFEE (consisting of GC, MFEM, SAM, and CAM) and MFFDE (consisting of MCFIM and SCConv). Each module is removed (model w/o) to assess its overall contribution and the synergistic effects of its internal components. This hierarchical ablation analysis provides a comprehensive evaluation of both individual components and their combination impact within the model.

(1) Ablation study of GC: To address the challenge of invalid pixels in cloud regions, GC is employed in the optical branch instead of VC in this study. As shown in Table 4, removing GC leads to a performance degradation: PSNR decreases by 1.21 dB, SSIM decreases by 0.0171, FID increases by 10.32, and LPIPS increases by 0.0256. This performance difference can be attributed to the unique advantages of GC in handling cloudy images. By assigning adaptive gating values to each pixel, the model can effectively distinguish between valid and invalid regions, thereby minimizing the interference caused by cloud regions during feature extraction.

(2) Ablation study of the MFEM: The MFEM is designed to effectively capture hierarchical representations of multi-scale features in remote sensing images. To assess its effectiveness, the MFEM was replaced with a single-scale convolution layer (3 × 3 VC and 3 × 3 GC). Quantitative results show that removing the MFEM leads to a notable decrease in model performance: PSNR decreases by 1.35 dB, SSIM decreases by 0.0197, FID increases by 11.54, and LPIPS increases by 0.026. This performance degradation can be attributed to the inability of single-scale convolutions to capture features at multiple spatial scales, causing the model to lose its ability to model the hierarchical structure of features.

(3) Ablation study of the SAM and CAM: The SAM and CAM are designed to enable adaptive modeling of feature importance for both spatial regions and spectral channels. The ablation study shows that removing the SAM and CAM results in a performance drop: PSNR decreases by 1.68 dB, SSIM decreases by 0.0209, FID increases by 15.88, and LPIPS increases by 0.0272. This outcome highlights the core advantage of the attention mechanism in feature learning: by dynamically adjusting the weight distribution of spatial regions and spectral channels, the model can more accurately capture the most relevant features essential for the reconstruction task, thereby improving the efficiency of learning complex spatial–spectral features.

(4) Ablation study of the MCFIM: This study designed the MCFIM based on the Transformer architecture to achieve deep feature complementarity and fusion between SAR and optical modalities. To validate its effectiveness, the MCFIM was replaced with a simple feature summation operation, resulting in performance degradation: PSNR decreased by 0.69 dB, SSIM decreased by 0.0101, FID increased by 5.11, and LPIPS increased by 0.0081. These results confirm the importance of the MCFIM in multi-modal feature fusion. By leveraging the cross-attention mechanism to capture global dependencies between modalities, the MCFIM facilitates deep fusion and complementary enhancement of multimodal features.

(5) Ablation study of SCConv: The goal of SCConv is to reduce redundancy in the fused multimodal features. An ablation study was conducted by constructing a control group without SCConv to quantitatively assess the contribution of this module. As shown in Table 4, removing SCConv results in a performance drop: PSNR decreases by 0.71 dB, SSIM decreases by 0.0106, FID increases by 6.97, and LPIPS increases by 0.011. These results underscore the important role of SCConv in feature optimization. By performing joint reconstruction along the spatial and channel dimensions, the module enables adaptive optimization of information, providing a more refined feature representation for the denoising network, thereby enhancing reconstruction quality.

(6) Ablation study of IAP: In response to the inherent high spatial heterogeneity and complex spectral characteristics of remote sensing images, this study introduces the IAP strategy as a replacement for the NP strategy. To comprehensively evaluate the effectiveness and generalizability of IAP, we designed a bidirectional validation experiment:

DMDiff ablation experiment: After replacing IAP with NP, the model performance significantly deteriorated. PSNR decreased by 21.6 dB, SSIM decreased by 0.4443, FID increased by 174.88, and LPIPS increased by 0.4726.
Conditional diffusion transfer experiment: As shown in Table 5, by integrating the IAP strategy into conditional diffusion, a substantial performance improvement was observed. PSNR increased by 18.75 dB, SSIM increased by 0.2755, FID decreased by 58.74, and LPIPS decreased by 0.3213.

Figure 13 further validates the necessity of the IAP strategy through qualitative analysis. When the NP strategy is used, both DMDiff and conditional diffusion exhibit significant spatial and spectral distortions in the generated results, with visual outputs deviating substantially from the cloud-free image. In contrast, after adopting the IAP strategy, both models demonstrate stable spectral features, with visual distortions almost completely eliminated. The generated results align much more closely with the physical properties of remote sensing images. These experimental findings highlight the IAP strategy as a versatile approach for applying diffusion models to remote sensing CR tasks. By directly predicting the target cloud-free image, the IAP strategy enables the diffusion model to more effectively capture the spatial–spectral characteristics of remote sensing images, thereby significantly enhancing the quality of the generated samples.

(7) Ablation study of the DMFEE: As a key feature extraction module of our model, the DMFEE employs a dual-branch structure to separately process SAR and cloudy images, ensuring comprehensive feature extraction and adaptation for different modalities. To assess its overall contribution, we replace the DMFEE with a simplified version, where the original complex dual-branch structure (incorporating GC, the MFEM, the SAM, and the CAM) is reduced to a basic feature extractor consisting of only two 3 × 3 convolutional layers. As shown in Table 4, experimental results reveal a significant decline in performance due to this simplification: PSNR decreases by 2.06 dB, SSIM decreases by 0.0233, FID increases by 18.67, and LPIPS increases by 0.0317. Notably, this performance degradation is substantially greater than that observed when removing GC, the MFEM, or the SAM and CAM individually, highlighting the synergistic effect of these components within the DMFEE. By comparing single-component ablation results with module-level ablation findings, we observe that the combination of components in the DMFEE is not merely additive but forms a complementary and reinforcing structure. This integrated design enables more effective capture of complex priors in SAR and cloudy images, providing a more comprehensive multimodal feature representation for subsequent feature fusion and cloud removal.

(8) Ablation study of the MFFDE: As a key fusion module, the MFFDE enhances multimodal feature interaction and complementarity through the MCFIM, while SCConv optimizes feature redundancy removal. To assess the overall contribution of the MFFDE, we replace it with a simple 1 × 1 convolution layer, which performs only basic feature channel fusion without deep interaction or redundancy removal. As shown in Table 4, this simplification leads to a decline in performance: PSNR decreases by 1.15 dB, SSIM decreases by 0.0128, FID increases by 9.81, and LPIPS increases by 0.0132. Notably, this degradation exceeds the performance drop observed when removing the MCFIM or SCConv individually, highlighting a strong synergistic effect between these two components within the MFFDE. The 1 × 1 convolution fails to facilitate deep multimodal feature interaction and lacks the ability to optimize the fused features, leading to suboptimal multimodal representations being passed to the denoising network. These findings further validate the significant role of the MFFDE in multimodal feature fusion and optimization. Specifically, the MCFIM captures global dependencies between modalities via a cross-attention mechanism, while SCConv performs spatial-channel joint reconstruction for adaptive feature optimization. The collaborative interaction between these components provides the denoising network with a more refined feature representation, ultimately enhancing cloud removal performance.

5.2. Computational Cost

To provide a more comprehensive quantitative analysis of computational complexity, we report the number of model parameters (Params) and floating point operations (FLOPs) for each method under a consistent experimental setup. Specifically, all models are evaluated using input data with a resolution of 256 × 256 pixels, 13 spectral bands, and a batch size of 4. The detailed Params and FLOPs are presented in Table 6.

The DMDiff model ranks third in parameter count, maintaining a relatively reasonable model scale. However, due to the inclusion of computationally intensive modules such as gated convolution and spatial-channel attention mechanisms, DMDiff exhibits a relatively higher FLOP count. While these modules increase computational complexity, they significantly enhance the model’s ability to effectively utilize prior information from SAR and cloudy images. This advantage is particularly evident in cross-modal feature extraction and the processing of complex cloud-covered regions.

Despite the increased computational cost, the substantial improvements in reconstruction quality justify this trade-off. Moreover, with continuous advancements in computing hardware, the impact of higher computational complexity is expected to diminish over time.

5.3. Limitations and Future Directions

Although the proposed method demonstrates outstanding performance across multiple datasets, certain limitations warrant further investigation.

(1) Exploring Latent Diffusion Models

Training and inference of diffusion models in pixel space require substantial computational resources, resulting in significant time costs for both stages. Specifically, on two ROIs of the LuojiaSET-OSFCR dataset, the training process takes approximately 3–4 days, while generating 400 samples during inference requires around 2 h. This high computational cost poses challenges to the deployment and practical application of the model.

To optimize computational efficiency while maintaining model performance, we plan to focus future research on latent diffusion models (LDMs). LDMs use an autoencoder to compress high-dimensional pixel space data into a lower-dimensional latent space, where the diffusion process is performed. For instance, compressing input data of 256 × 256 pixels to a latent representation of 32 × 32 pixels can theoretically reduce the computational complexity of the diffusion model by up to eight times. This approach strikes a better balance between performance and efficiency, making it suitable for real-world applications.

(2) Addressing Thin Cloud Effects

In this study, we utilize multiple datasets with pre-generated cloud masks that delineate most cloud-covered regions, enabling a focused reconstruction of these masked areas. However, the cloud mask generation process has inherent limitations, particularly in accurately identifying and fully masking thin cloud regions, especially along cloud edges. Different cloud types (e.g., cirrus, stratus, and cumulus) introduce thin clouds with varying thicknesses, textures, and spectral properties, leading to distinct feature patterns for the same ground objects under different conditions. The interaction between these partially transparent clouds and surface information has not been fully accounted for in our current approach, potentially affecting reconstruction performance. To address this limitation, future research could explore the following directions:

Designing specialized feature extraction strategies for thin cloud regions, such as employing graph neural networks (GNNs) to model spatial dependencies between thin clouds and surrounding clear areas;
Integrating multi-temporal data to leverage time-series analysis for mitigating thin cloud influences;
Incorporating additional multimodal data sources to provide a more comprehensive observation of thin cloud-affected regions.

These enhancements are expected to improve the model’s robustness and applicability under complex cloud cover conditions, particularly for challenging thin cloud regions that are difficult to mask accurately.

(3) Downstream Task Validation for Cloud Removal Results

This study leverages complementary information provided by SAR data to assist in reconstructing cloud-covered regions. The process remains an estimation influenced by several factors. Significant modality differences exist between SAR and optical data due to their distinct imaging mechanisms, resolutions, and surface feature responses. Additionally, even with quasi-synchronous acquisition, temporal discrepancies between SAR and optical sensors can lead to variations in surface features, especially for rapidly changing phenomena such as seasonal vegetation dynamics and water body fluctuations. These factors collectively limit reconstruction accuracy, potentially affecting the model’s applicability in high-precision scenarios. To address this limitation, future research could explore the following:

Practical usability evaluation by applying cloud-free reconstructions to downstream tasks (e.g., land cover classification, change detection) and comparing them with ground truth data;
Synergistic multi-temporal reconstruction integrating optical and SAR data from different time points to mitigate uncertainty in single-temporal estimations.

These advancements could improve the reliability of reconstructed images and enhance the method’s applicability in real-world remote sensing tasks.

6. Conclusions

This study proposes the Dual-branch Multimodal Conditional Guided Diffusion Model (DMDiff) for cloud removal in optical remote sensing images. The model uses SAR images as auxiliary data while fully exploiting the valid information from the cloud-free regions of cloudy images to accurately reconstruct surface features obscured by clouds. Considering the modal differences in imaging mechanisms and feature representations between SAR and optical images, we designed a dual-branch multimodal feature extraction encoder with independent feature extraction branches, enabling full adaptation to the unique characteristics of each modality and maximizing the extraction of effective information. Building on this foundation, we further developed a multimodal feature fusion and de-redundancy encoder, which achieves complementary fusion of multimodal features on a global scale while introducing an innovative redundancy reduction mechanism to eliminate information redundancy during feature fusion. This significantly enhances the quality of the final feature representation. Additionally, considering the remote sensing images exhibit more complex spatial and spectral features than natural images, we proposed an image adaptive prediction (IAP) strategy to replace the traditional noise prediction (NP) strategy. The IAP strategy guides the denoising network to directly predict the target cloud-free image, facilitating more effective learning of the complex spatial and spectral features of remote sensing data. Extensive experiments on airborne and satellite datasets demonstrate that DMDiff outperforms existing methods across different payload platforms, various missing data types, and common land cover types, confirming its effectiveness and robustness.

Despite these achievements, this study still has certain limitations. First, the training and inference of the current diffusion model in pixel space demand substantial computational resources, limiting its practical efficiency. Second, thin cloud regions that are difficult to fully identify during cloud masking may affect reconstruction quality. Third, modality differences and temporal inconsistencies between SAR and optical data can impact the accuracy of cloud-covered region reconstruction. To address these challenges, future work will focus on the following:

Exploring latent diffusion models to improve computational efficiency;
Developing specialized feature extraction strategies for thin clouds and integrating multi-temporal data;
Validating the practical usability of reconstruction results through downstream tasks and investigating collaborative reconstruction strategies with multi-temporal data.

These efforts aim to enhance the efficiency and reliability of SAR-assisted optical cloud removal for real-world remote sensing applications.

Author Contributions

Conceptualization, J.M. and W.Z.; methodology, J.M.; software, Y.W.; validation, W.Z.; formal analysis, J.M.; investigation, W.Z.; resources, W.Z.; data curation, J.M.; writing—original draft preparation, J.M.; writing—review and editing, Y.W. and W.Z.; visualization, J.M.; supervision, W.Z.; project administration, W.Z.; funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (NSFC no. 42201503) and the Science and Disruptive Technology Program, AIRCAS (No. 2024-AIRCAS-SDTP-11).

Data Availability Statement

Our code and dataset are available at https://github.com/WenjuanZhang-aircas/DMDiff (accessed on 1 March 2025).

Acknowledgments

The authors would like to express their gratitude to OpenAI for the open-source release of the improved diffusion code. We also extend our thanks to the researchers at Wuhan University for making the WHU-OPT-SAR and LuojiaSET-OSFCR datasets publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ju, J.; Roy, D.P. The availability of cloud-free Landsat ETM+ data over the conterminous United States and globally. Remote Sens. Environ. 2008, 112, 1196–1211. [Google Scholar] [CrossRef]
King, M.D.; Platnick, S.; Menzel, W.P.; Ackerman, S.A.; Hubanks, P.A. Spatial and Temporal Distribution of Clouds Observed by MODIS Onboard the Terra and Aqua Satellites. IEEE Trans. Geosci. Remote Sens. 2013, 51, 3826–3852. [Google Scholar] [CrossRef]
Asner, G.P. Cloud cover in Landsat observations of the Brazilian Amazon. Int. J. Remote Sens. 2001, 22, 3855–3862. [Google Scholar] [CrossRef]
Jing, R.; Duan, F.; Lu, F.; Zhang, M.; Zhao, W. Denoising Diffusion Probabilistic Feature-Based Network for Cloud Removal in Sentinel-2 Imagery. Remote Sens. 2023, 15, 2217. [Google Scholar] [CrossRef]
Meraner, A.; Ebel, P.; Zhu, X.X.; Schmitt, M. Cloud removal in Sentinel-2 imagery using a deep residual neural network and SAR-optical data fusion. ISPRS J. Photogramm. Remote Sens. 2020, 166, 333–346. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Li, Y.; Chan, J.C.W. Thick Cloud Removal With Optical and SAR Imagery via Convolutional-Mapping-Deconvolutional Network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 2865–2879. [Google Scholar] [CrossRef]
Sui, J.; Ma, Y.; Yang, W.; Zhang, X.; Pun, M.O.; Liu, J. Diffusion Enhancement for Cloud Removal in Ultra-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Grohnfeldt, C.; Schmitt, M.; Zhu, X. A Conditional Generative Adversarial Network to Fuse SAR and Multispectral Optical Data for Cloud Removal from Sentinel-2 Images. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 1726–1729. [Google Scholar]
Gao, J.; Yuan, Q.; Li, J.; Zhang, H.; Su, X. Cloud Removal with Fusion of High Resolution Optical and SAR Images Using Generative Adversarial Networks. Remote Sens. 2020, 12, 191. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; pp. 6840–6851. [Google Scholar]
Whang, J.; Delbracio, M.; Talebi, H.; Saharia, C.; Dimakis, A.G.; Milanfar, P. Deblurring via Stochastic Refinement. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16272–16282. [Google Scholar]
Ren, M.; Delbracio, M.; Talebi, H.; Gerig, G.; Milanfar, P. Multiscale Structure Guided Diffusion for Image Deblurring. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 10687–10699. [Google Scholar]
Li, H.; Yang, Y.; Chang, M.; Chen, S.; Feng, H.; Xu, Z.; Li, Q.; Chen, Y. SRDiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing 2022, 479, 47–59. [Google Scholar] [CrossRef]
Gao, S.; Liu, X.; Zeng, B.; Xu, S.; Li, Y.; Luo, X.; Liu, J.; Zhen, X.; Zhang, B. Implicit Diffusion Models for Continuous Super-Resolution. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 10021–10030. [Google Scholar]
Lugmayr, A.; Danelljan, M.; Romero, A.; Yu, F.; Timofte, R.; Gool, L.V. RePaint: Inpainting using Denoising Diffusion Probabilistic Models. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11451–11461. [Google Scholar]
Xia, B.; Zhang, Y.; Wang, S.; Wang, Y.; Wu, X.; Tian, Y.; Yang, W.; Gool, L.V. DiffIR: Efficient Diffusion Model for Image Restoration. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 13049–13059. [Google Scholar]
Bai, X.; Pu, X.; Xu, F. Conditional Diffusion for SAR to Optical Image Translation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Wen, Z.; Suo, J.; Su, J.; Li, B.; Zhou, Y. Edge-SAR-Assisted Multimodal Fusion for Enhanced Cloud Removal. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Zhang, X.; Qiu, Z.; Peng, C.; Ye, P. Removing Cloud Cover Interference from Sentinel-2 Imagery in Google Earth Engine by Fusing Sentinel-1 SAR Data with a CNN Model. Int. J. Remote Sens. 2022, 43, 132–147. [Google Scholar] [CrossRef]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Gool, L.V.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
Ding, H.; Zi, Y.; Xie, F. Uncertainty-Based Thin Cloud Removal Network via Conditional Variational Autoencoders. In Proceedings of the Asian Conference on Computer Vision (ACCV), Macao, China, 4–8 December 2022; pp. 469–485. [Google Scholar]
Wu, P.; Pan, Z.; Tang, H.; Hu, Y. Cloudformer: A Cloud-Removal Network Combining Self-Attention Mechanism and Convolution. Remote Sens. 2022, 14, 6132. [Google Scholar] [CrossRef]
Han, S.; Wang, J.; Zhang, S. Former-CR: A Transformer-Based Thick Cloud Removal Method with Optical and SAR Imagery. Remote Sens. 2023, 15, 1196. [Google Scholar] [CrossRef]
Bermudez, J.D.; Happ, P.N.; Oliveira, D.A.B.; Feitosa, R.Q. SAR to Optical Image Synthesis for Cloud Removal with Generative Adversarial Networks. ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci. 2018, 4, 5–11. [Google Scholar] [CrossRef]
Zhang, S.; Li, X.; Zhou, X.; Wang, Y.; Hu, Y. Cloud removal using SAR and optical images via attention mechanism-based GAN. Pattern Recognit. Lett. 2023, 175, 8–15. [Google Scholar] [CrossRef]
Li, C.; Liu, X.; Li, S. Transformer Meets GAN: Cloud-Free Multispectral Image Reconstruction via Multisensor Data Fusion in Satellite Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; pp. 8780–8794. [Google Scholar]
Kumari, N.; Zhang, B.; Zhang, R.; Shechtman, E.; Zhu, J.-Y. Multi-Concept Customization of Text-to-Image Diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 1931–1941. [Google Scholar]
Kim, G.; Kwon, T.; Ye, J.C. DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2426–2435. [Google Scholar]
Gu, S.; Chen, D.; Bao, J.; Wen, F.; Zhang, B.; Chen, D.; Yuan, L.; Guo, B. Vector Quantized Diffusion Model for Text-to-Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10696–10706. [Google Scholar]
Tan, H.; Wu, S.; Pi, J. Semantic Diffusion Network for Semantic Segmentation. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 8702–8716. [Google Scholar]
Esser, P.; Chiu, J.; Atighehchian, P.; Granskog, J.; Germanidis, A. Structure and Content-Guided Video Synthesis with Diffusion Models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 7312–7322. [Google Scholar]
Tumanyan, N.; Geyer, M.; Bagon, S.; Dekel, T. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 1921–1930. [Google Scholar]
Nichol, A.; Dhariwal, P. Improved Denoising Diffusion Probabilistic Models. In Proceedings of the International Conference on Machine Learning (ICML), Online, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
Wang, Y.; Zhang, B.; Zhang, W.; Hong, D.; Zhao, B.; Li, Z. Cloud Removal With SAR-Optical Data Fusion Using a Unified Spatial-Spectral Residual Network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–20. [Google Scholar] [CrossRef]
Yang, Q.; Wang, G.; Zhao, Y.; Zhang, X.; Dong, G.; Ren, P. Multi-Scale Deep Residual Learning for Cloud Removal. In Proceedings of the 2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 4967–4970. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Chen, Z.; Zhang, Y.; Gu, J.; Kong, L.; Yang, X.; Yu, F. Dual Aggregation Transformer for Image Super-Resolution. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 12278–12287. [Google Scholar]
Li, J.; Wen, Y.; He, L. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, Y.; Zhang, W.; Zhang, B. Cloud Removal With PolSAR-Optical Data Fusion Using A Two-Flow Residual Network. arXiv 2025, arXiv:2501.07901. [Google Scholar]
Zhao, X.; Jia, K. Cloud Removal in Remote Sensing Using Sequential-Based Diffusion Models. Remote Sens. 2023, 15, 2861. [Google Scholar] [CrossRef]
Hellwich, O.; Reigber, A.; Lehmann, H. Sensor and Data Fusion Contest: Test Imagery to Compare and Combine Airborne SAR and Optical Sensors for Mapping. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Toronto, ON, Canada, 24–28 June 2002; pp. 82–84. [Google Scholar]
Li, X.; Zhang, G.; Cui, H.; Hou, S.; Wang, S.; Li, X.; Chen, Y.; Li, Z.; Zhang, L. MCANet: A joint semantic segmentation framework of optical and SAR images for land use classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102638. [Google Scholar] [CrossRef]
Mohajerani, S.; Krammer, T.A.; Saeedi, P. A Cloud Detection Algorithm for Remote Sensing Images Using Fully Convolutional Neural Networks. In Proceedings of the 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), Vancouver, BC, Canada, 29–31 August 2018; pp. 1–5. [Google Scholar]
Pan, J.; Xu, J.; Yu, X.; Ye, G.; Wang, M.; Chen, Y.; Ma, J. HDRSA-Net: Hybrid dynamic residual self-attention network for SAR-assisted optical image cloud and shadow removal. ISPRS J. Photogramm. Remote Sens. 2024, 218, 258–275. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Zhou, W.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Pan, H. Cloud Removal for Remote Sensing Imagery via Spatial Attention Generative Adversarial Network. arXiv 2020, arXiv:2009.13015. [Google Scholar]
Xu, F.; Shi, Y.; Ebel, P.; Yu, L.; Xia, G.-S.; Yang, W.; Zhu, X.X. GLF-CR: SAR-enhanced cloud removal with global-local fusion. ISPRS J. Photogramm. Remote Sens. 2022, 192, 268–278. [Google Scholar] [CrossRef]

Figure 1. Samples generated in the forward process with different noise schedules.

Figure 2. Dual-branch Multimodal Conditional Guided Diffusion Model.

Figure 3. Dual-branch multimodal feature extraction encoder. (a) Dual-branch multimodal feature extraction encoder (DMFEE). (b) Multi-scale feature extraction module (MFEM-SAR). (c) Multi-scale feature extraction module (MFEM-Opt). (d) Gated convolution (GC).

Figure 4. Spatial attention module.

Figure 5. Channel attention module.

Figure 6. Multimodal feature fusion and de-redundancy encoder.

Figure 7. Multimodal cross-attention feature interaction module.

Figure 8. Results on airborne dataset scene 1. The red boxes represent enlarged views of the local regions. The red boxes indicate the enlarged views of the local results. (a) Corrupted image. (b) SAR image. (c) SAR-Opt-cGAN. (d) SpA-GAN. (e) SF-GAN. (f) DSen2-CR. (g) GLF-CR. (h) Cloud-Attention GAN. (i) Conditional diffusion. (j) DMDiff (ours). (k) Cloud-free image.

Figure 9. Results on airborne dataset scene 2. The red boxes represent enlarged views of the local regions. (a) Corrupted image. (b) SAR image. (c) SAR-Opt-cGAN. (d) SpA-GAN. (e) SF-GAN. (f) DSen2-CR. (g) GLF-CR. (h) Cloud-Attention GAN. (i) Conditional diffusion. (j) DMDiff (ours). (k) Cloud-free image.

Figure 10. Results on WHU-OPT-SAR dataset. The red boxes represent enlarged views of the local regions. (a) Corrupted image. (b) SAR image. (c) SAR-Opt-cGAN. (d) SpA-GAN. (e) SF-GAN. (f) DSen2-CR. (g) GLF-CR. (h) Cloud-Attention GAN. (i) Conditional diffusion. (j) DMDiff (ours). (k) Cloud-free image.

Figure 11. Results on LuojiaSET-OSFCR dataset. 1 and 2 represent two scenes from the dataset. The red and yellow boxes represent enlarged views of the local regions. 1-(a) and 2-(a) Cloud-free image. 1-(b) and 2-(b) SAR image. 1-(c) and 2-(c) Cloudy image. 1-(d) and 2-(d) SAR-Opt-cGAN. 1-(e) and 2-(e) SpA-GAN. 1-(f) and 2-(f) SF-GAN. 1-(g) and 2-(g) DSen2-CR. 1-(h) and 2-(h) GLF-CR. 1-(i) and 2-(i) Cloud-Attention GAN. 1-(j) and 2-(j) Conditional diffusion. 1-(k) and 2-(k) DMDiff (ours).

Figure 12. Spectral curves of different feature types. The legend is located in subplot (a) Vegetation, where different colors and markers represent different methods in the subplots.

Figure 13. Results of conditional diffusion and DMDiff with different strategies on LuojiaSET-OSFCR dataset. 1 and 2 represent two scenes from the dataset. The red boxes represent enlarged views of the local regions. 1-(a) and 2-(a) Cloud-free image. 1-(b) and 2-(b) Cloudy image. 1-(c) and 2-(c) Conditional diffusion + NP. 1-(d) and 2-(d) Conditional diffusion + IAP. 1-(e) and 2-(e) DMDiff + NP. 1-(f) and 2-(f) DMDiff + IAP.

Table 1. Quantitative evaluation on airborne dataset. (a) SAR-Opt-cGAN. (b) SpA-GAN. (c) SF-GAN. (d) DSen2-C€(e) GLF-CR. (f) Cloud-Attention GAN. (g) Conditional diffusion. (h) DMDiff (ours). Bold text indicates the best results, and underlined text indicates the second-best results.

↑