1. Introduction
Significant progress has been made in the technology of earth observation satellites, especially with optical Remote Sensing (RS) imagery. This technology has been extensively used in various industries, such as agricultural management, urban planning, environmental monitoring, and resource prospecting. According to the International Satellite Cloud Climatology Project, the average annual cloud cover worldwide is as high as 66% [
1,
2]. In practical applications, a significant portion of RS imagery is often obscured by clouds and fog, leading to loss of information, blurred details, and color distortion in the obscured areas, resulting in reduced reliability. This limitation adversely impacts a range of downstream tasks, such as target detection [
3], land cover change monitoring [
4], and land cover classification [
5], hindering their effectiveness and reliability in practical applications.
Thick cloud cover may obstruct all surface information, making it challenging to achieve satisfactory results using a single RS image for cloud removal and limiting its research value. Thin clouds, characterized by their lower optical thickness and limited impact on overall RS images compared to thicker clouds, permit partial sunlight transmission and allow partial surface information to reach satellite sensors [
6]. This feature facilitates data recovery from areas obscured by thin cloud cover using only a single image. As a result, eliminating thin cloud cover from individual RS images is a significant task when preprocessing RS imagery.
Substantial achievements have been made in reconstructing images affected by thin clouds and haze pollution using both Deep Learning (DL)-based and traditional conventional image processing techniques. Traditional methods have been extensively used in previous studies due to their simplicity and interpretability. Many traditional image processing approaches treat thin clouds as low-frequency components [
7,
8,
9] in the frequency domain, processing the low-frequency information of images to reduce or eliminate the impact of thin clouds and haze in images by enhancing surface features. For instance, Hu et al. [
10] presented a thin cloud removal technique based on the dual-tree complex wavelet transform to remove clouds through sub-band decomposition and low-frequency coefficient prediction. Imaging models have been utilized to understand the formation of images in optical systems, and have been extensively applied to remove thin clouds from RS images [
11,
12,
13]. Sahu et al. [
14] proposed a method that divides the image into blocks, selects the block with the highest score to estimate the atmospheric light, and eliminates atmospheric scattering through novel color channels, then employed illumination scaling factors to improve the dehazed image. He et al. [
15] introduced the Dark Channel Prior (DCP) algorithm in 2010, which estimates the atmospheric light component and transmittance in cloudy images based on an atmospheric scattering model for removing thin clouds and haze. Many methods are derived from DCP for thin cloud and haze removal [
16,
17,
18]. Image filtering techniques are also widely employed for thin cloud and haze removal [
19,
20]. The Homomorphic Filtering (HF) method proposed by Peli et al. [
21] is a commonly used approach; the image is first converted from the spatial domain to the frequency domain, then integrated with the atmospheric scattering model to remove thin clouds and haze by compressing brightness and enhancing contrast. Although traditional image processing methods can successfully remove clouds under certain conditions, they rely on specific physical assumptions and mathematical models which may not fully reflect the real-world situation, leading to difficulties in accurately restoring the accurate ground information.
With the recent development of DL technology, numerous studies have developed various models for removing clouds and haze that are tailored to the characteristics of optical RS images. Unlike traditional methods, DL approaches can adaptively learn and optimize, providing advantages when handling complex scenes. Furthermore, these approaches can generate cloud-free images that are more realistic and exhibit finer details. Zhang et al. [
22] proposed a unified spatiotemporal spectral framework based on Convolutional Neural Networks (CNNs) to eliminate clouds using multisource data for unified processing. Li et al. [
23] introduced an end-to-end Residual Symmetric Cascaded Network (RSC-Net), which employs symmetrical convolutional–deconvolutional concatenations to better preserve the detail in declouded images. Zhou et al. [
24] proposed a multiscale attention residual network for thin cloud removal which combined large-scale filters and fine-grained convolution residual blocks to enhance feature extraction. Ding et al. [
25] presented a Conditional Variational Auto-Encoder (CVAE) with uncertainty analysis to produce multiple plausible cloud-free pictures for each multicloud image. Zi et al. [
26] proposed a wavelet integral convolutional neural network, integrating wavelet inverse transform into the encoder–decoder architecture for thin cloud removal. Guo et al. [
27] introduced Cloud Perception Integrated Fast Fourier Convolutional Network (CP-FFCN), a single image-based blind method for thin cloud removal. CP-FFCN employs a cloud perception module and a fast Fourier convolution reconstruction module to effectively model and remove clouds in RS images without requiring external knowledge of cloud distribution. GANs are widely applied in RS images to remove clouds because they effectively generate realistic and diverse images [
28,
29,
30]. Wang et al. [
31] improved the structural similarity index of cloud-free images by introducing a novel objective function in their conditional GAN for cloud removal. Li et al. [
32] removed thin clouds by integrating cloud distortion physical models into a GAN. Tan et al. [
33] presented a contrastive learning-based unsupervised RS technique called GAN-UD for removing thin clouds from images.
While current DL-based methods can successfully remove thin clouds from RS images characterized by relatively simple structures such as clear delineation between cloud and land features and limited atmospheric interference, they struggle when the clouds vary in thickness or shape. These methods often struggle to achieve high-quality restoration in areas with uneven cloud cover, leading to detail loss and blurring. Most existing approaches focus primarily on recovering information from cloud-covered regions, which can result in over-correction in areas with thin clouds, causing artifacts and edge distortion. Additionally, architectural limitations in current models may lead to color deviations compared to actual surface information, even after successful cloud removal; such discrepancies and uneven restoration not only degrade image quality but also hinder downstream tasks sensitive to color variations, such as land cover classification and quantitative analyses.
This study introduces an end-to-end network based on a sparse transformer that leverages a multi-head self-attention mechanism, which we name Sparse Transformer-based Generative Adversarial Network (SpT-GAN). This study effectively models complex long-range dependencies and ignores irrelevant information, considering the excellent long-range modeling capability of the multi-head self-attention [
34] for capturing relationships between pixels in images. This mechanism enhances the model’s information restoration for areas covered by thin clouds. In addition, this study introduces a Global Enhancement Feature Extraction (GEFE) module designed to take advantage of partial ground detail information that can penetrate through thin clouds to reach sensors on satellites. This module enhances the model’s ability to preserve ground information in cloud-free and sparsely cloud-covered areas. The proposed method makes the following contributions:
- 1.
We introduce a sparse multi-head self-attention (sparse attention) module to build the transformer block within the generator. This module utilizes the self-attention mechanism’s outstanding long-range modeling capabilities to model global pixel relationships, enhancing the model’s ability to reconstruct cloud-free images. It employs a weight-learnable filtering mechanism to retain information from highly relevant areas while neglecting information from low-correlation areas.
- 2.
Moreover, we propose a GEFE module to capture aggregated features from different directions and enhance the model’s extraction of perceivable surface information.
- 3.
Our study demonstrates that the proposed SpT-GAN effectively removes clouds for both uniform and nonuniform thin cloud RS images across various scenes without significantly increasing computational complexity. Experimental results on public datasets, including RICE1 and T-Cloud, show that the generated images exhibit precise details, high color fidelity, and close resemblance to the authentic ground images.
The rest of this paper is organized as follows:
Section 2 provides a brief introduction to related work;
Section 3 thoroughly explains the proposed approach;
Section 4 presents the dataset specifics, analysis, experimental results, and relevant discussion; finally,
Section 5 presents our conclusions.
4. Results and Analysis
The experimental settings are discussed in
Section 4.1, including descriptions of the implementation details, datasets, and evaluation metrics.
Section 4.2 compares our results with other methods.
Section 4.3 discusses model complexity, while
Section 4.4 presents the results and analysis of ablation experiments. Finally,
Section 4.5 validates the robustness of SpT-GAN by processing the output cloud-free images.
4.1. Experimental Settings
4.1.1. Implementation Details
The proposed model was built using the PyTorch framework. The computational platform included an Intel(R) Xeon(R) Silver 4310 CPU and an Nvidia A100 GPU with 80 GB of RAM. During training, the proposed model was optimized using the AdamW optimizer [
73], which features adaptive learning rate characteristics. The initial parameters included
,
, a weight decay of 0.00001, a batch size of 1, 300 training epochs, eight attention heads, and an initial learning rate of 0.0004.
4.1.2. Description of Datasets
We evaluated the proposed model using the RICE1 [
74] and T-Cloud [
25] datasets. A comparison between the RICE1 dataset and the T-Cloud dataset is shown in
Table 1, while samples from both datasets are shown in
Figure 5.
The RICE1 dataset comprises 500 pairs of cloud-covered images and corresponding cloud-free images extracted from Google Earth, each with a resolution of pixels; the image acquisition period is 15 days. Most of the cloud-covered images in this dataset exhibit relatively uniform cloud formations.
By contrast, the T-Cloud dataset consists of 2939 pairs of cloud-covered images and corresponding cloud-free images captured by Landsat-8, each with a resolution of pixels. The images in each pair are taken 16 days apart due to the satellite re-entry period. The cloud formations in the T-Cloud dataset are more complex, featuring uneven cloud distribution and varying thicknesses. Both datasets comprise natural-color images collected from diverse ground scenes, including urban areas, mountainous regions, and coastlines.
Experimental assessments were conducted to evaluate the proposed model’s generalization capabilities across these two datasets with diverse cloud formations. For each of the RICE1 and T-Cloud datasets, of the images were used for training and the remaining were used for testing.
4.1.3. Evaluation Metrics
The experimental results were evaluated using three quantitative evaluation metrics: Peak Signal to Noise Ratio (PSNR) [
75], Structural Similarity Index (SSIM) [
76], and Learned Perceptual Image Patch Similarity (LPIPS) [
77]. These three evaluation metrics depend on comparing results with a reference image to highlight the relative performance of each method. We utilized real cloud-free images from the dataset as reference images to ensure a fair comparison of the differences among the methods. PSNR measures the difference between images at the pixel level based on error sensitivity, SSIM assesses the similarity between the reconstructed and reference images, and LPIPS aligns more closely with human perception. Higher PSNR and SSIM values indicate better quality of the generated images, while lower LPIPS values suggest that the generated image is more perceptibly similar to the real image. The mathematical definition of PSNR is as follows:
where
n is the number of bits per pixel,
H and
W respectively stand for the height and width of the image, and
is the mean squared error between the ground truth
and the fake image
.
The mathematical definition of SSIM is as follows:
where
represents the similarity in luminance,
indicates the contrast similarity,
represents the structural similarity,
is a weighting parameter,
and
are the respective mean brightness values of images
and
,
and
are the respective brightness variances of images
and
, the covariance of the luminance between images
and
is represented by
, and
,
, and
are constants utilized to prevent division by 0.
The mathematical definition of LPIPS is as follows:
where
represents the distance between
X and
,
, and
respectively indicate the height, width, and weight of the
l-th layer,
and
signify the features predicted by the model and baseline, respectively, at position
, and ⊙ stands for element-wise multiplication.
4.2. Comparison with Other Methods
This section employs two different types of methods for quantitative analysis, namely, hypothesis-driven approaches and deep learning methods. The hypothesis-driven approach selects the representative DCP [
15] method, while the DL methods include McGAN [
78], SpA-GAN [
28], AMGAN-CR [
79], CVAE [
25], MSDA-CR [
80], and MemoryNet [
81].
(1) Quantitative results analysis: The quantitative comparison of experiments on the RICE1 and T-Cloud datasets shown in
Table 2 indicates that SpT-GAN achieves better PSNR, SSIM, and LPIPS values than the other methods. DCP shows relatively lower performance on both datasets, likely due to its more basic approach. Methods such as McGAN and SpA-GAN demonstrate strong performance on RICE1 but exhibit variability on T-Cloud, highlighting their sensitivity to dataset characteristics. In contrast, AMGAN-CR encounters similar challenges while showing better performance on the T-Cloud dataset. While CVAE and MSDA-CR achieve notable results in terms of specific metrics, they do not match the overall effectiveness of SpT-GAN across both datasets. Compared to MemoryNet, which is the most recent best method, SpT-GAN shows respective improvements of 1.83 dB and 0.98 dB in PSNR, 0.33% and 0.76% in SSIM, and 0.0032 and 0.0095 in LPIPS on the RICE1 and T-Cloud datasets. This comprehensive analysis leads to the conclusion that our method holds a decisive advantage in restoring true surface information, which is thanks to the powerful long-range modeling ability of sparse attention and the excellent global detail perception ability of the GEFE module.
Figure 6 presents the box plots of the MSE results for each DL-based method across two datasets. The box plots illustrate performance differences, with a thicker box positioned higher and many outliers typically indicating larger errors and poorer performance. In contrast, a lower and narrower box with a more concentrated distribution of outliers suggests better performance. The box plot results generated by these methods are highly correlated with the data presented in
Table 2, and the generalization ability and performance differences of various methods across different datasets can be clearly observed. The MSE results of SpT-GAN on both datasets demonstrate its stable performance and good generalization capability.
(2) Qualitative result analysis of DL-based methods on the RICE1 dataset: Upon visual comparison of the results presented in
Figure 7, it can be observed that the McGAN method can effectively remove uniformly distributed thin clouds but struggles with images containing unevenly distributed thin clouds, resulting in detail loss and artifacts. The thin cloud removal results generated by SpA-GAN exhibit satisfactory overall performance, although some color distortion is present. The thin cloud removal results generated by SpA-GAN and McGAN are generally satisfactory, reflecting effective learning of data distribution through their GAN-based architectures. However, both models’ relatively simple generator structures lead to the aforementioned adverse results. AMGAN-CR produces images with advanced color saturation and contrast; however, it suffers from artifact issues in some images, impacting the overall visual quality. The MemoryNet method successfully removes the thin clouds in all scenes, albeit with an overall increase in the brightness in the generated images. AMGAN-CR and MemoryNet cause brightness enhancement, which may be due to their inability to accurately model the true distribution of image brightness. The results of CVAE demonstrate that while the thin clouds are effectively removed, the resulting images still exhibit some color distortions and the model does not perform as expected on images containing unevenly distributed thin clouds, indicating inferior generalization ability in handling such variations. In addition, the images generated by MSDA-CR exhibit color differences in some scenes compared to the real cloud-free images.
Figure 8 illustrates the zoomed-in details of the cloud-free images generated by each DL-based method. Upon observing the enlarged details, it is apparent that the details and edges produced by the AMGAN-CR method appear unnatural, with an increase in contrast. Both CVAE and MSDA-CR fail to eliminate thin clouds in the highlighted areas. The detailed representations generated by McGAN and SpA-GAN are insufficiently precise, making it challenging to interpret real surface information accurately. In addition, the images generated by MemoryNet are brighter than the ground truth image. By contrast, SpT-GAN completely removes the thin clouds in the highlighted regions and provides visual fidelity closer to real surface conditions.
(3) Qualitative result analysis of DL-based methods on the T-Cloud dataset:
Figure 9 depicts the overall visual effects of each technique on the T-Cloud dataset. This dataset contains various forms of thin clouds, making it particularly challenging to assess the performance of models. Visual inspection shows that McGAN, SpA-GAN, AMGAN-CR, and MSDA-CR can remove simple thin clouds but struggle with complex cloud formations, resulting in artifacts. The second cloud-free image generated by MemoryNet does not meet the expected level of cloud removal, resulting in an effect similar to haze cover. While CVAE demonstrates cloud removal effects that are generally close to expectations, tiny artifacts still appear in some images. Notably, SpT-GAN consistently produces the best visual results even under wide-area uneven cloud cover conditions.
Figure 10 presents a detailed comparison of each method’s performance on the T-Cloud dataset. Only SpT-GAN achieves the optimal level of detail. Other methods, including SpA-GAN, AMGAN-CR, and CVAE, are not utterly successful in removing the cloud regions outlined in the boxes. These results may stem from the attention mechanisms employed by these methods, which could cause them to overly concentrate on cloud-covered areas and neglect global image features. Isolating cloud-covered areas during processing may inadvertently amplify artifacts or inconsistencies in the unaffected regions of the image. The insufficient integration of local and global features could result in incomplete or inaccurate restoration of the underlying surface, thereby degrading the overall quality of the output, preventing these models from extracting effective features and leading to their failing to remove clouds. MSDA-CR and MemoryNet successfully remove clouds, but reduce the image’s overall brightness, making it challenging to interpret details. Although McGAN removes the clouds, its information recovery of cloud-covered surfaces is not as effective as SpT-GAN.
Based on these findings, it can be concluded that SpT-GAN exhibits excellent performance in cloud removal and ground detail restoration, even for datasets such as T-Cloud that contain complex cloud shapes.
4.3. Model Complexity Evaluation
We performed comparative experimental evaluations of the complexity of the various DL models, aiming to explore the correlation between model complexity and the efficacy of cloud removal. We used the T-Cloud dataset to validate the inference speed in Frames Per Second (FPS) of each model in practical applications, while all other metrics were tested using the shape tensors
. A comparison of complexity metrics for each model is presented in
Table 3.
The model’s complexity is comprehensively reflected through the parameter count, computational complexity, and model size metrics. Here, FPS denotes the speed at which the model processes data, the parameter count refers to the number of changeable parameters in the model that require learning and optimization, the computational complexity indicates the amount of floating-point operations, representing the number of multiplications and additions required during model execution, and the model size represents the storage space occupied by the pretrained parameters generated after training the model. Compared with other methods, SpT-GAN effectively removes thin clouds with a relatively lower parameter count and computational complexity. Specifically, SpT-GAN has 5.85 M parameters, which is significantly fewer than McGAN and CVAE, although more than the other models. The Flops metric highlights the tradeoff between computational efficiency and task performance, with the results indicating that SpT-GAN effectively balances these factors. Although SpA-GAN has the lowest Flops and parameter count, it is less effective in cloud removal. SpT-GAN outperforms methods such as AMGAN-CR and CVAE in terms of efficiency. MemoryNet demands significantly more computation than the other models.
In summary, the proposed model effectively removes thin clouds with a relatively lower parameter count and computational complexity than other models. However, its inference speed does not correspondingly improve due to its low computational complexity, primarily due to its incorporation of a significant amount of depth-wise separable convolutions. One characteristic of depth-wise separable convolutions is their low computational complexity and high inference time [
82]. Despite sacrificing a certain level of inference speed, the proposed model demonstrates relatively lower computational complexity and achieves superior cloud removal performance based on a comprehensive analysis of the experimental data.
4.4. Ablation Studies
We evaluated the impact of the IRFT block and the proposed GEFE module through ablation experiments assessing the impact of the loss function, number of attention heads, and sparsity strategy on the model’s cloud removal performance.
4.4.1. IRFT Block and GEFE Module Ablation Study
Table 4 displays the quantitative results for the IRFT block and the GEFE module. These experimental data indicate that including the IRFT block and GEFE module significantly improves the model’s performance. The difference in trainable parameter counts demonstrates that the IRFT module significantly increases the model’s parameter count; on the other hand, the GEFE module has a relatively minor impact on the number of parameters while still contributing positively to performance.
Figure 11 shows the visual differences of the ablation experiment results. The roles of the GEFE and IRFT modules in the cloud removal task are evident from an examining the enlarged areas in the figure. Comparing
Figure 11d–f, it is clear that the introduction of these two modules effectively reduces cloud artifacts in the generated images. The GEFE module focuses on surface information, enabling SpT-GAN to learn global features and mitigating the impact of clouds on image quality. On the other hand, the IRFT module employs FFT to treat the cloud-covered areas as low-frequency components in the frequency domain for filtering. Additionally, the GEFE module aids in aggregating global features, which helps to reduce color distortion in the generated images. The qualitative results presented in the figure are consistent with the quantitative results in
Table 4, further confirming the significance of the GEFE and IRFT modules in the cloud removal task.
This section explores the attention weight allocation of the GEFE module, presenting experimental results through the attention heatmaps in
Figure 12. From left to right, each set of images consists of a cloudy image, an attention heatmap generated by the GEFE module, and an attention heatmap generated by the transformer block. In the attention heatmaps, a higher intensity of red signifies the module’s greater allocation of attention weight to that particular region. In comparison, a lower intensity of color represents lower attention weight allocation. Observing the attention heatmaps generated by GEFE, it is evident that in
Figure 12a most areas were obscured by cloud cover, rendering the surface information invisible; thus, the corresponding attention heatmaps are predominantly shaded in blue. In contrast to
Figure 12b,d, areas not obstructed by clouds or covered by sufficiently sparse cloud layers are predominantly shaded in red, indicating the module’s capability to effectively filter out cloud contamination and focus on extracting surface information from sparsely cloud-covered and cloud-free regions, which is in alignment with the original design intent of the module. Substituting a transformer block for the GEFE module causes the model to focus primarily on cloud-covered areas. These results demonstrate that combining these two different attention mechanisms is essential for effectively removing clouds while enhancing the model’s ability to retain surface information.
4.4.2. Loss Function Ablation Study
We performed additional ablation experiments using various combinations of loss functions to assess the impact of different loss functions on the declouding effect during optimization. The tested combinations included three control groups: one employing the classic GAN loss function combination of reconstruction loss
and adversarial loss
, a second incorporating the edge loss
, and a third incorporating the perceptual loss
. The experimental results are presented in
Table 5.
As shown in
Table 5, the data indicate that individually adding either the edge loss
or perceptual loss
contributes to the optimization of the model to a certain extent compared to the traditional GAN loss function combination
. Notably, the effect of the perceptual loss
was more pronounced for the RICE1 dataset, which has higher image resolution; however, the effect of the edge loss
was comparatively more significant than
for the T-Cloud dataset, which has lower image resolution. The final experimental results demonstrate that integrating all of these individual loss functions into a composite function yields the most favorable results, validating the effectiveness of the composite loss function constructed in this study.
4.4.3. Sparse Attention Ablation Study
The number of attention heads has an impact on the model’s performance. In the attention mechanism, multi-head attention allows the model to independently learn representations in different subspaces, thereby capturing more information and relationships. However, increasing the number of attention heads increases computational and memory requirements, which may lead to overfitting or reduced computational efficiency. When using the multi-head attention strategy, some attention heads may be redundant [
83]. Therefore, we designed an experiment to investigate the influence of the attention heads and identify the optimal attention head strategy. In addition, this section presents experimental results on the impact of the sparsity strategy on model performance. The quantitative results with different numbers of heads and sparsity strategies are shown in
Table 6.
Upon analyzing the data presented in
Table 6, it can be observed that a higher number of attention heads is not necessarily better; on the other hand, a reduced number of attention heads can result in the model being unable to fully capture the complex relationships and diversity of the input data, potentially leading to information loss. Due to the difference in resolution between the RICE1 and T-Cloud datasets, the influence of variations in attention head numbers exhibits diverse impacts on performance outcomes. Based on validation across these two experimental datasets, it was determined that eight attention heads are optimal for the proposed model.
Regarding sparsity strategies, the experimental data in
Table 6 demonstrate that combining sparsity with varying numbers of attention heads can influence model performance to a certain extent. Moreover, introducing sparsity can reduce computational complexity. Our comprehensive analysis indicates that the sparse attention module effectively disregards some redundant information to enhance model performance.
4.5. Evaluation of Cloud-Free Image Processing
To further confirm the robustness of the proposed method, cloud-free images were used as inputs to the pretrained model for processing. Notably, the pretrained model was trained on datasets containing cloud-covered images, then was applied to process cloud-free images. The output results exhibited exceptional performance when compared with the original images in terms of both evaluation metrics and visual quality, indicating the model’s ability to effectively process images while disregarding cloud-related features. This observation underscores the high robustness of the proposed model. Visual comparisons are illustrated in
Figure 13, the first row depicts the original cloud-free images and the second row displays the corresponding output results. This test holds significant practical relevance, as satellite sensors cannot consistently capture cloud-obscured images.
5. Conclusions
This study proposes SpT-GAN, a novel method for removing thin clouds with different shapes and thicknesses while preserving high pixel-level similarity to real surface images. The proposed method employs a generator built upon an innovative sparse multi-head self-attention mechanism within the transformer block to adeptly model complex long-range dependencies. This advancement enhances the model’s capability to interpret RS images with intricate surface environments. The introduction of sparsity effectively filters out irrelevant information and enhances the quality of cloud-free images. In addition, we design a novel GEFE module to enhance the model’s ability to preserve surface details in cloud-free and sparse cloud areas. This module integrates a coordinate attention mechanism to enhance the model’s focus on surface details and an inverted residual Fourier transform block reduce redundant feature information. Compared to other DL-based approaches, the proposed model generates higher-quality cloud-free images and preserves surface details without significantly increasing computational complexity. Quantitative experimental results on two different datasets demonstrate the superiority of the proposed method. Furthermore, the proposed model does not perform cloud removal when processing cloud-free images and maintains high consistency with the original cloud-free images, indicating its good robustness.
Future research could extend the proposed model’s applicability by incorporating multi-band or synthetic aperture radar images. By combining techniques such as transfer learning, image fusion, and image transformation into the model, expanded data sources such as SAR and multispectral images would enable the model to gain a deeper understanding of thick cloud characteristics and acquire sufficient predictive experience to effectively reconstruct images contaminated by thick clouds. Considering the inference speed challenges of SpT-GAN, future research efforts could also explore ways of developing more efficient attention mechanisms to enhance the model’s inference speed while maintaining the effectiveness of cloud removal, potentially enabling direct deployment onto lightweight devices in the future.