1. Introduction
Cloud occlusion due to environmental factors has limited the performance of optical sensors, including domestic optical sensors. Therefore, it is of great significance to accurately extract cloud regions. Accurate extraction of cloud areas brings the following two benefits: First, low-quality images can be pre-checked under high traffic conditions, thereby reducing the amount of downlink data and improving image transmission efficiency; second, accurate extraction of cloud areas helps for land identification, and it provides the basis for subsequent cloud removal and reconstruction. Therefore, the cloud extraction task holds significant importance in image interpretation and quality inspection.
Nonetheless, there are still some difficulties in the aspect of cloud detection. They are summarized as follows: (1)
Cloud diversity and ambiguous boundaries: the significant variations in cloud types and shapes, ranging from thin clouds to thick clouds, pose inherent challenges in cloud extraction tasks. These differences within the cloud class directly lead to substantial intra-class variability. Moreover, the imaging mechanisms associated with the portrayal of thinly cloud-covered regions result in unclear boundaries in the image, further complicating the interpretation process. The lack of distinct edges in thinly cloud-covered regions makes it difficult to discern and interpret these areas accurately, as shown in
Figure 1a. (2)
Confounding of terrain and clouds in complex scenes: depending on the influence of season and climate, local terrain is covered with snow, which reflects the same information from the image as clouds. Therefore, cloud snow against a complex background is difficult to distinguish, as shown in
Figure 1b.
At present, cloud detection and segmentation approaches are mainly bifurcated into two parts: threshold-based approaches and learning-based approaches [
1,
2,
3,
4]. The first type of algorithm is mainly based on cloud spectral characteristics (portions of the electromagnetic spectrum), brightness, texture characteristics and geometry by analyzing the spectral difference between the cloud and other surfaces, thresholds or rules are formulated to realize cloud extraction [
5,
6,
7,
8]. Quan Xiong et al. [
9] employed a dynamic threshold hybrid multi-spectral feature (HMF) method for cloud extraction, which combines three kinds of spectra of normalized difference vegetation index (NDVI), whiteness and haze optimization transform (HOT) features to detect cloud pixels. One can utilize hybrid multi-spectral features. The pure threshold algorithm is straightforward, efficient and applicable for cloud detection; however, the method’s impracticality stems from its sensitivity to both background variations and cloud coverage. To enhance the capability to recognize edge details, some people have proposed methods based on machine learning, such as using Support Vector Machine (SVM), and Random Forest (RF) to extract hyper-spectral images [
10]. Sui Y et al. [
11] used simple linear iterative clustering (SLIC) to divide optical satellite images into super-pixels, and then calculated the energy and spectral features using Gabor transformation by extracting the texture features. The characteristics of cloud super-pixels serve as the training samples for the SVM classifier. The SVM classifier is trained to establish the classification model. In addition, Shao M et al. [
12] proposed a multi-dimensional and multi-granularity dense cascade forest (MDForest) for multi-spectral cloud detection. MDForest is a deep forest architecture that introduces a multi-dimensional and multi-granularity scanning mechanism, which enhances the cascade forest representation learning ability. At the same time, the spectral information of the multi-spectral satellite image was captured for cloud extraction. However, its recognition ability is not ideal, especially in complex backgrounds. In addition, the overall accuracy of the above cloud detection methods also is up to the number of image bands to effectively extract clouds. Recently, convolutional neural network (CNN) methods have also been used to detect clouds. For example, the M-type convolutional network model RM-Net uses atrous spatial pyramid pooling (ASPP). ASPP consists of atrous convolution and pyramid pooling [
13]. When scaling features, the phenomenon that the information loss caused by multiple down-sampling is effectively reduced. It extracts multi-scale features of images without losing information and combines residual units to make the network less prone to degradation. The encoding and decoding modules extract the global context information of the image, judge the class probability of each pixel according to the fused features, and input it into the classifier for pixel-level cloud and non-cloud segmentation [
14].
With the advancement of artificial intelligence, deep learning algorithms exhibit remarkable performance in image interpretation, particularly in the domain of optical satellite remote sensing imagery. Different from natural images, satellite images from optical sensors have a larger scale, more coverage and richer ground truth details. From the perspective of deep learning, we categorize cloud extraction methods into the following two classes: CNN-based and CNN-Trans-based methods. (a) CNN-based: U-Net [
15] is a classic image segmentation method with excellent performance in many binary classification tasks. Numerous studies have demonstrated that methods of the U-shape structure [
16,
17] showcase outstanding performance in the segmentation of optical satellite images [
18]. As an illustration, CloudU-Net, which is a structure derived from U-Net, uses Atrous Convolution instead of the traditional convolution layer to enlarge the field of view. It increases the training speed through batch normalization, which can also prevent over-fitting of the model [
19]. Although the algorithm performs well in the case of dense small objects, it does not take into account the diversity of clouds, because the characteristics of clouds are uncertain. (b) CNN-Trans-based: Self-attention mechanism is a core part of the transformer algorithm. The attention mechanism draws inspiration from human visual cognition science, where individuals naturally concentrate on detailed information related to a target while suppressing irrelevant details when reading text or observing objects. It is a process that goes from coarse to fine. The integration of the attention mechanism into CNN networks was introduced by researchers in 2017 [
20]. Since then, attention-based mechanisms in CNNs have found widespread application across various domains. The CNN-Trans attention module comprises two key components: the channel and spatial attention modules, respectively. The former accentuates the correlation among the dimensions of each layer, which forces the attention on the interested feature information and suppresses the useless channel. The latter can retain high-frequency feature information through spatial operation. On one hand, Hu et al. [
21] introduced a novel CNN unit called the squeeze-and-excitation block, which dynamically adjusts the feature response value by modeling inter-channel relationships. In comparison, CBAM [
22] not only incorporates spatial attention but also employs a concurrent structure involving multiscale pyramid pooling within the channel attention. Experimental validation attests to its effectiveness. Spatial attention, on the other hand, directs focus to regions of interest in the spatial aspect. While the attention mechanism has demonstrated considerable utility since its integration into CNNs, the issue of redundancy remains a common challenge. Therefore, numerous researchers have employed transformer models for cloud extraction tasks in optical satellite imagery [
23,
24,
25,
26]. Zhang J. et al. [
27] proposed a CNN cloud detection algorithm for GF-1 satellite images. Through cascading the channel and spatial attention, it introduced a probabilistic upsampling module to merge the downsampling channels through the entire network structure. Then, dark channel transformation based on dark channel prior technology and NSCT was added to the above network [
28]. Even though the attention mechanism in transformer models has demonstrated excellent performance in various domains, the direct transplantation of transformer to cloud extraction tasks does not yield satisfactory results. Some CNN-Trans-based methods connecting traditional CNN with a Transformer can increase the complexity of the model, making it challenging to find the optimal solution through optimization methods. Therefore, we design a network that integrates the strengths of deep CNN and attention, making it more suitable for imagery obtained from satellite sensors.
In this article, we refined a Multiscale Soft Attention Convolutional Neural Network (MSACN) which is a multiscale deep convolutional network structure with a soft attention mechanism incorporating spatial information. Taking inspiration from the U-Net [
15], ResNet [
29] and attention mechanism [
20], MSACN consists of two parts: a deep feature encoder module and a multi-head soft attention decoder module for cloud prediction. Compared with other networks, MSACN exploits the shallow-level information and high-level features of cloudy/non-cloudy pixels, which improves the extraction decision without any manual specific spectral information processing, since the pre-trained network can be visual objects in images to extract rich and unique high-level representations. Summarily, the contributions can be succinctly outlined as follows:
- (1)
We expand the scale of the ZY-3 satellite remote sensing cloud extraction dataset. To test in more complex scenarios, we augmented the dataset by incorporating a subset of clouds with snow images. The raw data are pre-extracted using the model, followed by manual refinement using Photoshop. Each image underwent meticulous selection to ensure the training data accuracy.
- (2)
We refine a multiscale deep convolutional neural network with soft attention and spatial information for cloud segmentation from satellite remote sensing images. Following the encoder–decoder architecture, our primary enhancement lies in the incorporation of a concurrent dilated residual convolution module and a multi-head soft attention fusion between the encoding and decoding processes, respectively.
- (3)
To evidence the validation, we employ comparative analyses with similar methods, including traditional CNN-based approaches, as well as dissimilar methods such as transformer-based models, all within the same datasets and training environment. In terms of the overall accuracy, precision, recall, F1-score and IoU, the performance of MSACN outperformed with other networks, showcasing its superior effectiveness. Meanwhile, we transplant the model to other datasets to assess its adaptability.
2. Materials and Methods
In this paper, MSACN consists of two parts: concurrent dilated residual convolution module, and multi-head soft attention module. To adaptively accommodate the diversity of cloud-shaped dimensions, the concurrent dilated convolutional module establishes pyramid-like features via multi-scale dilated factors in the front of the architecture. The residual convolution units are used for the remaining of the encoding process. The multi-head soft attention module is integrated into the process of decoding for extracted results restoration. By fusing spatial features at different feature resolution levels, a multi-head spatial fusion attention module is established based on depth semantic features, and then the dots of every head-attention channel are connected to achieve the prominent cloud linear distinguishable features. By concatenating the encoding features and upsampling channels, the soft attention module alleviates the cloud boundaries problem caused by the roughness of the architecture.
2.1. The Overall Structure
As shown in
Figure 2, MSACN is a U-shape structure including an encoder and decoder components, which are multi-scale Concurrent Dilated Residual Convolution Module (CDRCM) and Multi-head Soft Attention Module (MSAM), respectively. The MSACN is taking inspiration from U-Net as a whole. In addition to the core modules parts (CDRCM and MSAM), other backbone parts refer to ResNet50. Its expansion path and contraction path also have a corresponding relationship. Thus, the feature extraction part uses the residual module to deepen the model without loss of resolution and learn more complex features to reinforce the representation ability of the model. Before inputting the feature extractor part to the cloud semantic prediction module through skip connection, the shape is changed through convolution to adapt to the prediction module, and CDRCM is used in the first convolution to expand the field of view and preserve the spatial resolution. The input layer has the richest and most primitive features, and the reflected features are not lost due to processing. Therefore, the dilated pyramid-like unit is built for the original input image to control the scale difference by the dilation–convolution factor in order to improve the shape-scale in-variance performance. A series of multi-head attention modules are added to the cloud semantic prediction module. The multi-head attention of the decoding process belongs to the soft attention mechanism. Through the combination of multiple heads soft attention blocks, the depth semantic features of the cloud are strengthened and make them more approximate to be linear separable features. The multi-head soft attention module consists of these white blocks, as shown on the right half of the structure in
Figure 2. Each multi-head attention module is based on concatenated features. It receives two inputs from the encoder and decoder parts, and performs splicing and convolution operations on its output and upsampling results to ensure the depth of the network. The attention mechanism believes that features at different levels in the network have different importance, and by assigning greater weight to important features, it can inform subsequent layers to focus more on the interested information and suppress useless information. This improved method trains the network to more accurately capture the spatial location information and boundary details of clouds, thereby enhancing the model’s accuracy. Through multiple operations of upsampling and attention modules, the network is able to gradually restore spatial details and perform fine boundary segmentation to improve the accuracy and clarity of cloud boundaries.
2.2. Concurrent Dilated Residual Convolution Module
The concurrent dilated residual convolution module is a core component of the encoding process in MSACN. CDRCM plays a pivotal role in the enhancement of our proposed method. During the encoding phase, it systematically incorporates concurrent dilated convolutions with varying dilation factors, forming a pyramid-like structure. This design is strategically implemented to capture features at multiple scales and enrich the model’s receptive field. Typically, the concurrent dilated residual convolution module is used to be integrated into the residual basic block. The concatenated output from these dilated convolutional channels is further processed through residual modules to facilitate effective feature encoding. The design of dilation factors considers the relatively high spatial resolution of the raw images, making it unsuitable to employ smaller dilation sizes for establishing sparse convolutional kernels. The expansion of the basic block is set to 1, so the shape of the feature channel is the same as the input. The bottleneck has an extra convolution layer on the right side of the basic block, and expansion is set to 4, which means that the size of input and output maps is different at this time. As illustrated in
Figure 2, after the skip connection is matched with the shape of the cloud semantic prediction module, a full convolution layer is introduced to the middle part of ResNet50 to change the channel-matching shape. To better receive the validated information of the cloud, we enlarge the field of view and maintain the resolution features of the input channel, and replace the first convolution with the dilation convolution of dalition = 4, as illustrated in
Figure 3a, kernel1. It is the dilated convolution of dalition = 2, while kernel2 is an ordinary convolution. Through a series of residual blocks and convolutional layers, the dimensions of channels gradually increase, the shape gradually decreases, and finally a preliminary effective feature layer with a shape of [2048, 8, 8] is obtained.
In the part of the remaining encoding process, apart from a series of convolutions, BN, ReLU, and MaxPooling to obtain output, the basic residual block includes multiple bottlenecks and the basic block module is shown in (b) and (c) of
Figure 3. The inclusion of these two modules addresses the vanishing gradient problem inherent in DCNNs by introducing cross-layer shortcut connections, making the network easier to train and optimize.
2.3. Multi-Head Soft Attention Module
During the encoding process, we employ the multi-head soft attention module, as illustrated in
Figure 4. Each attention head within this module functions as a soft attention. Inspired by [
30,
31], it encompasses various convolutional operations culminating in a final attention score. This design empowers the model to dynamically allocate attention across features, adapting to the diverse inputs. Unlike conventional attention mechanisms, multi-head soft attention modules can process input representations with spatial dimensions, such as images, feature maps, or other numerical data types. The
input from the skip-connection and the
input generated by the previous layer are fed into the 1 × 1 convolution, turning them into the same number of channels, because
comes from the next layer of
, and the size is
/2, so
is downsampled. After that, they are accumulated and passed through ReLU via another 1 × 1 convolution and sigmoid activation function. The process involves assigning an importance score, ranging from 0 to 1, to every segment of the feature map. Subsequently, this attention channel map is multiplied by the input of the skip connection, leading to the generation of the final output for the attention block. Following this, the outputs of identical 4 or 8 soft attention modules are concatenated, followed by subsequent average pooling across different attention heads. This ensemble ultimately produces the final output.
The semantic concatenation layer serves the purpose of connecting the outputs generated by the multi-head soft attention modules. This structure aligns with the framework of the encoder. Utilizing these five initial effective feature channels, the output from the preceding layer is concatenated, followed by feature fusion. The method of feature fusion involves upsampling and stacking the feature layers.
In Formula (1), is the corresponding positions in the encoding process, and is the feature from the upper layer convolution output. We upsample the output of the fifth convolutional pooling block to obtain a feature layer of [16, 16, 512], and then change the channel through 1 × 1 convolution and connect the output to the fifth convolution pool through jump after block upsampling is used as the two inputs of the multi-head soft attention module, and then the output is concated with to obtain a feature layer of [32, 32, 256], and so on, and finally through the bilinear interpolation method restores the feature layer shape back to the input image size, uses a 1 × 1 convolution to adjust the channels, and adjusts the dimension of channels of the final feature layer to num_classes (cloud and non-cloud pixels).
2.4. MSACN Deformation
In this part, the depth of the network is categorized based on both relative depth and absolute depth. The absolute network depth generally refers to all layers of the network. Relative depth primarily denotes the dimensions of pooling layers in the network. As the characteristic resolution in satellite imagery is often associated with pooling layers, especially for medium-resolution remote sensing images, an excessively deep network with numerous pooling layers not only results in information resolution loss but also causes irreversible loss of resolution, leading to sub-optimal cloud extraction performance. Additionally, training such deep networks becomes challenging. Thus, considering the strong correlation between the depth of the network structure and the resolution characteristics of input images, we undertake a local deformation of the refined method: MSACN-small. We divide the entire network into four segments, using pooling layers and upsampling layers as boundaries for both encoding and decoding parts. In the context of medium-resolution remote sensing images, where pruning the network to reduce its scale is necessary, we conduct experiments and analyses with a downsized model, denoted as MSACN-small, feathering three pooling layers. This compact variant participates in the comparative analysis of our proposed method.
4. Discussion
In this section, in order to assess the adaptability of MSACN, we further discuss the results of training and validation from other satellites of two different sensors, Landsat OLI 8 and GF-1. From the results, we try to discuss the relationship between the resolution of satellite images and the depth of CNN.
For the test image collected from Lansat OLI 8, we present the comparative experimental results using different methods, shown in
Table 5. Overall, the results suggest that MSACN-small stands out in achieving a balance between accuracy, IoU, and precision-recall trade-offs. In contemplating the observed results, it is plausible that the 30m resolution of Landsat OLI 8 imagery might be a contributing factor. Lower resolutions often necessitate less complex neural networks to achieve satisfactory results. Additionally, as depicted in
Table 5, for lower-resolution images, various cloud extraction methods exhibit minimal discrepancies in their outcomes. MSACN-small achieved the highest accuracy (97.18%), closely followed by TransUNet (97.16%). MSACN-small exhibits the highest IoU (94.50%), indicating superior spatial overlap between predicted and ground truth cloud images. TransUNet and DeeplabV3+ also show competitive IoU values, reflecting their strong segmentation performance. Meanwhile, MSACN-small consistently performs well, balancing precision and recall effectively.
In
Table 6, we present the comparative experimental results for the GF1-HWU dataset, evaluating various cloud-extracting methods. Across all evaluated metrics, the cloud extraction methods exhibit consistently high performance. MSACN-small stands out with the highest scores, indicating robust performance in terms of accuracy, IoU, and precision-recall balance. In the GF-1 dataset, both TransUNet and SwinUNet exhibit oscillating results with comparatively lower IoU scores of 94.84% and 93.84%, respectively. Contrarily, conventional CNN methods like U-Net and DeeplabV3+ exhibit more stable and robust outcomes in the given context. The MSACN series methods consistently showcase outstanding performance across different datasets. This dual accomplishment not only affirms the effectiveness of the method but also indicates a certain level of robustness in its methodology.
To assess the performance of MSACN at different learning rates and the linear relationship between the learning rate and the image resolution, we try to use different learning rates for experimental comparison. In many experiments on the 38-cloud dataset, by comparing the model indicators in the table, it can be seen that the accuracy of the large learning rate is higher. In particular, the lr has a great effect on the Transform-based model. In the CNN-based model, MSACN-small, which removes the low-level convolution-pooling block, can explore the relationship between network structure and image resolution. In the GF-WHU dataset, regardless of the lr, the accuracy of MSACN-small is 0.29% higher than that of U-Net. It can be seen that the GF-WHU dataset works better in the shallow network. The 38-cloud dataset is affected by the hyper-parameter and learning rate, and the shallow network MSACN-small only performs well in large learning rates. This is all due to the fact that the 38-cloud dataset and the GF-WHU dataset have the characteristics of spatial-resolution images. The lower-resolution satellite images have fewer details, and the shallow network makes it easier to capture the texture features in the image. The 38-cloud where the low-resolution images are located is more susceptible to the impact of the learning rate, so it shows better results at a larger learning rate. Both
Table 5 and
Table 6 illustrate that the optimal models in all cases are based on CNN, so the Transformer may not be able to make full use of its self-attention when processing these images due to the low-resolution remote sensing image datasets with less-detailed mechanism and sequence modeling capabilities. In contrast, CNNs excel at extracting image features and capturing local and global context. The following conclusions can be depicted from the experimental data:
For high-resolution complex scenes, there is a certain relationship between model accuracy and attention module;
The size of lr: (a) When the data are low-resolution images, a large learning rate works better. (b) The learning rate has a greater impact on the fusion attention module, and a small learning rate is better.