MFAFNet: Multi-Scale Feature Adaptive Fusion Network Based on DeepLab V3+ for Cloud and Cloud Shadow Segmentation

Feng, Yijia; Fan, Zhiyong; Yan, Ying; Jiang, Zhengdong; Zhang, Shuai

doi:10.3390/rs17071229

Open AccessArticle

MFAFNet: Multi-Scale Feature Adaptive Fusion Network Based on DeepLab V3+ for Cloud and Cloud Shadow Segmentation

by

Yijia Feng

¹,

Zhiyong Fan

^2,3,*

,

Ying Yan

^2,3

,

Zhengdong Jiang

²

and

Shuai Zhang

²

¹

School of Future Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

School of Automation, Nanjing University of Information Science and Technology, Nanjing 210044, China

³

Collaborative Innovation Center on Atmospheric Environment and Equipment Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(7), 1229; https://doi.org/10.3390/rs17071229

Submission received: 23 February 2025 / Revised: 26 March 2025 / Accepted: 28 March 2025 / Published: 30 March 2025

(This article belongs to the Special Issue Deep Learning-Based Cloud Detection and Removal for Remote Sensing Images)

Download

Browse Figures

Versions Notes

Abstract

:

The accurate segmentation of clouds and cloud shadows is crucial in meteorological monitoring, climate change research, and environmental management. However, existing segmentation models often suffer from issues such as losing fine details, blurred boundaries, and false positives or negatives. To address these challenges, this paper proposes an improved model based on DeepLab v3+. First, to enhance the model’s ability to extract fine-grained features, a Hybrid Strip Pooling Module (HSPM) is introduced in the encoding stage, effectively preserving local details and reducing information loss. Second, a Global Context Attention Module (GCAM) is incorporated into the Atrous Spatial Pyramid Pooling (ASPP) module to establish pixel-wise long-range dependencies, thereby effectively integrating global semantic information. In the decoding stage, a Three-Branch Adaptive Feature Fusion Module (TB-AFFM) is designed to merge multi-scale features from the backbone network and ASPP. Finally, an innovative loss function is employed in the experiments, significantly improving the accuracy of cloud and cloud shadow segmentation. Experimental results demonstrate that the proposed model outperforms existing methods in cloud and cloud shadow segmentation tasks, achieving more precise segmentation performance.

Keywords:

remote sensing image; image segmentation; cloud and cloud shadow; feature fusion; deep learning

1. Introduction

In the field of remote sensing image processing, cloud and cloud shadow detection is a critical research area that plays a key role in enhancing the quality of remote sensing images. In satellite imagery, cloud and cloud shadow often obscure ground targets, leading to reduced illumination uniformity and color accuracy. This obstruction prevents the image from faithfully reflecting surface information, impacting the accuracy and reliability of downstream tasks such as land cover classification and object detection. Therefore, in applications that require precise surface information, such as land and water resource monitoring, the accurate segmentation of cloud and cloud shadow is crucial. Segmentation refers to dividing an image into multiple regions with distinct characteristics. In this study, it specifically involves categorizing remote sensing images into cloud, cloud shadow, and surface regions, thus improving image quality and minimizing analytical errors. Furthermore, the segmented cloud data can be utilized to study cloud distribution, movement patterns, and their impact on climate forecasting and disaster assessment. Thus, enhancing the accuracy of cloud and cloud shadow detection is of paramount importance.

Traditional cloud and cloud shadow detection methods can be categorized into three main types: threshold-based methods, spatial feature-based methods, and machine learning-based methods. Threshold-based methods utilize the spectral brightness characteristics of clouds and cloud shadows to achieve segmentation through fixed or adaptive thresholds. For example, Rossow et al. [1] proposed the ISCCP algorithm which detects cloud regions using fixed thresholds; however, this method suffers from reduced accuracy in complex scenes and is susceptible to misclassification. To address the limitations of threshold-based methods, Sun et al. [2] proposed the UDTCDA algorithm, which dynamically adjusts thresholds to enhance its adaptability to various cloud types and complex surface backgrounds. Spatial feature-based methods enhance detection performance by analyzing the texture or geometric properties of clouds and cloud shadows. For instance, Vásquez and Manian [3] proposed a cloud detection method based on texture features in remote sensing images, achieving better results than pixel-level threshold classification. Zhu et al. [4] developed the Fmask algorithm, which integrates spectral, shadow, and thermal features to generate highly accurate cloud and cloud shadow masks. Additionally, Danda et al. [5] introduced a morphology-based detection method that improves segmentation by analyzing the shape and spatial distribution of clouds. However, these methods often struggle to distinguish thin clouds and irregular cloud shadows, as their texture and geometric properties closely resemble those of background objects. With the rapid advancement of machine learning, models such as random forests and support vector machines have also been applied to cloud and cloud shadow detection. For example, Le Hégarat-Mascle and André [6] combined the Markov Random Field (MRF) model for the automatic detection of clouds and cloud shadows in high-resolution optical imagery. Cheng and Lin [7] combined multi-scale neighborhood features with multiple classifiers to enhance segmentation accuracy. Wei et al. [8] proposed the RFmask algorithm, which integrates energy-driven superpixel segmentation features to optimize cloud detection in Landsat imagery. Although machine learning methods significantly improve detection accuracy, they rely heavily on feature engineering. The handcrafted features may fail to capture the diverse characteristics of cloud and cloud shadow in complex scenes, limiting both detection accuracy and generalization capability.

Deep learning-based cloud and cloud shadow segmentation methods have gained increasing attention in the field of remote sensing image processing due to their end-to-end network structures, which enable high-precision segmentation in complex scenes. Since 2015, numerous classical network architectures have been proposed and applied to cloud and cloud shadow segmentation tasks. Long et al. [9] first introduced the Fully Convolutional Network (FCN), which is an end-to-end pixel-wise classification network that established the foundation for semantic segmentation. In the same year, Ronneberger et al. [10] developed the U-Net network for biomedical image segmentation, effectively recovering information lost during downsampling through skip connections, thereby significantly improving segmentation accuracy. The application of these network structures in cloud and cloud shadow segmentation has been progressively validated. In 2018, Miller et al. [11] applied Convolutional Neural Networks (CNNs) to meteorological image recognition, demonstrating superior segmentation accuracy compared to traditional random forest classifiers. In the same year, Mohajerani et al. [12] employed FCN for the pixel-level annotation of cloud regions in Landsat 8 imagery, achieving substantial improvements in the Jaccard index and recall rate. In 2019, Gonzales and Sakla [13] utilized a pre-trained deep U-Net for the semantic segmentation of clouds in satellite images, outperforming state-of-the-art networks on benchmark datasets based on several segmentation metrics. To further enhance segmentation accuracy and network efficiency, encoder–decoder architectures have become a key research focus. In 2017, Badrinarayanan et al. [14] proposed SegNet, which minimizes information loss during upsampling by employing pooling indices in the decoding stage. Subsequently, Lu et al. [15] developed a cloud segmentation method based on SegNet, investigating the impact of symmetric network structures on segmentation accuracy. In the same year, Zhao et al. [16] introduced the Pyramid Scene Parsing Network (PSPNet), which aggregates global contextual information to improve segmentation precision. Later, Chen et al. [17] proposed the DeepLab v3+ model in 2018, incorporating depthwise separable convolutions in the spatial pyramid pooling and decoder modules to enhance segmentation efficiency and accuracy. Additionally, in 2019, Sun et al. [18] introduced the High-Resolution Network (HRNet), which maintained continuous bidirectional information flow between high- and low-resolution features, achieving more precise cloud boundary detection. Recent studies have increasingly focused on the integration of attention mechanisms and multi-scale feature fusion to enhance segmentation performance. In 2021, Qu et al. [19] proposed the Strip Pooling Channel Spatial Attention Network (SP_CSANet), which enhances cloud and cloud shadow edge segmentation accuracy through strip pooling while integrating shallow and deep contextual information to improve model performance. However, its reliance on strip pooling may limit its ability to capture fine-grained boundary details, particularly in areas with thin clouds or light cloud shadows. In 2022, Lu et al. [20] introduced a Dual-Branch Network (DBNet) combining Transformer and convolutional networks, leveraging an interactive guidance module to extract deep features and refining coarse segmentation boundary representations in the decoding stage. Compared with SP_CSANet, DBNet provides more refined segmentation details; however, due to its complex structure, it incurs higher computational costs. In 2023, Dai et al. [21] designed the Location Pooling Multi-Scale Network (LPMSNet), which enhances target region attention by aggregating multi-scale information and incorporating attention mechanisms. Compared with the previous two models, LPMSNet achieves a balance between segmentation accuracy and computational efficiency. Nevertheless, in scenarios with severe interference, LPMSNet may still encounter limitations. The most recent research by Hu et al. [22] in 2024 introduced the Multibranch Hybrid Segmentation Network (HyCloudX), which integrates spectral band information in a multibranch architecture, demonstrating significant improvements in segmentation performance and generalization capability under complex noise interference.

Although existing deep learning models have made significant progress in cloud and cloud shadow segmentation tasks, several challenges still remain. First, the multi-scale feature fusion is insufficient and inefficient, leading to the loss of fine-grained details and the weakening of semantic information. Second, the models lack sufficient attention to fine-grained features and boundary information, resulting in blurred segmentation in transition areas. Finally, current methods exhibit limited global context awareness and struggle to capture long-range dependencies within scenes, making them less adaptable to diverse scenarios. In recent years, attention mechanisms have emerged as an important approach to enhance the model’s adaptability to complex scenarios due to its advantage in modeling global context information [23,24,25]. Simultaneously, multi-scale feature fusion techniques have proven effective in improving segmentation accuracy for diverse cloud types and complex boundaries while preserving local details, further enhancing the performance of deep learning models in remote sensing image semantic segmentation [26,27,28,29]. Therefore, to address the aforementioned issues, this paper proposes a Multi-scale Feature Adaptive Fusion Network (MFAFNet) based on the DeepLab v3+ model. DeepLab v3+ is selected as the foundational model due to its strong capability to capture multi-scale contextual information, which aligns well with the significant scale variations in cloud and cloud shadow segmentation. Furthermore, as a widely recognized model in the field of semantic segmentation, DeepLab v3+ offers a clear and modular architecture that facilitates modifications and enhancements. This provides a solid foundation for evaluating the effectiveness of our proposed improvement modules fairly and reliably. In the encoder, we replace the global pooling in the Atrous Spatial Pyramid Pooling (ASPP) with a Hybrid Strip Pooling Module (HSPM), avoiding the spatial information loss that may occur in traditional global pooling. Additionally, a Global Context Attention Module (GCAM) is added to each branch of the ASPP to optimize feature extraction. In the decoding stage, we use a Three-Branch Adaptive Feature Fusion Module (TB-AFFM) to adaptively fuse mid-level and high-level features from the backbone network with deep high-level features extracted by the ASPP, effectively improving feature fusion. Furthermore, we propose a hybrid loss function that combines focal loss and Dice loss, balancing pixel-level and region-level segmentation accuracy. The main contributions of this work are as follows:

A Multi-scale Feature Adaptive Fusion Network (MFAFNet) based on DeepLab v3+ is proposed to enhance feature extraction capabilities, optimize multi-scale feature fusion, and improve the accuracy of semantic segmentation for clouds and cloud shadows.
To avoid the potential spatial information loss caused by global pooling, we replace the global pooling in the ASPP structure with a Hybrid Strip Pooling Module (HSPM), which can extract global distribution information while focusing on features of locally salient regions, thus enhancing adaptability to different cloud types and complex cloud shadow boundaries.
Considering the spatial correlation between clouds and cloud shadows, we introduce a Global Context Attention Module (GCAM) into each branch of the ASPP, enabling the branches to better handle high-level semantic information and optimize feature extraction.
A Three-Branch Adaptive Feature Fusion Module (TB-AFFM) is employed to fuse mid-level and high-level features extracted from the backbone network with deep high-level features from the ASPP. This module adaptively adjusts weights in both the channel and spatial dimensions, improving the semantic understanding of complex cloud shadow scenes while preserving key detail information.
A weighted hybrid loss function combining focal loss and Dice loss is adopted to enhance the overall segmentation accuracy and improve boundary segmentation performance.

2. Methodology

2.1. Network Structure

Based on these observations, this paper proposes an improved DeepLab v3+ network model, as illustrated in Figure 1. In the encoding phase, the global pooling module in the ASPP structure is replaced with a Hybrid Strip Pooling Module (HSPM) to mitigate information loss during pooling. Additionally, a Global Context Attention Module (GCAM) is incorporated into each branch to enhance feature extraction by integrating global contextual information. In the decoding phase, the original DeepLab v3+ model utilizes only low-level features from Block2 of the backbone network. In contrast, the proposed improved model introduces a Three-Branch Adaptive Feature Fusion Module (TB-AFFM) in the decoding stage, adaptively fusing mid-level and high-level features from Block 3 and Block 4 of the backbone network with deep high-level features extracted by ASPP. This fusion effectively balances global semantics and local detail representation, significantly improving cloud and cloud shadow segmentation performance.

2.2. Hybrid Strip Pooling Module (HSPM)

The global pooling layer generates global contextual information through simple average pooling; however, it compresses all spatial information of the entire feature map into a single scalar, often leading to the loss of local details and spatial information. In cloud and cloud shadow segmentation tasks, local information such as boundaries and fine details is crucial for precise segmentation. Therefore, this paper proposes a Hybrid Strip Pooling Module (HSPM), which builds upon traditional strip pooling [30] by integrating both horizontal and vertical average and max pooling operations. The structure of HSPM is illustrated in Figure 2.

Given an input feature map

X

with dimensions

H \times W

, average and max pooling operations are applied along the height direction, producing an output of size

H \times 1

. Similarly, average and max pooling operations are performed along the width direction, resulting in an output of size

1 \times W

. The specific formulations are as follows:

y_{a v g}^{h} = \frac{1}{W} \sum_{0 \leq j < W} x_{i, j}

(1)

y_{a v g}^{v} = \frac{1}{H} \sum_{0 \leq i < H} x_{i, j}

(2)

y_{m a x}^{h} = \max_{0 \leq j < W} (x_{i, j})

(3)

y_{m a x}^{v} = \max_{0 \leq i < H} (x_{i, j})

(4)

The results of each pooling operation are then fused across channels through convolution and expanded to match the original feature dimensions

H \times W

. These outputs are combined through additive fusion to generate a feature map enriched with contextual information:

y = y_{a v g}^{h} + y_{a v g}^{v} + y_{m a x}^{h} + y_{m a x}^{v}

(5)

The fused feature map is further processed using a

1 \times 1

convolution for dimensionality reduction and subsequently activated by the Sigmoid function to produce an attention-weight map. This weight map is applied to the input features via element-wise multiplication, yielding the final output

Z

as follows:

z = S i g m o i d (x, σ (C o n v (y)))

(6)

2.3. Three-Branch Adaptive Feature Fusion Module (TB-AFFM)

The scale variation in clouds and cloud shadows is often significant, and fusing features from different scales is crucial for improving segmentation accuracy. The original DeepLab v3+ model only fuses low-level features (block2) with deep high-level features from the ASPP module, neglecting the important information carried by intermediate layer features. To utilize more comprehensive multi-scale feature information, this paper introduces mid-level and high-level features from the backbone network (Block 3 and Block 4). After upsampling to a unified spatial resolution, these features are fused with the deep high-level features output by the ASPP module. Unlike traditional concatenation or simple addition, this paper employs a three-branch adaptive feature fusion module to dynamically allocate fusion weights to the three input features, as shown in Figure 3.

For the input feature map

X \in R^{C \times H \times W}

, we design three parallel branches for feature processing. In the global branch, global average pooling is used to extract global distribution information

M \in R^{C \times 1 \times 1}

from the channel domain. A 1 × 1 point convolution layer with a scaling factor of

r

is applied to reduce and increase the channel dimension, which is followed by batch normalization and a ReLU activation function to obtain the global channel attention features. The formulation is as follows:

G (X) = B (W_{2} (δ (B (W_{1} (g (X))))))

(7)

where

g (\cdot)

represents global average pooling,

B (\cdot)

denotes batch normalization,

δ (\cdot)

is the ReLU activation function, and

W_{1}

and

W_{2}

are point convolutions with kernel sizes

1 \times 1 \times C \times \frac{C}{r}

and

1 \times 1 \times \frac{C}{r} \times C

, respectively.

In the local branch, global average pooling is omitted, and local feature information is directly extracted using a sequence of point convolutions while preserving the original spatial resolution. The formulation is as follows:

L (X) = B (W_{4} (δ (B (W_{3} (X)))))

(8)

where

W_{3}

and

W_{4}

are point convolutions with kernel sizes

1 \times 1 \times C \times \frac{C}{r}

and

1 \times 1 \times \frac{C}{r} \times C

, respectively.

Additionally, a hybrid strip pooling branch is introduced, combining max and average pooling operations to capture strip-like feature patterns. This branch also undergoes a sequence of point convolutions, batch normalization, and ReLU transformations, as expressed in the following formula:

H (X) = B (W_{6} (δ (B (W_{5} (h (X))))))

(9)

where

h (\cdot)

denotes hybrid strip pooling, and

W_{5}

and

W_{6}

are point convolutions with kernel sizes

1 \times 1 \times C \times \frac{C}{r}

and

1 \times 1 \times \frac{C}{r} \times C

, respectively.

The outputs of the three branches are processed through a softmax operation to generate normalized weights

ω

, ensuring that the sum of the weights equals 1. Finally, the features from the three branches are fused by weighted summation:

ω = s o f t m a x (G (X) + L (X) + H (X))

(10)

o u t p u t = ω_{1} \otimes X + ω_{2} \otimes Y + ω_{3} \otimes Z

(11)

where

ω_{1}

,

ω_{2}

, and

ω_{3}

represent the weights generated by the softmax operation, and

\otimes

denotes element-wise multiplication. This adaptive weighting mechanism dynamically adjusts the contribution of each branch based on the importance of the input features, effectively integrating multi-scale feature information.

2.4. Global Context Attention Module (GCAM)

A significant characteristic of cloud and cloud shadow segmentation is the notable spatial dependency between clouds and cloud shadows, where the morphological features and spatial distribution of cloud shadows directly correspond to the cloud body. To effectively model this spatial dependency, we introduce the Global Context Attention Module (GCAM) into each branch of the Atrous Spatial Pyramid Pooling (ASPP) module. This module establishes long-range dependencies to model global contextual information, enabling the model to not only extract local features but also to effectively integrate the global semantic details of the image, thereby enhancing the accuracy of cloud shadow segmentation. The module is simplified from the non-local module [31], and its specific structure is shown in Figure 4.

Studies [32] have found that for different query positions in the image, the global context modeled by the non-local module is almost identical. Therefore, to simplify computation, the global context modeling of one position can be used to replace other positions with the linear transformation matrix

W_{v}

placed outside the multiplication operation. The simplified process is described by Equation (12):

z_{i} = x_{i} \otimes S i g m o i d (W_{v} \cdot S o f t m a x (W_{k} x_{i}) x_{i})

(12)

where

x \in R^{C \times H \times W}

is the input feature,

z \in R^{C \times H \times W}

is the output feature,

W_{k}

is the global information modeling matrix, and

W_{v}

represents the linear transformation matrix.

In this module, the input features are first linearly mapped to obtain

W_{k}

, and then a weight matrix is generated by softmax normalization. The matrix

W_{v}

is weighted, and different weighting coefficients are assigned to the input features so that the model can focus on the important feature regions. Then, a gating value is generated through the sigmoid function to select important features for enhancement. Finally, the activated results are multiplied by the original feature map, allowing the model to dynamically adjust the feature map based on the global context.

2.5. Loss Function

In cloud and cloud shadow segmentation tasks, the cloud shadow regions are typically small and difficult to segment accurately, while the background region occupies most of the pixels, causing the model to be biased toward the background class. To address the class imbalance issue and the challenge of segmenting small targets in cloud and cloud shadow segmentation, this paper adopts a weighted loss function that combines focal loss [33] and Dice loss [34].

Focal loss is an improved version of cross-entropy loss, which assigns higher weights to hard-to-classify samples, thus allowing the model to focus more on the difficult-to-classify cloud shadow regions. It is suitable for handling class imbalance tasks. The formula is as follows:

L_{f o c a l} (p_{t_{n}}) = - \sum_{n = 1}^{N} {(1 - p_{t_{n}})}^{γ} \log (p_{t_{n}})

(13)

where

p_{t_{n}}

represents the predicted probability of the model for the true class of the n-th sample.

γ

is a modulation factor that controls the influence of easily classified samples and increases the weight of hard-to-classify samples, and

N

is the total number of samples.

Dice loss is a commonly used metric for measuring the overlap between the predicted region and the true region. It is defined as the size of the intersection between the predicted segmentation mask and the true segmentation mask divided by the size of their union. Since cloud shadow regions are relatively small and have complex shapes, Dice loss can effectively enhance the model’s attention to the cloud shadow regions, ensuring more accurate segmentation results. The formula is as follows:

L_{d i c e} = \frac{2 |Y \cap T|}{|Y| + |T|}

(14)

where

Y

is the predicted segmentation mask and

T

is the true segmentation mask.

This paper combines these two loss functions with weighted coefficients, which helps address the class imbalance between cloud shadow regions and background regions. It also optimizes the overlap of the segmentation results, enhancing the model’s learning of cloud shadow boundaries and details. The form of the hybrid loss function is as follows:

L = α \cdot L_{f o c a l} + β \cdot L_{d i c e}

(15)

where

α

and

β

are hyperparameters. In the experiments, multiple combinations of hyperparameters were tested, as shown in Table 1. It is evident that when

α = 0.6

and

β = 0.4

, the model achieves the highest segmentation accuracy. Therefore, this combination of hyperparameters was chosen for the subsequent experiments.

3. Experimental Analysis

3.1. Experimental Datasets

The publicly available dataset used in this study is the GF-1 WHU cloud and cloud shadow detection dataset, which is derived from images captured by the Chinese Gaofen-1 (GF-1) satellite. It was created by researchers from Wuhan University [35]. The images obtained by the GF-1 satellite’s Wide Field of View (WFV) sensor have a spatial resolution of 16 m and span four multispectral bands across the visible to near-infrared spectral range. The dataset includes 108 Level 2A scenes collected under various cloud conditions, covering multiple land cover types such as forests, barren land, water bodies, wetlands, and urban areas. The annotations in the dataset were manually completed by experts with cloud, cloud shadow, clear sky, and background pixels labeled as 255, 128, 1, and 0, respectively. Some sample images and labels from the dataset are shown in Figure 5.

The original images were divided into smaller 256 × 256 pixel sub-images with black borders and unclear images removed to avoid negative impacts on model performance. The dataset was then split into training and validation sets in an 8:2 ratio, providing adequate sufficient data for training and evaluation. To increase the model’s diversity and robustness, data augmentation techniques, such as horizontal/vertical flipping and random rotations, were applied to generate additional variations of the images. Ultimately, 5428 images were used for training, while 1360 images were used for validation and testing.

To evaluate the generalization capability of the model, we also conducted experiments on the HRC_WHU and Cloud and Cloud Shadow datasets. The HRC_WHU dataset, also provided by the laboratory of Wuhan University [36], contains 150 high-resolution images from various regions around the world. This dataset only labels cloud and non-cloud pixels and includes images with five different background types (water, vegetation, urban, snow/ice, and barren land). Due to GPU memory limitations, we cropped the images into 3200 small 256 × 256 patches and divided them into training and testing sets with an 8:2 ratio, resulting in 2560 images for training and 640 images for testing. The Cloud and Cloud Shadow dataset, derived from high-resolution Google Earth images, covers complex backgrounds like plains, farmlands, towns, and water bodies. For training convenience, we cropped the high-resolution images into 224 × 224 patches and applied data augmentation techniques such as rotation and flipping. A total of 2630 images were obtained, which were then split into training and validation sets with an 8:2 ratio. This dataset labels clouds, cloud shadows, and background information.

3.2. Experimental Details

To ensure a fair comparison, all models were tested under the same experimental conditions and with identical hyperparameter settings. The experiments were conducted on an NVIDIA RTX 3090 GPU, manufactured by NVIDIA Corporation, headquartered in Santa Clara, California, USA. The framework used was PyTorch 1.7.0 with CUDA version 11.0. Experimental results show that most models converge within 200–250 epochs, so we set the number of training epochs to 300. Considering the GPU memory limitations, the batch size was set to 16. Due to the good convergence and stability of the Adam optimizer, we selected Adam for the experiment. Except for our model, which uses the fusion loss function proposed in this paper, all other comparative models used the commonly applied cross-entropy loss function. The initial learning rate (LR) was set to 0.001, and a cosine annealing learning rate decay strategy was employed to ensure gradual convergence during training. During training, the learning rate was adjusted according to the following formula:

L R = 0.001 \times (1 - \frac{e p o c h}{300})^{2}

(16)

To evaluate the algorithm’s performance in cloud and cloud shadow segmentation tasks, we selected several quantitative metrics: precision (P), recall (R), F1 score, pixel accuracy (PA), average pixel accuracy (MPA), mean intersection over union (MIoU), and frequency-weighted intersection over union (FWIoU). The formulas for each of these metrics are as follows:

P = \frac{T P}{T P + F P}

(17)

R = \frac{T P}{T P + F N}

(18)

F_{1} = 2 \times \frac{P \times R}{P + R}

(19)

P A = \frac{\sum_{i = 0}^{k} p_{i, i}}{\sum_{i = 0}^{k} \sum_{j = 0}^{k} p_{i, j}}

(20)

M P A = \frac{1}{k} \sum_{i = 0}^{k} \frac{p_{i, i}}{\sum_{j = 0}^{k} p_{i, j}}

(21)

M I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i, i}}{\sum_{j = 0}^{k} p_{i, j} + \sum_{j = 0}^{k} p_{j, i} - p_{i, i}}

(22)

F W I o U = \frac{1}{\sum_{i = 0}^{k} \sum_{j = 0}^{k} p_{i, j}} \sum_{i = 0}^{k} \frac{\sum_{j = 0}^{k} p_{i j} p_{i i}}{\sum_{j = 0}^{k} p_{i j} + \sum_{j = 0}^{k} p_{j i} - p_{i i}}

(23)

where

T P

represents the number of true positive samples,

F P

represents the number of false positive samples, and

F N

represents the number of false negative positive samples.

p_{i, i}

denotes the number of correctly classified pixels of the i-th class,

p_{i, j}

denotes the number of pixels of the i-th class incorrectly predicted as the j-th class, and

p_{j, i}

denotes the number of pixels of the j-th class incorrectly predicted as the i-th class.

k

represents the number of classes (including background).

In the experiments on the GF-1 WHU dataset and the Cloud and Cloud Shadow Dataset, we consider three classes (cloud, cloud shadow, and background) for metric computation. For the HRC_WHU dataset experiments, we use two classes (cloud and background) for evaluation. In addition, we also calculated the number of floating-point operations (FLOPs) and the time taken to train a single image (Time).

3.3. Ablation Experiment

To evaluate the effectiveness of the proposed modules in improving segmentation results, we conducted ablation experiments. The original DeepLab v3+ was used as the baseline model, and we progressively introduced the HSPM, TB-AFFM, and GCAM modules to verify the effectiveness of each module. Finally, we incorporated our proposed innovative loss function to validate the feasibility of the entire model. The metrics used in the ablation experiments were MPA and MIoU with the specific results shown in Table 2. It can be observed that the model that includes all the modules and uses the innovative loss function achieved the best results, improving the MPA and MIoU metrics by 1.47% and 1.88%, respectively, compared to the baseline model.

Ablation for HSPM: Global pooling leads to a significant loss of detailed information, while replacing global pooling with HSPM helps retain both global information and important local details. This enables the model to better recognize the complex shapes and boundaries of clouds and cloud shadows. The experiment shows that the HSPM improved the MPA and MIoU by 84.87% and 75.79%, respectively, validating the module’s effectiveness in improving model accuracy.
Ablation for TB-AFFM: The proper use of multi-scale features is a crucial way to enhance model performance. In the decoding stage, the TB-AFFM module adaptively fuses mid-level and high-level features from the backbone network with deep high-level features extracted by ASPP. This helps the model automatically learn the importance of different feature levels and better recover and strengthen the details of clouds and cloud shadows during decoding. The experiment shows that TB-AFFM improved the model’s MPA and MIoU by 0.45% and 0.62%, respectively.
Ablation for GCAM: ASPP, using different dilated convolutions, can only capture local information at different scales, but it fails to capture global contextual relationships effectively. Considering that the morphology of clouds and the distribution of cloud shadows are correlated, we introduced GCAM into each branch of ASPP to enhance the model’s ability to perceive long-range pixel relationships and improve its resistance to complex background interference. The experiment shows that GCAM increased the MPA from 85.32% to 85.75% and the MIoU from 76.41% to 76.93%.

3.4. Comparison Experiments

To demonstrate the superiority of the proposed model in cloud and cloud shadow segmentation, we conducted comparative experiments with several outstanding semantic segmentation models. In addition to the models mentioned in the introduction, the comparative experiments also included CSDNet, based on CNN architecture, SegFormer, based on Transformer architecture, and DASUNet and MAINet, which are both based on hybrid architectures. The experimental results are shown in Table 3. From the perspective of mIoU, FCN-8s exhibited the worst segmentation performance, followed by SegNet, PSPNet, HRNet, UNet, DeepLab v3+, SP_CSANet, CSDNet, SegFormer, DASUNet, and MAINet, with the latter four models achieving better segmentation results. Our model achieved PA, MPA, MIoU, and FWIoU scores of 90.57%, 86.08%, 77.28%, and 83.05%, respectively, outperforming all other models.

To further analyze the advantages of our model, we computed the segmentation metrics for cloud and cloud shadow. The results are shown in Table 4. For cloud detection, our model achieved precision, recall, and F1 score values of 93.78%, 92.36%, and 92.63%, respectively, with precision and F1 score outperforming other models, while recall was slightly lower than DASUNet. For cloud shadow detection, our model achieved precision, recall, and F1 score values of 77.13%, 73.64%, and 75.34%, respectively, outperforming all other models in every metric. In conclusion, our model significantly improves both cloud and cloud shadow detection accuracy.

Figure 6 visually compares the segmentation results of different models for cloud and cloud shadow detection. In the figure, black represents the background, white represents the cloud, and gray represents the cloud shadow. It can be seen that the segmentation results of SegNet are relatively coarse especially in terms of cloud shadow extraction. Due to its simple network structure, SegNet struggles to capture detailed features, leading to inaccuracies in edge detection and frequent false negatives and false positives, particularly for small cloud and cloud shadow areas, as seen in the first and third sets of images. HRNet produces better segmentation results with smoother boundaries and more precise edge detection, though it still exhibits noticeable false negatives and false positives (third and fourth sets of images). From the first and fifth sets of images, it is evident that the encoder–decoder architecture of UNet helps preserve structural details in the image, enabling more accurate cloud shadow boundary detection. However, it may still result in false positives when handling complex backgrounds or overlapping cloud shadows, as seen in the fourth set of images. DeepLab v3+, by incorporating dilated convolutions, enhances the model’s ability to capture features at different scales, yielding fine segmentation results and strong edge extraction. However, in cases like the fourth set of images, where the cloud shadow boundaries are unclear and there is significant background interference, it may mistakenly classify some non-cloud shadow pixels as cloud shadows. Our proposed method, in the feature extraction phase, first uses HSPM to enhance the extraction of fine details. GCAM is then introduced to help the model better understand the relationships between clouds and the background, as well as between clouds and cloud shadows, reducing false positives and false negatives. Finally, in the decoding stage, TB-AFFM adaptively fuses mid-level and high-level features from the backbone network with deep features from the ASPP, further improving segmentation accuracy for detailed and complex regions. Particularly, as seen in the second and third sets of images, the boundaries of the segmentation results from our model are the most accurate.

To better demonstrate the generalization capability of our model in different scenarios, we compared the segmentation results of various models on seven different scenes: grassland, barren land, water bodies, desert, wasteland, urban areas, and snow/ice. As shown in Figure 7, in the first set of images, SegNet, HRNet, and UNet fail to detect small gaps in the clouds, and their edge extraction is not sufficiently accurate. Although DeepLab v3+ successfully detects the small gaps in the clouds, its segmentation of shapes is not accurate enough, and its detailed handling needs improvement. Our model, however, accurately identifies the gaps in the cloud shapes, with richer details. In the second set of images, HRNet’s edge extraction for irregular cloud boundaries is coarse, while other models perform better in edge segmentation, but their detailed rendering is still inferior to our model. In the third set of images, where the boundaries are unclear and difficult to segment, our model can still precisely locate the cloud boundaries, whereas other models struggle to extract accurate boundaries in this scenario. In the fourth set, the dark background increases the difficulty of cloud shadow detection. Except for our model, other models fail to detect the small cloud shadow area at the bottom of the image. In the fifth set of images, the dark regions in the background are more likely to be incorrectly detected as cloud shadows. The results show that UNet and DeepLab v3+ almost identify the entire dark area as cloud shadow, while SegNet identifies part of the dark area as cloud shadow. HRNet, although successfully distinguishing the dark background from the cloud shadow, does not segment the shapes of other clouds and cloud shadows as accurately as our model. In the sixth set of images, influenced by the urban background, SegNet, HRNet, and UNet show a relatively coarse segmentation of cloud and cloud shadow boundaries, whereas our model demonstrates the highest accuracy in boundary segmentation. In the seventh set of images, SegNet almost fails to distinguish between the snow on the mountaintop and the clouds, and other models also mistakenly classify some mountaintop snow as clouds. In contrast, our model exhibits the lowest false positive rate, demonstrating its superior performance in complex backgrounds.

Figure 8 demonstrates the segmentation results of clouds with varying thickness and shapes. The central portion of thick clouds, due to their lower translucency, appears as bright white regions, which are typically well recognized by the model. However, the edges of thick clouds, where light reflection is weaker and if the background is relatively dark, there is a low contrast between the cloud edges and the background, making it difficult for the model to distinguish the cloud boundaries, leading to false negatives or false positives. Our model can more accurately identify thick clouds and extract clearer boundaries. Thin or light clouds typically have blurred edges and are more susceptible to background interference, often resulting in discontinuous segmentation and false positives. For these types of clouds, our model also exhibits better segmentation performance. In conclusion, our model improves segmentation accuracy for clouds of different scales and shapes by aggregating global contextual information and adaptively assigning attention.

3.5. Generalization Performance Analysis

To further evaluate the generalization ability of our model on different datasets, we conducted comparative experiments on the HRC_WHU dataset (Dataset 1) and the Cloud and Cloud Shadow dataset (Dataset 2). The experiments were evaluated using three metrics: PA, MPA, and MIoU. The experimental results are shown in Table 5.

The experimental results indicate that on Dataset 1, our model demonstrates outstanding performance, leading all comparison models in the PA (94.44%) and MIoU (87.55%) metrics. In the MPA metric, our model ranks second only to DASUNet. On Dataset 2, our model also shows the best overall performance, with the PA metric reaching 97.62%, the MPA metric at 96.56%, and the MIoU metric at 93.02%. Overall, although other models such as CSDNet and DASUNet achieve competitive results in certain metrics, our model outperforms them across both datasets in terms of overall performance. This strongly demonstrates the efficiency and strong generalization capability of our model for cloud and cloud shadow segmentation tasks.

4. Discussion

In this study, we propose an improved model based on DeepLab v3+ aimed at enhancing the segmentation accuracy of clouds and cloud shadows. Our improvement strategy includes the Global Context Attention Module (GCAM), Hybrid Strip Pooling Module (HSPM), and Three-Branch Adaptive Feature Fusion Module (TB-AFFM). Experimental results show that our improved model achieved PA, MPA, MIoU, and FWIoU scores of 90.57%, 86.08%, 77.28%, and 83.05%, respectively, on the GF1_WHU dataset. Our model also outperforms other comparative models in segmentation accuracy on other datasets.

Although our research is primarily based on the DeepLab v3+ architecture, we believe that these methods are equally applicable to other image segmentation models. For instance, multi-scale feature fusion techniques can enhance the integration of multi-scale features, thereby improving the handling of images with varying resolutions and complex backgrounds. Similar approaches have been widely applied in other segmentation networks with good results. For example, Surya et al. [41] improved the fusion of multi-scale features in the U-Net architecture, successfully enhancing cloud shadow segmentation performance. Qu et al. [19] combined strip pooling with residual networks, significantly improving edge segmentation accuracy.

Beyond cloud and cloud shadow segmentation, our strategies may also yield similar effects in other task models, particularly in areas such as change detection [42,43,44]. For example, Jiang et al. [45] improved urban change detection by utilizing cross-scale fusion. Ren et al. [46] integrated self-attention and cross-attention mechanisms to enhance the model’s focus on real change regions.

In summary, we believe that our improvement strategies not only enhance the performance of DeepLab v3+ but also have potential applications in other image segmentation models, providing support for a broader range of remote sensing image tasks.

5. Conclusions

The accurate segmentation of clouds and cloud shadows is crucial in meteorological monitoring and remote sensing image analysis. This paper proposes an improved model based on DeepLab v3+, significantly enhancing cloud and cloud shadow segmentation accuracy. The model achieves this by adding a Global Context Attention Module to the ASPP branches, replacing global pooling with hybrid strip pooling, and incorporating an adaptive weighting mechanism to fuse mid- and high-level features from the backbone network with deep features from the ASPP during decoding. Additionally, we adopt an innovative loss function in the experiments. Specifically, the HSPM reduces spatial information loss and enhances the ability to extract detailed features. The GCAM models global contextual information and enhances the model’s perception of the global background through attention allocation, effectively reducing background noise interference. The TB-AFFM uses three parallel branches to adaptively generate weights, effectively integrating multi-scale feature information. Experimental results show that our model outperforms traditional segmentation models and some recent segmentation models across various metrics. In terms of segmentation performance, our model effectively handles different types of clouds and demonstrates strong generalization ability in complex scenarios, particularly excelling in edge extraction and preserving detailed features. However, there are still some limitations: (1) The model still experiences some false negatives and false positives in complex cloud shadow scenes, such as images with strong noise interference. (2) The segmentation performance for large areas of thin clouds needs further improvement. (3) The introduction of complex modules increases the number of parameters, leading to slower inference speed and higher computational overhead. In the future, we will address these limitations through further research to provide more accurate and real-time cloud and cloud shadow segmentation models for related fields.

Author Contributions

Methodology, Y.F. and Z.F.; Software, Y.F., Z.F. and Y.Y.; Validation, Y.F., Y.Y., Z.J. and S.Z.; Formal analysis, Y.F., Z.F. and Y.Y.; Investigation, Y.F. and Z.F.; Data curation, Y.F., Y.Y. and S.Z.; Writing—original draft, Y.F. and Z.F.; Writing—review & editing, Y.F. and Z.F.; Visualization, Y.F. and Z.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rossow, W.B.; Schiffer, R.A. Advances in Understanding Clouds from ISCCP. Bull. Am. Meteorol. Soc. 1999, 80, 2261–2288. [Google Scholar]
Sun, L.; Wei, J.; Wang, J.; Mi, X.; Guo, Y.; Lv, Y.; Yang, Y.; Gan, P.; Zhou, X.; Jia, C.; et al. A Universal Dynamic Threshold Cloud Detection Algorithm (UDTCDA) Supported by a Prior Surface Reflectance Database. J. Geophys. Res. Atmos. 2016, 121, 7172–7196. [Google Scholar] [CrossRef]
Vásquez, R.E.; Manian, V.B. Texture-Based Cloud Detection in MODIS Images. In Proceedings of the SPIE Remote Sensing; SPIE: Bellingham, WA, USA, 2003. [Google Scholar]
Zhu, Z.; Woodcock, C.E. Object-Based Cloud and Cloud Shadow Detection in Landsat Imagery. Remote Sens. Environ. 2012, 118, 83–94. [Google Scholar]
Danda, S.; Challa, A.; Sagar, B.S.D. A Morphology-Based Approach for Cloud Detection. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 80–83. [Google Scholar]
Le Hégarat-Mascle, S.; André, C. Use of Markov Random Fields for Automatic Cloud/Shadow Detection on High Resolution Optical Images. ISPRS J. Photogramm. Remote Sens. 2009, 64, 351–366. [Google Scholar]
Cheng, H.-Y.; Lin, C.-L. Cloud Detection in All-Sky Images via Multi-Scale Neighborhood Features and Multiple Supervised Learning Techniques. Atmos. Meas. Tech. 2017, 10, 199–208. [Google Scholar] [CrossRef]
Wei, J.; Huang, W.; Li, Z.; Sun, L.; Zhu, X.; Yuan, Q.; Liu, L.; Cribb, M. Cloud Detection for Landsat Imagery by Combining the Random Forest and Superpixels Extracted via Energy-Driven Sampling Segmentation Approaches. Remote Sens. Environ. 2020, 248, 112005. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Miller, J.; Nair, U.; Ramachandran, R.; Maskey, M. Detection of Transverse Cirrus Bands in Satellite Imagery Using Deep Learning. Comput. Geosci. 2018, 118, 79–85. [Google Scholar]
Mohajerani, S.; Krammer, T.A.; Saeedi, P. A Cloud Detection Algorithm for Remote Sensing Images Using Fully Convolutional Neural Networks. In Proceedings of the 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), Vancouver, BC, Canada, 29–31 August 2018; pp. 1–5. [Google Scholar]
Gonzales, C.; Sakla, W.A. Semantic Segmentation of Clouds in Satellite Imagery Using Deep Pre-Trained U-Nets. In Proceedings of the 2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), Washington, DC, USA, 15–17 October 2019; pp. 1–7. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar]
Lu, J.; Wang, Y.; Zhu, Y.; Ji, X.; Xing, T.; Li, W.; Zomaya, A.Y. P_Segnet and NP_Segnet: New Neural Network Architectures for Cloud Recognition of Remote Sensing Images. IEEE Access 2019, 7, 87323–87333. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Sun, K.; Zhao, Y.; Jiang, B.; Cheng, T.; Xiao, B.; Liu, D.; Mu, Y.; Wang, X.; Liu, W.; Wang, J. High-Resolution Representations for Labeling Pixels and Regions. arXiv 2019, arXiv:1904.04514. [Google Scholar]
Qu, Y.; Xia, M.; Zhang, Y. Strip Pooling Channel Spatial Attention Network for the Segmentation of Cloud and Cloud Shadow. Comput. Geosci. 2021, 157, 104940. [Google Scholar] [CrossRef]
Lu, C.; Xia, M.; Qian, M.; Chen, B. Dual-Branch Network for Cloud and Cloud Shadow Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5410012. [Google Scholar] [CrossRef]
Dai, X.; Chen, K.; Xia, M.; Weng, L.; Lin, H. LPMSNet: Location Pooling Multi-Scale Network for Cloud and Cloud Shadow Segmentation. Remote Sens. 2023, 15, 4005. [Google Scholar] [CrossRef]
Hu, Z.; Weng, L.; Xia, M.; Hu, K.; Lin, H. HyCloudX: A Multibranch Hybrid Segmentation Network With Band Fusion for Cloud/Shadow. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 6762–6778. [Google Scholar] [CrossRef]
Kang, J.; Liu, L.; Zhang, F.; Shen, C.; Wang, N.; Shao, L. Semantic Segmentation Model of Cotton Roots In-Situ Image Based on Attention Mechanism. Comput. Electron. Agric. 2021, 189, 106370. [Google Scholar] [CrossRef]
Li, D.; Yang, Y.; Zhao, S.; Ding, J. Segmentation of Underwater Fish in Complex Aquaculture Environments Using Enhanced Soft Attention Mechanism. Environ. Model. Softw. 2024, 181, 106170. [Google Scholar] [CrossRef]
Jiang, J.; Feng, X.; Ye, Q.; Hu, Z.; Gu, Z.; Huang, H. Semantic Segmentation of Remote Sensing Images Combined with Attention Mechanism and Feature Enhancement U-Net. Int. J. Remote Sens. 2023, 44, 6219–6232. [Google Scholar] [CrossRef]
Ding, Z.; Wang, T.; Sun, Q.; Wang, H. Adaptive Fusion with Multi-Scale Features for Interactive Image Segmentation. Appl. Intell. 2021, 51, 5610–5621. [Google Scholar] [CrossRef]
Wei, D.; Wang, H. MFFLNet: Lightweight Semantic Segmentation Network Based on Multi-Scale Feature Fusion. Multim. Tools Appl. 2023, 83, 30073–30093. [Google Scholar]
Li, Y.; Huang, M.; Zhang, Y.; Bai, Z. Attention Guided Multi Scale Feature Fusion Network for Automatic Prostate Segmentation. Comput. Mater. Contin. 2024, 78, 1649–1668. [Google Scholar] [CrossRef]
Ji, H.; Xia, M.; Zhang, D.; Lin, H. Multi-Supervised Feature Fusion Attention Network for Clouds and Shadows Detection. ISPRS Int. J. Geo-Inf. 2023, 12, 247. [Google Scholar] [CrossRef]
Hou, Q.; Zhang, L.; Cheng, M.-M.; Feng, J. Strip Pooling: Rethinking Spatial Pooling for Scene Parsing. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4002–4011. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 1971–1980. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–26 October 2016; IEEE Computer Society: Los Alamitos, CA, USA, 2016; pp. 565–571. [Google Scholar]
Li, Z.; Shen, H.; Li, H.; Xia, G.; Gamba, P.; Zhang, L. Multi-Feature Combined Cloud and Cloud Shadow Detection in GaoFen-1 Wide Field of View Imagery. Remote Sens. Environ. 2017, 191, 342–358. [Google Scholar] [CrossRef]
Li, Z.; Shen, H.; Cheng, Q.; Liu, Y.; You, S.; He, Z. Deep Learning Based Cloud Detection for Medium and High Resolution Remote Sensing Images of Different Sensors. ISPRS J. Photogramm. Remote Sens. 2019, 150, 197–212. [Google Scholar] [CrossRef]
Zhang, G.; Gao, X.; Yang, Y.; Wang, M.; Ran, S. Controllably Deep Supervision and Multi-Scale Feature Fusion Network for Cloud and Snow Detection Based on Medium- and High-Resolution Imagery Dataset. Remote Sens. 2021, 13, 4805. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Wei, F.; Wang, S.; Sun, Y.; Yin, B. A Dual Attentional Skip Connection Based Swin-UNet for Real-Time Cloud Segmentation. IET Image Process. 2024, 18, 3460–3479. [Google Scholar]
Ding, L.; Xia, M.; Lin, H.; Hu, K. Multi-Level Attention Interactive Network for Cloud and Snow Detection Segmentation. Remote Sens. 2024, 16, 112. [Google Scholar] [CrossRef]
Surya, S.R.; Abdul Rahiman, M. CSDUNet: Automatic Cloud and Shadow Detection from Satellite Images Based on Modified U-Net. J. Indian Soc. Remote Sens. 2024, 52, 1699–1715. [Google Scholar] [CrossRef]
Zhan, Z.; Ren, H.; Xia, M.; Lin, H.; Wang, X.; Li, X. AMFNet: Attention-Guided Multi-Scale Fusion Network for Bi-Temporal Change Detection in Remote Sensing Images. Remote Sens. 2024, 16, 1765. [Google Scholar] [CrossRef]
Wang, Z.; Gu, G.; Xia, M.; Weng, L.; Hu, K. Bitemporal Attention Sharing Network for Remote Sensing Image Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 10368–10379. [Google Scholar] [CrossRef]
Zhu, T.; Zhao, Z.; Xia, M.; Huang, J.; Weng, L.; Hu, K.; Lin, H.; Zhao, W. FTA-Net: Frequency-Temporal-Aware Network for Remote Sensing Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 3448–3460. [Google Scholar] [CrossRef]
Jiang, S.; Lin, H.; Ren, H.; Hu, Z.; Weng, L.; Xia, M. MDANet: A High-Resolution City Change Detection Network Based on Difference and Attention Mechanisms under Multi-Scale Feature Fusion. Remote Sens. 2024, 16, 1387. [Google Scholar] [CrossRef]
Ren, H.; Xia, M.; Weng, L.; Lin, H.; Huang, J.; Hu, K. Interactive and Supervised Dual-Mode Attention Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5612818. [Google Scholar] [CrossRef]

Figure 1. Multi-scale feature adaptive fusion network structure based on DeepLab v3+.

Figure 2. The structure of the Hybrid Strip Pooling Module. (The intensity of the color represents the magnitude of the feature values, with darker colors indicating larger values).

Figure 3. The structure of the three-branch adaptive feature fusion module.

Figure 4. The structure of the Global Context Attention Module.

Figure 5. Some data from the GF-1 WHU dataset. The first row shows the original images, and the second row corresponds to their labels: (a) forest; (b) Gobi desert; (c) barren land; (d) wetland; and (e) urban areas.

Figure 6. Comparison of segmentation results of clouds and cloud shadows by different methods: (a) test image; (b) labels; (c) segmentation result of our method; (d) segmentation result of SegNet; (e) segmentation result of HRNet; (f) segmentation result of UNet; (g) segmentation result of DeepLab v3+. (The red boxes highlight the areas where the segmentation results differ between models).

Figure 7. Comparison of segmentation results of clouds and cloud shadows by different methods in different scenarios: (a) test image; (b) labels; (c) segmentation result of our method; (d) segmentation result of SegNet; (e) segmentation result of HRNet; (f) segmentation result of UNet; (g) segmentation result of DeepLab v3+. (The red boxes and red circles highlight the areas where the segmentation results differ between models).

Figure 8. Comparison of segmentation results of different cloud types and cloud shadows by different methods: (a) test image; (b) labels; (c) segmentation result of our method; (d) segmentation result of SegNet; (e) segmentation result of HRNet; (f) segmentation result of UNet; (g) segmentation result of DeepLab v3+. (The red boxes highlight the areas where the segmentation results differ between models).

Table 1. The segmentation results of the model under different combinations of hyperparameters.

$α$	$β$	MPA	MIoU
0.2	0.8	85.80	77.16
0.3	0.7	85.98	77.20
0.4	0.6	85.48	76.84
0.5	0.5	86.08	77.06
0.6	0.4	86.50	77.43
0.7	0.3	85.52	76.87
0.8	0.2	85.27	76.80

Note: Bold indicates the best result, and the underline indicates the second-best result.

Table 2. The ablation experiment results of different module combinations.

Method	MPA (%)	MIoU (%)
DeepLab v3 plus	84.61	75.40
DeepLab v3 plus + HSPM	84.87	75.79 (0.39 ↑)
DeepLab v3 plus + HSPM + TB-AFFM	85.32	76.41 (0.62 ↑)
DeepLab v3 plus + HSPM + TB-AFFM + GCAM	85.75	76.93 (0.52 ↑)
DeepLab v3 plus + HSPM + TB-AFFM + GCAM + loss (Ours)	86.08	77.28 (0.35 ↑)

Note: Bold indicates the best result, and the arrow represents the improvement in model performance.

Table 3. Comparison of segmentation results of different models on the GF1_WHU dataset.

Method	PA (%)	MPA (%)	MIoU (%)	FWIoU (%)	FLOPs (G)	Time (ms)
FCN-8s [9]	86.15	80.20	69.05	74.52	15.07	3.73
SegNet [14]	88.48	82.85	72.83	79.73	18.15	4.26
PSPNet [16]	88.53	82.64	72.98	79.69	24.82	6.97
HRNet [18]	89.14	84.50	74.77	80.72	34.64	30.93
Unet [10]	89.32	85.02	75.08	81.07	17.36	2.90
DeepLab v3+ [17]	89.72	84.65	75.44	81.66	20.78	13.53
SP_CSANet [19]	89.69	85.35	75.78	82.72	25.04	23.21
CSDNet [37]	89.83	85.51	76.34	82.55	16.21	18.48
SegFormer [38]	90.24	85.45	76.45	82.63	20.32	15.23
DASUNet [39]	90.11	85.83	76.57	82.81	19.24	8.54
MAINet [40]	90.32	85.68	76.73	82.72	25.32	15.43
MFAFNet (Ours)	90.57	86.08	77.28	83.05	23.01	16.93

Note: Bold indicates the best result, and the underline indicates the second-best result.

Table 4. Comparison of segmentation results of different models in each category.

Method	Cloud			Cloud Shadow
Method	P (%)	R (%)	F1 (%)	P (%)	R (%)	F1 (%)
FCN-8s	87.84	86.34	88.63	68.57	65.01	67.78
SegNet	90.33	92.01	91.16	71.89	66.28	68.97
PSPNet	89.73	90.41	90.07	74.78	66.06	70.15
HRNet	91.49	90.72	91.10	74.82	71.37	73.05
UNet	91.72	91.23	91.47	73.60	72.68	73.14
DeepLab v3+	92.24	91.46	91.85	75.74	70.31	72.92
SP_CSANet	92.45	91.73	92.12	74.94	71.48	73.03
CSDNet	92.72	92.02	92.23	75.83	73.07	74.53
SegFormer	92.85	92.07	92.45	76.25	72.63	74.33
DASUNet	93.21	92.41	92.31	76.47	73.11	73.67
MAINet	93.49	92.25	92.42	76.32	72.83	74.24
MFAFNet (Ours)	93.78	92.36	92.63	77.13	73.64	75.34

Note: Bold indicates the best result, and the underline indicates the second-best result.

Table 5. Comparison of segmentation results of different models on other datasets.

Method	Dataset 1			Dataset 2
Method	PA (%)	MPA (%)	MIoU (%)	PA (%)	MPA (%)	MIoU (%)
FCN-8s	92.93	91.83	86.08	94.58	92.63	87.34
SegNet	93.06	92.02	86.11	94.86	92.74	87.55
PSPNet	93.33	92.25	86.59	95.23	93.61	89.86
HRNet	93.21	92.06	86.27	96.73	95.07	91.35
UNet	93.56	92.87	86.78	97.02	95.77	92.10
DeepLab v3+	93.35	92.14	86.54	97.26	95.81	92.25
SP_CSANet	93.72	93.06	86.85	97.72	96.33	92.76
CSDNet	94.35	93.50	87.27	97.58	96.27	92.64
SegFormer	93.87	93.35	86.92	97.32	96.22	92.43
DASUNet	94.32	93.63	87.32	97.45	96.43	92.81
MAINet	93.95	93.38	87.01	97.52	96.18	92.68
MFAFNet (Ours)	94.44	93.58	87.55	97.62	96.56	93.02

Note: Bold indicates the best result, and the underline indicates the second-best result.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, Y.; Fan, Z.; Yan, Y.; Jiang, Z.; Zhang, S. MFAFNet: Multi-Scale Feature Adaptive Fusion Network Based on DeepLab V3+ for Cloud and Cloud Shadow Segmentation. Remote Sens. 2025, 17, 1229. https://doi.org/10.3390/rs17071229

AMA Style

Feng Y, Fan Z, Yan Y, Jiang Z, Zhang S. MFAFNet: Multi-Scale Feature Adaptive Fusion Network Based on DeepLab V3+ for Cloud and Cloud Shadow Segmentation. Remote Sensing. 2025; 17(7):1229. https://doi.org/10.3390/rs17071229

Chicago/Turabian Style

Feng, Yijia, Zhiyong Fan, Ying Yan, Zhengdong Jiang, and Shuai Zhang. 2025. "MFAFNet: Multi-Scale Feature Adaptive Fusion Network Based on DeepLab V3+ for Cloud and Cloud Shadow Segmentation" Remote Sensing 17, no. 7: 1229. https://doi.org/10.3390/rs17071229

APA Style

Feng, Y., Fan, Z., Yan, Y., Jiang, Z., & Zhang, S. (2025). MFAFNet: Multi-Scale Feature Adaptive Fusion Network Based on DeepLab V3+ for Cloud and Cloud Shadow Segmentation. Remote Sensing, 17(7), 1229. https://doi.org/10.3390/rs17071229

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MFAFNet: Multi-Scale Feature Adaptive Fusion Network Based on DeepLab V3+ for Cloud and Cloud Shadow Segmentation

Abstract

1. Introduction

2. Methodology

2.1. Network Structure

2.2. Hybrid Strip Pooling Module (HSPM)

2.3. Three-Branch Adaptive Feature Fusion Module (TB-AFFM)

2.4. Global Context Attention Module (GCAM)

2.5. Loss Function

3. Experimental Analysis

3.1. Experimental Datasets

3.2. Experimental Details

3.3. Ablation Experiment

3.4. Comparison Experiments

3.5. Generalization Performance Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI