Exploring Factors Affecting the Performance of Neural Network Algorithm for Detecting Clouds, Snow, and Lakes in Sentinel-2 Images

Huang, Kaihong; Sun, Zhangli; Xiong, Yi; Tu, Lin; Yang, Chenxi; Wang, Hangtong

doi:10.3390/rs16173162

Open AccessArticle

Exploring Factors Affecting the Performance of Neural Network Algorithm for Detecting Clouds, Snow, and Lakes in Sentinel-2 Images

by

Kaihong Huang

^1,2,

Zhangli Sun

^1,2,*,

Yi Xiong

^1,2,

Lin Tu

^1,2,

Chenxi Yang

^1,2 and

Hangtong Wang

^2,3

¹

State Key Laboratory of Geohazard Prevention and Geoenvironment Protection, Chengdu University of Technology, Chengdu 610059, China

²

College of Earth and Planetary Sciences, Chengdu University of Technology, Chengdu 610059, China

³

Faculty of Geographical Science, Beijing Normal University, Beijing 100875, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(17), 3162; https://doi.org/10.3390/rs16173162

Submission received: 24 June 2024 / Revised: 18 August 2024 / Accepted: 25 August 2024 / Published: 27 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

Detecting clouds, snow, and lakes in remote sensing images is vital due to their propensity to obscure underlying surface information and hinder data extraction. In this study, we utilize Sentinel-2 images to implement a two-stage random forest (RF) algorithm for image labeling and delve into the factors influencing neural network performance across six aspects: model architecture, encoder, learning rate adjustment strategy, loss function, input image size, and different band combinations. Our findings indicate the Feature Pyramid Network (FPN) achieved the highest MIoU of 87.14%. The multi-head self-attention mechanism was less effective compared to convolutional methods for feature extraction with small datasets. Incorporating residual connections into convolutional blocks notably enhanced performance. Additionally, employing false-color images (bands 12-3-2) yielded a 4.86% improvement in MIoU compared to true-color images (bands 4-3-2). Notably, variations in model architecture, encoder structure, and input band combination had a substantial impact on performance, with parameter variations resulting in MIoU differences exceeding 5%. These results provide a reference for high-precision segmentation of clouds, snow, and lakes and offer valuable insights for applying deep learning techniques to the high-precision extraction of information from remote sensing images, thereby advancing research in deep neural networks for semantic segmentation.

Keywords:

semantic segmentation; influencing factors; neural network; sentinel-2; Tibetan plateau

1. Introduction

Remote sensing technology is a powerful tool for acquiring geographic environmental information and land resource data [1]. Optical remote sensing satellites offer detailed Earth observations, making them significant for monitoring land, forests, and the atmosphere [2]. However, cloud and snow cover can often obscure the Earth’s surface, hindering the observation of geomorphic features and surface characteristics [3]. Additionally, understanding climate change, investigating hydrological resources [4], and issuing snow disaster warnings all depend on accurate segmentation of clouds, snow, and lakes in remote sensing imagery [5].

The similar reflectivity characteristics of clouds and snow in visible and near-infrared bands present challenges for distinguishing between them [6], while lakes and cloud shadows exhibit similar low reflectance in visible bands, further complicating segmentation. Various algorithms have been proposed to address this issue, resulting in three primary technical approaches: spectral threshold-based methods [7], classical machine learning-based methods [8], and deep learning-based methods [9]. Spectral threshold methods are straightforward, and they rely heavily on prior knowledge and struggle in complex scenes [10]. Classical machine learning methods require manual selection of image features for classification [11], achieving high accuracy but demanding significant manual effort [12]. Deep learning, on the other hand, automatically extracts features and achieves high-precision segmentation, but it is heavily reliant on dataset quantity and quality, often requiring labor-intensive manual annotation [13].

Recent advancements in deep learning have significantly improved cloud, snow, and lake detection in remote sensing data [14]. Deep neural networks can automatically capture texture, shape, and contextual information, often outperforming traditional methods [15]. Previous studies have successfully applied deep neural networks to detect these features. Guo et al. [16] suggested a neural network with a codec structure to extract cloud regions in remote sensing images. Qu et al. [17] proposed a parallel asymmetric network with dual attention, which has both a high detection accuracy and a rapid detection speed for detecting clouds in remote sensing images, but it has no advantage in the case of the coexistence of cloud and snow. Chen et al. [18] used a Convolutional Neural Network (CNN) to segment the water bodies of lakes and demonstrated that deep learning methods were more accurate than traditional methods. However, this method cannot perceive global contextual information, so the segmentation results have a low degree of continuity. To improve the perception of global information to achieve dense prediction of lake pixels, Hu et al. [19] designed a multibranch aggregation module using dilated convolutions to improve the prediction accuracy of lakes.

Current research often concentrates on a singular neural network model, leading to notable disparities across studies due to varying sample settings and parameters. Moreover, coarse spatial resolution and significant differences in object size contribute to considerable errors in object recognition outcomes. The successful deployment of the Sentinel-2 satellite has opened new avenues for object identification and detection. Sentinel-2, equipped with a high-resolution multi-spectral imager (MSI), facilitates land monitoring by providing comprehensive imagery of vegetation, soil and water cover, inland waterways, coastal regions, and more [20]. Snow exhibits high reflectance in the visible light spectrum and low reflectance in the short-wave infrared spectrum, while clouds demonstrate higher short-wave infrared reflectance compared to snow. Lakes, conversely, exhibit low reflectance in the visible light band.

In this study, we investigate the factors influencing the performance of neural network algorithms for detecting clouds, snow, and lakes using Sentinel-2 remote sensing images. We propose a two-stage random forest algorithm for constructing a labeled dataset and then analyze the influencing factors of semantic segmentation networks from six perspectives: model architecture, encoder, learning rate adjustment strategy, loss function, input image size, and input bands. Section 2 introduces the research area and data. Section 3 introduces the six influencing factors of the neural network model. Section 4 details the experimental setup and parameter selection. Section 5 presents the results and analysis. Section 6 discusses the results of the work. Section 7 concludes this work.

2. Study Area and Data

2.1. Study Area

The Tibetan Plateau has lofty mountain ranges and expansive basins [21]. Renowned as the Earth’s third pole, this region represents a critical locus for climate investigation. In this study, we delineate the northwest parts of the plateau, encompassing prominent landmarks such as the Kailas, Karakoram, Kunlun mountains, and the Qiangtang Plateau (Figure 1). This area, characterized by its elevated terrain and frigid temperatures, exhibits a substantial snow cover. Additionally, it boasts a profusion of glaciers and lakes, serving as vital sources for major rivers. Consequently, it serves as an ideal locale for investigating semantic segmentation techniques applied to clouds, snow, and lakes, offering valuable insights into the plateau’s hydrological cycle and broader climatic implications.

2.2. Sentinel-2 Data

The Sentinel-2 satellite system is renowned for its high-resolution multi-spectral imaging capabilities and frequent revisits [22]. Equipped with a multispectral imager, Sentinel-2 captures imagery across 13 spectral bands with ground resolutions ranging from 10 to 60 m (Table 1). The spectral reflectance of clouds, snow, and lakes varies among these bands (Figure 2). As the wavelength increases, the spectral reflectance of all three features decreases, with snow showing a particularly sharp decline after 1 um. This characteristic is advantageous for the segmentation of clouds, snow, and lakes. The complementary revisit periods of the Sentinel-2A and Sentinel-2B satellites total 5 days, which were sourced from the European Space Agency’s repository. In this study, Level-2A data from both Sentinel-2A and Sentinel-2B were employed, ensuring correction for bottom-of-atmosphere reflectance. The data were downloaded from the DATA website (https://www.onda-dias.eu/cms/, accessed on 3 December 2023). Five field images with varying degrees of cloud cover, ranging from 3% to 23%, were selected for analysis (Table 2). To prevent the impact of lake surface freezing on lake segmentation, images from May to September were chosen for the study.

2.3. Datasets Establishment

In semantic segmentation, the quality and quantity of the dataset play pivotal roles in model performance. A comprehensive dataset, characterized by both size and precision, significantly enhances training outcomes. Hence, establishing a robust dataset is a fundamental step preceding the exploration of factors influencing semantic segmentation.

(1): Labeling image

The process of annotating images typically involves visual interpretation, a meticulous and labor-intensive method. To mitigate this labor-intensive task, auxiliary algorithms such as the band-threshold method and machine learning techniques are often employed. However, conventional algorithms encounter challenges, particularly regarding spectral similarity and size variability among objects in remote-sensing images. This is exacerbated by the spectral resemblances between clouds and underlying features, which hinder accurate segmentation. To surmount this hurdle, we introduce a novel approach: the two-stage Random-Forest (RF) algorithm. This method is achieved by partitioning images into blocks to minimize the influence of spectral similarities and then applying RF to extract objects within each block. This method yields precise extraction results that are directly applicable as labels for semantic segmentation. The workflow for image label creation is depicted in Figure 3.

(1) Resampling: Bilinear interpolation was utilized to resample Sentinel-2 images, standardizing the resolution of all band images to 10 m. The false-color image, synthesized from bands 12-2-3, was selected for label creation, with snow depicted in blue and clouds in white (Figure 4).

(2) Image Partitioning with Rectangular Boxes: The false-color image (12-2-3) underwent partitioning into rectangular blocks using Labelme software (version 5.2.1). The location information of each block was saved in a JSON file for subsequent processing.

(3) Object Extraction: The pixel-based RF classifier was deployed for target object classification within each block (Figure 5). Samples were manually selected in each block using a rectangular box, which was then used to train the RF classifier. Cloud shadows are manually distinguished based on their texture features and surrounding terrain characteristics. To ensure label creation accuracy, the four-class classification task encompassing clouds, snow, lakes, and background was applied. The former three were segmented into three binary classification tasks. Each target object was individually extracted to facilitate precise label creation.

(4) Splicing: Upon label creation for all blocks within each category, they were seamlessly spliced together based on the retained position information in the JSON file. This process yielded the final remote-sensing image labels.

(5) Compositing: Following the creation of labels for clouds, snow, and lakes within each image, they were amalgamated into a composite label. In the resultant composite, background pixels were assigned a value of 0, while cloud, snow, and lake pixels were labeled as 1, 2, and 3, respectively (Figure 6).

(6) Inspection and correction: To ensure the quality of the labels, we manually corrected the output labels generated by the above process using Labelme software (version 5.2.1), deleting misclassified pixels and refining inaccurate boundaries.

(2): Semantic segmentation dataset

In this study, five Sentinel-2 remote sensing images were selected (Table 2). Among these, four were allocated for training and validation purposes, while one was reserved for testing. Given the substantial size of remote sensing images, direct input of them into the neural network can strain video memory resources. To address this, both the images and their corresponding labels underwent cropping. To investigate the impact of size on segmentation performance, the four images designated for training and validation underwent random cropping to varying dimensions: 256 × 256, 512 × 512, 768 × 768, and 1024 × 1024 pixels. This process yielded 10,836, 2707, 1183, and 610 cropped images, respectively, which are sufficient for training the neural network, eliminating the need for additional data augmentation methods to increase the sample size. Subsequently, the training and validation sets were partitioned in a 4:1 ratio. For the test images, regular and same-scale grid clipping was applied, resulting in 1764, 441, 196, and 100 images, respectively. Following the segmentation process, the generated masks were reassembled and reinstated to assess their overall predictive accuracy.

3. Factors Affecting the Performance of Neural Network Algorithm

Utilizing the semantic segmentation dataset encompassing clouds, snow, and lakes, this study delves into the influence of six pivotal factors on model performance: model architecture, encoder, learning rate adjustment strategy, loss function, input image size, and input bands (Figure 7). Each factor underwent systematic exploration, with parameters demonstrating optimal performance in the preceding step serving as the benchmarks for subsequent adjustments. We did not explore all possible parameter combinations, as this would be excessively labor-intensive and likely have minimal impact on the results. Additionally, we did not strictly distinguish between glacier and snow; in this study, the term ‘snow’ refers to both snow and glacier.

3.1. Model Architecture

The application of neural networks in semantic segmentation relies heavily on effective model construction and architecture optimization. In this study, the impact of different model architectures on final performance was evaluated, encompassing U-Net [23], Deeplabv3+ [24], Feature Pyramid Network (FPN) [25], LinkNet [26], and Pyramid Attention Network (PAN) [27] (Table 3). U-Net, renowned for its efficacy, integrates encoder and decoder feature layers via skip connections and facilitates the fusion of features from diverse receptive fields. This characteristic enables the retention of both low-level shallow features and high-level semantic features, rendering it particularly suitable for datasets with limited sample sizes. FPN employs a feature pyramid to amalgamate information from various scales, adeptly managing objects of varying sizes and categories. By leveraging this pyramid structure, FPN effectively handles multi-scale features, enhancing segmentation accuracy. DeepLabv3+ adopts an encoder-decoder architecture, with the encoder incorporating dilated convolutions to augment the receptive field without compromising information integrity. Furthermore, a spatial pyramid pooling module with dilated convolutions at the encoder’s end facilitates the fusion of multi-scale information, contributing to comprehensive feature representation. LinkNet, inspired by U-Net’s architecture, introduces modifications to the connection mechanism between the encoder and decoder. By directly transmitting input and output from the encoder to the decoder, LinkNet minimizes information loss, thereby enhancing segmentation performance. PAN employs a unique approach to feature aggregation, consolidating feature maps from different levels to ensure maximal utilization of information within each map. By mitigating information loss during cascading and preserving detailed information, PAN exhibits promising potential for semantic segmentation tasks. Each model architecture offers distinct advantages, tailored to specific segmentation requirements and dataset characteristics. By systematically evaluating these architectures, researchers can discern the most suitable model for a given task, thereby optimizing segmentation performance.

3.2. Encoder

The encoder-decoder architecture plays a pivotal role in semantic segmentation tasks, serving as the backbone for feature extraction and subsequent pixel-level classification. Understanding the characteristics and trade-offs of different encoders and depths is crucial for designing efficient and effective segmentation networks tailored to specific tasks and computational resources. We will analyze the influence of different encoder structures and depths on neural network performance in this work.

The encoder is responsible for extracting hierarchical feature representations from input images. In the context of semantic segmentation, it compresses the original image while retaining crucial semantic information. Contextual understanding, facilitated by an effective receptive field, is vital for accurate segmentation. Five commonly used encoders—Vgg19 [28], ResNet50 [29], ResNext50_32x4d [30], MobileNet_v2 [31], and Mit_b5 [32]—are selected to discuss the influence of encoder structure on neural network performance in this study.

Vgg19 comprises five convolutional layers separated by max-pooling operations. Each convolutional layer consists of multiple convolution and ReLU operations. Vgg19 is relatively simple compared to other architectures and captures meaningful feature representations. ResNet50 introduces residual connections to address the vanishing gradient problem in deep networks. By allowing information to flow directly across layers, ResNet enhances convergence speed and accuracy, making it a popular choice for various computer vision tasks, including semantic segmentation. Building upon ResNet, ResNext50_32x4d incorporates group convolution to reduce the number of hyperparameters while maintaining accuracy. This architecture enhances parameter efficiency, making the model more suitable for running equipment with small video memory. MobileNet_v2 employs separable convolution, which decomposes the convolution operation into depthwise and pointwise convolutions. This reduces the number of parameters and computational complexity, making MobileNet_v2 particularly efficient for mobile and embedded applications. Mit_b5 merges the self-attention mechanism of the Transformer model [33] with the feature extraction capability of CNNs. By leveraging both global context information and local features, Mit_b5 enhances the accuracy and efficiency of semantic segmentation, especially in capturing long-range dependencies.

Encoder depth refers to the number of layers used for feature extraction. Deeper encoders can learn more complex features, allowing the network to capture intricate patterns and structures in the data. However, increasing depth also amplifies the number of parameters and computational complexity, potentially leading to longer training times. The ResNet model offers varying depths. ResNet18 (with the number “18” indicating the depth), ResNet34, ResNet50, ResNet101, and ResNet152 were selected to explore the impact of encoder depth on segmentation performance.

3.3. Learning Rate Adjusting Strategy

The strategy for adjusting the learning rate plays a pivotal role in the effective training of neural networks. In the initial stages of training, employing a high learning rate enables the model to rapidly approach an optimal solution. However, an excessively large learning rate can lead to instability in the optimization process, thereby impeding the acquisition of an optimal solution. Gradually decreasing the learning rate facilitates more effective convergence toward either a global or local optimum while simultaneously reducing the risk of model overfitting.

In this study, we deployed four prevalent learning rate adjustment strategies to examine their influence on training outcomes. Let e denote the number of epochs completed during training, and

{l r}_{e}

represent the learning rate until epoch e. We initialize the learning rate at 0.001. The strategies and their corresponding learning rate curves are delineated below (Figure 8):

(1) Remaining constant: The learning rate remains unchanged throughout the training process. The curve is a straight line, as the learning rate does not change.

{l r}_{e 1} = {l r}_{e - 1}

(1)

(2) Equally spaced decay: The learning rate decreases at equally spaced intervals.

{l r}_{e 2} = 0.001 - 0.000064 \times (e / / 10)

(2)

The learning rate decreases from 0.001 to 0.0001 linearly.

(3) Cosine decay: The learning rate follows a cosine function to decay smoothly over time.

{l r}_{e 3} = 0.00045 \cos (\frac{π}{10} e) + 0.00055

(3)

The learning rate follows a cosine function with a period of 20 epochs, decreasing and then increasing within each period.

(4) Exponential decay: The learning rate decays exponentially, reducing more rapidly in the initial stages.

{l r}_{e 4} = 0.001 \times {0.99}^{e}

(4)

The learning rate decreases smoothly throughout the training.

3.4. Loss Function

The loss function plays a pivotal role within neural networks, serving to quantify the disparity between the model’s predicted output and the actual label. Throughout the training process, minimizing the loss function becomes imperative to align the model’s predictions closely with the true labels. The selection of an appropriate loss function significantly influences the efficacy of training and the performance of the model. Cross-Entropy (CE) [34] loss function stands out as a widely employed choice for semantic segmentation tasks. Its definition is outlined as follows:

L = - \frac{1}{N} \sum_{i} \sum_{c = 1}^{M} y_{i c} \log (p_{i c})

(5)

where N represents the number of pixels; M represents the number of classes; y is an indicator function that is 1 if the true class of sample i is c, and 0 otherwise; p represents the predicted probability that the observed sample i belongs to class c.

In the Tibetan plateau, where mountains dominate the majority of the area, there is a pronounced data imbalance in the sample distribution. Specifically, one or more classes are significantly underrepresented compared to others. This imbalance poses a challenge as it biases the loss function, leading to poor performance of the model, particularly on smaller samples such as clouds and lakes. To mitigate this challenge, the Weighted Cross Entropy (WCE) loss function was introduced in this study. This loss function incorporates weighting factors into the Cross Entropy (CE) loss function, thereby rectifying the distribution of the loss function. The formulation of WCE is delineated as follows:

L = - \frac{1}{N} \sum_{i} \sum_{c = 1}^{M} {w_{c} y}_{i c} \log (p_{i c})

(6)

where w_c is the weight assigned to class c. By assigning higher weights to underrepresented classes, WCE ensures that the model pays more attention to these classes during training, improving its performance on small samples. This weighted approach helps mitigate the dominance of the majority classes and enhances the overall accuracy of the semantic segmentation model.

Moreover, FocalLoss [35] offers a solution by incorporating a focus factor that diminishes the loss impact of well-classified examples, thereby directing more attention toward challenging-to-classify samples. This is accomplished by assigning a weighting factor that tends toward 0 for accurately classified samples and 1 for misclassified ones. Consequently, the loss value attributed to accurately classified samples decreases, while that of misclassified ones remains relatively unchanged. This approach effectively amplifies the influence of inaccurately classified samples within the loss function. The formulation of Focal Loss is presented as follows:

L = - \frac{1}{N} \sum_{i} \sum_{c = 1}^{M} w_{c} {(1 - p_{i c})}^{γ} y_{i c} \log (p_{i c})

(7)

where γ is a focusing parameter that adjusts the rate at which easy examples are down-weighted, and

w_{c}

is the weight assigned to class c.

In this study, CE was employed as the benchmark to assess the efficacy of WCE and Focal Loss in tackling the data imbalance prevalent in cloud, snow, and lake segmentation. Through a comparative analysis of these loss functions, the objective is to investigate the potential of WCE and Focal Loss in alleviating the evaluation errors posed by imbalanced datasets and enhancing model performance.

3.5. Different Input Image Size

The size of the input image significantly impacts the performance of the model. A larger input size equips the model with more detailed information, thereby enhancing its ability to generalize and capture finer details and features within the data. Consequently, this often results in more precise and resilient predictions. However, larger input image size also increases computational complexity and processing time. Conversely, if the input image size is too small, critical information may be lost, potentially leading to model overfitting and undermining the neural network’s accuracy and generalization capacity. Hence, selecting an appropriate input image size tailored to specific application scenarios is very important for achieving optimal performance while striking a balance between accuracy and computational efficiency. In this study, five remote sensing images are cropped into sizes of 256 × 256, 512 × 512, 768 × 768, and 1024 × 1024 to serve as inputs for the neural network.

3.6. Different Band Combinations

Satellite remote sensing images capture both the reflected electromagnetic wave information and the thermal radiation emitted by ground objects. Due to various structural, compositional, and physical-chemical properties, the reflection and thermal radiation patterns of different ground objects are distinct. For instance, water bodies typically exhibit low reflectivity within the wavelength range covered by most remote sensing sensors due to their efficient absorption of solar energy. Conversely, snow demonstrates high reflectivity in visible light (VIS) and low reflectivity in short-wave infrared (SWIR), facilitating its effective differentiation from clouds.

Sentinel-2 incorporates three SWIR bands (Table 1): Band 10 (1.375 um), Band 11 (1.610 um), and Band 12 (2.190 um). Band 10 (referred to as B10 hereinafter, similarly for other bands) is designed for cirrus detection and records Top-of-the-Atmosphere (TOA) reflectance rather than surface reflectance, which is omitted from L2A level data. Consequently, this investigation concentrates on bands B11 and B12. These bands are combined with visible bands B2 and B3 to produce true-color or false-color images. Specifically, combinations such as bands 4-3-2, 12-2-3, and 11-2-3 are used to explore the impact of different band combinations on the segmentation of clouds, snow, and lakes (Figure 9).

4. Experimental Settings and Evaluation Metrics

4.1. Overall Flowchart

Figure 10 presents the overall workflow of this study. Label creation is based on a two-stage RF algorithm, followed by the creation of a semantic segmentation dataset using Sentinel-2 data. Five neural networks, i.e., U-Net [23], Deeplabv3+ [24], Feature Pyramid Network (FPN) [25], LinkNet [26], and Pyramid Attention Network (PAN), were selected to explore factors affecting their performance. The influencing factors examined include model structure, encoder structure, encoder depth, learning rate adjustment, loss function, input bands, and input image size. Seven metrics are used to evaluate the performance of neural networks, including Overall Accuracy (OA), Class Pixel Accuracy (CPA), Mean Pixel Accuracy (MPA), Class Pixel Recall (CPR), Mean Pixel Recall (MPR), Intersection over Union (IoU), and Mean Intersection over Union (MIoU).

4.2. Experimental Settings

All experiments were conducted using an NVIDIA GeForce RTX 3080 GPU with 10 GB of memory. The CPU used was an Intel i5-12400F, and the computer had 16 GB of RAM. The experiments were implemented on a Windows 10 operating system using the GPU-accelerated tool CUDA 12.3, the PyTorch 2.1.1 deep learning framework, and the Python 3.11 programming language. The activation function used in the hidden layers is ReLU, while SoftMax is employed in the output layer. For training, the batch size was set to 4, the number of epochs was set to 150, and the initial learning rate was 0.001. The encoder was initially set as ResNet50, and the loss function used was CE. The input image size was 512 × 512, with a false-color image with bands 12-2-3 combination, and the Adam optimizer [36] was employed to optimize the network.

4.3. Evaluation Metrics

The detection results were quantitatively evaluated using seven indicators (Table 4): OA, CPA, MPA, CPR, MPR, IoU, and MIoU.

5. Result and Analysis

5.1. Deep Learning Models Structure

Table 5 presents the evaluation metrics for U-Net, Deeplabv3+, FPN, LinkNet, and PAN models on the semantic segmentation dataset of clouds, snow, and lakes. It is evident that all models achieve high OA, exceeding 96%. However, the MPA is notably lower than the OA, with a maximum difference of up to 15% between the MPA and OA for the LinkNet model. This discrepancy is mostly due to the imbalance in the data categories, i.e., the background category comprises a significantly larger number of pixels compared to the target categories. Consequently, even if the segmentation accuracy for the target categories is very low, the overall accuracy remains high, indicating that OA is not a reliable metric for evaluating performance on imbalanced datasets. Additionally, the MPR for all models is above 95%, suggesting that there are a few missed points across categories. However, the large difference between MPR and MPA highlights the issue of data imbalance. That is, categories with a larger number of instances are often correctly identified, whereas those with fewer instances are less accurately segmented. Among all the models, LinkNet exhibited the lowest MIoU at 79.61%, while FPN achieved the highest MIoU at 85.69%, demonstrating the best performance.

Figure 11 presents the CPA, CPR, and IoU metrics from five models to detect clouds, snow, and lakes. The various metrics of snow and lake are almost above 90%, which means that the segmentation accuracy of these two categories is very high. However, the IoU of most models for cloud segmentation is almost below 60%. Linknet’s cloud segmentation IoU was the lowest at only 30%, significantly reducing the value of MIoU. In contrast, FPN’s cloud segmentation IoU was the highest, at 60%. The reason for poor performance for cloud segmentation for all models is that a large number of pixels from other categories are misclassified as clouds.

The prediction outcomes of each model are depicted in Figure 12. Notably, LinkNet exhibits a pronounced misclassification of clouds, with a substantial number of background pixels erroneously categorized as clouds. This misclassification directly undermines LinkNet’s cloud segmentation accuracy (low IoU). Structurally, LinkNet diminishes subsampling operations in the encoder to retain spatial information, yet this also preserves excessive noise during training, leading to significant disruptions in cloud recognition. In contrast, FPN demonstrates superior segmentation performance compared to U-Net, Deeplabv3+, PAN, and LinkNet, owing to its simpler structure and effective utilization of a pyramid structure to merge shallow (small size of receptive field) and deep (large size of receptive field) features. Simple target objects, such as clouds, snow, and lakes, can be effectively detected using spectral and texture features without relying solely on complex semantic features. Consequently, opting for a lightweight network like FPN may yield superior results. The large size variations within the ground objects require a network capable of producing outputs with varying receptive field sizes; thereby, the feature pyramid structure is particularly suited for segmenting clouds, snow, and lakes.

5.2. Encoder Type and Depth

Based on the above conclusions, FPN is identified as the best network structure among the selected models. Therefore, the following section will discuss the influence of the encoder on the performance of the FPN model.

(1): Encoder structure analysis

We selected five encoders, i.e., Vgg19, Resnet50, Resnext50, MobileNet_v2, and Mit_b5, to analyze their impact on model performance (Table 6). Resnet50, Resnext50_32x4d, and MobileNet_v2, which employ residual structure, exhibited approximately 20% higher MIoU than Vgg19, which does not use a residual structure. This underscores the effectiveness of residual connections. From the perspective of parameter count, MobileNet_v2 significantly reduces the number of parameters (only 4.22 million (M) parameters) and the computation load due to its deep separable convolution structure. This makes it more suitable for datasets with a small amount of data and relatively simple segmentation tasks. In contrast, Mit_b5 has a substantial number of parameters (83.33 M) due to its extensive use of multi-head self-attention mechanisms. This large parameter count needs a considerable amount of data for the model to be fully trained, which is a key factor in Mit_b5’s poor performance (low MIoU of 72.59%) in this experiment.

In summary, the use of residual connections in ResNet50, ResNeXt50_32x4d, and MobileNet_v2 significantly improves model performance, as evidenced by higher MIoU scores. MobileNet_v2, with its efficient deep separable convolution structure, stands out for its balance of performance and computational efficiency. Mit_b5, despite its potential for capturing complex features through multi-head self-attention mechanisms, requires more extensive data for optimal performance, which it lacked in this experiment.

(2): Encoder depth analysis

According to the above discussion, the MobileNet_v2 encoder performed the best among the tested encoders. However, due to the unavailability of MobileNet_v2 encoders with varying depths, the comparable performance of ResNet was chosen for the subsequent analysis of encoder depth. ResNet18, ResNet34, ResNet50, ResNet101, and ResNet152 were chosen to explore the impact of layer depth on feature extraction. Figure 13 shows how the model performs as the encoder depth increases. The number of parameters rises with an increase in the number of network layers. Resnet50 has the highest MIoU at 85.69%, and Resnet152 has the lowest MIOU at 84.12%, a difference of 1.57%. This indicates that although the residual network can suppress the invalid convolutional layers through the residual connections, the suppression effect has its limitations, and an optimal number of layers exists. If the number of layers is below the optimal number, features cannot be fully extracted. Conversely, if the number exceeds the optimal amount, overfitting occurs. Thus, selecting an appropriate encoder depth is crucial for optimal performance.

5.3. Learning Rate Adjustment Strategy

MobileNet_v2 was used as the encoder for the subsequent study, and four learning rate adjustment strategies were applied: Remain constant, Cosine decay, Equally spaced decay, and Exponential decay. These strategies were tested using Adam Optimizer. The metrics of different learning rate adjustment strategies are shown in Table 7. Exponential Decay achieved the highest MIoU (87.14%), while the strategy with no attenuation (Remain Constant) resulted in the lowest MIoU (86.07%), a difference of 1.07%.

Figure 14 illustrates how the training loss (i.e., the value of the loss function) changes as the epoch increases. Equally spaced decay and exponential decay demonstrate the most stable training processes, reaching the lowest losses at final training. Conversely, the training losses associated them with remain constant, and cosine decay strategies exhibit considerable fluctuations. These fluctuations are likely attributable to the excessive learning rate, which causes the model parameters to oscillate around the extrema. This suggests that, despite the adaptability of the Adam optimizer in adjusting the learning rate, implementing step length attenuation remains essential for achieving optimal performance.

5.4. Loss Function and Data Imbalance

In this study, CE, WCE, and Focal Loss were used as loss functions to assess their effects on an unbalanced dataset, with CE serving as the standard for comparison. Table 8 presents the metrics of models using different loss functions. Notably, the MPA value for WCE and Focal Loss is lower than that for CE, while the MPR is higher. Additionally, the MIoU for WCE and Focal Loss is approximately 3% lower than that for CE. This decrease can be attributed to the introduction of weight factors in WCE and focal factors in Focal Loss. These adjustments can mitigate the impact of data imbalance by reducing the emphasis on the dominant classes, thus potentially improving the recall for target categories. However, this improvement is relatively tiny for MPR. Generally, the WCE and Focal Loss lead to more misclassifications of other object types as target ground objects, which significantly diminishes the MPA. Consequently, the MIoU experiences a noticeable decrease. In summary, both WCE and Focal Loss fail to achieve the objective of enhancing segmentation accuracy on the unbalanced dataset. Despite attempts to address class imbalance through modified loss functions, the overall performance remains suboptimal compared to using traditional Cross-Entropy loss.

5.5. Input Image Size

Different sizes of input images are vital for optimizing model performance (Table 9). The 512 × 512 size has the best performance, achieving an MIoU of 87.14%, while the remaining sizes hovered around 85–86%. Generally, larger image sizes offer more information to the model, aiding in pixel category prediction. However, distant information may be irrelevant and could potentially interfere with the model’s classification judgment. This likely explains why the 1024 × 1024 size yielded lower prediction accuracy compared to the 512 × 512 image size. Additionally, the size of the model’s receptive field is crucial. The study uses a receptive field of 512 × 512, which may also contribute to the superior performance of the 512 × 512 size images.

For segmentation accuracy across different categories, the 512 × 512 image size excelled in cloud and snow segmentation, likely due to its ability to capture relevant details effectively. However, for lakes, the 1024 × 1024 input size demonstrated superior accuracy. This suggests that larger objects, such as lakes, require a more comprehensive representation. The 1024 × 1024 size provides the network with more relevant information, which the 512 × 512 size image may not fully capture.

In conclusion, selecting the appropriate input size involves considering the size of the target object. Input sizes that are too small may hinder the model’s ability to capture all features accurately, while excessively large sizes may introduce noise, reducing prediction accuracy. Moreover, different input sizes impose varying requirements on the model’s receptive field, necessitating the selection of a suitable network structure tailored to the input size.

5.6. Bands Combinations

In this part, images synthesized in bands 4-3-2 (true-color image), 11-3-2 (false-color image), and 12-3-2 (false-color image) were input into the neural network. Results are shown in Table 10. The MIoU for bands 12-3-2 was the highest at 87.14%, followed closely by the 11-3-2 bands at 86.65%. The MIoU for the 4-3-2 bands was the lowest at 82.28%. Since the 12-3-2 bands were selected for the initial settings, the parameter settings were configured accordingly. This primarily resulted in a slightly higher MIoU metric for the 12-3-2 bands compared to the 11-3-2 bands. However, different wavelengths within the SWIR band (i.e., bands 11 and 12) present slight discrepancies. Moreover, there was a significant difference in the results between the 12-3-2 bands and the 4-3-2 bands. The IoU for lakes was similar in both cases (i.e., bands 12-3-2 and 4-3-2), approximately 95%. However, the IoU for clouds and snow in images of bands 4-3-2 was significantly lower, by 7.82% and 11.63% lower, respectively, compared to images of bands 12-2-3.

Figure 15 shows images synthesized using different bands along with their predicted segmentation outputs for clouds, snow, and lakes. Notably, when utilizing the true-color image (bands 4-3-2), significant confusion between clouds and snow is evident. Specifically, central regions of snow are often erroneously detected as clouds, while cloud edges are mistakenly identified as snow. Moreover, background pixels frequently undergo misclassification as snow. Due to the presence of salt-out crystals or sand with spectral and texture properties similar to snow, lakes are often incorrectly classified as snow. Fortunately, the incorporation of SWIR bands (i.e., bands 11 and 12) substantially aids in distinguishing between clouds and snow.

6. Discussion

Clouds and underlying surfaces often share similar spectral characteristics, posing challenges for accurate cloud recognition due to their susceptibility to noise interference. Effective noise reduction is critical to enhancing precision in cloud recognition. Lakes, characterized by low reflectance in the visible light range, are relatively straightforward to segment. In contrast, clouds and snow present difficulties due to their similar spectral and textural features in the visible light range. To mitigate this, we introduce the SWIR bands, which significantly enhance discrimination between clouds and snow. Notably, different SWIR bands yield similar segmentation results. Some post-processed features, such as Normalized difference vegetation index (NDVI), Normalized difference water index (NDWI), and Normalized difference snow index (NDSI), may offer more direct guidance for the segmentation of snow and clouds. However, we chose not to use them as inputs, as they would interfere with the performance evaluation of the original bands in neural networks, which is the primary objective of this study. Given the considerable size variations among clouds, snow, and lakes, as well as variations within each category, multi-scale feature fusion becomes imperative in model simulation. The FPN employed in this study effectively integrates features across diverse receptive field sizes, facilitating accurate identification of objects at varying scales. Consequently, this model exhibits enhanced performance in the recognition of clouds, snow, and lakes.

The multi-head self-attention mechanism of the Transformer, employed by the Mit_b5 encoder, is currently a focal point of research. This mechanism excels at capturing global information, a task that CNN struggles with. However, in our study focusing on segmenting clouds, snow, and lakes, the Transformer did not perform as effectively. Two primary reasons account for this discrepancy. Firstly, the Transformer architecture requires a significant number of parameters, necessitating a large dataset for effective model training. The number of datasets utilized in our study did not meet these requirements. Secondly, the segmentation tasks in our study are relatively straightforward, requiring primarily local information rather than extensive global context. Overemphasis on global information may introduce noise, thereby reducing the accuracy of model predictions.

The introduction of residual structures in encoders such as ResNet50, ResNext50, and MobileNet_v2 significantly accelerates model convergence and enhances prediction accuracy compared to direct convolution (i.e., Vgg19) (Table 6). Incorporating a depth-separable structure within the residual block not only reduces the number of parameters but also improves prediction accuracy. More parameters do not necessarily equate to better performance. Simple tasks often achieve superior results with streamlined structures and fewer parameters. In the experiment on encoder depth, MIoU initially increased with deeper encoders and more parameters but then plateaued and eventually decreased (Figure 13). This change indicates that large parameter counts and deep encoder layers (i.e., large size of receptive fields) are not universally beneficial; instead, the optimal configuration depends on the specific task. While the Adam optimizer dynamically adjusts learning rates using a gradient (e.g., first and second moments), our experiment reveals the necessity of incorporating learning rate decay to achieve superior results (Table 7). Learning rate decay aids the model in converging more effectively towards global or local optima, thereby enhancing prediction accuracy. Efforts to address data imbalance in clouds, snow, and lake segmentation using WCE and Focal Loss did not yield anticipated improvements. While WCE and Focal Loss alleviate the challenges of target classification, they introduce additional noise, leading to decreased prediction accuracy. This result may be attributed to suboptimal settings of weight and focus factors or inherent differences in multi-classification challenges. In binary classification tasks, WCE and Focal Loss might perform better.

Regarding the setting of the input image size, the experiments were conducted with optimal model parameters based on a receptive field size of 512 × 512. Consequently, it is expected that the 512 × 512 image size performs best. Generally, the choice of input image size should align with the size of the target objects. A too-small image size may limit the model’s ability to learn effective features, whereas an excessively large size can introduce unnecessary noise that hampers training. Ideally, with sufficient video memory, larger input image sizes are beneficial, as they provide more information for the model to use when predicting pixels. However, not all of this additional information is useful; some might introduce noise that interferes with predictions. Adjusting the model’s structure helps define an appropriate receptive field for the task, thereby specifying the scope of reference information used in predicting pixel categories, and mitigating noise interference enhances overall prediction accuracy.

The performance of the model is influenced by multiple factors, each impacting simulation results to varying degrees. The model structure, encoder choice, and input bands notably affect performance, with differences in MIoU indices exceeding 5% across different parameter configurations. In contrast, factors such as learning rate decay strategy, loss function selection, and input image size have a relatively minor impact on performance. These findings provide valuable insights for optimizing model parameters, offering a useful reference for future research and applications in semantic segmentation.

Few previous studies have comprehensively evaluated the factors influencing neural network performance in detecting clouds, snow, and lakes. Qiu et al. (2019) employed the Function of mask (Fmask) 4.0 algorithm to detect clouds and cloud shadows, which improved the overall segmentation accuracy. However, this approach is limited to small-scale regions and is less effective for large-scale areas like TP. Zhong et al. (2022) explored using a neural network to extract lake water bodies using public datasets. We believe that public datasets may not be suitable for all regions, and creating labels specific to the study area will likely enhance model performance. The performance of various neural networks and their structural innovation is a primary objective in the field of semantic segmentation [17,19,20]. Xia et al. (2020) proposed a novel multi-scale feature extraction structure, demonstrating its effectiveness and crucial role in the extraction of clouds, snow, and lakes, which is consistent with the findings of this study. Transformer, combined with CNNs, has been proven effective in semantic segmentation tasks in previous studies [15]. However, our research yielded contrary results. This discrepancy may be attributed to the dataset not being large enough, among other factors that warrant further exploration in future research.

7. Conclusions

This study investigates the factors influencing neural network algorithm performance in detecting clouds, snow, and lakes in Sentinel-2 images. Key contributions and findings include the following:

(1) A two-stage classification algorithm based on spectral features was developed using Sentinel-2 images. This algorithm effectively reduces the interference from spectral similar pixels between classes, ensuring accurate pixel classification.

(2) FPN was identified as the most effective model structure due to its ability to integrate features from different receptive field sizes, improving the recognition of objects with varying scales. Encoders with residual structures, such as ResNet50 and MobileNet_v2, showed better performance. The introduction of the SWIR band significantly improved the distinction between clouds and snow. Larger input image sizes did not necessarily yield better results due to potential noise interference.

(3) Model architecture, encoder type, and different band combinations were found to have the most substantial impact on performance, with variations in these parameters resulting in IoU differences of over 5%. Learning rate decay strategy, loss function, and input image size also influence performance, though their impact is relatively smaller.

While this study primarily investigates factors influencing semantic segmentation of clouds, snow, and lakes in plateau mountain regions, its findings provide insights applicable to various surface objects in remote sensing images. The conclusions offer valuable guidance for optimizing neural network parameters and structures tailored to remote sensing image segmentation. These insights are particularly pertinent for applications dealing with substantial object-scale variations and data imbalance.

Author Contributions

Conceptualization, Z.S.; Methodology, K.H.; software, K.H.; validation, K.H. and Z.S.; formal analysis, K.H.; investigation, K.H.; resources, K.H.; data curation, H.W., C.Y., Y.X. and L.T.; writing—original draft preparation, K.H.; writing—review and editing, Z.S.; visualization, K.H.; supervision, Z.S.; project administration, Z.S.; funding acquisition, Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (42301049) and the Natural Science Foundation of Sichuan Provincial (2022NSFSC1031, 2024NSFSC0802).

Data Availability Statement

Sentinel-2 data were obtained from the websites of https://catalogue.onda-dias.eu/catalogue/ (accessed on 3 December 2023).

Conflicts of Interest

The author declares no conflicts of interest.

References

Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing. ISPRS J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
Toming, K.; Kutser, T.; Laas, A.; Sepp, M.; Paavel, B.; Nõges, T. First Experiences in Mapping Lake Water Quality Parameters with Sentinel-2 MSI Imagery. Remote Sens. 2016, 8, 640. [Google Scholar] [CrossRef]
Qiu, S.; Zhu, Z.; He, B. Fmask 4.0: Improved cloud and cloud shadow detection in Landsats 4–8 and Sentinel-2 imagery. Remote Sens. Environ. 2019, 231, 111205. [Google Scholar] [CrossRef]
Quesada-Román, A.; Umaña-Ortíz, J.; Zumbado-Solano, M.; Islam, A.; Abioui, M.; Tefogoum, G.Z.; Kariminejad, N.; Mutaqin, B.W.; Pupim, F. Geomorphological regional mapping for environmental planning in developing countries. Environ. Dev. 2023, 48, 100935. [Google Scholar] [CrossRef]
Painter, T.H.; Berisford, D.F.; Boardman, J.W.; Bormann, K.J.; Deems, J.S.; Gehrke, F.; Hedrick, A.; Joyce, M.; Laidlaw, R.; Marks, D.; et al. The Airborne Snow Observatory: Fusion of scanning lidar, imaging spectrometer, and physically-based modeling for mapping snow water equivalent and snow albedo. Remote Sens. Environ. 2016, 184, 139–152. [Google Scholar] [CrossRef]
Zhu, Z.; Wang, S.; Woodcock, C.E. Improvement and expansion of the Fmask algorithm: Cloud, cloud shadow, and snow detection for Landsats 4–7, 8, and Sentinel 2 images. Remote Sens. Environ. 2015, 159, 269–277. [Google Scholar] [CrossRef]
Zhang, Y.; Song, Y.; Ye, C.; Liu, J. An integrated approach to reconstructing snow cover under clouds and cloud shadows on Sentinel-2 Time-Series images in a mountainous area. J. Hydrol. 2023, 619, 129264. [Google Scholar] [CrossRef]
Rodriguez, L.K.; Polus, S.M.; Matuszak, D.I.; Domka, M.R.; Hanly, P.J.; Wang, Q.; Soranno, P.A.; Cheruvelil, K.S. LAGOS-US RESERVOIR: A database classifying conterminous U.S. lakes 4 ha and larger as natural lakes or reservoir lakes. Limnol. Oceanogr. Lett. 2023, 8, 267–285. [Google Scholar] [CrossRef]
Xia, M.; Li, Y.; Zhang, Y.; Weng, L.; Liu, J. Cloud/snow recognition of satellite cloud images based on multiscale fusion attention network. J. Appl. Remote Sens. 2020, 14, 032609. [Google Scholar] [CrossRef]
Chang, H.; Fan, X.; Huo, L.; Hu, C. Improving Cloud Detection in WFV Images Onboard Chinese GF-1/6 Satellite. Remote Sens. 2023, 15, 5229. [Google Scholar] [CrossRef]
Mahajan, S.; Fataniya, B. Cloud detection methodologies: Variants and development—A review. Complex Intell. Syst. 2020, 6, 251–261. [Google Scholar] [CrossRef]
Jin, D.; Lee, K.-S.; Choi, S.; Seong, N.-H.; Jung, D.; Sim, S.; Woo, J.; Jeon, U.; Byeon, Y.; Han, K.-S. An improvement of snow/cloud discrimination from machine learning using geostationary satellite data. Int. J. Digit. Earth 2022, 15, 2355–2375. [Google Scholar] [CrossRef]
Chai, D.; Newsam, S.; Zhang, H.K.; Qiu, Y.; Huang, J. Cloud and cloud shadow detection in Landsat imagery based on deep convolutional neural networks. Remote Sens. Environ. 2019, 225, 307–316. [Google Scholar] [CrossRef]
Zhong, H.-F.; Sun, Q.; Sun, H.-M.; Jia, R.-S. NT-Net: A Semantic Segmentation Network for Extracting Lake Water Bodies From Optical Remote Sensing Images Based on Transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Hu, K.; Zhang, E.; Xia, M.; Weng, L.; Lin, H. MCANet: A Multi-Branch Network for Cloud/Snow Segmentation in High-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 1055. [Google Scholar] [CrossRef]
Guo, J.; Yang, J.; Yue, H.; Tan, H.; Hou, C.; Li, K. CDnetV2: CNN-Based Cloud Detection for Remote Sensing Imagery With Cloud-Snow Coexistence. IEEE Trans. Geosci. Remote Sens. 2021, 59, 700–713. [Google Scholar] [CrossRef]
Xia, M.; Qu, Y.; Lin, H. PANDA: Parallel asymmetric network with double attention for cloud and its shadow detection. J. Appl. Remote Sens. 2021, 15, 046512. [Google Scholar] [CrossRef]
Chen, F. Comparing Methods for Segmenting Supra-Glacial Lakes and Surface Features in the Mount Everest Region of the Himalayas Using Chinese GaoFen-3 SAR Images. Remote Sens. 2021, 13, 2429. [Google Scholar] [CrossRef]
Hu, K.; Li, M.; Xia, M.; Lin, H. Multi-Scale Feature Aggregation Network for Water Area Segmentation. Remote Sens. 2022, 14, 206. [Google Scholar] [CrossRef]
Ding, L.; Xia, M.; Lin, H.; Hu, K. Multi-Level Attention Interactive Network for Cloud and Snow Detection Segmentation. Remote Sens. 2024, 16, 112. [Google Scholar] [CrossRef]
Wu, J.; Wang, G.; Chen, W.; Pan, S.; Zeng, J. Terrain gradient variations in the ecosystem services value of the Qinghai-Tibet Plateau, China. Glob. Ecol. Conserv. 2022, 34, e02008. [Google Scholar] [CrossRef]
Lu, P.; Shi, W.; Wang, Q.; Li, Z.; Qin, Y.; Fan, X. Co-seismic landslide mapping using Sentinel-2 10-m fused NIR narrow, red-edge, and SWIR bands. Landslides 2021, 18, 2017–2037. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 936–944. [Google Scholar]
Chaurasia, A.; Culurciello, E. LinkNet: Exploiting encoder representations for efficient semantic segmentation. In Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2018; pp. 1–4. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 770–778. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5987–5995. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Vaswani, A. Attention is All you Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Mao, A.; Mohri, M.; Zhong, Y. Cross-Entropy Loss Functions: Theoretical Analysis and Applications. arXiv 2023, arXiv:2304.07288. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 318–327. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. General situation of the research area. The red box represents the location of five remote-sensing images used in this work.

Figure 2. Spectral reflectance of clouds, snow, and lakes in Sentinel 2 images.

Figure 3. The flowchart for label creation.

Figure 4. False-color image (12-2-3) partitioning results: (a) Cloud; (b) Snow; (c) Lake. The rectangles of different colors and sizes represent the segmentation results.

Figure 5. RF classification results: (a) Cloud (white in false-color image) and its mask (red); (b) Snow (blue in false-color image) and its mask (white); (c) Lake (green in false-color composition) and its mask (blue).

Figure 6. False-color images containing clouds, snow, lakes (first row), and the corresponding labels (second row).

Figure 7. Factors affecting the performance of neural network algorithm.

Figure 8. Learning rate change curve.

Figure 9. Synthesis results of different bands. Bands 4-3-2 are true color; Bands 11-2-3 is a false-color image incorporating SWIR with a central wavelength of 1.610 um; Bands 12-2-3 represent a false-color image with a SWIR band centered at 2.190 um.

Figure 10. Overall flowchart of the study.

Figure 11. The CPA, CPR, and IoU metrics from five models for detecting clouds, snow, and lakes.

Figure 12. Segmentation results of clouds, snow, and lakes by different models.

Figure 13. The MIoU metric and parameters number for ResNet encoders at different depths.

Figure 14. Training loss changes with the epoch increases.

Figure 15. Images synthesized used different bands along with their predicted segmentation outputs for clouds, snow, and lakes. (1) True color images (bands 4-3-2); (2) False color images (bands 12-3-2); (3) labels; (4) Prediction results of (1); (5) Prediction results of (2). Pink box indicate areas requiring special attention.

Table 1. Band information of Sentinel-2 data.

Bands	Band Name	Central Wavelength (um)	Resolution (m)
Band 1	Coastal aerosol	0.443	60
Band 2	Blue	0.490	10
Band 3	Green	0.560	10
Band 4	Red	0.665	10
Band 5	Vegetation Red Edge	0.705	20
Band 6	Vegetation Red Edge	0.740	20
Band 7	Vegetation Red Edge	0.783	20
Band 8	NIR	0.842	10
Band 8A	Vegetation Red Edge	0.865	20
Band 9	Water vapor	0.945	60
Band 10	SWIR- Cirrus	1.375	60
Band 11	SWIR	1.610	20
Band 12	SWIR	2.190	20

Table 2. Sentinel-2 data used in this study.

Date	Satellite	Strip Number	Cloud Cover Rate (%)	Usage
2019.5.10	Sentinel-2A	R105_T44SLC	16.77	Training and validation
2019.7.31	Sentinel-2B	R062_T44SPE	22.07	Training and validation
2019.9.4	Sentinel-2A	R062_T44SMB	16.80	Training and validation
2019.9.4	Sentinel-2A	R062_T44SND	3.61	Testing
2019.9.27	Sentinel-2A	R105_T44SLD	16.92	Training and validation

Table 3. Five models used in this study.

Time	Model	Main Structural Features	Advantages
2015	U-Net [23]	U-shape structure, skip connection	Simple structure, suitable for small dataset size
2016	FPN [25]	Feature Pyramid structure	Efficient integration of multi-scale features
2017	LinkNet [26]	Direct connection between encoder and decoder	Fewer parameters, faster processing speed
2018	Deeplabv3+ [24]	Dilated convolution, Atrous Spatial Pyramid Pooling	Increases receptive field without information loss combines multi-scale information effectively
2019	PAN [27]	Feature Pyramid Enhancement Module, Feature Fusion Module	Blends different levels of feature maps, retains more detailed information

Table 4. Evaluation metrics.

Metrics	Formula	Evaluation Focus
OA	$\frac{\sum_{i = 0}^{M} p_{i i}}{N}$	Proportion of correctly classified pixels to the number of pixels.
CPA	$\frac{p_{i i}}{\sum_{j = 0}^{M} p_{i j}}$	Measures misclassification of a specific category by the model. A higher CPA indicates fewer misclassifications.
MPA	$\frac{\sum_{i = 0}^{M} \frac{p_{i i}}{\sum_{j = 0}^{M} p_{i j}}}{M}$	Average accuracy of all categories, indicating overall misclassification of the model.
CPR	$\frac{p_{i i}}{\sum_{j = 0}^{M} p_{j i}}$	Measures the leakage of the model. Higher CPR indicates fewer leakages.
MPR	$\frac{\sum_{i = 0}^{M} \frac{p_{i i}}{\sum_{j = 0}^{M} p_{j i}}}{M}$	Average recall of all categories, indicating overall leakage of the model.
IoU	$\frac{p_{i i}}{\sum_{j = 0}^{M} p_{j i} + \sum_{j = 0}^{M} p_{i j} - p_{i i}}$	Comprehensive evaluation of the model’s prediction, considering both misses and false positives.
MIoU	$\frac{\sum_{i = 0}^{M} \frac{p_{i i}}{\sum_{j = 0}^{M} p_{j i} + \sum_{j = 0}^{M} p_{i j} - p_{i i}}}{M}$	Average IoU of all categories, reflecting the overall prediction effect of the model.

Table 5. Evaluation metrics from five models.

Model	OA (%)	MPA (%)	MPR (%)	MIoU (%)
U-Net	97.53	84.82	96.28	82.76
FPN	98.30	88.14	96.15	85.69
LinkNet	96.20	81.77	96.26	79.61
Deeplabv3+	97.87	86.42	95.59	83.72
PAN	97.85	86.39	95.57	83.49

Table 6. Precision analysis of deep learning models with different encoder structures.

Encoder	OA (%)	MPA (%)	MPR (%)	MIoU (%)	Params Number (M)
Vgg19	93.82	74.68	87.92	65.18	22.11
Mit_b5	96.19	79.82	90.85	72.59	83.33
Resnext50	98.18	87.1	96.23	84.9	25.59
Resnet50	98.3	88.14	96.15	85.69	26.12
MobileNet_v2	98.41	88.51	96.14	86.06	4.22

Table 7. The metrics from different learning rate adjustment strategies.

Learning Rate Adjustment Strategy	OA (%)	MPA (%)	MPR (%)	MIoU (%)
Remain Constant	98.41	88.51	96.14	86.06
Cosine Decay	98.42	88.53	96.03	86.07
Equally Spaced Decay	98.48	89.14	95.85	86.46
Exponential Decay	98.56	89.79	96.03	87.14

Table 8. The metrics of models using different loss functions.

Metrics		CE (%)	WCE (%)	Focal (%)
OA		98.59	98.10	98.09
CPA	Background	99.68	99.76	99.75
	Cloud	67.42	58.55	58.84
	Snow	94.36	91.38	89.51
	Lake	97.73	96.87	97.99
MPA		89.80	86.64	86.52
CPR	Background	98.86	98.28	98.24
	Cloud	91.11	90.63	92.58
	Snow	96.26	97.22	97.44
	Lake	97.92	98.13	97.83
MPR		96.03	96.06	96.52
IoU	Background	98.55	98.05	98.01
	Cloud	63.26	55.20	56.19
	Snow	91.02	89.05	87.45
	Lake	95.74	95.12	95.90
MIoU		87.14	84.35	84.39

Note: The bold number indicates the metric is lower than CE, and the underlined number indicates that the metric is higher than CE.

Table 9. The metrics of models for clouds, snow, and lakes segmentation using different input image sizes.

Image Size	OA (%)	MPA (%)	MPR (%)	IoU (%)				MIoU (%)
Image Size	OA (%)	MPA (%)	MPR (%)	Background	Cloud	Snow	Lake	MIoU (%)
256 × 256	98.22	87.64	96.13	98.13	56.62	90.26	95.13	85.03
512 × 512	98.56	89.79	96.03	98.55	63.26	94.36	95.74	87.14
768 × 768	98.27	87.89	95.66	98.25	57.73	89.71	95.26	85.23
1024 × 1024	98.54	88.39	95.69	98.52	58.32	90.02	96.71	85.89

Table 10. The metrics of models based on different input band combinations.

Bands	OA (%)	MAP (%)	MAR (%)	IoU (%)				MIoU (%)
Bands	OA (%)	MAP (%)	MAR (%)	Background	Cloud	Snow	Lake	MIoU (%)
4-3-2	98.11	86.43	93.10	98.37	55.43	79.39	95.92	82.28
12-3-2	98.56	89.79	96.03	98.54	63.25	91.02	95.73	87.14
11-3-2	98.47	89.20	95.49	98.45	62.29	90.16	95.67	86.65

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, K.; Sun, Z.; Xiong, Y.; Tu, L.; Yang, C.; Wang, H. Exploring Factors Affecting the Performance of Neural Network Algorithm for Detecting Clouds, Snow, and Lakes in Sentinel-2 Images. Remote Sens. 2024, 16, 3162. https://doi.org/10.3390/rs16173162

AMA Style

Huang K, Sun Z, Xiong Y, Tu L, Yang C, Wang H. Exploring Factors Affecting the Performance of Neural Network Algorithm for Detecting Clouds, Snow, and Lakes in Sentinel-2 Images. Remote Sensing. 2024; 16(17):3162. https://doi.org/10.3390/rs16173162

Chicago/Turabian Style

Huang, Kaihong, Zhangli Sun, Yi Xiong, Lin Tu, Chenxi Yang, and Hangtong Wang. 2024. "Exploring Factors Affecting the Performance of Neural Network Algorithm for Detecting Clouds, Snow, and Lakes in Sentinel-2 Images" Remote Sensing 16, no. 17: 3162. https://doi.org/10.3390/rs16173162

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring Factors Affecting the Performance of Neural Network Algorithm for Detecting Clouds, Snow, and Lakes in Sentinel-2 Images

Abstract

1. Introduction

2. Study Area and Data

2.1. Study Area

2.2. Sentinel-2 Data

2.3. Datasets Establishment

3. Factors Affecting the Performance of Neural Network Algorithm

3.1. Model Architecture

3.2. Encoder

3.3. Learning Rate Adjusting Strategy

3.4. Loss Function

3.5. Different Input Image Size

3.6. Different Band Combinations

4. Experimental Settings and Evaluation Metrics

4.1. Overall Flowchart

4.2. Experimental Settings

4.3. Evaluation Metrics

5. Result and Analysis

5.1. Deep Learning Models Structure

5.2. Encoder Type and Depth

5.3. Learning Rate Adjustment Strategy

5.4. Loss Function and Data Imbalance

5.5. Input Image Size

5.6. Bands Combinations

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI