1. Introduction
Land-cover classification via satellite remote sensing (RS) images is an essential task in the field of earth observation. It has been widely applied in various areas, such as land water resources, vegetation resources, environmental monitoring, natural disaster forecasting [
1], urban planning [
2,
3,
4,
5], environmental impact assessment [
6,
7], and precision agriculture [
8,
9]. In particular, Sentinel-2 series satellites, one of the main RS platforms for land cover mapping, are capable of enabling persistent sensing for civil applications because of their multispectral information via customized sensors with high spatial/spectral/temporal resolutions in a wide range. However, as reported in [
10], 66% of Earth’s surface is covered by clouds. As a result, clouds inevitably appear on the acquired satellite images, restricting their applications in land-cover classification [
11]. Meanwhile, it is also reported in [
12] that snow coverage mapping plays an important role not only in studying hydrology and climatology but also in investigating crop disease overwintering for smart agriculture since snow coverage in winter not only weakens the negative effects of extreme cold temperature on crops but also have certain protection effects on crop diseases (e.g., yellow rust disease in wheat [
13]). Considering that snow and clouds share a very similar appearance and color distribution, manually separating snow pixels from cloud pixels requires expert knowledge and is a tedious and time-consuming process. Therefore, it is desirable to develop an automated algorithm to discriminate cloud/snow in RS images, facilitating the post-processing operations and interpretation of RS images, which is highly beneficial for land cover classification and precision agriculture applications.
Conventional snow/cloud classification algorithms could be generally divided into two categories, including threshold-based [
14,
15,
16,
17] and machine-learning-based ones [
18,
19,
20]. Threshold-based methods are conducted by manually setting spectral thresholds according to the representation of objects in different bands for object mapping such as clouds and snow. For example, Function of mask (Fmask) [
14] and Automated Cloud Cover Assessment (ACCA) [
15] are two classical threshold-based algorithms. For Fmask, various developments have been made to further improve its algorithm performance. For example, Tmask utilized multi-temporal Landsat images to classify clouds and snow, outperforming the single-date Fmask approach [
16]. Timbo introduced CFmask for snow/cloud mask error assessment, achieving an overall precision of 70% on Landsat imagery [
17,
21]. Machine-learning-based approaches adopted handcrafted classifiers combined with multi-features to improve classification precision and speed. For example, a Support Vector Machine (SVM) incorporating multi-feature strategies was used in [
18] to classify clouds and other objects, aiming to make full utilization of RS image information. Nijhawan integrated several individual classifiers for accuracy improvement, including SVM, Random Forest (RF), Rotation Forest (ROF), etc., achieving outstanding performances on the Landsat multispectral images [
19]. Nafiseh applied Random Forest classifiers in the fusion of visible-infrared (VIR) and thermal data for snow and cloud detection, providing a novel insight for feature selection in precise cloud/snow discrimination [
20]. It should be noted that some problems and limitations still exist in traditional approaches. First, threshold setting heavily relies on manual experience, making the snow/cloud mapping process empirical and subjective. In addition, handcrafted classifiers in machine learning algorithms are poor at extracting useful information from RS images (especially for two similar objects), which are less accurate for differentiating cloud and snow via high resolution images.
Therefore, to avoid the aforementioned problems brought by threshold-based and machine-learning-based approaches, deep-learning methods are also introduced to address the snow/cloud classification challenge. Recently, the strong feature extraction capability of deep learning was verified with its extensive applications to computer vision applications. Meanwhile, the rapid development of deep learning in RS field enables its application in snow/cloud classification as well. However, satellite RS images with improved image resolution have abundant spectrum features and rich texture distributions. As a result, models that could make full use of spatial and spectral information are needed. Considering the existing model structures, deep-learning methods could be categorized into two classes, including Convolutional Neural Network (CNN)-based and Transformer-based methods. In particular, CNN is an efficient model for image analysis, which has been introduced to snow/cloud classification [
12,
22,
23,
24]. After training with a large number of RS images, CNN-based methods could automatically extract image spatial features for classification. A novel CNN structure was presented to learn multi-scale semantic features from cloud and snow multispectral imagery, achieving better precision in discriminating the cloud and snow in high-resolution images compared to traditional cloud detection algorithms [
22]. Four encoder–decoder-based CNNs proposed by Mohapatra obtained an average accuracy value of 94% for the AWiFS data [
23]. Recently, Wang adopted a UNet-based structure to incorporate both spectral and spatial information into Sentinel-2 imagery, gaining higher accuracy and robustness than other traditional methods [
12]. By introducing a specially designed fully convolutional network and a multiscale prediction strategy, Zhan precisely distinguished clouds from snow at a pixel level from satellite images [
24]. Although CNN-based methods have realized local extractions of texture and semantic information, convolution operations in CNN restrict further development on snow/cloud distinction. Since local feature extraction limits the receptive field of features, it is difficult for convolutions to extract spectral data efficiently. In order to make full use of spectral information, Transformer [
25,
26,
27,
28,
29] was introduced from natural language processing and achieved superior classification results in the field of RS image analysis due to its powerful capability of capturing temporal and long-distance information. For example, He proposed HSI-BERT to capture the global dependence of bidirectional encoder representations from Transformers for the first time [
25]. With the acquisition of band connections, a cross-context capsule vision Transformer was developed for land-cover classification and demonstrated its efficiency on multispectral LiDAR datasets [
26]. Xu proposed an Efficient Transformer based on a Swin Transformer backbone [
27] to improve segmentation efficiency, and the edge enhancement methods were, meanwhile, adopted to cope with the inaccurate edge problems [
28]. Swin Transformer was also adopted as the backbone by Wang to extract the context information of fine-resolution remote sensing images [
29]. However, since Transformer-based methods mainly focus on obtaining global information, local information such as color and texture are lost in feature extraction. In addition, Transformer is computationally intensive when the sequence length is too long.
Considering the pros and cons of CNN-based and Transformer-based methods, we propose a dual-flow U-shaped framework named UCTNet to integrate CNNs and Transformers in discriminating snow and cloud, obtaining the local and global information from sensing images in a parallel way. The proposed model makes a complementary combination of CNNs and Transformers, making full use of spatial and spectral information in satellite multispectral RS images, hence promoting the accuracy of snow and cloud classification performance. In particular, the CNN and Transformer integration Module (CTIM) is designed to maximally integrate the information extracted via two branches. In addition, the Final Information Fusion Module is applied to fuse the two branch outputs of the decoder, obtaining the final prediction map for supervision. Meanwhile, we proposed an Auxiliary Information Fusion Head (AIFH) for a better feature representation ability. Finally, to verify the effectiveness of the proposed model, the Sentinel-2 snow/cloud dataset developed in our previous paper is utilized [
12]. The original dataset from the Sentinel-2 satellite is composed of 12 multispectral bands. However, our previous work showed that the best four-band combination can not only reduce model size but also possess the best performance. Therefore, it is adopted in this study. The proposed UCTNet is compared to advanced CNN-based and Transformer-based algorithms, showing that the proposed model could achieve a state-of-the-art performance in terms of accuracy and model size. In summary, the main contributions are as follows:
- (1)
A dual-flow architecture composed of a CNN branch and Transformer branch is proposed for the first time to solve the challenge of snow/cloud classification;
- (2)
As the core of encoder and decoder blocks, CTIM is introduced to leverage the local and global features for better performance;
- (3)
FIFM and AIFM are designed to fuse the two branches’ outputs for better supervision;
- (4)
Comparative experiments are conducted on the snow/cloud Satellite dataset to validate the proposed algorithm, which shows that the proposed UCTNet outperforms both CNN- and Transformer-based networks in both accuracy and model size.
5. Discussion
The dual-flow UCTNet presented in
Section 3 show better performance than previous methods in terms of Precision, Recall,
score, ACC, and mIoU on the collected snow/cloud dataset. Taking mIoU as the main evaluation criterion, the proposed UCTNet increases by 2.30% and 1.70% compared with CNN-based networks (i.e., U-Net and DeepLab-V3). Moreover, our method is 2.54% and 1.51% higher than previous Transformer-based approaches (i.e., ResT-Tiny and Swin-Tiny). By leveraging the local and global feature modeling ability of CNNs and Transformers, our UCTNet also increases the mIoU by 1.24% than the specially designed snow and cloud detection method (i.e., CSD-Net). In addition, extensive ablation experiments are conducted on the two-branch architecture, local–global fusion approach, position encoding for the Transformer branch, and the well-designed AIFH, further verifying the superiority of the proposed method.
There are still some issues worthy of investigation when the proposed method is utilized in practical scenarios. For example, the spatial resolution of the Sentinel-2 satellite is approximately 10 m. Therefore, certain pixels (especially those at the boundaries of different classes) actually constitute mixed spectral information of several surface classes. Furthermore, extremely small proportions of clouds and snow coverage within pixels cannot be effectively identified. Consequently, the classification performance concerning the aforementioned issues yields unsatisfactory outcomes. Our method can solve these problems when training on datasets with elevated spatial resolution.