Next Article in Journal
Spatiotemporal Variation in Compound Dry and Hot Events and Its Effects on NDVI in Inner Mongolia, China
Next Article in Special Issue
Optical Remote Sensing Image Cloud Detection with Self-Attention and Spatial Pyramid Pooling Fusion
Previous Article in Journal
Detecting Vegetation to Open Water Transitions in a Subtropical Wetland Landscape from Historical Panchromatic Aerial Photography and Multispectral Satellite Imagery
Previous Article in Special Issue
A Novel Ground-Based Cloud Image Segmentation Method Based on a Multibranch Asymmetric Convolution Module and Attention Mechanism
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Method for Ground-Based Cloud Image Classification Using Transformer

Department of Electronics and Information Engineering, Hebei University of Technology, Tianjin 300401, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2022, 14(16), 3978; https://doi.org/10.3390/rs14163978
Submission received: 8 July 2022 / Revised: 8 August 2022 / Accepted: 11 August 2022 / Published: 16 August 2022
(This article belongs to the Special Issue Deep Learning-Based Cloud Detection for Remote Sensing Images)

Abstract

:
In recent years, convolutional neural networks (CNNs) have achieved competitive performance in the field of ground-based cloud image (GCI) classification. Proposed CNN-based methods can fully extract the local features of images. However, due to the locality of the convolution operation, they cannot well establish the long-range dependencies between the images, and thus they cannot extract the global features of images. Transformer has been applied to computer vision with great success due to its powerful global modeling capability. Inspired by it, we propose a Transformer-based GCI classification method that combines the advantages of the CNN and Transformer models. Firstly, the CNN model acts as a low-level feature extraction tool to generate local feature sequences of images. Then, the Transformer model is used to learn the global features of the images by efficiently extracting the long-range dependencies between the sequences. Finally, a linear classifier is used for GCI classification. In addition, we introduce a center loss function to address the problem of the simple cross-entropy loss not adequately supervising feature learning. Our method is evaluated on three commonly used datasets: ASGC, CCSN, and GCD. The experimental results show that the method achieves 94.24%, 92.73%, and 93.57% accuracy, respectively, outperforming other state-of-the-art methods. It proves that Transformer has great potential to be applied to GCI classification tasks.

Graphical Abstract

1. Introduction

Clouds are visible collections composed of water vapor liquefied by cold in the atmosphere, which cover about 70% of the Earth’s surface. The study of clouds and their properties has a very important role in many applications, such as climate simulation, weather forecasting, meteorological studies, solar energy production, and satellite communications [1,2,3]. Clouds are also closely linked to the hydrological cycle, affecting the energy balance on local and global scales through interactions with radiation from the sun and the land [4,5,6,7,8,9]. Because different cloud types have different radiative effects on the Earth’s surface–atmosphere system, the study of cloud type classification is of great importance [10].
There are two main methods of cloud observation: meteorological satellite observations [11,12,13] and ground-based remote sensing observations [14,15,16]. Satellite cloud images can capture large areas of clouds and allow for the direct observation of the effects of clouds on earth radiation. However, the low resolution of images prevents the study of more local cloud details. Ground-based cloud image (GCI) classification is widely used to monitor the texture and distribution of clouds in local areas and has the advantages of flexible observation sites and rich image information, so GCI classification has become a hot research topic. As more and more places need cloud monitoring, a large number of images are generated at the same time. Relying on experts alone to identify and classify these images is clearly a time-consuming task and is easily influenced by personal subjective factors. For the above reasons, it is crucial to investigate methods that can automatically and accurately classify GCIs.
Early GCI classification was based on manually extracted features, and most methods employed brightness, texture, shape, and color features to represent image content. Heinle et al. [17] proposed a classification algorithm based on spectral features in red–green–blue (RGB) color space and texture features extracted using the Gray-Level Co-occurrence Matrix (GLCM). Li et al. [18] generated the corresponding feature vectors by calculating the weighted frequency values of microstructures in each cloud image and then inputting them into the support vector machine (SVM) classifier for classification. Dev et al. [19] proposed an improved text-based classification method that incorporates manually extracted color and texture features to improve the classification. Xiao et al. [20] extracted visual descriptors from color, texture, and the structure of clouds in a dense sampling manner simultaneously and then completed feature encoding using Fisher Vector, which characterized image samples by computing the log-likelihood gradient of model parameters to enhance classification performance. Zhuo et al. [21] applied color census transform to compute cloud image statistics to extract texture and structure features, which were then fed into a conventional classifier.
In recent years, with the development of deep learning, many convolutional neural network (CNN)-based methods have made significant progress in GCI classification. Shi et al. [22] argued that locally rich information is more important than global layout information, and, therefore, used deep convolutional activation-based features (DCAF) and shallow convolutional layer-based features for classification. Ye et al. [23] extracted multiscale feature maps from pretrained CNNs and then used Fisher Vector coding to perform spatial feature aggregation and high-dimensional feature mapping on the original deep convolutional features, which aimed to find discriminative local information to better distinguish cloud types. Zhao et al. [24] used a 3D-CNN model to process multiple consecutive ground-based cloud images to extract cloud features such as texture and temporal information, followed by a fully connected layer for classification. Zhao et al. [25] proposed a multichannel CNN-based classification method that first extracts cloud objects from large images and then inputs clouds into multiple channels to extract features, thus improving the classification accuracy. Li et al. [26] developed a dual-guided loss function for GCI to integrate information from different CNNs in the optimization process, thereby improving the discriminative ability of cloud feature representation. Zhang et al. [27] presented a GCI classification network called CloudNet, which includes four convolutional layers and two fully connected layers. In addition, some researchers have applied graph convolutional networks (GCN) to the field of GCI classification. Liu et al. [28] treated each cloud image as a node in a graph and then used GCN to aggregate information from the cloud image itself, as well as its connected images, in a weighted manner to extract richer feature information. Liu et al. [29] proposed a classification method based on the context graph attention network (CGAT), which uses the context graph attention layer to learn context attention coefficients and obtain the aggregated features of graph nodes based on these coefficients, solving the problem that the weights assigned by GCN do not accurately reflect the importance of connected nodes.
A cloud cannot be accurately described by the visual information contained in an image alone. Therefore, some researchers have fused visual information with nonvisual features obtained during cloud formation for classification, such as air pressure, wind speed, temperature, and humidity. Liu et al. [30] proposed a joint fusion CNN for learning both visual and nonvisual features in one model, which extracts features using ResNet50 [31] and then uses a weighting strategy for integration. Liu et al. [32] developed a multi-evidence multimodal fusion network (MMFN), which uses an attention network to extract local visual features while learning nonvisual features using a multimodal network. In addition, the authors specifically designed two fusion layers to fully fuse the two features. However, most GCI datasets contain only visual information in practice, so the above approach is not universally applicable.
The disadvantage of CNN models is that they do not handle global features well and thus lead to underutilization of features. In contrast, the recently emerged Transformer model can extract abundant global information. Transformer was originally proposed by Vaswani et al. [33] for natural language processing (NLP) problems, and the model introduced a self-attention mechanism to perform global computation on the input sequence. In the field of NLP, Transformer is gradually replacing recurrent neural networks (RNN) [34,35]. Inspired by this, related works have applied Transformer to image processing, such as DETR for target detection [36] and SETR for semantic segmentation [37]. Meanwhile, many research results have been generated in the field of image classification. Parmar et al. [38] input the pixels of an image as a sequence into the Transformer, which achieved better results but had a high computational cost. Dosovitskiy et al. [39] proposed the Vision Transformer (ViT), a model that reduces computational complexity by first dividing images into patches before feeding them into the Transformer. Touvron et al. [40] suggested a knowledge distillation strategy for Transformer that relies on a distillation token to ensure that the student network learns feature information from the teacher network through attention. Due to the success of Transformer in natural image processing, some researchers have applied it to other fields. Reedha et al. [41] used a transfer learning strategy to apply ViT to unmanned aerial vehicle (UAV) image classification, and the performance outperformed the state-of-the-art CNN model. Chen et al. [42] proposed a LeViT-based method for classifying asphalt pavement images, which consists of convolutional layers, transformer stages, and classifier heads. Shome et al. [43] developed a ViT-based classification model for chest X-ray images, which outperformed previous methods. He et al. [44] proposed a Transformer-based hyperspectral image classification method that uses CNN to extract spatial features while using a densely connected Transformer to capture sequence spectral relationships.
Transformer has proved to be very successful in some fields, but it has almost no reported applications for GCI classification, which is, in fact, a complex problem, with some images containing large cloud areas and others containing only a small portion. Thus, the models for GCI classification should have the ability to extract both global features and local features. Spurred by the above reasons, a novel Transformer-based GCI classification method is proposed in this paper, which first sends the images to a CNN model to extract low-level features and generate the local feature sequences of images, then uses the Transformer to learn the relationship between the low-level feature sequences. It is able to capture both the local features and global features of images, which improves the discrimination of the images by the model. To the best of our knowledge, this is the first time that Transformer has been introduced to the field of GCI classification. The results of experiments on three GCI datasets show that the classification performance of the method exceeds the available methods.
The main contributions of this paper are summarized as follows:
(1) We apply Transformer to GCI classification task and propose a Transformer-based classification method that combines the advantages of Transformer and CNN to extract both local and global features of images, maximizing their complementary advantages for GCI classification.
(2) We optimize the loss function to enhance supervised feature learning by supplementing the cross-entropy loss with the center loss.
(3) An experimental evaluation is performed on three datasets (ASGC, CCSN, and GCD), and the results show that the proposed method in this paper has better classification accuracy.
The rest of the paper is structured as follows. Section 2 details the components and overall structure of the research method. Section 3 reports different GCI datasets and the experimental setup used in this paper. The experimental results, as well as the discussion, are presented in Section 4. Section 5 provides the conclusion of this study.

2. Method

2.1. Overview of Proposed Method

This section shows the overview architecture of the designed Transformer-based classification method. As shown in Figure 1, a specially designed CNN model is used to extract the low-level semantic feature maps of GCI. The convolution operation is highly localized so it is beneficial to learn the local feature information of images. Then, the feature maps are fed into Transformer to learn the feature relationships between sequences to obtain the global feature representation of images. Finally, a linear classifier is used to complete the classification. In addition, a loss function is supplemented for enhanced supervised feature learning. The specific parameter information of the model is described in Table 1, where the number in parentheses after the convolutional layer or module indicates the kernel size, and the number in parentheses after the block indicates the number of times the feature map is downsampled. The details of each part are described below.

2.2. The EfficientNet-Based CNN

Related studies have demonstrated that the local connection and weight-sharing characteristics of CNN provide the network with strong effectiveness and robustness in the local feature extraction of images [45,46]. GCI classification can be regarded as a fine-grained classification task that requires more information for extracting the low-level features of images than natural image classification, so the use of CNN is necessary. CNN is mainly composed of a convolutional layer, a pooling layer, and a fully connected layer. The convolutional layer learns the representation information in the image through different kernels; the pooling layer plays the role of downsampling, thus filtering useful feature information; and the fully connected layer maps the learned distributed feature representations to the sample label space, thus predicting the sample class.
EfficientNet [47] is an efficient network developed by Google Research using neural network architecture search technology. It optimizes the three factors of network depth, the number of channels, and the input image resolution according to the same fixed scale factor, with the advantages of high efficiency and accuracy. EfficientNet is stacked by Mobile Inverted Bottleneck Convolution (MBConv) in MobileNet [48]; the structure of MBConv is shown in Figure 2. For the main branch, the input feature map first passes through a convolution layer when the kernel size is 1 × 1 and a depthwise convolution layer when the kernel size is 3 × 3 or 5 × 5, both of which are followed by the addition of a Swish activation function [49], a batch normalization (BN) layer, and an input Squeeze and Excitation (SE) module [50], followed by a convolutional layer and a dropout layer. Finally, the output of the main branch and the input branch are summed to obtain the final output of the module. Among them, the first convolutional layer is used to increase the dimensionality of the feature map. When the increased multiplier is 6, the module is called MBConv6, and MBConv1 can be obtained for the same reason. The original EfficientNet-B0 contains thirty-nine layers MBConv, and using all layers is not the best choice for the GCI feature extraction. Therefore, only some layers are selected to form the new CNN model in this paper. Specifically, we use thirteen layers MBConv and remove the pooling and fully connected layers from the original model, and the detailed structure is shown in the CNN section in Figure 1.

2.3. Transformer Architecture

2.3.1. Vision Transformer

Vision Transformer (ViT) [39] was the first model to apply Transformer to train large-scale image datasets, and the overall structure is shown in Figure 3a. The input of Transformer is a one-dimensional (1D) sequence token embedding, so it needs to be processed for two-dimensional (2D) images. Firstly, the image, X ∈ RH×W×C, is divided into a number of nonoverlapping patches, XP ∈ RP×P×C, where H×W is the size of the original image, C is the number of channels in the image, and P×P is the size of each patch. Then, each patch is linearly projected into a vector of fixed dimensions using a trainable embedding matrix. To preserve the spatial information in the image, position embedding is added to each embedding vector to collectively form the input of the transformer encoder. In addition, an extra class embedding is also fed into the encoder to distinguish images from different classes.
The structure of the transformer encoder is shown in Figure 3b. It is mainly composed of a multi-head self-attention (MSA) module and a feed-forward multilayer perceptron (MLP) [33]. MSA is the core module of the encoder, which will be explained in detail below. MLP contains two fully connected layers with a Gaussian Error Linear Unit (GeLU) activation function between the layers [51]. In addition, the two parts of the encoder both use residual connections, while a normalization layer is added before the input of each part. Finally, GCI classification is performed using the MLP head based on the features trained by the transformer encoder.
Multi-head self-attention (MSA) [33] is the core component of Transformer, and the introduction of an attention mechanism can make the network pay more attention to the relevant information in the input vector. Figure 4b provides an illustration of MSA. It is composed of multiple self-attention connections, and the structure of self-attention is shown in Figure 4a. The input vectors are first transformed into three different vector matrices: query matrix, Q; key matrix, K; and value matrix, V. The weight assigned to each value is determined by calculating the dot product of the query matrix and the key matrix of different input vectors so that the individual input vectors can be connected to each other to achieve the effect of global modeling. The specific calculation equation is:
A t t e n t i o n ( Q , K , V ) = S o f t m a x ( Q K T d k ) · V
where d k is the dimension of matrix K. The purpose of dividing by d k is to provide proper normalization to make the gradient more stable. In order to generate more correlations between different inputs, the input vector is split into several parts by the multi-head self-attention mechanism, which then computes the matrix dot product of each part in parallel and then finally concatenates all the attention output results. The calculation processes are Equations (2) and (3), which are expressed as:
h e a d i = A t t e n t i o n ( Q W i Q , K W i K , V W i V )
M S A ( Q , K , V ) = C o n c a t ( h e a d 1 , , h e a d i ) W o
where i denotes the number of input vector splits and W i Q , W i K , W i V , and W o are all trainable parameter matrices.

2.3.2. Swin Transformer

Swin Transformer [52] is a transformer model recently proposed and is gradually becoming the backbone for various vision tasks. The original Transformer always keeps the same downsampling operation, and each patch needs to compute self-attention with all other patches. This would result in extracting only single-scale features and incurring extremely high computational costs when classifying high-resolution images. Based on the above reasons, Swin Transformer builds a hierarchical structure imitating the CNN model, as shown in Figure 5. The hierarchical representation is achieved by patch merging operations for each stage. The method sets the dimensionality by merging adjacent patches and two linear transformations, and the resolution of the feature map becomes larger and larger as the transformer layer deepens. In addition, the model uses nonoverlapping windows to compute self-attention, thus changing the computational complexity from quadratic to linear with the size of image resolution. However, this approach results in a lack of information interaction between windows, which prevents a better understanding of contextual information. Therefore, Swin Transformer offers the shifted window method, which enables the interaction of information between neighboring windows. This is also the main difference from the original transformer model.
Swin Transformer provides four versions of this model: Swin-T, Swin-S, Swin-B, and Swin-L. Considering the specificity and computational complexity of GCI, we chose Swin-T, whose overall structure is shown in a part of Figure 1. The model is divided into four stages containing 2, 2, 6, and 2 blocks. Stage 1 contains two Swin Transformer blocks, and a linear embedding layer is added before the block to change the number of channels in the feature map. Stage 2 contains a patch merging layer and two Swin Transformer blocks, where patch merging acts like the pooling layer in CNN for downsampling before each stage to adjust the dimensionality of the feature map. The structure of stage 3 and stage 4 is similar to that of stage 2, except that the amount of downsampling multiplies, and the number of blocks is different.
Different from the original MSA, the Swin Transformer block is constructed based on window-based multi-head self-attention (W-MSA) and shifted window-based multi-head self-attention (SW-MSA). W-MSA reduces the computational complexity of the model by computing self-attention within the window, but it lacks communication between the windows as a result. Therefore, SW-MSA is proposed to achieve the interaction between the windows by dividing and merging the feature maps. The specific operation is shown in Figure 6. First, the feature map is partitioned into a number of nonoverlapping windows and then attention between different patches in each window is calculated with W-MSA. Then, the window positions are cyclically shifted to form a new feature map. Next, the feature map is input into SW-MSA for the calculation of self-attention within the windows, thus achieving the effect of information interaction across windows. Finally, the feature map is reversed back to the original state for the next loop operation.
Swin Transformer blocks always appear in pairs due to the abovementioned shifted window characteristics. Figure 7 shows the structure of two successive Swin Transformer blocks. The composition of each Swin Transformer block is almost the same as that of the ordinary transformer encoder; the only difference is that the MSA in it is replaced by W-MSA and SW-MSA. Based on this window mechanism, the equations for calculating the features of two successive Swin Transformer blocks are as follows:
z ^ l = W M S A [ L N ( z l 1 ) ] + z l 1
z l = M L P [ L N ( z ^ l ) ] + z ^
z ^ l + 1 = S W M S A [ L N ( z l ) ] + z l
z l + 1 = M L P [ L N ( z ^ l + 1 ) ] + z ^ l + 1
where z ^ l and z l represent the outputs of (S)W-MSA and the MLP module of the l -th block, respectively, and LN denotes layer norm.

2.4. Loss Function Design

In the linear classifier, the features output from the Swin Transformer are first fed into the average pooling layer, then passed through the normalization layer, and then, finally, the prediction results are obtained using the fully connected layer. The loss function is used to evaluate the extent to which the predicted results of the model are different from the true results. Therefore, the loss function is of great importance in the optimization process of the model. The commonly used loss function in the field of image classification is the cross-entropy loss function, which can be expressed by Equation (8):
L c r o = 1 M i = 1 M j = 1 N p ( x j ) log ( q ( x j ) )
where the size of the mini-batch and the number of classification classes is M and N, respectively, p ( x j )  ∈ {0, 1} is the ground truth of sample x belonging to class j , and q ( x j ) ∈ [0, 1] represents the probability of sample x being predicted as class j .
The GCI classification task can be regarded as a fine-grained classification. The feature difference between cloud regions in the image is small, so the simple cross-entropy loss is not sufficient to adequately supervise the learning of such fine-grained features. Therefore, we introduce the center loss [53] to improve the supervisory ability of the model. The center loss function is defined in Equation (9).
L c e n = 1 2 i = 1 M | | x i c y i | | 2 2
where M is the size of the mini-batch and xi denotes the feature of the i -th sample being extracted. c y i ∈ Rd denotes the y i -th class center of deep features, where d is the feature dimension, and the c y i should be updated as the deep features change. This function can minimize the intra-class distance while ensuring the separability of inter-class features, thus improving the discriminability between features. In conclusion, the final loss function of the whole model can be expressed as:
L o s s = L c r o + λ L c e n
where λ is the hyperparameter to balance the two loss functions.

3. Datasets and Experimental Settings

In this section, we introduce three GCI datasets and then describe the relevant experimental settings.

3.1. Dataset Description

We use three GCI datasets in the experiments, including the All-Sky Ground-based Cloud (ASGC), Cirrus Cumulus Stratus Nimbus (CCSN), and the Ground-based Cloud Dataset (GCD).

3.1.1. All-Sky Ground-Based Cloud (ASGC)

The cloud images in this dataset were captured by the all-sky camera [54] located in Muztagh, Xinjiang (38.19°N, 74.53°E). The all-sky camera consists of a Sigma 4.5 mm fisheye lens and a Canon 700D camera with a maximum field of view of 180° due to the fisheye lens. Different from conventional GCI, the sky of GCI in this dataset is mapped as a circle, where the center is the zenith and the boundary is the horizon. The frequency of shooting during the day is 20 min, increasing to 5 min a shot at night. In addition, the exposure time will is adjusted between 15 and 30 s according to the moon phase. All images are stored in a color JPEG format with a resolution of 460 × 460 pixels. To meet the training requirements, all images were uniformly adjusted to 448 × 448 pixels. A total of seven classes of cloud images were included in the original dataset, including altocumulus (Ac), cumulonimbus (Cb), cirrus (Ci), clear (Cl), cumulus (Cu), mixed (Mi), and stratocumulus (Sc). The number of images in each class varies from 210 to 450, and the dataset employs data augmentation because the small number of samples is prone to overfitting. The specific data augmentation methods include: horizontal flip, vertical flip, and random rotation. The example images of each class are shown in Figure 8. The sample distribution of the training set and testing set in the expanded dataset is described in Table 2.

3.1.2. Cirrus Cumulus Stratus Nimbus (CCSN)

This dataset is an open-source dataset collected by the Nanjing University of Information Engineering, and it is categorized according to the World Meteorological Organization’s genus-based classification proposal for the separation of cloud images into types, including altocumulus (Ac), altostratus (As), cumulonimbus (Cb), cirrocumulus (Cc), cirrus (Ci), cirrostratus (Cs), contrail (Ct), cumulus (Cu), nimbostratus (Ns), stratocumulus (Sc), and stratus (St). Due to a large number of cloud types, the cloud images in this dataset have large light variations and intra-class variations; more details are provided in [27].
All images are in a color JPEG format with a resolution of 256 × 256 pixels. To meet the input size requirements of the model, the images were uniformly resized to 448 × 448 pixels using bilinear interpolation [55]. The number of cloud images in each class varies from 140 to 340. In order to avoid the occurrence of overfitting, data augmentation is also used to expand the dataset. Figure 9 shows example images from the dataset, and the number of samples used for training and testing is described in Table 3.

3.1.3. Ground-Based Cloud Dataset (GCD)

The images in this dataset were captured by camera sensors in nine Chinese provinces over a period of more than one year and have a great diversity. This dataset classifies images according to the classification criteria published by the World Meteorological Organization, specifically, altocumulus (Ac), cumulonimbus (Cb), cirrus (Ci), clear (Cl), cumulus (Cu), mixed (Mi), and stratocumulus (Sc). All images are stored in a JPEG format with a resolution of 512 × 512 pixels, and more details of the dataset are presented in [29]. In this paper, the image size is uniformly resized to 448 × 448 pixels. An example of images in the dataset is shown in Figure 10. Table 4 describes the distribution of images in this dataset.

3.2. Experimental Setup

3.2.1. Implementation Details

The method proposed is implemented on an Intel (R) Core (TM) i7-8750H CPU @ 3.20 GHz computer with 32.0 GB RAM, utilizing an NVIDIA GeForce GTX 2070 super 16 G graphical processing unit (GPU). The programming language used for the code is Python, and Pytorch is chosen as the deep learning framework.
In order to improve the convergence speed and generalization ability of the model, we adopt transfer learning for training. The model obtained by training the Swin Transformer on the ImageNet-1K dataset is used for the pretraining weights in the experiments, and the model proposed in this paper is directly trained on the basis of its initial weights, which can not only shorten the training time but also avoid overfitting. The Adaptive Momentum Estimation and Weight Decay (AdamW) [56] optimizer is used for optimization, which adds the L2 regular term to the Adam optimizer, thus solving the problem of parameter overfitting, which has the advantage of fast gradient descent. In addition, the model uses a scheduling technique called Cosine AnnealingLR [56] and sets the initial learning rate to 0.0016. This method causes the learning rate to fall rapidly and then rise steeply during the training process to avoid falling into the local minimum and to find the true global minimum.

3.2.2. Evaluation Metric

In order to comprehensively evaluate the classification performance of the proposed method for various types of images, accuracy, precision, and recall are calculated as the evaluation metrics in this paper. Accuracy can be calculated based on positive and negative samples as:
A c c u r a c y ( A c c ) = T P + T N T P + T N + F P + F N
where TP (True Positive) is the number of correctly classified samples for a specific class, TN (True Negative) is the number of correctly classified samples for the remaining classes, FP (False Positive) is the number of misclassified samples for the remaining classes, and FN (False Negative) is the number of misclassified samples for a specific class. Precision and recall can be expressed as:
P r e c i s i o n ( P r ) = T P T P + F P
R e c a l l ( R e ) = T P T P + F N
In addition, we also use an F1_score for evaluation, the expression of which is shown in Equation (14).
F 1 _ s c o r e = 2 × P r × R e P r + R e

4. Experimental Results and Discussion

4.1. Results of GCI Classification

The classification results for each class in different datasets and the overall classification accuracy are provided in Table 5, Table 6 and Table 7. It can be seen that the proposed method achieves accuracies of 94.24%, 92.73%, and 93.57% in the ASGC, CCSN, and GCD datasets, respectively. The precision and recall are all greater than 89% for all classes except Sc in the ASGC dataset, where the highest accuracy is 99.66% for Cu, and the highest recall is 100% for Cl, but the accuracy and recall are both less than 90% for Sc. The seven types of images are sorted according to the F1_score value from largest to smallest, and the order is Cl, Cu, Mi, Ac, Ci, Cb, and Sc. In the CCSN dataset, the precision and recall are greater than 87% for all classes. The highest precision is 97.18% for Cb, and the highest recall is 98.67% for Ct, but the lowest precision and recall are 87.21% and 88.67%, respectively, for St. The various types of images are sorted according to the F1_score value from largest to smallest, and the order is Ct, Ci, Cb, Ac, Cu, As, Cc, Ns, Sc, Cs, and St. Both the precision and recall of all classes are greater than 85% in the GCD dataset. Among them, the highest precision is 98.62% for Cu, and the highest recall is 99.33% for Cl, but the lowest precision is 89.49% for Cb, and the lowest recall is 85.33% for Sc. All types of images are sorted according to the F1_score value from largest to smallest, and the order is Cl, Cu, Ac, Ci, Mi, Cb, and Sc. In conclusion, the method can better identify various types of cloud images and has the ability to automatically classify cloud images.
To analyze misclassification, the classification confusion matrix of different datasets is obtained as shown in Figure 11. The horizontal axis in each figure indicates the true image class, the vertical axis indicates the predicted image class, and the values present in the nondiagonal elements represent the number of misclassifications between classes. From the figure, it can be seen that, in the ASGC and GCD dataset, images of the Cl class are correctly classified the most, and the misclassified images are mainly from Cb and Sc, which is due to the fact that some images of the Sc class are affected by illumination, causing the bottom of the clouds to become dark black and thus easily confused with Cb. In addition, the movement of clouds can lead to changes in the shooting viewpoint, which increases the difficulty of identification. In the CCSN dataset, images of the Ct class are correctly classified in the largest number. The misclassified images are mainly from St, Ns, and Sc, which is predictable because they all belong to low-level clouds with a relatively similar structure and transparency.

4.2. Parameter Analysis

To provide a comprehensive study of the proposed method, this section analyzes the effect of hyperparameter λ in the loss function on the classification results. We vary λ from 0 to 0.1 to learn different models, and the accuracy of these models in the ASGC, CCSN, and GCD datasets is shown in Figure 12.
It can be seen that when λ is 0, the loss function contains only cross-entropy loss, which cannot adequately supervise feature learning at this point, resulting in poor performance. When the center loss is supplemented, the joint supervision improves the discriminative power of the deep features, thus improving the classification accuracy. In addition, the model performance remains largely stable over a large range of λ , where the proposed method obtains the best results when λ takes the value of 0.01. Therefore, λ is set to 0.01 to obtain the best classification performance for GCI classification when conducting experiments on each dataset.

4.3. Feature Visualization

In order to illustrate the feature extraction ability of the proposed method more intuitively, we use the Gradient-weighted Class Activation Mapping (Grad-CAM) [57] method for feature visualization. The method shows the important regions of the image predicted by generating a rough attention map from the last layer of the model. The brighter the color of the attention map, the higher the importance of the corresponding region of the image. Some images are selected from the ASGC dataset for testing, and the results are shown in Figure 13. It can be seen that the proposed method can highlight the real class of the object and has strong localization and recognition abilities.

4.4. Comparison with Other Methods

To verify the superiority of the proposed method, we compared it with other advanced methods, including CNN-based methods proposed for GCI classification such as DeepCloud [23], CloudNet [27], and CGAT [29], as well as classical CNN models such as GoogLeNet [58], VGG16 [59], ResNet34 [31], ResNet50 [31], DenseNet [60], MobileNet [61], EfficientNet-B0 [47], and EfficientNet-B5 [47]. In addition, we compared it with other Transformer-based classification models, such as ViT-B [39], ViT-L [39], and Swin-T [52]. The training and testing samples of each dataset were kept constant in the experiments, and the final testing results of each dataset are shown in Table 8.
It can be seen from the experimental results that EfficientNet-B0 achieves the highest accuracy among the CNN-based methods, with 91.47%, 89.97%, and 90.48% in the ASGC, CCSN, and GCD datasets, respectively. The classification accuracy of the Transformer-based method is higher than that of the CNN-based method, which proves that Transformer is more capable of extracting image features. The accuracy of the original Swin-T in each dataset reaches 92.86%, 91.06%, and 92.38%. The proposed method combines CNN with Transformer and optimizes the loss function to compensate for the lack of local features and enhances the learning of supervised features, thus improving the classification accuracy. Using the proposed method, the accuracy is improved by 1.38%, 1.67%, and 1.19% compared with the original Swin-T.

5. Conclusions

Transformer is a powerful deep neural network for processing sequences, but it has received little attention in the field of ground-based cloud image processing. In this paper, we apply Transformer to GCI and propose a novel GCI classification method. Different from traditional CNN-based methods, our method combines the Transformer and CNN models. Specifically, the CNN model is used as a low-level feature extraction tool to generate the local feature sequences of images, and Transformer effectively extracts long-distance dependencies between the sequences from the low-level feature. Using this method, both the local features and the global features of cloud images can be extracted. In addition, the center loss is introduced to supplement the cross-entropy loss to enhance the learning of supervised features. We evaluate the performance of the proposed method with three different GCI datasets. Compared with several other advanced methods, our method achieves the highest accuracy, with 94.24%, 92.73%, and 93.57% in the ASGC, CCSN, and GCD datasets, respectively. The proposed method shows the great potential of Transformer for GCI classification, and we will continue to study various improvements for Transformer to further improve the classification performance in the preceding work.

Author Contributions

Data curation, X.L., G.C. and L.Z.; investigation, C.W.; methodology, X.L. and B.Q.; resources, G.C.; Software, X.L.; supervision, B.Q.; validation, G.C. and C.W.; writing—original draft, X.L.; writing—review and editing, B.Q., C.W. and L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Joint Research Fund in Astronomy through a cooperative agreement between the National Science Foundation of China (NSFC) and the Chinese Academy of Sciences (CAS) under Grant No. U1931134 and through the Natural Science Foundation of Hebei under Grant No. A2020202001.

Data Availability Statement

The ASGC dataset was accessed from https://github.com/lixiaotong-su/All-Sky-Ground-based-Cloud (accessed on 5 July 2022). The CCSN and GCD datasets used in this work belong to open-source datasets available in their corresponding references within this manuscript.

Acknowledgments

We would like to thank the anonymous reviewers for their constructive and valuable suggestions on the earlier drafts of this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Nespoli, A.; Niccolai, A.; Ogliari, E.; Perego, G.; Collino, E.; Ronzio, D. Machine Learning techniques for solar irradiation nowcasting: Cloud type classification forecast through satellite data and imagery. Appl. Energ. 2022, 305, 117834. [Google Scholar] [CrossRef]
  2. Cao, Z.H.; Hao, J.X.; Feng, L.; Jones, H.R.; Li, J.; Xu, J. Data processing and data products from 2017 to 2019 campaign of astronomical site testing at Ali, Daocheng and Muztagh-ata. Res. Astron. Astrophys. 2020, 20, 82. [Google Scholar] [CrossRef]
  3. Westerhuis, S.; Fuhrer, O.; Bhattacharya, R.; Schmidli, J.; Bretherton, C. Effects of terrain-following vertical coordinates on simulation of stratus clouds in numerical weather prediction models. Q. J. R. Meteorol. Soc. 2021, 147, 94–105. [Google Scholar] [CrossRef]
  4. Long, C.N.; Sabburg, J.M.; Calbó, J.; Pagès, D. Retrieving cloud characteristics from ground-based daytime color all-sky images. J. Atmos. Ocean. Technol. 2006, 23, 633–652. [Google Scholar] [CrossRef]
  5. Huang, W.; Wang, Y.; Chen, X. Cloud detection for high-resolution remote-sensing images of urban areas using colour and edge features based on dual-colour models. Int. J. Remote Sens. 2018, 39, 6657–6675. [Google Scholar] [CrossRef]
  6. Liu, Y.; Tang, Y.; Hua, S.; Luo, R.; Zhu, Q. Features of the cloud base height and determining the threshold of relative humidity over southeast China. Remote Sens. 2019, 11, 2900. [Google Scholar] [CrossRef]
  7. Zhou, C.; Zelinka, M.D.; Klein, S.A. Impact of decadal cloud variations on the Earth’s energy budget. Nat. Geosci. 2016, 9, 871. [Google Scholar] [CrossRef]
  8. Manzo, M.; Pellino, S. Voting in transfer learning system for ground-based cloud classification. Mach. Learn. Knowl. Extr. 2021, 3, 542–553. [Google Scholar] [CrossRef]
  9. Wild, M.; Hakuba, M.Z.; Folini, D.; Dörig-Ott, P.; Schär, C.; Kato, S.; Long, C.N. The cloud-free global energy balance and inferred cloud radiative effects: An assessment based on direct observations and climate models. Clim. Dynam. 2019, 52, 4787–4812. [Google Scholar] [CrossRef]
  10. Huertas-Tato, J.; Rodríguez-Benítez, F.J.; Arbizu-Barrena, C.; Aler-Mur, R.; Galvan-Leon, I.; Pozo-Vázquez, D. Automatic cloud-type classification based on the combined use of a sky camera and a ceilometer. J. Geophys. Res. Atmos. 2017, 122, 11045–11061. [Google Scholar] [CrossRef]
  11. Zhong, B.; Chen, W.; Wu, S.; Hu, L.; Luo, X.; Liu, Q. A cloud detection method based on relationship between objects of cloud and cloud-shadow for Chinese moderate to high resolution satellite imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 4898–4908. [Google Scholar] [CrossRef]
  12. Young, A.H.; Knapp, K.R.; Inamdar, A.; Hankins, W.; Rossow, W.B. The international satellite cloud climatology project H-Series climate data record product. Earth Syst. Sci. Data 2018, 10, 583–593. [Google Scholar] [CrossRef]
  13. Kumthekar, A.; Reddy, G.R. An integrated deep learning framework of U-Net and inception module for cloud detection of remote sensing images. Arab. J. Geosci. 2021, 14, 1900. [Google Scholar] [CrossRef]
  14. Jain, M.; Gollini, I.; Bertolotto, M.; McArdle, G.; Dev, S. An extremely-low cost ground-based whole sky imager. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium, Brussels, Belgium, 11–16 July 2021; pp. 8209–8212. [Google Scholar]
  15. Nouri, B.; Wilbert, S.; Segura, L.; Kuhn, P.; Hanrieder, N.; Kazantzidis, A. Determination of cloud transmittance for all sky imager based solar nowcasting. Sol. Energy 2019, 181, 251–263. [Google Scholar] [CrossRef]
  16. Nouri, B.; Kuhn, P.; Wilbert, S.; Hanrieder, N.; Prahl, C.; Zarzalejo, L. Cloud height and tracking accuracy of three all sky imager systems for individual clouds. Sol. Energy 2019, 177, 213–228. [Google Scholar] [CrossRef]
  17. Heinle, A.; Macke, A.; Srivastav, A. Automatic cloud classification of whole sky images. Atmos. Meas. Technol. 2010, 3, 557–567. [Google Scholar] [CrossRef]
  18. Li, Q.; Zhang, Z.; Lu, W.; Yang, J.; Ma, Y.; Yao, W. From pixels to patches: A cloud classification method based on a bag of micro-structures. Atmos. Meas. Technol. 2016, 9, 753–764. [Google Scholar] [CrossRef]
  19. Dev, S.; Lee, Y.H.; Winkler, S. Categorization of cloud image patches using an improved texton-based approach. In Proceedings of the 2015 IEEE International Conference on Image Processing, Quebec City, QC, Canada, 27–30 September 2015; pp. 422–426. [Google Scholar]
  20. Xiao, Y.; Cao, Z.; Zhuo, W.; Ye, L.; Zhu, L. mCLOUD: A multiview visual feature extraction mechanism for ground-based cloud image categorization. J. Atmos. Ocean. Technol. 2016, 33, 789–801. [Google Scholar] [CrossRef]
  21. Zhuo, W.; Cao, Z.; Xiao, Y. Cloud classification of ground-based images using texture–structure features. J. Atmos. Ocean. Technol. 2014, 31, 79–92. [Google Scholar] [CrossRef]
  22. Shi, C.; Wang, C.; Wang, Y.; Xiao, B. Deep convolutional activations-based features for ground-based cloud classification. IEEE Geosci. Remote Sens. Lett. 2017, 14, 816–820. [Google Scholar] [CrossRef]
  23. Ye, L.; Cao, Z.; Xiao, Y. DeepCloud: Ground-based cloud image categorization using deep convolutional features. IEEE Trans. Geosci. Remote Sens. 2017, 55, 5729–5740. [Google Scholar] [CrossRef]
  24. Zhao, X.; Wei, H.; Wang, H.; Zhu, T.; Zhang, K. 3D-CNN-based feature extraction of ground-based cloud images for direct normal irradiance prediction. Sol. Energy 2019, 181, 510–518. [Google Scholar] [CrossRef]
  25. Zhao, M.; Chang, C.H.; Xie, W.; Xie, Z.; Hu, J. Cloud shape classification system based on multi-channel cnn and improved fdm. IEEE Access 2020, 8, 44111–44124. [Google Scholar] [CrossRef]
  26. Li, M.; Liu, S.; Zhang, Z. Dual guided loss for ground-based cloud classification in weather station networks. IEEE Access 2019, 7, 63081–63088. [Google Scholar] [CrossRef]
  27. Zhang, J.; Liu, P.; Zhang, F.; Song, Q. CloudNet: Ground-based cloud classification with deep convolutional neural network. Geophys. Res. Lett. 2018, 45, 8665–8672. [Google Scholar] [CrossRef]
  28. Liu, S.; Li, M.; Zhang, Z.; Cao, X.; Durrani, T.S. Ground-based cloud classification using task-based graph convolutional network. Geophys. Res. Lett. 2020, 47, e2020GL087338. [Google Scholar] [CrossRef]
  29. Liu, S.; Duan, L.; Zhang, Z.; Cao, X.; Durrani, T.S. Ground-Based Remote Sensing Cloud Classification via Context Graph Attention Network. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602711. [Google Scholar] [CrossRef]
  30. Liu, S.; Li, M.; Zhang, Z.; Xiao, B.; Cao, X. Multimodal ground-based cloud classification using joint fusion convolutional neural network. Remote Sens. 2018, 10, 822. [Google Scholar] [CrossRef]
  31. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  32. Liu, S.; Li, M.; Zhang, Z.; Xiao, B.; Durrani, T.S. Multi-evidence and multi-modal fusion network for ground-based cloud recognition. Remote Sens. 2020, 12, 464. [Google Scholar] [CrossRef]
  33. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  34. Mareček, D.; Rosa, R. Extracting syntactic trees from transformer encoder self-attentions. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; pp. 347–349. [Google Scholar]
  35. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  36. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
  37. Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
  38. Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; Tran, D. Image transformer. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4055–4064. [Google Scholar]
  39. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  40. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, online. 18–24 July 2021; pp. 10347–10357. [Google Scholar]
  41. Reedha, R.; Dericquebourg, E.; Canals, R.; Hafiane, A. Transformer Neural Network for Weed and Crop Classification of High Resolution UAV Images. Remote Sens. 2022, 14, 592. [Google Scholar] [CrossRef]
  42. Chen, Y.; Gu, X.; Liu, Z.; Liang, J. A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method. Remote Sens. 2022, 14, 1877. [Google Scholar] [CrossRef]
  43. Shome, D.; Kar, T.; Mohanty, S.N.; Tiwari, P.; Muhammad, K.; AlTameem, A.; Zhang, Y.Z.; Saudagar, A.K.J. COVID-transformer: Interpretable COVID-19 detection using vision transformer for healthcare. Int. J. Environ. Res. Public Health 2021, 18, 11086. [Google Scholar] [CrossRef] [PubMed]
  44. He, X.; Chen, Y.; Lin, Z. Spatial-spectral transformer for hyperspectral image classification. Remote Sens. 2021, 13, 498. [Google Scholar] [CrossRef]
  45. Jogin, M.; Madhulika, M.S.; Divya, G.D.; Meghana, R.K.; Apoorva, S. Feature extraction using convolution neural networks (CNN) and deep learning. In Proceedings of the 2018 3rd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, India, 18–19 May 2018; pp. 2319–2323. [Google Scholar]
  46. Liu, Y.; Pu, H.; Sun, D.W. Efficient extraction of deep image features using convolutional neural network (CNN) for applications in detecting and analysing complex food matrices. Trends Food Sci. Technol. 2021, 113, 193–204. [Google Scholar] [CrossRef]
  47. Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
  48. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
  49. Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
  50. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  51. Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
  52. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  53. Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 499–515. [Google Scholar]
  54. Mommert, M. Cloud Identification from All-sky Camera Data with Machine Learning. Astron. J. 2020, 159, 178. [Google Scholar] [CrossRef]
  55. Mastyło, M. Bilinear interpolation theorems and applications. J. Funct. Anal. 2013, 265, 185–207. [Google Scholar] [CrossRef]
  56. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  57. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
  58. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  59. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  60. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  61. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Figure 1. Overall architecture of the proposed method.
Figure 1. Overall architecture of the proposed method.
Remotesensing 14 03978 g001
Figure 2. The structure of Mobile Inverted Bottleneck Convolution (MBConv).
Figure 2. The structure of Mobile Inverted Bottleneck Convolution (MBConv).
Remotesensing 14 03978 g002
Figure 3. The structure of (a) the Vision Transformer (ViT) and (b) the transformer encoder.
Figure 3. The structure of (a) the Vision Transformer (ViT) and (b) the transformer encoder.
Remotesensing 14 03978 g003
Figure 4. The illustration of self-attention in Transformer. (a) The self-attention; (b) the multi-head self-attention.
Figure 4. The illustration of self-attention in Transformer. (a) The self-attention; (b) the multi-head self-attention.
Remotesensing 14 03978 g004
Figure 5. Illustration of hierarchical feature maps.
Figure 5. Illustration of hierarchical feature maps.
Remotesensing 14 03978 g005
Figure 6. The illustration of the shifted window approach.
Figure 6. The illustration of the shifted window approach.
Remotesensing 14 03978 g006
Figure 7. The structure of two successive Swin Transformer blocks.
Figure 7. The structure of two successive Swin Transformer blocks.
Remotesensing 14 03978 g007
Figure 8. Example images from the ASGC dataset: (a) altocumulus (Ac); (b) cumulonimbus (Cb); (c) cirrus (Ci); (d) clear (Cl); (e) cumulus (Cu); (f) mixed (Mi); (g) stratocumulus (Sc).
Figure 8. Example images from the ASGC dataset: (a) altocumulus (Ac); (b) cumulonimbus (Cb); (c) cirrus (Ci); (d) clear (Cl); (e) cumulus (Cu); (f) mixed (Mi); (g) stratocumulus (Sc).
Remotesensing 14 03978 g008
Figure 9. Example images from the CCSN dataset: (a) altocumulus (Ac); (b) altostratus (As); (c) cumulonimbus (Cb); (d) cirrocumulus (Cc); (e) cirrus (Ci); (f) cirrostratus (Cs); (g) contrail (Ct); (h) cumulus (Cu); (i) nimbostratus (Ns); (j) stratocumulus (Sc); (k) stratus (St).
Figure 9. Example images from the CCSN dataset: (a) altocumulus (Ac); (b) altostratus (As); (c) cumulonimbus (Cb); (d) cirrocumulus (Cc); (e) cirrus (Ci); (f) cirrostratus (Cs); (g) contrail (Ct); (h) cumulus (Cu); (i) nimbostratus (Ns); (j) stratocumulus (Sc); (k) stratus (St).
Remotesensing 14 03978 g009
Figure 10. Example images from the ASGC dataset: (a) altocumulus (Ac); (b) cumulonimbus (Cb); (c) cirrus (Ci); (d) clear (Cl); (e) cumulus (Cu); (f) mixed (Mi); (g) stratocumulus (Sc).
Figure 10. Example images from the ASGC dataset: (a) altocumulus (Ac); (b) cumulonimbus (Cb); (c) cirrus (Ci); (d) clear (Cl); (e) cumulus (Cu); (f) mixed (Mi); (g) stratocumulus (Sc).
Remotesensing 14 03978 g010
Figure 11. Confusion matrix for different datasets: (a) ASGC; (b) CCSN; (c) GCD.
Figure 11. Confusion matrix for different datasets: (a) ASGC; (b) CCSN; (c) GCD.
Remotesensing 14 03978 g011
Figure 12. Analysis of the performing results of the λ parameters.
Figure 12. Analysis of the performing results of the λ parameters.
Remotesensing 14 03978 g012
Figure 13. Grad-CAM visualization results in the ASGC dataset. Each input image is shown on the second line, and attention maps are shown on the first line: (a) Ac; (b) Cb; (c) Ci; (d) Cu.
Figure 13. Grad-CAM visualization results in the ASGC dataset. Each input image is shown on the second line, and attention maps are shown on the first line: (a) Ac; (b) Cb; (c) Ci; (d) Cu.
Remotesensing 14 03978 g013
Table 1. The details of the proposed model.
Table 1. The details of the proposed model.
Layer NameOutput SizeOutput ChannelsLayers
Conv (3 × 3)224 × 224481
MBConv1 (3 × 3)224 × 224243
MBConv6 (3 × 3)112 × 112405
MBConv6 (5 × 5)56 × 56645
Conv (1 × 1)56 × 56481
Linear Embedding and Block (4×)56 × 56962
Patch Merging and Block (8×)28 × 281922
Patch Merging and Block (16×)14 × 143846
Patch Merging and Block (32×)7 × 77682
Average Pooling1 × 17681
Layer Norm1 × 17681
Fully Connected1 × 1-1
Table 2. Training and testing samples for the ASGC dataset.
Table 2. Training and testing samples for the ASGC dataset.
NoClassTrainingTestingTotal
1Ac26403002940
2Cb27403003040
3Ci22803002580
4Cl28503003150
5Cu23603002660
6Mi24103002710
7Sc26203002920
Total17,900210020,000
Table 3. Training and testing samples for the CCSN dataset.
Table 3. Training and testing samples for the CCSN dataset.
NoClassTrainingTestingTotal
1Ac25503002850
2As23203002620
3Cb24403002740
4Cc24303002730
5Ci26603002960
6Cs23103002610
7Ct22803002580
8Cu23503002650
9Ns22403002540
10Sc24203002720
11St27003003000
Total26,700330030,000
Table 4. Training and testing samples for the GCD dataset.
Table 4. Training and testing samples for the GCD dataset.
NoClassTrainingTestingTotal
1Ac22763002576
2Cb32103003510
3Ci24243002724
4Cl27183003018
5Cu23503002650
6Mi18263002126
7Sc20963002396
Total16,900210019,000
Table 5. The classification results for the ASGC dataset.
Table 5. The classification results for the ASGC dataset.
NoClassPr (%)Re (%)F1_score (%)Acc (%)
1Ac92.3897.0094.63
2Cb91.9291.0091.46
3Ci94.0489.3391.62
4Cl98.3610099.1794.24
5Cu99.6697.0098.31
6Mi97.3096.0096.65
7Sc86.4589.3387.87
Table 6. The classification results for the CCSN dataset.
Table 6. The classification results for the CCSN dataset.
NoClassPr (%)Re (%)F1_score (%)Acc (%)
1Ac96.1792.0094.04
2As93.2792.3392.80
3Cb97.1892.0094.52
4Cc93.2492.0092.62
5Ci92.6596.6794.62
6Cs89.1490.3389.7392.73
7Ct96.1098.6797.37
8Cu92.7994.3393.55
9Ns92.2691.3391.79
10Sc90.4691.6791.06
11St87.2188.6787.93
Table 7. The classification results for the GCD dataset.
Table 7. The classification results for the GCD dataset.
NoClassPr (%)Re (%)F1_score (%)Acc (%)
1Ac96.3296.0096.16
2Cb89.4988.0088.74
3Ci91.998.3395.01
4Cl97.0799.3398.1993.57
5Cu98.6295.0096.78
6Mi90.8893.0091.93
7Sc90.7885.3387.97
Table 8. The classification results of different methods based on different datasets.
Table 8. The classification results of different methods based on different datasets.
MethodASGCCCSNGCD
Acc (%)Acc (%)Acc (%)
DeepCloud85.6785.1584.76
CloudNet85.0584.4284.24
CGAT88.5787.3388.95
GoogLeNet85.5284.0684.14
VGG1688.2486.8587.14
ResNet3489.0587.5888.05
ResNet5090.4388.3988.14
DenseNet90.6288.9188.57
MobileNet91.0589.3989.29
EfficientNet-B091.3889.6789.90
EfficientNet-B591.4789.9790.48
ViT-B92.1990.5591.19
ViT-L92.7191.1592.33
Swin-T92.8691.0692.38
Ours94.2492.7393.57
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, X.; Qiu, B.; Cao, G.; Wu, C.; Zhang, L. A Novel Method for Ground-Based Cloud Image Classification Using Transformer. Remote Sens. 2022, 14, 3978. https://doi.org/10.3390/rs14163978

AMA Style

Li X, Qiu B, Cao G, Wu C, Zhang L. A Novel Method for Ground-Based Cloud Image Classification Using Transformer. Remote Sensing. 2022; 14(16):3978. https://doi.org/10.3390/rs14163978

Chicago/Turabian Style

Li, Xiaotong, Bo Qiu, Guanlong Cao, Chao Wu, and Liwen Zhang. 2022. "A Novel Method for Ground-Based Cloud Image Classification Using Transformer" Remote Sensing 14, no. 16: 3978. https://doi.org/10.3390/rs14163978

APA Style

Li, X., Qiu, B., Cao, G., Wu, C., & Zhang, L. (2022). A Novel Method for Ground-Based Cloud Image Classification Using Transformer. Remote Sensing, 14(16), 3978. https://doi.org/10.3390/rs14163978

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop