1. Introduction
Clouds are visible collections composed of water vapor liquefied by cold in the atmosphere, which cover about 70% of the Earth’s surface. The study of clouds and their properties has a very important role in many applications, such as climate simulation, weather forecasting, meteorological studies, solar energy production, and satellite communications [
1,
2,
3]. Clouds are also closely linked to the hydrological cycle, affecting the energy balance on local and global scales through interactions with radiation from the sun and the land [
4,
5,
6,
7,
8,
9]. Because different cloud types have different radiative effects on the Earth’s surface–atmosphere system, the study of cloud type classification is of great importance [
10].
There are two main methods of cloud observation: meteorological satellite observations [
11,
12,
13] and ground-based remote sensing observations [
14,
15,
16]. Satellite cloud images can capture large areas of clouds and allow for the direct observation of the effects of clouds on earth radiation. However, the low resolution of images prevents the study of more local cloud details. Ground-based cloud image (GCI) classification is widely used to monitor the texture and distribution of clouds in local areas and has the advantages of flexible observation sites and rich image information, so GCI classification has become a hot research topic. As more and more places need cloud monitoring, a large number of images are generated at the same time. Relying on experts alone to identify and classify these images is clearly a time-consuming task and is easily influenced by personal subjective factors. For the above reasons, it is crucial to investigate methods that can automatically and accurately classify GCIs.
Early GCI classification was based on manually extracted features, and most methods employed brightness, texture, shape, and color features to represent image content. Heinle et al. [
17] proposed a classification algorithm based on spectral features in red–green–blue (RGB) color space and texture features extracted using the Gray-Level Co-occurrence Matrix (GLCM). Li et al. [
18] generated the corresponding feature vectors by calculating the weighted frequency values of microstructures in each cloud image and then inputting them into the support vector machine (SVM) classifier for classification. Dev et al. [
19] proposed an improved text-based classification method that incorporates manually extracted color and texture features to improve the classification. Xiao et al. [
20] extracted visual descriptors from color, texture, and the structure of clouds in a dense sampling manner simultaneously and then completed feature encoding using Fisher Vector, which characterized image samples by computing the log-likelihood gradient of model parameters to enhance classification performance. Zhuo et al. [
21] applied color census transform to compute cloud image statistics to extract texture and structure features, which were then fed into a conventional classifier.
In recent years, with the development of deep learning, many convolutional neural network (CNN)-based methods have made significant progress in GCI classification. Shi et al. [
22] argued that locally rich information is more important than global layout information, and, therefore, used deep convolutional activation-based features (DCAF) and shallow convolutional layer-based features for classification. Ye et al. [
23] extracted multiscale feature maps from pretrained CNNs and then used Fisher Vector coding to perform spatial feature aggregation and high-dimensional feature mapping on the original deep convolutional features, which aimed to find discriminative local information to better distinguish cloud types. Zhao et al. [
24] used a 3D-CNN model to process multiple consecutive ground-based cloud images to extract cloud features such as texture and temporal information, followed by a fully connected layer for classification. Zhao et al. [
25] proposed a multichannel CNN-based classification method that first extracts cloud objects from large images and then inputs clouds into multiple channels to extract features, thus improving the classification accuracy. Li et al. [
26] developed a dual-guided loss function for GCI to integrate information from different CNNs in the optimization process, thereby improving the discriminative ability of cloud feature representation. Zhang et al. [
27] presented a GCI classification network called CloudNet, which includes four convolutional layers and two fully connected layers. In addition, some researchers have applied graph convolutional networks (GCN) to the field of GCI classification. Liu et al. [
28] treated each cloud image as a node in a graph and then used GCN to aggregate information from the cloud image itself, as well as its connected images, in a weighted manner to extract richer feature information. Liu et al. [
29] proposed a classification method based on the context graph attention network (CGAT), which uses the context graph attention layer to learn context attention coefficients and obtain the aggregated features of graph nodes based on these coefficients, solving the problem that the weights assigned by GCN do not accurately reflect the importance of connected nodes.
A cloud cannot be accurately described by the visual information contained in an image alone. Therefore, some researchers have fused visual information with nonvisual features obtained during cloud formation for classification, such as air pressure, wind speed, temperature, and humidity. Liu et al. [
30] proposed a joint fusion CNN for learning both visual and nonvisual features in one model, which extracts features using ResNet50 [
31] and then uses a weighting strategy for integration. Liu et al. [
32] developed a multi-evidence multimodal fusion network (MMFN), which uses an attention network to extract local visual features while learning nonvisual features using a multimodal network. In addition, the authors specifically designed two fusion layers to fully fuse the two features. However, most GCI datasets contain only visual information in practice, so the above approach is not universally applicable.
The disadvantage of CNN models is that they do not handle global features well and thus lead to underutilization of features. In contrast, the recently emerged Transformer model can extract abundant global information. Transformer was originally proposed by Vaswani et al. [
33] for natural language processing (NLP) problems, and the model introduced a self-attention mechanism to perform global computation on the input sequence. In the field of NLP, Transformer is gradually replacing recurrent neural networks (RNN) [
34,
35]. Inspired by this, related works have applied Transformer to image processing, such as DETR for target detection [
36] and SETR for semantic segmentation [
37]. Meanwhile, many research results have been generated in the field of image classification. Parmar et al. [
38] input the pixels of an image as a sequence into the Transformer, which achieved better results but had a high computational cost. Dosovitskiy et al. [
39] proposed the Vision Transformer (ViT), a model that reduces computational complexity by first dividing images into patches before feeding them into the Transformer. Touvron et al. [
40] suggested a knowledge distillation strategy for Transformer that relies on a distillation token to ensure that the student network learns feature information from the teacher network through attention. Due to the success of Transformer in natural image processing, some researchers have applied it to other fields. Reedha et al. [
41] used a transfer learning strategy to apply ViT to unmanned aerial vehicle (UAV) image classification, and the performance outperformed the state-of-the-art CNN model. Chen et al. [
42] proposed a LeViT-based method for classifying asphalt pavement images, which consists of convolutional layers, transformer stages, and classifier heads. Shome et al. [
43] developed a ViT-based classification model for chest X-ray images, which outperformed previous methods. He et al. [
44] proposed a Transformer-based hyperspectral image classification method that uses CNN to extract spatial features while using a densely connected Transformer to capture sequence spectral relationships.
Transformer has proved to be very successful in some fields, but it has almost no reported applications for GCI classification, which is, in fact, a complex problem, with some images containing large cloud areas and others containing only a small portion. Thus, the models for GCI classification should have the ability to extract both global features and local features. Spurred by the above reasons, a novel Transformer-based GCI classification method is proposed in this paper, which first sends the images to a CNN model to extract low-level features and generate the local feature sequences of images, then uses the Transformer to learn the relationship between the low-level feature sequences. It is able to capture both the local features and global features of images, which improves the discrimination of the images by the model. To the best of our knowledge, this is the first time that Transformer has been introduced to the field of GCI classification. The results of experiments on three GCI datasets show that the classification performance of the method exceeds the available methods.
The main contributions of this paper are summarized as follows:
(1) We apply Transformer to GCI classification task and propose a Transformer-based classification method that combines the advantages of Transformer and CNN to extract both local and global features of images, maximizing their complementary advantages for GCI classification.
(2) We optimize the loss function to enhance supervised feature learning by supplementing the cross-entropy loss with the center loss.
(3) An experimental evaluation is performed on three datasets (ASGC, CCSN, and GCD), and the results show that the proposed method in this paper has better classification accuracy.
The rest of the paper is structured as follows.
Section 2 details the components and overall structure of the research method.
Section 3 reports different GCI datasets and the experimental setup used in this paper. The experimental results, as well as the discussion, are presented in
Section 4.
Section 5 provides the conclusion of this study.