1. Introduction
Clouds cover more than 50% of the Earth’s surface and are an important meteorological element [
1]. They have a significant impact on the global climate system, global irradiance and water vapor changes [
2,
3,
4,
5]. The observation and prediction of clouds is essential for weather monitoring, climate forecasting and predicting photovoltaic power generation. Cloud cover and cloud type are the two main directions of cloud observation [
6], so it is important to study the segmentation and classification of cloud images [
7,
8,
9].
There are three primary methods of acquiring cloud maps: space-based satellites, air-based radio sounders and ground-based remote sensing observations [
10]. Satellite observations are widely used in large-scale measurements. However, satellite imagery is impractical, expensive and time-consuming for hyperlocal cloud analysis. It has a low frequency and resolution of updates. Although space-based sounders are effective in detecting vertical cloud structures, they are quite expensive. As a result, ground-based remote sensing observation devices such as the total-sky imager (TSI) [
11] and the all-sky imager [
12] have rapidly evolved. These devices provide high-resolution, low-cost remote sensing images, which can facilitate local cloud analysis.
In addition, with the proliferation of ground-based cloud images, the segmentation and classification of clouds in ground-based cloud images have been the subject of extensive academic research. Traditional classification and segmentation techniques for cloud maps rely heavily on feature extraction and conventional machine learning algorithms.
For instance, Long et al. [
13] proposed a cloud segmentation algorithm based on RGB color channels, and Heinle et al. [
14] classified clouds by extracting the spectral and texture features of cloud maps using k-nearest neighbor (KNN) algorithms. Kazantzidis et al. [
15] proposed an improved KNN classification method that considers the color and texture characteristics of the cloud map while incorporating multi-modal information. Additionally, texture, local structure and statistical features were proposed by Zhou et al. [
16] as inputs to support vector machines for cloud classification. Moreover, Dev et al. [
17] used feature extraction and clustering to achieve cloud image segmentation via probabilistic pixel binary segmentation. Zhu et al. [
18] proposed a new channel attention module, ECA-WS, to improve the network’s ability to express channel information and used a decision fusion algorithm to solve the problems of network size limitations and dataset imbalance.
However, these methods have disadvantages when dealing with high-dimensional, nonlinear, high-noise and large-volume data cloud images, including low classification accuracy and inadequate feature extraction. Therefore, it is necessary to develop new algorithms to improve the accuracy and efficiency of classification and segmentation.
With the rapid advancement of deep learning technology in recent years, deep learning and neural networks have been widely used in the field of atmospheric detection [
19] and the classification and segmentation of ground-based cloud images based on deep learning has become a popular research field. In particular, the application of a convolutional neural network (CNN) in the classification and segmentation of ground-based cloud images has produced remarkable results. A CNN can automatically extract features from the original data while overcoming some limitations of traditional algorithms, significantly improving ground-based cloud images’ classification and segmentation.
Among these, Ye et al. [
20] employed a CNN to extract cloud map features and Fisher vector coding and an SVM classifier to classify cloud maps. By optimizing the pooled feature map, Shi et al. [
21] obtained the depth features of the cloud graph for cloud identification. In contrast, Zhang et al. [
9] proposed a CloudNet model and achieved high accuracy on a self-built CCSN dataset. In addition, Li et al. [
22] introduced the two-lead loss into the field of cloud classification segmentation, which had the ability to integrate information from multiple CNNs during the learning process to improve the model’s performance. Moreover, Liu et al. [
23] created the MGCD with 8000 ground cloud maps and corresponding meteorological data and classified them with a multi-modal fusion algorithm. Huertas-Tato et al. [
24] proposed an integrated learning algorithm to improve the classification accuracy by fusing the output probability vector of a CNN with a random forest classifier. Moreover, Liu et al. [
25] proposed an MMFN network that combined heterogeneous features within a unified framework and learned extended cloud information. To enable lightweight mobile deployment, Gyasi and Swarnalatha [
26] proposed Cloud-MobiNet, which can be deployed on smartphones.
In these previous studies, to our knowledge, cloud segmentation has been studied independently of cloud classification using several distinct methodologies. However, in areas such as PV forecasting, weather forecasting and climate research, cloud segmentation and classification are both crucial, important and widely utilized. Accordingly, we believe that the semantic segmentation information of cloud maps is beneficial, useful and conducive to cloud classification tasks.
In this paper, considering the above-mentioned issues, we propose a deep convolutional neural network architecture for the joint segmentation and classification of ground-based cloud images, called CloudY-Net. This network improves on the Y-Net architecture with its classification branch. By utilizing only one network, our study achieves the dual tasks of cloud segmentation and classification, which results in improved accuracy in cloud classification through the application of cloud segmentation. The classification branch and segmentation branch can share weights and pull feature maps from each segmentation encoder layer. The multi-head self-attention mechanism is used to increase the interactions between each feature vector in the combined input. Meanwhile, it mines deeper feature information and further improves feature expression. By self-training each feature weight through the C-MoE module, the input to the classifier is weighed to significantly improve the cloud classification accuracy. It also ensures that cloud segmentation is effective.
To cater for both segmentation and classification training, we have produced a cloud segmentation dataset by annotating images from a publicly available multi-modal ground-based cloud dataset (MGCD) that contains cloud classification information. Named MGCD-Seg, this dataset contains 4000 ground-based cloud images and their corresponding semantic segmentation annotation files.
The contributions of this paper are summarized as follows.
The proposed CloudY-Net has both segmentation and classification branches, performing the dual task of cloud segmentation and cloud classification in one network.
The CloudY-Net improves on the traditional Y-Net with an enhanced classification branch by introducing more features from the segmentation branch. The classification accuracy is better than that of state-of-the-art neural networks.
We produce a new cloud segmentation dataset, MGCD-Seg, which contains 4000 ground-based cloud images and semantic segmentation annotation files.
3. Proposed Cloud Image Joint Segmentation and Classification Approach
3.1. Overall Framework of the Approach
Figure 3 depicts the proposed structure of CloudY-Net. The U-Net segmentation network serves as the backbone and segmentation branch of the joint classification segmentation dual-task network. This branch has two components: an encoder and a decoder. The U-Net segmentation network compresses the input cloud image layer by layer using the convolutional layer of the encoder, resulting in a compact, intermediate block. Subsequently, the decoder generates the segmentation mask using both transposed and regular convolutional layers. These steps enable the network to produce accurate segmentation outputs. Meanwhile, jump connections are established between the encoder and decoder components, with each size corresponding to the feature maps, to enhance the segmentation performance of the network. The Y-Net network framework uses the feature information in the central block of the U-Net to create an additional branch for classification from the central block [
29]. In contrast, our modified CloudY-Net structure extracts features collectively from multiple layers in the segmentation branch, instead of focusing solely on the output features. This is because we believe that the features obtained only from the central block are too simple and superfluous. Conversely, the other layers in the segmentation branch contain richer cloud shape features that can positively impact cloud classification if we properly process this feature information.
To fully leverage these features, each of the four extracted feature maps is compressed using convolutional blocks. To ensure that the compressed feature vectors have a consistent scale, the size of the convolution block is decreased proportionally to the size reduction of each encoder layer’s feature map as it decreases from large to small. Subsequently, the multi-head self-attention mechanism receives four feature vectors on the same scale that have been stacked. This step involves enhancing the interactions among individual feature vectors, enhancing the correlations among features and extracting deep information. The shape of the output feature scales from the dot product self-attention mechanism remains unchanged, but their representational power is increased.
The terms for the enhanced feature are entered into the C-MoE module. C-MoE is used to learn the significance of each feature term, and the corresponding weight coefficients are outputted. These coefficients are then utilized to weigh different feature terms, leading to an improvement in the utilization of each layer of features. The C-MoE weighted fusion produces a 2 × 256 dimensional feature vector, which functions as the input to the classifier.
3.2. Multi-Head Self-Attention
The multi-head self-attention [
31] mechanism can simultaneously compute attention on multiple subspaces, allowing it to handle the information interaction of multi-layer features better and adapt to different feature map sizes. In classification networks, the multi-head self-attention mechanism can be used to improve the classification performance by enhancing the expressiveness of feature representations, extracting richer and more accurate feature representations and so on.
It performs a linear projection of
Q,
K and
V using projection parameter matrices, enters the point of attention and repeats the process
h times as follows:
where
is calculated as
The parameter matrix , , performs a linear projection of Q, K, V, respectively.
The structure of the multi-head self-attention module is depicted in
Figure 4. Notably, the multi-head self-attention mechanism must be centred on the sequence dimension so that, when applied to a classified network, the input data can be viewed in the channel dimension, enabling them to be viewed as a sequence. Correspondingly, the four layers of feature vectors were combined by us through concatenation to form a concatenation feature matrix, which contains all the extracted features. Additionally, this can be mapped to the focus of the attention mechanism. Subsequently, the attention module calculates the correlation among the characteristics, combines the various characteristics of each head using concatenation and produces a more exhaustive and characterized description. The inclusion of multiple scales and semantics in the information is indicated by this feature, which enhances the comprehensiveness and safeguards against the loss of crucial data.
3.3. Cloud Mixture-of-Experts
Mixture-of-Experts (MoE) is a method that combines several expert models [
32], with each model processing different aspects of the data. The final results are determined by averaging the outputs of each expert, weighted according to their significance. Inspired by the MoE Decomp in FEDformer, the proposed Cloud Mixture-of-Experts (C-MoE) extends this technique to combine different feature maps or feature extractors. In this section, we provide a more detailed analysis of the C-MoE module, elucidating its significance and the rationale for its inclusion.
Significance of Feature Maps: Our approach leverages C-MoE to evaluate the significance of each layer’s feature maps. Unlike traditional methods, C-MoE autonomously learns the relevance of different feature maps, allowing it to adapt dynamically to the data. This dynamic learning capability empowers the model to identify critical features that may not be apparent through manual feature engineering.
Feature Weighting with MLP: To further enhance the feature representation, we employ a multi-layer perceptron (MLP). The MLP takes the extracted features as input and produces corresponding weight coefficients, which are crucial in combining features effectively. This added layer of adaptability improves the model’s capacity to capture intricate relationships within the data.
Softmax for Coefficient Conversion: To ensure that our weight coefficients are valid and range from 0 to 1, we employ the Softmax function. Softmax converts the output weight coefficients of the MLP into a valid probability distribution, allowing them to be used as weighting factors for the features. This transformation guarantees that the weights are proportional and suitable for feature combination.
In this method, each feature’s weight coefficient is multiplied by a weighted sum, and the resulting values are added. The final output is a feature vector used for classification. One notable advantage of this approach is its ability to effectively reduce feature dimensionality, thereby improving both the classifier’s computational efficiency and overall accuracy. For the input feature maps
, an MLP was used to estimate the confidence levels separately, and then the coefficients
were obtained by scaling these confidence levels using Softmax,
, where the sum is 1:
The final output is the sum of the dot product of
and
:
Although both the self-attention mechanism and the C-MoE can be used to improve the interactions among features, their functions do not exactly overlap.
The self-attentive mechanism is primarily used to enhance the interactions of each piece of feature data, calculate the relationships among input data, map various input data to a unified vector space and weigh the correlations of each piece of input data to other data to obtain a more expressive feature representation.
On the other hand, C-MoE is primarily used to learn the weights of each feature automatically. These weights can combine feature information from different levels and judge the importance of each feature by the model’s self-training. C-MoE can improve the classification accuracy by enhancing the weights of important features. They can be utilized in tandem to further enhance the model’s performance.
3.4. Cross-Entropy Loss Function
Our proposed model uses the cross-entropy loss function, which is commonly used in classification tasks. Its core idea is to measure the difference between the model’s predicted probability distribution and the true label distribution for each sample. By minimizing the cross-entropy loss, the model is encouraged to better fit the training data, thereby improving the classification performance.
In our task, each sample has an associated true label, indicating which category the sample belongs to. The model’s prediction results in a probability distribution representing the predicted probabilities for each category. The cross-entropy loss quantifies the difference between these two probability distributions and serves as the optimization objective.
For each sample, the computation of the cross-entropy loss is as follows:
Here, represents the cross-entropy loss, N is the number of samples and C is the number of categories. denotes the true label probability for sample i belonging to category j, and represents the model’s predicted probability for sample i belonging to category j.
The computation of the cross-entropy loss involves comparing the true label distribution and the model’s predicted probability distribution for each category, evaluating the model performance. The optimization goal is to minimize this loss, enabling the model to make more accurate predictions about sample categories.
In our proposed model, the cross-entropy loss function is employed to optimize the performance of the 7-class classification task. By minimizing the cross-entropy loss, we encourage the model to generate predicted probability distributions that closely resemble the true label distribution. This helps to improve the classification accuracy, making the model better suited to handle different types of cloud images.
6. Conclusions
In this study, we present an enhanced Y-Net-based joint classification segmentation technique for ground-based cloud images. Using a single network, the proposed CloudY-Net can segment and classify ground-based cloud images. Accordingly, we enhanced the classification branch of the original Y-Net framework. Feature maps were acquired from each of the four different layers of the encoder for the classification task. The acquired features were fed into a multi-head self-attentive mechanism to obtain a feature representation with increased representational power.
Meanwhile, the C-MoE module learned the significance of each feature, assigned weights to them and combined them into a classification feature vector. In addition, we created a ground-based cloud image segmentation dataset called MGCD-Seg, with 4000 images, for the training of the segmentation branch. To evaluate the efficacy of the proposed CloudY-Net, a series of experiments were conducted, and the results indicated that the segmentation branch of the current network structure could perform the segmentation task of ground-based cloud images effectively; our model achieved great performance on MGCD-Seg, with mIoU 96.55%, mPA 98.26% and accuracy of 98.33%. In addition, compared with the traditional methods and the latest networks, the improved classification branch also obtained better classification accuracy. Compared with the traditional KNN model, the classification accuracy was only 68.9%, and the latest and most advanced Inception v3 model had accuracy of 88.32%. Our CloudY-Net achieved the highest classification accuracy of 88.58% on MGCD. This improved network greatly improves the capacity for cloud feature extraction and optimizes the weight distribution of the classification output vector. Comparing the classification accuracy of our method with state-of-the-art algorithms and networks shows the excellence of CloudY-Net. Therefore, CloudY-Net has a good effect in the cloud classification of ground cloud images and will have important impacts on fields such as photovoltaic power generation prediction and meteorological cloud computing.
Currently, our improvement over the original Y-Net is limited to the classification branch; however, the accuracy of cloud segmentation is in need of further improvement. Accordingly, future model researchers will likely consider modifying and enhancing the CloudY-Net segmentation branch to enhance the model’s segmentation performance.