1. Introduction
Precipitation is one of the three fundamental components of the water cycle, and it plays a key role in weather and climate systems as well as energy-transfer processes [
1,
2,
3]. The study of precipitation is highly valuable in many aspects. The spatial–temporal distribution of precipitation and its variation have wide-ranging effects on human well-being and ecosystems [
4,
5,
6]. For example, changes in the intensity, frequency, and distribution of precipitation can have a direct impact on crop water supply and soil water balance, and further have an impact on atmospheric balance and water resources management [
7,
8,
9]. Precipitation studies are also critical for regulating the global energy flow, i.e., the movement of heat, and further adjusting the numerical weather-prediction models. In addition, there is high climatological significance because studying the long-term variation in global precipitation can inform scientists of how precipitation responds to a changing climate [
10,
11].
The effects of precipitation on the hydrological cycle, energy balance budget, land–air interactions, and weather-prediction models further vary depending on the forms of precipitation. In general, there are three types of precipitation that reach the Earth’s surface: liquid-phase (rain), mixed-phase (sleet), and solid-phase (snow) [
12,
13]. While snowfall events can have a dramatic impact on the energy budget by increasing surface albedo by over 50% [
14,
15], rainfall can quickly raise soil moisture, replenish groundwater, and form surface runoff [
16,
17]. In light of the significance of hydrometeor classification for improved understanding in various earth science fields, including weather forecasting, climate-change dynamics, and the usage of water resources, an increasing number of studies have been focusing on the identification of hydrometeor types (rain/sleet/snow).
Most existing hydrometeor classification schemes rely on remote sensing measurements from ground-based, airborne, and spaceborne active radars. As listed in
Table 1 of the literature review summary, among the various classification algorithms, the fuzzy logic approach has gained considerable attention due to its ability to characterize different ranges of polarimetric measurements using empirically derived membership functions. The fuzzy logic approach was first proposed by Vivekanandan et al. [
18] and has since undergone several improvements with more sophisticated membership functions and decision criteria [
19,
20]. Later, the National Severe Storms Laboratory at the National Oceanic and Atmospheric Administration designed another fuzzy logic algorithm for the operation of the U.S. NEXRAD, which has been upgraded over the years [
21,
22,
23].
Although efforts have been continuously carried out by many researchers to improve hydrometeor classification, most studies have focused on using polarimetric radar measurements [
24,
25]. In this paper, we aim to explore the potential of passive microwave measurements, which have not been fully explored for hydrometeor classification. Passive microwave radiation measurements have proven useful for retrieving unique signatures that identify Earth surface features and obtain atmospheric temperature and composition. Microwave measurements interact with various types of hydrometeors through the vertical columns of precipitating clouds, making them particularly beneficial for global precipitation study [
26,
27,
28,
29,
30,
31]. The first spaceborne radiometer, the electrically scanning microwave radiometer (ESMR), was aboard NASA’s Nimbus-5 launched in 1972. Since then, various microwave radiometers have flown aboard satellites worldwide, collecting an increasingly large volume of radiation measurements emitted from the Earth at selected frequencies between 6 and 190 GHz, such as the advanced microwave sound unit-A and unit-B (AMSU-A and AMSU-B) and the microwave humidity sounder (MHS) [
32].
Previous research has indicated that the relationship between hydrometeor scattering intensity and passive microwave brightness temperature varies across the frequency range of 19 to 150 GHz. In a study by Bennartz and Petty [
27], the correlation between hydrometeor scattering intensity and passive microwave brightness temperature was examined using radiative transfer modeling data, revealing variations in the relationship for the four frequencies studied within the aforementioned range. Later studies that explored the relationship between airborne and spaceborne microwave data and signatures of hydrometeor types have supported this finding [
33,
34,
35]. To augment existing precipitation algorithms and to exploit passive microwave observation capabilities to their full potential, herein we explore a deep-learning approach which uses passive microwave radiance to diagnose hydrometeor types from liquid, mixed, and ice phases in an unprecedented manner. McCulloch and Pitts first proposed and developed the conceptual model of an artificial neural network in 1943. Since then, its application has expanded tremendously over the past few decades [
36]. Recently, convolutional neural networks (CNN) have emerged as one of the most popular deep-learning approaches to deep in various fields, including meteorological studies that involve remote sensing observations [
37]. Because meteorological applications tend to have large earth observation datasets with spatially and temporally coherent information, conventional statistically based methodologies may not be accurate enough to capture the spatio–temporal patterns in the vast amount of earth observation data, especially for ice particles and snowflakes which are of non-spherical shape and, hence, are more sophisticated and have imperfectly known ice particle scattering properties. The pattern recognition abilities of CNN models fit well with this type of data [
38]. They are especially suited to approximate complicated nonlinear relationships between input values and output results through learning phases and to extract information from image-like data and sequential data. Because the idea is relatively new and it is more challenging to classify hydrometeor types using passive microwave observations compared to conventionally used precipitation radar measurements, developing a deep-learning algorithm that eliminates the need for a well-defined function to describe the relationship between the input passive microwave radiance and the output hydrometeor types is desirable. Herein, we leverage a CNN-based model in conjunction with an attention mechanism to learn meaningful feature representations from the spatial and temporal dimension space of a passive microwave data for the task of hydrometeor classification.
We trained neural networks using observations from the Fengyun-3C (FY-3C) Micro-Wave Humidity Sounder-2 (MWHS-2) at frequencies between 89 and 190 GHz. The training was supervised by “ground truth” data of hydrometeor types derived from measurements of the global precipitation measurement (GPM) mission’s core observatory which carries two critical instruments: GPM microwave imager (GMI) and dual-frequency precipitation radar (DPR). It is worth noting that MWHS-2 carries five channels that are centered at the 118 GHz oxygen line, which is of tremendous significance because it is the first instrument measuring Earth’s radiance from space at 118 GHz. The thermal emission spectrum of the atmosphere near 118 GHz provides us with exceptional data to probe the atmosphere at 118 GHz with the combination of other channels at 90, 150, and 183 GHz. To best prepare datasets for training and testing models, we performed a series of data preprocessing, such as removing biases from observations, collocating observations and hydrometeor types as well as data sub-setting (which will be explained in detail in
Section 2). We used the trained model to predict the type of hydrometeor given an input of MWHS-2 observations.
Our model is capable of learning spatial and temporal feature representations from satellite observations through sophisticated neural networks [
39]. We also innovated and enhanced its functionality with two mechanisms, context and channel-attention networks. These mechanisms allow us to exploit contextual information around each channel feature and emphasize their contributions to the output of the classification task. We will elaborate on each of those components in the following sections.
This study presents a novel approach to combining 118 GHz channels with other conventional channels between 89 and 190 GHz. By doing so, this study provides new insights into the distribution and variation of global hydrometeor types. Moreover, this is the first study to use spaceborne passive microwave observations to classify various types of hydrometers globally over ocean. This approach represents a significant advancement in the application of deep-learning techniques to investigate hydrometeor characteristics using passive microwave observations. Furthermore, the proposed deep-learning scheme has significant implications for algorithm development in future missions. Specifically, the scheme can pave the way for innovative algorithm development for the forthcoming time-resolved observations of precipitation structure and storm intensity with a constellation of smallsats (TROPICS) mission and its pathfinder in hydrometeor classification work [
40].
The remainder of this article is structured as follows. In
Section 2, we provide information about the instrument and data used in our study.
Section 3 elaborates on the data-driven deep-learning model mechanism we proposed. We present the experimental evaluation of our model in
Section 4. Finally, in
Section 5, we discuss the obtained classification accuracy and describe future work.
3. ResNet-18 Network by Attention Mechanism
In this section, we present the enhanced ResNet-18 architecture using stacked convolutional and self-attention modules for hydrometeor classification. While there are several other CNN architectures, such as VGG16, Densenet, and Mobilenet, we conducted extensive experimentation and evaluation and found that ResNet-18 provided the best performance for our specific problem of classifying hydrometeors using passive microwave observations. Furthermore, recent research has shown that ResNet variants with self-attention have achieved state-of-the-art results in various classification tasks [
43,
44,
45,
46].
Figure 2 illustrates the entire workflow of our proposed architecture, where ResNet-18-based neural networks are employed, comprising stacked convolutional layers with varying filter sizes. ResNet-18 is a variant of ResNet families, and also an ensemble of 18 layers of residual nets to reduce prediction error on the ImageNet test set [
39]. As shown in
Figure 3a, the input of MWHS-2 passive microwave observation features is converted into vector representations by a layer of convolutional embedding, and then fed into a stack of bottleneck blocks to exploit spatial and temporal knowledge in MWHS-2 observations for predicting hydrometeors types. To capture spatial knowledge of observations between a channel and its neighboring ones, we extend the functionality of the vanilla bottleneck block in ResNet-18 by (1) using a context-attention layer to replace spatial convolution, (2) embedding a channel-attention module before the last down-sampling convolution layer. To improve the performance of the model and stabilize its output, we repeat the modified bottleneck layer multiple times, in which an output of one block is fed into the next one. Unless otherwise specified, we use the notation of “N × N” to specify the size of a filter in a convolutional layer in the following sections and figures. For example, “3 × 3” in
Figure 3 indicates a convolutional module with a 3 × 3 filter. Finally, the output probabilities of hydrometeor types are produced by a generator model comprising a linear neural network and a softmax activate function. In the implementation process, we split the data into training, testing, and validation in a ratio of 8:1:1 to build and evaluate our proposed model. Furthermore, we leverage the early stopping strategy to avoid issues with model overfitting during training process [
47]. We will depict all sub-components in the remainder of this section.
3.1. Model Configuration
Typically, according to Equation (1) and softmax-based weight function
, the matrix production
has a cost
if assuming
. Therefore, the complexity of the self-attention layers becomes
in the context and channel self-attention modules, where
k is the kernel size of the convolution layer. In addition, the complexity of the initial convolution is
. Hence, the overall complexity becomes
. In this paper, we implemented the algorithm using Pytorch 1.7.1 and trained it using Adam optimizer [
48,
49]. During training, we leveraged early stopping with 10 patience to avoid overfitting according to model performance on the validation dataset. To be consistent with the experimental settings of baselines, we conducted both training and testing on NVIDIA RTX 2080Ti GPU. The host server is configured with Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz and 503 GB memory.
In addition, we leveraged the Adam optimizer with and to compute and update gradients, and cross-entropy is used as loss function. Notably, instead of using hard label directly, we used a regularization technique called label smoothing with ratio 0.1. It computes cross-entropy, not with hard labels, but with a weighted mixture of hard labels with a uniform distribution, to weight average of the hard labels and the uniform distribution on the labels. In addition, we set up the learning rate and dropout as 0.001 and 0.2.
3.2. Convolutional Embedding Layer
As discussed in Section II, MWHS-2 TBs of 12 channels and the corresponding relative airmass were combined and represented as a low-level input feature with original values to the proposed model. Technically, we needed to convert the input matrix to high-level space vectors of 3D dimension with the following considerations: (1) the proposed model works on vectors of 3D dimension directly, rather than 2D matrix; (2) transforming the inputs into a new and larger high-level space without losing too much primary features can result in a better performance. To this end, we followed the similar strategies used in ResNet-18 to process the input matrix, as shown in
Figure 3b. More specifically, we first slid (more precisely, convolved) a filter with size of 7 × 7 across the width of the input (the height is equal to 1 in our case) and performed a convolution operation between the entry of the filter and the input at the corresponding position to summarize the presence of those features in the input which are known as feature maps. An example convolution operation using 3 × 3 filter is shown in
Figure 3c. Typically, a sliding box with the size of 3 × 3 over the input matrix filters an input block with identical size (blue color area in the input), and then we performed dot products with the entry of the filter to produce an output of feature maps (the output of first cell is 16 in this case). Similarly, the sliding box was moved across the width with a specified step size (the value of stride was equal to 2 in this case) to obtain all values of feature maps. With consideration of updating the gradients of previous layers in a backward way, it was a challenging and complicated task to train a deep neural network since parameters of previous layers changed during training. Therefore, we adopted several strategies to reduce the number of training epochs and stabilize the learning process. For example, batch normalization technique was used to standardize the inputs to a layer for each mini batch [
50].
Once a feature map was created and standardized by batch normalization technique, a nonlinearity function was applied on the feature map to approximate such a nonlinear relationship in the underlying data. We used a rectified linear activation function (ReLU), much as we do for the outputs of a fully connected layer [
50]. Technically, the ReLU is defined as a max selector between 0 and x: f(x) = max(0,x). In other words, it returns 0 if it receives any negative input, otherwise returns any positive value x. Usually, the output feature maps were sensitive to the location of the features in the input. In the domain of convolutional network, pooling layers were utilized to address this sensitivity by down-sampling the feature maps. There are two practical pooling methods: average and max pooling layers. The former summarizes the average of a feature while the latter chooses the most activated presence of a feature. In this paper, we applied a max pooling layer using a 3 × 3 filter on the output of ReLU to reduce the size of each feature map by a factor of 3.
3.3. Bottleneck Residual Block
In this module, we depicted and designed a novel bottleneck block for hydrometeor classification. As shown in
Figure 3a, there was a stack of five layers for each residual function
F: 1 × 1, 3 × 3, 1 × 1, 1 × 1, and 1 × 1 convolutions. The first and third layers were convolutional layers using 1 × 1 filters to reduce and then increase (restore) dimensions. The second layer was designed to capture spatial knowledge of observations at each channel using a 3 × 3 convolutional layer. In our implementation, we also expanded its capability to exploit contextual information around each channel by using attention mechanism (explained in the next subsection). The fourth layer performed a global attention mechanism among channels to emphasize the relationships among each channel. In the last layer, we conducted down-sampling directly by convolutional layers with 1 × 1 filter. In order to address the degradation problem of a complicated and deeper neural network and improve accuracy, we simply performed an identity mapping between the input and the end of the stacked layers, and added their yields to the outputs of the stacked layers:
F(X) + X. Then, a layer of ReLU was followed up to approximate such a nonlinear relationship in the underlying data. We repeated the modified bottleneck layer multiple times (in our case), with which the output of one block was being used as the input of the next block.
3.4. Attention Mechanism
There is a growing number of attention-style neural designs with competitive results in numerous tasks of various fields, such as the domain of natural language processing and computer vision [
51,
52]. Recently, earth observation studies have also benefited from its success in enhancing model prediction accuracy. Qiao et al. proposed an novel algorithm that combines an attention mechanism with recurrent neural networks to predict future sea-surface temperature (SST) using historical SST data, and experimental results showed that it outperformed other SST prediction approaches [
37]. Nevertheless, none of existing deep-learning-based algorithms employ attention over satellite passive microwave observations to exploit contexts among neighboring channels for improving accuracy of the hydrometeor classification task. In this paper, we designed two novel context and channel-attention modules and orchestrated them with the core bottleneck block in ResNet-18 elaborately to capture spatial knowledge of the microwave radiances around neighbor channels.
An attention mechanism maps a query
Q for a set of key-value pairs (
K,
V) to produce a sum of weighted values and its formal equation is defined as:
where
Q,
K,
V ϵ
N × dk,
N is the number of input observations and
dk is the dimension of features. In addition, the weights
for
V are computed following a compatibility function of the query with the corresponding key [
53,
54,
55].
As shown in
Figure 3d, we conducted interactions across different spatial feature locations in the channel-attention module. Specifically, the input
X was transformed into
Q,
K, and
V using three separated 1D 1 × 1 convolutions, respectively. Next, we performed a dot product between
Q and
K, divided each resulting element by their dimension size
dk, and then applied a softmax function to obtain the corresponding weights for all values in
V:
α(
Q,
K) =
softmax(
QKT/
). Eventually, the output Y was achieved using
.
There are advantages and disadvantages by applying a channel-attention module over feature maps. While better performance can be achieved, it lacks scalability and does not consider contextual information among neighboring keys because it handles queries and keys as a group of isolated pairs and investigates their pairwise relationships individually without learning the contexts between them. As a result, we additionally designed a novel local attention for inertial navigation, considered as a context attention module, as shown in
Figure 3e. We first employed 3 × 3 group convolution over all the neighbor keys within a grid of 3 × 3 to extract local contextual representations for each key, denoted by Z
1 =
XWK,3×3. Then, we investigated a form of concatenation-based weight function:
where [;] denotes a concatenation operation on input vectors and
W_
α was a weight vector that projected the concatenated vector to a scalar [
56]. Next, we computed the attended feature map
Z2 using (
Q, Z1) ×
V, through which it captured the global contextual information among all observations. We adopted an attention mechanism between local context
Z1 and global context
Z2 to produce a result. Further, in order to allow the model to learn and summarize information jointly, we split the input into multi-heads to represent subspace features at different spatial positions.
3.5. Precipitation Generator
We used the usual learned linear transformation and softmax function to convert the bottleneck output to predicted precipitation probabilities [
54]. Technically, the softmax function
takes as input a vector
z of
K real numbers (i.e., the number of hydrometeor type in the case of generator) and normalizes it into a probability distribution consisting of
K probabilities proportional to the exponentials of the input numbers, which is defined as follows:
The output of the softmax function represents a categorical distribution over hydrometeor class labels, and we can obtain the probabilities of each input element belonging to a label.
5. Conclusions
This study has two major goals:
- (1)
Utilizing CNN in conjunction with the attention mechanism to learn meaningful feature representations from spatial and temporal dimension space of passive microwave observations for hydrometeor classification;
- (2)
Exploiting the information content of passive microwave observations for the purpose of hydrometeor classification with the unprecedented inclusion of 118 GHz channels.
To achieve these goals, we developed a new deep-learning-based algorithm using coincident MWHS-2 observations and GPM DPR estimates for training. The algorithm was composed of independent modules, in particular, convolutional and attention modules, for learning the non-linear relationships between the input and output and for exploiting contexts among neighboring channels.
In addition to developing a classification algorithm, this study also investigated the information content of the different microwave channels ranging from 89 to 190 GHz. Of all the channels, the three highest-peaking channels around 118 GHz (channels 2–4 of MWHS-2) were demonstrated to be the least significant which aligns with previous findings. This algorithm has been validated on a different full year of MWHS-2 observations. The prediction of hydrometeor types for this full year shows high agreement with state-of-the-art hydrometeor types from the GPM DPR measurements through the combined algorithm (2BCMB). The global geographical distributions of occurrence fractions for different hydrometeor types show overestimation in some areas, such as the ITCZ for liquid precipitation, and underestimation for mixed precipitation. The differences in zonal mean likelihood of both ice and mixed precipitation occur in higher latitudinal regions.
This work is part of the development of precipitation retrieval algorithms for the upcoming TROPICS mission that consists of a constellation of CubeSats, each of which carries a high-performance radiometer. In particular, the similarities in channel configuration between MWHS-2 and TROPICS radiometers cause the former to be an appropriate substitute to provide passive microwave observations in the higher microwave spectrum (89–190 GHz) for hydrometeor type classification.
Discussions
While the main purpose of this study is to consider the use of spaceborne passive rather than active microwave measurements for hydrometer classification, we acknowledge that the capabilities of the proposed deep-learning-based algorithm are limited without further verification using independent ground-based references, such as radar networks. Despite the potential benefits of ground-based radar measurements, the sparse distribution of such instruments causes global evaluation to be challenging [
57,
58]. Nevertheless, we remain committed to pursuing this goal in future research. Our future work related to this study will also involve testing the potential of CNN in conjunction with attention mechanism for improved accuracy of hydrometeor classification tasks by incorporating ancillary information into the input data, including freezing-level altitude and temperature and humidity profiles. Preliminary analysis of adding freezing altitude data into model features shows promising results. We are aware of other machine-learning algorithms, such as inductive logic programming and Bayesian networks, which may also be applicable to this topic. Exploring such algorithms is beyond the scope of this paper and may be better suited for future work.