The core objective of this study is to optimize the nowcasting approach of convective precipitation through deep learning, thereby offering efficient meteorological services to various socio-economic sectors. The Adversarial Autoregressive Network (AANet) is a nowcasting model for convective precipitation. Based on NowcastNet, AANet integrates and enhances the generative adversarial network framework, refines the generation method of the provisional forecasted data, incorporates the multi-head attention mechanism and SSIM loss, and proposes a two-stage adversarial strategy.
2.2.1. FURENet
FURENet adopts UNet [
18] as its backbone network and utilizes the Squeeze-and-Excitation module (SE module) [
24] to delay the fusion of features of multiple polarization variables. The Delayed Fusion Strategy [
17,
25,
26,
27] is usually applied to learning the complex relationships among various different kinds of information. When intricate relationships exist in different input variables, the input layer forcibly combines different input variables through linear operations, thereby causing the information entanglement effect. However, the Delayed Fusion Strategy can effectively address such issues. Let the tensor
represent a specific section of the polarized radar observation data (where
i is
1,
2, and
3, respectively, representing
,
, and
,
N represents the time length,
H represents the length, and
W represents the width), and
and
, respectively, represent the encoder features and decoder features of FURENet. The specific process of FURENet is shown in Equations (3) and (4).
denotes the data of the specific polarization variable at the present moment;
represents the feature of the n-th-level semantic layer of the particular polarization variable data.
Equation (3): The encoder learns the features of the polarization variables of consecutive multiple frames and implements downsampling operations layer by layer semantically; the encoder outputs the semantic features () in each semantic layer and ultimately outputs the semantic feature sets (). Equation (4): The decoder learns various top-level semantic features (); by combining various same-level semantic features (), the decoder conducts upsampling on the semantic features; after multiple operations of layer semantics, the decoder obtains the provisional forecasted data ().
FURENet comprises three encoders and one decoder, and its structure is presented in
Figure 2. The encoder consists of four encoding blocks and a convolutional layer (5 × 5 convolution kernel). The encoder blocks include a residual block [
28], a bilateral downsampling, and two residual blocks. The bilateral downsampling includes a convolutional layer (with a 5 × 5 convolution kernel) and Maxpooling. The decoder is composed of the SE block, four decoder1 blocks, a bilateral upsampling, and a convolutional layer (5 × 5 convolution kernel). The bilateral upsampling includes Convtranspose [
29] and bilinear interpolation [
30]. The SE block comprises a global pooling layer, a linear layer, a Tanh layer, a linear layer, and a Sigmoid layer [
31]. The decoder1 block is composed of a bilateral upsampling, a convolutional layer (3 × 3 convolutional kernel), and two residual blocks. The normalization functions and activation functions of the convolutional layer and the residual block are group normalization [
32] and Tanh [
33], respectively, except for the output layer.
2.2.2. Semantic Synthesis Model
The Semantic Synthesis Model (SSM) consists of three parts, namely the encoder, the decoder, and the noise generator. The encoder encodes and learns the past observed data as well as the provisional forecasted data to acquire the data distribution characteristics (
). The decoder decodes the data distribution feature (
) and assigns it a global self-attention mechanism and a standardized affine transformation. Due to the influence of SPADE [
22], the decoder will focus on the provisional predicted data and ignore the past observed data. The noise generator generates a learnable noise distribution feature (z) and cascades it with the data distribution feature (
), so the noise distribution feature (z) deepens the complex relationship between the provisional forecasted data and the past observed data. The noise generator uses VGGNet [
34] as the backbone network. Let the tensor
be used to represent the observed data (where
N represents the time length,
H represents the length, and
W represents the width),
represents the random noise distribution, and
,
, and
represent the encoder feature, noise generator feature, and decoder feature of the SSM, respectively. The specific process is shown in Equations (5)–(7) as follows:
Equation (5): The encoder encodes and downsamples the past observation data () as well as the provisional forecasted data (), ultimately obtaining the data distribution feature (). Equation (6): The noise generator encodes and downsamples Gaussian noise (with a mean of 0 and a variance of 1), ultimately obtaining the noise distribution feature (). Equation (7): The data distribution characteristics and the noise distribution characteristics are cascaded; with the intervention of SPADE (where the feature normalization undergoes spatially affine transformation using the provisional forecasted data), the decoder decodes and upsamples the features and ultimately obtains the final forecasted data.
SSM consists of three parts, namely the encoder, the noise generator, and the decoder, and its specific structure is shown in
Figure 3. The encoder consists of four encoder blocks and a convolutional layer (5 × 5 convolution kernel). The encoder block is composed of a residual block [
28], a bilateral downsampling, and two residual blocks [
28]. The normalization function is group normalization, and the activation function is Tanh. The decoder is composed of a convolutional layer (3 × 3 convolution kernels), five decoder2 blocks, and a convolutional layer (3 × 3 convolution kernels). The decoder2 block consists of bilateral upsampling, a convolutional layer (3 × 3 convolution kernels), a dual-head self-attention mechanism, a convolutional layer (3 × 3 convolution kernels) and a SPADE ResBlk [
22]. The multi-head self-attention mechanism is composed of two self-attention [
23] cascades. The ResBlk consists of two SPADEs and a Residual Skip. The principles of self-attention and spatially adaptive (de)normalization (SPADE [
22]) will be presented subsequently.
The self-attention mechanism [
23] is that the model reassigns the weights of features, meaning that the feature map has a global attention mechanism. Assume that the tensor (
) represents the variable (where
C represents the number of channels,
H represents the length, and
W represents the width), and the tensor (
) represents the feature (with
C representing the number of channels,
representing the length, and
representing the width), and the principle is depicted as shown in Equations (8)–(13).
The length (
H) and the width (
W) are equal;
denotes the element of the tensor (
X); and (
i,
j,
k) represents the coordinates.
The principle of the self-attention mechanism [
23]:
, which d is the number of channels. Equations (8)–(10): the autoregressive model encodes and learns the feature weights to obtain three tensors, including query (
), key (
), and value (
). Equation (11): tensor
Q is transformed into matrix
Q (
), and tensor
K is also transformed into matrix
K (
); then, a matrix operation is performed on
Q and
and multiplied by factor
to obtain
(
). Equation (12): the softmax function is applied to the first dimension of
to obtain
(
). Equation (13): tensor
V is transformed into matrix
V (
); a matrix operation is carried out between
and
V to obtain
X (
).
The principle of SPADE [
22]: Under the intervention of semantic conditions, a spatial affine transformation is carried out for normalization. SPADE reprocesses instance normalization [
35]: the spatial affine transformation of normalization, which implements spatial adaptive adjustment for normalization under the intervention of semantic conditions. Suppose the tensor
is used to represent the variable (where
C represents the number of channels,
H represents the length, and
W represents the width), and the tensor
is used to represent the feature (where
C represents the number of channels,
represents the length, and
represents the width). The specific principle is detailed as shown in Equations (14)–(18).
The length
H and the width
W are equal;
denotes the element of the tensor
X, and (
i,
j,
k) represents the coordinates;
denotes the semantic condition;
denotes the mean value, and
denotes the standard deviation.
Equations (14) and (15): the feature maps acquire their mean and standard deviation via specific formulas. Equations (16) and (17): The semantic conditions are encoded and learned through the autoregressive model in order to acquire the weight tensor γ and the bias tensor β. Equation (18): the normalization of the feature map employs a spatial affine transformation.