1. Introduction
Short-term precipitation forecasting is a key component of modern meteorological forecasting systems, focusing on accurately predicting drastic changes in precipitation caused by severe convective weather within the next six hours, particularly within the 0–2 h range [
1,
2,
3]. In recent years, severe weather disasters due to heavy precipitation have caused substantial social and economic losses. For example, in 2021, Zhengzhou, China, experienced an extreme precipitation event with a maximum hourly rainfall of 201.9 mm, resulting in 380 deaths and missing persons, and direct economic losses amounting to CNY 40.9 billion [
4]. Thus, short-term precipitation forecasting is critically important in sectors like agriculture, transportation, urban management, and tourism. It is essential for disaster prevention and protecting lives and property [
5,
6,
7]. Enhancing the accuracy of short-term precipitation forecasts and providing timely predictions of rainfall are crucial research needs.
Meteorological radar is one of the core meteorological detection devices. By emitting microwave signals and receiving their reflections, it can effectively detect various atmospheric elements and be used for precipitation forecasting [
8]. China has established a new generation weather radar network, which can obtain high-resolution radar echo data with minute-level temporal resolution and kilometer-level spatial resolution over large areas [
9]. These devices and data provide feasibility for research on short-term precipitation forecasting. Developing precise short-term precipitation forecasting methods using radar echo data has become one of the current research hotspots and bottlenecks. Radar echo extrapolation-based short-term precipitation forecasting methods need to capture subtle atmospheric changes within a short time, facing significant challenges in accuracy, real-time performance, and technical requirements. The core concept of the radar echo extrapolation method is to predict future frames of radar echo images based on past frames, thereby forecasting future precipitation. This requires developing spatiotemporal sequence prediction models and obtaining the optimal solution [
10]. However, modeling the spatiotemporal sequence characteristics using radar data is challenging due to the high-dimensional nonlinearity and extremely complex spatiotemporal distribution of the data, making it difficult to develop accurate spatiotemporal sequence prediction methods.
Traditional radar echo extrapolation methods often rely on mathematical and physical approaches, such as the cross-correlation [
11], storm cell identification and tracking (SCIT) [
12], and optical flow methods [
13]. Although these methods can predict precipitation distribution to some extent, their ability to capture spatiotemporal relationships is limited, especially when dealing with the nonlinear motion of mesoscale atmospheric processes. Therefore, traditional methods cannot accurately forecast changes in precipitation. In recent years, with significant improvements in computing power, radar echo extrapolation methods based on deep learning [
14] have shown better performance than traditional methods. Deep learning-based radar echo extrapolation methods capture spatiotemporal features and nonlinearity from large amounts of radar echo data to extrapolate future frames of radar echo images. Compared to traditional methods, using deep learning for radar echo extrapolation has advantages in data utilization and forecasting accuracy [
15,
16,
17,
18,
19,
20]. Currently, recurrent neural networks (RNN) and convolutional neural networks (CNN) are the mainstream methods for handling spatiotemporal sequence prediction tasks in radar echo extrapolation. RNN-based models are primarily designed for modeling spatiotemporal sequence data. They have strong capabilities for temporal sequence modeling and can incrementally learn and predict radar echo sequences. However, they also have limitations, such as the inability to perform parallel computation, gradient explosion, and the accumulation of errors. To address these issues, researchers have proposed using CNN architecture networks for radar echo extrapolation, such as U-Net [
21] and SimVP [
22] models. U-Net is a groundbreaking convolutional neural network architecture that consists of two symmetrical modules: an encoder and a decoder. It not only excels in semantic segmentation [
23], visual detection [
24], and medical tasks [
25], but has also been extensively studied and proven to be an effective backbone model in the field of precipitation forecasting. SimVP, a new architecture model featuring a CNN-CNN-CNN structure, adds a translator module between the encoder and decoder of U-Net to learn temporal evolution, which enhances its ability to capture temporal information compared to the U-Net network. However, 2D convolution operations mix originally independent variables into indistinguishable feature channels, losing the independence of features and making it difficult to reflect the interdependencies between variables. This approach does not fully capture the temporal dependencies of radar echo data, still exhibiting significant prediction degradation, insufficient capture of temporal dependencies at different time scales, and low accuracy.
To address existing issues, we propose an end-to-end short-term precipitation forecasting model based on radar echo extrapolation: Multi-Scale Deep Dilated 3D Residual Spatio-Temporal Network (MS-DD3D-RSTN). This model uses multi-scale deep and dilated 3D convolutions to extract spatiotemporal features while significantly reducing the issue of excessive parameter quantities. The introduction of residual connections helps alleviate prediction degradation. A new loss function, STLoss, combines weighted mean squared error (WMSE) and differential divergence regularization (DDR) to learn intra-frame and inter-frame changes in radar data, effectively capturing the spatiotemporal variation trends of radar signals. The combination of these innovative designs enables the STCB module to more effectively capture correlated features in both temporal and spatial dimensions, thereby improving the model’s performance in radar echo extrapolation tasks. To evaluate the effectiveness of this method, we conducted experiments on the Sichuan dataset and the HKO-7 dataset. The results show that the proposed method achieves superior performance in terms of the CSI and POD evaluation metrics. Specifically, the CSI metrics reached 0.538 and 0.386 for the 20 dBZ reflectivity threshold and 0.485 and 0.198 for the 30 dBZ reflectivity threshold, demonstrating superior performance compared to existing radar extrapolation methods.
The main contributions of this work are as follows:
- (1)
A spatiotemporal sequence learning network model, MS-DD3D-RSTN, is proposed, which efficiently captures the spatiotemporal dependencies of radar echo data and accurately predicts the target task.
- (2)
The STCB module, based on multi-scale 3D convolution, dilated deep convolution, and residual connections, is proposed to achieve better spatiotemporal dependency capture capabilities and alleviate prediction degradation to some extent.
- (3)
We introduced a loss function, STLoss, which combines WMSE and DDR. This effectively addresses data imbalance issues and enhances the model’s ability to learn spatiotemporal features and their gradient characteristics.
The remainder of the paper is organized as follows:
Section 2 briefly introduces related work on radar echo extrapolation.
Section 3 describes the proposed method.
Section 4 presents comprehensive experiments to validate the effectiveness of the proposed model.
Section 5 summarizes the innovations of this paper and discusses the advantages and disadvantages of related methods.
Section 6 concludes the paper.
3. Methodology
3.1. Datasets
This study uses the Sichuan dataset and the HKO-7 dataset as experimental datasets, with the same model being trained separately on each. Through analysis of the raw data, we discovered a significant number of negative values. Generally, higher radar echo reflectivity indicates greater precipitation; therefore, these negative values can be ignored. First, all negative values in the raw data were set to zero, and noise filtering was applied to the dataset. The data were normalized to facilitate model training and optimization. To ensure the quality of the training samples, we excluded most samples without precipitation. After screening and preprocessing the raw data, we generated a sample dataset using a sliding window with a length of 40 and a step size of 1.
Sichuan dataset: This dataset is derived from radar echo data collected by the Plateau Meteorological Bureau of Sichuan Province, China, from 2011 to 2013. The data are three-dimensional, comprising nine layers. Images have a resolution of 360 × 920 pixels, spanning from 105.09° E to 109.95° E in longitude and from 29.09° N to 33.25° N in latitude. Based on experimental results, radar echoes from the first, third, and fifth layers were selected for analysis. The corresponding altitudes are 0.5 km, 1.5 km, and 2.5 km, respectively. After processing, the dataset consists of 10,394 samples, each with a sequence length of 40 frames at 6 min intervals. The first 20 frames of the first, third, and fifth layers are used as input for prediction, while the subsequent 20 frames of the first layer serve as the ground truth.
HKO-7 dataset: This dataset, developed by the Hong Kong Observatory, is commonly used for precipitation nowcasting. It includes radar echo data collected from 2009 to 2015. The images have a resolution of 480 × 480 pixels, with an altitude of 2 km. Centered on Hong Kong, the coverage area is 512 km × 512 km. After processing, the dataset contains 11,514 samples, with each sample sequence having a length of 40 frames and a time interval of 6 min [
32].
Both datasets are divided into training, validation, and test sets in a 7:2:1 ratio.
3.2. Problem Definition
In the field of radar echo extrapolation for near-term precipitation forecasting, the spatiotemporal sequence prediction problem can be modeled as a prediction problem based on historical radar image sequences [
10]. Specifically, the data flow of this problem can be represented as follows: Given a specific time point
t, the data of the
D time points prior to
t are used as historical input data, with a time interval of 6 min between data points. Let
be the sequence of data points within
D time points before
t, which are input into the prediction model to forecast radar images for the
M time points
after
t. The prediction interval is also 6 min, and the input data for each time point is a three-dimensional matrix
of size
.
For the historical input data, it can be represented as a dataset , with dimensions . For the predicted radar image sequence at future time points, it is represented as , with dimensions .
To address this problem, the MS-DD3D-RSTN model, as an objective function
F, can be used to construct a mathematical model that maps the historical input dataset I to the predicted future radar image sequence
P. The specific mathematical expression is as follows:
3.3. MS-DD3D-RSTN Network Framework
Figure 1 illustrates the network framework of the MS-DD3D-RSTN model. The model consists of three parts: the spatial encoder, the spatiotemporal learner, and the spatial decoder. The spatial encoder and spatial decoder are symmetric modules. The spatial encoder learns spatial information, reduces spatial dimensions, and decreases the number of parameters; the spatial decoder is responsible for mapping the feature information to the target sequence to predict the target task. The spatiotemporal learner learns spatiotemporal evolution and captures the temporal dependencies of radar echo data. To retain spatially related features, multiple skip connections are added between the spatial encoder and the spatial decoder. The input size of the model is
, and the output size is
, indicating that the model uses radar reflectivity maps from three layers, two hours before, as input to predict a single layer of radar images for the next two hours.
The core function of the spatial encoder lies in extracting spatial feature information and performing dimensionality reduction, with a primary focus on the spatial dimension. To achieve this, the radar echo image tensor of past frames
is first converted into a tensor of shape
. The converted tensor data is then processed using DoubleConv (DoubleConv is a dual convolutional layer, where each convolutional layer comprises a
Conv2D, a batch normalization (BN) layer, and the activation function ReLU) (represented by the brown blocks in
Figure 1) to increase the hidden dimensions, facilitating subsequent operations. The next step involves stacking
modules of MaxPool2d (The stride is set to 2, and the kernel size is
) and DoubleConv combinations (represented by the light purple blocks in
Figure 1) to downsample and extract spatial features. The hidden features in the spatial encoder can be represented as
where the dimensions of the input tensor
x and the output tensor
are
and
, respectively. DoubleConv represents a combination of two layers of Conv2d, BatchNormal, and ReLu, and
is the number of MaxPool2d and DoubleConv combination modules. Experiments have shown that the optimal size for
is 4.
The spatiotemporal learner primarily focuses on the temporal dimension. Therefore, the output tensor of the encoder
is converted into a tensor of shape
to arrange the same variables sequentially along the time dimension. Then, by stacking
modules (represented by the light gray blocks in
Figure 1), temporal features are extracted from the converted tensor. The STCB module is specifically introduced in
Section 3.3. The hidden features in the spatiotemporal learner can be represented as
where the dimensions of the input tensor
and the output tensor
are
and
, respectively.
is the number of STCB modules. Experiments have shown that the optimal size for
is 6.
The primary function of the spatial decoder is to integrate feature information and predict the radar reflectivity images of future frames. Corresponding to the spatial encoder, the output tensor of the spatiotemporal learner
is first converted into a tensor of shape
. Then, by stacking
modules of Upsample (the scaling factor is set to 2, and bilinear interpolation is used for upsampling) and DoubleConv combinations (represented by the light blue blocks in
Figure 1), feature information is integrated from the feature tensors of the spatial encoder and spatiotemporal learner. Finally, a
convolutional layer is used to output the predicted images. The hidden features in the spatial decoder can be represented as
where the dimensions of the input tensor
and the output tensor
are
and
, respectively.
is the number of Upsample and DoubleConv combination modules. The specific value of
is the same as
.
3.4. STCB
The STCB module, as the core component of the temporal learner, delves deeply into the dynamic features of sequential data. It integrates network techniques such as multi-scale 3D convolution, dilated depthwise convolution (DW-D), and residual connections, forming a network module designed to precisely capture subtle temporal motion changes, as shown in
Figure 2.
The specific design of the STCB module is as follows: In the first step, a 3D convolution is used at the very beginning to increase the hidden dimension for subsequent operations. In the second step, a multi-branch architecture is implemented using dilated depthwise 3D convolutions (DW-D Conv3d) layers with different dilation rates (d = 1, 2, 3, 5). The output tensor from the second step is fed into four branches, each containing a DW-D Conv3d layer followed by a GroupNorm normalization layer and a LeakyReLU activation function. In the third step, the different feature information extracted by each branch in the second step is integrated. In the fourth step, the above operations are repeated. In the fifth step, residual connections are applied by performing element-wise addition between the initial input tensor and the output tensor from the fourth step. In the sixth step, the output tensor from the fourth step is passed through a LeakyReLU activation function to obtain the final result of the STCB module.
Given the multidimensional time series nature of radar reflectivity images, traditional 2D convolution operations might mix originally independent variables, leading to a loss of independence among feature channels and making it difficult to reflect the interrelationships between variables. Therefore, this study employs 3D convolution technology to explore the interdependencies across temporal and spatial scales, thus more accurately capturing the spatiotemporal features of radar reflectivity images.
The distribution of key information in spatiotemporal data is complex and dynamically changing, so the model needs to handle information at different scales flexibly. This study achieves multi-scale feature extraction by applying convolution kernels of different sizes. Smaller convolution kernels are used to capture fine local features, while larger convolution kernels are used to capture globally distributed information. Additionally, the multi-branch architecture design allows the model to effectively integrate local details and global trends.
However, large-sized 3D convolution kernels may lead to reduced computational efficiency and a significant increase in model parameters [
45]. To address this issue, this study employs dilated depthwise convolution to achieve different receptive field sizes while reducing the number of model parameters. Specifically, the STCB module introduces residual connections to retain original feature information, which alleviates prediction degradation to some extent and improves the model’s ability to capture long-term dependencies. Although the STCB module is built on a purely convolutional network, it can effectively capture spatiotemporal dependencies.
3.5. Loss Functions
We introduce a novel loss function, STLoss, which consists of two components: WMSE [
39] and DDR [
46]. These components are used to learn intra-frame and inter-frame changes in radar data, respectively. The introduction of WMSE addresses the issue of precipitation sample imbalance. By setting thresholds, this component helps adjust the model’s emphasis on precipitation regions of varying intensities, ensuring that the model can balance the influence of different precipitation intensities during prediction. On the other hand, to overcome the shortcomings of the MSE loss that only considers intra-frame errors, DDR is proposed to learn the temporal variation trends of the data. It helps understand the differences between consecutive frames and captures the inherent changes in the data. The design of this combined loss function not only effectively handles data imbalance but also promotes better learning of spatiotemporal features by the model, thereby further improving the model’s accuracy.
The specific implementation of STLoss is as follows:
- (1)
Calculate the intra-frame error
We calculate the intra-frame error between the real and predicted radar echo images using a weighted mean squared error. Different weights
are assigned based on the various ranges of radar reflectivity. The mean squared error is then calculated between the predicted values
and the target values
. Finally, the errors for different ranges are multiplied by their corresponding weights
. The specific formula is as follows:
where
t is the prediction length, and the weights
are defined as:
.
Data analysis revealed that the occurrence frequency of different precipitation intensities is highly imbalanced. Based on the data distribution in various intervals, the weights are set to 1, 2, 4, and 6, respectively, to amplify the prediction errors for different radar reflectivity intervals by the corresponding multiples.
- (2)
Calculate the inter-frame error
We calculate the inter-frame error between the real and predicted radar echo images using differential sparsity regularization. First, the differences between adjacent frames in the time dimension for both the predicted values
and the target values
y need to be computed. The specific formula is as follows:
Next, the difference matrix is flattened into a one-dimensional vector. The softmax function [
47,
48,
49,
50] is then applied to convert the differences into a probability distribution. The specific formula is as follows:
Finally, the Kullback–Leibler (KL) divergence method [
51] is used to measure the difference between the two probability distributions. The specific formula is as follows:
where
denotes the KL divergence method,
is the probability distribution of the target values,
is the probability distribution of the predicted values, and
t is the prediction length.
- (3)
Calculate the target loss function, STLoss
The STLoss consists of two parts: the weighted mean squared error and the discrepancy disentangled regularization, with
and
representing the corresponding constant weights. The specific formula is as follows:
3.6. Implementation Details
This experiment is conducted under the PyTorch framework, using ADAM as the optimizer to train the model. The batch size is set to 2, and the learning rate during training is set to
. The MS-DD3D-RSTN model uses STLoss as the loss function. The GPU used in the experiment is the GeForce RTX 4090 with 24 GB of memory. The specific experimental parameter settings are shown in
Table 1.
3.7. Evaluation Metrics
We use a threshold-based evaluation method to test the test set, with thresholds selected at 20, 30, and 40 dBZ. The prediction durations include 1 h and 2 h, with a forecast interval of 6 min. The evaluation metrics adopted are the commonly used meteorological indicators: Critical Success Index (CSI), Probability of Detection (POD), and False Alarm Ratio (FAR) [
52,
53,
54]. CSI is a comprehensive scoring standard used to evaluate the accuracy of quantitative precipitation forecasts. POD refers to the proportion of correctly identified actual precipitation areas in the forecast. FAR measures the proportion of incorrectly predicted precipitation areas out of all predicted precipitation areas. Therefore, the higher the values of POD and CSI, and the lower the value of FAR, the more accurate the prediction results, and the better the model performance. The specific formulas are as follows:
where hit represents true positives, meaning both the predicted and actual values are above the threshold; miss represents false negatives, where the predicted value is below the threshold, but the actual value is above it; and far represents false positives, where the predicted value is above the threshold, but the actual value is below it.
5. Discussion
We propose an end-to-end radar echo extrapolation-based nowcasting model: MS-DD3D-RSTN. The main contributions include the introduction of a spatiotemporal learner and the STLoss function. The spatiotemporal learner, composed of stacked STCB modules, focuses on dynamic changes in time and space, capturing the spatiotemporal dependencies of radar echoes. The STCB is a multi-branch architecture that utilizes multi-scale depth and dilated 3D convolutions to perform convolutions at different temporal and spatial scales. It cleverly employs residual connections to mitigate prediction degradation to some extent. Additionally, the STLoss function is introduced to learn changes within and between radar frames, enhancing the model’s ability to learn the dynamic changes of radar echoes at different temporal and spatial scales, further improving the model’s accuracy, as confirmed by the prediction results.
The core of radar echo extrapolation methods based on recurrent neural networks lies in the recurrent units, which achieve continuous memory and updating of temporal information through recurrent connections, capturing dynamic features in the data. However, recurrent structures also have some significant drawbacks. Firstly, due to the presence of recurrent structures, errors accumulate during the prediction process, especially with long time series, potentially leading to decreased prediction accuracy. Secondly, the characteristics of recurrent structures make parallel processing inefficient, reducing computational efficiency, particularly on large datasets. In tasks such as radar echo extrapolation prediction, these issues result in prediction degradation and low accuracy. The proposed method in this paper, however, strives to preserve all detailed information to prevent prediction degradation and improve model prediction accuracy.
Radar echo extrapolation methods based on convolutional neural networks essentially adopt an encoder-decoder network architecture. This architecture, through symmetric contraction and expansion paths, demonstrates strong feature extraction capabilities, effectively focusing on the texture features of radar echoes, and accelerates the prediction process through parallel computation. However, radar echoes exhibit dynamic variability at different temporal and spatial scales. Conventional convolution operations are performed only on the spatial scale, neglecting temporal dependencies, and thus cannot capture spatiotemporal correlations well, leading to poor performance in radar echo extrapolation tasks. To overcome these issues, the proposed method employs multi-scale 3D convolutions to focus on temporal and spatial dimensions. Additionally, the STLoss function emphasizes the information gradient differences between sequential frames, forcing the network to focus on the temporal evolution of radar signals, thereby more efficiently capturing the features of radar temporal extrapolation.
The radar echo extrapolation model MS-DD3D-RSTN proposed in this paper is a deep learning model purely based on radar data analysis. However, the precipitation process is influenced by various factors such as atmospheric physics and meteorology. Although the MS-DD3D-RSTN model has shown improvements in certain areas, there are still some shortcomings. Firstly, the model has many parameters, requiring substantial computational resources. Secondly, as the prediction time extends, both the accuracy of the predictions and the clarity of the images tend to decrease. Future research can focus on physics-based deep learning methods to enhance the model’s physical interpretation of precipitation generation and evolution, further improving the performance of radar echo extrapolation.
The MS-DD3D-RSTN model has broad potential applications in practice. Firstly, it can be used to improve meteorological forecasting systems, especially in predicting heavy rainfall events, thereby enhancing disaster prevention and mitigation efficiency. Secondly, the model can be deployed in the field of agricultural management, providing farmers with more accurate weather forecasts to help optimize planting and harvesting schedules. Additionally, accurate precipitation forecasts are crucial in urban management for preventing urban flooding and planning infrastructure. Given these application scenarios, further improving the accuracy of the model’s predictions is of significant practical importance.