1. Introduction
Nowcasting typically refers to weather forecasting for the next 0 to 2 h [
1] and is particularly important for severe convective weather such as thunderstorms, strong winds, hail, etc. It plays a crucial role in various sectors, such as transportation, agriculture, military, livestock, and tourism, and is indispensable for urban flood prevention and early warning systems. Due to the high temporal and spatial correlation of radar echo sequence images, they are commonly used as important input variables for nowcasting models. How to use historical radar echo images for nowcasting, especially in predicting severe convective weather, is a hot research topic [
2,
3]. The Doppler weather radar used by meteorological departments can provide detailed high-spatiotemporal-resolution monitoring products, which are extremely important in disaster and sudden weather monitoring and early warning systems. As a result, many scholars have conducted research analyzing the characteristics of radar echoes under conditions of severe thunderstorms and strong winds, providing support for improving weather forecasting capabilities in related fields [
4,
5,
6].
Traditional radar echo extrapolation methods mainly include techniques such as the single-cell centroid method (SCIT) [
7], tracking radar echoes by correlation (TREC) [
8,
9], and the optical flow method [
10]. In recent years, the methods used in short-term forecasting have mainly included the traditional single-cell centroid method, the TITAN (Thunderstorm Identification, Tracking, Analysis, and Nowcasting) algorithm [
11], the SCIT (Storm Cell Identification and Tracking) algorithm [
12], and the deep learning-based FURENet (Future Radar Echoes prediction Network) model [
13], among others. The core idea of these methods is to treat thunderstorm cells as three-dimensional entities, and by identifying, analyzing, and tracking the cells, predict their movement and evolution. The traditional single-cell centroid method works by calculating the centroid position, volume, and projection area of a thunderstorm cell, and then performs cell matching and tracking in consecutive radar image sequences, followed by extrapolation for early warning. This method provides detailed feature data for cells but is computationally intensive and is only suitable for strong convective storms. The TITAN and SCIT algorithms are developed based on the traditional single-cell centroid method, improving tracking accuracy and efficiency through enhanced recognition processes. The TITAN algorithm uses a cost function to transform the tracking problem into a combinatorial optimization problem, while the SCIT algorithm employs the nearest neighbor method from pattern recognition. These algorithms have received positive feedback in practical applications and have become mainstream methods in nowcasting strong convective weather. The FURENet model is a new attempt to apply deep learning technology in short-term forecasting. This model combines different types of information and focuses on important details to make better predictions. By using key data from the polarization radar variables KDP and ZDR, the model improves forecast accuracy. Research has shown that after adding polarization radar variables, the forecast scores within 30 min and 60 min forecast periods improved by 13.2% and 17.4%, respectively, compared to using only the reflectivity factor.
With the deepening of research, although the single-cell centroid method has been continuously improved and has made some progress [
14,
15,
16,
17], its large computational cost and poor generalization ability prevent it from achieving better results in nowcasting. The cross-correlation method (TREC) calculates the spatial optimal correlation of radar echo data at consecutive times to obtain the motion vector characteristics of convective systems at different locations, and then extrapolates radar echoes based on these obtained motion vectors. This method performs poorly in forecasting rapidly changing convective precipitation because it only considers the horizontal movement of the echoes. It has a large computational load and provides good recognition and tracking results for larger echoes, but performs poorly for smaller cells, especially when multiple cells are close together. The optical flow method calculates the optical flow field of radar echoes to obtain the motion vector field, and then extrapolates the radar echo based on this motion vector field. This method can capture the movement and change information of radar echoes, but it does not fully utilize the echo image information over a longer time period, and its forecasts exhibit a lag, so it cannot meet real-time forecasting needs. These traditional methods perform better when echoes are stable and can track echo movement and changes more accurately. However, for weather processes where echoes change rapidly, the prediction accuracy decreases significantly. Furthermore, their forecasting lead time is short, and accuracy rapidly decreases as the forecast duration increases. With the development of deep learning technology, deep learning-based radar echo extrapolation methods, such as convolutional long short-term memory (ConvLSTM) networks, have shown better temporal and spatial feature extraction capabilities by combining the characteristics of convolutional neural networks and long short-term memory networks. These methods are particularly suited for radar echo images with strong temporal and spatial correlation, and they hold the potential to improve the accuracy and timeliness of forecasts.
Recurrent neural networks (RNNs) are a type of neural network specifically designed for handling sequence data [
18]. They capture temporal dependencies by recursively passing information across the sequence and sharing parameters within the network. The advantage of RNNs lies in their ability to memorize temporal information, but early RNNs faced issues with gradient vanishing. To address this problem, various gated RNN variants were developed, with the most famous being long short-term memory (LSTM) networks [
19]. LSTM networks effectively avoid the gradient vanishing issue by introducing gating mechanisms, enhancing the network’s ability to handle long-term dependencies. Building on LSTM, scholars further explored models that combine convolutional networks, such as ConvLSTM [
20]. These models are particularly suited for predicting radar echo sequences, as they can simultaneously process spatial and temporal information. Subsequently, by introducing dual-memory state transition mechanisms and Gradient Highway Units (GHUs), models like ST-LSTM, PredRNN++, and MIM were proposed [
21,
22,
23]. These models achieved deep integration of temporal and spatial memories, improving feature extraction capabilities. To further enhance model performance, Lin et al. attempted to integrate attention mechanisms into ConvLSTM [
24], while Jing et al. developed the HPRNN model [
25], which incorporates hierarchical prediction strategies and recursive coarse-to-fine mechanisms. These methods help reduce error accumulation in long-term predictions. Additionally, some studies combined LSTM with Generative Adversarial Networks (GANs) to leverage GANs’ ability to extract deeper radar echo features [
26,
27]. In terms of model structure, Sato et al. introduced skip connections and dilated convolutions to enhance the model’s short-term nowcasting capabilities [
28]. The U-Net structure has also been used for precipitation nowcasting, and Agrawal et al. further improved prediction accuracy by adding residual and attention mechanisms [
29]. These enhanced U-Net models are not only more lightweight in structure, but also demonstrate better performance than traditional models in practical applications. These deep learning-based models, through continuous optimization and innovation—especially in handling complex meteorological data and performing short-term nowcasting—have significantly improved the prediction capability of radar echo sequences.
With the development of deep learning technology, radar echo extrapolation has made significant progress. From early simple neural network models to the current complex deep learning architectures, the application of deep learning in radar echo extrapolation has deepened, demonstrating powerful potential and advantages. However, these deep learning models still have some limitations in practical applications. For instance, in the field of radar echo extrapolation, traditional recurrent neural network models have made some progress in handling sequential data, but they tend to encounter the issue of error accumulation in long-term predictions. This leads to the continuous amplification of deviations in successive time steps, severely affecting the accuracy of the model’s prediction of future moments. Moreover, these models also face challenges in mining long-term spatial dependencies between radar image sequences, and their accuracy in predicting small-area strong echo regions and their trends still needs improvement.
In view of the limitations of existing models in radar echo extrusions, this study aims to address three core questions: (1) how to effectively integrate attention mechanisms and full-dimensional dynamic convolution into LSTM networks to enhance spatiotemporal feature extraction for radar echo prediction; (2) whether the proposed model has excellent performance in capturing multi-scale spatio-temporal dependencies, especially for rapidly evolving convective systems; (3) to what extent the model improves the generalization ability and reduces the accumulation of errors in long-term predictions under extreme weather conditions. To address these challenges, we propose the EOST-LSTM model, which combines an efficient multi-scale attention module with a full-dimensional dynamic convolution module. The attention mechanism enhances global–local feature perception by through the adaptive weighting of key regions in the radar image, while dynamic convolution optimizes kernel parameters through spatial, channel, and filter dimensions to capture different context information. Our experimental results on the Moving MNIST (Moving Modified National Institute of Standards and Technology) and real radar datasets show that the model can improve the prediction accuracy of heavy-rain regions and maintain structural consistency over a long forecast period.
3. Method
In this section, we first introduce the attention mechanism, followed by a description of the full-dimensional dynamic convolution module. We then explain how the attention module and the full-dimensional dynamic convolution module were integrated into the ST-LSTM unit to obtain the new EOST-LSTM unit. Finally, we provide a detailed description of the overall extrapolation structure of the proposed EOST-LSTM model.
3.1. Efficient Multi-Scale Attention Module
In the field of computer vision, channel and spatial attention mechanisms are crucial for recognizing and extracting features. However, traditional methods often reduce the number of channels to better handle inter-channel relationships, which may negatively impact the extraction of deep features. To address this issue, we propose a novel efficient multi-scale attention module that maintains channel information while reducing computational cost through innovative techniques. Specifically, the proposed module extends certain channels to the batch dimension and divides the channels into multiple sub-feature groups, ensuring that the spatial semantic features within each group are evenly distributed. This structure allows the module to both encode global information to adjust the channel weights across different branches and integrate the outputs of two parallel branches through cross-dimensional interaction, capturing pixel-level detail relationships. This approach not only preserves the information within each channel, but also improves efficiency by reducing the computational load. The detailed structure is shown in
Figure 1, where “g” denotes the number of groups, “X Avg Pool” represents the one-dimensional horizontal global pooling, and “Y Avg Pool” represents the one-dimensional vertical global pooling. This innovative attention module helps strike a balance between computational efficiency and rich feature extraction, making it well suited for handling large-scale spatiotemporal data like radar echo sequences while maintaining high accuracy in predictions.The “g * batch size” in the structure diagram indicates that the number of packets g is multiplied by the batch size. The core purpose is to improve efficiency through parallel packet computation. This operation splits each batch sample into G-groups for independent processing (a grouping strategy similar to multi-head attention), optimizes hardware resource utilization and enhances the diversity of feature interactions without increasing computational complexity.
3.1.1. Coordinate Attention Module
The CA module [
31], also known as the coordinate attention module, captures the global spatial information of the image along the horizontal and vertical directions through global average pooling, generates two one-dimensional feature vectors encoding the corresponding position information, and then processes these vectors through the 1 × 1 convolution and Sigmoid activation function to generate two attention maps to emphasize the area of interest in the feature map. These attention maps are then used to weight aggregate the original features to retain spatial location information, learn the correlation between channels, and enhance the model’s ability to recognize key features. The CA module is also designed with computational efficiency in mind, by sharing 1 × 1 convolution weights and simplified multiplication operations to aggregate attention graphs, thereby reducing the computational burden without sacrificing performance, enabling the module to effectively improve feature representation and capture key information in images for a variety of visual tasks. The specific structure is shown in
Figure 2. Let the original input tensor X
R
C×H×W, where C represents the number of input channels, and H and W represent the spatial dimensions of the input features. Thus, 1D global averaging pooling that encodes global information along the horizontal dimension of C at height H can be expressed as
Similarly, another route comes directly from 1D global averaging pooling along the horizontal dimension and can therefore be viewed as a collection of location information along the vertical dimension. This route utilizes one-dimensional global average pooling along the vertical dimension to spatially capture remote interactions and retain precise location information along the horizontal dimension, enhancing focus on spatial regions of interest. At width W, the pooling output in C can be expressed as
The module works by creating two parallel one-dimensional feature coding vectors that capture global information in different directions. It then rearranges these vectors and merges them through a convolution layer, sharing a 1 × 1 convolution to capture local interactions between channels. The module then decomposes the output of the 1 × 1 convolution again and applies the Sigmoid function on each branch, generating two attention graphs. These attention attempts ultimately combine to enhance the representation of important parts of the feature map while preserving spatial location information and effectively exploiting long-distance dependencies. In short, the CA module improves the expressiveness of features through fine feature recombination and attention weighting while maintaining the integrity of spatial information.
3.1.2. Structure and Working Principle of Efficient Multi-Scale Attention Module
The coordinate attention module does perform well in integrating accurate spatial position information and capturing long-distance spatial interactions, thus improving the performance of the model, but it does not fully take into account the full range of interactions in spatial position, and the limited receptor field of the 1 × 1 convolution kernel limits the model’s ability to model local channel interactions and context information. In short, while the coordinate attention module improves the model’s performance, it also has the problems of insufficient consideration of the comprehensiveness of spatial interaction and limited ability to perform local interaction capture.
The core aim of the efficient multi-scale attention module is to enhance feature representation through grouping and parallel processing instead of the traditional channel dimensionality reduction method. The specific structure is shown in
Figure 3, which divides the input feature graph into multiple sub-feature groups, each of which learns different semantic information. The module contains two main parallel branches: the 1 × 1 convolution branch and the 3 × 3 convolution branch. The 1 × 1 branch is responsible for capturing global relationships between channels, while the 3 × 3 branch focuses on local multi-scale spatial information. In the 1 × 1 branch, the module uses global average pooling to encode channel information and 1 × 1 convolution to aggregate this information to form a channel attention diagram. These attention maps are then used to weight the original feature map to highlight important features. The 3 × 3 branch uses a 3 × 3 convolution kernel to capture local features and spatial relationships, expand receptive fields, and enrich feature representations. At the same time, the module further integrates the output of the two branches through the cross-space learning method. It uses global average pooling to encode global spatial information and matrix dot product operations to aggregate spatial information of different scales to generate spatial attention maps. These spatial attention maps not only capture pairwise relationships between pixels (for example, the similarity of a pixel to its surroundings in terms of brightness or shape), but also integrate global context information (such as the distribution of weather systems across the radar image). Specifically, ‘pair-wise relationships’ refer to the interaction of local details that the model understands by analyzing the relationships between each pixel and other pixels (such as whether adjacent pixels belong to the same cloud or precipitation region); “context awareness” means that the model can synthesize the information of the whole image (such as the overall movement trend of the cloud or the distribution pattern of strong echoes), and give each pixel more comprehensive environmental information support. For example, when predicting the intensity of radar echoes at a certain point, the model not only takes into account local features near that point, but also combines the dynamics of distant clouds to improve its ability to predict the evolution of complex weather systems.
3.2. Full-Dimensional Dynamic Convolution Module
In traditional convolutional neural network training, each layer usually learns only one static convolutional kernel. However, advances have shown that the performance of the network can be improved by learning linear combinations of multiple convolution kernels and dynamically adjusting the weights of these convolution kernels based on inputs, an approach known as dynamic convolution. It not only improves the accuracy of lightweight CNNs, but also maintains efficient reasoning. However, existing studies only focus on the dynamic adjustment of the number of convolutional nuclei, but ignore other important dimensions such as the space size of convolutional nuclei and the number of input and output channels. To overcome this limitation, we propose full-dimensional dynamic convolution, which learns complementary attention in all four dimensions of the convolution kernel through multidimensional attention mechanisms and parallel processing strategies, thus achieving a more comprehensive and flexible dynamic convolution design. The specific architecture is shown in
Figure 4.
Continuing the definition of dynamic convolution, full-dimensional dynamic convolution can be described as follows:
where awi represents the attention scalar of the convolution kernel w
i, and a
si∈R
k × k, a
ci∈R
cin, and a
fi∈R
cout represent three newly introduced attention modules along the spatial dimension, input channel dimension, and output channel dimension, respectively. These four attention modules were calculated using the multi-head attention module Πi (x). In the graph, the symbol "*" represents the convolution operation. Convolution operation is a basic operation in convolutional neural networks, which is used to extract and transform the input data. A new feature map is generated by multiplying and summing the convolution kernel element-by-element with the input feature map to capture spatial and channel dependencies in the data.
In the full-dimensional dynamic convolution, for the convolution kernel wi, asi assigns different attention values to the convolution parameters in the k * k spatial position; aci assigns different attention values to convolution filters with different input channels; afi assigns different attention values to convolution filters with different output channels; and awi assigns different values to n global convolution kernels. In principle, these four types of attention are complementary, and by progressively multiplying the convolution wi with different attention modules along the dimensions of position, channel, filter, and kernel, the convolution operation will have different dimensions for the input, providing better performance to capture rich context information. Therefore, full-dimensional dynamic convolution can greatly improve the feature extraction capability of convolution. More importantly, full-dimensional dynamic convolution with fewer convolution kernels can achieve comparable or even better performance than CondConv and DyConv.
3.3. EOST-LSTM Unit
In this section, we describe how to embed an efficient multi-scale attention module and a full-dimensional dynamic convolution module into the ST-LSTM unit to form the network unit EOST-LSTM, as shown in
Figure 5.
The input of the EOST-LSTM unit includes the current input
, the spatial memory unit
, the time memory unit
, and the hidden state
. The current input
serves as the input to the attention module, and by calculating the global context features of the input data
, the attention weights are generated, which reflect the relevance of different feature areas to the current task. In this way, the network not only enhances the perception of local features, but also improves the understanding of each channel. The calculation of the EOST-LSTM unit is shown in Equation (4):
wherein “Att” represents the attention module, “ODC” represents the full-dimensional dynamic convolutional module,
is the first input gate,
is the first input modulation gate,
is the first forgetting gate,
is the second input gate,
is the second input modulation gate,
is the second forgetting gate,
is the updated time memory unit,
represents the updated spatial memory unit,
represents the corresponding convolution kernel, and b represents the corresponding deviation value. “*” represents the convolution operation for the in-depth analysis of data characteristics, and “⊙” represents the Hadamard product of matrix, that is, matrix multiplication at the element level, for controlling information flow. “σ” represents the sigmoid activation function, and “tanh” represents the hyperbolic tangent activation function, both of which increase the nonlinear expression ability of the model.
3.4. EOST-LSTM Network Structure
Based on the previous stack structure, the model proposed in this paper adds an efficient multi-scale attention module and a full-dimensional dynamic convolutional module. The specific structure is shown in
Figure 6. To build this network structure, we used a stack of four layers of EOST-LSTM cells. In this structure, the spatial storage unit M transmits updates between layers through a zigzag path, symbolizing the flow of information in the spatial dimension of the network. This path is represented by an orange line. At the same time, the time storage unit C is updated in the same layer along the horizontal direction, representing the continuity in the time dimension, shown by a blue line.
3.5. Evaluation Metrics
In this experiment, we used the Critical Success Index (CSI) and Heidke Skill Score (HSS) as measures to assess the model’s ability to predict future situations based on historical radar data. Specifically, we first converted the radar data into a binary format, and then statistically predicted the correct cases (true positive, TP) and the wrong cases (false positive, FP; true negative, TN; and false negative, FN) according to different thresholds (10 dBZ, 20 dBZ, and 35 dBZ). These statistics help us to comprehensively consider the detection probability and false alarm rate of the model, so as to quantify the accuracy of the prediction. The higher the values of CSI and HSS, the more accurate the prediction result and the stronger the performance of the model.
The specific formulas of CSI and HSS are shown in Formula (5):
To convert the pixel values of a radar image into actual radar reflectance in decibels, we used a specific mathematical method. This method converts the original value of each pixel in the image to the corresponding reflectivity value (expressed in dBZ), as shown in Formula (6):
5. Conclusions
This study developed a new radar echo prediction model called EOST-LSTM, which utilizes deep learning techniques and improves the accuracy and efficiency of weather forecasts by incorporating global attention mechanisms and optimized LSTM units. Through experiments on the Moving MNIST dataset and the Jiangsu Province meteorological radar dataset, we confirm that the EOST-LSTM model performs well in a variety of scenarios.
In the test of the Moving MNIST dataset, the EOST-LSTM model showed a low mean-square error and a high structural similarity index, which indicates its high performance in dynamic image prediction. In actual weather radar data tests, the model also performed well, with EOST-LSTM outperforming existing ConvLSTM, PredRNN and other models at all thresholds, as assessed by the Critical Success Index and Heidke Skill Score. Especially under the condition of high threshold, EOST-LSTM significantly improved the accuracy of predicting the strong echo region.
Although EOST-LSTM achieved remarkable results in radar echo prediction, we still need to recognize that there is room for improvement in the generalization ability of deep learning models and the prediction of extreme weather events. Future work will focus on further optimizing the model structure, improving computational efficiency, and exploring the application of the model to a wider range of weather prediction areas, such as temperature, humidity, and wind speed, to achieve more comprehensive weather prediction capability.