EOST-LSTM: Long Short-Term Memory Model Combined with Attention Module and Full-Dimensional Dynamic Convolution Module

He, Guangxin; Wu, Wei; Han, Jing; Luo, Jingjia; Lei, Lei

doi:10.3390/rs17061103

Open AccessArticle

EOST-LSTM: Long Short-Term Memory Model Combined with Attention Module and Full-Dimensional Dynamic Convolution Module

by

Guangxin He

^1,2

,

Wei Wu

^1,2,

Jing Han

^2,*,

Jingjia Luo

¹ and

Lei Lei

³

¹

Key Laboratory of Meteorological Disaster, Ministry of Education (KLME), International Joint Research Laboratory on Climate and Environment Change (ILCEC), Collaborative Innovation Center on Forecast and Evaluation of Meteorological Disasters, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

Hainan Institute of Meteorological Sciences, Haikou 570203, China

³

Beijing Meteorological Observatory, Beijing 100097, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(6), 1103; https://doi.org/10.3390/rs17061103

Submission received: 15 February 2025 / Revised: 9 March 2025 / Accepted: 19 March 2025 / Published: 20 March 2025

(This article belongs to the Special Issue Artificial Intelligence-Based Sensor Data Processing for Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

In the field of weather forecasting, improving the accuracy of nowcasting is a highly researched topic, and radar echo extrapolation technology plays a crucial role in this process. Aiming to address the limitations of existing deep learning methods in radar echo extrapolation, this paper proposes a spatio-temporal long short-term memory (LSTM) network model that integrates an attention mechanism and the full-dimensional dynamic convolution technique. The multi-scale spatial and temporal features of radar images can be fully extracted by an efficient multi-scale attention module to enhance the model’s ability to perceive global and local information. The full-dimensional dynamic convolutional module introduces the dynamic attention mechanism in the spatial position and input and output channels of the convolutional kernel, adaptively adjusts the weight of the convolutional kernel, and improves the flexibility and efficiency of feature extraction. Combined with the network constructed by the above modules, the accuracy and time dependence of the model for predicting the strong echo region are significantly improved. Our experiments based on Jiangsu meteorological radar data show that the model achieved excellent results in terms of the Critical Success Index (CSI) and Heidke Skill Score (HSS), which show its efficiency and stability in predicting radar echo, especially under the condition of a high 35 dBZ threshold, and its prediction performance improved significantly. It provides an effective solution for fine short-term impending precipitation forecasting.

Keywords:

radar echo extrapolation; deep learning; long short-term memory networks; short-term weather forecast

1. Introduction

Nowcasting typically refers to weather forecasting for the next 0 to 2 h [1] and is particularly important for severe convective weather such as thunderstorms, strong winds, hail, etc. It plays a crucial role in various sectors, such as transportation, agriculture, military, livestock, and tourism, and is indispensable for urban flood prevention and early warning systems. Due to the high temporal and spatial correlation of radar echo sequence images, they are commonly used as important input variables for nowcasting models. How to use historical radar echo images for nowcasting, especially in predicting severe convective weather, is a hot research topic [2,3]. The Doppler weather radar used by meteorological departments can provide detailed high-spatiotemporal-resolution monitoring products, which are extremely important in disaster and sudden weather monitoring and early warning systems. As a result, many scholars have conducted research analyzing the characteristics of radar echoes under conditions of severe thunderstorms and strong winds, providing support for improving weather forecasting capabilities in related fields [4,5,6].

Traditional radar echo extrapolation methods mainly include techniques such as the single-cell centroid method (SCIT) [7], tracking radar echoes by correlation (TREC) [8,9], and the optical flow method [10]. In recent years, the methods used in short-term forecasting have mainly included the traditional single-cell centroid method, the TITAN (Thunderstorm Identification, Tracking, Analysis, and Nowcasting) algorithm [11], the SCIT (Storm Cell Identification and Tracking) algorithm [12], and the deep learning-based FURENet (Future Radar Echoes prediction Network) model [13], among others. The core idea of these methods is to treat thunderstorm cells as three-dimensional entities, and by identifying, analyzing, and tracking the cells, predict their movement and evolution. The traditional single-cell centroid method works by calculating the centroid position, volume, and projection area of a thunderstorm cell, and then performs cell matching and tracking in consecutive radar image sequences, followed by extrapolation for early warning. This method provides detailed feature data for cells but is computationally intensive and is only suitable for strong convective storms. The TITAN and SCIT algorithms are developed based on the traditional single-cell centroid method, improving tracking accuracy and efficiency through enhanced recognition processes. The TITAN algorithm uses a cost function to transform the tracking problem into a combinatorial optimization problem, while the SCIT algorithm employs the nearest neighbor method from pattern recognition. These algorithms have received positive feedback in practical applications and have become mainstream methods in nowcasting strong convective weather. The FURENet model is a new attempt to apply deep learning technology in short-term forecasting. This model combines different types of information and focuses on important details to make better predictions. By using key data from the polarization radar variables KDP and ZDR, the model improves forecast accuracy. Research has shown that after adding polarization radar variables, the forecast scores within 30 min and 60 min forecast periods improved by 13.2% and 17.4%, respectively, compared to using only the reflectivity factor.

With the deepening of research, although the single-cell centroid method has been continuously improved and has made some progress [14,15,16,17], its large computational cost and poor generalization ability prevent it from achieving better results in nowcasting. The cross-correlation method (TREC) calculates the spatial optimal correlation of radar echo data at consecutive times to obtain the motion vector characteristics of convective systems at different locations, and then extrapolates radar echoes based on these obtained motion vectors. This method performs poorly in forecasting rapidly changing convective precipitation because it only considers the horizontal movement of the echoes. It has a large computational load and provides good recognition and tracking results for larger echoes, but performs poorly for smaller cells, especially when multiple cells are close together. The optical flow method calculates the optical flow field of radar echoes to obtain the motion vector field, and then extrapolates the radar echo based on this motion vector field. This method can capture the movement and change information of radar echoes, but it does not fully utilize the echo image information over a longer time period, and its forecasts exhibit a lag, so it cannot meet real-time forecasting needs. These traditional methods perform better when echoes are stable and can track echo movement and changes more accurately. However, for weather processes where echoes change rapidly, the prediction accuracy decreases significantly. Furthermore, their forecasting lead time is short, and accuracy rapidly decreases as the forecast duration increases. With the development of deep learning technology, deep learning-based radar echo extrapolation methods, such as convolutional long short-term memory (ConvLSTM) networks, have shown better temporal and spatial feature extraction capabilities by combining the characteristics of convolutional neural networks and long short-term memory networks. These methods are particularly suited for radar echo images with strong temporal and spatial correlation, and they hold the potential to improve the accuracy and timeliness of forecasts.

Recurrent neural networks (RNNs) are a type of neural network specifically designed for handling sequence data [18]. They capture temporal dependencies by recursively passing information across the sequence and sharing parameters within the network. The advantage of RNNs lies in their ability to memorize temporal information, but early RNNs faced issues with gradient vanishing. To address this problem, various gated RNN variants were developed, with the most famous being long short-term memory (LSTM) networks [19]. LSTM networks effectively avoid the gradient vanishing issue by introducing gating mechanisms, enhancing the network’s ability to handle long-term dependencies. Building on LSTM, scholars further explored models that combine convolutional networks, such as ConvLSTM [20]. These models are particularly suited for predicting radar echo sequences, as they can simultaneously process spatial and temporal information. Subsequently, by introducing dual-memory state transition mechanisms and Gradient Highway Units (GHUs), models like ST-LSTM, PredRNN++, and MIM were proposed [21,22,23]. These models achieved deep integration of temporal and spatial memories, improving feature extraction capabilities. To further enhance model performance, Lin et al. attempted to integrate attention mechanisms into ConvLSTM [24], while Jing et al. developed the HPRNN model [25], which incorporates hierarchical prediction strategies and recursive coarse-to-fine mechanisms. These methods help reduce error accumulation in long-term predictions. Additionally, some studies combined LSTM with Generative Adversarial Networks (GANs) to leverage GANs’ ability to extract deeper radar echo features [26,27]. In terms of model structure, Sato et al. introduced skip connections and dilated convolutions to enhance the model’s short-term nowcasting capabilities [28]. The U-Net structure has also been used for precipitation nowcasting, and Agrawal et al. further improved prediction accuracy by adding residual and attention mechanisms [29]. These enhanced U-Net models are not only more lightweight in structure, but also demonstrate better performance than traditional models in practical applications. These deep learning-based models, through continuous optimization and innovation—especially in handling complex meteorological data and performing short-term nowcasting—have significantly improved the prediction capability of radar echo sequences.

With the development of deep learning technology, radar echo extrapolation has made significant progress. From early simple neural network models to the current complex deep learning architectures, the application of deep learning in radar echo extrapolation has deepened, demonstrating powerful potential and advantages. However, these deep learning models still have some limitations in practical applications. For instance, in the field of radar echo extrapolation, traditional recurrent neural network models have made some progress in handling sequential data, but they tend to encounter the issue of error accumulation in long-term predictions. This leads to the continuous amplification of deviations in successive time steps, severely affecting the accuracy of the model’s prediction of future moments. Moreover, these models also face challenges in mining long-term spatial dependencies between radar image sequences, and their accuracy in predicting small-area strong echo regions and their trends still needs improvement.

In view of the limitations of existing models in radar echo extrusions, this study aims to address three core questions: (1) how to effectively integrate attention mechanisms and full-dimensional dynamic convolution into LSTM networks to enhance spatiotemporal feature extraction for radar echo prediction; (2) whether the proposed model has excellent performance in capturing multi-scale spatio-temporal dependencies, especially for rapidly evolving convective systems; (3) to what extent the model improves the generalization ability and reduces the accumulation of errors in long-term predictions under extreme weather conditions. To address these challenges, we propose the EOST-LSTM model, which combines an efficient multi-scale attention module with a full-dimensional dynamic convolution module. The attention mechanism enhances global–local feature perception by through the adaptive weighting of key regions in the radar image, while dynamic convolution optimizes kernel parameters through spatial, channel, and filter dimensions to capture different context information. Our experimental results on the Moving MNIST (Moving Modified National Institute of Standards and Technology) and real radar datasets show that the model can improve the prediction accuracy of heavy-rain regions and maintain structural consistency over a long forecast period.

2. Data

2.1. Moving MNIST Dataset

Moving MNIST is a benchmark dataset specifically designed to train and evaluate the ability of deep learning models to handle dynamic image sequences [30]. It consists of a series of continuous images, where the digits move randomly, simulating real-world dynamic changes. Each sequence in this dataset contains 20 images, with the first 10 images used as inputs and the last 10 images as the prediction target. The images are 64 × 64 pixels in size. Typically, the dataset includes 10,000 training sequences, 2000 validation sequences, and 3000 test sequences, and is widely used for tasks such as video prediction and behavior recognition in spatiotemporal sequence prediction tasks. In our study, we first use the Moving MNIST dataset to evaluate the performance of the improved model on spatiotemporal sequence prediction tasks. We train the model using the training sequences from the dataset to allow it to learn the movement patterns of the digits. Then, we use the training sequence in the dataset to train the model so that it can grasp the movement of numbers. Next, we use validation sets to evaluate different model structures and hyperparameters to select the optimal scheme. Finally, we use test sets to verify the actual performance of the model and ensure the accuracy and stability of its predictions.

2.2. Radar Dataset

This study used the S-band meteorological radar dataset from Jiangsu Province, covering the period from April to September between 2019 and 2021. The data, which underwent quality control and mosaic stitching, was provided by the Jiangsu Provincial Meteorological Bureau. The dataset covers the entire Jiangsu Province and is stored as grayscale images, with pixel values ranging from 0 to 70 dBZ. The horizontal resolution is approximately 0.01 degrees (about 1 km), and the data are updated every 6 min. Each image has a size of 480 × 560 pixels. The details are shown in Table 1.

The processing of data was carried out to improve data quality and optimize model training. First, low radar echo artifacts and abnormal precipitation values were removed by data cleaning to ensure the accuracy of the data. Then, to solve the problem of unbalanced data distribution, resampling and downsampling were used to adjust the proportion of various samples to ensure the balance of samples of different strengths. In addition, due to the small difference between adjacent samples in the dataset and the high repetition rate, we adjusted the interval of adjacent cases to five frames to reduce sample overlap and improve the training stability of the model. These processes helped us to improve the performance and generalization of the model.

To evaluate the performance of the model in actual meteorological forecasting, we composed a sequence sample from every 20 consecutive radar images with a 6 min interval. In each sequence, the first 10 images are used as inputs to the model, while the last 10 images serve as the output that the model needs to predict. This setup simulates predicting future weather changes based on the past hour of observed data. All sequence samples were divided into training and testing sets in a 4:1 ratio, with the training set containing 21,103 samples and the testing set containing 5275 samples. The model was trained using the training set, and its performance was evaluated using the testing set. Since the testing set was not used during the training and parameter adjustment processes, it provides a more objective assessment of the model’s learning ability and prediction accuracy.

3. Method

In this section, we first introduce the attention mechanism, followed by a description of the full-dimensional dynamic convolution module. We then explain how the attention module and the full-dimensional dynamic convolution module were integrated into the ST-LSTM unit to obtain the new EOST-LSTM unit. Finally, we provide a detailed description of the overall extrapolation structure of the proposed EOST-LSTM model.

3.1. Efficient Multi-Scale Attention Module

In the field of computer vision, channel and spatial attention mechanisms are crucial for recognizing and extracting features. However, traditional methods often reduce the number of channels to better handle inter-channel relationships, which may negatively impact the extraction of deep features. To address this issue, we propose a novel efficient multi-scale attention module that maintains channel information while reducing computational cost through innovative techniques. Specifically, the proposed module extends certain channels to the batch dimension and divides the channels into multiple sub-feature groups, ensuring that the spatial semantic features within each group are evenly distributed. This structure allows the module to both encode global information to adjust the channel weights across different branches and integrate the outputs of two parallel branches through cross-dimensional interaction, capturing pixel-level detail relationships. This approach not only preserves the information within each channel, but also improves efficiency by reducing the computational load. The detailed structure is shown in Figure 1, where “g” denotes the number of groups, “X Avg Pool” represents the one-dimensional horizontal global pooling, and “Y Avg Pool” represents the one-dimensional vertical global pooling. This innovative attention module helps strike a balance between computational efficiency and rich feature extraction, making it well suited for handling large-scale spatiotemporal data like radar echo sequences while maintaining high accuracy in predictions.The “g * batch size” in the structure diagram indicates that the number of packets g is multiplied by the batch size. The core purpose is to improve efficiency through parallel packet computation. This operation splits each batch sample into G-groups for independent processing (a grouping strategy similar to multi-head attention), optimizes hardware resource utilization and enhances the diversity of feature interactions without increasing computational complexity.

3.1.1. Coordinate Attention Module

The CA module [31], also known as the coordinate attention module, captures the global spatial information of the image along the horizontal and vertical directions through global average pooling, generates two one-dimensional feature vectors encoding the corresponding position information, and then processes these vectors through the 1 × 1 convolution and Sigmoid activation function to generate two attention maps to emphasize the area of interest in the feature map. These attention maps are then used to weight aggregate the original features to retain spatial location information, learn the correlation between channels, and enhance the model’s ability to recognize key features. The CA module is also designed with computational efficiency in mind, by sharing 1 × 1 convolution weights and simplified multiplication operations to aggregate attention graphs, thereby reducing the computational burden without sacrificing performance, enabling the module to effectively improve feature representation and capture key information in images for a variety of visual tasks. The specific structure is shown in Figure 2. Let the original input tensor X

\in

R^C×H×W, where C represents the number of input channels, and H and W represent the spatial dimensions of the input features. Thus, 1D global averaging pooling that encodes global information along the horizontal dimension of C at height H can be expressed as

z_{c}^{H} (H) = \frac{1}{W} \sum_{0 \leq i \leq W} x_{c} (H, i)

(1)

Similarly, another route comes directly from 1D global averaging pooling along the horizontal dimension and can therefore be viewed as a collection of location information along the vertical dimension. This route utilizes one-dimensional global average pooling along the vertical dimension to spatially capture remote interactions and retain precise location information along the horizontal dimension, enhancing focus on spatial regions of interest. At width W, the pooling output in C can be expressed as

z_{c}^{W} (W) = \frac{1}{H} \sum_{0 \leq j \leq H} x_{c} (j, W)

(2)

The module works by creating two parallel one-dimensional feature coding vectors that capture global information in different directions. It then rearranges these vectors and merges them through a convolution layer, sharing a 1 × 1 convolution to capture local interactions between channels. The module then decomposes the output of the 1 × 1 convolution again and applies the Sigmoid function on each branch, generating two attention graphs. These attention attempts ultimately combine to enhance the representation of important parts of the feature map while preserving spatial location information and effectively exploiting long-distance dependencies. In short, the CA module improves the expressiveness of features through fine feature recombination and attention weighting while maintaining the integrity of spatial information.

3.1.2. Structure and Working Principle of Efficient Multi-Scale Attention Module

The coordinate attention module does perform well in integrating accurate spatial position information and capturing long-distance spatial interactions, thus improving the performance of the model, but it does not fully take into account the full range of interactions in spatial position, and the limited receptor field of the 1 × 1 convolution kernel limits the model’s ability to model local channel interactions and context information. In short, while the coordinate attention module improves the model’s performance, it also has the problems of insufficient consideration of the comprehensiveness of spatial interaction and limited ability to perform local interaction capture.

The core aim of the efficient multi-scale attention module is to enhance feature representation through grouping and parallel processing instead of the traditional channel dimensionality reduction method. The specific structure is shown in Figure 3, which divides the input feature graph into multiple sub-feature groups, each of which learns different semantic information. The module contains two main parallel branches: the 1 × 1 convolution branch and the 3 × 3 convolution branch. The 1 × 1 branch is responsible for capturing global relationships between channels, while the 3 × 3 branch focuses on local multi-scale spatial information. In the 1 × 1 branch, the module uses global average pooling to encode channel information and 1 × 1 convolution to aggregate this information to form a channel attention diagram. These attention maps are then used to weight the original feature map to highlight important features. The 3 × 3 branch uses a 3 × 3 convolution kernel to capture local features and spatial relationships, expand receptive fields, and enrich feature representations. At the same time, the module further integrates the output of the two branches through the cross-space learning method. It uses global average pooling to encode global spatial information and matrix dot product operations to aggregate spatial information of different scales to generate spatial attention maps. These spatial attention maps not only capture pairwise relationships between pixels (for example, the similarity of a pixel to its surroundings in terms of brightness or shape), but also integrate global context information (such as the distribution of weather systems across the radar image). Specifically, ‘pair-wise relationships’ refer to the interaction of local details that the model understands by analyzing the relationships between each pixel and other pixels (such as whether adjacent pixels belong to the same cloud or precipitation region); “context awareness” means that the model can synthesize the information of the whole image (such as the overall movement trend of the cloud or the distribution pattern of strong echoes), and give each pixel more comprehensive environmental information support. For example, when predicting the intensity of radar echoes at a certain point, the model not only takes into account local features near that point, but also combines the dynamics of distant clouds to improve its ability to predict the evolution of complex weather systems.

3.2. Full-Dimensional Dynamic Convolution Module

In traditional convolutional neural network training, each layer usually learns only one static convolutional kernel. However, advances have shown that the performance of the network can be improved by learning linear combinations of multiple convolution kernels and dynamically adjusting the weights of these convolution kernels based on inputs, an approach known as dynamic convolution. It not only improves the accuracy of lightweight CNNs, but also maintains efficient reasoning. However, existing studies only focus on the dynamic adjustment of the number of convolutional nuclei, but ignore other important dimensions such as the space size of convolutional nuclei and the number of input and output channels. To overcome this limitation, we propose full-dimensional dynamic convolution, which learns complementary attention in all four dimensions of the convolution kernel through multidimensional attention mechanisms and parallel processing strategies, thus achieving a more comprehensive and flexible dynamic convolution design. The specific architecture is shown in Figure 4.

Continuing the definition of dynamic convolution, full-dimensional dynamic convolution can be described as follows:

y = (α_{w 1} ⊙ α_{f 1} ⊙ α_{c 1} ⊙ α_{s 1} ⊙ W_{1} + \dots + α_{w n} ⊙ α_{f n} ⊙ α_{c n} ⊙ α_{s n} ⊙ W_{n}) * x

(3)

where awi represents the attention scalar of the convolution kernel w_i, and a_si∈R^{k × k}, a_ci∈R^cin, and a_fi∈R^cout represent three newly introduced attention modules along the spatial dimension, input channel dimension, and output channel dimension, respectively. These four attention modules were calculated using the multi-head attention module Πi (x). In the graph, the symbol "*" represents the convolution operation. Convolution operation is a basic operation in convolutional neural networks, which is used to extract and transform the input data. A new feature map is generated by multiplying and summing the convolution kernel element-by-element with the input feature map to capture spatial and channel dependencies in the data.

In the full-dimensional dynamic convolution, for the convolution kernel w_i, a_si assigns different attention values to the convolution parameters in the k * k spatial position; a_ci assigns different attention values to convolution filters with different input channels; a_fi assigns different attention values to convolution filters with different output channels; and a_wi assigns different values to n global convolution kernels. In principle, these four types of attention are complementary, and by progressively multiplying the convolution w_i with different attention modules along the dimensions of position, channel, filter, and kernel, the convolution operation will have different dimensions for the input, providing better performance to capture rich context information. Therefore, full-dimensional dynamic convolution can greatly improve the feature extraction capability of convolution. More importantly, full-dimensional dynamic convolution with fewer convolution kernels can achieve comparable or even better performance than CondConv and DyConv.

3.3. EOST-LSTM Unit

In this section, we describe how to embed an efficient multi-scale attention module and a full-dimensional dynamic convolution module into the ST-LSTM unit to form the network unit EOST-LSTM, as shown in Figure 5.

The input of the EOST-LSTM unit includes the current input

X_{t}

, the spatial memory unit

M_{t}^{l - 1}

, the time memory unit

c_{t - 1}^{l}

, and the hidden state

H_{t - 1}^{l}

. The current input

X_{t}

serves as the input to the attention module, and by calculating the global context features of the input data

X_{t}

, the attention weights are generated, which reflect the relevance of different feature areas to the current task. In this way, the network not only enhances the perception of local features, but also improves the understanding of each channel. The calculation of the EOST-LSTM unit is shown in Equation (4):

\begin{matrix} {\hat{X}}_{t} = A t t (X_{t}) = X_{t} ⊙ M_{c} ⊙ M_{s} \\ {\hat{H}}_{t - 1}^{l} = O D C (H_{t - 1}^{l}) \\ g_{t} = t a n h (W_{x g} * {\hat{X}}_{t} + W_{h g} * {\hat{H}}_{t - 1}^{l} + b_{g}) \\ i_{t} = σ (W_{x i} * {\hat{X}}_{t} + W_{h i} * {\hat{H}}_{t - 1}^{l} + b_{i}) \\ f_{t} = σ (W_{x f} * {\hat{X}}_{t} + W_{h f} * {\hat{H}}_{t - 1}^{l} + b_{f}) \\ C_{t}^{l} = f_{t} ⊙ C_{t - 1}^{l} + i_{t} ⊙ g_{t} \\ {g_{t}}^{'} = t a n h ({W_{x g}}^{'} * {\hat{X}}_{t} + W_{m g} * M_{t}^{l - 1} + {b_{g}}^{'}) \\ {i_{t}}^{'} = σ ({W_{x i}}^{'} * {\hat{X}}_{t} + W_{m i} * M_{t}^{l - 1} + {b_{i}}^{'}) \\ {f_{t}}^{'} = σ ({W_{x f}}^{'} * {\hat{X}}_{t} + W_{m f} * M_{t}^{l - 1} + {b_{f}}^{'}) \\ M_{t}^{l} = {f_{t}}^{'} ⊙ M_{t}^{l - 1} + {i_{t}}^{'} ⊙ {g_{t}}^{'} \\ o_{t} = σ (W_{x o} * {\hat{X}}_{t} + W_{h o} * H_{t - 1}^{l} + W_{c o} * C_{t}^{l} + W_{m o} * M_{t}^{l} + b_{o}) \\ H_{t}^{l} = o_{t} ⊙ t a n h (W_{1 \times 1} * [C_{t}^{l}, M_{t}^{l}]) . \end{matrix}

(4)

wherein “Att” represents the attention module, “ODC” represents the full-dimensional dynamic convolutional module,

i_{t}

is the first input gate,

g_{t}

is the first input modulation gate,

f_{t}

is the first forgetting gate,

i_{t}^{'}

is the second input gate,

g_{t}^{'}

is the second input modulation gate,

f_{t}^{'}

is the second forgetting gate,

o_{t}

is the updated time memory unit,

C_{t}^{l}

represents the updated spatial memory unit,

M_{t}^{l}

represents the corresponding convolution kernel, and b represents the corresponding deviation value. “*” represents the convolution operation for the in-depth analysis of data characteristics, and “⊙” represents the Hadamard product of matrix, that is, matrix multiplication at the element level, for controlling information flow. “σ” represents the sigmoid activation function, and “tanh” represents the hyperbolic tangent activation function, both of which increase the nonlinear expression ability of the model.

3.4. EOST-LSTM Network Structure

Based on the previous stack structure, the model proposed in this paper adds an efficient multi-scale attention module and a full-dimensional dynamic convolutional module. The specific structure is shown in Figure 6. To build this network structure, we used a stack of four layers of EOST-LSTM cells. In this structure, the spatial storage unit M transmits updates between layers through a zigzag path, symbolizing the flow of information in the spatial dimension of the network. This path is represented by an orange line. At the same time, the time storage unit C is updated in the same layer along the horizontal direction, representing the continuity in the time dimension, shown by a blue line.

3.5. Evaluation Metrics

In this experiment, we used the Critical Success Index (CSI) and Heidke Skill Score (HSS) as measures to assess the model’s ability to predict future situations based on historical radar data. Specifically, we first converted the radar data into a binary format, and then statistically predicted the correct cases (true positive, TP) and the wrong cases (false positive, FP; true negative, TN; and false negative, FN) according to different thresholds (10 dBZ, 20 dBZ, and 35 dBZ). These statistics help us to comprehensively consider the detection probability and false alarm rate of the model, so as to quantify the accuracy of the prediction. The higher the values of CSI and HSS, the more accurate the prediction result and the stronger the performance of the model.

The specific formulas of CSI and HSS are shown in Formula (5):

\begin{matrix} C S I = \frac{T P}{T P + F N + F P} \\ H S S = \frac{2 (T P \times T N - F N \times F P)}{(T P + F N) (F N + T N) + (T P + F P) (F P + T N)} \end{matrix}

(5)

To convert the pixel values of a radar image into actual radar reflectance in decibels, we used a specific mathematical method. This method converts the original value of each pixel in the image to the corresponding reflectivity value (expressed in dBZ), as shown in Formula (6):

d B Z = p \times 95 / 255 - 10

(6)

4. Experiments and Analysis

In this study, we validated the model’s performance by conducting experiments on two different datasets: the standard Moving MNIST dataset, and the actual Jiangsu Province meteorological radar dataset, which spans from 2019 to 2021. The results show that our model is not only valid in theory, but can also be applied to practical weather forecasting tasks. The model architecture is based on four layers of GFST-LSTM cells, each with 64 channels and a convolution kernel size of 5 × 5. The experiment was performed on Windows using the Pytorch framework (Version 1.12.0) and an NVIDIA A10 Tensor Core GPU, which used approximately 21 GB of video memory due to memory limitations. We selected the Adam optimizer, set the learning rate to 0.0001 and batch size to 4, and applied layer normalization after each convolutional layer to stabilize the training process. In short, our experimental setup takes into account both theoretical verification and practical application while optimizing the hardware and software configuration to ensure the stability and efficiency of the training.

4.1. Moving MNIST Experiments

The mean-square error (MSE) and structural similarity index (SSIM) were used to evaluate the prediction effect of the model. MSE measures the size of the error by calculating the average of the square of the difference between the predicted value and the actual value, and the lower the value, the higher the accuracy of the prediction. SSIM evaluates the similarity between images from three dimensions of brightness, contrast, and structure information, which can more comprehensively reflect the similarity between the predicted image and the real image. Therefore, the ideal prediction model should have low MSE and high SSIM values, which indicates that the model is not only accurate in terms of numerical prediction, but also highly similar to the real situation in the visual structure of the image, showing good prediction performance.

Compared with other deep learning prediction algorithms in terms of evaluation indicators, the EOST-LSTM proposed in this paper is significantly superior to other methods, and the results are shown in Table 2, indicating that EOST-LSTM has obvious advantages in terms of image prediction quality.

In our analysis of the Moving MNIST dataset, we found that the EOST-LSTM method is excellent at identifying changes in digital sequences, especially when digital tracks overlap, while still maintaining high clarity, and that this clarity can be maintained over time. In contrast, other models such as ConvLSTM and PredRNN quickly become blurry and lose detail when making predictions. As shown in Figure 7, we can see that in the final stage of prediction, EOST-LSTM has a particularly strong advantage in detail preservation.

4.2. Radar Dataset Experiments

By comparing the radar data of Jiangsu Province, this study verified the performance advantages of the EOST-LSTM model under various threshold conditions, especially when the threshold is higher. As shown in Table 3. Specifically, under the threshold of 35 dBZ, EOST-LSTM model outperforms other models in terms of the CSI and HSS, respectively, improving by 4.8% and 9.9% compared with the traditional PredRNN algorithm, and 1.2% and 3.6% compared with the PFST-LSTM model. These improvements are mainly due to the addition of the attention mechanism and the full-dimensional dynamic convolution module, which enhance the prediction accuracy of the model for the heavy rainfall region.

By comparing the performance of different network models under varying signal strength thresholds (as shown in Figure 8), we observed that the EOST-LSTM model outperforms others in terms of radar data prediction, particularly under the high threshold of 35 dBZ. This superiority stems from its novel architecture: the efficient multi-scale attention module dynamically integrates global channel relationships and local spatial features through subgrouping and parallel branches (1 × 1 and 3 × 3 convolutions). For instance, in predicting high-intensity echo regions, this module encodes long-range spatial dependencies (e.g., storm trajectories) via vertical/horizontal global pooling while enhancing pixel-level detail capture (e.g., small-scale echo structures) through cross-space learning, thereby reducing background noise interference. Simultaneously, the full-dimensional dynamic convolution module introduces dynamic attention across four dimensions—spatial positions, input/output channels, and kernel counts—enabling adaptive parameter adjustments for diverse meteorological scenarios (e.g., squall lines vs. isolated thunderstorms). This optimizes the modeling of both local details (echo edges) and global trends (system movement). Under high thresholds, the module significantly reduces false negatives (FN) and false positives (FP) through multi-dimensional complementary attention, whereas traditional models like ConvLSTM suffer from static kernels that fail to distinguish key regions from noise, and PredRNN struggles to dynamically capture complex evolutions (e.g., storm splitting/merging) due to fixed memory paths. The experimental results demonstrate that EOST-LSTM achieves a CSI of 0.3543 and an HSS of 0.3015 at 35 dBZ, surpassing PredRNN by 4.8% and 9.9%, and PFST-LSTM by 1.2% and 3.6%. Its predictions retain the precise positions and shapes of intense echoes (e.g., hook echoes) even over 60 min, validating its exceptional capability in spatiotemporal dependency modeling and detail preservation.

In order to more intuitively show the performance of different models in radar echo prediction tasks, this paper shows prediction examples of each model in Figure 9. Most of the models were close to the actual observations at the start of the 60 min prediction period, but as time went on, the accuracy of other models, such as ConvLSTM, PredRNN, and its improved version PreDRNN++, began to decline, especially when dealing with strong echo regions. The predictions of these models gradually become blurred and eventually lose some of their details. In contrast, EOST-LSTM and PFST-LSTM can preserve the characteristics of strong echo regions effectively to a certain extent, where EOST-LSTM performs better in terms of position accuracy. This is due to the EOST-LSTM’s combination of attention mechanisms and full-dimensional dynamic convolution module, which allows it to capture more comprehensive spatio-temporal information, effectively simulating long-term dependencies in the data, and thus preserving more detail when predicting radar echoes, reducing positional bias, and improving overall prediction accuracy. These results show that the prediction results of EOST-LSTM are richer in detail, and the prediction error caused by information loss is avoided.

5. Conclusions

This study developed a new radar echo prediction model called EOST-LSTM, which utilizes deep learning techniques and improves the accuracy and efficiency of weather forecasts by incorporating global attention mechanisms and optimized LSTM units. Through experiments on the Moving MNIST dataset and the Jiangsu Province meteorological radar dataset, we confirm that the EOST-LSTM model performs well in a variety of scenarios.

In the test of the Moving MNIST dataset, the EOST-LSTM model showed a low mean-square error and a high structural similarity index, which indicates its high performance in dynamic image prediction. In actual weather radar data tests, the model also performed well, with EOST-LSTM outperforming existing ConvLSTM, PredRNN and other models at all thresholds, as assessed by the Critical Success Index and Heidke Skill Score. Especially under the condition of high threshold, EOST-LSTM significantly improved the accuracy of predicting the strong echo region.

Although EOST-LSTM achieved remarkable results in radar echo prediction, we still need to recognize that there is room for improvement in the generalization ability of deep learning models and the prediction of extreme weather events. Future work will focus on further optimizing the model structure, improving computational efficiency, and exploring the application of the model to a wider range of weather prediction areas, such as temperature, humidity, and wind speed, to achieve more comprehensive weather prediction capability.

Author Contributions

Conceptualization, G.H. and W.W.; methodology, J.H. and J.L.; data curation, L.L. and J.H.; software, G.H.; validation, W.W.; formal analysis, W.W.; writing—review and editing, G.H. and W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China under Grant no. 2022YFF0802003; the China Meteorological Administration Youth Innovation Team (CMA2023QN10); the Innovation Development of China Meteorological Administration (CXFZ2025J123); and the National Natural Science Foundation of China (4246050048).

Data Availability Statement

The data used in this study were sourced from the Tianchi competition of the IEEE ICDM 2018 Global Meteorological AI Challenge. The dataset was publicly available as part of the competition. Due to the nature of the competition, specific privacy and ethical guidelines apply to the data, and thus it is not available for direct public access beyond the competition scope.

Acknowledgments

The authors would like to thank the Jiangsu Provincial Meteorological Bureau for providing the public radar dataset. We also sincerely thank the editor and anonymous reviewers for their constructive suggestions and improvements to our work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Browning, K.A. Nowcasting; Academic Press: London, UK, 1982. [Google Scholar] [CrossRef]
Yu, X.D.; Zheng, Y.G. Advances in severe convective weather research and operational service in China. Acta Meteorol. Sin. 2020, 78, 391–418. [Google Scholar]
Min, J.Z.; Wu, N.G. Advances in Atmospheric Predictability of Heavy Rain and Severe Convection. Chin. J. Atmos. Sci. 2020, 44, 1039–1056. (In Chinese) [Google Scholar]
Zhong, S.; Zeng, X.; Ling, Q.; Wen, Q.; Meng, W.; Feng, Y. Spatiotemporal Convolutional LSTM for Radar Echo Extrapolation. In Proceedings of the 2020 54th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 1–5 November 2020; pp. 58–62. [Google Scholar]
Guo, F.; Diao, X.; Chu, Y.; Ma, Y. Dual polarization radar characteristics of severe downburst occurred in weak vertical wind shear. J. Appl. Meteorol. Sci. 2023, 34, 681–693. [Google Scholar]
Han, L.; Sun, J.; Zhang, W. Convolutional Neural Network for Convective Storm Nowcasting Using 3-D Doppler Weather Radar Data. IEEE Trans. Geosci. Remote Sens. 2020, 58, 1487–1495. [Google Scholar] [CrossRef]
Johnson, J.T.; MacKeen, P.L.; Witt, A.; Mitchell, E.D.W.; Stumpf, G.J.; Eilts, M.D.; Thomas, K.W. The Storm Cell Identification and Tracking Algorithm: An Enhanced WSR-88D Algorithm. Weather Forecast. 1998, 13, 263–276. [Google Scholar] [CrossRef]
Rinehart, R.E.; Garvey, E.T. Three-Dimensional storm motion detection by conventional weather radar. Nature 1978, 273, 287–289. [Google Scholar]
Cao, W.H.; Chen, M.X.; Gao, F. A vector blending study based on object-based tracking vectors and cross correlation tracking vectors. Acta Meteorol. Sin. 2019, 77, 1015–1027. [Google Scholar]
Ayzel, G.; Heistermann, M.; Winterrath, T. Optical flow models as an open benchmark for radar-based precipitation nowcasting (rainymotion v0.1). Geosci. Model Dev. 2019, 12, 1387–1402. [Google Scholar] [CrossRef]
Dixon, M.; Wiener, G. TITAN: Thunderstorm Identification, Tracking, Analysis, and Nowcasting—A Radar-based Methodology. J. Atmos. Ocean. Technol. 1993, 10, 785–797. [Google Scholar]
Chung, K.S.; Yao, I.A. Improving radar echo Lagrangian extrapolation nowcasting by blending numerical model wind information: Statistical performance of 16 typhoon cases. Mon. Wea Rev. 2020, 148, 1099–1120. [Google Scholar] [CrossRef]
Huang, X.; Zhao, K.; Li, W.; Huang, J.; Zhao, L.; Fang, J. Improved Nowcasting of Short-Time Heavy Precipitation and Thunderstorm Gale Based on Vertical Profile Characteristics of Dual-Polarization Radar. Meteorol. Mon. 2024, 50, 1519–1530. [Google Scholar]
Han, L.; Zheng, Y.G.; Wang, H.Q.; Lin, Y. 3D storm automatic identification based on mathematical morphology. Acta Meteorol. Sin. 2007, 23, 805–814. [Google Scholar]
Han, L.; Fu, S.; Zhao, L.; Zheng, Y.; Wang, H.; Lin, Y. 3D Convective Storm Identification, Tracking, and Forecasting—An Enhanced TITAN Algorithm. J. Atmos. Ocean. Technol. 2009, 26, 719–732. [Google Scholar] [CrossRef]
Chen, M.; Gao, F.; Kong, R.; Wang, Y.; Wang, J. Introduction of auto nowcasting system for convective storm and its performance in Beijing Olympics meteorological service. J. Appl. Meteorol. Sci. 2010, 21, 395–404. [Google Scholar]
Lakshmanan, V.; Hondl, K.; Rabin, R. An efficient, general-purpose technique for identifying storm cells in geospatial images. J. Atmos. Oceanic Technol. 2009, 26, 523–537. [Google Scholar]
Sathasivam, S.; Abdullah, W.A.T.W. Logic Learning in Hopfield Networks. Mod. Appl. Sci. 2008, 2, 57–63. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 802–810. [Google Scholar]
Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. PredRNN: Recurrent neural networks for predictive learning using spatiotemporal LSTMs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 879–888. [Google Scholar]
Wang, Y.; Gao, Z.; Long, M. Predrnn++: Towards a resolution of thedeep-in-time dilemma in spatiotemporal predictive learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Wang, Y.; Zhang, J.; Zhu, H. Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9154–9162. [Google Scholar]
Lin, Z.; Li, M.; Zheng, Z.; Cheng, Y.; Yuan, C. Self-Attention ConvLSTM for Spatiotemporal Prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11531–11538. [Google Scholar]
Jing, J.; Li, Q.; Peng, X.; Ma, Q.; Tang, S. HPRNN: A Hierarchical Sequence Prediction Model for Long-Term Weather Radar Echo Extrapolation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 4142–4146. [Google Scholar]
Xu, Z.; Du, J.; Wang, J.; Jiang, C.; Ren, Y. Satellite Image Prediction Relying on GAN and LSTM Neural Networks. In Proceedings of the ICC 2019—2019 IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019; pp. 1–6. [Google Scholar]
Ying, G.; Zou, Y.; Wan, L.; Hu, Y.; Feng, J. Better Guider Predicts Future Better: Difference Guided Generative Adversarial Networks. In Computer Vision—ACCV 2018; Jawahar, C., Li, H., Mori, G., Schindler, K., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2019; Volume 11366. [Google Scholar]
Sato, R.; Kashima, H.; Yamamoto, T. Short-Term Precipitation Prediction with Skip-Connected PredNet. In Proceedings of the International Conference on Artificial Neural Networks, Rhodes, Greece, 4–7 October 2018. [Google Scholar]
Agrawal, S.; Barrington, L.; Bromberg, C.; Burge, J.; Gazen, C.; Hickey, J. Machine Learning for Precipitation Nowcasting from Radar Images. arXiv 2019, arXiv:1912.12132. [Google Scholar]
Srivastava, N.; Mansimov, E.; Salakhutdinov, R. Unsupervised Learning of Video Representations Using LSTMs. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15), Lille, France, 6–11 July 2015; pp. 843–852. [Google Scholar]
Ouyang, D.; He, S.; Zhan, J.; Guo, H.; Huang, Z.; Luo, M.; Zhang, G. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]

Figure 1. An example of an efficient multi-scale attention mechanism module.

Figure 2. Diagram of coordinate attention module.

Figure 3. Schematic diagram of efficient multi-scale attention modules.

Figure 4. Full-dimensional dynamic convolutional network architecture.

Figure 5. Internal structure diagram of long short-term memory unit integrating attention module and full-dimensional dynamic convolution module.

Figure 6. EOST-LSTM network model structure.

Figure 7. Results of different methods on mobile MNIST dataset.

Figure 8. CSIs and HSSs of echo prediction for each algorithm under different thresholds. (a) The CSI change with a threshold of 10, (b) the HSS change with a threshold of 10, (c) the CSI change with a threshold of 20, (d) the HSS change with a threshold of 20, (e) the CSI change with a threshold of 35, and (f) the HSS change with a threshold of 35.

Figure 9. The 60 min prediction results extrapolated by different methods. The first action inputs a temporal radar image each time a prediction is made, and the second action actually outputs it. Other predictions are shown under different models of behavior.

Table 1. Characteristics of the radar dataset.

Trait	Description	Remark
Space–time range	Covering the whole area of Jiangsu Province	Approximately 30.75°N to 35.20°N, 116.30°E to 121.95°E
Time resolution	Updated every 6 min	Each image represents 6 min of data
Spatial resolution	Horizontal resolution approximately every 0.01 degree (about 1 km)	Each pixel corresponds to an area of about 1 km × 1 km

Table 2. Results of different methods on moving MNIST dataset (10 frames → 10 frames).

Method	MSE/Frame ↓	SSIM/Frame ↑
ConvLSTM	102.7	0.733
CDNA	95.2	0.775
PredRNN	65.6	0.843
PredRNN++	54.7	0.886
SA-ConvLSTM	45.8	0.918
PFST-LSTM	34.8	0.927
EOST-LSTM	32.7	0.9316

Table 3. The CSIs and HSSs extrapolated over 60 min by different methods.

dBZ Threshold	CSI ↑				HSS ↑
dBZ Threshold	10	20	35	avg	10	20	35	avg
ConvLSTM	0.7273	0.4853	0.3327	0.5151	0.7076	0.5248	0.2684	0.5002
CDNA	0.7315	0.4879	0.3364	0.5186	0.7132	0.5362	0.2721	0.5071
PredRNN	0.7432	0.4985	0.3381	0.5266	0.7165	0.5452	0.2742	0.5112
PredRNN++	0.7478	0.5074	0.3412	0.5321	0.7237	0.5508	0.2786	0.5177
SA-ConvLSTM	0.7523	0.5148	0.3467	0.5379	0.7282	0.5524	0.2872	0.5226
PFST-LSTM	0.7562	0.5246	0.3498	0.5435	0.7314	0.5576	0.2903	0.5264
EOST-LSTM	0.7608	0.5286	0.3543	0.5479	0.7362	0.5632	0.3015	0.5336

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, G.; Wu, W.; Han, J.; Luo, J.; Lei, L. EOST-LSTM: Long Short-Term Memory Model Combined with Attention Module and Full-Dimensional Dynamic Convolution Module. Remote Sens. 2025, 17, 1103. https://doi.org/10.3390/rs17061103

AMA Style

He G, Wu W, Han J, Luo J, Lei L. EOST-LSTM: Long Short-Term Memory Model Combined with Attention Module and Full-Dimensional Dynamic Convolution Module. Remote Sensing. 2025; 17(6):1103. https://doi.org/10.3390/rs17061103

Chicago/Turabian Style

He, Guangxin, Wei Wu, Jing Han, Jingjia Luo, and Lei Lei. 2025. "EOST-LSTM: Long Short-Term Memory Model Combined with Attention Module and Full-Dimensional Dynamic Convolution Module" Remote Sensing 17, no. 6: 1103. https://doi.org/10.3390/rs17061103

APA Style

He, G., Wu, W., Han, J., Luo, J., & Lei, L. (2025). EOST-LSTM: Long Short-Term Memory Model Combined with Attention Module and Full-Dimensional Dynamic Convolution Module. Remote Sensing, 17(6), 1103. https://doi.org/10.3390/rs17061103

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EOST-LSTM: Long Short-Term Memory Model Combined with Attention Module and Full-Dimensional Dynamic Convolution Module

Abstract

1. Introduction

2. Data

2.1. Moving MNIST Dataset

2.2. Radar Dataset

3. Method

3.1. Efficient Multi-Scale Attention Module

3.1.1. Coordinate Attention Module

3.1.2. Structure and Working Principle of Efficient Multi-Scale Attention Module

3.2. Full-Dimensional Dynamic Convolution Module

3.3. EOST-LSTM Unit

3.4. EOST-LSTM Network Structure

3.5. Evaluation Metrics

4. Experiments and Analysis

4.1. Moving MNIST Experiments

4.2. Radar Dataset Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI