SAM-Net: Spatio-Temporal Sequence Typhoon Cloud Image Prediction Net with Self-Attention Memory

Ren, Yanzhao; Ye, Jinyuan; Wang, Xiaochuan; Xiao, Fengjin; Liu, Ruijun

doi:10.3390/rs16224213

Open AccessArticle

SAM-Net: Spatio-Temporal Sequence Typhoon Cloud Image Prediction Net with Self-Attention Memory

by

Yanzhao Ren

^1,†

,

Jinyuan Ye

^1,†

,

Xiaochuan Wang

¹

,

Fengjin Xiao

²

and

Ruijun Liu

^3,*

¹

School of Computer and Artificial Intelligence, Beijing Technology and Business University, Beijing 100048, China

²

China Meteorological Administration, National Climate Center, Beijing 100081, China

³

School of Software, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2024, 16(22), 4213; https://doi.org/10.3390/rs16224213

Submission received: 14 August 2024 / Revised: 21 October 2024 / Accepted: 8 November 2024 / Published: 12 November 2024

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Cloud image prediction is a spatio-temporal sequence prediction task, similar to video prediction. Spatio-temporal sequence prediction involves learning from historical data and using the learned features to generate future images. In this process, the changes in time and space are crucial for spatio-temporal sequence prediction models. However, most models now rely on stacking convolutional layers to obtain local spatial features. In response to the complex changes in cloud position and shape in cloud images, the prediction module of the model needs to be able to extract both global and local spatial features from the cloud images. In addition, for irregular cloud motion, more attention should be paid to the spatio-temporal sequence features between input cloud image frames in the temporal sequence prediction module, considering the extraction of temporal features with long temporal dependencies, so that the spatio-temporal sequence prediction network can learn cloud motion trends more accurately. To address these issues, we have introduced an innovative model called SAM-Net. The self-attention module of this model aims to extract an inner image frame’s spatial features of global and local dependencies. In addition, a memory mechanism has been added to the self-attention module to extract interframe features with long temporal and spatial dependencies. Our method shows better performance than the PredRNN-v2 model on publicly available datasets such as MovingMNIST and KTH. We achieved the best performance in both the 4-time-step and 10-time-step typhoon cloud image predictions. On a cloud dataset consisting of 10 time steps, we observed a decrease in MSE of 180.58, a decrease in LPIPS of 0.064, an increase in SSIM of 0.351, and a significant improvement in PSNR of 5.56 compared to PredRNN-v2.

Keywords:

spatio-temporal sequence predictive learning; remote sensing; recurrent neural networks; self-attention memory

1. Introduction

Considering the powerful nonlinear modeling capabilities demonstrated by deep learning in image processing tasks, many studies have attempted to use a recurrent neural network (RNN) or a combination of convolutional neural network (CNN) and RNN to characterize and predict satellite cloud images, in order to provide better visual weather forecasts for end users. In addition, spatio-temporal sequence prediction can also be applied to disaster prevention [1] and unmanned driving [2,3]. However, early model prediction performance was not satisfactory, because satellite cloud images have strong complexity in both the time and spatial dimensions, and dynamic temporal and spatial features are difficult to capture.

At present, spatio-temporal sequence prediction has become a mainstream method. Shi et al. [4] first defined meteorological prediction as a video prediction problem, and a convolutional long short-term memory network (ConvLSTM) was proposed. This model combines CNN and RNN, utilizing their unique advantages in spatio-temporal sequence data extraction to simultaneously obtain temporal relationships and spatial features. It has achieved good results in radar echo extrapolation tasks. In 2017, Shi et al. [5] introduced real-time variable convolution on the basis of ConvLSTM and proposed TrajGRU. This improvement makes the model more capable of capturing synchronous changes such as rotation and scaling, and also performs better in precipitation prediction. In order to improve the non-linear fitting ability of the network to spatio-temporal sequence features, ConvLSTM is often stacked and used. However, for simple stacked structures, spatio-temporal sequence information is transmitted horizontally along time steps and vertically along stacked layers, but the spatio-temporal sequence information transmitted vertically is not effectively utilized. Therefore, Wang et al. [6] proposed the spatio-temporal LSTM (ST-LSTM) neural network. Chen et al. [7] introduced 3D convolution on the basis of ConvLSTM and proposed a hybrid prediction model based on CNN and LSTM to better understand the spatial relationships of typhoon features. Phermphoonphiphat et al. [8] proposed a prediction model based on ConvLSTM and introduced up and down sampling operations to make short- and long-term predictions of the tropospheric circulation in the northern hemisphere, with higher accuracy than other existing models.

However, existing cloud spatio-temporal sequence prediction methods have some problems: relying on stacked convolutional layers to obtain mostly local spatial features, in order to cope with the complex changes in cloud position and shape in the cloud image, the prediction module of the model needs to be able to extract the inner image frame’s global and local spatial features from the cloud image. Secondly, for irregular cloud motion, the prediction module should pay more attention to the temporal sequence features between input cloud image frames and consider extracting interframe temporal features with long temporal dependencies. In this way, the prediction network can more accurately understand the cloud motion trend. Especially in multi-step cloud image prediction, the lack of long temporal features and global spatial features may lead to inaccurate cloud image prediction.

To tackle these challenges, we designed a typhoon cloud image prediction network based on self-attention memory. In order to cope with the irregular motion of cloud images, the self-attention mechanism in the self-attention memory module aggregates inputs from all positions to generate predictive features, solving the problem of not being able to capture global spatial features in stacked convolutional layers. The additional storage unit M in the self-attention memory module can preserve the long temporal and spatial features between cloud images, thereby solving the problem of inaccurate cloud image prediction due to the lack of long temporal and spatial dependencies in multi-step cloud image prediction. In summary, our contributions are as follows:

•: We propose a typhoon cloud image prediction model called SAM-Net, which excels at capturing global spatial features as well as long-range dependent spatio-temporal features in cloud images.
•: In SAM-Net, we solved the problem of complex changes in cloud position and shape, as well as irregular cloud motion in cloud prediction, through the SAM module.
•: We conducted multi-frame predictions using SAM-Net on publicly available datasets, including MovingMNIST and KTH, as well as cloud image datasets, demonstrating improved performance over PredRNN-v2.

2. Related Work

Next, we will introduce related work from traditional methods and deep learningal methods.

2.1. Traditional Methods

The traditional typhoon forecasting methods are mainly divided into numerical forecasting, statistical forecasting, and statistical dynamic forecasting.

Numerical weather prediction, first proposed by Godske et al. [9] in 1957, is one of the most commonly used objective methods for typhoon forecasting. This approach utilizes large-scale meteorological data collected from various observational sources (such as radar and satellites), including meteorological factors, physical factors, and terrain factors, to construct atmospheric dynamical equations. Based on the current state of the atmosphere, and given initial conditions and boundary conditions, these equations are approximated using large-scale computers to achieve numerical forecasts of typhoon tracks and intensities. Sanders et al. proposed the SANBAR [10] and barotropical [11] models, which have been implemented for the prediction of typhoon paths in the sky for the following one to three days, laying the foundation for subsequent research. After decades of research, Qian et al. [12] discussed the impact of different initial fields and lateral boundary conditions on the accuracy of typhoon numerical forecasting. Statistical forecasting refers to realization of the long-term and short-term forecasting of a typhoon through the statistical analysis of a large number of historical meteorological data, mining the linear relationship between the forecast amount and the forecast factors related to the cause of the typhoon, and building a statistical model based on regression equation. Compared with numerical forecasting, statistical forecasting has the advantages of strong flexibility, low consumption of computing resources, and better performance in forecasting typhoon intensity. The climate sustainability method (CLIPER), as an early classic statistical forecasting model, was developed by Neumann [13], who proposed it in 1972. Chand et al. [14] used a Bayesian probability regression model to predict the causes of tropical cyclones in the Fiji region. Kim et al. [15] used a multimodal dynamic statistical model based on the Asia Pacific Climate Center (APCC) to make seasonal forecasts for typhoons in the northwest Pacific. Statistical dynamic forecasting is a forecasting method that combines numerical forecasting with statistical forecasting. Its principle is to use numerical forecasting to solve the atmospheric dynamics equation, and then use statistical forecasting to further correct the prediction results obtained by numerical forecasting, so as to effectively reduce the prediction errors of typhoon track and intensity. This method not only makes up for the shortcomings of numerical forecasting in strength prediction, but also overcomes the shortcomings of statistical forecasting, which relies too much on historical data.

2.2. Deep Learning Methods

There are mainly two types of methods based on deep learning: one is based on RNN, and the other is a hybrid model based on CNN and RNN.

Recurrent neural networks (RNNs) are widely used in video prediction. The commonly used long short-term memory (LSTM) [16] networks for video prediction algorithms and the gated recurrent unit GRU [17] are variants of RNN. In 2014, Ranzato [18] applied RNN to the field of video prediction and proposed a new strong baseline model for unsupervised feature learning. Srivastava [19] introduced LSTM into video forecasting. Using LSTM as the encoder and decoder of the prediction model for video prediction, and highlighting the combination of convolutional neural network (CNN) and LSTM for video prediction, this demonstrated the feasibility of improving prediction accuracy. Chen et al. [1] used an attention mechanism and GRU to accurately predict floods.

Srivastava [19] highlighted the feasibility of improving prediction accuracy through the combination of convolutional neural network (CNN) and LSTM for video prediction. Shi et al. [4] drew inspiration from Srivastava’s ideas and combined CNN with LSTM, proposing ConvLSTM for image prediction, designing an encoding forecasting (EF) structure based on the transmission of information flow in the network, and applied this prediction model to precipitation prediction, achieving good results. Afterwards, Shi et al. [5] further optimized the EF structure and proposed a more lightweight approach compared to ConvLSTM. ConvGRU and TrajectoryGRU were proposed by introducing learnable convolution, which is theoretically more suitable for predicting irregular motion characteristics such as precipitation data. Deng et al. [20] used a refined pyramid grid to increase the receptive field of image extraction. From 2017 to 2019, Wang et al. [6,21,22] integrated 3D convolution, spatio-temporal memory units, and gradual highways and PredRNN [6], PredRNN++ [21], and E3D LSTM [22], making significant contributions to the development of video prediction technology. In 2022, Wang et al. [23] proposed PredRNN-v2, which was based on PredRNN and utilized memory decoupling loss to prevent memory cells from learning redundant features. In the same year, Gao et al. [24] released the SimVP video prediction model, and in 2023, Lian et al. [25] proposed the SCSTque model, which achieved prediction on ordinary cloud images.

In recent years, with the popularity of generative adversarial networks (GANs) [26,27], attention mechanisms [28], and self-attention mechanisms [29,30] in the field of deep learning, Xu et al. [31] and Lin et al. [32] have respectively integrated GANs and self-attention mechanisms into LSTM, and the video prediction achieved good results. The above video prediction algorithms require a significant amount of memory space to preserve as much information as possible in the image; this greatly limits the application scope of the algorithm. Due to the instability and nonlinearity of cloud movement, based on depth studies, the difficulty of predicting cloud movement using Xi’s method is relatively high. In 2017, Hong et al. [33] combined an autoencoder with ConvLSTM to extract temporal cloud images by incorporating skip connections between the encoder and decoder based on spatio-temporal characteristics, but the accuracy of their prediction results could be improved. In 2021, Cai [34] combined 3D convolution with U-Net to form a cloud image prediction module, and used GAN to train the prediction network to improve the prediction model’s ability to extract spatio-temporal sequence features.

In summary, existing methods are unable to address the issue of irregular cloud image motion in cloud images. Therefore, it is necessary to consider extracting spatial features from both global and local dependencies. In addition, due to the complex shape and spatio-temporal variations of cloud images, it is necessary to pay more attention to the long temporal dependencies between cloud images when extracting cloud image features. In this paper, we introduce the SAM module, which can effectively extract spatial features from both global and local perspectives. Additionally, with the introduction of additional memory units, we can effectively focus on the long temporal and spatial features between cloud images.

3. Methods

In this section, we will provide a detailed introduction to our proposed SAM-Net. The specific network structure diagram is shown in Figure 1. This network is mainly divided into two parts: the light blue dashed box and the red dashed box are the training part, and the orange part is the prediction part of the model. In the training part, eight frames of cloud images are read in each batch. The first four frames are used to learn the features of the cloud images, and the last four frames are the ground truth, used as a comparison for the predicted cloud images. The size of the input cloud images is

128 \times 128 \times 1

. We seamlessly embed the self-attention memory module (SAM) in PredRNN-v2. In the prediction part, we set the size to

128 \times 128 \times 1

. The cloud images are input into the trained model to obtain predicted cloud images.

3.1. Self-Attention Module

The first row of Figure 2 shows the process of the standard self-attention module.

H_{t}

in the input original feature image, which is mappedto different feature spaces, where

Q_{h}

,

K_{h}

, and

V_{h}

are, respectively, the query, key, and value. The specific formula for the self-attention module is as follows:

\begin{matrix} Q_{h} = W_{q} H_{t} \in R^{\hat{C} \times N} \end{matrix}

(1)

\begin{matrix} K_{h} = W_{k} H_{t} \in R^{\hat{C} \times N} \end{matrix}

(2)

\begin{matrix} V_{h} = W_{v} H_{t} \in R^{C \times N} \end{matrix}

(3)

In the above formula,

W_{q}

,

W_{k}

,

W_{v}

is the weight set of 1 × 1 convolution, and C and

\hat{C}

are the number of channels, where

N = H \times W

.

\begin{matrix} e = Q_{h}^{T} K_{h} \in R^{N \times N} \end{matrix}

(4)

Calculate the similarity score of each pair of points through matrix multiplication:

\begin{matrix} α_{i, j} = \frac{exp e_{i, j}}{\sum_{k = 1}^{N} exp e_{i, k}}, i, j \in {1, 2, . . ., N} \end{matrix}

(5)

where

e_{i, j}

represents the similarity between the i-th and j-th points,

e_{i, j} = (H_{t, i}^{T} W_{q}^{T}) (W_{k} H_{t, j})

,

H_{t}^{i}

and

H_{t}^{j}

represent feature vectors with a shape size of C * 1, and finally, the similarity score is normalized along the column.

\begin{matrix} Z_{i} = \sum_{j = 1}^{N} α_{i, j} (W_{v} H_{t; j}), \end{matrix}

(6)

Calculate the position and features of the i-th position by weighting all positions, where

W_{v}, H_{t, j} \in R^{C \times 1}

are the j-th columns of the value

V_{h}

.

3.2. Self-Attention Memory Module

As shown in the second row of Figure 2, introducing an additional spatio-temporal memory cell M into the attention mechanism can capture long-range dependencies between input cloud frames, making it easier to understand the motion trends of the cloud. The self-attention memory module receives two inputs: one is the spatio-temporal memory

M_{t - 1}

from the previous step, and the other is the feature

H_{t - 1}

from the current time step. The entire module is divided into three parts: feature aggregation, memory update, and output.

\begin{matrix} e_{m} = Q_{h}^{T} K_{m} \in R^{N \times N} . \end{matrix}

(7)

Similar to Formula (4), calculate the similarity score between input and memory by performing matrix multiplication between query

Q_{h}

and key

K_{m} = W_{m k} H_{t}

.

\begin{matrix} α_{m; i, j} = \frac{exp e_{m; i, j}}{\sum_{k = 1}^{N} exp e_{m; i, k}}, i, j \in {1, 2, . . ., N} . \end{matrix}

(8)

Similar to Formula 5, use the

S o f t M a x

function along the row direction to calculate the weight values of aggregated features.

\begin{matrix} Z_{m; i} = \sum_{j = 1}^{N} α_{m; i, j} V_{m; j} = \sum_{j = 1}^{N} α_{m; i, j} W_{m v} M_{t - 1; j} . \end{matrix}

(9)

Calculate the pixel value of the i-th position of feature

Z_{m}

by summing the weights of all N positions in the value

V_{m}

, where

M_{t - 1; j}

is the j-th column of memory, aggregating feature

Z = W_{z} [Z_{h}; Z_{m}] .

\begin{matrix} i_{t}^{'} = σ (W_{m; z i} * Z + W_{m; h i} * H_{t} + b_{m; i}) . \end{matrix}

(10)

\begin{matrix} g_{t}^{'} = tanh (W_{m; z g} * Z + W_{m; h g} * H_{t} + b_{m; g}) . \end{matrix}

(11)

\begin{matrix} M_{t + 1}^{l} = (1 - i_{t}^{'}) M_{t}^{l} + i_{t}^{'} \circ g_{t}^{'} . \end{matrix}

(12)

Use a gating mechanism to update memory M to obtain long-range dependencies in the spatio-temporal domain. Aggregate feature

Z

and input

H_{t}

to generate input gate and fusion feature

g_{t}

values, using 1 −

i_{t}

instead of a forget gate to reduce parameters.

\begin{matrix} o_{t}^{'} = σ (W_{m; z o} * Z + W_{m; h o} * H_{t} + b_{m; o}) . \end{matrix}

(13)

\begin{matrix} H_{t + 1}^{l} = o_{t}^{'} \circ M_{t + 1}^{l} . \end{matrix}

(14)

The output feature

H_{t + 1}^{l}

is the dot product of the output gate

o_{t}^{'}

and the updated

M_{t + 1}^{l}

.

3.3. SAM-Net Cell

The SAM-Net cell is composed of the self-attention memory module and PredRNN-v2 module. The specific cellular structure is shown in Figure 3, and the formula for the model is as follows:

\begin{matrix} g_{t} = tanh (W_{x g} * X_{t} + W_{h g} * H_{t - 1}^{l}) \end{matrix}

(15)

\begin{matrix} i_{t} = σ (W_{x i} * X_{t} + W_{h i} * H_{t - 1}^{l}) \end{matrix}

(16)

\begin{matrix} f_{t} = σ (W_{x f} * X_{t} + W_{h f} * H_{t - 1}^{l}) \end{matrix}

(17)

\begin{matrix} C_{t}^{l} = f_{t} ⊙ C_{t - 1}^{l} + i_{t} ⊙ g_{t} \end{matrix}

(18)

In the above formula,

g_{t}

,

i_{t}

, and

f_{t}

respectively represent the input modulation gate, input gate, and forget gate, while

C_{t}^{l}

is the memory states,

H_{t}^{l}

is hidden states, and W and b represent weight and bias, respectively.

\begin{matrix} g_{t}^{″} = tanh (W_{x g}^{'} * X_{t} + W_{m g} * M_{t}^{l - 1}) \end{matrix}

(19)

\begin{matrix} i_{t}^{″} = σ (W_{x i}^{'} * X_{t} + W_{m i} * M_{t}^{l - 1}) \end{matrix}

(20)

\begin{matrix} f_{t}^{″} = σ (W_{x f}^{'} * X_{t} + W_{m f} * M_{t}^{l - 1}) \end{matrix}

(21)

\begin{matrix} M_{t}^{l} = f_{t}^{'} ⊙ M_{t}^{l - 1} + i_{t}^{'} ⊙ g_{t}^{'} \end{matrix}

(22)

M_{t}^{l}

is the spatio-temporal memory flow, while

g_{t}^{'}

,

i_{t}^{'}

and

f_{t}^{'}

respectively represent the time and space memory flow that passes through it.

\begin{matrix} o_{t} = σ (W_{x o} * X_{t} + W_{h o} * H_{t - 1}^{l} + W_{c o} * C_{t}^{l} + W_{m o} * M_{t}^{l}) \end{matrix}

(23)

\begin{matrix} H_{t}^{l} = σ (W_{x o} * X_{t} + W_{h o} * H_{t - 1}^{l} + W_{c o} * C_{t}^{l} + W_{m o} * M_{t}^{l}) \end{matrix}

(24)

\begin{matrix} H_{t + 1}^{l}, M_{t + 1}^{l} = S A M (H_{t}^{l}, M_{t}^{l}) \end{matrix}

(25)

In the above formula,

σ

is the sigmoid activation function, and * and ⊙ represent the product of the convolution operator and Hadamard.

o_{t}

is the output gate,

X_{t}

represents the input at the current moment, and SAM represents this self-attention memory module. The specific self-attention memory model is shown in Figure 2. The light green part in the picture represents the self-attention memory module that we have integrated into it.

4. Experiments

4.1. Datasets

Given that the MovingMNIST and KTH action datasets are standard benchmarks for video spatio-temporal sequence prediction, we assessed the performance of our SAM-Net on these datasets, in addition to the cloud image dataset, to validate its capability in spatio-temporal sequence forecasting.

The MovingMNIST dataset contains 1000 video sequences, each consisting of 20 frames. In each video sequence, two numbers move independently within the frame, and these numbers are placed in random locations. Numbers are often interlaced, overlapping, and initialized at random speed, and bounce from the edge of the image at a certain angle. The image size of the dataset is

64 \times 64 \times 1

, and each sequence contains 20 frames, of which 10 frames are used for input and 10 frames are used for prediction. The experimental dataset has a total of 280,000 frames, of which the training dataset contains 20,000 frames. The verification set contains 40,000 frames of images, and the test set contains 40,000 frames of images.

The KTH Action Dataset was obtained from 25 experimenters in four different scenarios: indoor and outdoor scene changes, wearing different clothes, and repeating six types of actions (walking, jogging, running, boxing, waving, and clapping). All video clips were shot using a static camera in a background with a constant frame rate of 25 fps, with an average length of 4 s. All videos were divided into a training set (1–16) and a testing set (17–25) based on the experimenter. The model predicts the next 10 frames of images by learning 10 consecutive frames, including a training set of 108,717 sequences and a testing set of 4086 sequences.

The dataset of typhoon cloud images was taken from the geostationary satellite data of Himawari-8. Himawari-8 is the world’s first new-generation geostationary meteorological satellite with a total of 3 visible, 3 near-infrared, and 10 infrared channels. The area range covers a north latitude of 3°

51^{'}

N to 53°

33'

N, and east longitude of 80°E to 135°

05^{'}

E. Our field of vision covered the southeast coast of our country, including part of the western Pacific Ocean and a small part of land. The typhoon cloud image datasets used in this experiment include Typhoon Nida from 20:00 on 29 July 2016 to 20:00 on 2 August 2016, Typhoon Hato from 2:00 on 20 August 2017 to 16:00 on 23 August 2017, Typhoon Mangkhut from 23:00 on 12 September 2018 to 15:00 on 17 September 2018, and Typhoon Higos from 17:00 on 16 August 2020 to 12:00 on 19 August 2020. The spatial resolution of these typhoons is 2 km × 2 km. The dataset used 1200 series data of the water vapor channel, containing 840 training sequences, 180 validation sequences, and 180 test sequences. Each sequence contained eight frames of images, of which the first four frames were used as training data, and the last four frames were used as verification data. In order to save computing resources, each image was converted into a

128 \times 128 \times 1

-size npz file, and the sampling interval between frames was 1 h.

The Northwest Pacific cloud images dataset is also from Himawari-8. In this experiment, we selected cloud image data from 0:00 on 1 April 2024 to 24:00 on 30 April 2024, and from 0:00 to 24:00 on 31 May 2024, and sampled the dataset once per hour, resulting in a total of 1464 samples. Due to limitations in computing resources, the original images were adjusted to 128 × 128 pixels with a spatial resolution of 20 km × 20 km. In total, 1024 cloud images from this dataset were selected as training samples, 220 cloud images were used as validation samples, and 220 cloud images were used as test samples. The first four frames were used as inputs for the training model, and the last four frames were used as ground truth.

4.2. Implementation

Figure 1 illustrates the flowchart for the typhoon cloud image prediction model. The process is divided into two main components: the training process and the testing process. During the training phase, each batch of input data comprises eight cloud images. The model generates both the actual cloud image data and the four-step predicted cloud image data. The dimensions of the training and test data are (840, 1, 128, 128) and (180, 1, 128, 128), respectively. These dimensions represent the total number of data samples, the number of data channels, and the height and width of each data sample. To ensure a fair comparison with the PredRNN-v2 model, the experimental setup on the MovingMNIST dataset mirrored that of PredRNN-v2. We utilized the ADAM optimizer, set the batch size to 8, and established a learning rate of 0.0001. Each hidden state within the model features a channel dimension of 128. The convolution kernel size within each cell unit is configured at

5 \times 5

. The mean squared error (MSE) serves as the loss function, and the training is halted after reaching 80,000 iterations.

In the typhoon cloud image dataset, after obtaining the typhoon dataset, we first convert the .tif-format dataset into .jpg-format data, then remove incomplete data, select data with band 8, normalize the image to the range of [0–1], consider the limitation of computing resources, adjust the image size to 128 × 128, and finally convert it into .npz format as the training and testing dataset. We also used ADAM as the optimizer, the batch size was set to 8, the learning rate was set to 0.001. The number of channels per hidden state was set to 128, the convolution kernel size per intracellular unit was set to

5 \times 5

, and MSE loss was used as the loss function. The training process is stopped after 10,000 iterations, and the preprocessing method is used. The cloud image data obtained from the Northwest Pacific are video data. We extracted each frame of cloud image data from the video, converted the cloud images’ size to 128 × 128, and finally converted it into npz-format data. Other parameters during the training process were the same as those of the typhoon cloud image data.

4.3. Evaluating Indicator

For the above experiment, we used evaluation metrics such as mean square error (MSE), peak signal-to-noise ratio (PSNR), structural similarity index measurement (SSIM) [35], image similarity (LPIPS) [36], and Pearson spatial correlation coefficient (PCCS), which have been widely used in image evaluation. The difference between these indicators is that MSE is used to calculate the absolute pixel error value, PSNR is an index used to measure the similarity degree of two images, SSIM is used to measure the similarity of structural information in the spatial neighborhood, and LPIPS is an evaluation index based on deep features that are more in line with human perception. PCCS measures the degree of correlation between two variables. The larger the value of SSIM, PCCS, and PSNR, or the smaller the value of MSE and LPIPS, the better the prediction effect of the model.

\begin{matrix} M S E = \frac{1}{N} \sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2} \end{matrix}

(26)

where

{\hat{y}}_{i}

,

y_{i}

represent the predicted and true values at the sampling location, and N represents the number of pixels in the test sample.

\begin{matrix} P S N R = 10 \times l g (\frac{255^{2}}{MSE}) \end{matrix}

(27)

M S E

stands for mean square error.

\begin{matrix} S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})} \end{matrix}

(28)

Among them, x and y are the two images to be compared,

μ_{x}

and

μ_{y}

represent the means of these two images,

μ_{x}^{2}

and

μ_{y}^{2}

represent the variances of these two images,

σ_{x y}

represents the covariance of these two images, and

C_{1}

and

C_{2}

are constants used to avoid situations where the denominator is 0. The values of constants vary depending on the color range. Generally,

C_{1}

=

(K_{1}

× L)²,

C_{2}

=

(K_{2}

× L)², where L is the range of pixel values. In this study, pixel values were between 0 and 1.

K_{1}

and

K_{2}

are constants less than 1. In this experiment, we set

K_{1}

= 0.01 and

K_{2}

= 0.03.

\begin{matrix} d (x, x_{0}) = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, w} | | w_{l} ⊙ ({\hat{y}}_{h w}^{l} - {\hat{y}}_{0 h w}^{l}) {| |}_{2}^{2} \end{matrix}

(29)

In the calculation formula of

L P I P S

, d is the distance between

x_{0}

and x. Extract feature stacks from the L layer and perform unit normalization in the channel dimension. Use vector

W_{l}

to scale the number of activated channels and ultimately calculate the

L_{2}

distance. Finally, average in space and sum in channels.

\begin{matrix} P C C S = \frac{\sum (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum {(x_{i} - \bar{x})}^{2} \sum {(y_{i} - \bar{y})}^{2}}} \end{matrix}

(30)

Among them,

x_{i}

and

y_{i}

are the

i - t h

pair of observations in the sample data, and

\bar{x}

and

\bar{y}

are the means of x and y, respectively.

5. Result and Analysis

5.1. MovingMNIST Dataset

In the MovingMNIST dataset, employing the same hyperparameters and completing 80,000 training rounds, our SAM-Net outperforms PredRNN-v2. As shown in Table 1, our proposed model achieves a lower MSE by 0.01, a higher SSIM by 0.02, and a decrease in LPIPS by 0.003 compared to PredRNN-v2. The images predicted by SAM-Net are clearly distinguishable in both number and trajectory. In the dataset experiment on MovingMNIST in Figure 4, we can see that there is almost no difference in the prediction performance between our proposed model and PredRNN’s model in the simple number movement of “8” and “5”. However, in Figure 5, considering the slightly complex spatio-temporal prediction where the numbers “8” and “0” overlap and separate, we embed a self-attention memory module in the cell unit of PredRNN, and we can see a significant difference between the two models. After the numbers “8” and “0” leave, it is difficult to recognize the number “8” in the image predicted by PredRNN, while our images predicted by SAM-Net can be well distinguished in terms of both number and trajectory.

5.2. KTH Dataset

On the KTH action dataset, our proposed SAM-Net demonstrates superior performance compared to PredRNN-v2. As shown in Table 2, our model shows improved metric results, with a 1.13 lower MSE, 0.016 higher SSIM, 0.046 higher PSNR, and 0.03 lower LPIPS. As depicted in Figure 6, which illustrates the experiment on the KTH dataset, we observe the experimenter’s arms transitioning from a closed position on the head to an open and extended state. The PredRNN-v2 model’s predictions show the experimenter’s arms remaining closed on the head. In contrast, our proposed SAM-Net accurately captures this transition, with the predicted arms closely resembling the ground truth, indicating a clear progression from a closed to an open and flat state.

5.3. Typhoon Cloud Images

5.3.1. Prediction of 4 Timesteps

As shown n the red box in Figure 7, the white cloud layer generated by our SAM-Net model is more prominent compared to PredRNN-v2. Secondly, in terms of the overall shape of the typhoon, SAM-Net also has more accurate predictions. As seen in Table 3, our proposed model demonstrates improvements over the PredRNN-v2 model, with a decrease of 2.39 in MSE, an improvement of 0.025 in SSIM, an improvement of 0.63 in PSNR, and a decrease of 0.005 in LPIPS. Given the relatively short duration of four timesteps, both PredRNN-v2 and SAM-Net capture similar amounts of information in terms of local dependencies. Consequently, the prediction graphs and results of the two models are nearly identical.The visualization of the results of the four indicators is shown in Figure 8.

5.3.2. Prediction of 10 Timestep

As seen in Figure 9 and Table 3, our SAM-Net predicts cloud images more accurately compared to PredRNN-v2, especially in the white clouds marked with red boxes. Our model predicts more clearly, accurately, and clearly. It is evident that our proposed SAM-Net offers more pronounced advantages and more accurate prediction of changes in cloud images. The integration of the self-attention memory module allows for the acquisition of global spatial context within a single layer, which is more effective than traditional convolution operations for capturing global information. The self-attention memory (SAM) leverages a feature aggregation mechanism that calculates pairwise similarity scores, fusing current and memory features to achieve a global receptive field effect. This capability enables SAM to seize long-range dependencies in both spatial and temporal dimensions, thus enhancing the model’s performance in long-term spatio-temporal prediction tasks. This is particularly beneficial in scenarios involving large-scale, irregular, and dispersed distributions of cloud images, as well as in the presence of substantial noisy data, where the model’s performance is further enhanced.The visualization of the results of the four indicators is shown in Figure 10.

5.4. Northwest Pacific Cloud Image Datasets

From Figure 11, it can be seen that our SAM-Net is not only applicable to typhoon cloud images with distinct features, but also to ordinary cloud images with scattered cloud features. The areas marked with red boxes indicate that the two clouds gradually aggregate together, while the air between the two clouds gradually dissipates and becomes smaller.

5.5. Analysis of Ablation Experiment Results

In order to further investigate the superiority of the performance of the self-attention memory module, we conducted ablation experiments on a four-step cloud graph dataset to analyze the memory cells of the internal modules of self-attention memory

M_{t}

, and thus the necessity and effectiveness of self-attention in this work. In the ablation experiment, comparative experiments were conducted by controlling variables and gradually adding them. The results are shown in Table 4. The introduction of memory cells and self-attention mechanisms has improved all four evaluation indicators. From the table, it can be seen that compared with the self-attention memory, the introduction of memory cells shows a stronger improvement on various indicators than the self-attention module, because it can extract long-range spatio-temporal dependency relationships, which is very important for SAM-Net.

6. Discussion

This article proposes an innovative typhoon prediction model based on PredRNN-v2 and a self-attention memory mechanism. In the cloud prediction experiment of this article, we used the four-year typhoon cloud map data provided by Himawari-8, including Nida, Hato, Mangkhut, and Higos, as well as ordinary cloud map data in the northwest Pacific from April and May 2024, to conduct experiments on our model. Satellite data have undergone preprocessing steps, including format conversion, spatio-temporal matching, missing value processing, image normalization, and image resizing. After training the model, we evaluated it on the 2020 Higos typhoon dataset. Typhoon Higos formed on Luzon Island, Philippines, at 12:00 on 16 August 2020, then strengthened and moved northwest. It made landfall along the coast of Guangdong, China, at 6:00 on 19 August and dissipated at 8:00 on 19 August. The cloud image data used in this study are the cloud images after entering China’s territorial waters on 17 August. Quantitative evaluation shows that our proposed SAM-Net model has better accuracy than other models in cloud prediction at 4 and 10 time steps. Especially at 10 time steps, we have made significant improvements compared to PredRNN-v2. Compared with 10 time steps, 4 time steps have better performance representation, highlighting the potential of our model in short-term prediction. Due to the introduction of the self-attention memory mechanism, we achieved good performance in short-term typhoon cloud images, indicating that this mechanism can capture the spatio-temporal sequence features between typhoon cloud image frames and the global and local spatial dependency features between each frame of the image well. However, as the time steps increase, these features are gradually lost, resulting in larger prediction errors compared to the four-timestep typhoon cloud map prediction. Therefore, in future research work, we will consider introducing new loss functions or adding super-resolution algorithms to image prediction to improve the accuracy of long-term typhoon cloud image prediction and the clarity of the generated cloud images. In addition, in future research, the weights of cloud image data during the training phase will be adjusted according to the different generation stages of typhoon cloud images. For example, in the stages of typhoon formation, heat storm, recession, and dissipation, features from different stages will be extracted to improve the predictive ability and interpretability of the model. These results will be announced in the near future. To verify the generalization ability of our proposed model, we also conducted experiments on a regular cloud image dataset. The experimental results show that our model is not only applicable to typhoons, but also to ordinary cloud images, and has good generalization performance.

In addition, a major challenge in predicting typhoon cloud imagery is resource consumption, especially when dealing with larger and more complex typhoon cloud imagery data. How to reduce resource consumption and train models more quickly and effectively is an important issue. Therefore, in future work, we will consider how to reduce the consumption of computing resources without affecting performance, such as reducing the number of neurons, weight quantization, and other methods, to generate a lightweight framework. In summary, how to make our model lighter and more efficient will be one of the focuses of our future research directions.

7. Conclusions

In this article, we introduce the SAM-Net model for spatio-temporal prediction of images, specifically targeting the challenges posed by cloud images. To cope with the complex changes in cloud position and shape in cloud images, the prediction module of the model needs to be able to extract both global and local spatial features from the cloud images. In addition, for irregular cloud motion, more attention should be paid to the spatio-temporal sequence features between input cloud image frames in the temporal sequence prediction module, considering the extraction of temporal features with long temporal dependencies, so that the spatio-temporal sequence prediction network can learn cloud motion trends more accurately. To address these issues, we have integrated a self-attention memory module into the PredRNN-v2 framework, enabling the capture of remote spatial and temporal features. This enhancement allows the SAM-Net to learn not only temporal patterns but also spatial correlations through the self-attention mechanism.

Our experiment results indicate that the performance of PredRNN-v2 is not satisfactory in predicting typhoon cloud images. Thus, we proposed an improved prediction model based on the predRNN-v2 and evaluated it on both the MovingMNIST and KTH action datasets, as well as a typhoon cloud image dataset. The results demonstrated an improvement over the PredRNN-v2 model.

Author Contributions

Conceptualization, Y.R. and J.Y.; methodology, Y.R. and J.Y.; software, Y.R. and J.Y.; validation, Y.R. and J.Y.; formal analysis, Y.R. and J.Y.; investigation, Y.R. and J.Y. resources, F.X.; data curation, X.W.; writing—original draft preparation, X.W.; writing—review and editing, Y.R. and J.Y.; visualization, J.Y.; supervision, X.W.; project administration, Y.R. and X.W.; funding acquisition, R.L. All authors have read and agreed to the published version of the manuscript.

Funding

This project has received funding support from the National Science and Technology Major Project (2022ZD0119502).

Data Availability Statement

The MovingNNIST and KTH action dataset can be download at https://github.com/thuml/predrnn-pytorch. The Typhoon cloud images dataset can be download at https://www.eorc.jaxa.jp/ptree. The Northwest Pacific image datasets can be download at http://agora.ex.nii.ac.jp/digital-typhoon/region/pacific.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, C.; Jiang, J.; Zhou, Y.; Lv, N.; Liang, X.; Wan, S. An edge intelligence empowered flooding process prediction using Internet of things in smart city. J. Parallel Distrib. Comput. 2022, 165, 66–78. [Google Scholar] [CrossRef]
Xiao, T.; Chen, C.; Pei, Q.; Jiang, Z.; Xu, S. SFO: An adaptive task scheduling based on incentive fleet formation and metrizable resource orchestration for autonomous vehicle platooning. IEEE Trans. Mob. Comput. 2023, 23, 7695–7713. [Google Scholar] [CrossRef]
Deng, X.; Liu, Y.; Zhu, C.; Zhang, H. Air–ground surveillance sensor network based on edge computing for target tracking. Comput. Commun. 2021, 166, 254–261. [Google Scholar] [CrossRef]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; Volume 28. [Google Scholar]
Shi, X.; Gao, Z.; Lausen, L.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Deep learning for precipitation nowcasting: A benchmark and a new model. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
Chen, R.; Wang, X.; Zhang, W.; Zhu, X.; Li, A.; Yang, C. A hybrid CNN-LSTM model for typhoon formation forecasting. GeoInformatica 2019, 23, 375–396. [Google Scholar] [CrossRef]
Fukui, K.I. A Study of Upper Tropospheric Circulations over the Northern Hemisphere Prediction Using Multivariate Features by ConvLSTM. In Proceedings of the 23rd Asia Pacific Symposium on Intelligent and Evolutionary Systems, Tottori, Japan, 6–8 December 2019; Springer Nature: Berlin/Heidelberg, Germany, 2019; Volume 12, p. 130. [Google Scholar]
Godske, C.L.; Bjerknes, V. Dynamic Meteorology and Weather Forecasting; American Meteorological Society: Boston, MA, USA, 1957. [Google Scholar]
Sanders, F.; Burpee, R.W. Experiments in barotropic hurricane track forecasting. J. Appl. Meteorol. Climatol. 1968, 7, 313–323. [Google Scholar] [CrossRef]
Sanders, F.; Pike, A.C.; Gaertner, J.P. A barotropic model for operational prediction of tracks of tropical storms. J. Appl. Meteorol. Climatol. 1975, 14, 265–280. [Google Scholar] [CrossRef]
Qian, C.H.; Duan, Y.H.; Ma, S.H.; Xu, Y. The current status and future development of China operational typhoon forecasting and its key technologies. Adv. Meteor. Sci. Technol. 2012, 2, 36–43. [Google Scholar]
Neumann, C.J. An Alternate to the HURRAN (Hurricane Analog) Tropical Cyclone Forecast System; National Oceanic and Atmospheric Administration: Washington, DC, USA, 1972. [Google Scholar]
Chand, S.S.; Walsh, K.J. Forecasting tropical cyclone formation in the Fiji region: A probit regression approach using Bayesian fitting. Weather. Forecast. 2011, 26, 150–165. [Google Scholar] [CrossRef]
Kim, O.Y.; Kim, H.M.; Lee, M.I.; Min, Y.M. Dynamical–statistical seasonal prediction for western North Pacific typhoons based on APCC multi-models. Clim. Dyn. 2017, 48, 71–88. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Ranzato, M.; Szlam, A.; Bruna, J.; Mathieu, M.; Collobert, R.; Chopra, S. Video (language) modeling: A baseline for generative models of natural videos. arXiv 2014, arXiv:1412.6604. [Google Scholar]
Srivastava, N.; Mansimov, E.; Salakhudinov, R. Unsupervised learning of video representations using lstms. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 843–852. [Google Scholar]
Deng, X.; Liao, L.; Jiang, P.; Qian, Y. Towards scale adaptive underwater detection through refined pyramid grid. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Wang, Y.; Gao, Z.; Long, M.; Wang, J.; Philip, S.Y. Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 5123–5132. [Google Scholar]
Wang, Y.; Jiang, L.; Yang, M.H.; Li, L.J.; Long, M.; Fei-Fei, L. Eidetic 3D LSTM: A model for video prediction and beyond. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Wang, Y.; Wu, H.; Zhang, J.; Gao, Z.; Wang, J.; Philip, S.Y.; Long, M. Predrnn: A recurrent neural network for spatiotemporal predictive learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2208–2225. [Google Scholar] [CrossRef] [PubMed]
Gao, Z.; Tan, C.; Wu, L.; Li, S.Z. Simvp: Simpler yet better video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3170–3180. [Google Scholar]
Lian, J.; Chen, R. A sequence-to-sequence based multi-scale deep learning model for satellite cloud image prediction. Earth Sci. Inform. 2023, 16, 1207–1225. [Google Scholar] [CrossRef]
Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Wu, W.; Deng, X.; Jiang, P.; Wan, S.; Guo, Y. Crossfuser: Multi-modal feature fusion for end-to-end autonomous driving under unseen weather conditions. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14378–14392. [Google Scholar] [CrossRef]
Mnih, V.; Heess, N.; Graves, A. Recurrent models of visual attention. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; Volume 27. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
Xu, Z.; Du, J.; Wang, J.; Jiang, C.; Ren, Y. Satellite image prediction relying on GAN and LSTM neural networks. In Proceedings of the ICC 2019-2019 IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
Lin, Z.; Li, M.; Zheng, Z.; Cheng, Y.; Yuan, C. Self-attention convlstm for spatiotemporal prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11531–11538. [Google Scholar]
Hong, S.; Kim, S.; Joh, M.; Song, S.K. Psique: Next sequence prediction of satellite images using a convolutional sequence-to-sequence network. arXiv 2017, arXiv:1711.10644. [Google Scholar]
Cai, P. Research on Cloud Detection and Cloud Image Prediction Methods Based on FY-4A Satellite. Ph.D. Thesis, Nanjing University of Information Science and Technology, Nanjing, China, 2021. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Finn, C.; Goodfellow, I.; Levine, S. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016; Volume 29. [Google Scholar]
Lotter, W.; Kreiman, G.; Cox, D. Deep predictive coding networks for video prediction and unsupervised learning. arXiv 2016, arXiv:1605.08104. [Google Scholar]
Wang, Y.; Zhang, J.; Zhu, H.; Long, M.; Wang, J.; Yu, P.S. Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9154–9162. [Google Scholar]
Su, J.; Byeon, W.; Kossaifi, J.; Huang, F.; Kautz, J.; Anandkumar, A. Convolutional tensor-train LSTM for spatio-temporal learning. Adv. Neural Inf. Process. Syst. 2020, 33, 13714–13726. [Google Scholar]
Villegas, R.; Yang, J.; Hong, S.; Lin, X.; Lee, H. Decomposing motion and content for natural video sequence prediction. arXiv 2017, arXiv:1706.08033. [Google Scholar]

Figure 1. Flowchart of SAM-Net for cloud image prediction.

Figure 2. The first image is the standard self-attention module, and the second image is the self-attention memory module.

Figure 3. The cell structure of SAM-Net. The light green part represents the embedded self-attention memory module shown in Figure 2.

Figure 4. Prediction on the MovingMNIST test set (1).

Figure 5. Prediction on the MovingMNIST test set (2). (red frame marks the comparison between the prediction results of SAM-Net and other models).

Figure 6. Prediction on the KTH action test set. (red frame marks the comparison between the prediction results of SAM-Net and other models).

Figure 7. Prediction of four timesteps on Higos typhoon cloud image dataset. (red frame marks the comparison between the prediction results of SAM-Net and other models).

Figure 8. Comparison of predictions for different models’ MSE, SSIM, PSNR, and LPIPS on a cloud dataset frame by frame (four timesteps).

Figure 9. Prediction of 10 timesteps on Higos typhoon cloud image datasets. (red frame marks the comparison between the prediction results of SAM-Net and other models).

Figure 10. Comparison of predictions for different models’ MSE, SSIM, PSNR, and LPIPS on a cloud dataset frame by frame (10 timesteps).

Figure 11. Prediction of four timesteps on Northwest Pacific cloud image datasets. (red frame marks the comparison between the prediction results of SAM-Net and other models).

Table 1. Results on MovingMNIST averaged over 10 future timesteps.

Datasets Models	MovingMNIST
Datasets Models	MSE (↓)	$Δ$	SSIM (↑)	$Δ$	PSNR (↑)	$Δ$	LPIPS (↓)	$Δ$
ConvLSTM [4]	103.3	–	0.707	–	–	–	0.156	–
CDNA [37]	97.4	–	0.846	–	–	–	–	–
VPN Baseline [38]	64.1	–	0.87	–	–	–	–	–
MIM [39]	52	–	0.874	–	–	–	0.079	–
Conv-TT-LSTM [40]	64.3	–	0.846	–	–	–	0.133	–
PredRNN [6]	56.8	–	0.867	–	–	–	0.107	–
PredRNN-v2 [23]	48.4	–	0.891	–	–	–	0.071	–
SAM-Net	48.3	−0.1	0.893	+0.02	20.03	–	0.068	−0.03

Table 2. Results on KTH averaged over 10 future timesteps.

Datasets Models	KTH Action Dataset
Datasets Models	MSE (↓)	$Δ$	SSIM (↑)	$Δ$	PSNR (↑)	$Δ$	LPIPS (↓)	$Δ$
ConvLSTM [4]	–	–	0.712	–	23.58	–	0.231	–
Mcnet + Residual [41]	–	–	0.806	–	26.29	–	–	–
TarjGru [5]	–	–	0.79	–	26.97	–	–	–
DFN [36]	–	–	0.794	–	27.26	–	–	–
Conv-TT-LSTM [40]	–	–	0.815	–	27.62	–	–	–
PredRNN [6]	–	–	0.839	–	27.55	–	0.204	–
PredRNN-v2	20.5	–	0.845	–	31.72	–	0.139	–
SAM-Net	19.37	−1.13	0.861	0.016	32.18	0.046	0.109	−0.03

Table 3. Results on typhoon cloud images averaged over 4 and 10 future timesteps.

Datasets Models	Typhoon Image Cloud (4 Timestemps)
Datasets Models	MSE (↓)	$Δ$	SSIM (↑)	$Δ$	PSNR (↑)	$Δ$	LPIPS (↓)	$Δ$	PCCS (↑)	$Δ$
ConvLSTM	104.19	–	0.448	–	18.32	–	0.242	–	–	–
ConvGRU	89.34	–	0.532	–	20.02	–	0.203	–	–	–
PredRNN	80.86	–	0.621	–	23.15	–	0.162	–	–	–
PredRNN-v2	64.2	–	0.715	–	24.63	–	0.143	–	0.871	–
SAM-Net	60.13	−4.07	0.719	+0.004	24.89	+0.26	0.138	−0.005	0.914	+0.043
Datasets Models	Typhoon Image Cloud (10 Timestemps)
Datasets Models	MSE (↓)	$Δ$	SSIM (↑)	$Δ$	PSNR (↑)	$Δ$	LPIPS (↓)	$Δ$	PCCS (↑)	$Δ$
PredRNN-v2	265.89	–	0.317	–	18.09	–	0.238	–	0.562	–
SAM-Net	85.31	−180.58	0.668	+0.351	23.65	+5.56	0.174	−0.064	0.881	+0.319

Table 4. Ablation experiments using different methods on cloud image datasets.

Datasets Models	Typhoon Image Cloud (4 Timesteps)
Datasets Models	MSE (↓)	$Δ$	SSIM (↑)	$Δ$	PSNR (↑)	$Δ$	LPIPS (↓)	$Δ$
PredRNN-v2	66.28	–	0.762	–	24.61	–	0.137	–
w Mem w/o Attention	64.98	−1.3	0.779	+0.017	25.02	+0.41	0.134	−0.003
w/o Mem w Attention	65.77	−0.51	0.771	+0.009	24.93	+0.31	0.136	−0.001
w Mem w Attention	63.89	−2.39	0.787	+0.025	25.24	+0.63	0.132	−0.005

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ren, Y.; Ye, J.; Wang, X.; Xiao, F.; Liu, R. SAM-Net: Spatio-Temporal Sequence Typhoon Cloud Image Prediction Net with Self-Attention Memory. Remote Sens. 2024, 16, 4213. https://doi.org/10.3390/rs16224213

AMA Style

Ren Y, Ye J, Wang X, Xiao F, Liu R. SAM-Net: Spatio-Temporal Sequence Typhoon Cloud Image Prediction Net with Self-Attention Memory. Remote Sensing. 2024; 16(22):4213. https://doi.org/10.3390/rs16224213

Chicago/Turabian Style

Ren, Yanzhao, Jinyuan Ye, Xiaochuan Wang, Fengjin Xiao, and Ruijun Liu. 2024. "SAM-Net: Spatio-Temporal Sequence Typhoon Cloud Image Prediction Net with Self-Attention Memory" Remote Sensing 16, no. 22: 4213. https://doi.org/10.3390/rs16224213

APA Style

Ren, Y., Ye, J., Wang, X., Xiao, F., & Liu, R. (2024). SAM-Net: Spatio-Temporal Sequence Typhoon Cloud Image Prediction Net with Self-Attention Memory. Remote Sensing, 16(22), 4213. https://doi.org/10.3390/rs16224213

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SAM-Net: Spatio-Temporal Sequence Typhoon Cloud Image Prediction Net with Self-Attention Memory

Abstract

1. Introduction

2. Related Work

2.1. Traditional Methods

2.2. Deep Learning Methods

3. Methods

3.1. Self-Attention Module

3.2. Self-Attention Memory Module

3.3. SAM-Net Cell

4. Experiments

4.1. Datasets

4.2. Implementation

4.3. Evaluating Indicator

5. Result and Analysis

5.1. MovingMNIST Dataset

5.2. KTH Dataset

5.3. Typhoon Cloud Images

5.3.1. Prediction of 4 Timesteps

5.3.2. Prediction of 10 Timestep

5.4. Northwest Pacific Cloud Image Datasets

5.5. Analysis of Ablation Experiment Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI