1. Introduction
Precipitation is one of the most common weather events in our daily lives [
1]. From traditional rain gauges to modern meteorological radars and satellites, the automation and precision of sensors for measuring precipitation have significantly improved. These advancements provide a data foundation for Numerical Weather Prediction (NWP) [
2]. NWP involves the long-term simulation of the Earth’s atmosphere, a typical nonlinear system, through the establishment of mathematical and physical models. It exhibits good interpretability. However, NWP models are grounded in the mathematical modeling principles of atmospheric physics and dynamics, rendering them more suitable for long-term meteorological forecasting tasks. They encounter challenges in predicting rapid changes in local weather conditions, particularly in the short-term forecasting of precipitation within specific local areas [
3].
The radar extrapolation method [
4] is a precipitation forecasting technique based on radar reflectivity data. This method predicts the development trend and intensity of precipitation by analyzing the characteristics of radar echoes. One advantage of this approach is its real-time capability, allowing for precipitation forecasts within a relatively short time range. However, due to the fact that motion can be highly dynamic and nonlinear, and its intensity can vary over time, radar extrapolation still has the problem of low accuracy in practical applications [
5]. Due to limitations such as insufficient data update frequency and insensitivity to dynamic changes, the radar echo extrapolation method faces challenges in meeting the demands of short-term precipitation forecasting.
With the rise of deep learning science, there have been new development directions in the field of precipitation forecasting, and many scholars have conducted research in this area [
6,
7,
8,
9,
10]. Deep learning models for short-term precipitation forecasting tasks can be categorized into four main types: Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Generative Networks, and Graph Neural Networks (GNNs). CNN structures, notably UNet [
11] and its variants, have been extensively developed and applied in the study of hurricanes and extreme precipitation events. However, CNN models exhibit limitations in handling time series data, particularly for rapidly changing weather events.
Generative Adversarial Networks (GANs) find application in synthesizing meteorological scenarios, especially precipitation scenes. By learning distributions from existing observational data, GANs can generate synthetic data, aiding in augmenting training samples to enhance model performance. Meng et al. [
12] introduced TCR-GAN for predicting extreme precipitation events caused by cyclones. Graph Neural Networks (GNNs) excel in capturing spatiotemporal relationships between nodes in graph structures, which is crucial for handling complex spatiotemporal correlations in meteorological data.
Recently, with the popularity of Transformer [
13], Zhang et al. [
14] employed a deep learning framework based on Automated Machine Learning (AutoML) techniques and Transformers. Experimental validation on datasets of two different resolutions confirmed the algorithm’s effectiveness. In addition, Transformer variants such as FourCastNet [
15], Rainformer [
16], and Earthformer [
17] are also applied in the field of short-term precipitation forecasting.
Compared to other models, RNN models are more suitable for short-term precipitation forecasting due to their outstanding performance in handling time-series data. These models, by incorporating recurrent structures, effectively capture dependencies in the temporal dimension, showcasing notable performance when dealing with rapidly changing weather events.
However, typical RNN models such as ConvLSTM [
18] treat the short-term precipitation forecasting problem as a spatiotemporal prediction issue based on radar echo sequences and have only been tested on radar echo datasets. Due to the instability of meteorological systems, radar data can be subject to noise interference, making it challenging to accurately reflect the true distribution of precipitation and leading to significant prediction errors.
In addition to radar echo data, there are other data sources in the meteorological field, such as satellite observation data, which can reflect real precipitation conditions. Furthermore, unlike other spatiotemporal prediction problems, meteorological data often interact with each other, exhibiting multimodality. Most current precipitation forecasting models do not fully exploit this characteristic. Therefore, our research focuses on the question of how to effectively leverage multimodal data to enhance the accuracy of precipitation forecasting.
Moreover, we hypothesize that the limitations of spatiotemporal prediction models such as ConvLSTM in short-term precipitation forecasting stem from an excessive reliance on radar echo sequences, neglecting the multimodality and spatial feature complexity of meteorological data. Based on this assumption, we propose a novel short-term precipitation forecasting model that combines a multimodal fusion structure with spatiotemporal prediction models, termed the multimodal fusion and prediction model (MultiPred).
Our goal is to integrate multimodal data and thoroughly explore the implicit causal relationships between precipitation data and other meteorological elements, aiming to enhance the accuracy of short-term local precipitation forecasts. The model leverages historical multimodal data from the target area to forecast subsequent precipitation events. Specifically, MultiPred comprises three components for precipitation prediction. Firstly, a spatial feature extraction network (SFEN) is employed to fuse spatial features from historical multimodal data. Secondly, a time feature extraction network consisting of convolutional long short-term memory (ConvLSTM) is utilized for the initial prediction of historical multimodal data. Thirdly, a modal fusion layer (MFL) is incorporated to fuse spatial features from the initial prediction, amalgamating the spatial features obtained from multiple meteorological modal data in the first stage. This enables the model to focus on predicting and outputting precipitation in the target region. The model was tested using ERA5 and GPM data, and the experimental results demonstrate that, compared to other models, MultiPred exhibits higher accuracy. The main contributions of this paper are as follows:
In this study, a precipitation forecasting model named MultiPred is introduced, incorporating a multimodal fusion structure to integrate historical multimodal data from various sources, including satellite observations. This approach maximizes the diversity of meteorological data and enhances the model’s ability to reflect real precipitation.
A novel spatial feature attention layer is proposed for constructing the spatial feature extraction network (SFEN), enabling a more specialized and efficient implementation of spatial feature fusion for historical multimodal data.
Experimental validation on ERA5 and GPM datasets demonstrates that the MultiPred model exhibits higher accuracy compared to other models. This provides a novel and more effective direction and methodology for the future field of meteorological forecasting.
4. Model
4.1. Spatiotemporal Sequence Prediction
Given the spatial resolution, temporal intervals, and uneven distribution in meteorological observation data and meteorological reanalysis data, certain meteorological data, for instance, precipitation in coastal areas, may simultaneously be more prominent than in inland regions. In the same region, spring precipitation is typically less than summer precipitation, as illustrated in
Figure 5.
Therefore, meteorological observations can be formalized as meteorological data sequences with specific spatiotemporal resolutions [
32]. For a local region at time
, the
-th type of meteorological observation data can be represented by the tensor
, where
and
represent the grid length and width, based on latitude and longitude resolution, respectively. The entire meteorological observation sequence can be represented by concatenating all meteorological modality data, i.e.,
, where
represents the number of modalities. By modeling meteorological data, meteorological forecasting tasks can be formalized as a spatiotemporal prediction task from sequence to sequence [
33]. The goal is to infer the future meteorological states
based on the historical observation sequence
.
4.2. ConvLSTM
The ConvLSTM layer comprises two states: the cell state (
) and the hidden state (H), as well as three gates—the input gate, the forget gate, and the output gate. Along the time dimension, the forget gate (
) is activated first to determine which parts of the cell state (
) should be “forgotten”. Subsequently, new information accumulates in the cell state through the input gate (
). Finally, the output gate (
) controls which information will propagate to the next time step. Formally, given an input F
t, the key equations of the ConvLSTM are as follows:
where ∗ denotes the convolution operator, ⊙ represents element-wise multiplication;
and
b are parameters for each gate; σ is the sigmoid function. As the third component of the model, ConvLSTM serves as a temporal feature extraction network, predicting multimodal meteorological data across multiple time steps. This can further enhance the accuracy and robustness of precipitation forecasting.
4.3. Basic Network Structure
The precipitation forecasting problem in the field of meteorology is commonly regarded as a spatiotemporal prediction challenge in the realm of deep learning. In response to this, the MultiPred model has been devised, consisting primarily of three parts. The model structure is depicted in
Figure 6.
The first part is the spatial feature extraction network, which inputs multiple meteorological modal data sequences, including precipitation,
, and outputs the encoded multimodal feature sequence,
. The second part is the ConvLSTM network, which inputs the multimodal feature string
and outputs the multimodal feature prediction sequence
. The third part is the modal fusion layer, which integrates the spatial features of
that have been upsampled by
, and then, via the MFL, outputs the predicted precipitation value
for the target area. The first part is the spatial feature extraction network, which can be formulated as follows:
In this context, represents a multimodal meteorological sequence with t_i observations at time t, where consists of denoting the U-component wind, V-component wind, humidity, temperature, vertical velocity, and precipitation at time , respectively. and represent the number of channels, the length of the tensor, and the width of the tensor, respectively.
In the first part of the model, the spatial feature extraction network extracts and integrates the spatial features of the reanalysis data sequence , and outputs a spatially encoded multimodal feature sequence .
The second part is the time feature extraction network, which, based on the output of the first part,
, further extracts the temporal features of multimodal data and outputs the predicted multimodal data values for the target region, with a length of
. This stage can be formalized as follows:
The third part is the spatial regression layer. Based on the output of the second part, F, of the model, this layer first obtains
by upsampling
, further fusing multimodal data, and outputting the predicted precipitation value
for the target area. This stage can be formalized as follows:
4.4. Spatial Feature Extraction Network
4.4.1. Spatial Feature Attention Layer
The spatial feature extraction network comprises a spatial feature attention layer (SFAL). Extreme values in meteorological data represent anomalies in the meteorological system, often associated with extreme weather events, such as heavy rain, typhoons, and so on. These extreme weather events can significantly impact precipitation, and the occurrence of extreme values can lead to changes in meteorological patterns, thereby influencing the distribution and intensity of precipitation. Therefore, these extreme values should be emphasized, and SFAL has been designed inspired by spatial attention mechanisms. The process of SFAL is as follows:
Assuming
where
represents the multimodal meteorological sequence,
is input into both the left and right branches. In the left branch, a 1 × 1 convolutional layer is employed to halve the dimensions of
and
W, double the number of channels
C, followed by Batch Normalization (BatchNorm) and ReLU activation operations. The left branch ultimately outputs the feature map
. In the right branch, max-pooling is applied to halve the dimensions of
and
, followed by a 1 × 1 convolutional layer to double the number of channels C, and finally, the sigmoid activation function yields the feature map
. Finally,
and
undergo element-wise multiplication, achieving different degrees of enhancement and suppression for the spatial positions of
. The SFAL ultimately outputs the feature map
. SFAL can be formalized as follows:
The structure of SFAL is illustrated in
Figure 7.
SFAL utilizes a spatial attention mechanism to learn spatial attention weight maps, achieving different degrees of enhancement and suppression for the spatial positions of the original feature map. This allows the final feature map to focus more on spatial positions containing extreme values. Compared to directly using convolutional layers to extract spatial features, the SFAL module adds a pooling layer, resulting in fewer computations. Simultaneously, it complements the original features, obtaining a feature map with richer semantics.
4.4.2. Spatial Feature Extraction Network
The role of the spatial feature extraction network is to preliminarily integrate the spatial information of different meteorological data and establish spatial relationships between various meteorological data. This module primarily employs multiple layers of SFAL to process input data. SFAL convolves and max-pools the three-dimensional tensor of the form across all channels, producing a new tensor in the form . During this process, SFAL performs element-wise multiplication and addition on feature maps from different channels, capturing spatial relationships between various meteorological data. By stacking multiple layers of SFAL, the module gradually extracts and integrates spatial information of different meteorological data, establishing spatial relationships between them.
The structure of the multimodal spatial encoding module is illustrated in
Figure 8.
4.5. Spatial Feature Extraction Network
The primary function of the ConvLSTM network is to provide a self-regressive recurrent prediction structure. The multimodal spatial encoding module inputs the encoded data at time into the ConvLSTM network. The ConvLSTM network generates multimodal data prediction results containing the predicted precipitation, which is then used to generate subsequent multimodal data prediction results , forming a self-regressive recurrent prediction.
In the second part, the continuous integration and reconstruction of the spatiotemporal features of multimodal meteorological data enhance the overall prediction accuracy of multimodal data and significantly improve the precision of precipitation prediction.
4.6. Modal Fusion Layer
The third component of this model is the modality fusion layer, primarily implemented through upsampling and a self-attention mechanism. In this section, the initial feature map generated by the temporal feature extraction network undergoes transformation to obtain a higher-resolution feature map. The upsampling part of the modality fusion layer includes a series of stacked deconvolution layers and ReLU activation functions. The deconvolution layer plays a crucial role in this module, enlarging the spatial resolution of the input tensor through the inverse operation of convolution. Through a sequence of deconvolution layers and activation functions, the modality fusion layer first reconstructs the predicted result into a representation with a higher spatial resolution ratio. Subsequently, is split into and by channel. Finally, by refining and adjusting through a self-attention mechanism, the precipitation forecast for time step t + 1 is ultimately obtained.
The schematic diagram of the modal fusion layer is shown in
Figure 9.
5. Experiment
5.1. Data Preprocessing
The input data for the model includes GPM precipitation data and ERA5 multimodal data for the target region. Due to the 30 min resolution of GPM data, data for each preceding half-hour is added to the input data to capture more spatiotemporal features of precipitation. The experiment utilizes ERA5 and GPM data from 2017 to 2020, with the first three years as the training set, and the fourth year divided into a validation set and a test set. The training set contains a total of 52,560 precipitation data points, and the ERA5 data consist of 26,280 points. Both the validation and test sets have 8760 precipitation data points and 4380 ERA5 data points each. Each batch inputted into the model consists of 8 h of data, where the first 4 h serve as the model input, and the GPM precipitation data for the following 4 h act as the model prediction label.
Given the different time resolutions (30 min for GPM and 1 h for ERA5), two GPM precipitation data sets within one hour are channel-merged to form the (2 × 120 × 120) input precipitation data.
Standard score transformation is applied to the ERA5 multimodal meteorological data to convert different modal data into standard normal distribution, facilitating statistical analysis and modeling. This transformation aids in better understanding and analyzing patterns and trends within ERA5 multimodal data.
The standard score transformation formula is given by , where is the initial value of ERA5 meteorological data, is the standardized value after standard score transformation, is the mean of the overall data, and is the standard deviation of each modality.
5.2. Loss Function
Zhao et al. [
34]. believe that combining different loss functions can also achieve better image quality. Therefore, during the training phase, we combine MAE, MSE, and SSIM loss as the network’s loss functions. The loss function is shown in Formula (6).
where
represents the ground truth, and
represents the predicted value.
The Structural Similarity Index (SSIM) [
35] loss is a metric used to measure the structural similarity between two images, commonly employed for image quality assessment. In the calculation process, the two images are divided into small blocks for computation. SSIM loss takes into account information regarding brightness, contrast, and structure, evaluating the similarity between two images by comparing these features.
The range of SSIM loss is between , where a value closer to 1 indicates a higher similarity between two images, while a value closer to 0 suggests a greater dissimilarity. When the SSIM loss is 1, it signifies that the two images are identical.
In the formula, the SSIM loss can be expressed as:
where
and
respectively, represent patches from
and the corresponding patches from
.
and
respectively, represent the average of
and
.
and
, respectively, represent the variance of
and
.
represents the covariance between
and
.
,
, and
represent the luminance similarity, contrast similarity, and structure score between the two patches, respectively.
,
and
represent weights. The schematic diagram of the computation process is shown in
Figure 10.
Due to the fact that an SSIM value closer to 1 indicates greater similarity between two images, an SSIM loss closer to 0 implies that the model prediction is more similar to the actual target. Therefore, subtracting the SSIM value from 1 provides a loss value consistent with the degree of similarity, facilitating the optimization process. The formula for SSIM loss is as follows:
MSE and MAE are common loss functions in spatiotemporal prediction models. The calculation formulas for MSE and MAE are shown in Equation (9).
where
is the true value and
is the predicted result.
5.3. Hyperparameter Settings
Before the actual training process, a series of experiments were conducted to carefully select the most suitable hyperparameter configurations, with the key parameters being batch size and learning rate. We examined three different batch sizes (4, 8) and three different learning rates (5 × 10
−4, 3 × 10
−4, 1 × 10
−4) separately, conducting training for 300 epochs to investigate the model performance under different prediction time steps, as illustrated in
Table 1.
Observing the results table, it is evident that under the relatively low learning rate of 1 × 10−4, the model with a batch size of 4 consistently exhibited the minimum loss (LOSS) across various scenarios, with particularly outstanding performance when the learning rate was set to 1 × 10−4. Consequently, for the formal training phase, we have opted for the configuration of a batch size of 4 and a learning rate of 1 × 10−4 in anticipation of achieving superior model performance.
In this experiment, we utilized the Adam optimizer [
36] for training. All network parameters were initialized using a normal distribution, and the training process concluded after 1000 iterations. The proposed neural network was implemented using PyTorch1.8.1 [
37] and trained end-to-end. To prevent overfitting, an early stopping strategy was implemented in the experiment, meaning that if the loss value did not decrease within 5 epochs, the training process would stop. Our experimental platform featured 32 GB of memory and was equipped with a Nvidia RTX 2080 GPU, running Ubuntu 16.04.
5.4. Training Process
The overall training process of the model is illustrated in
Figure 11. At time
, the model’s input data is divided into GPM precipitation data and ERA5 reanalysis data. ERA5 reanalysis data include meteorological information for five modalities: U-wind, V-wind, humidity, temperature, and vertical wind speed, with a temporal resolution of 1 h and spatial dimensions of (3 × 48 × 48). The GPM precipitation data has a temporal resolution of 30 min. To align with the temporal resolution of ERA5 data, two GPM precipitation data with dimensions (120 × 120) within one hour are channel-merged to obtain precipitation data with dimensions (2 × 120 × 120). At the beginning of the model training, ERA5 data and GPM precipitation data are input into the spatial feature extraction network. It is worth noting that, to align the spatial dimensions of the two types of data, the precipitation data is downsampled to (4 × 48 × 48) through a convolution operation before entering the spatial feature extraction network.
In the spatial feature extraction network, the SFAL-enhanced model with six branches focuses on extreme values in the input feature map and performs channel expansion. After two layers of SFAL, the model extracts key features related to extreme weather events from the two types of data. Between the first and second layers of the SFAL branch handling precipitation, a convolution layer is used to further downsample the precipitation feature map.
In the final stage of the spatial feature extraction network, the output feature maps of each of the six branches are channel-merged to obtain a feature map with dimensions (192 × 22 × 22). Then, this feature map is input into the temporal feature extraction network, where a single layer of PredRNN extracts the spatiotemporal features of the input data, generating multimodal data predictions for time step . Through the temporal feature extraction network, the model enhances its ability to predict multimodal data in future time steps, further improving the accuracy of precipitation forecasting.
Next, is input into the modality fusion layer, and through deconvolution operations, is upsampled to the size of (17 × 48 × 48). Subsequently, is split into (15 × 48 × 48) and . is further upsampled and refined through self-attention to obtain precipitation prediction for time step . Additionally, and will form the input data for the model at the next time step to achieve autoregressive prediction.
5.5. Evaluation Metrics
A series of six model comparison experiments were meticulously designed to elucidate the efficacy of the MultiPred model in precipitation forecasting. To finely evaluate the experimental results, three widely recognized metrics, namely the Critical Success Index (CSI), Probability of Detection (POD), and the Heidke Skill Score (HSS), were employed. The calculation formulas for CSI, POD, and HSS are shown in Equation (10).
where TP stands for true positives, indicating the number of times the model correctly predicted precipitation events; FP stands for false positives, representing the number of times the model incorrectly predicted precipitation events, i.e., when there was no actual precipitation, but the model predicted precipitation; FN stands for false negatives, indicating the number of times the model incorrectly failed to predict precipitation events, i.e., when there was actual precipitation, but the model predicted no precipitation; TN stands for true negatives, representing the number of times the model correctly predicted the absence of precipitation events.
Both CSI and POD stand as pivotal indicators employed for the meticulous assessment of the precipitation forecasting model’s accuracy. The inclusion of the Heidke Skill Score (HSS) in the evaluation process provides a more comprehensive measure of the model’s accuracy. HSS takes into account both correct and incorrect predictions and is calculated as the difference between the proportion of true positives correctly predicted and the proportion of false positives incorrectly predicted.
As shown in
Table 2, three different levels have been established: mild, when the precipitation d is less than 2 mm and greater than 0.1 mm; moderate, when the precipitation d is less than 6 mm and greater than 2 mm; and heavy, when the precipitation d is greater than 6 mm.
5.6. Experimental Results and Analysis
The first group used the complete MultiPred model, with input data consisting of multiple meteorological modalities covering a large area for the preceding 4 h (including time encoding), and the output data representing continuous precipitation in the target area for the subsequent 4 h. The second group used the MultiPred model without the ERA5 meteorological data. The input data included continuous GPM precipitation and time encoding data for the past 4 h, with the same output as the first group. The third to sixth groups used the ConvLSTM, PredRNN, TrajGRU, and MIM models, respectively, with inputs and outputs similar to the first group. The comparative results on the ERA5 and GPM datasets are shown in
Table 3,
Table 4,
Table 5 and
Table 6.
To more effectively analyze the precipitation forecasting capabilities of these six models, the CSI, POD, and HSS indices for each forecasting moment were selected and categorized into three precipitation levels for display. The comparative results on the GPM precipitation dataset and ERA5 reanalysis dataset are shown in
Table 3,
Table 4,
Table 5 and
Table 6. It can be observed (from
Table 3,
Table 4,
Table 5 and
Table 6) that over time, all indicators for the six models are gradually decreasing. In the prediction tasks for light precipitation with Light and heavy precipitation with over the entire four hours, the complete MultiPred model consistently exhibits the highest indicator values.
However, in the prediction task for moderate precipitation, the MultiPred model without ERA5 achieved the highest scores in the first and second hours. Additionally, the CSI, POD, and HSS index means for these models at the four forecasting moments were selected and categorized into three precipitation levels for display. From
Figure 12,
Figure 13 and
Figure 14, it can be seen that the proposed MultiPred model obtained the highest score in the prediction task for light precipitation, with mean CSI, POD, and HSS values reaching 0.471, 0.570, and 0.557, respectively. Among all compared methods, E3D-LSTM is the best-performing precipitation forecasting model. Compared to E3D-LSTM, the MultiPred model improved the CSI, POD, and HSS indices by 14.03%, 22.05%, and 26.31%, respectively.
For the prediction task of heavy precipitation, all indicators of the six models decreased compared to the light precipitation prediction task. Among them, the proposed MultiPred model also obtained the highest score. Compared to the MultiPred model without ERA5 data, the MultiPred model improved the CSI, POD, and HSS indices by 41.46%, 37.89%, and 31.06%, respectively.
This indicates that using only historical precipitation data to predict subsequent precipitation in light and heavy precipitation prediction tasks has clear limitations, as it does not consider the overall climate environment of the region. However, in the prediction task for moderate precipitation, the information from historical precipitation data is crucial for short-term forecasting. Nevertheless, in subsequent hours of forecasting, the complete MultiPred model achieved higher scores, indicating that information about the overall climate environment is necessary for long-term forecasting.
5.7. Visual Analysis
Figure 15 presents visual comparisons of precipitation forecast images for the target area at the first, second, third, and fourth hours, using different networks. The visualized models include the complete MultiPred model, MultiPred (without ERA5), TrajGRU, PredRNN++, MIM and ConvLSTM. The blue area represents regions where the model predicts no precipitation, while the precipitation values in the middle are mapped to green or yellow, and larger precipitation values are mapped to orange or red.
The first row in the figure shows the ground truth precipitation values at the four moments. It can be observed that, similar to MultiPred (without ERA5), in the method of predicting subsequent precipitation in the target area using only historical precipitation data, other methods compared to the complete MultiPred model exhibit issues with unclear boundaries and less accurate distribution of predicted precipitation in the regions with heavy precipitation. The complete MultiPred model performs well in predicting light and moderate precipitation areas, showing clearer boundaries, and the predicted results are closer to the true values. The network’s performance in predicting light and moderate precipitation areas depends on its ability to extract more abundant boundary features. In this regard, the complete MultiPred model, utilizing a multimodal fusion structure, demonstrates excellent forecasting capabilities for light and moderate precipitation amounts. This suggests that the multimodal fusion approach can effectively enhance the stability and accuracy of the prediction model.
Compared to the fifth row (PredRNN++), it is evident that the complete MultiPred model exhibits a slight improvement in predictive capabilities. This indicates that adding the multimodal spatial encoding module and the spatial regression module can better capture correlations between various meteorological data and perform feature fusion, thereby improving the model’s predictive accuracy.
Compared to TrajGRU and MIM, the complete MultiPred model shows improved index values and predictive performance, suggesting that the model has certain advantages in multi-factor precipitation forecasting compared to traditional spatiotemporal prediction models.
6. Discussion
Currently, mainstream deep learning-driven short-term precipitation forecasting models commonly face the challenge of not effectively integrating multimodal data.Therefore, we conducted an in-depth study on the organic integration of a multimodal fusion structure with spatiotemporal prediction models, leading to the proposal of a spatiotemporal prediction model based on early fusion, namely MultiPred. In MultiPred, we achieved outstanding performance in experiments by integrating precipitation data and reanalysis data, aligning with previous relevant research results by Sun et al. [
38], who studied the enhancement of downscaling coarse-resolution precipitation forecasting results using satellite precipitation data and reanalysis data. Analyzing the experimental results, we observed that the MultiPred model exhibits poorer predictive performance in continuous 3 h and 4 h forecasts when not using ERA5 data. This suggests that in long-term precipitation forecasting, relying solely on historical precipitation information to predict future precipitation, without considering other meteorological modal information in the region, poses significant challenges.
Furthermore, through visual analysis, it is evident that compared to ConvLSTM, the MultiPred model produces clearer boundaries in its predictive results and captures light and moderate precipitation more sensitively. This aligns with the viewpoint in Ma et al.’s research [
28] regarding the enhancement of model accuracy through multimodal fusion structures. Finally, short-term precipitation forecasting models commonly face the challenge of insufficient accuracy in precipitation datasets in practical applications, especially when predicting heavy precipitation using satellite precipitation data. This aligns with Kumar et al.’s findings [
39] about the inadequacy of satellite data resolution. Despite the success of deep learning models in improving spatial feature extraction capabilities, there is still a need for datasets with higher spatial resolution. Therefore, in our next research step, we plan to utilize higher-resolution radar data to address this issue.
7. Conclusions
This study introduces a model named MultiPred for predicting precipitation in a target area. The model emphasizes a fusion structure of multimodal data and a combination of spatiotemporal prediction models. In the experiments, we utilized the ERA5 dataset and corresponding precipitation data from the eastern coastal region of China, conducting six sets of model control experiments to forecast continuous precipitation in the target area for the next 4 h. Through quantitative evaluation of multiple indicators and visualization of results, it can be concluded that the MultiPred model excels in both light and heavy precipitation forecasting tasks, demonstrating significantly superior performance compared to other methods. This validates the effectiveness of the multimodal fusion structure in precipitation forecasting, and empirical evidence confirms that the use of ERA5 data further enhances prediction accuracy.
To further enhance short-term precipitation forecasting performance, there is considerable room for improvement in the MultiPred model, primarily in two aspects: in terms of model structure, one can consider introducing more complex network structures or improving existing ones to enhance predictive performance; in terms of data, efforts can be made to add more input features or improve data quality to enhance model input information and prediction accuracy. For instance, introducing more meteorological observation data, geographical information data, or other relevant data could increase the model’s understanding and judgment capabilities in precipitation forecasting.