1. Introduction
Rainfall forecasting is of substantial research significance and practical relevance in the fields of meteorology and hydrology. As a fundamental component of disaster prevention and mitigation frameworks, accurate rainfall forecasting not only supports decision-making in agricultural irrigation scheduling, urban flood management, and water resources operations, but it also constitutes the scientific basis for developing flood early warning systems and emergency response strategies [
1]. The prevailing forecasting methodology, numerical weather prediction (NWP) [
2], formulates atmospheric dynamic equations and assimilates multi-source observational data, exhibiting notable advantages in weather prediction across time scales spanning several hours to multiple days. However, the application of NWP in short-term heavy rainfall prediction (0–2 h, referred to as precipitation nowcasting [
3]) is hindered by two major challenges: first, the intricate spatial heterogeneity and pronounced temporal non-stationarity considerably exacerbate the complexity of traditional model development; second, the substantial computational expense and extended iteration cycle of NWP impede its ability to fulfill the real-time requirements of nowcasting [
4]. With the exponential growth of meteorological observational datasets and transformative advancements in GPU-driven parallel computing, deep learning (DL)-based data-driven methodologies have opened new avenues for improving the accuracy of short-term heavy rainfall forecasting by uncovering the spatiotemporal evolution patterns within historical meteorological records [
5].
In recent years, deep learning models, particularly those based on convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have exhibited remarkable capability in capturing the spatiotemporal evolution of precipitation systems by leveraging end-to-end learning to establish nonlinear mappings between historical radar echo sequences and meteorological fields [
6,
7,
8]. Unlike traditional numerical weather prediction methods, which depend on parameterized physical equations, these data-driven models leverage self-supervised learning to extract latent physical correlations from vast historical observational datasets—including Doppler radar reflectivity, wind field profiles, and satellite cloud imagery—thereby enabling probabilistic forecasting of future precipitation fields. Shi et al. [
9] introduced a convolutional recurrent neural network-based approach, ConvLSTM, which harnesses the long short-term memory (LSTM) mechanism to effectively capture both the motion dynamics of rainfall and variations in rainfall intensity, thereby surpassing other methodologies in precipitation forecasting. Expanding on this foundation, Wang et al. [
10] developed PredRNN, an advanced network integrating a novel spatiotemporal LSTM (ST-LSTM) unit, which not only preserves spatiotemporal information but also facilitates inter-layer interactions among memory states from different LSTM units. PhyDNet [
11] explicitly disentangles physical knowledge, specifically partial differential equation (PDE) dynamics, from latent unknown information, incorporating a novel physical unit (PhyCell) to enforce PDE constraints within the latent space. This approach has exhibited exceptional performance in video prediction tasks, particularly in addressing missing data and enhancing long-term forecasting capabilities. Ma et al. [
12] employed a spatial local attention memory (SLAM) module to capture spatial dependencies within meteorological data while integrating a temporal difference memory (TDM) module to extract temporal variations. By coupling these mechanisms with PredRNN, the model effectively enhances the spatiotemporal dependencies of radar observations.
Although sequence modeling approaches, particularly recurrent neural networks (RNNs), have shown promise in meteorological time-series analysis, their inherent limitations continue to hinder practical applicability. Specifically, RNN-based models are prone to vanishing and exploding gradient issues when training deep networks [
13], a phenomenon that is particularly pronounced when capturing the nonlinear dynamical evolution of hourly scale precipitation systems, thereby limiting their ability to effectively learn long-term temporal dependencies. Recently, the UNet architecture [
14] has been widely adopted in precipitation forecasting tasks, demonstrating strong predictive performance. To further optimize this architecture, SmaAt-UNet [
15] incorporates the convolutional block attention module (CBAM) [
16] and depthwise separable convolution (DSC), effectively maintaining the performance of precipitation nowcasting while reducing model parameters and computational resource consumption. More importantly, SmaAt-UNet mitigates the gradient explosion issue commonly observed in traditional deep networks, thereby enhancing training stability and overall effectiveness. However, it is important to note that the UNet architecture was not originally designed for time-series forecasting tasks, and as a result, certain variants exhibit limitations in predictive accuracy when applied to such problems [
13,
14,
17,
18,
19].
To overcome these bottlenecks, researchers have begun exploring Transformer-based alternatives [
20], which effectively alleviate gradient degradation through the synergistic design of residual connections and layer normalization, while leveraging the global interaction capability of self-attention [
21] to model cross-spatiotemporal feature associations in meteorological fields. Bai et al. [
22] introduced Rainformer, a precipitation nowcasting model that integrates Swin Transformer [
23] and CNN. By utilizing fully convolutional networks (FCNs), Rainformer effectively reduces the risk of gradient explosion and enhances the accuracy of high-intensity rainfall forecasting. Earthformer [
24], leveraging the innovative cuboid attention mechanism and global vector design, not only addresses computational efficiency challenges but also enhances spatiotemporal modeling capabilities, demonstrating significant potential in Earth system forecasting. Wang et al. [
25] proposed a hierarchical architecture named STPF-Net, based on recurrent neural networks, aiming to address the limitations of traditional NWP models in terms of computational efficiency and prediction accuracy for short-term precipitation. This model introduces a layered temporal encoding strategy—comprising high and low temporal resolution modules—to mitigate error accumulation in long-term forecasting. In addition, it incorporates Swin Transformer modules to capture large-scale spatial contextual information. Bojesomo et al. [
26] introduced a Video Swin Transformer (VST) model based on shifted window cross-attention. Employing an encoder–decoder architecture, the model progressively integrates multi-scale spatiotemporal features, enabling the efficient modeling of 8 h weather forecasts. Compared to conventional models, it achieves significantly improved parameter efficiency. Xiong et al. [
27] developed a Spatiotemporal Feature Fusion Transformer (STFFT) for precipitation nowcasting. By leveraging a multi-head squared self-attention mechanism (MHSFFA) and a cross-feature feed-forward network (CAFFFN), the model enables dynamic interaction modeling of radar echo sequences. Liu et al. [
28] innovatively integrated the Swin Transformer with the UNet architecture to develop a post-processing framework for numerical weather prediction (NWP). By fusing fundamental meteorological variables such as temperature and humidity with satellite-based precipitation observations, their approach significantly enhanced the prediction accuracy of severe convective precipitation events. Ji et al. [
29] designed the EDH-STNet model, marking the first application of the Swin-UNet architecture to spatiotemporal forecasting tasks in extreme drought and hydrology (EDH). This encoder–decoder-based framework integrates multi-source hydro-meteorological parameters—such as sea surface temperature, latent heat flux, and wave height—while utilizing the Swin Transformer’s window-based self-attention mechanism to capture global spatial dependencies and temporal evolution features. Piran et al. [
30] proposed a precipitation forecasting method based on generative Transformer models using composite radar data from multiple meteorological radars in South Korea. Their model effectively predicts future precipitation patterns and helps reduce the risk of catastrophic weather events caused by heavy rainfall. Collectively, these studies demonstrate that Transformer-based architectures, through their global attention mechanisms and advanced feature fusion strategies, are overcoming the limitations of traditional numerical methods and CNN/RNN models in modeling extreme weather events.
Against this backdrop, we propose a novel precipitation nowcasting model, SwinNowcast, designed for high-resolution gridded precipitation forecasting within a 30 min lead time. The input to our model consists of precipitation maps, which are radar images representing accumulated rainfall over a specific period. SwinNowcast is primarily constructed using the multi-scale feature balancing module (M-FBM), which comprises a local multi-scale feature extraction unit, a global feature extraction unit, and a gated attention feature fusion unit (GAFFU). The local multi-scale feature extraction unit focuses on capturing localized features of small- to moderate-intensity precipitation events. The global feature extraction unit, implemented as a Swin Transformer, employs a window-based multi-head self-attention (W-MSA) mechanism to effectively capture global dependencies in precipitation data. Due to the significant numerical discrepancy between global and local features, balancing these two types of features remains a critical challenge in precipitation forecasting. In response to this challenge, we designed GAFFU. GAFFU employs a gating mechanism that generates two forget matrices to independently regulate the contributions of local and global features, thereby achieving balanced feature fusion. Specifically, GAFFU leverages a gating mechanism to perform weighted fusion of local and global features, mitigating numerical discrepancies between them and enabling the model to more stably process multi-scale feature representations. This adaptive weighted fusion strategy across different feature levels has been widely recognized in various research domains. Yang et al. [
31] introduced the adaptive feature pyramid network (AFPN), which incorporates an adaptive spatial fusion module to effectively address the feature imbalance issue in traditional feature pyramid networks (FPNs) during multi-level feature fusion, thereby enhancing feature aggregation capability and overall model performance. Furthermore, the gating mechanism in GAFFU prevents the gradient explosion issue commonly observed in traditional methods due to excessive emphasis on a particular feature type, thereby ensuring training stability.
The primary contributions of this study are summarized as follows:
We propose a novel precipitation nowcasting model, SwinNowcast, which independently extracts global and local features from precipitation data and effectively fuses them, thereby enhancing the model’s predictive capability across precipitation events of varying intensities;
We integrate multi-scale feature extraction units with global feature extraction units to enhance the model’s ability to perceive precipitation events at different scales. This integration enables the model to simultaneously extract critical features across multiple scales, thereby capturing spatiotemporal dependencies in precipitation data more comprehensively;
We propose a novel gated attention feature fusion unit (GAFFU), which addresses the imbalance between global and local features through a gating mechanism. GAFFU effectively integrates complementary information from different scales, thereby improving the effectiveness of feature representation.
The proposed SwinNowcast demonstrates significant performance improvements over six state-of-the-art (SOTA) models on publicly available precipitation datasets.
4. Experimental Results and Analysis
To assess the effectiveness of the model, we performed a comparative analysis of ground observation data and the outputs of six predictive models from both qualitative visual inspection and quantitative data evaluation perspectives. We conducted a comparative analysis of SwinNowcast against six state-of-the-art methods using the KNMI data subset. Specifically, we independently trained six models (UNet, ConvLSTM, PhyDNet, SmaAt-UNet, PredRNN [
40], and Rainformer) and selected the model with the lowest validation loss during training as its “best version”. Subsequently, we employed these best models to compute the previously introduced evaluation metrics on the test set. In the evaluation of precipitation maps, we selected 0.5, 5, and 10 mm/h as representative thresholds. The threshold of 0.5 mm/h is used to assess the model’s ability to capture light rainfall, while 5 and 10 mm/h correspond to moderate and heavy rainfall events. However, it is important to note that high-intensity rainfall events are relatively rare in the dataset. We analyzed the pixel-level rainfall distribution in both the training and test sets and found that pixels with rainfall ≥10 mm/h account for only 0.30% of the training set (8,695,139 pixels) and 0.41% of the test set (3,138,573 pixels). Additionally, pixels within the 5–10 mm/h range comprise only 1.71% of the training set (approximately 48,899,083 pixels) and 1.84% of the test set (approximately 14,288,821 pixels). In contrast, pixels with rainfall between 0.5 and 5 mm/h make up the majority of effective precipitation samples, accounting for 36.37% in the training set and 31.43% in the test set. Furthermore, more than 60% of the pixels fall below the 0.5 mm/h threshold, indicating no or extremely light rainfall. This highly imbalanced distribution across different rainfall intensities—particularly the scarcity of heavy rainfall samples—partially explains the performance degradation observed at higher thresholds (5 and 10 mm/h), as the model receives insufficient supervision to learn accurate representations for rare events.
4.1. Quantitative Comparison
As shown in
Table 1, the proposed SwinNowcast model demonstrates significant advantages across multiple key metrics in the quantitative evaluation of precipitation forecasting. Specifically, it achieves a critical success index (CSI) of 0.7494 and a Heidke skill score (HSS) of 0.3868, representing relative improvements of 1.23% and 1.15%, respectively, over the second-best model, Rainformer. Notably, at 5 mm/h and 10 mm/h heavy precipitation thresholds, SwinNowcast attains CSI scores of 0.3731 and 0.1955, reflecting improvements of 14.5% and 6.5% over the traditional ConvLSTM architecture and 11.4% and 3.5% over the PredRNN temporal prediction model. These results indicate a statistically significant enhancement in modeling extreme weather events.
Regarding forecast reliability, SwinNowcast exhibits a systematic reduction in the false alarm rate (FAR): at 0.5/5/10 mm/h precipitation thresholds, its FAR values are 0.1574, 0.3131, and 0.4742, respectively. This represents an average reduction of 47% compared to UNet-based models and a 27.6% decrease relative to the physics-constrained model PhyDNet. This improvement can be attributed to the model’s dynamic attention mechanism, which effectively suppresses the propagation of non-physical precipitation signals.
A threshold-wise comparison reveals deeper advantages: in the operationally critical moderate-to-heavy precipitation forecasting scenario (5 mm/h threshold), SwinNowcast’s CSI score (0.3731) surpasses Rainformer by 74%, while its HSS (0.2675) outperforms PredRNN by 8.6%. Particularly, in the 10 mm/h extreme precipitation scenario, the model maintains an HSS of 0.1627 while keeping the FAR at 0.4742, which is 49.3% lower than SmaAt-UNet. This “high detection–low false alarm” characteristic is highly valuable for disaster warning applications.
A comparative analysis of existing models highlights three major limitations: static architectures such as UNet exhibit limited spatiotemporal modeling capabilities, as reflected by their low HSS of 0.0929 at the 5 mm/h threshold; recurrent models like ConvLSTM and PredRNN, constrained by localized receptive fields, accumulate long-range spatiotemporal modeling errors up to 23.7%; Rainformer, while achieving a high CSI of 0.7403 in weak precipitation scenarios (0.5 mm/h), lacks physical process modeling, leading to a 41.2% performance drop in heavy precipitation cases. To address these challenges, this study introduces a hierarchical spatiotemporal attention mechanism based on Swin Transformer, which enables decoupled multi-scale meteorological feature modeling. This architecture maintains short-term forecast accuracy (FAR = 0.1574 at 0.5 mm/h) while significantly enhancing forecast stability for extreme precipitation events by reducing variance by 37.2% at the 10 mm/h threshold. These findings establish a new architectural paradigm for deep learning-based meteorological modeling.
To comprehensively analyze the model’s performance at different forecast lead times,
Table 2,
Table 3 and
Table 4 present a detailed evaluation of prediction scores at 10 min, 20 min, and 30 min intervals. As the forecast horizon extends, precipitation prediction becomes more challenging, leading to a general decline in CSI scores and HSSs while FARs tend to increase. However, comparisons across multiple methods indicate that SwinNowcast consistently maintains higher CSI scores and HSSs across all lead times while achieving the lowest or near-lowest FAR in the majority of cases. These findings suggest that, despite the increasing complexity of extended forecast horizons, the proposed model retains high predictive accuracy and stability.
Table 2,
Table 3 and
Table 4 further indicate that traditional convolutional neural network models (e.g., UNet) and temporal forecasting methods (e.g., ConvLSTM, PredRNN, PhyDNet) exhibit competitive performance at shorter lead times (e.g., 10 min). However, as the forecast period lengthens, their overall performance deteriorates more noticeably. Meanwhile, the attention-based Rainformer model exhibits competitive performance under specific precipitation thresholds. However, its predictive stability diminishes at extended lead times and higher thresholds (e.g., 10 mm/h). In contrast, SwinNowcast consistently demonstrates strong adaptability and maintains a lower false alarm rate across both short-term and extended lead times. This suggests that the proposed enhanced network architecture and multi-scale attention mechanism effectively capture the spatiotemporal evolution of precipitation.
Overall, across both the full test set evaluation and the detailed analyses at different forecast lead times, SwinNowcast consistently attains higher CSI scores and HSSs while maintaining a lower FAR in most cases, highlighting its superior ability to characterize precipitation features and ensure stable predictive performance. This sustained superior performance is primarily attributed to the proposed multi-scale feature balancing module (M-FBM). This module comprises three key components: the multi-scale convolutional block attention module (MSCBAM), the Swin Transformer, and the gated attention feature fusion unit (GAFFU). Through the extraction and fusion of local and global spatiotemporal features, it enables effective modeling and balancing of the multi-scale evolution of precipitation systems. Specifically, MSCBAM enhances local feature extraction, improving the model’s capability to analyze precipitation intensity variations and texture details. The Swin Transformer leverages window self-attention (W-MSA) and shifted window self-attention (SW-MSA) to capture long-range spatiotemporal dependencies. GAFFU dynamically integrates local and global features via a gated attention mechanism, enhancing predictive accuracy and stability. Owing to this innovative module, SwinNowcast exhibits superior predictive performance across both short-term and extended lead-time forecasting scenarios. Its reduced FAR further underscores its strong adaptability and robustness in handling varying precipitation intensities and mitigating noise interference.
In addition to its advantages in prediction accuracy, to further evaluate the practical efficiency and complexity of different models in real-world applications, we have quantified critical metrics including parameter counts, computational costs (GFLOPs), and inference time per sample, as detailed in
Table 5. The results demonstrate that SwinNowcast maintains reasonable computational complexity and inference speed while preserving superior predictive performance. It exhibits significantly enhanced inference efficiency compared to temporal modeling approaches like PhyDNet and PredRNN, while its complexity remains substantially lower than that of the parameter-heavy Rainformer. This indicates that our proposed network architecture achieves improved precipitation forecasting accuracy without substantially increasing computational overhead, demonstrating favorable deployment potential and practical applicability. Particularly when addressing high-frequency and large-scale precipitation forecasting requirements, SwinNowcast achieves an optimal accuracy–efficiency balance, making it particularly suitable for operational scenarios demanding both precision and computational efficiency.
4.2. Qualitative Comparison
This study conducts a qualitative analysis of SwinNowcast’s performance in 30 min precipitation forecasting.
Figure 6 illustrates the predicted results of various models over six future time steps (each with a 5 min interval, totaling 30 min). Each row represents a different model, while each column corresponds to a specific forecast time step. This visualization allows for an intuitive comparison of the models’ ability to capture the temporal evolution of precipitation patterns. Additionally,
Figure 7 presents a side-by-side comparison of the forecasted precipitation at the 6th time step (i.e., 30 min ahead) against the ground truth. This comparison highlights differences in spatial distribution, intensity prediction, and boundary details, providing further insight into each model’s ability to reconstruct precipitation features accurately.
As evidenced by the stepwise predictions in
Figure 6, SwinNowcast demonstrates exceptional performance in precipitation forecasting refinement and spatial consistency. Its predictions (T = 4–T = 6) exhibit remarkable morphological agreement with ground truth precipitation fields, characterized by well-defined and continuous rainband edges, particularly maintaining high fidelity in the positional accuracy and intensity distribution of intense precipitation cells. In contrast, while UNet captures broad precipitation trends during short-term forecasting (T = 1–T = 3), its predictive capability significantly degrades with extended lead times (T = 4–T = 6), manifesting substantial blurring effects where local convective cells become excessively smoothed into amorphous regions, resulting in indistinguishable precipitation boundaries. ConvLSTM shows limitations in modeling the temporal evolution of dynamic precipitation systems, with discontinuous precipitation regions and unphysical diffusion patterns observed in specific sequences (e.g., T = 5), indicating inadequate spatial correlation modeling. PhyDNet preserves the overall morphology of precipitation systems through physical constraints but exhibits temporal lag in predicting rapidly moving rainbands (e.g., eastward migration from T = 3 to T = 4), leading to approximately 30–50 km positional offsets. SmaAt-UNet’s lightweight architecture introduces systematic underestimation (15–20%) of intense precipitation cores and fails to resolve small-scale convective structures (e.g., localized convection at T = 2). Although PredRNN and Rainformer enhance temporal continuity through sophisticated sequence modules, they demonstrate delayed response during convective initiation phases (T = 1–T = 2), resulting in reconstruction lag and occasional spurious precipitation artifacts (e.g., northwest corner at T = 6). SwinNowcast’s integration of hierarchical local attention with global meteorological field representation significantly improves the interpretability of multi-scale precipitation structures while reducing historical state dependency, thereby maintaining physical plausibility and spatial detail integrity throughout extended forecasting periods.
As clearly demonstrated in
Figure 7, the disparities in modeling intense precipitation cores and peripheral rainbands become increasingly pronounced across models at the final prediction step (T = 6). UNet exhibits extensive weakening of precipitation coverage with near-complete dissipation of localized convective cells, significantly reducing the contrast between core precipitation areas and background fields. ConvLSTM continues to exhibit fragmented precipitation boundaries and discontinuous diffusion patterns at this stage, revealing persistent deficiencies in capturing late-stage rainband evolution. Although PhyDNet maintains relatively complete rainband morphology, residual positional lag persists in rapidly eastward-moving precipitation systems, indicating insufficient adaptability of its physical constraints to fast-changing meteorological fields. SmaAt-UNet perpetuates its systematic underestimation of precipitation intensity, displaying notably weaker magnitude compared to ground truth observations, with small-scale rainband features being almost entirely smoothed out. While PredRNN and Rainformer show marginal improvements in spatiotemporal continuity, localized artifacts remain observable (e.g., isolated spurious echoes in the northwestern sector), reflecting incomplete modeling of marginal precipitation zones during extended sequence predictions. SwinNowcast consistently maintains superior precipitation resolution and core intensity preservation at T = 6, featuring sharply defined and continuous rainband edges that achieve heightened consistency with ground truth observations. This performance underscores the methodology’s enhanced capability through multi-scale attention mechanisms for forecasting complex precipitation systems.
4.3. Ablation Experiment
To assess the contributions of key components in the SwinNowcast model, we conducted ablation experiments to examine the effects of the following modules:
MSCBAM (multi-scale convolutional block attention module): Enhances local feature representation by leveraging a multi-scale attention mechanism, refining the fine-grained details in precipitation predictions.
GAFFU (gated attention feature fusion unit): Employs a gated attention mechanism to dynamically integrate local and global features, optimizing feature fusion for improved representation.
Inception (a multi-scale feature extraction module) [
41]: Adopts an inception-inspired structure for multi-scale feature extraction, capturing diverse spatial resolutions.
GFU (an alternative feature fusion unit): Serves as a replacement for GAFFU, providing an alternative approach to feature integration.
By systematically removing or substituting these modules, we assessed their influence on precipitation prediction accuracy and false alarm rate. As presented in
Table 6, the ablation experiments unequivocally demonstrate that the full SwinNowcast configuration (MSCBAM + GAFFU) achieves optimal performance in precipitation forecasting. The main conclusions are as follows:
GAFFU exerts the most significant influence on prediction accuracy: The removal of GAFFU results in a decline in CSI and HSS, underscoring its crucial role in feature fusion. Notably, retaining GAFFU alone (without MSCBAM) still yields near-optimal performance, emphasizing GAFFU’s predominant impact.
MSCBAM enhances local feature representation: While the removal of MSCBAM has a relatively minor effect on overall performance, the full model still attains marginally higher CSI scores and HSSs than the GAFFU + inception combination, indicating that local feature enhancement remains beneficial for improving accuracy.
The combined approach delivers the best results: Utilizing MSCBAM or GAFFU independently does not attain optimal performance, whereas their integration offers superior feature extraction and fusion capabilities, ensuring precise local feature capture and effective global information synthesis.
GFU serves as a viable alternative but is slightly inferior to GAFFU: Substituting GAFFU with GFU marginally lowers the false alarm rate but also results in a slight decline in CSI and HSS, suggesting that GAFFU is more proficient in integrating global and local features for precipitation forecasting.
In summary, the MSCBAM + GAFFU combination represents the optimal configuration for SwinNowcast, achieving high prediction accuracy while minimizing false alarms, thereby offering robust support for high-precision short-term precipitation forecasting.
5. Conclusions and Discussion
This study introduces SwinNowcast, a high-precision short-term precipitation nowcasting model built upon the Swin Transformer. The primary contribution of this work is the novel design of the multi-scale feature balance module (M-FBM), which effectively fuses multi-scale local features with global spatiotemporal contextual information, facilitating precise modeling of precipitation system evolution. Experimental results reveal that SwinNowcast consistently surpasses existing approaches in key evaluation metrics, including CSI and HSS, while demonstrating enhanced robustness in mitigating false alarm rates (FARs). Specifically, the MSCBAM strengthens the representation of local structural features, such as precipitation cores and rainband boundaries, by leveraging multi-scale dilated convolutions and an attention mechanism. Meanwhile, the window self-attention mechanism in the Swin Transformer efficiently models the spatiotemporal dynamics of large-scale meteorological systems, capturing complex phenomena such as frontal movements and convective mergers. Furthermore, the GAFFU module employs a gating mechanism to dynamically regulate the fusion weights between local and global features, preventing global information from excessively overshadowing local details while effectively attenuating local noise interference. This local–global collaborative optimization framework empowers the model to effectively capture the nonlinear variations in precipitation intensity, exhibiting distinct advantages in forecasting moderate to heavy precipitation events.
In real-world applications, SwinNowcast offers substantial value for extreme weather early warning systems. For example, in urban flood management, its 30 min high-resolution forecasts enable emergency response teams to gain critical lead time, facilitating optimized drainage operations and traffic control measures. Likewise, in flash flood warning scenarios, the model’s ability to accurately capture heavy precipitation events (e.g., achieving a CSI of 0.1955 at the 10 mm threshold) enhances disaster warning reliability and mitigates the risk of missed detections. However, transitioning from laboratory research to operational deployment presents several practical challenges, including the need for real-time processing, high computational resource dependency, and reliance on high-performance GPU hardware. To facilitate deployment in edge computing environments, future work should focus on model compression and computational efficiency improvements. Furthermore, the training data for the current model are predominantly sourced from the Netherlands, and their generalizability to regions with distinct topographical and climatic characteristics remains to be validated. The insufficient representation of extreme events may further introduce prediction biases in rare heavy rainfall scenarios, highlighting the need for improved dataset diversity and expanded coverage of extreme weather events in future research.
While SwinNowcast has achieved notable advancements in short-term precipitation forecasting, its methodology still presents certain limitations. These include challenges in modeling extreme heavy rainfall events (>50 mm/h), constrained scalability for extended forecasting horizons (e.g., 1–2 h), and the inherent difficulty in interpreting Transformer-based self-attention mechanisms, which may hinder trust in the model’s decision-making within the meteorological community. Furthermore, the local–global collaborative modeling framework proposed in this study not only proves valuable for precipitation forecasting but also exhibits strong potential for cross-domain transfer learning. For example, it can be leveraged in traffic flow prediction to capture localized congestion patterns alongside global road network dynamics, in air quality forecasting to integrate physical diffusion principles with data-driven models, and in energy demand forecasting to analyze the spatiotemporal correlations between meteorological factors and electricity consumption trends. Future research should prioritize advancements in model efficiency, multi-source and multimodal data integration, the incorporation of physical constraints, computational optimization, and uncertainty quantification to further enhance model performance and real-world applicability. With advancements in computational power and the expansion of interdisciplinary collaborations, SwinNowcast and similar architectures hold great promise as foundational technologies in the construction of smart city disaster resilience systems.