Leveraging Recurrent Neural Networks for Flood Prediction and Assessment

Heidari, Elnaz; Samadi, Vidya; Khan, Abdul A.

doi:10.3390/hydrology12040090

Open AccessArticle

Leveraging Recurrent Neural Networks for Flood Prediction and Assessment

by

Elnaz Heidari

¹

,

Vidya Samadi

^2,3

and

Abdul A. Khan

^1,*

¹

The Glenn Department of Civil Engineering, Clemson University, Clemson, SC 29634, USA

²

Department of Agricultural Sciences, Clemson University, Clemson, SC 29634, USA

³

Artificial Intelligence Research Institute for Science and Engineering (AIRISE), School of Computing, Clemson University, Clemson, SC 29634, USA

^*

Author to whom correspondence should be addressed.

Hydrology 2025, 12(4), 90; https://doi.org/10.3390/hydrology12040090

Submission received: 18 March 2025 / Revised: 10 April 2025 / Accepted: 14 April 2025 / Published: 16 April 2025

(This article belongs to the Section Water Resources and Risk Management)

Download

Browse Figures

Versions Notes

Abstract

:

Recent progress in Artificial Intelligence and Machine Learning (AIML) has accelerated improvements in the prediction performance of many hydrological processes. Yet, flood prediction remains a challenging task due to its complex nature. Two common challenges afflicting the task are flood volatility and the sensitivity and complexity of flood generation attributes. This study explores the application of Recurrent Neural Networks (RNNs)—specifically Vanilla Recurrent Neural Networks (VRNNs), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU)—in flood prediction and assessment. By integrating catchment-specific hydrological and meteorological variables, the RNN models leverage sequential data processing to capture the temporal dynamics and seasonal patterns characteristic of flooding. These models were employed across diverse terrains, including mountainous watersheds in the state of South Carolina, USA, to examine their robustness and adaptability. To identify significant hydrological events for flash flood analysis, a discharge frequency analysis was conducted using the Pearson Type III distribution. The 1-year and 2-year return period flows were estimated based on this analysis, and the 1-year return flow was selected as a conservative threshold for flash flood event identification to ensure a sufficient number of training instances. Comparative benchmarking with the National Water Model (NWM v3.0) revealed that the RNN-based approaches offer notable enhancements in capturing the intensity and timing of flood events, particularly for short-duration and high-magnitude floods (flash floods). Comparison of predicted disharges with the discharge recorded at the gauges revealed that GRU had the best performance as it achieved the highest mean NSE values and exhibited low variability across diverse watersheds. LSTM results were slightly less consistent compared to the GRU albeit achieving satisfactory performance, proving its value in hydrological forecasting. In contrast, VRNN had the highest variability and the lowest NSE values among the three. The NWM model trailed the machine learning-based models. The study highlights the efficacy of the RNN models in advancing hydrological predictions.

Keywords:

Recurrent Neural Networks; flood prediction; National Water Model; NLDAS

1. Introduction

Floods have become increasingly frequent and severe on a global scale, posing significant threats to human lives, infrastructure, and economies. According to [1], who utilized data from the Emergency Events Database [2], over 175,000 lives were lost and approximately 2.2 billion people were directly affected by floods over a span of 27 years. These figures are likely underestimated due to unreported events [3]. Furthermore, the United Nations Office for Disaster Risk Reduction [4] identified floods as the most common natural disaster; annually claiming 606,000 lives (from 1995 to 2015), injuring 4.1 billion people, and causing an estimated financial loss exceeding $30 billion.

The increasing frequency and impact of floods underscore the urgent need for improved prediction and management strategies. Over the past several decades, deterministic [5] and physics-based models [6,7,8] have been extensively applied in various environmental systems. While these approaches have advanced understanding and awareness of flood dynamics, they also reveal limitations. Many of these traditional flood models, particularly physically explicit models, are constrained by their dependence on specific numerical approximations and rigid spatial parameterizations, limiting their adaptability across diverse hydrological contexts [9]. Moreover, these models are usually computationally expensive.

Recent advancements in Artificial Intelligence and Machine Learning (AIML) have demonstrated significant promise in flood modeling applications. Studies such as [3] and [10] highlight the efficacy of AIML methodologies in enhancing predictive accuracy and addressing the limitations of traditional approaches. These methods offer the potential to serve as core drivers of flood modeling, as evidenced by their successful implementation in diverse contexts [11,12,13]. In addition, ref. [14] introduced an innovative AI model called Bagging-LMT, combining bagging ensembles and Logistic Model Trees (LMT) for mapping flood susceptibility. This shift toward AIML-driven frameworks represents a critical evolution in flood prediction, offering a more adaptive, data-driven, and scalable solution to one of the most pressing global challenges.

Early applications of intelligent methods, such as those by [15,16], demonstrated the potential of machine learning (ML) for flood prediction. Recent advancements in deep learning, particularly in deep neural networks (DNNs), have further enhanced predictive capabilities across various spatial and temporal scales, regions, and hydrological processes [10,11,17,18,19,20]. Nevo et al. [3] pioneered the use of DNNs in Google’s flood forecasting system, operational during the 2021 monsoon season in India and Bangladesh, which issued over 100 million flood alerts using models such as Long Short-Term Memory (LSTM) for stage forecasting. Similarly, ref. [21] integrated hybrid machine learning approaches to generate flood susceptibility maps, while [22] combined deep learning with physical models and transfer learning for flood prediction in Japan. More recently [23] demonstrated that AI-based forecasting can reliably predict extreme river events in ungauged regions up to five days in advance, exceeding the accuracy of traditional models for one-year return period floods. In another case, ref. [20] optimized LSTM hyperparameters using Genetic Algorithm for flash flood prediction in Taiwan.

Despite these successes, challenges persist. DNNs face difficulties in capturing the spatial variability and magnitude of extreme events over time (anomaly events), conducting parameter sensitivity assessments to understand the influence of datasets and algorithmic structures, and producing physically plausible or explainable predictions for short-duration, high-intensity events. To address these issues, this study leverages advanced Recurrent Neural Networks (RNNs), including LSTM, Gated Recurrent Units (GRU), and Vanilla RNN (VRNN), with a focus on temporal dependencies, sensitivity analysis, and feature composition and engineering. The analysis in this paper seeks to explore the impact of different feature compositions on predictive performance and examine various scenarios based on utilizing all the flood-related meteorological and hydrological features or only specific combinations of these features. RNNs are used to capture temporal dependencies within data across different spatiotemporal scales. In the realm of advanced AIML techniques applied in flood prediction, LSTM and GRU are the key components, offering significant advantages in capturing temporal dependencies within data. LSTM, designed to address the vanishing gradient problem, introduces specialized units, including an input gate, an output gate, and a forget gate, allowing selective retention and forgetting of information over long sequences. Researchers have verified LSTM’s effectiveness in modeling complex temporal patterns in various domains [24]. On the other hand, GRU has recently gained popularity due to its efficiency and simpler architecture compared to LSTM. GRU features two gates, an update gate and a reset gate, which regulates information flow. GRUs have fewer parameters and are computationally efficient, reaching convergence more rapidly during training [25]. Despite their simplicity, GRUs can retain information over longer sequences, making them valuable for modeling complex hydrological processes.

This research aims to achieve the following:

Incorporate diverse datasets, including meteorological and hydrological data into an AIML-based pipeline for sub-daily scale flood prediction.
Address the spatial and temporal variability of flooding across multiple regions.
Benchmark RNNs’ performance against observational data and the National Water Model (NWM v3.0) reanalysis.
Assess the sensitivity and generalizability of RNNs across different scales and eco-physical contexts.
Evaluate the impact of feature selection and engineering on predictive accuracy under various scenarios, from full feature utilization to selective combinations of inputs.

To summarize, this paper presents a benchmark study for data-driven flood prediction, highlighting the variability of predictions, the importance of feature selection, and the implications of model architecture choices. By examining rainfall-runoff mechanisms across various regions, this study offers insights into flood generation and hydrological processes at multiple spatiotemporal scales. The findings aim to mitigate the scarcity of accurate flood forecasting information while enhancing model sensitivity and reliability. This study is organized as follows: Section 1 introduces the study’s objectives and contextual background. Section 2 details the study area, data processing, and algorithm structures, followed by the training methodology for robust and accurate predictions. Section 3 explores the hyperparameters tuning results, strengths of RNNs in flood forecasting, and sensitivity assessment. Finally, Section 4 concludes with key findings and implications for advancing flood prediction research.

2. Methodology

2.1. Study Area

The study area encompasses four distinct catchments located in the northwest and west parts of the state of South Carolina (Figure 1), each defined by its drainage area and culminating in a United States Geological Survey (USGS) streamflow gauge. All catchments are located in mountainous regions, with one of them draining an urbanized area in Greenville, SC, offering a perspective on hydrological processes in densely developed settings. The remaining catchments represent more natural landscapes, including Stevens Creek in the eastern Piedmont, Twelve Mile Creek in the northern part of the state, and the Chattooga River in the northeast, which also serves as a boundary waterway. These catchments collectively provide a balanced representation of urban and natural hydrological systems, forming the basis for analyzing regional water dynamics.

2.2. Meteorological Forcing Data

The North American Land Data Assimilation System, Version 2 (NLDAS-2), provides a high-resolution, hourly atmospheric and surface forcing data specifically designed to support hydrological and environmental modeling applications. NLDAS-2 incorporates multiple data assimilation techniques and observational datasets to generate accurate temporally and spatially consistent meteorological inputs that are crucial for flood prediction and related studies [26].

For this study, the NLDAS-2 forcing dataset was employed due to its high resolution (spatial resolution of 0.125° and hourly temporal resolution) and extensive coverage over the contiguous United States. The dataset provides a comprehensive suite of meteorological variables, including precipitation, temperature, potential evapotranspiration (PET), specific humidity, shortwave and longwave radiation, and wind components (Table 1). These variables serve as essential inputs for modeling hydrological processes and capturing the dynamic nature of rainfall-runoff mechanisms, which are critical for flood forecasting.

NLDAS-2 combines observational datasets such as radar-based precipitation estimates, satellite observations, and reanalysis products from the National Centers for Environmental Prediction (NCEP). This integration ensures that the data is both accurate and representative of regional hydrometeorological conditions. In addition, NLDAS-2 forcing data is widely recognized for its utility in operational hydrology and has been validated extensively against independent observations [26,27]. In this study, NLDAS-2 data was utilized to capture the spatial and temporal variability of meteorological conditions across diverse eco-physical regions, including mountainous watersheds. The high-resolution temporal data allowed for detailed modeling of hourly flood prediction scenarios, while the spatial resolution (0.125 degree) ensured that key local and regional meteorological drivers of flooding were accurately represented. By leveraging NLDAS-2 forcing data, this study ensures a robust and reliable foundation for training and validating the predictive capabilities of RNNs. The integration of these meteorological inputs enables the models to better capture the variability and intensity of extreme events, contributing to the overall goal of improving flood prediction accuracy and sensitivity across different regions and seasons.

2.3. USGS Discharge

For this study, streamflow data were obtained from the United States Geological Survey (USGS), a trusted and widely used source for hydrological observations across the United States. The USGS streamflow data consists of flow rates recorded at 15-min intervals, providing the high temporal resolution crucial for capturing dynamic hydrological events, such as flash floods and peak flow scenarios. To align with the hourly temporal resolution of the NLDAS-2 forcing data used in this research, the 15-min flow data were aggregated into hourly time steps. This conversion was achieved by selecting the maximum flow rate within each hour, ensuring that the peak flow events were preserved while maintaining consistency with meteorological inputs.

The use of maximum flow data offers several advantages, including long-term coverage and high accuracy in streamflow measurements. These attributes make it an essential dataset for validating and training hydrological models. In this study, the converted hourly maximum flow data were integrated with the NLDAS-2 forcing data to capture the rainfall-runoff dynamics across different catchments. The consistency in temporal resolution between the datasets enhances the reliability of the model outputs, enabling a more precise representation of flood generation mechanisms. In addition, the use of maximum flow data ensures that the model’s predictions remain robust under extreme conditions, such as high-intensity rainfall events or sudden snowmelt, which are critical for real-time flood forecasting. The combined dataset provides a comprehensive framework for evaluating the performance of RNNs in predicting floods across diverse hydrological and climatological settings.

2.4. RNN Algorithms

Jordan [28] introduced the Jordan network based on the theory of distributed parallel processing. Each hidden layer node in the Jordan network connects to a state unit to enable delayed input with the logistic function serving as the activation function. This network employs a BackPropagation (BP) algorithm for learning and extracts phonetic features of a given syllable during testing [29,30]. Subsequently, [31] introduced the Elman network, the first fully connected RNN. Both the Jordan and Elman networks form recursive connections from a single-layer feedforward neural network, leading them to be termed Simple Recurrent Networks (SRNs) [32]. Concurrently with the emergence of SRNs, the learning theory of RNNs also progressed. Following the introduction of the BP algorithm, researchers began attempting to train RNNs using the BP framework. Williams and Zipser [33] proposed real-time recurrent learning for RNNs. Later, ref. [29] developed a BP through a time algorithm (BPTT). The fundamental concept behind RNNs is to leverage sequential information. In contrast to the traditional neural networks that regard inputs as independent units, which is often a flawed assumption for many tasks—for example, predicting the next word in a sequence benefit from knowing the preceding word [34]—RNNs account for the temporal sequence of inputs, rendering them appropriate for tasks that involve sequential data [35]. RNNs utilize a looping mechanism to perform identical operations on each element within a sequence, where the current computation relies on both the present input and the outcomes of prior computations [36].

In this study, we incorporated three well known variants of RNNs, including Vanilla RNN, LSTM, and GRU, which are described in the following sub sections. The full workflow of the proposed RNNs pipeline for flood prediction is provided in Figure 2.

2.4.1. Vanilla RNN

A Vanilla RNN, also known as a simple RNN, is the most basic form of RNN that was invented in the 1980s [29,32,37]. It is designed to handle sequential data by maintaining a form of memory through its recurrent connections. This memory enables the network to capture temporal dependencies in the data, making it suitable for tasks such as time series prediction, language modeling, and sequence labeling. The core idea of a Vanilla RNN is to process a sequence of inputs

x_{t}

at each time step t, and to maintain a hidden state

h_{t}

that captures information from previous time steps. The hidden state is updated using the current dynamic input and the previous hidden state. The mathematical formulation of a Vanilla RNN can be described by the following equations:

h_{t} = t a n h {(W}_{h} (x_{t}) + U_{h} h_{t - 1} + b_{h})

(1)

where

h_{t}

: The hidden state and output at time step t. It represents the memory of the network, capturing information from the sequence up to time step t.

x_{t}

: Dynamic input features at time step t.

h_{t - 1}

: Hidden state at the previous time step.

W_{h}

: Weight matrix for the dynamic input features.

U_{h}

: The weight matrix for the hidden state. It determines how the previous hidden state

h_{t - 1}

influences the current hidden state

h_{t}

.

t a n h

: Hyperbolic tangent activation function. It introduces non-linearity into the network, enabling it to learn complex patterns, which output values between −1 and 1.

b_{h}

: The bias vector.

In a Vanilla RNN, the hidden state

h_{t}

is updated at each time step based on the previous hidden state

h_{t - 1}

and the current input

x_{t}

. This recursive process allows the network to maintain a dynamic memory that evolves as it processes the sequence. The activation function

t a n h

ensures that the hidden state can capture non-linear relationships in the data. The hidden state

h_{t}

inherently serves as the output at each time step. This means that there is no separate computation for an output layer; instead, the hidden state itself is interpreted as the output (see Figure 2a).

While Vanilla RNNs are effective for modeling short-term dependencies in sequential data, they can struggle with long-term dependencies due to issues such as vanishing and exploding gradients during training. Despite these limitations, they form the foundational concept upon which more advanced recurrent architectures, such as LSTM networks and GRUs, are built.

2.4.2. LSTM

LSTM networks were introduced by [24] and are a type of RNN that are particularly well-suited to learning from sequences of data. They are designed to capture long-term dependencies and mitigate the vanishing gradient problem, which is a common issue with traditional RNNs [38,39]. Rather than experiencing vanishing or exploding gradients during backpropagation, LSTM networks allow errors to propagate through unlimited layers. This is achieved through the use of memory cells/units [40] which is the key distinction with Vanilla RNNs, enabling LSTM networks to effectively manage tasks that necessitate recalling information from events that occurred many time steps ago.

In LSTM networks, the interaction of the cell state, hidden state, input gate, forget gate, and output gate orchestrates the model’s ability to manage long-term dependencies in sequential data. The cell state acts as a memory carrier, retaining information over time, and is modulated by the gates. The forget gate determines the extent to which the previous cell state is preserved, selectively forgetting parts of the information. The input gate regulates the incorporation of new information into the cell state, allowing relevant data to update the memory. Simultaneously, the input and forget gates ensure that only pertinent information is retained. The hidden state, representing the output of the LSTM cell at each time step, is influenced by the cell state and the output gate. The output gate determines the portion of the cell state that contributes to the hidden state, thus controlling the information flow to subsequent cells in the network (Figure 2b displays the configuration of the LSTM). This interplay between the gates and states enables LSTMs to effectively capture temporal dependencies and maintain a robust memory of past inputs. The subsequent forward propagation equations are presented below:

Forget Gate

f_{t} = σ {(W}_{f} (x_{t}) + U_{f} h_{t - 1} + b_{f})

(2)

Input Gate

i_{t} = σ {(W}_{i} (x_{t}) + U_{i} h_{t - 1} + b_{i})

(3)

Candidate Cell State

{\tilde{C}}_{t} = t a n h {(W}_{c} (x_{t}) + U_{c} h_{t - 1} + b_{c})

(4)

Cell State Update

C_{t} = f_{t} C_{t - 1} + i_{t} {\tilde{C}}_{t}

(5)

Hidden State Update and Output

h_{t} = σ {(W}_{h} (x_{t}) + U_{h} h_{t - 1} + b_{h})

(6)

where,

f_{t}

: Forget gate at time step t, which controls the retention of the previous cell state

C_{t - 1}

.

x_{t} :

Dynamic input vector at time step t.

σ

: The sigmoid activation function, which outputs values between 0 and 1, used to gate the flow of information.

t a n h

: Hyperbolic tangent activation function for computing candidate cell and hidden states.

i_{t}

: Input gate at time step t, which determines the amount of new information added to the cell state.

W_{i}

,

W_{f}

,

W_{c}

,

W_{h}

: Weight matrices for input-to-gate connections

U_{i}, U_{f}, U_{c}, U_{h}

: Weight matrices for hidden state-to-gate connections.

b_{i}, b_{f}, b_{c}, b_{h}

: Bias vectors for the gates

{\tilde{C}}_{t}

: Candidate cell state at time step t, representing new information to be potentially added to the cell state.

C_{t}

: Cell state at time step t, acting as the long-term memory.

C_{t - 1}

: Previous cell state

h_{t}

: Updated hidden state and output at time step t.

2.4.3. GRU

GRU is a type of RNN architecture introduced by [25]. It aims to address some of the shortcomings of traditional RNNs, such as the vanishing gradient problem. GRUs are a streamlined version of LSTM networks, designed to handle sequential data more effectively. They achieve this by using gating mechanisms to regulate the flow of information, which simplifies the model and reduces the number of parameters, resulting in faster training and greater efficiency. The GRU network combines the hidden state and cell state into a single state vector and uses two main gating mechanisms: the update gate

z_{t}

and the reset gate

r

. These gates regulate the input, memory, and output processes in a streamlined manner compared to the three gates in LSTMs (input gate, forget gate, and output gate). Figure 2c highlights the architectural layout of the GRU.

Update Gate

z_{t} = σ {(W}_{z} (x_{t}) + U_{z} h_{t - 1} + b_{z})

(7)

Reset Gate

r_{t} = σ {(W}_{r} (x_{t}) + U_{r} h_{t - 1} + b_{r})

(8)

Candidate Hidden State

{\tilde{h}}_{t} = t a n h {(W}_{h} (x_{t}) + U_{h} ({r_{t} h}_{t - 1}) + b_{h})

(9)

Hidden State Update and Output

h_{t} = z_{t} h_{t - 1} + (1 - z_{t}) {\tilde{h}}_{t}

(10)

where,

z_{t}

: The update gate at time step t, which controls how much of the previous hidden state is retained and how much of the new candidate hidden state is used.

r_{t}

: The reset gate at time step t, which determines how much of the previous hidden state is forgotten when computing the new candidate hidden state.

{\tilde{h}}_{t}

: The candidate hidden state at time step t, which is computed using the current input and the reset gate-modified previous hidden state.

h_{t}

: The hidden state at time step t, which combines the previous hidden state and the new candidate hidden state according to the update gate.

x_{t}

: The dynamic input vector at time step t.

W_{z}

,

W_{r}

,

W_{h}

: Weight matrices for the input-to-gate connections.

U_{z}

,

U_{r}

,

U_{h}

: Weight matrices for the hidden state-to-gate connections.

b_{z}

,

b_{r}

,

b_{h}

: Bias vectors for the gates.

σ

: The sigmoid activation function, which outputs values between 0 and 1, used for the update and reset gates.

t a n h

: The hyperbolic tangent activation function, used for computing the candidate hidden state.

2.5. NWM Model

The NWM Version 3.0, developed by the National Oceanic and Atmospheric Administration (NOAA), is a comprehensive hydrological modeling framework that provides high-resolution water resource forecasts as well as reanalysis (typically 1 km or less) for the contiguous United States. The NWM integrates meteorological inputs, weather forecasts, radar precipitation, soil moisture, snowpack, and land surface conditions to simulate streamflow, soil moisture, and other hydrological variables [41]. It operates on a continuous basis and is designed to deliver short-term, medium-term, and long-term forecasts, making it a valuable tool for flood prediction, water resource management, and emergency response planning [42]. NWM v3.0 builds upon its predecessor by incorporating significant advancements in hydrological modeling, making it a more robust tool for water resource prediction. Predictions in NWM v3.0 are generated using the Weather Research and Forecasting Hydrological model (WRF-Hydro) [43], which resolves hydrological processes across a range of spatial and temporal scales. However, v3.0 introduces enhanced physics-based parameterizations, updated calibration methodologies, and improved data assimilation techniques, increasing its accuracy and scalability. The model simulates streamflow at over 2.7 million river reaches, maintaining its fine-grained representation of water dynamics. Moreover, v3.0 features an upgraded ensemble prediction system that incorporates additional uncertainty quantification methods for meteorological and hydrological processes. These improvements make NWM v3.0 better suited for benchmarking against advanced machine learning models, especially in capturing extreme events and regional hydrological variability. Despite these advancements, challenges remain in localized prediction accuracy, which advanced ML models like GRU and LSTM may address more effectively.

For this study, NWM streamflow reanalysis data was used as a baseline for benchmarking the results of the RNNs. By comparing the RNN outputs with NWM data, we aimed to evaluate the efficacy of data-driven approaches in capturing complex hydrological phenomena and their ability to outperform traditional physics-based models under specific scenarios. To facilitate the retrieval and integration of NWM data, a custom Python script was developed to interact with NOAA NWM S3 bucket (Figure 2d). This script automates the process of querying, downloading, and preprocessing NWM reanalysis data for specific river gauges and time periods. The integration of the NWM data via the Python (v3.0) API streamlines the benchmarking process, ensuring that the model evaluation is both accurate and efficient. This approach not only validates the performance of the RNNs but also demonstrates the potential for combining physics-based and machine learning approaches to improve flood prediction capabilities.

2.6. Data Preprocessing and Storm Event Identification

Effective data preprocessing is critical for ensuring the quality and consistency of inputs used in hydrological modeling. In this study, two primary datasets were preprocessed: NLDAS-2 forcing data and USGS discharge data. This section outlines the preprocessing steps undertaken to harmonize these datasets and prepare them for integration into the RNN models.

The NLDAS-2 dataset, a high-resolution gridded dataset with a spatial resolution of 0.125° and an hourly temporal resolution, was used to capture meteorological forcing variables such as precipitation, temperature, and PET. Fortunately, the dataset was complete with no missing values, eliminating the need for imputation. For catchments whose boundaries intersected multiple NLDAS grid cells (see Figure 1), a weighted average/sum was computed to derive representative values for each variable. The weights were determined based on the proportion of the catchment area falling within each grid cell. This spatial aggregation ensured that the meteorological forcings were accurately aligned with the catchment areas, providing a robust basis for modeling flood dynamics.

The USGS discharge data, recorded at 15-min intervals, required multiple preprocessing steps to align it with the temporal resolution of the NLDAS-2 data. First, the maximum flow within each hour was extracted to convert the data to an hourly time step, preserving critical peak flow information essential for flood prediction. A small number of missing values were present in the USGS dataset. These were addressed using a simple max imputation method, where missing values were filled by getting the maximum of the flow values immediately before and after the gap. This approach maintained the temporal continuity of the dataset while minimizing potential distortions in the data. In addition, the discharge values were normalized by dividing them by the catchment area to convert the data into flow per unit area (m³/s/m²). This step ensured that the RNN models could process data from multiple catchments simultaneously without being influenced by differences in catchment size. This normalization made the dataset more consistent and enabled more robust model training across diverse catchments.

Next, in order to identify significant hydrological events for flash flood analysis, threshold discharge values were determined based on return period flows. Specifically, the 1-year return period flow and the 2-year return period flow were computed using a discharge frequency curve employing the Pearson Type III distribution. In accordance with conventional practices, the threshold discharge was conservatively defined as the bankfull flow, which is typically approximated by the 2-year return period flow. However, this study used a 1-year return period flow as the threshold to ensure a sufficient number of events for model training (see Table 2). Consequently, the resulted discharge value was established as the threshold for flash flood early warning in this study, ensuring a robust and practical criterion for event identification. Once the events were extracted, the baseflow values were also determined by identifying the minimum flow preceding each rainfall event, and these were included as an additional variable for the RNN models to account for the initial state of the system.

To prepare the data for integration into the RNN models, the processed NLDAS-2 and USGS datasets were synchronized to ensure consistent time steps and spatial alignment. The combination of meteorological forcings and normalized streamflow data provided a comprehensive representation of the rainfall-runoff processes across diverse catchments.

2.7. Model Training Procedure

The training of RNNs involved processing the prepared input data to capture temporal dependencies and the dynamics of flood events. The data, including meteorological forcing variables from NLDAS-2 and discharge (as well as baseflow) data from USGS, was preprocessed to ensure compatibility with the model’s requirements. The features were normalized to improve training stability and ensure that variables with larger numerical ranges did not dominate the optimization process [44,45]. The RNN models, including LSTM, GRU, and VRNN, were trained using time-sequential data to predict discharge at an hourly resolution. Training was performed on a sliding window of input sequences, where each window encompassed a specific time lag of past observations to predict the next timestep. The sliding window size varied between the minimum and maximum catchment time of concentration (2 and 12 h respectively) calculated for the four selected catchments, ensuring the models effectively captured the hydrological response dynamics unique to each catchment. This approach enabled the models to learn patterns over varying temporal scales, capturing both short-term variability and long-term trends. Out of a total of 223 identified events, 183 were allocated for training, while 60 events each were reserved for validation and testing each, ordered chronologically to preserve the temporal sequence of hydrological processes (ratio 60:20:20; Table 2). A mean squared error (MSE) loss function was used to quantify prediction error during training, with optimization achieved using the adaptive moment estimation (Adam) optimizer [46]. Early stopping criteria were employed to prevent overfitting, halting training when validation loss stopped improving for a predefined number of epochs. Hyperparameters such as learning rate, batch size, and the number of layers and hidden units were fine-tuned based on validation performance. Finally, the trained models were evaluated on their ability to predict discharge across diverse catchments and meteorological conditions, ensuring generalizability and reliability in operational flood forecasting scenarios.

2.8. Performance Metrics

The primary objective of this research is to present a discharge forecasting model that relies on RNNs. To gauge the model’s effectiveness, we subject it to a rigorous evaluation using specific performance criteria. These criteria include the Relative Peak Error (RPE), Peak Time Error (PTE), and the Nash-Sutcliffe coefficient of efficiency (NSE). The purpose of employing these criteria is to assess the model’s accuracy in predicting peak flow rates, the precision of lead-time forecasts, and the overall shape of the discharge hydrograph. Each of these metrics provides a different perspective on model performance, helping to comprehensively evaluate the accuracy and reliability of predictive models. Below are the descriptions of these metrics along with their respective formulae.

The RPE metric measures the relative error between the observed and simulated peak discharge. A value closer to zero indicates a better estimation of peak discharge.

R P E = \frac{Q_{p e a k}^{s i m} - Q_{p e a k}^{o b s}}{Q_{p e a k}^{o b s}}

(11)

where:

Q_{p e a k}^{s i m}

: simulated discharge at the time which the peak discharge actually occurs as observed in the real data.

Q_{p e a k}^{o b s} :

actual discharge at the observed peak discharge time in the real data.

The PTE metric calculates the difference between the simulated peak time and the observed peak time, expressed in hours.

P T E = t_{p e a k}^{s i m} - t_{p e a k}^{o b s}

(12)

where:

t_{p e a k}^{s i m} :

the time at which the peak discharge is predicted by the model (simulated peak time).

t_{p e a k}^{o b s} :

the time at which the peak discharge actually occurs as observed in the real data.

Finally, NSE is a normalized statistic that determines the relative magnitude of the residual variance compared to the measured data variance. In other words, it shows how well the model’s predictions match the observed data by comparing the deviations of the predicted values from the observed values against the overall variability in the observed data. NSE is widely used to assess the predictive power of hydrological models.

N S E = 1 - \frac{\sum_{i = 1}^{n} {{(Q}_{o b s, i} - Q_{s i m, i})}^{2}}{\sum_{i = 1}^{n} {{(Q}_{o b s, i} - {\bar{Q}}_{o b s})}^{2}}

(13)

where:

Q_{o b s, i}

: the observed value at time step i

Q_{s i m, i} :

the simulated value at time step i

{\bar{Q}}_{o b s}

: the mean of observed values

n: the number of observations

3. Results and Discussion

3.1. Hyperparameters Tuning

Hyperparameter tuning played a pivotal role in optimizing the performance of the RNNs applied for flood prediction. It should be noted that RNNs are computationally expensive to train and difficult to parallelize. Typically, when the size of a network increases, it becomes more powerful and provides better prediction accuracy. This comes with significantly increasing memory bandwidth requirements and computational cost. In addition, RNNs are expensive in terms of the number of networks parameters. This requires large storage and runtime memory for use. In our case, the training was executed on graphical processing units (GPUs; NVIDIA A100) while inferencing tasks were completed on central processing units (CPUs). We used Torch library (developed by Facebook) [47], which is based on Lua scripting language but also has a Python version called PyTorch [48]. PyTorch provides access to low-level controls of operators and loss functions. PyTorch is easy to implement and enjoys strong optimization support from the fast GPU DL library “cudnn”. All three models were trained and tested through the Clemson University Palmetto high-performance computing (HPC) system with a uniform set of hyperparameters and the same hardware configuration.

The selected hyperparameters included the learning rate, batch size, the number of layers, the number of hidden units, sequential length, and the dropout rate. These parameters as well as choice of optimization algorithm were systematically adjusted to achieve a balance between model complexity and generalization, minimizing overfitting while ensuring accurate predictions. A grid search approach was employed to explore combinations of hyperparameters within predefined ranges (Table 3), guided by the prior literature and the characteristics of the dataset. For instance, the learning rate was varied logarithmically between 10⁻⁵ and 10⁻², while the batch size range of (8–256) were tested. Batch size represents the size of the data volume, namely, the number of training samples in an iteration. The model then updates the network parameters, weight, and bias by calculating the error between actual output value and simulated output value of the sample set. The number of recurrent layers and hidden units ranged from 1 to 3 and 2 to 128, respectively, to evaluate model depth and capacity. Dropout rates between 0.1 and 0.5 were incorporated to mitigate overfitting, particularly for models trained on smaller datasets. One critical hyperparameter was the sequential length parameter, which determined the time lag of past observations used to predict the next timestep. This choice was particularly challenging, as it needed to reflect the average time of concentration across all catchments to ensure applicability across the models. To address this, a sliding window of input sequences was implemented, with the window size varying between the minimum and maximum time of concentration calculated for the four selected catchments. The optimal value for this hyperparameter was selected as the maximum catchments’ time of concentration to account for both small and large catchments. This approach ensured the models effectively captured the hydrological response dynamics unique to each catchment.

The tuning process leveraged the validation set, with performance evaluated using metrics such as root mean squared error (RMSE) and Nash-Sutcliffe efficiency (NSE). We used the early stopping technique for the number of epochs to prevent overfitting. Early stopping is an optimization technique that was widely used to reduce overfitting without compromising model accuracy. The main idea behind early stopping is to stop training before a model starts to overfit. To understand how it works, it is important to look at how training and validation errors change with the number of epochs. Usually, the training error decreases exponentially by increasing the number of epochs until convergance. The validation error, however, initially decreases with increasing epochs, but after a certain point, it starts increasing. This is the point where the training must be completed, as beyond this point overfitting degrades the performance. The final hyperparameter settings were chosen based on the best trade-off between validation accuracy and computational efficiency (see Table 3). By iteratively fine-tuning these hyperparameters, the RNN models demonstrated improved convergence and stability during training, ultimately enhancing the reliability of discharge predictions across the diverse catchments and hydrological conditions considered in this study.

3.2. Feature Importance

Once we fine-tuned the hyperparameters, we investigated how variations in input data impact forecast quality across different models. Feature engineering is a critical step in enhancing the performance and interpretability of machine learning models. The initial set of variables included precipitation, PET, humidity, temperature, and wind components (Wind-U and Wind-V), as well as longwave and shortwave radiation (RLDS and RSDS). In addition, baseflow was included as a feature (replicated in all rows) by identifying the minimum discharge value before each rainfall event. This variable provided a critical indicator of antecedent hydrological conditions, enhancing the model’s ability to capture the relationship between initial states and flood responses. All other features were derived from NLDAS forcing data.

To identify the most impactful variables and streamline the input space, an iterative feature selection approach was employed. Initially, the models were trained using the full suite of variables to establish a baseline performance across all catchments. Following this, a leave-one-out strategy was applied (with replacement), where each variable was individually removed from the input set and the model’s performance was re-evaluated. The primary metric used for evaluation was NSE. The analysis revealed that precipitation and baseflow were consistently the most influential variables, followed by temperature and wind components (Wind-U and Wind-V). Potential evapotranspiration (PET), humidity, and radiation variables (RLDS and RSDS) showed lower contributions to the model’s predictive performance (see Table 4). As all features contributed to flood generation process (even very small contributions), we considered all variables as input features for RNNs training and validation.

3.3. Performance Assessment

Statistical analysis of the simulation results is documented as follows, wherein a suite of statistical measures was employed to facilitate comparison. Specifically, statistical measures such as the mean and median values, as well as the standard deviation, were utilized to offer a more nuanced understanding of the model performances. These measures allowed us to explore not only the central tendency but also the spread and distribution of performance metric values, providing a robust assessment of each model’s consistency and reliability.

The performance of the RNNs and NWM were evaluated using the NSE, RPE, and PTE metrics across four USGS gauges, revealing differences in their effectiveness and consistency. For each gauging station, the NSE, RPE, and PTE were evaluated for all the events at a gauging station using the observed and predicted discharges, and mean, median, minimum, and maximum, and standard deviation were calculated. Table 5 presents the statistics for NSE values obtained from RNNs as well as NWM models simulations across different catchments during the test events.

At USGS gauge 02164000, GRU achieved the highest mean NSE (0.70) with low variability (standard deviation of 0.21) and a maximum NSE of 0.97, demonstrating its robust performance in handling sequential data. LSTM followed with a mean NSE of 0.62, showing strong but slightly less consistent results. NWM (mean NSE of 0.46) and VRNN (mean NSE of 0.44) performed less effectively, with VRNN exhibiting high variability (standard deviation of 0.44) and a minimum NSE of −0.59, indicating challenges in stability and generalization. Also, in case of USGS gauge 02177000, LSTM led with a mean NSE of 0.74 and a median of 0.78, closely followed by GRU with a mean NSE of 0.72. Both models achieved high maximum NSEs of 0.91 and 0.96, respectively, showcasing their ability to capture hydrological event dynamics. NWM performed moderately (mean NSE of 0.49), while VRNN struggled with a negative mean NSE (−0.56), high variability (standard deviation of 0.83), and a minimum NSE of −3.12, suggesting insufficient generalization under varying conditions due to its short memory.

In addition, considering USGS gauge 02186000, GRU again performed best with a mean NSE of 0.66, followed by LSTM at 0.63. Both models exhibited high maximum NSEs (GRU: 0.91, LSTM: 0.92), highlighting their robustness. In contrast, NWM and VRNN performed poorly, with mean NSEs of 0.08 and −0.23, respectively. VRNN’s negative median NSE (−0.10) and high variability (standard deviation of 0.33) underscored its limitations, while NWM’s wide performance range (standard deviation of 0.77) reflected challenges in adapting physics-based models to site-specific conditions. Finally, at USGS gauge 02196000, GRU again delivered the highest mean NSE (0.64), with consistent performance (standard deviation of 0.17) and a maximum NSE of 0.86. LSTM followed with a mean NSE of 0.59, while NWM (mean NSE of 0.35) and VRNN (−0.01) lagged behind. VRNN exhibited substantial variability (standard deviation of 0.89) and a minimum NSE of −2.98, further emphasizing its instability.

Overall, GRU achieved the highest overall mean NSE of 0.70, followed by LSTM with 0.65. NWM had a moderate overall mean NSE of 0.34, and VRNN underperformed with a mean NSE of −0.12. GRU demonstrated the lowest variability (standard deviation of 0.17), indicating consistent performance, while VRNN had the highest variability (standard deviation of 0.75), reflecting significant fluctuations in performance. These findings highlight the superiority of advanced machine learning models like GRU and LSTM in flood prediction tasks, outperforming the traditional physics-based NWM. The results also suggest that VRNN requires further optimization and model structure updates to improve its reliability and generalizability for hydrological modeling.

To assess the models’ accuracy in predicting peak flow rates, the peak bias assessment was conducted using the absolute values of RPE to calculate the statistics, ensuring a consistent evaluation of model performance (Table 6).

Overall, GRU and LSTM showed the best performance across all catchments, with mean absolute RPE biases of 0.15 (15%) each. GRU exhibited the least variability in performance (STDEV = 0.14), followed closely by LSTM (STDEV = 0.16), reflecting their ability to deliver consistent peak flow predictions. NWM displayed a moderate performance with a mean RPE bias of 0.25 and variability of 0.22, while VRNN had the highest mean bias of 0.39 and the largest variability (STDEV = 0.24), indicating less reliable performance. Examining specific catchments highlights the variation in model performance. For instance, at USGS gauge 02164000, GRU achieved a mean RPE absolute bias of 0.15, outperforming both NWM (RPE = 0.24) and VRNN (RPE = 0.34). Similarly, at USGS gauge 02177000, LSTM delivered the lowest mean RPE of 0.18, demonstrating its effectiveness in predicting peak flows under diverse conditions. In contrast, VRNN’s mean RPE of 0.46 at this gauge reflects significant challenges in capturing peak flow dynamics. The superior performance of GRU and LSTM can be attributed to their ability to capture temporal and non-linear patterns in the data effectively, which is critical for predicting extreme hydrological events. NWM, grounded in physics-based modeling, showed reasonable performance but lacked the precision of advanced data-driven models. The suboptimal performance of RNNs in simulating multi-peak flood in some of the events primarily stems from input insufficiency, particularly for rare and complex flood patterns, as well as limitations in temporal resolution that hinder the models’ ability to capture rapid shifts in flow dynamics. While systematic biases may play a role, these factors significantly constrain the RNNs’ capacity to represent multi-scale temporal features, especially in data-scarce scenarios. In addition, while LSTM and GRU models mostly performed well in estimating peak flows, often capturing the magnitude accurately or even slightly overestimating them, some underestimations still occurred, particularly during complex multi-peak events. Nonetheless, from an early warning perspective, such overestimation is generally more favorable than underprediction, especially for flash floods in mountainous regions where timely alerts are critical. The issue of peak flow underestimation warrants careful attention, as it poses significant challenges for early warning and risk prevention, particularly in the context of flash floods.

Finally, the PTE was assessed to evaluate the accuracy of each model in predicting the timing of peak flows, measured in absolute deviations from the observed peak time (Table 7).

Overall, LSTM and GRU demonstrated the best total performance, with mean PTE absolute biases of 0.90 and 0.97, respectively. GRU exhibited slightly lower variability (STDEV = 0.90) compared to LSTM (STDEV = 0.92), indicating consistent performance across catchments. VRNN had the highest mean PTE of 2.79, with significant variability (STDEV = 1.89), while NWM displayed a moderate mean PTE of 1.51, though with higher variability (STDEV = 1.49). Catchment-specific results highlight similar trends (see Table 7). The total performance results demonstrate that LSTM and GRU models, leveraging their ability to capture temporal relationships effectively, are better suited for peak timing predictions. VRNN’s higher errors and variability due to its short memory suggest challenges in generalization or tuning, especially under varying hydrological conditions. NWM, while moderately effective, relies on deterministic principles that may not fully account for the nuances in timing errors across catchments.

To further illustrate model performance, hydrographs were generated for three sample events per catchment, allowing a detailed comparison of observed and simulated streamflow across events (Figure 3). In addition to the hydrographs, a comprehensive table summarizing the three key metrics was provided for these sample gauges (Table 8). Unlike the earlier analyses, the RPE and PTE values in this table include their signed values, capturing both positive and negative biases. This presentation highlights not only the magnitude but also the directional tendencies of the models in predicting peak flow and timing, offering further insight into their strengths and limitations. The selected gauges and associated metrics provide a focused evaluation of the models’ performance within each catchment, serving as representative examples of the broader analysis.

At the catchment scale, analyzing input–output relationships involves a complex array of interconnected processes. These include intricate and nonlinear dynamics, such as evapotranspiration, cyclic patterns of water storage in lowland regions involving both depletion and replenishment, and various interconnected sub-processes within the hydrological cycle. These complexities highlight the nuanced interactions inherent in hydrological systems, underscoring the need for a thorough understanding and sophisticated modeling techniques to ensure accurate analysis and enhanced management of water resources. Among the three types of RNNs, the GRU model emerged as the standout performer, showcasing an impressive median NSE value of 0.73. This result signifies that the GRU model consistently achieved a high level of efficiency in simulating the hydrological processes under investigation across the various gauge locations. Additionally, the relatively narrow interquartile range associated with the GRU model indicates its stable and predictable performance, further reinforcing its position as the top-performing model.

The LSTM model displayed commendable performance, with a median NSE value of 0.70. While slightly lower than the GRU model, this NSE value still represents a strong and reliable simulation of the flooding event, suggesting that the LSTM model is a robust choice for modeling in this context. The interquartile range associated with the LSTM model provides insights into the variability of its performance, shedding light on its adaptability across different seasonal conditions.

On the contrary, the VRNN model emerged as the least favorable performer among the models examined, as indicated by its notably lower median NSE value of −0.03. This result raises questions about the VRNN model’s suitability for accurately representing the complex flood generation mechanism due to its limitation to short-term dependencies spanning 10 or fewer time steps. This limitation arises due to the network’s susceptibility to vanishing or exploding gradients, which manifest as unstable error signals during the backward propagation phase of training. These issues constrain the vanilla RNN’s ability to effectively capture long-term temporal dependencies. The wide interquartile range associated with the VRNN model underscores its inconsistency in performance, indicating that this algorithm may struggle to provide accurate predictions under varying hydrological conditions. These findings suggest that the VRNN model may require further refinement, or that it is perhaps better suited for regions with linear or less complex flood generation mechanism.

Also, it is worth noting that the NWM model was not calibrated specifically for the selected watersheds, as it is a nationally calibrated model developed for application across the continental U.S. While the study areas have relatively complete observational data, the lack of localized calibration may partially explain the model’s suboptimal performance in simulating specific flood events. This distinction is important for clarifying the basis of comparison with other methods.

4. Conclusions

This study evaluated the performance of various machine learning models (VRNN, LSTM, and GRU) and the physics-based NWM v3.0 reanalysis data for flood prediction across multiple catchments using a diverse dataset encompassing meteorological and hydrological data. Our findings showed the superiority of LSTM and GRU models in providing accurate predictions of flood characteristics, including hydrograph shapes, peak rates, and time to peak. Both LSTM and GRU models consistently outperformed the NWM, signifying their potential for enhancing flood prediction accuracy. The pivotal role of meteorological features, such as precipitation data and its spatial distribution, was evident in our research. These variables significantly contributed to improved flood prediction accuracy in mountainous watersheds.

Across all gauges, GRU emerged as the most robust model, achieving the highest mean NSE values and demonstrating low variability in performance. LSTM, while slightly less consistent than GRU, also exhibited strong performance, with high mean NSE values and comparable maximum NSE values, underscoring its utility in hydrological forecasting. In contrast, VRNN showed high variability and lower mean NSE values, suggesting challenges in generalization and model tuning. NWM, while consistent with its physics-based framework, lagged behind the machine learning models, particularly in capturing the variability and site-specific characteristics of the data. In addition, the assessment of peak rate, which included both positive and negative biases, revealed that GRU and LSTM provided the most accurate peak flow predictions, with mean RPE values of 0.15 across all gauges, significantly lower than VRNN (0.39) and NWM (0.25). The time to peak analysis further reinforced the superiority of GRU and LSTM, which consistently minimized timing errors compared to VRNN and NWM. LSTM achieved the lowest mean PTE of 0.90 h across all gauges, followed closely by GRU with a mean PTE of 0.97 h. In contrast, VRNN exhibited high mean PTE values (2.79 h), indicating significant timing discrepancies in peak flow predictions. NWM performed moderately, with a mean PTE of 1.51 h, highlighting its reliance on physics-based principles that may not adapt well to catchment-specific dynamics.

In summary, our research underscores the potential of RNNs, specifically LSTM and GRU, to advance flood prediction capabilities with practical applications for flood warning and response systems. By embracing a combination of catchment meteorological forcing features, these models offer a promising avenue for accurate flood forecasts in diverse terrains. However, the identified biases towards early predictions and underestimation necessitate further investigation and model refinement to enhance the reliability of flood prediction. Due to the mounting challenges of climate change and its impact on flood occurrences, integrating advanced machine learning techniques into hydrological research holds promise for more effective flood risk management and mitigation strategies.

Future research should focus on enhancing the accuracy and robustness of machine learning models for flood prediction by integrating static catchment attributes, such as soil properties, land use, and topography, to provide additional contextual information. This integration can help improve model generalization and capture catchment-specific hydrological dynamics. Furthermore, exploring advanced machine learning architectures like Transformers, which excel in capturing long-range dependencies and leveraging parallel processing, offers a promising avenue for future work. Transformers can overcome some of the limitations of RNNs, such as vanishing gradients and sequential processing constraints, enabling more efficient training and better representation of complex temporal patterns in hydrological data. These advancements, combined with rigorous calibration and uncertainty quantification, could pave the way for more reliable and interpretable flood prediction systems.

Author Contributions

Conceptualization, E.H. and A.A.K.; methodology, E.H. and A.A.K.; software, E.H.; validation, E.H. and A.A.K.; formal analysis, E.H.; investigation, E.H.; resources, E.H.; data curation, E.H.; writing—original draft preparation, E.H.; writing—review and editing, A.A.K. and V.S.; visualization, E.H.; supervision, A.A.K. and V.S.; project administration, A.A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data and code will be made available upon request.

Acknowledgments

We would like to thank Ioana Popescu from IHE Delft Institute for Water Education for her invaluable advice and insightful comments, which greatly contributed to the development and refinement of the models. Also, Clemson University is acknowledged for its generous allotment of computing time on the Palmetto cluster.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jonkman, S.N. Global perspectives on loss of human life caused by floods. Nat. Hazards 2005, 34, 151–175. [Google Scholar] [CrossRef]
EM-DAT—The International Disaster Database (n.d.). Available online: https://www.emdat.be/ (accessed on 8 May 2024).
Nevo, S.; Morin, E.; Gerzi Rosenthal, A.; Metzger, A.; Barshai, C.; Weitzner, D.; Voloshin, D.; Kratzert, F.; Elidan, G.; Dror, G.; et al. Flood forecasting with machine learning models in an operational framework. Hydrol. Earth Syst. Sci. 2022, 26, 4013–4032. [Google Scholar] [CrossRef]
UNDRR. Sendai Framework for Disaster Risk Reduction 2015–2030. (n.d.). Available online: https://www.undrr.org/publication/sendai-framework-disaster-risk-reduction-2015-2030 (accessed on 11 January 2025).
Singh, V.P.; Woolhiser, D.A. Mathematical Modeling of Watershed Hydrology. In Perspectives in Civil Engineering: Commemorating the 150th Anniversary of the American Society of Civil Engineers; ASCE Publications: Reston, VA, USA, 2003; pp. 345–367. [Google Scholar] [CrossRef]
Farooq, M.; Shafique, M.; Khattak, M.S. Flood hazard assessment and mapping of River Swat using HEC-RAS 2D model and high-resolution 12-m TanDEM-X DEM (WorldDEM). Nat. Hazards 2019, 97, 477–492. [Google Scholar] [CrossRef]
Hussain, F.; Wu, R.S.; Wang, J.X. Comparative study of very short-term flood forecasting using physics-based numerical model and data-driven prediction model. Nat. Hazards 2021, 107, 249–284. [Google Scholar] [CrossRef]
Kaya, C.M.; Tayfur, G.; Gungor, O. Predicting flood plain inundation for natural channels having no upstream gauged stations. J. Water Clim. Change 2019, 10, 360–372. [Google Scholar] [CrossRef]
Clark, M.P.; Nijssen, B.; Lundquist, J.D.; Kavetski, D.; Rupp, D.E.; Woods, R.A.; Freer, J.E.; Gutmann, E.D.; Wood, A.W.; Brekke, L.D.; et al. A unified approach for process-based hydrologic modeling: 1. Modeling concept. Water Resour. Res. 2015, 51, 2498–2514. [Google Scholar] [CrossRef]
Zhang, B.; Ouyang, C.; Cui, P.; Xu, Q.; Wang, D.; Zhang, F.; Li, Z.; Fan, L.; Lovati, M.; Liu, Y.; et al. Deep learning for cross-region streamflow and flood forecasting at a global scale. Innovation 2024, 5, 100617. [Google Scholar] [CrossRef]
Tabas, S.S.; Samadi, S. Variational Bayesian dropout with a Gaussian prior for recurrent neural networks application in rainfall–runoff modeling. Environ. Res. Lett. 2022, 17, 065012. [Google Scholar] [CrossRef]
Tabas, S.S.; Humaira, N.; Samadi, S.; Hubig, N.C. FlowDyn: A daily streamflow prediction pipeline for dynamical deep neural network applications. Environ. Model. Softw. 2023, 170, 105854. [Google Scholar] [CrossRef]
Sadeghi Tabas, S. Explainable Physics-Informed Deep Learning for Rainfall-Runoff Modeling and Uncertainty Assessment Across the Continental United States. Ph.D. Thesis, Clemson University, Clemson, SC, USA, 2023. [Google Scholar]
Chapi, K.; Singh, V.P.; Shirzadi, A.; Shahabi, H.; Bui, D.T.; Pham, B.T.; Khosravi, K. A novel hybrid artificial intelligence approach for flood susceptibility assessment. Environ. Model. Softw. 2017, 95, 229–245. [Google Scholar] [CrossRef]
Hsu, K.-L.; Gupta, H.V.; Sorooshian, S. Artificial Neural Network Modeling of the Rainfall-Runoff Process. Water Resour. Res. 1995, 31, 2517–2530. [Google Scholar] [CrossRef]
Tiwari, M.K.; Chatterjee, C. Development of an accurate and reliable hourly flood forecasting model using wavelet–bootstrap–ANN (WBANN) hybrid approach. J. Hydrol. 2010, 394, 458–470. [Google Scholar] [CrossRef]
Kratzert, F.; Klotz, D.; Brenner, C.; Schulz, K.; Herrnegger, M. Rainfall-runoff modelling using Long Short-Term Memory (LSTM) networks. Hydrol. Earth Syst. Sci. 2018, 22, 6005–6022. [Google Scholar] [CrossRef]
Dtissibe, F.Y.; Ari, A.A.A.; Abboubakar, H.; Njoya, A.N.; Mohamadou, A.; Thiare, O. A comparative study of Machine Learning and Deep Learning methods for flood forecasting in the Far-North region, Cameroon. Sci. Afr. 2024, 23, e02053. [Google Scholar] [CrossRef]
Shi, P.; Wu, H.; Qu, S.; Yang, X.; Lin, Z.; Ding, S.; Si, W. Advancing real-time error correction of flood forecasting based on the hydrologic similarity theory and machine learning techniques. Environ. Res. 2024, 246, 118533. [Google Scholar] [CrossRef] [PubMed]
Jhong, Y.-D.; Chen, C.-S.; Jhong, B.-C.; Tsai, C.-H.; Yang, S.-Y. Optimization of LSTM Parameters for Flash Flood Forecasting Using Genetic Algorithm. Water Resour. Manag. 2024, 38, 1141–1164. [Google Scholar] [CrossRef]
Nguyen, H.D.; Van, C.P.; Do, A.D. Application of hybrid model-based deep learning and swarm-based optimizers for flood susceptibility prediction in Binh Dinh province, Vietnam. Earth Sci. Inform. 2023, 16, 1173–1193. [Google Scholar] [CrossRef]
Kimura, N.; Minakawa, H.; Kimura, M.; Fukushige, Y.; Baba, D. Examining practical applications of a neural network model coupled with a physical model and transfer learning for predicting an unprecedented flood at a lowland drainage pumping station. Paddy Water Environ. 2023, 21, 509–521. [Google Scholar] [CrossRef]
Nearing, G.; Cohen, D.; Dube, V.; Gauch, M.; Gilon, O.; Harrigan, S.; Hassidim, A.; Klotz, D.; Kratzert, F.; Metzger, A.; et al. Global prediction of extreme floods in ungauged watersheds. Nature 2024, 627, 559–563. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
Xia, Y.; Mitchell, K.; Ek, M.; Sheffield, J.; Cosgrove, B.; Wood, E.; Luo, L.; Alonge, C.; Wei, H.; Meng, J.; et al. Continental-scale water and energy flux analysis and validation for the North American Land Data Assimilation System project phase 2 (NLDAS-2): 1. Intercomparison and application of model products. J. Geophys. Res. Atmos. 2012, 117, 3109. [Google Scholar] [CrossRef]
Cosgrove, B.A.; Lohmann, D.; Mitchell, K.E.; Houser, P.R.; Wood, E.F.; Schaake, J.C.; Robock, A.; Marshall, C.; Sheffield, J.; Duan, Q.; et al. Real-time and retrospective forcing in the North American Land Data Assimilation System (NLDAS) project. J. Geophys. Res. Atmos. 2003, 108, 8842. [Google Scholar] [CrossRef]
Jordan, M.I. Serial Order: A Parallel Distributed Processing Approach. Adv. Psychol. 1997, 121, 471–495. [Google Scholar] [CrossRef]
Werbos, P.J. Backpropagation Through Time: What It Does and How to Do It. Proc. IEEE 1990, 78, 1550–1560. [Google Scholar] [CrossRef]
Wythoff, B.J. Backpropagation neural networks: A tutorial. Chemom. Intell. Lab. Syst. 1993, 18, 115–155. [Google Scholar] [CrossRef]
Elman, J.L. Finding Structure in Time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
Elman, J.L. Distributed Representations, Simple Recurrent Networks, and Grammatical Structure. Mach. Learn. 1991, 7, 195–225. [Google Scholar] [CrossRef]
Williams, R.J.; Zipser, D. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Appear. Neural Comput. 1989, 1, 270–280. [Google Scholar] [CrossRef]
Shiri, F.; Perumal, T.; Mustapha, N.; Mohamed, R. A Comprehensive Overview and Comparative Analysis on Deep Learning Models: CNN, RNN, LSTM, GRU. arXiv 2023, arXiv:2305.17473. [Google Scholar] [CrossRef]
Abbaspour, S.; Fotouhi, F.; Sedaghatbaf, A.; Fotouhi, H.; Vahabi, M.; Linden, M. A Comparative Analysis of Hybrid Deep Learning Models for Human Activity Recognition. Sensors 2020, 20, 5707. [Google Scholar] [CrossRef]
Fang, W.; Chen, Y.; Xue, Q. Survey on Research of RNN-Based Spatio-Temporal Sequence Prediction Algorithms. J. Big Data 2021, 3, 97–110. [Google Scholar] [CrossRef]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Hochreiter, S.; Younger, A.S.; Conwell, P.R. Learning to Learn Using Gradient Descent. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2001; Volume 2130, pp. 87–94. [Google Scholar] [CrossRef]
Huang, Y.; Sun, Y. Learning to pour. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Vancouver, BC, Canada, 24–28 September 2017; pp. 7005–7010. [Google Scholar] [CrossRef]
Sak, H.H.; Senior, A.; Google, B. Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling. arXiv 2014, arXiv:1402.1128. [Google Scholar]
Cosgrove, B.; Gochis, D.; Flowers, T.; Dugger, A.; Ogden, F.; Graziano, T.; Clark, E.; Cabell, R.; Casiday, N.; Cui, Z.; et al. NOAA’s National Water Model: Advancing operational hydrology through continental-scale modeling. J. Am. Water Resour. Assoc. 2024, 60, 247–272. [Google Scholar] [CrossRef]
Cohen, S.; Praskievicz, S.; Maidment, D.R. Featured collection introduction: National water model. J. Am. Water Resour. Assoc. 2018, 54, 767–769. [Google Scholar] [CrossRef]
Scheuerer, M.; Viterbo, F.; Hughes, M.; Thorstensen, A.R. Investigation of the Added Value of Using Statistically Postprocessed GEFS Ensemble Forecasts as Alternative Forcings for the WRF-Hydro National Water Model (NWM); AMS: Hongkong, China, 2020; Available online: https://ams.confex.com/ams/2020Annual/meetingapp.cgi/Paper/363669 (accessed on 14 January 2020).
Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Minns, A.W.; Hall, M.J. Artificial neural networks as rainfall-runoff models. Hydrol. Sci. J. 1996, 41, 399–417. [Google Scholar] [CrossRef]
Zhang, Z. Improved Adam Optimizer for Deep Neural Networks. In Proceedings of the 2018 IEEE/ACM 26th International Symposium on Quality of Service 2019, IWQoS, Banff, AB, Canada, 4–6 June 2018. [Google Scholar] [CrossRef]
Collobert, R.; Bengio, S.; Mariéthoz, J. Torch: A Modular Machine Learning Software Library; IDIAP: Martigny, Switzerland, 2002. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026. [Google Scholar]

Figure 1. Selected catchments and NLDAS grid cells, as well as associated USGS gauges, located in the northwest and west of the state of South Carolina, USA.

Figure 2. Schematic overview of applying RNNs for flood prediction.

Figure 3. Flood hydrographs were generated for three sample events per catchment, allowing a detailed comparison of observed and simulated streamflow across events.

Table 1. NLDAS-v2 meteorological forcing variables, unit and their description utilized for model training and verification.

No.	Variable	Unit	Description
1	Prcp	(mm)	Precipitation
2	PET	(mm)	Potential Evapotranspiration
3	Humidity	(kg/kg)	Specific Humidity
4	T	(K)	Air Temperature
5	Wind-U	(m/s)	West to east wind
6	Wind-V	(m/s)	South to north wind
7	RLDS	(W/m²)	surface downward longwave radiation
8	RSDS	(W/m²)	surface downward shortwave radiation

Table 2. Catchment information, one-year return period flow threshold, and number of events selected for training, validation, and test periods.

Catchment #	USGS #	Drainage Area (Squared Kilometers)	Flood Event Threshold (cms)	Number of Training Events	Number of Validation Events	Number of Test Events
1	02164000	125.87	28.31	39	13	13
2	02177000	536.12	56.63	44	15	15
3	02186000	274.53	28.31	50	16	16
4	02196000	1411.54	56.63	50	16	16
Total	-	-	-	183	60	60

Table 3. RNNs’ hyperparameters, search space, and optimal values, as well as their definitions.

Hyperparameter	Search Space	Optimum	Description
Num. of recurrent layers	(1, 2, 3)	1	Represents number of stacked recurrent layers.
Optimizer	(Adam, Adagrad, RMSprop)	Adam	Providing directions to update the weights of the network.
Learning rate	(10⁻⁵, 10⁻⁴, 10⁻³, 10⁻²)	0.001	Controlling the learning speed of the model
Batch size	(8, 16, 32, 64, 128, 256)	128	Number of samples processed by the RNN per iteration.
Hidden unit size	(2, 4, 8, 16, 32, 64, 128)	32	Modulating the output size of hidden layers in the RNN network.
Sequential Length	(3–12)	12	Time lag of past observations used to predict the next timestep.
Dropout rate	(0.1, 0.2, 0.3, 0.4, 0.5)	0.3	The probability of randomly dropping out a neuron during training, used as a regularization technique to prevent overfitting.

Table 4. Leave-one-out strategy to investigate how variations in input data impact forecast quality across different RNNs considering NSE values.

Left Out Variable	VRNN		LSTM		GRU
Left Out Variable	Mean	Median	Mean	Median	Mean	Median
-	−0.12	−0.03	0.65	0.70	0.70	0.73
Baseflow	−0.15	−0.04	0.51	0.57	0.57	0.61
Precipitation	−0.21	−0.20	0.39	0.43	0.41	0.43
Temperature	−0.16	−0.08	0.55	0.59	0.58	0.62
Wind	−0.16	−0.11	0.57	0.60	0.59	0.62
PET	−0.14	−0.06	0.63	0.65	0.65	0.66
Humidity	−0.16	−0.07	0.61	0.67	0.65	0.64
Solar Radiation	−0.14	−0.05	0.62	0.68	0.67	0.65

Table 5. Statistics for NSE values obtained from RNNs simulated test events as well as NWM simulations across different catchments during the test events.

USGS #	Total Events	Model	Mean	Median	Min	Max	STDEV
02164000	13	VRNN	0.44	0.62	−0.59	0.88	0.44
		LSTM	0.62	0.61	0.25	0.90	0.20
		GRU	0.70	0.75	0.33	0.97	0.21
		NWM	0.46	0.55	−0.05	0.90	0.24
02177000	15	VRNN	−0.56	−0.39	−3.12	0.66	0.83
		LSTM	0.74	0.78	0.33	0.91	0.18
		GRU	0.72	0.72	0.33	0.96	0.14
		NWM	0.49	0.73	−0.74	0.94	0.50
02186000	16	VRNN	−0.23	−0.10	−1.25	0.03	0.33
		LSTM	0.63	0.69	−0.33	0.92	0.31
		GRU	0.66	0.61	0.44	0.91	0.16
		NWM	0.08	0.39	−1.66	0.84	0.77
02196000	16	VRNN	−0.01	0.37	−2.98	0.78	0.89
		LSTM	0.59	0.59	0.38	0.85	0.17
		GRU	0.64	0.67	0.31	0.86	0.17
		NWM	0.35	0.50	−1.07	0.90	0.54
Total	60	VRNN	−0.12	−0.03	−3.12	0.88	0.75
		LSTM	0.65	0.70	−0.33	0.92	0.24
		GRU	0.70	0.73	0.31	0.97	0.17
		NWM	0.34	0.51	−1.66	0.94	0.59

Table 6. Statistics for absolute RPE biases obtained from RNNs as well as NWM simulations across different catchments during the test events.

USGS #	Total Events	Model	Mean	Median	Min	Max	STDEV
02164000	13	VRNN	0.34	0.32	0.08	0.58	0.14
		LSTM	0.18	0.17	0.00	0.57	0.15
		GRU	0.15	0.08	0.00	0.41	0.15
		NWM	0.24	0.24	0.05	0.45	0.13
02177000	15	VRNN	0.46	0.52	0.00	0.87	0.31
		LSTM	0.18	0.16	0.00	0.59	0.16
		GRU	0.22	0.26	0.00	0.58	0.15
		NWM	0.19	0.11	0.00	0.60	0.18
02186000	16	VRNN	0.38	0.40	0.00	0.78	0.23
		LSTM	0.15	0.09	0.00	0.74	0.19
		GRU	0.13	0.12	0.00	0.37	0.12
		NWM	0.26	0.20	0.00	0.72	0.23
02196000	16	VRNN	0.38	0.32	0.07	0.85	0.19
		LSTM	0.10	0.00	0.00	0.37	0.13
		GRU	0.12	0.08	0.00	0.45	0.13
		NWM	0.30	0.19	0.00	0.88	0.25
Total	60	VRNN	0.39	0.37	0.00	0.87	0.24
		LSTM	0.15	0.13	0.00	0.74	0.16
		GRU	0.15	0.15	0.00	0.58	0.14
		NWM	0.25	0.20	0.00	0.88	0.22

Table 7. Statistics for absolute PTE biases (hour) obtained from RNNs as well as NWM simulations across different catchments during the test events.

USGS #	Total Events	Model	Mean	Median	Min	Max	STDEV
02164000	13	VRNN	0.67	0.50	0.00	3.00	0.85
		LSTM	0.67	0.50	0.00	2.00	0.75
		GRU	0.67	1.00	0.00	1.00	0.47
		NWM	1.00	1.00	0.00	3.00	1.00
02177000	15	VRNN	3.39	5.00	0.00	5.00	2.03
		LSTM	1.11	1.00	0.00	2.00	0.87
		GRU	1.06	1.00	0.00	2.00	0.91
		NWM	0.67	0.50	0.00	3.00	0.82
02186000	16	VRNN	3.68	5.00	0.00	5.00	1.66
		LSTM	1.14	1.00	0.00	3.00	0.97
		GRU	1.23	1.00	0.00	3.00	1.00
		NWM	1.18	1.00	0.00	4.00	1.11
02196000	16	VRNN	2.55	3.00	0.00	5.00	1.32
		LSTM	0.60	0.00	0.00	2.00	0.86
		GRU	0.80	0.50	0.00	2.00	0.87
		NWM	2.95	3.00	1.00	7.00	1.56
Total	60	VRNN	2.79	3.00	0.00	5.00	1.89
		LSTM	0.90	1.00	0.00	3.00	0.92
		GRU	0.97	1.00	0.00	3.00	0.90
		NWM	1.51	1.00	0.00	7.00	1.49

Table 8. Performance metrics for the sample hydrographs across different catchments.

		VRNN			LSTM			GRU			NWM
USGS #	Event #	NSE	RPE	PTE	NSE	RPE	PTE	NSE	RPE	PTE	NSE	RPE	PTE
2164000	1	0.63	−0.41	0.00	0.93	−0.12	−1.00	0.97	0.04	0.00	0.70	−0.27	0.00
	2	0.81	−0.26	0.00	0.88	0.17	1.00	0.92	0.24	1.00	0.37	−0.13	2.00
	3	0.61	−0.38	0.00	0.69	−0.21	1.00	0.65	−0.04	1.00	0.69	−0.31	1.00
2177000	1	−0.58	−0.14	−5.00	0.91	−0.05	−2.00	0.96	0.00	−2.00	0.40	−0.20	−1.00
	2	−1.52	−0.87	−5.00	0.68	−0.21	−2.00	0.72	−0.27	−2.00	−0.11	−0.58	1.00
	3	−3.12	−0.55	−5.00	0.77	0.06	−2.00	0.79	0.17	0.00	−0.74	−0.04	−1.00
2186000	1	0.02	−0.22	−5.00	0.92	−0.01	−2.00	0.91	−0.03	−2.00	0.84	0.20	0.00
	2	−0.26	−0.36	−5.00	0.92	−0.10	−1.00	0.90	0.05	−2.00	0.37	0.53	−1.00
	3	−0.18	−0.74	−3.00	0.91	0.01	1.00	0.90	0.05	1.00	−0.14	−0.61	1.00
2196000	1	0.66	−0.28	−3.00	0.84	−0.16	2.00	0.79	−0.22	−1.00	0.78	0.04	−1.00
	2	0.02	−0.67	−5.00	0.67	−0.27	0.00	0.82	−0.19	−1.00	0.42	0.74	−3.00
	3	0.74	−0.32	−2.00	0.63	−0.18	−2.00	0.86	−0.11	2.00	0.65	−0.28	1.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Heidari, E.; Samadi, V.; Khan, A.A. Leveraging Recurrent Neural Networks for Flood Prediction and Assessment. Hydrology 2025, 12, 90. https://doi.org/10.3390/hydrology12040090

AMA Style

Heidari E, Samadi V, Khan AA. Leveraging Recurrent Neural Networks for Flood Prediction and Assessment. Hydrology. 2025; 12(4):90. https://doi.org/10.3390/hydrology12040090

Chicago/Turabian Style

Heidari, Elnaz, Vidya Samadi, and Abdul A. Khan. 2025. "Leveraging Recurrent Neural Networks for Flood Prediction and Assessment" Hydrology 12, no. 4: 90. https://doi.org/10.3390/hydrology12040090

APA Style

Heidari, E., Samadi, V., & Khan, A. A. (2025). Leveraging Recurrent Neural Networks for Flood Prediction and Assessment. Hydrology, 12(4), 90. https://doi.org/10.3390/hydrology12040090

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Leveraging Recurrent Neural Networks for Flood Prediction and Assessment

Abstract

1. Introduction

2. Methodology

2.1. Study Area

2.2. Meteorological Forcing Data

2.3. USGS Discharge

2.4. RNN Algorithms

2.4.1. Vanilla RNN

2.4.2. LSTM

2.4.3. GRU

2.5. NWM Model

2.6. Data Preprocessing and Storm Event Identification

2.7. Model Training Procedure

2.8. Performance Metrics

3. Results and Discussion

3.1. Hyperparameters Tuning

3.2. Feature Importance

3.3. Performance Assessment

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI