Enhancing Peak Runoff Forecasting through Feature Engineering Applied to X-Band Radar Data

Álvarez-Estrella, Julio; Muñoz, Paul; Bendix, Jörg; Contreras, Pablo; Célleri, Rolando

doi:10.3390/w16070968

Open AccessArticle

Enhancing Peak Runoff Forecasting through Feature Engineering Applied to X-Band Radar Data

by

Julio Álvarez-Estrella

^1,2,*

,

Paul Muñoz

¹

,

Jörg Bendix

³,

Pablo Contreras

¹ and

Rolando Célleri

¹

Departamento de Recursos Hídricos y Ciencias Ambientales, Universidad de Cuenca, Cuenca 010207, Ecuador

²

Facultad de Ingeniería, Universidad de Cuenca, Cuenca 010207, Ecuador

³

Laboratory for Climatology and Remote Sensing, Faculty of Geography, University of Marburg, 35032 Marburg, Germany

^*

Author to whom correspondence should be addressed.

Water 2024, 16(7), 968; https://doi.org/10.3390/w16070968

Submission received: 29 December 2023 / Revised: 6 February 2024 / Accepted: 15 February 2024 / Published: 27 March 2024

(This article belongs to the Section Water and Climate Change)

Download

Browse Figures

Versions Notes

Abstract

:

Floods cause significant damage to human life, infrastructure, agriculture, and the economy. Predicting peak runoffs is crucial for hazard assessment, but it is challenging in remote areas like the Andes due to limited hydrometeorological data. We utilized a 300 km² catchment over the period 2015–2021 to develop runoff forecasting models exploiting precipitation information retrieved from an X-band weather radar. For the modeling task, we employed the Random Forest (RF) algorithm in combination with a Feature Engineering (FE) strategy applied to the radar data. The FE strategy is based on an object-based approach, which derives precipitation characteristics from radar data. These characteristics served as inputs for the models, distinguishing them as “enhanced models” compared to “referential models” that incorporate precipitation estimates from all available pixels (1210) for each hour. From 29 identified events, enhanced models achieved Nash-Sutcliffe efficiency (NSE) values ranging from 0.94 to 0.50 for lead times between 1 and 6 h. A comparative analysis between the enhanced and referential models revealed a remarkable 23% increase in NSE-values at the 3 h lead time, which marks the peak improvement. The enhanced models integrated new data into the RF models, resulting in a more accurate representation of precipitation and its temporal transformation into runoff.

Keywords:

Peak runoff forecast; X-band radar; Random Forest; Andes

1. Introduction

Floods stand as one of the most devastating natural disasters, impacting and causing damage to human life, infrastructure, agriculture, and the economy [1,2,3]. Thus, peak runoff forecasting tools play a crucial role in hazard assessment and for allowing decision-makers to take mitigation actions with sufficient anticipation [1,2,4]. However, predicting peak runoff remains challenging, particularly in complex (in terms of biophysical and climatological characteristics) and/or remote areas, such as the mountainous region of the Andes, due to a lack of sufficient information to describe the extreme variability of the main hydrometeorological variables that control the runoff generation process (e.g., precipitation, topography, land uses, soil characteristics, etc.), particularly precipitation [3,5].

A solution that has emerged in the past few decades is to exploit Remote Sensing (RS) products obtained either from satellite or ground weather radars. For the case of precipitation, the use of weather radar estimates is encouraged due to their finer spatial and temporal resolutions when compared to precipitation derived from satellite products. This makes radar precipitation more suitable for hydrological applications, such as peak runoff forecasting [2,3,6]. Several precipitation-runoff models have been developed using radar data [2,3,7,8], exploring the utility of radar precipitation estimates. Specifically, with data sourced from the X-band radar, which is also utilized in this study, Orellana-Alvear et al. [9] employed a random forest algorithm for runoff forecasting. They used native radar data (i.e., reflectivity instead of the derived rain rate), achieving satisfactory results (NSE = 0.85, KGE = 0.81). However, to get the most out of this precipitation radar data, it is appropriate to develop a methodology in which the advantages of the high-resolution data can be exploited.

For peak runoff modeling and forecasting, an effective strategy is to develop precipitation-runoff models powered by Machine Learning (ML) techniques. Models powered by ML techniques are data-driven models, meaning that they learn from data about system functioning by attempting to relate a set of inputs to a set of outputs. With higher quality data, such as better resolution imagery for precipitation estimation, improved model results are expected. However, these type of models do not distinguish or consider the physical processes involved in the simulated system (black box modeling) [10]. Commonly used ML techniques for runoff forecasting include the Random Forest (RF) algorithm, Fuzzy Logic, Support Vector Machine (SVM), and Artificial Neural Networks (ANN) [10,11,12]. Based on a literature review, the Random Forest (RF) algorithm is better suited for peak runoff forecasting, among machine learning techniques. Its efficient and scalable architecture results in significantly lower computational costs for setup and operation compared to other machine learning techniques. Yet, it is worth mentioning that computational efficiency is not the primary objective of this research [1,10,12,13,14].

Despite ML’s success in precipitation-runoff forecasting, several shortcomings affecting model performance have been identified. These are the use of irrelevant input features misleading the ML learning process, lack of interpretability, and overfitting issues [1,13,14]. Therefore, it is important to address these issues to improve the performance of the models. Nowadays, the trend in model research focuses on adding physical knowledge to the ML models, in what is known as “grey modeling”. These grey-box models aim at optimizing the ML learning process with the purpose of increasing their accuracy. For improving the learning process, in grey modeling, raw data can be transformed, removed (in case of unnecessary information for the model), or used to create new features that describe certain aspects of the system functioning [15,16,17,18]. All of this set of conceptual and/or mathematic operations for transforming, removing, or creating new inputs is known as Feature Engineering (FE). However, few research initiatives have addressed the importance of developing appropriate FE strategies [19].

The effectiveness of employing FE strategies in hydrological models is supported in several studies [19,20,21,22,23,24,25]. In the specific case of precipitation-runoff models, there are studies; for example, in the one conducted by Muñoz et al. [20], they employed FE through a spatiotemporal object-based approach. This object-based approach is derived from the framework proposed in the study by Laverde-Barajas et al. [26]. Among other aspects, this framework suggests the identification of precipitation objects and extraction of attributes from them, which can be used as inputs for forecasting models. Also, the authors suggested that this method could be employed, among other potential approaches, to assess the performance of high-resolution precipitation products in a specific area. Thus, given the existing studies, the challenge lies in extracting physical and meteorological features (such as the area, volume, and location of the objects) from high-resolution images. This is done to add physical meaning to precipitation-runoff processes and enhance the efficiency of peak runoff forecasting models.

All in all, we aim to enhance peak runoff forecasts by exploiting precipitation estimates retrieved from weather radar data using an FE strategy with an object-based approach to derive precipitation attributes, which are then used to generate the enhanced models. Furthermore, we evaluate the effectiveness of the FE approach through a direct comparison between referential models (those without the application of the FE strategy) and enhanced models (those incorporating the FE strategy). This evaluation is performed using performance metrics, considering lead times of 1, 3, and 6 h.

2. Study Area and Dataset

2.1. Study Area

The study area corresponds to the Tomebamba catchment, situated in the southern Ecuadorian Andes, northwest of Cuenca city. The outlet of the study catchment is the Matadero Sayausí discharge station (Figure 1) where the Tomebamba river (also known as Matadero at this point) enters Cuenca. Therefore, it is important to forecast potentially hazardous peak runoff events in this location.

The catchment’s altitudinal range spans from approximately 2592 to 4164 m above sea level (m.a.s.l), covering an estimated surface area of approximately 300 km². The mean annual precipitation exhibits variation across the catchment, with lower elevations experiencing an average of 850 mm of precipitation per year, while the upper regions receive a higher annual precipitation of around 1100 mm [27]. The average annual temperature in the study area ranges between 4 °C and 15 °C [28]. The potential evapotranspiration for the study catchment is approximately 981 mm/year [29]. The higher elevations of the catchment, situated above 3500 m.a.s.l., encompass a pristine region characterized by a blend of wetlands, lagoons, and paramo grasslands. Transitioning to the mid-elevations (2700–3500 m.a.s.l.), the landscapes exhibit a diverse composition, featuring a mix of forests, agricultural and grazing areas, and sporadic urban settlements [30]. The snow line in Ecuador is approximately 4700 m.a.s.l, so there is no contribution of snow to the runoff in the study area [31].

2.2. Dataset

The dataset for this study encompasses precipitation estimates and runoff data. Precipitation estimates were derived from a single polarized, non-Doppler, X-band radar located at an elevation of 4440 m above sea level on the Paragüillas hill [32]. The radar has a bin resolution of 2 degrees in azimuth and 100 m in range. More detailed information about the radar can be found in the study developed by Orellana-Alvear et al. [32].

Radar data were utilized to derive precipitation estimates on an hourly scale, serving as inputs for the models. These estimations were recorded as precipitation depths in millimeters (mm). The series of radar precipitation depth was obtained using a step-wise correction model, as previously outlined by Orellana-Alvear et al. [32]. This model applies clutter and attenuation corrections to ensure data accuracy. Subsequently, the precipitation rate was determined through a site-specific Z-R relationship (Z = 204R^1.57) for intense precipitation events identified by Orellana et al. [33]. The data were then transformed from the precipitation rate (measured in mm/h) to the precipitation depth (in mm). This approach culminated in aggregating the data to an hourly scale from its original 5 min resolution records.

Hourly runoff time series data from the Matadero-Sayausí station (outlet of the catchment, see Figure 2) are available from 2015 to July 2021. Figure 2 displays the hourly time series of runoff during the available data period. The events used to generate forecast models were determined based on common dates with records of precipitation and runoff data.

Although runoff data is available from 2015 to 2021, there is a gap in the radar data, resulting in only having radar information for the years 2015 to 2017 and the year 2021. This period constitutes the study timeframe for event detection.

3. Methods

Figure 3 presents an overview of the methodology employed in this study. First, the runoff time series was analyzed to obtain near-independent peak runoff events. For each identified peak, a 12 h window before and after peak values was considered to capture the entire hydrological event (i.e., each event has a fixed duration of 25 h, starting and ending close to a base flow). Additionally, a lag analysis was conducted for each variable (runoff and precipitation) to determine the adequate number of lags for the development stage of the forecasting models. Subsequently, the input feature space (IFS) was obtained by intersecting the dates of near-independent hydrological events together with their corresponding lags from runoff and precipitation data. (Figure 3b).

Using this information, referential models were generated, considering only lagged variables and without applying any Feature Engineering strategy. Following this, enhanced models were developed based on the referential models, but with the addition of FE; this is replacing the precipitation input with precipitation attributes derived from the object-based approach (Figure 3a). Finally, an evaluation and comparison were performed between the referential models and the enhanced models (Figure 3b).

3.1. Determination of Independent Peak Runoff Events

Near-independent peak runoff events were determined using the WETSPRO time series tool [34], which employs a peak-over-threshold (POT) approach to derive nearly independent peak flows. The POT method, based on baseflow, categorizes two peaks as near-independent if the flow between them decreases to approximately the baseflow level.

Two parameters in the POT selection require calibration: the maximum ratio difference with the subflow and the minimum peak height. The maximum ratio difference is the percentage by which the lowest flow can vary below the baseflow level between two events to be considered independent. The minimum peak height was determined using the 90th percentile value obtained from Equation (1), which represents the probability of exceedance.

P = \frac{m}{(N + 1)}

(1)

where P is the probability of exceedance: this corresponds to the probability that a defined event, or peak runoff, is equaled or exceeded. N represents the total number of elements in a series, and m represents the order of the series when arranged in descending order.

Furthermore, in the flow separation to estimate the baseflow, two parameters must be calibrated: (i) the recession constant of the slow flow component, and (ii) the fraction of the total flow attributed to the quick flow component.

3.2. Development of Peak Runoff Forecasting Models

The referential models were developed using precipitation radar data (for each pixel) and runoff information. The process of developing referential models solely involved statistical lag analyses without applying any Feature Engineering (FE) strategy to the precipitation data. In contrast, the enhanced models incorporated additional precipitation inputs, taking into account hydrometeorological attributes, which replaced the raw precipitation radar data. The Random Forest (RF) regression algorithm was employed to build all the models, and a detailed description of this algorithm is provided in the Section 3.2.2.

The construction of the IFS for the RF models was based in the methodology presented by Muñoz et al. [5], and consists of three primary components. Firstly, it integrates hourly runoff and precipitation radar data. Secondly, it considers three precipitation attributes derived from the object-based approach: total area of precipitation objects, total volume of precipitation objects, and distances to the centroids of precipitation objects. Thirdly, it incorporates lag information from previous hours for both precipitation radar data and runoff. The determination of precipitation and runoff lags was based on statistical correlation analyses, including cross-correlation functions for precipitation, as well as partial and auto-correlation functions for runoff. This process is described in detail in the subsequent subsection.

3.2.1. Runoff and Precipitation Lags

The determination of runoff and precipitation lags is crucial as they enrich the input feature space for the runoff forecasting models. To determine the optimal number of precipitation and runoff lags, we conducted statistical analyses. For runoff, the study of Sudheer et al. [35] recommends utilizing the Auto-Correlation Function (ACF) and the Partial Auto-Correlation Function (PACF). Whereas for precipitation, we used Pearson’s cross-correlation between precipitation and runoff time series.

On one hand, the ACF and PACF contemplate the autoregressive behavior of runoff. The ACF measures the correlation between a value in a time series and its past values, encompassing the influence of intermediate time intervals. In contrast, the PACF focuses on a direct correlation without considering the influence of other values.

On the other hand, precipitation lags can be seen as a proxy variable for mimicking soil moisture in the catchment. This is advantageous for the model, as precipitation on unsaturated or partially saturated soil initially infiltrates the soil until reaching saturation before transforming into runoff. Conversely, if the soil is already saturated, most of the precipitation is expected to be directly converted into runoff.

3.2.2. Random Forest (RF) Algorithm for Regression

The Random Forest is a machine learning technique, and it has been widely employed in hydrological forecasting [1,15,16,20,36]. The strength of RF lies in its ensemble nature, where each decision tree within the forest is trained on a distinct data subset, promoting diversity and minimizing potential bias. Additionally, the technique incorporates randomized feature selection within each tree, enhancing robustness and capturing intricate relationships in the data. The comprehensive explanation of the Random Forest (RF) algorithm can be found in Breiman [37]; however, a concise summary of the algorithm’s flow is as follows:

i.: The bootstrap resampling method is applied to randomly select samples from the IFS, which are used to construct individual regression trees. The “out-of-bag” (OOB) sampling technique is applied to each bootstrap sample. The OOB samples consist of the data that are not included in a particular bootstrap sample, serving as a validation set for the corresponding tree, allowing for unbiased regression.
ii.: Data splitting for each bootstrap sample determined in (i). It occurs randomly at each node within every tree. To prevent the risk of overfitting, it is crucial to specify a maximum number of features for choosing the optimal split from the complete set of predictors within the feature space. This helps to ensure diversity in the models and avoids duplicate model construction.
iii.: All models generated in the bootstrap sample generation stage grow based on the splits defined in step (ii). Their growth is restricted by defining an upper limit, which can be achieved by configuring a hyperparameter governing the maximum depth or specifying the minimum number of samples expected in the final node. The regulation of the maximum size of the trees (pruning) is intended to decrease the structural complexity of the model, resulting in noise reduction and the model’s simplicity.
iv.: Determination of the regression prediction result, which involves calculating the arithmetic mean of the responses from all the regression trees.

Effective hyperparameter tuning is crucial to ensure optimal model performance and prevent overfitting. In the context of runoff forecasting, the most influential hyperparameter is the number of trees (n_estimators) [27]. Additionally, the hyperparameters max_depth (the maximum depth that can reach a tree) and max_features (the maximum number of features to perform the splits) are notably influential as well [27]. To find the best combination of these three hyperparameters (n_estimators, max_depth, and max_features), a systematic search was conducted using a random grid search methodology within a 3-fold cross-validation framework. Model performance was evaluated using the Nash-Sutcliffe Efficiency (NSE), a measure of agreement between simulations and observations, which is defined in the following section (3.3 Model evaluation and comparison between referential and enhanced models). Table 1 presents the grid search space of the three hyperparameters in the optimization process.

The RF technique’s implementation in forecasting models was performed using the scikit-learn package for machine learning in Python^® version 3.7 [38].

3.2.3. Object-Based Approach to Derive Precipitation Attributes for Enhanced Forecasting Models

The precipitation radar data associated with the identified independent peak runoff events were processed using the object-based approach (OBA) introduced by Laverde-Barajas et al. [26]. The OBA methodology employs algorithms, including size filtering and morphological closing to derive precipitation characteristics from remote sensing (RS) data. The resulting attributes offer a detailed representation of precipitation events, encompassing information, such as their spatial distribution (localization of precipitation objects in the catchment, area of the objects) and meteorological properties (volume, intensities). The implementation of the OBA was performed using the scikit-image processing package within Python^® version 3.7 [39].

Overview of Object-Based Approach (OBA) Process Implementation

An overview of the OBA’s application in this study is presented below, while a comprehensive description can be found in Laverde-Barajas et al. [26].

(i) Data retrieval: The precipitation radar data for the identified peak runoff events were retrieved, along with the clipping of imagery to the Tomebamba catchment (Figure 4a).

(ii) Detection of precipitation objects: The process of detecting precipitation objects begins with the definition of a detection sensitivity threshold. This threshold is set to filter out unwanted noise and retain only well-defined precipitation entities within the precipitation imagery. Calibration of the detection sensitivity was carried out through iterative experimentation, resulting in the selection of a volume threshold of precipitation of 0.1 mm. This implies that precipitation features with depths less than 0.1 mm were excluded (Figure 4b).

(iii) Size filtering: A filter based on size criteria was applied to the objects detected in step (i). The criteria define the minimum object area to be considered as a precipitation entity. In this instance, four pixels were chosen, equivalent to 1 km², as the minimum area (Figure 4c).

(iv) Morphological closing: The morphological closing technique was employed to refine the identified precipitation objects found in step (ii), which involves expanding and/or removing boundaries of the objects (Figure 4d). This algorithm combines dilation and erosion processes to enhance the delineation of precipitation features. During dilation, the boundaries of the precipitation objects are expanded, while erosion subsequently removes these expanded boundaries. This sequential operation of morphological dilation followed by erosion aids in the precise delineation of convective entities, ensuring a more accurate representation of precipitation patterns.

(v) Determination of precipitation attributes: From the refined objects in step (iii), physical characteristics, such as the centroid location and spatial extent, along with meteorological attributes, like the precipitation volume, were retrieved. These characteristics are further detailed in the subsequent subsection.

Object Attributes

Three key precipitation attributes for the forecasting models were retrieved from the radar data, precipitation volume, areal extension of precipitation objects, and objects distance, i.e., the distance between centroids of each precipitation object and the catchment outlet.

The volume of precipitation provided the model with a comprehensive understanding of the water quantity that precipitated during that specific hour. The area allowed us to capture the spatial extent of the precipitation, providing insights for the model into the distribution and coverage of the precipitation. Additionally, the distance from the precipitation objects to the catchment outlet was calculated using the distance between two points. This distance contributed spatial information to the model, helping determine how far from the outlet the precipitation occurs and providing the model with an estimate of the time it takes for that precipitation to reach the outlet.

3.3. Model Evaluation between Referential and Enhanced Models

For model evaluation, we split the near independent peak events into two sets: 80% for training and 20% for testing. Each event was utilized to simulate peak runoff within a 25 h window, covering the peak runoff and the 12 h before and after it, in order to capture the entire hydrograph.

To evaluate the model performance, two of the most widely used indices in hydrology for assessing the goodness of fit between model simulations and observations were selected: the Nash-Sutcliffe efficiency (NSE) and the Kling-Gupta efficiency (KGE) [40]. These two indices, along with the Root Mean Square Error (RMSE), were chosen to assess the different aspects of model performance. The KGE is particularly effective in accounting for peak runoff underestimations and low runoff overestimations, while the NSE, also known as the coefficient of efficiency, is less sensitive to extreme high values, providing a robust measure of the overall model accuracy [41]. The equations for these metrics can be found in Table 2.

For the comparison between the referential and enhanced models, the initial guidance was based on the values obtained in the efficiency metrics described above. This was carried out and analyzed for each 1, 3, and 6 h forecast window, respectively. Additionally, a visual comparison was conducted by examining hydrographs of specific events, similarly for 1, 3, and 6 h forecasts, in which the observed runoff was compared to the forecasts of both the referential and enhanced models.

4. Results

4.1. Independent Peak Runoff Events

Nearly independent peak runoff events were defined using the following calibrated parameters in the WETSPRO tool. First, a difference of 10% was allowed with the subflow (baseflow), and second, a minimum peak height was obtained from the 90th percentile value (17.92 m³/s) (Figure 5). To determine the baseflow, we derived it from the original runoff time series using recession constant values of 500, 60, and 5 h for baseflow, interflow, and overland flow, respectively. The parameter w obtained from the calibration was 0.7 for baseflow and 0.5 for interflow. With these criteria and considering the respective availability of radar data, 29 independent peak hydrological events were obtained, of which, the initial 23 events (80% of total events), chronologically ordered, were utilized for training the models, while the subsequent 6 events (20% of total events) were reserved for testing purposes.

4.2. Development of Peak Runoff Forecasting Models

Referential models were developed incorporating, in addition to the latest precipitation and runoff data, a fixed number of precipitation and runoff lags. To determine the correlated number of runoff lags, we employed a method that involves utilizing both the autocorrelation function (ACF) and the partial autocorrelation function (PACF) for a more focused analysis. The ACF assesses the correlation between a value in a time series and its past values, considering the influence of intermediate time intervals. In contrast, the PACF concentrates on a direct correlation without the influence of intervening values, providing a more targeted approach. From the ACF, we obtained 260 significant lags (approximately 11 days) when using a 95% confidence band (Figure 6a). This result was complemented with the PACF and its respective 95% confidence level. The PACF analysis revealed a significant correlation up to lag 8 (hours) (Figure 6b). Based on both results, we defined 8 as the appropriate number of runoff lags.

Similarly, the number of precipitation lags was defined using Pearson’s cross-correlation between each precipitation pixel (1210) and the runoff time series (i.e., a correlation curve was generated for each pixel with the runoff time series). For this, a correlation threshold of 0.2 was employed, as suggested by Muñoz et al. [5]. With this threshold, we determined 12 as the number of precipitation lags (hours). The maximum correlation appeared to be at lag 5 (0.31) which agrees with the concentration-time of the catchment [5].

For the enhanced models, we enriched the IFS of referential models with additional information derived from precipitation radar. This additional information was obtained by applying the OBA to the precipitation data, as described in Section 3.2.3. This new dataset replaced the original precipitation data (pixel-based timeseries) used in the referential models. The inputs related to runoff remained unchanged from those used in the referential models.

Table 3 presents the optimal hyperparameter combinations for the both referential and enhanced forecasting models across increasing lead times. The values for the number-of-trees hyperparameter in the referential models does not provide a clear insight, as it changes independently of the lead time. On the contrary, enhanced models exhibit variation across lead times, with the 1 h lead time having the highest number (420 trees). This suggests a more complex ensemble structure for short-term predictions. Moreover, referential models consistently utilized a high proportion of max features (9688), reflecting their dependence on a broad set of input features, which aligns with the relatively larger number of features employed in these models.

Enhanced models, however, adopted a different approach. For the 1 h and 3 h lead times, they employed n_features (32 features), whereas for the 6 h lead time, it used the log base 2 of n_features (5 features). This could be attributed to the diminishing relevance of some features for the 6 h lead time, as they provide information beyond the catchment’s concentration time. Consequently, using fewer features yields similar results. In addition, the referential models exhibit varying maximum depths, with the 3 h lead time having the lowest depth (5). This may indicate a preference for shallow trees in this case. The choice of shallow trees suggests a modeling strategy prioritizing the capture of simpler and more general patterns in the data, limiting the model’s complexity and mitigating the risk of overfitting.

On the other hand, enhanced models display different depths for different lead times. The 3 h lead time has the highest depth (65), indicating a more complex tree structure. It can be considered that the attributes given to these enhanced models are more effectively leveraged and contribute more substantially to the modeling process. All in all, the enhanced models seem to adopt a more focused approach, particularly evident in the reduction of max features. This implies an attempt to refine the model’s concentration based on the most influential characteristics, potentially enhancing interpretability.

Precipitation Attributes for Enhanced Forecasting Models

From the events identified in Section 4.1, we determined precipitation objects for each hour, as illustrated in Figure 4, and obtained their respective precipitation attributes (area, volume, and distance). For each event and hour, as expected, the attributes varied in magnitude. Below, the different attributes that were identified, along with their respective physical meanings, are presented.

In Figure 7, two precipitation objects represented by circles are observed. For illustration purposes, let us consider that each object corresponds to a different time, but both have the same volume.

Analyzing the area extent of each object, illustrated in Figure 7, one object is noted to have a larger area than the other. Consequently, it is feasible to infer that the object with the smaller area has more intense and localized precipitation, for the same volume, compared to the larger object. This analysis could be reversed with two equal areas but different volumes, in which case the intensity of the precipitation event would depend solely on how large the volume is.

Furthermore, different distances from the centroid of the objects to the outlet of the catchment were obtained. Illustrating this with the example from Figure 7, the smaller object is positioned closer to the outlet, resulting in a shorter distance. This information is relevant for the model as it provides insights into the time required for precipitation to leave the catchment. In the case of the object with the longer distance, it would theoretically take more time for the precipitation to exit the catchment.

In summary, each object presented its unique characteristics, and the physical information corresponding to each precipitation object described in this example was added to the enhanced models through these attributes (area, volume, and distance).

4.3. Comparison between Referential and Enhanced Models

Figure 8 illustrates the comparative results of the referential models (orange dots) and enhanced models (blue dots) using scatter plots. The improvement in the enhanced models becomes apparent as they approach the dashed 45-degree line, symbolizing points where observed and predicted values match. For the 1 h lead time, both the referential and enhanced models demonstrate a close proximity to the 45-degree dashed line, indicating a high level of accuracy. However, at higher runoff values (exceeding 90 m³/s, beyond the 99th percentile), a substantial improvement is evident in the enhanced models, with all blue values closely adhering to the bisector line. In the range of 40 m³/s to 80 m³/s (between the percentiles 98 and 99 approximately), although some dispersion is present, the blue values progressively converge more closely.

Moving to the 3 h lead time, a greater dispersion is observed; however, overall, and particularly for runoff values exceeding 80 m³/s, it is evident that the results of the enhanced models approach the guideline more closely. While there is a tendency to underestimate the observed values, this underestimation is less pronounced than in the referential models. Lastly, for the 6 h lead time, despite increased dispersion, the values of the referential models remain the farthest from the dashed 45-degree line, especially in terms of underestimation. The improvement for this lead time is perceptible in the scatter plot, demonstrating a positive impact on the model’s performance.

In addition, as outlined in Section 3.3, the models underwent evaluation using three performance metrics: the NSE, the KGE, and the RMSE. For all metrics, the highest values were achieved for the shortest forecast lead time (1 h), with lower values observed for 3 and 6 h, respectively. Furthermore, for all enhanced models, the efficiencies consistently exhibited higher values across all metrics compared to referential models, as demonstrated in Table 4. The best performances encountered for the 1 h lead time (NSE = 0.93) can be attributed to the autoregressive nature of runoff, which is magnified for the shorter lead times. The decreasing accuracy observed with longer forecasting horizons, such as 3 and 6 h, can be attributed mainly due to greater lack of hours of precipitation, which could be improved with forecasts of precipitation for example.

Figure 9 presents hydrographs for each lead time, illustrating two cases from the testing events subset for which meaningful comparisons can be made. These events were selected, one event that demonstrates the improvements and another where the improvements are not so evident. This is to be rigorous in the evaluation and to provide a balanced overview in the study.

For the 1 h lead time, the values of the referential models closely resemble those of the enhanced models and observed data, particularly evident in event 2. However, notable improvements, especially in peak runoff, are observed in event 1. This enhancement is reflected in the percentage of improvement presented in Table 5. Moving to the 3 h lead time, a more pronounced difference in forecasts is noticeable, with the enhanced models outperforming the referential models, as indicated in Table 5. Despite some underestimation in event 1, there is a noticeable improvement compared to the referential model. Similarly, for the 6 h lead time, enhancements in the enhanced models are evident, even with occasional instances of greater underestimation. However, in cases like event 2, the enhanced models closely approach the observed values, surpassing the performance of the referential models, as mentioned earlier. Furthermore, notable shifts were observed in the peaks of event 2, Figure 9b, likely arising from the absence of precipitation information within the forecast window. This error increases as the forecast window duration extends. Addressing such disparities could involve forecasting precipitation and/or incorporating an additional objective function, such as the time-to-peak during training. However, these considerations are beyond the scope of this study.

Finally, considering all evaluated events, a comprehensive comparison of improvements between the enhanced models and referential models was conducted. According to Table 5, for the 1 h lead time, efficiency improvements are minimal but noticeable across all metrics. At the 3 h lead time, improvements exceeding 15% are observed, with the KGE metric showing the most significant increase of 23% in the enhanced models compared to the referential models. Despite the metrics being lower for the 6 h lead time compared to the 3 and 1 h lead times (see Table 4), improvements in the metrics of the enhanced models versus referential models are still present, reaching up to an 18% increase in NSE.

5. Discussion

For the purposes of this study, peak runoff forecasting models were developed using the RF algorithm for a mountain catchment located in the Ecuadorian Andes. The methodology employed in this study aims to enhance peak runoff forecasts by exploiting precipitation estimates retrieved from weather radar data using a feature engineering strategy with an object-based approach to derive precipitation attributes.

We developed referential models for lead times ranging from 1 to 6 h to address peak runoff forecasting in the study catchment. In addition to these referential models, our focus was on analyzing weather radar precipitation using an OBA to generate new precipitation attributes to add to the models and thus create enhanced models. The enhancement of models, based on precipitation attributes, such as area, volume, and distance to the centroid of the objects of precipitation, show the advantages of applying FE to the already acceptable reference models.

The performance of the referential models, as measured based on the NSE, ranged from 0.42 to 0.93. These results are comparable to a study utilizing radar data and RF, with NSE values between 0.66 and 0.85 [9]. These results also align with studies employing radar data in physically based models, like HEC-HMS [42], with NSE values between 0.55 and 0.98, or TOPMODEL [43], with NSE values between 0.64 and 0.91. While the aim of this study was not to outperform physically based models that use radar data, it is important to note that ML models, which require less data preprocessing and do not rely on simplifying assumptions to represent complex systems, facilitated faster forecast generation, with similar results.

Furthermore, the performance of our models is consistent with studies in runoff forecasting that utilize different machine learning techniques. This is supported by Noymanee et al. [44] in their flood forecasting study, where they achieved NSE values ranging from 0.51 to 0.8 for lead times of 3 and 6 h using different machine learning methods, including neural networks, Bayesian linear regression, and boosted decision trees.

The performance of the enhanced models, with NSE ranging from 0.50 to 0.94, is superior to that of the referential models for all lead times (1, 3, 6 h), respectively, in the study. Even in the 1 h lead time, where the reference model’s efficiency was already high and had limited capacity for improvement, the performance was slightly improved.

These enhancements can be attributed to the new information provided to the enhanced models through the feature engineering strategy proposed in this study. This new information is expected to add physical insights to the models. To prove this statement, further analysis is required, such as local and global sensitivity analyses to determine the impact of each attribute and the total number of attributes.

Key features included the volume of rain objects, providing an estimate of the amount of water that would contribute to runoff, in combination with the area of rain objects, which helped determine whether the volume mentioned earlier was concentrated in a small area (intense localized rain) or distributed over a larger area. For a specific volume, more intense rain is represented when it falls over a smaller area, leading to soil saturation and the faster conversion of rain into runoff. In addition to area and volume, the distance from the centroid of the rain object to the outlet was extracted.

Analyzing the event of 24 May 2021, it was observed that a rain object concentrated near the outlet of the catchment improved the 1 h forecast since the model interpreted that this rain, being near the outlet, would exit relatively soon. For a 3 h forecast, precipitation data from the middle and upper parts of the catchment, with a time of concentration of 5 h, are more likely to contribute, as data very close to the outlet would already be considered to have left the catchment. Theoretically, for a 6 h forecast, all observed precipitation data should have already left the catchment (one of the reasons for the lower efficiencies among different lead times). However, the feature data improved the efficiencies, as they can also provide certain physical insights, such as previous moisture conditions in different areas, for instance.

Based on the conclusions of the study of Laverde-Barajas et al. [26], we proved the potential of evaluating other remote sensing products different from satellite sources, with an object-based approach. In our study, the potential of using the OBA for X-band radar data was explored, and it was found to be effective in helping with the enhancement of peak runoffs. However, it is important to acknowledge that, due to data availability constraints, enhanced models could not be applied through the classification of events based on their duration and area, as conducted in the study of Laverde-Barajas et al. [26]. Nevertheless, it is anticipated that with an increased number of peak runoff events, this approach could further enhance the models.

A potential extension of this study would be to involve feature engineering techniques that focus on obtaining additional variables derived from remote sensing data, such as satellite imagery. These variables may include, but are not limited to, soil moisture, as demonstrated by Massari et al. [45], watershed topography, as shown by Tripathi et al. [46], and geomorphic and biophysical parameters, such as the Normalized Difference Vegetation Index (NDVI) and the Index of Connectivity (IC), as presented by Asadi et al. [47]. By incorporating these variables, the study could potentially enhance its predictive power and provide valuable insights into the underlying mechanisms driving runoff generation in the study area.

Also, a next step in the study, could be to determine whether the observed peak flows lead to flooding. This can be achieved by establishing flow thresholds, analyzing historical flood events, or deriving this information from an extensive runoff dataset. Producing flood models requires an evaluation with additional metrics beyond those used in this study—specifically, categorical metrics. This system could be assessed using metrics, such as the probability of detection (POD), the false alarm ratio (FAR), and/or the critical success index (CSI) [48].

6. Conclusions

In this study, we developed enhanced peak runoff forecasting models by exploiting precipitation data retrieved from an X-band radar data. This was performed by applying a Feature Engineering (FE) strategy accounting for an object-based approach to derive key precipitation attributes instead of using a pixel-based timeseries. To assess the effectiveness of the application of the FE strategy, we conducted a comparative analysis of the performance metrics between the referential models and enhanced models across a range of 1 to 6 h modeling lead times. Based on the results, the following conclusions can be drawn:

The application of the FE strategy resulted in enhanced model efficiencies and enabled us to better leverage precipitation radar data by incorporating attributes of precipitation, such as the precipitation volume, areal extension of precipitation objects, and the distance between the centroid of these objects and the outlet of the catchment.
All enhanced models demonstrated improvements in their efficiencies. Notably, the models for 3 and 6 h lead times exhibited more significant enhancements compared to the 1 h forecast, where the autoregressive behavior already produced an efficient model.
To fully utilize the high spatial resolution of radar data for modeling, it is crucial to extract relevant attributes, rather than using the entire dataset, which introduces noise to the models. The enhanced models achieved a significant reduction in input data, emphasizing the efficiency gained through selective attribute extraction. This highlights a simplified method that optimally utilizes ground-based radar data.
This study has demonstrated the positive impact of improving the representativeness of precipitation retrieved from a high-resolution X-band weather radar. By extracting relevant attributes from high-resolution imagery, we were able to better capture the spatial characteristic of precipitation and improve the assimilation of this information to RF models.

Author Contributions

Conceptualization, J.Á.-E. and P.M.; methodology, J.Á.-E. and P.M.; software, J.Á.-E. and P.C.; modeling and formal analysis J.Á.-E.; writing—original draft preparation, J.Á.-E.; writing—review and editing, P.M., J.B. and R.C.; supervision, P.M. and R.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Vice-rectorate for Research of the University of Cuenca (VIUC) through the project “Data fusion of remote sensing products and machine learning feature engineering strategies for near-real time runoff forecasting”.

Data Availability Statement

Data are not publicly available due to institutional policies.

Acknowledgments

This research was funded by the Vice-rectorate for Research of the University of Cuenca (VIUC) through the project titled “Data fusion of remote sensing products and machine learning feature engineering strategies for near-real-time runoff forecasting.” We extend our sincere gratitude to this institution for its generous funding. Additionally, the authors wish to express their appreciation to the Editors and anonymous Reviewers for their constructive comments, which have significantly contributed to enriching the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mosavi, A.; Ozturk, P.; Chau, K.-W. Flood prediction using machine learning models: Literature review. Water 2018, 10, 1536. [Google Scholar] [CrossRef]
Falck, A.; Maggioni, V.; Tomasella, J.; Diniz, F.; Mei, Y.; Beneti, C.; Herdies, D.; Neundorf, R.; Caram, R.; Rodriguez, D. Improving the use of ground-based radar rainfall data for monitoring and predicting floods in the Iguaçu river basin. J. Hydrol. 2018, 567, 626–636. [Google Scholar] [CrossRef]
Rozalis, S.; Morin, E.; Yair, Y.; Price, C. Flash flood prediction using an uncalibrated hydrological model and radar rainfall data in a Mediterranean watershed under changing hydrological conditions. J. Hydrol. 2010, 394, 245–255. [Google Scholar] [CrossRef]
Stefanidis, S.; Stathis, D. Assessment of flood hazard based on natural and anthropogenic factors using analytic hierarchy process (AHP). Nat. Hazards 2013, 68, 569–585. [Google Scholar] [CrossRef]
Muñoz, P.; Orellana-Alvear, J.; Willems, P.; Célleri, R. Flash-flood forecasting in an andean mountain catchment-development of a step-wise methodology based on the random forest algorithm. Water 2018, 10, 1519. [Google Scholar] [CrossRef]
Anagnostou, M.N.; Nikolopoulos, E.I.; Kalogiros, J.; Anagnostou, E.N.; Marra, F.; Mair, E.; Bertoldi, G.; Tappeiner, U.; Borga, M. Advancing precipitation estimation and streamflow simulations in complex terrain with X-Band dual-polarization radar observations. Remote Sens. 2018, 10, 1258. [Google Scholar] [CrossRef]
Bournas, A.; Baltas, E. Comparative analysis of rain gauge and radar precipitation estimates towards rainfall-runoff modelling in a peri-urban basin in Attica, Greece. Hydrology 2021, 8, 29. [Google Scholar] [CrossRef]
Grek, E.; Zhuravlev, S. Simulation of rainfall-induced floods in small catchments (The polomet’ river, north-west russia) using rain gauge and radar data. Hydrology 2020, 7, 92. [Google Scholar] [CrossRef]
Orellana-Alvear, J.; Celleri, R.; Rollenbeck, R.; Muñoz, P.; Contreras, P.; Bendix, J. Assessment of native radar reflectivity and radar rainfall estimates for discharge forecasting in mountain catchments with a random forest model. Remote Sens. 2020, 12, 1986. [Google Scholar] [CrossRef]
Beven, K. Rainfall-Runoff Modelling; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2012. [Google Scholar]
Yu, P.-S.; Yang, T.-C.; Chen, S.-Y.; Kuo, C.-M.; Tseng, H.-W. Comparison of random forests and support vector machine for real-time radar-derived rainfall forecasting. J. Hydrol. 2017, 552, 92–104. [Google Scholar] [CrossRef]
Lohani, A.K.; Goel, N.; Bhatia, K. Improving real time flood forecasting using fuzzy inference system. J. Hydrol. 2014, 509, 25–41. [Google Scholar] [CrossRef]
Tayfur, G.; Singh, V.P.; Moramarco, T.; Barbetta, S. Flood hydrograph prediction using machine learning methods. Water 2018, 10, 968. [Google Scholar] [CrossRef]
Biau, G.; Scornet, E. A random forest guided tour. TEST 2016, 25, 197–227. [Google Scholar] [CrossRef]
Liu, D.; Fan, Z.; Fu, Q.; Li, M.; Faiz, M.A.; Ali, S.; Li, T.; Zhang, L.; Khan, M.I. Random forest regression evaluation model of regional flood disaster resilience based on the whale optimization algorithm. J. Clean. Prod. 2020, 250, 119468. [Google Scholar] [CrossRef]
Al-Fawa’reh, M.; Hawamdeh, A.; Alrawashdeh, R.; Jafar, M.T. Intelligent Methods for flood forecasting in Wadi al Wala, Jordan. In Proceedings of the 2021 International Congress of Advanced Technology and Engineering (ICOTEN), Virtual, 4–5 July 2021. [Google Scholar]
Choi, C.; Kim, J.; Kim, J.; Kim, D.; Bae, Y.; Kim, H.S. Development of heavy rain damage prediction model using machine learning based on big data. Adv. Meteorol. 2018, 2018, 5024930. [Google Scholar] [CrossRef]
A Pollard, J.; Spencer, T.; Jude, S. Big Data Approaches for coastal flood risk assessment and emergency response. WIREs Clim. Chang. 2018, 9, e543. [Google Scholar] [CrossRef]
Fang, Z.; Wang, Y.; Peng, L.; Hong, H. Predicting flood susceptibility using LSTM neural networks. J. Hydrol. 2020, 594, 125734. [Google Scholar] [CrossRef]
Muñoz, P.; Corzo, G.; Solomatine, D.; Feyen, J.; Célleri, R. Near-real-time satellite precipitation data ingestion into peak runoff forecasting models. Environ. Model. Softw. 2022, 160, 105582. [Google Scholar] [CrossRef]
Yang, Y.; Chui, T.F.M. Modeling and interpreting hydrological responses of sustainable urban drainage systems with explainable machine learning methods. Hydrol. Earth Syst. Sci. 2021, 25, 5839–5858. [Google Scholar] [CrossRef]
Miao, Q.; Pan, B.; Wang, H.; Hsu, K.; Sorooshian, S. Improving monsoon precipitation prediction using combined convolutional and long short term memory neural network. Water 2019, 11, 977. [Google Scholar] [CrossRef]
Kim, G.; Barros, A.P. Quantitative flood forecasting using multisensor data and neural networks. J. Hydrol. 2001, 246, 45–62. [Google Scholar] [CrossRef]
Davis, C.A.; Brown, B.; Bullock, R. Object-based verification of precipitation forecasts. Part I: Application to convective rain systems. Mon. Weather Rev. 2006, 134, 1785–1795. [Google Scholar] [CrossRef]
Laverde-Barajas, M.; Perez, G.C.; Chishtie, F.; Poortinga, A.; Uijlenhoet, R.; Solomatine, D. Decomposing satellite-based rainfall errors in flood estimation: Hydrological responses using a spatiotemporal object-based verification method. J. Hydrol. 2020, 591, 125554. [Google Scholar] [CrossRef]
Laverde-Barajas, M.; Corzo, G.; Bhattacharya, B.; Uijlenhoet, R.; Dimitri, P.S. Spatiotemporal Analysis of Extreme Rainfall Events Using an Object-Based Approach; Elsevier Inc.: Amsterdam, The Netherlands, 2019. [Google Scholar]
Contreras, P.; Orellana-Alvear, J.; Muñoz, P.; Bendix, J.; Célleri, R. Influence of random forest hyperparameterization on short-term runoff forecasting in an andean mountain catchment. Atmosphere 2021, 12, 238. [Google Scholar] [CrossRef]
Pesántez, J. Propuesta de un Modelo de Gestión de la Subcuenca del Río Tomebamba, Como Herramienta de Manejo Integrado y de Conservación; Universidad del Azuay: Cuenca, Ecuador, 2015. [Google Scholar]
Buytaert, W.; Célleri, R.; Timbe, L. Predicting climate change impacts on water resources in the tropical Andes: Effects of GCM uncertainty. Geophys. Res. Lett. 2009, 36, L07406. [Google Scholar] [CrossRef]
Nieves, J.A.; Contreras, J.; Pacheco, J.; Urgilés, J.; García, F.; Avilés, A. Assessment of drought time-frequency relationships with local atmospheric-land conditions and large-scale climatic factors in a tropical Andean basin. Remote Sens. Appl. Soc. Environ. 2022, 26, 100760. [Google Scholar] [CrossRef]
Hastenrath, S. On snow line depression and atmospheric circulation in the tropical americas during the pleistocene. S. Afr. Geogr. J. 1971, 53, 53–69. [Google Scholar] [CrossRef]
Orellana-Alvear, J.; Célleri, R.; Rollenbeck, R.; Bendix, J. Optimization of X-Band radar rainfall retrieval in the southern andes of ecuador using a random forest model. Remote Sens. 2019, 11, 1632. [Google Scholar] [CrossRef]
Orellana-Alvear, J.; Célleri, R.; Rollenbeck, R.; Bendix, J. Analysis of rain types and their z–r relationships at different locations in the high andes of southern ecuador. J. Appl. Meteorol. Clim. 2017, 56, 3065–3080. [Google Scholar] [CrossRef]
Willems, P. A time series tool to support the multi-criteria performance evaluation of rainfall-runoff models. Environ. Model. Softw. 2009, 24, 311–321. [Google Scholar] [CrossRef]
Sudheer, K.P.; Gosain, A.K.; Ramasastri, K.S. A data-driven algorithm for constructing artificial neural network rainfall-runoff models. Hydrol. Process. 2002, 16, 1325–1330. [Google Scholar] [CrossRef]
Li, M.; Zhang, Y.; Wallace, J.; Campbell, E. Estimating annual runoff in response to forest change: A statistical method based on random forest. J. Hydrol. 2020, 589, 125168. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
van der Walt, S.; Schönberger, J.L.; Nunez-Iglesias, J.D.; Boulogne, F.; Warner, J.; Yager, N.; Gouillart, E.; Yu, T. scikit-image: Image processing in Python. PeerJ 2014, 2, e453. [Google Scholar] [CrossRef] [PubMed]
Lamontagne, J.R.; Barber, C.A.; Vogel, R.M. Improved Estimators of Model Performance Efficiency for Skewed Hydrologic Data. Water Resour. Res. 2020, 56, e2020WR027101. [Google Scholar] [CrossRef]
Krause, P.; Boyle, D.P.; Bäse, F. Comparison of different efficiency criteria for hydrological model assessment. Adv. Geosci. 2005, 5, 89–97. [Google Scholar] [CrossRef]
Cho, Y. Application of NEXRAD Radar-Based Quantitative Precipitation Estimations for Hydrologic Simulation Using ArcPy and HEC Software. Water 2020, 12, 273. [Google Scholar] [CrossRef]
Xiaoyang, L.; Jietai, M.; Yuanjing, Z.; Jiren, L. Run off Simulation Using Radar and Rain Gauge Data. Adv. Atmos. Sci. 2003, 20, 213–218. [Google Scholar] [CrossRef]
Noymanee, J.; Nikitin, N.O.; Kalyuzhnaya, A.V. Urban Pluvial Flood Forecasting using Open Data with Machine Learning Techniques in Pattani Basin. Procedia Comput. Sci. 2017, 119, 288–297. [Google Scholar] [CrossRef]
Massari, C.; Camici, S.; Ciabatta, L.; Brocca, L. Exploiting Satellite-Based Surface Soil Moisture for Flood Forecasting in the Mediterranean Area: State Update Versus Rainfall Correction. Remote Sens. 2018, 10, 292. [Google Scholar] [CrossRef]
Tripathi, M.P.; Panda, R.K.; Pradhan, S.; Sudhakar, S. Runoff modelling of a small watershed using satellite data and GIS. J. Indian Soc. Remote Sens. 2002, 30, 39–52. [Google Scholar] [CrossRef]
Asadi, H.; Shahedi, K.; Jarihani, B.; Sidle, R.C. Rainfall-Runoff Modelling Using Hydrological Connectivity Index and Artificial Neural Network Approach. Water 2019, 11, 212. [Google Scholar] [CrossRef]
Schaefer, J.T. The Critical Success Index as an Indicator of Warning Skill. Weather Forecast. 1990, 5, 570–575. [Google Scholar] [CrossRef]

Figure 1. Location and attributes of the Tomebamba catchment, located in the southern Ecuadorian Andes.

Figure 2. Hourly runoff data for the period from January 2015 to July 2021.

Figure 3. Methodology for the development and evaluation of peak runoff forecasting models. (a) Radar precipitation processing through Feature Engineering, and (b) forecast modeling and evaluation approach.

Figure 4. Precipitation detection with OBA. (a) Precipitation detection in the Tomebamba Catchment, (b) identification of five precipitation objects with the OBA, (c) one object after size filtering, and (d) one object after morphological closing.

Figure 5. Exceedance probability curve over the period January 2015–July 2021. The vertical red dashed line corresponds to the 90th percentile value (17.92 m³/s).

Figure 6. (a) Autocorrelation function (ACF) and (b) partial autocorrelation function (PACF) of the Matadero-Sayausi runoff series.

Figure 7. Representation of two precipitation objects.

Figure 8. Results for the referential and enhanced models for the 1,3 and 6 h lead time.

Figure 9. Hydrographs of event 1 (24 May 2021, (a), left) and event 2 (18 May 2021, (b), right), with results from the referential models and the enhanced models for 1, 3, and 6 h.

Table 1. Grid of the RF hyperparameters.

Hyperparameter	Values
n_trees ^a	50; 800; 10
max_features	n_features ^b, n_features^(1/2), log₂(n_features)
max_depth ^a	5; 200; 5

Note(s): ^a domain defined by min, max, and increment. ^b n_features denote the quantity of estimators (features) in the IFS.

Table 2. Equations for performance metrics.

Metric	Equation	Range	Ideal Value
NSE	$1 - \frac{\sum_{i = 1}^{n} {(O_{i} - P_{i})}^{2}}{\sum_{i = 1}^{n} {(O_{i} - \bar{O})}^{2}}$	−∞, 1	1
KGE	$1 - \sqrt{{(r - 1)}^{2} + {(\propto - 1)}^{2} + {(β - 1)}^{2}}$	−∞, 1	1
RMSE	$\sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(O_{i} - P_{i})}^{2}}$	0, +∞	0

where

n

represents the number of instances,

O_{i}

denotes the observed runoff at time

i

,

P_{i}

signifies the predicted runoff at time

i

,

\bar{O}

is the mean observed runoff,

{\bar{O}}_{p}

is the mean predicted runoff,

r

stands for the correlation coefficient between

O_{p}

and O,

α = \frac{σ_{p}}{σ_{o}}

is the variability ratio,

β = \frac{{\bar{O}}_{p}}{{\bar{O}}_{o}}

is the bias ratio, and

σ

stands for the standard deviation.

Table 3. Hyperparameters for referential and enhanced models.

Referential Models
Lead Time	n_trees	max_features	max_depth
1 h	300	9688	55
3 h	450	9688	5
6 h	400	9688	35
Enhanced Models
Lead time	n_trees	max_features	max_depth
1 h	420	32	25
3 h	310	32	65
6 h	130	5	40

Table 4. Metrics of efficiency of the referential and enhanced models.

Lead Time	NSE		KGE		RMSE
Lead Time	Referential	Enhanced	Referential	Enhanced	Referential	Enhanced
1 h	0.93	0.94	0.90	0.92	7.33	6.83
3 h	0.65	0.75	0.54	0.66	16.72	14.14
6 h	0.42	0.50	0.37	0.44	21.56	20.07

Table 5. Percentage of forecasting improvement across lead times.

Lead Time	NSE	KGE	RMSE
1 h	1%	2%	7%
3 h	15%	23%	15%
6 h	18%	17%	7%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Álvarez-Estrella, J.; Muñoz, P.; Bendix, J.; Contreras, P.; Célleri, R. Enhancing Peak Runoff Forecasting through Feature Engineering Applied to X-Band Radar Data. Water 2024, 16, 968. https://doi.org/10.3390/w16070968

AMA Style

Álvarez-Estrella J, Muñoz P, Bendix J, Contreras P, Célleri R. Enhancing Peak Runoff Forecasting through Feature Engineering Applied to X-Band Radar Data. Water. 2024; 16(7):968. https://doi.org/10.3390/w16070968

Chicago/Turabian Style

Álvarez-Estrella, Julio, Paul Muñoz, Jörg Bendix, Pablo Contreras, and Rolando Célleri. 2024. "Enhancing Peak Runoff Forecasting through Feature Engineering Applied to X-Band Radar Data" Water 16, no. 7: 968. https://doi.org/10.3390/w16070968

APA Style

Álvarez-Estrella, J., Muñoz, P., Bendix, J., Contreras, P., & Célleri, R. (2024). Enhancing Peak Runoff Forecasting through Feature Engineering Applied to X-Band Radar Data. Water, 16(7), 968. https://doi.org/10.3390/w16070968

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Peak Runoff Forecasting through Feature Engineering Applied to X-Band Radar Data

Abstract

1. Introduction

2. Study Area and Dataset

2.1. Study Area

2.2. Dataset

3. Methods

3.1. Determination of Independent Peak Runoff Events

3.2. Development of Peak Runoff Forecasting Models

3.2.1. Runoff and Precipitation Lags

3.2.2. Random Forest (RF) Algorithm for Regression

3.2.3. Object-Based Approach to Derive Precipitation Attributes for Enhanced Forecasting Models

Overview of Object-Based Approach (OBA) Process Implementation

Object Attributes

3.3. Model Evaluation between Referential and Enhanced Models

4. Results

4.1. Independent Peak Runoff Events

4.2. Development of Peak Runoff Forecasting Models

Precipitation Attributes for Enhanced Forecasting Models

4.3. Comparison between Referential and Enhanced Models

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI