4.1. Overall Performance Statistics
The average performance of the different ELM ensemble models in terms of KGE is presented in
Figure 7. As expected, the ELM models that uses none of the QPEs presented the worst performance among all input configurations considered, with absolute KGE differences of up to 0.14 when compared with the worst-performing single-QPE models (
Figure 7d). This observation illustrates the magnitude of gains in performance that the ELM models can attain by “learning”, up to a certain level, the influence that precipitation has on the discharge of the catchment for the different lead times.
Among the single-QPE ELM models, the ones using radar data presented the best performance in terms of KGE for lead times of 3 and 4 h, and a performance comparable with the ELM that uses only gauges data for the other lead times. ELM models using CaPA data presented lower performance for lead times of 2, 3 and 4 h when compared with its radar-based counterpart. A remarkable characteristic of the KGE values of all single-QPE ELMs is their comparable performances for both the shortest (1 h) and longest (5 h) lead time. A probable explanation for such a pattern may be related to the response time of the catchment of approximately 3 h. The water that enters the catchment as precipitation in a moment
tends to have limited influence on the discharge at the outlet of the catchment in a moment
+ 1 h as the majority of its volume is still in transit through the basin. Most of the runoff volume reaches the outlet between
+ 2 h and
+ 4 h, thus differences in precipitation patterns are more likely to be reflected in the discharge at these instants. At
t + 5 h, most of the water entering at time
is expected to have already left the catchment, leading to a condition similar to
+ 1. The major decrease in performance of the models with lead time
is likely driven by the
and
predictors (present in all models) following their decaying correlation with the predicted variable
as
increases (
Figure 8).
All ELM models that use two of the QPE products concurrently presented equal or higher KGE values than the models that use only one of the QPEs individually (
Figure 7a–c). The differences in performance between the single-QPE models and the two-QPE models is almost imperceptible for the shortest lead time. For longer lead times, the gain in performance with the use of more precipitation input varied for each pair of QPE products. Gains in absolute KGE are up to 0.04 for the gauge and radar pair (
Figure 7a), up to 0.11 for the gauge and CaPA pair (
Figure 7b) and up to 0.12 for the radar and CaPA pair (
Figure 7c).
As observed for the two-products scenarios, the models that use all QPEs presented a better performance than the best single-QPE models for all lead times, except for
+ 1 h, when the differences are imperceptible (
Figure 7d).
Taylor diagrams created using the Python library Skill Metrics [
54] summarize the statistical parameters of standard deviation, coefficient of correlation with observations, and RMSE of the ELM models that use no QPE, the ones that use only one QPE product, and the ones that use all three QPEs (
Figure 9). As observed in the KGE analysis, the models using the three QPE products performed in average better or as good as their best single-QPE counterparts for all lead times except for 1 h, in which all QPE-aware ELMs practically coincide in all the three metrics. This consistency in the results from different metrics is a good indicator that the gains in performance with additional QPEs are not concentrated in specific scenarios nor produce undesired significant drawbacks.
Table 3 summarizes the mean of the main metrics for evaluating the goodness-of-fit of the models, including the percentual gain of using the three QPEs when compared to the best performing single-QPE models. Out of the 15 scenarios (KGE, RMSE and
r metrics at 5 lead times each), the multi-QPE did not present the best performance in only 1 (RMSE at 1 h lead time), had a performance comparable to the best of the single-QPE model in other 2 scenarios (KGE and
r at 1 h lead time), and was the best in the remaining 12 scenarios, making it the “clear winner” in terms of overall performance.
As presented in
Table 4, the set of models that consider only radar data presented the lowest bias among the single-QPE group of models. Nevertheless, when all the three QPE products are used as input, a reduction in the overall bias is observed for all lead times. It worth noting, though, that all model setups are characterized as being biased low (negative bias value), a pattern that can be due to the fact that each predicted value used is the mean of an ensemble of model outputs, which tends to lead to a “smoothening” of the predictions, especially for the peak values. As improvements were also observed in other metrics, it is possible to deduce that the use of multiple QPE products resulted in an appropriate increase of the overall values predicted (higher bias value).
The hydrographs of four distinct high-flow events presented in
Figure 10 reflect the patterns identified in the performance metrics, i.e., multi-QPEs models usually outperforming its single-QPE counterparts. One component to be highlighted is the timing: in all events except the one of 4 June 2011, the three-QPEs models was the one to earliest predict the peak, event even anticipating in one hour the maximum flow observed in one hour. Such an anticipation may eventually be considered a positive characteristic for first responders as they have more time to act.
These results answer our first research question by evidencing that ML models indeed have the potential to “learning” patterns in the rainfall data estimated by different sources concurrently and improve their performance in reproducing rainfall-runoff processes. Additionally, it is possible to note that if a ML designer is constrained to use a single QPE product, discussions may rise on which data source should be chosen depending on the metrics considered. For example: models using only radar data and models using only gauge data were each considered as “the best model” for 2 lead times in terms of RMSE and , while for KGE, radar-based models overperformed gauge-based models with a tight difference (3 times the first overperforming the later against 2 the other way around).
4.2. Contingency Analysis
As described in
Section 3.3, the threshold adopted in this study to identify high-flows events was the value representing 10 times the baseflow discharge (i.e., 50 m
3/s), and the calculated contingency metrics for the identification of high-flows events is presented in
Figure 11. For all lead times, the EML models that use the three QPE products as predictors presented a median CSI value that is higher than its best-performing single-QPE counterparts, which indicates an overall best performance of the first over the later on predicting the upcoming occurrence of a high-flow record. Reflecting the results obtained in the analysis of the goodness of fit, the single-QPE model that uses CaPA products consistently presented the worst results. Models using only gauge and only radar data presented competitive performances between each other, with the first performing better in lead times of 2 h while the later overperformed in the remaining lead times.
The boxplots representing the sensitivity values of the models that use multiple QPEs have little differences when compared to the boxplots of its best-performing single-QPE counterpart. The forecast precision of all models remains high for lead time 1 and 2 h at about 75% and then deteriorate significantly for subsequent lead times. If the criterion for deciding if an ensemble of models is that both precision and sensitivity should be higher than 0.5, the maximum lead time to which they could be deemed “useful” is 3 h. The resulting CSI values of the models that use multiple precipitation inputs is the highest (or equivalent to other highest model with only one precipitation product) for all lead times—indicating the superior potential benefit in using all precipitation products.
The performance of single-QPE and three-QPE models for detecting high flows is summarized in
Table 5. If only the ability to anticipate the occurrence of an upcoming high-flow event is to be considered, regardless of the number of false alarms issued, modelers could be inclined to select the models that use only QPEs from gauges as their sensitivity was the highest for 2 of the lead times. On the other hand, if avoiding the emission of false alarms is the main interest for the forecasters, the single-QPE model that uses only radar data overcomes its gauge data-based counterpart for the majority of the lead times due to its best performance in terms of precision. However, usually both the sensitivity and precision of forecasting models are important, and a balanced metric such as CSI is used as tiebreaker. In this work, however, CSI values indicate, as observed with KGE, that the radar-based ELMs overperform the gauges-based ELMs with just 3 out of the 5 lead times, which could still rise questions in the selection of the single-QPE product to be used.
The three-QPEs models overperformed the other three single-QPE ones in terms of precision for all led times, scored equally to the best ELMs (the ones using gauge data only) in two of the lead times and underperformed it in the remaining three lead times. If CSI is used as a tiebreaker, the three-QPEs configuration would emerge as a “clear winner” as it presents the best metrics for 4 out of 5 lead times though. Taking the single-QPE model that uses radar data as a reference, it is possible to interpret from these results that the addition of the other precipitation products provided useful information for skillfully reduce the number of false alarms issued, thus increasing the precision of the model in a way that outweighs the reduction of the number of missed events. These results answer our second question: using concurrent QPE as input of the ML models also improved the prediction of high-flow events mainly by increasing its precision.
4.3. Brief Discussion on Replicability
The replicability of the results presented in
Section 4.1 and
Section 4.2 is yet to be assessed for other densely monitored catchment covered by multiple systems that provide QPEs. This is a scenario usually observed in urban areas of significative economical or societal relevance, but still unlikely for most of the basins [
55].
With recent technological developments, however, different systems now provide QPEs with global or sub-global coverage. Examples of such precipitation products available in near-real time include the estimates derived from satellites, such as the Integrated Multi-satellitE Retrievals for Global precipitation measurement (IMERG) [
56], the Global Satellite Mapping of Precipitation (GSMaP) [
57], and the Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks—Cloud Classification System (PERSIAN-CCS) [
30,
58,
59]. Individual satellite-based precipitation data usually presents coarse spatio-temporal resolution and, as any other QPE product, different sources of uncertainty, which results in the task of performing discharge forecasting for poorly gauged or ungauged basins challenging and subject of research. The findings of this work may also motivate developers of machine learning models for rainfall-runoff forecast to consider the possibility of using multiple QPE products from different global datasets to enhance model performance when an appropriate rain gauge monitoring network is absent.