1. Introduction
It is more than 50 years since the introduction of the multilayer perceptron model [
1] whereas the first applications in hydrology started appearing almost 25 years ago including rainfall-runoff models [
2] and short-term downstream flow forecasting [
3]. More particularly, Minns and Hall [
2] were among the first who applied recurrent neural networks (RNN) [
4] in a hydrological application, stating that “
antecedent flow ordinates both perform the same function” (note: distinguish between the rising limb and the recession) “
and provide additional information about the input pattern”. These early data-based models were fairly simple with at most two hidden layers and up to a dozen of hidden nodes, and seemed very promising in the dawn of the era of automatic data acquisition. A rainfall-runoff model could be created and calibrated based only on dynamic data (precipitation, evaporation, abstractions, etc.) obtained at high sampling frequencies from electronic sensors. However, these first machine learning (ML) models fell behind in performance compared with the standard hydrological models, conceptual or physically based.
The continuous increase of the computational power allowed more complex ML networks in a variety of hydrological applications especially in situations where the classical approaches are computationally demanding (for example, prediction of maximum flood inundation [
5], river flow prediction [
6], water resources management [
7], stochastic analysis [
8], etc.). A milestone in time-series-related applications was the introduction of the long short-term memory (LSTM) [
9]. LSTM units are used as hidden nodes in RNN and include, besides the input and output, a forget gate (see Figure 2 in [
6]). This offers the advantage of assigning a dynamic state to each LSTM unit, which serves as a mechanism of memory. The topology of the LSTM networks is characterized by the number of different LSTM cells used in each time step (see Figure 1 in [
6]) and the sequence length processed by the LSTM network at each step (see the figure in the section “LSTM Layer Architecture” in [
10], which corresponds to the vertical direction of Figure 1 in [
6]).
Machine learning with LSTM has outperformed the standard hydrological approach in many applications. For example, Ayzel et al. [
11] demonstrated that an LSTM based model with 266,497 parameters achieved a higher generalization capacity than a parsimonious model for streamflow simulation in Barents, White and Baltic Seas. Kratzert et al. [
12] introduced a modification to LSTM, the Entity-Aware-LSTM, which can take into account the catchment properties, and achieved better performance not only from the hydrological models that were calibrated regionally, but also from hydrological models that were calibrated for each basin individually. Lees et al. [
13] applied LSTM in 669 catchments in Great Britain, which outperformed a suite of conceptual hydrological models.
This consistent performance superiority of ML in hydrological applications has been studied by Nearing et al. [
14] who concluded that it is the constraints in the structure of the traditional models that prevents them from fully capturing the information in large-scale hydrological data sets. However, the efficiency of the LSTM models is not without a cost in computational complexity. As mentioned above, the model of Ayzel et al. [
11] employed 266,497 parameters. Similarly, Lees et al. [
13] have reported that it took 10 h to train an LSTM ensemble on a machine with 188 GB of RAM and a single NVIDIA V100 GPU.
An indirect approach to take advantage of ML in hydrological applications is to use it as a tool for pre-processing the data or post-processing the results of the standard hydrological models. For example, Iorgulescu and Beven [
15] used ML to reveal anomalies in data sets, i.e., inconsistencies concerning the principal equations of standard models (e.g., water and energy balances). Solomatine et al. [
16] have trained an ML model to serve as an estimator of the probability distribution of the output of a hydrological model. Althoff et al. [
17] have suggested an elegant method to take advantage of the dropout technique (a common regularization strategy used in ML) to obtain ensemble predictions and quantify the uncertainty of hydrological models. Li et al. [
18] have used a complex scheme that includes Box-Cox transformations, an LSTM network, and Bayesian inference to obtain probabilistic streamflow predictions. Aparicio et al. [
19] have employed machine-learning models to estimate the instantaneous peak flow from the maximum mean daily obtained from the SWAT model. Noymanee et al. [
20] have compared various alternative statistical and ML techniques for improving the flood forecasting efficiency of hydrological modeling. Yang et al. [
21], have combined a physically-based distributed hydrological model with networks, computer vision, and a categorization approach.
In this study, inspired by the concepts found in the mentioned above works, we employ a simple ML network as a tool to assess the performance of hydrological models. A hydrologist always looks forward to improving the performance of his/her model. The simple ML network is used to assess a hydrological model and determine whether and how much it can be further improved. It should be noted that even if the optimization algorithm has achieved the best possible calibration of the hydrological model, the model may still not be able to achieve the best feasible fit because it is limited by its structural characteristics. However, as Beven suggests [
22], “
If there are consistent anomalies between the conceptual structure of a hydrological model in a particular catchment and the nature of the hydrological processes in that catchment, then a deep learning (DL) model might well be able to capture that behaviour”. Therefore, by comparing the performance of a trained ML model with that of a calibrated hydrological model, the hydrologist can detect if there is room for further improvement and how to accomplish it by tuning the configuration and/or the structure (e.g., spatial/temporal resolution, modules employed, parameters, etc.) of the hydrological model.
One could suggest using ML or deep learning approaches, like those in the previously mentioned studies, for benchmarking the performance of a hydrological model. However, this would require a significant amount of time, since, as previously mentioned, these models are notoriously CPU-intensive to train. On the other hand, the ML network employed in this study is minimalist. We opted for simplicity both to facilitate the applicability (an existing tool, like MATLAB ntstool, can be directly applied to the available data without any need for coding) and also to improve the generalization and reliability of this approach.
4. Discussion
On closer inspection, the outputs of the ML approximators exhibit some unrealistic characteristics (e.g., large spikes, unjustified oscillations, etc.). It should be noted that similar behaviour is typical in plain RNNs. For example, Bao et al. compared various ML models to predict stock indexes. They found that LSTM and RNN have large variations and distances to the actual data (see Figure 8 in [
32]). To improve the performance, they combined an LSTM network with wavelet transforms (WT) and stacked autoencoders. Other researchers have reported similar behaviour while employing RNN to perform sensor fusion for an inertial measurement unit [
33]. The suggested solution was to use Fourier Transformation and remove high-frequency signal components. As was mentioned in the introduction section, ML can achieve superior performance in hydrological applications, but only after paying the price of increased computational complexity. However, the ML approximator is not intended for hydrological simulations, but for the assessment of hydrological models. To make the assessment procedure practical, the ML approximator was designed to be as simple as possible, hence the oscillations and spikes. Nevertheless, these artefacts should not be a source of concern. If the ML approximator achieves better performance, despite any high-frequency oscillations (but not because of), then a better hydrological model is achievable.
The comparison of a hydrological model against the simulation of the ML approximator can give clear indications of the performance sufferings that can be addressed. Similar conclusions cannot be obtained after a comparison of the model directly with the observations because it cannot be guaranteed that improving an obvious failure of the model to reproduce a specific characteristic of the response will not deteriorate the reproduction of another characteristic. For example, this comparison in
Figure 8 (ill-calibrated LRHM against RNN and LSTM) indicates that a better model in Bakas can be obtained to simulate more accurately both the high and low flows; on the other hand, the comparison in
Figure 9 indicates that only the performance during the low-flow periods can be improved.
In the case study of Nyangores River, the significant difference in the performance of the hydrological models between calibration and validation periods made difficult the direct approach employed in the other case studies. The shuffling approach, inspired by the cross-validation technique, allowed the study of the variance of the loss function of the ML approximator and the variance of the performance metric of the hydrological model. In the Nyangores River case study, the ML approximator demonstrated lower variance for both assessed hydrological models, LRHM and HYMOD2. To further study the importance of the variance, the shuffling approach was tried in the Karveliotis case study, at a second round, this time only for the calibrated LRHM, for which the direct application of the ML approximator achieved only a marginal performance improvement. The variance of the ML approximator, obtained after the shuffling and the repeated applications, was slightly larger than the variance of the calibrated LRHM suggesting that the performance improvement, in this case, is not only marginal but also does not generalize, which increases the credibility of the indication obtained with the direct application (see
Section 3.3) that a better model performance than that already achieved by LRHM is not feasible.
The shuffling of the data, though introducing some additional workload, offered a twofold benefit in the application of the ML approximator. First, it facilitated the ML model training in situations where the training and test periods presented different response patterns. Second, it allowed the evaluation of the variance of the loss function, as was mentioned in the previous paragraph. This beneficial technique is not applicable in hydrological models because it violates the continuity equations, the cornerstone of these models. Shuffling carefully the data (to preserve the important correlation characteristics) offers an important advantage to ML models over standard approaches, which should be considered in hydrological applications.
As mentioned previously, the ML approximator helps to identify whether there is some information in the data that the structure or setup of the hydrological model is not taking into account. In most cases, this happens because of some deficit or weakness of the hydrological model. An interesting situation is when the observations are obtained by a sensor that introduces systematic error. In this case, the ML approximator will yield a better fit than the hydrological model, which is restricted by the water balance equation. Yet, in this case, the hydrological model predictions will be closer to reality. Though modern sensors are reliable and tested in laboratories to ensure no systematic errors, the possibility of such an untoward event cannot be excluded. At this point, we need to acknowledge the agnostic nature of any kind of model. If the observations are found at some point to suffer from systematic errors, the model should be run again with the new, closer to reality, data. If no error can be proven in the available data, then the reality for the model is the data.
Finally, it is worth noticing that the ML approximator
did achieve better performance than the ill-calibrated models in Nedon and the less efficient hydrological model in the Nyangores River case study, indicating the potentials to improve the model performance. However, this performance improvement did not exceed the performance of the corresponding calibrated models and the more efficient hydrological model (e.g., compare the values of the ML approximator in
Table 4 and HYMOD2 in
Table 5). This means that the ML approximator
actually gives a better model than
, but not the best feasible model
. Apart from what has been mentioned in the beginning of this section, another reason is the assumption that there is a function
such that
in Equation (
2) is uncorrelated, homoscedastic, and zero-inflated. Since the complexity of
is equivalent to the complexity of the approximator
, which in the proposed methodology is kept minimal, the aforementioned assumption may not be feasible for every assessed model structure and configuration. The alternative to bypass this restriction would be to allow a function
of arbitrary complexity, which in the end would result in ML applications similar to that appearing in various recent publications. Consequently, any advantage regarding the time required to train (the time required on an AMD EPYC dual-core processor was less than 2 min for LSTM and less than 1 min for RNN) and to apply the ML approximators would be lost.
5. Conclusions
In this study, we developed an approach to help a hydrologist to answer the question “Can my model perform any better with the available data?”. To answer this question, we suggest employing a simple machine learning network that can be easily prepared and trained. The network inputs are the inputs and outputs of the hydrological model. If the machine learning model achieves better performance, then there is some information in the data that the structure or setup of the hydrological model is not taking into account.
The proposed methodology can be applied with simple ML tools that are straightforward and require no coding or data curation. However, this direct simple approach is reliable only when the assessed hydrological model performs almost equivalently during the training and test. A large difference in performance between these two periods indicates that important response patterns are not evenly represented in the data sets of the training and test. In these cases, a more sophisticated approach that includes data curation (shuffling carefully to preserve the statistical structure) is required to have a reliable model assessment.
When assessing a model with data shuffling (a method similar to cross-validation) the variance of the loss function is a metric of how well the model generalizes. If the machine learning achieves lower variance than the hydrological model, this is an additional indication that a better model can be prepared with the available hydrological data. This was confirmed by the application of the shuffling method into two case studies (Nyangores River and Karveliotis). In the former case study, the mean loss function and its variance of the machine learning model were lower than that of the hydrological model, indicating the potentials of a better model, whereas in the latter the mean loss function was marginally lower but the variance was higher, indicating that the best feasible model has already been achieved.
The suggested methodology could be used as a filter to improve the efficiency of hydrological models. However, as mentioned previously, it is not guaranteed that the improved performance after the filtering is the best that can be obtained with the available data. For this reason, it is recommended to be used principally as an assessment tool for crafting hydrological models, either by improving the calibration or by adding/modifying physical/conceptual assumptions to the model. Finally, it should be noticed that the suggested methodology is generic. It can be used with any model, e.g., hydraulic, even a financial model, to evaluate the capacity of the assessed model to fully describe the deterministic relationship between the model inputs and outputs.