1. Introduction
The intelligent management of water resources plays a critical role in promoting social and economic development, which needs to be established on the basis of a full understanding of the spatial and temporal distribution of water resources [
1,
2]. Hydrological models are useful tools to provide hydrological information for water resource management [
3]. In the past few decades, numerous hydrological models, from lumped empirical to fully distributed physically based models, have been developed [
4]. However, the best-performed models were not consistent under different basin characteristics and various climatologies [
5,
6]. Multimodel averaging methods, defined as using the outputs from multiple models to obtain one output, are proved to be more efficient in hydrological modeling than their individual members, by numerous studies [
7,
8,
9]. Since the early paper of Cavadias and Morin [
10] introduced the concept of weighted averaging for streamflow simulation, various multimodel averaging approaches have been proposed to find the optimal weights for each member, to minimize the error between the combined and the observed streamflow time series [
7,
11].
Numerous studies have been conducted to compare various averaging methods in different regions [
5,
7,
12,
13,
14]. For example, Diks and Vrugt [
5] compared seven averaging methods by using eight conceptual watershed models, and found that the Granger–Ramanathan averaging (GRA) method is superior to other methods. Arsenault et al. [
7] compared 9 averaging methods over 429 catchments in the United States, and concluded that the Granger–Ramanathan averaging (GRA, GRB, GRC) methods perform better than any individual member. These studies contributed much to the research in multimodel averaging. However, most of them used a limited number of catchments for a specific region, which did not consider the merits and shortcomings of different averaging methods from a global perspective. Furthermore, the effect on the performance of different averaging methods, caused by various climate conditions and basin attributes, also cannot be revealed.
In addition, with the development of multimodel averaging, many attempts have been made to find the best average scheme. For example, Clark et al. [
15] concluded that the outputs from one model, calibrated with different objective functions, could be considered as different models and be used to improve the performance of averaging methods. Arsenault et al. [
8] found that promising model averaging results could be achieved by using the outputs from one model, driven by different climate datasets. In recent years, precipitation has been considered as one of the major sources of uncertainty in water resource estimates and may significantly impact the performance of hydrological models in runoff simulations [
16,
17,
18,
19,
20,
21,
22]. Gauged observations are usually considered as the most accurate estimation for precipitation. However, there are plenty of places with sparsely distributed rain gauges that lack accurate precipitation data for hydrological modeling [
23,
24,
25]. Therefore, various global-scale gridded precipitation datasets have been developed in the past few decades, to provide precipitation with high temporal and spatial resolution across the world, especially for the ungauged regions [
26,
27,
28,
29,
30,
31]. However, compared to the real historic precipitation, the gridded precipitation datasets suffered from errors [
32,
33,
34]. Therefore, compared to the use of a single precipitation dataset, the use of hydrological model outputs, driven by various precipitation datasets as ensemble members, can add the diversity of ensembles and may create a more precise combination for data-sparse regions [
8,
35,
36].
Hydrological model outputs driven by various precipitation datasets are commonly used for uncertainty analysis or climate change impacts in previous studies [
3,
35,
37]. There is limited research on the application of multi-input averaging in the hydrological continuous streamflow simulation [
8]. Najafi and Moradkhani [
36] used the Bayesian model averaging (BMA) method to estimate runoff extremes, using a single hydrologic model and multiple regional climate model outputs as forcing data, and concluded that the merged signal generally outperforms the best individual signal. Arsenault et al. [
8] compared the performance of multimodel and multi-input over the continental United States by using the Granger–Ramanathan C (GRC) method. They found that multi-input averaging provides higher skill than multimodel averaging. Sun et al. [
35] used the BMA method to merge streamflows from three global precipitation datasets. They concluded that the hydrologic ensemble using multiple global precipitation products can provide a promising streamflow prediction. However, only one averaging method has been used in the above studies, and whether the improvement in the hydrological runoff simulation of multi-input averaging is independent of averaging methods was not considered. In addition, the number of members used in the multi-input averaging and multimodel averaging was not consistent, which may affect the performance of the averaging methods [
8,
35].
Accordingly, the first objective of this study is to evaluate and compare the performance of different averaging schemes, i.e., multimodel, multi-input and multi-input model (i.e., members consist of multiple models driven by multiple precipitation datasets). The second objective is to quantify the performances of various averaging methods under different climate regions, to find the optimal averaging methods for global hydrological streamflow simulation. Specifically, four hydrological models, driven by six gridded precipitation datasets (24 combination members) and nine averaging methods, were used to evaluate the performance of different averaging schemes. In addition, the impact of climate conditions on the performance of the averaging methods is investigated by using 2277 watersheds distributed in different climate regions. The large sample size will allow a better understanding of the usage of averaging methods, and thus improving the performance of hydrological runoff simulations, especially for data-sparse regions.
4. Discussion
This study used 6 global gridded precipitation datasets to drive 4 hydrological models for streamflow simulations over 2277 watersheds around the world, and took each of the outputs as a member for model averaging. To find the best combination of different members and improve the predictive skill in hydrological runoff modeling, eleven averaging schemes classified as multi-input, multimodel and multi-input model, and nine averaging methods were considered for streamflow averaging. The results show that the combination of different members may largely impact the performance of the averaging methods. The performance of multimodel averaging schemes largely depends on the input data. In general, the multi-input averaging schemes perform better than multimodel averaging schemes. Global gridded precipitation datasets are laden with intrinsic and structural errors, due to the different interpolation schemes, and they are likely all different from the real climate data [
32,
33]. Therefore, a given model driven by different precipitation datasets performs quite differently (
Table 5). For example, the median KGE value of SIMHYD-MSWEP is approximately 0.12 greater than that of SIMHYD-JRA55. The improvement in the multi-input schemes may be partly because of the reduction in the uncertainties caused by the inputs between the simulated and observed hydrograph [
7,
35]. Theoretically, using real climate data may reduce the advantage of the multi-input schemes. However, real precipitation varies greatly in time and space, and therefore is extremely challenging to observe and estimate [
32]. Therefore, multi-input averaging schemes can be a powerful tool for hydrograph simulations, and can provide an advantageous way to support reasonable runoff prediction and water management, especially in ungauged basins [
35].
Equifinality is defined as a hydrological model having multiple sets of parameters that lead to equally acceptable model performance, which is considered to be one of the uncertainties in hydrological modeling [
71,
72]. Theoretically, using the outputs from equifinal parameter sets as averaging members may improve the performance of averaging methods, by reducing the errors caused by the parameter set uncertainty. The performance of averaging methods, by combining the outputs of 10 equifinal parameter sets, was tested. Four models driven by MSWEP were calibrated ten times by the shuffled complex evolution method (SCE-UA), with different initial random seeds. The results show that using the outputs of 10 equifinal parameter sets, calibrated from a hydrological model driven by specific precipitation as averaging members, cannot improve the performance of the averaging methods (
Figure 10). This conclusion is consistent with that in Arsenault et al. [
8].
The KGE was used to calibrate the models. The KGE metric is one of the most common metrics used in hydrological modeling. It puts more emphasis on the simulation of flow variability and correlation [
73,
74]. Compared to the best single model (SIMHYD-MSWEP), the KGE values improved for each averaging method for most schemes. When it comes to NSE and AVE, the improvement in the averaging methods is more obvious (
Figure 4,
Figure A1, and
Figure 9). The NSE metric focuses more on the peak flows and less on the low flows [
73]. Therefore, most aspects of the hydrograph simulated by averaging methods are improved compared to the specific hydrograph simulated by one objective function. In addition, previous studies indicated that using the outputs from one model, calibrated with different objective functions as averaging members, can improve the performance of the averaging methods [
15]. Therefore, a more comprehensive study is needed to investigate how a large ensemble containing multiple model structures, each with multiple objective functions and driving datasets, impacts the performance of averaging methods.
When compared to the best individual member, even though the simplest equal weights averaging methods (EWA) can improve the simulation performance for more than 40% of the watersheds. However, the performance of different methods is not consistent among climate regions. The AICA, BICA and Granger–Ramanathan group are in the lead group in the arctic region; however, they show poor performance in other climate regions, especially in the equatorial and arid regions. In fact, AICA and BICA tend to put more weight on the best individual member and neglect others [
5]. Therefore, the high performance of AICA and BICA in the arctic region could be due to the large differences in performance among 24 members in this region. It is the same for the Granger–Ramanathan group. The Granger–Ramanathan group allows negative weights; therefore, these methods are able to hedge against the use of a bad model [
5,
54]. The BGA and EWA methods are in the middle level compared with other averaging methods for most regions. In addition, they are more robust than the AICA and BICA methods. The stable performance of these two methods may be due to the fact that these methods distribute the weights fairly. The BMA method is amongst the best methods for most climate regions (except the arctic region). The fact that the performance of BMA would be affected by the poorly performed members may be the reason for the relatively poor performance in the arctic region [
36]. In addition, the BMA method is the longest to execute among these averaging methods, because of its iterative nature [
5,
7]. Therefore, the MMSE averaging method is recommended for its speed of execution, simplicity and stable performance among climate regions.
5. Conclusions
Nine multimodel averaging methods and 11 averaging schemes have been compared, using the simulations of 4 hydrological models driven by 6 precipitation datasets, to find the most suitable multimodel averaging application under different climate regions. The study was conducted over 2277 watersheds around the globe, covering 5 main climatic groups, according to the Köppen–Geiger classification. The following paragraphs outline the results.
The performances of multimodel averaging schemes are closely related to the precipitation used in the hydrological simulation, with a 0.14 difference of the median KGE values between the worst (CPC-COMBINE) and the best (MSWEP-COMBINE) multimodel averaging schemes. Using models driven by different gridded precipitation datasets as ensemble members allows for improving the performance of different averaging methods compared to the multimodel averaging schemes.
Merging multiple members can lead to a significant improvement in hydrological simulations for up to six members. The use of more than 6 members only improves the estimation results slightly, as compared with using all 24 members.
Clear differences in the performance of averaging methods were displayed for different climatic regions. The warm-temperate climatic regions provided the best performance for the averaging methods, with at least 61% of the watersheds having experienced improvements in runoff prediction skills compared to the best single member. Equatorial and snow regions follow closely behind. Moreover, the differences in hydrological model performance among the various averaging methods in arid and arctic regions are more significant than the others.
The best-performing averaging method was different among different climate regions. The MMSE method shows the best performance in most climate regions, except for the arctic region. It is the Granger–Ramanathan average group that outperforms others in the arctic region. In general, the MMSE averaging method shows more advantages over other averaging methods because it is simple to implement, and is always amongst the leading groups.