1. Introduction
Hydrological forecasting stands as a pivotal non-engineering measure within flood risk reduction efforts. Runoff serves as a fundamental manifestation of water resources [
1], and the provision of highly precise runoff predictions holds paramount significance in the planning and management of water resource systems [
2].
Data-driven models can capture the non-linear relationship between driving factors and runoff, mitigating the impact of subjective elements on model uncertainty [
3]. These models exhibit strong fitting capabilities and flexible input data applicability in runoff forecasting [
4]. Pour et al. [
5] evaluated the performance of two machine-learning models (GMDH and GEP) in runoff prediction in the eastern coastal region of the Malaysian Peninsula, yielding favorable outcomes. Feng et al. [
6] developed an LSTM model based on different data integrations, achieving record-breaking Nash efficiency coefficient values on a continental scale. Naganna et al. [
7] applied deep-learning and machine-learning techniques for runoff prediction in the Cauvery River in India, revealing significant disparities in results under different model inputs. Data-driven models can be categorized into statistical models, machine-learning models, and deep-learning models [
4]. Nonetheless, it is worth noting that statistical models are incongruent with the non-linear characteristics exhibited in time series data of runoff [
8,
9,
10], machine-learning, and deep-learning models, while employed, present inherent challenges such as susceptibility to local optima, overfitting tendencies, and sluggish convergence rates; consequently, they do not invariably manifest robust predictive prowess [
4,
6,
11,
12,
13,
14]. Given the limitations of the aforementioned single-model forecasting approaches, numerous studies have advocated the adoption of ensemble methods involving multiple models to forecast the runoff. Such approaches have been substantiated to offer dependable runoff predictions for diverse geographical regions [
15,
16,
17,
18,
19,
20].
Multi-model ensemble approaches, through the amalgamation of diverse models, serve to diminish model errors and uncertainties, with the ultimate objective of enhancing forecasting accuracy. The simplistic arithmetic mean method uniformly assigns equal weights to the predictions of each base model, inadvertently neglecting variations in the forecasting capabilities among different models. In contrast, the weighted averaging method involves the allocation of weights to the base models, thereby harnessing the advantages of multiple models in the ensemble [
18].
The estimation of these weights can be based on a variety of methodologies, including multivariate linear regression [
21], least squares methods [
16], machine-learning techniques [
20,
22], or Bayesian model averaging [
23,
24,
25]. Wang et al. [
26] constructed two multi-model ensemble models based on the dynamic system response curve (DSRC) and Bayesian model averaging (BMA). The integrated forecast results for three process-driven hydrological models (XAJ, HBV, and VHY models) were significantly superior to the baseline model. Farfán and Cea [
27] built a hydrological ensemble model based on artificial neural networks, enhancing the model results in terms of linear correlation, bias, and variability. The models mentioned above allocate weights based on the individual performance of a single model throughout the entire training phase [
28]. However, in the context of runoff forecasting, the data characteristics of model predictions at different time points may exhibit varying correlations with training data features [
29]. When training data feature attributes are chosen collectively, the predicted model data points at different time points may not align with the considered collective attributes [
30]. Consequently, relying solely on the performance of a single model during the training phase to determine corresponding weights may lead to significant errors in localized predictions [
31]. This approach disregards the influence of the quantity and quality of training period data features on the effectiveness of the ensemble forecasting process during the weight determination phase.
The K-Nearest Neighbor (KNN) algorithm finds extensive application in runoff forecasting and classification research [
32,
33,
34,
35] and can be utilized for selecting or reducing data features in the data preprocessing stage [
29,
34,
36]. The fundamental assumption underpinning the KNN algorithm posits that the objective world adheres to regularity and repeatability, consequently yielding similar outcomes under analogous conditions [
37]. By assessing the degree of similarity, the KNN algorithm retrieves analogous multi-model forecasting instances from historical collections for the current time step, thereby estimating the prediction states of various forecasting models at the present prediction time point. This process ultimately yields the optimal estimation of model weights for the query instance [
30].
Nonetheless, conventional KNN algorithms have primarily been applied to research examining the similarity of errors in single models [
30,
34,
36,
37,
38] and face limitations when extending their utility to high-dimensional (HD) multi-model datasets. In particular, the challenge of calculating distances between multidimensional time series data in data-driven hydrological models presents itself when operating within high-dimensional spaces [
35]. The Pearson Correlation Coefficient (PCC) is a measure used to assess the degree to which two sets of data fit a straight line and is widely employed in feature selection for runoff prediction [
4,
39,
40,
41]. It effectively characterizes the correlation between different time series [
42]. However, there is limited research on its use as a distance metric between runoff sequences.
To address the aforementioned issues, this study introduces an adaptive prediction ensemble forecasting method with time-varying weights, referred to as the Improved K-Nearest Neighbor Multi-Model Ensemble (IKNN-MME) method. Its purpose is to enhance the predictive performance of data-driven models in runoff forecasting. The primary innovations in this research are as follows:
(1) This study introduces an enhanced K-Nearest Neighbor (KNN) method that improves the distance function suitable for multi-model, multi-dimensional hydrological sequences by incorporating a correlation coefficient enhancement. This enhancement aims to enhance the data classification capabilities of the KNN algorithm within multi-dimensional runoff sequences.
(2) By leveraging the concept of error similarity, this research proposes an adaptive prediction ensemble forecasting method with time-varying weights, known as the Improved K-Nearest Neighbor Multi-Model Ensemble (IKNN-MME) method. This model capitalizes on the advantages of KNN in data mining, extracting historically similar samples and incorporating the preferred runoff information into the weight determination for ensemble models. It integrates with dynamic ensemble forecasting models, thereby achieving the adaptive update of model weights.
Finally, this research evaluates the performance of the multi-model averaging method on various watersheds with different attributes in comparison to multiple multi-model ensemble forecasting methods and benchmark models. This validation demonstrates the method’s advanced nature and applicability.
5. Conclusions
To enhance the performance of data-driven models in runoff prediction, this study introduces the Improved K-Nearest Neighbor-based Multi-Model Ensemble method (IKNN-MME). This method, by coupling the improved KNN approach and multi-model ensemble techniques, strengthens the runoff modeling capabilities from both feature data selection and dynamic weighting of multiple models. To validate the model’s practicality, it is applied to runoff prediction in four watersheds in the United States. The results indicate that the performance of individual data-driven models in the study area is limited, especially in watersheds where there is a weak correlation between input factors and runoff. Upon applying the IKNN-MME model for multi-model combination, the model’s predictive Nash–Sutcliffe efficiency (NSE) values are improved from around 0.55 to above 0.80. For peak flow, the IKNN-MME model, based on the improved KNN, exhibits combined results closer to observational values compared to the KNN-MME model based on traditional KNN. This underscores the superiority of the improved IKNN over traditional KNN in multi-model integration, providing a novel approach for enhancing the accuracy of data-driven prediction models. Additionally, this study reveals the following:
(1) Data-driven models exhibit stronger predictive performance when there is a higher correlation between input and output features.
(2) The runoff prediction capabilities of deep-learning models are, to some extent, superior to machine-learning and linear models.
(3) Multi-model ensemble prediction models do not consistently demonstrate a characteristic of consistently improving the baseline model’s prediction ability.
However, the proposed method still has limitations. The IKNN-MME model has a limited capability to enhance the prediction performance of baseline models with already high accuracy, and the computational efficiency of the model is relatively low. Future research directions include improving the computational efficiency of ensemble models, enhancing the accuracy and robustness of the model, obtaining higher-precision runoff prediction results, and ultimately improving water resource management decisions.