1. Introduction
Predicting river bed load is a pivotal challenge in fluvial geomorphology, crucial for understanding river dynamics, sediment transport, and their implications across ecological and engineering applications. The task is inherently complex, driven by the stochastic and non-linear nature of sediment transport within diverse riverine systems. This complexity is further enhanced by factors such as variability in river discharge, sediment size distribution, channel morphology, and hydro-dynamic forces, which together complicate the prediction of reliable bed load measurements [
1]. Additionally, the direct measurement of bed load during significant flooding events or over extended periods poses substantial difficulties, further exacerbating the challenges in making reliable predictions [
2].
Traditional methodologies in sediment transport, including semi-empirical and empirical models [
3,
4], often necessitate extensive site-specific calibration and face challenges in transferability across diverse river environments, underscoring the limitations inherent in conventional approaches [
5]. Numerical models in bed load transport prediction often rely on assumptions like grain diameter and density, which may not accurately represent the mechanics of sediment moving by saltation [
6,
7]. This has led to a growing recognition that a singular approach may not suffice to encompass the complex interplay of factors influencing bed load transport. This limitation has been highlighted in multi-model evaluations, which reveal that few approaches perform consistently well across broad, multi-region samples, underscoring the challenge of resolving site-specific variation and averaging over temporal and spatial scales [
8,
9]. Efforts to predict bed load transport are further challenged by the inherent variability in this process, which includes turbulent stress spatial variability, grain protrusion, bed heterogeneity, structural arrangement [
10,
11,
12], and additional complicating factors such as vegetation-induced resistance [
4], variability in alluvial cover thickness [
13], and the processes of the aggradation and degradation of river beds [
14]. These elements provide substantial uncertainty to predictions, often requiring models to be adaptable and robust across a wide range of conditions, especially during significant flood events or over prolonged periods [
15].
In this context, advanced machine learning (ML) algorithms emerge as a promising alternative [
16,
17]. Unlike traditional models, ML approaches offer the flexibility to learn directly from data, capturing complex, non-linear relationships without the reliance on predefined equations [
18,
19]. Notably, the application of ML methods in environmental monitoring, including the estimation of scour depth [
20], precipitation bias correction [
21], and rainfall-runoff modeling [
22,
23], underscores their adaptability to varied contexts. This adaptability is pivotal for river bed load prediction, where the interaction of numerous factors—ranging from hydrological to geomorphological—determines sediment transport rates [
24]. For example, Bhattacharya et al. [
25] developed two ML models, model trees and artificial neural network (ANN), to estimate total load transport and bed load, utilizing observed samples provided by Gomez and Church [
26]. Their analysis reaffirmed the advantages of ML models, particularly in their ability to process complex input parameters. It derived accurate predictions with significantly lower root mean squared errors (RMSEs) compared to traditional models, such as the models of Bagnold [
27] and Einstein [
28]. Azamathulla et al. [
29] compared the adaptive neuro-fuzzy inference system (ANFIS) with four common bed load equations using a dataset based on observed data gathered from four watersheds in Malaysia. This study demonstrated that the ANFIS model can estimate the bed load data more efficiently than a regression-based equation.
In addition, Kitsikoudis et al. [
30] and Kitsikoudis et al. [
31] made a noteworthy advancement in applying ML to sediment transport by employing genetic algorithm (GA)-based symbolic regression (SR), ANN, and ANFIS models to analyze sediment concentration datasets from field and flume investigations. Their findings illuminated the superior performance of models trained on field data over those trained on flume data. The author concluded that ANNs provided the best outcomes, while the equations derived from SR are simpler and more clear. Roushangar and Koosheh [
32] developed a bed load transport rate prediction model for three gravel-bed rivers utilizing support vector regression (SVR). The results showed that SVR models offered better performance than conventional equations. By contrast, the study of Roushangar and Shahnazi [
33] pointed out that wavelet kernel extreme learning machine has a higher predictive potential than SVM in estimating bed load for 19 gravel-bed rivers in the USA. In 2020, Asheghi and Hosseini [
34] expanded the scope of ML applications in bed load transport by developing three ANN-based prediction models (multilayer perceptions, generalized feed-forward neural network, and radial-based function) using a dataset in Idaho, USA. Despite the limited geographic scope, their study demonstrated the potential of ANNs to outperform existing empirical equations, the coefficient of determination range [0.93–0.98] for ANNs and [0.10–0.21] for empirical equations. Moreover, a block-combined network with a GA to improve ANN performance was proposed by Hosseini et al. [
35]. In comparison to previous ANNs and empirical models, this model displayed improved performance accuracy (89.77%) when tested on 879 bed load datasets in Idaho, USA.
Despite the above advances, the application of ML in predicting river bed load across a wide range of fluvial settings remains underexplored. The studies mentioned above, while pioneering, were constrained by the limited size and scope of their datasets. This limitation underscores the need for research employing extensive, globally collected datasets to train and validate a diverse array of ML models. Hosseiny et al. [
36] addressed this gap by developing an ANN model, utilizing measurements from 134 rivers to forecast bed load sediment transport rates. This comprehensive dataset enabled the development of a sediment transport model that outperforms current models and eliminates the necessity for site-specific calibration. Compared to other traditional models like Einstein [
28], Wilcock and Crowe [
37], and Recking [
38], the ANN model demonstrated superior accuracy and reliability in reflecting the distribution of the observed bed loads. However, it also highlights a significant gap in the literature—the application of a comprehensive suite of ML models to river bed load prediction. While ANNs have been the focal point of much research, the potential of other ML algorithms, such as random forest (RF), categorical boosting (CAT), extra tree regression (ETR), or K-nearest neighbors (KNNs), remains underexplored in this specific context.
The present study extends previous work by employing a multi-model ML approach, integrating both individual algorithms and state-of-the-art ensemble techniques to predict river bed load, utilizing a comprehensive dataset collected from research by Recking [
39]. This approach leverages the collective strengths of several well-established ML models, including RF, CAT, ETR, KNN, gradient boosting machine (GBM), and Bayesian regression model (BRM). Each model has been selected for its demonstrated efficacy in handling complex, non-linear data structures and its potential to contribute unique insights into the predictive challenges associated with river bed load estimation. Furthermore, the research enhances traditional ML approaches by introducing ensemble models that combine predictions from multiple models, aiming to reduce predictive uncertainty and improve accuracy. This innovative aspect examines the performance of ensemble models—such as the weighted averaging ensemble (WAE), stacking ensemble (SE), and voting ensemble (VE)—in achieving more precise bed load estimations than individual models.
The principal purpose of this study is to develop, analyze, and compare several advanced ML algorithms for the prediction of river bed loads. The study also performs a comparative analysis to discern which predictors excel in diverse scenarios and to understand the reasons behind their performance. The primary contributions of this research are as follows:
It presents one of the first comprehensive comparisons of multiple advanced ML models alongside novel ensemble techniques in the context of river bed load prediction, contributing to the methodological advancements in fluvial geomorphology.
The study conducts a detailed analysis of model performance using a range of metrics, offering insights into the strengths and limitations of each ML approach and ensemble approach.
By utilizing feature importance methods and Shapley additive explanation (SHAP) values, the research deepens the understanding of the critical drivers in bed load dynamics.
The uncertainty quantification through Monte Carlo simulations under various data scenarios offers valuable insights into the reliability and robustness of ML predictions and explores the effect of training data size on the accuracy of river bed load predictions.
The study’s findings have practical implications, aiding in more accurate and reliable predictions of river bed loads, which are crucial for effective river management and policy-making.
Following this introduction,
Section 2 outlines the methodology, detailing data collection and pre-processing, and provides an overview of the ML models utilized. This section also discusses model training and uncertainty analysis.
Section 3, Results and Discussion, presents a comparative performance analysis, interprets feature importance and SHAP value insights, assesses the predictive uncertainty under various training data sizes, and examines the performance of ensemble models.
Section 4 summarizes the paper.
2. Materials and Methods
2.1. Data Collection and Pre-Processing
The bed load transport dataset for this study was sourced from BedloadWeb
http://en.bedloadweb.com (accessed on 15 February 2024), an online platform accessible to the public that stores statistics on bed loads collected in laboratories and in the field, as well as databases and official reports. According to Recking [
39], this database stands out for its global compilation of bed load transport measurements, encompassing over 120 sites worldwide and offering an expansive dataset of approximately 11,000 data points. Each data point in this dataset corresponds to a specific measurement of bed load transport rates.
This dataset integrates a wide variety of physical and hydrological variables, including river discharge (Q), flow width (W), bed slope (S), section averaged velocity (U), mean flow depth (H), hydraulic radius (Rh), and grain size distributions denoted by the 16th, 50th, 84th, and 90th percentiles (D16, D50, D84, and D90). These variables were selected to comprehensively represent the key factors influencing bed load transport, thus providing a robust foundation for developing ML models with widespread applicability across different riverine and geomorphological contexts.
A significant challenge in pre-processing the dataset was the uneven availability of variables across the different river survey locations. Some measurements, such as U, H, Rh, and specific grain size percentiles (D16 and D90), were inconsistently reported. This inconsistency poses a risk of introducing bias or over-reliance on incomplete datasets in ML models, potentially compromising the model’s predictive performance and generalizability. To address this challenge, a strategic decision was made to exclude the variables with high rates of missing data, specifically U, H, and Rh, from the analysis. This strategic decision was supported theoretically by the application of Manning’s equation in river hydraulics, where U, H, and Rh can be expressed as functions of the consistently reported variables Q, S, and W, along with the Manning roughness coefficient (n), generally correlated with grain size. Therefore, the exclusion minimizes potential bias and ensures that the ML models developed could rely on a consistently available set of variables. Here, the relationship between U, H, and Rh with channel geometry and roughness is implicitly maintained through the other included variables.
Furthermore, the survey locations lacking complete data for essential grain size distributions (D16 or D90) were omitted from the dataset. Consequently, the primary variables of interest for this study were refined to seven, including Q, W, S, and the grain size distributions (D16, D50, D84, and D90). In addition, the data underwent a screening process to identify and remove outliers, excluding extreme values that could potentially skew the model training. Specifically, transport data related to the discharge values outside the 95th percentile, extreme bed load flux values above the 95th percentile, and values below the 10th percentile were removed. After applying these steps, the dataset was reduced to 5347 samples, each encompassing a comprehensive and consistent set of variables critical for analyzing bed load transport dynamics. This refined dataset ensures a balanced representation of the physical characteristics relevant to bed load transport, improving the reliability and applicability of the resulting ML algorithms.
Subsequently, the data were log-transformed (base 10) to normalize the distribution of each variable. This transformation helps stabilize variance and make the dataset more model-friendly, particularly for algorithms sensitive to the scale of the input features. Following log transformation, the dataset underwent feature scaling to ensure uniformity in the range of data, and no single variable would dominate the model due to its scale. This scaling is crucial for the models sensitive to the magnitude of input, such as KNN. The scaling adjusted the values to a standard scale of 0 to 1, defined by the following:
where
X and
X’ denote the original and log-transformed values in the dataset, and
Xscaled represents the scaled value.
2.2. Overview of ML Algorithms
2.2.1. Random Forest (RF)
The RF algorithm is a powerful ensemble learning technique utilized for both regression and classification tasks. It works by constructing a large number of decision trees during training time and then outputs the mode of the mean prediction (regression) or classes (classification) of the individual trees. RF introduces randomness into the model in two ways: sampling the data points (bootstrap sampling) and picking feature subsets on every split in the decision trees, thereby increasing the diversity of the models in the ensemble [
40]. RF is particularly suited for river bed load estimation due to its ability to handle high-dimensional data and its robustness to overfitting. By aggregating the predictions of numerous trees, RF can identify complex, non-linear correlations between the input factors and the target
qs.
Given a dataset
D = ((
x1,
y1), (
x2,
y2), …, (
xn,
yn)) with
n samples and each sample
xi having
m features, the RF algorithm constructs
B decision trees. A random sample of the data
Db is selected for each tree
Tb. At each tree node, a subset of
m’ <
m features is chosen randomly, and the best split on these features is used to partition the data. The procedure is repeated until a stopping requirement is satisfied, for example, a minimum number of samples per leaf. The prediction for a new sample
x’ is obtained by averaging the predictions from all the individual trees (
Tb):
2.2.2. Categorical Boosting (CAT)
CAT is an algorithm that builds upon the principles of gradient boosting and is particularly designed to function effectively with categorical variables alongside continuous ones [
41]. It introduces innovative techniques such as ordered boosting and combining categorical features to enhance model performance and reduce overfitting.
Given a dataset
D, CAT iteratively constructs an ensemble of decision trees. The key distinction lies in its treatment of categorical features, which are transformed using a statistics-based approach considering the target variable. The prediction model at iteration k is denoted as
Fk(
x) and is updated by fitting a tree to the pseudo-residuals calculated from the previous model
Fk−1(
x).
where
hk(
x) represents the decision tree trained on the pseudo-residuals at iteration
k, and
α is the learning rate.
2.2.3. Gradient Boosting Machine (GBM)
GBM is an ensemble approach that develops models consecutively, with each new model correcting errors from the previous ones. GBM combines numerous weak prediction models, commonly decision trees, to form a robust model stage-wise. The fundamental principle of GBM is to fit a new model to the residuals of the previous models and add it to the ensemble to minimize the overall prediction error [
42]. GBM has notable efficacy in predictive tasks such as river bed load estimation, whereby the correlation between the predictors and the response predictor is complicated and non-linear. Its ability to sequentially focus on difficult-to-predict instances makes it a powerful tool for improving prediction accuracy.
Let
Fk(
x) be the predictive model at iteration
k. The GBM algorithm seeks to minimize a loss function
L(
y,
F(
x)) over all the samples in the dataset, where
y is the observed value and
F(
x) is the predictive value. The algorithm starts with an initial model
F0(
x), often chosen as the mean of the target variable. At each subsequent iteration
k, the predictions of the current model
Fk−1(
x) are utilized to fit a decision tree
hk(
x) to the loss function’s negative gradient. The model is then updated as follows:
where
γ represents the initial model parameters;
α is the learning rate, which determines how much each tree contributes to the final model.
2.2.4. Extra Tree Regression (ETR)
ETR is an ensemble method that builds upon the principles of RFs by constructing many unpruned decision trees. The key differences are in how the splits are chosen and how the data are sampled. ETR selects splits completely at random and builds the trees using the whole dataset rather than a bootstrap sample [
43]. ETR’s approach of using the entire dataset and selecting splits randomly increases the diversity among the trees in the ensemble, often leading to improved model robustness and generalization capability compared to standard RFs.
Given a dataset
D, ETR constructs
B decision trees. For each split in each tree, rather than searching for the optimal split point, ETR randomly selects split points and chooses the best one among them:
where
t(
x;
θb,i) represents the
ith tree in the
bth ensemble, parameterized by
θ; and the aggregation is typically the average for regression task.
2.2.5. K-Nearest Neighbors (KNNs)
KNN is a lazy learning method for classification and regression that is non-parametric. It operates on the simple principle of determining the k-nearest training instances in the feature space and producing estimates based on the majority vote (for classification) or average (for regression) of these neighbors [
44]. KNN’s simplicity and effectiveness make it particularly useful for river bed load estimation when the connection between the input variables and the objective may not be linear, but a local similarity in the feature space can be exploited.
For a new sample
x’, the KNN algorithm uses a distance metric, such as Euclidean, to calculate the distance between
x’ and each point
xi in the training set. The algorithm then selects the
k nearest points
Nk(
x’) based on the smallest distances and predicts the output variable by averaging the target values of these nearest neighbors for regression tasks:
where
m is the number of features.
2.2.6. Bayesian Regression Model (BRM)
BRM represents a probabilistic approach to regression analysis, which contrasts with traditional regression methods by incorporating prior knowledge about the model parameters through probability distributions [
45]. BRM leverages computational techniques such as Markov Chain Monte Carlo to determine the posterior distribution of the coefficients from which predictions can be made. These predictions are inherently probabilistic, providing point estimates and confidence intervals that reflect the uncertainty in the model parameters. Incorporating prior information and explicitly modeling parameter uncertainty make BRM especially valuable in contexts where prior knowledge exists or where it is crucial to quantify the uncertainty in predictions.
In contrast to linear regression, which uses a linear function to represent the relationship between a dependent variable (
y) and a set of independent variables (
X), Bayesian regression models the regression coefficient (
β) as random variables with specified prior distributions. The noise term (
ε) is typically considered to be normally distributed (mean zero and variance
σ2). The core equation in BRM is the posterior distribution of the regression coefficient given the data, expressed as follows:
where
is the posterior probability of the coefficients given the data, representing our updated belief about the coefficients after observing the data;
represents the likelihood of observing the data given the coefficients, essentially the probability of the data under the model specified by
β;
represents the prior probability of the coefficients, encapsulating our beliefs about the values of
β before observing the data; and
represents the marginal likelihood of the data, serving as a normalizing constant to ensure the posterior distribution sums to one.
2.2.7. Ensemble Techniques
Ensemble techniques have received substantial interest in ML due to their robustness and superior predictive performance over single-model approaches. They operate on the premise that combining the predictions of multiple models can compensate for individual weaknesses and leverage their diverse strengths, thereby enhancing prediction accuracy and reducing model variance. This subsection delineates three sophisticated ensemble techniques employed in river bed load estimation: weighted averaging ensemble (WAE), stacking ensemble (SE), and voting ensemble (VE).
WAE approach synthesizes predictions from various models by averaging them together and assigning different weights to each model’s output. This method emphasizes the contribution of the more accurate models and downplays that of the less accurate ones. In this study, the weights (
wi) are inversely proportional to the model’s RMSE (
RMSEi), enhancing the ensemble’s predictive performance. Formally, the combined estimation
for a given input vector
X is expressed as follows:
where
i represents the
ith model in the ensemble;
j represents the
jth observations in the dataset;
m denotes the number of observations.
where
i denotes the
ith model for which the weight is being calculated;
k represents the
kth model in the total ensemble of
n models;
fi(
X) denotes the prediction from the
ith model.
SE, also known as stacked generalization, involves training a secondary model, known as a meta-learner, to combine the predictions from multiple base models optimally. The based-level model
fi is trained on the complete training dataset, and their predictions form a new dataset (
Dmeta), which is used to train the meta-learner (
F). The final prediction is obtained by inputting the base model’s predictions into the meta-learner. In the context of this study, linear regression serves as the meta-learner, chosen for its interpretability and efficiency. This technique can be mathematically formalized as follows:
VE is another ensemble method that combines predictions from multiple models, but unlike WAE, it does not use weighted averages. Instead, VE applies simple voting mechanisms. This method benefits from the diverse perspectives of the constituent models, as it does not assume any prior information regarding the model’s performance, thereby providing a model-agnostic approach to ensemble learning. For regression tasks, the most common form of voting is averaging, where each model in the ensemble votes for a particular value, and the final prediction is the arithmetic mean of these votes:
2.3. Model Training and Uncertainty Analysis
2.3.1. Hyperparameter Optimization
The model training and cross-validation processes are critical in developing ML models for river bed load prediction. The dataset was subjected to a random shuffling algorithm a thousand times after pre-processing. This rigorous shuffling ensures a homogeneous distribution of data points across the dataset, thereby mitigating any potential biases that could arise from the original ordering of the data. Such a strategy is instrumental in preserving the integrity of the model training and validation processes. Upon data shuffling, the dataset was partitioned into training and testing subsets, with 80% of the data (approximately 4277 samples) allocated for training purposes and the remaining 20% (about 1070 samples) reserved for testing. This division is designed to thoroughly evaluate the model’s predictive abilities and capacity to generalize across novel datasets.
Hyperparameter optimization is a critical step in configuring ML models to achieve optimal performance. For this study, particle swarm optimization (PSO) was employed, diverging from the conventional grid search methodology. PSO, inspired by the social behavior patterns of birds and fish, excels in navigating large and complex search spaces efficiently. The algorithm simulates a swarm of particles moving through the search space, where each particle represents a potential solution, in this case, a set of hyperparameters.
The PSO was initiated by randomly positioning particles within the predefined bounds of possible hyperparameter values. The position of each particle was updated iteratively based on two key aspects: the personal best position of the particle and the global best position found by any particle in the swarm. This dual influence allows the particles to explore the search space thoroughly while converging toward the most promising regions.
The specific objective function for the PSO aimed to minimize the negative mean squared error, assessed through a rigorous 5-fold cross-validation. In each iteration of the cross-validation, four segments of the training dataset were used to train the model, while the fifth segment served as the validation set to evaluate performance. This procedure was repeated five times, rotating the validation segment to ensure each part of the dataset was used exactly once for validation. Such a methodology provided a robust estimation of the model’s predictive accuracy across varied data subsets. The updated equations governing the movement of each particle are expressed as follows:
where
is the velocity of particle
i at iteration
t;
is the current position (hyperparameters);
pi is the personal best position of the particle; and
g is the best position found by any particle in the swarm. The constants
w,
c1, and
c2 represent the inertia weight and cognitive and social coefficients, respectively, while
r1 and
r2 are random numbers between 0 and 1. These equations ensure that each particle learns from its own experience and from the successes of neighbors, driving towards convergence on optimal hyperparameters. The optimal hyperparameters identified through PSO for each model are summarized in
Table 1.
2.3.2. Feature Importance and SHAP Value Analysis
The permutation feature importance method was employed to quantify the contribution of each input variable in the predictive models. This technique involves randomly shuffling each predictor variable in the dataset and measuring the change in the model’s accuracy [
46]. The more significant the drop in accuracy, the more influential the variable is for the model’s prediction capability. This approach is particularly insightful for understanding the relative significance of different physical factors, such as grain size and river discharge distributions, in influencing bed load transport rates.
Additionally, the SHAP values were calculated for each model to provide a deeper understanding of the feature contributions. SHAP values, at their core, serve to decompose the prediction of an ML algorithm into contributions from each predictor, providing a comprehensive perspective through which to interpret the model’s behavior [
47]. This method extends the interpretability of models by applying a model-agnostic approach, which means it can be used across different types of ML models, from linear models to complex ensemble trees and deep neural networks.
One remarkably versatile implementation of SHAP is kernel SHAP, which employs a weighted linear regression model to approximate SHAP values for any ML model. This approach is grounded in the idea that the explanation model (a simple linear model of feature contributions) should approximate the output of the complex model as closely as possible for any given input. The kernel SHAP method calculates the SHAP value for a feature by evaluating the change in the prediction that occurs when the feature is “included” in a subset versus when it is “excluded”. The weights for the regression are determined by the SHAP kernel, which measures the distance between the coalition of features included and the complete set of features. Kernel SHAP effectively enables the estimation of SHAP values and emphasizes the importance of understanding model predictions regarding feature contributions.
2.3.3. Uncertainty Quantification
In the context of river bed load dynamics, which are characterized by their inherent complexity and variability, quantifying uncertainty in model predictions is essential. The Monte Carlo method stands out as a fundamental approach for uncertainty assessment, taking advantage of stochastic simulations to explore a variety of outcomes under various scenarios. This technique is instrumental in evaluating the behavior of ML models under different data conditions, shedding light on their robustness and reliability.
The framework for uncertainty quantification was structured around several data scenarios, essentially varying the sample rate of the training data. By adjusting the sample rates (SR) of the original training data (X_train) from 0.3 to 1.0, the analysis captures the model’s performance across a spectrum of data completeness. This approach is designed to gauge how well each model performs under the varying degrees of data completeness, reflecting realistic situations where data limitations are a common challenge. By systematically exploring the impact of training data variability on model performance, this methodology provides a robust framework for assessing the reliability of ML predictions.
For each sample rate
SRi, where
i ranges from 0.3 to 1.0 in predefined increments, the study executes 1000 Monte Carlo simulations. These simulations generate a multitude of training subsets (
), with
j denoting the simulation iteration. This process ensures the robustness of the findings, mitigating the influence of specific data samplings. A detailed landscape of each model’s adaptability to quantity and quality is constructed by evaluating model performance across diverse training data scenarios. Following the simulations, the mean (
µ) and standard deviation (
σ) of the performance metrics (here is R) for each model and sample rate are computed to quantify the model’s predictive performance and its variability.
2.4. Performance Metrics
In this study, three critical performance metrics were employed to evaluate the effectiveness of ML models in predicting river bed load: RMSE, mean absolute error (MAE), correlation coefficient (R), and Nash–Sutcliffe efficiency (NSE). These metrics were chosen for their widespread acceptance in hydro-dynamic modeling and their ability to provide comprehensive insights into model accuracy, predictive power, and overall performance. In particular, RMSE measures the model’s ability to estimate the quantity of interest. It quantifies the square root of the average squared differences between the observed and estimated values. Lower RMSE (or MAE) values indicate better model performance. NSE is used to assess the predictive skill of a model compared to the mean of the observed data, while R measures the strength and direction of the linear relationship between the observed and predicted values. The following table (
Table 2) summarizes these metrics, providing their formulas, range values, and optimal values for reference.
In this table, yi and represent the observed and predicted values; is the mean of observed values; and n denotes the number of samples.
3. Results and Discussion
3.1. Model Performance Analysis
Table 3 presents the performance statistics of six ML algorithms across four key metrics: RMSE, MAE, NSE, and R.
Figure 1 complements this analysis by providing the comparative scatter plots of model predictions versus the observed data.
Table 3 provides a concise overview of the performance statistics for the six algorithms. An immediate observation is a superior performance among RF, GBM, CAT, and ETR models, which exhibit the lowest MAE and RMSE scores (0.009 and 0.022, respectively), indicating a closer fit to the observed data points. These models also demonstrate high NSE values (above 0.932), signifying their efficiency in capturing the variance of the observed bed load transport rates. Conversely, the BRM exhibits notably higher MAE and RMSE values (0.014 and 0.035, respectively) alongside a considerably lower NSE (0.838). Although BRM can capture the bed load transport process trend, its predictions are less precise and consistent compared to the tree-based and KNN models. This shows the difficulty of using Bayesian methods in this problem domain. The KNN model, with an NSE of 0.925 and RMSE of 0.024, offers a balance between the high accuracy of tree-based models and the lower performance of BRM. These differences underscore the variability in model performance, highlighting the strengths and limitations of each approach in capturing complex fluvial processes.
Figure 1 visually reinforces these quantitative findings by illustrating the degree of alignment between the model predictions and observed data. The scatter plots, particularly for the tree-based models, show a tighter clustering of points along the 1:1 line, reflecting their higher accuracy and efficiency in bed load prediction with R values exceeding 0.936. The plots for KNN, which depict a slightly more dispersed but still closely aligned set of points, further confirm their effectiveness despite a slight decrease in accuracy as indicated by their R score (about 0.9326). In contrast, the BRM, with its unique approach to handling uncertainty and incorporating prior knowledge, tends to exhibit variability in performance depending on the specificity and quality of the input data. This variability is evident in the broader spread of data points in the scatter plot for BRM, resulting in a notably lower R score (0.8384) and visually depicting its lower predictive accuracy.
3.2. Feature Importance and SHAP Value Insights
This section will delve into the critical evaluation of feature importance across the different ML models employed in the study. This analysis is pivotal in understanding how each input variable influences the model’s predictive capabilities. Permutation feature importance scores are initially calculated for each variable across all the models. To enable a comparative analysis across the diverse models, these scores are normalized by dividing each variable’s raw importance score by the sum of all the scores for that model, ensuring that the total normalized importance for each model equals one. This normalization facilitates a direct comparison of the influence of each variable within and across different modeling approaches.
Figure 2 graphically represents permutation feature importance scores across the models, and
Table 4 lists these scores, providing a basis for the normalization process that results in the values depicted in
Figure 2. Subsequent discussions will pivot to SHAP values, as illustrated in
Figure 3.
Figure 2 graphically represents these normalized permutation feature importance scores across the models. River discharge (
Q) demonstrates high normalized importance across most models, particularly in BRM and GBM, both registering a score of 0.692. This dominance emphasizes the critical role of discharge in influencing bed load transport rates, a finding that aligns with hydrological principles. In addition, this score signifies the substantial influence of
Q in these models compared to CAT and KNN, where
Q’s normalized importances are markedly lower at 0.469 and 0.440, respectively.
Among the variables, flow width (W) showed notably higher importance in the BRM (0.179) than others, suggesting that BRM attributed a significant portion of the prediction variance to this variable. On the other hand, the ETR assigns the lowest importance to W (0.044). The difference in the scores of W highlights the different weighting of physical properties by the different algorithms. The bed slope (S) emerged as a consistently significant variable in all the models, with KNN having the highest importance (0.161). This finding demonstrates the widespread recognition of the influence of slope on bed load transport regardless of the modeling method. Interestingly, the detailed variables (D16, D50, D84, and D90) show varying importance across models, with D16 and D84 receiving modest importance scores in some instances. The relatively low scores for grain sizes, especially in the BRM, indicate the limited sensitivity of these samples to changes in larger grain sizes. The analysis emphasizes the heterogeneous nature of model responses to different input variables, reflecting the complex interactions between physical factors in river bed load dynamics.
SHAP value distributions, as presented in
Figure 3, provide a deeper understanding of variable importance than permutation feature importance scores alone, as they take into account interaction effects between features and provide insight directly into the marginal contribution of each variable to the estimation. For instance, the
Q, which emerged as a key variable in the permutation feature importance analysis, consistently shows the highest impact across all the models in the SHAP value analysis as well. This is evidenced by the dense clustering of points with larger SHAP values, signifying that the role of
Q in predicting bed load is both significant and stable. Notably, the SHAP values for
Q in the BRM display a wider spread (from −0.6 to +0.6), suggesting that the BRM’s predictions are more sensitive to variations in
Q than those of the other models (from −0.4 to +0.2). When observing the directionality of the SHAP values for the models, higher values of river discharge lead to an increase in predicted bed load, which aligns with hydrological expectations that higher flows generally carry more sediment. This trend is visually represented by a dense cluster of high-value points extending to the right of the zero line, indicating a positive correlation with the model’s output. Moreover, the distinct spread of SHAP values for
Q in the BRM model, as depicted in
Figure 3, underscores the model’s unique handling of predictive uncertainty. Unlike more deterministic models, BRM’s broader SHAP value distribution reflects its probabilistic framework, which integrates a range of possible outcomes to accommodate the inherent uncertainties in the input data. This characteristic is particularly evident when compared to the tighter clusters observed in deterministic models like RF and CAT, which are less equipped to directly represent uncertainty in their predictions.
The SHAP analysis also highlights the substantial impact of W and S (but inconsistently for models), corroborating the findings from the importance of the permutation feature. The SHAP values for W in the BRM suggest that higher values of W tend to decrease the predicted bed load (collection of points skewed to the left of 0 for higher feature values), a less clear relationship in the other models. This inverse relationship may reflect the unique way in which BRM integrates this feature, perhaps influenced by its underlying Bayesian framework, which is designed to handle uncertainty differently than other methods. The impact of S is significant in models such as KNN, RF, and ETR, where a higher slope is associated with an increase in the predicted bed load. The SHAP values cluster positively for S, which is consistent with the physical understanding that steeper slopes can often result in more vigorous sediment transport due to increased gravitational forces acting on the particles.
The grain size variables D16, D50, D84, and D90, showed different levels of impact between the models. The CAT and ETR models display more pronounced sensitivity to changes in these features than KNN and BRM. For instance, the CAT model displays a sensitive response to changes in D50, which is not as pronounced in the KNN and BRM. For the RF and GBM, the strong association of larger grain sizes with increased bed load is clearly reflected in the SHAP values, with a trend that emphasizes the correlation between higher D90 values and transport rates.
3.3. Impact of Individual Variables on Model Performance
In this section, we continue to unravel the complexity of predictive modeling for river bed load estimation by examining the impact of sequential feature inclusion and exclusion on model performance. The results are systematically presented in
Table 5 and visually interpreted through
Figure 4 and
Figure 5.
Table 5 summarizes the cumulative performance of the five ML models when additional variables are incrementally introduced into the training process. The models exhibit a consistent trend: starting with using only the
Q, the R scores range from 0.508 to 0.554, underscoring the discharge’s dominant role as previously established by permutation importance and SHAP analyses. Including
S leads to a notable improvement across all the models, with increased R values indicating a significant slope contribution to model accuracy. Adding
W to the
Q and
S variables further enhances model performance, with R values rising above 0.890 in all the models. This increase highlights the relevance of
W despite its lower permutation significance, suggesting its synergistic effect with other variables, especially
Q and
S.
The trend continues by including mean grain size (
D50), where the models reach an R-value range between 0.918 and 0.927. The cumulative impact of sequential feature inclusion reached a plateau with the addition of
D90,
D84, and
D16, where the R values showed marginal improvements or, in some cases, slight declines. This pattern indicates the diminishing returns of adding less influential variables and possibly the onset of model complexity that does not correspondingly enhance performance.
Figure 4 visually demonstrates these findings through a stacked bar representation, elucidating the incremental benefits of each added variable. The visualization emphasizes the importance of
Q,
S, and
W as fundamental contributors to the model’s predictive ability.
The line graph in
Figure 5 illustrates a marked decline in the R score when
Q is removed from the models, underscoring its importance. The RF model, which initially had a robust R score of 0.926 with all the variables included, witnessed a significant decrease to an R score of 0.882 without
Q. This pattern is reflected across the GBM, CAT, ETR, and KNN models, although the extent of impact varies, with KNN showing the most significant reduction. Similarly, the exclusion of bed slope
S and
W yields a clear reduction in model performance, though to a lesser extent compared to
Q. The relatively smaller decline in R scores when excluding these variables, such as the decrease from 0.926 to 0.919 for RF when omitting
S, indicates that although they are significant, the models may partially compensate for their absence by leveraging information from other correlated variables.
The grain size distribution variables, D16, D50, D84, and D90, present an interesting pattern. When excluded individually, the slight variation in R scores, especially for D84 and D90, implies a lower reliance of the models on these variables to predict bed loads. For instance, the RF model exhibits a negligible change in the R score when D84 is removed, confirming the marginal influence of the variable, which is also suggested by the importance of the permutation feature and SHAP value analysis. In general, these observations reveal a hierarchy of variable importance with Q at the top, followed by S and W, and finally, the grain size distribution variables. The effectiveness of models such as RF and ETR in eliminating individual variables highlights their potential utility in situations where data on specific predictors may be incomplete or challenging to collect.
3.4. Uncertainty Assessment in Predictions
This subsection explores uncertainty in model predictions under varying data availability. As mentioned in
Section 3.4, the study performed 1000 Monte Carlo simulations for each sample rate SRi.
Table 6 and
Figure 6 summarize the results of the Monte Carlo simulation-based sampling processes that illustrate the influence of the training sample size on the R scores of four primary ML models: RF, GBM, CAT, and ETR.
The numbers in
Table 6 and
Figure 6 showed a consistency in the performance of the models in that as the number of samples from the training dataset increases (ratio from 0.3 to 1.0), the models show a dual trend of increasing average R scores and narrowing standard deviations. This highlights the direct correlation between the amount of training data and the model’s prediction accuracy. The RF model, for instance, displays a commendable growth in the mean R score, starting from 0.9156 at a sample rate of 0.3 and reaching a peak of 0.9366 with the entire dataset. This upward trend in the mean R score is paralleled by a decrease in standard deviation from 0.00535 to 0.0002, illustrating the robustness of the model’s predictive accuracy.
Similarly, the CAT and ETR models exhibit comparable trends with slightly different magnitudes of variation in the mean R score and standard deviation. In particular, the mean R scores gradually increase (from 0.922 to about 0.940). At the same time, their standard deviation gradually decreases (from 0.0024 to just 0.0002), a clear indicator of enhanced performance with more data. The GBM model, though starting with a slightly higher standard deviation at a sample rate of 0.3 (0.0034, exhibits the most pronounced reduction in predictive variability, reaching 0.00003 at a sample rate of 1.0. This dramatic decrease in standard deviation, combined with a rise in the mean R score, emphasizes the enhanced robustness of the model when trained on a more extensive dataset.
Previous subsections discussed individual model performances and the importance of features. Therefore, the joint consideration of mean and variability adds another layer of understanding to model behavior in response to varying training data sizes. It provides a metric of model reliability—models with smaller standard deviations at equivalent mean R scores are considered more reliable, as their predictions are consistently closer to the observed values. In addition, ML models not only become more accurate but also demonstrate reduced variability in their estimations with adequate training data.
3.5. Comparative Analysis of Ensemble Techniques
This subsection provides an examination of the performance of ensemble methods in comparison to individual ML models. The ensemble techniques, WAE, SE, and VE, synthesize the strengths of the RF, GBM, CAT, and ETR models to enhance predictive accuracy and robustness.
Table 7 and
Figure 7 summarize the main results and illustrate these comparisons.
Table 7 reveals that all three ensemble models achieve remarkably similar performance metrics, with RMSE and MAE values slightly better than those of the strongest individual models. For instance, while RF and GBM individually report an RMSE of 0.0453 (see
Table 3), WAE and VE show a slightly improved RMSE of 0.0439. The ensemble models also yield an NSE metric comparable to the individual models, with SE slightly ahead with an NSE of 0.9249. Notably, the R scores for the ensemble models reached 0.9406, slightly higher than the R-value of the individual models.
Figure 7 provides a visual corroboration of these findings. The scatter plots for WAE, SE, and VE all show a tight clustering of predictions around the observed data points, where R scores across the ensembles are uniformly high (0.940).
The analysis in this subsection emphasizes the importance of ensemble techniques as an advanced methodological approach to refine prediction accuracy. By integrating predictions from four selected models which have proven effective in their capacities, the ensemble methods minimize individual model biases and leverage the collective predictive strength, thereby improving model reliability.
4. Conclusions
The study systematically analyzed the predictive capabilities of six distinct ML algorithms, delved into the intricate dynamics of feature importance, evaluated the impact of individual variables on model performance, assessed prediction uncertainty, and scrutinized the efficacy of advanced ensemble techniques. This investigation provides a comprehensive understanding of the capabilities of the models and the robust enhancements achieved through ensemble approaches.
The individual models, with RF, CAT, and ETR, were tops in performance and established a high benchmark in terms of RMSE, MAE, NSE, and R metrics. The feature importance analysis, enriched by permutation importance scores and SHAP values, revealed a significant impact of river discharge (Q) on predictions, a finding that corroborated across all the models and resonated with fundamental hydrological principles. Furthermore, bed slope (S) and flow width (W) have been consistently recognized, although with varying degrees of influence.
Sequential feature inclusion and exclusion analyses demonstrated the specific influence of each single variable on model performance. This analysis provided valuable guidance on feature prioritization in situations constrained by data scarcity. Moreover, uncertainty assessment through Monte Carlo simulation has emphasized the important role of data volume in model training and confidence. This directly correlates with increased training data and improved predictive accuracy and reliability.
The ensemble techniques—WAE, SE, and VE—demonstrated their power, not just matching but slightly exceeding the performance of the most accurate individual models. This small improvement in RMSE and R scores is especially notable given the already high performance of individual models, suggesting that the ability of multiple models to integrate multiple predictions can provide an advantage.
This study has comprehensively examined the capabilities and limitations of individual and ensemble models in river bed load estimation. The ensemble methods, in particular, have shown that they can offer a valuable pathway to improve predictive accuracy and reduce uncertainty. The research has contributed to advancing ML applications in hydrology and highlighted the importance of strategic data collection and model selection.