Comparison of Machine Learning Models for Real-Time Flow Forecasting in the Semi-Arid Bouregreg Basin

Elhallaoui Oueldkaddour, Fatima Zehrae; Wariaghli, Fatima; Brirhet, Hassane; Yahyaoui, Ahmed; Jaziri, Hassane

doi:10.3390/limnolrev25010006

Open AccessArticle

Comparison of Machine Learning Models for Real-Time Flow Forecasting in the Semi-Arid Bouregreg Basin

by

Fatima Zehrae Elhallaoui Oueldkaddour

^1,*

,

Fatima Wariaghli

¹

,

Hassane Brirhet

²,

Ahmed Yahyaoui

¹

and

Hassane Jaziri

¹

Laboratory of Biodiversity, Ecology and Genome, Faculty of Sciences, University Mohammed V in Rabat, Rabat 10000, Morocco

²

Department of Water Research and Planning, Ministry of Equipment, Transport, Logistics and Water, Rabat 10000, Morocco

^*

Author to whom correspondence should be addressed.

Limnol. Rev. 2025, 25(1), 6; https://doi.org/10.3390/limnolrev25010006

Submission received: 23 January 2025 / Revised: 25 February 2025 / Accepted: 3 March 2025 / Published: 5 March 2025

Download

Browse Figures

Versions Notes

Abstract

Morocco is geographically located between two distinct climatic zones: temperate in the north and tropical in the south. This situation is the reason for the temporal and spatial variability of the Moroccan climate. In recent years, the increasing scarcity of water resources, exacerbated by climate change, has underscored the critical role of dams as essential water reservoirs. These dams serve multiple purposes, including flood management, hydropower generation, irrigation, and drinking water supply. Accurate estimation of reservoir flow rates is vital for effective water resource management, particularly in the context of climate variability. The prediction of monthly runoff time series is a key component of water resources planning and development projects. In this study, we employ Machine Learning (ML) techniques—specifically, Random Forest (RF), Support Vector Regression (SVR), and XGBoost—to predict monthly river flows in the Bouregreg basin, using data collected from the Sidi Mohamed Ben Abdellah (SMBA) Dam between 2010 and 2020. The primary objective of this paper is to comparatively evaluate the applicability of these three ML models for flow forecasting in the Bouregreg River. The models’ performance was assessed using three key criteria: the correlation coefficient (R²), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). The results demonstrate that the SVR model outperformed the RF and XGBoost models, achieving high accuracy in flow prediction. These findings are highly encouraging and highlight the potential of machine learning approaches for hydrological forecasting in semi-arid regions. Notably, the models used in this study are less data-intensive compared to traditional methods, addressing a significant challenge in hydrological modeling. This research opens new avenues for the application of ML techniques in water resource management and suggests that these methods could be generalized to other basins in Morocco, promoting efficient, effective, and integrated water resource management strategies.

Keywords:

flow forecasting; time series; machine learning; Random Forest model; SVR model; XGBoost model; Bouregreg River

1. Introduction

Predicting inflows to dams is of paramount importance in water resource planning to ensure a reliable and adequate water supply. In recent decades, the combined effects of climate change and human activities have significantly altered the frequency and intensity of extreme weather events, a trend that is expected to persist in the future [1,2,3]. Hydrological models are essential tools for examining the impacts of climate change and anthropogenic activities on hydrological processes. Over the past few decades, hydrologists have increasingly focused on modeling the water cycle and forecasting its behavior. However, traditional hydrological models often rely on the relationship between precipitation and flow, employing a range of empirical, conceptual, and physical equations.

With the advent of machine learning (ML), there has been a growing need to explore its application in hydrological modeling and assess its reliability and validity, particularly in watersheds with complex climatic characteristics, such as the semi-arid basin under study. Daily flow prediction is critical for effective reservoir management and optimizing reservoir performance, while monthly flow prediction is essential for long-term reservoir planning and water allocation [4]. The selection of an appropriate flow forecasting approach depends on various factors, including the purpose of the forecast, which determines the choice of model and its level of complexity. Consequently, such decisions should be informed by historical time series data [5].

Despite the increasing interest in applying machine learning to hydrology, significant research gaps remain. While much of the existing research has focused on daily flow estimation, which is crucial for short-term reservoir management, monthly flow forecasting—essential for long-term water resource planning and allocation—has received less attention. Furthermore, the application of machine learning models in semi-arid regions, where water scarcity and climate variability pose significant challenges, remains underexplored and warrants further investigation.

This study aimed to address these gaps by focusing on monthly flow prediction in the Bouregreg watershed, a semi-arid basin in Morocco, using advanced machine learning models. Three models were selected for this research: Random Forest (RF), Support Vector Regression (SVR), and Extreme Gradient Boosting (XGBoost). These models were chosen based on a comprehensive literature review, which highlighted their efficiency, ease of application, and ability to deliver robust results.

Previous studies have demonstrated that SVR, in particular, outperforms RF and XGBoost in capturing non-linear hydrological relationships through structural risk minimization [6,7]. Moreover, a substantial body of research suggests that RF, SVR, and XGBoost are often preferred over other machine learning models due to their proven effectiveness in various hydrological and water resource applications [7,8]. Consequently, these models were employed in this study to simulate monthly hydrological behavior in the Bouregreg watershed at the Sidi Mohamed Ben Abdellah (SMBA) Dam between 2010 and 2020.

The primary objective of this work is to reproduce past hydrological events using these models to improve the management of the SMBA Dam reservoir. During the validation phase, the models demonstrated satisfactory performance, accurately replicating the monthly hydrological behavior of water inflows over the study period. Notably, the SVR model outperformed both RF and XGBoost, as evidenced by higher R² correlation coefficients and lower AIC and BIC values.

This study contributes to the growing body of knowledge on the application of machine learning in hydrology and provides valuable insights for water resource managers tasked with ensuring sustainable water supplies under challenging climatic conditions.

2. The Study Area

2.1. Characteristics of the Bouregreg Basin

The Bouregreg Basin features an elliptical watershed, with its long axis oriented northwest to southeast, covering an area of approximately 9600 km². The watershed is characterized by hilly terrain and impermeable plateaus, with a relatively dense vegetation cover. Geographically, it is situated in northwestern Morocco, bounded to the north by the Sebou watershed, to the south by the Oum Rbia watershed, and to the southwest by the Cherrat River, Nfifikh, and Mellah basins. The basin extends westward, where it discharges into the Atlantic Ocean (Figure 1).

2.2. Presentation of the Sidi Mohammed Ben Abdallah Dam (SMBA)

The Sidi Mohammed Ben Abdellah (SMBA) Dam, commissioned in 1974, is located downstream of the confluence between the Bouregreg and Grou rivers (Figure 2). Following elevation work completed in 2006, the dam’s storage capacity was increased to 1.025 billion cubic meters (m³). The dam is constructed at a narrow section of the Bouregreg River, where it intersects a primary geological massif composed of alternating schistose quartzite and quartz schist formations. The SMBA Dam plays a critical role in enhancing the drinking water supply for the cities of Rabat, Salé, and Casablanca, underscoring its regional importance. However, water inflows to the dam are highly irregular and exhibit significant variability from year to year. The average annual inflow recorded at the SMBA Dam station is approximately 647.4 million cubic meters (Mm³) [9].

3. Methodology

3.1. Hydrographic Network

The Bouregreg Basin spans an area of approximately 9600 km² and extends over a length of 240 km. The basin originates in the Middle Atlas massif at an elevation of 1627 m on Mount Mtourzgane and flows westward, eventually discharging into the Atlantic Ocean near Rabat-Salé [11]. Since 1971, the basin has recorded an average flow rate of 23 m³/s, though this can exceed 1500 m³/s during flood events. The average annual runoff volume is approximately 660 million m³ in a typical year (Table 1).

Near its mouth, the Bouregreg Basin is regulated by the Sidi Mohamed Ben Abdellah Dam. The hydrographic network of the basin is primarily formed by two major rivers: the Bouregreg River, which drains an area of 4000 km², and the Grou River, which drains 3600 km² [13]. Additionally, the Grou River is fed by two significant tributaries: the Korifla River (1900 km²) and the Akrech River (150 km²).

The Bouregreg watershed is equipped with seven main gauging stations, four of which are located along the Bouregreg River and its tributaries. The measurement data from these stations, which include mean annual flow, mean monthly flow, mean annual rainfall, and mean monthly rainfall, form the basis of this study (Figure 3). The available measurement series vary in duration, ranging from 19 years at the Tsalat station to 47 years at the Sidi Mohammed Ben Abdellah Dam (SMBA) station [14].

3.2. Hydro Meteorological Data

The hydro-climatological data utilized in this study consist of precipitation and flow time series [15]. These datasets, covering the period from 2010 to 2020, were collected through the measurement networks operated by the Bouregreg-Chaouia Hydrological Basin Agency (ABHBC). The Bouregreg watershed experiences a semi-arid Mediterranean climate, with an average annual precipitation of approximately 440 mm. Rainfall distribution across the basin exhibits a gradual decrease with latitude, following a gentle gradient. Additionally, altitude plays a significant role in creating spatial variability in precipitation. The northeastern regions of the watershed receive the highest rainfall, with annual totals reaching up to 671 mm, while the southwestern areas receive less than 350 mm annually [16]. Precipitation is closely linked to the basin’s water supply. While it has minimal influence during the dry season, it becomes a critical factor during flood events. The Bouregreg River records an average annual flow of 6.1 m³/s, largely due to the higher rainfall received in the northern portion of the watershed [17].

Figure 4 illustrates the variability of precipitation recorded at the Sidi Mohamed Ben Abdellah (SMBA) Dam station from September 2010 to August 2021. The graph reveals distinct fluctuations over the decade, with notable peaks indicating periods of exceptionally high precipitation, particularly in certain years. The overall trend highlights a seasonal pattern, with higher precipitation occurring in specific months. Analysis of these data provides insights into the complex climatic conditions of the study basin, particularly the influence of seasonal factors.

Similarly, Figure 5 depicts the variation in flow rates at the SMBA Dam station between 2010 and 2020. The graph shows irregular peaks, particularly in 2012, 2013, and 2017, interspersed with extended periods of low flow. These variations underscore the direct impact of rainfall on river flow and emphasize the importance of effective water resource management to mitigate the risks of both flooding and water shortages.

3.3. Machine Learning Models

Machine learning has become increasingly significant across various scientific disciplines. Fundamentally, it involves predicting outcomes based on a set of input features or characteristics [17,18]. This study employed machine learning models (Figure 6) to predict the monthly variation of water flow in the Sidi Mohamed Ben Abdellah (SMBA) Dam for the upcoming season, utilizing historical monthly climatic variables as predictors.

In this research, three machine learning methods—Random Forest (RF), Support Vector Regression (SVR), and Extreme Gradient Boosting (XGBoost)—were applied to develop a real-time prediction model for water supply in the SMBA Dam. The performance of these three methods is compared to assess their effectiveness in predicting water supply. A brief overview of each method is provided below.

3.4. Presentation of the Random Forest Model

The Random Forest (RF) model, first introduced by [19], is an ensemble learning method that builds upon decision trees, a widely used technique for both classification and regression tasks. Figure 7 illustrates the architecture of a Random Forest. The design of this model involves two key stages. The RF method constructs multiple regression decision trees during training and defines a regression variable as a linear combination of the variables from individual trees. For each tree in the forest, a randomly selected subset of data attributes is used. Additionally, during the construction of decision trees, the RF algorithm performs random sampling of the training data and considers a subset of all features for splitting each node in each tree [20].

The RF process begins by creating multiple individual trees based on the decision tree process. Each tree is generated by randomly selecting datasets from the training dataset, a technique known as bagging or bootstrap aggregation [21]. To evaluate the prediction error rate, each instance uses approximately two-thirds of the data for constructing the decision tree, while the remaining one-third is designated as Out-of-Bag (OOB) instances [22]. In the next step, predictions are made using a voting method for classification problems or by averaging the predictions for regression problems, based on the predictive performance of each tree [23].

Figure 7. Random Forest Model [23].

The RF model is an additive model that combines multiple base models (decision trees) to produce predictions. Mathematically, the final model is represented as the sum of individual base models, as shown in Equation (1):

g (x) = f_{0} (x) + f_{1} (x) + f_{2} (x) + f_{3} (x) + \dots f_{r} (x)

(1)

This approach, known as model assembly, enhances predictive accuracy by aggregating the results of multiple models.

Typically, the dataset is divided into two parts: a training set and a test set (Figure 6). The training set is used to fit the model, while the test set is used to evaluate its predictive performance. Although a significant portion of the data is randomly assigned to the training set at the outset, ongoing research focuses on optimizing data partitioning for linear regression analyses [24]. Maintaining a separate validation set is essential for evaluating and comparing the performance of different models without relying solely on the test set.

Prediction Method of the RF

The general prediction method of the RF model involves the following steps:

Training Data Sampling: Training samples (n × sample) are randomly selected n times to create a training set, with samples being replaced after each selection. This process is repeated multiple times to generate training sets: D₁, D₂, …, D_r.

Attribute Selection: Each training set consists of k attributes randomly chosen from the attribute set (m × attribute), where k = log₂m. Decision trees (CART trees) are then created: f₁(x), f₂(x), …, f_r(x).

Final Prediction: The final prediction value of the Random Forest is calculated using the mean method, as shown in Equation (2):

g (x) = \frac{1}{r} \sum_{i}^{r} = 1 f i (x)

(2)

In this study, the RF package in R was used to predict water flow in the SMBA Dam. Previous research has demonstrated that the RF model, which utilizes clusters of decision trees, is a powerful tool in hydrology. Its robustness, ability to handle complex datasets, and capacity to model nonlinear relationships make it particularly suitable for prediction and classification tasks.

3.5. Presentation of the SVR Model

Support Vector Machines (SVM) are widely used in machine learning for solving classification problems. However, they can also be adapted for regression tasks through Support Vector Regression (SVR) [25]. While SVR shares the same foundational principles as SVM, it is specifically designed to predict continuous outputs rather than classify data points [26].

The SVR model is used to predict continuous outcomes. Unlike traditional regression models, SVR leverages the principles of SVM to map input data into high-dimensional spaces and identify an optimal hyperplane that best represents the data [27]. This approach allows SVR to efficiently handle both linear and non-linear relationships, making it a versatile tool in various fields, including hydrological forecasting.

The SVR algorithm is employed for regression analysis. In machine learning, the SVR model aims to identify a function that captures the relationship between input variables and a continuous target variable while minimizing prediction error [28]. In the SVR model (Figure 8), the two red lines represent the decision boundaries, and the green line represents the hyperplane. The objective of SVR is to consider data points within the decision boundaries while maximizing the number of points closest to the hyperplane, which is the optimal line for prediction [29].

In this paper, we used the e1071 package in R to create a SVR model to predict the water flow in the SMBA dam.

To train SVR models effectively, it is necessary to optimize key parameters. Two critical parameters must be tuned: the kernel parameter (sigma) and the regularization constant (C). The kernel parameter is typically set to automatic mode by default, while the constant C controls the trade-off between model complexity and error tolerance. By default, C is set to 1. A high value of C imposes a greater penalty for errors, leading to a more complex model with reduced generalization capability. Conversely, a low value of C results in a lower penalty for errors, allowing for a simpler model with better generalization [30].

Allowing a higher tolerance for errors can reduce model complexity and improve generalization. However, if C is set to a high value, the model becomes more complex, as it minimizes errors at the cost of reduced flexibility. Therefore, it is essential to experiment with different values of C to determine their impact on model performance. Additionally, the parameter epsilon (ε) defines the width of the margin around the hyperplane, influencing the model’s sensitivity to errors [30].

The primary goal of SVR is to find a function f(x) that approximates the relationship between the input variables x and the target variable y, while remaining as flat as possible. The function f(x) (3) is defined as:

f (x) = 〈 w, x 〉 + b

(3)

where:

-: w is the weight vector,
-: x is the input feature vector,
-: b is the bias term

The Support Vector Regression (SVR) model employs an ϵ-insensitive loss function, which disregards errors smaller than a specified threshold ϵ. The loss function is defined as follows:

L_{ϵ} (y, f (x)) = \{\begin{matrix} 0 i f | y - f (x) | \leq ϵ \\ | y - f (x) | - ϵ o t h e r w i s e \end{matrix}

(4)

This means that the model does not penalize predictions that are within ϵ distance from the true value. The SVR optimization problem (5) can be formulated as:

M i n i m i z e \frac{1}{2} {‖ w ‖}^{2} + C \sum_{i = 1}^{n} (ξ_{i} + ξ_{i}^{*})

(5)

subject to the constraints (6) [31]:

\{\begin{matrix} y_{i} - 〈 w, x_{i} 〉 - b \leq ϵ + ξ_{i} \\ 〈 w, x_{i} 〉 + b - y_{i} \leq ϵ + ξ_{i}^{*} \\ ξ_{i}, ξ_{i}^{*} \geq 0 \end{matrix}

(6)

where:

${‖w‖}^{2}$ is the norm of the weight vector (to maximize the margin),
C is a hyperparameter that controls the trade-off between model simplicity (wide margin) and tolerance for errors,
$ξ_{i}$ and $ξ_{i}^{*}$ are slack variables that allow errors exceeding the ϵ threshold.
To solve this optimization problem, the method of Lagrange multipliers is used. The dual problem is formulated (7) as:

M a x i m i z e - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} (α_{i} - α_{i}^{*}) (α_{j} - α_{j}^{*}) ⟨x_{i}, x_{i}⟩ - ϵ \sum_{i = 1}^{n} (α_{i} + α_{i}^{*}) + \sum_{i = 1}^{n} y_{i} (α_{i} - α_{i}^{*})

(7)

\{\begin{matrix} \sum_{i = 1}^{n} (α_{i} - α_{i}^{*}) = 0 \\ 0 \leq α_{i}, α_{i}^{*} \leq C \end{matrix}

(8)

where:

α_{i}

and

α_{i}^{*}

are the Lagrange multipliers associated with the constraints.

Once the dual problem is solved, the decision function (9) (or prediction function) is given by:

f (x) = \sum_{i = 1}^{n} (α_{i} - α_{i}^{*}) 〈 x_{i}, x 〉 + b

(9)

The data points for which

α_{i}

−

α_{i}^{*} \neq

0 are referred to as support vectors. These points are critical as they influence the decision function. To handle nonlinear problems, a kernel function K(x_i, x_j) is used to transform the data into a higher-dimensional space where it becomes linearly separable. The decision function [6] then becomes (10):

f (x) = \sum_{i = 1}^{n} (α_{i} - α_{i}^{*}) K (x_{i}, x) + b

(10)

Commonly used kernels (11) include:

Linear kernel : K (x_{i}, x) = 〈 x_{i}, x 〉

(11)

3.6. Presentation of the XGBoost Model

In recent years, a novel machine learning paradigm known as Extreme Gradient Boosting (XGBoost) has gained prominence in addressing challenges related to water resources. XGBoost represents an advanced evolution of decision trees, built upon an optimized gradient boosting framework. Its performance has been demonstrated to rival, and in some cases surpass, that of traditional models in various domains, including hydrology and hydroclimatology [32,33,34]. The foundation of the XGBoost algorithm lies in the decision tree, a widely used supervised learning algorithm for both classification and regression tasks [35]. A decision tree splits data into smaller subsets based on thresholds of input features, forming a hierarchical, tree-like structure. As illustrated in Figure 9, this structure consists of:

Root Node: Represents the initial threshold for data division.

Internal Nodes (Branches): Represent intermediate splits based on feature thresholds.

Leaf Nodes: Represent the final outputs of the model [36].

Figure 9. XGBoost flowchart [37].

For regression tasks, the mean value of the instances assigned to a leaf node typically serves as the final output of that node [37].

XGBoost is a machine learning algorithm based on the boosting method, which combines multiple decision trees to create a robust and accurate model [38]. It is particularly well-suited for regression tasks, including water flow forecasting. Below are the key equations and concepts of XGBoost as applied to water flow forecasting:

The XGBoost model is a sum of k decision trees:

\hat{y} i = \sum_{k = 1}^{k} f_{k} (x_{i}), f_{k} \in F

(12)

where:

\hat{y} i

is the model’s prediction for sample i,

f_{k}

is a decision tree,

F is the space of decision trees,

And x_i is the feature vector for sample i.

The objective function (Obj) of XGBoost (13) consists of two components: a loss function (L) and a regularization term (Ω) designed to prevent overfitting [39]. The objective function is defined as:

X_{o b j} = \sum_{i = 1}^{n} l (y, \hat{y}) + \sum_{k = 1}^{K} Ω {(f}_{k})

(13)

where

\sum_{i = 1}^{n} l (y, \hat{y})

is used to describe the difference between the predicted and actual model values and

\sum_{k = 1}^{K} Ω {(f}_{k})

is the regularization term of the objective function [38].

Ω (f_{k}) = γ T + λ \frac{1}{2} ω_{j}^{2}

(14)

where T represent the number of leaf nodes in the tree, γ is the coefficient of the penalty term that controls the complexity of the tree, ω is the fraction of leaf nodes, and λ is a regularization parameter used to prevent the scores of the leaf nodes from becoming excessively large [40].

3.7. Evaluation Parameters of the Models

3.7.1. SVR Parameters

In Support Vector Regression (SVR), three key parameters must be optimized: the tolerance threshold (ε), the regularization parameter (C), and the radial-based kernel function parameter (σ). The tolerance threshold (ε) determines the precision of the model by specifying the margin within which errors are ignored. The regularization parameter (C) controls the trade-off between model complexity and error tolerance, while the kernel function parameter (σ) defines the influence of individual data points in the kernel function. In this study, the optimal values for these parameters were determined through a trial-and-error process using a grid search technique applied to the training and calibration datasets. The best parameter sets were selected to maximize the value of the objective function calculated on the calibration dataset [41]. The evaluated parameter ranges included ε values from 0.01 to 0.5 (in increments of 0.01), C values from 1 to 5 (in increments of 0.1), and σ values from 0.01 to 1 (in increments of 0.01).

3.7.2. RF Parameters

For a Random Forest (RF) model, two key parameters must be defined: the number of variables or features used in the construction of each tree (mtry) and the number of trees in the forest (ntree). In some cases, setting mtry = 1 can yield reasonable performance [42]. For classification tasks, it is generally recommended to set mtry to one-third of the total number of input variables, while for regression tasks, the square root of the total number of input variables is typically used [19,43]. In this study, various values of the mtry parameter were tested, ranging from 1 to twice the number of input variables. For the ntree parameter, it was observed that the generalization error tends to increase beyond the optimal value, with 500 being the commonly used default [42,44]. To identify the best-performing models, different values of the ntree parameter were evaluated.

3.7.3. XGBoost Parameters

XGBoost parameters play a crucial role in the model’s configuration and performance [45,46].

-: eta (Learning Rate): the learning rate. It controls the contribution of each tree to the final model by scaling the leaf weights of each tree.

A low eta (e.g., 0.01 or 0.1) makes the learning process slower but allows the model to converge to a more accurate solution.

A high eta (e.g., 0.3 or 0.5) speeds up learning but may lead to overfitting or premature convergence.

-: subsample: controls the proportion of samples (observations) used to train each tree. If subsample = 0.8, 80% of the training data is randomly selected for each tree. This parameter introduces diversity among trees and helps reduce overfitting.
-: colsample_bytree: Controls the proportion of features used to build each tree. This parameter helps reduce overfitting and improves the model’s robustness. colsample_bytree = 0.8, 80% of the features are randomly selected for each tree.
-: Max_depth: controls the maximum depth of each tree in the model. Default value: 6.

The parameters employed by the machine learning models to simulate the inflow at the SMBA Dam in this paper are presented in Table 2.

3.8. Performance Indices of the Models

To evaluate the performance of the RF, the SVR and XGBoost models in forecasting monthly discharge, several statistical indices were employed. The models’ predictive capabilities were assessed using the correlation coefficient (R²), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). These metrics are defined as follows:

3.8.1. The Coefficient of Determination (R²)

In this study, several measures were employed to assess the accuracy of the models. One key indicator is the coefficient of determination (R²), which quantifies the proportion of variance in the dependent variable that is predictable from the independent variables. The R² value ranges from 0 to 1, with higher values indicating lower error variance and better model performance. Generally, an R² value greater than 0.5 is considered satisfactory, reflecting a reasonable level of predictive accuracy.

3.8.2. Akaike’s Information Criteria (AIC)

The AIC is one of the most widely used information criteria for model evaluation. AIC aims to select the model that minimizes the risk of negative likelihood, which is influenced by the number of parameters in the model [47]. This criterion is mathematically represented by Equation (15):

A I C = - 2 l o g p (L) + 2 p

(15)

In the context of the AIC, (L) represents the likelihood of the fitted model, while p denotes the number of parameters in the model. The primary objective of AIC is to identify the most accurate model that approximates the process generating the observed data. Its applications are grounded in the principles of balancing model fit and complexity, as discussed in the works of Bozdogan [48,49].

3.8.3. Bayesian Information Criteria (BIC)

The BIC is used in a Bayesian framework to estimate the Bayes factor for comparing two competing models [50]. The BIC is defined by Equation (16):

B I C = 2 \log p (L) + p \log p (n)

(16)

At its core, the BIC differs from the Nguyen [51] AIC primarily in its second term, which incorporates the sample size (n). Models that minimize the BIC are selected as the most suitable. From a Bayesian perspective, the BIC is designed to identify the most probable model given the available data, emphasizing a balance between model fit and complexity.

4. Results and Discussion

This section compares the performance of RF, SVR, and XGBoost models in predicting inflows to the Bouregreg catchment, regulated by the Sidi Mohamed Ben Abdellah (SMBA) Dam.

4.1. Stream-Flow Prediction and Validation with Random Forest Model

The comparison between observed and simulated flow rates, as illustrated in Figure 10, demonstrates that the Random Forest (RF) model accurately replicated the observed flow rates, with the exception of slight underestimation during peak flow periods. These discrepancies can be attributed to the challenges posed by the highly complex semi-arid environment and the limited sample size available for analysis. The simulation results, as depicted in the graphs of observed and simulated flows, indicate good performance.

4.2. Stream-Flow Prediction and Validation with SVR Model

The graphs of observed and simulated hydrological flows (Figure 11) demonstrate that the SVR model achieved good results, accurately reproducing the observed flow patterns. However, the model struggled to replicate certain extreme flow events, where the simulated flow rates were underestimated compared to the observed values.

4.3. Stream-Flow Prediction and Validation with XGBoost Model

The monthly comparison of observed and predicted data from 2010 to 2020 with the XGBoost model (Figure 12) showed a more stable and consistent match between observed and simulated flows, with a good ability to align overall trends. Data analysis shows that the model better captured peaks (peak flows) and minimum river flows (low flows), although some values were underestimated or overestimated.

For the study period from January 2010 to December 2020, projections one month ahead of time were created to assess the model’s performance. Table 3, Table 4 and Table 5 present a comparison based on performance indices of the Random Forest, SVR, and XGBoost models. The hydrograph in Figure 10, Figure 11 and Figure 12 show the differences between the initial streamflow projected by the RF model, the SVR model and the XGBoost model.

Based on the results (Table 3, Table 4 and Table 5), several conclusions can be drawn regarding the suitability of these models for predicting water reserves in the semi-arid context of the SMBA Dam. Overall, the results are highly satisfactory, with the forecast graphs closely matching the observed flow patterns. While the performance of the three models is somewhat similar, the SVR model outperforms the RF and XGBoost models, as evidenced by its higher R² value (0.6539 compared to 0.5325 and 0.5319) and significantly lower AIC and BIC values (AIC = 191.7364 and BIC = 194.0925). These metrics indicate a better quality of fit and lower model complexity for the SVR model. However, the XGBoost model offers a wider range of simulation values, which may be advantageous for studying extreme hydrological events.

In conclusion, the SVR model is recommended for achieving higher overall accuracy and fit, while further validation and exploration of the RF and XGBoost models behavior could provide additional insights for specific applications.

5. Conclusions

This study evaluated the effectiveness of three machine learning models—Random Forest (RF), Support Vector Regression (SVR), and XGBoost—in predicting the monthly flow of the Bouregreg River in a semi-arid region, using data from the Sidi Mohamed Ben Abdellah Dam (SMBA) over the period from 2010 to 2020. The performance of the models was compared using statistical criteria such as the correlation coefficient (R²), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC).

The results highlight the strong performance of the models, particularly the SVR model, which achieved the highest R² value (0.6539), demonstrating its good ability to reproduce observed flow patterns. Additionally, the SVR model exhibited lower AIC and BIC values, indicating better adaptability to the data and reduced model complexity.

While the RF and XGBoost models showed slightly lower overall accuracy, they demonstrated greater flexibility in simulating a wide range of flows, especially predict peak flows for the XGBoost model, making it particularly useful for analyzing extreme hydrological events. A significant advantage of the employed models is their ability to perform well without requiring extensive datasets, which is especially beneficial in regions like semi-arid zones where hydrological records are often limited. This contrasts with conventional models, which typically demand long and detailed datasets that are frequently unavailable in such areas.

The use of machine learning models such as SVR, RF, and XGBoost opens new opportunities for water resource management in Morocco. Accurate monthly flow forecasts can significantly improve dam operations and water allocation strategies. The promising results of this study encourage the application of this approach to other watersheds across the country, enhancing the capacity to manage water resources under the challenges of climate change and increasing water variability. To further improve model performance, future research could incorporate additional explanatory variables, such as evapotranspiration or climatic indices. Additionally, extending the approach to longer time periods or seasonal forecasts could better meet the needs of decision-makers in water resource planning.

Author Contributions

Conceptualization, F.Z.E.O.; Data curation, F.Z.E.O.; Formal analysis, F.Z.E.O.; Investigation, F.Z.E.O.; Methodology, F.Z.E.O.; Resources, F.Z.E.O.; Software, F.Z.E.O.; Validation, F.Z.E.O., F.W., H.B. and H.J.; Visualization, F.W., H.B. and A.Y.; Writing—original draft, F.Z.E.O.; Writing—review and editing, F.Z.E.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Due to ethical considerations and the privacy rules imposed by the agencies and research partners (HDD) [9] and (ABHBC) [12]., the data associated with this study cannot be shared publicly. These data are subject to restrictions.

Acknowledgments

We would like to thank the (HDD) [9] and (ABHBC) [12], for providing data on water flow in the Bouregreg basin.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Demirel, M.C.; Booij, M.J.; Hoekstra, A.Y. Impacts of climate change on the seasonality of low flows in 134 catchments in the river Rhine basin using an ensemble of bias-corrected regional climate simulations. Hydrol. Earth Syst. Sci. 2013, 17, 4241–4257. [Google Scholar] [CrossRef]
Halgamuge, M.N.; Nirmalathas, A. Analysis of large flood events: Based on flood data during 1985–2016 in Australia and India. Int. J. Disaster Risk Reduct. 2017, 24, 1–11. [Google Scholar] [CrossRef]
Sang, Y.F.; Wang, Z.; Changming, L. Variabilité spatiale et temporelle des extrêmes de précipitations dans le bassin de la rivière Haihe. Chine Hydrol. Process. 2014, 28, 926–932. [Google Scholar] [CrossRef]
Anaraki, M.V. Modeling of monthly rainfall–runoff using various machine learning techniques in wadi ouahrane basin, Algeria. Water 2023, 15, 3576. [Google Scholar] [CrossRef]
Salas, J.D.; Tabios, G.Q.; Bartolini, P. Approaches to multivariate modeling of water resources time series. J. Am. Water Resour. Assoc. 1985, 21, 683–708. [Google Scholar] [CrossRef]
Maity, R.; Bhagwat, P.P.; Bhatnagar, A. Potential of support vector regression for prediction of monthly streamflow using endogenous property. Hydrol. Process. 2010, 24, 917–923. [Google Scholar] [CrossRef]
Wang, W.-C.; Chau, K.W.; Cheng, C.T.; Qiu, L. A comparison of performance of several artificial intelligence methods for forecasting monthly discharge time series. J. Hydrol. 2009, 374, 294–306. [Google Scholar] [CrossRef]
Aqil, M.; Kita, I.; Yano, A.; Nishiyama, S. A comparative study of artificial neural networks and neuro-fuzzy in continuous modeling of the daily and hourly behaviour of runoff. J. Hydrol. 2007, 33, 22–34. [Google Scholar] [CrossRef]
HDD (The Hydraulic Development Department). Large dams in Morocco; Edited by the Hydraulic Administration; Rabat, Morocco, 2006. [Google Scholar]
Rahdou, M. Hydrological Modelling of the Bouregreg Watershed (SS) and Evaluation of the Siltation Rate of the Sidi Mohamed Ben Abdellah Dam (Morocco) Using the Bathymetric Method. Ph.D. Thesis, Faculty of Sciences and Techniques, Fez, Morocco, 2010. [Google Scholar]
Beaudet, G.; Ruellan, A. The Moroccan Quaternary: State of the Art: Author Files and Bibliographical Annotations Grouped Together. J. Phys. Geogr. Dyn. Geol. 1969. Available online: https://www.persee.fr/doc/medit_0025-8296_1989_num_68_2_2614 (accessed on 2 March 2025).
Bouregreg-Chaouia Hydraulic Basin Agency (ABHBC). Study on the Optimization of Surface Water Management; Bouregreg-Chaouia Hydraulic Basin Agency: Rabat, Morocco, 2012. [Google Scholar]
Lahlou, A. Updated study of dam silting in Morocco. Water Sci. 1986, 6, 337–356. [Google Scholar]
Bounouira, H. Study of the Chemical and Geochemical Qualities of the Bouregreg Watershed. Ph.D. Thesis, Pierre and Marie Curie University—Paris VI, Paris, France, 2007. Available online: https://tel.archives-ouvertes.fr/tel-00726475 (accessed on 2 March 2025).
Fröhlich, K.; Dobrynine, M.; Isensee, C. The German weather forecasting system: GCFS. J. Adv. Modèle. Système Terre 2021, 13, e2020MS002101. [Google Scholar] [CrossRef]
Oueldkaddour, F.-Z.E.; Wariaghli, F.; Brirhet, H.; Yahyaoui, A. Hydrological Modeling of Rainfall-Runoff of the Semi-Arid Aguibat Ezziar Watershed Through the GR4J Model. Limnol. Rev. 2021, 21, 119–126. [Google Scholar] [CrossRef]
Franklin, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2005. [Google Scholar] [CrossRef]
Lin, J.-Y.; Cheng, C.-T.; Chau, K.W. Using support vector machines for longterm discharge prediction. Hydrol. Sci. J. 2006, 51, 599–612. [Google Scholar] [CrossRef]
Leo, B. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Hussain, D.; Khan, A.A. Machine learning techniques for monthly river flow forecasting of Hunza River. Pak. Earth Sci. Inform. 2020, 13, 939–949. [Google Scholar] [CrossRef]
Leo, B. Bagging Predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Rodriguez-Galiano, V.F. Modelling interannual variation in the spring and autumn land surface phenology of the European forest. Biogeosciences 2016, 13, 3305–3317. [Google Scholar] [CrossRef]
Pakorn, D. Comparative study of machine learning methods and GR2M model for monthly runoff. Ain Shams Eng. J. 2022, 14, 101941. [Google Scholar] [CrossRef]
Joseph, V.R. Optimal ratio for data splitting. Stat. Anal. Data Min. ASA Data Sci. J. 2022, 15, 531–538. [Google Scholar] [CrossRef]
Dibike, Y.B.; Velickov, S.; Solomatine, D.; Abbott, M.B. Model induction with support vector machines: Introduction and applications. J. Comput. Civ. Eng. 2001. [Google Scholar] [CrossRef]
Dastorani, M.T.; Javad, M.; Talbi, A.; Fakhar, F. Application of Machine Learning Approaches in Rainfall-Runoff Modeling. (Case Study: Zayandeh_Rood Basin in Iran). Civ. Eng. Infrastruct. J. 2018, 51, 293–310. [Google Scholar] [CrossRef]
Wu, C.L.; Chau, K.W. Rainfall-runoff modeling using artificial neural network coupled with singular spectrum analysis. J. Hydrol. 2011, 399, 394–409. [Google Scholar] [CrossRef]
Adamu, M.; Umar, I.K.; Haruna, S.I.; Ibrahim, Y.E. A soft computing technique for predicting flexural strength of concrete containing nano-silica and calcium carbide residue. Case Stud. Constr. Mater. 2022, 17, e01288. [Google Scholar] [CrossRef]
Botsis, D.; Latinopoulos, P.; Diamantaras, K. Rainfall-runoff modeling using support vector regression and artificial neural networks. In Proceedings of the 12th International Conference (CEST2011), Rhodes, Greece, 8–10 September 2011. [Google Scholar]
Santos, L.L.; Francisco, S.C. Introduction to Spatial Network Forecasting with R. 2019. Available online: https://laurentlsantos.github.io/forecasting/support-vector-regression.html (accessed on 2 March 2025).
Yoon, H.; Jun, S.C.; Hyun, Y.; Bae, G.O.; Lee, K.K. A comparative study of artificial neural networks and support vector machines for predicting groundwater levels in a coastal aquifer. J. Hydrol. 2011, 396, 128–138. [Google Scholar] [CrossRef]
Wu, L.; Zhou, H.; Ma, X.; Fan, J.; Zhang, F. Daily reference evapotranspiration prediction based on hybridized extreme learning machine model with bio-inspired optimization algorithms: Application in contrasting climates of China. J. Hydrol. 2019, 577, 123960. [Google Scholar] [CrossRef]
Wu, L.; Huang, G.; Fan, J.; Ma, X.; Zhou, H.; Zeng, W. Hybrid extreme learning machine with meta-heuristic algorithms for monthly pan evaporation prediction. Comput. Electron. Agric. 2020, 168, 105115. [Google Scholar] [CrossRef]
Niazkar, M.; Pirée, R.; Türkkan, G.E. Drought analysis using innovative trend analysis and machine learning models for Eastern Black Sea Basin. Theor. Appl. Climatol. 2023, 155, 1605–1624. [Google Scholar] [CrossRef]
Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
Bisong, E. Building Machine Learning and Deep Learning Models; Springer Nature: Berkeley, CA, USA, 2019. [Google Scholar] [CrossRef]
Niazkar, M.; Menapace, A.; Brentan, B. Applications of XGBoost in water resources engineering: A systematic literature review. Environ. Model. Softw. 2024, 174, 105971. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Wang, S.; Peng, H. Multiple spatio-temporal scale runoff forecasting and driving mechanism exploration by K-means optimized XGBoost and SHAP. J. Hydrol. 2024, 630, 130650. [Google Scholar] [CrossRef]
Fang, H.; Liao, J.; Huang, S. Research on Status Assessment and Operation and Maintenance of Electric Vehicle DC Charging Stations Based on XGboost. Smart Cities 2024, 7, 3055–3070. [Google Scholar] [CrossRef]
Hakan, T.; Booji, M.J. Simulation and forecasting of streamflows using machine learning models coupled with base flow separation. J. Hydrol. 2018, 564, 266–282. [Google Scholar] [CrossRef]
Yu, F.; Cui, N.; Gong, D.; Zhang, Q.; Zhao, L. Evaluation of random forests and generalized regression neural networks for daily reference evapotranspiration modelling. Agric. Water Manag. 2017, 193, 163–173. [Google Scholar] [CrossRef]
Liaw, A.; Wiener, M. Classification and Regression by Randomforest. R News 2002, 2–3, 18–22. Available online: https://CRAN.R-project.org/doc/Rnews (accessed on 2 March 2025).
Booker, D.J.; Woods, R.A. Comparing and combining physically-based and empirically-based approaches for estimating the hydrology of ungauged catchments. J. Hydrol. 2014, 508, 227–239. [Google Scholar] [CrossRef]
Nagoor, B.S.; Jongkittinarukorn, K.; Bingi, K. XGBoost based enhanced predictive model for handling missing input parameters: A case study on gas turbine. Case Stud. Chem. Environ. Eng. 2024, 10, 100775. [Google Scholar] [CrossRef]
El Bilali, A.; Hadri, A.; Taleb, A.; Tanarhte, M.; El Khalki, M.; Kharrou, M. A novel hybrid modeling approach based on empirical methods, PSO, XGBoost, and multiple GCMs for forecasting long-term reference evapotranspiration in a data scarce-area. Comput. Electron. Agric. 2025, 232, 110106. [Google Scholar] [CrossRef]
Akaike, H. Information Theory and an Extension of the Maximum Likelihood Principle. In Proceedings of the 2nd International Symposium on Information Theory; Petrov, B.N., Csaki, F., Eds.; Akademiai Kiado: Budapest, Hungary, 1973; pp. 267–281. [Google Scholar]
Bozdogan, H. Model Selection and Akaike’s Information Criterion (AIC): The General Theory and Its Analytical Extensions. Psychometrika 1987, 52, 345–370. [Google Scholar] [CrossRef]
Zucchini, W. An Introduction to Model Selection. J. Math. Psychol. 2000, 44, 41–61. [Google Scholar] [CrossRef]
Kass, R.; Raftery, A. Bayes factors. J. Am. Statist. Assoc. 1995, 430, 773–795. [Google Scholar] [CrossRef]
Nguyen, T.T.L.; Pandey, M.; Janizadeh, S.; Bhunia, G.S.; Norouzi, N.; Ali, S.; Pham, Q.B.; Anh, D.T.; Ahmadi, K. Flood susceptibility modeling based on new hybrid intelligence model: Optimization of XGboost model using GA metaheuristic algorithm. Adv. Space Res. 2022, 69, 3301–3318. [Google Scholar] [CrossRef]

Figure 1. DEM map of Bouregreg Watershed.

Figure 2. SMBA Dam on the Bouregreg Watershed [10].

Figure 3. Hydrographic map of Bouregreg Watershed.

Figure 4. Precipitation variability at SMBA Dam station (2010 to 2020).

Figure 5. Inflow variability at SMBA Dam station (2010 to 2020).

Figure 6. Flow chart for the development of machine learning models.

Figure 8. Structure of the SVR model [28] (ξ = Slack variable to denote the deviation from the point to the positive edge of the hyperplane, ξ* = Slack variable to denote the deviation from the point to the negative edge of the hyperplane).

Figure 10. Monthly comparison of observed and predicted data from 2010 to 2020 with RF.

Figure 11. Monthly comparison of observed and predicted data from 2010 to 2020 with SVR.

Figure 12. Monthly comparison of observed and predicted data from 2010 to 2020 with XGBoost.

Table 1. Characteristics of the watersheds at the main hydrometric stations [12].

Station	River	Area (km²)	Length (km)	Average Slope %	Concentration Time (h)
Aguibat Ezziar	Bouregreg	3710	159	1220	18.9
Ras Fathia	Grou	3543	222	1200	24.5
Sidi Med Cherif	Mechraa	609	60	780	12.1
Ain Loudah	Korifla	645	69	840	12.2

Table 2. Parameters of simulation models.

Parameters of Models	SVR			RF		XGBoost
Parameters of Models	Ε	C	$γ$	mtry	ntree	eta	Subsample	colsample_bytree	max_depth
SMBA Dam	0.1	1	1	1	100	0.1	0.8	0.8	6

Table 3. Evaluation of the Random Forest model performance.

Random Forest	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
Observed input	0.00	1.70	9.14	38.81	42.00	368.99
Simulated input	0.39	4.63	12.45	44.60	38.93	223.68
R²		AIC		BIC
0.5319788		202.6252		204.9813

Table 4. Evaluation of the SVR model performance.

SVR	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
Observed input	0.00	1.70	9.14	38.81	42.00	368.99
Simulated input	0.93	7.53	10.10	31.27	35.42	248.67
R²		AIC			BIC
0.6539323		191.7364			194.0925

Table 5. Evaluation of the XGBoost model performance.

XGBoost	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
Observed input	0.00	1.70	9.14	38.81	42.00	368.99
Simulated input	0.75	12.03	14.60	31.19	42.47	215.60
R²		AIC			BIC
0.5325813		205.4948			212.5631

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Elhallaoui Oueldkaddour, F.Z.; Wariaghli, F.; Brirhet, H.; Yahyaoui, A.; Jaziri, H. Comparison of Machine Learning Models for Real-Time Flow Forecasting in the Semi-Arid Bouregreg Basin. Limnol. Rev. 2025, 25, 6. https://doi.org/10.3390/limnolrev25010006

AMA Style

Elhallaoui Oueldkaddour FZ, Wariaghli F, Brirhet H, Yahyaoui A, Jaziri H. Comparison of Machine Learning Models for Real-Time Flow Forecasting in the Semi-Arid Bouregreg Basin. Limnological Review. 2025; 25(1):6. https://doi.org/10.3390/limnolrev25010006

Chicago/Turabian Style

Elhallaoui Oueldkaddour, Fatima Zehrae, Fatima Wariaghli, Hassane Brirhet, Ahmed Yahyaoui, and Hassane Jaziri. 2025. "Comparison of Machine Learning Models for Real-Time Flow Forecasting in the Semi-Arid Bouregreg Basin" Limnological Review 25, no. 1: 6. https://doi.org/10.3390/limnolrev25010006

APA Style

Elhallaoui Oueldkaddour, F. Z., Wariaghli, F., Brirhet, H., Yahyaoui, A., & Jaziri, H. (2025). Comparison of Machine Learning Models for Real-Time Flow Forecasting in the Semi-Arid Bouregreg Basin. Limnological Review, 25(1), 6. https://doi.org/10.3390/limnolrev25010006

Article Menu

Comparison of Machine Learning Models for Real-Time Flow Forecasting in the Semi-Arid Bouregreg Basin

Abstract

1. Introduction

2. The Study Area

2.1. Characteristics of the Bouregreg Basin

2.2. Presentation of the Sidi Mohammed Ben Abdallah Dam (SMBA)

3. Methodology

3.1. Hydrographic Network

3.2. Hydro Meteorological Data

3.3. Machine Learning Models

3.4. Presentation of the Random Forest Model

Prediction Method of the RF

3.5. Presentation of the SVR Model

3.6. Presentation of the XGBoost Model

3.7. Evaluation Parameters of the Models

3.7.1. SVR Parameters

3.7.2. RF Parameters

3.7.3. XGBoost Parameters

3.8. Performance Indices of the Models

3.8.1. The Coefficient of Determination (R2)

3.8.2. Akaike’s Information Criteria (AIC)

3.8.3. Bayesian Information Criteria (BIC)

4. Results and Discussion

4.1. Stream-Flow Prediction and Validation with Random Forest Model

4.2. Stream-Flow Prediction and Validation with SVR Model

4.3. Stream-Flow Prediction and Validation with XGBoost Model

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.8.1. The Coefficient of Determination (R²)