1. Introduction
As the world increasingly recognizes the need for sustainable practices, great focus is given to energy generation and use. Among the essential services that municipalities provide, water is generally one of the most energy-intensive. Approximately 3.7 TWh of global energy use is associated with water supply, 2 TWh for distribution, and 1.7 TWh for wastewater treatment [
1]. Furthermore, 30 to 60% of municipal expenses are related to the water industry [
2]. It is, thus, clear that gains in water distribution energy efficiency and energy generation can lead to significant reductions in greenhouse gas (GHG) emissions and costs [
3,
4,
5].
The safe and reliable operation of water distribution networks requires pressures to be controlled, often with pressure-reducing valves (PRVs). However, PRVs dissipate pressure through friction, wasting energy. Instead, energy can be recovered through micro-hydro turbines or pumps as turbines (PaTs) [
6]. The former are generally more expensive than the latter [
7]. PaTs are simply pumps operated in reverse, coupled with generators. The initial installation costs and GHG emissions are quickly offset with long-term savings and energy generation [
8]. Still, one barrier to implementing PaTs remains. Manufacturers generally do not provide the characteristic attributes of pumps in reverse mode since this was not their initial intended application. Determining PaT characteristics can require expensive testing. Laboratory testing of PaTs can show different hydrodynamic and mechanical forcing scenarios and allow for tuning [
9]. While lab results are reliable, they are time-consuming to achieve. Computational fluid dynamics (CFD) have also been applied in modeling PaT behavior with high accuracy [
10,
11], sometimes better than experimental results [
9]. They enable the analysis of the effect of specific scenarios, e.g., transients [
11], and pump characteristics, e.g., guide vane clocking positions [
12], impeller geometry modifications [
13], and variable rotational speed [
14,
15]. CFD nevertheless requires extensive time, resources, and computing power [
16].
A less costly approach to predicting PaT performance has been the development of theoretical or empirical equations. Studies seeking to select the optimal PaT for a water distribution network, regularly employ these equations to estimate turbine performance based on known pump characteristics [
6,
17,
18,
19].
Stepanoff [
20] was the first to relate pump and turbine characteristics, through Equation (1):
where
Nst is the specific speed of the turbine,
Nsp is the specific speed of the pump and
ƞp is the pump efficiency. Specific speeds are calculated based on the best efficiency points (BEPs). Thus, the theoretical Equation (1) implies the power generated at the BEP in turbine mode is lower than the power used in pump mode. Sharma [
21] developed another similar theoretical equation, assuming a smaller reduction in specific speed, as shown in Equation (2):
As more information became available, more accurate empirical equations were developed to estimate the flow (Q) and head (H) of PaTs, as summarized in
Table 1. Alatorre-Frenk et al. [
22] formulated equations based on statistical correlations by curve-fitting experimental data for pumps with specific speeds between 10.5 and 98.7. The model was reported to have a high coefficient of determination (0.9928) [
22]. Yang et al. [
9] also applied curve fitting in developing statistical regressions in the normalized flow range of 0.7 and 1.33 with low percentage errors (5.3% for head prediction and 6.2% for flow prediction). While these equations relied solely on pump efficiency, other studies have found that considering specific speeds can lead to more reliable equations. Barbarelli et al. [
23] based equations on experimental data of pump and turbine modes for 12 pumps with specific speeds between 15 and 65. Audisio [
24] developed a set of equations from the experimental data of 41 PaTs, all with speeds greater than 400 rpm. Fontanella et al. [
25] developed unique equations based on the rotational speed of the pump and turbine. They considered the REDAWN (Reduction Energy Dependency in Atlantic area Water Networks) project dataset of 34 pumps, compiled from literature, as well as supplied by manufacturers and researchers. The key limitations of these equations are their small datasets and restricted applicability to certain ranges. This makes them harder to generalize with different pump types and characteristics.
Similar to the development of BEP equations, turbine characteristic curve equations have been developed through regressions fit to experimental data, as listed in
Table 2. Derakhshan and Nourbakhsh [
26] developed equations to predict turbine mode head (H) and power (P) based on flows (Q), with a library of 4 PaTs. Rossi et al. [
27] developed a set of equations based on dimensionless parameters to facilitate the application of equations to pumps of various sizes with a larger library of 32 PaTs. The dimensionless flow parameter is denoted as Φ and head as Ψ. Used a larger library of 181 PaTs to develop characteristic curve equations based on dimensioned characteristics normalized with BEP values. While Perez-Sanchez et al. [
28] set a minimum flow-to-flow BEP ratio of 0.4, Rossi et al. [
27] limited their equation to a maximum flow ratio of 1.4. Perez-Sanchez et al. [
28] and Rossi et al. [
27] provided high coefficient of determination results with values over 0.91. However, it is not clear if dimensioned or dimensionless equations provide more accurate results in general. Even with slightly larger libraries, these equations are still restricted to the pump types and specific speed ranges of their datasets, and can hardly be generalized since these variables are not included in the equations.
Recent studies highlight the opportunity in applying machine learning to predict the behavior of geometric subjects [
29,
30]. By accounting for multiple parameters, they can lead to general and simple predictive models. Rossi et al. [
31] developed artificial neural networks (ANNs) to predict PaT performance, both BEP and characteristic curves. The models were based on a library of 32 PaTs and used specific speed, rotating speed, and efficiency as well as dimensionless flow, head and power parameters. While a relatively good fit was reached and recent studies have sought to extend upon these models [
32], the accuracy of ANN models has been shown to be unstable with such small datasets, and other models may lead to better and more reliable results [
29]. Given that the datasets of PaT performance are inherently small, other regression models may yield better results. Indeed, Alacco [
33] attempted to develop evolutionary polynomial regressions to predict characteristics curves. However, the performance of the models was unsatisfactory and potential improvements were not provided. Accordingly, the goal of the present study is to investigate the accuracy of various types of multivariate regression models in predicting the performance of PaT behavior, with the support of both a dimensioned and dimensionless library.
3. Results
Before developing the proposed models, the target BEP turbine variables were plotted against their corresponding pump model variables. The relations between turbine and pump flow, head and efficiency are presented in
Figure 2. All plots show a linear trend with the strongest being the flows and the weakest being the efficiency. It should also be noted that most of the flow rates range from 0 to 150 L/s, and a similar density trend in the head ranges from 0 to 90 m. As these datapoints fit within the typical and expected ranges of a PaT, the information beyond the maximum range is sparse and more sporadic in nature.
Existing characteristic curve equations, presented in
Table 2, generally define the characteristics curves according to normalized head, flow and efficiency values. These normalized variables were also visualized for the current dataset, as shown in
Figure 3. A near linear trend is seen in
Figure 3a, with a stronger density in the mid-range and more scattering appearing on the maximum and minimums. On the other hand,
Figure 3b shows a parabolic trend. Thus, a linear model may be preferred for head curves, whereas a nonlinear or polynomial linear relationship may perform best for efficiency curves.
3.1. BEP Results
The best-performing multivariate regression models for each BEP attribute based on the dimensioned and dimensionless datasets are presented in
Table 6 and
Table 7, respectively. The R
2 scores from the default hyperparameter models are compared to the optimized models. The reduction in R
2 scores from the default to optimized parameters of the model is a consequence of the hyperparameter tuning and fitting the model better to the data. It should also be noted that the scales of the dimensioned parameters and the dimensionless parameters are different. Models applied to the dimensioned data set performed better than the dimensionless with regard to R
2. This may be explained by the fact that the dimensioned dataset contains more variables. Because dimensionless variables are normalized by impeller diameter and rotational speed, these attributes were not included separately in the dimensionless dataset.
The best BEP flow model applied the Huber Regressor and the dimensionless dataset with the following hyperparameters in scikit learn: fit intercept = False, epsilon = 1.523529, and alpha as 0.1. The Huber Regressor is a linear regression model, robust to outliers. Similarly, the best BEP head model used Elastic Net and the efficiency model, Orthogonal Matching Pursuit. These are also linear regression models, confirming the observations of
Figure 2. The best BEP head model applied the following hyperparameters: selection = cyclic, positive = True, normalize = False, l1_ratio = 10, fit_intercept = False, copy_x = True, alpha = 10. And the best BEP efficiency model applied the following hyperparameters: of precompute = auto, normalize = True, fit_intercept = True, n_nonzero_coefs = 0.
The BEP predictions with ANN are overall less accurate than the multivariate regression model results, apart from the prediction of specific speeds. Dimensioned and dimensionless ANN model results are summarized in
Table 8 and
Table 9, respectively. Similar to the multivariate regression results, the highest R
2 was found for flow predictions and the lowest for efficiency. However, the dimensioned dataset performed better for flow. The rectified linear unit activation function was selected through tuning for the dimensioned models, confirming the better performance of linear models.
ANNs generally require larger datasets, which are not available for the current PaT problem. Furthermore, ANN is more computationally intensive. Typically, the multivariate regression process from start to finish took around 20 min for each attribute, including training and tuning. With ANN, however, the AX optimization process took at least 25 min, up to 45 min, depending on the number of iterations required. These durations are for a laptop with a 2.1 GHz processor, in Windows 10.
Given the superiority of the multivariate regression models, their results are further explored.
Figure 4 compares the models’ predicted results against actual values for all BEP attributes, for both dimensioned and dimensionless datasets. Firstly, flow results in
Figure 4a,b, show the majority of predictions are close to actual values. Only one outlier is observed in both the dimensioned and dimensionless data, due to a larger PaT. This outlier is also evident in the dimensionless head predictions (
Figure 4d), but not in the dimensioned head model. The dimensioned head model fits actual values well, with an R
2 of 0.9319, even though the majority of head values are slightly underpredicted. Efficiency results are more scattered and are identical for dimensioned (
Figure 4e) and dimensionless models (
Figure 4f). This is because the orthogonal matching pursuit model was applied to both. This model has no parameters which can be tuned and look for the most highly correlated attributes. In this case, the most correlated attribute to the turbine best efficiency is the pump best efficiency, which is also inherently dimensionless. Thus, choosing dimensioned or dimensionless attributes does not impact results in this case.
3.2. Characteristic Curve Results
All characteristic curve multivariate regression models performed the best with the XGB Regressor, as presented in
Table 10 and
Table 11. The results also show very similar performances for both dimensioned and dimensionless datasets. The dimensionless dataset models perform better by a small margin when considering the R
2 of the efficiency curves. For both datasets, the head curve was predicted with very high accuracy, with the same R
2 of 0.997. Hyperparameters for the best XGB Regressor models are summarized in
Table 12.
The ANN results for predicting characteristics curves with the dimensioned and dimensionless datasets are summarized in
Table 13 and
Table 14, respectively. The dropout rates are consistently very small or null. Because the models rely on small datasets, lower dropouts are preferred to ensure more information can be distributed and used in training a more accurate model. The R
2 scores for the head and Ψ curves are high, 0.986 and 0.954 respectively, albeit lower than the multivariate regression models. The efficiency and η model scores are lower but nevertheless strong for both the dimensioned and dimensionless predictions. Still, the multivariate regression models performed better in predicting efficiency and η curve, as well. With more datapoints and possibly more attributes, the ANN may perform better. More research would be required to collect more data on PaTs. Nevertheless, the accuracy of the multivariate regression models is already high.
Given the superiority of the multivariate regression models, the relation between their predicted and actual results is explored in
Figure 5. A very strong correlation between actual and predicted head curves is observed for all ranges of normalized head values, as shown in
Figure 5a,b. Efficiency curve results are more scattered, being better fit when normalized values are closer to 1, i.e., turbine efficiency is close to the BEP. For efficiencies between 50 and 80% of the BEP, the models generally overestimate efficiency. There are also less data in this range. Thus, these models may be improved by adding more data regarding PaT performance at lower efficiencies.
4. Discussion
The performance of the models developed in the present study was also compared to those from previous research. For the BEP prediction comparison, 20 random data points were extracted from the dimensioned datasets to ensure a consistent test set. The models developed herein had different train/test splits and were thus initially tested on datasets of different sizes.
Table 15 shows the current multivariate regression models outperformed all previous models. The current head BEP multivariate regression model has an R
2 of 0.932, followed by the model proposed by Sharma [
21] with a score of 0.827. While the ANN model performed well, with a score of 0.822, the multivariate regression model and Sharma’s equation still performed better. Other previous equations had slightly lower scores, but generally above 0.7. The exception is Barbarelli et al. [
23] who developed their equations based on 4 PaTs with specific speeds ranging between 14 and 45. In the present dataset, most specific speeds were below 10. Thus, the Barbarelli et al. [
23] equation is not applicable to this lower range.
The current flow BEP model has an even higher R
2, of 0.972. The next best-performing model is the Yang et al. [
9] equation, at an R
2 of 0.965. The ANN model scored well, but the current multivariate regression model, Yang et al. [
9], Sharma [
21] and Stepanoff [
20] was better. Efficiency results were not compared with previous studies because most authors did not develop a separate equation for efficiency. The PAT efficiency is not required to determine its BEP or create characteristic curves.
A comparison between the characteristic curves developed herein and other studies is provided in
Table 16. Again, the current multivariate regression models outperformed all previous models. The multivariate regression head curve prediction had a very high R
2 of 0.997, relatively higher than the Perez-Sanchez et al. [
28] equation, with a value of 0.983. The ANN model scored very high as well, 0.986, which makes it the second best. The RMSE values confirm these results. The multivariate regression efficiency curve also had a high R
2 of 0.909, above the Rossi et al. [
27] score of 0.869. In this case, the current ANN had the lowest score of the compared efficiency models. The results of the predicted efficiency curve values also scored highly using the multivariate regression method with a coefficient of determination of 0.901 with Rossi et al. [
27] as the runner-up with a score of 0.869. The ANN method had a good score of 0.766 but the multivariate regression model and the Rossi et al. [
27] model both performed better. Because some of these scores are very similar, the models are comparable, and their applicability might depend more on the range of pump values.
With all the scores considered, the current model is superior to that of the equations from the literature. Some of the models from the literature either scored highly in the prediction of the head curve or the efficiency curve but hardly ever at the same time. The highest scoring model for both variables would be proposed by Perez-Sanchez et al. [
28] with scores of 0.955 and 0.868 for the head and efficiency curve, respectively, compared against 0.997 and 0.909 for the proposed model respectively. As for the ANN model commissioned by Rossi et al. [
31], fully recreating the results and model was not possible as only information on the number of hidden layers and neurons per layer was given. Information regarding the learning rate, dropout rate, activation function, etc., was unknown. Furthermore, the training and test data sizes were unclear. Assuming that the datasets are comparable, the current model is superior in predicting the head curve variables, while Rossi et al.’s [
31] model is superior in the prediction of efficiency.
The higher performance of the proposed BEP models compared to previous studies can be largely explained by the amount of data compiled. Whereas previous BEP prediction studies had datasets ranging from 4 to 32 PaTs, the present study compiled data from 145 PaTs. The comparison of multiple regression algorithms also enables the selection of the specific best-performing models for each attribute, whether BEP or characteristic curves.
Overall, the results show that linear regression models (i.e., Huber regressor, elastic net, and orthogonal matching pursuit) were specifically preferred for predicting BEP, and XGB regressors were best for predicting characteristic curves. Such models can be quickly applied in practice, facilitating the selection of PaTs in real water distribution networks. Furthermore, as data-driven multivariate regression models, they can easily be updated and improved as more data becomes available.
It is also important to highlight herein some of the worst-performing models overall considering the initial library contained a total of 24 models. Reducing the number of possibilities for the regressions can aid with future studies when considering and evaluating multiple machine learning regression models. Models that should not be considered globally for any prediction pertaining to PaTs are the Gamma Regressor, Poisson, Gamma, Inverse Gaussian, and SVR-lin. All these models showed negative R2 scores, and therefore, show no promise in predicting the attributes.
Limitations
It should be noted that the comparison of the current model against other models in the literature is slightly biased. The datasets used in training each model differed. Evaluating the fit of the model to the type and range of data for which it was originally trained and tested would lead to better results. For example, Rossi et al. [
33] reported higher results in their study, i.e., R
2 of 0.98429 for the head curve, compared to 0.955 reported herein. These scores are still lower than those of the current multivariate regression model, i.e., 0.997. Furthermore, the Rossi et al. [
33] scores refer to the overall training, validation and testing dataset, whereas the results presented herein are specifically for the 20 randomly selected data points.
The current models are also limited in their application to ESOB pumps. Data were compiled specifically for this pump typology since it is the most common for PaTs. Nevertheless, multivariate regression models can be easily generalized with additional data, as opposed to earlier models that relied solely on pump efficiency and specific speed.
5. Conclusions
The present study developed novel multivariate regression models to predict PaT behavior. A dataset larger than previous studies, with 145 BEP data points, was compiled from previous work. While previous studies either applied dimensioned or dimensionless datasets, both approaches are compared herein. The developed models outperformed all previous statistical and ANN models. Results show linear regression models are specifically preferred for predicting BEP values given the underlying linear relation between pump and turbine values. The resulting R2 for flow and head BEP were 0.972 and 0.932, respectively. On the other hand, the best characteristic curve predictions were developed with XGB Regressors, with R2 of 0.994 and 0.919 for head and efficiency, respectively. Furthermore, the dimensionless dataset produced better characteristic curve and flow BEP models, whereas the dimensioned dataset provided slightly higher scores for head BEP models. Thus, a dimensionless dataset overall would be preferred.
The high accuracy of the developed multivariate regression models, combined with their lower computational cost compared to ANN, make them a robust solution for selecting PaTs in practice. Future studies can explore the development of broader models. Adding information for PaTs with higher flow rates or other typologies besides centrifugal ESOB, such as multistage, axial and double suction would be valuable in expanding the applicability of the models. Furthermore, the current efficiency curve models can be improved by adding datapoints to the dataset. The current dataset has between 7 and 12 datapoints per PaT. Thus, increasing the number of points per PaT could increase the accuracy of these models.