Machine Learning Model for Battle of Water Demand Forecasting

Pagano, Mario; Santonastaso, Giovanni Francesco; Di Nardo, Armando; Cuomo, Salvatore; Schiano Di Cola, Vincenzo

doi:10.3390/engproc2024069037

Open AccessProceeding Paper

Machine Learning Model for Battle of Water Demand Forecasting^†

by

Mario Pagano

¹,

Giovanni Francesco Santonastaso

^1,*

,

Armando Di Nardo

¹,

Salvatore Cuomo

²

and

Vincenzo Schiano Di Cola

³

¹

Dipartimento di Ingegneria, Università della Campania Luigi Vanvitelli, via Roma 29, 81031 Aversa, Italy

²

Dipartimento di Matematica e Applicazioni “R. Caccioppoli”, Università degli Studi Federico II, via Vicinale dell’Infermeria 58, 80125 Napoli, Italy

³

Cyberneid srl, 80122 Naples, Italy

^*

Author to whom correspondence should be addressed.

^†

Presented at the 3rd International Joint Conference on Water Distribution Systems Analysis & Computing and Control for the Water Industry (WDSA/CCWI 2024), Ferrara, Italy, 1–4 July 2024.

Eng. Proc. 2024, 69(1), 37; https://doi.org/10.3390/engproc2024069037

Published: 3 September 2024

(This article belongs to the Proceedings of The 3rd International Joint Conference on Water Distribution Systems Analysis & Computing and Control for the Water Industry (WDSA/CCWI 2024))

Download

Browse Figure

Versions Notes

Abstract

This article investigates the optimization of urban water distribution in the context of population growth and climate change. It highlights the use of the ExtraTreesRegressor algorithm to forecast water demand with greater accuracy. By analyzing a dataset from North-East Italy, the study demonstrates the importance of temporal dynamics over meteorological factors in predicting water consumption patterns. The findings present a novel approach to improving water management strategies, demonstrating machine learning’s potential in addressing critical urban infra-structure challenges.

Keywords:

extra trees; water demand forecasting; predictive modeling

1. Introduction

Potable water resources are limited, and, with population growth and climate change, it is crucial to optimize their urban distribution. Forecasting water demand, both short-term (weeks, months) and long-term (years), is crucial to mitigate problems such as droughts or leaks. However, forecasting is particularly challenging due to the complex variables involved, which interact in a non-linear fashion, and the limited availability of data.

Recently, advances in computing power and data availability have favored the application of machine learning and artificial intelligence (ANN, SVM, RF) methods. Due to their adaptability, these methods can overcome the limitations of more traditional models in adapting to different spatial conditions, allowing greater performance in short-term prediction [1,2].

2. Materials and Methods

In the context of supervised regression, we consider a dataset

{(x_{i}, y_{i})}_{i = 1}^{n}

, where each

x_{i} \in R^{p}

is a vector of p features, and

y_{i} \in R

is the corresponding target value for observation i. The objective of the regression is to approximate the unknown feature

f : R^{p} \to R

such that

f (x_{i}) \approx y_{i}

for each input–output pair in the training set.

Decision-tree-based algorithms use a tree structure to represent the function f. The bagging technique combines the predictions of several decision trees, each trained on a bootstrap sample of data.

Formally, if we construct M independent decision trees

{T_{m}}_{(m = 1)}^{M}

, each trained on a bootstrap sample of the original dataset, the aggregate prediction for a new input

x \in R^{p}

is computed as the average of the predictions of all trees:

\hat{f} (X) = \frac{1}{M} \sum_{m = 1}^{M} T_{m} (X),

(1)

While each individual tree may have a high variance, averaging several independent trees tends to cancel out fluctuations, leading to a more stable and reliable final prediction.

The ExtraTreesRegressor uses this method (Extremely Randomized Trees Regressor) algorithm [3] to improve and stabilize performance:

Division Point Selection: ExtraTrees selects a division point at random for each candidate feature and chooses the best of these random divisions.
Tree Construction: Trees are grown to their maximum depth without pre-pruning, resulting in fewer leaf nodes. Random selection of split points and features avoids exhaustive search processes, making the algorithm efficient even with large amounts of data.

In the process of developing the predictive model, we developed a set of features derived from historical water flow and meteorological data. These features were selected to capture temporal trends and interactions with environmental variables that influence water demand in different DMAs (District Metered Areas). These features for the forecast model are divided into several categories, including temporal, statistical, and meteorological features.

Temporal features: Time of day, day of week, month, season, year, day of the year, day of the month, and week of the year. These are intended to capture individuals’ habitual dynamics as well as seasonal variations that may influence the data being analyzed.

Statistical features: Metrics such as the mean, median, maximum, and minimum on a weekly and monthly basis have been calculated. These indicators help to incorporate historical memory and capture recent trends.

Selected weather patterns include environmental variables such as precipitation, air temperature, humidity, and wind speed.

Feature lagged hourly, which represents the past values of demand for each hour of the previous weeks, was considered. These characteristics allow the model to better contextualize the specific situation in which it finds itself.

To ensure the training and validation of the model in a robust context, we divided the dataset into two parts: a training set (80%) and a test set (20%), mixing the dataset to avoid Bayes from a given time period, but maintaining the chronological order between features and predictions. K-fold cross-validation was implemented on the training set. In this process, the set is divided into K parts (k = 10) of equal size; for each iteration, a different fold is used as the validation set, with the remaining K-1 folds serving as the training set. This technique reduces the risk of overfitting and provides a more reliable estimate of model performance on unseen datasets. We performed feature importance analysis and parameter tuning.

The dataset provided for the Battle of Water Demand Forecasting (BWDF) competition is composed of detailed time series representing the net flow rates for ten District Metering Areas (DMAs) of a water network located in the north-east of Italy. The net flow rates, measured in liters per second (L/s), were recorded from 1 January 2021 to 31 March 2023 and acquired through the SCADA system in use by the water utility. The dataset also includes specific characteristics of each DMA which lists the type of area served, the number of users and the average net flow rate for the years 2021 and 2022. Additional information, such as meteorological data and event calendars, is provided to support the analysis.

3. Results

To assess the model’s performance, we employed a range of metrics reported in Table 1. Mean RMSE, applied over all cross-validation folds, calculates the square root of the mean of squared differences between the actual values and the values predicted by the model; the RMSE Std Dev function determines the standard deviation of the RMSE values obtained from different cross-validation folds, which is utilized to quantify the spread of the model’s prediction errors. Test RMSE evaluates the error on the unseen test set; percentage error, on the other hand, shows the average percentage difference in predictions.

To be able to interpret the percentage error and test error, Figure 1a,b depict scatter plots of model performance in districts A and D, where the percentage error and test RMSE were respectively highest, indicating poorer performance. The proximity of points to the diagonal line reflects model effectiveness; in district A (Figure 1b), the presence of many values far from the line indicates greater instability, as evidenced by a high %Error. In contrast, district D has fewer outliers despite a higher average distance of the points from the diagonal, as indicated by the Test RMSE.

4. Conclusions

In this study, we employed the ExtraTreesRegressor algorithm to enhance urban water distribution optimization. Our analysis, focused on a dataset from North-East Italy, underscored the superiority of temporal dynamics over meteorological data in predicting water demand. The research demonstrated not only the robustness and efficiency of machine learning algorithms in forecasting but also their potential to significantly contribute to the advancement of sustainable urban water management strategies. By prioritizing temporal and usage pattern data, our findings offer a new perspective on improving water distribution efficiency, presenting a promising avenue for future research in urban infrastructure optimization.

Author Contributions

Conceptualization, M.P. and G.F.S.; methodology, M.P., G.F.S. and V.S.D.C.; software, M.P. and G.F.S.; validation, G.F.S. and V.S.D.C.; formal analysis, M.P., G.F.S. and V.S.D.C.; investigation, M.P. and G.F.S.; resources, G.F.S.; data curation, M.P.; writing—original draft preparation, M.P.; writing—review and editing, M.P., G.F.S., A.D.N., S.C. and V.S.D.C.; project administration, G.F.S.; funding acquisition, G.F.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pu, Z.; Yan, J.; Chen, L.; Li, Z.; Tian, W.; Tao, T. A hybrid Wavelet-CNN-LSTM deep learning model for short-term urban water demand forecasting. Front. Environ. Sci. Eng. 2023, 17, 22. [Google Scholar] [CrossRef]
Donkor, E.A.; Mazzuchi, T.A.; Soyer, R.; Roberson, J.A. Urban Water Demand Forecasting: Review of Methods and Models. J. Water Resour. Plan. Manag. 2014, 140, 146–159. [Google Scholar] [CrossRef]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]

Figure 1. Scatter plot for district A: comparison of predicted and actual values on X and Y axes, respectively (a). Scatter plot for district D: comparison of predicted and actual values on X and Y axes, respectively (b).

Table 1. Performance of the Extra Trees Regressor model for each DMA, including Mean Root Mean Square Error (Mean RMSE), Root Mean Square Error Standard Deviation (RMSE Dev), RMSE on the test set (Test RMSE), and Percentage Error (% Error).

DMA	Mean RMSE	RMSE Std Dev	Test RMSE	% Error
A	1.247	0.170	1.102	9.18
B	0.474	0.030	0.477	3.11
C	0.439	0.017	0.449	6.11
D	2.449	0.098	2.423	5.78
E	2.080	0.119	2.201	1.71
F	0.949	0.027	0.942	8.91
G	1.373	0.175	1.330	3.68
H	1.026	0.252	0.891	3.18
I	1.310	0.047	1.326	4.84
J	1.347	0.071	1.333	3.68

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pagano, M.; Santonastaso, G.F.; Di Nardo, A.; Cuomo, S.; Schiano Di Cola, V. Machine Learning Model for Battle of Water Demand Forecasting. Eng. Proc. 2024, 69, 37. https://doi.org/10.3390/engproc2024069037

AMA Style

Pagano M, Santonastaso GF, Di Nardo A, Cuomo S, Schiano Di Cola V. Machine Learning Model for Battle of Water Demand Forecasting. Engineering Proceedings. 2024; 69(1):37. https://doi.org/10.3390/engproc2024069037

Chicago/Turabian Style

Pagano, Mario, Giovanni Francesco Santonastaso, Armando Di Nardo, Salvatore Cuomo, and Vincenzo Schiano Di Cola. 2024. "Machine Learning Model for Battle of Water Demand Forecasting" Engineering Proceedings 69, no. 1: 37. https://doi.org/10.3390/engproc2024069037

APA Style

Pagano, M., Santonastaso, G. F., Di Nardo, A., Cuomo, S., & Schiano Di Cola, V. (2024). Machine Learning Model for Battle of Water Demand Forecasting. Engineering Proceedings, 69(1), 37. https://doi.org/10.3390/engproc2024069037

Article Menu

Machine Learning Model for Battle of Water Demand Forecasting^†

Abstract

1. Introduction

2. Materials and Methods

3. Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Machine Learning Model for Battle of Water Demand Forecasting †

Abstract

1. Introduction

2. Materials and Methods

3. Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Machine Learning Model for Battle of Water Demand Forecasting^†