1. Introduction
Water utilities ensure a consistent supply of clean water to customers, with pipe infrastructure playing a crucial role in maintaining the security and quality of the water supply. These utilities possess a significant number of aging pipe assets that are nearing or have exceeded their intended lifespan [
1]. At the same time, climate change is impacting all forms of infrastructure, including watermains, through changes in temperature fluctuations, freeze–thaw cycles, and rainfall patterns [
2]. These challenges necessitate the urgent development of watermain break prediction models that incorporate climate factors, ensuring more effective and proactive watermain management.
Over the past decades, a variety of models for predicting pipe breaks have been developed. The evolution of Machine Learning (ML) algorithms stands out as one of the key technological breakthroughs of the 21st century. Lately, experts in water research are turning to these ML techniques to tackle the issue of pipe failure, a critical challenge for the security of urban water distribution systems. Commonly used models in the literature include Artificial Neural Networks (ANN), Support Vector Machines (SVM), Evolutionary Polynomial Regression (EPR), and, more recently, tree models. These models integrate different features to predict watermain breaks, including pipe intrinsic, historical, and environmental data [
3].
Incorporating climate data into the analysis of historical failure records allows the model to uncover patterns and connections that might remain hidden when examining failure data in isolation. Watermain breaks influenced by climatic conditions primarily arise from changes in temperature, rainfall, and wind speed [
4]. Climatic covariates identified in previous studies are applied herein to explore the impact of climate change.
Climate change presents challenges for utilities in developing sustainable management and rehabilitation plans [
4]. It could modify rainfall patterns, potentially leading to prolonged droughts that reduce groundwater levels. This drop in groundwater can lead to soil compaction, which might increase differential soil settlement. Consequently, such changes in the ground could pose a risk to the integrity of buried water infrastructure [
5]. Climate change-driven behavioral shifts, like varying heating and cooling requirements, can also impact patterns of water usage. Elevated demands for water can lead to a rise in internal pressure within watermains, which, in turn, may contribute to the likelihood of pipe failures [
6].
However, only a few studies have investigated the effect of climatic variations on watermain break prediction. This study seeks to bridge this research gap by developing ML models designed to forecast watermain breaks, with a focus on incorporating climate-related covariates. Eventually, this model would be deployed with the purpose of identifying potential watermain breaks that may be attributed to the effects of different climate change scenarios. Such predictive capability would enable utilities to adopt proactive maintenance and repair tactics, thereby minimizing the risk of disruptions in the water supply.
2. Materials and Methods
This section outlines the data collection, processing, and modeling methodology used in this study. System-related data and historical break records were obtained from Kitchener’s watermain network [
7]. The first dataset encompasses the watermain inventory, recording the characteristics of pipes, covering aspects such as their length, diameter, material, and more. The second dataset records watermain breaks between 1985 and 2018. As the present research investigates the impact of climate change on watermain breaks, historical weather data including minimum, maximum, and mean temperatures, and rainfall are also collected from the Environment and Climate Change Canada (ECCC) [
8]. To clean the data, initial steps involved removing records with missing values, inconsistencies, or outliers to ensure data reliability. Furthermore, categorical components were encoded using one-hot encoding as necessary. The cleansed datasets, including inventory, break records, and climate data, were then merged through the unique ID of each pipe and time period.
After the data were cleaned and prepared, it was divided into training and testing sets to evaluate the predictive performance of the ML models by randomly assigning 30% of the data to the test set, and the remaining 70% was used for training.
This study seeks to predict the future status of water pipes, either broken or unbroken ones. Since compiling a dataset with yearly records for each pipe is cumbersome and leads to extreme data imbalance, decade-long time intervals were used. This approach facilitates a comprehensive evaluation of the influence of time-dependent variables, including cumulative failures, pipe age, and climate-related factors, within each defined interval. Furthermore, the available data for this study exhibits a high level of imbalance, meaning that one class has significantly more observations than the other. To address the issue of imbalance data, this study focuses on Cast Iron (CI) pipes, given their higher number of breaks compared to other types.
To examine the impact of climate change on watermain failures, various climate-related variables were considered, including min, max, and mean temperatures, air temperature changes, intensities of air temperature changes, variation in temperature, freezing and thawing index, cumulative cold, hot, and thawing days, and total rain.
Four ML models were compared: Random Forest (RF), K-nearest neighbour (KNN), Artificial Neural Network (ANN), and Extreme Gradient Boosting (XGBoost). Random Forest is an ensemble learning method that utilizes multiple decision trees to improve prediction accuracy and reduce the risk of overfitting. XGBoost as a robust and effective algorithm that uses a collection of decision trees to create accurate predictions. KNN, a non-parametric method, uses the distance between data points to classify new data. The underlying assumption is that data points in close proximity to one another are probably members of the same class. The Artificial Neural Network (ANN), a type of feedforward neural network, uses the backpropagation algorithm to construct the predictive model. The configuration, including the number of neurons in each layer, was optimized through a trial-and-error method. And the hyperparameters of each ML algorithm were optimized using Randomized Search CV optimization. For evaluating the models, the following evaluation metrics were employed: accuracy, precision, and F1 score.
To evaluate the effect of climate change, future projection of temperature and precipitation for three scenarios (SSP1, SSP2, and SSP5) were taken from Environment and Climate Change Canada simulations [
8]. The first scenario, SSP1, embodies a sustainable, low-emission future, potentially keeping warming below 2 °C. SSP2 represents a moderate path, with uneven development and a mid-century emissions peak, leading to moderate warming. In contrast, SSP5 depicts a high-emission, fossil-fuel-reliant world with significant warming [
9].
3. Results
The performance of the ML models is shown in
Table 1. Among the four compared ML models, RF was found to perform the best, especially in terms of F1 Score. KNN showed excellent precision; however, its recall is the lowest, suggesting it might miss the prediction of a significant number of broken pipes. Conversely, for ANN, its recall rate stands out, but lower precision leads to a higher false-positive rate, implying a less-reliable result. While XGBoost demonstrates a balanced result between precision and recall, its performance metrics are slightly lower than RF. Therefore, as RF offers a balanced classification with strong performance across all metrics, it was used to make future predictions under three different climate change scenarios.
The predicted number of broken pipes for SSP1, SSP2, and SSP3 are 484, 352, and 456, respectively. The highest predicted number of broken pipes under the SSP1 scenario indicates that this is the most challenging or deteriorative condition, suggesting greater vulnerability of CI pipes in colder climates. Conversely, the SSP2 scenario, having the fewest predicted broken pipes and being associated with moderate future warming, implies that the pipes are less susceptible to breaks in moderately warm climate conditions. The higher incidence of broken pipes observed under the SSP5 scenario, relative to SSP2, suggests that more extreme warming accelerates the deterioration of water pipes in comparison to moderate warming. This may stem from various reasons, like longer dry periods affecting soil settlements, and increase in water demand, which leads to higher internal pressure in pipes [
4].