1. Introduction
Functional water supply networks are essential for a proper urban environment and the population that inhabits it. Monitoring the quality of water in the water supply network, and in case of contamination, identifying and controlling the source and contamination propagation is an extremely important task for human health and safety. Water supply network pollution can be caused by a wide variety of incidents which include an intentional contamination, biofilm formation in pipes, water aging and chemical contamination from pipe lining and corrosion [
1,
2].
Water supply network security methodologies heavily rely on accurate water quality models and pipe network hydraulic simulators. EPANET [
3] is the most popular simulator created for the purposes of running simulation experiments which are therefore used, in conjunction with various mathematical methodologies, for finding the optimal water quality sensor placement in a water supply network ([
4,
5,
6,
7]), control of water supply networks in case of contamination events ([
8,
9,
10]) and contamination source detection ([
11,
12,
13]). A thorough and recent review of methodologies for water supply network quality modeling with contamination source detection can be found in [
14] and a general, recent thorough review of water supply network security research and methods are covered in [
15].
Simulation-optimization methods have been the most popular approach for the water supply network contamination source detection problem. This procedure includes the coupling of an optimization algorithm (stochastic or deterministic) with a water supply network simulator. The goal function of the optimization algorithm is to minimize the difference between the recorded water quality sensor readings and the simulated values in order to find the contamination source, start and end times of the contamination event and the injected concentration of the contaminant. Genetic algorithms (GA) and variations have been widely used for this purpose ([
16,
17,
18]). Simulation-optimization with added hydraulic demand uncertainty and GA has also been investigated [
19]. Recently, a Poisson model for a changing water demand was coupled with an improved GA [
20].
The simulation-optimization approach comes with an added computational cost and is usually parallelized due to the fact that the problem variables are both of discrete (network nodes) and continuous nature (contamination start and end times, injected contaminant concentration). Beside the GA stochastic approach, the Nelder-Mead (NM) deterministic optimization algorithm was used coupled with logistic regression to determine the potential contamination source candidates and other relevant variables [
21]. An important feature of this work is that it proposed a model-based approach for classifying the most probable contamination source nodes and thus eliminated the discrete variable from the simulation-optimization procedure and applied it only to the other relevant variables for the contamination event reconstruction. Recently, an algorithm for search space reduction was developed for eliminating potential source nodes based on a sensor measurement comparison procedure [
22]. The simulation-optimization approach was then applied for the remaining potential source nodes. Both Particle Swarm Optimization (PSO) and GA were investigated and PSO exhibited better convergence rate and accuracy. Other simulation-optimization based methods include dynamic niching GA [
23], cultural algorithm [
24], hybrid encoding [
25] and a data-driven multi-strategy collaboration algorithm [
26].
Another approach to solving the problem of source identification is to use Bayesian optimization. In the work by [
27] a Bayesian framework for localizing multiple pollution sources and it incorporated Gaussian process emulators trained on data obtained from computational fluid dynamics simulations. A Bayesian approach was investigated for the contamination source localization in a water distribution network with stochastic demands [
28], and recently, reference [
29] constructed a Bayesian framework for the same application of contamination source localization but with mobile sensor data. Additionally, a Gaussian surrogate model was implemented with a collaborative based algorithm [
30] specifically for the contamination source identification problem.
Recently, machine learning methods have been successfully applied to a wide variety of problems in environmental engineering. A Long Short-Term Memory (LSTM) Neural Network was used for the problem of flood forecasting with rainfall and discharge as input data [
31]. Additionally, Artificial Neural Networks (ANN) and Random Forests (RF) were coupled to identify chemical leaks using data obtained from monitoring [
32]. Similarly to air quality prediction, the field of groundwater flow modeling has also been actively including machine learning methods. Convolutional Neural Network (CNN) coupled with a Markov Chain Monte Carlo (MCMC) method has been used to identify the contaminant sources in groundwater flow [
33].
Alternatively, it is possible to use machine learning algorithms for contamination source detection in water supply networks. Artificial Neural Network (ANN) was trained to detect the source of pollution of E. Coli in a small pipe network [
34]. Potential sub-zones of contamination source nodes have been predicted using learning vector quantization Neural Network (LVQNN) for larger water supply pipe networks [
35]. Recently, CNN has been used for the contamination source detection problem [
36]. The CNN was trained based on the water supply network user complaints unlike the usual supply network water quality sensor recordings. Additionally, it was found that CNN performs better than a basic ANN. Recent work also includes a machine learning-based framework designed specifically for high performance systems [
37]. The algorithmic framework uses ANNs for tournament style classification of potential contamination event source nodes and the Random Forest (RF) machine learning algorithm for regression analysis which predicts the contamination start and end times and injected contaminant concentrations.
Previously, Decision Trees (DT) were utilized for water network contamination source area isolation [
38] and more recently, the RF algorithm has also been successfully utilized for potential water supply network contamination source node identification [
39] and for determining the number of contamination sources in a water distribution network [
40]. The RF algorithm was trained with Monte Carlo (MC) generated input data of sensor water quality readings through a time interval and the true source nodes as the output data. RF models were also trained with simulation data for the purpose of contamination source detection in river systems [
41].
Machine learning and simulation-optimization coupling has been also employed in the area of groundwater pollution source and pollution characteristics prediction. Coupling of non-dominated sorting genetic algorithm II (NSGA-II) and both Probabilistic Support Vector Machines (PSVM) and Probabilistic Neural Networks (PNN) has been done for characterizing an unknown pollution source in groundwater resources systems [
42].
In this work, a novel methodology for predicting the water supply network contamination event is presented and investigated. Two algorithmic frameworks are constructed which are based on the methodology. Both frameworks utilize a machine learning approach based on the RF algorithm (implemented in the Python machine learning module scikit-learn 0.21.3 [
43]) for potential contamination source search space reduction (as presented in our previous work [
39]). The first investigated framework couples the simulation-optimization procedure directly with the RF classifier in order to determine the contamination start time, end time and injected contaminant concentration for each RF model predicted node separately and for this framework, three different stochastic optimization algorithms were investigated for one water distribution network benchmark. The three stochastic optimization algorithms were Particle Swarm Optimization (PSO), fireworks algorithm (FWA), both implemented in the swarm optimization Python module indago 0.1.2 [
44], and genetic algorithms (GA) implemented in the multiobjective optimization python module pymoo 0.4.2 [
45]. The optimization algorithms were fine-tuned and the best performing one was further investigated with the coupling framework for both benchmark networks. The other algorithmic framework differs slightly as it includes an additional RF model regression for each RF predicted potential source node separately in order to predict each top node’s start time, end time and injected contaminant concentration. After the RF regression, each potential source node’s newly obtained data is then used as initial values for the deterministic global search optimization algorithm Mesh Adaptive Direct Search (MADS) which is implemented in NOMAD 4.0 [
46].
The EPANET2 [
3] hydraulic and water quality simulator is used for water supply network contamination event simulations. EPANET2 simulates contaminant transport using simplified complete mixing advection models which in most cases are not accurate enough as previously shown by [
47]. However, for the purposes of examining the algorithm proposed in this study the simplified EPANET2 complete mixing model is good enough as the whole procedure is not dependent on the accuracy of the mixing processes occurring in the water distribution network. Monte Carlo simulations are made to train the RF model for classification (as described in [
39]) with the sensor water quality measurements being the input features and the true source node being the output. The RF model classifier then predicts the most probable contamination source nodes which are then submitted either to the stochastic simulation-optimization procedure (for the first framework) or to RF regression (trained with previously generated MC EPANET2 data) which predicts their start and end times and injected contaminant concentration. Both algorithmic frameworks are used on two water supply benchmark networks. The smaller benchmark network (92 nodes) was investigated with perfect sensor water quality measurements, while the bigger (865 nodes) was investigated with fuzzy sensor water quality measurements.
4. Conclusions
In this study two algorithmic frameworks for water distribution network contamination event detection were presented. Both frameworks were tested on a small water distribution benchmark network with 92 potential sources with perfect sensor measurements and a bigger benchmark network with 865 potential sources which included fuzzy sensor measurements to examine the robustness of the frameworks.
The first algorithmic framework includes coupling a ML classification model based on the RF algorithm and a stochastic optimization algorithm. After a preliminary analysis and parameters calibration procedure on the smaller benchmark network, the fireworks algorithm showed to be superior to the Particle Swarm Optimization algorithm and the genetic algorithms which are the most popular optimization algorithms for the water network contamination source detection problem. The algorithmic framework with the Fireworks algorithm shows to work with good accuracy in predicting the start time, end time and injected contaminant concentration for both benchmark networks but lacks the robustness of predicting the true source node with fuzzy sensor measurements.
The second presented algorithmic framework has an added ML regression model for each of the potential source nodes generated by the RF classifier. The regression model is trained pre-generated data by Monte Carlo simulations in parallel. The framework was coupled with the Mesh Adaptive Direct Search algorithm which is extremely well suited for this procedure as it requires an initial search value which in this case is generated by the RF regression model. This framework showed to be robust and can predict with good accuracy the true source node when the contamination event incorporates fuzzy measurements.
The proposed methodology differs from other methods for contamination source node identification, as it combines the two more general approaches in a whole framework. Usually the simulation-optimization methods and data-driven machine learning based methods are uncoupled and used separately for the task of contamination source detection. With this approach, the strength of identifying the most probable source nodes via a machine learning algorithm is coupled with the strength of finding the start time, end time and injected contaminant concentration through simulation-optimization algorithms. The proposed methodology is computationally efficient since a search space reduction is achieved with the machine learning approach.
Hydraulic demand uncertainties of the water distribution networks should be included in future studies as they were not investigated with this framework, but as shown in [
39], the RF classifier accuracy is slightly lowered when they are incorporated. In future studies other ML algorithms could be tested for the classification part of both algorithmic frameworks and the regression part of the second framework. Additionally, other optimization algorithms (stochastic and deterministic) could also be incorporated into both algorithmic frameworks and investigated.