Next Article in Journal
Francis Turbine Draft Tube Troubleshooting during Operational Conditions Using CFD Analysis
Next Article in Special Issue
Two-Leak Isolation in Water Distribution Networks Based on k-NN and Linear Discriminant Classifiers
Previous Article in Journal
Real-Time Control Operation Method of Water Diversion Project Based on River Diversion Disturbance
Previous Article in Special Issue
Application Research on Risk Assessment of Municipal Pipeline Network Based on Random Forest Machine Learning Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Water Flow Modeling and Forecast in a Water Branch of Mexico City through ARIMA and Transfer Function Models for Anomaly Detection

by
David Barrientos-Torres
,
Erick Axel Martinez-Ríos
,
Sergio A. Navarro-Tuch
*,
Jose Luis Pablos-Hach
and
Rogelio Bustamante-Bello
Tecnologico de Monterrey, School of Engineering and Sciences, Mexico City 14380, Mexico
*
Author to whom correspondence should be addressed.
Water 2023, 15(15), 2792; https://doi.org/10.3390/w15152792
Submission received: 8 July 2023 / Revised: 26 July 2023 / Accepted: 28 July 2023 / Published: 2 August 2023

Abstract

:
Early identification of anomalies (such as leakages or sensor failures) in urban water distribution systems is critical to mitigating water scarcity in cities and is a challenge in water resource management. Several data-driven methods based on machine learning algorithms have been proposed in the literature for leakage detection in urban water distribution systems. Still, most of them are challenging to implement due to their complexity and requirements of vast amounts of reliable data for proper model generation. In addition, the required infrastructure and instrumentation to collect the data needed to train the models could be unaffordable. This paper presents the use and comparison of Autoregressive Integrated Moving Average models and Transfer Function models generated via the Box–Jenkins approach to modeling the water flow in water distribution systems for anomaly detection. The models were fit using water flow data from tanks operating in a branch of the water distribution system of Mexico City. The results showed that both methods helped select the best model type for each variable in the analyzed water branch, with Seasonal ARIMA models achieving a lower mean absolute percentage error than the fitted Transfer Function models. Furthermore, this methodology can be adjusted to different time windows to generate alerts at different rates and does not require a large sample size. The generated anomaly detection models could improve the efficiency of the water distribution system by detecting anomalies such as wrong measurements and water leakages.

Graphical Abstract

1. Introduction

The growing worldwide population, the increasing living standards, the altered water consumption habits, and the spread of irrigated agriculture are the primary causes of the rising global demand for water. Water scarcity has become a danger to the sustainable development of human society [1]. According to the World Urbanization Prospects published by the United Nations in 2018, almost 90% of Mexico’s population is projected to reside in urban areas [2]. Nevertheless, 20 million people in Mexico suffer from severe water scarcity [1]. Even though there is sufficient infrastructure in Mexico, water management could be improved, and the system needs to be appropriately maintained. According to the estimates, the distribution networks lose 40% of their water due to aging pipelines, lengthy periods without sufficient maintenance, poor building and management techniques, and ongoing land subsidence in metropolitan areas [3].
Managing water resources and preventing, identifying, and fixing leaks are essential to reduce city water scarcity. However, this requires an information system. Therefore, real-time monitoring and data collection are crucial to creating trustworthy and practical information systems. Additionally, the data must be accurate and thorough to draw reliable conclusions and models [4]. Nevertheless, data collection is hampered by several structural and physical restrictions, environmental factors, and human errors. Moreover, the complexity and breadth of the modern city’s water system necessitate a significant infrastructure investment for communications, location, and data processing [5]. The above has motivated the development of solutions that help to monitor water distribution networks. Several studies and technologies have been developed for leakage detection in water distribution systems, which can be classified into hardware- and software-based methods. Acoustic monitoring, gas injection, thermography, ground-penetrating radar, and free swimming systems are examples of hardware-based techniques. However, employing these techniques in broad regions can be time-consuming, expensive, and inappropriate for automation or long-term monitoring [6].
Software-based leakage detection techniques can be categorized into model-based and data-driven approaches. Model-based approaches define the link between the variables of the water distribution network in a mathematical model of the water distribution network while considering the network’s physical properties [7]. Model-based leak identification techniques do not require previous network data; instead, the leak diagnosis is performed by comparing the model outputs to the measured variables. Its development, however, could be challenging, confined by the accuracy of the mathematical models, and dependent on accurate model calibration [8].
On the other hand, data-driven approaches create data analysis plans using the network’s historical data as a resource. In contrast to model-based approaches, data-driven methodologies need to know the network’s structural characteristics and historical data. Recently, there has been an interest in employing machine-learning methods because of the robust capacities for pattern recognition and feature identification and the rising development and accessibility of data-collecting technology [8]. For instance, multilayer perceptrons, support vector machines, clustering algorithms, or deep learning algorithms have been proven efficient for solving leak localization problems, as discussed in [6,8].
In this regard, water leakage detection based on machine learning algorithms has been proposed to use different data modalities to train the detection models. These data modalities include flow sensor data, pressure data, vibration data, vibro-acoustic data, acoustic emission data, and satellite data [9,10,11,12,13,14]. However, satellite and acoustic emission data collection may be unaffordable [15]. Besides, using flow sensors and vibration data for water leakage detection requires the installation of several sensors across the water pipeline or its junctions, which limits their use on large water distribution networks. In addition, some studies that have developed water leakage detection systems have used simulation or laboratory tests under controlled conditions without considering the uncertainty to which data may be susceptible in real scenarios [16,17,18,19].
Moreover, the key challenge with applying machine learning techniques is choosing the suitable algorithm, building appropriate feature extractors to learn complicated features, accessing a large amount of data for training the models, and needing efficient signal processing tools [6]. In addition, the black-box nature of deep learning algorithms makes them less interpretable by humans and necessitates specialized computer hardware for their training (e.g., Graphics Processing Units) [20].
Most of the studies that have proposed water leakage detection systems based on data-driven methods either use machine learning techniques (i.e., random forest, support vector machines, Adaboost, XGBoost) or deep learning algorithms (i.e., convolutional neural networks (CNNs)), which often requires high-quality data to train them. However, gathering enough data could be expensive, time-consuming, and unrealistic. Furthermore, black-box classification techniques such as CNNs, random forests, multilayer perceptrons, or XGBoost have many parameters to be adjusted, which limits their interpretability and makes them prone to overfitting [21]. On the other hand, one of the crucial characteristics of Autoregressive Integrated Moving Average (ARIMA) and Transfer Function (TF) models is that they provide a linear estimate of the system to be modeled. Besides, the number of parameters of ARIMA and TF models tends to be lower [22], which is a strong simplification compared to the large number of parameters that non-linear machine learning and deep learning approaches often require [23].
This study presents a methodology for anomaly detection in water distribution systems by employing water flow data and two classical time series modeling techniques, the ARIMA and TF models, which were fit following the Box–Jenkins methodology [24]. This study modeled water flow data from tanks in a primary network branch of the water distribution system in Mexico City. This branch carries a significant volume of water through tanks and supplies the secondary network. Analyzing this branch is crucial due to the substantial water flows and pipeline sizes involved. A leakage occurring in this system would result in more-significant water losses than other public water network systems of Mexico City.
As previously stated, the studies in the literature have performed simulations, laboratory tests, or placed sensors (e.g., flow and vibrations sensors) across the water pipelines to collect the necessary data to develop water-leakage-detection models [10,25,26,27,28]. On the contrary, this study focused on analyzing a branch of the water distribution system of Mexico City, which supplies water to the water pipes, instead of analyzing the water pipes directly through flow or vibration sensors. The above was performed to detect anomalies in the water flow behavior that could indicate the presence of sensor malfunction or water leakages along the analyzed water branch. Thus, the data on the inlet and outlet water flow of the tanks that comprise the analyzed water branch were used to develop the ARIMA and TF models proposed in this work. Such a study has not yet been performed to the authors’ knowledge.
The principal contribution of this work was the use and comparison of the TF and ARIMA models generated through the Box–Jenkins methodology applied for anomaly detection in water flow variables of a water branch of the Mexico City water distribution system, which allowed us to:
  • Adjust the models and forecasts to different time windows of the water flow consumption in Mexico City.
  • Generate anomaly-detection models with incomplete and small datasets by employing the water flow data of a branch of the water distribution system of Mexico City.
  • Generate interpretable and sparse models of water flow for anomaly detection in a branch of the water distribution system of Mexico City.
  • Perform a comparison of the ARIMA and TF models for modeling the water flow behavior of a branch of the water distribution system of Mexico City.
The rest of this paper is structured as follows. Section 2 presents the literature review on water leakage detection based on machine learning algorithms and an analysis of the state-of-the-art. Section 3 presents the case study and describes the data collection process of the flow data of the water distribution branch analyzed in this work. Moreover, the overall methodology is explained, including the theoretical background of the ARIMA and TF models and the model-generating process. In addition, the proposed anomaly-detection methodology for the analyzed water branch, which integrates the best models of both methods, is explained. Section 4 presents the results of the present study, while Section 5 presents the analysis and discussion of the results. Section 6 shows the main limitations and areas of opportunity of this work. Finally, Section 7 presents the conclusions and future work.

2. Literature Review

This section presents an overview of the studies that have proposed methods for water leakage detection based on machine learning. In addition, an analysis and discussion of the gaps in the current state-of-the-art is shown. Table 1 summarizes the literature on recent approaches towards leakage detection in water distribution systems that used machine learning techniques.
Recent studies have suggested using data-driven methodologies to detect water leakage, primarily relying on machine learning algorithms. Islam et al. [29] presented and discussed this trend of using machine learning for water leakage detection. For instance, the study of Moulik et al. [17] proposed to detect water leakages and blockages in water pipelines by processing the vibrations of PCV pipes. Moulik’s study employed three-axis accelerometers to measure the vibration on the PCV pipes produced by water leakages; then, the vibration data were utilized as the input into a k-means clustering technique to perform the detection. Similarly, Choi et al. [30] utilized sound vibration data from water pipes to detect water leakages by employing the magnitude spectra of the sound vibration data to train a 2D CNN. Likewise, Yu et al. [10] employed vibration data collected from piezoelectric accelerometers placed in the water distribution networks of several cities in China for water leakage detection. Yu’s study tested different machine learning algorithms such as support vector machines, decision trees, the SqueezeNet CNN, and K-nearest neighbor, with the SqueezeNet achieving a higher performance when trained with the spectrograms of the Short-Time Fourier Transform of the vibration data.
Fereidooni et al. [9] installed flow sensors in the pipeline network junction to detect water leakages. The flow sensor data were processed using hydraulic equations to generate velocity and head loss features. The trained algorithms were a decision tree, a K-nearest neighbor, a random forest, and a Bayesian network. Satellite data have also been used for water leakage detection. An example of this was presented by Chen et al. [11], who utilized augmented satellite images to detect water leakages in the canal systems in Arizona. The authors employed Landsat 8 satellite images to train a CNN, used as the water-leakage-detection algorithm. Sousa et al. [12] proposed to analyze pressure data measured from pumps in district-metered areas of Stockholm, Sweden. The analyzed area corresponded to a residential area with no water tanks or reservoirs. The detection algorithms involved a comparison of unsupervised learning algorithms, such as k-means clustering, and supervised learning algorithms, such as learning vector quantization algorithms. In [31], it was proposed to detect water leakages by processing acoustic emission signals collected from the water distribution networks of Jiangsu, Zhejiang, and Shanghai. The acoustic emission signals were characterized by computing the main frequency, the spectral roll-off rate, the spectral flatness, and the Mel frequency cepstrum coefficients. Then, the authors trained tree-based algorithms such as decision trees, Adaboost, and random forests, with Adaboost achieving the highest performance. Likewise, Fares et al. [13] utilized acoustic emission signals to detect water leakages in water distribution networks. Fares’ study utilized time and frequency domain features to represent the acoustic emission signals and used them as the input to train a support vector machine, an artificial neural network, and deep learning algorithms.
Furthermore, Xue et al. [18] introduced a leakage-fault-detection approach using a hydraulic simulation model encompassing all potential leakage faults. Subsequently, XGBoost was trained, and an alert-triggering algorithm generated a leakage signal associated with the specific pipe’s name. Cody and Narasimhan [32] proposed a linear prediction model, specifically an autoregressive moving average model in conjunction with a multivariate Gaussian mixture model to perform semi-supervised leakage detection. This method utilizes data collected with hydrophone sensors and simulated leakages within a water distribution network. Additionally, the authors suggested a coarse-resolution leakage location using the average baseline root mean square of the collected data and a fine location estimation utilizing cross-correlation based on the time series data from linear prediction filter sensors. Taghlabi et al. [19] conducted experiments employing two methods for water leakage detection. Firstly, they simulated artificial leaks using the EPANET code on the MATLAB platform to establish a database of pressure values that describe the network’s behavior when leaks are present. Subsequently, these data were utilized to train a random forest algorithm, enabling it to forecast the rate and location of leaks within the network. Secondly, they simulated artificial leaks by manipulating hydrants in different locations, considering two distinct leak sizes, and comparing results.
Similarly, Pérez-Pérez et al. [16] proposed using artificial neural network (ANN) techniques and online measurements of pressure and flow rate to detect and locate water leaks in pipelines. The friction factor of the pipe was estimated and utilized as an input for computing the leak position. Fabbiano et al. [33] considered that the energy variation transmitted to the pipe walls by the radial component of vibrations induced by fluid turbulence might be related to the flow leak. Hence, Fabbiano measured the radial vibrational status of specific pipes in the network. Finally, Tornyeviadzi et al. [34] proposed a one-dimensional CNN deep autoencoder trained to locate and identify water leaks. This technique uses multivariate time series data to lessen the adverse effects of random noise. The proposed autoencoder’s input data involved flow, pressure, and tank-level data.
Table 1. Recent data-driven approaches for water-leakage-detection technologies.
Table 1. Recent data-driven approaches for water-leakage-detection technologies.
ProjectYearCountryMethodologyResults
Leakage detection in water distribution systems based on time–frequency convolutional neural network [15]2021ChinaA leakage spectrogram was employed to capture the characteristics of leakage signals, and a time–frequency convolutional neural network (TFCNN) model was compared with other classification models across various signal-to-noise ratio (SNR) conditions.The TFCNN model demonstrated superior performance with a mean accuracy of 98% across different SNR conditions. Even at a challenging −10 dB SNR, the mean detection accuracy remained high at 90%.
Water Leakage Detection in Hilly Region PVC Pipes using Wireless Sensors and Machine Learning [17]2020TaiwanWireless sensors were utilized to capture vibrations in PVC pipes during water flow. Machine learning algorithms were applied to these vibration records to identify any disruptions in the regular water flow caused by leakage or blockage.Analysis of vibration records with the help of K-means algorithm to
determine the water level and the leakages, if any.
Application of CNN Models to Detect and Classify Leakages in Water Pipelines Using Magnitude Spectra of Vibration Sound [30]2023KoreaCNN model for water leakage detection and classification using sound vibration data from sensors in water pipes.The proposed CNN model achieved an F1-score of 94.82% and a Matthew’s correlation coefficient of 94.47%.
Leak detection in water distribution systems by classifying vibration signals [10]2023ChinaSupport vector machine (SVM), decision tree (DT), and K-nearest neighbor (KNN) for leak detection models using signal data from piezoelectric accelerometers in Chinese water distribution systems (WDSs).SqueezeNet performed best with 95.15% accuracy in leak identification, while KNN excelled among the three classifiers with superior sensitivity and 88.17% accuracy.
A hybrid model-based method for leak detection in large scale water distribution networks [9]2021NetherlandsInfluential leak detection features using hydraulic equations (Hazen–Williams, Darcy–Weisbach, and pressure drop) and decision tree, KNN, random forest, and Bayesian network used to locate leaks and determine their pressure based on pipeline topology.Of the models, 80.5% consistently achieved results above 92% in all scenarios. The Naïve Bayesian Model performed the best overall, with a top result of 85.81%.
Augmenting a deep-learning algorithm with canal inspection knowledge for reliable water leak detection from multispectral satellite images [11]2020USAA deep learning approach, combined with canal inspection knowledge, enabled automated and reliable water leak detection of canal sections using Landsat 8 satellite images.The proposed approach can achieve recall at 86%, precision at 86%, and accuracy at 85%.
Leakage detection in water distribution networks using machine-learning strategies [12]2023SwedenAnalyzed pressure measurements from pumps in district-metered areas (DMAs) using unsupervised learning (K-means and cluster validation techniques) and supervised learning (learning vector quantization algorithms).The proposed learning strategies are able to obtain correct classification rates up to 93.98%.
A Tree-Based Machine Learning Method for Pipeline Leakage Detection [31]2022ChinaDistinctive features such as main frequency, spectral roll-off rate, spectral flatness, and 1D Mel frequency cepstrum coefficient (MFCC) using random forest and Adaboost models. The Adaboost model had the lowest false positive rate of 7.35%. The recall rates of the random forest and Adaboost models were 100% and 99.52%.
Leak detection in real water distribution networks based on acoustic emission and machine learning [13]2022ChinaAcoustic signals in time and frequency domains were used to develop leak-detection models, employing SVM, ANN, and deep learning (DL) techniques.Demonstrated a largely stable performance and a high accuracy, particularly for new unlabeled cases.
Machine learning-based leakage fault detection for district heating networks [18]2020ChinaHydraulic simulation model and an XGBoost-based model85.85% of mean accuracy.
Field implementation of linear prediction for leak-monitoring in water distribution networks [32]2020CanadaLinear prediction model for semi-supervised leak detection.A detection accuracy in most cases of over 70%.
Prelocalization and leak detection in drinking water distribution networks using modeling-based algorithms [19]2021MoroccoA simulation of artificial leaks and a random forest machine learning algorithmLeak position identified within a 100 m radius.
Leak diagnosis in pipelines using a combined artificial neural network approach [16]2021MexicoANN techniques and online measurements of pressure and flow rate measurementsAn average error of 0.629% for leak location.
Smart water grid: A smart methodology to detect leaks in water distribution networks [33]2020ItalyMeasuring the radial vibrational status of opportune pipes of the networkRadial vibration signals are linearly dependent only on the flow rate variations due to the leakages.
Leakage detection in water distribution networks via 1D CNN deep autoencoder for multivariate SCADA data [34]2023NorwayA one-dimensional convolutional neural network deep autoencoder (AE) using multivariate time series dataIdentified 16 of the 19 leaks in 2019.
From this literature review, it is possible to observe that machine learning techniques have already been used extensively to perform water leakage detection. To a lesser extent, satellite data have been employed. Nevertheless, satellite data could be difficult to collect and label and may not be useful for detecting leakages inside the water pipelines. On the other hand, sound vibration data may be unaffordable due to the need for specific hardware to sample the vibro-acoustic signals. In the case of vibration data collected from accelerometers, it is necessary to install multiple sensors across the water pipelines, which can be costly and require extensive maintenance. Hence, analyzing the flow behavior of the water network can be considered a cost-effective solution since flow data are frequently monitored in water distribution systems. Nevertheless, similar to using accelerometers placed along the water pipeline to measure the pipe vibration, it is necessary to install multiple flow sensors along the water pipeline.
Moreover, in some of the reviewed works, the leakage-detection algorithm was developed in laboratory conditions, such as the studies of Pérez-Pérez et al. [16], Moulik et al. [17], and Taghlabi et al. [19]. Nevertheless, as mentioned by Shen et al. [31], on-site leakage signals have greater interference and randomness than leakage signals in a laboratory. Hence, there is an opportunity to analyze flow sensor data sampled from real water distribution systems and develop algorithms that can tackle the uncertainty to which the data are susceptible when developing models for water leakage detection. Furthermore, most of the related works focused on detecting water leakages by directly measuring the vibration or flow sensor data from the water pipeline. Nevertheless, analyzing the sensor data along the complex water pipelines could be inefficient and costly.
In the case of the machine learning techniques that have been used to develop the leakage detection models, it can be appreciated that deep neural networks have been extensively used, mainly variants of CNNs [15,30]. Even so, CNNs required a sizable sample size to avoid overfitting and a lack of interpretability due to the complexity that this type of technique often requires. Other techniques frequently used in water leakage detection are non-linear classification techniques such as decision trees, random forests, support vector machines, Adaboost, and XGBoost [35]. However, these non-linear classification techniques, similar to CNNs, require a large sample size to avoid overfitting and are less interpretable than linear machine learning techniques [36].
Considering the above, there is an opportunity to develop techniques for detecting water leakage from other locations besides measuring water flow data directly from the water pipelines. Furthermore, the gathering of data and required flow sensors could be reduced if the water branch that delivers water to the water pipelines is analyzed, rather than directly measuring the flow or vibration in the water distribution pipelines. Finally, linear machine learning techniques such as the TF and ARIMA models could be tested to avoid using non-linear classification techniques, frequently employed in the literature, as presented in Table 1.

3. Materials and Methods

3.1. Case Study and Dataset

The case study examined in this work consisted of six tanks from the Mexico City water distribution system situated in the Álvaro Obregón delegation and connected by 48 in-diameter pipelines. Figure 1 shows a general schematic of the primary water distribution network in Mexico City, where the main two sources of water (Cutzamala System and Lerma System) feed several branches interconnected in cascade and fed by gravity. This study analyzed the data from Branch C of Figure 1. The branch presented in Figure 1 is instrumented to measure the water flow that is input and output to each water tank. The sensors used to sample the data were ISOMAG electromagnetic flow meters. Moreover, it is essential to highlight that the water distribution system of Mexico City is not instrumented in the sub-branch of the water pipelines that serve to supply water to the users. Due to the above, the case study was limited to analyzing the input and output of each water tank of the analyzed branch to detect anomalies and their behavior that could indicate the presence of leakages or measurement errors in the flow sensors.
The tanks are fed by gravity and connected by a leading pipeline, which separates them by 0.5 to 2 km. The tanks are located in Álvaro Obregón delegation in Mexico City, and each tank also supplies water to the local surrounding areas. Between Tank 1 and Tank 5 exists a difference of 200 m of elevation. Figure 2 shows the geographic location of the water branch analyzed in this study. The blue squares represent the tanks of Branch C presented in Figure 1. The blue line in Figure 2 indicates the connection between the tanks across Mexico City.
The water flow rate of six tanks was measured. A total of 11 water flow variables related to each tank’s input and output flow with their corresponding timestamp were recorded every 15 min (i.e., the water flow measurements had a sampling rate of 15 min). The data utilized in this study were collected during the final two weeks of August 2020. The first week was the sampling period, during which the model was developed using the available data. The second week was then designated as the forecasting period, where the model’s performance was evaluated by generating forecasts based on the learned patterns from the previous week. Each tank within the system is equipped with flow rate measurements from electromagnetic flowmeters at its entry and one or two of its exits, allowing for monitoring flow distribution within the main pipeline in liters per second (lps). Furthermore, it is important to note that certain flow rates designated for local consumption or exit flows of the tank are unavailable due to a lack of instrumentation.
A schematic representation of the water distribution branch analyzed in this work is presented in Figure 3, where the blue lines represent the main pipeline and connection with the next tank, and the gray lines correspond to the exit to the local network of the region. Furthermore, Figure 3 shows an example of the average water consumed in a week in percentage; therefore, for the first tank, all the input water corresponds to 100%, while in Tank 5, the water to the next stage of the network corresponds to 32%; this means that the region of this system consumed 68% of the total input water during the analyzed period.
Table 2 shows the variables and summary statistics from the period of water flow analyzed in this study for the tanks shown in Figure 1, Figure 2 and Figure 3. The summary statistics are minimum, maximum, and mean flow rate presented in lps and the percentage of not available (NA) or empty observations, taking as the total the entire period of each variable every 15 min. The analysis and models presented in this study were implemented in RStudio Version 2022.02.3 + 492 using R Version 4.2.0 on a 64 bit Windows 7 PC with 12 GB of RAM and an AMD A10-5800K processor.
The first step involved identifying missing values in the raw time series flow sensor data, as seen in Figure 4. This process was repeated for each water tank variable presented in Table 2. Figure 4 illustrates the Tank 4 Inflow variable time series. The second step consisted of filling in the original raw time series data missing values, as seen in Figure 5. The average of the neighborhood values around the missing data points in the time series was computed to fill in the missing values, using observations from an equal number of data points on both sides of a central missing value. The process presented in Figure 5 was repeated for each water tank variable in Table 2; the same Figure illustrates the filling of missing values for the Tank 4 Inflow variable.

3.2. Autoregressive Integrated Moving Average and Seasonal Autoregressive Integrated Moving Average

The overall methods used to model the water flow variables of the water distribution branch analyzed in this study are illustrated in Figure 6. Initially, measurements of each water tank were collected, and then, the data were pre-processed to prepare them for modeling. The best model for each variable was used to forecast over two different periods: one day and one week. In the case of the TF models, time-ahead data were also included as an input variable for estimating the forecasts. Further details on this process are explained in subsequent sections of this research work. The theoretical background of the ARIMA models is described below.
ARIMA models are composed of a dependent variable Y t , which depends on past values Y and an error term e t . Besides, these models are characterized by three elements: a moving average component, an autoregressive component, and a differencing (integration) component. The autoregressive component indicates that Y t depends on one or multiple lagged values of Y t . The moving average component shows that Y t depends on one or multiple lagged values of the error e t . Finally, the integration or differencing component indicates that the series should be stationary; computing the difference between neighboring observations in the time series accomplishes the above [37].
The notation ARIMA(p,d,q) represents the order of the ARIMA models, where p is the order of the autoregressive component, d is the order of the differencing (integration) component, and q is the order of the moving-average process [38]. The backshift operator ( B ) can be used to define an ARIMA model as follows:
ϕ p ( B ) ( 1 B ) d Y t = θ q ( B ) e t
where Y t is the value of the series observed at time t; B is the backshift operator; ϕ are the autoregressive polynomials; θ is the moving average polynomial; e t are the error terms of the model. The error terms were assumed to be independent and identically distributed with a normal distribution and zero mean [24].
However, the Seasonal Autoregressive Integrated Moving Average was considered to model the seasonal component of the time series. In this regard, Seasonal ARIMA models were selected since, as shown in Figure 4 and Figure 5, the water flow time series have a seasonal component (i.e., the series exhibits a regular fluctuation), which appears every 96 observations, corresponding to a day of water flow measurements. This seasonal term makes the water flow time series nonstationary; therefore, to consider the seasonal component of the time series and to fit an ARIMA model, the seasonal component needs to be considered for the models [24].
Seasonality implies that Y t depends on lagged values of Y t at a regular interval s . Seasonal ARIMA models consider the non-Seasonal ARIMA(p,d,q) and three additional parameters labeled as ( P , D , Q ) m to account for the seasonality presented in a time series. The m term refers to the number of time steps corresponding to a single seasonal period. On the other hand, the term P represents the order of the seasonal autoregressive component; the term Q refers to the seasonal moving average component; the term D represents the seasonal differencing component [38]. The mathematical representation of the Seasonal ARIMA models is shown in the next expression:
Φ P ( B m ) Φ P ( B ) ( 1 B m ) D ( 1 B ) d Y t = Θ Q ( B m ) θ q ( B ) w t
where Y t is a seasonal time series; w t   is the Gaussian white noise process; Φ P ( B ) is the non-seasonal autoregressive polynomials; θ q ( B ) represents the non-seasonal moving average polynomial. d is the non-seasonal differencing term. D is the seasonal differencing term. One key aspect is that, when D = 1 , this is sufficient to ensure stationarity in the time series. Φ P ( B m ) represents a seasonal autoregressive polynomial; the term Θ Q ( B m ) is a seasonal moving average polynomial. Finally, B is the backshift operator [38].
In general, the optimal ARIMA model parameters are determined by considering three criteria: (a) using Akaike’s information criterion (AIC); (b) examining the auto-correlation function (ACF) to determine the q parameter of the ARIMA model and the number of moving average (MA) coefficients and computing the partial auto-correlation function (PACF) of the residuals to determine the p parameter for the number of autoregressive coefficients; (c) by plotting the series residuals to confirm that the error term is equivalent to white noise. The following sections describe the definitions and procedures to compute the ACF, PACF, and AIC in more detail.

3.3. Auto-Correlation Function and Partial Auto-Correlation Function

Auto-correlation can be defined as the degree of similarity of a time series with a lagged version of itself. Furthermore, the plot of a time series’ auto-correlations against lags is known as the auto-correlation plot. Thus, the so-called ACF shows the linear relationship between the observation y t at time t and the observation at a previous time ( y t k ) that is separated by k lags at time [38]. Taking the above into account, the mathematical representation of the ACF for a time series y t is shown in the expression below:
A C F ( y t , y t k ) = C o v a r i a n c e ( y t , y t k ) v a r i a n c e ( y t )
where k is the lag, and it is defined as the difference in time between the observation y t and the observation y t k . The term A C F ( y t , y t k ) denotes the correlation between the observations y t and y t k that are separated by k periods. The ACF serves to know the order of the moving average component of an ARIMA model. Moreover, the ACF also allows analyzing the periodicity and detecting recurrence in a time series [24].
On the other hand, the so-called partial auto-correlation or conditional correlation removes the intermediate observations when computing the correlation between two observations at different lags. In this case, the PACF is conditional on the intermediate observation of the time series, since they are taken out from the covariance computation. For instance, consider the PACF of two observations y t and y t k (i.e., k = 2 ) [38]. The above can be expressed as shown below:
P A C F ( y t , y t 2 ) = c o v a r i a n c e ( y t , y t 2 | y t 1 ) v a r i a n c e ( y t | y t 1 ) v a r i a n c e ( y t 2 | y t 1 )
where the term P A C F ( y t , y t 2 ) is the PACF between the observations y t and y y 2 . Notice that the covariance between y t and y y 2 and the variance of y t and y y 2 are conditional on the intermediate observation y t 1 since the PACF removes the effect of the intermediate observations [38]. The computation of the PACF serves to know the order of the autoregressive component of an ARIMA model.
Figure 7 illustrates the Tank 4 Inflow ACF and PACF, from which it can be inferred that the series is not stationary. It is important to mention that the auto-correlation and partial auto-correlation functions are dimensionless; the above implies that they are independent of the scale of measurement of the analyzed time series [24].
Given that the observations were taken every 15 min and matched up with the earlier visual inspection of the residuals, it was determined that the series becomes stationary by differencing at a lag of 96. This corresponds to the 96 observations in a single day. Therefore, according to the ACF and residuals of the ARIMA(1,0,0)(1,1,1)(96) model of Tank 4 Inflow presented in Figure 8, the model is adequate, since the residuals follow a normal distribution. Similarly, the remaining water flow variables of the system depicted in Table 2 were processed, and the ACF, PACF, and residual analyses were repeated.
As previously stated, the ACF and PACF correlogram analysis is required to determine the components of an ARIMA model. If the time series is stationary or not, it can be determined by looking at the residual plots. After a few auto-correlations, the ACF for a stationary time series will zero out. However, the ACF for nonstationary time series will decline slowly or increase positively [24]. Following multiple iterations and the initial analysis, some models were suggested as the best. After identifying the best ARIMA models for the series, the best model was chosen by comparing its residuals and information criteria.
The AIC, mean absolute percentage error (MAPE), and root-mean-squared error (RMSE) criteria were used to measure the performance of each model. The residuals were then examined for the model’s diagnosis, and if the model was satisfactory, it could be used to forecast; otherwise, additional models must be tested [24]. Figure 9 shows the methodology to generate an ARIMA model via the Box–Jenkins approach. The following section describes the metrics utilized in this study in further detail.

3.4. Evaluation Metrics

The fitting procedure of the resulting ARIMA models was assessed with the aid of the AIC, as shown in Figure 9. The information criterion measures the model’s ability to explain the relationship between the variables. A common criterion is to compute the AIC, which is an information criterion that enables the assessment of the quality of the models by rewarding those with minor errors while penalizing those with too many parameters [38]. Thus, this criterion allows the selection of sparse models [39]. The mathematical representation of the AIC is shown in the following expression:
A I C = 2 l o g L ( θ ^ ) + 2 K
where l o g L ( θ ^ ) represents the likelihood function and K is the total number of parameters of the model. A lower value of the AIC represents a better model with a higher likelihood value. Compared to other metrics, such as the Bayesian information criterion, the AIC value provides a greater penalty on the number of parameters.
On the other hand, for this work, the MAPE and RMSE were used to measure the error and to have a numerical comparison of the effectiveness of the proposed models after selecting the best through the AIC. The RMSE and MAPE were used to compare the accuracy of the model’s forecast to the actual values, with a lower value indicating a better fit [24,38,40]. The equations of these indicators are shown as follows:
M A P E = 1 N f i = 1 N f | y i y ^ i | y i × 100 %
R M S E = 1 N f i = 1 N f ( y i y ^ i ) 2
where y i is the observed value and y ^ i the predicted value at time i ; N f is the number of forecast time steps.
A week was chosen to evaluate the Seasonal ARIMA models. The models were tested within different time frames and assessed on various dates. Two data transformations were considered to transform a nonstationary time series into a stationary series and use the Box–Jenkins methodology: first, differencing, and second, differencing with a transformation using the natural logarithm. By calculating the difference between two consecutive observations, differencing makes a nonstationary time series stationary. The time series’ variance can be stabilized using the natural logarithm. Some preliminary models that follow the patterns and methodology of Box–Jenkins can be provided after comparing and analyzing the resulting ACF and PACF of the water flow time series. The AIC was calculated for each fitted model to choose the optimal [41].
Moreover, to evaluate the forecast of each model, one day (equivalent to 96 observations) and one week (corresponding to 672 observations) were examined, with a confidence interval of 95%. Figure 10 shows an example of the forecast of one week in the future for the Seasonal ARIMA models for the Tank 4 Inflow time series. The forecast’s confidence interval was calculated at 95% by obtaining the standard errors of the estimates as described in the Box–Jenkins methodology [24].

3.5. Transfer Function Models

This section presents the theoretical background and methodology for developing TF models based on the Box–Jenkins approach. TFs are models that combine a causal approach and a time series approach. The time series X t affects the time series Y t through a TF, which spreads the impact X t via some period in the future. The resultant TF model connects the output series ( Y t ), the input series ( X t ), and a noise term (Nt). The addition of a noise term is considered since, in practice, the response of a system could be affected by disturbances and noise induced by the environment, which corrupts the system’s output by an amount N t . Hence, a TF is equivalent to a response function. The mathematical representation of TF models can be written in terms of the backward operator B , as shown in Equation (8) [24].
Y t = δ 1 ( B )   ω ( B ) B b X t + N t
where Y t is the output of the system at time t; X t is the input of the system at time t; B is the backshift operator; ω ( B ) is an s th -order polynomial operator; δ 1 (B) is an r th -order polynomial operator; B b is a b th order dead time operator, which indicates the number of periods before any effect is discernible; finally, N t is the amount of noise to which the system is susceptible. The terms (b, s, r) are integers greater than or equal to zero. The term ω ( B ) controls the effect of current and previous input values in the system’s response. On the other hand, the term δ 1 (B) controls the effect of previous output values in the system’s response [42].
A TF estimation of the system based on the Box–Jenkins methodology was developed, and it was motivated by the correlation between the system variables displayed in Figure 3, specifically the input and output flow of each water tank. Figure 11 illustrates the overall procedure for predicting using TFs, based on the Box–Jenkins approach for fitting, and validating TF models. The definition of the input and output ARIMA models that were used, the prewhitening of both series (i.e., the method of removing the impact of serial correlation on trend analysis), and the calculation of cross-correlation for the identification of the pre-estimates and final parameters of the model or tentative models are the key steps in this process. The models’ diagnostics were then determined, and if the model was sufficient, it could be utilized for the forecast; if not, a different model should be suggested [24].
The dependent (output) and independent (input) variables were modeled via ARIMA models. The ARIMA models used for the TF were the same as those developed for each water flow variable, as explained in Section 3.2. Then, the input and output series generated by the fitted ARIMA models were prewhitened. Consequently, the series were cross-correlated to find the relationship between the lags, or the effect of X t over Y t . The cross-correlation function is represented as shown in Equation (9) [24].
ρ x y ( k ) = γ x y ( k ) σ x σ y
where ρ x y ( k ) is the cross-correlation function of a stationary bivariate process; γ x y ( k ) is the cross-covariance coefficients between the series x t and y t at lags k = ± 0 , ± 1 , ± 2 , ; σ x is the standard deviation of the x series; σ y is the standard deviation of the y series. The cross-covariance function γ x y ( k ) of a stationary bivariate process is defined as shown in Equation (10) for lags k = ± 0 , ± 1 , ± 2 , .
γ x y ( k ) = E [ ( x t k μ x ) ( y t μ y ) ] =   E [ ( y t μ y ) [ ( x t k μ x ) ] = γ y x ( k )  
The importance of the cross-correlation function of the prewhitening input and output series is that it provides an estimate of the impulse response of the system. This impulse response estimate serves to know the order of the s and r polynomials, as well as the order of the dead time operator ( B b ) that should be used to fit the TF model. Similar to other areas such as signal processing and system analysis, the impulse response is used for the graphical or mathematical representation of the output of a system or a model in response to a brief input signal or impulse. The impulse response provides valuable insights into the system’s behavior, including its frequency response, stability, and the effect of the input signal on the output [43].
The plot pattern of the cross-correlation function determines the values of b , r , and s , which, according to the Box and Jenkins [24] notation, are the parameters for a TF (b, s, r) model. The parameters b and s determine the number of lagged terms of x that entered into the TF. The value of b is determined by the first lag significantly different from zero in the cross-correlation plot. The s term is established by how long x influences y after the first significant lag. The r value represents how long the output series ( y t ) is connected with the prior value of the output series. The value of r can be set by analyzing the plot of auto-correlation or determined by the plot pattern of lag ( b + s ) ; if it has an exponential decay, then r = 1 could provide an appropriate approximation of the TF, and if it has a sine wave plot pattern, then r = 2 could provide an approximation of the TF [24].
The input Tank 1 Inflow and the Tank 2 Inflow series are cross-correlated in Figure 12, demonstrating that the fifth lag is the most-significant. Nevertheless, Lag 0 is the first latency that deviates sufficiently from zero. It is also clear that the fifth lag (from Lag 0 to Lag 5) is the number of delays between the previous significant lag and the current lag. Finally, the plot development appears to follow a sine wave. Consequently, a TF with parameters (2,3,0) could be a first model proposal.
The fourth step used the fitted TF models to forecast one day and week in the future. The forecast’s confidence interval was also calculated, as can be seen in Figure 13.

3.6. Anomaly Detection in Water Distribution Branches Methodology

After generating and using the models for forecasting, the anomaly detection procedure involved comparing the observed values with the 95% confidence interval of the model’s forecast. This work assumed that an anomaly presented in the measured water flow’s water branch is outside the model forecast’s confidence interval. Figure 14 shows the general methodology for the data evaluation for anomaly detection in water distribution branches.
The forecast values, the confidence interval, and the following day and week observations are needed by the methodology shown in Figure 14 before any other feedback processes can begin. First, it determines whether the observed values for each model are within the confidence interval of the forecast; if they are, it goes back to the forecast stage and compares the subsequent observations with the subsequent prediction. In cases with missing data points, it is plausible that they are due to various factors, such as interruptions in sensing or communication caused by issues with the energy supply at the station, sensor malfunctions, intermittent data transmission problems, or failures in the database. When these data points are missing, the corresponding alert is labeled as “not available” to indicate the absence of data.
On the other hand, if an observation exists and falls outside the confidence interval, it can indicate two main possibilities. Firstly, it could suggest a potential measurement error where the sensor may have malfunctioned and provided an incorrect reading. Alternatively, it may indicate a genuine change in the water system’s behavior, potentially caused by external factors such as water leakages, variations in water demand, hydraulic system issues, water availability, or operational actions. This methodology suggests two key alerts: potential measurement error and potential water leakage to facilitate the detecting and understanding of different types of anomalies. When new observations significantly exceed the confidence interval’s upper limit, it indicates a potential measurement error, which is more likely than a possible water leakage, since having a higher water flow rate than what the source can supply is not feasible. However, it is also possible that the sensor briefly malfunctioned if the alarm is not persistent. Conversely, suppose the new observation of water inflow falls below the confidence interval’s lower limit. In that case, it suggests a potential water loss, indicating the possibility of leaks occurring between the water tanks. This inference is drawn from the observed water inflow being below the expected range, but a misread by the sensor cannot be completely ruled out as a possibility. Incorporating these alerts into the methodology makes identifying and categorizing anomalies easier, leading to improved system operation and early detection of potential issues.
The appearance of the warnings is next examined; if they are persistent and invalidated by the user or another qualified individual, the model must be redesigned because the old model does not account for the new observations. Additionally, the distribution of the series could have been altered, necessitating a return to the model-generation stage.

4. Results

The errors from the Seasonal ARIMA models were calculated to select the best model for each variable. On average, it required 8 to 12 iterations to generate different models and compare the AICs between them to find the best fitted model for each water flow variable. A Seasonal ARIMA model was generated for each water tank’s input and outflows. Table 3 shows the Seasonal ARIMA models selected for each water tank in Figure 2 and the resulting AIC, RMSE, and MAPE.
Then the models were utilized to forecast one day and one week ahead, and the obtained forecasts were compared with the actual observations to calculate the forecasting MAPE and RMSE. The obtained RMSEs and MAPEs that each model obtained for one-day and one-week forecasts are presented in Table 4.
On the other hand, Table 5 presents the AICs, MAPEs, and RMSEs of the best fitted TF models for each variable. Only the possible and correlated water flow variables were used to generate the TF models based on the system presented in Figure 3. In addition, Table 5 shows the order of the TF models’ polynomials and dead time operator of the obtained TF models with the corresponding RMSE and MAPE values computed with the data interval used for generating each model.
Furthermore, Table 6 presents the MAPEs and RMSEs of the one-day and one-week forecasts of the fitted TF models. The models with the lowest MAPE were chosen and utilized in the data-evaluation stage. The asterisk indicates the variable and error values lower than those obtained for the Seasonal ARIMA model.
Based on the results presented in Table 4 and Table 6, the best models for each water tank inflow and outflow were selected based on the MAPE values; these models were used to develop the anomaly-detection methodology presented in Section 3.6. The shorthand notation for the model and the corresponding mathematical models, including the parameters and coefficients, are shown in Table 7.
In the last stage, the new data were assessed, and any anomalies were found using the forecasting values that were produced. First, the top models for each variable were chosen from the previous step. These models match those displayed in Table 7. Then, the models were used to forecast over the short and medium term (i.e., one day and one week). Finally, the limits for assessing whether a new observation was an anomaly were the models forecast upper and lower 95% confidence intervals. Three categories, possible leakage, possible measurement mistake, and not available (NA) datapoint, were used to group the notifications.
Figure 15 presents the forecasting of the fitted models presented in Table 7 for each analyzed water flow variable. The orange line represents the model’s forecast. The blue lines represent the new observations. The gray zone represents each model’s 95% confidence interval. Table 8 presents the alerts generated for each variable, providing information on the final model utilized, the three types of potential alerts, and the total count of alerts. The results shown in Table 8 are based on the methodology presented in Figure 14.

5. Discussion

The AICs, RMSEs, and MAPEs shown in Table 3 were computed by comparing the fitted models with the data observations (one week) used to generate the Seasonal ARIMA models. On the other hand, Table 4 shows the RMSEs and MAPEs obtained by comparing the fitted model with the observations used for forecasting. By comparing both tables, it can be appreciated that the errors were lower when comparing the model with the observations used for generating the models than the error obtained when comparing with the observations used for forecasting. Despite the above, the difference between the fit and prediction errors was low for most models. The smaller error obtained with the fit data compared to the prediction data suggested a slight overfit.
In addition, it can be appreciated that the order of the seasonal components of the fitted models was the same and that all of them required a first-order seasonal difference and moving average component. On the other hand, only the Tank 4 Inflow sensor data modeling required a first-order seasonal autoregressive component. In the case of the non-seasonal part of the fitted ARIMA models, it can be appreciated that certain heterogeneity existed in the order and the components that each of the input and output water flow variables required. The above could be attributed to the difference in the dynamics of the studied water tanks initially presented in Figure 2.
Moreover, based on the results presented in Table 4, it can be observed that the fitted Seasonal ARIMA models had a lower MAPE and RMSE for a one-day forecast than a one-week one. Hence, the models were better for forecasting in short periods. The above was useful to define the remodeling period and forecast for future use of the methodology. The only exception in which RMSE and MAPE were lower for the one-week forecast was the Seasonal ARIMA model fit for the Tank 4 Outflow. The above variable presented for both forecasts’ periods a greater MAPE and RMSE compared to the rest of the water tanks. The new observed values for the next day were very extreme and differed from those used for the model generation. Nevertheless, the new observations were closer to the previous data during the week. This could mean that the sensor of this variable had issues and, in some periods, was not working accurately. Furthermore, Tank 5 Outflow did not contain a MAPE calculation because the nature of the variable data did not allow it; as can be seen in Table 2, the variable had minimum negative flow and positive maximum values, but some of the observed values were zero, so the division by zero in Formula (6) gets undefined.
In the case of the fitted TF models shown in Table 5, it can be appreciated that the order of the polynomials and the dead time operator that provided the lowest AIC value were different for each tank. Moreover, all TF models were influenced by past input values since they all had an s th-order polynomial. However, not all models were influenced by the past values of the output since the r th-order polynomial equaled zero, as in the models for Tanks 4 to 6, as shown in Table 5. On the other hand, by comparing the errors of the TF models presented in Table 5 with the ones shown in Table 6, it can be appreciated that the errors of Table 5 were lower. The above was expected since the errors presented in Table 5 were computed with the same data used to fit the model. Despite the above, the errors were similar. Like the Seasonal ARIMA models, the obtained MAPE and RMSE values of the TF models, when used for forecasting, were generally lower for one-day forecasting than one-week forecasting, except for the first two models (see Table 6). Furthermore, only the Tank 3 Inflow TF model obtained an AIC value lower than its corresponding Seasonal ARIMA models (5536.81 and 5744.72, respectively). The above suggested a lower error while fitting and generating sparse models. However, even though the AIC values of the Seasonal ARIMA models were lower than the TF models, the first could provide an overfitted model when selected through the AIC [44].
Based on the results reported in Table 7, it can be appreciated that the water flow data can be modeled better by a Seasonal ARIMA according to the reported AIC, RMSE, and MAPE values. Moreover, as shown in Table 7, the TF models fitted through the Box–Jenkins methodology were less helpful in modeling the analyzed water branch’s flow than the ARIMA models. In addition, it can be appreciated that, in general, the generated models had a low number of coefficients, which could facilitate a physical implementation of the proposed system since the computation that this type of model requires could be lower due to their low complexity.
Figure 15 shows the forecasting values of each final model presented in Table 7. The graphs present a one-week forecast, with the initial portion of the forecast (from Hour 168 to 172) representing the first forecast day, which served as the basis for the results reported in Table 7. Based on the forecasting values presented in Figure 15, the different dynamical behaviors of each flow variable of the analyzed branch can be appreciated visually. This visual representation enhances the understanding of and facilitates the alert process for users, enabling them to discern patterns and trends more effectively. In addition, it can be observed that the forecast values and observed values (actual measured water flow) were similar for most of the models. Furthermore, it is evident that the 95% confidence interval demonstrated a dynamic behavior across each model and did not remain constant throughout the analyzed forecasting week. In some cases, the gradual increase in the 95% confidence interval size could be attributed to either the absence of the moving average component in the model or the inherent increase in uncertainty as time progressed. These factors contributed to the widening of the confidence interval, indicating the challenges in accurately forecasting values over an extended period. In most cases, the selected model behaved similarly to the actual observations.
Finally, from Table 8, it can be seen that the variables with the worst (bigger) MAPE and RMSE errors generated the most alerts, as was the case for Tank 4 Outflow, Tank 4 Inflow, and Tank 5 Inflow, whose MAPEs were more prominent (more than 10%) from the rest of models and Tank 5 Outflow A had the biggest RMSE in both forecast periods (one day and one week). Tank 1 Inflow alerts could be considered a particular case because there were no alerts in one day, but there were many possible measurement errors in one week. The above could be due to the sensor malfunction over a long period or a change in data behavior, which might require remodeling. Although Figure 15 shows slight anomalies in the long-term behavior of Tank 3 Inflow, Tank 6 Inflow, and Tank 5 Outflow B, they were not detected, presented in Table 8, because they still fell within the 95% confidence interval. Therefore, the detection of alerts depends on the system’s tuning, such as setting a smaller confidence interval or a shorter forecast period (e.g., 1 d, 12 h, etc.). Additionally, more potential measurement errors and not available observations were identified than possible leakages in both forecast periods, indicating the need to verify the calibration and proper functioning of sensors. This information is valuable in justifying investment in equipment maintenance and highlighting the affected areas.
The resulting ARIMA and TF models could be considered less-complex models than the ones produced by other machine learning algorithms proposed in the literature, such as XGBoost [18], ANNs [16], and CNNs [34], due to their reliance on a solid mathematical and statistical background with well-defined interpretations and the lower number of parameters that they have, as shown in Table 7. In addition, contrary to the studies presented in Section 2, Table 7 shows the coefficients and the mathematical representations of the fitted ARIMA and TF models. Moreover, the steps for modeling the ARIMA and TF models are well-defined and based on the specific assumptions of data stationarity, linearity, and the independence of residuals, providing a more-transparent framework and guidance in the modeling process. Finally, the models are suitable for forecasting based on historical patterns of water flow variables, making them a practical tool for anomaly detection in water distribution systems.
A comparison of the methodology presented in this study with the related literature presented initially in Section 2 in terms of the input data, machine learning algorithm, and analyzed system is presented in Table 9. In this regard, it can be observed that heterogeneity existed between the related research and the present study. Authors have proposed to detect leakages from pressure measurements, water flow measurements, acoustic emission, and vibration data; on the other hand, this study based its analysis exclusively on water flow data. Another aspect that can be observed is the frequent use of CNNs to perform leakage detection—other approaches involved tree-based techniques such as decision trees, random forest, XGBoost, and Adaboost. Otherwise, the present study focused on using Seasonal ARIMA and TF models that produce less-complex models than CNNs and tree-based classifiers. Finally, the analyzed systems varied from study to study, with water distribution networks being frequently analyzed. Public datasets and simulation tests of water pipelines have also been considered. In the case of the present work, the water flow data came from a branch of the water distribution system of Mexico City that supplies water to the sub-branch of water pipelines (see Figure 1 and Figure 2).
Nevertheless, it is difficult to perform a homogenous comparison of the results presented in this study with the related research. This is because the systems analyzed to develop the water-leakage-detection algorithms and the data collected varied from study to study. As previously stated, the authors proposed performing water leakage detection through simulation, laboratory tests, and data collected from water distribution systems. Nonetheless, each analyzed system had different data distributions, which impacted the type of algorithms that best describe the data. Furthermore, most of the works in the literature presented in Table 9, Section 2, and the Introduction Section employed supervised learning techniques to train the detection algorithms. On the other hand, this study tackled the problem from an unsupervised point of view since access to labeled data that classify the anomalies in water leakages or measurement errors were not available when developing this work. However, the above points out an area of opportunity that needs to be addressed in future work.

6. Limitations of the Study

This work was constrained by the branch’s existing infrastructure and data availability, limiting it to a single case study focused solely on one operational variable, water flow. In future work, it is proposed to test the methodology in other cases of study and with other operational variables such as pressure, tank level, and more flow points. As shown in Figure 3, some tanks have other water exits, the flow rate of which was not available in this dataset, but are valuable variables that could help to generate a better model and understanding of water usage in the system. The use of additional variables could be performed with the help of multimodal techniques that consider flow time series data and pressure data measured in each of the tanks of the water branch. The above could be assessed through multimodal machine learning techniques such as model-agnostic (i.e., the fusion was carried out before applying the machine learning technique) and model-based (i.e., the fusion of the modalities was performed while generating the model) methods [45].
Furthermore, the methodology could not be validated with real leakages because a report of the actual leakages (detected or repaired) was unavailable when the models and the study were developed. The above implies another limitation, such as the need to develop a physical implementation of the proposed algorithms to validate anomalies and select and design an appropriate hardware platform in which the proposed algorithm can be embedded and executed. In addition, one of the crucial challenges of data-driven models such as ARIMA models is that, if the dynamics of the system changes, the fitted ARIMA models may not work as expected since the distribution of the data used for fitting could have changed. A potential solution to this problem is to fine-tune the models over time to keep their parameters updated in case of a change in the dynamics of the water distribution branch analyzed in this study.
Furthermore, due to the limited and incomplete dataset, this study focused on employing linear models such as the ARIMA and TF models; however, there might be non-linear dynamics in the water distribution systems that the proposed methods could not capture. Hence, deep learning algorithms such as recurrent neural networks or long short-term memory neural networks could be compared with the proposed ARIMA and TF models in terms of performance. However, deep learning solutions often require a sizable sample size to be trained adequately and avoid overfitting problems. The above could be mitigated using transfer learning techniques in combination with deep learning solutions.
In addition, another potential disadvantage of the fitted ARIMA models is the need to filter the time series through differentiating; despite being essential to produce a stationary time series, it can also have certain biases related to the dynamics of the analyzed system since differentiating acts as a high-pass filter on the time series. The above could be mitigated with hybrid techniques combining nonstationary time series techniques such as wavelet analysis and ARIMA models. For instance, Nury et al. [40] proposed a wavelet-ARIMA model for temperature prediction in Bangladesh to account for the nonstationary behavior of the analyzed temperature time series data.
Additionally, due to the complexity of the water branch analyzed in this study, eleven flow measurements (i.e., considering the input and output flows of each tank) led to adjusting eleven models for the case of the fitted ARIMA models. The above is a potential drawback of the proposed anomaly detection system since it requires at least two models to detect anomalies for a single water tank. Thus, validating the proposed models could be a time-consuming task. Moreover, since the methodology used in this study depends extensively on data, its implementation is limited to water flow data availability. Hence, the approach presented in this work could be combined with hardware approaches to reduce data dependency.
Moreover, another approach that could have been developed is to generate a TF of the system shown in Figure 2 by considering the water tank’s dynamics, as in the work of Li et al. [46]. The above could reduce the need for data to generate the TFs. However, certain variables, such as the height of the fluid present in the tanks, the area of the tanks, and the pressure, need to be considered to generate an adequate model that represents the system’s dynamics and, consequently, the water flow behavior. Moreover, dynamical models based on linear differential equations do not consider the disturbances and noise the system is susceptible to. Hence, future work could compare estimating a TF based on the Box–Jenkins methodology and a TF obtained by considering the linear differential equations that describe the system dynamics.
Another opportunity that could be tackled is the need for developing a publicly available dataset from which the proposed water leakage detection models can be compared homogeneously. In the present study, sensor flow data were considered to analyze a water branch of the water distribution systems of Mexico City. However, other studies described in Section 2 used other types of data. They analyzed water distribution systems of other regions whose results cannot be compared to those presented in this work since the data modalities and distributions are different. The analyzed systems differed even among works that performed similar research in Mexico, such as Pérez-Pérez et al. [16] and the present study.

7. Conclusions

This work proposed using Seasonal ARIMA and TF models fit through the Box–Jenkins approach to model the flow data of a branch of the water distribution system of Mexico City for anomaly detection in water distribution branches. The results of this study showed that ARIMA models can describe and forecast the flow variables of the analyzed water branch with low error in terms of the MAPE. The generated TF models can also explain the linear branch system’s relationship between tanks according to the reported RMSE and MAPE values. Still, in most cases, the ARIMA models achieved a higher performance in terms of the MAPE.
The models proposed in this study have the potential to make significant contributions to reducing water losses and improving the efficiency of the distribution system. These improvements were achieved by utilizing the existing instrumentation and infrastructure of Mexico City’s water distribution system, along with a clear and understandable methodology, visually representing anomalies to aid in the alert process for users. These models can facilitate early detection and localization of potential issues, enabling prompt actions and interventions for more-effective water distribution network management. Additionally, by identifying the specific sensor that triggered the alert, the search for potential issues can be narrowed down to a specific zone, enabling faster localization of the failure and more-efficient troubleshooting. The actions for each alert type (leakage or measurement error) depend on whether the water flow variable type is an inflow or outflow. For example, if less water than expected is arriving into a tank, it can be assumed that there is a leakage in the pipeline before the tank’s entry. In such a case, physical inspection would be required.
On the other hand, if more water than expected is arriving, testing the sensor and verifying its calibration are recommended. In the case of an output water flow variable, if less water than expected is outgoing from a tank, it is possible that the leakage occurred inside the tank, such as an overflow or an unauthorized intake. If more water than expected is outgoing from the tank outflow, it is possible that there is a leakage in the pipe ahead, which should be verified.
The variables with greater errors were the ones with the most alarms. Therefore, the corresponding authorities should review and provide maintenance to these variables’ sensors and communications systems. In addition, the generated alerts should be reviewed and validated by the operators responsible for the system to determine if new modeling is required or if the alerts are correct.
The current methodology’s future work will improve it into an integrated support system with close collaboration between water service providers in Mexico City and action-based research. Implementation requires capable personnel with access to tank instrumentation to monitor and validate alerts constantly. Optimizing this methodology involves developing an online monitoring and detection system that reduces false alerts, detects leaks in real-time, and even remodels the system automatically. The platform should also provide dynamic data exchange and customer information for building positive relationships and pro-environmental attitudes.

Author Contributions

Conceptualization, D.B.-T. and R.B.-B.; methodology, D.B.-T. and J.L.P.-H.; software, D.B.-T.; validation, D.B.-T., R.B.-B. and J.L.P.-H.; formal analysis, D.B.-T.; investigation, D.B.-T.; resources, D.B.-T.; data curation, D.B.-T.; writing—original draft preparation, D.B.-T. and E.A.M.-R.; writing—review and editing, D.B.-T., E.A.M.-R. and S.A.N.-T.; visualization, D.B.-T.; supervision, R.B.-B.; project administration, D.B.-T. and R.B.-B.; funding acquisition, R.B.-B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received funding from Tecnológico de Monterrey and CONACYT (CVU: 1080546).

Data Availability Statement

The data were provided by a telemetry company based in Mexico City named Virtual Wave Control (VWC). The original dataset is private and the company’s property and was allowed to be used only for this work.

Acknowledgments

The authors would like to thank: CONACYT (CVU: 1080546), Tecnológico de Monterrey, VWC, and all the Reviewers for all the support and counsel.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

1D CNNOne-dimensional convolutional neural network
ACFAuto-correlation function
AEAutoencoder
AICAkaike’s information criterion
ANNArtificial neural networks
ARIMAAutoregressive Integrated Moving Average
CNNConvolutional neural network
DLDeep learning
DMAsDistrict-metered areas
DTDecision tree
KNNK-nearest neighbor
lpsLiters per second
MAMoving average
MAPEMean absolute percentage error
MFCCMel frequency cepstrum coefficient
MLMachine learning
NANot available
PACFPartial auto-correlation function
RMSERoot-mean-squared error
SNRSignal-to-noise ratio
SVMSupport vector machine
TFTransfer Function
TFCNNTime–frequency convolutional neural network
WDSsWater distribution systems
XGBoostExtreme gradient boosting

References

  1. Mekonnen, M.M.; Hoekstra, A.Y. Four billion people facing severe water scarcity. Sci. Adv. 2016, 2, e1500323. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. United Nations, Department of Economic and Social Affairs, Population Division. World Urbanization Prospects: The 2018 Revision (ST/ESA/SER.A/420); United Nations: New York, NY, USA, 2019. [Google Scholar]
  3. Aguirre, D.R.; Espinoza, V. El Gran reto del Agua en la Ciudad de México; Sistema de Aguas de La Ciudad de México: Mexico City, Mexico, 2012. [Google Scholar]
  4. Pineda-Pablos, N.; Salazar-Adams, A. Cities and drought in Mexico. Water management as a mitigation critical strategy. Tecnol. Cienc. Agua 2016, 7, 95–113. Available online: https://www.cabdirect.org/cabdirect/abstract/20183068452 (accessed on 29 January 2021).
  5. Ortega-Ballesteros, A.; Iturriaga-Bustos, F.; Perea-Moreno, A.J.; Muñoz-Rodríguez, D. Advanced Pressure Management for Sustainable Leakage Reduction and Service Optimization: A Case Study in Central Chile. Sustainability 2022, 14, 12463. [Google Scholar] [CrossRef]
  6. Mashhadi, N.; Shahrour, I.; Attoue, N.; El Khattabi, J.; Aljer, A. Use of machine learning for leak detection and localization in water distribution systems. Smart Cities 2021, 4, 1293–1315. [Google Scholar] [CrossRef]
  7. Sun, C.; Parellada, B.; Puig, V.; Cembrano, G. Leak localization in water distribution networks using pressure and data-driven classifier approach. Water 2019, 12, 54. [Google Scholar] [CrossRef] [Green Version]
  8. Ares-Milián, M.J.; Quiñones-Grueiro, M.; Verde, C.; Llanes-Santiago, O. A leak zone location approach in water distribution networks combining data-driven and model-based methods. Water 2021, 13, 2924. [Google Scholar] [CrossRef]
  9. Fereidooni, Z.; Tahayori, H.; Bahadori-Jahromi, A. A hybrid model-based method for leak detection in large scale water distribution networks. J. Ambient. Intell. Humaniz. Comput. 2021, 12, 1613–1629. [Google Scholar] [CrossRef]
  10. Yu, T.; Chen, X.; Yan, W.; Xu, Z.; Ye, M. Leak detection in water distribution systems by classifying vibration signals. Mech. Syst. Signal Process. 2023, 185, 109810. [Google Scholar] [CrossRef]
  11. Chen, J.; Tang, P.; Rakstad, T.; Patrick, M.; Zhou, X. Augmenting a deep-learning algorithm with canal inspection knowledge for reliable water leak detection from multispectral satellite images. Adv. Eng. Inform. 2020, 46, 101161. [Google Scholar] [CrossRef]
  12. Sousa, D.P.; Du, R.; Mairton Barros da Silva, J., Jr.; Cavalcante, C.C.; Fischione, C. Leakage detection in water distribution networks using machine-learning strategies. Water Supply 2023, 23, 1115–1126. [Google Scholar] [CrossRef]
  13. Fares, A.; Tijani, I.A.; Rui, Z.; Zayed, T. Leak detection in real water distribution networks based on acoustic emission and machine learning. Environ. Technol. 2022, 1–17. [Google Scholar] [CrossRef]
  14. Bykerk, L.; Valls Miro, J. Detection of Water Leaks in Suburban Distribution Mains with Lift and Shift Vibro-Acoustic Sensors. Vibration 2022, 5, 21. [Google Scholar] [CrossRef]
  15. Guo, G.; Yu, X.; Liu, S.; Ma, Z.; Wu, Y.; Xu, X.; Wang, X.; Smith, K.; Wu, X. Leakage detection in water distribution systems based on time–frequency convolutional neural network. J. Water Resour. Plan. Manag. 2021, 147, 04020101. [Google Scholar] [CrossRef]
  16. Pérez-Pérez, E.D.J.; López-Estrada, F.R.; Valencia-Palomo, G.; Torres, L.; Puig, V.; Mina-Antonio, J.D. Leak diagnosis in pipelines using a combined artificial neural network approach. Control. Eng. Pract. 2021, 107, 104677. [Google Scholar] [CrossRef]
  17. Moulik, S.; Majumdar, S.; Pal, V.; Thakran, Y. Water Leakage Detection in Hilly Region PVC Pipes using Wireless Sensors and Machine Learning. In Proceedings of the 2020 IEEE International Conference on Consumer Electronics—Taiwan (ICCE-Taiwan), Taoyuan, Taiwan, 28–30 September 2020; pp. 1–2. [Google Scholar] [CrossRef]
  18. Xue, P.; Jiang, Y.; Zhou, Z.; Chen, X.; Fang, X.; Liu, J. Machine learning-based leakage fault detection for district heating networks. Energy Build. 2020, 223, 110161. [Google Scholar] [CrossRef]
  19. Taghlabi, F.; Sour, L.; Agoumi, A. Prelocalization and leak detection in drinking water distribution networks using modeling-based algorithms: A case study for the city of Casablanca (Morocco). Drink. Water Eng. Sci. 2020, 13, 29–41. [Google Scholar] [CrossRef]
  20. Rai, A. Explainable AI: From black box to glass box. J. Acad. Mark. Sci. 2020, 48, 137–141. [Google Scholar] [CrossRef] [Green Version]
  21. Li, H.; Li, J.; Guan, X.; Liang, B.; Lai, Y.; Luo, X. Research on overfitting of deep learning. In Proceedings of the 2019 15th International Conference on Computational Intelligence and Security (CIS), Macao, China, 13–16 December 2019; pp. 78–81. [Google Scholar]
  22. Elsaraiti, M.; Merabet, A. A Comparative Analysis of the ARIMA and LSTM Predictive Models and Their Effectiveness for Predicting Wind Speed. Energies 2021, 14, 6782. [Google Scholar] [CrossRef]
  23. Viccione, G.; Guarnaccia, C.; Mancini, S.; Quartieri, J. On the use of ARIMA models for short-term water tank levels forecasting. Water Supply 2020, 20, 787–799. [Google Scholar] [CrossRef] [Green Version]
  24. Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
  25. Van der Walt, J.C.; Heyns, P.S.; Wilke, D.N. Pipe network leak detection: Comparison between statistical and machine learning techniques. Urban Water J. 2018, 15, 953–960. [Google Scholar] [CrossRef]
  26. Liu, Y.; Ma, X.; Li, Y.; Tie, Y.; Zhang, Y.; Gao, J. Water Pipeline Leakage Detection Based on Machine Learning and Wireless Sensor Networks. Sensors 2019, 19, 5086. [Google Scholar] [CrossRef] [Green Version]
  27. Quy, T.B.; Kim, J.M. Leakage Detection of Water-Induced Pipelines Using Hybrid Features and Support Vector Machines. In Advances in Computer Communication and Computational Sciences. Advances in Intelligent Systems and Computing; Bhatia, S., Tiwari, S., Mishra, K., Trivedi, M., Eds.; Springer: Singapore, 2019; Volume 924. [Google Scholar] [CrossRef]
  28. Vrachimis, S.G.; Eliades, D.G.; Taormina, R.; Kapelan, Z.; Ostfeld, A.; Liu, S.; Kyriakou, M.; Pavlou, P.; Qiu, M.; Polycarpou, M.M. Battle of the leakage detection and isolation methods. J. Water Resour. Plan. Manag. 2022, 148, 04022068. [Google Scholar] [CrossRef]
  29. Islam, M.R.; Azam, S.; Shanmugam, B.; Mathur, D. A Review on Current Technologies and Future Direction of Water Leakage Detection in Water Distribution Network. IEEE Access 2022, 10, 107177–107201. [Google Scholar] [CrossRef]
  30. Choi, J.; Im, S. Application of CNN Models to Detect and Classify Leakages in Water Pipelines Using Magnitude Spectra of Vibration Sound. Appl. Sci. 2023, 13, 2845. [Google Scholar] [CrossRef]
  31. Shen, Y.; Cheng, W. A Tree-Based Machine Learning Method for Pipeline Leakage Detection. Water 2022, 14, 2833. [Google Scholar] [CrossRef]
  32. Cody, R.A.; Narasimhan, S. A field implementation of linear prediction for leak-monitoring in water distribution networks. Adv. Eng. Inform. 2020, 45, 101103. [Google Scholar] [CrossRef]
  33. Fabbiano, L.; Vacca, G.; Dinardo, G. Smart water grid: A smart methodology to detect leaks in water distribution networks. Measurement 2020, 151, 107260. [Google Scholar] [CrossRef]
  34. Tornyeviadzi, H.M.; Seidu, R. Leakage detection in water distribution networks via 1D CNN deep autoencoder for multivariate SCADA data. Eng. Appl. Artif. Intell. 2023, 122, 106062. [Google Scholar] [CrossRef]
  35. Kammoun, M.; Kammoun, A.; Abid, M. Experiments based comparative evaluations of machine learning techniques for leak detection in water distribution systems. Water Supply 2021, 22, 628–642. [Google Scholar] [CrossRef]
  36. James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: New York, NY, USA, 2013; Volume 112, p. 18. [Google Scholar]
  37. Schaffer, A.L.; Dobbins, T.A.; Pearson, S.A. Interrupted time series analysis using autoregressive integrated moving average (ARIMA) models: A guide for evaluating large-scale health interventions. BMC Med. Res. Methodol. 2021, 21, 58. [Google Scholar] [CrossRef]
  38. ArunKumar, K.E.; Kalaga, D.V.; Kumar, C.M.S.; Chilkoor, G.; Kawaji, M.; Brenza, T.M. Forecasting the dynamics of cumulative COVID-19 cases (confirmed, recovered and deaths) for top-16 countries using statistical machine learning models: Auto-Regressive Integrated Moving Average (ARIMA) and Seasonal Auto-Regressive Integrated Moving Average (SARIMA). Appl. Soft Comput. 2021, 103, 107161. [Google Scholar] [PubMed]
  39. Mestre, G.; Portela, J.; Rice, G.; San Roque, A.M.; Alonso, E. Functional time series model identification and diagnosis by means of auto-and partial autocorrelation analysis. Comput. Stat. Data Anal. 2021, 155, 107108. [Google Scholar] [CrossRef]
  40. Nury, A.H.; Hasan, K.; Alam, M.J.B. Comparative study of wavelet-ARIMA and wavelet-ANN models for temperature time series data in northeastern Bangladesh. J. King Saud Univ.-Sci. 2017, 29, 47–61. [Google Scholar] [CrossRef] [Green Version]
  41. Katoch, R.; Sidhu, A. An application of ARIMA model to forecast the dynamics of COVID-19 epidemic in India. Glob. Bus. Rev. 2021. [Google Scholar] [CrossRef]
  42. Helmer, R.M.; Johansson, J.K. An exposition of the Box-Jenkins transfer function analysis with an application to the advertising-sales relationship. J. Mark. Res. 1977, 14, 227–239. [Google Scholar] [CrossRef]
  43. von Asmuth, J.R.; Bierkens, M.F.; Maas, K. Transfer function-noise modeling in continuous time using predefined impulse response functions. Water Resour. Res. 2002, 38, 23-1–23-12. [Google Scholar] [CrossRef]
  44. DelSole, T.; Tippett, M.K. Correcting the corrected AIC. Stat. Probab. Lett. 2021, 173, 109064. [Google Scholar] [CrossRef]
  45. Barua, A.; Ahmed, M.U.; Begum, S. A Systematic Literature Review on Multimodal Machine Learning: Applications, Challenges, Gaps and Future Directions. IEEE Access 2023, 11, 14804–14831. [Google Scholar] [CrossRef]
  46. Li, X.; Li, Z. The application of linear and nonlinear water tanks case study in teaching of process control. IOP Conf. Ser. Earth Environ. Sci. 2018, 113, 012165. [Google Scholar] [CrossRef]
Figure 1. A schematic representation of a section of the primary water distribution system of Mexico City analyzed in this study. Filled orange squares represent trifurcations, while cyan squares represent tanks of various capacities. The blue lines represent pipelines with a diameter exceeding 126 inches, while the blue dotted lines indicate pipelines with a diameter of 48 inches.
Figure 1. A schematic representation of a section of the primary water distribution system of Mexico City analyzed in this study. Filled orange squares represent trifurcations, while cyan squares represent tanks of various capacities. The blue lines represent pipelines with a diameter exceeding 126 inches, while the blue dotted lines indicate pipelines with a diameter of 48 inches.
Water 15 02792 g001
Figure 2. Geographic locations and connections of the water tanks of the water distribution branch of Mexico City analyzed in this study. The blue squares represent the locations of the tanks along the city. The lines on the map represent the connections between the tanks across Mexico City. The blue line represents a pipeline of 48 in in diameter, and the cyan line represents a pipeline of 20 in in diameter.
Figure 2. Geographic locations and connections of the water tanks of the water distribution branch of Mexico City analyzed in this study. The blue squares represent the locations of the tanks along the city. The lines on the map represent the connections between the tanks across Mexico City. The blue line represents a pipeline of 48 in in diameter, and the cyan line represents a pipeline of 20 in in diameter.
Water 15 02792 g002
Figure 3. Schematic representation of the water distribution branch analyzed in this study. The case study comprised six tanks connected by gravity. The average water mass of one week is presented as a percentage. The measured variables are indicated with a symbol, and the variable numbers correspond with those shown in Table 2 for each water flow sensor. The orange dots indicate the positions of the water flow sensors along the water distribution branch that were measured.
Figure 3. Schematic representation of the water distribution branch analyzed in this study. The case study comprised six tanks connected by gravity. The average water mass of one week is presented as a percentage. The measured variables are indicated with a symbol, and the variable numbers correspond with those shown in Table 2 for each water flow sensor. The orange dots indicate the positions of the water flow sensors along the water distribution branch that were measured.
Water 15 02792 g003
Figure 4. Example of identifying missing values in the raw time series flow sensor raw data for Tank 4 Inflow variable during one week. The dotted blue line represents the observations of the Tank 4 Inflow variable. The red line represents the missing values in the Tank 4 Inflow variable.
Figure 4. Example of identifying missing values in the raw time series flow sensor raw data for Tank 4 Inflow variable during one week. The dotted blue line represents the observations of the Tank 4 Inflow variable. The red line represents the missing values in the Tank 4 Inflow variable.
Water 15 02792 g004
Figure 5. Example of filling the missing values for Tank 4 Inflow variable corresponding to one week. The dotted blue line represents the observations of the Tank 4 Inflow variable. The dotted red line represents the imputed values in the Tank 4 Inflow variable.
Figure 5. Example of filling the missing values for Tank 4 Inflow variable corresponding to one week. The dotted blue line represents the observations of the Tank 4 Inflow variable. The dotted red line represents the imputed values in the Tank 4 Inflow variable.
Water 15 02792 g005
Figure 6. The methodology flowchart utilized to fit the ARIMA and TF models based on the water flow data.
Figure 6. The methodology flowchart utilized to fit the ARIMA and TF models based on the water flow data.
Water 15 02792 g006
Figure 7. ACF (left) and PACF (right) plot of Tank 4 Inflow original time series.
Figure 7. ACF (left) and PACF (right) plot of Tank 4 Inflow original time series.
Water 15 02792 g007
Figure 8. ACF (left) and residuals (right) of the ARIMA(1,0,0)(1,1,1)(96) model of Tank 4 Inflow.
Figure 8. ACF (left) and residuals (right) of the ARIMA(1,0,0)(1,1,1)(96) model of Tank 4 Inflow.
Water 15 02792 g008
Figure 9. An overview of the generation process of an ARIMA model.
Figure 9. An overview of the generation process of an ARIMA model.
Water 15 02792 g009
Figure 10. Forecasts of 1 week of the model ARIMA(1,0,0)(1,1,1)(96) of Tank 4 Inflow with the corresponding model forecast in orange (in blue are the actual observations), and the grey zone indicates the 95% confidence interval of the model.
Figure 10. Forecasts of 1 week of the model ARIMA(1,0,0)(1,1,1)(96) of Tank 4 Inflow with the corresponding model forecast in orange (in blue are the actual observations), and the grey zone indicates the 95% confidence interval of the model.
Water 15 02792 g010
Figure 11. Transfer Functions’ modeling and forecasting process.
Figure 11. Transfer Functions’ modeling and forecasting process.
Water 15 02792 g011
Figure 12. Cross-correlation of prewhitened input and output for Tank 2.
Figure 12. Cross-correlation of prewhitened input and output for Tank 2.
Water 15 02792 g012
Figure 13. Forecasts of 1 week of the model TFM(0,1,0) of Tank 4 Inflow taking into consideration Tank 3 Inflow data as input data of the model. The orange line represents the TF model forecast; the blue line represents the actual observations; the gray area represents the 95% confidence interval of the fitted TF model.
Figure 13. Forecasts of 1 week of the model TFM(0,1,0) of Tank 4 Inflow taking into consideration Tank 3 Inflow data as input data of the model. The orange line represents the TF model forecast; the blue line represents the actual observations; the gray area represents the 95% confidence interval of the fitted TF model.
Water 15 02792 g013
Figure 14. Data evaluation flowchart to detect the anomalies in the water tanks of the analyzed water distribution branch by employing the fitted Seasonal ARIMA and TF models’ 95% confidence intervals.
Figure 14. Data evaluation flowchart to detect the anomalies in the water tanks of the analyzed water distribution branch by employing the fitted Seasonal ARIMA and TF models’ 95% confidence intervals.
Water 15 02792 g014
Figure 15. Water flow forecasts and alerts of all the selected models. The first row shows the model for Variables 1 to 3. The second row shows the model for Variables 4 to 6. The third row shows the models for Variables 7 to 9. The fourth row shows the models for Variables 10 and 11. The orange line represents the model’s forecast values. The blue line represents the actual or new observations of the analyzed week. The gray zone represents the 95% confidence interval of each model. The red dots represent the observations that fall out of each model’s 95% confidence interval gray area. These red dots are considered anomalies in the water flow behavior of the analyzed water distribution branch of Mexico City. The variable number corresponds to the ones shown in Figure 3 and Table 2.
Figure 15. Water flow forecasts and alerts of all the selected models. The first row shows the model for Variables 1 to 3. The second row shows the model for Variables 4 to 6. The third row shows the models for Variables 7 to 9. The fourth row shows the models for Variables 10 and 11. The orange line represents the model’s forecast values. The blue line represents the actual or new observations of the analyzed week. The gray zone represents the 95% confidence interval of each model. The red dots represent the observations that fall out of each model’s 95% confidence interval gray area. These red dots are considered anomalies in the water flow behavior of the analyzed water distribution branch of Mexico City. The variable number corresponds to the ones shown in Figure 3 and Table 2.
Water 15 02792 g015
Table 2. Characteristics of water flow variables in the sampling period for the six tanks of the analyzed water distribution branch. The variable number corresponds to the ones shown in Figure 3.
Table 2. Characteristics of water flow variables in the sampling period for the six tanks of the analyzed water distribution branch. The variable number corresponds to the ones shown in Figure 3.
Variable
Number
VariableMinimum (lps)Maximum (lps)Mean (lps)NA (%)
1Tank 1 Inflow0258718370%
2Tank 1 Outflow092.7365.30%
3Tank 2 Inflow0243018430%
4Tank 2 Outflow0401.4280.90%
5Tank 3 Inflow−1.862220.151554.580%
6Tank 4 Inflow02489.2994.814%
7Tank 4 Outflow−107.01117.416.3348%
8Tank 5 Inflow02465.81088.35%
9Tank 5 Outflow A−352.52912819.35%
10Tank 5 Outflow B0530.3412.90%
11Tank 6 Inflow0101.567.1670%
Table 3. Seasonal ARIMA models’ AIC, MAPE, and RMSEs from each water flow variable. The variable number corresponds to the ones shown in Figure 3 and Table 2.
Table 3. Seasonal ARIMA models’ AIC, MAPE, and RMSEs from each water flow variable. The variable number corresponds to the ones shown in Figure 3 and Table 2.
Variable Number VariableModel NotationAICRMSEMAPE
1Tank 1 InflowARIMA(2,0,1)(0,1,1)(96)6011.6234.877781.412087
2Tank 1 OutflowARIMA(0,1,1)(0,1,1)(96)1882.581.0453281.112624
3Tank 2 InflowARIMA(2,0,1)(0,1,1)(96)5998.834.471431.378066
4Tank 2 OutflowARIMA(0,1,0)(0,1,1)(96)3322.983.8544251.069508
5Tank 3 InflowARIMA(2,0,0)(0,1,1) (96)5744.7229.297161.069776
6Tank 4 InflowARIMA(1,0,0)(1,1,1)(96)6727.9175.563545.415305
7Tank 4 OutflowARIMA(1,0,1)(0,1,1)(96)4901.1816.7920233.91554
8Tank 5 InflowARIMA(1,1,2)(0,1,1)(96)7497.96135.5895.748487
9Tank 5 Outflow AARIMA(1,0,0)(0,1,1)(96)7435.93155.895222.79048
10Tank 5 Outflow BARIMA(1,1,0)(0,1,1)(96)3129.152.8690730.3859113
11Tank 6 InflowARIMA(0,1,1)(0,1,1)(96)1296.070.58308090.3625255
Table 4. Seasonal ARIMA models’ MAPEs and RMSEs from each model forecast for one day and one week in the future. The variable number corresponds to the ones shown in Figure 3 and Table 2.
Table 4. Seasonal ARIMA models’ MAPEs and RMSEs from each model forecast for one day and one week in the future. The variable number corresponds to the ones shown in Figure 3 and Table 2.
Variable Number VariableModel Notation1-Day Forecast1-Week Forecast
RMSEMAPERMSEMAPE
1Tank 1 InflowARIMA(2,0,1)(0,1,1)(96)70.336063.221687156.8526.182667
2Tank 1 OutflowARIMA(0,1,1)(0,1,1)(96)5.9728187.1927566.4542278.924136
3Tank 2 InflowARIMA(2,0,1)(0,1,1)(96)83.170523.757622142.417595.610582
4Tank 2 OutflowARIMA(0,1,0)(0,1,1)(96)14.048794.57400428.6588388.957665
5Tank 3 InflowARIMA(2,0,0)(0,1,1)(96)86.981324.963585169.630348.577175
6Tank 4 InflowARIMA(1,0,0)(1,1,1)(96)221.1230111.697919272.2630314.447808
7Tank 4 OutflowARIMA(1,0,1)(0,1,1)(96)18.52472352.694217.195477.69167
8Tank 5 InflowARIMA(1,1,2)(0,1,1)(96)303.455323.445315384.090123.786044
9Tank 5 Outflow AARIMA(1,0,0)(0,1,1)(96)367.1641-556.4129-
10Tank 5 Outflow BARIMA(1,1,0)(0,1,1)(96)15.5685363.378292728.9825225.397012
11Tank 6 InflowARIMA(0,1,1)(0,1,1)(96)0.66848050.52893681.51030291.220953
Table 5. Input and output variables were determined for the Transfer Function model and AIC, MAPE, and RMSE of each of the best-selected models for each water flow variable. The variable number corresponds to the ones shown in Figure 3 and Table 2.
Table 5. Input and output variables were determined for the Transfer Function model and AIC, MAPE, and RMSE of each of the best-selected models for each water flow variable. The variable number corresponds to the ones shown in Figure 3 and Table 2.
Input VariableOutput VariableTF Model(b,s,r)AICRMSEMAPE
Tank 1 Inflow
(Variable 1)
Tank 2 Inflow
(Variable 3)
TFM(4,2,3)6815.2938.478231.64836
Tank 2 Inflow
(Variable 3)
Tank 3 Inflow
(Variable 5)
TFM(2,4,2)5536.8129.469010.99219
Tank 1 Inflow
(Variable 1)
Tank 3 Inflow
(Variable 5)
TFM(6,1,2)5548.2429.760440.94921
Tank 3 Inflow
(Variable 5)
Tank 4 Inflow
(Variable 6)
TFM(0,1,0)7755.8477.317495.27644
Tank 4 Inflow
(Variable 6)
Tank 5 Inflow
(Variable 8)
TFM(1,3,0)8598.87144.667034.90388
Tank 4 Inflow
(Variable 6)
Tank 6 Inflow
(Variable 11)
TFM(0,2,0)1227.60.699270.44741
Tank 1 Inflow
(Variable 1)
Tank 5 Outflow B
(Variable 10)
TFM(0,2,2)3511.453.281810.47553
Table 6. Input variable for Transfer Function model and the best model selected for the variable, AIC, MAPE, and RMSE of each selected model forecasts one day and one week after. The asterisk denotes the variable and error values that are lower than those obtained from the Seasonal ARIMA model corresponding to the same output variable. The variable number corresponds to the ones shown in Figure 3 and Table 2.
Table 6. Input variable for Transfer Function model and the best model selected for the variable, AIC, MAPE, and RMSE of each selected model forecasts one day and one week after. The asterisk denotes the variable and error values that are lower than those obtained from the Seasonal ARIMA model corresponding to the same output variable. The variable number corresponds to the ones shown in Figure 3 and Table 2.
Input VariableOutput VariableTF Model
(b,s,r)
1-Day Forecast1-Week Forecast
RMSEMAPERMSEMAPE
Tank 1 Inflow
(Variable 1)
Tank 2 Inflow *
(Variable 3)
TFM(4,2,3)43.727 *1.863451 *41.114 *1.806004 *
Tank 2 Inflow
(Variable 3)
Tank 3 Inflow *
(Variable 5)
TFM(2,4,2)61.848 *3.317456 *52.368 *2.8926 *
Tank 1 Inflow
(Variable 1)
Tank 3 Inflow
(Variable 5)
TFM(6,1,2)114.8876.828815141.0187.95437
Tank 3 Inflow
(Variable 5)
Tank 4 Inflow
(Variable 6)
TFM(0,1,0)212.25818.27315259.4421.63761
Tank 4 Inflow
(Variable 6)
Tank 5 Inflow
(Variable 8)
TFM(1,3,0)373.34547.85059745.249566.45463
Tank 4 Inflow
(Variable 6)
Tank 6 Inflow
(Variable 11)
TFM(0,2,0)2.4022.1951432.2561.982629
Tank 1 Inflow
(Variable 1)
Tank 5 Outflow B
(Variable 10)
TFM(0,2,2)10.2182.02364936.3727.027344
Table 7. Mathematical models of the best ARIMA(p,d,q)(P,D,Q)(s) and TF(b, s, r) models for each variable in the analyzed water distribution branch, where: y t is the output variable at time t , x t is the input variable at time t , B is the backshift operator defined as B j Y t = Y t j , and ε t is the error term at time t , assumed to be normally distributed with mean 0 and constant variance [24]. The variable number corresponds to the ones shown in Figure 3 and Table 2.
Table 7. Mathematical models of the best ARIMA(p,d,q)(P,D,Q)(s) and TF(b, s, r) models for each variable in the analyzed water distribution branch, where: y t is the output variable at time t , x t is the input variable at time t , B is the backshift operator defined as B j Y t = Y t j , and ε t is the error term at time t , assumed to be normally distributed with mean 0 and constant variance [24]. The variable number corresponds to the ones shown in Figure 3 and Table 2.
Variable NumberVariableModel NotationWritten Model Including Parameters
1Tank 1 InflowARIMA(2,0,1)(0,1,1)(96) ( 1 1.1144 B + 0.1173 B 2 ) ( 1 B 96 ) y t = ( 1 0.8911 B ) ( 1 0.9995 B 96 ) ε t
2Tank 1 OutflowARIMA(0,1,1)(0,1,1)(96) ( 1 B ) ( 1 B 96 ) y t = ( 1 0.2119 B ) ( 1 0.8294 B 96 ) ε t
3Tank 2 InflowTFM(4,2,3) y t = ( 1 + 0.39 B 2 B 2 + 0.68 B 3 ) ( 1 1.87 B + 0.88 B 2 ) ( 1 B ) B 4 x t
4Tank 2 OutflowARIMA(0,1,0)(0,1,1)(96) ( 1 B ) ( 1 B 96 ) y t = ( 1 0.6201 B 96 ) ε t
5Tank 3 InflowTFM(2,4,2) y t = ( 1 0.76 B 0.94 B 2 ) ( 1 0.8 B 0.34 B 2 0.81 B 3 + 0.98 B 4 ) ( 1 B ) B 2 x t
6Tank 4 InflowARIMA(1,0,0)(1,1,1)(96) ( 1 0.8729 B ) ( 1 0.1742 B 96 ) ( 1 B 96 ) y t = ( 1 0.902 B 96 ) ε t
7Tank 4 OutflowARIMA(1,0,1)(0,1,1)(96) ( 1 + 0.3205 B ) ( 1 B 96 ) y t = ( 1 + 0.5807 B ) ( 1 0.9996 B 96 ) ε t
8Tank 5 InflowARIMA(1,1,2)(0,1,1)(96) ( 1 0.8741 B ) ( 1 B ) ( 1 B 96 ) y t = ( 1 0.984 B 0.016 B 2 ) ( 1 0.8242 B 96 ) ε t
9Tank 5 Outflow AARIMA(1,0,0)(0,1,1)(96) ( 1 0.8737 B ) ( 1 B 96 ) y t = ( 1 0.3874 B 96 ) ε t
10Tank 5 Outflow BARIMA(1,1,0)(0,1,1)(96) ( 1 0.2493 B ) ( 1 B ) ( 1 B 96 ) y t = ( 1 0.9992 B 96 ) ε t
11Tank 6 InflowARIMA(0,1,1)(0,1,1)(96) ( 1 B ) ( 1 B 96 ) y t = ( 1 0.6446 B ) ( 1 0.9999 B 96 ) ε t
Table 8. The count of three types of alerts and the total amount of alerts generated by each model in two forecast periods for the whole branch. The variable number corresponds to the ones shown in Figure 3 and Table 2.
Table 8. The count of three types of alerts and the total amount of alerts generated by each model in two forecast periods for the whole branch. The variable number corresponds to the ones shown in Figure 3 and Table 2.
nVariableModel1 Day (96 Observations)1 Week (672 Observations)
Possible LeakagePossible Measurement ErrorNATotal AlertsPossible LeakagePossible Measurement ErrorNATotal Alerts
1Tank 1 InflowARIMA(2,0,1)(0,1,1)(96)0101151016
2Tank 1 OutflowARIMA(0,1,1)(0,1,1)(96)00000000
3Tank 2 InflowTFM(4,2,3)30033003
4Tank 2 OutflowARIMA(0,1,0)(0,1,1)(96)00000000
5Tank 3 InflowTFM(2,4,2)01012103
6Tank 4 InflowARIMA(1,0,0)(1,1,1)(96)04172107069139
7Tank 4 OutflowARIMA(1,0,1)(0,1,1)(96)02125461351190326
8Tank 5 InflowARIMA(1,1,2)(0,1,1)(96)0606760269
9Tank 5 Outflow AARIMA(1,0,0)(0,1,1)(96)12602731440147
10Tank 5 Outflow BARIMA(1,1,0)(0,1,1)(96)10011001
11Tank 6 InflowARIMA(0,1,1)(0,1,1)(96)00000000
Total of the branch55942106166277261704
Table 9. Comparison of the related research approaches with the methodology presented in this study.
Table 9. Comparison of the related research approaches with the methodology presented in this study.
AuthorInput DataMachine Learning
Algorithm
Analyzed System
Guo et al. [15]Piezoelectric accelerometersTime–frequency CNNPipe networks from the city of Cheng Du, China
Moulik et al. [17]3-axis accelerometerK-means clusteringLaboratory pipeline system prototype
Choi et al. [30]Sound vibration data magnitude spectraCNN AI Hub dataset composed of water leakage data from neighborhoods in Gwangju, Korea
Yu et al. [10]Piezoelectric accelerometer dataSqueezeNet CNNPipe networks from cities in China including Shaoxing, Guangzhou, Sanya, Dalian, and Kunming
Fereidooni et al. [9]Water flow sensorComparison between
random forest,
decision tree,
Bayesian network, and
K-nearest neighbor
Vitens company dataset of the water distribution networks of Leeuwarden City in Netherland
Chen et al. [11] Landsat 8 satellite imagesCNNWater canal systems in Arizona
Sousa et al. [12] Pressure measurements from pumpsK-means clustering and
learning vector quantization
District-metered areas in Stockholm, Sweden
Shen et al. [31]Acoustic emission signalsAdaboostWater distribution networks of Jiangsu, Zhejiang, and Shanghai
Fares et al. [13]Acoustic emission signalsComparison between support vector machine, artificial neural network, and deep learning techniques.Water distribution networks from Hong Kong
Xue et al. [18]Flowmeters and pressure sensorsXGBoostHydraulic simulation model
Taghlabi et al. [19]Pressure values Random forestEPANET-Matlab simulation
Pérez-Pérez et al. [16]Flow and pressure measurementsArtificial neural networkLaboratory water pipeline
Tornyeviadzi et al. [34]Multivariate time series SCADA data1D deep CNN autoencoderL-TOWN water distribution network dataset
This studyWater flow sensor dataModeling of water flow data via
Seasonal ARIMA model
Input and output water flow of a water distribution branch of Mexico City
This studyWater flow sensor dataModeling of water flow data via
Transfer Function model
Input and output water flow of a water distribution branch of Mexico City
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Barrientos-Torres, D.; Martinez-Ríos, E.A.; Navarro-Tuch, S.A.; Pablos-Hach, J.L.; Bustamante-Bello, R. Water Flow Modeling and Forecast in a Water Branch of Mexico City through ARIMA and Transfer Function Models for Anomaly Detection. Water 2023, 15, 2792. https://doi.org/10.3390/w15152792

AMA Style

Barrientos-Torres D, Martinez-Ríos EA, Navarro-Tuch SA, Pablos-Hach JL, Bustamante-Bello R. Water Flow Modeling and Forecast in a Water Branch of Mexico City through ARIMA and Transfer Function Models for Anomaly Detection. Water. 2023; 15(15):2792. https://doi.org/10.3390/w15152792

Chicago/Turabian Style

Barrientos-Torres, David, Erick Axel Martinez-Ríos, Sergio A. Navarro-Tuch, Jose Luis Pablos-Hach, and Rogelio Bustamante-Bello. 2023. "Water Flow Modeling and Forecast in a Water Branch of Mexico City through ARIMA and Transfer Function Models for Anomaly Detection" Water 15, no. 15: 2792. https://doi.org/10.3390/w15152792

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop