1. Introduction
The growing worldwide population, the increasing living standards, the altered water consumption habits, and the spread of irrigated agriculture are the primary causes of the rising global demand for water. Water scarcity has become a danger to the sustainable development of human society [
1]. According to the World Urbanization Prospects published by the United Nations in 2018, almost 90% of Mexico’s population is projected to reside in urban areas [
2]. Nevertheless, 20 million people in Mexico suffer from severe water scarcity [
1]. Even though there is sufficient infrastructure in Mexico, water management could be improved, and the system needs to be appropriately maintained. According to the estimates, the distribution networks lose 40% of their water due to aging pipelines, lengthy periods without sufficient maintenance, poor building and management techniques, and ongoing land subsidence in metropolitan areas [
3].
Managing water resources and preventing, identifying, and fixing leaks are essential to reduce city water scarcity. However, this requires an information system. Therefore, real-time monitoring and data collection are crucial to creating trustworthy and practical information systems. Additionally, the data must be accurate and thorough to draw reliable conclusions and models [
4]. Nevertheless, data collection is hampered by several structural and physical restrictions, environmental factors, and human errors. Moreover, the complexity and breadth of the modern city’s water system necessitate a significant infrastructure investment for communications, location, and data processing [
5]. The above has motivated the development of solutions that help to monitor water distribution networks. Several studies and technologies have been developed for leakage detection in water distribution systems, which can be classified into hardware- and software-based methods. Acoustic monitoring, gas injection, thermography, ground-penetrating radar, and free swimming systems are examples of hardware-based techniques. However, employing these techniques in broad regions can be time-consuming, expensive, and inappropriate for automation or long-term monitoring [
6].
Software-based leakage detection techniques can be categorized into model-based and data-driven approaches. Model-based approaches define the link between the variables of the water distribution network in a mathematical model of the water distribution network while considering the network’s physical properties [
7]. Model-based leak identification techniques do not require previous network data; instead, the leak diagnosis is performed by comparing the model outputs to the measured variables. Its development, however, could be challenging, confined by the accuracy of the mathematical models, and dependent on accurate model calibration [
8].
On the other hand, data-driven approaches create data analysis plans using the network’s historical data as a resource. In contrast to model-based approaches, data-driven methodologies need to know the network’s structural characteristics and historical data. Recently, there has been an interest in employing machine-learning methods because of the robust capacities for pattern recognition and feature identification and the rising development and accessibility of data-collecting technology [
8]. For instance, multilayer perceptrons, support vector machines, clustering algorithms, or deep learning algorithms have been proven efficient for solving leak localization problems, as discussed in [
6,
8].
In this regard, water leakage detection based on machine learning algorithms has been proposed to use different data modalities to train the detection models. These data modalities include flow sensor data, pressure data, vibration data, vibro-acoustic data, acoustic emission data, and satellite data [
9,
10,
11,
12,
13,
14]. However, satellite and acoustic emission data collection may be unaffordable [
15]. Besides, using flow sensors and vibration data for water leakage detection requires the installation of several sensors across the water pipeline or its junctions, which limits their use on large water distribution networks. In addition, some studies that have developed water leakage detection systems have used simulation or laboratory tests under controlled conditions without considering the uncertainty to which data may be susceptible in real scenarios [
16,
17,
18,
19].
Moreover, the key challenge with applying machine learning techniques is choosing the suitable algorithm, building appropriate feature extractors to learn complicated features, accessing a large amount of data for training the models, and needing efficient signal processing tools [
6]. In addition, the black-box nature of deep learning algorithms makes them less interpretable by humans and necessitates specialized computer hardware for their training (e.g., Graphics Processing Units) [
20].
Most of the studies that have proposed water leakage detection systems based on data-driven methods either use machine learning techniques (i.e., random forest, support vector machines, Adaboost, XGBoost) or deep learning algorithms (i.e., convolutional neural networks (CNNs)), which often requires high-quality data to train them. However, gathering enough data could be expensive, time-consuming, and unrealistic. Furthermore, black-box classification techniques such as CNNs, random forests, multilayer perceptrons, or XGBoost have many parameters to be adjusted, which limits their interpretability and makes them prone to overfitting [
21]. On the other hand, one of the crucial characteristics of Autoregressive Integrated Moving Average (ARIMA) and Transfer Function (TF) models is that they provide a linear estimate of the system to be modeled. Besides, the number of parameters of ARIMA and TF models tends to be lower [
22], which is a strong simplification compared to the large number of parameters that non-linear machine learning and deep learning approaches often require [
23].
This study presents a methodology for anomaly detection in water distribution systems by employing water flow data and two classical time series modeling techniques, the ARIMA and TF models, which were fit following the Box–Jenkins methodology [
24]. This study modeled water flow data from tanks in a primary network branch of the water distribution system in Mexico City. This branch carries a significant volume of water through tanks and supplies the secondary network. Analyzing this branch is crucial due to the substantial water flows and pipeline sizes involved. A leakage occurring in this system would result in more-significant water losses than other public water network systems of Mexico City.
As previously stated, the studies in the literature have performed simulations, laboratory tests, or placed sensors (e.g., flow and vibrations sensors) across the water pipelines to collect the necessary data to develop water-leakage-detection models [
10,
25,
26,
27,
28]. On the contrary, this study focused on analyzing a branch of the water distribution system of Mexico City, which supplies water to the water pipes, instead of analyzing the water pipes directly through flow or vibration sensors. The above was performed to detect anomalies in the water flow behavior that could indicate the presence of sensor malfunction or water leakages along the analyzed water branch. Thus, the data on the inlet and outlet water flow of the tanks that comprise the analyzed water branch were used to develop the ARIMA and TF models proposed in this work. Such a study has not yet been performed to the authors’ knowledge.
The principal contribution of this work was the use and comparison of the TF and ARIMA models generated through the Box–Jenkins methodology applied for anomaly detection in water flow variables of a water branch of the Mexico City water distribution system, which allowed us to:
Adjust the models and forecasts to different time windows of the water flow consumption in Mexico City.
Generate anomaly-detection models with incomplete and small datasets by employing the water flow data of a branch of the water distribution system of Mexico City.
Generate interpretable and sparse models of water flow for anomaly detection in a branch of the water distribution system of Mexico City.
Perform a comparison of the ARIMA and TF models for modeling the water flow behavior of a branch of the water distribution system of Mexico City.
The rest of this paper is structured as follows.
Section 2 presents the literature review on water leakage detection based on machine learning algorithms and an analysis of the state-of-the-art.
Section 3 presents the case study and describes the data collection process of the flow data of the water distribution branch analyzed in this work. Moreover, the overall methodology is explained, including the theoretical background of the ARIMA and TF models and the model-generating process. In addition, the proposed anomaly-detection methodology for the analyzed water branch, which integrates the best models of both methods, is explained.
Section 4 presents the results of the present study, while
Section 5 presents the analysis and discussion of the results.
Section 6 shows the main limitations and areas of opportunity of this work. Finally,
Section 7 presents the conclusions and future work.
2. Literature Review
This section presents an overview of the studies that have proposed methods for water leakage detection based on machine learning. In addition, an analysis and discussion of the gaps in the current state-of-the-art is shown.
Table 1 summarizes the literature on recent approaches towards leakage detection in water distribution systems that used machine learning techniques.
Recent studies have suggested using data-driven methodologies to detect water leakage, primarily relying on machine learning algorithms. Islam et al. [
29] presented and discussed this trend of using machine learning for water leakage detection. For instance, the study of Moulik et al. [
17] proposed to detect water leakages and blockages in water pipelines by processing the vibrations of PCV pipes. Moulik’s study employed three-axis accelerometers to measure the vibration on the PCV pipes produced by water leakages; then, the vibration data were utilized as the input into a k-means clustering technique to perform the detection. Similarly, Choi et al. [
30] utilized sound vibration data from water pipes to detect water leakages by employing the magnitude spectra of the sound vibration data to train a 2D CNN. Likewise, Yu et al. [
10] employed vibration data collected from piezoelectric accelerometers placed in the water distribution networks of several cities in China for water leakage detection. Yu’s study tested different machine learning algorithms such as support vector machines, decision trees, the SqueezeNet CNN, and K-nearest neighbor, with the SqueezeNet achieving a higher performance when trained with the spectrograms of the Short-Time Fourier Transform of the vibration data.
Fereidooni et al. [
9] installed flow sensors in the pipeline network junction to detect water leakages. The flow sensor data were processed using hydraulic equations to generate velocity and head loss features. The trained algorithms were a decision tree, a K-nearest neighbor, a random forest, and a Bayesian network. Satellite data have also been used for water leakage detection. An example of this was presented by Chen et al. [
11], who utilized augmented satellite images to detect water leakages in the canal systems in Arizona. The authors employed Landsat 8 satellite images to train a CNN, used as the water-leakage-detection algorithm. Sousa et al. [
12] proposed to analyze pressure data measured from pumps in district-metered areas of Stockholm, Sweden. The analyzed area corresponded to a residential area with no water tanks or reservoirs. The detection algorithms involved a comparison of unsupervised learning algorithms, such as k-means clustering, and supervised learning algorithms, such as learning vector quantization algorithms. In [
31], it was proposed to detect water leakages by processing acoustic emission signals collected from the water distribution networks of Jiangsu, Zhejiang, and Shanghai. The acoustic emission signals were characterized by computing the main frequency, the spectral roll-off rate, the spectral flatness, and the Mel frequency cepstrum coefficients. Then, the authors trained tree-based algorithms such as decision trees, Adaboost, and random forests, with Adaboost achieving the highest performance. Likewise, Fares et al. [
13] utilized acoustic emission signals to detect water leakages in water distribution networks. Fares’ study utilized time and frequency domain features to represent the acoustic emission signals and used them as the input to train a support vector machine, an artificial neural network, and deep learning algorithms.
Furthermore, Xue et al. [
18] introduced a leakage-fault-detection approach using a hydraulic simulation model encompassing all potential leakage faults. Subsequently, XGBoost was trained, and an alert-triggering algorithm generated a leakage signal associated with the specific pipe’s name. Cody and Narasimhan [
32] proposed a linear prediction model, specifically an autoregressive moving average model in conjunction with a multivariate Gaussian mixture model to perform semi-supervised leakage detection. This method utilizes data collected with hydrophone sensors and simulated leakages within a water distribution network. Additionally, the authors suggested a coarse-resolution leakage location using the average baseline root mean square of the collected data and a fine location estimation utilizing cross-correlation based on the time series data from linear prediction filter sensors. Taghlabi et al. [
19] conducted experiments employing two methods for water leakage detection. Firstly, they simulated artificial leaks using the EPANET code on the MATLAB platform to establish a database of pressure values that describe the network’s behavior when leaks are present. Subsequently, these data were utilized to train a random forest algorithm, enabling it to forecast the rate and location of leaks within the network. Secondly, they simulated artificial leaks by manipulating hydrants in different locations, considering two distinct leak sizes, and comparing results.
Similarly, Pérez-Pérez et al. [
16] proposed using artificial neural network (ANN) techniques and online measurements of pressure and flow rate to detect and locate water leaks in pipelines. The friction factor of the pipe was estimated and utilized as an input for computing the leak position. Fabbiano et al. [
33] considered that the energy variation transmitted to the pipe walls by the radial component of vibrations induced by fluid turbulence might be related to the flow leak. Hence, Fabbiano measured the radial vibrational status of specific pipes in the network. Finally, Tornyeviadzi et al. [
34] proposed a one-dimensional CNN deep autoencoder trained to locate and identify water leaks. This technique uses multivariate time series data to lessen the adverse effects of random noise. The proposed autoencoder’s input data involved flow, pressure, and tank-level data.
Table 1.
Recent data-driven approaches for water-leakage-detection technologies.
Table 1.
Recent data-driven approaches for water-leakage-detection technologies.
Project | Year | Country | Methodology | Results |
---|
Leakage detection in water distribution systems based on time–frequency convolutional neural network [15] | 2021 | China | A leakage spectrogram was employed to capture the characteristics of leakage signals, and a time–frequency convolutional neural network (TFCNN) model was compared with other classification models across various signal-to-noise ratio (SNR) conditions. | The TFCNN model demonstrated superior performance with a mean accuracy of 98% across different SNR conditions. Even at a challenging −10 dB SNR, the mean detection accuracy remained high at 90%. |
Water Leakage Detection in Hilly Region PVC Pipes using Wireless Sensors and Machine Learning [17] | 2020 | Taiwan | Wireless sensors were utilized to capture vibrations in PVC pipes during water flow. Machine learning algorithms were applied to these vibration records to identify any disruptions in the regular water flow caused by leakage or blockage. | Analysis of vibration records with the help of K-means algorithm to determine the water level and the leakages, if any. |
Application of CNN Models to Detect and Classify Leakages in Water Pipelines Using Magnitude Spectra of Vibration Sound [30] | 2023 | Korea | CNN model for water leakage detection and classification using sound vibration data from sensors in water pipes. | The proposed CNN model achieved an F1-score of 94.82% and a Matthew’s correlation coefficient of 94.47%. |
Leak detection in water distribution systems by classifying vibration signals [10] | 2023 | China | Support vector machine (SVM), decision tree (DT), and K-nearest neighbor (KNN) for leak detection models using signal data from piezoelectric accelerometers in Chinese water distribution systems (WDSs). | SqueezeNet performed best with 95.15% accuracy in leak identification, while KNN excelled among the three classifiers with superior sensitivity and 88.17% accuracy. |
A hybrid model-based method for leak detection in large scale water distribution networks [9] | 2021 | Netherlands | Influential leak detection features using hydraulic equations (Hazen–Williams, Darcy–Weisbach, and pressure drop) and decision tree, KNN, random forest, and Bayesian network used to locate leaks and determine their pressure based on pipeline topology. | Of the models, 80.5% consistently achieved results above 92% in all scenarios. The Naïve Bayesian Model performed the best overall, with a top result of 85.81%. |
Augmenting a deep-learning algorithm with canal inspection knowledge for reliable water leak detection from multispectral satellite images [11] | 2020 | USA | A deep learning approach, combined with canal inspection knowledge, enabled automated and reliable water leak detection of canal sections using Landsat 8 satellite images. | The proposed approach can achieve recall at 86%, precision at 86%, and accuracy at 85%. |
Leakage detection in water distribution networks using machine-learning strategies [12] | 2023 | Sweden | Analyzed pressure measurements from pumps in district-metered areas (DMAs) using unsupervised learning (K-means and cluster validation techniques) and supervised learning (learning vector quantization algorithms). | The proposed learning strategies are able to obtain correct classification rates up to 93.98%. |
A Tree-Based Machine Learning Method for Pipeline Leakage Detection [31] | 2022 | China | Distinctive features such as main frequency, spectral roll-off rate, spectral flatness, and 1D Mel frequency cepstrum coefficient (MFCC) using random forest and Adaboost models. | The Adaboost model had the lowest false positive rate of 7.35%. The recall rates of the random forest and Adaboost models were 100% and 99.52%. |
Leak detection in real water distribution networks based on acoustic emission and machine learning [13] | 2022 | China | Acoustic signals in time and frequency domains were used to develop leak-detection models, employing SVM, ANN, and deep learning (DL) techniques. | Demonstrated a largely stable performance and a high accuracy, particularly for new unlabeled cases. |
Machine learning-based leakage fault detection for district heating networks [18] | 2020 | China | Hydraulic simulation model and an XGBoost-based model | 85.85% of mean accuracy. |
Field implementation of linear prediction for leak-monitoring in water distribution networks [32] | 2020 | Canada | Linear prediction model for semi-supervised leak detection. | A detection accuracy in most cases of over 70%. |
Prelocalization and leak detection in drinking water distribution networks using modeling-based algorithms [19] | 2021 | Morocco | A simulation of artificial leaks and a random forest machine learning algorithm | Leak position identified within a 100 m radius. |
Leak diagnosis in pipelines using a combined artificial neural network approach [16] | 2021 | Mexico | ANN techniques and online measurements of pressure and flow rate measurements | An average error of 0.629% for leak location. |
Smart water grid: A smart methodology to detect leaks in water distribution networks [33] | 2020 | Italy | Measuring the radial vibrational status of opportune pipes of the network | Radial vibration signals are linearly dependent only on the flow rate variations due to the leakages. |
Leakage detection in water distribution networks via 1D CNN deep autoencoder for multivariate SCADA data [34] | 2023 | Norway | A one-dimensional convolutional neural network deep autoencoder (AE) using multivariate time series data | Identified 16 of the 19 leaks in 2019. |
From this literature review, it is possible to observe that machine learning techniques have already been used extensively to perform water leakage detection. To a lesser extent, satellite data have been employed. Nevertheless, satellite data could be difficult to collect and label and may not be useful for detecting leakages inside the water pipelines. On the other hand, sound vibration data may be unaffordable due to the need for specific hardware to sample the vibro-acoustic signals. In the case of vibration data collected from accelerometers, it is necessary to install multiple sensors across the water pipelines, which can be costly and require extensive maintenance. Hence, analyzing the flow behavior of the water network can be considered a cost-effective solution since flow data are frequently monitored in water distribution systems. Nevertheless, similar to using accelerometers placed along the water pipeline to measure the pipe vibration, it is necessary to install multiple flow sensors along the water pipeline.
Moreover, in some of the reviewed works, the leakage-detection algorithm was developed in laboratory conditions, such as the studies of Pérez-Pérez et al. [
16], Moulik et al. [
17], and Taghlabi et al. [
19]. Nevertheless, as mentioned by Shen et al. [
31], on-site leakage signals have greater interference and randomness than leakage signals in a laboratory. Hence, there is an opportunity to analyze flow sensor data sampled from real water distribution systems and develop algorithms that can tackle the uncertainty to which the data are susceptible when developing models for water leakage detection. Furthermore, most of the related works focused on detecting water leakages by directly measuring the vibration or flow sensor data from the water pipeline. Nevertheless, analyzing the sensor data along the complex water pipelines could be inefficient and costly.
In the case of the machine learning techniques that have been used to develop the leakage detection models, it can be appreciated that deep neural networks have been extensively used, mainly variants of CNNs [
15,
30]. Even so, CNNs required a sizable sample size to avoid overfitting and a lack of interpretability due to the complexity that this type of technique often requires. Other techniques frequently used in water leakage detection are non-linear classification techniques such as decision trees, random forests, support vector machines, Adaboost, and XGBoost [
35]. However, these non-linear classification techniques, similar to CNNs, require a large sample size to avoid overfitting and are less interpretable than linear machine learning techniques [
36].
Considering the above, there is an opportunity to develop techniques for detecting water leakage from other locations besides measuring water flow data directly from the water pipelines. Furthermore, the gathering of data and required flow sensors could be reduced if the water branch that delivers water to the water pipelines is analyzed, rather than directly measuring the flow or vibration in the water distribution pipelines. Finally, linear machine learning techniques such as the TF and ARIMA models could be tested to avoid using non-linear classification techniques, frequently employed in the literature, as presented in
Table 1.
3. Materials and Methods
3.1. Case Study and Dataset
The case study examined in this work consisted of six tanks from the Mexico City water distribution system situated in the Álvaro Obregón delegation and connected by 48 in-diameter pipelines.
Figure 1 shows a general schematic of the primary water distribution network in Mexico City, where the main two sources of water (Cutzamala System and Lerma System) feed several branches interconnected in cascade and fed by gravity. This study analyzed the data from Branch C of
Figure 1. The branch presented in
Figure 1 is instrumented to measure the water flow that is input and output to each water tank. The sensors used to sample the data were ISOMAG electromagnetic flow meters. Moreover, it is essential to highlight that the water distribution system of Mexico City is not instrumented in the sub-branch of the water pipelines that serve to supply water to the users. Due to the above, the case study was limited to analyzing the input and output of each water tank of the analyzed branch to detect anomalies and their behavior that could indicate the presence of leakages or measurement errors in the flow sensors.
The tanks are fed by gravity and connected by a leading pipeline, which separates them by 0.5 to 2 km. The tanks are located in Álvaro Obregón delegation in Mexico City, and each tank also supplies water to the local surrounding areas. Between Tank 1 and Tank 5 exists a difference of 200 m of elevation.
Figure 2 shows the geographic location of the water branch analyzed in this study. The blue squares represent the tanks of Branch C presented in
Figure 1. The blue line in
Figure 2 indicates the connection between the tanks across Mexico City.
The water flow rate of six tanks was measured. A total of 11 water flow variables related to each tank’s input and output flow with their corresponding timestamp were recorded every 15 min (i.e., the water flow measurements had a sampling rate of 15 min). The data utilized in this study were collected during the final two weeks of August 2020. The first week was the sampling period, during which the model was developed using the available data. The second week was then designated as the forecasting period, where the model’s performance was evaluated by generating forecasts based on the learned patterns from the previous week. Each tank within the system is equipped with flow rate measurements from electromagnetic flowmeters at its entry and one or two of its exits, allowing for monitoring flow distribution within the main pipeline in liters per second (lps). Furthermore, it is important to note that certain flow rates designated for local consumption or exit flows of the tank are unavailable due to a lack of instrumentation.
A schematic representation of the water distribution branch analyzed in this work is presented in
Figure 3, where the blue lines represent the main pipeline and connection with the next tank, and the gray lines correspond to the exit to the local network of the region. Furthermore,
Figure 3 shows an example of the average water consumed in a week in percentage; therefore, for the first tank, all the input water corresponds to 100%, while in Tank 5, the water to the next stage of the network corresponds to 32%; this means that the region of this system consumed 68% of the total input water during the analyzed period.
Table 2 shows the variables and summary statistics from the period of water flow analyzed in this study for the tanks shown in
Figure 1,
Figure 2 and
Figure 3. The summary statistics are minimum, maximum, and mean flow rate presented in lps and the percentage of not available (NA) or empty observations, taking as the total the entire period of each variable every 15 min. The analysis and models presented in this study were implemented in RStudio Version 2022.02.3 + 492 using R Version 4.2.0 on a 64 bit Windows 7 PC with 12 GB of RAM and an AMD A10-5800K processor.
The first step involved identifying missing values in the raw time series flow sensor data, as seen in
Figure 4. This process was repeated for each water tank variable presented in
Table 2.
Figure 4 illustrates the Tank 4 Inflow variable time series. The second step consisted of filling in the original raw time series data missing values, as seen in
Figure 5. The average of the neighborhood values around the missing data points in the time series was computed to fill in the missing values, using observations from an equal number of data points on both sides of a central missing value. The process presented in
Figure 5 was repeated for each water tank variable in
Table 2; the same Figure illustrates the filling of missing values for the Tank 4 Inflow variable.
3.2. Autoregressive Integrated Moving Average and Seasonal Autoregressive Integrated Moving Average
The overall methods used to model the water flow variables of the water distribution branch analyzed in this study are illustrated in
Figure 6. Initially, measurements of each water tank were collected, and then, the data were pre-processed to prepare them for modeling. The best model for each variable was used to forecast over two different periods: one day and one week. In the case of the TF models, time-ahead data were also included as an input variable for estimating the forecasts. Further details on this process are explained in subsequent sections of this research work. The theoretical background of the ARIMA models is described below.
ARIMA models are composed of a dependent variable
, which depends on past values
and an error term
. Besides, these models are characterized by three elements: a moving average component, an autoregressive component, and a differencing (integration) component. The autoregressive component indicates that
depends on one or multiple lagged values of
. The moving average component shows that
depends on one or multiple lagged values of the error
. Finally, the integration or differencing component indicates that the series should be stationary; computing the difference between neighboring observations in the time series accomplishes the above [
37].
The notation ARIMA(
p,
d,
q) represents the order of the ARIMA models, where
p is the order of the autoregressive component,
is the order of the differencing (integration) component, and
is the order of the moving-average process [
38]. The backshift operator (
) can be used to define an ARIMA model as follows:
where
is the value of the series observed at time
t;
B is the backshift operator;
ϕ are the autoregressive polynomials;
θ is the moving average polynomial;
are the error terms of the model. The error terms were assumed to be independent and identically distributed with a normal distribution and zero mean [
24].
However, the Seasonal Autoregressive Integrated Moving Average was considered to model the seasonal component of the time series. In this regard, Seasonal ARIMA models were selected since, as shown in
Figure 4 and
Figure 5, the water flow time series have a seasonal component (i.e., the series exhibits a regular fluctuation), which appears every 96 observations, corresponding to a day of water flow measurements. This seasonal term makes the water flow time series nonstationary; therefore, to consider the seasonal component of the time series and to fit an ARIMA model, the seasonal component needs to be considered for the models [
24].
Seasonality implies that
depends on lagged values of
at a regular interval
. Seasonal ARIMA models consider the non-Seasonal ARIMA(
p,
d,
q) and three additional parameters labeled as
to account for the seasonality presented in a time series. The
term refers to the number of time steps corresponding to a single seasonal period. On the other hand, the term
P represents the order of the seasonal autoregressive component; the term
Q refers to the seasonal moving average component; the term
D represents the seasonal differencing component [
38]. The mathematical representation of the Seasonal ARIMA models is shown in the next expression:
where
is a seasonal time series;
is the Gaussian white noise process;
is the non-seasonal autoregressive polynomials;
represents the non-seasonal moving average polynomial.
is the non-seasonal differencing term.
is the seasonal differencing term. One key aspect is that, when
, this is sufficient to ensure stationarity in the time series.
represents a seasonal autoregressive polynomial; the term
is a seasonal moving average polynomial. Finally,
is the backshift operator [
38].
In general, the optimal ARIMA model parameters are determined by considering three criteria: (a) using Akaike’s information criterion (AIC); (b) examining the auto-correlation function (ACF) to determine the parameter of the ARIMA model and the number of moving average (MA) coefficients and computing the partial auto-correlation function (PACF) of the residuals to determine the parameter for the number of autoregressive coefficients; (c) by plotting the series residuals to confirm that the error term is equivalent to white noise. The following sections describe the definitions and procedures to compute the ACF, PACF, and AIC in more detail.
3.3. Auto-Correlation Function and Partial Auto-Correlation Function
Auto-correlation can be defined as the degree of similarity of a time series with a lagged version of itself. Furthermore, the plot of a time series’ auto-correlations against lags is known as the auto-correlation plot. Thus, the so-called ACF shows the linear relationship between the observation
at time
and the observation at a previous time (
) that is separated by
lags at time [
38]. Taking the above into account, the mathematical representation of the ACF for a time series
is shown in the expression below:
where
is the lag, and it is defined as the difference in time between the observation
and the observation
. The term
denotes the correlation between the observations
and
that are separated by
periods. The ACF serves to know the order of the moving average component of an ARIMA model. Moreover, the ACF also allows analyzing the periodicity and detecting recurrence in a time series [
24].
On the other hand, the so-called partial auto-correlation or conditional correlation removes the intermediate observations when computing the correlation between two observations at different lags. In this case, the PACF is conditional on the intermediate observation of the time series, since they are taken out from the covariance computation. For instance, consider the PACF of two observations
and
(i.e.,
) [
38]. The above can be expressed as shown below:
where the term
is the PACF between the observations
and
. Notice that the covariance between
and
and the variance of
and
are conditional on the intermediate observation
since the PACF removes the effect of the intermediate observations [
38]. The computation of the PACF serves to know the order of the autoregressive component of an ARIMA model.
Figure 7 illustrates the Tank 4 Inflow ACF and PACF, from which it can be inferred that the series is not stationary. It is important to mention that the auto-correlation and partial auto-correlation functions are dimensionless; the above implies that they are independent of the scale of measurement of the analyzed time series [
24].
Given that the observations were taken every 15 min and matched up with the earlier visual inspection of the residuals, it was determined that the series becomes stationary by differencing at a lag of 96. This corresponds to the 96 observations in a single day. Therefore, according to the ACF and residuals of the ARIMA(1,0,0)(1,1,1)(96) model of Tank 4 Inflow presented in
Figure 8, the model is adequate, since the residuals follow a normal distribution. Similarly, the remaining water flow variables of the system depicted in
Table 2 were processed, and the ACF, PACF, and residual analyses were repeated.
As previously stated, the ACF and PACF correlogram analysis is required to determine the components of an ARIMA model. If the time series is stationary or not, it can be determined by looking at the residual plots. After a few auto-correlations, the ACF for a stationary time series will zero out. However, the ACF for nonstationary time series will decline slowly or increase positively [
24]. Following multiple iterations and the initial analysis, some models were suggested as the best. After identifying the best ARIMA models for the series, the best model was chosen by comparing its residuals and information criteria.
The AIC, mean absolute percentage error (MAPE), and root-mean-squared error (RMSE) criteria were used to measure the performance of each model. The residuals were then examined for the model’s diagnosis, and if the model was satisfactory, it could be used to forecast; otherwise, additional models must be tested [
24].
Figure 9 shows the methodology to generate an ARIMA model via the Box–Jenkins approach. The following section describes the metrics utilized in this study in further detail.
3.4. Evaluation Metrics
The fitting procedure of the resulting ARIMA models was assessed with the aid of the AIC, as shown in
Figure 9. The information criterion measures the model’s ability to explain the relationship between the variables. A common criterion is to compute the AIC, which is an information criterion that enables the assessment of the quality of the models by rewarding those with minor errors while penalizing those with too many parameters [
38]. Thus, this criterion allows the selection of sparse models [
39]. The mathematical representation of the AIC is shown in the following expression:
where
represents the likelihood function and
is the total number of parameters of the model. A lower value of the AIC represents a better model with a higher likelihood value. Compared to other metrics, such as the Bayesian information criterion, the AIC value provides a greater penalty on the number of parameters.
On the other hand, for this work, the MAPE and RMSE were used to measure the error and to have a numerical comparison of the effectiveness of the proposed models after selecting the best through the AIC. The RMSE and MAPE were used to compare the accuracy of the model’s forecast to the actual values, with a lower value indicating a better fit [
24,
38,
40]. The equations of these indicators are shown as follows:
where
is the observed value and
the predicted value at time
;
is the number of forecast time steps.
A week was chosen to evaluate the Seasonal ARIMA models. The models were tested within different time frames and assessed on various dates. Two data transformations were considered to transform a nonstationary time series into a stationary series and use the Box–Jenkins methodology: first, differencing, and second, differencing with a transformation using the natural logarithm. By calculating the difference between two consecutive observations, differencing makes a nonstationary time series stationary. The time series’ variance can be stabilized using the natural logarithm. Some preliminary models that follow the patterns and methodology of Box–Jenkins can be provided after comparing and analyzing the resulting ACF and PACF of the water flow time series. The AIC was calculated for each fitted model to choose the optimal [
41].
Moreover, to evaluate the forecast of each model, one day (equivalent to 96 observations) and one week (corresponding to 672 observations) were examined, with a confidence interval of 95%.
Figure 10 shows an example of the forecast of one week in the future for the Seasonal ARIMA models for the Tank 4 Inflow time series. The forecast’s confidence interval was calculated at 95% by obtaining the standard errors of the estimates as described in the Box–Jenkins methodology [
24].
3.5. Transfer Function Models
This section presents the theoretical background and methodology for developing TF models based on the Box–Jenkins approach. TFs are models that combine a causal approach and a time series approach. The time series
affects the time series
through a TF, which spreads the impact
via some period in the future. The resultant TF model connects the output series (
), the input series (
), and a noise term (
Nt). The addition of a noise term is considered since, in practice, the response of a system could be affected by disturbances and noise induced by the environment, which corrupts the system’s output by an amount
. Hence, a TF is equivalent to a response function. The mathematical representation of TF models can be written in terms of the backward operator
, as shown in Equation (8) [
24].
where
is the output of the system at time
t;
is the input of the system at time
t;
is the backshift operator;
is an
-order polynomial operator;
(B) is an
-order polynomial operator;
is a
order dead time operator, which indicates the number of periods before any effect is discernible; finally,
is the amount of noise to which the system is susceptible. The terms (b, s, r) are integers greater than or equal to zero. The term
controls the effect of current and previous input values in the system’s response. On the other hand, the term
(B) controls the effect of previous output values in the system’s response [
42].
A TF estimation of the system based on the Box–Jenkins methodology was developed, and it was motivated by the correlation between the system variables displayed in
Figure 3, specifically the input and output flow of each water tank.
Figure 11 illustrates the overall procedure for predicting using TFs, based on the Box–Jenkins approach for fitting, and validating TF models. The definition of the input and output ARIMA models that were used, the prewhitening of both series (i.e., the method of removing the impact of serial correlation on trend analysis), and the calculation of cross-correlation for the identification of the pre-estimates and final parameters of the model or tentative models are the key steps in this process. The models’ diagnostics were then determined, and if the model was sufficient, it could be utilized for the forecast; if not, a different model should be suggested [
24].
The dependent (output) and independent (input) variables were modeled via ARIMA models. The ARIMA models used for the TF were the same as those developed for each water flow variable, as explained in
Section 3.2. Then, the input and output series generated by the fitted ARIMA models were prewhitened. Consequently, the series were cross-correlated to find the relationship between the lags, or the effect of
over
. The cross-correlation function is represented as shown in Equation (9) [
24].
where
is the cross-correlation function of a stationary bivariate process;
is the cross-covariance coefficients between the series
and
at lags
;
is the standard deviation of the
series;
is the standard deviation of the
series. The cross-covariance function
of a stationary bivariate process is defined as shown in Equation (10) for lags
.
The importance of the cross-correlation function of the prewhitening input and output series is that it provides an estimate of the impulse response of the system. This impulse response estimate serves to know the order of the
and
polynomials, as well as the order of the dead time operator (
) that should be used to fit the TF model. Similar to other areas such as signal processing and system analysis, the impulse response is used for the graphical or mathematical representation of the output of a system or a model in response to a brief input signal or impulse. The impulse response provides valuable insights into the system’s behavior, including its frequency response, stability, and the effect of the input signal on the output [
43].
The plot pattern of the cross-correlation function determines the values of
,
, and
, which, according to the Box and Jenkins [
24] notation, are the parameters for a TF (b, s, r) model. The parameters
and
determine the number of lagged terms of
that entered into the TF. The value of
is determined by the first lag significantly different from zero in the cross-correlation plot. The
term is established by how long
influences
after the first significant lag. The
value represents how long the output series (
) is connected with the prior value of the output series. The value of
can be set by analyzing the plot of auto-correlation or determined by the plot pattern of lag
; if it has an exponential decay, then
could provide an appropriate approximation of the TF, and if it has a sine wave plot pattern, then
could provide an approximation of the TF [
24].
The input Tank 1 Inflow and the Tank 2 Inflow series are cross-correlated in
Figure 12, demonstrating that the fifth lag is the most-significant. Nevertheless, Lag 0 is the first latency that deviates sufficiently from zero. It is also clear that the fifth lag (from Lag 0 to Lag 5) is the number of delays between the previous significant lag and the current lag. Finally, the plot development appears to follow a sine wave. Consequently, a TF with parameters (2,3,0) could be a first model proposal.
The fourth step used the fitted TF models to forecast one day and week in the future. The forecast’s confidence interval was also calculated, as can be seen in
Figure 13.
3.6. Anomaly Detection in Water Distribution Branches Methodology
After generating and using the models for forecasting, the anomaly detection procedure involved comparing the observed values with the 95% confidence interval of the model’s forecast. This work assumed that an anomaly presented in the measured water flow’s water branch is outside the model forecast’s confidence interval.
Figure 14 shows the general methodology for the data evaluation for anomaly detection in water distribution branches.
The forecast values, the confidence interval, and the following day and week observations are needed by the methodology shown in
Figure 14 before any other feedback processes can begin. First, it determines whether the observed values for each model are within the confidence interval of the forecast; if they are, it goes back to the forecast stage and compares the subsequent observations with the subsequent prediction. In cases with missing data points, it is plausible that they are due to various factors, such as interruptions in sensing or communication caused by issues with the energy supply at the station, sensor malfunctions, intermittent data transmission problems, or failures in the database. When these data points are missing, the corresponding alert is labeled as “not available” to indicate the absence of data.
On the other hand, if an observation exists and falls outside the confidence interval, it can indicate two main possibilities. Firstly, it could suggest a potential measurement error where the sensor may have malfunctioned and provided an incorrect reading. Alternatively, it may indicate a genuine change in the water system’s behavior, potentially caused by external factors such as water leakages, variations in water demand, hydraulic system issues, water availability, or operational actions. This methodology suggests two key alerts: potential measurement error and potential water leakage to facilitate the detecting and understanding of different types of anomalies. When new observations significantly exceed the confidence interval’s upper limit, it indicates a potential measurement error, which is more likely than a possible water leakage, since having a higher water flow rate than what the source can supply is not feasible. However, it is also possible that the sensor briefly malfunctioned if the alarm is not persistent. Conversely, suppose the new observation of water inflow falls below the confidence interval’s lower limit. In that case, it suggests a potential water loss, indicating the possibility of leaks occurring between the water tanks. This inference is drawn from the observed water inflow being below the expected range, but a misread by the sensor cannot be completely ruled out as a possibility. Incorporating these alerts into the methodology makes identifying and categorizing anomalies easier, leading to improved system operation and early detection of potential issues.
The appearance of the warnings is next examined; if they are persistent and invalidated by the user or another qualified individual, the model must be redesigned because the old model does not account for the new observations. Additionally, the distribution of the series could have been altered, necessitating a return to the model-generation stage.
4. Results
The errors from the Seasonal ARIMA models were calculated to select the best model for each variable. On average, it required 8 to 12 iterations to generate different models and compare the AICs between them to find the best fitted model for each water flow variable. A Seasonal ARIMA model was generated for each water tank’s input and outflows.
Table 3 shows the Seasonal ARIMA models selected for each water tank in
Figure 2 and the resulting AIC, RMSE, and MAPE.
Then the models were utilized to forecast one day and one week ahead, and the obtained forecasts were compared with the actual observations to calculate the forecasting MAPE and RMSE. The obtained RMSEs and MAPEs that each model obtained for one-day and one-week forecasts are presented in
Table 4.
On the other hand,
Table 5 presents the AICs, MAPEs, and RMSEs of the best fitted TF models for each variable. Only the possible and correlated water flow variables were used to generate the TF models based on the system presented in
Figure 3. In addition,
Table 5 shows the order of the TF models’ polynomials and dead time operator of the obtained TF models with the corresponding RMSE and MAPE values computed with the data interval used for generating each model.
Furthermore,
Table 6 presents the MAPEs and RMSEs of the one-day and one-week forecasts of the fitted TF models. The models with the lowest MAPE were chosen and utilized in the data-evaluation stage. The asterisk indicates the variable and error values lower than those obtained for the Seasonal ARIMA model.
Based on the results presented in
Table 4 and
Table 6, the best models for each water tank inflow and outflow were selected based on the MAPE values; these models were used to develop the anomaly-detection methodology presented in
Section 3.6. The shorthand notation for the model and the corresponding mathematical models, including the parameters and coefficients, are shown in
Table 7.
In the last stage, the new data were assessed, and any anomalies were found using the forecasting values that were produced. First, the top models for each variable were chosen from the previous step. These models match those displayed in
Table 7. Then, the models were used to forecast over the short and medium term (i.e., one day and one week). Finally, the limits for assessing whether a new observation was an anomaly were the models forecast upper and lower 95% confidence intervals. Three categories, possible leakage, possible measurement mistake, and not available (NA) datapoint, were used to group the notifications.
Figure 15 presents the forecasting of the fitted models presented in
Table 7 for each analyzed water flow variable. The orange line represents the model’s forecast. The blue lines represent the new observations. The gray zone represents each model’s 95% confidence interval.
Table 8 presents the alerts generated for each variable, providing information on the final model utilized, the three types of potential alerts, and the total count of alerts. The results shown in
Table 8 are based on the methodology presented in
Figure 14.
5. Discussion
The AICs, RMSEs, and MAPEs shown in
Table 3 were computed by comparing the fitted models with the data observations (one week) used to generate the Seasonal ARIMA models. On the other hand,
Table 4 shows the RMSEs and MAPEs obtained by comparing the fitted model with the observations used for forecasting. By comparing both tables, it can be appreciated that the errors were lower when comparing the model with the observations used for generating the models than the error obtained when comparing with the observations used for forecasting. Despite the above, the difference between the fit and prediction errors was low for most models. The smaller error obtained with the fit data compared to the prediction data suggested a slight overfit.
In addition, it can be appreciated that the order of the seasonal components of the fitted models was the same and that all of them required a first-order seasonal difference and moving average component. On the other hand, only the Tank 4 Inflow sensor data modeling required a first-order seasonal autoregressive component. In the case of the non-seasonal part of the fitted ARIMA models, it can be appreciated that certain heterogeneity existed in the order and the components that each of the input and output water flow variables required. The above could be attributed to the difference in the dynamics of the studied water tanks initially presented in
Figure 2.
Moreover, based on the results presented in
Table 4, it can be observed that the fitted Seasonal ARIMA models had a lower MAPE and RMSE for a one-day forecast than a one-week one. Hence, the models were better for forecasting in short periods. The above was useful to define the remodeling period and forecast for future use of the methodology. The only exception in which RMSE and MAPE were lower for the one-week forecast was the Seasonal ARIMA model fit for the Tank 4 Outflow. The above variable presented for both forecasts’ periods a greater MAPE and RMSE compared to the rest of the water tanks. The new observed values for the next day were very extreme and differed from those used for the model generation. Nevertheless, the new observations were closer to the previous data during the week. This could mean that the sensor of this variable had issues and, in some periods, was not working accurately. Furthermore, Tank 5 Outflow did not contain a MAPE calculation because the nature of the variable data did not allow it; as can be seen in
Table 2, the variable had minimum negative flow and positive maximum values, but some of the observed values were zero, so the division by zero in Formula (6) gets undefined.
In the case of the fitted TF models shown in
Table 5, it can be appreciated that the order of the polynomials and the dead time operator that provided the lowest AIC value were different for each tank. Moreover, all TF models were influenced by past input values since they all had an
th-order polynomial. However, not all models were influenced by the past values of the output since the
th-order polynomial equaled zero, as in the models for Tanks 4 to 6, as shown in
Table 5. On the other hand, by comparing the errors of the TF models presented in
Table 5 with the ones shown in
Table 6, it can be appreciated that the errors of
Table 5 were lower. The above was expected since the errors presented in
Table 5 were computed with the same data used to fit the model. Despite the above, the errors were similar. Like the Seasonal ARIMA models, the obtained MAPE and RMSE values of the TF models, when used for forecasting, were generally lower for one-day forecasting than one-week forecasting, except for the first two models (see
Table 6). Furthermore, only the Tank 3 Inflow TF model obtained an AIC value lower than its corresponding Seasonal ARIMA models (5536.81 and 5744.72, respectively). The above suggested a lower error while fitting and generating sparse models. However, even though the AIC values of the Seasonal ARIMA models were lower than the TF models, the first could provide an overfitted model when selected through the AIC [
44].
Based on the results reported in
Table 7, it can be appreciated that the water flow data can be modeled better by a Seasonal ARIMA according to the reported AIC, RMSE, and MAPE values. Moreover, as shown in
Table 7, the TF models fitted through the Box–Jenkins methodology were less helpful in modeling the analyzed water branch’s flow than the ARIMA models. In addition, it can be appreciated that, in general, the generated models had a low number of coefficients, which could facilitate a physical implementation of the proposed system since the computation that this type of model requires could be lower due to their low complexity.
Figure 15 shows the forecasting values of each final model presented in
Table 7. The graphs present a one-week forecast, with the initial portion of the forecast (from Hour 168 to 172) representing the first forecast day, which served as the basis for the results reported in
Table 7. Based on the forecasting values presented in
Figure 15, the different dynamical behaviors of each flow variable of the analyzed branch can be appreciated visually. This visual representation enhances the understanding of and facilitates the alert process for users, enabling them to discern patterns and trends more effectively. In addition, it can be observed that the forecast values and observed values (actual measured water flow) were similar for most of the models. Furthermore, it is evident that the 95% confidence interval demonstrated a dynamic behavior across each model and did not remain constant throughout the analyzed forecasting week. In some cases, the gradual increase in the 95% confidence interval size could be attributed to either the absence of the moving average component in the model or the inherent increase in uncertainty as time progressed. These factors contributed to the widening of the confidence interval, indicating the challenges in accurately forecasting values over an extended period. In most cases, the selected model behaved similarly to the actual observations.
Finally, from
Table 8, it can be seen that the variables with the worst (bigger) MAPE and RMSE errors generated the most alerts, as was the case for Tank 4 Outflow, Tank 4 Inflow, and Tank 5 Inflow, whose MAPEs were more prominent (more than 10%) from the rest of models and Tank 5 Outflow A had the biggest RMSE in both forecast periods (one day and one week). Tank 1 Inflow alerts could be considered a particular case because there were no alerts in one day, but there were many possible measurement errors in one week. The above could be due to the sensor malfunction over a long period or a change in data behavior, which might require remodeling. Although
Figure 15 shows slight anomalies in the long-term behavior of Tank 3 Inflow, Tank 6 Inflow, and Tank 5 Outflow B, they were not detected, presented in
Table 8, because they still fell within the 95% confidence interval. Therefore, the detection of alerts depends on the system’s tuning, such as setting a smaller confidence interval or a shorter forecast period (e.g., 1 d, 12 h, etc.). Additionally, more potential measurement errors and not available observations were identified than possible leakages in both forecast periods, indicating the need to verify the calibration and proper functioning of sensors. This information is valuable in justifying investment in equipment maintenance and highlighting the affected areas.
The resulting ARIMA and TF models could be considered less-complex models than the ones produced by other machine learning algorithms proposed in the literature, such as XGBoost [
18], ANNs [
16], and CNNs [
34], due to their reliance on a solid mathematical and statistical background with well-defined interpretations and the lower number of parameters that they have, as shown in
Table 7. In addition, contrary to the studies presented in
Section 2,
Table 7 shows the coefficients and the mathematical representations of the fitted ARIMA and TF models. Moreover, the steps for modeling the ARIMA and TF models are well-defined and based on the specific assumptions of data stationarity, linearity, and the independence of residuals, providing a more-transparent framework and guidance in the modeling process. Finally, the models are suitable for forecasting based on historical patterns of water flow variables, making them a practical tool for anomaly detection in water distribution systems.
A comparison of the methodology presented in this study with the related literature presented initially in
Section 2 in terms of the input data, machine learning algorithm, and analyzed system is presented in
Table 9. In this regard, it can be observed that heterogeneity existed between the related research and the present study. Authors have proposed to detect leakages from pressure measurements, water flow measurements, acoustic emission, and vibration data; on the other hand, this study based its analysis exclusively on water flow data. Another aspect that can be observed is the frequent use of CNNs to perform leakage detection—other approaches involved tree-based techniques such as decision trees, random forest, XGBoost, and Adaboost. Otherwise, the present study focused on using Seasonal ARIMA and TF models that produce less-complex models than CNNs and tree-based classifiers. Finally, the analyzed systems varied from study to study, with water distribution networks being frequently analyzed. Public datasets and simulation tests of water pipelines have also been considered. In the case of the present work, the water flow data came from a branch of the water distribution system of Mexico City that supplies water to the sub-branch of water pipelines (see
Figure 1 and
Figure 2).
Nevertheless, it is difficult to perform a homogenous comparison of the results presented in this study with the related research. This is because the systems analyzed to develop the water-leakage-detection algorithms and the data collected varied from study to study. As previously stated, the authors proposed performing water leakage detection through simulation, laboratory tests, and data collected from water distribution systems. Nonetheless, each analyzed system had different data distributions, which impacted the type of algorithms that best describe the data. Furthermore, most of the works in the literature presented in
Table 9,
Section 2, and the Introduction Section employed supervised learning techniques to train the detection algorithms. On the other hand, this study tackled the problem from an unsupervised point of view since access to labeled data that classify the anomalies in water leakages or measurement errors were not available when developing this work. However, the above points out an area of opportunity that needs to be addressed in future work.
6. Limitations of the Study
This work was constrained by the branch’s existing infrastructure and data availability, limiting it to a single case study focused solely on one operational variable, water flow. In future work, it is proposed to test the methodology in other cases of study and with other operational variables such as pressure, tank level, and more flow points. As shown in
Figure 3, some tanks have other water exits, the flow rate of which was not available in this dataset, but are valuable variables that could help to generate a better model and understanding of water usage in the system. The use of additional variables could be performed with the help of multimodal techniques that consider flow time series data and pressure data measured in each of the tanks of the water branch. The above could be assessed through multimodal machine learning techniques such as model-agnostic (i.e., the fusion was carried out before applying the machine learning technique) and model-based (i.e., the fusion of the modalities was performed while generating the model) methods [
45].
Furthermore, the methodology could not be validated with real leakages because a report of the actual leakages (detected or repaired) was unavailable when the models and the study were developed. The above implies another limitation, such as the need to develop a physical implementation of the proposed algorithms to validate anomalies and select and design an appropriate hardware platform in which the proposed algorithm can be embedded and executed. In addition, one of the crucial challenges of data-driven models such as ARIMA models is that, if the dynamics of the system changes, the fitted ARIMA models may not work as expected since the distribution of the data used for fitting could have changed. A potential solution to this problem is to fine-tune the models over time to keep their parameters updated in case of a change in the dynamics of the water distribution branch analyzed in this study.
Furthermore, due to the limited and incomplete dataset, this study focused on employing linear models such as the ARIMA and TF models; however, there might be non-linear dynamics in the water distribution systems that the proposed methods could not capture. Hence, deep learning algorithms such as recurrent neural networks or long short-term memory neural networks could be compared with the proposed ARIMA and TF models in terms of performance. However, deep learning solutions often require a sizable sample size to be trained adequately and avoid overfitting problems. The above could be mitigated using transfer learning techniques in combination with deep learning solutions.
In addition, another potential disadvantage of the fitted ARIMA models is the need to filter the time series through differentiating; despite being essential to produce a stationary time series, it can also have certain biases related to the dynamics of the analyzed system since differentiating acts as a high-pass filter on the time series. The above could be mitigated with hybrid techniques combining nonstationary time series techniques such as wavelet analysis and ARIMA models. For instance, Nury et al. [
40] proposed a wavelet-ARIMA model for temperature prediction in Bangladesh to account for the nonstationary behavior of the analyzed temperature time series data.
Additionally, due to the complexity of the water branch analyzed in this study, eleven flow measurements (i.e., considering the input and output flows of each tank) led to adjusting eleven models for the case of the fitted ARIMA models. The above is a potential drawback of the proposed anomaly detection system since it requires at least two models to detect anomalies for a single water tank. Thus, validating the proposed models could be a time-consuming task. Moreover, since the methodology used in this study depends extensively on data, its implementation is limited to water flow data availability. Hence, the approach presented in this work could be combined with hardware approaches to reduce data dependency.
Moreover, another approach that could have been developed is to generate a TF of the system shown in
Figure 2 by considering the water tank’s dynamics, as in the work of Li et al. [
46]. The above could reduce the need for data to generate the TFs. However, certain variables, such as the height of the fluid present in the tanks, the area of the tanks, and the pressure, need to be considered to generate an adequate model that represents the system’s dynamics and, consequently, the water flow behavior. Moreover, dynamical models based on linear differential equations do not consider the disturbances and noise the system is susceptible to. Hence, future work could compare estimating a TF based on the Box–Jenkins methodology and a TF obtained by considering the linear differential equations that describe the system dynamics.
Another opportunity that could be tackled is the need for developing a publicly available dataset from which the proposed water leakage detection models can be compared homogeneously. In the present study, sensor flow data were considered to analyze a water branch of the water distribution systems of Mexico City. However, other studies described in
Section 2 used other types of data. They analyzed water distribution systems of other regions whose results cannot be compared to those presented in this work since the data modalities and distributions are different. The analyzed systems differed even among works that performed similar research in Mexico, such as Pérez-Pérez et al. [
16] and the present study.
7. Conclusions
This work proposed using Seasonal ARIMA and TF models fit through the Box–Jenkins approach to model the flow data of a branch of the water distribution system of Mexico City for anomaly detection in water distribution branches. The results of this study showed that ARIMA models can describe and forecast the flow variables of the analyzed water branch with low error in terms of the MAPE. The generated TF models can also explain the linear branch system’s relationship between tanks according to the reported RMSE and MAPE values. Still, in most cases, the ARIMA models achieved a higher performance in terms of the MAPE.
The models proposed in this study have the potential to make significant contributions to reducing water losses and improving the efficiency of the distribution system. These improvements were achieved by utilizing the existing instrumentation and infrastructure of Mexico City’s water distribution system, along with a clear and understandable methodology, visually representing anomalies to aid in the alert process for users. These models can facilitate early detection and localization of potential issues, enabling prompt actions and interventions for more-effective water distribution network management. Additionally, by identifying the specific sensor that triggered the alert, the search for potential issues can be narrowed down to a specific zone, enabling faster localization of the failure and more-efficient troubleshooting. The actions for each alert type (leakage or measurement error) depend on whether the water flow variable type is an inflow or outflow. For example, if less water than expected is arriving into a tank, it can be assumed that there is a leakage in the pipeline before the tank’s entry. In such a case, physical inspection would be required.
On the other hand, if more water than expected is arriving, testing the sensor and verifying its calibration are recommended. In the case of an output water flow variable, if less water than expected is outgoing from a tank, it is possible that the leakage occurred inside the tank, such as an overflow or an unauthorized intake. If more water than expected is outgoing from the tank outflow, it is possible that there is a leakage in the pipe ahead, which should be verified.
The variables with greater errors were the ones with the most alarms. Therefore, the corresponding authorities should review and provide maintenance to these variables’ sensors and communications systems. In addition, the generated alerts should be reviewed and validated by the operators responsible for the system to determine if new modeling is required or if the alerts are correct.
The current methodology’s future work will improve it into an integrated support system with close collaboration between water service providers in Mexico City and action-based research. Implementation requires capable personnel with access to tank instrumentation to monitor and validate alerts constantly. Optimizing this methodology involves developing an online monitoring and detection system that reduces false alerts, detects leaks in real-time, and even remodels the system automatically. The platform should also provide dynamic data exchange and customer information for building positive relationships and pro-environmental attitudes.