Advances in the Monitoring, Diagnosis, and Optimisation of Water Systems

Edited by Miquel À. Cugueró-Escofet and Vicenç Puig

mdpi.com/journal/sensors

## **Advances in the Monitoring, Diagnosis, and Optimisation of Water Systems**

## **Advances in the Monitoring, Diagnosis, and Optimisation of Water Systems**

Editors

**Miquel A. Cuguer ´ ` o-Escofet Vicen¸c Puig**

Basel • Beijing • Wuhan • Barcelona • Belgrade • Novi Sad • Cluj • Manchester

*Editors* Miquel A. Cuguer ` o-Escofet ´ Polytechnic University of Catalonia (UPC-BarcelonaTech) Terrassa Spain

Vicenc¸ Puig Institut de Robotica i Inform ` atica Industrial ` CSIC-UPC Barcelona Spain

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Sensors* (ISSN 1424-8220) (available at: https://www.mdpi.com/journal/sensors/special issues/ Optimisation Water Systems).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

Lastname, A.A.; Lastname, B.B. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-9594-8 (Hbk) ISBN 978-3-0365-9595-5 (PDF) doi.org/10.3390/books978-3-0365-9595-5**

Cover image courtesy of Miquel A. Cuguer ` o-Escofet ´

© 2023 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license. The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons Attribution-NonCommercial-NoDerivs (CC BY-NC-ND) license.

## **Contents**


Reprinted from: *Sensors* **2021**, *21*, 7403, doi:10.3390/s21217403 .................... **153**

### **Jaime Lloret, Sandra Sendra, Laura Garcia and Jose M. Jimenez**


## **About the Editors**

#### **Miquel A. Cuguer ´ ` o-Escofet**

Miquel A. Cuguer ` o-Escofet is Assistant Professor at the Automatic Control Department of the ´ Universitat Politecnica de Catalunya (UPC-BarcelonaTech) and Researcher at the Advanced Control ` Systems Research Group (SAC-UPC). His research interests are on the field of systems fault diagnosis and control, with a strong focus on data analysis, including system modelling and supervision, optimisation, adaptive control, robust control, sensor placement, and sensor data validation, reconstruction, and forecasting in different applications such as water systems management or active noise control. He has co-authored over 45 publications including JCR-indexed journals, international conference papers, and book chapters, and participated in over 30 EU and national R&D projects in collaboration with multidisciplinary scientific teams in different countries. His current research area is related to supervision, fault diagnosis, and the optimisation of critical infrastructure systems such as water sanitation systems, with a special focus on trying to reduce the gap between theory and practice.

#### **Vicen¸c Puig**

Vicenc¸ Puig is Full Professor at the Automatic Control Department of the Universitat Politecnica ` de Catalunya (UPC-BarcelonaTech) and a Researcher at the Institut de Robotica i Inform ` atica ` Industrial (IRI), CSIC-UPC. He is the Director of the Automatic Control Department and Head of the research group on Advanced Control Systems (SAC) at UPC. He has made important scientific contributions in the areas of fault diagnosis and fault-tolerant control using interval and linear parameter-varying models adopting set-based approaches. He has also developed important scientific contributions in the areas of monitoring, control, and supervision of large-scale systems with particular applicability to systems in the water cycle (drinking water networks, sewage/drainage networks, irrigation networks, and waste treatment plants). He has participated in more than 20 European and national research projects in the last decade. He has also led many private contracts with several companies and has published more than 140 journal articles, as well as over 400 contributions in international conference/workshop proceedings. He has supervised over 25 PhD dissertations and over 50 master's theses/final projects. He is currently Chair of the IFAC Safeprocess TC Committee 6.4 and served as Vice Chair from 2016 to 2019. He was the general chair of the 3rd IEEE Conference on Control and Fault-Tolerant Systems (SysTol 2016) and IPC Chair of IFAC Safeprocess 2018.

## *Editorial* **Advances in the Monitoring, Diagnosis and Optimisation of Water Systems**

**Miquel Àngel Cugueró-Escofet 1,\* and Vicenç Puig 1,2**


In the context of global climate change, with the increasing frequency and severity of extreme events—such as draughts and floods—which will likely make water demand more uncertain and jeopardise its availability, those in charge of water system management face new operational challenges because of increasing resource scarcity, intensive energy requirements, growing populations (especially in urban areas), costly and ageing infrastructures, increasingly stringent regulations, and rising attention towards the environmental impact of water use. The shift from a linear to a circular economy and the need for a transition to a low-carbon production system represents an opportunity to address these emerging challenges related to water, energy, and the efficient use of resources. These challenges impel network managers to improve their methods and techniques for the monitoring, diagnosis, prognosis, supervision, and optimisation of the performance of water-related systems to adhere to the current sustainability agenda.

In this context, the increasing number of advanced installed sensors—and the corresponding increase in available data—allow for the implementation of Industry 4.0 (I4.0) techniques, which are strongly focused on interconnectivity, automation, artificial intelligence (AI), and real-time data acquisition, and will facilitate the development of intelligent tools to tackle such challenges. Within this framework, the successful implementation of I4.0 techniques in water-cycle-management facilities may prompt a breakthrough in improving the processes involved, drastically increasing their performance.

In this Special Issue, a selection of these techniques applied to the integral water cycle—i.e., water distribution and water sanitation—is introduced to address different current water-management challenges. These challenges may be classified as water-quantity challenges and water-quality challenges. On the water-distribution side, these challenges may include fault detection—namely, leak localisation—in water-distribution networks (WDNs), e.g., in [1], where a process prior to the actual leak localisation—i.e., sensor placement—is carried out using information-theory simulation-based methodology; or in [2], where a new data-driven method for leak location considering pressure measurements and network topological information is presented; or in [3], where simultaneous leak detection and isolation is applied to real data. All these methodologies contribute to reducing water loss due to leaks, which may account for up to 65% of the total water depending on the network [3] and, hence, impact water-quantity-management challenges. WDNs are also the focus in [4], where a challenge from the water-quality side—particularly, the water-disinfection process in water distribution—is addressed, providing a water-quality model by an online chlorine-decay-model calibration method, which has a strong impact on human health, since its correct concentration is paramount to ensure safe water disinfection.

Work presented in [5–8] discusses the water-sanitation side. In this field, there is a growing interest in the adaptation and use of technologies related to the circular economy which promote environmental sustainability, where resource recovery is a key issue for industrial and environmental processes and involves a wide spectrum of study possibilities.

**Citation:** Cugueró-Escofet, M.À.; Puig, V. Advances in the Monitoring, Diagnosis and Optimisation of Water Systems. *Sensors* **2023**, *23*, 3256. https://doi.org/10.3390/s23063256

Received: 1 March 2023 Revised: 8 March 2023 Accepted: 11 March 2023 Published: 20 March 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

In water sanitation, wastewater treatment plants (WWTPs) offer a wide range of possibilities for resource recovery, mainly related to sludge-treatment processes such as biogas generation via the substrate codigestion process, which can be an alternative source for thermal and electrical energy production. This potential for biogas generation could become a source of renewable natural gas, which has specific composition requirements that demand high-tech sensors to assure its quality no matter its origin. Due to their potential for resource recovery and the further implications in the water–food–energy nexus, WWTPs have been a research focus in different areas of expertise: from modelling and engineering design to process dynamics, simulation, and integration. This line of work is introduced in [6], where resource recovery—namely biogas in the latter reference—is optimised by a centralised codigestion method considering real data from a WWTP network. Different nature-inspired optimisation algorithms are compared in the performance of this task, providing potential dramatic improvement when compared with actual nonoptimised operation. The improved operation of WWTP is also sought in [5,7] by means of improving the controllers involved in the operation of certain key processes of the WWTP, e.g., the aeration process of biological reactors. Classic proportional-integral (PI) controllers have been traditionally considered as the control strategy for such processes; however, improved performance may be achieved with more complex structures and techniques, e.g., model predictive-control (MPC) schemes or artificial neural network (ANN) approaches. In [5], an economic MPC (EMPC) considering a linear parameter-varying (LPV) model is proposed to control dissolved oxygen concentration in the WWTP biological reactors. Since the MPC technique requires a model of the process involved for its control, in the latter reference, a reduced model of the complex nonlinear plant is represented in a quasilinear parameter-varying (qLPV) form to reduce the computational burden—enabling the realtime operation—and applied in a real facility. This model, however, may be not available or may be difficult to obtain since the processes involved in the WWTP include nonlinear relations. ANN schemes may provide an alternative to this issue since they are well suited to deal with such processes. In this line of work, [7] considers transfer-learning (TL) methods to train ANN nets supporting control operations in WWTPs, and compares this approach with traditional control schemes, providing improved control performance while reducing control-design complexity and time invested in the ANN training process, which can be considerably time-demanding. Last but not least, in this Special Issue collection, a soft-sensing approach to predict key performance indicators (KPIs) in water-quality monitoring and control of WWTPs—such as effluent biochemical oxygen demand (BOD) or ammonia nitrogen (NH3-N)—is presented in [8]. Water-quality KPIs in WWTPs are traditionally subject to nonautomated lab-based offline monitoring approaches. Instead, in the latter reference, a method to perform accurate predictions of these KPIs, aiming for online operation, is introduced.

Further work in this area is included in the Special Issue, e.g., in [9], where remote sensing (RS) image-based time series are considered to obtain mass balances and estimate the unfiltered volumes in topographic depressions which are seasonally filled with water in a real area; or in [10], where a soil-moisture monitoring technique in precision agriculture—which is becoming key to providing food sustainably in the context of world's increasing population and natural resource scarcity—is provided using a low-cost wireless sensor network in order to help farmers optimise the irrigation process, and is tested in a real plot of land. Finally, a comprehensive review of AI and computer-vision methods for intelligent water monitoring—namely, water-body extraction and water-quality monitoring—using RS techniques is presented in [11], discussing the main challenges of using AI and RS for water-information extraction, as well as pointing out research priorities in this area. Hence, all the contributions in this Special Issue have an impact on the advances in the monitoring, diagnosis, and optimisation of water systems and, overall, cover a wide and complete sector of knowledge within this area.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

## *Article* **Pressure Sensor Placement for Leak Localization in Water Distribution Networks Using Information Theory**

**Ildeberto Santos-Ruiz 1, Francisco-Ronay López-Estrada 1,\*, Vicenç Puig 2, Guillermo Valencia-Palomo <sup>3</sup> and Héctor-Ricardo Hernández <sup>1</sup>**


**Abstract:** This paper presents a method for optimal pressure sensor placement in water distribution networks using information theory. The criterion for selecting the network nodes where to place the pressure sensors was that they provide the most useful information for locating leaks in the network. Considering that the node pressures measured by the sensors can be correlated (mutual information), a subset of sensor nodes in the network was chosen. The relevance of information was maximized, and information redundancy was minimized simultaneously. The selection of the nodes where to place the sensors was performed on datasets of pressure changes caused by multiple leak scenarios, which were synthetically generated by simulation using the EPANET software application. In order to select the optimal subset of nodes, the candidate nodes were ranked using a heuristic algorithm with quadratic computational cost, which made it time-efficient compared to other sensor placement algorithms. The sensor placement algorithm was implemented in MATLAB and tested on the Hanoi network. It was verified by exhaustive analysis that the selected nodes were the best combination to place the sensors and detect leaks.

**Keywords:** sensor placement; pressure monitoring; information theory; leak localization; water distribution network

#### **1. Introduction**

Finding a suitable sensor placement is a fundamental problem for monitoring water distribution networks (WDNs) because it is impossible to install sensors at each point of the geographic area covered by the distribution system. A WDN comprises hundreds of nodes; however, only a few sensors can be installed in certain carefully selected nodes. Then, the main question is how to select the optimal sensor placement. Finding an answer to this problem is not trivial because the selected nodes must capture the most relevant information to estimate hydraulic variables at non-measured points and provide essential information for different supervision algorithms, e.g., for leak localization [1,2]. Often there are pressure and flow instruments at the supplying nodes of a WDN and in some cases at critical points (e.g., at the minimum pressure node). However, these measurements are not sufficient for an accurate leak localization, so additional sensors must be installed at other sites [3]. A practical solution is to install more pressure sensors, because they are cheaper and easier to install and maintain than flow sensors. In addition, node pressures are more sensitive to leaks than flow rates, which is why many localization algorithms are based primarily on pressure measurements. The problem of sensor placement is closely related to other WDN management problems, such as the state estimation of the network [4–6],

**Citation:** Santos-Ruiz, I.; López-Estrada, F.-R.; Puig, V.; Valencia-Palomo, G.; Hernández, H.-R. Pressure Sensor Placement for Leak Localization in Water Distribution Networks Using Information Theory. *Sensors* **2022**, *22*, 443. https://doi.org/10.3390/ s22020443

Academic Editor: Hossam A. Gabbar

Received: 12 December 2021 Accepted: 5 January 2022 Published: 7 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

model calibration [7,8], water quality monitoring such as detection of contaminants and cyberattacks [9–15], among others. Nevertheless, the present work focuses on the context of leak detection and localization as discussed in [16,17]. Regarding techniques for optimal sensor placement for leak/burst detection and localization in water distribution systems, a comprehensive review can be found at [18].

In a mathematical/computational context, the placement of pressure sensors is a mixed-integer programming problem. In this problem, for a network with *N* nodes, a sensor placement consists of a selection [*s*1,*s*2, ... ,*sN*] where *si* are binary decision variables such that *si* = 1 indicates that a sensor will be placed on the *i*-th node, whereas *si* = 0 indicates that no sensor will be placed on that node.

Combinatorial analysis shows that there are 2*<sup>N</sup>* <sup>−</sup> 1 possible sensor placements when non-empty subsets with any number of sensors are considered. If the number of sensors is previously set to a fixed number *S*, then the number of possible sensor placements is reduced to ( *N <sup>S</sup>* ), which is still a very large number. Therefore, in medium-sized and large networks, it is not feasible to check all possible combinations. For example, in a network containing 500 nodes the number of different placements for 10 sensors is ( 500 <sup>10</sup> ) <sup>≈</sup> 2.5 <sup>×</sup> 1020. That is why it is important to find an optimal placement method without analyzing all the possible combinations.

Usually, sensor placement focused on leak localization is addressed with an optimization approach from synthetic pressure data obtained by simulation. Some authors have focused on minimizing the number of undetectable leaks [19,20], whereas others reduce the error in the leak location [16,21]. In [22], a min-max optimization algorithm that considers the isolation of the leaks from their signatures obtained through simulation is proposed. In [23], a multi-objective approach to mitigate errors both in the detection and localization of leaks, considering minimum night flow conditions, is presented. Regarding the optimization of the objective function, two approaches are usually used: deterministic methods (e.g., branch and bound [24]) and metaheuristic methods, (e.g., genetic algorithms [25–27] and particle swarm optimization [28]). Deterministic approaches guarantee an optimal solution, but the computation time increases exponentially with the number of nodes and possible leak scenarios. On the other hand, metaheuristic methods search for a near-optimal solution that only guarantees optimality when the number of candidate solutions evaluated (named "population size") tends to infinity. Furthermore, optimization-based sensor placement methods are linked to a specific leak localization method because the objective function is expressed in terms of a localization error or isolation index for that method [16,28,29]. Based on this, a sensor placement method may be optimal for one specific leak localization method but not as good for others. Furthermore, the method should be independent of the leak localization method since it is not feasible to change it for every method. Thus, an improved leak localization method could be proposed based on an ensemble of different machine learning algorithms using the information provided by the sensors.

The huge computing time in networks with hundreds and thousands of nodes using optimization-based methods and the high dependence on the selected leak localization method has motivated the present work. In this new proposal, it is not considered how specific leak localization methods will use the information provided by the sensors, but rather that the sensor placement method only focuses on the sensors capturing as much information related to the leaks as possible. The proposed method consists of a heuristic algorithm to select the subset of nodes where to place the sensors, seeking to maximize the relevance of the information captured by the sensors while minimizing the redundancy between the pressures in the selected nodes. Both metrics, relevance and redundancy, are defined in terms of information theory.

An important contribution of this work is the reduction in computing time for sensor placement, compared to methods based on metaheuristic optimization. Another relevant contribution is the nondependence of the sensor placement on the leak localization method used, which allows the use of the same sensor placement with different localization methods. Some aspects not yet covered in this work are the possible heterogeneity of the sensors

(e.g., different errors and measurement ranges) and the influence of the measurement noise in the optimal placement, but they are considered as future work.

The rest of the document is organized as follows: in Section 2, the concepts of redundancy and relevance is presented in terms of mutual information, and the information quotient used as the basis of the method is also defined. In Section 3, the proposed method is formally described and some guidelines for its implementation are given. In Section 4, the results of the proposed method applied to a simplified version of the Hanoi network (case study) are presented. Finally, in Section 5, the conclusions are presented and future related works are proposed.

#### **2. Information Theory Fundamentals**

In Shannon's information theory (IT), the self-information of a random variable is defined according to the unexpectedness of its values [30]. Thus, the information contained in a constant random variable is zero. Mathematically, if an event *E* has probability *P*, its information content is defined by:

$$I(E) \stackrel{\omega}{=} -\log\_b(P),\tag{1}$$

where the unit of measure of *I* is defined by the base of the logarithm, *b*, which is called "bit" if *b* = 2 . In a discrete random variable *X* with probability function *p*(*x*) = Pr(*X* = *x*), the self-information for obtaining *x* as a result when measuring *X* is given by:

$$I(\mathbf{x}) = -\log\_b(p(\mathbf{x})) = \log\_b(1/p(\mathbf{x})).\tag{2}$$

To quantify the average information that a random variable contains, considering all its possible values, the *entropy* is used:

$$H(X) \stackrel{\scriptstyle \omega}{=} E(I(\mathbf{x})) = \sum\_{\mathbf{x}} -p(\mathbf{x}) \log\_b(p(\mathbf{x})),\tag{3}$$

which is the expected value of the information contained in the measurements of *X*, that is, the sum of the self-information of each of its possible values weighted by its probability of occurrence.

The mutual information of two random variables, sometimes called "information gain", measures the amount of information obtained from one of the random variables by observing the other one. For example, in a practical application of WDN monitoring, the mutual information between two node pressures would indicate how much information about the pressure at one node is gained by knowing the pressure at the other one. In probabilistic terms, the mutual information determines how different the joint distribution of (*X*,*Y*) is from the product of the marginal distributions of *X* and *Y*.

For two discrete variables *X* and *Y*, defined over the space X ×Y, the mutual information is computed as the double sum:

$$I(X,Y) = \sum\_{y \in \mathcal{Y}} \sum\_{x \in \mathcal{X}} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}\tag{4}$$

where *p*(*x*, *y*) = Pr(*X* = *x*,*Y* = *y*) is the joint probability function of *X* and *Y*, whereas *p*(*x*) and *p*(*y*) are the marginal probability functions of *X* and *Y*, respectively. The mutual information (4) is derived from the entropy and the conditional probability by the following equivalences:

$$H(X,Y) \equiv H(X) - H(X \mid Y) \equiv H(Y) - H(Y \mid X). \tag{5}$$

Furthermore, *I*(*X*, *X*) = *H*(*X*), *I*(*X*,*Y*) = *I*(*Y*, *X*) and *I*(*X*,*Y*) ≥ 0, where *I*(*X*,*Y*) = 0 iff *X* and *Y* are independent.

For continuous random variables, the summations in (4) are replaced by integrals and the probability functions by probability densities:

$$I(X,Y) = \int\_{\mathcal{Y}} \int\_{\mathcal{X}} p(\mathbf{x},\boldsymbol{y}) \log \frac{p(\mathbf{x},\boldsymbol{y})}{p(\mathbf{x})p(\boldsymbol{y})} \, \mathrm{d}\mathbf{x} \, \mathrm{d}\mathbf{y}.\tag{6}$$

Due to the difficulty in modeling the probability densities and subsequently evaluating the double integrals in (6), a simplification to calculate the mutual information in continuous variables is to discretize the variables with *n* bits, so that the domain of each variable is reduced to 2*<sup>n</sup>* bins. For example, to compute the mutual information of two node pressures in a hydraulic network, the span of the pressure variables [*P*min, *P*max] must be divided into a discrete 8-bit grid (256 different values) and then (4) is applied.

#### **3. Sensor Placement Method**

The proposed sensor placement method is based on a dataset of node pressures that collects typical variations due to leaks of different sizes in all network nodes. The pressure dataset is obtained from simulations with the hydraulic model of the network in [31]. Each pressure data point is labeled with a "leak class" (the node where the leak occurs) so that the proposed method can be classified as supervised.

In the context of machine learning, the placement of pressure sensors is a *feature selection* stage. To select the features (the subset of nodes where the sensors will be placed), an algorithm is proposed that seeks to maximize the relevance of the selected features (node pressures) for the response variable (leaky node), while each of them avoids capturing information already contributed by the others, that is, minimizing redundancy.

The following definitions of relevance and redundancy, proposed in [32], are used as a basis for defining the methodology:

**Definition 1** (Relevance)**.** *A metric of the relevance of the subset of node pressures* S *for the response variable y (leak node), is given by*

$$\text{Rel}(\mathcal{S}) \stackrel{\omega}{=} \frac{1}{\mathcal{S}} \sum\_{\mathbf{x} \in \mathcal{S}} I(\mathbf{x}, \mathbf{y})\_{\mathbf{y}} \tag{7}$$

*where x is any feature in* S*, and S* = |S| *is the number of features in* S *(the cardinality).*

**Definition 2** (Redundancy)**.** *A metric for information redundancy in a feature subset* S *is given by:*

$$\text{Red}(\mathcal{S}) \stackrel{\scriptstyle \mathcal{A}}{=} \frac{1}{\mathcal{S}^2} \sum\_{\mathbf{x}, \mathbf{x}' \in \mathcal{S}} I(\mathbf{x}, \mathbf{x}'), \tag{8}$$

*where x and x are any features in* S*.*

To apply the above definitions to compute a pressure sensor placement, first, a dataset of node pressures is built covering different scenarios that consider leaks of different magnitude in all nodes of the network. Through simulation with the hydraulic model of the network, a series of samples of the node pressures is obtained, one sample for each different leakage scenarios. In this way, if *M* different leakage scenarios are simulated in a network containing *N* nodes, the result of the simulation is a collection of *N M*-dimensional vectors, *x* and *x* in (7) and (8), corresponding to the *N* candidate nodes (initially, it is assumed that all nodes are potential sensing nodes). In addition, an output vector, *y* in (7), is generated containing integer labels to indicate the leaky node corresponding to each simulated scenario.

The exhaustive search for the optimal subset of sensors, <sup>S</sup>, requires testing the 2*<sup>N</sup>* <sup>−</sup> <sup>1</sup> different combinations, which would require an impractical computation time in networks with many nodes. Therefore, the use of the method proposed in [32] was considered to rank the node pressures through an iterative forward scheme that only requires *O*(*NS*) computations. In fact, with this proposal, it is possible to rank all the node pressures in order of importance with a computational cost of *O*(*N*2).

Next, a heuristic algorithm is proposed, which orders the node pressures according to their importance to explain the different leak classes (leaky nodes). The first node pressures in the output list correspond to the nodes with the highest importance for explaining the leak positions according to the information contained in the dataset. The sequential selection of nodes starts from an empty subset and, at each iteration, adds the best-ranked node among those that are still available to be selected. At each iteration, the relevance of each available feature (node pressure) with respect to the output (leaky node) and its redundancy with respect to the variables that have been previously selected is evaluated using the following equations, adapted from (7) and (8):

$$\text{Rel}\_{\mathcal{Y}}(x) = I(x, y), \tag{9}$$

$$\text{Red}\_{\mathcal{S}}(\mathbf{x}) = \frac{1}{\mathcal{S}} \sum\_{\mathbf{x'} \in \mathcal{S}} I(\mathbf{x}, \mathbf{x'}). \tag{10}$$

Since maximizing relevance and simultaneously minimizing redundancy represents a multiobjective problem, a combined relevance/redundancy index (RRI) is defined that increases with increasing relevance and also with decreasing redundancy, so the problem is expressed as a single objective to be maximized:

$$\text{RRI} = \text{Rel}\_{\mathcal{Y}}(\mathfrak{x}) / \operatorname{Red}\_{\mathcal{S}}(\mathfrak{x}). \tag{11}$$

The complete node ranking process is formally expressed in Algorithm 1. When the process finishes, the nodes where to place the sensors are taken from the first positions in the list S. If it is not necessary to obtain the complete ranking of the nodes, but only to know the best-ranked positions, the process may stop prematurely when the subset S already contains the number of sensors to be placed.


The number of sensors to place for leak localization purposes is determined by the equipment available in most cases. The minimum number of sensors for a successful leak localization method will depend on how that method uses the available information, the measurement noise, as well as the quality, resolution and calibration of the sensors. If there are enough resources to intensively instrument the network, it must be taken into account that increasing the number of sensors does not always lead to better performance in locating leaks. To determine how many sensors should be placed, it is suggested to start from the ranking obtained by Algorithm 1, and run a marginal analysis with the leak localization method to be used. Starting from one sensor (the best ranked), the number of sensors is progressively increased and the leak localization performance is evaluated for each new set of sensors until adding a new sensor no longer represents a significant benefit for locating leaks.

It should be noted that Algorithm 1 does not take into account the geographical distribution of the nodes, since relevance and redundancy depend only on the mutual information between node pressures. This means that the network topology is what determines the amount of mutual information rather than the distance between sensors (i.e., two sensors can be geographically very close but have little mutual information).

#### **4. Results and Discussion**

Algorithm 1 was implemented in MATLAB and tested on the Hanoi network [33]. The model of the Hanoi network is composed of one reservoir, 31 consumer nodes, and 34 pipes, as shown in Figure 1. Due to its reduced topology, this network has been used as a standarized benchmark in different works [21,27,34].

In order to build the pressure dataset, leaks of different magnitude were simulated at each junction node using the EPANET 2 simulation program [35] through the EPANET/ MATLAB Toolkit [36]. The procedure to generate the dataset using EPANET, the training and the predictive use of classifiers in locating leaks have been described in [37]. The dataset generated by simulation for this work considered leaks at all junction nodes with flow rates from 50 L/s. In order to simulate leaks at a node, the demand assigned to that node in the EPANET hydraulic model was modified by increasing this demand by an amount equal to the flow of the simulated leak. Because the Hanoi network contains few nodes, the optimality of the sensor placement calculated by Algorithm 1 was exhaustively verified.

**Figure 1.** The Hanoi network.

To assess the optimality of the sensor placement obtained from Algorithm 1, leak localization tests were carried out using two machine learning methods that used the pressures in the selected nodes as features (input variables). The methods used were the *k*-nearest neighbors (*k*-NN) and quadratic discriminant analysis (QDA). These leak localization methods are based on classifiers that recognize directional patterns in pressure residuals using supervised learning techniques, as described in [38].

Through the marginal analysis, suggested at the end of Section 3, it was determined that *S* = 3 is an adequate number of sensors in the Hanoi network, because the addition of the fourth sensor does not produce a statistically significant improvement (with 0.95 confidence level) in leak location (considering that measurement noise may possibly increase the minimum number of sensors, but this discussion has been considered as future work). Because the Hanoi network contains few nodes, it was possible to comprehensively analyze all 4495 possible combinations of three sensor nodes. For each triplet of nodes (three-sensor placement), 50 leak localization tests were carried out with flow rates *q*leak = 1, 2, . . . , 50 L/s at each node of the network. Finally, the overall performance of both methods was evaluated for each candidate triplet using the classification accuracy (Acc) and the average topological distance (ATD) as performance metrics, as defined in [39]. The Acc is the fraction of exactly located leaks considering all leak scenarios in the test dataset, where Acc = 1 means that all leaks were correctly located, whereas Acc = 0 means that no leaks were correctly located. The ATD is a measure of how far from the true leaky node the classifier locates the leak, counting the number of separation links between the true leaky node and the estimated leaky node, averaged across all scenarios in the test dataset. Therefore, the best sensor placements are the ones that lead to the highest Acc values and the lowest ATD values.

The results in Table 1 show that the node triplet {12, 21, 28} computed by Algorithm 1 is among the best ranked, since it presents the highest accuracy and the lowest average topological distance.


**Table 1.** Better positions to place three sensors in the Hanoi network, obtained by exhaustive analysis. The shaded selection is the one obtained by Algorithm 1.

Figure 2 shows the geographic location of the three-sensor placement obtained considering the three nodes best ranked by Algorithm 1. Figure 3 shows the complete ranking considering the 31 nodes of the network.

**Figure 2.** Computed three-sensor placement in the Hanoi network.

**Figure 3.** Node ranking in the Hanoi network.

Table 2 shows the sensor placements obtained for two, three and four sensors in the Hanoi network, and they are compared with the results obtained by metaheuristic methods reported in the literature [28]. The nodes selected by these methods are quite similar and produce very close results in terms of accuracy in locating leaks based on the pressures of the selected nodes. However, there is an important difference in the computation time of the IT-based method (Algorithm 1) compared with the metaheuristic methods. On a personal computer with an Intel 64-bit processor and 8 GB of RAM, the computation time for the IT-based method was around one second with the synthetic data from the Hanoi network, whereas it was 24 min for the genetic algorithm (it may be larger, depending on the initial population size) and about one hour for the exhaustive analysis.


**Table 2.** Optimal three-sensor placement in the Hanoi network using different methods.

*<sup>a</sup>* Algorithm 1. *<sup>b</sup>* Genetic algorithm, reported in [28]. *<sup>c</sup>* Particle swarm optimization, reported in [28]. *<sup>d</sup>* Semiexhaustive search, reported in [28].

Further tests were made on larger networks, e.g., in some midsize sectors of the Madrid network. Figure 4 shows a 10-sensor placement obtained using Algorithm 1 in a sector of the Madrid network containing one reservoir, 312 junction nodes and around 14 km of pipes. In this case, optimality was not exhaustively tested due to the vast number of possible placements to compare. However, it was found that the average accuracy in leak localization with sensor placements obtained by Algorithm 1 was at least better than that obtained with an existing placement (previously obtained by a genetic algorithm) for different leak scenarios.

**Figure 4.** The optimal 10-sensor placement in a sector of the Madrid network.

Figures 2 and 4 show that the computed sensor placements do not show geometric regularity (i.e., the sensors do not appear equally spaced), since geometric or spatial criteria are not used to distribute the sensors in the network. However, regardless of geometric irregularity, leak location tests with these placements demonstrated that pressure measurements at these nodes provided the most useful information for discerning between different leak scenarios. In fact, when the placement of sensors obtained by Algorithm 1 is compared with the results reported by other authors using metaheuristics, sometimes very close performances can be found even though the sensors are distributed in different nodes, because the proposed algorithm does not optimize the position of each sensor individually but the entire set of sensors. This can be explained with an informal analogy: two soccer teams can achieve similar performances using different players.

Although, as noted above, there may be different sensor placements that lead to a good performance in locating leaks, the one obtained by Algorithm 1 has the advantage of being calculated in less time than the methods based on metaheuristics and that it is not linked to a specific leak location method, so changing the leak location method does not imply changing the location of the sensors, which would be impractical.

#### **5. Conclusions**

This paper has presented a technique for finding optimal sensor placements from information theory using a sequential forward selection, maximizing the relevance and minimizing the redundancy of the selected node subset. The proposed technique is computationally less expensive than other methods reported in the literature because the proposed technique operates directly on the values of node pressures without performing calculations for leak localization in the implementation of the algorithm. The optimality of the sensor placement obtained with the proposed method was extensively tested by simulation with the Hanoi network. It was found that the selection of nodes where to place sensors using information theory produced the best combination of pressure variables to locate leaks using different machine learning methods.

An implicit assumption in the proposed algorithms is that all network nodes have the same availability to place the sensors. However, in practice, some specific nodes may have placement priority over others; for example, critical nodes (points of minimum pressure) and nodes that supply essential services (e.g., hospitals) could be monitored as a priority. It may also occur that some nodes already have a sensor installed and that previous partial placement must be held, or that the conditions in a node are physically adverse and instrumentation is avoided. These circumstances warrant adjustments to the proposed sensor placement algorithm that may lead to future work. Another possible working line is the combination of heterogeneous sensors where different sensing specifications are included (e.g., different precision) or where the sensors measure different physical magnitudes (e.g., sensor placements combining pressure and flow sensors).

**Author Contributions:** Conceptualization, I.S.-R., F.-R.L.-E. and V.P.; methodology, I.S.-R.; software, I.S.-R.; validation, F.-R.L.-E., V.P. and G.V.-P.; formal analysis, G.V.-P. and H.-R.H.; data curation, I.S.-R. and H.-R.H.; writing—original draft preparation, I.S.-R.; writing—review and editing, F.-R.L.-E., V.P. and G.V.-P.; supervision, F.-R.L.-E. and V.P.; project administration, F.-R.L.-E. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was cofinanced by the European Regional Development Fund of the European Union in the framework of the ERDF Operational Program of Catalonia 2014–2020, under the research project 001-P-001643 Agrupació Looming Factory. Tecnológico Nacional de México (TecNM) also cofinanced this work by granting the research project 11080.21-P.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


## *Article* **Robust Data-Driven Leak Localization in Water Distribution Networks Using Pressure Measurements and Topological Information**

**Débora Alves 1,2,\*, Joaquim Blesa 1,3,4,\*, Eric Duviella <sup>2</sup> and Lala Rajaoarisoa <sup>2</sup>**


**Abstract:** This article presents a new data-driven method for locating leaks in water distribution networks (WDNs). It is triggered after a leak has been detected in the WDN. The proposed approach is based on the use of inlet pressure and flow measurements, other pressure measurements available at some selected inner nodes of the WDN, and the topological information of the network. A reduced-order model structure is used to calculate non-leak pressure estimations at sensed inner nodes. Residuals are generated using the comparison between these estimations and leak pressure measurements. In a leak scenario, it is possible to determine the relative incidence of a leak in a node by using the network topology and what it means to correlate the probable leaking nodes with the available residual information. Topological information and residual information can be integrated into a likelihood index used to determine the most probable leak node in the WDN at a given instant *k* or, through applying the Bayes' rule, in a time horizon. The likelihood index is based on a new incidence factor that considers the most probable path of water from reservoirs to pressure sensors and potential leak nodes. In addition, a pressure sensor validation method based on pressure residuals that allows the detection of sensor faults is proposed.

**Keywords:** water distribution networks; leak localization; data-driven

#### **1. Introduction**

Water distribution networks are complex systems that are difficult to manage and monitor with extreme importance nowadays. The detection and location of leaks have become crucial for water distribution because when there are bursts or leaks, this can generate not only economic losses but also an environmental issue and represents a potential risk to public health with contaminated water [1]. Another concern is the scarcity of water that can occur in 2025, which may affect half the world's population that will not have access to safe and accessible water for their basic needs [2]. However, with all these risks, currently, this infrastructure does not perform satisfactorily in practice. According to [3], a global volume of water loss called Non-Revenue Water (NRW) has been calculated at 346 million cubic meters per day or 126 billion cubic meters per year.

The infrastructure in a medium-sized city can have pipes that span hundreds of kilometers connected to hundreds of nodes (pipe junctions or customers that connect to the network). Therefore, several factors can generate water loss during transport between the treatment plants and the reservoir for consumers, usually attributed to several causes, including leaks, measurement errors, and theft. Water loss can be divided into two terms,

**Citation:** Alves, D.; Blesa, J.; Duviella, E.; Rajaoarisoa, L. Robust Data-Driven Leak Localization in Water Distribution Networks Using Pressure Measurements and Topological Information. *Sensors* **2021**, *21*, 7551. https://doi.org/10.3390/ s21227551

Academic Editor: Jason K. Levy

Received: 8 September 2021 Accepted: 6 November 2021 Published: 13 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

"real losses" and "apparent losses". Apparent losses are constituted by badly read measurements, data handling errors, and illegal water tapping. In contrast, the real losses comprise leakage from all system parts and overflow at storage tanks. Real losses are divided into "background leakage" made up of small undetectable and into detectable leaks relevant for detection as they represent significant losses for the water distribution company.

Effective leak management is vital for all of the factors mentioned above to save financial resources and water. The methods of leak localization can be classified into two categories: Hardware-based system and Software-based.

The Hardware-based utilizes hardware sensors to detect a leak directly and help the localization of the leak. As there are various types of sensors and instruments available, they can be further subclassified as: acoustic [4,5] and non-acoustic detection methods [6].

Software-based methods generally rely on an algorithm or model for detecting leaks. Unlike hardware-based methods, these methods do not seek to locate the leak point accurately but minimize possible leakage areas. Since these methods are based on information, such as the pressure of the pipe network, flow data, and so forth, they work well on any type of pipe. These methods can be divided into physical modeling methods and datadriven methods. The physical modeling methods or model-based methods identify the leak using a numerical model and compare the results with the field data, for example, Ref. [7] which uses pressure sensitivity analysis, Ref. [8] uses leak signature space, Ref. [9] analyzes the sensitivity matrix and residuals, and [10] uses pressure and flow measurements to perform leakage detection through model-invalidation. On the other hand, data-driven methods analyze the monitoring data, combining tools such as artificial intelligence (e.g., classifiers [11–14] or artificial neural networks [15,16]). Thus, it is possible to identify potential areas of the leak based on certain rules or principles without resorting to the simulation of the physical model results [17]. However, these methods need, in general, an important number of non-leak and leak data scenarios in the training process to obtain reasonable results. As an exhaustive amount of leak scenarios are not available in general, a hydraulic simulator can be used to generate leak data. This work deals with the problem of leak localization and it is assumed that it is available a leak detection method that determines if a leak is present or not in the WDN. In particular, a non-numerical localization method, focused on a data-driven approach, is proposed.

Like other recent works [18,19], it requires only topological information of the network and historical data without leakage of the available measurements. In this work, the topological information provides the most probable paths for extra flows produced by leaks. A new incidence factor from every combination of nodes and sensors is computed with this information. Every incidence factor determines how a leak in a particular node affects a specific pressure sensor. On the other hand, historical data are used to calculate non-leak pressure estimations at sensed inner nodes. Residuals are generated using the comparison between these estimations and leak pressure measurements. Incidence factors are integrated with residuals in likelihood indexes to give the most probable leak node in a leak scenario. In addition, pressure residuals are used to detect sensor faults by means of a novel sensor validation algorithm.

The remainder of this paper is organized as follows: Section 2 presents the theory of graphs applied to WDN and explains the structure of the reduced-order model used in this work. The developed leak localization has been elaborated in Section 3. In Section 4 a sensor validation method that allows the detection of pressure sensor faults is presented. Section 5 introduces the case studies of Hanoi and Modena's WDNs. Section 6 presents the conclusions and future scope of the research work.

#### **2. Water Distribution Networks**

#### *2.1. Preliminaries*

A water distribution network is composed of *m* pipes, *n* internal consumer nodes and can be described by a directed graph G = {V, E }, [20], with V = {*v*1, ... , *vn*} is the set of vertices that represent connections between the components of the network, additionally the last {*vn*−*nI*+1, ..., *vn*}, represent the vertices of the system's input, *nI* being the number of the inlets, with *nI* ≥ 1. The elements of the set E = {*e*1, ... ,*em*} are the edges, which represent the *m* pipes in the network.

The graph G can be represented by the incidence matrix **H** = [*hij*], in which the elements *hij* are defined as:

> *hij* = ⎧ ⎪⎪⎨ ⎪⎪⎩ −1 if the *j*th edge is entering *i*th vertex. 0 if the *j*th edge is not connected to the *i*th vertex. 1 if the *j*th edge is leaving *i*th vertex.

The direction of the edge represents a reference direction for the flow in the corresponding pipe. The incidence matrix is composed of *<sup>H</sup>* ∈ {−1, 0, 1}*n*×*<sup>m</sup>* with each row corresponding to a node and column corresponding to a pipe.

The WDN must fulfill mass conservation law, which expresses the conservation of mass in each vertex, described by:

$$
\mathbf{H} \cdot \mathbf{q} = \mathbf{d},
\tag{1}
$$

where **<sup>d</sup>** <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is the vector of nodal demands, with **<sup>d</sup>***<sup>i</sup>* > 0 when the flow is into the node *<sup>i</sup>*, and **<sup>q</sup>** <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* is the vector of flows in the edges. By virtue of the mass conservation, it is possible to have only *<sup>n</sup>* <sup>−</sup> 1 independent nodal demand, <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *di* = 0, therefore the supply flow must equal the end-user demands as there is no storage in the network.

Let **p** be the vector of absolute pressures at the nodes and Δ**p** be the vector of differential pressures across the pipes, both in meters of water column [mwc], then the energy law for water networks gives:

$$
\Delta \mathbf{p} = \mathbf{H}^T \mathbf{p} = f(\mathbf{q}) - \mathbf{H}^T \mathbf{h},
\tag{2}
$$

where **<sup>p</sup>** <sup>∈</sup> <sup>R</sup>*n*, and *<sup>f</sup>* : <sup>R</sup>*<sup>m</sup>* <sup>→</sup> <sup>R</sup>*m*, *<sup>f</sup>*(**q**)=(*f*1(*q*1), ... , *fm*(*qm*)). The function *fj*(·) describes the flow dependent pressure drop due to the hydraulic resistance in the *j*th edge. The relationship between pipe flow and energy loss caused by friction in individual pipes can be computed using the Hazen–Williams formula [21] for expression *fj*(·):

$$f\_{\vec{j}}(q\_{\vec{j}}) = \frac{10.7 \cdot L\_{\vec{j}}}{\rho\_{\vec{j}}^{1.852} \cdot D\_{\vec{j}}^{4.87}} \cdot q\_{\vec{j}}^{1.852} \, , \tag{3}$$

where *Lj* is the length of the pipe and *Dj* is the diameter of the pipe, both in meters [m], *qj* is the pipe flow in m3/s and *ρ<sup>j</sup>* is the pipe roughness coefficient.

The term **H***T***h** is the pressure drop across the pipes due to the difference in geodesic level (i.e., elevation) in meters [m] between the ends of the pipes with **<sup>h</sup>** <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* the vector of geodesic levels at each vertex.

#### *2.2. Structure of the Reduced Order Model*

The reduced-order network model is used in this paper to calculate the nominal pressure at the measured internal nodes. The model uses the pressure dependence of the network's internal nodes with the pressure and flow measurements of the inlets. The details of the model derivation can be found in [22,23].

A network can be divided into nodes connected with reservoirs (the inlets nodes) and internal nodes that compose the system. To facilitate the explanation in this work, the information regarding inlet nodes will be represented by (*r*) superscript and those of the internal nodes, which will be expressed by the (*in*) superscript. In particular, vector **<sup>p</sup>**(*in*) will contain pressure node values *<sup>p</sup>*1, ..., *pn*−*nI* and **<sup>p</sup>**(*r*) inlet pressure values *pn*+1−*nI* , . . ., *pn*.

The network needs to fulfill some conditions for using the reduced model proposed:

Condition 1: corresponds to the demands of the internal nodes of the system, where Equation (1) can be defined as:

$$\mathbf{d}(k) = -\mathbf{v}(k)\sigma(k),\tag{4}$$

where *σ*(*k*) denotes the total inlet flow into the network at time instant *k*, the vector **v**(*k*) defines the distribution of the total demand in the internal nodes at every time *k*, with the property ∑*<sup>n</sup> <sup>i</sup> vi*(*k*) = 1. Notice that if all consumers are residential, all nodes demand have the same consumption profile, in consequence, the **v**(*k*) will be constant **v**(*k*) = **v**.

Condition 2: is a particularly case when the vector **p**(*r*) of control inputs fulfill the following case,

$$\mathbf{p}^{(r)}(k) + \mathbf{h}^{(r)} = \kappa(k)\mathbb{1},\tag{5}$$

for some *<sup>κ</sup>* <sup>∈</sup> <sup>R</sup>, which is the total head at the inlets in [mwc] and where <sup>1</sup> denote the vector consisting of ones. In [23], there is a discussion on this definition's feasibility where the controllers should satisfy this premise at least in networks with the low total consumption.

If these two conditions are fulfilled, the pressure at the *i*th internal node can be expressed by:

$$p\_i^{(i\_n)}(k) = a\_i \sigma^2(k) + \sum\_{j=1}^{n\_l} \beta\_{i\bar{j}}(k) p\_{\bar{j}}^{(r)}(k),\tag{6}$$

where *α<sup>i</sup>* is parameter dependent on the network topology and the distribution of demands in the network, and *βij* is dependent on the network topology with *j* = 1, ..., *nI*. The total inlet flow *σ* is typically well-known since inlet flows are measured. *α<sup>i</sup>* is a parameter dependent on the network topology and the distribution of demands in the network, and *βij* is dependent on the network topology with *j* = 1, ..., *nI*. The total inlet flow *σ* is typically well-known since inlet flows are measured.

Some methods of identifying parameters can be used to identify parameters *α<sup>i</sup>* and *βij* since model (6) of *p* (*in*) *<sup>i</sup>* is linear [24], using the measures of *<sup>σ</sup>*, **<sup>p</sup>**(*r*) and *<sup>p</sup>* (*in*) *<sup>i</sup>* with nodes that contain pressure sensors that will be denoted as *psi* ∀*i* = 1, ..., *ns* in the following, where *ns* is the number of sensors installed in the inner nodes.

Once inner pressure model (6) has been calibrated, the accuracy of the model can be assessed by applying the computation of the model error or pressure residual defined by:

$$r\_{s\_i} = p\_{s\_i}(\varepsilon) - p\_{s\_i}(\varepsilon), \quad \forall i = 1, \dots, n\_{s\_{\tau}} \tag{7}$$

where *c* denotes the boundary conditions (heads and inflows in inlets) necessary to compute pressure estimation by means of (6). For example, minimum and maximum residual bounds *σ<sup>i</sup>* and *σ*¯*i*, considering the available data, can be computed for every sensor *i* = 1, ..., *ns* to obtain an idea of the accuracy of model (6). Sensor noises and error models can produce residual errors. If big values of residual bounds are obtained, improvements in model (6) should be considered. For example, the assumption that all the nodes have the same consumption profile can lead to a big error in some networks. In this case, the error could be decreased if model (6) is calibrated only using data from the same hour but on different days. It would be assumed that different users can have different profiles at a given hour, but a particular user will have the same profile at a particular hour for all the different days. This possible improvement will imply the calibration of 24 different models (6) (one for each hour) and will require more historical data to obtain good accuracy. Another method to obtain an estimation of the pressure in inner nodes *p*ˆ*si* (*c*) is to use historical data directly as a lookup table, as was proposed in [18]. That is, given particular operating conditions, *c* provides the inner pressures from historical data that had the closest operating conditions *c*ˆ to *c*. Residuals (7) considering leak pressure measurements will be used in the leak localization, as will be explained in detail in the next section.

#### **3. Leak Localization**

The location of the leak on the WDN is typically divided into two steps: leak detection and leak localization [25]. The focus in this work is the leak localization assuming that the detection has already been effectuated. In addition, it is assumed that the leaks can only happen in the nodes of the network (as considered in [7,26], or [8]), making the number of nodes equal to the number of potential leaks. The nodes correspond to water users, pipe junctions, and other structures such as hydrants. However, if the number of nodes will not provide a representative discretization of the network, some artificial nodes could be considered.

In this Section, two leak localization methods will be proposed. The first one will only use available measurements, and its diagnosis will point to one of the inner pressure sensors installed in the WDN. Therefore, the detected leak should be in an area around this sensor (cluster). The second method will combine the information of the first method with the topological information: characteristics of the pipes and connections between the nodes of the WDN, in a likelihood index that will allow the leak localization at the node level.

#### *3.1. Leak Localization at Cluster Level*

As stated before, the proposed leak localization is applied after the leak detection. In addition to inlet pressure and flow sensors, it is assumed that *ns* pressure sensors are installed at different inner nodes. Consider a leak *lj* acting on the node *j* of the network, and the used measurements are assumed to be captured under a leaky situation. Additionally, admitting leak-free historical data of all the sensors are available. The residual pressure in internal nodes that contain a sensor, defined in (7), can be computed as:

$$r\_{s\_i} = \mathfrak{p}\_{s\_i}(\mathcal{c}) - p\_{s\_i}(\mathcal{c}^{l\_j}), \quad \forall i = 1, \ldots, n\_{\mathfrak{s}}.\tag{8}$$

where *p*ˆ*si* (*c*) is the pressure estimation considering boundary conditions *c* in a leak-free scenario. On the other hand, *psi* (*c lj*) is the pressure value measured by the inner pressure sensor *i* under boundary conditions *c lj* (the same heads and inflows in inlets as in *c* but with a leak in node *j*).

Following the ideas in [18], positive residuals can be obtained from the following transformation:

$$
\bar{r}\_{\mathfrak{s}\_i} = r\_{\mathfrak{s}\_i} - \min(r\_{\mathfrak{s}\_1 \prime} \dots \prime r\_{\mathfrak{s}\_m}) \quad \forall i = 1, \dots, n\_\mathfrak{s}. \tag{9}
$$

Then, as the leak localization can be achieved by determining the residual pressure component with maximum size (see [22,27]), leak localization can be formulated as:

$$f = \underset{i \in \{1, \dots, n\_s\}}{\text{arg}\max} \left\{ \vec{r}\_{s\_i} \right\}. \tag{10}$$

Notice that the result of the leak localization method (10) is one of the *ns* pressure sensor locations.

Then, the leak localization results in *j*ˆ point not only to sensor location *sj* but also to the nodes that produce a higher incidence for this sensor than the other sensors (cluster *j*).

#### *3.2. Leak Localization at Node Level*

Considering the Hazen–Willians Equation (3) for every pipe (edge *ez*) a resistance *Rz* can be defined:

$$R\_z = \frac{10.7 \cdot L\_z}{\rho\_z^{1.852} \cdot D\_z^{4.87}}.\tag{11}$$

Among the multiple pipe paths that can connect every pair of nodes *ij*, a path <sup>P</sup>*min ij* with a minimum total resistance *Rij* can be computed by means of :

$$\mathcal{P}\_{ij}^{min} = \underset{\mathcal{P}\_{ij}^{(k)} \in \mathcal{P}\_{ij}}{\text{arg min}} \sum\_{\boldsymbol{c}\_z \in \mathcal{P}\_{ij}^{(k)}} R\_{z\boldsymbol{c}} \tag{12}$$

where <sup>P</sup>*ij* <sup>=</sup> {P(*r*) *ij* ,...,P(*e*) *ij* } denotes the set of paths connecting nodes *i* and *j*.

On the other hand, the minimum path from the *nI* inlets to a node *<sup>j</sup>* , <sup>I</sup>*min <sup>j</sup>* , can be obtained by applying the computation of the minimum paths from the *nI* inlets to node *j* by means of (12) and determine which is the one with the minimum resistance among the *nI* paths.

When a leak is produced in node *<sup>j</sup>*, <sup>I</sup>*min <sup>j</sup>* is the most probable path for the extra flow produced by the leak. So the effect of a leak in node *j* to sensor *si* depends on the intersection of the paths from inlets to node *j* and the node where the sensor is located *si*: <sup>I</sup>*min <sup>j</sup>* and <sup>I</sup>*min si* . To quantify the degree of incidence of the leak to the sensor, an incidence factor *gj*,*si* is defined as:

$$\mathbf{g}\_{\mathbf{j},\mathbf{s}\_i} = R^{c}\_{\mathbf{j},\mathbf{s}\_i} \bar{\mathbf{g}}\_{\mathbf{j},\mathbf{s}\_{i'}} \tag{13}$$

where *R<sup>c</sup> <sup>j</sup>*,*si* is the resistance of the path defined by <sup>I</sup>*min <sup>j</sup>* ∩ I*min si* , the superscript *c* refers to the common path between node and sensors, and *g*¯*j*,*si* is a normalization factor that takes into account the inverse of the resistance from the node *j* to the different sensors:

$$\mathfrak{g}\_{j,s\_i} = \begin{cases} \begin{array}{c} \frac{1}{\mathbb{R}\_{j,s\_i}}\\ \frac{\sum\_{l=1}^{n\_s} \frac{1}{\mathbb{R}\_{j,s\_i}}}{\mathbb{L}\_{j,s\_i}} \end{array} & \text{if } j \neq s\_i\\\ 1 & \text{if } j = s\_i. \end{cases}$$

The *ns* incidence factors associated to a leak in node *j*, *gj*,*si i* = 1, ..., *ns* can be normalized:

$$
\lambda\_{j,s\_i} = \frac{\mathcal{S}\_{j,s\_i}}{\sum\_{l=1}^{n\_s} \mathcal{S}\_{j,s\_l}} \, ' \tag{14}
$$

where coefficient *λj*,*si* determines the relative incidence of a leak in node *j* to sensor *si* regarding all the *ns* sensors and the need to fulfill:

$$\sum\_{i=1}^{n\_s} \lambda\_{j, s\_i} = 1 \quad \forall j = 1, \dots, n - n\_I. \tag{15}$$

For every node *j* = 1, ..., *n* − *nI*, the most sensitive sensor to a leak in this node can be computed as:

$$\mathfrak{f} = \underset{i \in \{1, \ldots, n\_s\}}{\arg\max} \{\lambda\_{\mathfrak{f}, \mathfrak{s}\_i}\}.\tag{16}$$

The *ns* clusters used in the leak localization defined in (10) can be computed using the set of nodes that provide the same value of *j*ˆ. The following equation is the definition of the cluster associated with the sensed node *l*:

$$\mathcal{C}\_{l} = \{v\_{\rangle} \in V \mid \arg\max\{\lambda\_{j, s\_i}\} = l\},\tag{17}$$

where *l* = 1, . . ., *ns*. The topological information of *λj*,*si* and the measurement information of residuals *r*¯*si* can be integrated in a parameter *θ<sup>j</sup>* defined as:

$$\theta\_{\dot{\theta}} = \frac{1}{\overline{\theta}} \sum\_{i=1}^{n\_s} \lambda\_{\dot{j}, s\_i} \overline{r}\_{s\_i} \tag{18}$$

where ¯ *θ* is a normalization factor. Then, *θ<sup>j</sup>* can be interpreted as a likelihood index, and the leak localization at cluster level defined in (10) can be formulated at node level as:

$$\hat{f} = \underset{j \in \{1, \ldots, n-n\_I\}}{\arg\max} \left\{ \theta\_j \right\}.\tag{19}$$

In order to improve the performance of the leak localization method, the information of the residuals at different time instants *k* can be taken into account applying the Bayes' rule as:

$$P\_{\vec{\gamma}}(k) = \frac{P\_{\vec{\gamma}}(k-1)\theta\_{\vec{\gamma}}(k)}{\sum\_{l=1}^{n-n\_l} P\_l(k-1)\theta\_l(k)},\tag{20}$$

where *Pj*(*k* − 1) is the prior probability whose initial value *Pj*(*k* − 1) has to be determined (for example *Pj*(0) = 1/(*n* − *nI*)). Then, the leak node localization can be estimated by using posterior leak probabilities by:

$$f(k) = \underset{j \in \{1, \ldots, n-n\}}{\arg\max} \left\{ P\_j(k) \right\}.\tag{21}$$

#### **4. Sensor Validation**

When a leak is not detected by the leak detection method, anomalous values of pressure residuals *rsi* (*k*) *i* = 1, ..., *ns* defined in (7) can be used to detect sensor faults. In the same operating conditions, the historical data of inner pressure sensors (leak-free data or data for a particular leak scenario) can be used first to calibrate a pressure estimation model as described in Section 2.2. Secondly, to determine residual bounds *σ<sup>i</sup>* and *σ*¯*<sup>i</sup>* that allows the implementation of pressure sensor fault detection through checking:

$$\begin{cases} r\_{s\_i}(k) \in \left[\underline{\sigma\_i}, \overline{\sigma\_i}\right] \Rightarrow \text{No Earth}\left(\phi\_i(k) = 0\right) \\ r\_{s\_i}(k) \notin \left[\underline{\sigma\_i}, \overline{\sigma\_i}\right] \Rightarrow \text{Fault in sensor } s\_i\left(\phi\_i(k) = 1\right) . \end{cases} \tag{22}$$

The accuracy of this fault detection method depends on the length of residual bounds *σ<sup>i</sup>* and *σ*¯*<sup>i</sup>* and, therefore, on the accuracy of pressure estimation (6). In order to increase the accuracy of the fault detection method, spatial residuals [28] between pressure residuals (7) can be computed

$$S r\_{s\_i s\_j}(k) = r\_{s\_i}(k) - r\_{s\_j}(k) \quad \forall i = 1, \dots, n\_s - 1 \quad \text{and} \quad j = i + 1, \dots, n\_s. \tag{23}$$

In the same way as the pressure residuals, spatial residual bounds *εi*,*<sup>j</sup>* and *ε*¯*i*,*<sup>j</sup>* can be computed using leak-free data, and the fault detection can be implemented as follows:

$$\begin{cases} Sr\_{s\_i s\_j}(k) \in \left[ \underline{\varepsilon}\_{i, j}, \underline{\varepsilon}\_{i, j} \right] \Rightarrow \text{No } \text{Fault}(\Phi\_{i, j}(k) = 0) \\ Sr\_{s\_i s\_j}(k) \notin \left[ \underline{\varepsilon}\_{i, j}, \underline{\varepsilon}\_{i, j} \right] \Rightarrow \text{Fault}(\Phi\_{i, j}(k) = 1) . \end{cases} \tag{24}$$

As model errors will affect in a similar way as close pressure sensors, it is expected that some spatial residual bounds will be smaller than pressure residual bounds. Therefore, fault detection defined by (24) will be more sensitive to pressure sensor faults than the one defined by (22). The accuracy of the sensor fault detection can be increased by means of average computing residuals in a time window leading to smaller residual bounds.

Once a residual has been violated, that is, at least one of the sensor faulty signals *φi*(*k*) *i* = 1, ..., *ns* or spatial faulty signals Φ*i*,*j*(*k*) *i* = 1, ..., *ns* − 1 and *j* = *i* + 1, ..., *ns* is equal to one, the sensor fault isolation can be implemented in two stages as described in Algorithm 1:

#### **Algorithm 1** Sensor validation search for sensor fault

*Stage 1:* In the case of the activation of one or more sensor faulty signals *φi*(*k*) *i* = 1, ..., *ns*, as these signals are uniquely related to sensors *si i* = 1, ..., *ns*, the isolation is trivial and faulty sensors must be discarded for future leak localization, and the number of available healthy sensors *ns* should be updated.

*Stage 2:* Only considers Spatial faulty signals Φ*i*,*j*(*k*) of the *ns* non-faulty sensors from *Stage*1. As these fault signals are potentially affected by two possible sensor faults *si* and *sj*, the fault isolation can be implemented iteratively by the following steps:

1: **for** *i* ← 1, *ns* − 1 **do**

2: **for** *j* ← *i* + 1, *ns* **do**

3: **if** Φ*i*,*j*(*k*) == 1 **then**

4:

$$\mathcal{I} = \operatorname\*{arg\,max}\_{i \in \{1, \dots, n\_s\}} \left\{ \sum\_{j=i+1}^{n\_s-1} \Phi\_{l,j}(k) + \sum\_{j=1}^{i-1} \Phi\_{l,i}(k) \right\}. \tag{25}$$

5: Discard sensor *sı*ˆ, eliminate faulty signals related to this sensors, update *ns*.


In the case that two or more sensors obtain the same cost function in (25) and less than the maximum possible value *ns* − 1, the computation of (25) should be performed in a time window until new Spatial faulty signals are activated.

#### **5. Case Study**

#### *5.1. Hanoi WDN*

The Network used for the case study is a reduced city's network model from Hanoi's WDN (Vietnam). It is composed of one inlet (reservoir), 34 pipes, and 31 nodes, represented by Figure 1.

**Figure 1.** Simplified Hanoi topological WDN.

To analyze the performance of the proposed approach, data with different conditions have been generated artificially using the EPANET hydraulic simulator [29]. In order to consider realistic scenarios, some uncertainty has been added to the data [30]: the magnitude of the leak is random with a range of 25 to 75 [l/s], that is, between 1% and 2.5% of the average inlet flow of the WDN. Furthermore, white noise has been combined to emulate the noise present in real measurements, and uncertainty of 10% (uniform distribution) was added in the nominal demand value.

The daily water consumption pattern used for the calibration of Equation (6) is shown in Figure 2, having four days of operation.

The sample rate is 10 min, but average hourly measurements are calculated to reduce uncertainties on the diagnostic stage.

**Figure 2.** Flow consumption.

#### Results

The evaluation of the performance of the proposed leak localization method at node level defined in Equation (21) will be analyzed utilizing Average Topological Distance (ATD) [11]. The ATD represents the node's distance between the node predicted as leaking and the actual node with the leak. To calculate the ATD, it is first necessary to create a matrix containing the minimum topological distance (in nodes or meters), *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*n*−*nI*×*n*−*nI* .

Finally, the confusion matrix Γ*i*,*j*(*n* − *nI* × *n* − *nI*) defined in [18] and depicted in Table 1 is used to assess the performance of Equation (21). The rows of this matrix correspond to the leak scenario and the columns to where the leak is located (ˆ *l*) by the leak localization method.


**Table 1.** Confusion matrix **Γ**.

Considering the confusion matrix Γ, the ATD can be computed as follows:

$$ATD = \frac{\sum\_{i=1}^{n-n\_I} \sum\_{j=1}^{n-n\_I} \Gamma\_{i,j} A\_{i,j}}{\sum\_{i=1}^{n-n\_I} \sum\_{j=1}^{n-n\_I} \Gamma\_{i,j}}.\tag{26}$$

Four cases have been considered with different quantities of sensors in the network to analyze how this affects the final result. Table 2 presents the distribution of the selected nodes to contain a sensor. As seen in [31], the positioning of the sensors produces different results. As this work did not discuss the adequate sensors' arrangement, they were chosen to consider an improvement in the results regarding the number of sensors.

**Table 2.** Nodes with sensors.


Using the inlet flow data and non-leak historical pressure measurements of the selected sensors, the *β<sup>i</sup>* and the *α<sup>i</sup>* with *i* = 1, ..., 31 in (6) have been identified (notice that *nI* = 1). With these parameters, the pressure estimations under a non-leak condition in the network can be calculated considering inlet measurements using Equation (6) and posteriorly applied to calculate the residuals (9) with measured pressures in leak scenarios. In addition, non-leak pressure measurements and estimations are used to generate fault-free pressure residuals *rsi* (*k*) and bounds *σi*, *σ*¯*<sup>i</sup> i* = 1, ..., *ns*, as well as spatial residuals *Srsi*,*sj* (*k*) and bounds *εi*,*j*, *ε*¯*i*,*<sup>j</sup>* ∀*i* = 1, . . ., *ns* − 1 and *j* = *i* + 1, . . ., *ns*.

For every sensor configuration, normalized incidence factors (14) have been computed with topological information: node connections and pipe characteristics (length, diameter, and roughness). Figure 3 has the objective of comparing the information of the incidence of single leaks to pressure sensors obtained by a hydraulic model with the one obtained by means of topological information. The nodes selected to have sensors are the ones defined in the first case in Table 2. In particular, Figure 3c shows the clustering that groups the nodes that produce the highest effect in a specific pressure sensor. Nodes in yellow define the cluster of nodes where a leak produces a maximum pressure deviation from the non-leak scenario in the sensor installed in node 12, and the same is true for nodes in violet, red and green regarding pressure sensors in nodes 17, 23, and 29, respectively. Finally, nodes in black are nodes that produce a similar variation of pressure (difference of variation less than 0.1 [mwc]) in at least two different pressure sensors. In order to obtain this information, a hydraulic model to compute the difference of non-leak and leak pressures in all the nodes for the different leaks is required. On the other hand, Figure 3a shows the clustering that takes into account the shortest weighted pipe length (hydraulic distance), that is, the sum of (*Lz*/*D*4.87 *<sup>z</sup>* ) for all edge *ez* in the path to the sensors, being the smallest one used to define the most resemblance to the sensor, used in Ref. [18]. Finally, the clustering depicted in Figure 3b is defined by Equation (17), which is based on the common resistance path explained in Section 3. These two last clusters that only require topological information could be used in the leak localization method at the cluster level defined in Equation (10). It is important to emphasize that the clustering based on the resistance common path, proposed in this paper and depicted in Figure 3b, resembles the clustering based on the actual leak effect in the network (given by the model) depicted in Figure 3c much more than the clustering based in the hydraulic distance depicted in Figure 3a. Therefore, the clustering proposed in this paper provides more accurate information for leak localization purposes than that based on the hydraulic distance. For

example, as shown in Figure 3c, when a leak is present in nodes 3, 4, 5, 6, 7, 8 or 9, the sensor most affected by the leak is the sensor in node 12. This information is the same as the one provided by the clustering depicted in Figure 3b and is computed only with topological information. However, using the clustering of Figure 3a based on the hydraulic distance between nodes and sensors, the closest sensor to these nodes is the sensor in node 17.

**Figure 3.** The *ns* clustering generated with the aspects: (**a**) shortest weighted pipe length, (**b**) The resistance takes into account the common path *R<sup>c</sup> j*,*si* ; (**c**) is the maximum residual. Nodes in yellow, violet, red and green define the cluster related to sensor installed in node 12, 17, 23, and 29, respectively.

Figure 4 shows the correlation analyses of the relative incidence index *λj*,*si* defined in Equation (14) for all the nodes *j* = 1, ..., 31 depicted in every subplot for every sensor *si i* = 1, ..., 4. As this index is normalized, its values are in the range [0,1). The nodes with the higher index (more brown color) are those that produce a higher effect in the pressure sensor *si*.

Figure 5a displays the evolution of the ATD (in nodes) obtained by the leak localization method based on the Kriging spatial interpolation methodology presented in [18] with the time horizon (in hours) used recursively by the Bayes' rule in (20). Four different sensor configurations are considered with 4, 6, 8, and 10 sensors placed optimally in order to maximize the performance of the leak localization proposed [18]. The performance can be compared with the one obtained by the new leak localization method proposed in this paper at node level defined in Equation (19) with the same dataset and the same sensor configurations as in [18], depicted in Figure 5b and with the sensor configurations shown in Table 2, depicted in Figure 5c.

Figure 5a shows that the leak detection performance of the Kriging method improves significantly from four to eight sensors and more moderately compared to ten sensors, still having a good result, even with noise data managing to reach an ATD equal to 2.5 node. When compared to the newly proposed leak localization method, as can be

seen in Figure 5b,c, the new leak localization method always outperforms the Krigring method, even in the case of using the sensor configurations proposed in [18] that were computed to optimize the performance of the Kriging method. Figure 5c shows that the sensor configurations proposed in [18] are not optimal for the proposed method but the performance can be improved by changing the sensor configurations, in this case manually.

**Figure 4.** Relative incidence index *λj*,*si* for all the nodes (*j* = 1, ..., 31), corresponding to: (**a**) 1st sensor (*i* = 1), (**b**) 2nd senor (*i* = 2), (**c**) 3rd sensor (*i* = 3), and (**d**) 4th sensor (*i* = 4).

In order to illustrate the performance of the proposed sensor validation method, Case 1 (four sensors) will be considered. The four-sensor residuals computed by Equation (7) have been considered in a time window of 24 h using leak-free data leading to upper residual bounds equal to:

$$[\bar{\sigma}\_1, \bar{\sigma}\_2, \bar{\sigma}\_3, \bar{\sigma}\_4] = [0.11, 0.06, 0.09, 0.11]\_1$$

and the lower residual bounds equal to:

$$[\underline{\sigma}\_{1\prime}\underline{\sigma}\_{2\prime}\underline{\sigma}\_{3\prime}\underline{\sigma}\_{4}] = [-0.14, -0.10, -0.10, -0.08].$$

In the same way, the six spatial residuals defined by (23) have been computed in the same conditions as sensor residuals leading to spatial residual bounds:

$$\mathbb{E}\left[\mathbb{E}\_{1,2,\mathsf{V}}\mathbb{E}\_{1,3,\mathsf{V}}\mathbb{E}\_{1,4,\mathsf{V}}\mathbb{E}\_{2,3,\mathsf{V}}\mathbb{E}\_{2,4,\mathsf{V}}\mathbb{E}\_{3,4}\right] = \left[0.06, 0.07, 0.08, 0.04, 0.05, 0.03\right]$$

and

$$\begin{bmatrix} \left[ \mathfrak{E}1, \mathfrak{E}1, \mathfrak{E}1, \mathfrak{E}1, \mathfrak{E}2, \mathfrak{E}2, \mathfrak{E}2, \mathfrak{E}3, \mathfrak{E}3 \right] \end{bmatrix} = \begin{bmatrix} -0.04, -0.06, -0.08, -0.06, -0.09, -0.03 \end{bmatrix}.$$

Figures 6 and 7 depict the evolution of sensor and spatial residuals with their respective residual bounds in a fault scenario of sensor 1 that corresponds to the pressure sensor in node 12. The fault is a drift of 0.1 [*mcw*] that starts on the 5th day. As shown in Figure 6, by applying (22) to sensor residuals it is impossible to detect the fault until the end of the day 9 (i.e., 4 days later) when residual sensor 1 violates the bounds. However, by applying (24) to spatial residuals it is possible to detect the fault in 10 h: *Srs*1,*s*<sup>2</sup> violates its bounds in 10 h, and *Srs*1,*s*<sup>3</sup> , *Srs*1,*s*<sup>4</sup> violate their bounds in 16 h and 22 h, respectively.

**Figure 5.** Evolution of the ATD between the methods: (**a**) using the Kriging interpolation method presented in [18], (**b**) using the new leak localization method with the same sensor configurations as in [18] and (**c**) using the new localization method with sensor configurations of Table 2.

#### *5.2. Modena WDN*

The second case study selected to test the performance is the reduced model of the real water distribution network of the Italian city Modena. This large-scale network is comprised of 268 junctions (nodes) connected through 317 pipes and served by four reservoirs. There are no pumps in the network since it is entirely gravity-fed [32,33].

The EPANET hydraulic simulator was used to generate artificial data to analyze the performance of the proposed method. The following simulation conditions were considered:


**Figure 6.** Graph of the filtered residual with a fault in sensor number 1 (**a**) 1st sensor, (**b**) 2nd senor, (**c**) 3rd sensor, and (**d**) 4th sensor.

**Figure 7.** Graph of the spatial residual with a fault in sensor number 1.

The sensor bias, sensor drift, and abrupt sensor failure of sensor faults were proposed to analyze the sensor validation method. The sensor bias fault was simulated as a step change, and the drift fault was given as a time-varying ramp signal. In both cases, the fault magnitude was randomly chosen with a range of 0.1 to 0.2 [mwc]. The last fault was simulated by turning the sensor output to zero.

#### Results

As applied in the previous case study, the Average Topological Distance (ATD) was used to assess the performance of the proposed leak localization method at the node level defined in (19). Two scenarios have been considered with five and ten pressure sensors that are presented in Figure 8a,b respectively. As emphasized in the last section, performance in the leak localization task is highly dependent on the number of sensors installed in the network [34–36].

Figure 9 shows the result of ATD evolution as defined in (26) applied with Bayes' posterior time reasoning (20) to represent the leak location performance of the proposed method. This figure shows that the leak localization performance reached an ATD of 8 and 5.5 nodes with 5 and 10 inner pressure sensors installed in the network respectively. Considering that the proposed leak localization method only requires topological information and non-leak historical data in available measurements, the obtained performance is reasonably good.

On the other hand, a total of 6000 scenarios were simulated with 10 days each to evaluate the sensor validation method for the five sensor configurations depicted in Figure 8a. Thus, 1000 scenarios were generated for each sensor with sensor bias, sensor drift, and abrupt sensor failure applied randomly, and the remainder 1000 without faults.

To calculate the residual and spatial residuals bounds, a 6-month leak-free scenario was generated. The five sensor residuals computed by Equation (8) considering the time window of 24 h and increasing 24% observed bounds. Leading to upper residual bounds equal to:

$$[\bar{\sigma}\_{1\prime}\bar{\sigma}\_{2\prime}\bar{\sigma}\_{3\prime}\bar{\sigma}\_{4\prime}\bar{\sigma}\_{5}] = [0.10, 0.06, 0.04, 0.01, 0.04]$$

and to lower residual bounds equal to:

$$[\underline{\varrho}\_{1}, \underline{\varrho}\_{2}, \underline{\varrho}\_{3}, \underline{\varrho}\_{4}, \underline{\varrho}\_{5}] = [-0.08, -0.05, -0.03, -0.01, -0.06].$$

Following, the ten spatial residuals defined by (23) were computed in the same conditions as sensor residuals leading to spatial residual bounds:

$$\begin{bmatrix} \mathbb{E}\_{1,2,\prime} & \mathbb{E}\_{1,3,\prime}\mathbb{E}\_{1,4,\prime}\mathbb{E}\_{1,5,\prime}\mathbb{E}\_{2,3,\prime}\mathbb{E}\_{2,4,\prime}\mathbb{E}\_{2,5,\prime}\mathbb{E}\_{3,4,\prime}\mathbb{E}\_{3,5,\prime}\mathbb{E}\_{4,5} \\ & \begin{bmatrix} 0.07, 0.08, 0.08, 0.10, 0.06, 0.05, 0.06, 0.03, 0.06, 0.05 \end{bmatrix} \end{bmatrix}$$

and

[*ε*1,2, *ε*1,3,*ε*1,4,*ε*1,5,*ε*2,3,*ε*2,4,*ε*2,5,*ε*3,4,*ε*3,5,*ε*4,5] = [−0.07, −0.09, −0.08, −0.09, −0.05, −0.04, −0.06, −0.03, −0.04, −0.04].

For this study, the evaluation metric applied was classification accuracy. To this purpose, the confusion matrix was used, which presents the classification accuracy and the misclassification error, and the horizontal axis of the confusion matrix describes the predicted labels of samples, while the longitudinal axis depicts the true labels of samples. The right side shows the percentages of correctly and incorrectly classified observations for each true class.

**Figure 8.** Configuration of pressure sensors in Modena WDN: (**a**) 5 sensors, (**b**) 10 sensors.

**Figure 9.** Evolution of the ATD.

Figure 10 illustrates the result for the confusion matrix for all scenarios generated, and depicts that the accuracy of detecting faults in the sensor is very high, where the lowest accuracy is presented in fault sensor number five with an accuracy of 95.4% and the highest in fault sensor number three with 100% accuracy. Regarding the accuracy of the scenario with no-fault, eight of the 1000 fault free scenarios presented one false alarm among the 240 samples of the scenario; therefore providing an average interval between false detections of 240,000/8 = 30,000 h.

**Figure 10.** Confusion matrix for sensor validation method.

#### **6. Conclusions**

A new data-driven method for leak localization in WDN based on historical non-leak data and the topological information of the network is proposed. The proposed method is triggered when a leak is detected, and it is based on the evaluation of residuals generated by leak pressure measurements in some inner nodes and the estimation of leak-free pressures in these nodes utilizing a reduced-order model and historical data. Topological information is used to compute a new incidence factor that considers the most probable path of water from reservoirs to pressure sensors and potential leak nodes. The proposed incidence factor combined with residual information generates a likelihood index that allows leak localization at the node level. In addition, a sensor validation method based on the sensor pressure residuals, which is able to detect and isolate pressure sensor faults, is proposed.

The proposed method's general performance for leak location and sensor validation is evaluated in reduced models of the Hanoi and Modena water distribution networks. The results of the leak localization are compared to another technique published with satisfactory results. Future works can be developed to improve the leak localization and sensor validation performances, with a study of an algorithm to automatically determine the optimal sensors required to maximize the leak localization performance.

**Author Contributions:** Conceptualization, D.A., J.B., E.D. and L.R.; methodology, D.A., J.B., E.D. and L.R.; software, D.A.; validation, D.A. and J.B.; formal analysis, D.A., J.B., E.D. and L.R.; investigation, D.A., J.B., E.D. and L.R.; data curation, D.A.; writing—original draft preparation, D.A.; writing review and editing, D.A., J.B., E.D. and L.R.; visualization, D.A. and J.B.; supervision, J.B., E.D. and L.R.; project administration, J.B., E.D. and L.R. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work has been partially funded by SMART Project (ref.num. EFA153/16 Interreg Cooperation Program POCTEFA 2014-2020), L-BEST Project (PID2020-115905RB-C21) funded by MCIN/ AEI /10.13039/501100011033 and AGAUR ACCIO RIS3CAT UTILITIES 4.0–P1 ACTIV 4.0. ref.COMRDI-16-1-0054-03.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data can be found in https://github.com/adeboracris/Epanet\_ Leak\_localization (accessed on 10 November 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Two Simultaneous Leak Diagnosis in Pipelines Based on Input–Output Numerical Differentiation**

**Adrián Navarro-Díaz 1, Jorge-Alejandro Delgado-Aguiñaga 2,\*, Ofelia Begovich <sup>3</sup> and Gildas Besançon <sup>4</sup>**


**Abstract:** This paper addresses the two simultaneous leak diagnosis problem in pipelines based on a state vector reconstruction as a strategy to improve water shortages in large cities by only considering the availability of the flow rate and pressure head measurements at both ends of the pipeline. The proposed algorithm considers the parameters of both leaks as new state variables with constant dynamics, which results in an extended state representation. By applying a suitable persistent input, an invertible mapping in *x* can be obtained as a function of the input and output, including their time derivatives of the third-order. The state vector can then be reconstructed by means of an algebraic-like observer through the computation of time derivatives using a Numerical Differentiation with Annihilatorsconsidering its inherent noise rejection properties. Experimental results showed that leak parameters were reconstructed with accuracy using a test bed plant built at Cinvestav Guadalajara.

**Keywords:** fault diagnosis; pipelines; multiple leaks; numerical differentiation; experimental results

#### **1. Introduction**

In recent decades, climate change and the overuse of natural water resources have caused water scarcity in big cities. Furthermore, water distribution systems operators (WDSOs) are facing major water losses as high as 65% due to pipeline leaks caused by lack of maintenance, illegal intrusion, or accidents.

According to a study performed by the Organisation for Economic Co-operation (OECD), entitled Water Governance in Cities [1], aging water networks have a negative impact in terms of efficiency. One of the consequences is water loss from pipeline leaks. On average, water loss in the surveyed cities (in the referred report) was 21% in 2016. However, for Mexican cities, water loss was more than 40% (Chihuahua, Mexico City, San Luis Potosi) or even up to 65% (Tuxtla), see Figure 1.

On the other hand, to satisfy the current demand, government policies are focused on bringing more water from far away places instead of solving water losses due to leaks. This means that the amount of water lost is currently considered in the water budget. Interestingly, the amount of water needed to meet the demand in deficit is very similar to what is lost through leaks. In other words, it could be possible to satisfy the current water demand by minimizing the water losses due to leaks without the need for bringing more water from far away places.

**Citation:** Navarro-Díaz, A.; Delgado-Aguiñaga, J.-A.; Begovich, O.; Besançon, G. Two Simultaneous Leak Diagnosis in Pipelines Based on Input–Output Numerical Differentiation. *Sensors* **2021**, *21*, 8035. https://doi.org/10.3390/s21238035

Academic Editors: Miquel À. Cugueró-Escofet and Vicenç Puig

Received: 3 November 2021 Accepted: 26 November 2021 Published: 1 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Figure 1.** Proportion of water loss in surveyed cities (leakage rate).

The implementation of leak detection and isolation (*LDI*) systems has demonstrated a reduction in water losses in pipelines. *LDI* systems are algorithms that perform the following tasks: *detection* and *localization* of one or several leaks in a pipeline system. Thus, once a leak is identified, the repair technicians can fix the leak and avoid the water loss. Many works that address the *LDI* problem for one leak have been reported [2–6]. For instance, in [2], a leak isolation methodology using a fitting loss coefficient calibration is presented on the basis of two stages: in the first stage, the equivalent straight length (*ESL*) is fixed by a model-based observer designed as an extended Kalman filter (*EKF*); in the second stage, an algebraic observer is started with the *ESL* value fixed by the previous observer. Finally, the estimated leak position is recovered in original coordinates since the observer deals with *ESL* coordinates. Authors in [3] presented a methodology for leak detection and isolation in pipelines based on data fusion using two approaches: a steady-state estimation and an *EKF*. Authors concluded that the solution of the *LDI* problem improves significantly when a steady-state estimation is incorporated to the estimation provided by the *EKF*. In other words, the solution provided by the *EKF* is less accurate by itself. In [4], authors propose a bank of observers together with a Genetic Algorithm (*GA*), which is exploited to minimize the integration of the square observation error. The minimum integral observation error is reached in the observer where the estimated leak parameters match the real values. Experimental results evidence an accurate leak position estimation in a test bed pilot plant. In [5], a combined artificial neural network (*ANN*) for leak diagnosis in pipelines is presented. The *ANN* scheme estimates the location and friction factor based on measurement data. An average error of 0.629% was obtained for leak location in the experiments. More recently, in [6], a new approach for solving the *LDI* problem in pipelines is introduced on the basis of a Kalman filter for linear parameter varying (*LPV*) systems. The off-line computation of the filter gain allows the computational effort to be reduced and the authors claim that the *LPV* design outperforms the classical *EKF* design in terms of parameter-estimation accuracy.

On the other hand, the multi-leak case study has also been considered from two different perspectives: (i) for sequential leaks (non-concurrent case) [7–9] and (ii) for the simultaneous leak problem (concurrent case) [10].

In [7], a model adaptation strategy is proposed to isolate non-concurrent multiple leaks based on extended Kalman filters. Experimental results show the potential of this approach by allowing to monitor each new leak, no matter where it appears. Following this direction, a scheme is proposed in [8] for detecting and locating multiple sequential leaks based on a combination of an adaptive observer to identify the hydraulic gradient in real time and a leak location observer to estimate the leak position and its outflow. Experimental results of a pilot pipeline showed a satisfactory estimation in spite of operational changes and leaks. More recently, in [9], a pressure distribution analysis is proposed to diagnose the location of leaks via an experimental study and computational fluid dynamics (*CFD*) simulation. Multiple flow rate testing is conducted to detect the locations of two leaks.

In the same way, a more complex case of the multi-leak problem is when two or more leaks occur at the same time, also known as the concurrent case. In [10], the orthogonal collocation method (*OCM*) is used to obtain an approximate solution of the water hammer equations (*WHE*). An estimator is then designed based on the spatially-discretized model to detect multiple leaks by identifying their positions and leak coefficients by applying a persistent input [11]. The results are presented via simulation.

Regarding the one leak problem, state observer-based techniques have been proposed and successfully evaluated since the observer convergence is guaranteed, even by applying constant inputs. This is because the structure of underlying state-space representation fulfills a uniform observability condition (which is independent of the input) [12]. Conversely, when two or more leaks occur simultaneously, such an observability condition is no longer satisfied and the observability depends on the input. Particularly in steady state, the output obtained by two or more leaks is equivalent to the one obtained by a single *virtual* leak, which is known as indistinguishability [11].

#### *1.1. Problem Statement*


#### *1.2. Methods*


Hereinafter, the paper is organized as follows: In Section 2, a mathematical model is derived from the well-known water hammer equations and the two simultaneous leak problem is stated. State vector reconstruction based on injection of input–output time derivatives is presented in Section 3. Experimental results are presented in Section 4 by using databases from a pilot plant built at Cinvestav Guadalajara. Finally, several conclusions and future perspectives are given in Section 5.

#### **2. Preliminaries**

#### *2.1. Pipeline Mathematical Model*

#### 2.1.1. Governing Equations

The transient fluid through a pipeline can be described by conservation mass and momentum equations known as water hammer equations, which are a couple of quasi-linear hyperbolic partial differential equations (*PDE*). Generally, *PDE* are derived considering the following assumptions: the fluid is slightly compressible, the duct wall is slightly

deformable, and the convective velocity changes are negligible. The cross section and the fluid density are assumed to be constant [13]: *Momentum Equation*

$$\frac{\partial Q(z,t)}{\partial t} + \lg A \frac{\partial H(z,t)}{\partial z} + \mu Q(z,t) |Q(z,t)| = 0 \tag{1}$$

*Continuity Equation*

$$\frac{\partial H(z,t)}{\partial t} + \frac{b^2}{gA} \frac{\partial Q(z,t)}{\partial z} = 0 \tag{2}$$

Here, *Q* stands for the flow rate [m3/s]; *H* is the pressure head [m]; *z* is the length coordinate [m]; *t* is the time coordinate [s]; *g* is the gravity acceleration [m/s2]; *A* is the cross-section area [m2]; *b* is the pressure wave speed in the fluid [m/s]; *μ* = *τ*/2*φA*, where *φ* is the inner diameter [m] and *τ* is the friction factor. The dynamics in Equations (1) and (2) are fully defined by related pairs of initial and boundary conditions.

#### 2.1.2. Finite Difference Approximation

For the purpose of obtaining a finite dimensional model from (1) and (2), the partial differential equations are discretized with respect to the spatial variable *z*, as in [14,15], by using the following approximations:

$$\begin{array}{ll}\text{Section } i: \left\{ \begin{array}{ll} \frac{\partial H(z\_{i}, t)}{\partial z} \cong \frac{H\_{i+1} - H\_{i}}{\Delta z\_{i}} \quad \forall i = 1, \cdots, n\\ \frac{\partial Q(z\_{i-1}, t)}{\partial z} \cong \frac{Q\_{i} - Q\_{i-1}}{\Delta z\_{i-1}} \quad \forall i = 2, \cdots, n \end{array} \right.\end{array} \tag{3}$$

where index *i* stands for the variable discretized in (1) and (2) at section *i*. To solve the *LDI* problem for two simultaneous leaks, Equations (1) and (2) admit a simple spatial discretization as shown in Figure 2, where sections are defined according to the two leakage positions:

**Figure 2.** Discretization of the pipeline with two arbitrarily located leaks.

Here, *Ql*1,2 represent the leaking flows that can be modeled as:

$$Q\_{l\_{1,2}} = \lambda\_{1,2} \sqrt{H\_{2,3}} \tag{4}$$

where, *λ*1,2 is a constant that depends on the orifice size and the discharge coefficient, and *H*2,3 is the head pressure at the leak point.

Thus, assuming a lumped-parameter model for the flow equations and considering that the pressure head and flow rate measurements are available at both ends of the pipeline via sensors, a low order dynamical representation of the system with two leaks can be written using approximation (3) in Equations (1) and (2), as follows:

$$\begin{bmatrix} \dot{Q}\_1\\ \dot{H}\_2\\ \dot{Q}\_2\\ \dot{H}\_3\\ \dot{Q}\_3 \end{bmatrix} = \begin{bmatrix} \frac{-\underline{\chi}A}{\underline{\Delta}z\_1}(H\_2 - H\_1) - \frac{\underline{\tau}}{2\underline{\rho}A}Q\_1|Q\_1|\\ \frac{-b\_2^2}{\underline{\chi}A\Delta z\_1}(Q\_2 - Q\_1 + \lambda\_1\sqrt{H\_2})\\ \frac{-\underline{\chi}A}{\underline{\Delta}z\_2}(H\_3 - H\_2) - \frac{\underline{\tau}}{2\underline{\rho}A}Q\_2|Q\_2|\\ \frac{-b\_2^2}{\underline{\chi}A\Delta z\_2}(Q\_3 - Q\_2 + \lambda\_2\sqrt{H\_3})\\ \frac{-\underline{\chi}A}{\underline{\Delta}z\_3}(H\_4 - H\_3) - \frac{\underline{\tau}}{2\underline{\rho}A}Q\_3|Q\_3| \end{bmatrix} \tag{5}$$

Notice that, due to mass conservation, *Ql*1,2 must satisfy the next relation:

$$Q\_{l\_{1,2}} = Q\_{b\_{1,2}} - Q\_{a\_{1,2}} \tag{6}$$

where *Qb*1,2 and *Qa*1,2 are the flows in an infinitesimal length before and after the leak, respectively.

#### 2.1.3. Pipeline Equivalent Straight Length

It is worth pointing out that the mathematical model (5) assumes a straight pipe. This is not a loss of generality because even if the pipe is not straight, it is possible to obtain an equivalent straight length (*ESL*) of the pipe. The *ESL* is the straight length of a virtual pipe (with the same parameters as the original duct) that would give rise to the same pressure drop as the real pipeline. Such an equivalence is calculated by considering losses due to each "non-straight element" (i.e., fitting) in accordance with the Darcy–Weisbach formula [2,16,17]:

$$l\_{\varepsilon} = \frac{\phi}{r} K \tag{7}$$

where, *le* means the equivalent straight length of a specific fitting, *K* is the so-called loss coefficient parameter, which is normally provided by the pipe manufacturer, and *τ* is the friction coefficient. Thus, the total ESL of the pipe, *Le*, can be calculated as follows:

$$L\_c = L\_r + \frac{\Phi}{\pi} \sum\_{i=0}^{n\_f} K\_i \tag{8}$$

where *Lr* stands for the sum of all pipeline straight length elements, *Ki* is the fitting loss coefficient for the *i*-th fitting, and *nf* the number of the pipeline fittings.

#### 2.1.4. Friction Model

The friction factor (*τ* in Equation (1)) represents the loss of pressure of a fluid due to the interactions between the fluid and the internal surface roughness of the pipe. Thus, such a friction factor is a function of the Reynolds number, Re, and the pipe's roughness, [18,19].

In many practical cases, *τ* is deemed to be a constant value, which is commonly taken from the Moody chart; nevertheless, in pipes with a relative roughness usually less than 1 <sup>×</sup> <sup>10</sup>−<sup>3</sup> m, the zone where the friction factor is almost constant (i.e., the complete turbulence zone), is difficult to reach. Consequently, when the *LDI* scheme is applied to a plastic pipe, it is preferable to obtain a more accurate friction value by using a formula or an algorithm to estimate such value. In this work, the authors propose the use of the well know Swamee–Jain equation to directly calculate the coefficient of friction:

$$\tau\_i(Q\_i) = \frac{0.25}{\left[\log\_{10}\left(\frac{\varepsilon}{3.7\phi} + \frac{5.74}{\operatorname{Re}\_i^{0\Psi}}\right)\right]^2} \tag{9}$$

where subscript *i* denotes the section number of the pipeline (see Figure 2). The Reynolds number is, in turn, function of the flow rate, *Qi*, as follows:

$$\text{Re}\_{i} = \frac{Q\_{i}\phi}{\nu A} \tag{10}$$

where, *ν* is the kinematic viscosity of the water. Notice that, due to leak occurrence, the flow rate is different in each pipeline section (see Figure 2), causing a significant deviation of the friction factor value (since the working plastic pipe area is commonly in the transition zone). Therefore, it is important to introduce Equations (9) and (10) in the mathematical model to calculate, at any sampling time, the friction coefficient due to the flow variations in each *i*-th section.

#### *2.2. Two Simultaneous Leak Problem Statement*

In this work, the two simultaneous leak case is considered, i.e., a couple of leaks can appear in a pipeline at locations: *z*<sup>1</sup> ∈ (0, *L*) and *z*<sup>2</sup> ∈ (0, *L*), with *z*<sup>2</sup> > *z*1. Thus, the problem is reduced to the size estimation of the pipe sections: Δ*z*<sup>1</sup> (Δ*z*<sup>1</sup> = *z*1) and Δ*z*<sup>2</sup> (Δ*z*<sup>2</sup> = *z*<sup>2</sup> − *z*1), see Figure 2.

Now, Equation (5) can be written in compact form as follows:

$$\begin{array}{rcl} \dot{\zeta} &=& \tilde{\mathfrak{z}}(\tilde{\mathfrak{z}}) + \rho(\tilde{\mathfrak{z}})\gamma \\ \dot{\Psi} &=& \theta(\tilde{\mathfrak{z}}) \end{array} \tag{11}$$

where *ζ* = [*ζ*<sup>1</sup> *ζ*<sup>2</sup> *ζ*<sup>3</sup> *ζ*<sup>4</sup> *ζ*5] *<sup>T</sup>* = [*Q*<sup>1</sup> *H*<sup>2</sup> *Q*<sup>2</sup> *H*<sup>3</sup> *Q*3] *<sup>T</sup>* <sup>∈</sup> <sup>R</sup><sup>5</sup> is the state vector, *<sup>γ</sup>* = [*Hin Hout*] *<sup>T</sup>* <sup>∈</sup> R<sup>2</sup> is the input vector, and <sup>Ψ</sup> = [*Qin Qout*] *<sup>T</sup>* <sup>∈</sup> <sup>R</sup><sup>2</sup> is the output vector for some functions *<sup>ξ</sup>*, *ρ*, and *ϑ*.

The leak diagnosis problem for two simultaneous leaks appearing in a pipeline can then amount to the estimation of parameters Δ*z*1, Δ*z*2, *λ*1, and *λ*<sup>2</sup> in (5). Let us consider those parameters as new state variables with constant dynamics [11], that is: if *θ* = [*z*<sup>1</sup> *z*<sup>2</sup> *λ*<sup>1</sup> *λ*2] *<sup>T</sup>* then ˙ *θ* = 0. This results in an extended state: *x* = [*ζ θ*] *<sup>T</sup>* = [*Q*<sup>1</sup> *H*<sup>2</sup> *Q*<sup>2</sup> *H*<sup>3</sup> *Q*<sup>3</sup> Δ*z*<sup>1</sup> Δ*z*<sup>2</sup> *λ*<sup>1</sup> *λ*2] *<sup>T</sup>* =: [*x*<sup>1</sup> *x*<sup>2</sup> *x*<sup>3</sup> *x*<sup>4</sup> *x*<sup>5</sup> *x*<sup>6</sup> *x*<sup>7</sup> *x*<sup>8</sup> *x*9] *<sup>T</sup>* <sup>∈</sup> <sup>R</sup>9. Then, considering an unidirectional flow given by *Qi*|*Qi*<sup>|</sup> <sup>=</sup> *<sup>Q</sup>*<sup>2</sup> *<sup>i</sup>* , the extended state representation of (11) can take a form as follows:

$$\begin{array}{rcl} \dot{x} &=& f(x, u) \\ y &=& h(x) \end{array} \tag{12}$$

where *<sup>u</sup>* . = [*H*<sup>1</sup> *H*4] *<sup>T</sup>* . = [*u*<sup>1</sup> *u*2] *<sup>T</sup>*, and *f* , *g* are differentiable vector fields with the following structure:

$$f(\mathbf{x},\boldsymbol{\mu}) = \begin{bmatrix} -\frac{\frac{\sqrt{A}}{\mathbf{x}\_{6}}(\mathbf{x}\_{2}-\boldsymbol{\mu}\_{1}) - \mu\_{1}(\mathbf{x}\_{1})\mathbf{x}\_{1}^{2}}{\mathbf{x}\_{8}\mathbf{x}\_{6}}\\ -\frac{\frac{\mathbf{p}^{2}}{A\_{8}\mathbf{x}\_{6}}(\mathbf{x}\_{3}-\mathbf{x}\_{1}+\mathbf{x}\_{7}\sqrt{\mathbf{x}\_{2}})}{\mathbf{x}\_{8}\mathbf{x}\_{6}}\\ -\frac{\frac{\mathbf{p}^{2}}{\mathbf{x}\_{8}}(\mathbf{x}\_{4}-\mathbf{x}\_{2}) - \mu\_{2}(\mathbf{x}\_{3})\mathbf{x}\_{3}^{2}}{\mathbf{x}\_{8}\mathbf{x}\_{8}}\\ -\frac{\frac{\mathbf{p}^{2}}{A\_{8}\mathbf{x}\_{8}}(\mathbf{x}\_{5}-\mathbf{x}\_{3}+\mathbf{x}\_{9}\sqrt{\mathbf{x}\_{4}})}{\mathbf{x}\_{5}\mathbf{x}\_{6}}\\ \frac{\frac{\mathbf{p}^{2}}{A\_{8}\mathbf{x}\_{6}}(\mathbf{u}\_{2}-\mathbf{x}\_{4}) - \mu\_{3}(\mathbf{x}\_{5})\mathbf{x}\_{5}^{2}}{\mathbf{Q}^{4\times 1}} \end{bmatrix} \tag{13}$$

$$h(\mathfrak{x}) = \left[ \begin{array}{c} y\_1(\mathfrak{x}) \\ y\_2(\mathfrak{x}) \end{array} \right] = \left[ \begin{array}{c} Q\_1 \\ Q\_3 \end{array} \right] = \left[ \begin{array}{c} \mathfrak{x}\_1 \\ \mathfrak{x}\_5 \end{array} \right]$$

where <sup>O</sup>4×<sup>1</sup> is the 4 <sup>×</sup> 1 zero matrix, *<sup>μ</sup><sup>i</sup>* <sup>=</sup> *<sup>τ</sup><sup>i</sup>* <sup>2</sup>*φ<sup>A</sup>* is computed as in Equation (9), and *τ*1,2,3 are functions of *x*1, *x*3, and *x*5, respectively.

### **3. State Vector Reconstruction Based on Input–Output Numerical Differentiation**

#### *3.1. Observability Discussion*

For the one leak case, the so-called observability rank condition is satisfied, and such a property does not depend on the inputs [12]. However, in the case of two simultaneous leaks (or even more), this is no longer true, since the states are not distinguishable by applying constant inputs, and thus a persistent input is required [11].

In fact, to reconstruct the actual state vector *x* of (12), one can compute time derivatives of the output as functions of state variables as well as input and its time derivatives (using Equation (13)), such that an invertible map with respect to the full state vector *x* is obtained by applying an appropriate input. More precisely, by considering the two simultaneous leak case described by (12), we have two input variables and two output variables, from which input–output time derivative vectors can be generally defined up to orders *p*, *p* for the input and *q*, *q* for the output, as:

$$\mathcal{U}L\_{(p,p')}(t) := \begin{bmatrix} u\_1 \ \vdots \ u\_1 \ \cdots \ u\_1^{(p)} \ u\_2 \ \vdots \ u\_2 \ \cdots \ \cdots \ u\_2^{(p')} \end{bmatrix}^T \tag{14}$$

and

$$Y\_{(q,q')}(t) := \begin{bmatrix} y\_1 \ \dot{y}\_1 \ \cdots \ \dot{y}\_1^{(q)} \ y\_2 \ \dot{y}\_2 \ \cdots \ \dot{y}\_2^{(q')} \end{bmatrix}^T \tag{15}$$

Clearly, from (13), the output time derivatives depend on the state and input time derivatives, so that we can get:

$$\mathcal{Y}\_{(p,q')}(t) = \Gamma\left(\mathbf{x}, \mathcal{U}\_{(p,p')}\right) \tag{16}$$

for some *p*, *p* , given *q*, *q* .

On this basis, observability somehow means that this relationship is invertible and that it is possible to find elements among the components of Γ defining an invertible map with respect to *x* [20,21]. Specifically, if such an inverted map exists, and the input–output and the corresponding time derivatives are known, then, it is possible to compute each independent state in (16) as follows:

$$\mathfrak{x} = \Gamma^{-1} \left( \mathcal{U}\_{(p,p'), \prime} \mathcal{Y}\_{(q,q')} \right) \tag{17}$$

Notice that, in general, it can be of interest to avoid or limit time derivatives of the input, but assuming that they are available or can be estimated in the same way as time derivatives of y is enough for our present application.

It should be noted that for this LDI problem, it is even enough to obtain an expression by only relating the input, output, and their time derivatives with leak parameters. We will propose this in Section 3.3, with a similar procedure as *elimination* (which, conversely to realization, consists of deriving an externally equivalent representation not containing the state [22]). We will even see how to recover the full state vector. However, let us first discuss the way to obtain input–output time derivatives.

#### *3.2. Numerical Differentiation with Annihilators*

The *LDI* scheme proposed in this work is based on Equation (17), that is, the injection of the inputs, outputs, and their time derivatives. Although there are multiple algorithms to numerically compute a time derivative, here we use a numerical differentiation with annihilators [23] due to its inherent noise rejection properties. Hereinafter, a brief explanation of such an algorithm is given, which is proposed for the first time in this paper, and it is described by Equation (25).

Let *γm*(*t*) denote the *m*-th order derivative of a smooth signal *γ*(*t*) defined on an interval I ⊂ <sup>R</sup>+. The signal *<sup>γ</sup>*(*t*) could represent a measurement variable corrupted by some noise, whose derivatives are not directly available.

Ignoring the noise for a moment, let *γ*(*t*) be an analytical function on I. So, without any loss of generality, it is possible to consider the truncated Taylor expansion at *t* = 0:

$$\gamma(t) = \sum\_{i=0}^{m} a\_i \frac{t^i}{i!} + \mathcal{O}(t^m) \tag{18}$$

where *ai* <sup>=</sup> *<sup>d</sup><sup>i</sup> γ*(*t*) *dti t*=0 . This implies that *γ*(*t*) can be approximated by the polynomial *pm*(*t*) = ∑*<sup>m</sup> <sup>i</sup>*=<sup>0</sup> *ai t i i*! .

In this way, the *i*-th order time derivative estimation of *γ*(*t*) can be tackled as a parameter estimation problem for *pm*(*t*). Using the method described in [23], it is possible to calculate each *ai* (*i* = 0, ... , *m*) independently, reducing in this manner the sensitivity to noise and numerical computation errors which often appear in simultaneous estimation methods. Moreover, the independent calculation allows the use of higher order polynomials without the calculation of all their coefficients. The algorithm is described as follows (a complete explanation is given in [23]):

• Let *pm*(*t*) be the *m*-th order polynomial approximation of *γ*(*t*),

$$p\_m = \frac{a\_0}{0!} + \frac{a\_1}{1!}t + \frac{a\_2}{2!}t^2 + \dots \dots + \frac{a\_m}{m!}t^m \tag{19}$$

• Transforming Equation (19) into Laplace domain yields:

$$P\_{\rm m} = \frac{a\_0}{\rm s} + \frac{a\_1}{\rm s^2} + \frac{a\_2}{\rm s^3}t^2 + \dots \,\_\times + \frac{a\_{\rm m}}{\rm s^{m+1}}\tag{20}$$

• In order to calculate the *i*-th time derivative approximation, *ai*, it is necessary to first annihilate every *ax* (*x* > *i*) in (20), using the next operator:

$$
\left[\prod\_{I=0}^{m-i} \frac{d}{ds}\right] \cdot s^{m+1} \tag{21}
$$

• Then, to annihilate every *ax* (*x* < *i*), the following operator is subsequently applied to Equation (20):

$$\left[\prod\_{h=0}^{i} \frac{d}{ds}\right] \cdot s^{-1} \tag{22}$$

• Finally, the resulting equation (after applying Equations (20)–(22)) is:

$$\begin{aligned} (-1)^i (m-i)! i! a\_i s^{-(i+1)} &= \\ \sum\_{h=0}^i \sum\_{l=0}^{m-1+h} (-1)^{i-h} \binom{i}{h} \binom{m-i+h}{l} \frac{(i-h)!(m+1)!}{(i+1-h+l)!} s^l \frac{d^l P\_m(s)}{ds^l} \end{aligned} \tag{23}$$

• Now, multiplying both sides of the above equation by *s*−(*m*+1) yields a polynomial taking the following form:

$$\frac{c\_m}{s^{i+m+2}}a\_i = \frac{1}{s}\frac{d^m P\_m(s)}{ds^m} + \frac{c\_{m-1}}{s^2}\frac{d^{m-1}P\_m(s)}{ds^{m-1}} + \dots + \frac{c\_0}{s^{m+1}}P\_m(s) \tag{24}$$

• Using the Cauchy rule for iterated integrals, the time domain expression for *ai* in Equation (24) yields:

$$a\_i = \frac{(m+i+1)!}{c\_m T^{m+i+1}} \int\_0^T \left[t^m + \frac{c\_{m-1}}{1!}(T-t)t^{m-1} + \dots + \frac{c\_0}{m!}(T-t)^m\right] p(t) \cdot dt \tag{25}$$

It should be noted that each constant *cj*, *j* = {0, 1, ... , *m*}, is obtained from Equation (23), and *T* represents a moving window of length *T* for the integrals. A short time window is sufficient to obtain accurate estimations. In addition, the iterated integrals work as low pass filters that provide a smoother form of highly fluctuating noises. Therefore, no previous knowledge on the statistical properties of the noise is required to filter it out.

#### *3.3. Extended State Vector Reconstruction*

The purpose of this section is to provide a synthesis of the proposed *LDI* scheme. In this sense, the method is based on the study of the structure of the input–output differential equation; thus, the problem is solved by exploiting the observability property of the system (13). First, a set of equations describing each state just as a function of the inputs, outputs, and the corresponding time derivatives is derived. Then, taking advantage of the filtering characteristic of the numerical differentiation exposed in Section 3.2, the input– output time derivatives are computed. Thus, the leak positions and magnitudes can be calculated by a pair of algebraic equations. In the next step, the algorithm used to derive this set of equations is described.

It is easy to check that the observability rank condition for the system (12) is satisfied (for the finite number of input–output time derivatives), only by applying a persistent input. More precisely, it is also easy to see that for an input of the form *Ap* sin (*ωt*) . This wave form is easy to achieve by a variable frequency drive, a commonly installed device in a pumping station.

The rank condition of (12) is fulfilled with *p* = 3, *p* = 2, *q* = 4, and *q* = 3 for Equations (14) and (15). Thus, the output derivatives as well as the inverse mapping can be computed as follows (perhaps the major difficulty of the algorithm is to obtain the inverse mapping due to the complexity of the resulting equation):

First, by construction of Equation (13), the state variables *x*<sup>1</sup> and *x*<sup>5</sup> are taken directly from the measurements:

$$\mathbf{x}\_1 = \mathbf{y}\_1 \tag{26}$$

$$\mathbf{x\_5} = \mathbf{y\_2} \tag{27}$$

Now, the following step is to take the time derivatives of Equations (26) and (27), by using (13) and replacing *x*1, *x*<sup>5</sup> by outputs according to (26) and (27) solving the resulting equation for *x*<sup>2</sup> and *x*4, respectively, yields expressions without *x*<sup>1</sup> and *x*5:

$$\mathbf{x}\_2 = \Phi\_1(\mathbf{x}\_{6\prime}\,\boldsymbol{u}\_{1\prime}\,\boldsymbol{y}\_{1\prime}\,\dot{\mathbf{y}}\_1) \tag{28}$$

$$\mathbf{x}\_4 = \Phi\_2(\mathbf{x}\_{6\prime}, \mathbf{x}\_{8\prime}, \boldsymbol{\mu}\_2, \boldsymbol{y}\_{2\prime}\boldsymbol{y}\_2) \tag{29}$$

In the same way, the third step is to compute the derivative of Equation (28), substitute *x*1, *x*5, *x*2, and *x*<sup>4</sup> according to (26), (27), (28), and (29), respectively. Solving this equation for *x*3, yields an expression without *x*1, *x*5, *x*2, *x*4:

$$\mathbf{x}\_3 = \Phi\_3(\mathbf{x}\_{6\prime}\mathbf{x}\_7, \boldsymbol{\mu}\_1, \boldsymbol{\mu}\_1, \boldsymbol{y}\_{1\prime}, \dots, \boldsymbol{y}\_1^{(2)}) \tag{30}$$

Following the same steps, now for (29) and (30), it is possible to find an expression for *x*<sup>7</sup> and *x*<sup>9</sup> free of *x*1, *x*2, *x*3, *x*4, and *x*5:

$$\mathbf{x}\_7 = \Phi\_4(\mathbf{x}\_6, \mathbf{x}\_8, \boldsymbol{\mu}\_{1\prime}, \dots, \boldsymbol{\mu}\_1^{(2)}, \boldsymbol{\mu}\_{2\prime} y\_{1\prime}, \dots, y\_1^{(3)}, y\_{2\prime} y\_{2\prime}) \tag{31}$$

$$\mathbf{x}\_9 = \Phi\_5(\mathbf{x}\_6, \mathbf{x}\_8, \boldsymbol{\mu}\_1, \dots, \boldsymbol{\mu}\_1^{(2)}, \boldsymbol{\mu}\_2, \dot{\boldsymbol{\mu}}\_2, \mathbf{y}\_1, \dots, \mathbf{y}\_1^{(3)}, \mathbf{y}\_2, \dots, \mathbf{y}\_2^{(2)}) \tag{32}$$

The final step is to apply the same methodology for Equations (31) and (32), but now for solving for the states *x*<sup>6</sup> and *x*8. In this way, it is feasible to obtain an expression of *x*<sup>6</sup> and *x*<sup>8</sup> just as a function depending on the input and output, and their time derivatives:

$$\mathbf{x}\_6 = \Phi\_6(\boldsymbol{u}\_1, \dots, \boldsymbol{u}\_1^{(3)}, \boldsymbol{u}\_2, \dot{\mathbf{u}}\_2, \dot{\mathbf{y}}\_1, \dots, \boldsymbol{y}\_1^{(4)}, \mathbf{y}\_2, \dots, \mathbf{y}\_2^{(3)}) \tag{33}$$

$$\mathbf{x}\_8 = \Phi\_7(u\_1, \dots, u\_1^{(3)}, u\_2, \dots, u\_2^{(2)}, y\_1, \dots, y\_1^{(4)}, y\_2, \dots, y\_2^{(3)}) \tag{34}$$

At this point, we are able to recover the whole state with the acknowledgement of the input and output time derivatives calculated through Equation (25); this is achieved by first obtaining *x*<sup>6</sup> and *x*<sup>8</sup> (leak positions) from Equations (33) and (34). Once *x*<sup>6</sup> and *x*<sup>8</sup> are obtained, leak magnitudes, *x*<sup>7</sup> and *x*9, can be computed by using (31) and (32). The rest of the state can then be recovered going backwards through Equations (30)–(28) .

#### **4. Experimental Results**

In this section, experimental results are presented to evaluate the proposed *LDI* methodology. The experiments are performed using several databases from the pilot plant located at Cinvestav Guadalajara. A couple of different two simultaneous leak scenarios were emulated by opening different electrovalves located along the pilot plant. A general description of the pipeline prototype is presented below with a detailed description of each experiment.

#### *4.1. Pilot Pipeline Description*

The layout of the pilot pressurized water pipe of Cinvestav Guadalajara, which is 68.2 [m] long (between sensors) with an internal diameter of 6.271 <sup>×</sup> <sup>10</sup>−<sup>2</sup> [m], thickness 1.27 <sup>×</sup> <sup>10</sup>−<sup>2</sup> [m], friction coefficient 1.66 <sup>×</sup> <sup>10</sup><sup>−</sup>2, pressure wave speed 358 [m/s], and gravity acceleration 9.81 [m/s2], is shown in Figure 3. The line is instrumented with two water-flow (FT) and pressure-head (PT) sensors at the inlet and outlet of the pipe. To emulate the leak, three control valves at position 16.8, 33.3, and 49.8 [m] are installed together with a electronic-based actuator to practically set any opening of the valve.

**Figure 3.** Schematic diagram of the pipeline's prototype.

The prototype is manufactured with a plastic material known as polypropylene copolymer random, for which technical characteristics can be found in [24]. It is integrated with a store tank of 7.5 <sup>×</sup> <sup>10</sup>−<sup>1</sup> [m3], a hydraulic pump of 5 HP, and a variable-frequency driver (*VFD*) which controls the pressure in the system through the rotational speed of the pump motor (more details about the pipeline prototype can be found in [25]).

To ensure the application of a persistent input, experiments with operation point variations are carried out by the pump variation via the *VFD* (specifically, a change in the form of *Ap* sin (*ωt*), where *ω* stands for the angular frequency induced via the *VFD*). Table 1 summarizes the pipeline's main parameters.



#### *4.2. LDI Results*

Hereinafter, two off-line examples of the *LDI* scheme are displayed. Both results were carried out by taking data from the pipeline prototype previously described. The experiment was performed as follows: Pump 1 is started in a steady state operation during the first 65 [s] approximately. After that, it begins to operate in some unsteady state, namely a sine-like pressure signal is introduced, just exactly like a persistent input in the sense of [11]. This sine signal was experimentally obtained by setting up the pump controller as follows:

$$\begin{array}{rcl} VFD(t) & = & 60[Hz], \forall t \le 65[s] \\ VFD(t) & = & 50 + 5 \sin(2.7313t)[Hz], \forall t > 65[s] \end{array} \tag{35}$$

At *t* ≈ 60 [s], two leaks were induced simultaneously at the opening of the control valves in the pilot plant.

The leak position estimations were undertaken by the injection of the inputs and outputs for which a low pass filter was previously applied (*u*1, *u*2, *y*1, and *y*<sup>2</sup> in (13)), and the corresponding derivatives together with Equations (33) and (34), respectively.

Then, the leak magnitudes were also estimated by using Equations (31) and (32). It is possible to recover the whole state going backwards using Equations (30)–(28). As stated earlier, the time derivatives were computed using the methodology discussed in Section 3.2.

#### 4.2.1. Experiment 1: Leaks Induced in Valves *n*◦1 and *n*◦3

The first experiment consists of simultaneously inducing two leaks in valve *n*◦1 and valve *n*◦3 (see Figure 3). The algorithm starts once the leak is detected. This initial detection is obtained when a deviation between the upstream and downstream flow is detected |*Qin* − *Qout*| > *δ*, where *δ* is a constant threshold defined by the designer (normally related to the signal-noise ratio): here *<sup>δ</sup>* <sup>=</sup> <sup>1</sup> <sup>×</sup> <sup>10</sup>−<sup>4</sup> [m3/s]. Immediately afterwards, a sinusoidal wave form as (35) is applied to ensure enough observability of the model (13) via persistent inputs (see Section 3.1). Once the sinusoidal steady state has been reached, the *LDI* scheme starts. Figure 4 shows the time evolution of the pressure head at the inlet and outlet of the pipe (system input, *Hin* = *u*<sup>1</sup> and *Hout* = *u*2). As it can be seen, the *LDI* algorithm begins close to 65 [s].

**Figure 4.** Pressure heads *Hin* and *Hout*.

Following with the same idea, Figure 5 shows the inlet and outlet flows (system output *Qin* = *y*<sup>1</sup> and *Qout* = *y*2). It is clear that, due to the physical nature of the leak phenomena, the inlet flow is separated from the outlet flow:

**Figure 5.** Flow rates *Qin* and *Qout*.

Now, once the input and output time derivatives have been computed with Equations (25), (33), and (34), then they are used to reconstruct the leak positions. For clarity, just one signal cycle (*Tin* = 20 s), as seen in Figure 6, is enough to correctly locate the two leaks, despite signal noise. To quantify the accuracy of the leak position estimation, a Mean Absolute Error index is applied. For the first leak, the estimation accuracy is 96.62% (with respect to the whole pipeline length), whereas the second leak localization accuracy is 96.33%. As Figure 6 shows, one cycle has demonstrated to be enough to isolate the leak well despite signal noise.

**Figure 6.** Estimation of leak positions.

Similarly, once the leak positions are obtained, it is possible to reconstruct the leak magnitudes using Equations (31) and (32). The corresponding results are shown in Figure 7, where estimated magnitudes are set around 1.3 <sup>×</sup> <sup>10</sup>−<sup>4</sup> [m5/2/s] and 0.95 <sup>×</sup> <sup>10</sup>−<sup>4</sup> [m5/2/s], respectively. The outflow computed by using Equation (4) for each leak is consistent with the total outflow.

**Figure 7.** Estimation of leak magnitudes.

Notice that for the *LDI* problem, the estimation of the leak parameters of both leaks is enough, but, as mentioned before, it is also possible to recover the remaining states of (13), going backwards from Equations (30) and (28).

#### 4.2.2. Experiment 2: Leaks Induced in Valve *n*◦1 and *n*◦2

To better illustrate the effectiveness of the method, a second experiment is exposed. The experiment setup was exactly the same as in Section 4.2.1 except that two leaks are now induced in valve *n*◦1 and valve *n*◦2 (see Figure 3). Figure 8 shows the time evolution of the pressure head at upstream and downstream, respectively. In the same way as before, the *LDI* algorithm starts when a flow deviation exceeds a predefined threshold, *δ*, (|*Qin* − *Qout*| > *δ*).

**Figure 8.** Pressure heads *Hin* and *Hout* (2nd experiment).

Figure 9 shows the corresponding flow rate evolution at upstream and downstream. As stated above, due to the physical nature of the leak phenomena, the inlet flow is separated from the outlet flow.

**Figure 9.** Flow rate *Qin* and *Qout* (2nd experiment).

As performed above, once the algorithm starts, the input and output time derivatives are computed following the algorithm explained in Section 3.2. Subsequently, Equations (33) and (34) are used to obtain the leak positions. The results are presented in Figure 10. As before, the leak positions are well estimated despite signal noise in just one signal cycle (*Tin* = 20 s). Here, the MAE yields an estimation accuracy for the first leak of 95.01%, while the second leak position presents an estimation accuracy of 97.94%.

**Figure 10.** Estimation of leak positions (2nd experiment).

Continuing with the algorithm, the leak magnitudes are computed with Equations (31) and (32). These results are shown in Figure 11.

**Figure 11.** Estimation of leak magnitudes (2nd experiment).

#### **5. Conclusions**

The simultaneous leak detection and isolation problem is currently an open and challenging problem. Even though extensive research in the field is currently in progress, to the best of our knowledge, only simulation results have been reported until now. One of the reasons is that the distinguishability of two (or more) simultaneous leaks depends on the input.

In this work, an *LDI* methodology for the two simultaneous leak detection and isolation has been proposed based on an algebraic observer that uses the injection, the inputs and outputs of the system, and the corresponding time derivatives, by applying an appropriate input. The time derivatives are computed using *Numerical Differentiation with Annihilators* and the approach has been successfully applied to real data. This methodology could be extended to more general cases of simultaneous leaks in real operational conditions, such as the case of the *SIAPA* aqueduct in Guadalajara, Mexico.

**Author Contributions:** A.N.-D .: software, validation, data curation, writing—original draft preparation, methodology, formal analysis, and investigation; J.A.D.-A.: conceptualization, methodology, formal analysis, writing—review and editing, and investigation; O.B.: writing—review; G.B.: writing review and editing. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was funded by the Tecnológico de Monterrey through the School of Engineering and Science.

**Acknowledgments:** Authors would like to thank Tecnológico de Monterrey for the financial support and the facilities granted for fulfillment of this research article. All databases were obtained by the second author during his PhD studies at Cinvestav Guadalajara.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Chlorine Concentration Modelling and Supervision in Water Distribution Systems**

**Ramon Pérez 1,2, Albert Martínez-Torrents 2, Manuel Martínez 2,3, Sergi Grau 4,5, Laura Vinardell 2, Ricard Tomàs 4, Xavier Martínez-Lladó <sup>2</sup> and Irene Jubany 2,\***


**Abstract:** The quality of the drinking water distributed through the networks has become the main concern of most operators. This work focuses on one of the most important variables of the drinking water distribution networks (WDN) that use disinfection, chlorine. This powerful disinfectant must be dosed carefully in order to reduce disinfection byproducts (DBPs). The literature demonstrates researchers' interest in modelling chlorine decay and using several different approaches. Nevertheless, the full-scale application of these models is far from being a reality in the supervision of water distribution networks. This paper combines the use of validated chlorine prediction models with an intensive study of a large amount of data and its influence on the model's parameters. These parameters are estimated and validated using data coming from the Supervisory Control and Data Acquisition (SCADA) software, a full-scale water distribution system, and using off-line analytics. The result is a powerful methodology for calibrating a chlorine decay model on-line which coherently evolves over time along with the significant variables that influence it.

**Keywords:** chlorine; water distribution networks; modelling; supervision; decay model

#### **1. Introduction**

Disinfection is one of the most important steps in water treatment, as it must ensure the microbiological safety of the water generated, not only after treatment, but also throughout the transport process to the consumption point. Many countries use chlorine-based chemicals (sodium hypochlorite, chlorine dioxide, chloramines, etc.) to achieve this objective, as they guarantee the degree of residual disinfection potential that is required by their laws [1]. If required, booster disinfection stations are installed at different points in the network. Their need and best location can be optimized using models and tools based on estimates of the chlorine concentration. Chlorine concentration is precisely one of the most relevant parameters to consider for the water distribution network (WDN) quality management. Although chlorine ensures the absence of pathogens, it is the main cause of the formation of disinfection byproducts (DBPs) [2]. Most of these compounds are toxic or carcinogenic for human health and need to be controlled to ensure drinking water safety [3]. Thus, European legislation limits the concentration of some DBPs in drinking water [4].

Nowadays, given the lack of reliable and applicable models for predicting chlorine behavior, disinfection management is not optimal in most WDNs, since it is based on

**Citation:** Pérez, R.; Martínez-

Torrents, A.; Martínez, M.; Grau, S.; Vinardell, L.; Tomàs, R.; Martínez-Lladó, X.; Jubany, I. Chlorine Concentration Modelling and Supervision in Water Distribution Systems. *Sensors* **2022**, *22*, 5578. https://doi.org/10.3390/ s22155578

Academic Editors: Eduard Llobet and Angela Maria Stortini

Received: 9 May 2022 Accepted: 19 July 2022 Published: 26 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

a point-specific control as opposed to the consideration of the whole network [5]. There are often dark points where the chlorine level may be too low, along with over-chlorination at other points (particularly in summer), with a subsequent increase in both operating costs and DBP concentrations.

The absence of robust models for predicting chlorine behavior in WDNs is fundamentally due to two aspects: (1) the complexity of modelling the hydraulics of the WDNs and (2) the need for on-line quality data. Although authors report good results in chlorine prediction in full-scale networks in some studies [6], the predictions become less accurate when the environmental conditions or the water composition change from those of the calibration. Such a situation is very common in WDNs fed with treated surface water.

Regarding the first aspect, WDNs are highly meshed and complex systems, the behavior of which is difficult to predict. The introduction of flow, level and pressure sensors, and automated metering readers (AMR) for consumption has recently increased the model accuracy [7,8]. Thus, the intense use of a large amount of hydraulic data together with hydraulic models and numerical simulators allow prediction of residence time, which is one of the main parameters needed for successful water quality prediction.

Regarding the second aspect, it is mandatory to obtain information on water quality in the effluent of the drinking water treatment plant and the relevant points in the WDNs in order to predict the behavior of chlorine. Several studies [6,9] base their decay models on parameters that are easily measured on-line, such as temperature, pH, redox potential, conductivity, turbidity, and chlorine concentration. Nevertheless, the calibration and maintenance of these models for their on-line use is seldom performed.

Common models for chlorine modelling in WDNs are first-order linear differential equations [10] such as:

$$\frac{d\mathbb{C}}{dt} = K\_b \cdot \mathbb{C}^a \tag{1}$$

where *Kb* is a constant that contains the different parameters and physicochemical phenomena that may affect chlorine decay, such as natural organic matter, inorganic compounds, or temperature [6]. More sophisticated models have also been studied, including second-order models [11].

Furthermore, knowing the effect of the parameters influencing chlorine decay is also important, as this will, in turn, allow the prediction of such decay and, in some cases, the application of corrective measures to reduce its effect.

Equations of different complexity have been used to model the effect of some of these parameters. One of the most influencing and studied parameters is temperature, which is usually based on Arrhenius's model [12] or other power models. Liu et al. [13] took pH and temperature into account in their models, thus differentiating the effect of HOCl and OCl species on pH. Similarly, Arevalo, in his doctoral dissertation [14], used a model that considered temperature and UV254 as an indication of organic matter. In this case, two decay constants were used, one related to chlorine decay on bulk water and one related to chlorine decay on water close to pipe wall.

Hassan et al. [15] studied the specific case of organic matter adsorbed onto goethite, which is the predominant iron oxide in pipe deposits, in order to see how effectively the presence of organic matter increased the decay rate. Their main conclusions were: (i) an increase in temperature causes an increase in the decay constant and therefore in the decay rate, (ii) the pH has not been seen to greatly affect decay, (iii) a higher initial chlorine concentration leads to a lower decay rate, (iv) a higher organic matter concentration, in general dissolved organic matter (DOC), causes an increase in the rate of decay, (v) an increase in the velocity of the water flow through the pipe causes an increase in the rate of decay, and (vi) the concentration of ammonium, nitrites, iron, and manganese seems to affect the rate of decay, causing its increase.

Chlorine decay first-order equations can also be used in software like EPANET, which has sufficient power to simulate and predict the concentration of chlorine in the network. EPANET is a public domain software for WDN modelling developed by the United States

Environmental Protection Agency (US EPA). This software can perform transient simulations of hydraulic behavior and water quality in pressurized pipe networks. In order to properly model water quality and its time evolution at consumer points, it is mandatory to have a reliable hydraulic model of the WDN. However, like any simulation software, EPANET depends on the availability and application of continuous data into robust models. The default built-in models have fixed calibration parameters that are not easily extrapolated in most real cases. The approach some authors take to overcome these barriers is to modify and recalibrate the default models included in the EPANET database based on the real system data to be modelled [16].

Another important aspect when modelling chemical reactions in pipes is the different behavior in wall and in bulk water. In the literature, bulk reactions are usually considered first-order and wall reactions are considered zero-order. Values for the bulk reaction coefficient are usually obtained using laboratory measurements [17]. Nevertheless, changing water characteristics in the network requires updating the model. There are a few approaches for the quality model calibration. This calibration requires a validated hydraulic model and water quality data. This is often carried out in a well-monitored part of the network and then generalized to the whole network [18,19].

This paper focuses on the chlorine decay process and variables that affect it, the models for concentration prediction, and their application within WDNs in a specific case-study. First, an on-line calibration procedure, with available data from the transport network, is adapted to the full-scale system and performed over a long period so that the evolution of the decay parameters can be studied. This model is used to predict the chlorine concentration in the distribution network and validated with discrete monitoring data. The dependence of chlorine decay on the relevant variables is also studied. Finally, this dependence is compared with the evolution of the parameters estimated using the on-line calibration method. The aim is to illustrate how the intense use of models and available data can provide a better understanding of the behavior of chlorine in a WDN and, thus, be used to support decision-making to improve water quality.

#### **2. Materials and Methods**

#### *2.1. Case Study Network*

The case study in this work is a WDN in Catalonia (Spain) (see location in Figure 1) managed by Aigües de Manresa, who provided the network configuration information and hydraulic and water quality data for a period of 14 months (2017–2018). The water supplied comes from the Llobregat River and goes through a prechlorination step with sodium hypochlorite (Apliclor Water Solutions S.L., Sant Martí Sesgueioles, Spain) or chlorine dioxide generated using sodium chlorite (Apliclor Water Solutions S.L., Sant Martí Sesgueioles, Spain) and hydrochloric acid (Apliclor Water Solutions S.L., Sant Martí Sesgueioles, Spain) (depending on the season), a sand filtration process, and a final disinfection step with sodium hypochlorite.

Two parts of the network were used in this study: the transport network and the district metered area (DMA). The transport network (Figure 2) consists of two water storage tanks (T1 and T2) equipped with sensors for chlorine concentration (input of T1 (Cl1) and output of T2 (Cl2)), flowmeters (outflows from the tanks, Q1 and Q2), and water level (H1 and H2). Water flows from T1 to T2 through a 6859 m main. T2 is a boosting station with known sodium hypochlorite (Apliclor Water Solutions S.L., Sant Martí Sesgueioles, Spain) addition. The geometry of the tanks (volume) and pipes (length and diameter) are known. Water from T2 is distributed to the rest of the network (through Q2) of the DMA.

The DMA corresponds to a residential area. The hydraulic model includes 572 nodes and 610 pipes with a total length of 31 km, providing water to 300 consumers. Water flows by gravity. There are two quality-sampling points where the chlorine concentration is measured weekly. Figure 3 presents the model of this DMA visualized in EPANET. The input tank (corresponding to T2 in Figure 2) and the two sampling points are highlighted (S1 and S2).

**Figure 1.** Study site location, a network in Catalonia supplied by the Llobregat River.

**Figure 2.** Outline of the network section used for the on-line calibration.

**Figure 3.** DMA network model in EPANET.

#### *2.2. On-Line Calibration*

A very well parametrized system in terms of hydraulics and chlorine concentration (at least at two points) is required to calibrate the chlorine decay constant of a WDN. As in this study, the transport network often fulfils this condition. Therefore, the network used for on-line calibration in this study was the transport network shown in Figure 2. The objective was to find the decay constants for the model that best explained the chlorine concentration measured at the output of T2.

The chosen model was a first-order model. Higher-order models could be used with no fundamental changes in the methodology. Equation (2) shows that the chlorine concentration (*Cl2*) at the outflow of T2 depends on the input chlorine concentration (*Cl1*) and the residence time in the system (*t*). The solution of Equation (1) is as follows:

$$Cl\_2 = Cl\_1 \cdot e^{-K\_b \cdot t} \tag{2}$$

where *Kb* is the decay constant and *α* is considered as 1.

This dependence is defined by the decay constant *Kb*, which was calibrated on-line using the measurements available so that it was adapted throughout the year to the different water characteristics and environmental conditions. Estimations were performed on a weekly basis, since some information was only available at this frequency (chlorine dosing in T2).

The residence time (RT) in T1 was calculated from the hydraulic information available using (3). The weekly mean residence time at tank T1 and the pipe was calculated using the flowmeter data (Q1) and the volume of this subsystem.

$$
\overline{RT}\_1 = \frac{\overline{V}\_1 + V\_{pipe}}{\overline{Q}\_1} \tag{3}
$$

In order to estimate the mean water volume in T1, the level data of the tank (H1) and the geometric information was used. The residence time in T2 was calculated using the mean values of the volume obtained from the level data (H2) and the mean values of the tank effluent (Q2), as shown in (4):

$$
\overline{RT}\_2 = \frac{\overline{V}\_2}{\overline{Q}\_2} \tag{4}
$$

The chlorine concentration increase due to rechlorination (*Cladded*) was calculated using the added volume of chlorine divided by the mean volume of water treated:

$$
\overline{Cl}\_{addcd} = \frac{\overline{\Delta V\_{Cl}} \cdot 143}{\int \, Q\_2} \tag{5}
$$

where Δ*VCl* is the volume in liters of the concentrated chlorine added weekly to the network and 143 is the concentration of the added chlorine in g/L (value obtained from the conversion of the 15% NaClO to reactive chlorine, see Section S1 in the Supplementary Material).

Finally, a decay *Kb* constant was calculated which explained the chlorine concentration *Cl*<sup>2</sup> at the outflow of T2 given the residence time calculated using (6)

$$
\overline{\mathbb{C}I}\_2 = \overline{\mathbb{C}I}\_1 \cdot e^{-\mathbb{K}\_b(\overline{\mathbb{R}T}\_1 + \mathbb{R}\overline{\mathbb{R}T}\_2)} + \overline{\mathbb{C}I}\_{addcd} \cdot e^{-\mathbb{K}\_b\overline{\mathbb{R}T}\_1} \tag{6}
$$

where *Cl*<sup>2</sup> was considered equal to 0.6 ppm, which is the set point of the chlorine control system in the boosting station. The algorithm for *Kb* calibration is shown in the Supplementary Material (Section S2).

The first order decay model is the most used in the water industry. Its decay constant includes all the dependencies related to environmental and water characteristics. Thus, the continuous updating of this constant is the guarantee of its reliability. The limitation of this methodology is the information required. Hydraulic information, that allows the determination of the residence time, must be available. Multiple chlorine concentration measurements and the exact volume of added chlorine between these measurements are also mandatory data.

#### *2.3. Chlorine Decay Calibration and Validation*

The chlorine decay first-order model validation for the transport network was carried out using available the on-line data of the chlorine concentration in the output of T2 considering the residence time in this tank. The data used covered the period from February 2017 to April 2018. The chlorine decay model was also validated for its use in the distribution network in the section where chlorine concentration is monitored.

Applying the calibrated decay model directly to the distribution network produced poor results. This was expected, due to the difference between the transport network and the distribution network regarding pipe size, materials, age, etc. To adjust the model, the available data period was divided into two sets: one for training the new distribution quality model and the remaining data for validation. There were 35 samples available in S1 and 11 samples in S2. Thus, the first 21 samples in S1 were used for the training and the remaining ones for the validation. The algorithm used for this calibration is shown in the Supplementary Material (Section S3).

#### *2.4. Parametrised Chlorine Decay Model*

The decay constant, determined from the available on-line data, evolved clearly throughout 2017. The question arose if this could be due to the effect of the available variables such as temperature, the initial chlorine concentration, the cumulative precipitation, and the turbidity at the drinking water treatment plant or not.

This suggested the idea of analyzing the variables that influence chlorine decay in order to generate an empirical model based on the available independent variables. The availability of considerable data (temporal and spatial) implies dealing with large amounts of data, multiple variables, and experimental noise, which hinders the direct extraction of valuable information.

Principal component analysis (PCA) is a multivariate statistical technique that allows the description of the data according to the variance [20]. This method transforms data in noncorrelated new variables by linear transformation, decreasing the data dimensions. New data description is more condensed and can describe patterns that are hard to identify in multivariable datasets. PCA has already been used to determine the physical and chemical parameters influencing chlorine decay [21]. Therefore, for being a powerful, reliable, and globally accepted tool when dealing with big data, PCA was selected to extract the main trends, patterns, and correlations among the variables (dimensions) [11].

Based on the PCA results, the chlorine decay constant was modelled using the available variables and a potential multiparametric model (7).

$$K\_b = K \cdot parameter1^a \cdot parameter2^b \cdot parameter3^c \cdot \dots \tag{7}$$

Specifically, a power model (8) and an Arrhenius model (9) [12] where calibrated using experimental data (temperature in 2017) and *Kb* obtained from the on-line calibration using the least square error fitting method implemented in the "Solver" function in Excel.

$$K\_b = K\_{power} \cdot T^t \tag{8}$$

$$K\_{\Phi} = A \cdot e \exp\left(-E\_{\mathfrak{a}}/RT\right) \tag{9}$$

where, *Kpower*, *a*, and *A* are constants, *Ea* is the activation energy (Jmol<sup>−</sup>1), *R* is the universal gas constant, and *T* is the temperature. Finally, the parametrized chlorine decay model was compared with that obtained in the on-line calibration to assess its coherence throughout the year.

#### **3. Results and Discussion**

#### *3.1. On-Line Calibration*

The decay constants *Kb* obtained are presented in Figure 4. In the upper graphic, the weekly evolution between February 2017 and April 2018, can be observed. A different icon was used for the data of each trimester to clearly identify the season of the year. In the lower graphic obtained, *Kb* are grouped by month to observe how this parameter evolves throughout the year (some months include estimations of both years). It seems clear that there may be a seasonal variation related to temperature.

**Figure 4.** Up: Mean Kb calculated weekly indicating the season of the year (by trimester). Down: Data of mean Kb grouped by month.

#### *3.2. Chlorine Decay Validation*

The calibrated model was applied to the peak episodes observed at the output of T2 due to the rechlorination and mixing effect. The dataset used in this validation was not used for the estimation. The dataset for calibration consisted of the mean values corresponding to the stationary state. Figure 5 shows the chlorine concentration data and the model prediction. It can be observed how this high-frequency dynamic is adjusted with the model obtained with the mean values. For this prediction, Kb evolves weekly.

**Figure 5.** Chlorine decay model validation at the output of T2. Up: data from January 2017. Down: data from March 2017. Due to high frequency of sampling measurement, data appears like a thick line.

For the distribution network simulation, a relation between the transport *Kb*, estimated by on-line calibration, with the distribution *Kb* \* was obtained by adjusting the concentration in the training set of chlorine sampling. This relation was applied to the entire period and the predicted concentration was compared with the measurements for the validation set of samples. A total of 40 days were simulated. The decay constant for both the bulk and wall were fitted using the first 21 samples of the chlorine concentration in S1. These are the first samples of upper graphic in Figure 6.

**Figure 6.** Simulation results for the two sampling points (**S1** and **S2**) compared with experimental samplings.

The result was that both decay constants minimize the error when the original *Kb* obtained in the transport system was divided by 2, as if the calibrated effect was distributed in the two phenomena (K\* b,bulk = K\* b,wall = Kb/2). The results obtained are compared with the available experimental data in Figure 6. The mean absolute percentage error was 16% for S1 (including calibration and validation samples) while it was 17% using only validation samples. Therefore, not significantly different deviations were obtained for the calibration and the validation steps. Graphically, the fit may seem poor; however, the concentration is lower in S2 than in S1 both in prediction and measurements, and the mean values in both sampling points are coherent between the prediction and measurements. One aspect that may justify part of the mismatching is that the exact hour of the day of the manual measurements was not available and, therefore, each experimental data may not be in its exact position. This difficulty could be overcome with on-line chlorine sensors instead of manual analysis. Figure 7 presents the measured chlorine concentration at the source (T2) and the chlorine prediction in the two sampling points (S1 and S2). The chlorine concentration decreases with the residence time, since the concentration in S2 is lower than in S1, and both are lower than in T2.

Finally, Figure 8 shows the network nodes colored by their chlorine concentration: green, black, or red, depending on whether their concentration is too low, acceptable, or too high, respectively. In fact, no red points exist in this area and period. Such a representation is very useful in order for the network operator to make decisions. Nodes in this figure correspond to those in Figure 3 and are presented with north at the top.

**Figure 7.** Measured chlorine concentration at the source (T2) and prediction at the two sampling points (S1 and S2).

**Figure 8.** Distribution of network nodes with low concentration (<0.4 ppm, green), acceptable concentration (0.4 ppm > chlorine < 1 ppm, black), and excessive concentration (>1 ppm, red).

Results from the PCA applied to the decay constant (K\_Cl\_decay) determined in the distribution network and other data available (temperature as Tavg\_C, initial chlorine concentration as Initial\_Cl, cumulative precipitation as Cumulative\_Prec, and turbidity at the drinking water treatment plant) are shown in Figure 9.

**Figure 9.** Variables for the two main dimensions of the PCA.

In this case, dimension one was related to the temperature, and dimension two was related to the initial chlorine concentration. Therefore, the decay constant was closely related to the temperature. Figure 10 shows the variables that had the most influence on dimension five, which were again the temperature and the chlorine decay, demonstrating the clear strong relation of the temperature on the decay constant.

**Figure 10.** Contribution of variables to dimension 5. Variables with contributions below the dotted line are considered not significant for that dimension.

Thus, it was concluded that the variable that had a higher effect on the decay constant was temperature. It was observed that the variables turbidity, precipitation, and initial chlorine did not excessively improve the fit between the decay constants of the model and the decay constants obtained. Therefore, only the temperature was used, since the effort required to obtain values for the rest of the parameters did not compensate the improvement of the model adjustment.

Experimental data from 2017 and *Kb* obtained with the on-line calibration were used to calibrate the parameters of the two equations, an Arrhenius model (8) and a power model (9), to predict the temperature effect on the decay constant. The following Equations (10) and (11) show the results obtained.

$$K\_b = 5.477 \cdot 10^{-8} \cdot T^{1.524} \left(\text{s}^{-1}\right) \tag{10}$$

$$K\_b = 3950 \cdot \exp\left(-49873/RT\right) \left(\text{s}^{-1}\right) \tag{11}$$

The fit obtained using the Arrhenius model and the power model were similar, although the Arrhenius one was slightly better. The Arrhenius model also determines the activation energy (J/mol), which is the minimum energy that the system needs for the reaction to take place. The ratio Ea/R obtained in this study was 5999 K, which is in accordance with other authors. For example, Courtis et al. [22] estimated 5388 K and 6701 K for two different water distribution systems, Powell et al. [23] obtained a range between 7500 and 9600 K, and Hua et al. [12] obtained a range between 8203 and 8727 K, depending on the type of water. The variability of the values for this ratio suggests that this is a water-specific parameter that might depend significantly on the natural organic matter composition [24].

Figure 11 shows the fit of the two models to data from 2017, and Figure 12 their forecast for the first days of 2018. In these figures, K Chlorine is the chlorine decay constant determined previously, K power the constant determined following the potential model, and K Arrhenius the constant obtained from the Arrhenius model. As it can be seen, the Arrhenius model provides good predictions while using only the temperature as an input parameter.

**Figure 11.** Chlorine decay constant fitting to temperature dependent models using data from 2017.

**Figure 12.** Chlorine decay constant forecast for the first days of 2018.

#### **4. Conclusions**

The literature review shows that only the simplest models of chlorine decay are applied to water distribution networks where the hydraulic behavior is complex enough. Even so, these models are seldom used due to the lack of proper calibration. In this paper, the performance of a decay model was evaluated when the parameters were calibrated using state-of-the-art techniques.

The calibration was first carried out in the transport network, where the on-line data allowed an on-line calibration. The relation between the decay constant in the transport and distribution networks used analytical data and, thus, it could not be done on-line.

The prediction error in the validation data was 17% and quite similar to the error obtained for the training set (16%), which meant that there was no overfitting. The decay constant obtained changed during the year following the assumed dependence of the chlorine decay on the temperature. This result suggested the possibility of using available data for predicting this decay constant.

The principal component analysis determined that the temperature was the parameter with higher effect on the decay of chlorine. The chlorine decay constant was obtained using temperature as an independent variable. The obtained constants were compared with the data-driven model obtained in the on-line calibration, showing a high correlation. While the dominant dependence on the temperature is not a novelty, it is important to ensure this unique dependency, as it guarantees that other characteristics of the water source will not be relevant. This has been studied throughout one year, and the results obtained by both models are coherent.

This procedure could also be applied to other quality parameters, such as disinfection byproduct concentrations, which are currently under investigation by the authors. The variables analyzed for chlorine decay estimation are being studied for the trihalomethanes formation prediction.

The final aim of this study is to increase knowledge within the network in order to enable decision-making processes regarding chlorine dosing (quantity and frequency) both in the disinfection process and the boosting stations, in addition to identifying whether other control systems are required to ensure the continuous good quality of supplied water to the final user at a minimum cost.

**Supplementary Materials:** The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/s22155578/s1, Table S1: Pseudocode to estimate the decay chlorine constant; Table S2: Pseudocode to estimate the chlorine decay constant in the distribution network.

**Author Contributions:** Conceptualization, R.P., R.T. and X.M.-L.; Data curation, R.P., A.M.-T., M.M. and S.G.; Formal analysis, R.P., A.M.-T. and M.M.; Funding acquisition, R.T., X.M.-L. and I.J.; Investigation, A.M.-T., M.M. and L.V.; Methodology, R.P., A.M.-T. and M.M.; Project administration, X.M.-L. and I.J.; Resources, S.G. and R.T.; Software, R.P., A.M.-T. and M.M.; Supervision, R.P., X.M.-L. and I.J.; Validation, R.P. and X.M.-L.; Visualization, R.P., A.M.-T. and L.V.; Writing—original draft, R.P., A.M.-T. and L.V.; Writing—review and editing, R.P., L.V. and I.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author. The data is not publicly available because the owners want to keep track of its use, as it is sensitive information from a full-scale system (hydraulic models). Nevertheless, software for data treatment and chlorine decay model estimation and validation including nonsensitive data (model\_estimation.m and model\_validation.m) are available at https://cs2ac.upc.edu/en (18 July 2022) in Matlab format.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Economic Linear Parameter Varying Model Predictive Control of the Aeration System of a Wastewater Treatment Plant †**

**Fatiha Nejjari 1, Boutrous Khoury 1, Vicenç Puig 1,2,\*, Joseba Quevedo 1, Josep Pascual <sup>1</sup> and Sergi de Campos <sup>3</sup>**


**Abstract:** This work proposes an economic model predictive control (EMPC) strategy in the linear parameter varying (LPV) framework for the control of dissolved oxygen concentrations in the aerated reactors of a wastewater treatment plant (WWTP). A reduced model of the complex nonlinear plant is represented in a quasi-linear parameter varying (qLPV) form to reduce computational burden, enabling the real-time operation. To facilitate the formulation of the time-varying parameters which are functions of system states, as well as for feedback control purposes, a moving horizon estimator (MHE) that uses the qLPV WWTP model is proposed. The control strategy is investigated and evaluated based on the ASM1 simulation benchmark for performance assessment. The obtained results applying the EMPC strategy for the control of the aeration system in the WWTP of Girona (Spain) show its effectiveness.

**Keywords:** economic model predictive control; linear parameter varying modelling; wastewater treatment process

#### **1. Introduction**

Biological wastewater treatment plants (WWTPs) are complex nonlinear systems with large variations in their flow rates and feed concentrations. These plants have to be operated continuously taking care of strict environmental regulations. Thus, the use of advanced control strategies becomes necessary to make them more efficient.

The most widely used biological wastewater treatment is the activated sludge process (ASP). In the ASP, microorganisms are mixed with wastewater. The pollutants of the wastewater constitute the nutrient of the microorganisms. As the organisms feed on the organic pollutants in the wastewater, the pollutants are converted to more organisms, biomass, and some by-products. Following an adequate amount of treatment time, the mixture of microorganisms and wastewater, the mixed liquor flows from the aeration tank to a clarifier or settler where the sludge is separated from the treated water. Some of the settled sludge is continuously recirculated from the clarifier to the aeration tank to ensure the maintenance of adequate amounts of microorganisms in this tank. The microorganisms are again mixed with incoming wastewater where they are reactivated to consume organic nutrients. There are five major groups of microorganisms generally found in the aeration basin of the activated sludge process: (i) aerobic bacteria responsible for removing the organic nutrients, (ii) protozoa to remove and digest dispersed bacteria and suspended particles, (iii) metazoa to dominate longer age systems and clarify effluent, (iv) filamentous bacteria or bulking sludge, which are present when operating conditions change, (v) algae and fungi, which are photosynthetic organisms that are present with pH changes and older sludge.

**Citation:** Nejjari, F.; Khoury, B.; Puig, V.; Quevedo, J.; Pascual, J.; de Campos, S. Economic Linear Parameter Varying Model Predictive Control of the Aeration System of a Wastewater Treatment Plant. *Sensors* **2022**, *22*, 6008. https://doi.org/ 10.3390/s22166008

Academic Editor: Assefa M. Melesse

Received: 10 May 2022 Accepted: 9 August 2022 Published: 11 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

The majority of the culture is mixed and reused with inlet wastewater to keep high reaction rates and sludge age characteristics. In particular, nitrogen is eliminated as follows: First, ammonium is oxidized producing nitrate under nitrification in the aerobic step. The nitrate that is produced is then converted into nitrogen gas by means of denitrification in the anoxic step. Thus, the control of aeration is very important because a low amount of dissolved oxygen can cause the biomass death. On the other hand, an excess of dissolved oxygen could cause the sludge to settle insufficiently. Moreover, because 60% to 80% of the global energy consumption is due to aeration and the operating costs of a WWTP [1], an excessive aeration is not desirable regarding economic efficiency.

The models that are usually considered for characterizing the WWTP processes are the ones developed by the International Association on Water Quality (IAWQ) known as Activated Sludge Models (ASMs) [2].

In this paper, optimal economic operation of the aeration system is considered to improve the efficiency and reliability of an ASP with intermittent aeration, which is used for the removal of nitrogen from domestic wastewater. The objective of the control is to design an aeration strategy (air-on and air-off periods) which minimizes the energy dissipated by the aeration system, with adherence to the limits of the effluent requirements and the operating constraints. The implementation of optimal operation strategies is therefore interesting because WWTPs face the challenge of treating water properly albeit ensuring the minimization of operational costs. This has been the driving force for the active research in the development of advanced control techniques and hierarchical control schemes to improve the operation of the WWTPs, see for example [3,4].

Model predictive control (MPC) has been the most successful advanced control approach applied to control WWTPs. This is due to the fact that MPC controllers allow in a straightforward manner the different operational requirements, the multivariate nature of the control problem (that could even include delay) and directly handling constraints on the control inputs, system outputs and/or internal states [5]. It can also include disturbance prediction, allowing to anticipate the appropriate control actions (feedforward) to achieve optimal performance according to defined criteria in the cost function, which can include different quality criteria and operational costs. Adjusting the MPC control strategy is carried out by suitable manipulating prioritization of different objectives of the performance index that could also include the use of soft constraints. In this way, MPC has become an attractive control strategy for a considerable number of WWTP applications in the last few years. Some examples of MPC control of WWTP can be found in [6,7]. In [6], a benchmarking of different hierarchical control structures for WWTPs that combines static and dynamic real-time optimization (RTO) and nonlinear model predictive control (NMPC) is presented. In [8], a procedure to find the best controlled variables in an economic sense for the activated sludge process in a wastewater treatment plant, despite the large load disturbances, is introduced.

Classical MPC formulation considers pre-established set points, and the objective functions related with error and energy effort have quadratic forms [5]. However, the determination of optimal and reachable reference set points in real time is not an easy task because of the existence of disturbances, set-point changes, time-varying parameters and model uncertainties, among others. This constitutes one of the main limitations of classical MPC. To remedy this issue, real-time optimizers (RTO) or steady-state target optimizers (SSTO) are used to pre-compute the reference set- points at a supervisory layer in the control hierarchy. Then, these pre-computed set points are sent to the lower layer, where a classical MPC behaves as a regulatory controller, forcing the process to follow the desired set points. However, in spite of the use of an RTO, not reachable trajectories might be generated because of the appearance of unexpected disturbances or set points variations, among others. Moreover, there is a delay between the different layers, because the lower layer receives the reference set points determined from the upper layer before its execution. These problems can be avoided by using economic MPC (EMPC) that optimizes process performance directly (e.g., by means of economic objective functions), eliminating the need

of generating reachable reference set points [9]. The first results of the application of EMPC to DO concentration control in WWTPs have been presented in a previous work from the authors [10] considering the nonlinear model of the plant. However, this leads to a nonlinear optimization problem.

Alternatively, this paper proposes an EMPC strategy using the linear parametervarying (LPV) framework to optimize the effluent quality and minimize the operational cost of a WWTP under operating and physical constraints. The objective is to minimize the energy used by the aeration system with the control of the dissolved oxygen (DO) concentrations in the aerated reactors and maintain the effluent concentration under the required limits. The proposed approach is based on real-time dynamic optimization methods. Optimization in MPC with nonlinear models presents a non-convex problem which is computationally demanding, especially when dealing with large-scale plants with complex dynamics such as the WWTP. Thus, the LPV framework allows the embedding of these nonlinearities in scheduling variables, which are functions of system states (i.e., qLPV). This allows obtaining a pseudo-linear model which is linear in state space but nonlinear in the parameter space and deriving a less demanding convex MPC optimization problem, since convex quadratic optimization tools can be applied. The stability and recursive feasibility of MPC with LPV models has been studied (see [11] for a review of the recent results). The application of dynamic optimization methods requires a sufficiently accurate mathematical model describing the wastewater treatment process. The present work uses the Activated Sludge Model No. 2 (ASM2) [12]. To illustrate the proposed approach a WWTP located in Girona (Spain) is considered as a case study.

In Section 2, the WWTP is described and modeled using a reduced ASM2 model, which is then represented in a qLPV form. The proposed EMPC strategy is introduced and described in Section 3, while the proposed MHE approach is presented in Section 4. The results are presented in Section 5, with simulation scenarios obtained from the application of the EMPC strategy on the Girona WWTP. Finally, some conclusions are given in Section 6.

#### **2. WWTP Description and Modeling**

#### *2.1. WWTP Description*

The Girona WWTP is a biological treatment plant designed to treat the wastewater generated by 200,000 inhabitant equivalents with a medium daily inflow of 35,000 m3/d. The processes of the plant can be divided into two main treatment lines: water and sludge (see Figure 1). The water line is separated into three phases: pre-treatment, primary treatment and secondary treatment. The secondary treatment is designed to convert biodegradable, organic wastewater constituents and certain inorganic fractions into new cell mass and by-products. The plant uses an activated sludge system and has three lines composed of three main reactors that are divided into various compartments and three clarifiers. Each line is made of two anoxic reactors located at the beginning, three aerated tanks and an anoxic tank followed by an aerated one. With this configuration, the plant can nitrify and denitrify with great efficiency. The anoxic and aerobic tanks have volumes of 1335, 4554, 1929, and 1929 m<sup>3</sup> for anoxic and 1929, 1276, and 1409 m<sup>3</sup> for aerobic, respectively. Oxygen is supplied to aerated tanks by the aeration system, which delivers air to each of the aeration tanks. The wastewater and activated sludge are separated into three parallel secondary settlers. The volume of each secondary settler is approximately 5024 m3. The activated sludge is internally recirculated from the last aerobic zone to the anoxic tank (210% of influent waste). Additionally, the wastewater is recirculated from the secondary settlers to the anoxic tank (45 to 100% of influent waste).

Figure 2 shows a standard WWTP technological layout. The wastewater flow enters into the biological part after the mechanical treatment. The nutrient removal takes place in the activated sludge reactor through the biological treatment. The first zone in this treatment is anaerobic, where phosphorus is released. The mixed liquor internal recirculation originates from the anoxic zone. The denitrification occurs in the second zone. The activated sludge returned from the clarifiers bottom, and the internal recirculation from the aerobic zones end is directed toward the anoxic zone.

#### *2.2. WWTP Modeling*

The Benchmark Simulation Model (BSM1), developed within the framework of COST Actions 624 and 682 [2], has been adapted to represent the Girona WWTP (see Figure 1).

**Figure 1.** Girona Wastewater Treatment Plant.

The Activated Sludge Model No. 1 (ASM1) describes the biological phenomena that takes place in the biological reactors, and it is supposed that no biological reactions take place in the settlers. Due to the complexity of the nonlinear model describing the different complex processes in the plant, various reduced models have been proposed in the literature [13–15] to aid in the online implementation of certain modern control schemes (e.g., MPC), which would have otherwise presented ill-conditioned or stiff numerical problems due to slow and fast dynamic interactions. In [15,16], one can see some successful implementations using reduced WWTP models in various areas of control applied to WWTPs. The reduced model as suggested in [14], which primarily involves certain simplification criteria for a reduced order of the rigorous high dimensional WWTP model, has been adapted to conditions representing the Girona WWTP. This basically involves the derivation of the reactor model based on mass balances of the wastewater species, which are generally expressed as follows:

$$Accuracy = Inflow - Outflow + Reaction$$

Validation of the reduced model considering data from the ASM1 and the reduced model has been undertaken in [14]. In simplifying the complex model, a systematic reduction process of the high-dimensional model considers some assumptions, with the principal conditions given as follows:


Under these conditions, the resultant state variables of the reduced model are therefore the chemical oxygen demand (*XCOD*), the dissolved oxygen concentration, (*SO*), heterotrophic biomass, *XBH*, ammonia concentration (*SNH*), nitrate concentration (*SNO*) and autotrophic biomass (*XBA*). The control of oxygen concentration (*S*0) in the aerobic tanks is via the manipulation of the control input, the oxygen transfer coefficient *KLa*(*t*).

The states and input vectors are thus given as:

$$\mathbf{x}(t) = \left[ \mathbf{X}\_{COD}(t), \mathbf{S}\_O(t), \mathbf{X}\_{BH}(t), \mathbf{S}\_{NH}(t), \mathbf{S}\_{NO}(t), \mathbf{X}\_{BA}(t) \right]^T$$

$$\boldsymbol{\mu}(t) = \mathbf{K}\_{La}(t)$$

The WWTP process is therefore described by the following dynamic equations of the reduced model:

$$\mathcal{X}\_{\rm COD}(t) = \frac{1}{Y\_{\rm h}} \left[ \theta\_1(t) + \theta\_2(t) \right] + \left( 1 - f\_p \right) \left( \theta\_4(t) + \theta\_5(t) \right) + \theta\_1(t), \tag{1}$$

$$
\dot{S}\_{\rm O}(t) = \frac{Y\_h - 1}{Y\_h} \theta\_1(t) + \frac{Y\_d - 4.57}{Y\_d} \theta\_3(t) + \theta\_2(t),
\tag{2}
$$

$$\dot{S}\_{NH}(t) = -i\_{\rm xb} \left[ \theta\_1(t) + \theta\_2(t) \right] - \left[ i\_{\rm xb} + \frac{1}{Y\_a} \right] \theta\_3(t) + \left( i\_{\rm xb} - f\_p i\_{\rm xp} \right) \left[ \theta\_4(t) + \theta\_5(t) \right] + \theta\_3(t), \tag{3}$$

$$\mathcal{S}\_{\rm NO}(t) = \frac{Y\_{\hbar} - 1}{2.86Y\_{\hbar}} \theta\_2(t) + \frac{1}{Y\_{\hbar}} \theta\_3(t) + \theta\_4(t), \tag{4}$$

$$
\mathcal{X}\_{BH}(t) = \theta\_1(t) + \theta\_2(t) - \theta\_4(t) + \theta\_5(t),
\tag{5}
$$

$$
\dot{X}\_{BA}(t) = \theta\_3(t) - \theta\_5(t) + \theta\_6(t). \tag{6}
$$

where

$$\begin{aligned} \theta\_1(t) &= \mu\_h \frac{\mathcal{X}\_{\text{COD}}(t)}{\mathcal{K}\_{\text{COD}} + \mathcal{X}\_{\text{COD}}(t)} \frac{\mathcal{S}\_{\text{O}}(t)}{\mathcal{K}\_{\text{OH}} + \mathcal{S}\_{\text{O}}(t)} \mathcal{X}\_{\text{BH}}(t) \\ \theta\_2(t) &= \mu\_h \eta\_{NO} \frac{\mathcal{X}\_{\text{COD}}(t)}{\mathcal{K}\_{\text{COD}} + \mathcal{X}\_{\text{COD}}(t)} \frac{\mathcal{S}\_{\text{NO}}(t)}{\mathcal{K}\_{\text{NO}} + \mathcal{S}\_{\text{NO}}(t)} \frac{\mathcal{K}\_{\text{OH}}}{\mathcal{K}\_{\text{OH}} + \mathcal{S}\_{\text{O}}(t)} \mathcal{X}\_{\text{BH}}(t) \\ \theta\_3(t) &= \mu\_d \frac{\mathcal{S}\_{\text{NH}}(t)}{\mathcal{K}\_{\text{NH},A} + \mathcal{S}\_{\text{NH}}(t)} \frac{\mathcal{S}\_{\text{O}}(t)}{\mathcal{K}\_{\text{O},A} + \mathcal{S}\_{\text{O}}(t)} \mathcal{X}\_{\text{BA}}(t) \\ \theta\_4(t) &= b\_H \mathcal{X}\_{\text{BH}}(t) \\ \theta\_5(t) &= b\_A \mathcal{X}\_{\text{BA}}(t) \end{aligned}$$

With the flow rate given as *Qin*(*t*), *Vo* as the volume of the aerobic tank and considering that *S*0*in* (*t*), *SNOin* (*t*), *XBAin* (*t*) are equal to zero. *ϑ*1(*t*), *ϑ*2(*t*), ··· , *ϑ*6(*t*) are given as follows:

$$
\begin{split}
\theta\_{1}(t) &= \frac{\underline{Q\_{in}}(t)}{V\_{o}} \left[ \mathbf{X}\_{\text{COD}\_{\text{in}}}(t) - \mathbf{X}\_{\text{COD}}(t) \right] \\
\theta\_{2}(t) &= \frac{\underline{Q\_{in}}(t)}{V\_{o}} \left[ -S\_{O}(t) \right] + K\_{Lx}(t) \left[ S\_{O\_{\text{sdf}}} - S\_{O}(t) \right] \\
\theta\_{3}(t) &= \frac{\underline{Q\_{in}}(t)}{V\_{o}} \left[ S\_{N\underline{H}\_{\text{in}}}(t) - S\_{N\underline{H}}(t) \right] \\
\theta\_{4}(t) &= \frac{\underline{Q\_{in}}(t)}{V\_{o}} \left[ -S\_{NO}(t) \right] \\
\theta\_{5}(t) &= \frac{\underline{Q\_{in}}(t)}{V\_{o}} \left[ X\_{BH\_{\text{in}}}(t) - \frac{f\_{\text{w}}(1+f\_{r})}{f\_{r}+f\_{w}} X\_{BH}(t) \right] \\
\theta\_{6}(t) &= \frac{\underline{Q\_{in}}(t)}{V\_{o}} \left[ X\_{BA}(t) - \frac{f\_{\text{w}}(1+f\_{r})}{f\_{r}+f\_{w}} X\_{BA}(t) \right]
\end{split}
$$

where *YH*, *YA*, *fr*, *fw*, *bh*, *bA*, *ixb*, and *fp* are the stoichiometric parameters and *μh*, *KCOD*, *KOH*, *μa*, *KNH*,*A*, and *KO*,*<sup>A</sup>* are the kinetic parameters.

**Figure 2.** Layout of Girona WWTP.

#### *2.3. LPV Representation of the WWTP*

For ease of computational burden, the nonlinear reduced model is represented in a LPV form which involves the embedding of nonlinearities in varying parameters, resulting in a linear representation in state space. This procedure offers benefits when applied to MPC over its nonlinear MPC [15] and linear MPC [17] counterparts as applied on the WWTP by providing a faster run time and the avoidance of numerical problems with respect to the former and the ability to operate in a wide range of operating points with regard to the latter. The nonlinear model in this case is defined by linear systems at each time instance based on some time-varying parameters *<sup>σ</sup>*(*t*) <sup>∈</sup> <sup>R</sup>*n<sup>σ</sup>* , with an assumption that the parameters *σ*(*t*) are not known a priori but can be measured or estimated online [18]. The dynamic behavior of the LPV model is therefore described as:

$$
\dot{\mathbf{x}}(t) = A(\sigma(t))\mathbf{x}(t) + B(\sigma(t))\boldsymbol{u}(t) \tag{7}
$$

$$y(t) = \mathbb{C}(\sigma(t))x(t) + D(\sigma(t))u(t) \tag{8}$$

where *<sup>x</sup>*(*t*) <sup>∈</sup> <sup>R</sup>*nx* and *<sup>u</sup>*(*t*) <sup>∈</sup> <sup>R</sup>*nu* are the states and inputs, respectively, with *<sup>y</sup>*(*t*) <sup>∈</sup> <sup>R</sup>*ny* as the measured signals. *A*(*σ*(*t*)), *B*(*σ*(*t*)), *C*(*σ*(*t*)) and *D*(*σ*(*t*)) are time-varying matrices of appropriate dimensions that are affine in *<sup>σ</sup>*(*t*) <sup>∈</sup> <sup>R</sup>*n<sup>σ</sup>* . In the quasi LPV case, the scheduling parameters are dependent on measured signals, *ys*(*t*) <sup>∈</sup> <sup>R</sup>*<sup>k</sup>* <sup>⊂</sup> *<sup>y</sup>*(*t*) <sup>∈</sup> <sup>R</sup>*ny* , such that

$$
\sigma(t) = f(y\_s(t))
$$

where *<sup>f</sup>* : <sup>R</sup>*<sup>k</sup>* → <sup>R</sup>*n<sup>σ</sup>* is a continuous mapping [19]. With observed states and exogenous inputs (*w*(*t*)), nonlinearities involving the system states can be "hidden" in the varying parameters, *σ*(*t*, *ys*(*t*), *w*(*t*)).

Therefore, from the generic nonlinear form

$$\begin{aligned} \dot{x}(t) &= f(x(t), u(t), w(t)) \\ y(t) &= g(x(t), u(t)) \end{aligned} \tag{9}$$

a linear quadruple (*A*(*σ*(*t*, *ys*(*t*), *w*(*t*))), *B*(*σ*(*t*, *ys*(*t*), *w*(*t*))), *C*(*σ*(*t*, *ys*(*t*), *w*(*t*))), *and D*(*σ*(*t*, *ys*(*t*), *w*(*t*))) estimate is formulated and incorporated into the EMPC for a convex optimization problem. In the following, the function *σ*(*t*, *ys*(*t*), *w*(*t*))) will simply be represented as *σ*(*t*). The choice of scheduling parameters considering the origin of nonlinearities in the reduced model (1) are

$$\begin{aligned} \sigma\_1(t) &= Q\_{\text{in}}(t), \qquad \sigma\_2(t) = \frac{X\_{\text{COD}}(t)}{K\_{\text{COD}} + X\_{\text{COD}}(t)} \frac{X\_{BH}(t)}{K\_{OH} + S\_O(t)},\\ \sigma\_3(t) &= \frac{X\_{\text{COD}}(t)}{K\_{\text{COD}} + X\_{\text{COD}}(t)} \frac{S\_{\text{NO}}(t)}{K\_{\text{NO}} + S\_{\text{NO}}(t)} \frac{K\_{\text{OH}}}{K\_{\text{OH}} + S\_O(t)},\\ \sigma\_4(t) &= \frac{1}{K\_{\text{OA}} + S\_O(t)} \frac{S\_{NH}(t)}{K\_{NH} + S\_{NH}(t)} X\_{BH}(t), \qquad \sigma\_5(t) = S\_O(t). \end{aligned}$$

The dynamic LPV model is thus given as:

$$
\dot{\mathbf{x}} = A(\sigma(t))\mathbf{x}(t) + B(\sigma(t))u(t) + Ew(t). \tag{10}
$$

with the time-varying matrices, *A*(*σ*(*t*)), *B*(*σ*(*t*)), and time-invariant disturbance matrix *E* as:

$$A(\sigma(t)) = \begin{bmatrix} a\_{11}(t) & 0 & 0 & 0 & a\_{15}(t) & a\_{16} \\ 0 & a\_{22}(t) & 0 & 0 & a\_{25}(t) & 0 \\ 0 & a\_{32}(t) & 0 & 0 & a\_{35}(t) & a\_{36} \\ 0 & a\_{42}(t) & 0 & 0 & a\_{45}(t) & 0 \\ 0 & 0 & 0 & 0 & a\_{55}(t) & 0 \\ 0 & a\_{62}(t) & 0 & 0 & 0 & a\_{66}(t) \end{bmatrix},$$

$$B(\sigma(t)) = \begin{bmatrix} 0 \\ b\_{12}(t) \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix} \text{and } E = \frac{1}{V\_{\odot}} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{bmatrix}.$$

where

$$\begin{aligned} a\_{11}(t) &= \frac{\sigma\_1(t)}{V\_o}, & a\_{15}(t) &= -\frac{\mu\_h}{Y\_h}\sigma\_2(t) + (1-f\_f)b\_h - \frac{\mu\_h \eta\_{NO\_3}}{Y\_h}\sigma\_3(t), & a\_{16} &= (1-f\_f)b\_{a\_1}, \\ a\_{22}(t) &= -\frac{\sigma\_1(t)}{V\_O} - \frac{4.57 - Y\_a}{Y\_a}\mu\_4\sigma\_4(t), & a\_{25}(t) &= \frac{Y\_h - 1}{Y\_h}\mu\_h\sigma\_2(t) \\ a\_{32}(t) &= -\left(i\_{xb} + \frac{1}{Y\_a}\right)\mu\_{47}\sigma\_4(t), & a\_{35}(t) &= (i\_{xb} - f\_{p1x})b\_{h} - i\_{xb}\mu\_h\sigma\_2(t) - (i\_{xb}\mu\_h\sigma\_3(t)), \\ a\_{36} &= \left(i\_{xb} - f\_{p1x}b\right)b\_{a}, & a\_{42}(t) &= \frac{1}{Y\_a}\mu\_{47}(t), & a\_{45}(t) &= \frac{Y\_h - 1}{2.86Y\_h}\mu\_h\eta\_{NO\_3}\sigma\_3(t), \\ a\_{55}(t) &= \mu\_h\sigma\_2(t) - b\_{h} - \left(\frac{\sigma\_1(t)}{V\_O} - \frac{f\_{w1}(1+f\_r)}{f\_r+f\_w}\right) - b\_{a}, & a\_{62}(t) &= \mu\_a\sigma\_4(t), \\ a\_{66}(t) &= \frac{\sigma\_1(t)}{V\_O}\left(-\frac{f\_{w1}(1+f\_r)}{f\_r+f\_w} - 1 - b\_{a}\right), & b\_{12}(t) &= S\_{\rm sat} - \sigma\_5(t). \end{aligned}$$

The input concentrations are

$$w(t) = \begin{bmatrix} Q\_{in}(t)X\_{COD\_{in}}(t) & Q\_{in}(t)S\_{NH\_{in}}(t) & Q\_{in}(t)X\_{BH\_{in}}(t) \end{bmatrix}^T$$

.

**Remark 1.** *In this work, it is assumed that all the concentrations are measured online, but it must be noted that in practice, not all the concentrations, such as, e.g., XCODin can be measured.*

#### **3. EMPC of a WWTP**

#### *3.1. Operational Goals*

The immediate control goal of a WWTP is to meet water quality levels established by regulators while operating efficiently by reducing operational cost. As discussed in the introduction, predictive control techniques may be used to compute strategies which achieve this goal while at the same time optimizing the system performance in terms of different operational indices. To achieve this objective, the control of dissolved oxygen concentration as well as nitrates within certain limits is necessary. The MPC presents the advantage of being a non-conservative control strategy, such that in periods of low influents, with a minimal level of pollutants, the effluent quality can be achieved by regulating the levels of *S*<sup>0</sup> and *SNO* below the stipulated reference point to avoid waste of energy. Subsequently, during periods of high influents levels, it is then important to meet the predefined set points to reduce pollutants, avoiding the violation of the standard effluent quality set by authorities [15]. In this work, a PI-EMPC control strategy is employed: PI designed by authors of the BSM1 for the regulation of *SNO* and a designed EMPC for the control of *SO* in the aeration tank. In the proposed LPV EMPC, the following objectives are then considered:

• **Economic costs.** The main economic costs associated with WWTP are primarily due to treatment and electricity costs. Water through the WWTP involves important electricity costs in pumping stations in charge of internal and external water recirculations as well as aeration in the aerobic tanks. In our case, only the aeration energy is considered with an objective of minimizing the cost associated with supply of oxygen for controlled culture growth. The performance index is described as follows

$$J\_{\rm esc}(k) = \frac{S\_{\rm ost}}{1800} V\_o K\_{La}(k) \left[\frac{\rm kwh}{\rm day}\right]. \tag{11}$$

• **DO concentration control.** In order to control the *So* within some bounds in the EMPC during the aeration process, slack variables are introduced in the optimization problem, which seek to penalize the dissolved oxygen states, such that they are maintained in a range to maintain effluent quality. Selecting slack variables, (*λ*<sup>+</sup> > 0 and *λ*<sup>−</sup> > 0), additional terms of soft constraints (see (16c) and (16d)) and a quadratic objective index are introduced with *xsp* as the selected DO concentration value. The introduction of the slack variables ensures that the DO concentration varies within a boundary around *xsp* aided by the appropriate selection of weights in the objective function. The performance index is thus given as

$$J\_{\lambda}(k) = \|\lambda(k)\|\_{2'}^{2} \tag{12}$$

where *<sup>λ</sup>*(*k*) = *<sup>λ</sup>*−(*k*), *<sup>λ</sup>*+(*k*))*<sup>T</sup>* .

• **Smooth set points for equipment conservation.** The operation of WWTP and main valves and pumps usually requires smooth flow set-point variations. To obtain such a smoothing effect, the proposed MPC controller includes a third term in the objective function to penalize the control signal variation between consecutive time intervals. This term is expressed as

$$J\_{smo}(k) = \Delta u(k)^T \mathcal{W}\_u \Delta u(k). \tag{13}$$

Therefore, the performance function *J* considering the aforementioned control objectives has the form

$$J = w\_1 \sum\_{k=0}^{H\_p - 1} J\_{\text{coo}}(k) + w\_2 \sum\_{k=0}^{H\_p - 1} J\_{\text{smo}}(k) + w\_3 \sum\_{k=1}^{H\_p} J\_\lambda(k). \tag{14}$$

#### *3.2. Control Strategy Computation*

The control strategy is determined by the computation of an optimal sequence of control actions for a prediction horizon, *Hp*.

$$\mathfrak{u}\_k = \left(\mathfrak{u}(k|j)\right)\_{j=0}^{H\_{p-1}} = \left(\mathfrak{u}(k|0), \mathfrak{u}(k|1), \dots, \mathfrak{u}(k|H\_{p-1})\right). \tag{15}$$

We solve at each time instance *k*, the following optimal control problem with initial state obtained from measurements (or state estimation) of the dynamics WWTP model and prediction in the MPC loop with the qLPV plant model (10),

$$\min\_{\vec{u}\_k} J(\vec{u}\_k, k) \tag{16a}$$

subject to

$$\mathbf{x}(i+1|k) = A(\sigma(k))\mathbf{x}(i|k) + B(\sigma(k))\mathbf{u}(i|k) + E\mathbf{w}(i|k) \quad i = 0, \dots, H\_p - 1,\tag{16b}$$

$$\mathbf{x}\_{\mathcal{S}\_0}(i|k) > = \mathbf{x}\_{\mathcal{sp}} + \lambda^+(i|k), \qquad i = 1, \cdots, H\_{\mathcal{P}'} \tag{16c}$$

$$\mathbf{x}\_{\mathfrak{s}\_0}(i|k) <= \mathbf{x}\_{sp} - \lambda^-(i|k), \qquad i = 1, \cdots, \mathbf{l}, \mathbf{H}\_{\mathfrak{p}}.\tag{16d}$$

$$u(i|k) \in \mathcal{U} \quad i = 0, \cdots, H\_p - 1,\tag{16e}$$

$$\mathbf{x}(i|k) \in \mathcal{X} \qquad i = 1, \cdots, H\_{\mathbf{p}} \tag{16f}$$

$$\mathcal{Y}(i|k) \in \mathcal{Y} \qquad j = 0, \cdots, H\_{\mathbb{P}}, \tag{16g}$$

$$
\lambda^+(i|k), \lambda^-(i|k) \;> = 0 \tag{16h}
$$

where *xso* is the dynamic state representing the soluble oxygen. (16c–f) are described by the box constraints:

$$\begin{array}{l} \mathcal{M} = \left\{ \boldsymbol{\mu} \in \mathbb{R}^{n\_{\boldsymbol{\mu}}} | \boldsymbol{\mu}^{\text{min}} \le \boldsymbol{\mu} \le \boldsymbol{\mu}^{\text{max}} \right\}, \\ \mathcal{X} = \left\{ \boldsymbol{x} \in \mathbb{R}^{n\_{\boldsymbol{\mu}}} | \boldsymbol{x}^{\text{min}} \le \boldsymbol{x} \le \boldsymbol{x}^{\text{max}} \right\}, \\ \mathcal{Y} = \left\{ \boldsymbol{y} \in \mathbb{R}^{n\_{\boldsymbol{\mu}}} | \boldsymbol{y}^{\text{min}} \le \boldsymbol{y} \le \boldsymbol{y}^{\text{max}} \right\}. \end{array} \tag{17}$$

which are determined from the maximum residual concentrations imposed in order to cope with the European Union effluent standards on chemical oxygen demand COD, suspended solids *SS* and total nitrogen *TN*:

$$COD \lessapprox COD\_{\max} = 125 \,\text{gm}^{-3}/$$

$$S\_S \lessapprox S\_{S\max} = 35 \,\text{gm}^{-3}/$$

$$T\_N \lessapprox T\_{N\max} = 10 \,\text{gm}^{-3}.$$

The first control action of the sequence *u*( *k*|0) is applied to the WWTP plant to obtain the system measurements and/or MHE estimated states, which are then used in the succeeding optimization problem, resulting in a recursive procedure. Not all the state variables are measured as stated earlier; the moving horizon estimator (MHE), which is the dual of the MPC controller, estimates the unmeasurable states.

#### **4. Moving Horizon Estimation**

Since some states cannot be measured online in the operation of the WWTP, a design of an estimator, in our case the MHE, is necessary for the prediction of system outputs, bearing in mind that apart from purposes of feedback control, the quasi-LPV formulation relies on information of the system states for the model construction. By solving a constrained optimization problem, the MHE utilizes a limited *N*-prediction horizon of past measurements through an error minimization scheme aided by information of the system model in a prediction window to estimate the system states. The optimization problem is therefore set up with the discretized plant model as:

$$\begin{aligned} \min\_{\{\hat{\mathbf{x}}(i|k)\}\_{i=-N}^{0}} & \quad \left(\hat{\mathbf{x}}(-N|k)-\mathbf{x}\_{o}\right)^{T} \boldsymbol{P}\_{o} \Big(\hat{\mathbf{x}}(-N|k)-\mathbf{x}\_{o}\Big) + \sum\_{i=-N}^{k} \left(\boldsymbol{\varepsilon}(i|l)^{T}\mathbf{Q}\boldsymbol{\varepsilon}(i|k) + \boldsymbol{s}(i|k)^{T}\mathbf{R}\boldsymbol{s}(i|k)\right) \\ \text{s.t.} & \quad \hat{\mathbf{x}}(i+1|k) = A(\boldsymbol{\sigma}(i|k))\hat{\mathbf{x}}(i|k) + B(\boldsymbol{\sigma}(i|k))u(i|k) + E\boldsymbol{w}(i|k) + \boldsymbol{\epsilon}(i|k) \quad i = -N\_{r} \cdots -1, \\ & \quad y(i|k) = \mathbf{C}\boldsymbol{x}(i|k) + s(i|k), \\ & \quad \pounds\_{k} \in \mathcal{X}. \end{aligned} \tag{18}$$

where *<sup>R</sup>* <sup>=</sup> *<sup>R</sup><sup>T</sup>* <sup>∈</sup> <sup>R</sup>*ny*×*ny* <sup>&</sup>gt; 0, *<sup>Q</sup>* <sup>=</sup> *<sup>Q</sup><sup>T</sup>* <sup>∈</sup> <sup>R</sup>*nx*×*nx* <sup>≥</sup> 0 and *Po* <sup>=</sup> *<sup>P</sup><sup>T</sup> <sup>o</sup>* <sup>∈</sup> <sup>R</sup>*nx*×*nx* <sup>≥</sup> <sup>0</sup> are the weighting matrices that are defined according to uncertainty levels induced respectively by the noise, disturbance and unknown initial conditions (*xo*). X bounds the estimated states. At every iteration, *<sup>N</sup>* sets of control inputs, {*u*(*i*|*k*)}−<sup>1</sup> *<sup>i</sup>*=−*<sup>N</sup>* <sup>∈</sup> <sup>R</sup>*nu*×*N*, measurements {*y*(*i*|*k*)}−<sup>1</sup> *<sup>i</sup>*=−*<sup>N</sup>* <sup>∈</sup> <sup>R</sup>*ny*×*<sup>N</sup>* and *<sup>N</sup>* sets of LPV matrices {*Ai*}−<sup>1</sup> *<sup>i</sup>*=−*<sup>N</sup>* <sup>∈</sup> <sup>R</sup>(*nx*×*nx*)*N*, {*Bi*}−<sup>1</sup> *<sup>i</sup>*=−*<sup>N</sup>* <sup>∈</sup> <sup>R</sup>(*nx*×*nu*)*<sup>N</sup>* are taken as inputs into the optimization problem to predict the state sequence {*x*ˆ(*i*|*k*)}<sup>0</sup> *<sup>i</sup>*=−*<sup>N</sup>* <sup>∈</sup> <sup>R</sup>*nx*×(*N*+1) by solving the dynamical optimization problem (18). The last element of the sequence {*x*ˆ(*i*|*k*)}<sup>0</sup> *<sup>i</sup>*=−*<sup>N</sup>* is subsequently chosen as the estimated states, the measurements and inputs are then discarded, and the procedure is repeated. The ammonia concentration (*SNH*), nitrate concentration (*SNO*) and the soluble oxygen (*So*) are supposedly measurable; therefore, the MHE is designed for the estimation of [*XCOD*, *XBH*, *XBA*] *<sup>T</sup>* as shown in Figures 3–5 .

**Figure 3.** MHE estimate of oxygen demand concentration (*XCOD*) for 7 days.

**Figure 4.** MHE estimate of heterotrophic biomass (*XBH*) for 7 days.

**Figure 5.** MHE estimate of autotrophic biomass (*XBA*) for 7 days.

#### **5. Simulation Results**

#### *5.1. LPV EMPC Implementation Details*

To illustrate LPV EMPC approach presented in this paper, the Girona WWTP case study presented in Section 2 is used. The constituents of the influent wastewater of Girona WWTP varies during the day between the following bounds :


The inflow of Girona WWTP is shown in Figure 6.

**Figure 6.** WWTP inflow.

With a quasi-linear approximation of the nonlinear WWTP via the LPV representation, the constrained optimization problem (16) is solved using quadratic programming formulation using the CPLEX® solver in MATLAB® on an Intel Core i7, 8 GB of RAM PC. A sampling time of 15 min and a prediction horizon of 6 h is chosen for simulation. The process is simulated for 7 days in a Simulink environment representing the dynamics of the Girona WWTP, as shown in Figure 1.

Using the weights *wi* associated with the multiobjective EMPC cost function, (14) is tuned using the procedure as performed in [20,21] with the aim of maintaining the quality of the exit water at some levels within the current regulations regardless of the entry at a minimum cost.

Some control scenarios are selected to show different behaviors of the proposed scheme by altering *Xsp* and manipulating weights *wi*, ideally to illustrate the different actions of aeration corresponding to different dissolved oxygen requirements for a quality effluent.

#### *5.2. First Scenario*

The first scenario consists of controlling the dissolved oxygen concentration in the exit of the biological treatment plant between the bounds (1.5, 2.5). Figure 7 shows the dynamics of the DO concentration (above) and its corresponding aeration energy (below). The operation of the aeration, as stated in the preceding section, corresponds to the variation of the influents during the day; therefore, the DO concentration varies between the defined bounds in relation to the amount of pollutants at each time instance in the influents, which can be inferred from Figure 6.

**Figure 7.** (**Above**): DO concentraton variation. (**Below**): Aeration flow for Scenario 1.

#### *5.3. Second Scenario*

The second scenario also consists of controlling the DO concentration between the ranges of 0.5 to 1.2 mg/L with minimum aeration energy consumption.

From Figure 8, a similar behavior of oxygen in the tanks as in the first scenario is realized with an expected less aeration energy, as less DO is required for treatment. The nitrates in the exit of the WWTP (Figure 9) range approximately between 5 and 7 mg/L.

**Figure 8.** (**Above**): DO concentration variation and (**Below**): Aeration flow for Scenario 2.

**Figure 9.** Nitrate concentration variation.

#### **6. Conclusions**

In this paper, an LPV EMPC strategy for the control of dissolved oxygen concentration in the aerated reactors of a WWTP is proposed and applied to the Girona (Spain) case study. The proposed approach combines two improvements with respect to the existing approaches in the literature: First, differently from standard tracking MPC, the proposed EMPC strategy optimizes the economic performance of the plant instead of following some pre-established set points. Second, a reduced model of the WWTP is represented in a quasi-LPV form allowing the real-time implementation of the controller thanks to the use of quadratic programming optimization tools. If otherwise, the nonlinear model plant was used, nonlinear programming algorithms are required that usually prevent the realtime implementation because of the large computational time. Moreover, an LPV moving horizon state estimation scheme has also been proposed that allows the implementation of the LPV EMPC with the available sensors in the WWTP. The effectiveness of the proposed scheme has been illustrated in the considered case study with two scenarios aiming at keeping the DO within some bounds.

As future work, real testing in the WWTP plant will be conducted to further validate the performance of the proposed solution. Another issue to take into consideration is the application of the proposed methodology for aerobic conditions maintenance in sewer networks [22].

**Author Contributions:** Conceptualization, F.N., B.K. and V.P.; methodology, F.N., B.K. and V.P.; software, B.K. and J.P.; writing—original draft preparation, F.N.; writing—review and editing, F.N., B.K, V.P., J.P., J.Q. and S.d.C.; supervision, F.N., B.K and V.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work has been co-financed by the Spanish State Research Agency (AEI) and the European Regional Development Fund (ERFD) through the project SaCoAV (ref. MINECO PID2020- 114244RB-I00), by the European Regional Development Fund of the European Union in the framework of the ERDF Operational Program of Catalonia 2014–2020 (ref. 001-P-001643 Looming Factory), and by the DGR of Generalitat de Catalunya (SAC group ref. 2017/SGR/482).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** All the required data are included in the paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


## *Article* **Comparison of Optimisation Algorithms for Centralised Anaerobic Co-Digestion in a Real River Basin Case Study in Catalonia**

**David Palma-Heredia 1,\*, Marta Verdaguer 1, Vicenç Puig 2,3, Manuel Poch <sup>1</sup> and Miquel Àngel Cugueró-Escofet <sup>2</sup>**


**Abstract:** Anaerobic digestion (AnD) is a process that allows the conversion of organic waste into a source of energy such as biogas, introducing sustainability and circular economy in waste treatment. AnD is an intricate process because of multiple parameters involved, and its complexity increases when the wastes are from different types of generators. In this case, a key point to achieve good performance is optimisation methods. Currently, many tools have been developed to optimise a single AnD plant. However, the study of a network of AnD plants and multiple waste generators, all in different locations, remains unexplored. This novel approach requires the use of optimisation methodologies with the capacity to deal with a highly complex combinatorial problem. This paper proposes and compares the use of three evolutionary algorithms: ant colony optimisation (ACO), genetic algorithm (GA) and particle swarm optimisation (PSO), which are especially suited for this type of application. The algorithms successfully solve the problem, using an objective function that includes terms related to quality and logistics. Their application to a real case study in Catalonia (Spain) shows their usefulness (ACO and GA to achieve maximum biogas production and PSO for safer operation conditions) for AnD facilities.

**Keywords:** anaerobic co-digestion; ant colony optimisation; particle swarm optimisation; genetic algorithms; waste management; circular economy

#### **1. Introduction**

In the context of global climate change with rising and more extreme events—such as droughts and floods—which will likely provide growing uncertainty to water demand and jeopardise the availability of specific resources, there is a growing interest in the adaptation and use of technologies related to the circular economy that promote environmental sustainability. In this framework, resource recovery is a key issue for industrial and environmental processes and shows a wide spectrum of study possibilities. In water sanitation, wastewater treatment plants (WWTPs) offer a wide range of possibilities for resource recovery, mainly related to sludge treatment processes [1–7] as biogas generation via the substrate co-digestion process, which can be an alternative source for thermal and electrical energy production [8–14]. This potential for biogas generation could be translated as well to a source of renewable natural gas, which has specific composition requirements that demand high-tech sensors to assure its quality no matter its origin, as those developed in [15,16]. Due to their potential for resource recovery and the further implications in the water–food–energy nexus, WWTPs have been a research focus from different areas of

**Citation:** Palma-Heredia, D.; Verdaguer, M.; Puig, V.; Poch, M.; Cugueró-Escofet, M.À. Comparison of Optimisation Algorithms for Centralised Anaerobic Co-Digestion in a Real River Basin Case Study in Catalonia. *Sensors* **2022**, *22*, 1857. https://doi.org/10.3390/s22051857

Academic Editor: Ernest W. Tollner

Received: 27 January 2022 Accepted: 24 February 2022 Published: 26 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

expertise: from modelling and engineering design [17–24] to process dynamics, simulation and integration [25–28].

Anaerobic digestion (AnD), a complex process involved in biogas production, has a delicate balance of substrate composition. The optimal performance requires avoiding process inhibition and maximising biogas generation. The optimal balance may be achieved with the correct mixture of available substrates, but this task is challenging and difficult to achieve manually due to the high combinatorial possibilities and changing availability of substrates of heterogeneous nature. The complexity increases when the process is codigestion, with the addition of residual substrates produced by agro-food and similar industries, each with its own dynamics of substrate generation and composition [29–32]. Additionally, not all WWTPs have an anaerobic digester. Therefore, optimisation also requires logistical challenges to process a maximum volume of the available substrates in a certain geographical area and its travel logistics (of sludge and co-substrates) from its origin to the destination digester.

Hence, dealing with such complexity is a former step to tackle optimal co-digestion in a complex network composed of many substrate sources—including WWTPs without AnD processes and industrial producers—and several co-substrate receptors. These will be located in different geographical places. As a result, the logistics of substrates will be affected by the geographical distance between actors involved and the restrictions related to the receptors.

Optimisation of the individual digester feed requires optimal blending of different co-substrates in order to fulfil the volumetric and compositional requirements of the anaerobic procedure. This problem can be understood as a multidimensional knapsack problem (MKP) [33–35]. The MKP is an NP-hard problem [36] and has been widely studied in the literature. To solve this type of problem, the use of combinatorial optimisation metaheuristics is proposed in [37,38], mainly when a high number of restrictions are presented [39].

Many tools have been developed to this end, either focused on modelling, control of the optimum co-substrate blending, or system operation, as shown in [37,39–43]. In [37,40], identification and modelling of critical parameters are performed; in [39] control schemes based on the composition qualities are developed; and in [42,43], optimised control strategies are implemented according to blend composition. In [41], logistics are also included to optimise the performance of a single anaerobic digester with co-digestion strategies.

However, in real-world installations, most of these systems are managed and supervised not in a single fashion but in a network fashion. Thus, proper system management requires simultaneous consideration of the entire AnD network to select which combination is the best for each digester to maximise the potential of the overall infrastructure. Besides, literature on this matter is relatively scarce due to its ad-hoc nature. There is literature related to optimised placement of new AnD plants, such as in [44], but it lacks optimisation of the operational part involved in the feeding of the anaerobic digesters. Very specific works can be found about optimisation of supply chain networks in the field of waste valorisation, such as in [45], where an integrated geographical information system (GIS)-based optimisation is performed, but it requires highly detailed and tailored data, so its implementation becomes time-consuming and highly dependent on data availability; furthermore, it does not tackle process optimisation regarding waste processing facilities. Regarding logistics, other works can be found for path planning optimisation such as in [46], where truck routes are traced based on GIS-oriented algorithms, or in [47], where a smart waste bin prototype is developed for sensor-based waste classification. As it can be seen, there is a gap in the literature regarding network optimisation of existing waste management facilities (such as AnD plants) that would include both logistics (i.e., minimising route impact and length) and quality (i.e., improving process performance) optimisation. This is a necessary gap to fulfil, since, as stated before, AnD networks are currently managed in an ad-hoc, manual fashion by the practitioners, which is dramatically time-consuming and needs highly qualified personnel. Although there are currently different approaches

that could help overcoming specific parts of this challenge (i.e., those observed in [44–47]), none of them can currently successfully accomplish the overall task.

The approach presented in this study continues the work introduced in [48] where the optimisation problem of blending in anaerobic co-digestion (AnCD) is handled by an ant colony optimisation (ACO) algorithm in a synthetic case study with realistic conditions, simulating a centralised AnD single-stage reactor that received feed once a day. In [41], the work is extended while considering the quality, social and travel logistics of the cosubstrates, analysing its importance to the overall optimisation. In addition, ACO has been implemented in real-world waste sector case studies, e.g., [48,49]. Here, the work is extended to a similar real-world case study considering multiple receptors in the geographical area of the Besòs River basin (Catalonia, Spain). The data used correspond to the real operation conditions of this area. For these conditions, the authors have also evaluated the results obtained using different optimisation approaches such as ACO, genetic algorithms (GA) and particle swarm optimisation (PSO).

ACO, GA and PSO algorithms were selected as convenient approaches to tackle a problem of the nature stated here after reviewing applications of similar nature in the literature. In [50], a review of nature-inspired algorithms is performed, including GA, ACO and PSO—among others—for AnD modelling and optimisation, showing how in the field of AnD, PSO obtains better performance in substrate feed optimisation for agricultural biogas plants than other evolutionary methods. For example, in [51] genetic algorithms are used to minimize the environmental impact caused by mine water. The main drawback for PSO pointed out in [50] is premature convergence, since particles may become trapped in local optima or suffer stagnation, but this may be solved by a partial restart of the process introducing new particles in the search space. In [52], ACO and GA are applied to optimise the route of waste collection vehicles for municipal waste collection and transportation—the highest cost of the entire waste management system—with similar performance attained by both algorithms; however, only the problem of travel logistics is considered, not the blending of municipal waste. Ref. [53] proposes a nonlinear model predictive control strategy using the MATLAB BioOptim toolbox, developed by the same authors, for optimal control of substrate feed to AnD operation of an agricultural biogas plant, with a graphical user interface (GUI) integrating a fitness function including different operating constraints and parameters such as pH, solids or methane concentration, and using evolutionary optimisers such as PSO, covariance matrix adaptation evolution strategy (CMAES). Alternatively, they propose differential evolution (DE), which achieved better performance with PSO but without considering the substrate travel logistics in the optimisation. Ref. [54] presents a prediction and optimisation method using a multi-layer perceptron artificial neural network (ANN) and PSO for the maximisation of biogas generation in a real wastewater treatment facility. A similar approach is presented in [55], where modelling and optimisation of biogas production with mixed substrates are obtained with a combination of ANN and GA methods.

Additionally, ref. [55] and references therein point out how stochastic global optimisation algorithms (SGOAs), such as PSO, the ACO, and GA, among others, are considered efficient alternatives in the design of optimal production media and optimal process operating conditions in fermentation research and can significantly reduce the process development time. Regarding the comparison of the optimisation algorithms selected here solving combinatorial optimisation (CO) problems, in [56], ACO and GA are compared, both achieving good performance but with GA exhibiting slightly better performance than ACO. In the latter reference, it is also mentioned how trimming of specific parameters for both optimisers—e.g., number of iterations, evaporation coefficient and number of ants for ACO, or chromosome population, crossover and mutation probabilities for GA—is required to achieve good performance in both cases. In [57], relationships between GA and ACO-type algorithms are detailed, presenting their similitudes and showing how they use similar principles to succeed in CO problems with globally convex structure of its solution space. Overall, SGOAs such as ACO, GA or PSO have shown good performance

in the type of applications presented and hence demonstrated suitability for non-convex nonlinear multidimensional optimisation problems, as presented here. Optimal blending for AnCD is considered, e.g., in [40,55], but to the knowledge of the authors, the optimisation of such blending combined with the travel logistics of co-substrates in a centralised multi-receptor co-digestion strategy has not yet been studied. Additionally, a comparison between different suitable optimisation algorithms for such applications, i.e., ACO, GA and PSO, is presented.

The application of optimisation strategies in AnD allows a significant enhancement of co-digestion strategies [30] maximising biogas production and minimising associated risks to each AnD operation (e.g., overdosing or acidification). In this work, the performance of each optimisation approach considered is evaluated on a real case study in the area of the Besòs River basin in Catalonia, including a network of substrate generators and three anaerobic digesters. Hence, the objective of this study is to develop a tool that is able to optimise the centralised digestion process of an AnD network with multiple waste sources and waste receptors by means of three evolutionary optimisation algorithms—namely, ACO, GA and PSO. Such a tool is tested in a real case study to further analyse and compare the performance of each algorithm in the overall AnD network optimisation.

#### **2. Material and Methods**

#### *2.1. Optimisation Algorithms Considered*

The optimisation algorithms presented here fall within the set of SGOA, concretely in the subset of evolutionary algorithms (EA) for GA—which use mechanisms inspired by biological evolution, e.g., mutation or recombination to achieve the goal of optimisation and in the subset of swarm intelligence methods for ACO and PSO based on the collective behaviour of self-organised decentralised systems, respectively. SGOA algorithms have been widely used to solve NP-hard combinatorial optimisation problems, such as that presented in this study, which deterministic optimisation methods fail to handle due to their complexity.

Regarding each proposal, ACO is a metaheuristic approach that has been shown to be effective in solving a variety of NP-hard problems [58]. The algorithm is based on simulation of the behaviour of real ants in their search for food. When ants find food, they leave a pheromone trail on their path. Then, new ants follow that trail. In this way, an increasing number of ants are concentrated in places where there is food. In a similar way, the virtual ants construct a solution moving through the graph that represents the search space of solutions. Their paths are guided by a probabilistic state transition rule, which is based on pheromone trails and specific heuristic information. The algorithmic procedure is iterative. At each iteration, the pheromone trails are updated by applying an evaporation coefficient (when the value selected is not part of a feasible solution). To avoid rapid stagnation of the solution, the ACO algorithms can use several strategies [59], such as that related to the limitation of the pheromone trails between maximum and minimum values. Max–Min Ant System [60] uses this procedure.

The GA is a metaheuristic approach also used in combinatorial optimisation problems. It is based on the mechanics of natural selection and natural genetics. GA applications cover a range of combinatorial optimisations, e.g., hydraulic model calibration [61], performance of photovoltaic systems under variable atmospheric conditions [62], and sensor placement for leak detection in water distribution networks [39,63]. The GA is based on three main parameters: selection, crossover and mutation. The population matrix is randomly generated and consists of the design variables, and the best variables are selected according to their fitness value. From these solutions, new solutions are produced via the crossover operator [64]. The mutation operator is finally employed to avoid the algorithm converging to local optima (i.e., to maintain the genetic diversity).

The GA cycle is repeated through a number of generations until a stopping criterion is met. It is worth noting that elitism is not generally considered an operator in the canonical GA. However, it is deemed a robust and effective operator because it leads the optimisation procedure towards the optimal solution. Accordingly, this operator stops the best solutions from being mutated. In this way, the best solutions of each generation would pass to the next, unaltered. Over the course of the algorithm and through a sufficient number of generations, the traits of these solutions would transfer to their offspring, increasing the chance of producing new solutions whose fitness function values might be better than their parents [64]. Some drawbacks of GAs are noted in [61], e.g., achieving a global optimum for large and complex systems is not guaranteed, which is also a drawback for ACO. In [57], the relation between GAs and ACO is noted.

PSO is a recently developed EA that includes features such as easy implementation for solving practical problems, high accuracy and fast convergence of the solution as some of its main advantages [65,66]. While similarities exist in the iterative nature of PSO and GAs, conversely, in PSO, there are no, e.g., "crossover" or "mutation" operations. Instead, PSO is based on a population of candidate solutions, defined as particles. The set of particles composes a swarm, where each individual flows through the parameter space. The flow of such particles is defined by trajectories, which are driven by the best performance of the particle and the neighbouring particles in the parameter space. The initialization of particle swarm is random. The initial solution of each particle represents an alternative solution; that is, each particle has its own initial position and speed and is randomly distributed in each position of the feasible solution space to be searched. Therefore, the initialization of the particle swarm represents the preparation of the particle swarm search. Its size is determined by its speed and position, and the particle update is based on the comparison of the fitness values between each search particle and its neighbouring search particles to determine the necessity of updating a particle. The updated particle adjusts its speed and position according to the particle's new flight path model, which is based on the best results achieved by its neighbouring particles. These conditions yield different optimal experiences for different particle subgroups, which dynamically evolve according to the current position of particles, the particle current velocities, the distance between each particle of the subgroup and its best position and the distance between each particle of the subgroup and the best position of the whole subgroup.

The PSO algorithm does not need cross-mutation or other genetically inspired operations, so the algorithm has fewer parameters and is still high efficiency [65]. These properties are suitable for both engineering applications and scientific research, and a significant number of research results have been produced in recent years [67]. For example, in [68], PSO is applied for function optimisation regarding eco-economics modelling and assessment; in [69], it is used as part of fuzzy systems developed to optimise fuel consumption of hybrid vehicles; and in [70], PSO is used to train neural network models and perform real-time optimisation.

#### *2.2. Centralised Co-Digestion as an Optimisation Problem*

Mathematical optimisation involves the selection of one solution amongst a set, according to some criterion and constraints (that is, the optimisation problem). This optimisation problem can be stated as:

$$\min\_{\mathbf{x}\in\mathcal{X}} f(\mathbf{x}) \text{ subject to } \operatorname{g}(\mathbf{x}), \tag{1}$$

where *f*(*x*) is the objective or cost function, *X* is a feasible region and *g*(*x*) are the constraints that have to hold to find a minimiser *x\** of *f*(*x*) such that *f*(*x*∗) = min*x*∈*<sup>X</sup> f*(*x*). ACO, GA and PSO introduced in Section 2.1 are algorithms aimed at finding the optimal solution of the optimisation problem posed in (1)—i.e., minimise the objective function *f*(·) subject to the set of constraints *g*(*x*) that apply—which here consists of the selection of the best substrates and volumes according to a set of restrictions related to the operation of the anaerobic digester. In addition, the cost function allows quantifying each alternative potential solution according to (1), involving the calculation of a value or "cost" associated with each alternative considered to find the optimal solution.

The problem statement is similar to that presented in [41], although the number of waste receptors increases from one to three, thus making necessary a reformulation of the optimisation problem involved. Specifically, it is required to increase the dimensions of all data vectors that define each generator-receptor interaction and their subsequent calculations. In addition, matrix operations are repeated per new dimension (i.e., waste receptor) added.

This optimisation problem, which can be understood as a MKP and is of combinatorial nature, can be represented as a matching problem. It is defined with a graph *G =* (*N, E*) that summarises all the possible combinations. The graph consists of *N* vertices (or nodes) and *E* edges or pairs of vertices. Specifically, for the case of AnD optimisation, a bipartite graph can be used to represent the posed optimisation problem to differentiate between the set of waste generators (*N*1), containing *W* nodes, and the set of waste receptors (*N*2), containing *R* nodes. For the specific optimisation problem, all *W* nodes are connected to each *R* node, thus resulting in a total of *W*·*R=E* edges. Figure 1 shows a generic representation of the defined matching problem applied to AnD co-digestion optimisation.

**Figure 1.** Generic representation of the posed matching problem.

A set of substrate generators *w* ∈ {1, . . . , *N*} is considered. The volume of each substrate *Vw* can be selected as a contribution to any of the AnD systems. The binary decision variable *y<sup>s</sup> <sup>w</sup>* allows generating array volumetric possibilities (*V<sup>s</sup> <sup>w</sup>*, with *s* ∈ {0..., *lw*} that are determined as a multiple of a number (e.g., 1000 by default) such that 1000*lw* = *Vw*. The selection of each volumetric possibility is determined by the corresponding value of the binary decision variable, *y<sup>s</sup> <sup>w</sup>*, where *<sup>y</sup>* <sup>∈</sup> {0, 1}, with *<sup>y</sup><sup>s</sup> <sup>w</sup>* = 0 when the corresponding volumetric configuration is not selected, and *y<sup>s</sup> <sup>w</sup>* = 1 when it is selected. Note that for each waste generator *w*, there are *lw* different volumetric configurations in *y<sup>s</sup> <sup>w</sup>*, but only one is selected at a time, i.e., ∑*lw <sup>s</sup>*=<sup>1</sup> *<sup>y</sup><sup>s</sup> <sup>w</sup>* = 1 ∀*w* ∈ {1..., *N*}.

To normalise the objective function, selected volumes *V<sup>s</sup> <sup>w</sup>* are divided by the maximum volume from their corresponding waste generator (*Vw*). This approach provides values of the cost function between 0 and 1, where the closest to 1 the better the solution. However, note that the ACO algorithm looks for a maximum of the objective function, while GA and PSO look for a minimum. This behaviour is considered using the constant *K* ∈ {−1, 1}, which depends on the algorithm considered: for the ACO algorithm *K* = 1 and for the GA and PSO algorithms *K* = −1. Hence, cost index *B* would take positive values between 0 and 1 for ACO and negative values between 0 and −1 for GA and PSO. Instead, the absolute value is taken for all three algorithms.

*Fc <sup>w</sup>* (*c* = 1, ... , 3) and *Tw* are the set of dimensionless coefficients corresponding to the substrate characterisation and the quality term ∑3 *<sup>c</sup>*=<sup>1</sup> *F<sup>c</sup> w ρq*, already used and explained previously in [41,46] and defined as shown in Figure 2.

**Figure 2.** (**A**) *F*<sup>1</sup> *<sup>w</sup>*, (**B**) *F*<sup>2</sup> *<sup>w</sup>*, (**C**) *F*<sup>3</sup> *<sup>w</sup>* and (**D**) *Tw* equations used for dimensionless coefficient calculation.

*F*1 *<sup>w</sup>* is a coefficient related to the potential biogas production, measured as a function of the Chemical Oxygen Demand (*COD*) content. *F*<sup>2</sup> *<sup>w</sup>* indicates the ratio of *COD*/*TN* (where *TN* refers to Total Nitrogen), a useful measure to prevent situations of acidification and other undesired reactions of the AnD process, as long as it is maintained around the range of 20–60. *F*<sup>3</sup> *<sup>w</sup>* is linked to the alkalinity (Alk) concentration, and it is associated with a restriction ranging from 2500 to 6000 mg CaCO3/L integrated within all optimisation algorithms. *Tw* is a coefficient of the utmost importance since it describes the toxicity level of all waste fluxes, which should be kept at the lowest level possible (specifically below 2.1 mg Pb/L).

The *N* different substrate generators are located at different distances (*dw*) from each anaerobic digester. The conveyance of the selected volumes implies a travel distance *dw* (in km) with an economic cost *xw* (in €/km) and a social impact *Iw* = 1, ... , 3 (dimensionless). The higher the value of *Iw*, the higher the social impact of the related route (e.g., proximity to sensitive areas due to pollution, traffic density, or pedestrian presence). Since each route is different for each generator, different values are assigned to approach the logistic impact of the corresponding waste generator, so a value for *Iw* is assigned for each sludge/substrate generator depending on its route to the ST.

The coefficient weight *<sup>ρ</sup><sup>q</sup>* (dimensionless) is related to the quality term ∑3 *<sup>c</sup>*=<sup>1</sup> *F<sup>c</sup> w ρq*, and the coefficient *ρ<sup>x</sup>* (dimensionless) is the coefficient that weights the logistics term *ρx Xwdw Iw* . Each weight is given a value of 0.5 to provide a balance between the quality and logistics terms in the optimisation. Selected volumes of each substrate to each receptor contribute to the input to the AnD network, and the aforementioned parameters constitute the objective or cost function *f*(*x*).

Additionally, the optimisation problem presented considers a set of restrictions *g*(*x*) related to each of the total inputs to each of the receptor systems, based on those presented in [41,48]. The first restriction is the sum of accepted substrates ∑*<sup>N</sup> <sup>w</sup>*=<sup>1</sup> <sup>∑</sup>*lw <sup>s</sup>*=<sup>0</sup> *<sup>y</sup><sup>s</sup> wV<sup>s</sup> <sup>w</sup>* must not exceed the maximum acceptable volume *V* for each AnD system. Moreover, the *COD*/*TN* ratio, related to the dimensionless coefficient *F*<sup>2</sup> *<sup>w</sup>*, must be kept within the range *C*2 *min*, *<sup>C</sup>*<sup>2</sup> *max* . The alkalinity concentration, related to the dimensionless coefficient *F*<sup>3</sup> *<sup>w</sup>*, must also be kept within the range *C*3 *min*, *<sup>C</sup>*<sup>3</sup> *max* . The toxicity level, related to *Tw*, does not require restriction since the corresponding coefficient *Tw* is considered to be restrictive enough, as shown in Figure 2.

In addition, an estimation of the produced biogas is made assuming a conversion factor of 0.268 m3 biogas/kg *COD*. Finally, note that the cost function presented in this work is adapted from [41,48], where the ACO algorithm was used for waste management optimisation in a similar fashion but limited to one AnD receptor.

The objective function *f*(*x*) for the presented optimisation problem is as follows in (2). However, note that the performance comparison of the ACO, GA and PSO algorithms is not conducted directly on the value of the optimised objective function, *B* , but on its absolute value, *B*, as shown in (3).

$$B' = K \left\{ \sum\_{w=1}^{N} \sum\_{s=0}^{l\_w} y\_w^s \frac{V\_w^s}{V\_w} T\_w \left[ \left( \sum\_{c=1}^3 F\_w^c \right) \rho\_q + \frac{\rho\_x}{X\_w d\_w I\_w} \right] \right\} \tag{2}$$

$$B = \begin{vmatrix} B' \end{vmatrix} \tag{3}$$

#### **3. Results**

*3.1. Case Study*

The case study includes a network of 19 organic waste generators and three organic waste receptors. These 22 locations (i.e., 19 generators and 3 receptors) are part of the wastewater treatment system managed by Consorci Besòs Tordera (CBT), a public local water administration composed of 64 municipalities in four different regions of Catalonia (Spain) with a population of approximately 470,000 inhabitants. This case study and its anaerobic network system were also considered in [41]. Figure 3 shows the corresponding bipartite graph of the case study.

**Figure 3.** Bipartite graph of the case study.

The three organic waste receptors (R1–R3, or nodes 1–3 of Figure 3) refer to three separate WWTPs that produce their own sewage sludge, but that also have available AnD technology. Due to oversized design, which is a usual practice in WWTP design [71], these AnD systems in R1–R3 have available capacity. This free excess capacity can be used to accept wastes from external sources, such as the undigested sewage sludge of W1–W12 or the industrial substrates from C1–C7.

The 19 waste generators consist of 12 WWTPs that produce undigested sewage sludge (W1–W12, or nodes 4–15 of Figure 3) and seven industrial substrate generators (C1–C7, or nodes 16–22 of Figure 3), which were considered suitable sources of organic waste for the AnD network under study. Each of these locations is a separate and independent system that must manage its own waste produced as best as possible. Additionally, seven industrial substrate generators have been previously verified as feasible substrates for AnD by CBT technical services.

#### *3.2. Simulation Methodology*

The algorithms used in this work have been implemented in the MATLAB environment. Simulations were performed with a Lenovo ThinkPad (Lenovo Group, Ltd., Girona, Spain) L14 Gen1-20U10016SP ×64 using the OS Microsoft Windows 10 Pro and an Intel(R) Core(TM) i7-10510U CPU processor (1.80 GHz, 2304 MHz) consisting of four main processors and eight logic processors.

The main optimisation parameters of both GA and PSO algorithms were trimmed in an attempt to select the most suitable array to provide reliable results. Accordingly, the same procedure was already performed for the ACO algorithm to determine the values of its corresponding optimisation parameters in [48], where the same optimisation parameters are used in this work.

For the ACO algorithm, an initial population of 100 individuals (or ants) and 500 iterations per repetition is set, and the values used for the algorithm optimisation parameters are *α* = 1, *β* = 2 and *ρ* = 0.98, each corresponding to the importance assigned to the pheromone trail, the importance assigned to the heuristic information and the persistence degree or pheromone evaporation, as explained in [38,58,60]. For GA, the initial population is set to 100, the total number of iterations (or generations) is set to 500, the crossover fraction is set to 0.8 and the fraction of elite children is set to 5% of the corresponding total children. For PSO, the initial population was set to 100, the total number of iterations was set to 500, cognitive attraction was set to 0.8 and the social attraction factor is set to 1.25. Tables 1 and 2 summarise trimming tests for GA and PSO, respectively, where the best results are obtained for higher values of objective index *B*.


**Table 1.** Summary of trimming tests for the GA.

**Table 2.** Summary of trimming tests for PSO.


For the sake of performance comparison, some parameters were fixed for the three algorithms. The fixed parameters are the number of independent simulations (set to 10, the best result is selected), the population (set to 100 individuals), and the maximum number of iterations (set to 500). With these constraints on algorithm trimming, a performance comparison of ACO, GA and PSO was conducted.

The comparison of ACO, GA and PSO performances is based on the value of the fitness function, execution time and an array of technical variables related to the total expected performance of the optimised AnD network: total daily biogas production (in Nm3), average organic load (in kg of COD per m3 of volume of the digestion system and day), average carbon to nitrogen ratio (C/N), and average alkalinity (in mg of CaCO3). All algorithms are tested with data from a real case study as the main simulation scenario. However, other synthetic scenarios are tested to further compare the performance of each algorithm under different scenario conditions.

In the approach presented here, simulated scenarios are based on the waste generator data in Table 3, alongside route distance and receptor system characterisation. For all the 19 waste generators (i.e., the 12 WWTPs without AnD and the seven substrate generators), the addition to the AnD network is optimised. For each of the three AnD systems (i.e., receptors R1, R2 and R3), different volume constraints have been determined, according to operational data and assuming a limit to the hydraulic retention time of 20 days (below that retention time, AnD efficiency is expected to greatly decrease).

**Table 3.** Waste generator dataset, including distance between waste generators and receptors and characterisation of each receptor of the case study (Baseline Scenario or Scenario 0).


Each simulation for ACO, GA and PSO is repeated 10 times since these algorithms have probabilistic, iterative-based search methods. The best solution among these runs is selected for further analysis, although the average fitness function is also registered for discussion.

The data obtained for ACO, GA and PSO (each comprising 10 repetitions of the corresponding algorithm) are compared for every simulated scenario. The baseline scenario (i.e., Scenario 0) corresponds to the real case study, as described in Table 3. Additional synthetic Scenarios 1–4 are simulated, and their corresponding data are created from alterations of the baseline scenario, as described in Table 4.


**Table 4.** Synthetic scenarios created from original Scenario 0 in Table 3. Description of data alteration procedure.

While for Scenarios 1 and 2 any modification is viable, for Scenario 3, an increase in volume involves a significant increase in execution time. This is because the optimisation problem works around combinations of fixed volumes, and an increase in volume would involve a higher number of possible combinations for the algorithms to consider (i.e., an increase in the search space), hence the expected increase in execution time. Thus, volume modification for Scenario 3 was limited to a triple increase in the baseline scenario volume. Alternatively, for Scenario 4, the C/N ratios were modified while being kept below 60 to facilitate the algorithms in finding a viable solution. This measure was adopted because the C/N ratio was the most limiting optimisation parameter in previous applications of a similar optimisation problem in [41]. Scenario 2 was designed with both linear and nonlinear distance modifications (Scenarios 2a and 2b, respectively) to discuss the effect of distance distribution, as pointed out in [41]. Note that trimming tests were carried out only for GA and PSO using the baseline scenario, assuming that trimmed parameters would suffice for simulation of other synthetic scenarios similar to the baseline scenario.

The optimisation results are presented as a sequence of contributions from all the generators to each anaerobic digester. This optimised contribution sequence can be considered a suggested logistic plan for the co-substrate distribution as follows: once enough substrate has been produced and stocked on a waste generator, a truck of 20 metric tonnes capacity would be fully loaded with substrate from the corresponding waste generator, disregarding the truck waiting time before starting each route; once fully loaded, the truck is assumed to travel to the waste receptor without further stops (assuming it always follows the same route). As long as the cycle of supply routes of all involved waste generators is completed within the AnD retention time of 20 days, the properties of the resulting blending should not vary significantly, especially considering that every waste receptor would have a receiving system for these external organic substrates, where they would be stored and blended before being added to the AnD system. The specific start and finish time for each route along the day have not been considered; this does not affect the optimisation, although it has been noted that it has considerable impact on real-world implementation.

#### *3.3. Algorithm Performance Comparison and Scenario Analysis*

The simulation results for each scenario are shown in Table 5. For every scenario and for ACO, GA and PSO, this table shows the best cost index (B) achieved, elapsed optimisation time, and additional parameters related to the performance of the AnD systems: total biogas production, average organic load, average carbon/nitrogen ratio and average alkalinity.

In the baseline scenario, ACO and GA show higher biogas production than PSO (23% and 30% higher, respectively). However, they show a slightly lower *B* index achieved (4% and 11% lower). This result indicates that although one of the main goals of AnD optimisation involves maximising biogas production, it is not all that matters because there are other parameters also subjected to optimisation. PSO appears to find a solution with lower biogas production but better optimises other quality parameters, such as the C/N ratio and alkalinity. However, it is remarkable that out of the three algorithms, ACO and GA find a "similar solution" (prioritising high biogas production), and PSO fins a significantly different solution (prioritising other quality-related parameters).

**Table 5.** Summary of algorithm performance. The best value *B* is highlighted. Scenario 2a feasible results (\*) are associated with a poor solution, so no direct comparison is conducted.


In Scenario 1, the COD concentration was increased tenfold. This was done to compare the efficiency of algorithms to optimise substrates with high organic loads, which is especially meaningful for maximisation of biogas production. For this scenario, ACO and GA have better performance, according to the best index *B* achieved. As a natural consequence of substantial COD increases, biogas production also dramatically increases. However, PSO is unable to achieve a competitive solution in relation to both ACO and GA within this scenario and the baseline scenario, respectively.

In Scenario 2a, a tenfold lineal increase in the geographical distances between facilities was conducted. This scenario allows comparing how well each algorithm can handle situations where most substrates have long distances. For this scenario, ACO is unable to find a feasible solution. On the other hand, GA and PSO find a solution, but the corresponding biogas production is far lower than that obtained in the baseline scenario (48% and 41% lower biogas production for GA and PSO, respectively).

Alternatively, Scenario 2b shows the optimisation results when nonlinearly modifying geographical distances between facilities by the square root of the original distances. This modification allows understanding which algorithm would be more favoured by a more equally distributed geographic location of plants. All the algorithms tested are able to find a feasible solution, showing that GA has the best performance (both in terms of best *B* and biogas production). On the other hand, ACO shows the worst best index *B* achieved.

Scenario 3 was modified by a threefold increase in available volume from all sources. The presented modification allows studying the performance of each algorithm when the total number of possible solutions is much greater. ACO shows noticeably poor performance, below the best index *B* achieved by ACO in former scenarios. Although GA shows better performance than PSO in terms of the best index *B* achieved, biogas production appears similar to that in other scenarios.

In Scenario 4, an increase in the C/N ratio for waste generators W1–W12 was conducted. This modification would test the ability of each algorithm when one of the restrictions (i.e., C/N ratio) requires more adjustments. In this case, PSO shows the best index *B* achieved, although it presents the lowest biogas production. Similar to the baseline scenario, ACO shows slightly better performance than GA in terms of the best *B* achieved, but GA still has slightly better biogas production.

Geographical distance modification was performed with two alternative scenarios. Scenario 2a includes a lineal modification of the distance matrix (tenfold), and Scenario 2b considers a distance modified by the square root of the original distance. The relative locations of all involved waste generators and receptors in the case study are shown in Figure 4. For the baseline scenario, waste generators are homogeneously geographically distributed, but receptors are located in a relatively small area—i.e., the geographical distance difference of each receptor from the emitters might be negligible by the optimisation—which may be interpreted as a single receptor with higher volume capacity, caused by the geographical overlapping of waste receptors, or the "big dot" effect. The linear modification of geographical distances in Scenario 2a does not alter this relative distribution, but the nonlinear modification in Scenario 2b does so, avoiding this "big dot" effect by dispersing Receptors A, B, and C in the geographical space. It is important to note that for Scenario 2a, ACO was unable to find a viable solution, and both GA and PSO achieved a relatively poor solution compared to the corresponding solutions for the baseline scenario.

**Figure 4.** Map of waste generators and waste receptors R1–R3 for Baseline Scenario (**A**), Scenario 2a (**B**) and Scenario 2b (**C**). Distance is expressed as longitudinal distance (X-axis) and latitudinal distance (*Y*-axis) with respect to the R1 plant.

Figure 5 shows the resulting blending profile for the baseline scenario and Scenario 2b. For the baseline scenario, PSO tends to balance the blending of substrates with a low organic load content—i.e., from W1–W12—and selects noticeably lower amounts of high organic load substrates—i.e., from C1–C7— than ACO or GA. This observed behaviour is similar between the three waste receptors A, B and C. On the other hand, ACO and GA tend towards selective blending, showing similar preferences for both receptors B and C. For receptor A, the GA algorithm tends towards slightly more homogeneous blending. In any case, both ACO and GA include more substrates of high organic load—i.e., from C1–C7—except for receptor C.

For Scenario 2b, the ACO blending profiles are similar to those obtained in the baseline scenario—showing a certain tendency to include particular waste generators, —although varying the substrates that the algorithm selects. The GA blending profiles obtained in Scenario 2b are the most affected by geographical distance distortion. GA appears to balance the blending of substrates from all waste generators, much like PSO for both

scenarios 2a and 2b. Additionally, the GA blending profile for Scenario 2b accounts for more industrial, high organic load wastes—i.e., from C1–C7—than the PSO blending profile, which remains relatively similar between the baseline scenario and Scenario 2b.

**Figure 5.** Blending profiles for every waste receptor and ACO, GA and PSO algorithms for the baseline scenario (**left**) and Scenario 2b (**right**).

#### **4. Discussion**

Simulations with the optimisation algorithms ACO, GA and PSO were performed, showing successful optimisation results in almost all scenarios. Data from a real case study were used to carry simulations of centralised anaerobic co-digestion blending. As detailed in Section 3, these datasets are composed of 19 organic waste generators and three organic waste receptors within the context of a sanitation network in an area of high industrial activity in Catalonia. This case study composes the baseline scenario. In that previous work, the potential impacts of optimising AnD with wastes from external sources were already demonstrated, bearing up to 77% cost savings regarding waste management. Different modifications were made to this dataset to compare the performance of the ACO, GA, and PSO algorithms under different conditions to assess the performance of each optimisation algorithm in relevant situations. Regarding the optimisation problem, the C/N ratio is the dominant restriction, as was previously seen in [41]. This is the reason why this parameter is included in the discussion of the results, together with biogas production.

For the baseline scenario as seen in Table 5, PSO shows the best index *B* achieved but also the lowest biogas production. However, PSO also shows the lowest C/N ratio, which might play a role in achieving the best solution, compensating for the lack of biogas produced. If biogas production is increased by the design of a particular setup, this could be trimmed by the corresponding weight in the objective function *B* as a trade-off among the different parameters involved. The results obtained have been considered convenient for the installation under study and improved dramatically performance obtained in the baseline scenario [41]. However, both ACO and GA generally show higher amounts of biogas production, but their best index *B* values achieved are below that of PSO, and their C/N ratios are above 50.

As shown in Figure 5, ACO and GA show similar behaviours for the baseline scenario, prioritising specific substrates. A first hypothesis suggests that prioritised substrates would be those with higher COD since they would allow higher biogas production. On the other hand, PSO shows a different strategy blending more available substrates and tends to exclude industrial substrates. This trend may point to PSO performing a conservative strategy where it is avoided in all cost situations where the operation of the AnD would be put at risk. Therefore, the general trend is that ACO and GA solve the presented optimisation problem by maximising biogas production and pushing restrictions to the limit, while PSO tends to balance biogas maximisation and the C/N ratio trade-off. In addition, note that PSO has the shortest execution times and ACO the largest, which is observed for all scenarios, indicating PSO to be more computationally efficient, where even here, the execution time is not a drawback for real implementation with the values obtained.

The similarities between ACO and GA and the differences between those and PSO could be partially explained by the nature of these algorithms. Both ACO and GA tend to explore the search space of solutions around the borders, thus increasing the number of nonfeasible solutions but also increasing the chances of finding a "rare" solution with a higher best index [72,73]. Thus, these algorithms appear to be based on relatively independent behaviour between particles so that each one can explore separate areas of the border search space and be able to find different non-redundant solutions. On the other hand, PSO algorithm exploration of the search space is based on dependent behaviour between neighbouring particles, which does not encourage particles to explore the limits of the search space. Instead, it promotes the exploration of other mid-term areas between the centre and the borders of the search space. This could help explain why the PSO algorithm attains solutions within shorter execution times but also with generally lower biogas production. Hence, PSO would tend to be a conservative strategy where instead of selecting the most promising solution, single ant or particle, it would prioritise a consensus between the best neighbourhoods.

As detailed in Section 3, Scenario 1 is modified by increasing the organic load of all substrates tenfold. Thus, the dominant condition, in this case, is that organic waste valorisation is fostered, leading to higher biogas production. As observed in Table 5, ACO and GA show better performance than PSO in this scenario, but GA is more efficient since its attained biogas production is noticeably higher than that achieved with ACO.

Additionally, as detailed in Section 3, Scenarios 2a and 2b include a geographical location modification of the involved facilities. As observed in Figure 4, the relative distances between waste generators and receptors (R1, R2, R3) are not modified by lineal modification of the distance when the map plot of the baseline scenario is compared to that of Scenario 2a. However, nonlinear modification of geographical distances in Scenario 2b leads to a different map plot, where waste receptors are more dispersed between them in relation to waste generators (Figure 4). The effect of this distortion of distances is that waste receptors are more separated, thus avoiding geographical overlapping of waste receptors, or the "big dot" effect, i.e., assimilating closer plants as a single centralised plant from a geographical perspective.

Hence, Scenario 2a is modified by tenfold increasing the geographical distances between waste generators and receptors, making geographical distance a dominant condition for optimisation. In this case, ACO is unable to find a solution, and both GA and PSO show extremely poor performance when compared with the baseline scenario, as shown in

Table 5. The linear modification of distances of Scenario 2a shows the performance of each algorithm under the pressure of cases with high geographical distances. This pressure case of Scenario 2a was especially relevant to test because it can significantly impact the logistics processes. On the other hand, Scenario 2b presents a different trend due to the nonlinear modification of distances. As detailed in Table 5, GA shows the best performance, and PSO shows better performance than ACO even in terms of biogas production.

As observed in Figure 5, ACO and PSO maintain similar blending profiles, while GA and PSO also exhibit similar blending profiles, but including GA results in a greater volume of industrial substrates. This shared behaviour between GA and PSO is exclusive to Scenario 2b, but it might indicate that GA behaves similarly to PSO in this case. However, from the operational point of view, GA solutions involve major risks since they tend to include more industrial substrate than PSO solutions.

In Scenario 3, a threefold increase in the volume of all waste generated was performed. First, this result implies that the search space—i.e., the total number of combinations and possible solutions—drastically increases. In this case, Table 5 confirms a similar trend observed for previous scenarios, where GA obtained the best performance and ACO the worst. Again, this finding is consistent with the observation that the ACO algorithm attains weaker performance than GA and PSO for this particular case and that GA and PSO attain similar performance in this study, although GA appears to generally provide better performance than PSO.

Finally, Scenario 4 was composed of increasing the C/N ratio of W1–W12 substrates. These substrates originally conformed to sewage sludge with a low nitrogen load, but in Scenario 4, the drastic increase in the C/N content of sewage sludge was the dominant condition to be tested. Table 5 also presents a summary of the results for Scenario 4, where PSO shows the best performance and GA the worst. The main observation is that the PSO algorithm is more able than the GA to manage situations with high nitrogen loads or major restrictions, while the GA has more potential to maximise biogas production. However, it is more sensitive to high nitrogen loads because it reduces available space to acquire industrial wastes with both high organic loads and high nitrogen loads.

The developed algorithms have successfully optimised the AnD network of the case study, and their performances have been tested under different conditions (i.e., Scenarios 1–4). Simultaneous logistics and quality optimisation of a network of existing waste management facilities is a gap in the current state of the art due to its ad-hoc nature and its interdisciplinarity: there are specialized works for logistics optimisation such as in [45,46], but they do not include process optimisation. The present study implements this logistic optimisation by minimising a cost function designed to this end. The reason for choosing this approach is also based on the need for professionals who manage the AnD network considered here to have a decision support tool capable of integrating logistics and process performance optimisation. And, in addition, to have the ability to handle changing operation conditions and scenarios, as it actually happens in real facilities.

It is also worth noting that it exists a variety of sensors for the determination of physical-chemical parameters that could complement the sensor network considered in this case study, such as a variation of the ones presented in [47]. These sensors could provide additional insight, especially if combined with GIS and process optimisation, and also facilitate the real-time implementation of the presented approach. Additionally, they could also be used as control mechanisms for those cases where ACO and GA optimisation is applied, since attained optimised outcomes pushed quality restrictions of the AnD process close to their thresholds. However, there is a trade-off between information (i.e., data gathered from new sensors) and resources (e.g., implementation, maintenance) which has to be taken into account when considering new sensors.

Overall, this study presents a step forward towards the integrated optimisation of AnD networks, making an innovative attempt to couple logistics and quality optimisation of the centralised digestion process of a real AnD network.

#### **5. Conclusions**

In this study, three approaches were developed for the simultaneous optimisation of multiple AnD systems based on ACO, GA and PSO. These methods were applied to a case study based on real data from an AnD network in the area of the Besòs River basin in Catalonia. The performance of each optimisation approach was evaluated. All the approaches successfully optimised biogas production for simulated scenarios while preserving some practical restrictions in optimisation.

For the baseline scenario, ACO and GA allowed maximum biogas production by placing restrictions on the limits of safe operations. On the other hand, PSO solved the optimisation problem with a more conservative strategy where biogas production is lower than that in ACO or GA solutions, in favour of the best AnD operation conditions (i.e., by adjusting the C/N ratio and alkalinity).

In those cases with high opportunities for biogas production (i.e., Scenario 1), GA and ACO would perform the best due to their capabilities of maximising biogas production over that of PSO. GA would perform as the best optimisation algorithm both for cases where distances are significantly different amongst them (i.e., Scenario 2b) and for cases where higher volumes should be handled (i.e., Scenario 3), presumably due to GA's computational potential. Finally, for those cases where other quality-related parameters are restrictions (i.e., Scenario 4), PSO would be the best performing algorithm.

The present study shows an innovative contribution to optimize the performance of centralized AnD systems, combining logistical and quality parameters. To the authors' knowledge, this optimization has not yet been addressed in the literature for an AnD network. In addition, the framework has proven its effectiveness in minimizing the total distance travelled to transport the waste and maximizing biogas production. At the same time, the physical-chemical parameters of the process have been kept within their operational limits.

Further work may include methodologies to improve social impact factor quantification in the optimisation, which might allow better characterisation of the logistic impact of each substrate generator. Additionally, the development of logistic route simulations would be required to enhance real-world distribution planning considering daytime, travel frequency, dynamic waste production-consumption coupled with stocking problems and other time-related issues key to logistic planning.

**Author Contributions:** Conceptualization, D.P.-H., M.V. and M.À.C.-E.; methodology, D.P.-H., M.V. and M.À.C.-E.; software, D.P.-H.; validation, M.V., M.À.C.-E. and V.P.; formal analysis, M.V., M.P., M.À.C.-E. and V.P.; investigation, D.P.-H., M. Verdaguer; resources, M.P. and M.À.C.-E.; data curation, D.P.-H., M.V., V.P. and M. À. Cugueró-Escofe; writing—original draft preparation, D.P.-H., M.V. and M.À.C.-E.; writing—review and editing, D.P.-H., M.V., V.P. and M.À.C.-E.; supervision, V.P., M.P. and M.À.C.-E.; project administration, M.P. and M.À.C.-E.; funding acquisition, M.P. and M.À.C.-E.. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** No new data were created or analyzed in this study. Data sharing is not applicable to this article.

**Acknowledgments:** The authors would like to acknowledge the support received from Consorci Besòs Tordera. LEQUIA-UdG and SAC-UPC have been recognized as consolidated research groups by the Catalan Government (2017-SGR-1552 and 2017-SGR-482, respectively).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**


#### **References**


## *Article* **Transfer Learning in Wastewater Treatment Plant Control Design: From Conventional to Long Short-Term Memory-Based Controllers**

**Ivan Pisa 1,2,\*, Antoni Morell 1, Ramón Vilanova <sup>2</sup> and Jose Lopez Vicario <sup>1</sup>**


**Abstract:** In the last decade, industrial environments have been experiencing a change in their control processes. It is more frequent that control strategies adopt Artificial Neural Networks (ANNs) to support control operations, or even as the main control structure. Thus, control structures can be directly obtained from input and output measurements without requiring a huge knowledge of the processes under control. However, ANNs have to be designed, implemented, and trained, which can become complex and time-demanding processes. This can be alleviated by means of Transfer Learning (TL) methodologies, where the knowledge obtained from a unique ANN is transferred to the remaining nets reducing the ANN design time. From the control viewpoint, the first ANN can be easily obtained and then transferred to the remaining control loops. In this manuscript, the application of TL methodologies to design and implement the control loops of a Wastewater Treatment Plant (WWTP) is analysed. Results show that the adoption of this TL-based methodology allows the development of new control loops without requiring a huge knowledge of the processes under control. Besides, a wide improvement in terms of the control performance with respect to conventional control structures is also obtained. For instance, results have shown that less oscillations in the tracking of desired set-points are produced by achieving improvements in the Integrated Absolute Error and Integrated Square Error which go from 40.17% to 94.29% and from 34.27% to 99.71%, respectively.

**Keywords:** control design; industrial control; transfer learning; WWTP

#### **1. Introduction**

Industrial environments are characterised by running complex and repetitive processes which are sometimes maintained over time. In that sense, control systems are adopted in order to ensure that these processes perform correctly [1]. Most of the times, the development of control strategies can become a complex and time-demanding task since a deep knowledge of the process under control is required. However, the incursion of the Industry 4.0 paradigm and Artificial Neural Network (ANNs) applications are changing the way we control and manage industrial environments. Their main aim is to provide the industries with solutions mainly based on measurements obtained from their systems [2]. Some of these solutions go from basic forecasting systems to more complex solutions, like predictive maintenance ([3], Chapter 9). However, one of the sectors where Industry 4.0 and ANNs are making the point corresponds to the industrial control ([3], Chapter 5). There, ANNs have been adopted for a wide range of tasks, such as the design of soft-sensors or the detection of malfunctions [4–6]. Not only this, but the industrial control domain is experiencing a change in its tendency: ANNs are used more and more as the main control structures than conventional controllers.

**Citation:** Pisa, I.; Morell, A.; Vilanova, R.; Vicario, J.L. Transfer Learning in Wastewater Treatment Plant Control Design: From Conventional to Long Short-Term Memory-Based Controllers. *Sensors* **2021**, *21*, 6315. https://doi.org/ 10.3390/s21186315

Academic Editors: Miquel À. Cugueró-Escofet and Vicenç Puig

Received: 13 July 2021 Accepted: 18 September 2021 Published: 21 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

One of the industrial sectors where this tendency is observed corresponds to the Wastewater Treatment Plants (WWTPs), which are characterised by running very complex processes where individual operations and actions can change the whole operation of the plant [7]. For that reason, a huge number of control loops are required in order to ensure that each individual operation is correctly performed. Proportional Integral (PI) controllers have mostly been considered as the default and basic controller strategy able to ensure correct WWTP behaviour [8]. However, a complete reduction of the pollutants present in the residual waters cannot be ensured. For that reason, more complex structures have been proposed in order to improve the control performance. Fuzzy and Model Predictive Controllers (MPCs) have been adopted in [9] as the main control strategy to avoid the effluent violations of a WWTP plant, whereas in [10], a hierarchical structure with fuzzy and MPC controllers has also been proposed to determine the control actuation, taking into account the weather and a variable set-point. In this case, the set-point adopted by the MPC controllers is determined by the fuzzy controller whose objective is to maintain the ammonium in the fifth reactor tank of a WWTP at a desired value (please observe Figure 1 to see the distribution of the tanks of a general-purpose WWTP). The problem observed with this kind of structure lies in the fact that they require a model which replicates the relationships between input and output measurements. Besides, most of the time, these relationships consist of non-linear relations which are difficult and tedious to model. This is where ANNs come in, since they are algorithms offering good performance when dealing with these kinds of relationships ([11], Chapter 6). The first approach consists of the adoption of ANNs as elements whose predictions are adopted by conventional control structures. For instance, the solution proposed in [10] has been improved in [12], where Long Short-Term Memory (LSTM) cells have been adopted to predict the WWTP effluent concentrations and determine when and which controller has to actuate. In other cases, neural networks have been considered to directly determine the optimal set-point values adopted by conventional controllers [13] or to implement a Reinforcement Learning (RL) module performing the same task [14]. Moreover, in the last few years, ANNs have been directly considered as the control strategy. In [15], neural networks have been considered to implement an Internal Model Controller (IMC) devoted to managing certain concentrations required in the pollutant reduction tasks performed in the WWTP. This also entails that the control actuation can be decoupled from the physical specifications of the environment [16,17].

**Figure 1.** Benchmark Simulation Model No. 1 layout. *Qo*, *Qa*, *Qr*, *Qe*, and *Qw* are the influent, internal recycle, the external recycle, the effluent, and the wastage flow rates, respectively. Dotted lines correspond to control signals (measured concentrations, desired set-points and actuation signals), while solid lines correspond to process media.

The incursion of ANNs in the industrial control domain presents its own drawbacks that have to be taken into account [18]. The most important one consists in the fact that ANNs have to be designed and trained with amounts of data. This training process is devoted to determining the different hyperparameters of the ANNs, such as the numbers of hidden layers, neurons, learning rates, or even the topology of networks. This has to be performed for each ANN considered, either to complement a conventional controller (PI, MPC, Fuzzy), or to act as the controller as such. Besides, this training process can last hours or even days with regard to the network structure, the hyperparameters, and the amount of data, accordingly [19]. For that reason, transfer learning (TL) methods have been considered to alleviate these tasks.

TL was adopted from image classification tasks, where they were considered to obtain a good image classifier from predesigned and pretrained structures in a source domain [20]. Then, these pretrained structures were retrained with images of the target domain in what is called a fine-tuning process ([21], Chapter 6). In terms of industrial environments, TL was adopted in the design process of soft-sensors, where they are firstly designed and trained in a source domain where a huge number of measurements are available. TL techniques have been adopted mainly to design and implement soft-sensors in those harsh environments, showing a lack of measurements. In [22], TL techniques were considered to design a soft-sensor which would be deployed over a sulphur recovery unit. The problem there is that this environment shows a severe problem of data scarcity; therefore, a traditional ANN training process cannot be performed. To alleviate this, the authors proposed the adoption of TL to design and implement the soft-sensor in an environment without data scarcity problems (the source domain). Then, the obtained soft-sensor was transferred into the environment with the scarcity problem (the target domain) and fine-tuned to adapt its behaviour to this environment [22]. In our case, we propose the adoption of the Transfer Learning-based Control design approach to implement and design the complete control strategy of a general-purpose WWTP. The main idea is to substitute all the PI controllers by LSTM-based PI controllers, where only one is implemented while the others are obtained from transferred versions. The main point here is that instead of training and designing as many LSTM-based PIs as PI controllers, we will implement only a unique LSTM-based structure which will then be transferred into the remaining control loops. In that way, the design of the control loops can be eased at the same time its complexity is reduced. Now, efforts will be focused on designing a unique controller, which will be based on data instead of designing and tuning as many controllers as control loops. In this work, only two control loops have been designed and implemented following this approach. Thus, the benefit of this control approach is not as widely explored as it could be in a scenario where there exists multiple control loops, like in the petrochemical industry [4,23]. However, it has to be taken into account that this approach is mainly based on the adoption of ANNs, which are trained with amounts of data coming from the control loops. Therefore, data have to be accessible in order to adopt this approach; otherwise, the ANNs will not be properly trained, and consequently, the control loops will not act as they should.

This approach has firstly been conducted in [24], where a LSTM-based PI structure has been trained with data from a unique control loop and then transferred into the remaining control loop of a WWTP environment. Notwithstanding this, the structure proposed in [24] considers a unique LSTM cell which requires a total amount of 4 h of WWTP measurements in order to achieve a good control approach. Besides, neither the design of the LSTM-based PI, nor its fine-tuning process is carried out. Therefore, the control performance of the LSTM-based PI can be improved if it is fine-tuned with measurements coming from the target domain, that is, the control loop where the LSTM-based PI is transferred. For that reason, in this manuscript we will continue the work started in [24]. Here, we propose the fine-tuning process and we also analyse the benefits and losses of implementing the LSTMbased PI with data coming from different control loops. Moreover, a new LSTM-based PI structure able to manage the WWTP control loops without requiring 4 h of measurements will be proposed at the same time the control performance will be improved by means of the fine-tuning of this LSTM-based PI controller. Results will show that among all the LSTM-based PI, there exists one able to perform well in the different control loops. Thereby, the fine-tuning process of this LSTM-based controller and its control performance will also

be analysed in this work. Besides, the speed-up of the design and implementation process will be explored and analysed in this manuscript as a function of the amount of time required to train the LSTM-based structures. The application where it is tested is specific, but the proposed design approach can be adopted in any kind of industrial environment where measurements are available. In summary, a TL-based design approach is proposed to implement the complete control strategy of a WWTP. The main contributions of this work can be summed up as:


The structure of the manuscript is as follows. The work presented here is introduced in Section 1. The materials and methods adopted in this work are presented in Section 2, especially the Benchmark Simulation Model No. 1 (BSM1), a digital framework which models a general-purpose WWTP. In addition, the LSTM cells, as well as the TL principles are explained in this section. Then, the main contribution of this work, that is, the adoption of TL methods to design and implement the controllers of WWTP control loops are defined and explained in Section 3. The results of the exploratory analyses carried out are reflected in Section 4, while Section 5 concludes the paper.

#### **2. Materials and Methods**

#### *2.1. Benchmark Simulation Model No. 1*

The Transfer Learning-based Control Design approach proposed here is tested over the Benchmark Simulation Model No. 1 (BSM1). The BSM1 plant is a fictitious WWTP designed by using the engineering principles of an activated sludge process. It characterizes a medium-scale and general-purpose WWTP plant whose main objective is to reduce the nitrogen-derived pollutant products present in residual urban waters [25]. Besides, one of the major aims of BSM1 is to implement a digital framework where different control strategies can be designed and tested before being applied in the real environment. Thus, BSM1 is able to offer generality, easy comparison, and replicability of results in terms of the different control strategies devoted to maintaining certain pollutant components under certain levels or limits [8].

In such a context, BSM1 implements the Activated Sludge Model No. 1 (ASM1) which corresponds to a set of mathematical expressions describing the non-linear and highly complex biological and biochemical processes carried out inside the WWTP plant [26]. These processes mainly consist of the denitrification and nitrification processes where the nitrate and ammonia components are transformed into nitrogen and its derivate products [27]. Notwithstanding this, there are other Activated Sludge Models whose main aim is not only to model the processes carried out to reduce the nitrogen-derived pollutants, but also the phosphorus-derived ones. This is the case for the Activated Sludge Models No. 2, 2d, and 3 [28], which require some updates in the BSM1 framework in order to either consider the phosphorus removal processes like in the phosphorus removal BSM1 framework (BSM1-P), or the sludge treatment, like in the Benchmark Simulation Model No. 2 (BSM2) [29,30]. Nevertheless, the study of these behaviours, as well as the layout of these benchmarks is out of the scope of this work.

#### 2.1.1. BSM1 Layout

The BSM1 layout consists of a set of five reactor tanks and a settler placed just before spilling the clean water into the receiving waters (see Figure 1). The five reactor tanks, where the biological and biochemical processes described in the ASM1 model are carried out, are characterised by their aerated conditions: the first two are anoxic tanks (working with a lack of oxygen), whereas the last three work under aerated conditions [8]. They have a total volume of 6000 m3 , 1000 m<sup>3</sup> for each anoxic tank and 1333 m3 for each aerated tank. The settler has a total volume of 6000 m3. Thus, the total volume of BSM1 equals to 12,000 m3. Besides, the BSM1 framework has been designed to process an average influent flow rate equal to 18,446 m3/d and an average biodegradable chemical oxygen demand (COD) of 300 g/m3. This entails that the BSM1 retention time is equivalent to 14.4 h on average [8,31].

The influent data for municipal WWTP consists of time-series data of the flow and concentrations of the water quality parameters. These influent flow rates depend on many factors: the size of the catchment, the type of the sewer system, and the number of person equivalents, among others. For instance, influent profiles for a WWTP of 100,000 PE are available in [32]. They include dry, rainy, and stormy weather conditions. Besides, these are the usual ones considered when working with BSM1, and therefore the ones considered in the presented work. More information about the BSM1 influent flow and concentrations can be obtained in the BSM1 specifications [8]. Among these 15 variables, the ones of interest in this work are the ones related to the BSM1 default control strategies:


In the case of the NO control loop, a proportional integral (PI) controller is proposed to manage the internal recycle flow rate (*Qa*) in order to ensure that the *SNO*,2 concentration is maintained at the default set-point (1 mg/L). The DO control loop considers another PI structure whose main aim is to maintain the *SO*,5 concentration at the default set-point of 2 mg/L. This is performed by means of varying the oxygen transfer coefficient of the fifth reactor tank (*KLa*,5) accordingly to the measured *SO*,5. In that sense, it is worth noting that the two default PI controllers provided in the BSM1 framework have already been tuned, that is, the proportional gain and the integral time parameters are predefined by the BSM1 designers. The control performance of these PI configurations is provided as a start and a baseline with which a new control structure can be compared. Moreover, we have considered the default control strategies, that is, their parameters have been left as the initial configuration proposed by the BSM1 designers.

#### 2.1.2. BSM1 Simulation and Evaluation Protocols

As previously stated, BSM1 has been widely considered as a general-purpose WWTP digital framework offering generality, easy replication, and comparison between different control strategies. In order to ensure a fair comparison in the control performance, BSM1 considers two kind of simulations: (i) a simulation where no variations are produced in the influent, and (ii) a simulation where daily influent and weather variations are produced. In that sense, four influent profiles considering 14 days of influent measurements are provided [25]:


• Stormy influent: Influent profile showing daily variations of the influent concentrations. Two short but intense stormy perturbations are produced at days 8 and 11.

Thus, the kind of simulation is set accordingly to the influent profile considered in the simulation. However, the BSM1 model has to be previously initialised before performing the simulations. The initialisation process mainly consists in the stabilization of the BSM1 reactor tanks by means of simulating a total amount of 100 days of constant influent ([8], Section 3). Once the model is stabilised, one can perform the desired simulation. From the 14 days simulated, only the last seven days of the simulation, that is, day 7 to day 14 are considered in the performance computation ([8], Section 6). It is also worth noting that only dry, rainy, and stormy influent profiles are considered in this work.

BSM1 also considers its own performance metrics which ease the comparison process among control strategies. They can be divided into two main categories: the environmental metrics and the control ones. The environmental metrics are those showing the improvements achieved in terms of the pollutant reduction when a control strategy is considered instead of another one. Among the different metrics, the two most widely adopted ones are the Overall Cost Index (OCI) and the Effluent Quality Index (EQI). OCI is related to the costs generated in the pollutant reduction process, while the EQI can be understood as a metric telling how clean the water is [8,30]. Nevertheless, we will focus on the control metrics which do not have either an environmental nor a pollutant flavour. In our case, we are going to consider the Integrated Absolute (*IAE* ) and Integrated Squared Errors (*ISE*) between the measured variables and their corresponding set-points:

$$IAE = \int\_{t=7^{\text{th}}\,day}^{t=14^{\text{th}}\,day} |r(t) - y(t)| \, dt \tag{1}$$

$$ISE = \int\_{t=7^{th}\,day}^{t=14^{th}\,day} \left(r(t) - y(t)\right)^2 dt,\tag{2}$$

where *r*(*t*) corresponds to the desired set-point, and *y*(*t*) to the measured concentration. In this case, *y*(*t*) = {*SNO*,2(*t*), *SO*,5(*t*)}. Notice that only the control metrics are considered due to the fact that this work is mainly focused on the adoption of transfer learning approaches and ANNs to ease and speed up the design and implementation of the control strategies.

#### *2.2. Long Short-Term Memory Cells*

The ANN-based PI controller adopted in this work is mainly based on Long Short-Term Memory (LSTM) cells. They correspond to a type of gated networks which are characterised by their good performance when dealing with time-series signals ([11], Chapter 10). This is possible thanks to the gates that each LSTM cell implements: (i) three sigmoid activation layers, the input gate (**i**(*t*)), the forget gate (**f**(*t*)), and the output gate (**o**(*t*)), and (ii) one hyperbolic tangent layer, the state gate (**c˜**(*t*)) (see Figure 2).

**Figure 2.** LSTM cell internal structure.

In terms of data, the LSTM cell considers the input data (**x**(*t*)) and the output data (**h**(*t*)) vectors. Accordingly to them, the forget gate determines the amount of the cell state information that has to be deleted:

$$\mathbf{f}(t) = \sigma(\mathbf{W}\_f \cdot \mathbf{x}(t) + \mathbf{U}\_f \cdot \mathbf{h}(t-1) + \mathbf{b}\_f(t)). \tag{3}$$

Then, the input and state gates determine the new information to be stored in the cell state:

$$\mathbf{i}(t) = \sigma(\mathbf{W}\_i \cdot \mathbf{x}(t) + \mathbf{U}\_i \cdot \mathbf{h}(t-1) + \mathbf{b}\_i(t))\tag{4}$$

$$\mathfrak{E}(t) = \tanh(\mathbf{W}\_{\mathfrak{c}} \cdot \mathbf{x}(t) + \mathbf{U}\_{\mathfrak{c}} \cdot \mathbf{h}(t-1) + \mathbf{b}\_{\mathfrak{c}}(t))\tag{5}$$

$$\mathbf{c}(t) = \mathbf{f}(t) \circ \mathbf{c}(t-1) + \mathbf{i}(t) \circ \mathbf{\tilde{c}}(t). \tag{6}$$

Finally, the output data of the LSTM cell is computed as a function of the input and previous output, as well as the outcome of the output gate:

$$\mathbf{o}(t) = \sigma(\mathbf{W}\_{\vartheta} \cdot \mathbf{x}(t) + \mathbf{U}\_{\vartheta} \cdot \mathbf{h}(t-1) + \mathbf{b}\_{\vartheta}(t))\tag{7}$$

$$\mathbf{h}(t) = \mathbf{o}(t) \diamond \tanh(\mathbf{c}(t)). \tag{8}$$

Notice that **W***x* and **U***x* are the weights of the different gates modifying the input and output data vectors, respectively. **b***<sup>x</sup>* are the biases of the different gates. Finally, ◦ is the Hadamard product between two matrices. *σ* and *tanh* are the sigmoid and hyperbolic tangent activation functions, respectively. If more information about LSTM cells and their behaviour is required, readers are referred to ([11], Section 10.10).

#### *2.3. Transfer Learning*

The main contributions of this work are mainly focused on the adoption of Transfer Learning (TL) techniques to ease and speed up the control design in industrial environments, especially in the WWTPs. In that sense, TL consists of transferring the knowledge obtained in the training process of an ANN structure into another one. For instance, TL techniques have been widely adopted in the design and implementation of image classifiers among others [20]. One clear example is shown in ([21], Chapter 6), where the Inception model, a general-purpose image classifier, is adopted to develop a dog breed classifier. This new classifier is implemented with the Inception classifier without the last layer plus three new convolutional layers connected to the output of the penultimate Inception layer. Therefore, the dog breed classification performance will be derived from the Inception classification one and a new retraining process where a new set of dog breed pictures is considered ([21], Chapter 6). This shows that TL techniques can be considered as techniques which not only obtain ANN models performing well from a source model, but also speed up their designing process since the knowledge of the source model is shared with the new ones ([21], Chapter 4).

In that sense, TL techniques can be categorised into three classes as a function of the data availability in the source and target domains or scenarios ([21] Chapter 4, [22]):


In our case, we are faced with an Inductive Transfer Learning task, since the source and target domains do not show a data scarcity problem. Here, the source domain consists in the *SO*,5 control loop (DO control loop) or the *SNO*,2 control loop (NO control loop) depending on the base ANN-based controller being implemented. In that sense, if the DO control loop is considered as the source domain, the NO control loop will be considered as the target domain and vice versa.

#### *2.4. Modelling*

Three different tools have been considered in this work to implement and test the proposed Transfer Learning-based Control Design approach. They correspond to Simulink and Python. Simulink was adopted due to the fact that the BSM1 model is completely deployed over this simulator. Simulink version 10.1 running over Matlab R2020b was considered. Moreover, all the ANNs involved in the proposed approach are also deployed over the BSM1 model in order to test their behaviour. Thus, they are also implemented in Simulink. In that sense, the ANNs, and especially the LSTM cells have been designed and trained by adopting Python 3.6 with three open-source libraries and a NVIDIA GeForce RTX 2080 Titan GPU memory, which is considered to speed up the LSTM training process:


#### **3. TL-Based Control Design**

As it has been stated, one of the problems in industrial control is related to the conception and design of the control loop. Most of the times, the design of the controllers can become a tedious and time-consuming process since one has to determine the topology of the controller to be used, as well as the plant or process it is going to manage. In that sense, ANNs have arisen as a possible solution able to alleviate this. They only require pairs of input and output data of the process to be controlled [15]. However, this has its own drawbacks: ANNs have to be correctly trained and designed if a good control performance is required. This can become a time-demanding and computationally expensive process if there are a lot of control loops to design.

For that reason and to alleviate this issue, we propose in this work the TL-based Control Design approach, which is focused on designing and implementing the control strategies of a general purpose WWTP. In this case, the TL-based Control Design approach consists in two stages: (i) the LSTM-based controller, where the design and training of an ANN-based controller is carried out, and (ii) the Control Knowledge Transfer approach, where the transfer of the controller knowledge into the different industrial control loops is performed. The first stage is mainly based on designing an ANN able to manage the signals considered in the control of the industrial process. To achieve this, the proposed ANN-based controller predicts the corresponding actuation signal accordingly to its input measurements, that is, the measured value and its set-point. In our case, the signals involved in the control loops correspond to either the *SO*,5 or the *SNO*,2 concentrations, and their respective actuation signals, the *KLa*,5 or *Qa*. Besides, the ANN-based controller will be implemented with LSTM cells due to their good performance when dealing with timeseries signals, such as the ones obtained from the BSM1 framework. ([11] Section 10.10, [8]). The second stage is mainly focused on transferring the knowledge of the proposed LSTMbased controller into the other control loops. In this case, the LSTM-based controller is considered as the baseline strategy to be transferred (see Figure 3). Thus, the objective is to design only a LSTM-based controller instead of as many LSTM-based structures

as control loops present in the WWTP. Then, its knowledge will be transferred into the remaining loops.

It is also important to notice that this transfer approach can be adopted in any industrial scenario. However, there is a requirement that has to be fulfilled; the TL-based Control Design can only be applied among control loops sharing the same control objective. This is motivated by the fact that the ANN-based structure trained in the source control loop will learn how to generate an actuation signal from the controlled ones with the objective of performing certain tasks, for instance, the tracking of a given set-point. Then, the knowledge of this ANN-based structure will be transferred into the target one, which should have the same objective. Otherwise, the target structure would generate actuation signals which do not fulfil the control objective. In the case of this work, the control objective is clear, where both the DO and the NO control loops are designed to track the given set-points regardless of the fact that the involved signals show different values and dynamics [8,25].

**Figure 3.** Graphical description of the TL-based Control Design approach. Notice that DO refers to the Dissolved Oxygen (*SO*,5) control loop, whilst NO refers to the nitrate and nitrite (*SNO*,2) control loop.

Once the knowledge of the LSTM-based control structure is transferred, the control performance of the LSTM-based controller can be adjusted through a fine-tuning process which consists in a retraining of the LSTM-based structure. However, this fine-tuning process is different from the usual fine-tuning processes performed in the usual applications of transfer learning, that is, the development of image classifiers. There, the data considered to carry out the fine-tuning process consist in a set of new images where the labels are intrinsically obtained from the same images. Taking up the dog breed classifier, the TL fine-tuning process is performed to the Inception structure with images of different dog breeds where the labels are obviously clear. When talking about industrial processes, the situation completely changes. The new measurements have to be obtained from the control loop where the LSTM-based controller is going to be transferred. In addition, the knowledge about how to control this loop has to also be obtained. For that reason, the data required to perform the fine-tuning processes have to be obtained by means of simulating the behaviour of the industrial process when an existing and conventional controller is applied. If not, the LSTM-based controller will not be able to offer a good control performance.

In this manuscript, two LSTM-based controllers will replicate the behaviour of the default WWTP PIs since they are the ones present in the BSM1 digital framework [8]. This is motivated by the fact that the LSTM-based PI controller can be obtained with data from the DO control loop (DO LSTM-based PI) and then transferred into the NO control loop, or it can be designed considering measurements from the NO control loop (NO LSTM-based PI) and then transferred into the DO loop. From these two controllers, the one offering the best control performance in both control loops will be fine-tuned.

#### *3.1. LSTM-Based PI*

As it has been previously stated, the controller proposed in the TL-based Control Design approach consists in a LSTM-based controller which will act as a PI controller managing either the *SO*,5, or the *SNO*,2 concentrations. Hence, two LSTM-based PI candidates are proposed since there exists two control loops, the DO and the NO control loop. For that reason, we will analyse the control performance of each one in order to determine which LSTM-based PI will be transferred and fine-tuned. The first LSTM-based PI controller corresponds to the DO LSTM-based PI, which is derived from the PI managing the *SO*,5 concentration, while the second one corresponds to the NO LSTM-based PI. It is derived from measurements of the default PI managing the *SNO*,2. Before designing and training the two LSTM-based PIs, one can guess which one will offer the best control performance. If the control performance of the default PI controllers is taken into account (see Figure 4), one can observe that the best PI corresponds to the one managing the *SO*,5, since it is able to maintain the *SO* concentration at the desired value (2 mg/L). On the other hand, the PI managing the *SNO*,2 is not able to maintain the desired set-point. Thereby, the control performance of the LSTM-based ones will be similar to the default PI controller from which the data were obtained. In other words, the better the conventional controller performance, the better the LSTM-based one.

**Figure 4.** Control performance when the default PI controllers are adopted. Notice that the worst performance is offered by the *SNO*,2 default PI controller.

The DO LSTM-based PI and the NO LSTM-based PI structures are obtained by means of a grid search method where different LSTM-based structures are trained with the same set of measurements. The efforts of the grid search are focused on determining the number of LSTM cells, feedforward layers, and hidden neurons per layer of the LSTM-based structure. Then, the LSTM structure offering the best prediction performance without committing overfitting is the one considered as the main structure in which the LSTMbased PI is based on. In that sense, the grid search is performed instead of finding the parameters characterising the PI controller, that is, the integral time and the proportional gain [36]. This means that a deep knowledge of the process under control is not required. Only pairs of input and output measurements of the existing default PI controllers are needed. To obtain them, a complete year of randomly distributed weather profiles has

been simulated in order to achieve a good control performance regardless of the weather conditions. From all the available measurements, the input and output measurements of the LSTM-based PI controller will be determined accordingly to the control loop they will manage:


These measurements are the ones considered to carry out the grid search method devoted to determining the LSTM-based PI structures. Each one of these measurements is split into three different sets: 70% of the measurements to train the different LSTM-based net configurations, 15% to validate them, and the remaining 15% to test the structures. The grid search process has been carried out adopting the Adam optimizer ([11], Sections 6.5 and 8.5.3) and a total amount of 500 epochs. The initial learning rate value has been set to <sup>1</sup> <sup>×</sup> <sup>10</sup>−3, however, it is reduced along the process. In addition, LSTM nets are also known to suffer overfitting problems, where they memorise the input and output measurements instead of deriving a model from them ([11], Chapter 7). To avoid this problem, the L2 parameter regularisation technique and early stopping method are considered. The L2 parameter regularisation consists in the addition of extra penalty to the weights of the corresponding layer ([11], Section 7.1.1). This extra penalty is known as the weight decay parameter, which in this case has been set to 5 <sup>×</sup> <sup>10</sup>−4. On the other hand, early stopping acts as a technique which stops the training process when the validation performance changes its tendency with respect to the training one ([11], Section 7.8). Here, the important point corresponds to the early patience, which determines the amount of epochs that this change of tendency is allowed. In this work, we consider an early patience of five epochs, understanding an epoch as a complete pass over the training dataset ([37], Chapter 2). Both LSTM-based controllers consider the same LSTM-based structure (see Figure 5) which mainly consists in two LSTM cells devoted to extracting and obtaining information from the time correlation between measurements and two feedforward layers which will transform this information into the desired output. Moreover, each structure considers Normalisation and Denormalisation stages in charge of normalising the input measurements towards zero mean and unity variance, and to take them into its natural range, respectively. These two stages are needed since the range of the measurements involved in the control loops are quite different: the mean of the measurements involved in the DO control loop equal to 1.9752 and 144.68 for the *SO*,5 and the *KLa*,5, respectively. In the case of the NO loop, the mean values of the variables involved in the control are equal to 0.9937 and 2.1802 <sup>×</sup> <sup>10</sup><sup>4</sup> for the *SNO*,2 and *Qa* measurements. As a summary, the DO LSTM-based and NO LSTM-based structures are as follows:

	- **–** Input measurements: the dissolved oxygen in the fifth reactor tank (*SO*,5(*t*)) and its desired set-point (*SO*,5*set*−*point*(*t*)). Besides, the DO LSTM-based net considers the Nonlinear Autoregressive Exogenous principle (NARX) where the output predicted by the net will be considered as an extra input. This extra input provides the LSTM-based structure with information about its performance in the prediction process [38], thus it will be able to correct its predictions as a function of this extra input. In this case, the extra input corresponds to the previously computed actuator signal (*KLa*,5(*t* − 1)).
	- **–** Normalisation Stage: stage devoted to normalising the input measurements towards zero mean and unity variance.
	- **–** Input measurements: the nitrate and nitrite nitrogen in the second reactor tank (*SNO*,2(*t*)) and its desired set-point (*SNO*,2*set*−*point*(*t*)). As it happens with the DO LSTM-based PI, the NO LSTM-based controller also considers the NARX principle. In this case, the extra input corresponds to the previously computed actuator signal (*Qa*(*t* − 1)).
	- **–** Normalisation Stage: stage devoted to normalising the input measurements towards zero mean and unity variance.
	- **–** LSTM-based Net: main part of the LSTM-based Controller. It consists of two LSTM cells with 100 and 50 hidden neurons and two feed forward layers with 50 and 25 hidden neurons, respectively.
	- **–** Denormalisation Stage: stage devoted to denormalising the actuation signal (DO LSTM-based Net output) towards its real range of values.
	- **–** Output: the actuation signal which corresponds to the WWTP internal recirculation flow rate (*Qa*(*t*)).

**Figure 5.** LSTM-based net considered in the LSTM-based Controller. *l* corresponds to the number of inputs, which in this case is set to three measurements: the measured concentration of interest, *SO*,5(*t*) or *SNO*,2, its set-point, *SO*,5*set*−*point*(*t*) or *SNO*,2*set*−*point*(*t*), and the actuation variable, *KLa*,5(*<sup>t</sup>* − <sup>1</sup>) or *Qa*(*<sup>t</sup>* − <sup>1</sup>).

The prediction performance of both structures has been computed in terms of the difference between the predicted actuation variables and the expected ones (remember that the DO LSTM-based PI predicts the *KLa*,5), whereas the NO one predicts the *Qa*. Five metrics are adopted, the Root Mean Squared Error (RMSE), the Mean Absolute Error (MAE), the Mean Average Percentage Error (MAPE) the determination coefficient (*R*2), and the training time [39]. The RMSE and the MAE tell us how the prediction errors are, that is, if the predictions are close to the expected measurements or not. However, they are absolute metrics in terms of how they do not tell us how big or small these errors are. For that reason, we consider the MAPE, which compares the errors with respect to the expected value. *R*<sup>2</sup> is considered to determine the correlation between the predicted and expected measurements. Finally, the training time is considered to determine the amount of time to train unique network. Notice that all the prediction metrics are computed considering normalised values, with the exception of the MAPE in order to avoid divisions by zero and the training time. In that sense, the results show that the proposed LSTM-based structures are able to offer a good prediction performance (see Table 1) since both structures yield low RMSE, MAE and MAPE values at the same time they offer a *R*<sup>2</sup> nearly equal to 1. Therefore, it is corroborated that these structures can be used to implement PI controllers which are mainly based on data.



#### *3.2. Control Knowledge Transfer Approach*

The Control Knowledge Transfer approach corresponds to the stage of the TL-based Control Design devoted to transferring the knowledge of the LSTM-based PI structures of one WWTP control loop into the other. The adoption of this stage is motivated by the fact that we looked for the ease and speed-up of the controller design and implementation process, respectively.

In this manuscript, three different TL approaches are considered to achieve the transfer of the control knowledge between control loops. Two of them are considered to determine which controller, the DO or the NO LSTM-based PI, has to be transferred and then finetuned. The third approach mainly consists in the adoption of the controller showing the best performance in the source and target domains and its fine-tuning to adapt its behaviour to the dynamics of the target domain, the control loop where it has been transferred. As a summary, the three considered control approaches are:

• Transfer Learning from DO to NO

The DO LSTM-based PI structure is transferred directly from the DO to the NO control loop. Here, it is important to notice that the structure is not fine-tuned, that is, the LSTM-based PI controller has been trained to manage the *SO*,5 concentration. Besides, only the normalisation and denormalisation stages are adapted to the NO control loop measurements.

• Transfer Learning from NO to DO

The NO LSTM-based PI structure is directly transferred from the NO to the DO control loop without performing any change, neither in its structure, nor in its weights and biases. Thus, the knowledge on how to manage *SNO*,2 concentration is transferred into the DO control loop. The unique change performed in this transfer approach corresponds to the normalisation and denormalisation stages. They have been adapted to normalise and denormalise the measurements coming from the NO control loop instead of the DO control loop. Following this, the NO LSTM-based PI will be at least equal to the default PI managing the *SNO*,2 concentration, that is, the NO control loop PI. If Figure 4 is taken into account, one can assure that the NO LSTM-based PI controller will not offer such a good control performance as the DO LSTM-based PI derived from the DO control loop.

• LSTM-based controller Fine-tuning & Transfer

This transfer approach is the most important one since it corresponds to the transfer method performing a fine-tuning process and therefore, adapting the behaviour of the transferred controller to the target domain dynamics. In other words, in this transfer learning approach the LSTM-based controller yielding the best control performance between the Transfer Learning from DO to NO and the Transfer Learning from NO to DO will be considered as the candidate to be fine-tuned. Results in Section 4 show that the best control performance is offered by the DO LSTM-based PI. For that reason, this is the LSTM-based controller considered in the fine-tuning process. Nevertheless, this choice can be done at the very beginning if the performance offered by the conventional PI structures is considered (see Figure 4).

In terms of the three TL classes, the LSTM-based controller Fine-tuning and Transfer approach consists in an inductive transfer learning task: data from the source domain, the DO control loop, is considered to firstly obtain the DO LSTM-based PI structure. Then it is fine-tuned (retrained) with data coming from the PI controlling the target domain, the NO control loop. In other words, the default *SNO*,2 controller whose performance is observed in Figure 4a has been considered to perform the fine-tuning process of the DO LSTM-based PI controller. Thus, the obtained controller, the finetuned DO LSTM-based PI (FTDO LSTM-based PI) will know how to correctly manage the desired variable, but adapted to the NO control loop. This clearly shows that an existing controller managing the target control loop is compulsory to obtain the measurements considered in the fine-tuning process. This differs from traditional and conventional TL applications, where labelled data are available.

The main point here is that in the fine-tuning process not all the layers of the DO LSTM-based PI controller will be retrained with measurements of the target domain: the weights of the two LSTM cells are blocked whilst the weights and biases of the two feedforward layers (see Figure 5) are modified in the fine-tuning process. The LSTM cells are the ones that are blocked since they are the layers gathering the information about the time-dependence between measurements. The feedforward layers mainly take this information to adapt the output of the controller to the desired control loop. For that reason, these are the layers which will be retrained just to adapt the outcomes of the LSTM layers to the new domain.

The measurements of the target domain are again obtained by performing a wholeyear simulation of the BSM1 behaviour when the three weather profiles, dry, rainy, and stormy, are randomly distributed. The weights and biases of the two retrained feedforward layers are obtained considering the same training parameters as in the case of the DO LSTM-based PI training process: initial learning rate equals to 1 <sup>×</sup> <sup>10</sup><sup>−</sup>3, the weight decay equals to 5 <sup>×</sup> <sup>10</sup>−<sup>4</sup> and the early patience is set to 5 epochs.

#### **4. Results**

#### *4.1. TL-Based Control Design Results*

The performance of the TL-based Control Design approach is determined by means of analysing the control performance of each one of the proposed TL approaches: (i) the Transfer Learning from DO to NO, (ii) the Transfer Learning from NO to DO, and (iii) the LSTM-based controller Fine-tuning and Transfer. In that sense, the two first results will determine which controller, the DO LSTM-based PI or the NO LSTM-based PI, is performing better in both control loops when no fine-tuning process is carried out. Finally, the one performing better is fine-tuned and its control performance is computed in the last TL approach. Results will show which is the best option not only to obtain a complete and good control approach mainly based on data, but also to speed-up the design process of the complete WWTP control strategy.

The control performance has been computed in terms of fix and variable set-points in order to determine if the TL-based Control Design approach is suitable for both types of set-points. Fix set-points are considered since the default control strategy considers them in order to assure that the nitrification and denitrification processes, the ones performing the pollutant reduction task, are correctly performed [8,27]. They have been set to 2 mg/L and 1 mg/L for the *SO*,5 and *SNO*,2 control loops, respectively. Notwithstanding, variable setpoints are the ones of most interest since most of the times the set-points are computed by means of other control strategies or are varied in order to optimise the pollutant reduction process [10,14,40,41]. In this case, the variable DO set-point has been computed accordingly to the Fuzzy Logic adopted in [10], where the Fuzzy controller is considered to determine the *SO*,5 set-point generating the lower *SNH*,5. Moreover, the three different BSM1 weather profiles have also been simulated to determine if the control design approach can be considered regardless of the weather conditions.

#### *4.2. Transfer Learning from DO to NO*

The first computed control performance corresponds to the situation where the DO LSTM-based PI is obtained with data from the DO control loop. Then, it is transferred into the NO control loop without performing any fine-tuning process. Results are shown in Table 2, where the first important effect that one can notice is that the control performance in the DO control loop, that is, in the management of *SO*,5, is even better than the control offered by the default PI. This effect is motivated by two situations: (i) the fact that the DO LSTM-based PI has been trained through the simulation of the control strategy when random variations in the set-point are provided, and (ii) the NARX principle which provides the LSTM-based structure with information about the previous predicted outcomes. Thus, the LSTM-based structure has learnt how to correct variations present either in the set-point, or in the measured concentration.


**Table 2.** Control performance when the DO LSTM-based PI derived from the DO control loop is transferred into the NO control loop.

> In terms of the *IAE* and *ISE* metrics, one can observe that they are improved with respect to the default PI control performance when a fixed set-point is considered. In addition, these improvements are achieved regardless of the weather profile. In other words, the *IAE* and *ISE* values were improved by around a 95.98% and a 99.84% in average with respect to the default PI controller, respectively. For instance, the highest *IAE* improvement is achieved when the stormy influent profile is simulated. The *IAE* offered by the default *SO*,5 PI controller is equivalent to 0.158, while it is reduced until 0.006 when the DO LSTM-based PI is considered. This entails that the difference between the measured and the *SO*,5 controller is minimal. In terms of the *ISE*, the highest improvement is obtained when the dry weather is simulated. The achieved improvement equals to 99.86% with respect to the default PI controller. Notwithstanding, this improvement equals to 99.84% and 99.82% when the rainy and stormy weathers are considered, respectively. These results

show that the DO LSTM-based PI is able to be highly improved when a fixed set-point is considered. However, the important results are the ones obtained when a variable set-point is considered, since it corresponds to the most frequently adopted set-point topology.

In such a context, the same effect is observed when a *SO*,5 variable set-point is considered. In this case, the average improvement in terms of the *IAE* and *ISE* equals to 91.67% for the *IAE* and a 97.77% for the *ISE*. Now, the highest improvement is achieved when the dry weather is considered: the *IAE* and the *ISE* are improved by 92.97% and 98.54% with respect to the default PI controller performance. This is motivated by the fact that rainy and stormy influents are derived from the dry weather where the rainy and stormy episodes are included. For that reason, the LSTM-based structure has more often observed the effects of the PI controlling the *SO*,5 when dry episodes are observed rather than stormy or rainy ones. In addition, the control performance clearly shows that the DO LSTM-based PI controller can be adopted as the main controller in the DO control loop (see Figure 6). As it is observed, the output of the controller is much closer to the given set-point of 2 mg/L than the default PI output.

**Figure 6.** Control performance for the DO control loop when the LSTM-based PI is considered.

However, the most important point is to determine the control performance of the NO control loop, since in this case the DO LSTM-based PI is directly transferred into the NO control loop. The changes performed in the control structure correspond to the normalisation and denormalisation stages, which have been adapted to the range of values involved in the control of *SNO*,2. Results show that the control performance of the DO LSTM-based PI controller can be improved by, on average, 33.07% and 42.94% in the case of the *IAE* and the *ISE*, respectively, when it is managing the *SNO*,2 and considering a fixed set-point. For instance, the highest improvement with respect to the default PI structure is achieved when the stormy weather is considered. The *IAE* and *ISE* obtained in such a situation equal to 1.033 and 0.357, respectively, which in percentage values equal to an improvement of a 44.88% and a 63.46% for the *IAE* and *ISE* respectively. At the same time, this represents a reduction of the *IAE* and *ISE* improvement of 51.32 and 36.36 percentage points with respect to the improvement achieved when the DO LSTM-based PI is managing the *SO*,5. This is clearly motivated by the fact that the DO LSTM-based PI is designed to offer its best performance when managing the DO control loop. When a *SO*,5 variable set-point is chosen, one can observe that the average improvements in the NO control loop

and in terms of the *IAE* and *ISE* are equal to 26.19% and 37.27%, respectively, being the dry weather the one showing the highest improvement (see Figure 7). The *IAE* values go from 1.792 to 1.271 while the *ISE* values go from 0.858 to 0.503, respectively.

**Figure 7.** Control performance for the NO control loop when the LSTM-based PI derived from the DO control loop is transferred into it.

These results show that the DO LSTM-based PI controller is able to improve the default PI controllers performance. For that reason, it is considered as a candidate to be fine-tuned in order to adapt its behaviour to the *SNO*,2 control management and therefore, achieve a better improvement in the management of this loop.

#### *4.3. Transfer Learning from NO to DO*

Before performing the fine-tuning process, the control performance of the NO LSTMbased PI is also computed to determine its behaviour when managing the NO control loop (its source domain) and its performance when managing the DO loop (its target domain). Results are shown in Table 3 where at first sight it is clearly observed that the *IAE* and *ISE* metrics are improved with respect to the default *SNO*,2 PI controller. When a *SO*,5 fix set-point is considered, the NO control loop *IAE* is improved in average a 24.32% while the corresponding *ISE* is improved around a 39.03% in average. Both with respect to the default NO control loop PI controller. The *ISE* improvement shows that the proposed NO LSTM-based PI controller, which has been derived from the NO control loop, is able to reduce the highest errors between the measured *SNO*,2 and its set-point, with respect to the default PI controller. However, the control performance can be still improved since the improvement achieved in terms of the *IAE* error is still low. For instance, the best improvement is observed when the stormy weather is considered. There, the obtained *IAE* goes from 1.874 to 1.360, whereas the *ISE* goes from 0.977 to 0.543. These values represent an improvement around a 27.43% and a 44.42% when the obtained *IAE* and *ISE* values are compared to the default PI control metrics. In terms of the *SO*,5 control performance, the transferred NO LSTM-based PI shows that the *IAE* performance is degraded instead of improved. For instance, when the NO LSTM-based PI is adopted, the *IAE* is increased from 0.148 to 0.158 when the dry weather is considered. This effect is motivated by the fact that the default PI of the NO control loop is not offering such a good control performance as

the default PI of the DO control loop. Thus, the control performance will not be improved if data from the NO control loop is obtained to derived the NO LSTM-based PI and then transfer it into the DO control loop.

**Table 3.** Control performance when the NO LSTM-based PI derived from the NO control loop is transferred into the DO control loop.


Visually, one can observe that the *SNO*,2 control performance is slightly improved with respect to the default PI (see Figure 8). The peaks of *SNO*,2 concentration are reduced, however, the desired set-point is not achieved. In terms of the *SO*,5, the control performance is even slightly degraded with respect to the default PI controller. As it can be observed, the measured *SO*,5 does not show variations as the default PI controller, however, there exists an offset which produces the *IAE* increment. For that reason, the *ISE* metric in terms of the *SO*,5 is still reduced, it now equals to 0.004 in average instead of 0.007. Notice that the *ISE* tells if there exists a huge difference between the measured and the desired concentration, whereas the *IAE* tells if the difference is maintained over time.

When a variable set-point is considered, one can observe that the control performance is only improved in terms of the NO control loop. The *IAE* and *ISE* metrics are improved in averages with respect to the default PI controller of 27.56% and 42.94%, respectively. In terms of the DO control loop performance, results show that transferring the NO LSTMbased PI controller derived from the NO loop into the DO loop is not an option, since all the control metrics are degraded. For instance, the *IAE* and *ISE* metrics are nearly doubled with respect to the default PI controller when the stormy weather profile is simulated. These results entail that the NO LSTM-based PI cannot be considered as a candidate to be fine-tuned since it does not improve the control performance of target domain at the same time that the improvement achieved in the source domain is much lower than the one achieved by the DO LSTM-based PI. In addition, this also corroborates one of the main ideas stated before: the better the conventional control performance, the better the LSTM-based one. For that reason, the DO LSTM-based PI is the one considered to perform the fine-tuning process. It is important to notice that the initial training of both structures, the DO LSTM-based PI and the NO LSTM-based PI is not compulsory. In Figure 4b it is clearly observed that the control loop offering the best performance corresponds to the PI managing the DO control loop. Hence, the DO LSTM-based PI can be initially adopted to be trained. Then, it will be transferred into the NO control loop and fined-tuned. As a consequence, there is no need to train or even implement the NO LSTM-based PI.

**Figure 8.** Control performance for the NO and DO control loops when the stormy weather is considered. The LSTM-based PI managing the DO control loop is derived from the NO control loop and transferred into the DO one.

#### *4.4. LSTM-Based Controller Fine-Tuning & Transfer*

Once the control performance of the DO and NO LSTM-based PI controllers is computed one can clearly observe that the DO LSTM-based PI is the controller offering the best control performance in both control loops. For that reason, the fine-tuning of the DO LSTM-based PI controller is proposed. To perform this tasks, data coming from the default *SNO*,2 PI controller is considered. In that sense, information about how to control and manage the *SNO*,2 concentration is provided to the DO LSTM-based PI. Thus, the fine-tuned version of the controller, the FTDO LSTM-based PI, should be able to improve a better control performance in terms of the *SNO*,2 managing process.

Now, the prediction performance of the FTDO LSTM-based PI equals to a RMSE of 0.095 mg/L, a MAE of 0.067 mg/L, a MAPE of 6.24% and a *R*<sup>2</sup> of 0.991. Its training time equals to 20.27 s. At first sight one can observe that prediction performance is degraded with respect to the DO and NO LSTM-based PI controllers. However, this degradation is motivated by the fact that the proposed FTDO LSTM-based PI controller has learnt how to correctly manage the *SO*,5 and *SNO*,2 concentration instead of a unique one. In addition, the training time in this occasion equals to 20.27 s, which means that the time spent in the fine-tuning process is largely reduced with respect to training the LSTM-based structure from scratch. This effect is motivated by the information already present in the LSTM structure, that is, the weights and biases of the blocked LSTM cells. This corroborates that TL techniques can be adopted to simplify and speed up the control design process. Let's suppose that instead of transferring the knowledge of the DO LSTM-based PI into the NO control loop and performing a fine-tuning process, we decide to control each loop with its corresponding LSTM-based PI structure. The amount of time devoted to training the networks correspond to 69.91 and 98.60 s for the DO and NO control loops, respectively. This equals to a total time of 168.51 s only in terms of the training time. Although this time is affordable, if the DO LSTM-based PI is transferred into the NO control loop, only 69.91 s plus the time spent in the fine-tuning process, no more than 21 s is required. Thus, the total amount of time invested in the design process equals to 90.18 s, which represents a reduction of 78.33 s with respect to training two individual nets. Therefore, the reduction of the training time is clearly observed. In addition, it is important to notice that the WWTP

we are dealing with only considers two control loops. However, this reduction of time will be higher in these situations where the number of control loops to design is larger. In that sense, an estimation of the training time reduction can be performed. If we suppose that the training time of the baseline LSTM-based PIs (from scratch) correspond to *tbaseline* and that the time spent in the fine-tuning process on average equals to *tf t*, the reduction of time (Δ*t*) provided by our approach can be computed as:

$$
\Delta t = N \cdot t\_{\text{baseline}} - \left[t\_{\text{baseline}} + (N-1) \cdot t\_{ft}\right] = (N-1)\left[t\_{\text{baseline}} - t\_{ft}\right], \tag{9}
$$

where *tf t tbaseline*. *N* equals to the number of control loops where the baseline LSTMbased PI is the transfer. As it is observed, the higher the number of control loops to design, the higher the reduction of time and the higher the benefit of the proposed methodology. Not only this, but this methodology can also be applied in those situations where the control of a new WWTP scenario has to be designed. In such a context, the new control structure can be derived by transferring the knowledge of the control structure of an already controlled WWTP. This would involve an even higher reduction of the complexity and time required in the development of the control strategy. All these facts motivate us to consider the TL methods in the design of the WWTP control loops.

In terms of the control performance, results of the FTDO LSTM-based PI control are shown in Table 4, where the *IAE* and *ISE* values are computed for different weather profiles and set-points. It is worth noticing that the *SNO*,2 is now managed by the fine-tuned and transferred DO LSTM-based PI, that is, the FTDO LSTM-based PI, whereas the *SO*,5 concentration is managed by the DO LSTM-based PI.


**Table 4.** Control performance when the DO LSTM-based PI derived from the DO control loop is transferred into the NO control loop. Then, the NO controller is fine-tuned with data from the default PI controller managing the *SNO*,2.

> When a fixed set-point is considered, one can observe that the control performance is hugely improved not only in terms of the *SO*,5, but also in terms of the *SNO*,2. The improvement of the DO control loop with respect to the default PI controller is translated into an average reduction of the *IAE* around a 95.94% and an average reduction in the *ISE* around a 99.78%. Thereby, this is translated in a better tracking process of the *SO*,5 and consequently, a better management of this concentration. In terms of the NO control loop, one can observe that the *IAE* and *ISE* are hugely improved as well. However, there is an exception with the rainy weather. In this case, the *SNO*,2 *IAE* and *ISE* are only improved a 40.17% and a 34.27%, respectively. This is motivated by the fact that the rainy weather profile shows two large perturbation during days 9 and 11. Besides, the fine-tuning process is performed with measurements obtained from the *SNO*,2 default PI controller when a

whole year of randomly distributed weathers is simulated. Thus, this entails that most of the knowledge provided to the DO LSTM-based PI consists in the control actuations to manage the *SNO*,2 concentration when the dry weather is considered (remember that rainy and stormy weathers are equal to the dry weather with the exception of the two rainy and the two stormy episodes). On average, the NO control loop *IAE* and *ISE* are reduced by 73.47% and 72.84% with respect to the default *SNO*,2 PI control performance. The greatest improvement is observed when the dry weather is considered. The *IAE* is reduced from 1.594 to 0.091, whereas the *ISE* is decreased from 0.691 to 0.002 (see Figure 9). In the case of the rainy weather, the reduction of the *IAE* and *ISE* is lower, the *IAE* changes from 1.922 to 1.150 and the *ISE* from 0.951 to 0.625. Nevertheless, this *IAE* value corresponds to the lowest NO control loop *IAE* value of the three TL approaches considered in this work.

**Figure 9.** Control performance for the NO and DO control loops when a *SO*,5 fix set-point and dry weather are considered. The LSTM-based PI managing the NO loop is transferred from the DO control loop and fined-tuned with data from the NO control loop.

Results of the control performance when a variable set-point is considered show the same tendency as the fix set-point ones. The *IAE* and *ISE* metrics have been improved for all the weather profiles. Again, the most important results are the ones corresponding to the NO control loop, which is the controller whose control performance improvement is sought with the fine-tuning process. In that sense, the best improvement is now observed when the dry weather is simulated. The *IAE* has been decreased from 1.792 to 0.129, which equals to an improvement of 92.80%. In terms of the *ISE*, it is decreased from 0.858 to 0.004, which represent an improvement of a 99.53%. It is important to notice that the lowest control performance is obtained when the rainy weather is considered, the *IAE* deceases from 2.132 to 0.643 while the *ISE* is reduced from 1.089 to 0.261. Although these improvements are not so high as the ones achieved with the dry weather, they are still much better than the performance obtained when the fine-tuning process is not carried out. For instance, the *IAE* has been improved a 69.84% whilst the *ISE* has been improved a 76.03%. The *IAE* improvement represents an increase of 48.26 and 42.63 percentage points with respect to the improvements achieved in the Transfer Learning from DO to NO and from NO to DO. In terms of the *ISE*, these increments equal to 45.64 and 36.82 percentage points, respectively. Visually, we can observe in Figure 10 that the *SNO*,2 desired value of 1 mg/L is obtained at the same time the *SO*,5 variable set-point is correctly tracked. In

addition, the rainy episodes are plotted to show that the FTDO LSTM-based PI controller requires some more knowledge to finally learn how to manage these events.

**Figure 10.** Control performance for the NO and DO control loops when a *SO*,5 variable set-point and rainy weathers are considered. The LSTM-based PI managing the NO loop is transferred from the DO control loop and fined-tuned with data from the NO control loop.

As a summary, the control performance is improved in all terms regardless of the set-point topology and the weather profiles. This entail that the best option to design or improve a control strategy of an industrial plant, and especially a WWTP, is to obtain a first baseline controller, the DO LSTM-based PI, and then transfer its knowledge to the rest of control loops. The main point is to design the baseline controller with data coming from the controller performing better. In our case, this controller corresponds to the *SO*,5 default PI controller. Then, the obtained DO LSTM-based PI is transferred into the remaining control loops and fine-tuned with data coming from controllers actuating in the target domain. Moreover, this approach entail that control loops can be designed without requiring a high knowledge of the different processes carried out in the plant. Only input and output measurements of a control strategy performing well are required. The rest of the control loops will be derived from the implemented one. Thus, the higher the number of control loops, the higher the benefit offered with this design approach. In our case, this benefit is not widely exploded since we have only transferred the LSTM-based PI between two control loops. However, this approach can be adopted in other scenarios where the number of control loops is largely higher than the ones managed here. In that sense, the benefit of this approach should be much higher than the one observed here.

Finally, the results observed in this manuscript motivates us to open a new research line where the transfer learning approach presented here is considered as the initial process of a reinforcement learning based control design. Then, instead of performing the finetuning process with measurement coming from a conventional control approach, it could be performed following a reinforcement leaning process. In that sense, the controller would be fine-tuned over time, adapting its output to the incoming measurements.

#### **5. Conclusions**

In this work, we presented a new industrial control design process which involves the application of LSTM-based neural networks and transfer learning approaches. The main purpose was to design and implement the control loops managing a general-purpose WWTP. The application is specific; however, the design approach can be adopted in any kind of industrial environment. The main idea is that TL techniques allow us to derive new control strategies from a baseline one without performing a deep tuning process of each control structure. This reduces the control design complexity, as well as the time invested in the training process of each data-based control structure. Thus, the higher the number of control loops, the higher the improvement achieved.

In our case, the proposed control design approach consists in two main processes: (i) the design of a neural network-based controller with data obtained from an existing control loop and (ii) the transfer of the controller knowledge into the remaining loops. To achieve that, three different design approaches were proposed, two of them mainly consisting in the design of the LSTM-based controller with data from a control loop and then transferring it to the others without retraining the net structure. In that sense, we considered the development of the LSTM-based PI controller either with measurements from the *SO*,5 control loop or from the *SNO*,2 one, both from a general-purpose WWTP. The third option considers the development of the LSTM-based PI with data from the *SO*,5 control loop and the fine-tuning of its transferred version. Results show that there exists a trade-off between deriving the LSTM-based PI with measurement from the *SNO*,2 or the *SO*,5 control loops. If the LSTM-based PI controller is derived with measurements from the *SO*,5 control loop, one can observe that the *SO*,5 control performance is highly improved with respect to the default PI controller regardless of the weather influent and the considered set-point. In addition, the *SNO*,2 control performance experiences a slight improvement as well. On the other hand, the NO control performance experienced an improvement at expense of degrading the *SO*,5 control performance when the LSTM-based PI transferred into the DO control loop is implemented with measurements from the NO control loop. To solve this trade-off, we considered the third option, where the LSTM-based PI derived from the *SO*,5 control loop was adopted and transferred into the DO control loop. Its transferred version was fine-tuned with measurements coming from the default PI controller managing the *SNO*,2 control loop.

Results show that a high improvement is achieved in the *SO*,5 control loop as well as in the *SNO*,2 one. Besides, the lowest *IAE* and *ISE* improvements in terms of the *SNO*,2 when compared to the default *SNO*,2 PI controller equalled to 69.84% and 76.03% for the *IAE* and *ISE*, respectively, which are even higher improvements than in the cases where the fine-tuning process is not considered. This clearly shows that designing a LSTM-based PI in a control loop, transferring it to another different one, and then performing a fine-tuning process is the best option if a high level of improvement of the control performance is sought. Besides, this also entails a speed-up and a complexity reduction of the control design process since only the design and training of one control loop has to be performed. Again, the higher the number of control loops to design, the higher the benefit obtained following this design approach.

**Author Contributions:** I.P. has designed and trained the artificial neural networks considered in the TL-based methodology adopted in this work. He has also implemented and tested the proposed structures over the BSM1 framework. The manuscript has been also written by I.P., A.M., R.V. and J.L.V. supervised the work. All authors have read and agreed to the published version of the manuscript.

**Funding:** The APC was funded by the Spanish Government under Project TEC2017-84321-C4-4-R co-funded with European Regional Development Funds of the European Union.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data considered in this study correspond to the influent and effluent data generated and available in the same BSM1 framework defined in [8,25].

**Acknowledgments:** This work has received support from the Catalan Government under Projects 2017 SGR 1202 and 2017 SGR 1670, and from the Spanish Government through the MICINN projects PID2019-105434RB-C33 and TEC2017-84321-C4-4-R.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


**Meng Zhou 1, Yinyue Zhang 1, Jing Wang 1, Yuntao Shi 1,\* and Vicenç Puig <sup>2</sup>**


**Abstract:** This paper proposes a novel interval prediction method for effluent water quality indicators (including biochemical oxygen demand (BOD) and ammonia nitrogen (NH3-N)), which are key performance indices in the water quality monitoring and control of a wastewater treatment plant. Firstly, the effluent data regarding BOD/NH3-N and their necessary auxiliary variables are collected. After some basic data pre-processing techniques, the key indicators with high correlation degrees of BOD and NH3-N are analyzed and selected based on a gray correlation analysis algorithm. Next, an improved IBES-LSSVM algorithm is designed to predict the BOD/NH3-N effluent data of a wastewater treatment plant. This algorithm relies on an improved bald eagle search (IBES) optimization algorithm that is used to find the optimal parameters of least squares support vector machine (LSSVM). Then, an interval estimation method is used to analyze the uncertainty of the optimized LSSVM model. Finally, the experimental results demonstrate that the proposed approach can obtain high prediction accuracy, with reduced computational time and an easy calculation process, in predicting effluent water quality parameters compared with other existing algorithms.

**Keywords:** water quality monitoring; data pre-processing; improved IBES-LSSVM algorithm; interval prediction method

#### **1. Introduction**

Nowadays, freshwater is considered one of the most critical resources for humans, since it can ensure the availability of an acceptable quantity of water for livelihoods, health, ecosystems and production. Hence, freshwater plays a key role in poverty and disease burden reduction, economic growth and environmental sustainability [1,2]. This fact has long been acknowledged all over the world. However, due to industrial pollution, rapid population growth and farmland sewage caused by the extensive use of chemical fertilizers, pesticides and herbicides, the shortage of freshwater sources is a serious and challenging issue [3,4].

Wastewater treatment is one key technology to potentially provide additional water supplies, and it is very important for the functioning of the economy and society. Wastewater treatment has been attracting a lot of attention, since it can not only remove organic wastes to reduce the environmental burden, but also offer the advantage of producing a renewable source of water [5,6]. Wastewater treatment is a very complex process with a variety of physical and biochemical reactions since it presents nonlinear dynamic behavior, time delay and uncertainty [7]. In wastewater treatment plant processes, effluent water quality monitoring is an important task that involves measuring the evolution of the quality parameters in time.

Note that most traditional methods of measuring these quality indicators for wastewater treatment processes are based on manual lab-based monitoring approaches, with manual sample collection, long-time transportation and biological/microbial testing in a

**Citation:** Zhou, M.; Zhang, Y.; Wang, J.; Shi, Y.; Puig V. Water Quality Indicator Interval Prediction in Wastewater Treatment Process Based on the Improved BES-LSSVM Algorithm. *Sensors* **2022**, *22*, 422. https://doi.org/10.3390/s22020422

Academic Editor: Assefa M. Melesse

Received: 30 October 2021 Accepted: 23 December 2021 Published: 6 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

laboratory, which is cumbersome and time-consuming. Usually, the testing equipment is very expensive and cannot be used online. In addition, since the process of wastewater treatment is complex, some control strategies are necessary and required to be deployed to guarantee that effluent quality indicators behave normally. In recent decades, water quality monitoring has been evolving to the latest wireless sensor networks [8], such that most of the important indicators of effluent water (pressure, pH, level and so on) can be measured by their corresponding sensors online. However, there are still some parameters that cannot be measured quickly due to high costs and the limitations of sensors, such as BOD and NH3-N. Usually, the concentration of the BOD/NH3-N effluent associated with a wastewater treatment process is an important factor to measure the water quality since the discharge of a large amount of NH3-N and BOD wastewater will lead to water eutrophication, which can affect human health. In China's "Pollutant Discharge Standard for Urban Wastewater Treatment Plants (*GB*18918-2002)", the Class A standard stipulates that the maximum discharge for NH3-N is 5 mg/L, while for BOD, it is 10 mg/L. Thus, measuring these effluent quality indicators with high accuracy is an important issue.

Researchers have focused on soft-sensing methods to predict these effluent quality indicators and the prediction task is addressed combining data analytics and water quality control. Soft-sensing methods aim to find some certain relationships between easy-tomeasure variables and difficult-to-measure variables in the sewage treatment process. Then, a suitable model is established based on these relationships, and difficult-to-measure variables can be predicted based on the soft-sensing models.

Machine learning approaches are usually considered a subset of artificial intelligence. They focus on some statistical models and algorithms to extract patterns from data so that useful inferences can be used to predict new data. Recently, with the development of machine learning, artificial neural network (ANN), support vector machine (SVM), decision tree, random forest, ensemble learning and many other methods have been researched in depth and have a wide range of applications, including text processing, computer vision, healthcare, finance and robotics. They can also be used for socio-economic and environmental studies [9–12]. In [12], the impacts of flood protection in Bangladesh were evaluated by machine learning methods. In [13], a gray model and ANN method were investigated to predict suspended matter and chemical oxygen demand in the wastewater treatment process. Cong et al. proposed a mixed soft sensor model based on a wavelet neural network and adaptive weighted fusion for the online prediction of effluent COD [14]. M. Hamada carried out the assessment of a wastewater treatment plant's performance based on ANN and a multiple linear regression method [15]. M. Zeinolabedini et al. proved that applying various parent wavelet functions to the neural network structure can improve the accuracy of predicting the wastewater sludge volume [16]. A. K. Kadam et al. used ANN and multiple linear regression to model and predict water quality parameters in river basins [17]. S. Heddam et al. investigated a generalized regression neural network model to predict the BOD of effluent in wastewater treatment plants [18]. Tan et al. predicted the first weighting from the working face roof in a coal mine based on a GA-BP neural network [19]. V. Nourani et al. proved that the prediction ability of a neural network ensemble is more reliable [20].

Compared with the ANN method, SVM is another important prediction technique, which can effectively solve the problem of high-dimensional data model construction under the condition of limited samples, and has strong generalization ability. Hence, many scholars have carried out a lot of research on SVM-based prediction. Cheng et al. proposed a variety of kernel single-class SVMs to monitor and predict the intake conditions of wastewater treatment plants [21]. Han et al. developed a neural network model for predicting the sludge volume index based on information transfer strength and adaptive second-order algorithms [22]. Wu et al. proposed an adaptive multi-output soft sensor model for monitoring wastewater treatment and made several simulation comparisons to prove the superiority of the algorithm [23]. K. Lotfi et al. used a linear–nonlinear hybrid method to predict the effluent index of a wastewater treatment plant, which improves the

prediction ability of the single method [24]. Han et al. proposed a data-based predictive control strategy and proved its superiority through several simulations [25]. In [26], the total solid content of a wastewater treatment plant was predicted by an SVM model, which can enhance performance and durability.

Although SVM is a small-sample learning method and has been widely used to solve the wastewater prediction problem, the calculation process is multifarious, which is difficult to implement for large-scale training samples [27]. To overcome these disadvantages, the least-squares support vector machine (LSSVM) has been proposed. LSSVM improves the performance of the SVM algorithm by solving linear programming rather than quadratic programming. In this way, the calculation process can be reduced and the computation speed greatly improved [28]. Zhang et al. proposed an improved LSSVM model based on SVM to predict river flow [29]. Fei Luo et al. integrated the Gustafson-Kessel algorithm and least-squares support vector machine for line prediction of [30]. D. S. Manu et al. combined SVM and an adaptive neuro-fuzzy reasoning system model to predict the effluent nitrogen content of wastewater treatment plants [31]. Liu et al. investigated the online prediction of effluent COD in an anaerobic wastewater treatment system based on principal component analysis and the LSSVM algorithm [32].

Note that there are some unknown parameters in the kernel functions of LSSVM that need to be selected in advance. Generally, these parameters are determined according to experience, which may be time-consuming, and it is difficult to find the optimal parameters. Nowadays, swarm intelligence optimization algorithms are researched extensively, since the optimal solution can be found by swarm intelligence to perform a collaborative search mechanism. The results of the combination of swarm intelligence optimization algorithms and machine learning methods can be found in a large number of references. In [33], a hybrid model of particle swarm optimization (PSO) and support vector machine is proposed to predict the turbidity and pH value of sand filtered water in irrigation systems. Han et al. use an adaptive PSO algorithm to design self-organizing radial basis function neural networks to improve the accuracy and save time [34]. Chen et al. study the artificial bee colony optimization back-propagation network to predict the water quality of a water diversion project [35]. Fan et al. use the LSSVM model to improve the performance of predicting the safety factor of a circular slope [36]. Mahdi Shariati et al. use the gray wolf algorithm to optimize ELM model parameters to predict the compressive strength of partially replaced cement concrete [37]. However, to the best of the authors' knowledge, these swarm intelligence methods may fall into local optima and do not find the global optimal solutions.

Most of the above-mentioned methods only focus on point prediction, without providing information regarding accuracy. The prediction results have strong uncertainty that affects the decision-making process, increasing the risk of not making good decisions. Prediction interval (PI) is a standard tool for quantifying prediction uncertainty. PI not only provides the range where the target value is most likely to exist, but also indicates its accuracy. Yao et al. combined the mean variance estimation (MVE) method with a recurrent neural network to measure the uncertainty in prediction [38]. Yuan et al. combined beta distribution with the PSO-LSTM model to obtain the wind power prediction interval with high reliability and a narrow interval width, so as to provide decision support for the safe and stable operation of power systems [39]. Liao et al. combined the bootstrap method with the long and short memory network to realize the uncertain prediction of the remaining service life of the machine [40]. Marin et al. obtained the prediction interval of power consumption by combining the delta method with a fuzzy prediction model [41]. Sun et al. constructed a high-quality prediction interval based on the two-step method of dual ELM and applied it to the scheduling of a gas system [42]. In recent years, a direct interval prediction method called upper and lower bound estimation (LUBE) has been proposed. The main idea of this method is to directly construct the upper and lower bounds of PI by optimizing the coefficients of the neural network according to the interval quality evaluation index. This approach can provide good performance and does not consider

strict data distribution assumptions, such that it can provide more information about the prediction results, which motivates the work of this paper.

The main objective of this paper is to obtain a soft-sensor-based interval prediction method with high prediction accuracy and less computational time to predict the effluent water quality parameters, which is significant for water quality monitoring and control. Aiming at the online prediction of BOD/NH3-N effluent in a wastewater treatment plant within a smart data-driven framework, the main contributions of this paper are the following:


The structure of this paper is as follows: In Section 2, the problem description is given, including the real data collection, data pre-processing and gray-correlation-analysis-based data selection. Section 3 describes the model uncertainty analysis by using the proposed IBES-LSSVM algorithm and LUBE algorithm. In Section 4, the simulation examples are depicted, demonstrating the effectiveness of the proposed method based on the BOD and NH3-N data. Section 5 draws the main conclusions of this paper.

#### **2. Problem Description**

In this paper, a soft-sensing-based method is investigated to analyze and predict the water quality indicators, including three main aspects: data collection, data pre-processing and data interval prediction. The main steps of the approach presented in this paper are shown in Figure 1.

Under a smart data-driven framework, in order to predict water quality tendencies and analyze the mechanisms behind the considered data sources, enough relevant experimental data in real time must be collected based on the prediction quality indicators. Most collected data may present several issues, such as data sparsity and data synchronization, among others. After the data are collected, they must be pre-processed in advance by applying several procedures, such as data cleaning, abnormal data elimination or normalization. Then, correlation analysis from different dimensions of water quality indicators should be considered to extract the relations between these auxiliary variables and find the key factors.

#### *2.1. Data Collection*

Due to the complexity of the wastewater treatment process and the large number of parameters that need to be set, it is necessary to determine the characteristic variables related to the water quality to be determined as auxiliary variables. The data that can evaluate the quality or impact of water quality in wastewater treatment plants are mainly divided into the following four categories [43]:

• **Physical data:** Physical properties are the ones that must be monitored throughout the treatment process, including total suspended solids, temperature, conductivity, transparency, total dissolved solids, etc.


**Figure 1.** Main steps of the proposed approach.

This paper focuses on a real wastewater treatment plant in Beijing, China, from August 2014 to September 2014 [7,44]. Two data sets are collected first, which are used to predict the BOD/NH3-N effluent, separately. (1) BOD data set: containing 360 batches of data with 23 variables (including the BOD effluent parameters)—the detailed information is shown in Table 1; (2) NH3-N data set: including 10 characteristic variables related to NH3-N effluent parameters, as shown in Table 2.

#### *2.2. Elimination of Abnormal Data*

Data collected from wastewater treatment plants can contain erroneous values because of improper instrument operation, human or environmental interference and other factors. As a result, we need to analyze the collected data first, and eliminate some abnormal or meaningless data.

In this paper, we use the 3*σ* criterion to handle the abnormal data of the two collected data sets. The sample data are denoted as *x*1, *x*2, ··· , *xn*. *η<sup>i</sup>* is used to represent the data residual error. Then, the standard deviation is calculated as follows:

$$
\sigma = \sqrt{\frac{\sum\_{i=1}^{n} \eta\_i^2}{n-1}} \tag{1}
$$

$$
\eta\_i = \mathbf{x}\_i - \mathbf{x} \tag{2}
$$

where *n* represents the number of elements in the data set, and *x*¯ is the data average. If the residual error of particular data sample *xi* satisfies

$$|\eta\_i| > 3\sigma \tag{3}$$

this means that it corresponds to an abnormal sample and needs to be eliminated. Otherwise, *xi* is accepted.



**Table 2.** Effluent NH3-N data set.


#### *2.3. Data Normalization*

Different variables often have different dimensions and dimensional units. In order to eliminate the dimensional influence between indicators, it is necessary to normalize the data to achieve uniformity among the different data indicators. There are four classes of normalization methods, i.e., rescaling, mean normalization, standardization and scaling to unit length. In this paper, the rescaling method is selected. The normalization formula is as follows:

$$
\bar{\mathbf{x}}\_{i} = \frac{\mathbf{x}\_{i} - \mathbf{x}\_{i\text{min}}}{\mathbf{x}\_{i\text{max}} - \mathbf{x}\_{i\text{min}}} \tag{4}
$$

where *xi* is any value of a variable; *xi* min and *xi* max are, respectively, the minimum and maximum value of the variable.

After this kind of normalization, all the values of the data are set in the range of [0, 1].

#### *2.4. Correlation Degree Analysis*

Since different characteristic variables will have different influences on the predicted variables, to obtain a soft-sensing model with a simpler structure, it is necessary to choose the quality indicators with high correlations. Selecting *m*´ auxiliary variables from *m* variable, it has *m*´ < *m*. In practice, the larger *m* is, the smaller *m*´ is compared to *m*.

In this paper, the gray relational degree analysis method is investigated to select the characteristic variables of BOD and NH3-N effluents. Gray relational degree analysis is a multi-factor statistical method, which describes the strength of the relationship between various factors according to the gray relational degree. This method looks for the inconsistency between quantitative results and quantitative analysis in the traditional mathematical statistics method and reduces the amount of calculation.

The gray correlation coefficient is formulated as follows:

$$\beta = \left| x\_0(k) - x\_j(k) \right| \tag{5}$$

$$\mu\_{\mathbf{j}}(k) = \frac{\min\_{j} \min\_{k} \beta + \rho \cdot \max\_{j} \max\_{k} \beta}{\beta + \rho \cdot \max\_{j} \max\_{k} \beta} \tag{6}$$

where *j* means the *j*-th variable, *k* is the *k*-th iteration, *x*0(*k*) is the output variable, *xj*(*k*) is the input variable, *μ*<sup>j</sup> is the gray correlation coefficient and *ρ* is the resolution coefficient. If *ρ* is smaller, the difference between correlation coefficients is larger, and the distinguishing ability is stronger.

Then, the gray correlation degree can be calculated as follows:

$$\gamma\_{\bar{\jmath}} = \frac{1}{n} \sum\_{k=1}^{n} \mu\_{\bar{\jmath}}(k) \tag{7}$$

where *n* is the number of variables.

If the gray correlation degree is larger, this means that the corresponding variable has a higher correlation with the effluent quality indicators. Then, according to the gray correlation degree, the characteristic variables are sorted from front to back. Usually, a threshold is determined in advance as *h*¯, and then the key indicators can be selected as the input of the soft-sensing model if

$$
\gamma\_j > \hbar \tag{8}
$$

is satisfied.

#### **3. Methodology**

In this section, a novel IBES-LSSVM method is proposed to find the optimal kernel function parameters of the LSSVM in Figure 2.

**Figure 2.** Flow chart of IBES-LSSVM model.

#### *3.1. LSSVM Algorithm*

The theory of LSSVM was first proposed by Suykens in 1994. LSSVM is a kernel learning machine following the principle of structural risk minimization and is suitable for analyzing the issue of sample classification and regression estimation [45].

In LSSVM theory, firstly, the sample data are mapped to higher dimensions through nonlinear changes, and linear functions are used for fitting in this high-dimensional feature space:

$$y(\mathbf{x}) = \boldsymbol{w} \cdot \boldsymbol{\phi}(\mathbf{x}) + \boldsymbol{b} \tag{9}$$

where *y*(*x*) is the output variable, *x* is the input variables, and *w* and *b* are weight and bias terms, respectively.

The optimization objectives of the LSSVM regression algorithm can be formulated as

$$\begin{aligned} \min l(w, \xi\_i) &= \frac{1}{2} w^T w + \frac{C}{2} \sum\_{i=1}^n \xi\_i^2 \\ \text{s.t.} \\ y\_i &= w \cdot \phi(x) + b + \xi\_i, \quad i = 1, 2, \dots, n \end{aligned} \tag{10}$$

where *<sup>C</sup>* is the regularization coefficient, *<sup>ξ</sup><sup>i</sup>* is the relaxation variable, and *<sup>n</sup>* ∑ *i*=1 *ξ*2 *<sup>i</sup>* is the experience risk.

By means of Lagrange multipliers *αi*, (10) can be expressed as:

$$\begin{aligned} L(w, b \, \_i \mathbb{F}\_i, \mathfrak{a}\_i) &= \frac{1}{2} w^T w + \frac{C}{2} \sum\_{i=1}^N \tilde{\varsigma}\_i^2 \\ &- \sum\_{i=1}^n \mathfrak{a}\_i [w \cdot \phi(\mathfrak{x}) + b + \zeta\_i - \mathfrak{y}\_i] \end{aligned} \tag{11}$$

According to Karush–Kuhn–Tucker (KKT) optimization conditions:

$$\begin{cases} \begin{aligned} \frac{\partial L}{\partial \mathbf{b}} = 0 &\Rightarrow \sum\_{i=1}^{n} \alpha\_{i} = 0\\ \frac{\partial L}{\partial w} = 0 &\Rightarrow w = \sum\_{i=1}^{n} \alpha\_{i} \phi(x\_{i})\\ \frac{\partial L}{\partial \xi\_{i}} = 0 &\Rightarrow \alpha\_{i} = \mathbf{C} \tilde{\xi}\_{i} \\ \frac{\partial L}{\partial a} = 0 &\Rightarrow w \cdot \phi(x\_{i}) + b + \tilde{\xi}\_{i} - y\_{i} \end{aligned} \tag{12}$$

By defining kernel functions, the optimization problem (11) can be transformed into a linear solution issue:

$$
\begin{pmatrix} 0 & 1 & \cdots & 1 \\ 1 & \frac{K(\mathbf{x}\_1, \mathbf{x}\_1) + 1}{\mathbb{C}} & \cdots & K(\mathbf{x}\_1, \mathbf{x}\_n) \\ \vdots & \vdots & & \vdots \\ 1 & K(\mathbf{x}\_n, \mathbf{x}\_1) & \cdots & \frac{K(\mathbf{x}\_n, \mathbf{x}\_n) + 1}{\mathbb{C}} \end{pmatrix} \begin{pmatrix} b \\ \mathbf{a}\_1 \\ \vdots \\ \mathbf{a}\_n \end{pmatrix} - \begin{pmatrix} 0 \\ y\_1 \\ \vdots \\ y\_n \end{pmatrix} \tag{13}
$$

where *K*(*x*, *xi*) is the kernel function.

The Lagrange multiplier and its parameters can be obtained from (13). Therefore, the output of LSSVM can be obtained:

$$\mathcal{G}(\mathbf{x}) = \sum\_{i=1}^{n} a\_i K(\mathbf{x}, \mathbf{x}\_i) + b \tag{14}$$

For LSSVM, there are many different types of kernel functions, such as linear function, polynomial kernel function, radial basis function (RBF), sigmoid kernel function, etc. Different kernel functions will produce difference types of LSSVM. In this paper, we select RBF as the kernel function of the model:

$$K(\mathbf{x}, \mathbf{x}\_i) = \exp(-\frac{\left\|\mathbf{x} - \mathbf{x}\_i\right\|^2}{2\sigma^2})\tag{15}$$

where *σ* is the variance of RBF.

Through the aforementioned analysis, LSSVM has two tunable parameters (regularization coefficient *C* and variance of radial basis kernel function *σ* with RBF), which are important and need to be determined. To obtain the optimal two parameters, the next step is to use an improved PSO algorithm to optimize them.

#### *3.2. IBES-LSSVM Algorithm*

The BES algorithm is an optimization algorithm that simulates the hunting strategy of vultures when looking for fish. It can obtain a single optimal solution through multiple iterations and finally obtain the overall optimal solution, such that the position of the optimal solution corresponds to the optimal parameter value.

BES hunting is divided into three stages. In the first stage (selection space), the eagle selects the space with the largest prey number. In the second stage (spatial search), the eagle moves in the selected space to find the prey. In the third stage (dive), the eagle swings from the best position determined in the second stage and determines the best hunting.

In the selection stage, firstly, this paper optimizes the initial prey position and adopts the tent chaos strategy, which has the advantages of simple structure and strong ergodicity. Then, the linear decreasing method is used to improve the control parameters of the vulture iterative update position. The optimal model parameters of the model can be found that improve the quality of the fitting. The tent chaotic mapping function is described as:

$$P\_{i+1} = \begin{cases} \begin{array}{c} P\_i/\lambda\_\prime \\ (1-P\_i)/(1-\lambda)\_\prime \end{array} & P\_i \in \left[0,\lambda\right) \\\end{array} \tag{16}$$

where *λ* is [0, 1].

Then, the vultures hunt for food. The formula is:

$$P\_{\text{new},i} = P\_{\text{best}} + R\_1 \cdot C\_1 \cdot (P\_{\text{uncan}} - P\_i) \tag{17}$$

where *R*<sup>1</sup> is a parameter controlling the position change, and *C*<sup>1</sup> is a random number between (0, 1). *Pbest* is the current optimal location. *Pmean* is the average distribution location of vultures after the previous search. *Pi* is the location of the *i*-th vulture.

In the search phase, vultures search for prey in the selected search space and move in different directions in the spiral space to speed up the search. The best position for subduction is:

$$P\_{i, \text{new}} = P\_i + b(i) \cdot \left(P\_i - P\_{i+1}\right) + a(i) \cdot \left(P\_i - P\_{\text{mean}}\right) \tag{18}$$

where:

$$a(i) = \frac{ar(i)}{\max(|ar|)}\tag{19}$$

$$b(i) = \frac{br(i)}{\max(|br|)}\tag{20}$$

$$ar(i) = r(i) \cdot \sin[(\theta(i))] \tag{21}$$

$$\mathbf{b}\mathbf{r}(i) = r(i) \cdot \cos[(\theta(i))] \tag{22}$$

$$r(i) = \theta(i) + R\_2 \cdot C\_3 \tag{23}$$

$$
\theta(i) = \pi \cdot \omega \cdot \mathbb{C}\_2 \tag{24}
$$

$$
\omega = (1 - \frac{\dot{l}}{\dot{l}\_{\max}})^2 \cdot (\omega\_{\max} - \omega\_{\min}) + \omega\_{\min} \tag{25}
$$

where *θ*(*i*) and *r*(*i*) are the polar angle and polar diameter of the spiral equation, respectively. *ω* and *R*<sup>2</sup> are the parameters controlling the spiral trajectory. *C*<sup>2</sup> and *C*<sup>3</sup> are a random number within (0, 1). The *a*(*i*) and *b*(*i*) represent the position of the vulture in polar coordinates, and the values are (−1, 1).

During the dive phase, vultures swing from the best position in the search space to their target prey. All points also move towards the best point according to

$$\begin{split} P\_{\text{i.new}} &= \, ^\circ \text{C}\_{\text{4}} \cdot P\_{\text{best}} + a\_1(i) \cdot (P\_{\text{i}} - R\_{\text{3}} \cdot P\_{\text{man}}) \\ &+ b\_1(i) \cdot (P\_{\text{i}} - R\_{\text{4}} \cdot P\_{\text{best}}) \end{split} \tag{26}$$

where:

$$a\_1(i) = \frac{ar(i)}{\max(|ar|)}\tag{27}$$

$$b\_1(i) = \frac{br(i)}{\max(|br|)}\tag{28}$$

$$ar(i) = r(i) \cdot \sinh[(\theta(i))]\tag{29}$$

$$\text{br}(i) = r(i) \cdot \cosh[(\theta(i))] \tag{30}$$

$$r(i) = \theta(i) \tag{31}$$

$$
\theta(i) = \pi \cdot \omega \cdot \mathbb{C}\_5 \tag{32}
$$

where *R*<sup>3</sup> and *R*<sup>4</sup> represent the moving speed of the vulture to the optimal point. *C*<sup>4</sup> and *C*<sup>5</sup> are random numbers within (0, 1).

#### *3.3. Interval Prediction*

The traditional point prediction cannot deal with the uncertainty in the operation of the system. In order to obtain the numerical estimation and its reliability, the practical application requires the calculation of the prediction interval. Interval prediction indicates the estimation interval of the range of predicted values in a certain confidence interval. Therefore, the prediction interval is composed of the upper and lower line of prediction, which provides its accuracy within a certain confidence level. Assuming that the confidence level is (1 − *μ*)%, *l* and *u* are the lower and upper limits, respectively, when *P*(*l* < *y* < *u*) = 1 − *μ*%, and PI can be expressed as [*l*, *u*]. For a given confidence interval, the smaller the range of prediction interval, the smaller the uncertainty of prediction and the higher the accuracy.

The evaluation indexes of interval prediction are as follows [46].

*PICP*: The ratio of the real value to the upper and lower bounds of the prediction interval

$$PICP = \frac{1}{n} \sum\_{i=1}^{n} c\_i \tag{33}$$

If the predicted value is within the [*li*, *ui*] range, *ci* is 1. Otherwise, *ci* is 0. If all predicted values are included in the prediction interval, *PICP* = 100%. *n* is the number of prediction points. In theory, *PICP* (1 − *μ*)%; otherwise, PI is invalid or unreliable. When comparing the PIs by the model, the other indexes should be as small as possible under the condition that the *PICP* is as close to the confidence level as possible.

*PINAW*: The narrow PI has more information and practical value than the wide PI according to

$$PINAW = \frac{1}{nR} \sum\_{i=1}^{n} (u\_i - l\_i) \tag{34}$$

where *R* is the range of predicted values, respectively.

*PINRW*: Represents the standard square root width of the predicted interval. The expression is:

$$PINRV = \frac{1}{R} \sqrt{\frac{1}{n} \sum\_{i=1}^{n} (u\_i - l\_i)^2} \tag{35}$$

*CWC*: In practical application, it is often hoped that a narrow prediction interval width can still be obtained under the condition of high prediction probability, i.e., the prediction interval range probability and interval width will conflict. Therefore, the comprehensive index *CWC* is proposed:

$$\text{CCNC} = PINAW\left(1 + \varrho(PICP) \cdot e^{-\mathsf{T} \cdot \left(PICP - (1 - \mu)\right)}\right) \tag{36}$$

where *τ* and *μ* are constants.

When working with training data, the set (*PICP*) is 1. In addition, in data verification, (*PICP*) is a step function:

$$\varrho = \begin{cases} 0 & PICP \ge 1 - \mu \\ 1 & PICP < 1 - \mu \end{cases} \tag{37}$$

LUBE is a method based on neural networks to directly calculate the lower and upper bound of the prediction interval. Assuming that the two node values of the output layer of the neural network are the upper and lower limits of the interval, respectively, all the predicted values are included in this range at the confidence level (1 − *μ*)%. The training purpose of a neural network is to minimize the objective function *CWC*. In this way, the probability and width of the prediction interval are considered at the same time, and the advantages and disadvantages of the prediction interval PI can be comprehensively evaluated.

The flow-chart of the proposed IBES-LSSVM algorithm is shown in Figure 2, which mainly includes the procedure presented in Algorithm 1.


**Input:** Measured data of wasterwater treatment plant.

**Output:** Prediction interval of BOD/NH3-N effluent.

Step 1: Abnormal data elimination, normalization of the data according to Equations (1)–(4).

Step 2: Analyzing and selecting the key indicators with high correlation degree by Equations (5)–(8).

Step 3: The bald eagle population is initialized by tent chaos strategy based on Equation (16).

Step 4: Local optimal solution.

1: for all *Xi* do:

2: for all *Xi* do:

3: Obtain predicted value by means of Equations (9)–(15), (17).

4: end for

5: Using confidence, mean, standard deviation and other parameters, the prediction interval is obtained according to *norminv*() formula.

6: Evaluate interval fitness by means of Equations (33)–(37).

7: end for

8: Obtain the local optimal solution.

Step 5 : Global optimal solution.

1: While *t* ≤ *iter* do:


6: Using confidence, mean, standard deviation and other parameters, the prediction interval is obtained according to *norminv*() formula.


9: Update parameter *X*, *C*, *σ* by using Equations (26)–(32).

10: Obtain different predictions by using Equations (9)–(15).

11: Using confidence, mean, standard deviation and other parameters, the prediction interval is obtained according to *norminv*() formula.


14: *t* = *t* + 1

15: end while

16: Obtain the global optimal solution.

Step 6: Return the global optimal prediction interval.

Step 7: Output *C*, *σ*, fitness and other index values by using Equations (33)–(37), (38)–(41).

#### **4. Simulation Results**

In this section, the data sets of BOD/NH3-N effluents are collected from a wastewater treatment plant in Beijing and are used to verify the effectiveness of the proposed approach.

The following evaluation indices of several certainty point predictions are evaluated as follows:

$$MSE = \frac{1}{n} \sum\_{i=1}^{n} \left(\hat{y}\_i - y\_i\right)^2 \tag{38}$$

$$RMSE = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} (\hat{y}\_i - y\_i)^2} \tag{39}$$

$$MAE = \frac{1}{n} \sum\_{i=1}^{n} |\hat{y}\_i - y\_i| \tag{40}$$

$$R^2 = 1 - \frac{\sum\_{i=1}^{n} (\mathcal{Y} - \mathcal{Y}\_i)^2}{\sum\_{i=1}^{n} (\mathcal{Y} - \mathcal{Y})^2} = \frac{\sum\_{i=1}^{n} (y\_i - \bar{y})^2}{\sum\_{i=1}^{n} (\mathcal{Y} - \bar{y})^2} \tag{41}$$

#### *4.1. Experiment of Benchmark Functions*

The proposed approach is based on the six functions listed in Table 3 with the corresponding ranges and parameters. The range is the boundary of the function search space.

In order to verify the superiority of the proposed approach, it is compared with the WOA, GWO, PSO and SSA algorithms. Statistical results are presented in Table 4. Moreover, the iteration process is depicted in Figures 3–8. From the results, we can see that the convergence rate of IBES is better than that of the other algorithms and the proposed IBES method is able to provide competitive results on the benchmark functions.

**Table 3.** Benchmark functions.


**Table 4.** Simulation results of algorithms.


**Figure 3.** The result of F1.

**Figure 4.** The result of F2.

**Figure 5.** The result of F3.

**Figure 6.** The result of F4.

**Figure 8.** The result of F6.

#### *4.2. Experiment of BOD Data*

BOD is one of the most important effluent quality indexes and can reflect the water pollution situation [7]. First, the key auxiliary variables are selected for the BOD effluent data set by calculating the gray correlation degree based on (7). The threshold of the gray correlation degree is chosen as 0.8. Hence, 14 auxiliary variables (as shown in Table 5) are selected as the soft measurement model inputs. Including the output effluent BOD, there are 15 key indicators; the detailed information is shown in Figure 9. Moreover, the description of each datum is given in Figure 10.

In this paper, the BOD effluent data set has 365 sets of data; among them, 335 sets of data are randomly selected as training samples, and the remaining 30 sets of data are treated as the prediction samples. In order to demonstrate the superiority of the proposed IBES-LSSVM method, it is compared with some existing results, i.e., CNN, LSTM, ELMAN, WOA-LSSVM, GWO-LSSVM, PSO-LSSVM and SSA-LSSVM. In the experiments, the initialization conditions are set as: *iter* is 50, *n* = 30, *ωmax* = 10, *ωmin* = 0, *R*<sup>1</sup> = 1.8, *R*<sup>2</sup> = 1, *R*<sup>3</sup> = 1.5, *R*<sup>4</sup> = 1.5.

**Figure 10.** Original data of BOD.



From Tables 6 and 7 and Figures 11–13, we can see that, compared with the existing CNN model, LSTM model, ELMAN model, WOA-LSSVM model, GWO-LSSVM model, PSO-LSSVM model and SSA-LSSVM model, the prediction accuracy of the proposed method is better, demonstrating its effectiveness.

**Figure 12.** 95% of BOD.

**Figure 13.** 90% of BOD.

**Table 6.** Predictive index of BOD.


**Table 7.** PI of BOD.


#### *4.3. Experiment of NH3-N Data*

In this experiment, the NH3-N effluent data set is considered, which has been described in [44]. First, the gray correlation degree is calculated from (7), and the results are presented in Figure 14. In addition, each selected auxiliary datum of the NH3-N data set is shown in Figure 15.

**Figure 14.** Auxiliary variables of NH3-N.

**Figure 15.** Original data of NH3-N.

In this example, the threshold of the gray correlation degree is also chosen as 0.8; hence, 7 auxiliary variables (as shown in Table 8) are selected as the soft measurement model input. The experimental data of effluent NH3-N used in this paper are from a sewage treatment plant in Beijing. In total, 237 sets of data were obtained, including 200 sets of data that were randomly selected as training samples, and the remaining 37 sets of data were treated as the prediction samples.


**Table 8.** Data after processing.

In order to demonstrate the superiority of the proposed BES-LSSVM method, it is compared with some existing approaches, i.e., CNN, LSTM, ELMAN, WOA-LSSVM, GWO-LSSVM, PSO-LSSVM and SSA-LSSVM. In the experiments, the parameters are set as follows: *iter* is 50, *n* = 30, *ωmax* = 10, *ωmin* = 0, *R*<sup>1</sup> = 1.8, *R*<sup>2</sup> = 1.2, *R*<sup>3</sup> = 1.8, *R*<sup>4</sup> = 1.8.

From Tables 9 and 10 and Figures 16–18, we can see that, compared with the existing CNN model, LSTM model, ELMAN model, WOA-LSSVM model, GWO-LSSVM model, PSO-LSSVM model and SSA-LSSVM model, the prediction accuracy of the proposed method is the best, demonstrating its effectiveness.

**Table 9.** PI of NH3-N.


**Figure 16.** 99% of NH3-N.

**Figure 18.** 90% of NH3-N.

**Table 10.** Predictive index of NH3-N.


#### **5. Conclusions**

This paper investigates an improved IBES-LSSVM algorithm to predict the effluent water quality indicators of a wastewater treatment plant, in which an improved BES method is proposed to find the optimal LSSVM parameters. To deal with the uncertainties of the data, the prediction interval is generated within a certain confidence level, which could provide the upper and lower bounds of the prediction results. Compared with other existing methods, the proposed approach demonstrates high prediction accuracy, with reduced computational time and an easy calculation process, in predicting effluent water quality parameters. Note that the proposed results can only predict the water quality indicators, but this is not the end work for a wastewater treatment plant process. The application of this work to reliable decision-making and the generation of a suitable control strategy will be our future work.

**Author Contributions:** Conceptualization M.Z., V.P.; methodology M.Z., J.W., Y.S., V.P.; resources Y.S.; writing-review and editing M.Z., V.P., Y.Z.; supervision M.Z., J.W.; investigation J.W.; formal analysis M.Z.; software and data curation Y.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was partially supported by the National Key Research and Development Program (Grant No. 2018YFC1602704), National Natural Science Foundation of China (Grant No. 51805021,61973023), the Fundamental Research Funds for Beijing (No. 110052972027/015) and the Research foundation for Talents of NCUT (No. 213051360020XN173/017), "Science and Technology Innovation" Special Fund Project in Shijingshan District: 4010537621I7.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data are available upon request.

**Acknowledgments:** This work was carried out with the help of Dingyuan Chen at Beijing University of Technology.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Estimation of Infiltration Volumes and Rates in Seasonally Water-Filled Topographic Depressions Based on Remote-Sensing Time Series**

**Pavel P Fil 1,2,\*, Alla Yu Yurova 1, Alexey Dobrokhotov <sup>3</sup> and Daniil Kozlov <sup>1</sup>**


**Abstract:** In semi-arid ecoregions of temperate zones, focused snowmelt water infiltration in topographic depressions is a key, but imperfectly understood, groundwater recharge mechanism. Routine monitoring is precluded by the abundance of depressions. We have used remote-sensing data to construct mass balances and estimate volumes of temporary ponds in the Tambov area of Russia. First, small water bodies were automatically recognized in each of a time series of high-resolution Planet Labs images taken in April and May 2021 by object-oriented supervised classification. A training set of water pixels defined in one of the latest images using a small unmanned aerial vehicle enabled high-confidence predictions of water pixels in the earlier images (Cohen's K = 0.99). A digital elevation model was used to estimate the ponds' water volumes, which decreased with time following a negative exponential equation. The power of the exponent did not systematically depend on the pond size. With adjustment for estimates of daily Penman evaporation, function-based interpolation of the water bodies' areas and volumes allowed calculation of daily infiltration into the depression beds. The infiltration was maximal (5–40 mm/day) at onset of spring and decreased with time during the study period. Use of the spatially variable infiltration rates improved steady-state shallow groundwater simulations.

**Keywords:** closed depressions; temporary water bodies; remote sensing; infiltration

#### **1. Introduction**

Shallow groundwater is present in many semi-arid landscapes across the world either intermittently or permanently, depending on the lithological profile, topography, and water balance. Unlike in wetter environments with diffuse groundwater recharge, recharge in these environments is primarily focused (local) in areas of excess water input [1]. In such environments, where moisture deficits in upland soils are high, groundwater recharge will only occur if there is sufficient infiltration of converging flow to overcome the deficits. One mechanism involved is a localized recharge process that routes surface water runoff within the landscape to topographically low areas (depressions), allowing infiltration of water through ephemeral seasonal ponds [2–4]. Moreover, depression-focused recharge driven by snowmelt is a major annual hydrological event in cold semi-arid regions such as the Pothole Prairie Region of North America. In recent decades, there has been an accelerated increase in process understanding of the contributions of prairie potholes to surface runoff [5,6] and depression-focused groundwater recharge [3] in this part of North America. The knowledge has been acquired through studies involving conceptual and mathematical modeling of hydrological processes of surface flows [5–7], subsurface flows and combinations of the two [8,9], applications of isotopic and environmental tracers [3], digital elevation model (DEM)-based delineations of depressions and their watersheds [10–14], assessments of

**Citation:** Fil, P.P.; Yurova, A.Y.; Dobrokhotov, A.; Kozlov, D. Estimation of Infiltration Volumes and Rates in Seasonally Water-Filled Topographic Depressions Based on Remote-Sensing Time Series. *Sensors* **2021**, *21*, 7403. https://doi.org/ 10.3390/s21217403

Academic Editors: Miquel Àngel Cugueró-Escofet and Vicenç Puig

Received: 14 September 2021 Accepted: 3 November 2021 Published: 7 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

hydrologic connectivity [6,10,15–17], and remote sensing with high and intermediate resolution [18,19]. Studies in various catchments have shown that both horizontal and vertical connectivity in pothole hydrological systems are very site-specific and no model can be applied to a new system without validation.

In Russia, areas rich in pothole-like systems of depressions ("zapadiny") in interfluves of forest-steppe catchments cover a larger region than in North America, extending across much of European Russia and into Siberia. However, after an initial period of intensive hydrological research in the 1960s to the 1980s there was very little study of depressionfocused groundwater recharge despite advances in GIS-facilitated simulation and remote sensing. Moreover, there is increasing societal need for such studies to enhance the understanding of key landscape functions related to water storage or movement, e.g., water capacitance, carbon sequestration, and both nutrient retention and cycling [17,20] and precision agricultural management. With some justification, early studies noted similarities between prairie potholes and forest-steppe zapadiny. However, before applying tools developed in North American research to Russian systems, there is a need for quantitative evaluation of concepts that emerged in earlier local studies.

One of the key hypotheses developed during the 1960s is that the major source of recharge for shallow groundwater in areas such as the Oka-Don Lowland of the Tambov region in European Russia is depression-focused infiltration during snowmelt [21]. In a very recent study an indirect method was used to calibrate the groundwater recharge to hydraulic conductivity ratio for application in an analytical steady-state solution of the 2D shallow groundwater flow equation using soil redoximorphic features of typical classified catenas of the Samovetc catchment in this lowland [22]. In the cited study, the same recharge rate was prescribed for all points along a topographical transect. In contrast, in the study presented here, the spatial variability of depression-focused groundwater recharge along the transect was studied in a field campaign in spring 2021 during, immediately after snowmelt, and several weeks later.

There is no single method for classifying remote-sensing data for the ponds' retrieval. The methods and materials used vary greatly depending on the region of study, season of the year, image resolution or type of the pond. In terms of wavelengths used in the electromagnetic spectrum, they are visible (RGB), near infrared (NIR), shortwave infrared (SWIR) and thermal infrared (TIR) [23,24]. In addition to optical methods, data from RADAR and LIDAR are also used [25]. Methods for extraction of small water bodies are divided into four groups. The first group is the threshold methods, the essence of which is the discretization of individual spectral channels or spectral indicators based on expert or experimental threshold values [26,27]. The second group of methods are statistical methods, such as those using multivariate regression or discriminant analysis. Classification methods (the third group) are a matrix of combinations of different methods—this is a pixel or objectoriented approach, classifications with or without training, various classification machines; for example: a random forest or support vector machine, neural algorithms [28–32]. There are also various special techniques (group four) such as entropy-based computer vision techniques [33].

In this work, remote-sensing data were used to construct a mass balance and estimate volumes of ephemeral ponds by object-oriented supervised classification of high-resolution Planet Labs images of the Tambov area acquired from April to May 2021. The data acquired on dynamic changes in delineated ponds, in combination with a DEM, observations using an unmanned aerial vehicle (UAV), a widely accepted method for calculating evaporation, and visual hydrological observations were used to estimate infiltration volumes and rates through the depression bottoms and account for their spatial and temporal variability. Considering that groundwater recharge from the depressions' bottom is very area-focused and occurs episodically during the snowmelt, the process is not usually accounted for in the regional-scale evaluation of the groundwater resources in Tambov region. To include the impacts of spatial heterogeneity and dynamic fluctuation the depression-focused infiltration may be modeled numerically [8]. To examine the early hypothesis [21] on a

critical role of depressions in ground-water recharge in a forest-steppe region through the simplified approach for estimating recharge, this paper aims: (1) to determine the variations of the pond recession and infiltration rate in time and between the depressions due to systematic (vertically-varying hydraulic conductivity) and random factors (presence of clogging or frozen layers, pond- surface drainage network connection); (2) to determine the role of such spatial variations through numerical analysis of shallow groundwater model for the simplified 2D case; and (3) to determine a method for calculating the volume of recharge through depression in other catchments both with and without the requirement of numerical modeling and data assimilation. The result allowed identification of the volume of intercepted water during snowmelt and calculation of the rate of water recession and infiltration rates in closed depressions for the first time for the study region. Use of the horizontal variation in parameters obtained along the studied transect substantially improved results of the shallow ground water model developed in the cited study [22].

#### **2. Materials and Methods**

Remote-sensing data were used to construct a mass balance and estimate volumes of ephemeral ponds by object-oriented supervised classification of high-resolution Planet Labs images of the Tambov area acquired from April to May 2021. The data acquired on dynamic changes in delineated ponds, in combination with a DEM, observations using an unmanned aerial vehicle (UAV), a widely accepted method for calculating evaporation, and visual hydrological observations were used to estimate infiltration volumes and rates through the depression bottoms and account for their spatial and temporal variability. The acquired time series of changes in the volume of nine temporary ponds enabled parametrization with a negative exponential curve. A time series of the infiltration rate, calculated from the water balance, was used to estimate the total amount accumulated during the event, and both the initial (maximum) and saturated (minimum) infiltration rates per unit area.

#### *2.1. Study Area*

The study area covers approximately 560 ha in the center of the Oka-Don lowland (52◦37 N, 40◦2 E) in the Petrovsky district of the Tambov region, Russia (Figure 1). The lowland is the largest in the forest-steppe biome. With elevation ranging from 120 to 180 m above sea level, on average it is 100 m lower than adjacent territories. The lowland has a semi-arid climate with long winters, pronounced spring snowmelt events and relatively dry summers with an annual precipitation to potential evapotranspiration ratio of 0.8. According to data recorded at a meteorological station 10 km north of the study site, during the period 2005–2020 the annual temperature was 6.9 ◦C, and average monthly temperatures in January and July were −8.6 ◦C and 21.0 ◦C, respectively [34]. Mean annual precipitation during this period amounted to 550 mm (of which 113 mm fell during periods with sub-zero temperatures), and the mean snow height before onset of snowmelt was 320 mm, very similar to the recorded historical climatic norm for 1961–1990 (290 mm).

The soils are mainly chernozems and the area is mainly used for cultivating crops (typically wheat, corn, sunflower, soy, sugar beet), despite hindrance by water shortages. Clay and loamy deposits, generally 5–15 m (but sometimes up to 40 m) thick, with boulders of glacial origin, underlie a layer of loess-like loam with thickness ranging from 2 m in the lower parts of slopes to 30 m in the interfluve. The upper layer is porous and can both accumulate and retain moisture, while the glacial clays and loams form a local aquiclude for infiltrated surface waters. Shallow groundwater above this aquiclude is permanent and forms a continuous layer in the focal catchment. Evidence of stagnic condition in topsoil is restricted to the presence of albic material in the lower part of the humus horizon in grey gleysols in the depression bottom. There is clear evidence of gleyic conditions in the soil morphology (Fe-Mn concretions, Fe masses, pore lining, reduced matrix) in the catchment and continuous presence of water saturation below 2–3 m depth in the poorly drained soils and 1 m depth in the waterlogged soils. The latter was confirmed by a few cases of drilling

in different years and seasons (the WTD is consistently highest after the springflood) and automated measurements in 2019 (a year with extremely low snow accumulation). Physical properties of the surface loams have contributed to development of closed depressions of multifactorial genesis, which are widely spread throughout the Oka-Don lowland. These depressions delay the runoff of surface waters into rivers [21] and transfer surface runoff to groundwater, thereby replenishing the groundwater and moistening the surrounding soil. The closed depressions are filled with water in the spring when the snow melts. Snow located in the catchment area of each basin melts and replenishes it, usually in mid-March to early April. At the end of spring, surface water only remains in small parts of the depressions, and in summer they usually dry up completely, in contrast to the closed depressions of the Pothole Prairie Region. Monitoring the dynamics of water volume and its filtration enables estimation of amounts of valuable additional moisture entering the soil in this semi-arid region.

**Figure 1.** The study area. (**a**) Location of the study area in the forest-steppe biome of Eurasia. (**b**) Location of the study area in the catchment of the Matyr River-Oka-Don Lowland (SRTM). (**c**) Digital elevation model (DEM) of the interfluve of the Samovets brook; the studied territory of the depressions is marked with a rectangle. (**d**) Unmanned aerial vehicle (UAV) DEM of the study area, the numbers indicate numbers of closed depressions filled with pond water in the spring of 2021 (for details, see Table 2).

#### *2.2. Input Data*

Water dynamics in a closed depression in spring 2021 was tracked and modeled using the following three types of data (Figure 2): ultra-high resolution (25 cm) digital terrain and elevation models obtained using a small UAV—DJI Mavic 2 Pro, orthophotomaps of terrain in the visible range with a ultra-high resolution (25 cm) from UAV, orthophotomaps of high resolution (3 m) in the visible range from the sensors of the RapidEye and SkySat mini-satellites of the Planet Labs system [25]. Precipitation data were obtained from the nearest weather station with daily resolution. Evaporation data from the water surface was obtained using Penman's equation and meteorological input from the same station.

**Figure 2.** Sources of spatial data. (**a**) High-resolution DEM obtained photogrammetrically with colors indicating heights. (**b**) Planet Labs' digital images of terrain in the visible spectrum with3mresolution. (**c**) High-resolution digital image obtained using the UAV in the visible spectral range.

> In this study we used stereophotogrammetry, i.e., estimation of three-dimensional coordinates of points on an object from two or more photographic images taken from different positions by the small UAV. In this article, we used a standard method for constructing a digital terrain model using a small UAV with an accuracy of 0.03 m (hereafter, the UAV DEM).

> For this, we used the Mavic 2 Pro routing app (DroneDeploy.com). Geolocation markers were located on the ground, and their positions were determined using the STONEX GNSS system (flight altitude, 150 m; image overlap, 75%). We processed the data using Agisoft Metashape and created a dense point cloud to generate a digital terrain model. We manually filtered points associated with agroforestry areas within the fields in ArcGis Pro using the field mask and the vegetation mask. The masks were obtained by manual decoding the UAV materials. Orthophoto maps generated from free satellite photos obtained via Bing were used to identify trees. Points related to heights of the trees were removed. A digital model of the territory with 25 cm resolution was created from the remaining point cloud using the kriging interpolation tool in ArcMap. An orthomosaic was created in Agisoft Metashape and exported with 25 cm resolution.

> A time series of high-resolution visible orthomosaics (with 1–3 m resolution) at times when there was no cloud cover, from the beginning of spring snowmelt to the drying up of temporary water bodies in early summer were downloaded from Planet Labs Inc. Images of the scenes were downloaded when there was no cloud cover, from the beginning of spring snowmelt to the drying up of temporary water bodies in early summer.

#### 2.2.1. Delineation of Water Bodies

Orthophotomaps generated using an UAV allow correct interpretation of water surfaces, as they can be visually inspected to delineate water/dry surface boundaries accurately. Orthomosaic maps from Planet Labs have lower resolution and higher atmospheric noise. Therefore, we used the Interactive Supervised Classification tool in the ArcGIS Pro desktop application to delineate water bodies in them. For this we created a water feature training set from the ultra-high resolution orthomosaic, and used it to enable automatic recognition of water bodies in the Planet Labs orthomosaics via object-oriented supervised classification, as implemented in the ArcGis Pro raster classification tool [35]. It is well established that object-oriented classification is superior to pixel-based classification for high-resolution images [36], and it has been previously used to delineate similar depression-shaped natural systems [37,38].

The classification involved the following steps. First, the analyzed raster layer was constrained by a field cadastral border [39] buffered 15 m on each side to prevent inclusion of objects rather than a bare soil surface without water (e.g., an agricultural field with no vegetation in early spring) and surfaces that may be flooded with water. Masking was applied to avoid possible classification errors by excluding unnecessary objects (trees, roads, buildings, etc.). The second step was imaging segmentation, based on a mean shift procedure, by criteria of the minimum segment size expressed in pixels [40,41] implemented in ArcGIS Pro, to merge adjacent pixels of relative homogeneity—preferentially based on spectral (color) characteristics—into image objects. Unitless segmentation scale parameters determining the average size of objects governing the degree of homogeneity allowed for pixel merging was set to 10 on the RGB scale. The third step was creation of a training set. As summer approaches, ponds in the depressions always shrink (Figure 3). Thus, water surfaces present on the date of a UAV flight were always present on the preceding dates, and three training samples were created for groups of dates before each UAV survey (Figure 3). Each training sample contained two categories: water and soil surface. Finally, the random forest (RF) method [42,43] and support vector machine (SVM) [44] for supervised classification of segmented images was applied, yielding a binary (water– not water) raster. The testing set from the next UAV survey was used to validate the resulting binary models. This enabled identification of the water surface areas in each period. However, in the classified images, the boundary of the ponds does not have a constant height relative to the DEM of the UAV. To avoid this unnatural variation of height the classified raster was transformed into a vector containing only water polygons. Along the outer boundary of the water polygon, the DEM values of the UAV were sampled with a frequency of 25 centimeters. The median was calculated from the extracted values. The contour was then drawn for the second time, now in accordance with the average value on the UAV DEM, thus, the outer boundaries of the ponds were forced to have a constant height value. The described procedure allowed us to avoid misclassification of the Planet Labs mixed pixels due to relatively low resolution as we operated with the vector area, not the raster area. For the comparison purely pixel-based classification was also made. The DEM and water polygons vector were used to calculate the volume of water in each depression on the days the images were taken. We used the Surface-Volume tool from ArcGIS Pro to calculate the area and volume between the surface and the reference plane (Polygon Volume (3D Analyst). This provided the water content in each depression in cubic meters on each of the days.

The maximum surface water area of a depression corresponds to the volume of water up to its overflow point, defined here as the minimum value of the height in the UAV DEM along its drainage basin vector boundary [45]. The watershed boundary was defined by the Basin tool in ArcGis based on the raster of the flow direction, derived from the UAV DEM using the "Direction of flow" tool in ArcGis Pro. The water layer (mm) in the catchment area of each depression required for its maximum volume is equal to its total maximum volume of water divided by its entire catchment area.

**Figure 3.** Schematic illustration of the water surface classification method. The arrow at the bottom indicates the general trend of depressions drying (from left to right) in spring. The green dashed lines around the blue areas show the extent of the water as photographed by the UAV on the control dates (22 April, 15 May and 2 June). The gray color indicates the water surface resulting from object-oriented image classification trained on a subset of points from the later control surveys by the UAV. Three time intervals were used for the training. The green shading at the end is the area searched for a water mirror surface by the UAV on a day when the pond had already disappeared (light red fill).

#### 2.2.2. Evaporation

Results of a previous comparison suggest that all of three conventional methods for estimating evapotranspiration from water-filled and vegetated depressions have acceptable applicability for estimating evaporation from open water [46]. The most convenient of these methods, the classical form of the Penman equation [47,48], was used in this study to estimate potential evaporation:

$$E\_{PEN} = \frac{\Delta}{\Delta + \gamma} \frac{R\_{\text{fl}}}{\lambda} \frac{\gamma}{\gamma + \Delta} \frac{6.43 E\_A}{\lambda} ,\tag{1a}$$

Here: *EPEN* is potential (open water) evaporation (mm/d); *Rn* is net radiation at the surface (MJ/m2/d); Δ is the slope of the saturation vapor pressure curve (kPa/C); *γ* is a psychrometric coefficient (kPa/◦C); *λ* is the latent heat of vaporization (MJ/kg); and *EA* is the drying power of the air, which can be found using the following Dalton-type formulation:

$$E\_A = f(\mathcal{U})D = (1 + 0.536 \mathcal{U})(e\_s - e\_a) \tag{1b}$$

Here: *f*(*U*) is a wind function with linear coefficients for the original Penman equation (1948, 1963); u is the wind speed at 2 m height (m/s), *D* = (*es* − *ea*) is vapor pressure deficit (kPa); *eS* is saturation vapor pressure (kPa); and *ea* is actual vapor pressure (kPa).

Open-water evaporation was computed from readily available data as previously described [49] and implemented in the Evaplib Python library [50]. Input data for this were air temperature (T, ◦C), solar radiation (RS, MJ/m2/d), relative humidity (RH, %), and wind velocity (u, m/s).

In the absence of actinometric measurements of net radiation at the surface, this was calculated from amounts of cloud cover recorded at the weather station and a previous regional calibration [51].

$$R\_{\rm I} = \frac{24(60)}{\pi} G\_{\rm sc} d\_{\rm I} (\omega\_{\rm s} \sin(\phi) \sin(\delta) + \cos(\phi) \cos(\delta) \sin(\omega\_{\rm s})) \tag{2a}$$

Here: *Ra* is extraterrestrial radiation (MJ/m2/day\*\*), *Gsc* is the solar constant (0.0820 MJ/m2/min), *dr* is the inverse of the relative distance between the Earth and Sun, *ω<sup>s</sup>* is the sunset hour angle (rad), *φ* is latitude (rad), and *δ* is solar declination (rad).

$$d\_I = 1 + 0.033 \cos\left(\frac{2\pi}{365}I\right) \tag{2b}$$

where *J* is the day of the year; *δ* = 0.409 sin <sup>2</sup>*<sup>π</sup>* <sup>365</sup> *<sup>J</sup>* <sup>−</sup> 1.39 ;

$$
\omega\_{\mathfrak{s}} = \arccos(-\tan(\phi)\tan(\delta)) \; ; \mathcal{N} = \frac{24}{\pi}\omega\_{\mathfrak{s}}.
$$

Solar radiation, *Rs*, can be calculated from the amount of cloud:

$$R\_s = (a\_s + b\_s(1 - N))R\_a \tag{3}$$

where *N* is the amount of cloud (ranging from 0 for clear sky to 1 for full cloud cover), while *as* and *bs* are Angstrom values, and without regional calibration values of 0.25 and 0.50, respectively, are recommended [52].

Net longwave radiation (*Rnl*) can be estimated from the air temperature, actual vapor pressure, and solar radiation. Net longwave radiation is expressed by the Stefan– Boltzmann law:

$$R\_{nl} = \sigma \left(\frac{T}{\max\_{4\_{\text{min}}^4}^2} (0.34 - 0.14\sqrt{\varepsilon\_d}) \left(1.35 \frac{R\_s}{R\_{so}} - 0.35\right)\right) \tag{4}$$

where *Tmax* is daily maximum air temperature (K), *Tmin* is daily minimum air temperature (K), and *Rso* is clear-sky radiation (MJ/m2/day) according to:

$$R\_{so} = (a\_s + b\_s)R\_a \tag{5}$$

We applied a constant albedo of 7% (0.07) for water surfaces in the calculations, based on the latitude and published mean reference values [51].

$$R\_{\rm nl} = (1 - \alpha)R\_{\rm s} - R\_{\rm nl} \tag{6}$$

We calculated daily evaporation values. Input data for Equations (1a), (1b) and (4) and daily precipitation were obtained from the nearest meteorological station (at Lipetsk city).

#### 2.2.3. Water Balance and Groundwater Model Recalibration

Infiltration rates (mm/day) were calculated from the daily water balance equation:

$$F = 1000 \cdot \left[ -\Delta V - A\_t (E\_{PEN} - P) \right] / A\_t \tag{7}$$

where <sup>−</sup>Δ*<sup>V</sup>* is the daily rate of reduction in pond volume (m3), *At* is the current pond area (m2), and *P* is the daily precipitation (mm/d).

The volume of a pond on a given Julian day (*dayT*) was derived from the volume on the first Julian day in a series (*dayF*) and the following negative exponential equation:

$$V = a \cdot e^{-c \cdot (dayT - dayF)}\tag{8}$$

The scaling coefficient *a* and the power of the exponent *c* (the pond's approximate initial volume and decay rate, respectively) were obtained by the least square method, which has given a fit with R2 > 0.9 for each of studied lakes (Table 2, Figure 4d). The estimated volume on each day was used to calculate the rate of reduction in pond volume in Equation (7) with a daily time step.

**Figure 4.** Temporal dynamics of selected ponds in the depressions. (**a**) Positions of Ponds 4 and 2 in a UAV photo image. (**b**,**c**) Groundwater levels in basins of the ponds on indicated dates based on the classification of images (highlighted in color by day). (**d**) Water volumes in the basins of ponds 4 and 2 on indicated dates (points) and negative exponential fits derived by Equation (8) (lines). Pond numbering as in Figure 1 and Table 2.

A limitation of the method lies in the choice of the first day of infiltration, because it is impossible to determine the water boundary in depressions when they are covered with snow. Water infiltrates the soil when the temperature is already above 0 degrees Celsius, but classification of images with partial snow cover is problematic. Thus, the first Planet Labs image that was subjected to classification was the first when there was no snow cover according to the nearest (Lipetsk) weather station.

A 2D profile of the steady-state shallow water table depth was obtained by the analytical form of continuity equation with calibration based on soil redoximorphic features. In [22] the hypothesized relationship between archived morphological properties (redoximorphic features as indicators of gleyic conditions) of soils and a current hydrological process indicator (WTD) were established based on the expert knowledge of soil types, WTD co-occurrence, then verified under a hillslope flow continuity constraint expressed mathematically as a steady-state solution with two free parameters: hydraulic conductivity and recharge rate. Here the input horizontal transect of groundwater recharge rate was taken as a time integral of Equation (7). Spatially, it varied along the transect according to positions of the depressions in the landscape. Infiltration into the soil is not equal to the groundwater recharge rate, so relative values in the [0, 1] interval were used to describe the variability along the transect while the formal calibration of the absolute values of recharge rate to hydraulic conductivity (N/k) was preserved in the method. Time-averaged infiltration was calculated based on the volume of water that infiltrated in closed depressions. We established 10 regular topographic profiles representing the generalized transect, 3 km long and crossing the interfluve along the main slope with regularly (5 m) spaced points. At each standard point, the value of the water layer (mm) was extracted, which was filtered out in a closed depression. The 10 topographic lines were combined into a single profile by averaging values corresponding to the order of the points of the water layer. The regular placement of topographic profiles and sampling points was intended to optimize the two-dimensional characterization of additional moisture infiltration along the studied transect.

#### **3. Results**

The proposed combination of object-oriented image classification based on a time series of Planet Labs images and an orthomosaic derived from UAV surveys to verify the satellite data enabled highly accurate identification of the water mirrors of closed depressions during their drying (Cohen's kappa = 0.99). Moreover, high-precision digital terrain models obtained using UAVs can be used to calculate volumes of water in closed depressions.

We compared different methods of pond extraction for the scene on 9 April 2021, when the reference UAV image was obtained. Two supervised pixel-based classification methods were compared: random forest (RF) and support vector machines (SVM) providing results as a raster. Then the ponds boundaries were brought to a constant median value of the DEM to obtain vector pond polygons (also both from RF and SVM classification). Root mean square error (RMSE) and mean absolute percentage error (MAPE) of pond volume and area were the lowest for the vector approach and notably higher for raster approach (Table 1). SVM and RF errors were almost the same within the vector approach (Table 1), and it was decided to use RF as the most common in such studies. Contrary to [52], the novel way to vectorize the polygons based on idea of flat pond mirror with constant height brought a very notable increase in the quality of area and volume estimate.

**Table 1.** Root mean square error (RMSE) and mean absolute percentage error (MAPE) of the area and volume with true value taken from ultra-high resolution UAV estimate of ponds' boundaries. Random forest (RF) and support vector machine (SVM) methods are compared for the pixel-based (raster) and median DEM height-based (vector) delineation of the image taken on 9 April 2021.


Results obtained using the described procedure show that the drainage process of the focal depressions follows an exponential equation (Figure 4, Table 2), with coefficients (Table 2) that presumably depend on various factors (e.g., the depressions' source rocks and filtration areas), but we found no systematic quantitative relationships between the coefficients and considered parameters.

**Table 2.** Derived characteristics of the ponds during the decreasing volume phase after snowmelt from the maximum (starting day) to zero (final day). Coefficients a and c are from Equation (8) and R2 is the coefficient of determination for the negative exponential fit of the lake volume by the least square method.


<sup>1</sup> The numbering follows Figure 1. <sup>2</sup> dd—melt water peak event duration (days).

During the initial phase the rate of pond recession is much higher than later in the season (Figure 5 top). Notably less water is evaporated than infiltrates (Figure 5, middle and bottom), so the depression-focused replenishment of the groundwater is consistent with the previously mentioned hypothesis that the major source of recharge for shallow groundwater in the study area (and similar areas) is depression-focused infiltration during snowmelt [21]. There are two phases of infiltration—fast and slow (Figure 5, bottom). Measurements during the fast phase enable estimation of the unsaturated soil's refill rate and capacity (Table 2). During the slow phase the change in infiltration rate from day to day is much smaller. The saturated hydraulic conductivity decreases strongly with depth under a depression [8], reflecting the effects of the decreasing frequency of fractures with depth, and the flow is presumably limited by the lowest layer with the smallest frequency. Thus, the infiltration rate estimated during the slow phase provides an approximation of the hydraulic conductivity (Table 2), corresponding to the maximum possible flux out of the soil column.

Overflow can occur from any closed depression (Figure 6). The probability of spillage depends on multiple factors, including elevations of the lowest point in the catchment area and the depression's overflow point. In 2018, water reached the overflow point in almost all the depressions considered here (Figure 6). Thus, the initial volumes (*a* coefficients) obtained for the nine studied ponds can be used in speculation regarding the effects of the landscape morphometry and meltwater input on initial volumes of ponds after snowmelt.

The results also indicate that the hypothesis of a quantitative linear relationship between the volume of water accumulated in a depression and the catchment area of the basin is only partly correct. The water volumes do not appear to be linearly related to the depressions' catchment, because the amount of water in a depression depends on the catchment area and maximum volume that can be stored in it. Excess water will flow through the overflow point without replenishing the water table. Limits of the possible volume and layer of water intercepted by the focal depressions, which limit their ability to converge surface runoff into underground flows, were identified. The maximal layer depends on the catchment area of the depression and height of its overflow (Figure 6, right).

The water layer filling closed depressions during snowmelt in the forest-steppe zone is highly dynamic. From 2005 to 2021, the snowmelt water layer (snow water equivalent, SWE), reconstructed from the statistically corrected snow height and snow density data series, varied from 50 to 300 mm. The derived snowmelt water layer during this period has a binomial distribution with two maxima, at 50 and 200 mm SWE. Analysis of the

meteorological data showed that closed depressions did not overflow during snowmelt in 60% of cases, on average, from 2005 to 2021. This corroborates the finding that in most cases closed depressions intercept the surface runoff and transfer it to groundwater. Frequencies of overflow were lowest for Ponds 2 and 4 (around 10%) and highest for Pond 9 (90%).

**Figure 5.** Rate of recession of the indicated ponds' water levels (**a**), daily evaporation rate (**b**) and infiltration rates of the ponds estimated from the mass balance (bottom panel) expressed in mm of the water layer (**c**). The numbering of the ponds in the color legend follows Figure 1 and Table 2.

**Figure 6.** (**a**) Boundaries of the maximum potential filling of closed depressions (up to their overflow heights) shown in white in a Planet Labs image from 4 April 2018, with color shading and magenta contours, and the maximum pond boundaries in 2021 from 6 April 2021 (green contours). In 2018, the investigated closed depressions were overflowing. (**b**) Shades of blue indicating the layers of water (in mm) that must enter the depressions from their catchment areas to completely fill them.

> Figure 7 illustrates the simulations of the shallow groundwater level for cases with the recharge rate either constant or spatially varied along the transect. The parameter N/k was restricted by the requirement for correspondence between the simulated WTD and range of WTD for soils of each type from expert knowledge (Table 2 in [22]) in distance intervals across the catena's whole toposequence. For example, if very poorly drained soils (under depression bed) are present in M unit intervals, those in which WTD>3m (too deep) were counted with 0 weight and the others with 1 weight. The same procedure was then applied for each of the intervals with the other soil types, then the sums were added for all groups and scaled to the total number of unit intervals in the catena toposequence to obtain the accuracy in percent. Simulation of WTD was successful for the generalized transect in terms of correspondence between the simulated WTD and ranges of WTD obtained from the indirect soil indicators (redoximorphic features) and expert knowledge both in the cases of constant and spatially variable recharge. However, the required accuracy threshold was set at 97%, and was met for the spatially variable recharge. A significantly lower threshold (80%) was satisfied for the constant recharge case. Therefore, the method to estimate depression-focused infiltration proposed here can make the shape of the water table profile more realistic.

**Figure 7.** Cross-section of the catena with water table depth (WTD, blue and red lines) adjusted to correspond to soil group [22]. The yellow and black lines indicate the position of the bedrock and DEM profile, respectively. The blue and red lines respectively indicate WTD obtained with infiltration along the transects estimated from the depressions' positions and specific infiltration rates (as indicated by the blue bars and right y axis) and constant infiltration along the transect.

#### **4. Discussion**

The word steppe is usually associated with the Russian plains, but the northern part of this ecoregion has notable similarity to North American prairies. The lithological and geomorphological similarity of the Tambov region to the Saskatchewan and Alberta provinces in Canada enables direct comparison of the depression-focused infiltration into their soils through temporal ponds that are very similar in size distribution and shape. The recession rate of the ponds after snowmelt obtained in this study is similar to that derived from an artificial flooding experiment in the C24 depression, northwest of Calgary, Alberta, Canada, in 2004 [8]. As in the cited study [8] and another previous investigation [4], we found that evaporation accounts for a much smaller proportion of the pond water balance loss term than infiltration into the soil. The pre-event pore space available for filling with infiltration water was not directly measured in this study. However, data from a depression monitoring site in the study region in the years 2003–2005 show a spread of 30–400 mm of water deficit to saturation. An assumption underlying our two-stage infiltration conceptual model is that pores of the soils below the bottom of a pond are all filled to saturation during the first stage down to the shallow groundwater depth (approximately 2 m). Thus, the inflow is restricted by the bottleneck hydraulic conductivity below this point, which is an order of magnitude lower than in the upper soil layers [8] and also by the gradual rise of the water table when two fronts of water are jointing. The soil refill amount of 34 to 172 mm recorded in Table 2 fits well into this range. There is also similarity with the refill amount (148 mm) obtained in the cited Canadian study [8]. A strength of our study is that the infiltration rate was estimated for nine ponds, not just one pond such as the well-studied experimental pond C24. The variation (four-fold) in infiltration between those ponds (Table 2) could not be explained by the pond size or topographical settings. Thus, it is not sufficient to apply infiltration data from one pond to other ponds as this leads to large errors. The differences are likely due to diverse factors, inter alia physical properties

of the soil associated with their lithological and textural characteristics, the thawing rate and ice content, abundance of root channels and other pathways for preferential flow. We conclude that there is no straightforward analytical way to characterize this spatial variability, but use of data obtained by the methods proposed here in conjunction with appropriate hydrological models and high-resolution satellite images is highly promising.

Here, we used the steady-state continuity equation in kinematic wave form parameterized using expert knowledge of the links between typical water table depth (WTD) and redoximorphic features of soils with different hydromorphy degrees [22]. In this simulation, we were able to account for variation in infiltration rates in the catena using real data on depressions' positions within the transect. However, calibration was still necessary because the infiltration and recharge are split in time by unsaturated zone processes. In future research, we plan to develop a model conceptually similar to the VSMB Depression-Upland System (VSMB-DUS) model [8] using data acquired in investigations of the surface water– groundwater interaction in individual depressions and their catchments. The planned model will be based on the watershed hydrological WASA-SED model [53], which already discretizes focal watersheds into hierarchical levels (sub basins, land units, terrestrial components, soil-vegetation components). Land units are representative catenas and terrestrial components can be easily supplemented with depressions and uplands providing surface flow to them by an already incorporated horizontal flow mechanism. For the terrestrial components prescribed as depressions, the temporally varying fluxes obtained by the method developed here will be used as upper boundary conditions. Collection of field data is planned to obtain saturated hydraulic conductivity values for the vertical levels besides the bottom soil layer. Groundwater depth measurements will provide calibration for the drainage rates from the deepest soil layer and validation for the dynamic version of the WASA-SED shallow groundwater flow sub-model. In this manner, groundwater recharge will fully account for the spatial variability of depression density, such as prevailing areas of numerous depressions at the water divide.

In this study, we derived the saturated hydraulic conductivity, not for a single point, but aggregated for the area of depressions. Most grid data used represent points, but landscape-level data are essential inputs for a hydrological model. A hypothesis under test is that soil hydraulic properties are related to landscape position and topography [54]. If so, elucidation of these relationships could greatly enhance pedotransfer functions for estimating saturated hydraulic conductivities at the level of land units and terrestrial components, not just points. Our study, based on remote sensing, provides an example of such derivation because the hydraulic conductivity is based on the depressions' water balance accounting for their positions in the landscape.

A limitation of this study lies in the assumption that all snowmelt runoff from the upland was routed to the depressions before the initial day of the study, and water volume within each depression exceeding its maximum storage capacity overflowed directly into surface runoff with no contribution to infiltration into the soil. However, it is widely acknowledged that depressions tend to form fill-spill networks, where overflow from one depression feeds an adjacent depression [5,6]. This process can be modeled [6,10,15–17], but studies of fill-spill processes have primarily focused on effects of depression storage on surface flow to streams rather than depression-focused groundwater recharge. Visual observations during the hydrological phase after the most active snowmelt showed no signs of connectivity between depressions at our study site, but that was typical for the active snowmelt phase itself of about a week duration. We justify the restriction of our approach with the hypothesis that non-stationary volumes of the depression ponds when snow is still present contribute little to total infiltration, partly due to the frozen state of the soil.

#### **5. Conclusions**

Estimation of infiltration through ponds is an important step toward the challenging goal to estimate depression-focused recharge of groundwater, and thus evaluate this important resource, in the forest-steppe zone of Russia. Using high-resolution Planet Labs images and widely evaluated tools for object-based image recognition, we have developed a relatively simple method to reconstruct a time series of infiltration into the soil under ponds and estimate landscape-scale saturated hydraulic conductivity. The simulation of the steady-state groundwater profile for the topographical transect fed with data on relative water supplies through depressions along the transect was more consistent with observations (based on soil redoximorphic indicators of water level) than the simulation fed with a uniform recharge function. Further development is needed to assimilate the data generated with consideration of the spatial variability of pond infiltration into a processbased model of groundwater recharge that accounts for interactions between depressions and their catchments.

**Author Contributions:** Conceptualization, A.Y.Y. and P.P.F.; methodology, P.P.F. and A.Y.Y.; software, P.P.F., A.D. and A.Y.Y.; validation, P.P.F. and A.D., formal analysis, A.Y.Y.; investigation, P.P.F. and A.Y.Y.; resources, P.P.F.; data curation, P.P.F.; writing—original draft preparation, P.P.F., A.D. and A.Y.Y.; writing—review and editing, A.Y.Y. and D.K.; visualization, P.P.F.; supervision, D.K.; project administration, D.K.; funding acquisition, D.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the RFBR, grant number 19-29-05277.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are openly available in Fil, Pavel; Yurova, Alla (2021), "Water bodies recognition for depression-focused recharge. Tambov region, Russia", Mendeley Data, V1, doi:10.17632/pn3gdzhdy4.1.

**Acknowledgments:** We are grateful to three anonymous reviewers for the useful comments that have helped to improve the manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

1. de Vries, J.J.; Simmers, I. Groundwater recharge: An overview of process and challenges. *Hydrogeol. J.* **2002**, *10*, 5–17. [CrossRef]

2. Hayashi, M.; Van Der Kamp, G.; Schmidt, R. Focused infiltration of snowmelt water in partially frozen soil under small depressions. *J. Hydrol.* **2003**, *270*, 214–229. [CrossRef]


## *Article* **A Wireless Sensor Network Deployment for Soil Moisture Monitoring in Precision Agriculture**

**Jaime Lloret \*, Sandra Sendra, Laura Garcia and Jose M. Jimenez**

Instituto de Investigación Para la Gestión Integrada de Zonas Costeras (IGIC), Universitat Politècnica de València, Paraninf 1, 46730 Valencia, Spain; sansenco@upv.es (S.S.); laugarg2@teleco.upv.es (L.G.); jojiher@dcom.upv.es (J.M.J.)

**\*** Correspondence: jlloret@dcom.upv.es; Tel.: +34-6-0954-9043

**Abstract:** The use of precision agriculture is becoming more and more necessary to provide food for the world's growing population, as well as to reduce environmental impact and enhance the usage of limited natural resources. One of the main drawbacks that hinder the use of precision agriculture is the cost of technological immersion in the sector. For farmers, it is necessary to provide low-cost and robust systems as well as reliability. Toward this end, this paper presents a wireless sensor network of low-cost sensor nodes for soil moisture that can help farmers optimize the irrigation processes in precision agriculture. Each wireless node is composed of four soil moisture sensors that are able to measure the moisture at different depths. Each sensor is composed of two coils wound onto a plastic pipe. The sensor operation is based on mutual induction between coils that allow monitoring the percentage of water content in the soil. Several prototypes with different features have been tested. The prototype that has offered better results has a winding ratio of 1:2 with 15 and 30 spires working at 93 kHz. We also have developed a specific communication protocol to improve the performance of the whole system. Finally, the wireless network was tested, in a real, cultivated plot of citrus trees, in terms of coverage and received signal strength indicator (RSSI) to check losses due to vegetation.

**Keywords:** electromagnetic induction; soil moisture; precision agriculture; low cost; water management; Internet of Things (IoT); wireless sensor network

#### **1. Introduction**

Given the basic need to provide food to the world's population, it is necessary to introduce technology to the agriculture sector to reduce the environmental impact caused by the crops and to increase the conservation of natural resources, among others [1]. Efficient Irrigation is one of the essential factors to increase the development of sustainable agriculture, especially in arid and semi-arid regions where there are the greatest limitations. Irrigation methods can be classified into three generic categories; these are (1) gravity irrigation, (2) sprinkler irrigation, and (3) drip irrigation. The gravity irrigation system is the oldest method and the least efficient for the conservation of natural resources. However, in order to determine the specific irrigation needs of crops, sensing devices must be deployed to obtain data such as soil moisture.

Precision agriculture is a concept that appeared in the USA in the 1980s. It is a management strategy that allows making decisions to improve farming productivity and to achieve more sustainable activity. It is based on the management of crops by observing, measuring, and acting against the variability of the many factors that affect them. Using Internet of Things (IoT) solutions, the soil where the crops are planted can be monitored to make decisions and perform more effective irrigation. These solutions may include not only the electronic devices deployed in the fields but also the use of vehicles such as drones to support the network [2] and to manage the use of pesticides on the crops [3,4]. However, in crop monitoring tasks, especially in those where fruit trees are grown, it is important to be able to control soil moisture levels accurately. For the correct progress

**Citation:** Lloret, J.; Sendra, S.; Garcia, L.; Jimenez, J.M. A Wireless Sensor Network Deployment for Soil Moisture Monitoring in Precision Agriculture. *Sensors* **2021**, *21*, 7243. https://doi.org/10.3390/s21217243

Academic Editors: Vicenç Puig and Miquel À. Cugueró-Escofet

Received: 26 September 2021 Accepted: 27 October 2021 Published: 30 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

of a fruit tree, it is necessary to ensure that the roots have the right levels of moisture. High humidity levels can facilitate the proliferation of fungi in the roots and leaves, thus affecting production. However, an extremely low soil moisture level can provoke the soil to crack. causing broken roots and the tree to die. This fact negatively affects the growth of plants and consequently their production.

One of the main drawbacks that hinder the use of precision agriculture is the cost of the sensors and the utilized technology. For farmers that want to use technology on a massive scale, it is necessary to provide low-cost systems to make easier deployments.

The available commercial sensors for soil monitoring use different methods to assess the water content of the soil. The most relevant existing methods for obtaining moisture values from the soil [5] are the gravimetric method, tensiometric method, neutronic method, gamma-ray attenuation method, dielectric method, Wenner or resistive method, and light method infrared [6]. Generally, when low-cost sensors are used to measure soil moisture, conductivity-based sensors are based on the use of two electrodes [1]. These types of sensors have two fundamental disadvantages, lack of reliability, and durability. On the one hand, depending on the type of soil and its salt content, the conductivity measurement can vary even when the amount of water in the soil is maintained. On the other hand, the electrodes must be in contact with the ground, and consequently, they can suffer rapid deterioration. Inductive sensors are also employed to measure soil moisture. However, they do not integrate the system into a sensor node to be able to read the parameters.

The network design is an important aspect to consider as well. Usually, fields are located in remote areas. These areas may not have access to the internet infrastructure and the power grid. Therefore, PA systems should include a form of energy harvesting such as solar panels, and some characteristics of these networks should be considered when designing the deployment of sensing devices [7]. Wireless communications are a good solution because it eliminates the cost and hindrance of deploying cabled networks on extensive areas where machinery is utilized. However, the foliage of the crops affects the quality of the signal, resulting in reduced coverage between the devices. It is therefore necessary to determine the optimal deployment design for the area of interest according to the type of crop, and size of the field. Furthermore, the available protocols may not provide all the functionalities desired for a particular crop and the resources available for the area.

In this paper, we present a group-based wireless sensor network to efficiently irrigate cultivated lands. The network is composed of both actuators and sensor nodes that will collect data from the soil and will activate different irrigation systems as a function of the plot needs. Additionally, we design a new soil moisture sensor able to measure the amount of water content in the root ball of a tree. The design includes the sensor and the power circuit required to generate the bi-phase signals to power the coils. The paper presents the design of the operation algorithm and the message exchange for efficient use of water. Finally, the entire system is tested in a real environment to check the correct operation in terms of soil moisture measurements and network performance.

The rest of the paper is structured as follows. Section 2 presents some previous and related works where soil moisture systems are developed. Section 3 presents an overall description of our proposed sensor as well as the features of the different coils used to develop our soil moisture sensor and the experimental tests performed with the coils. This section also includes the power circuit in charge of generating the required signals as well as the integration of both the sensor and the power circuit with an ESP32 module. Section 4 explains the network operation algorithm and message exchange between nodes. In Section 5, the tests performed in a real environment are shown. Section 6 explains the conclusion and future works.

#### **2. Related Work**

In this section, we summarize some previous works related to our proposal. The gap in current solutions for soil moisture monitoring is also identified.

Authors such as Ojha et al. [8] present a study where they analyze the wireless sensor network (WSN) implementations for various agricultural applications. We will look at surveys such as the one presented by Garcia et al. [1], aimed at summarizing the current state of the art regarding smart irrigation systems and schemes for Internet of Things (IoT) irrigation monitoring. This survey includes the review of more than 100 scientific works. Other authors, such as Susha Lekshmi et al. [9], present a review of techniques employed for soil moisture measurement. The authors highlight the limitations of the techniques and the influence of soil parameters. Tumanski [10] describes the use of a coil to develop sensors. The work compares, summarizes, and analyzes coil design methods and frequency properties of the coil as well as the use of coil sensor applications such as magnetic antennas. Jawad et al. [11] describe applications of WSNs in agricultural research, and classify and compare wireless communication protocols, the taxonomy of energy efficiency, and energy harvesting techniques for WSNs used in agricultural monitoring systems. They also explore the challenges and limitations of WSNs in agriculture, highlighting energy reduction and agricultural management techniques for long-term monitoring. Hamami et al. [12] present a review of the application of WSNs in the field of irrigation. Mekonnen et al. [13] present a review of the application of different machine learning algorithms in the analysis of sensor data observed using WSNs in agriculture. In addition, they analyze a case study on a smart farm prototype, based on IoT data, as an integrated food, energy, and water (FEW) system. Nabi et al. [14] present a comparative study of different studies to provide a deeper insight into these implemented systems. They also present a study of apple disease prognostic systems, highlighting their key characteristics and drawbacks. The result of their study can be used to select appropriate technologies to build a WSN-based system, optimized for precision apple cultivation, which will help farmers avoid the ravages caused by disease outbreaks.

Kabashi et al. [15] present a framework to design WSNs for agricultural monitoring in developing regions, taking into account the particularities of said environments. They propose new solutions and research ideas for sensor network design, including zone-based joint topology control and power scheduling mechanism, multi-sink architecture with complementary routing associated with backlink/storage, and a task scheduling approach with parameter, energy, and environment recognition. Authors such as Kassim et al. [16] present WSNs as the best way to solve agricultural problems related to optimization of agricultural resources, decision support, and land monitoring in order to perform those functions in real time. They explain in detail the hardware architecture, network architecture, and software process control of the precision irrigation system. García et al. [7] study different WSN deployment configurations for a soil monitoring PA system, to identify the effects of the rural environment on the signal and to identify the key aspects to consider when designing a PA wireless network. The PA system is described, providing the architecture, the node design, and the algorithm that determines the irrigation requirements. The results of their testbed show high variability in densely vegetated areas. These results are analyzed to determine the theoretical maximum coverage for acceptable signal quality for each of the studied configurations. Furthermore, there are aspects of the rural environment and the deployment that affect the signal. Zervopoulos et al. [17] present the design and deployment of a WSN capable of facilitating the sensing aspects of smart and precision agriculture applications. They describe a simple synchronization scheme, which was installed in an olive grove, to provide time-correlated measurements using the receiving node's clock as a reference. The obtained results indicate the general effectiveness of the system, although they appreciate a difference in the time correlation of the acquired measurements. Bayrakdar [18] investigated an intelligent insect pest detection technique with underground wireless sensor nodes for precision agriculture using a mathematical simulation model. To evaluate performance, he examined the received signal strength and path loss parameters. He observed the need for transmission of signals with different transmission powers for depth-based communication in wireless underground sensor networks.

Other authors study the application of WSNs to monitor specific crops. Khedo et al. [19] describe the implementation of the PotatoSense application, for precision agriculture with WSNs, to monitor a potato plantation field in Mauritius. They employ different energy efficiency algorithms, to ensure that the life of the system is prolonged. Additionally, they have developed a monitoring application to process the data obtained from the simulated WSN. Rasooli et al. [20] propose using WSNs and IoT to help increase wheat and saffron production in Afghanistan in the future. Using both techniques, they predict the control of the condition and growth of the crop as well as the ability to check soil, temperature, humidity, and other environmental parameters.

Some authors propose the observation of parameters utilizing WSNs in greenhouses. Chaudhary et al. [21] propose and discuss the use of the programmable system on chip technology (PSoC) as part of the WSN to monitor and control various greenhouse parameters. Srbinovska et al. [22] propose a WSN architecture for vegetable greenhouses, in order to achieve scientific cultivation and reduce management costs from the aspect of environmental monitoring. They have designed a practical and low-cost greenhouse monitoring system based on wireless sensor network technology to monitor key environmental parameters such as temperature, humidity, and lighting.

There are authors also studying energy savings in WSNs used in monitoring agriculture. Hamouda et al. [23] study the problem of selecting the sampling interval, for precision agriculture using WSNs, due to the energy limitation that appears when deploying sensors in WSNs. They propose a Variable Sampling Interval Precision Agriculture (VSI-PA) system to measure and monitor agricultural parameters for appropriate agricultural activities, such as water irrigation. Compared to other fixed sampling interval schemes, the proposed VSI-PA system provides a significant improvement in energy consumption, while maintaining a small variation in soil moisture, regardless of soil temperature values. Qureshi et al. [24] propose Gateway Clustering Energy-Efficient Centroid (GCEEC)-based routing protocol, where a cluster head is selected from the centroid position and gateway nodes are selected from each cluster. The results obtained, after evaluating the proposed protocol in comparison to last-generation protocols, indicated a better performance of the proposed protocol, and provided a more feasible WSN-based monitoring for temperature, humidity, and lighting in the agricultural sector.

Table 1 summarizes different previous studies, carried out by other authors, regarding the use of WSNs in soil monitoring for agriculture.


**Table 1.** Previous studies regarding the use of WSNs in soil monitoring for agriculture.

Regarding the available sensors for soil monitoring, there are works, such as [25], that study farmed podzolic soils since these types of soils are under-represented in the relevant literature. In the study, the authors established the relationship between apparent electrical conductivity (ECa) and soil moisture content (SMC). The authors also evaluated the estimated SMC with ECa measurements obtained with two electromagnetic (EMI) induction sensors. The authors concluded that ECa measurements obtained through multicoil or multi-frequency sensors had the potential to be successfully used for field-scale SMC mapping. Others, such as [26], designed and manufactured an integrated passive wireless sensor to monitor the moisture in the sand. The sensor was made of a printed spiral inductor embedded within the sand and it contained an inductive-capacitive (LC) resonant circuit. The authors measured the level of internal moisture by monitoring the resonance frequency using a sensing coil. Kizito et al. [27] presented a study where ECH20 sensors

were used to measure soil moisture content, bulk electrical conductivity, and temperature for a range of soils, across a range of measurement frequencies between 5 and 150 MHz. The authors affirmed that the measurements carried out on soil were accurate enough to work at 70 MHz. Finally, Nor et al. [28] discussed the development of a low-cost sensor array based on planar electromagnetic sensors to determine the contamination levels of nitrate and sulfate in water sources. The authors proposed three types of sensors: parallel, star, and delta. According to their experiments, the star sensor array was the one with the highest sensitivity.

After analyzing the exhibited works and many others not included in this paper, we can conclude that our work improves the existing systems. In either very few or no cases in the other works reviewed do the authors present complete or easily integrable systems in commercial nodes, such as Arduino or similar, and many of them use working frequencies that are too high (on the MHz scale). This fact makes it difficult to develop a simple and inexpensive signal generator circuit. Our proposal aims to take a step beyond the current state of the art, proposing a complete system, consisting of a sensor based on coils whose working frequency is around 93 kHz, and a power circuit that can be easily integrated into commercial modules for the development of a more complex wireless sensor network to monitor a large-scale crop.

#### **3. Network Nodes Description**

This section describes the proposed system and the different parts that comprise our proposed system. Additionally, it presents the features of the different coils used to develop our soil moisture sensor as well as the experimental tests performed to determine the best prototype.

#### *3.1. Overall System Description*

When trying to develop complete monitoring systems for precision agriculture, it is important to take into account different aspects. On the one hand, agriculture is an essential activity for the survival and development of society; this fact is evidenced by the amount of global- and regional-scale agricultural monitoring systems [29] to assess the crop growing conditions, crop status, and agro-climatic conditions that may have an impact on global production of any type of crop. Some examples are Group on Earth Observations Global Agricultural Monitoring Initiative (GEOGLAM) [30] or CropWatch [31], among others.

On the other hand, it is necessary to know the kind of crop wanted to be developed in order to design adapted methods for monitoring activity. Considering the crop to monitor and the location of the plot, the network should use a specific wireless communication technology. Currently, it is possible to use cellular technologies by paying for subscriptions to a service or by using low-power technologies such as ZigBee, LoRa, LoRaWAN, Bluetooth BLE, or Sixfox, among others; most of these services do not require payment for using their communication network infrastructure [32]. However, the wireless technology par excellence for developing wireless sensors networks continues to be Wi-Fi. Although its energy consumption is still high, it allows transmitting any type of content without the bandwidth limitations that other technologies present. In addition, it is a widely studied standard so it is easy to develop new optimized protocols. Therefore, by making a good design of a power system based on renewable energies, it is possible to use Wi-Fi to develop a Wi-Fi-based agriculture monitoring network with very interesting properties.

In the end, the completion of the design of the system led to precisely defining the type of parameter to be monitored since this fact will indicate the type of sampling, and analysis we should do. After that, the data interpretation and the scoring curves will help us to define the correct operation of our actuator network system. Lastly, the correct processing of collected data will help us to know the soil health and its characteristics for determining if these are optimal for our crop.

Therefore, considering these previous issues, we propose the development of a groupbased wireless sensor network for soil moisture monitoring in precision agriculture. The

network is composed of a set of nodes with different roles and functions. Some nodes are able to collect data from the environment, particularly data from soil moisture and other parameters required to ensure the correct progress of a tree (See Figure 1). The rest of the nodes have actuators to control the activity of ditch gates and drip irrigation elements. So, we will have 3 different sets of nodes that will communicate between them. Additionally, sensor nodes will provide data to the actuator nodes performing the required computation and decision making in the edge. Edge computing is recommended in scenarios where nodes present in the network are able to analyze the data and take decisions. Edge computing enables data produced by Internet of Things (IoT) devices to be processed closer to where it was created rather than being sent over long journeys to reach data centers and computing clouds. One of the fundamental advantages of this type of computing is that it allows analyzing important data in near real time [33]. In citrus groves, it is common to distribute them by forming rows of trees separated at a distance of approximately 6 m, being able to opt for a denser plantation, with a minimum separation of 4.5 m. The minimum depth that a citrus tree usually reaches is 45 cm. Considering these facts and taking into account that a field can have different extensions, different topologies of sensor nodes can be created. An important aspect is to ensure complete coverage between nodes to guarantee stable communication. A distributed ad-hoc network is optimal for this kind of scenario.

**Figure 1.** Proposed group-based network.

One of the main characteristic aspects of this proposal is its hierarchical structure by layers where each layer has a series of nodes that, if necessary, could change their role. That is, all sensor nodes and actuator nodes are wireless devices with the ability to act as a packet relay. In a hypothetical situation where a node falls, communications can be rerouted by other nodes of the same layer. If there is a fall of several nodes and one of them is isolated but active, it could use nodes of the upper or lower layer as an alternative way to carry out communications. However, these nodes would only forward the packet to nodes of the isolated node's layer.

To deal with the failure of a sensor or actuator node, it is convenient to establish an alarm system, based on keep-alive messages. It is a task periodically scheduled, once per day. It is possible to work with a large periodicity because the irrigation tasks of a field are not considered a critical task. If there is a node or several nodes not responding to these requests, the system will consider a node is down.

Additionally, developing a low-cost system was required to measure the moisture in the soil depths. This system consists of four coil-based sensor elements equally distributed along 60 cm. The coils are connected to a processor module in charge of collecting the data and wirelessly share them with the rest of the nodes of its group. Finally, and considering the values of moistures collected by the sensor nodes, the actuator nodes will enable/disable the ditch gates or the drip irrigation.

When talking about moisture or soil humidity, we refer to the amount of water the soil contains. A gravimetric analysis method gives the relative comparison between the mass of dry soil and the mass of watered soil (which will always be higher). The moisture given in percentage is the result of dividing the difference between these two values by the mass of dry soil. If there is no difference, moisture will be 0%. In the opposite case, when the watered soil mass doubles the dry one, the moisture level will be 100%.

The development of our coil-based soil moisture sensor is based on the principle of electromagnetic induction of the coils and how it varies as a function of the type of core the coil has inside [34–36].

The soil moisture sensor is composed of two solenoid coils wound on the same PVC pipe support. Coil 1 receives the sinusoidal signal generated by the power circuit based on the integrated ICM7555. Coil 1 induces a current on Coil 2 which is largely affected by the content of the coil core since the magnetic field is affected by the type of soil and water content inside it. Finally, this current is measured, collected, and stored with an electronic module. In our case, a module ESP32 DevKIT [37] with an integrated Wi-Fi interface has been chosen. Figure 2 shows the diagram of the proposed soil moisture sensor.

**Figure 2.** Diagram of proposed soil moisture sensor base on coils.

Since this kind of module usually presents one or two analog inputs to collect data, we also propose the use of an analog multiplexor of four inputs which can be controlled by using two digital outputs. With this, our system will be able to take measurements from the four soil moisture sensors.

#### *3.2. Soil Moisture Sensor Based on Coils*

As we mentioned before, it is possible to develop soil moisture sensors based on several principles and chemical processes. However, we want to use a method based on physical principles such as the variation of electromagnetic flow as a function of the nature of the coil core.

In a coil distribution such as the one shown in Figure 3, coil 1 generates a magnetic field that affects coil 2. This effect is known as mutual inductance and refers to the electromotive force (EMF) in a coil due to the change of current in another coil attached. The induced EMF in a coil is described by Faraday's law and its direction is always opposite to the change in the magnetic field produced in it by the coupled coil (Lenz's law). The EMF in coil 1 (left) is due to its own inductance L.

**Figure 3.** Principle of operation for our developed sensor.

The induced EMF in coil 2, generated by the changes of current I1, can be expressed as (see Equation (1)):

$$emf\_2 = -N\_2A\frac{\Delta B}{\Delta t} = -M\frac{\Delta I\_1}{\Delta t} \tag{1}$$

where *N*<sup>2</sup> is the number of spires of coil 2, *M* the coefficient of mutual self-induction, *A* is the cross-sectional area of the coil, <sup>Δ</sup>*<sup>B</sup>* <sup>Δ</sup>*<sup>t</sup>* the variation of the magnetic field as a function of the time, and <sup>Δ</sup>*I*<sup>1</sup> <sup>Δ</sup>*<sup>t</sup>* the variation of current in coil 1 as a function of time. Mutual inductance (*M*) can be defined as the ratio between the electromagnetic force (EMF) generated in coil 2, and the changes in current in coil 1 that causes that EMF. Likewise, *M* is highly affected by the characteristics of the medium that surrounds the coils, usually expressed by its magnetic permeability.

Since it is difficult to measure the value of the magnetic permeability of the earth core as a function of the moisture level, two theoretical approximations of the air core are introduced [38]. Based on Equation (2) (which presents the coil inductance), we can state Equation (3) where *l* is the length of the coil and *r* is the radius to the center of the coil of the innermost layer of the conductor while *R* is the radio for the outermost layer.

$$L = \frac{\Phi N}{I} = \frac{\mu N^2 A}{l} = \frac{\mu\_0 \mu\_r N^2 \pi r^2}{l} \text{ (H)}\tag{2}$$

$$L\_{layer} = \frac{N^2 r^2}{2.54(9r + 10l)} \ (\mu\text{H}) \tag{3}$$

where *L* is the inductance of our coil (in H), Φ is the magnetic flow (in Wb), *N* represents the number of turns (dimensionless), *l* is the length of the coil (in m), *r* expresses the radius of the inner coil's layer (in m), *R* is the radius of the outer coil's layer (in m), *A* is the area of the coil's surface (in m2), *μ*<sup>0</sup> is the magnetic permeability (free space) (H/m) and finally, *μ<sup>r</sup>* is the relative magnetic permeability (medium) (dimensionless).

This approximation allows estimating the components of the circuit for an air core, which would be similar to those obtained with a large amount of pure water; so this value will vary depending on the type of soil, its composition, and the level of soil moisture presented by the soil that contains coil. The resonance peak of our coils can be calculated by Equation (4).

$$f\_r = \frac{1}{2\pi} \sqrt{\frac{1}{LC\_d} - \frac{R\_S^2}{L^2}} \approx \frac{1}{2\pi \sqrt{LC\_d}} \text{ (Hz)}\tag{4}$$

where *fr* is resonance frequency (in Hz), *Cd* is coil's parasite capacity (in F), *L* is the coil's inductance (in H) and, *Rs* is the coil's resistance (in Ω).

We should take into account that the primary coil and secondary coil will have different resonance frequencies because the secondary coil has a different number of coils. However, our sensor only intends to detect changes in the induced current due to the presence of a changeable medium and, finally, we want to relate this value of current with the amount of water content in the soil.

Equations from (1) to (3) are theoretical approaches to explain how important it is to know the relationship between the physical and electrical characteristics of the coil. Equations (1)–(3) explain how the coil inductance, and hence mutual inductance, depend on its geometry (length, radius, and the number of turns) for single-layer coils. Equation (4) helps us to design a resonant circuit to obtain the maximum power transfer. It is highly important to consider the appearance of a possible parasite capacity due to coil geometry and working frequency.

We previously performed several tests with different combinations of coils, varying the number of spires, the ratio of spires between coils, and the diameter [39]. In these previous works, we performed many experiments with combinations of spires and the best results were determined for a ratio of 1:2 with a medium value of spires and larger diameter. For a fixed diameter, if we reduced the number of spires, the working frequency increased for a fixed number of spires; if we increased the coil diameter, the working frequency decreased. Additionally, developing a simple and cheap electronic system to generate the signals was required. For such a system, it is highly recommended to have a sensor that requires the lowest working frequency. Therefore, we chose to set some parameters such as the type of copper and number of spires, we only varied the diameter of coils.

In developing our coils, 0.6 mm enameled copper wire was used. The process entails winding copper wire along a cylinder, forming two solenoids. The distance between the primary coil and the secondary coil is five mm. Figure 4 shows the developed coils with a single layer of spires. Table 2 shows the physical features of each prototype.

**Figure 4.** Coils used in our developed sensor: (**a**) P1, coil of 50 mm;(**b**) P2, coil of 32 mm; (**c**) P3, coil of 20 mm.


**Table 2.** Prototypes to measure soil moisture.

The procedure to perform these tests with the coils consists of introducing each model into a container filled with dry and compacted soil to observe the behavior of the output voltage as a function of the amount of water. Therefore, for the same moisture level, a frequency sweep will be carried out to find the frequency that shows a peak in the induced voltage. This value will be the sought-out resonance frequency. After that, the linearity of each model will be analyzed. For each test, 4000 g of soil is used, with increments of 250 mL of water, for each moisture level up to 1000 mL of water. When water is added to the soil, the sample is reposed for an hour to obtain a homogeneous sample inside the coil. Specifically, five levels of content of water in soil will be measured: 0%, 6.25%, 15.5%, 18.75%, and 25% (see Table 3). For this type of soil, 25% of the content of water in soil implies a land completely flooded. Measurements have been taken at 25 ◦C.


**Table 3.** Samples used during the tests.

In order to determine in which type of soil our sensor can be used, we endeavored to determine which one presents the biggest linearity. The idea of this concept is that an increase in the percentage of moisture is equivalent to an increase in output voltage without instabilities.

#### *3.3. Experimental Results with the Developed Coils*

This subsection presents the test performed to determine the most suitable prototype selected to develop the system, followed by the testing of the selected coil with different types of soils and different levels of moisture.

Considering the types of soil, we can conclude that soils are usually made up of different proportions of sand, silt, and clay. Each of them has morphological characteristics:


In order to perform our tests, we have selected three different types of soils, i.e., sand from the beach, soil from cultivated land, and commercial universal substrate.

The sand on the beaches is formed by sediments from rocks and other marine debris such as shells, corals, animals, algae, and even sand that travels through the rivers until flowing into the sea. Due to the erosion of water and wind, due to rain and waves, or temperature differences, the grain size of the sand tends to be reduced.

The soil from cultivated land is usually made up of an organic fraction, organic matter more or less degraded into humus and humic and fulvic acids. These elements provide the fertile part of the earth. The rest of the soil is considered as physical support. Some farmlands have a high degree of clay which also intervenes in ion exchange and water retention, facilitating the release of fertile elements according to the needs of the plants.

The raw materials used in the manufacture of a commercial universal substrate are usually blonde peat from sphagnum moss, coconut fiber, compost, perlite, organic fertilizer, mineral fertilizer, algae extract, etc. In addition, this type of soil usually contains a high level of aeration.

Performing the measurements entails three identically constructed sensors simultaneously placed in three samples of each soil. The results shown in our graphs are the average value of the three measurements collected, which in all cases were identical.

In order to perform the test, the primary coil is powered by using a wave of 7 Vpp with positive and negative values. For example, it is possible to use sine or square waves,

such as the one shown in Figure 5. In Figure 5, we can see, in blue, the signal used to power the primary coil while the result of the induced current is shown in yellow.

**Figure 5.** Example of generated and obtained signals.

Figure 6 shows the preliminary results obtained with the 3 coils. Figure 6 shows the value of the resonance frequency and the maximum voltage value.

**Figure 6.** Resonance frequency obtained for each prototype.

Table 4 shows the resonance frequency values (in kHz) of the developed prototypes and the maximum voltage value (in mV) obtained in the induced coil.

**Table 4.** Prototypes to measure soil moisture.


After analyzing the results obtained in Figure 7, we can conclude that the prototype that gives the best results is Prototype 1, with a working frequency of 93 kHz and a maximum output voltage of 1.82 V.

**Figure 7.** Behavior for prototype 1 in several types of soils.

Once the most suitable sensor for further development has been determined, this sensor is tested on different types of soil to determine its versatility. In all cases, we will look for the maximum linearity in the sensor response.

As we can see in the previous figure, the selected model has a linear behavior for all three cases up to a humidity degree of 18.75%, i.e., a total volume of water of 750 mL for 4000 g of sand.

Another important aspect to highlight is that the behavior of the sensor for a universal substrate is inverse to the behavior shown in the case of beach sand or cultivated soil. This aspect should be considered when the results are processed in a real environment.

#### *3.4. Power Circuit Design*

Locating the resonance frequencies for the selected prototype of coil systems requires a power supply and excitation circuit able to generate an alternating signal. To do this, a 555 series oscillator integrated circuit [40] will be used. A series of components will be used to obtain the desired output signal with a resonance frequency of 93 kHz. Our circuit is based on the ICM7555 [41]. According to the manufacturers' specifications, this integrated circuit can generate signals up to 3 MHz. Figure 8 shows the schema of our entire circuit. This kind of integrated circuit has been conceived to be customized regarding the duty cycle of signal and the required frequency. In this case, we can use R2 and C1 to change the working frequency while C3, C4, and R1 are used to control the ripple signal and its form. Modifying C4 and R1, it is possible to obtain both a sine wave and square signal, such as the one shown in Figure 9.

To directly read a value of voltage proportional to the amount of water content in the soil, we include a Graetz bridge or double wave rectifier bridge followed by an RC filter that has been connected to the terminals of the secondary coil.

**Figure 8.** Enhanced power circuit schematic.

**Figure 9.** Example of signal obtained in Output 1.

#### *3.5. Comparision and Discussion with Existing Published Systems*

We have compared our sensor model with existing and commercial soil moisture sensors. Table 5 shows this analysis. It is important to consider that this table only contains the price of the sensor, with exception of references [42–45] which include an electronic module. In the rest of the cases, a microprocessor module must be included similar to the one used in this paper that can cost approximately \$10–\$15.



It is evident that our proposed sensor, based on 2 coils, is one of the models that presents the lowest prices. The price includes the pipe and wire, because it can be added to any electronic platform to gather the data.

When this type of system is designed and developed, it is extremely important to consider practical implementation problems and challenges.

One of the main problems in the outdoors is how to protect the electronics from adverse conditions. A waterproof protection is highly recommended to protect the different devices since the places where the devices are deployed can be highly changeable. Additionally, the time during which the sensor nodes should work and the exposure to environmental temperature and humidity can cause some variations in the measurements. This issue should be controlled since a wrong reading would cause anomalous values and consequently wrong behavior. To shore up this problem, it is possible to use artificial intelligence and redundancy mechanisms.

Furthermore, there is another important issue regarding manufacturing techniques of certain sensors and probes. In several cases, they are manufactured with copper. An improvement in the system implementation could be the replacement of these sensors for ones protected with the process of gold plating which help to combat the corrosion of probes.

Coverage estimations do not usually match practical experimentation because the emulation of environmental conditions is difficult. For this reason, we highly recommend performing practical experiments and test benching, as presented in this paper.

#### **4. Network Protocol Design and System Procedure**

This section presents the network protocol used in our topology. In addition, it also presents the algorithm designed to collect data from sensors and control the different actuators as well as the messages exchanged between devices and the algorithm designed for the system procedure.

As we presented before, our network is composed of three different types of nodes, which can be classified as sensor nodes and actuator nodes. Additionally, we consider an additional node that is placed in the engine to provide water to the plot. This node will be in charge of starting the process of monitoring the entire network. Sensor nodes collect data from the soil and provide the required warning alarms to the actuators for enabling or disabling the irrigation systems. Figure 10 shows the diagram of our entire network deployed in the plot.

**Figure 10.** Diagram of our entire network in the plot.

In order to obtain high performance, we have developed a specific network protocol. This section presents the message exchange between devices, the fields of the messages exchanged between devices, and the algorithm designed for the system procedure.

The designed network is a distributed network made of sensor nodes (each one has one or several physical moisture sensors), and one or several actuators which activate the engine, for the drip irrigation system and/or the ditch gates, depending on the case. Then the number of nodes (see Equation (5)) of the whole system (*N*) is:

*N* = *ns* + *na* (5)

where *ns* is the number of sensor nodes and *na* is the number of actuator nodes.

Our network will use Ad Hoc On-Demand Distance Vector Routing (AODV) since it is one of the ad hoc routing protocols that presents the best performance [53].

#### *4.1. Algorithm of the System*

In order to determine when the irrigation process should be carried out, we need to collect the data from the different *ns* which are placed and identified by zones (i). For each zone, we defined the maximum number of nodes comprised in the zone as counter. Carrying out this automation process of irrigation requires the design of an operation algorithm. Figure 11 shows the operation algorithm of our soil moisture monitoring system.

**Figure 11.** Operation algorithm.

As commonly done in agriculture, there are periodic planned irrigations that should be performed. In this case, the system of drip irrigation elements is enabled and it covers the entire extension of trees. If an alarm is registered from a sensor, the system will request the data from all nodes of that zone. If the number of nodes that register the need for water is higher than 5, the system will enable the ditch gate of this zone. Even if only some sensors warn about the need for water, the system will enable the drip irrigation elements of this zone. The rest of the zones in the plot will be analyzed to check if it is required to proceed with irrigation. The different orders will be sent to the sink node by the nearest node of that area to the sink node which will be in charge of enabling/disabling the irrigations systems.

Finally, if the plot does not require any action, the system will remain in idle mode waiting for new information.

#### *4.2. Message Flow between Nodes*

Finally, in order to send the required actions to the correct actuator nodes, it is important to design the message exchange between nodes. In this sense, we should consider three different situations (see Figure 12). Firstly, the most frequent situation is the one in which the plot does not require any type of irrigation. In this case, if the sensor nodes do not send any message in the next 30 min, the system will consider that no irrigation is required (1).

**Figure 12.** Message exchange between nodes.

The second situation (2) is when there is a global need for water in a zone of the plot. In this case, the sensor head node will wait 30 min for messages from sensor nodes. If more than 5 messages are received, the system will consider that global irrigation for this zone is required. Then, the sensor head node will send a message to the actuator node in charge of enabling the gates. After that, this node will inform the sink node to enable the engine to provide water to the ditch.

The third situation (3) will be done when there is a partial need for water in a zone of the plot. In this case, the sensor head node will wait 30 min for messages from sensor nodes. If less than 5 messages are received, the system will consider that partial irrigation for this zone is required. Then, the sensor head node will send a message to the actuator node in charge of enabling the drip irrigation element of the affected trees. After that, this actuator node will signal the sink node to enable the system of drip irrigation.

To make easier the process of forwarding messages from sensor nodes to actuator nodes or sink nodes, it is possible to use any node present in the network. In this sense, a node can receive several packets but if it is not the destination of this message, the node will relay the message without processing it. When the sensor nodes of a zone are communicating, an intragroup routing protocol will be utilized. When the message exchange is performed between sensor nodes and actuator nodes or between actuator nodes for ditch gates and actuator nodes for drip irrigation elements, these nodes will use an intergroup routing protocol [54].

#### **5. Experimental Results in a Practical Deployment**

In this section, the results obtained in the deployments on orange groves are presented. In order to perform the test, we have used several ESP32 DevKit nodes placed at different heights. This will allow us to study the coverage of the nodes at different heights, so it will be kept as a recommendation for practical deployments. The different deployment strategies that were tested are presented in Figure 13. As can be seen, different configurations of emitter height and receiver height were tested. The emitters were deployed at heights of 0.5 m, 1 m, 1.5 m, and 2 m. The receivers were placed at 0 m for the on-ground deployment, 0.5 m for the near-ground deployment, and 1.5 m for the above-ground deployment. The emitter and receiver were separated for each test. The trees are spaced in four-meter intervals and the field is located in an area with a Mediterranean climate. The foliage of the trees affects the wireless communication among the devices. Testing different configurations of transmitter and receiver provides us with the knowledge to design the best deployment for optimal communication with this type of crop. The tests were performed with sunny weather and temperatures of 20 ◦C. The measurement carried out is the received signal strength indicator (RSSI) at different measuring points. The Esp32 DevKit nodes were encapsulated on a protective box.

The results for the emitter at a height of 0.5 m and the receiver at different deployment configurations are presented in Figure 14. The positions of the trees are indicated by the bold orange numbers on the X-axis. As can be seen, the overall higher RSSI values considering multiple trees along the tested distance were obtained for the near-ground position of the receiver. This configuration of the receiver is also the most stable. Moreover, some small fluctuations occurred for tree number 1 and tree number 4. However, the foliage in the space between trees 2 to 4 presented a higher density, which lead to higher fluctuations. One of the reasons for these fluctuations may be the multipath effect. Thus, avoiding node deployments in areas of high foliage density is best to obtain more stable signals.

**Figure 13.** Testbed.

**Figure 14.** Emitter at height of 0.5 m.

For the case of the emitter deployed at a height of 1 m, the results are presented in Figure 15. As it can be seen, the near-ground receiver is the one with the best results. In this case, the signal quality is reduced between trees 2 and 4. However, for the near-ground receiver, the signal presents some recovery after the area with high foliage density. The above-ground deployment presents similar results for the area with high foliage density but worse signal quality for the rest of the measurement points. Lastly, the on-ground receiver deployment presents the worst results.

**Figure 15.** Emitter at height of 1 m.

Figure 16 presents the results for the emitter height of 1.5 m. The near-ground deployment has the highest signal quality values at almost all measuring points. As can be seen, it experiences some fluctuations between trees 2 and 3. However, even with the fluctuations, the signal quality is better than that of the other configurations. The next best option is the above-ground receiver. In this case, the signal is more stable while remaining below the quality levels of the near-ground receiver. Lastly, the on-ground deployment presented the worst results and the highest fluctuations. Another final aspect to consider is that the average signal quality for this emitter height was lower than the signal quality obtained for lower emitter heights.

Lastly, the results for the emitter height of 2 m are presented in Figure 17. This emitter height obtains the worst signal quality results compared to all the emitter heights. Regarding the receiver height, in this case as well, the near-ground deployment obtained the best results. However, as shown in the figure, all receiver configurations present similar results, while the results for this emitter height present the least fluctuations. As in the other cases, the on-ground configuration was the worst option.

**Figure 17.** Emitter at height of 2 m.

Considering the results for all the emitter heights, we can conclude that in the case of orange groves, emitter heights of 0.5 and 1 m present the best signal quality and the near-ground receiver deployment is the best option for all emitter heights. Therefore, nearground configurations are the optimal deployment style for both emitters and receivers.

The coverage results obtained from the tests performed on the orange groves have been utilized to obtain a heuristic signal attenuation model for all emitter heights as specified in [7]. The outlier values were discarded to perform this model. Equations (6)–(9) show the model for emitter heights of 0.5 m, 1 m, 1.5 m, and2mrespectively.

$$P\_{0,5\ m} = -7.182 \ln d(m) - 45.276\tag{6}$$

$$P\_{1\text{ }m} = -7.69 \ln d(m) - 44.194\tag{7}$$

$$P\_{1,5\ m} = -9.545 \ln d(m) - 44.475\tag{8}$$

$$P\_{2\text{ }m} = -10.34\ln d(m) - 43.493\tag{9}$$

Furthermore, the model, confidence intervals, and prediction intervals are presented in Figure 15, where the dots represent the values obtained from the tests on the fields. As can be seen, the model reflects that the configurations of emitter heights of 0.5 m and 1 m (See Figure 18a,b) present better signal quality. Lastly, Figure 18c shows the graphic representation for the case of the emitter height at 1.5 m and Figure 18d presents the results for the emitter height of 2 m.

Considering the results for all the emitter heights, we can conclude that for the case of orange groves, emitter heights of 0.5 and 1 m present the best signal quality and the nearground receiver deployment was the best option for all emitter heights. Therefore, nearground configurations are the optimal deployment style for both emitters and receivers.

**Figure 18.** Heuristic model for (**a**) emitter at height of 0.5 m, (**b**) emitter at height of 1 m, (**c**) emitter at height of 1.5 m, and (**d**) emitter at height of 2 m.

#### **6. Conclusions and Future Work**

Estimating the amount of water needed to irrigate a crop is essential to carry out efficient use of a scarce resources such as water. The introduction of technology in the agricultural sector is also important to improve the sustainability and competitiveness of the sector. For this reason, this paper has presented the prototype of a low-cost sensor based on coils for measuring soil moisture. For this, three prototypes composed of two coils with different characteristics have been presented. These coils have been tested to analyze their behavior based on the humidity level of the soil. After the observed results, it has been concluded that the sensor that has had the best performance is prototype 1 working at 93 kHz. Additionally, a power circuit based on the ICM7555 has been designed to be able to generate the biphase signal to power the soil moisture sensor. This sensor is able to measure the percentage of water content in the soil at the desired depth. This fact helps us to ensure the correct irrigation of the root ball. The sensor and power supply circuit is connected to an ESP32 module for reading and storing humidity measurements. The entire system has been tested with real samples for the extraction of its mathematical behavior model. The results show that our sensor demonstrates that by using these models we can achieve accuracies close to 95%.

Additionally, the network performance has been tested in a real, cultivated plot. According to the results, and after modeling mathematically the results of the network coverage, we can conclude that for the case of orange groves, the best results are obtained when the emitter is placed at 0.5 and 1 m and the receiver is placed near the ground. So, near-ground configurations are the optimal deployment style for both emitters and receivers.

In future work, we would like to perform more practical experiments with more models of coils and different kinds of soils to design a more versatile sensor capable of working with several sorts of soils without changing the sensor. It will also study the possibility of including a system to automatically adapt the working frequency to the type of soil. Because in our practical experiments we have included only the measurements of signal amplitude, it could be interesting to measure the quadrature component and phase of the obtained signal and trying to relate these parameters with changes of pH of water. We also want to include other sensors in a multi-parametric node to place in the crop field [55,56] to enhance the efficiency of water management in precision agriculture [57]. In this sense, we want to check if soil temperature has some effect over the soil moisture measurements and, if required, over obtaining the soil moisture values compensated with temperature. Finally, as the last step, we will study the most appropriate enclosures to protect our entire system.

**Author Contributions:** Conceptualization, S.S. and J.L.; methodology, S.S. and J.L.; validation, L.G. and J.M.J.; formal analysis, S.S. and J.L.; writing—review and editing, S.S., J.L., L.G. and J.M.J.; All authors have read and agreed to the published version of the manuscript.

**Funding:** This work has been partially supported by the European Union through the ERANETMED (Euromediterranean Cooperation through ERANET joint activities and beyond) project ERANETMED3- 227 SMARTWATIR, by the "Programa Estatal de I+D+i Orientada a los Retos de la Sociedad, en el marco del Plan Estatal de Investigación Científica y Técnica y de Innovación 2017–2020" (Project code: PID2020-114467RR-C33) and by "proyectos de innovación de interés general por grupos operativos de la Asociación Europea para la Innovación en materia de productividad y sostenibilidad agrícolas (AEI-Agri)" in the framework "Programa Nacional de Desarrollo Rural 2014–2020", GO TECNOGAR. This work has also been partially funded by the Universitat Politècnica de València through the post-doctoral PAID-10-20 program.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** No dataset has been created.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

#### **References**


## *Review* **Towards Synoptic Water Monitoring Systems: A Review of AI Methods for Automating Water Body Detection and Water Quality Monitoring Using Remote Sensing**

**Liping Yang 1,2,3,\*, Joshua Driscol 1,2, Sarigai Sarigai 1,2, Qiusheng Wu 4, Christopher D. Lippitt 1,2 and Melinda Morgan <sup>1</sup>**


**Abstract:** Water features (e.g., water quantity and water quality) are one of the most important environmental factors essential to improving climate-change resilience. Remote sensing (RS) technologies empowered by artificial intelligence (AI) have become one of the most demanded strategies to automating water information extraction and thus intelligent monitoring. In this article, we provide a systematic review of the literature that incorporates artificial intelligence and computer vision methods in the water resources sector with a focus on intelligent water body extraction and water quality detection and monitoring through remote sensing. Based on this review, the main challenges of leveraging AI and RS for intelligent water information extraction are discussed, and research priorities are identified. An interactive web application designed to allow readers to intuitively and dynamically review the relevant literature was also developed.

**Keywords:** surface water; water body detection; surface water extraction; water quality monitoring; remote sensing; artificial intelligence; computer vision; machine learning; deep learning; convolutional neural networks

#### **1. Introduction and Motivation**

Water is fundamentally necessary to all forms of life, and it is also the primary medium through which climate change impacts Earth's ecosystem and thus the livelihood and wellbeing of societies [1]. While water covers about 71% of the Earth's surface, only approximately 3% of the Earth's water bodies are freshwater [2]. Climate change will bring unique challenges to these water bodies. Many rivers and streams are heavily dependent on winter snowpack, which is declining with rising temperatures and changing precipitation patterns [3]. Sea level rise is also impacting the continued quality and quantity of water supplies [4]. Both the quantity and the quality of freshwater systems are critical environmental features essential to increasing resilience in the face of climate change [5,6]. Resilience is defined here as the capacity of a system to absorb disturbance and still retain its basic function and structure [7]. Climate change will bring new disturbances in many forms, including increased pollution from wildfires, saltwater intrusion, and deteriorated water quantity resulting from prolonged drought [1,8]. It is critical that we gather, ideally automatically, as much information as possible about freshwater bodies and how they function in order to increase our capacity to respond to a changing climate. Rockström [5,6] and his colleagues conceptualize freshwater use and biogeochemical flows that threaten the integrity of freshwater (via pollution) as two of seven variables key to overall Earth

**Citation:** Yang, L.; Driscol, J.; Sarigai, S.; Wu, Q.; Lippitt, C.D.; Morgan, M. Towards Synoptic Water Monitoring Systems: A Review of AI Methods for Automating Water Body Detection and Water Quality Monitoring Using Remote Sensing. *Sensors* **2022**, *22*, 2416. https://doi.org/10.3390/ s22062416

Academic Editors: Miquel À. Cugueró-Escofet and Vicenç Puig

Received: 2 January 2022 Accepted: 15 March 2022 Published: 21 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

system function. Each of these variables, they argue, can be thought of as having "planetary boundary", a threshold that should not be crossed if we are to maintain the Earth in its current system state [5]. In this sense, the integrity and functioning of freshwater systems are essential not only in the local scale in which they provide critical ecosystem services; they also create a "safe operating space" for humanity as a whole, as we seek to achieve global solutions to the larger environmental challenges we face with climate change and associated stressors [6].

Responding to climate change challenge impacts on water resources requires adaptation strategies at the local, regional, national, and global scales. Countries are urged to improve their water resources management systems and to identify and implement "no regrets" strategies in order to be resilient to climate change [1]. The changing spatial and temporal patterns of surface water are important, in both practical and scientific terms, for water resources management, biodiversity, emergency response, and climate change [9]. More specifically, automated monitoring of water bodies is critical for adapting to climate change, water resources, ecosystem services, and the hydrological cycle, as well as for urban hydrology, which can facilitate timely flood protection planning and water quality control for public safety and health [10–12]. Accurate water quality monitoring is essential for developing sustainable water resource management strategies and ensuring the health of communities, ecosystems, and economies [13]. However, current knowledge of water quality is often disconnected in time and space across different measurement techniques and platforms that may fail to capture dynamic ecosystem changes. This disconnection indicates an inefficiency and redundancy in research and monitoring activities. A major challenge for water resource management is how to integrate multiple sources of water quality data and indices into usable and actionable information of environmental, social, economic, and infrastructural value [13,14].

Geospatial big data are leading to transformative changes in science (with the advent of data-driven and community science) and in society (with the potential to support the economy, public health, and other advances). Artificial intelligence (AI), especially its branches machine learning (ML), deep learning (DL), and computer vision (CV), are central to leveraging geospatial big data for applications in both domains. Remote sensing (RS) is the single largest source of geospatial big data and has increased dramatically in terms of both spatial and temporal resolution. This poses serious challenges for effective and efficient processing and analysis [15]. Meanwhile, recent advances in DL and CV have significantly improved research in RS and geosciences [16–18]. These advances, if integrated in creative and appropriate ways, host potential to enable the automated identification and monitoring of large-scale water bodies and water quality effectively and efficiently.

In this article, we argue specifically that bridging research into extracting important water information (e.g., water body extent, water quality) from RS imagery will provide an important computational foundation for the development of smart, RS-enabled water resource management systems. We review a range of recent developments in the relevant fields that can be leveraged to support intelligent automation of water body extraction and water quality detection and monitoring through RS imagery. An accompanying interactive web application allows our readers to intuitively track scholars and publications covered in this review (the web app tool URL and its brief demo video link are provided in Appendix A).

#### *1.1. Selection Criterion for Reviewed Papers and Brief Graphic Summary*

In the literature review process, we performed a systematic search on Google Scholar with the keywords and search strategy detailed in Table 1. In addition, our search was restricted to research articles published in English and in peer-reviewed journals or conference proceedings. For water body detection, we combined the water body keywords with some combination of the general keywords. The process for finding publications related to water quality was the same, only with the water quality keywords list. Beyond the keywords listed in this table, references (those cited in the papers we reviewed) cited by

the keyword-identified papers were also retained. A total of 90 papers relevant to the topic of water body and/or water quality from RS imagery using AI/ML/DL/CV algorithms were identified. A total number of 56 highly relevant articles were identified by applying the following exclusion criteria: (1) papers related to plastic pollution and sewage/water treatment plants, (2) precipitation forecasting or groundwater detection (as it is not intuitive to detect groundwater from RS imagery), and (3) general land use classification. Figure 1 shows the spatial distribution and a simple statistics summary of the papers covered in this review, where (d) shows the number of published papers by year in the reviewed topics from 2011 to early 2022.

**Table 1.** Keywords used for article search.


<sup>1</sup> A list of general keywords were combined with either the category of water body or water quality, respectively, to perform our search.

**Figure 1.** *Cont*.

**Figure 1.** Geospatial distribution and simple statistics of the reviewed papers. Note that a freely accessible interactive version of the charts can be accessed via our web app tool (the web app tool URL and its brief demo video are provided in Appendix A). We can easily see that the major countries are China and the United States and that the number of published papers by year (2011 to 2021) has dramatically increased since 2018 and 2019. (**a**) Spatial distribution of reviewed papers based on the first author's institution location. (**b**) Topic distribution (water body, water quality, both). (**c**) Country distribution. (**d**) Number of published papers by year from 2011 to 2021 on the relevant topics.

#### *1.2. Roadmap*

Here, we provide a roadmap for the rest of the paper. Section 2 outlines the scope of this review and our intended audience. Section 3 is the core of the paper, focused on identifying important and recent developments and their implications to water body detection and water quality monitoring from RS imagery through the leverage of AI/ML/DL/CV. Here, we highlight recent advances in several subfields of AI that water domains and RS can leverage. Specifically, we provide general characteristics of the reviewed studies using word clouds (Section 3.1). We then examine and appraise key components of influential work in water body detection (Section 3.2) and water quality monitoring (Section 3.3). Section 4 starts with a brief summary (Section 4.1), followed with a discussion of key technical challenges (Section 4.2) and opportunities (Section 4.3). The paper concludes in Section 5.

To allow our readers to intuitively and dynamically review the relevant literature, we have developed a free-of-charge interactive web app tool (the web app URL and its brief demo video are provided in Appendix A). To provide background for readers (particularly those from water resources and RS) who are new to AI/ML/DL/CV, we introduce essential ML terms in Appendix B. As evaluation metrics are essential for measuring the performance of AI/ML/DL/CV models, we also provide an introduction to a set of commonly used evaluation metrics in Appendix C. In addition, as there are plenty of acronyms in this paper, we provide a full list of abbreviations right before the appendices.

#### **2. Audience and Scope**

It is important to know where water is and how its extent and quality are changing over time in a quick and accurate manner. Water quality is a key issue in water supply, agriculture, human and animal health, and many other areas [19]. Impaired water quality can be caused by natural disasters, but the most common cause is anthropogenic pollution. Pollutants, excessive nutrients from fertilizers, and sediment (e.g., from soil erosion) are carried into local lakes and rivers via runoff from urban or agricultural areas [19,20]. The quality of water varies from places and from time to time [19]. Affected surface waters are present in RS imagery and can be identified with the help of computational techniques such as ML. *To make near real-time intelligent water body detection and water quality monitoring possible, we need to first detect extent of water bodies from RS imagery, from which volume can be*

*computed, and then recognize their corresponding water quality, eventually linking the two to allow water quality monitoring*.

Environmental nonprofits, government agencies, and water managers need access to this type of integrated spatial–time series of water body and water quality information to see how local water resources are changing and plan for future drought conditions. Collective detection and monitoring of water bodies and their associated water quality has applications for human health, as well as to private-sector industries including timber, agriculture, recreation, and tourism. Public policy planners need to be better informed as they make environmental preservation and restoration decisions based on changing water availability, and with this data we can be better equipped to monitor water quality that can quickly change due to floods, hurricanes, or human-caused pollution, *and yet, to date, water body detection and water quality monitoring research has been historically separate and does not focus enough on producing intuitive, operational products*.

Building on the long-term interest in ML and CV within the RS community, the main goals of this review paper are to (1) survey recent advances in water body detection and water quality monitoring from RS data using AI to identify commonly cited challenges in order to provide suggestions for new research directions, and (2) move towards automated, synoptic water quantity and quality monitoring to inform more robust water resource management.

This systematic review is relevant to multiple research domains, including, but not limited to RS, geographic information science, computer science, data science, information science, geoscience, hydrology, and water resource management. This paper does not attempt to review the application of RS to water resources and hydrology more generally; for recent reviews of these topics, see [13,21–24]. A survey of DL applications in hydrology and water resources can be found in [25]; a survey of AI in the water domain can be found in [26]; and a survey of water quality applications using satellite data solely focused on ML can be found in [27]. This review focuses on investigating recent AI methods, including its branches ML, DL, and CV, for water information extraction (specifically water body detection and/or water quality monitoring) from RS imagery. Our review has a narrowed scope in water resources and hydrological research, but a wider and deeper scope in terms of AI methods and metrics used to assess models in both water body detection and water quality research. *By integrating both domains, we hope to develop a basis for effective computational frameworks for intelligent water monitoring systems using RS and AI.*

#### **3. The State of the Art: Advances in Intelligent Waterbody Information Extraction**

*3.1. General Characteristics of the Reviewed Studies*

Note that we only included and reviewed the papers that use both RS and AI/ML/DL/CV for water body and/or water quality detection (that is, the number of papers cited in our reference section is much larger than the number of papers we review in this Section 3). A word cloud visualization of the titles, abstracts, and keywords of the reviewed 56 papers are provided in Figure 2, where the top figure indicates the word cloud for all reviewed papers. The bottom left word cloud is for reviewed water body papers, and the bottom right for reviewed water quality papers.

As we can see from the word cloud for both water body extraction and water quality (see the top word cloud in Figure 2), "remote sensing", "deep learning", "prediction", "classification", "extraction", "machine learning", "water body", "water quality", and "convolutional neural network" are prominent concepts and words captured by the word cloud. Our focus is on studies that use RS for water body extraction and water quality monitoring, so many of the keywords are to be expected. However, it is perhaps surprising to see DL featured so prominently given that the shift from ML to DL models is a relatively recent phenomenon.

**Figure 2.** Word cloud visualization of all the reviewed papers (**top**), water body papers (**bottom left**), and water quality papers (**bottom right**). Note that the word clouds are generated from paper titles, abstracts, and keywords. The word clouds provide an informative (general and specific) focus of each set of the papers. For example, both water body and water quality papers share the focus on RS, DL, and neural networks (NN). We can also see that water body extraction tasks tend to focus on the use of convolutional neural networks (CNN), whereas for water quality modeling the use of long short-term memory (LSTM) networks is more prevalent. We can also see that there are specific, unique keywords for water quality, such as "turbidity", "chl", and "algal bloom".

When we separate the keyword word clouds (see the bottom two word clouds in Figure 2), this trend becomes clearer. Deep learning is much more common in water body extraction, whereas in the word cloud for water quality monitoring, "neural network" and "machine learning" are about the same size. Additionally, in the water body extraction word cloud, "remote sensing" is featured much more heavily than it is in the water quality extraction literature. In our review, the water quality papers often involved other types of data, including in situ sensors or smaller RS devices (not satellites), whereas the water body extraction literature is dominated by RS imagery. This is related to the scale of projects in the two domains: water body extraction is usually undertaken across large spatial scales, whereas the water quality monitoring literature is still focused on smaller, often individual, bodies of water. This points to a future research direction in the water quality literature that we touch on in our review paper: we need to scale up water quality estimation using RS imagery by matching it with ground-truth water quality measurements.

Tables 2 and 3 provide a brief summary of the methods used for water body detection and water quality monitoring, elaborated in Sections 3.2 and 3.3, respectively. The general characteristics summarized by machines (i.e., the word clouds in Figure 2) align with the literature; convolutional neural network (CNN) models are indeed applied much more frequently for water body detection, and long short-term memory (LSTM) models are often used for water quality monitoring. The evaluation metrics used in the reviewed articles were also summarized and are provided in Tables 2 and 3 (a brief explanation of each metric is in Appendix C).

**Table 2.** Studies targeting water body detection from RS imagery using AI (note that it is ordered chronologically to show trends in data type and model usage; see the Abbreviations for a list of the acronyms).



#### **Table 2.** *Cont.*

**Table 3.** Studies targeting water quality monitoring from RS imagery using AI (where "/" means none. Note that it is ordered chronologically to show trends in data type and model usage) (See the Abbreviations for a full list of the acronyms).


<sup>1</sup> The authors use the abbreviation RMSELE for RMSLE in their paper (this might be a typographical error).

#### *3.2. Recent Advances in Water Body Detection Using AI*

From our systematic review (including Table 2), we provide a brief summary here about the recent advances in water body detection based on AI. (1) The most common satellite platforms were Landsat, GaoFen, Zi Yuan, WorldView, and Sentinel, although there were some manually annotated datasets. The use of UAVs and DEMs were noted but were not as common. (2) Precision, recall, overall accuracy (OA), F1-score, kappa, and intersection over union (IoU) are the most popular evaluation metrics for water body detection since it is mainly a classification task. (3) Convolutional neural networks (CNNs) are normally compared to normalized difference water index (NDWI) or another indexbased method, some form of "shallow" ML model (e.g., random forest (RF), support vector machine (SVM)), or other CNN architectures). Below, we provide a more detailed review of the methods used for water body detection. As Table 2 and word clouds (see Figure 2) indicate, the dominant methods used in water body detection with AI are CNNs (Section 3.2.1). Beyond CNN-based methods, there are other methods including CNN hybrids (Section 3.2.2), artificial neural networks (ANN), multilayer perceptrons (MLP), dense neural networks (DNN), other DL methods (Section 3.2.3), and "shallow" ML based methods (Section 3.2.4).

#### 3.2.1. CNN-Based Water Body Detection

CNN-based models are the dominant methods for water body detection, but each of them have addressed different challenges posed in water body detection from RS imagery. Based on our review, we identify the following five groups of use cases: (1) Addressing limitations of index-based methods; (2) sharpening blurred boundaries caused by CNNs; (3) Addressing spatial and spectral resolution challenges, which covers those methods that are able to recognize water body across scales, at multiple resolutions, from very highresolution imagery, and/or integrating bands beyond RGB channels to use for CNN model training; (4) Robust detection of small/slender/irregular-shaped water bodies; (5) Others.

1. Addressing limitations of index-based methods:

Index methods (e.g., NDWI) are rule-based and fail to take advantage of context information. CNNs can overcome this, although they often blur boundaries in segmentation tasks because of the convolution operation [34]. A DenseNet was used in [43] for water feature extraction and the authors compared its performance with NDWI and several popular CNN architectures. While NDWI methods are quick, they are not as accurate as CNNs. The authors showed that DenseNet performed the best at distinguishing water from shadows and clouds. However, the authors argue that clouds often occlude optical imagery, so one way to improve their method is to combine it with microwave RS imagery.

The authors in [31] pointed out that index methods require careful calibration and that indices differ from place to place. They also suffer from false positives (from snow, ice, rock, shadows, etc.) and vary in different weather conditions (e.g., clouds). To overcome those limitations of index-based methods, the authors of [31] developed DeepWaterMap, which can classify water with high accuracy, even distinguishing it from snow, ice, shadow, and clouds. DeepWaterMap is able to classify land classes that are often misclassified as water (or vice versa); thus, it minimizes false positives during the classification process. Most importantly, the DeepWaterMap model also works across different terrains and in different weather conditions, although it is still affected by clouds. The same authors released a second version of the model, DeepWaterMap v2, in [40]. The major improvement from v1 is that the new version allows users to input large RS scenes without the need for tiling, and the authors made their network run efficiently with constant memory at inference time. This model should theoretically work across different sensor platforms as long as they have the visible, near-infrared, and shortwave infrared 1 and 2 bands, but will still sometimes classify clouds as water.

2. Sharpening blurred boundaries caused by CNNs:

CNN-based methods can overcome the limitations of index-based methods, as reported above in group (1) [34], but they often blur boundaries in segmentation tasks because of the convolution operation. To sharpen water body detection boundaries, in [34], a restricted receptive field deconvolution network (RRF DeconvNet) and a new loss function called edges weighting loss were proposed. However, the authors needed to retrain the entire network (which is very computationally expensive) instead of using transfer learning (TL).

Apart from blurring pixel boundaries, CNNs generally require many training parameters and very large training datasets to be successful. A novel convolution–inception block in a network, called W-Net, was proposed in [48], to extract water bodies from RS imagery. W-Net is able to train on fewer images compared with other CNN models and still extract water bodies accurately, and the authors pointed out that less computations are necessary due to use of inception layers. W-Net outperformed other CNN architectures, although the authors still needed to go through the time- and labor-intensive process of creating a dataset of manually annotating images.

#### 3. Addressing resolution and band related challenges

High-resolution optical RS imagery allows for much finer detail in surface water body extraction. However, clouds and their shadows are often present in optical RS images [78]. The shadows (e.g., cloud shadows and building shadows) and water bodies share a very similar appearance in optical RS images. Therefore, water body extraction is not an easy task in the optical high-resolution RS images due to the limited spectral ranges (including blue, green, red, and near-infrared bands) and the complexity of low-albedo objects (cloud shadows, vegetation, and building shadows). Higher spatial resolution imagery often comes at the cost of less spectral channels and thus makes it difficult to extract features from complex scenes. To address this problem, a dense local feature compression (DLFC) was proposed [52] to extract bodies of water from RS imagery, and their DLFC outperformed other state-of-the-art (SOTA) CNNs, as well as an SVM and NDWI thresholding. Their results demonstrated that the DLFC is good at extracting slender water bodies and distinguishing water bodies from building shadows using multisensor data from multiple RS platforms.

TL and data augmentation (see Appendix B) are used in [37] to extract water bodies from satellite imagery. The authors showed that a CNN can outperform NDWI and an SVM in water body detection when the input data is very high resolution. There are tradeoffs, however, and the authors reported that the difficulty of hyperparameter tuning is one downside to using a CNN. A water body extraction NN, named WBE-NN, was proposed in [45] to extract water bodies from multispectral imagery at multiple resolutions while distinguishing water from shadows, and performed much better than NDWI, an SVM, and several CNN architectures. A self-attention capsule feature pyramid network (SA-CapsFPN) was proposed in [49] to extract water bodies from satellite imagery of different resolutions. SA-CapsFPN is able to recognize bodies of water across scales and different shapes and colors, as well as in varying surface and environmental conditions, although it is still entirely dependent on optical imagery as input to the CNN.

The novel MSResNet proposed in [46] learned from a large dataset of unlabeled RS imagery. MSResNet, in addition to being able to extract water bodies in an unsupervised manner, is able to recognize water bodies at multiple resolutions and of varying shapes. However, their network cannot distinguish water bodies from farms and barren areas. In addition, the CNN-based model name FYOLOv3, proposed in [51], is able to detect tidal flats at different resolutions. However, it does depend on a manually selected similarity threshold that introduces some subjectivity.

RGB band imagery is the primary focus in substantial research for water body extraction, but many more bands are available in RS imagery. A multichannel water body detection network (MC-WBDN) was created in [47], which fused the infrared and RGB channels and used them as input data for their CNN architecture. They demonstrated that when multispectral data is used, model performance for water body detection is increased and the model is more robust to lighting conditions. The proposed model MC-WBDN is

much more accurate than index-based methods such as NDWI, modified NDWI (MNDWI), and normalized difference moisture index (NDMI). MC-WBDN also outperforms other SOTA architectures such as U-Net and DeepLabV3+ for water body detection tasks. However, this method still relies on preprocessing data to make sure each input image is the same shape and free of clouds.

4. Robust detection of small/slender/irregular-shaped water bodies

Small water bodies are hard to extract from RS imagery. In [33], the authors designed a CNN (named SAPCNN), which is able to extract high-level features of water bodies from input data in a complex urban background. NDWI and SVMs cannot distinguish between water and shadows and their architecture's performance partly relies on visual inspection. Ref. [53] utilized a modified DeepLabv3+ architecture to extract bodies of water at different scales. Their focus is on extracting water bodies in urban RS images. Their network performed well on small bodies of water, but the model has problems identifying many of them because they were not properly annotated.

Mask-region-based CNNs (R-CNNs) have demonstrated success in detecting small and irregular shape water bodies. Song et al. (2020) [41] employed an R-CNN for water body detection from RS imagery, and their model outperforms many traditional ML models in identifying small water bodies and bodies of water with differing shapes. However, it is still difficult to deploy a trained NN model into a usable, production-ready form for water mapping applications. The authors reported that using NN output to create and update a vector map of water resources for stakeholders is challenging.

Yang et al. (2020) [42] also used a mask R-CNN to automate water body extraction. The authors argued that this allows them to avoid manual feature extraction in complex RS imagery. They segmented small water bodies and bodies of water with irregular shapes, although their methods suffer from poor IoU accuracy. This is primarily due to a small training set, for which DL models are ill-suited, and resulted in their models having problems identifying multiple bodies of water in RS images.

A self-attention capsule feature pyramid network (SA-CapsFPN) was proposed in [49] to extract water bodies from satellite imagery. SA-CapsFPN is able to recognize bodies of water across scales and different shapes and colors, as well as utilizing different information channels. The novel MSResNet proposed in [46], learnt from unlabeled large RS imagery, is also able to recognize water bodies at multiple resolutions and of varying shapes; however, their network cannot distinguish water bodies from farms and barren areas.

A dense local feature compression (DLFC) was proposed in [52] to extract bodies of water from RS imagery, and their DLFC outperformed other SOTA CNNs, as well as an SVM and an NDWI. Their results demonstrated that the DLFC is good at extracting slender water bodies and distinguishing water bodies from building shadows using multisensor data from multiple RS platforms.

#### 5. Others

Extracting water bodies from RS imagery quickly and reliably is still a difficult task. Based on U-Net, [50] developed a new model called SU-Net to distinguish between water bodies, shadows, and mixed scenes. However, the authors only focused on water body extraction in urban areas and only used RGB information during the extraction process. While SU-Net performed better than an SVM and classic U-Net, it suffered when extracting water bodies from RS imagery with high reflectivity or that contained aquatic plants.

Wetlands are important ecosystems because they can keep flooding at bay and store carbon; however, they are threatened by development, climate change, and pollution. For the task of identifying wetlands, [44] combined RS imagery with hydrological properties derived from digital elevation models (DEMs) to identify wetlands. They showed that an RF performs as well as a CNN, although both models had issues distinguishing roads and trees from wetlands. This is perhaps due to their small training set. To improve performance, the authors argued that larger datasets with finer labels should be created for wetland detection.

Substantial water body detection work has focused on water bodies in urban and inland settings. Very few focus on tidal flat extraction, where sediment levels are high and the boundary of the water body itself is blurry. A CNN model called FYOLOv3 was proposed in [51], where the authors compared their model to NDWI, an SVM, a maximum likelihood classifier, U-Net, and YOLOv3. FYOLOv3 performed the best and is able to detect tidal flats at different resolutions; however, it depends on a manually-selected similarity threshold during the training process, which is a source of subjectivity.

Large sets of unlabeled water body data are available and easy to acquire, and semantic segmentation networks cannot recognize different water body shapes. A recent, very novel encoder–decoder CNN architecture named MSResNet, proposed in [46], is able to overcome those limitations. MSResNet is able to learn from unlabeled data and can also recognize water bodies of varying shapes and at multiple resolutions. However, even though their network outperforms other SOTA architectures without supervised training, their network has some issues categorizing water bodies, farms, and barren areas.

#### 3.2.2. CNN Hybrid-Based Water Body Detection

CNNs are the SOTA models in water body extraction tasks (detailed in Section 3.2.1 above); however, their output and decisions for why they make the predictions that they do are largely a black box. Recent studies have integrated CNNs with some ML models. Interpretability was improved by using a CNN and SVM in parallel to classify wetland water bodies [39]. Wetlands are difficult/complex to identify in high-resolution satellite imagery with any single ML model. Hybrid models have shown promise in a process called decision fusion. Here, the authors pick a decision fusion threshold value by performing cross-validation on the CNN to see when it is sure or not. They then use this threshold value for the decision fusion predictions (e.g., when the CNN is not that sure, they defer to the SVM). However, the authors did not explain why they used an SVM and not some other ML model. The classifier used in [32] combines a CNN with a logistic regression (LR) model to extract water bodies. The authors emphasized that traditional ML methods for water body extraction need multispectral data and rely on lots of prior knowledge. Thus, those ML-based methods would not generalize well to different tasks. The authors also argue that single-band threshold methods are subjective. Their results demonstrated that the hybrid CNN-LR model works better than an SVM, an ANN, and other CNNs. However, their method requires segmented RS images as input.

How to accurately extract water bodies from RS images, while continuously updating the surface water maps, is an active research question. Index methods and active contour models are popular methods for water body detection tasks but are sensitive to subjective threshold values and starting conditions. Deep U-Net model was proposed to be used with a conditional random field (CRF) and regional restriction to categorize water versus non-water in satellite images [36], while reducing the blurring of edges that often occurs from CNNs for image segmentation. Although this network is highly accurate, it takes a lot of data and computation power to train. Training ML models at a single scale in single channels can cause errors when generalizing to other scales or types of RS data. Multiscale RS imagery was used with DeepLabV3+ and a CRF for water body segmentation [38]. This approach works well for training models on data from different scales, and they concluded that CNNs and CRFs together extract more accurate water boundaries at both large and small scales than CNNs alone.

#### 3.2.3. ANN, MLP, DNN, and Other DL-Based Methods for Water Body Detection

An NN architecture called a local excitatory globally inhibitory oscillator network (LEGION) is used in [28], where the authors compared the results of LEGION trained on NDWI and spectral information, respectively. In addition, they employed object-wise classification, instead of pixel-based classification used in most other work. The authors reported that the network is very computationally expensive.

Different methods of water body extraction work (or do not work) in different areas/terrain types. Each needs subjective thresholds and/or hand-crafted features. In addition, generating large sets of labeled data is difficult and expensive, as high-dimension RS data is difficult to analyze. Objects such as shadows, clouds, and buildings are hard to distinguish from water bodies. In [29], the authors used an autoencoder for unsupervised training and concluded that their results are more accurate than for an SVM and traditional NN.

Huang et al., 2015 [30] pointed out that not many people have focused on water body detection in urban settings. This is a problem because water bodies often look similar to shadows due to buildings at certain times of the day in optical imagery. The authors employed an extreme learning machine (ELM), an SVM, a tree bagger (TB), and an RF to detect water bodies. The authors reported that the RF and TB performed much better than the SVM and ELM. However, their method depends on optical imagery with subjective thresholds set through trial and error. Specifically, their method depends on subjective threshold values in NDWI, normalized difference vegetation index (NDVI), and morphological shadow index (MSI).

Ref. [10] compared MLP, NDWI, and a maximum likelihood model for water body classification and showed that MLP performed the best. However, the maximum likelihood model could not recognize small bodies of water and thin rivers, whereas NDWI was not able to distinguish seawater from land. The MLP could identify small bodies of water better, but the analysis depended on visual assessment.

#### 3.2.4. "Shallow" ML-Based Water Body Detection

Although most of the recent methods for water body detection used DL and/or deeper neural networks (Sections 3.2.1–3.2.3), a few studies used only "shallow" ML methods (e.g., RF and SVM). In [35], the authors used band methods (where slope, NDVI, and NDWI were added as three secondary bands to integrate extra information into ML training), and then applied an SVM, a decision tree (DT), and an RF to analyze multiband RS data for water body extraction in the Himalayas. However, while their models worked well for flat and hilly terrain, they had to parse out high elevations and snow in this method (which involves extra preprocessing and limits when/where their method can work with optical data). The authors ran different experiments to analyze which input bands (NDWI vs. individual input bands from Landsat data) worked the best but could only compare results visually. The authors concluded that adding single secondary bands is better than adding multiple in most ML algorithms except for NNs.

Sentinel-1 data and four different ML models (K-nearest neighbors classifier (KNN), fuzzy-rules classification, Haralick's textural features of dissimilarity, Otsu valley-emphasis) were employed to classify water bodies in [54]. It involved many different ML methods in tandem (i.e., the output of one ML model was fed into other processing steps), which complicates interpretability. This method did not have very high accuracy and did not work well in flooded regions, near buildings, and in the presence of aquatic vegetation. However, it was an important attempt to use synthetic aperture radar (SAR) data, which is rare in water body detection literature.

#### *3.3. Recent Advances in Water Quality Monitoring Using AI*

From Table 3, we identify the following trends in the use of AI for water quality monitoring research: (1) Water quality monitoring differs from water body detection in that it is formulated as both a classification and a regression task. Because of this, recurrent neural networks (RNNs), long short-term memory (LSTMs), and gated recurrent units (GRUs) are much more prevalent in the water quality literature. (2) Accuracy, precision, and recall are common metrics, as are some variations of mean squared error (MSE) and R2. (3) It is important to note that while water body detection papers describe integrating multiple data sources into one analysis, this practice is much more common in water quality monitoring research. This primarily takes the form of trying to match up water quality parameters from time series data or water samples to optical satellite RS imagery. In water

quality monitoring, it is much more common to utilize Internet of Things (IoT) sensors, smaller probes such as unmanned aerial vehicle (UAVs) and stationary hyperspectral imagers, as well as government and private water quality time series data. (4) Some studies do not compare their model to any other models (detailed in Table 3), making it difficult to fully assess their methodologies.

Below, we provide a more detailed review of the methods used for water quality detection and monitoring. As our manual investigation (see Table 3) and machine summary (word cloud, see Figure 1) indicate, the dominant methods used in water quality detection with AI are LSTMs (Section 3.3.1) and ANNs, MLPs, DNNs, and other DL methods (Section 3.3.5). Beyond LSTM and ANN-based methods, there are other methods including LSTM hybrids (Section 3.3.2), CNN-based methods (Section 3.3.3), and "shallow" ML-based methods (Section 3.3.4).

#### 3.3.1. LSTM-Based Water Quality Detection and Monitoring

Algal blooms cause serious harm to human and animal health and can damage both environments and economies. Various factors lead to algal blooms and gathering the data necessary to predict them is time- and cost-intensive. ML models can provide advanced warning for these events by taking into account time series data of basic water quality parameters. A linear regression model was compared with an MLP, an RNN, and an LSTM to predict harmful algae blooms in dammed pools from several rivers [57]. While the LSTM model was the most accurate overall, for several of the dammed pools that the authors tested, a least-squares regression model outperformed the LSTM. This casts doubt as to how the LSTM model generalizes and if it is worth the added complexity.

Water pollution is becoming an increasing problem because of rapid rates of development and urbanization. Large amounts of water quality parameters can be taken via IoT sensors, and DL techniques are well suited to finding patterns in the large quantity of data. An LSTM was used to predict future values of different water quality parameters [60]. Most importantly, the authors only used single-dimensional inputs and outputs (i.e., a 1D time series of dissolved oxygen as an input to predict dissolved oxygen at some time in the future). While the results were good, the authors noted that the architecture would benefit from training on multiple time series at the same time. The authors reported that long-term predictions on the order of 6 months into the future did not work well. Beyond monitoring water for different levels of pollutants, it is also important to find the sources of pollutants when they are identified. Cross-correlation was used to map pollutants to different water quality parameters [58]. They then used an LSTM to match pollutants to nearby polluting industries using the highly correlated water quality parameters.

Similar to LSTMs, RNNs have been demonstrated to be accurate for times series prediction but are also often criticized for being difficult to interpret. Meanwhile, processbased ecological models, although deterministic, fail to capture patterns at longer time scales. A process-based model was integrated with an RNN to better align predictions of phosphorus levels in lakes to eliminate outlier predictions. Constraining NN output with physics-based models better aligns their predictions with ecological principles [68].

Rapid development has led to decreased water quality. In [70], water quality parameters can be used to both classify the current water quality index and predict future water quality index states. However, the authors separately compared DL models for water quality prediction and ML models for water quality classification, making the methods not directly comparable. A nonlinear autoregressive neural network (NARNET), a type of ANN, performed better than an LSTM at predicting the water quality index, while an SVM performed better than other traditional ML models for classification.

#### 3.3.2. LSTM Hybrids Water Quality Detection and Monitoring

To further improve model performance, a few recent studies have integrated other models with LSTMs. Water scarcity and drought are increasingly significant environmental challenges. Increased development is leading to worsening water pollution. Predicting the water quality from time series data is essential, but traditional ML models fail to capture long-term temporal patterns. This causes them to make false predictions in water quality monitoring applications. An RNN–Dempster–Shafer (RNN–DS) evidence theory hybrid model was used to make sense of multiple input time series of different time scales [63]. While evidence theory did make the predictions more stable, longer-term predictions did not work very well, even with the improvements to the model. The authors pointed out one possible reason might have been not taking spatial correlations between water quality parameters into account.

Economic development and urban growth have posed water quality issues. Wavelet domain threshold denoising (WDTD) and wavelet mean fusion (WMF) were used to analyze the output of LSTM predictions for multiple water quality parameters [65]. While multiple wavelet basis functions were used to analyze predictions, the LSTM was not compared to any other models in this analysis. The authors noted that not having enough observations was a limitation while training their LSTM model.

Mangrove wetlands provide habitats for many different types of animal species in addition to preventing coastal erosion. More recent research has focused on monitoring the water quality in these environments to assess the health of coastal ecosystems. Using water quality and meteorological time series data, three different submodels were used for each water quality parameter at different time intervals and fused their output predictions [66]. The authors tested this setup with a DNN, a gated recurrent unit (GRU), and an LSTM model. While the LSTM performed the best, the authors reported that the model is not very reusable or user-friendly.

Collecting and analyzing water samples is expensive, time-consuming, and laborintensive. Thus, many researchers choose to use sensors to remotely monitor water quality parameters, but the number of parameters they can record are often limited. Ref. [69] used a submerged multiprobe sensor to monitor several important water quality parameters over the course of 1 year. They found that a CNN–LSTM model performs better than standalone DL models and traditional ML methods for predicting water quality parameter values; however, the authors did not use a validation set during NN training and the hybrid model was able to quickly learn the training and testing set data distributions.

#### 3.3.3. CNN-Based Water Quality Detection and Monitoring

CNNs are the dominant architecture for water body detection (Sections 3.3.1 and 3.3.2) but are not used as widely for water quality. Here, we review two very interesting but effective CNN-based methods. In situ water quality measurements work really well but are very expensive. In addition, things such as total nitrogen and phosphorus, biological oxygen demand, and dissolved oxygen are hard to measure from satellites because they have weak optical properties. A CNN was used in [59] and showed that TL beats out traditional ML models when classifying water quality from RS imagery. However, their dataset was very small, and their focus was narrow (specifically, only two lakes in China, no rivers or coastal waters covered). Water bodies are often polluted, or their quality is affected from far away and thus it is difficult to identify and report on water quality. Methods for estimating water quality at scale are essential. Turbidity can be a proxy for total suspended solids (TSS) and suspended sediment concentration (SSC), so [72] used image detection and then applied edge detectors to UAV images of water. They employed CNNs to detect changes in water color and utilized this to approximate quality. They showed that image-based turbidity detection is as accurate as actual turbidity meters, but more importantly represents a very promising method for monitoring water quality at greater spatial scales.

#### 3.3.4. "Shallow" ML-Based Water Quality Detection and Monitoring

Remote water bodies are hard to monitor for water quality. A simple NN architecture was designed to estimate several water quality parameters (i.e., chlorophyll-a, turbidity, phosphorus) both before and after an ecosystem restoration project during both the dry

and wet seasons [55]. Importantly, their predictions, using seven different input bands for training the NN, were very close to the actual values.

Finding what data to input into an ML model for water quality monitoring is neither easy nor straightforward. Different indices are sensitive to different areas and varying weather and lighting conditions. To address this problem, [71] first correlated water quality parameters to different RS bands. These correlations were then used to test four ML models and their ability to predict a water quality index. Their R2 statistics were not high, though.

#### 3.3.5. ANN, MLP, DNN, and Other DL-Based Methods for Water Quality Detection and Monitoring

Climate change is making droughts and water shortages increasingly worse in arid regions. It is thus important to develop methods and systems for intelligent and efficient monitoring of the water resources in those regions. A water quality index for arid regions was proposed in [56] and attempted to find which bands and spectral indices are related to that water quality index. In situ water quality sampling is labor- and cost-intensive and often suffers from low temporal resolution. As bodies of water around the world are changing rapidly due to global warming, it is more important than ever to model their spatial variation through time. A point-centered regression CNN (PSRCNN) was used in [73] to analyze lake reflectance data to model water transparency. The authors concluded that their model outperformed different band ratios and traditional ML models (KNN, RF, SVM), although at the cost of generalization. The PSRCNN did not make stable predictions due to too little data.

There is currently not enough paired RS imagery and in situ water measurement to meaningfully create robust water quality monitoring applications. The generation of a synthetic dataset of atmospheric reflectances and its suitability for water quality monitoring were investigated in [76]. The synthetic dataset is physics-based and attempts to capture the natural variability in inland water reflectances and chlorophyll-a concentrations. An ANN outperforms several traditional ML models (KNN, RF, XGBoost) in predicting actual water quality parameter values when trained on the synthetic dataset, although only the ANN is validated against unseen data. Still, synthetic data generation is a promising research direction for water body and water quality detection. Without RS imagery, many water quality monitoring programs will suffer from lack of spatial coverage due to labor, time, and cost constraints. Yet while RS is a useful tool for monitoring water quality parameters, it has not been meaningfully integrated into operational water quality monitoring programs. Existing water quality time series data were used in [75] and assessed the effectiveness of multiple RS data platforms and ML models in estimating various water quality parameters. The authors showed that some sensors are poorly correlated with water quality parameters, while others are more suitable for water quality monitoring tasks. They concluded that more research needs to be carried out for assessing the suitability of paired RS imagery and in situ field data.

Current water quality monitoring systems are labor-, time-, and cost-intensive to operate. IoT sensors can monitor water quality parameters in near real time, allowing for much more data to be recorded with much higher temporal resolution. A wireless sensor network made up in part of IoT sensors was used in [61], and used an MLP to classify water quality as either good or bad. The authors utilized the MLP predictions to notify water quality managers via SMS if the water quality drops below a certain threshold value. However, because of the cost to deploy and run the network, the authors were not able to include additional water quality parameters from more types of bodies of water other than rivers. Water quality monitoring data collection is expensive and time consuming, and there are usually tradeoffs between spatial and temporal resolution when implementing data collection programs. In addition, several key water quality parameters (pH, turbidity, temperature) can be estimated directly from optical and infrared RS imagery. Randrianianina et al. [64] used RS imagery and DNNs to model water quality parameters directly, after which they extend their analysis to map the distributions of water quality

parameters to an entire lake, but they only focused on one lake and did not test their methods on other bodies of water.

As bodies of water are exposed to increased nutrient loads, harmful algal blooms can occur, leading to eutrophication. This process can create dead zones that would kill wildlife and lead to negative economic impacts. Thus, it is important to monitor chlorophyll-a levels in water bodies and predict algal blooms before they happen. Zhao et al. [74] attempted to address this need by comparing DL models to traditional ML and curve-fitting methods to predict chlorophyll-a levels using time series measurements paired with RS imagery. The authors did not have much data as they limited the data collection process to one lake. Thus, the DL models did not perform well. Additionally, the ML models used in this paper needed more data and computing than simpler models in order to perform well.

It is often difficult to monitor inland water bodies for quality because of low signal-tonoise ratios and limitations in resolution. A proximal hyperspectral imager was used in [77] with high spectral and temporal time series data for continuous water quality observations. The authors found that index-based methods of water quality detection were difficult to calibrate as thresholding values are subjective, while ML and DL models performed much better. However, the authors show that their models do not generalize well to other water bodies with different water quality parameter distributions.

Anthropogenic activities have currently threatened largely coastal ecosystems. Coastal ecosystems are complex bodies of water but monitoring them is very important. The performance of an ANN was compared to traditional ML models in [62] for predicting various water quality parameters. In some cases, traditional ML methods outperform the ANN. More importantly, the authors conducted an analysis of relative variable importance to show which sets of input data helped the ML models to learn the most. While the relative variable importance analysis is critically important, the authors only test their method in cloud-free RS imagery, limiting its utility. Additionally, while biophysical and chemical water quality parameters were analyzed, little work was carried out with bio-optical data due to issues with data availability.

While recent advances in RS capabilities for water quality detection are substantial in the literature, few papers have collected and synthesized the resources available to researchers. In a paper reviewing recent trends in RS imagery, cloud computing, and ML methods, [67] used time series data from hundreds of water quality parameters and water samples and combined them with proximal imagery, hyperspectral imagery, and two sets of data from different satellite data platforms. They showed that DNNs outperform many other traditional processing and ML techniques for assessing water quality. The authors conclude that anomaly detection using multisensor data is the most promising method for algal bloom detection. As is sometimes the case in the water body detection and water quality monitoring literature, the authors did not have a third holdout set (necessary for DL projects so that the data is not memorized).

#### **4. Challenges and Opportunities**

In this section, we first provide a brief summary and discussion of the key themes and overall insights (Section 4.1) derived from reviewing the range of research discussed above. In Section 4.2, we provide and discuss some of the major challenges we identified through our systematic survey. Specifically, those challenges shared in both domains are detailed in Section 4.2.1, those specific only to water body extraction in Section 4.2.2, and those specific to water quality monitoring in Section 4.2.3. Finally, we discuss possible research directions and related opportunities for water body detection and water quality monitoring using RS and AI in Section 4.3.

#### *4.1. Summary and Discussion*

After introducing the essential terms in AI and RS (Appendix B) and commonly used evaluation metrics in ML and DL for classification, regression, and segmentation tasks (Appendix C), we reviewed recent and influential research for water body detection and water quality monitoring using RS and AI (Section 3).

While the research investigated in Section 3 has demonstrated the power of using RS and AI to detect water bodies and monitor water quality, very few studies thus far performed integrative research of water body and water quality using the power of RS and AI. In addition, most existing RS and AI-based work on water bodies and water quality repeat the same (or very similar) methods in a different research location or on a different (usually small) dataset. However, real intelligent water resource management applications will require serious development that goes beyond this type of research. Before operational applications can be deployed, AI models (especially DL models) need to be trained on large and representative benchmark datasets with a focus on making models generalizable and interpretable.

We noticed that most work does not include hardware specifications (e.g., what CPU/GPU the authors used to run their models) and/or processing time. To make models comparable and for the sake of replicability and reproducibility, it is essential to report such information. This is even true for index-based methods and more traditional ML models so that researchers can fully evaluate the trade-offs between runtime, accuracy, and ease of implementation. We hope our review will provide a useful guide to make future research more replicable and reproducible. From our interactive web app (the web app tool URL and its brief demo video link are provided in Appendix A), we also noticed that while most papers have an open access PDF/HTML version of their manuscripts, a sizable portion of manuscripts (16 out of 56 of reviewed articles) do not. We suggest authors provide an open access version (e.g., posting the proofreading version after acceptance to ResearchGate/arXiv) in order to increase the visibility of their research and thus to accelerate the advancement of scientific knowledge.

#### *4.2. Identified Major Challenges*

Below, we provide the most commonly posed challenges for water body and water quality research in the literature we reviewed. Those challenges shared in both domains are outlined in Section 4.2.1 and those specific to each domain are detailed in Sections 4.2.2 and 4.2.3, respectively. Here are some specific issues to water body detection and water quality monitoring.

#### 4.2.1. Shared Common Challenges in Both Domains

A summary of the shared common challenges and identified problems in water body extraction and water quality monitoring using RS and AI are provided below.


• Both domains over-rely on optical RS imagery, and thus clouds and shadows are a persistent problem and heavily skew the results towards working only in cloud-free conditions.

**Table 4.** Existing datasets for waterbody extraction and water quality monitoring.


#### 4.2.2. Additional Challenges in Water Body Extraction

The specific challenges and problems identified for water body extraction are summarized below.


4.2.3. Additional Challenges in Water Quality Monitoring

The specific challenges and problems identified for water quality monitoring are summarized below.


of data at the same location is critical, as computational methods require such data to verify their model performance in order to generalize to new water bodies.


#### *4.3. Research Directions and Opportunities*

Here, we provide five research directions, each along with its promising opportunities, from our investigation and based on the posed challenges discussed in Section 4.2 above.

#### 4.3.1. Urgent Need of Large and Comprehensive Benchmark Datasets

Large representative, balanced, and open-access benchmark datasets are critical for any domain to let AI meaningfully shine [84–86]. In computer science, especially for its branches CV and DL, there are very comprehensive, large, and open-source databases (e.g., ImageNet [87] for image classification tasks, and Microsoft COCO [88] for object detection and segmentation tasks). The availability of big and open-source image repositories has dramatically boosted recent advances in novel and robust algorithms in DL and CV, as computer science researchers do not need to worry about collecting datasets. Instead, they can focus on developing new algorithms and/or methods.

In our systematic review, we identified an urgent need for more curated, labeled datasets for intelligent water body extraction and water quality monitoring. We found some of the few available open-source datasets with water body boundary labels through our literature review, but also sought out additional datasets. We identified datasets that were not used in our literature review but contain water body labels, or datasets that were used for water body detection or water quality monitoring that did not use ML/DL/CV but would be useful for benchmarking tasks. Our search results are summarized in Table 4 above. Below, we list a few opportunities in this direction.

(1) *More public data and code: currently, most authors do not share their code and/or datasets.* See the two quoted pieces below from [25]: (a) "Lack of deep learning-ready datasets within the water field [ ... ] The main problem caused by this absence of many datasets is that the research community does not build upon previous work in terms of constructing better neural network architectures and moving the state of art to the next iteration [ . . . ]"; (b) "[ . . . ] many papers are published that achieve the same task with almost identical methods but different data.". Part of this issue is a replication crisis in the water body detection and water quality monitoring literature, but it stems more broadly from the lack of public codebases and datasets.

(2) Some promising ways to generate large datasets of good quality


annotation of RS imagery available on GEE. To our knowledge, no RS datasets for water body detection and water quality monitoring are downloaded from GEE and then annotated, let alone interfaces for directly annotating RS imagery on GEE.

	- - "The first dataset was collected from the Google Earth service using the BIGEMAP software (http://www.bigemap.com, accessed on 15 December 2021). We named it as the GE-Water dataset. The GE-Water dataset contains 9000 images covering water bodies of different types, varying shapes and sizes, and diverse surface and environmental conditions all around the world. These images were mainly captured by the QuickBird and Land remote-sensing satellite (Landsat) 7 systems." [49].
	- - "We constructed a new water-body data set of visible spectrum Google Earth images, which consists of RGB pan-sharpened images of a 0.5 m resolution, no infrared bands, or digital elevation models are provided. All images are taken from Suzhou and Wuhan, China, with rural areas as primary. The positive annotations include lakes, reservoirs, rivers, ponds, paddies, and ditches, while all other pixels are treated as negative. These images were then divided into patches with no overlap, which provided us with 9000 images [ . . . ]" [34].

#### 4.3.2. Generalization

It is important to be able to obtain a good accuracy score when training an ML/DL model, but perhaps more important is that model's ability to generalize to unseen data. The ultimate goal of ML/DL is to develop predictive models through finding statistical patterns in a training set which then generalize well to new, previously unseen data outside the training set [89]. Ideally, this is achieved by training on large and representative datasets that capture nearly all variations in the data actual distribution of values [86,89]. A model's ability to generalize is critical to the success of a model. An ML/DL model with good generalization capability will have the best trade-off between underfitting and overfitting so that a trained model obtains the best performance (See "Generalization, overfitting, underfitting and regularization" entry in Appendix B for details). Below, we outline a few ways to make AI systems more generalizable for water body detection and water quality monitoring tasks.

(1) Create robust AI methods for tiny water body detection. Depending on resolution, tiny water bodies such as ponds or small lakes in desert cities are difficult to identify yet may play a more critical role than we think.

(2) Develop NN architectures and comprehensive datasets (see Section 4.3.1) that are able to recognize water bodies not just from


(3) Utilize data from multiple sources to train ML/DL models. From our comprehensive investigation, most of the current AI methods are only able to deal with water quality and/or water body detection data from one specific type of RS imagery. This should be improved and indicates a promising new research direction. Specifically, it will be important to focus on using data from multiple data platforms or resolutions, from varying weather conditions, and regions which have different ecosystem and terrain types. We humans can recognize water bodies in different RS imagery with different weather conditions. We

expect that machines should be able to mimic humans to perform this task well if we have robust AI algorithms and comprehensive datasets. See some example research below:


(4) Propose new frameworks for improving generalizability. Generalization is one of the fundamental unsolved problems in DL. The goal of a generalization theory in supervised learning is to understand when and why trained ML/DL models have small test errors [90]. The recently proposed deep bootstrap framework [90] provides a new lens for understanding generalization in DL. This new framework has the potential to advance our understanding of water domain research empowered by RS and AI by highlighting important design choices when processing RS imagery with DL.

#### 4.3.3. Addressing Interpretability

DL has achieved significant advances with great performance in many tasks in a variety of domains, including some water domain tasks (detailed in Section 3). In the literature we reviewed for this paper, DL models have produced results comparable to, and in some scenarios even superior to, human experts. Improving predictive accuracy is important; however, improving the interpretability of ML/DL models is more important, especially through visualization techniques of ML/DL model output for later analysis by humans [18]. Interpretability is one of the primary weaknesses of DL techniques and raises wide concerns and attention in DL [91]. Due to the overparameterized and black-box nature of DL models, it is often difficult to understand the prediction results of DL models [92,93]. Understanding and explaining their black-box behaviors remains challenging due to their hierarchical, nonlinear nature. The lack of interpretability raises major concerns across several domains; for example, in high-stakes prediction applications, such as autonomous driving, healthcare, and financial services [94], the trust of DL models is critical. While many interpretation tools (e.g., image perturbation and occlusion [95], visualizing NN activation weights and class activation mapping [96,97] or attention mechanisms [98,99], feature inversion [100], local interpretable model-agnostic explanations or "LIME" [101]) have been proposed to interpret how DL models make decisions, either from a scientific perspective or a social angle, explaining the behaviors of DL models is still in progress [92]. For water domains, we list some specific potential opportunities in terms of interpretability we identified below.

	- - The authors in [62] systematically analyzed relative variable importance to show which sets of input data contributed to the ML models' performance. See

the quoted text below: "Relative variable importance was also conducted to investigate the consistency between in situ reflectance data and satellite data, and results show that both datasets are similar. The red band (wavelength ≈ 0.665 μm) and the product of red and green band (wavelength ≈ 0.560 μm) were influential inputs in both reflectance data sets for estimating SS and turbidity, and the ratio between red and blue band (wavelength ≈ 0.490 μm) as well as the ratio between infrared (wavelength ≈ 0.865 μm) and blue band and green band proved to be more useful for the estimation of Chl-a concentration, due to their sensitivity to high turbidity in the coastal waters".


#### 4.3.4. Ease of Use

As emphasized in [13,14], one of the major current challenges for water resource management is the integration of water quality data and indices from multiple sources into usable and meaningful insights for actionable management decisions. Geovisualization, also known as geographic visualization, uses the visual representations of geospatial data and the use of cartographic techniques to facilitate thinking, understanding, knowledge construction, and decision support about human and physical environments at geographic scales of measurement [102,103]. Geovisualization is widely utilized in different domains (e.g., public health [104], crisis management [105,106], environmental analysis [107–109], and climate change strategies [110]) for the exploration and analysis of spatiotemporal data. To the best of our knowledge, very little research has leveraged geovisualization in this way for water resources management. The only piece of work similar to this we noticed is in [111], where a web interface powered by GEE allows their expert system, combined with visual analytics, to be run on any Landsat 5, 7, or 8 imagery to draw boundaries for water bodies. Geovisualization through interactive web applications provides a promising solution to the posed challenge of integrating water quality data and indices from multiple sources [112–115]. We provide a few suggested research opportunities in this direction below.

• Simply applying (or with minor modifications) existing AI/ML/CV/DL algorithms/ methods to RS big data imagery-based problems is still very far away from producing

real-world applications that meet water management professionals' and policymakers' needs. As echoed in [13], "[ ... ] realizing the full application potential of emerging technologies requires solutions for merging various measurement techniques and platforms into useful information for actionable management decisions, requiring effective communication between data providers and water resource managers" [116]. Much more multidisciplinary and integrative collaboration in terms of depth and breadth are in high demand. Those scholars and practitioners who have an interdisciplinary background will play a major role in this in-depth and in-breadth integration. For example, researchers who have expertise in RS but also know how to utilize AI, through collaboration with domain expertise such as water resources management officers, will significantly advance this research direction. Intuitive interactive web apps that are powered by both geovisualization and AI/ML/DL/CV will definitely make interdisciplinary collaboration much more seamless and thus easier.


#### 4.3.5. Shifting Focus

From our investigation, it is clear that with enough annotated data and allocated computing, DL models are more accurate than traditional ML models, which are in turn more accurate than index-based methods for water body detection and water quality monitoring tasks. Increasing the accuracy of models by fractions of a percent should be given much less focus and attention moving forward. Water body detection methods are unlikely to improve upon the high rates of accuracy already reported in the literature without very high-resolution, very large, labeled datasets or the use of UAVs to detect small water bodies. Instead, we suggest that future research should focus more on reducing model parameters and making model training less computationally expensive in terms of time (e.g., designing neural networks to use constant memory at inference time [40], or by using TL [37,59]). Below, we outline some additional potential research directions we identified through our systematic review.

	- - Unsupervised learning methods are able to learn from big sets of *unlabeled* data, as demonstrated in [29,46].

#### **5. Conclusions**

Building intelligent and synoptic water monitoring systems requires automation of water body extent detection using RS imagery, from which volume can be computed, and also automation of their corresponding water quality, eventually linking the two to allow synoptic water quality monitoring. Yet, to date, water body detection and water quality monitoring research has been historically separate. Our systematic investigation indicates the following trends: deep learning is much more commonly used in water body detection, the dominant data source of which is RS imagery, whereas water quality literature often involves other types of data sources (e.g., in situ sensors, smaller RS devices that are not satellites). The trends relate to the scale of projects in the two domains: water body extraction is usually undertaken across large spatial scales, whereas the water quality monitoring literature is still only focused on smaller, often individual, bodies of water. This points to one of the future research directions in the water quality literature that we touch on above in Section 4.3; that is, we need to scale up water quality estimation using RS imagery through matching it with ground-truth water quality measurements.

Overall, based on the systematic review above, we contend that RS integrated with AI/ML/DL/CV methods, along with geovisualization, have great potential to provide smart and intelligent support for water resources monitoring and management. Thus, this integration has considerable potential to address major scientific and societal challenges, such as climate change and natural hazards risk management.

**Author Contributions:** All authors have contributed to this review paper. L.Y. initiated the review, contributed to writing and overall organization, identified selected research to include in the review, supervised the web app design and development, and coordinated input from other authors. J.D. took the lead on identifying relevant literature, contributed to writing and editing the text, and provided the data for the accompanying interactive web app. S.S. contributed to the web app design and development, word clouds visualization, and editing. Q.W., C.D.L. and M.M. have contributed to editing and M.M. also contributed to writing part of the introduction section along with identifying some relevant literature. All authors have revised the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** This material is partly based upon work supported by the funding support from the College of Arts and Sciences at University of New Mexico. The authors are also grateful to the three reviewers for their useful suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations (in alphabetical order) are used in this manuscript:




#### **Appendix A. The Accompanying Interactive Web App Tool for the Literature of Intelligent Water Information Extraction Using AI**

In Section 1.1, we provided a brief map and graphic summary of the papers covered in this review. To allow readers to obtain more useful and dynamic information and insights from the papers reviewed, we have developed an interactive web app. Through the web app, readers can keep track of the major researchers and access an up-to-date list of publications in the reviewed topics. Updated publications are accessible through (1) a researcher's public academic profile on Google Scholar or ResearchGate (see Figure A1a for an example), and (2) a continuously updated citations count of the papers that we reviewed in this paper (see Figure A1b for an example: the cited by as of 10 November 2021 is 47, which is when we first entered the data in our data file when we reviewed the paper, and then before this paper submission, when we clicked on the cited by URL, the page shows that the up-to-date citation number is 49). The web app can be accessed publicly, *free of charge* at


(**a**)

**Figure A1.** *Cont*.


(**b**)

**Figure A1.** Our highly interactive web app (accessible publicly at: https://geoair-lab.github.io/ WaterFeatureAI-WebApp/index.html, accessed on 5 December 2021) provides the track of scholars and publications with just a few clicks. See an example on the pop-up. Our readers can access (1) a direct link to the PDF file of the paper (note that if there is no free, publicly available version of the paper, we link directly to the journal page of the paper so our readers can obtain the paper if their institution purchases the journal database), (2) the scholar profile (Google Scholar/ResearchGate URL) of the first author, and (3) "Cited by" Google Scholar page. (**a**) Water body and quality AI literature map pop-up. (**b**) "Cited by" Google Scholar page corresponding to the paper shown in (**a**).

#### **Appendix B. Essential AI/ML/DL/CV Terms**

In this appendix, we provide brief definitions to some essential terms (ordered alphabetically) in ML/DL/RS in our review. For readability, we group some related concepts together.

**Ablation Studies:** In AI, particularly in ML and DL, ablation is the removal of a component of an AI system. Ablation studies are crucial for AI, especially for DL research. An ablation study investigates the performance of an AI system by removing certain components to understand the contribution of the component to the overall system. The term is analogous to ablation in biology (removal of components of an organism). Note that ablation studies require that the systems exhibit *graceful degradation* (i.e., they continue to function even when certain components are missing or degraded). The motivation was that, while individual components are engineered, the contribution of an individual component to the overall system performance is not clear; removing components allows this analysis. Simpler is better: if we can obtain the same performance with two models, we prefer the simpler one.

**Convolution, kernel (i.e., filter), and feature map** [121–123]:

Convolutional layers are the major building blocks in CNNs. A convolution is the simple application of a filter (i.e., kernel) to an input that results in an activation. Repeated application of the same filter to an input results in a map of activations called a feature map, indicating the locations and strength of a detected feature in an input (e.g., an image).

Convolution: Convolution is one of the most important operations in signal and image processing. Convolution is a mathematical operation to merge two sets of information. Convolution provides a way of multiplying together two arrays of numbers, generally of different sizes, but of the same dimensionality, to produce a third array of numbers of the same dimensionality. This can be used in image processing to implement operators whose output pixel values are simple linear combinations of certain input pixel values.

A convolutional filter (i.e., kernel) is a weight matrix (vector for one-dimensional and *cube* for three-dimensional data) which operates through a sliding window on input data. The convolution is performed by determining the value of a central pixel through adding the weighted pixel values of all its neighbors together. Specifically, it is carried out by sliding the kernel over the input image, generally starting at the top left corner, so as to move the kernel through all the positions where the kernel fits entirely within the boundaries of the input image. Each kernel position corresponds to a single output pixel, the value of which is calculated by multiplying together the kernel value and the underlying image pixel value for each of the cells in the kernel, and then adding all these numbers together. The output is a new modified filtered image. Convolution is a general purpose filter effect for images. Depending on the kernel structure, the operation enhances some features of the input data (e.g., blurring, sharpening, and edge detection).

In the context of a CNN, a convolution is a linear operation that involves the multiplication of a set of weights with the input. Given that the technique was designed for two-dimensional input, the multiplication is performed between an array of input data and a two-dimensional array of weights (i.e., a filter or a kernel). Technically, note that in CNNs, although it is referred to as a "convolution" operation, it is actually a "cross-correlation". That is, in CNNs, the filter is not flipped as is required in typical image convolutions; except for this flip, both operations are identical.

Kernel (i.e., filter): A kernel is a small matrix used in image convolution, which slides over the input image from left to right and top to bottom. Differently sized kernels, which contain different patterns of numbers, produce different results through convolution operation. The size of a kernel is arbitrary, but 3 × 3 or 5 × 5 is often used. Think of a filter similar to a membrane that allows only the desired qualities of the input to pass through it.

Feature map: The feature maps of a CNN capture the application result of the filters to an input image (i.e., at each layer, the feature map is the output of that layer). Think of it as (higher level) representations of the input. The feature map(s) is/are the output image(s) of each convolutional layer(s). The resultant number of feature maps equals the number of filters.

#### **Data augmentation (DA)** [124]:

ML (especially DL) model performance often improves with an increase in the amount of data. The common case in most ML/DL applications, especially in image classification tasks, is that obtaining new training data is not easy. Thus, we need to make good use of the existing (relatively small) training set. DA is one technique to expand the training dataset from existing training data in order to improve the performance and generalizability of DL models. DA enriches (i.e., "augments") the training data by creating new examples through random transformation of existing ones. This way, we artificially boost the size of the training set, reducing overfitting. Thus, to some extent, DA can also be viewed as a regularization technique.

Image DA is perhaps the most well-known type of DA and involves creating transformed versions of images in the training dataset that belong to the same class as the original image. The ultimate goal is to expand the training dataset with new, plausible examples (i.e., variations of the training set images that are most likely to be seen by DL models). For example, a horizontal flip of a bike photo may make sense, because the photo could be taken from the left or right. A vertical flip of a bike image does not make sense and would probably not be appropriate as the model is very unlikely to see a picture of an upside down bike. Transformations for image DA include a range of operations from the field of image manipulation (e.g., rotation, shifting, resizing, flipping, zooming, exposure

adjustment, contrast change, and much more). This way, a lot of new samples can be generated from a single training example.

Note that image DA is typically only applied to the training dataset, and NOT to the validation or test dataset. This is different from data preparation such as image resizing and pixel scaling; those must be performed consistently across all datasets that interact with the model. The choice of the specific DA techniques used for a training dataset must be chosen carefully and within the context of the training dataset and knowledge of the problem domain. It can be useful to experiment with DA methods in isolation and in concert to see if they result in a measurable improvement to model performance, perhaps with a small prototype dataset, model, and training run.

**DeepLabV3+** [125]: DeepLabV3 was firstly proposed to enable deep CNNs to segment features in images *at multiple scales.* ResNet-50 and ResNet-101, two variations on the popular residual network (ResNet) architecture, are the tested backbones for DeepLabV3. Through the use of residual blocks, atrous convolution, and a spatial pyramid pooling module, the authors showed that their new architecture achieved comparable performance to other SOTA models in image segmentation tasks without the need for further post-processing. The authors further improved DeepLabV3 and named the new version DeepLabV3+ [126], which combines atrous spatial pyramid pooling modules with an encoder–decoder module. This further improved the performance of DeepLabV3 while sharpening predicted feature boundaries. The DeepLabV3+ architecture is very popular in the water body extraction literature.

**Generative adversarial network (GAN):** GAN is a class of unsupervised DL frameworks in which two neural networks compete with each other. One network, the generator, tries to create synthetic or false images which fool the discriminator network. The discriminator, in turn, attempts to discern which images coming from the generator are actual vs. synthetic images [127]. GANs use a cooperative zero-sum game framework to learn. Among many variants of GAN, cycleGAN [128] is a technique for training unsupervised image translation models using the GAN architecture and unpaired collections of images from two different domains. CycleGAN has been demonstrated on a wide range of applications, including season translation, object transfiguration, style transfer, and generating photos from paintings.

**Generalization, overfitting, underfitting and regularization (referenced [123,129,130]):**

The prediction results of an ML/DL model sit somewhere between (a) low-bias, low-variance, (b) low-bias, high-variance, (c) high-bias, low-variance, and (d) high-bias, high-variance. A low-biased, high-variance model is called overfit and a high-biased, low-variance model is called underfit. A trained model achieves the best performance, through generalization, when the best trade-off between underfitting and overfitting is found. Learning with good accuracy is good, but generalization is what matters most. A good model is supposed to have both low bias and low variance. Overfitting and underfitting should both be avoided, where regularization may help.

Generalization: In ML/DL, generalization refers to the ability of a trained ML/DL model to react to new (i.e., previously unseen) data, drawn from the same distribution as the training data used to create the model. That is, after being trained on a training set, an ML/DL model can digest new data and make accurate predictions. The generalizability of an ML/DL model is central to the success of that model.

Overfitting vs. underfitting: Variance and bias are two important terms in ML. Variance refers to the variety of predicted values made by an ML model (target function). Bias means the distance of the predictions from the actual (true) target values. A high-biased model means its prediction values (average) are far from the actual values. In addition, high-variance prediction means the prediction values are highly varied.

If an ML/DL model has been trained too well on training data, it will be unable to generalize. It will make inaccurate predictions when given new data, making the model useless even though it is able to make accurate predictions for the training data. This is called overfitting. Underfitting happens when a model has not been trained enough on the data. Underfitting models are not useful either, as they are not capable of making accurate predictions, even with the training data.

Low error rates and a high variance are good indicators of overfitting. To avoid overfitting, part of the training dataset is typically set aside as the "test set" to check whether a trained model is overfitting. If the training data has a low error rate and the test data has a high error rate, it signals overfitting. An overfit model would have very low training error on seen training data but very high error from unseen datasets (e.g., testing dataset and new datasets beyond training and testing data). This is because the model maps the training set perfectly and any deviation from the training set would result in errors. An underfit model has high training error in training data and testing error in testing data and thus in new unseen data. This is because the model cannot generalize the training data correctly. Thus, the model will have a very high training error.

Regularization (also known as shrinkage): When an ML/DL model becomes too complex, it is most likely to suffer from *overfitting*. To avoid overfitting, regularization is a collection of methods to constrain and make an ML/DL model simpler and less flexible. Specifically, regularization methods are used to avoid high variance (i.e., bias/underfitting) and overfitting and thus to increase generalization. Intuitively, it follows that the function the model represents is simpler, less unsteady. Thus, predictions are smoother, and overfitting is less likely. Certain approaches are applied to different ML algorithms, for example, pruning for DT, dropout techniques for NN, and adding a penalty parameter to the cost function in regression.

**Google Earth (GE):** GE is a computer software, formerly known as Keyhole Earth-Viewer, that renders a 3D representation of Earth based primarily on satellite imagery. It has a web version at https://earth.google.com/web/, accessed on 2 January 2022. Since GE version 4.3, Google fully integrated Street View into Google Earth. Street View displays 360◦ panoramic street-level photos of select cities and their surroundings. The photos were taken by cameras mounted on automobiles, can be viewed at different scales and from many angles, and are navigable by arrow icons imposed on them.

**Google Earth Engine (GEE) and Microsoft Planetary Computer (MPC):**

GEE and MPC share similar goals (e.g., cloud storage and computing support for geospatial datasets), but have their own primary focus. For example, GEE is the pioneer in the area of RS cloud computing (launched in 2010, has 495 datasets in total as of 22 December 2021), and MPC, launched in 2020 (contains 17 datasets in total as of 22 December 2021), with a primary focus on climate change and sustainable environmental studies.

GEE [131,132]: GEE is a cloud-based platform for planetary-scale geospatial analysis, launched in 2010 by Google. GEE combines a multipetabyte catalog of satellite imagery and geospatial datasets with planetary-scale analysis capabilities. Scientists, researchers, and developers use GEE to detect changes, map trends, and quantify differences on the Earth's surface. GEE brings Google's massive computational capabilities to bear a variety of high-impact societal problems (e.g., deforestation, drought, disaster, disease, food security, water management, climate monitoring, and environmental protection). GEE has been available for commercial use from 2021 and remains free for academic and research use.

MPC [133,134]: The world lacks comprehensive, global environmental data. Microsoft Chief Environmental Officer (CEO), Dr. Lucas Joppa, imagines an international database that would provide the world with "information about every tree, every species, all of our natural resources". Microsoft President Brad Smith further emphasized that "it should be as easy for anyone in the world to search the state of the planet as it is to search the internet for driving directions or dining options", and Microsoft believes technology and AI is the key to get there, in hopes that this information will allow people to "come together and solve some of the greatest environmental and sustainability challenges we face today".

To support sustainability decision-making with the power of cloud computing and AI, similar to GEE, since December 2020, Microsoft is using ML and computing power to aggregate global environmental data (contributed by individuals around the world coupled with machinery placed in water, space, land, and air environments) into a planetary computer for a sustainable future. MPC, described as a "global portfolio of applications connecting trillions of data points", is designed to use AI to synthesize environmental data into practical information regarding the Earth's current ecosystems. For the first time, there will be a concise and comprehensive compendium of international ecosystem data. Not only will this allow for essential environmental information to be readily available to individuals across the world, but the planetary computer will predict future environmental trends through ML. In short, MPC integrates a multipetabyte catalog of global environmental data with APIs, a flexible scientific computing environment that allows people to answer global questions about that data, and applications that place those answers in the hands of conservation stakeholders.

**Image classification:** The concept of image classification in RS and ML/DL settings has different meanings. In RS research, the image classification is at pixel level (this is what semantic segmentation does in CV, ML, and DL settings; see the concept definition below). In contrast, in an ML and DL setting, image classification does not refer to assigning each individual pixel to a class (e.g., vegetation, water), but rather to assign the entire image to a specific class (e.g., flooded vs. not flooded) [135].

**Instance segmentation:** Unlike semantic segmentation, instance segmentation identifies each object instance of each pixel for every known object within an image. Thus, labels are instance-aware. Instance segmentation is essential to tasks such as counting the number of objects and reasoning about occlusion.

**Normalized difference moisture index (NDMI)** [136,137]: Normalized difference moisture index (NDMI) is a satellite-derived index from the near-infrared (NIR) and short wave infrared (SWIR) channels of RS imagery (note that some literature used NDMI interchangeably with NDWI; check the NDWI entry in this Appendix B for clarification).

NDMI is sensitive to the moisture levels in vegetation, and thus used to determine vegetation water content. It can be used to monitor droughts as well as monitor fuel levels in fire-prone areas. NDMI uses NIR and SWIR bands to create a ratio designed to mitigate illumination and atmospheric effects. It is calculated as a ratio between the NIR and SWIR values from RS imagery, see the formula below. For example, in Landsat 4–7, NDMI = (Band 4 − Band 5)/(Band 4 + Band 5). In Landsat 8, NDMI = (Band 5 − Band 6)/(Band 5 + Band 6). Delivered NDMI is a single band image. Similar to NDVI, NDMI values are between −1 and 1.

$$\text{NDMMI} = (\text{NIR} - \text{SWIR}) / (\text{NIR} + \text{SWIR})$$

**Normalized difference vegetation index (NDVI)** [138]: NDVI is a pixel-wise mathematical calculation rendered on an image. It is an indicator of plant health, calculated by comparing the values of absorption and reflection of red and near-infrared (NIR) light. A single NDVI value can be determined for every pixel in an image, ranging from an individual leaf to a 500-acre wheat field, depending on the RS imagery resolution.

$$\text{NDVI} = (\text{NIR} - \text{Red}) / (\text{NIR} + \text{Red})$$

NDVI values always fall between −1 and 1. Values between −1 and 0 indicate dead plants, or inorganic objects (e.g., water surfaces, manmade structures such as houses, stones/rocks, roads, clouds, snow). Bare soil usually falls within 0.1–0.2 range; and plants will always have positive values between 0.2 and 1 (1 being the healthiest plants). Healthy, dense vegetation canopy should be above 0.5, and sparse vegetation will most likely fall within 0.2 to 0.5. However, it is only a rule of thumb and we should always take into account the season, type of plant, and regional peculiarities to meaningfully interpret NDVI values.

**Normalized difference water index (NDWI) and modified NDWI (MNDWI) [139–141]**: The NDWI is an RS-based indicator sensitive to the change in the water content of leaves or water content in water bodies (detailed below). There are two versions of NDWI.

One was defined to monitor changes in water content of leaves, using near-infrared (NIR) and short-wave infrared (SWIR) wavelengths, proposed by Gao in 1996 [139] (to avoid confusion of the two versions of NDWI, this version is also called NDMI, see NDMI entry in this Appendix B).

$$\text{NDWII} = (\text{NIR} - \text{SWIR}) / (\text{NIR} + \text{SWIR})$$

The other version of NDWI, proposed by McFeeters in 1996, was defined to monitor changes related to water content in water bodies, using green and NIR wavelengths [140]. The calculation formula is given below. It is obvious that the NDWI in the papers we reviewed in this article is the version of water content in water bodies. Modification of normalized difference water index (MNDWI) was proposed [141] for improved detection of open water by replacing NIR spectral band with SWIR.

$$\text{NIDWI} = \text{(Green } - \text{NIR)} / \text{(Green + NIR)}$$

**PyTorch [142]**: PyTorch is an open-source deep learning framework developed and maintained by Facebook Artificial Intelligence Research (FAIR). At its core, PyTorch is a mathematical library that performs efficient computation and automatic differentiation on graph-based models. Achieving this directly is challenging, although thankfully, the modern PyTorch API provides classes and methods that allow you to easily develop a suite of deep learning models.

**Random forest (RF):** It is an ML (particularly, ensemble learning) algorithm that can be used for both continuous (regression) and categorical (classification) tasks [143]. RF is widely accepted as an efficient ensemble approach for land cover classification using RS data. It handles imbalanced data, missing values, and outliers well [144].

**Semantic segmentation:** In contrast to instance segmentation, semantic segmentation aims to predict categorical labels for each pixel for every known object within an image, without differentiating object instances [145]. Thus, its labels are class-aware.

**Support vector machine (SVM):** SVM is a (supervised) machine learning algorithm that provides solutions for both classification and regression problems. The support-vector clustering [146] algorithm applies the statistics of support vectors (developed in the support vector machine algorithm) to categorize unlabeled data and is one of the most widely used clustering algorithms in many applications.

**TensorFlow:** TensorFlow is an open-source deep learning framework developed and maintained by Google. Although using TensorFlow directly can be challenging, the modern tf.keras API brings the simplicity and ease of use of Keras to the TensorFlow project.

**Transfer learning (TL):** TL is one powerful technique that makes learning in (deep) ML transferable. TL was initially proposed in [147] and recently received considerable attention due to recent significant advances in DL [123,148–152]. Inspired by humans' capabilities to transfer knowledge across domains (e.g., the knowledge gained while learning violin can be helpful to learn piano faster), TL aims to leverage learned knowledge from a related domain to achieve a desirable learning performance with minimized number of labeled samples in a target domain [151]. The main idea behind TL is that it is more efficient to take a DL model trained on an (unrelated) massive image dataset (e.g., ImageNet [87]) in one domain, and transfer its knowledge to a smaller dataset in another domain instead of training a DL classifier from scratch [153], as there are universal, low-level features shared between images for different problems.

**U-Net:** CNNs gave decent results in easier image segmentation problems but have not made any good progress on complex ones. This is where UNet comes in. UNet was first designed especially for medical image segmentation in [154]. It demonstrated such good results that it was used in many other fields afterwards. UNet is an improved architecture developed for biomedical image segmentation [154]. The UNet architecture stems from a fully convolutional network (FCN) first proposed by Long and Shelhamer in [155] and its architecture was modified and extended to work with fewer training images and to yield more precise segmentations. The architecture of UNet resembles a "U", which justifies its name.

The UNet architecture includes three sections: the contraction, the bottleneck, and the expansion section. The bottommost layer mediates between the contraction layer and the expansion layer. The number of expansion blocks is the same as the number of contraction blocks. Most importantly, UNet uses a novel loss weighting scheme for each pixel such that there is a higher weight at the border of segmented objects. Specifically, all pixel-wise softmax applied on the resultant image is followed by a cross-entropy loss function. Each pixel is classified into one of the classes. The idea is that even in segmentation, every pixel has to lie in some predefined category. Thus, a segmentation problem was converted into a multiclass classification and it performed very well compared to the traditional loss functions.

#### **Appendix C. Common Evaluation Metrics in AI/ML/DL/CV Classification and Regression, and Segmentation Tasks**

Many evaluation criteria have been proposed and are frequently used to assess the performance of AI/ML/DL/CV models. No single evaluation metric can tell a full story of a trained model. To better select appropriate evaluation metrics for certain domain problems and tasks, in this appendix, we provide brief definitions to some commonly used evaluation metrics (ordered alphabetically; referenced [123,129,130,156,157]) in AI/ML/DL/CV for classification, regression, and segmentation tasks in our review (i.e., those listed in the field of "Evaluation metrics" in Tables 2 and 3). For readability, we group some related metrics together. In the following formulas, TP refers to true positive, FP to false positive, FN to false negative, and TN to true negative. TP samples are those that are in the positive category and are correctly predicted as positive. FPs are not annotated as the positive category but are incorrectly predicted as positive. TNs are correctly predicted as negative, while FNs are predicted as negative when they are actually labeled as positive.

**Accuracy, overall accuracy (OA), commission error (CE), omission error (OE), producer's accuracy (PA), user's accuracy (UA), and pixel accuracy (PixA)** [31,156,158–161]:

To better understand the metrics in this group, let us use the same confusion matrix shown below in Figure A2 to calculate the accuracy metrics in this group. Confusion matrix, also called error matrix, is a table that allows us to visualize the performance of a classification algorithm by comparing the predicted value of the target variable with its actual value [162].


**Figure A2.** Example confusion matrix. The classified data indicate the ML/DL model predicted results and the reference data refer to the actual manually annotated data (image source: [161]).

(Average) Accuracy: Classification accuracy is the number of correct predictions made as a ratio of all predictions made. Accuracy with a binary classifier is measured as the following:

Accuracy (for binary classifier) = (*TP* + *TN*)/(*TP* + *TN* + *FP* + *FN*)

Note, however, that (average) accuracy for a multiclass classifier is calculated as the average of each accuracy per category (i.e., sum of accuracy for each category/number of categories) (see the definition and examples of binary classification and multiclass classification in Appendix A4 in [84]). For the example confusion matrix shown in Figure A2 (it is a multiclass classification problem), the (average) accuracy is calculated as follows:

(average) accuracy = (21/27 + 31/37 + 22/31)/3 = 77.5%

Accuracy is perhaps the most common evaluation metric for classification problems, and it is also the most misused. It is really only meaningful and appropriate when there are an equal number of observations in each category and that all predictions and prediction errors are equally important, which is often not the case. Accuracy alone cannot tell a full meaningful story of the ML/DL models, especially when a dataset encounters a severe data imbalance problem (detailed in [86]); other metrics, such as F-score, need to tell whether an ML/DL is not suffering from overfitting when the trained model has very high accuracy.

OA: It essentially tells us out of all of the samples what proportion were classified correctly. OA is usually expressed as a percent, with 100% accuracy being a perfect classification where all samples were classified correctly. OA is the easiest to calculate and understand but ultimately only provides very basic accuracy information. OA is formally defined as follows, where N is the number of total samples. OA calculation from the example confusion matrix in Figure A2 is (21 + 31+ 22)/95 = 74/95 = 77.9%

OA = *Number of correctly classified samples/N* = (*TP* + *TN*)/*N*

OE [31]: Errors of omission refer to samples that were left out (or omitted, as its name implies) from the correct category in the classified results. An example of OE is when pixels of a certain thing (such as maple trees), are not classified as maple trees.

OE is sometimes also referred to as a type II error (false negative). An OE in one category will be counted as a CE in another category. OEs are calculated by reviewing the reference sites for incorrect classifications. In the example confusion matrix shown in Figure A2, this is carried out by going down the columns for each category and adding together the incorrect classifications and dividing them by the total number of samples for each category. A separate OE is generally calculated for each category, as this will allow us to evaluate the classification accuracy and error for each category. OE is the inverse of the PA (i.e., OE = 1 − PA).

OE example based on the confusion matrix shown in Figure A2:

Water: Incorrectly classified reference sites: 5 + 7 = 12. Total # of reference sites = 33.

$$\text{OE} = 12/33 = 36\%$$

Forest: Incorrectly classified reference sites: 6 + 2 = 8. Total # of reference sites = 39.

$$\text{OE} = 8 / 39 = 20\%$$

Urban: Incorrectly classified reference sites: 0 + 1 = 1. Total # of reference sites = 23.

$$\mathbf{OE} = 1/23 = 4^{\circ}\mathbb{Q}$$

CE [31]: Errors of commission are in relation to the classified results. An example of an CE is when a pixel predicts the presence of a feature (such as trees) and, in reality, it is absent (no trees are actually present). CE is sometimes also referred to as a type I error (false positive). CEs are calculated by reviewing the classified sites for incorrect classifications. This is performed by going across the rows for each class and adding together the incorrect classifications and dividing them by the total number of classified sites for each class. CE is the inverse of the UA (i.e., CE = 1 − UA). This makes sense and is easy to interpret, as when the predicted results are very reliable (with high UA score), the classification error would be low.

CE example based on the confusion matrix shown in Figure A2:

Water: Incorrectly classified sites: 6 + 0 = 6. Total # of classified sites = 27.

 $\text{CE} = 6/2$  $\text{?} \text{= } 22\text{\%}$ 

Forest: Incorrectly classified sites: 5 + 1 = 6. Total # of classified sites = 37.

 $\mathbf{CE} = 6/3$  $\mathcal{T} = 16\%$ 

Urban: Incorrectly classified sites: 7 + 2 = 9. Total # of classified sites = 31.

$$\text{CE} = 9/31 = 29\%$$

PA: Similar to UA, PA is category-level-based accuracy. PA is the accuracy from the point of view of the "producer". PA tells us how often real features in the ground truth are correctly shown in the classified results, or the probability that a certain ground truth category is classified as such. PA is formally defined as the following and is complement of the omission error (OE). PA = 100% − OE.

*PA = Number of correctly classified reference samples for a particular category/Number of samples from reference (i.e., annotated) data for that category = 1* − *omission error*

PA example based on the example confusion matrix in Figure A2:

PA for water category = Correctly classified reference sites for water category/Total # of reference sites for water category = 21/33 = 64%.

PA for forest category = Correctly classified reference sites for forest category/Total # of reference sites for water category = 31/39 = 80%.

PA for urban category = Correctly classified reference sites for urban category/Total # of reference sites for uran category = 22/23 = 96%.

UA: Similar to PA, UA is category-level-based accuracy. UA is the accuracy from the point of view of a "user", not the "producer". UA essentially tells us how often the classified category will actually align with the ground truth. This is referred to as reliability (memory tip: users often care about reliability). The UA is a complement of the commission error (i.e., UA = 100% − Commission Error). UA is defined as the following:

UA = *Number of correctly classified samples for a particular category/Number of samples classified (i.e., predicted) to that category* = 1 − *commission error.*

UA example based on the example confusion matrix in Figure A2:

UA for *water* category = *Correctly classified sites for water category/Total # of classified sites for water category* = 21/27 = 78%.

UA for *forest* category = *Correctly classified sites for forest category/Total # of classified sites for water category* = 31/37 = 84%.

UA for *urban* category = *Correctly classified sites for urban category/Total # of classified sites for uran category* = 22/31 = 70%.

PixA [158]: Pixel accuracy is perhaps the easiest to understand metric conceptually. It is the percent of pixels in the image that are classified correctly. It is the simplest metric, simply computing a ratio between the amount of properly classified pixels and the total

number of pixels. See the PixA calculation formula below, where N represents the total number of pixels in the assessment image, which equals *TP + TN + FP + FN*. *TP* denotes the number of target-pixels that were correctly detected, *FN* denotes the number of water body pixels not classified, *FP* is the number of nontarget pixels classified, and *TN* is the number of nontarget pixels classified as nontarget pixels. This metric can sometimes provide misleading results when the category representation is small within the image, as the measure will be biased in mainly reporting how well the classifier identifies negative category (i.e., where the category we care about, such as the water body category, is not present).

$$\text{PixA} = (TP + TN) / N$$

Edge overall accuracy (EOA), edge commission error (ECE), and edge omission error [33]: The authors in [33] defined a few evaluation metrics for water edge pixel extraction accuracy. See the following steps for how these metrics are computed.


Let the total number of pixels in the buffer area be M, the number of correctly classified pixels be MR, the number of missing pixels be MO, and the number of false alarm pixels be MC. EOA, EOE, and ECE are defined as below:

$$\text{EOA} = \text{M}\_{\text{R}}/\text{M} \times 100\%$$

$$\text{EOE} = \text{M}\_{\text{o}}/\text{M} \times 100\%$$

$$\text{ECE} = \text{M}\_{\text{c}}/\text{M} \times 100\%$$

**Intersection over union (IoU), mean intersection over union (mIoU), and frequency weighted intersection over union (FWIoU):**

In the formal definitions below, TP, TN, FP, and FN are the number of true positive, true negative, false positive, and false negative samples, respectively.

IoU [163,164]: It is the most popular and simple evaluation metric for object detection and image segmentation used to measure the overlap between any two shapes such as two bounding boxes or masks (e.g., ground-truth and predicted bounding boxes). Values of IoU lie between 0 and 1, where 0 means two boxes do not intersect and 1 indicates two boxes completely overlap. If the prediction is completely correct, IoU = 1. The lower the IoU, the worse the prediction.

mIoU [43]: It is a common evaluation metric for semantic image segmentation, which first computes the IOU for each semantic class and then computes the average over classes. The formula is given below.

$$\text{mIoU} = TP / (TP + FP + FN)$$

FWIoU [46,158]: It is an improvement over mIoU. As its name implies, it weights each class importance depending on their appearance frequency. The formal definition of FWIoU is given below, where n is the number of categories.

$$\text{FWOIoU} = \frac{1}{n+1} \sum\_{i=0}^{n} \left( \frac{TP\_i}{TP\_i + TN\_i + FN\_i} \times \frac{TP\_i + FN\_i}{TP\_i + FP\_i + TN\_i + FN\_i} \right).$$

#### **Kappa statistic** [156,159,165–172]:

Kappa (aka Cohen's kappa) statistic, a statistic that is frequently used to measure inter-annotator reliability (i.e., agreement) and also intra-annotator reliability for qualitative (i.e., categorical) items, is a very useful, but underutilized, metric. The importance of rater reliability is important because it represents the extent to which the data collected in a study are correct representations of measured variables. Note that this measure is to compare labeling by different human annotators, not a classifier versus a ground truth.

Cohen's kappa statistic is a very good measure that can handle both multiclass and imbalanced class problems very well. In ML, for a multiclass classification problem (see Appendix A4.2 in [84] for the definition and other types of classification tasks), measures such as accuracy, precision, or recall do not provide the complete picture of the performance of a classifier. In addition, for imbalanced class problems (see section II.D Imbalanced data in [86] for details about data imbalance), measures such as accuracy are misleading, so measures such as precision and recall are used. There are different ways to combine the two, such as the F-measure, but the F-measure does not have a very good intuitive explanation, other than it being the harmonic mean of precision and recall.

The kappa statistic can be calculated by the following formula, where Pr(a) represents the actual observed agreement, and Pr(e) represents expected (i.e., estimated) chance agreement). Thus, Pr(a) = OA.

$$\text{Kappa Statistic} = (\Pr(\text{a}) - \Pr(\text{e}))/(1 - \Pr(\text{e})) $$

Note that the sample size consists of the number of observations made across which raters are compared. Cohen specifically discussed two raters in his papers. The kappa is based on the chi-square table, and the Pr(e) is obtained through the following formula [166], where: cm1, cm2, rm1, rm2 represent column 1 marginal, column 2 marginal, row 1 marginal, row 2 marginal, respectively, and *n* represents the number of observations (not the number of raters).

$$\text{Expected (Chance)}\,\text{Agreement} = \frac{\left(\frac{\text{cm}^1 \times \text{cm}^1}{n}\right) + \left(\frac{\text{cm}^2 \times \text{cm}^2}{n}\right)}{n}$$

Similar to most correlation statistics, the kappa score can range from −1 to +1. Scores above 0.8 are generally considered good agreement; zero or lower mean no agreement (practically random labels). According to the scheme of [165], a value of <0 indicates no agreement, 0–0.20 is slight, 0.21–0.40 is fair, 0.41–0.60 is moderate, 0.61–0.80 is substantial, and 0.81–1 is almost perfect agreement.

Kappa is one of the most commonly used statistics to test interrater reliability, but it has limitations. Judgments about what level of kappa should be acceptable for health research are questioned. Cohen's suggested interpretation may be too lenient for health-related studies because it implies that a score as low as 0.41 might be acceptable [166]. Additional measures have been proposed to make use of the kappa framework.

For example, in [159], the authors advocate against the use of kappa and proposed the alternative measures of quantity and allocation disagreement. Quantity disagreement (QD) is the disagreement between the classification and reference data resulting from a difference in proportion of categories. Allocation disagreement (AD) assesses a difference in the spatial location of categories. The two measures (i.e., QD and AD) sum to overall error (i.e., 1–OA).

**Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE)** [123, 129,130,173]:

MAE: also called mean absolute deviation, MAE finds the average of the absolute differences between actual and predicted values. It gives an idea of how wrong the predictions were. MAE measure gives an idea of the magnitude of the error, but no idea of the direction (e.g., over- or underpredicting). MAE is defined as below [174], where *yi* is the actual true value, and *yˆi* is the predicted value. MAE value lies between 0 to ∞. Small value indicates a better model, and a value of 0 indicates no error, or perfect predictions.

$$\text{MAE} = \frac{1}{N} \sum\_{i=1}^{N} |y\_i - \hat{y}\_i|^2$$

MAE is more robust to the outliers than MSE, as it is not sensitive to outliers. MAE treats larger and small errors equally. The main reason is that in MSE, through squaring the errors, the outliers, which usually have higher errors than other samples, obtain more attention and dominance in the final error and thus impact the model parameters. In addition, there is an intuitive maximum likelihood (MLE) interpretation behind MSE and MAE metrics. If we assume a linear dependence between features and targets, then MSE and MAE correspond to the MLE on the model parameters by assuming Gaussian and Laplace priors on the model errors, respectively.

MAPE [175]: MAPE, also known as mean absolute percentage deviation (MAPD), is the mean or average of the absolute percentage errors of forecasts. Error is defined as actual value (i.e., observed value) minus the forecasted value. Percentage errors are summed without regard to sign to compute MAPE. It is the most common measure used to forecast error and works best if there are no extremes to the data (and no zeros). Because absolute percentage errors are used, it avoids the problem of canceling positive and negative errors. The formula is given below, where *M* is mean absolute percentage error, *n* is number of times the summation iteration happens, *At* is the actual value, and *Ft* is the forecast value. The smaller the MAPE, the better the forecast.

$$M = \frac{1}{n} \sum\_{t=1}^{n} \left| \frac{A\_t - F\_t}{A\_t} \right|.$$

#### **Precision, recall, sensitivity, specificity, and F-score** [156]:

Each measure in this group is a set-based measure [176]. The values of those measures are all from 0 to 1, with the best value at 1 and the worst score at 0.

Precision: The precision is mathematically defined by the following formula. Precision attempts to answer the question What proportion of positive identifications was actually correct? Precision refers to the proportion of the samples that is correctly classified amongst the samples predicted to be positive and is equivalent to user's accuracy (UA) for the positive category, which is also equivalent to 1 − commission error.

$$\text{Precision} = TP/(TP + FP)$$

Recall (also called sensitivity or true positive rate): it refers to the proportion of the reference data for the positive category that is correctly classified and is equivalent to producer's accuracy (also equivalent to 1 − omission error) for the positive category. It is calculated by the following formula. Recall attempts to answer the following question: What proportion of actual positives was identified correctly?

$$\text{Recall} = TP / (TP + FN)$$

Specificity (also called true negative rate): it refers to the proportion of negative samples that is correctly predicted and is equivalent to the producer's accuracy (PA) for the negative category [177].

$$\text{Specificity} = T\text{N/TN} + FP\text{ }$$

F-score (also called F1-score, F measure): Depending on the application domain, we may need to give a higher priority to recall or precision, but there are many applications where both recall and precision are important. Thus, it is natural to think of a way to combine these two metrics into a single one. One popular metric that combines precision and recall is called F1-score. The F1-score can be interpreted as a weighted harmonic mean of the precision and recall and is formally defined as below. There is always a trade-off between precision and recall of a model; if making the precision too high, we would see a drop in the recall rate, and vice versa.

$$\text{F1-score} = \text{(2} \times \text{Precision} \times \text{Recall)} / \text{(Precision} + \text{Recall)}$$

The generalized version of F-score is defined as follows. F1-score is a special case of F\_β when β = 1.

$$F\_{\beta} = \left(1 + \beta^2\right) \times \frac{precision \times recall}{\beta^2 \times precision + recall}$$

**R2, mean squared error (MSE), root mean squared error (RMSE), and root mean squared logarithmic error (RMSLE)** [123,129,130,173]:

R<sup>2</sup> is based on correlation between actual and predicted value; MAE is based on absolute value of error; MSE and RMSE are both based on square of error.

**R2:** R-squared, also known as the coefficient of determination, is a value between 0 and 1 that measures how well a regression line fits the data (i.e., indication of the goodness of fit of a set of predictions to the actual values in a regression model). The value range of R2 lies between 0 and 1 for no-fit and perfect fit, respectively. R2 is not sensitive to outliers.

The R-squared formula compares our fitted regression line to a baseline model. This baseline model is considered the "worst" model. The baseline model is a flat line that predicts that every value of y will be the mean value of y. R-squared checks to see if our fitted regression line will predict y better than the mean.

$$\mathcal{R}^2 = 1 - \frac{SS\_{RES}}{SS\_{TOT}} = 1 - \frac{\sum\_{i} \left(y\_i - \hat{y}\_i\right)^2}{\sum\_{i} \left(y\_i - \overline{y}\right)^2}$$

*SSRES* refers to the residual sum of squared errors of the regression model; *yi* is the actual value, and *yˆi* is the predicted value through the regression model. For example, if the actual y value was 58 but we had predicted it would be 47 then the residual squared error would be 121 and we would add that to the rest of the residual squared errors for the model.

*SSTOT* is the total sum of squared errors. This compares the actual y values to the baseline model (i.e., the mean). We square the difference between all the actual y values and the mean *y* and add them together.

MSE: MSE is perhaps the most popular metric used for regression problems. It essentially finds the mean (i.e., average) of the square of the difference (i.e., squared error) between actual and estimated values. Similar to MAE, MSE provides a gross idea of the magnitude of error. Let us assume we have a regression model that predicts the price of houses in the Boston area and let us say for each house we also have the actual price the house was sold for. The MSE can be calculated as the following, where *N* is the number of samples, *yi* is the actual house price, and *yˆi* is the predicted value through the regression model. MSE value lies between 0 to ∞. Small value indicates a better model. Sensitive to outliers, it punishes larger errors more. MSE incorporates both the variance and the bias of the predicting model.

$$\text{MSE} = \frac{1}{N} \sum\_{i=1}^{N} (y\_i - \mathcal{y}\_i)^2$$

MSE measures how far the data are from the model's predicted values, whereas R2 measures how far the data are from the model's predicted values compared to how far the data are from the mean. The difference between how far the data are from the model's predicted values and how far the data are from the mean is the improvement in prediction from the regression model.

RMSE: very straightforward, RMSE is the square root of MSE. Sometimes people use RMSE to have a metric with scale as the target values. Taking the house pricing prediction example, RMSE essentially shows what is the average deviation in your model predicted house prices from the target values (the prices the houses are sold for). Similar to MSE, RMSE value lies between 0 to ∞, with a small value indicating a better model. Similar to MSE, RMSE is sensitive to outliers and punishes larger errors more. The value of RMSE is always greater than or equal to MAE (RMSE >= MAE). The greater difference between them indicates greater variance in individual errors in the sample.

RMSLE: both RMSE and RMSLE are the techniques to find out the difference between the actual values and the predicted values by an ML/DL model. RMSLE is the root mean squared error of the log-transformed predicted and log-transformed actual values. RMSLE is formally defined as follows, where X denotes the predicted value and Y denotes the actual value, and n is the number of samples. Note that RMSLE adds 1 to both actual and predicted values before taking the natural logarithm to avoid taking the natural log of possible 0 (zero) values.

$$\text{RMLSE} = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} \left( \log(\mathbf{x}\_i + 1) - \log(\,\,\mathbf{y}\_i + 1) \right)^2}$$

RMSLE is very robust to outliers. When we compare the formula of the RMSE and RMSLE, the only difference is the log function. Basically, what changes is the variance measured. This small difference makes RMSLE much more robust to outliers than RMSE. In RMSE, outliers can explode the error term to a very high value, but in RMLSE, the outliers are drastically scaled down, therefore nullifying their effect.

RMSLE is often used when we do not want to penalize huge differences in the predicted and the actual values when both predicted and true values are huge numbers. (1) If both predicted and actual values are small: RMSE and RMSLE is same. (2) If either predicted or the actual value is big: RMSE > RMSLE. (3) If both predicted and actual values are big: RMSE > RMSLE (RMSLE becomes almost negligible).

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland www.mdpi.com

*Sensors* Editorial Office E-mail: sensors@mdpi.com www.mdpi.com/journal/sensors

Disclaimer/Publisher's Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Academic Open Access Publishing

mdpi.com ISBN 978-3-0365-9595-5