*2.2. Field Measurements*

The KF-based demand estimation method estimates real-time node group demands using field measurements (e.g., pipe flow rate and nodal pressure) received from a supervisory control and data acquisition system. The pipe flow rate and nodal pressure are the output variables, whereas the node group demands are the input variable of the WDS. Note that the nodal demands can easily be calculated from the node group demands. Therefore, the KF-based method reverse-engineers the input variable, which cannot be directly measured, using the output variables, which can be measured by meters installed throughout the system.

In this study, the field datasets of the pipe flow rate are synthetically generated to assess the accuracy of the KF-based demand estimation method given the potential node groupings (Figure 2). The field measurements for the nodal pressure are assumed to not be utilized for the WDS demand estimation because low accuracy is observed when nodal pressure measurements are included [8].

The synthetic pipe flow rates are generated following these procedures: (1) an identical demand pattern (Residential 1 in Figure 4) is assigned to each node in the system (Figure 5a); (2) random deviations *N* `0, *σq* ˘ (*i.e.*, white noise) are added to each demand to consider the randomness and heterogeneous nature of true nodal demands (Figure 5b); (3) true node group demands are calculated by summing the demand of nodes in a group (Figure 5c); (4) the nodal demands generated in Step (2) are entered into an EPANET hydraulic model of the study network to generate true pipe flow rates; and (5) another white noise is added to each pipe flow rate to introduce measurement errors *N* `0, *<sup>σ</sup>Q* ˘ (Figure 5d).

**Figure 5.** Synthetic field measurement generation steps: (**a**) an identical demand pattern (Residential 1) is assigned to each node (*i.e.*, time-varying demand factors are multiplied by the base demand); (**b**) random variability is added to the nodal demands; (**c**) the true node group demand (*i.e.*, the sum of the three nodes' demand generated in the previous step); and (**d**) the true pipe flow and the addition of measurement error.

While the assumption of the normal distribution for the nodal demand is common in WDS design [8,25,26], Surendran and Tota-Maharaj [27] have recently confirmed that normal and lognormal distributions are appropriate for WDS demand based on analyzing real daily water consumption data with a 15-min interval for four years in the U.K. Auto-correlated noise (e.g., a sports event lasting for 2 h) can be considered for Step (2) to simulate more realistic demand fluctuations [28,29]. Note that the effect of considering different noise types on the accuracy of the KF-based demand estimation method would be minimal because it estimates the state variable, such that the error covariance is minimized.

### *2.3. KF-Based Demand Estimation*

The KF-based demand estimation method links the field measurements with the hydraulic simulation model (EPANET) that solves the non-linear system equation comprising mass and energy conservations [2]. Before demand estimation, the nodes are grouped into a single demand (*i.e.*, nodal group demand) following the grouping from the optimization submodule (Figure 2) to overcome the limitation of low measurement redundancy. The field pipe flow data are measured in real time. First, aggregated demands are estimated using the final demand estimates of the previous time step. The estimated group demands are disaggregated to individual nodes, which are entered into the hydraulic simulation model to calculate the pipe flow rate estimates. The demand estimates are updated such that the error covariance is minimized using the gap between the field measurement and the estimate of the pipe flow rate.

The KF comprises the recursive implementation of forecast and update steps, such that the *a posteriori* error covariance is minimized. The forecast step estimates the state at the current time step using the state estimate from the previous time step. The update step refines the forecast based on measurements at the current time step for a more accurate state estimate in the current time step. When calculating the state estimate in the update stage, more weight is given to the state estimate with the smaller uncertainty between those from the forecast stage and those from the current measurement [30].

The KF has the advantage of embedding the system dynamics in the estimates, which enables it to consider system operational changes when estimating the demand for a set of nodes (node group demands) based on the measured pipe flows.

The two types of KF are linear (LKF) and non-linear (NKF). The NKF represents the full non-linear relationship between measurements and states, whereas the LKF employs the linearization of these functions. The NKF is used herein because of its higher accuracy for the state estimation of a WDS (*i.e.*, a non-linear system) than the LKF [8].

The state forecast **x**´ *k* (*i.e.*, the *a priori* state estimate of the node group demands) is computed using the state equation as follows:

$$\mathbf{x}\_{k}^{-} = \mathbf{A}\_{k}\mathbf{x}\_{k-1}^{+} + \mathbf{w}\_{k}, \quad \mathbf{w}\_{k} \sim N\left[0, \mathbf{Q}\_{k}\right] \tag{1}$$

where **A***k* relates the state at the previous time step *k* ´ 1 to the state at the current step *k*. This matrix is updated at each time step and calculated from the historical mean node group demand values; **<sup>w</sup>***k* is a random variable representing the process noise; and **Q***k* is the process noise covariance. Note that node grouping is provided from the optimization algorithm submodule (Figure 2) based on the methodology described in Section 2.1.

In the NKF, the non-linear system function *h* in the measurement equation relates the *a priori* state estimate (**x**´ *k* ) and exogenous variable (**<sup>u</sup>***k*, operational information) to the measurements (pipe flows) as follows:

$$\mathbf{z}\_k = h\left(\mathbf{x}\_k^-, \mathbf{u}\_k\right) + \mathbf{v}\_{k\prime} \quad \mathbf{v}\_k \sim N\left[0, \mathbf{R}\_k\right] \tag{2}$$

where **<sup>z</sup>***k* denotes the measurement variables; **<sup>v</sup>***k* is a random variable representing the measurement noise; and **R***k* is the measurement noise covariance.

The updated state estimate (*i.e.*, the *a posteriori* state estimate of the demand) is given as:

$$\mathbf{x}\_{k}^{+} = \mathbf{x}\_{k}^{-} + \mathbf{K}\_{k} \left( \mathbf{z}\_{k} - h\left(\mathbf{x}\_{k}^{-}, \mathbf{u}\_{k}\right) \right) \tag{3}$$

where **K***k* is the Kalman gain matrix expressed as follows:

$$\mathbf{K}\_{k} = \mathbf{P}\_{k}^{-} \mathbf{H}\_{\text{uk}}^{T} \left( \mathbf{H}\_{\text{uk}} \mathbf{P}\_{k}^{-} \mathbf{H}\_{\text{uk}}^{T} + \mathbf{R}\_{k} \right)^{-1} \tag{4}$$

where **H***uk* is the Jacobian matrix of the partial derivatives of *h* with respect to *x* and *u*, and a unique **H***uk* is computed for each system operational state (**<sup>u</sup>***k*) around the *a priori* state estimate **<sup>x</sup>**´*k* ; and **<sup>P</sup>**´*k* is the *a priori* estimate error covariance calculated as follows:

$$\mathbf{P}\_k^{-} = \mathbf{A}\_k \mathbf{P}\_{k-1}^{+} \mathbf{A}\_k^{T} + \mathbf{Q}\_k \tag{5}$$

Therefore, the *a posteriori* state estimate **x**`*k* is computed as a linear combination of the *a priori* estimate **<sup>x</sup>**´k and the weighted difference between the actual **<sup>z</sup>***k* and predicted *h* `**x**´*k* , **<sup>u</sup>***k*˘ measurements. For example, a large measurement error covariance **R***k* results in a small update correction to the forecast state vector **<sup>x</sup>**´*k*.

The *a posteriori* estimate error covariance is finally calculated as follows:

$$\mathbf{P}\_k^+ = \left(\mathbf{I} - \mathbf{K}\_k \mathbf{H}\_{uk}\right) \mathbf{P}\_k^- \tag{6}$$

The LKF uses the linear transition matrix **H** in Equations 2 (**<sup>z</sup>***k* " **Hx**´*k* ` **<sup>v</sup>***k*) and 3 (**x**`*k* " **<sup>x</sup>**´*k* ` **K***k* `**z***k* ´ **Hx**´*k* ˘). The LKF can be used for non-linear systems *h*(x) with weak non-linearity, but may perform poorly as the non-linearities increase. Both the LKF and NKF use first-order approximations for the error covariance propagation (**H***uk***P**´*k* **H***Tuk* in Equation 4 and **<sup>A</sup>***k***P**`*k*´1**<sup>A</sup>***Tk* in Equation 5). A perturbation method, wherein the derivatives are approximated by the numerical forward finite differences, is used to calculate **H***uk*.

Note that a KF is classified as linear or non-linear based on the type of underlying state and measurement equation and not on the first-order approximation. The extended KF can only be applied to an NKF that estimates **<sup>w</sup>***k* and **<sup>v</sup>***k* in Equations 1 and 2, respectively, along with the state.

### *2.4. Demand Estimation Accuracy Indicator: RMSE*

The state estimate includes errors because of measurement and model errors and the state variable randomness. Measurement errors can originate from deterioration and imperfect calibration of meters and delays in data communication (e.g., missing data). Model errors originate from model parameter uncertainties and a lack of system knowledge (e.g., erroneous system structure). This study aims to find the optimal node grouping that minimizes the model errors when estimating the WDS node group demands. The RMSE is used as an indicator of the WDS demand estimation accuracy (Figure 2). The RMSE measures the difference between values predicted by a model or estimator (KF) and the true values and is calculated as follows:

$$RMSE = \sqrt{\frac{\sum\_{k=1}^{nt} \left(E\_k - O\_k\right)^2}{nt}} \tag{7}$$

where *nt* is the total number of time steps; *Ek* is the estimated group demand at time step *k* (obtained from the KF-based demand estimation method described in Section 2.3); and *Ok* is the true group demand at the *k*-th time step (synthetic measurements are obtained by the methodology described in Section 2.2).
