*2.2. Rationale and Structure of the Analysis*

In the present paper the hourly water consumption database collected within the pilot area is used to obtain a comprehensive sample of hourly peak factors. Such a database can be potentially used to draw significant information allowing for the prediction of fundamental statistics of hourly peak demand such as central values, variability, and probability distribution. This research particularly focuses on the first issue, namely the sample mean of peak factors and related statistics.

As mentioned in the previous sections, peak factors in water networks can be deeply influenced by the amount and behaviour of consumers. Any statistical analysis should comply with the fact that the peak factors' values and the related statistics could be affected by the number and quality of the aggregated time series. For instance, if the network serves a small number of users, there is a large possibility that those consumers will highlight similar behaviours, resulting in higher peak factor values. On the contrary, if the network serves a large number of consumers, different behaviours are expected and this translates in a global water demand more homogeneously distributed within the day, with smaller values of peak factors.

The peak factors evaluation in water networks usually consists of understanding how peak factors change under a progressive aggregation of the users, namely in finding a mathematical or statistical dependence of peak factors on the number of users *Nu*. Synchronicity of consumption behaviours is usually accounted for by means of the cross-correlation among consumption time series. Those considerations imply that, when *Nu* is small, a dependence can be found not only on *how many* but also on *which* time series is going to be aggregated. In other words, results may deeply vary according to the specific performed selection of consumers. On the contrary, when *Nu* is large, results are expected not to be significantly altered whichever time series is selected.

To overcome this issue and to investigate the statistical structure of peak factors in a way that is reliable, rigorous, and robust, the following sampling design is proposed. A discretized number *N* of households is set and, for each of them, the *N* time series (each corresponding to a water meter) with size *D* (corresponding to the considered monitored days) are extracted from the consumption database of the pilot area and aggregated. For each *N*, the operation is repeated *M* times, allowing the same water meters to be extracted in different samples, whereas, for each sample, extraction is performed without replacement. In this way, *N* artificial populations are obtained (one for each aggregation level) and *M* representative samples with size *D* are available. Finally, for each *N*, the main focus concerns the analysis of the following quantities assumed as the most important when using the concept of peak factors for the design or verification of water networks:


To correctly address the above-mentioned items, the usual sampling theory (e.g., [43]) cannot be adopted straightforwardly. The first reason lies in the observation that each random sample consists of a time series made up of a number *D* of independent realizations of the variable of interest (hourly peak factor), but there could be a non-negligible cross-correlation among the *M* samples that has to be taken into account. In this perspective, literature provides suggestions about including cross-correlation in the analyses [44].

The second reason is that the effect of a finite population must be taken into account. In this perspective, literature suggests that the classic sampling theory should be adopted when the population fraction ψ (namely the ratio of the amount of extracted data to the maximum number of available data, or, in other words, the ratio of sample size to finite population size) is small [45]. Indeed, in this condition, sample sizes comparable with the population size provide unnaturally small variabilities, since different samples will contain the same elements when ψ → 1, with a progressive degeneration of the variance [45]. In turn, this could result in the need for very large and expensive databases to investigate large aggregation levels. For large ψ values, in case of sampling without replacement, suitable correction factors should be applied when estimating standard errors from the population variance, whereas the effect of a finite population on central values is usually considered negligible [45]. Especially concerning the variance of sample means, a correction factor, usually referred to as the Finite Population Correction Factor (*FPCF*) [45,46], a function of the population fraction, should be used when relating this quantity to the population variance. In the present research, the investigated population is characterized by two different dimensions, namely the number of monitored days *D* and the number of aggregated households *N*. For the adopted sampling design, *D* is the sample size, directly affecting computations, but the scientific interest mainly lies in understanding the effect of *N*, which, in turn, acts as a hidden variable with no explicit mathematical effect.
