**4. Dataset Spatial Scales**

To answer the first research question reported in Figure 1, we here investigate the distribution of the 92 reviewed datasets across different spatial resolutions, along with their implications for demand modeling and management.

As already reported in Figure 2, we identify only 20 datasets at the district scale. Water demand data collected at this scale relate to specific areas of a water distribution network. They are primarily used to monitor aggregate water demand patterns in the network, or to provide input information to simulation models of water distribution systems. Among these datasets, it is worth highlighting the presence of comprehensive, multi-network datasets, such as the WDSRD database for research applications [51]. This dataset includes data for over 40 different distribution networks, collected by the ASCE Task Committee on Research Databases for Water Distribution Systems for the water distribution system community to develop and test new algorithms for network design, analysis, and operations. A typical problem that requires such type of data is the optimal sensor placement in a partitioned water distribution network [52]. This problem, consisting of finding the optimal sensor location that minimizes the economic costs, while maximizing the amount of information required for network operations and diagnosis, still represents an open challenge for utilities and researchers [53,54]. The datasets classified in the district spatial scale are generally gathered by water utilities for ad hoc analysis on specific case studies within their controlled water network facilities. As the data ownership belongs to water utilities, such data is generally not released to the public, but only released to researchers under non-disclosing agreements. If demand data come from individual household-scale water meters, privacy-protection schemes, e.g., data anonymization, are usually required before data are actually shared.

The majority of the reviewed datasets was collected at the household (31 datasets) or end use (41 datasets) scale. Datasets as such high spatial resolutions have been emerging in the literature in the last 20–30 years, driven by the increasing scientific interest towards smart water metering technology. Smart meters can be defined as digital sensors able to measure, store, and transmit water use data at the household level and with a sub-daily temporal sampling resolution, down to a few seconds [28,55]. Mining smart meter information with advanced data analytics is enabling new opportunities also for developing automatic tools to estimate the water consumption of individual fixtures in a household [56,57], quantify the impact of individual and collective human behaviors on residential water consumption and water conservation [58], and acquire a better understanding on which socio-demographic determinants primarily drive residential water consumption in different geographical contexts [59,60]. Water data at the household/end use scale are of great interest for behavioral studies and provide key information for fostering water conservation, designing water tariffs, promoting more sustainable uses of resource, characterizing water demand during peak hours, and improving demand forecasting and management capabilities [61]. These topics have been already extensively reviewed in the literature, and several comprehensive reviews analyzed the usage and benefits of smart metering for data collection and detailed water demand modelling and management [8,21,62,63].

We report a detailed summary of the metadata of the datasets identified at the district, household and end use scales in Tables 1 (district), 2 (household) and 3 (end use), sorted in chronological order. These metadata include the year when the dataset first appeared in the literature, its size (number of districts/households), time series length, time sampling resolution, access policy (classified in Open (O), Restricted (R), Not Available (NA)), and main goals and dataset applications in the related publications. When a dataset is found to be open access, we include the link to the repository where it is stored at the time of this review.

Some common features and trends can be identified from the information reported in the three tables. First, there is an inverse correlation between the dataset size (or the time series length) and the time sampling resolution. Datasets comprising hundreds or thousands of homes (e.g., [48,49,64–66]) generally include data collected with a monthly or daily time sampling resolution, while datasets with a sub-daily time sampling resolution only include a few units or tens of homes (e.g., [67–69]). This may be attributed to the experimental extent of most high-resolution studies, their usually short-term duration, and the costs of deploying large-scale smart metering systems. Second, while datasets collected at the district scale have been primarily used for WDN optimization, WDN design, understanding the effects of socio-economic determinants on aggregate water demand, and leak detection, we identify four categories of state-of-the-art studies that have used, so far, datasets at the household scale listed in Table 2. These four categories, defined based on the scope of the listed studies, are: water demand forecasting, water demand pattern recognition, water conservation and customer awareness, and water end use disaggregation. The problem of water demand forecasting has been investigated for decades with different modelling techniques. Several recent applications exploit Artificial Neural Networks and

other machine learning techniques to predict future water demands [44,66,70] and use this information to optimize water network operations or design water use efficiency programs [49,71–73]. Eight studies can be included in this category, among those listed in Table 2. A second category of studies (e.g., [31,74–76]) exploited household-scale water demand data combined with pattern recognition techniques to inform effective water allocation and reduce water demand to enhance urban water service infrastructure. Other 9 studies from those in Table 2 can be included in this category. Third, 11 datasets among those in Table 2 were gathered as part of water conservation and customer awareness research efforts and projects, including [65,77–79]. These studies investigate the potential of smart meter technologies, often coupled with data analytics and digital platforms, for data communication to water consumers, to increase users' awareness on water consumption and sustainable water usage behaviors. Finally, 3 household-scale datasets were primarily used for water demand disaggregation to estimate water use at individual fixture levels with a non-intrusive approach, i.e., coupling the data from a single-point smart meter with a disaggregation algorithm and avoiding the installation of several intrusive sensors to directly monitor the water consumption of each end use [64,80,81].

Water end use disaggregation can be identified as the link between WDDs at the household and the end use level. Since intrusive smart meter installations at the end use level turn out to be costly and unlikely acceptable and/or accepted by water consumers, thus non-viable for large-scale deployments, non-intrusive techniques represent a valid solution. Yet, non-intrusive end use disaggregation algorithms require ground truth data collected at the fixture level, at least for a limited time span, for algorithm training, validation, and performance assessment. For this reason, the majority of the reviewed WDDs classified in the end use spatial scale (see Table 3) has been used to develop and train different end use disaggregation algorithms, including machine learning-based algorithms (see, for instance, [67,68,82,83]). Differently from the WDDs at the household scales, end use datasets feature a short time series duration (a few days or weeks) and a high time sampling resolution, with data collected primarily with a sampling frequency of 5–10 s. These datasets, mainly collected in the last 10 years, usually include samples collected in two heterogeneous periods (e.g., summer and winter) to account for the seasonal variations of some end uses, e.g., outdoor water demand for irrigation. Whereas developing and testing end use disaggregation methods remains the main purpose of collecting water demand data at the end use level, some of the WDDs listed in Table 3 have been also used to evaluate water consumer behaviors and attitudes toward individual residential water uses (e.g., [84,85]), or test the effectiveness of water conservation strategies based on appliance retrofit and efficiency upgrades [86,87], customized tariffs [88], and awareness campaigns [89,90].





**Table 2.** *Cont.*


**Table 3.** Metadata of the 41 reviewed datasets at the end use scale. Different goals and applications are considered (see last column): WEUD = Water

 End Use




**Table 3.** *Cont.*
