*4.1. Datasets*

In order to evaluate our algorithm and perform a fair comparison with state-of-theart methods, we choose two publicly available real-world datasets and adopt the same partition into training and test sets of the previous studies [15,17,18]. The Reference Energy Disaggregation Data Set (REDD) [26] contains data for six houses in the USA at 1 second sampling period for the aggregate power consumption, and at 3 s for the appliance power consumption. Following the previous studies, we consider the 3 top-consuming appliances: dishwasher (DW), fridge (FR), and microwave (MW). We use the data of houses 2–6 to build the training set, and house 1 as the test set. The preprocessed REDD dataset is provided by the authors of [17]. The second dataset, the Domestic Appliance-Level Electricity dataset UK-DALE [27], contains over two years of consumption profiles of five houses in UK, at a 6 s sampling period. Here, the experiments are conducted using the 5 top-consuming appliances: dishwasher (DW), fridge (FR), kettle (KE), microwave (MW), and washing machine (WM). For evaluation, we use houses 1, 3, 4, and 5 for training and house 2 for

testing as in the previous works [15,17,18]. The UK-DALE dataset has been preprocessed by the authors of [13]. We stress that for both datasets we consider the *unseen* setting in which we train and test on different households. In fact, the best way to test the generalization capability of a model is to use the model on a building not seen during the training. This is a particularly desirable property for a NILM algorithm since the unseen scenario is more likely in the real world application of the NILM service.
