*4.2. Metrics*

In order to evaluate our NILM approach, we recall specific metrics that allow to capture significant performance of the algorithm. Following the previous studies in [15,17,18], we use the Mean Absolute Error (MAE) and the Signal Aggregate Error (SAE). Let *yi*(*t*) and *y* ˆ *i*(*t*) be the true power and the estimated power at time *t* for the appliance *i*, respectively. The MAE for the appliance *i* is defined as

$$\text{MAE}\_{i} = \frac{1}{T} \sum\_{t=1}^{T} |y\_{i}(t) - \hat{y}\_{i}(t)|. \tag{8}$$

Give a predicted output sequence of length *T*, the SAE for the appliance *i* is defined as

$$\text{SAE}\_{i} = \frac{1}{N} \sum\_{\tau=1}^{N} \frac{1}{K} |r\_{i}(\tau) - \hat{r}\_{i}(\tau)| \,\tag{9}$$

where *N* is the number of disjoint time periods of length *K*, *T* = *K* · *N*, and *ri*(*τ*) and *<sup>r</sup>*<sup>ˆ</sup>*i*(*τ*) represent the sum of the true power and the sum of the predicted power in the *τ*th time period, respectively . In our experiments, we set *N* = 1200 which corresponds to a time period of approximately one hour for the REDD dataset and two hours for the UK-DALE dataset. For both metrics, lower values indicate better disaggregation performance.

In order to measure how accurately each appliance is running in on/off states, we use classification metrics such as the F1-score, that is, the harmonic mean of precision (*P*) and recall (*R*):

$$F\_1 = \frac{2P \cdot R}{P + R}, \quad P = \frac{TP}{TP + FP}, \quad R = \frac{TP}{TP + FN} \tag{10}$$

where *TP*, *FP*, and *FP* stand for true positive, false positive, and false negative, respectively. An appliance is considered "on" when the active power is greater than some threshold and "off" when it is less or equal the same threshold. The threshold is assumed to be the same value used for extracting the activations [17,18]. In our experiments, we use a threshold of 15 Watt for labeling the disaggregated loads. Precision, recall, and F1-score return a value between 0 and 1 where a higher number corresponds to better classification performance.

## *4.3. Network Setup*

According to the Neural NILM approach, we train a network for each target appliance. A mini-batch of 32 examples is fed to each neural network, and mean and variance standardization is performed on the input sequences. For the target data, min-max normalization is used where minimum and maximum power consumption values of the related appliance are computed in the training set. The training phase is performed with a sliding window technique over the aggregated signal, using overlapped windows of length *L* with hop size equal to 1 sample. As stated in [13], the window size for the input and output pairs has to be large enough to capture an entire appliance activation, but not too large to include contributions of other appliances. In Table 1, we report the adopted window length *L* for each appliance that is related to the dataset sampling rate. The state classification label is generated by using a power consumption of 15 Watt as threshold. Each network is trained with the Stochastic Gradient Descent (SGD) algorithm with Nesterov momentum [28] set to 0.9. The loss function used for the joint optimization of the two subnetworks is given by L = L*out* + L*cls*, where L*out* is the Mean Squared Error (MSE) between the overall output of

the network and the ground truth of a single appliance, and L*cls* is the binary cross-entropy (BCE) that measures the classification error of the on/off state for the classification subnetwork. The maximum number of epochs is set to 100, the initial learning rate is set to 0.01, and it is reduced by a decay factor equal to 10−<sup>6</sup> as the training progresses. Early stopping is employed as a form of regularization to avoid overfitting since it stops the training when the error on the validation set starts to grow [29]. For the classification subnetwork, we adopt the hyperparameters from in [17] as our focus is only the effectiveness of the proposed components. The hyperparameter optimization of the regression subnetwork regards the number of filters (*F*), the size of each kernel (*K*), and the number of neurons in the recurrent layer (*H*). Grid search is used to perform the hyperparameter optimization, which is simply an exhaustive search through a manually specified subset of points in the hyperparameter space of the neural network where *F* = {16, 32, <sup>64</sup>}, *K* = {4, 8, 16} and *H* = {256, 512, <sup>1024</sup>}. We evaluate the configuration of the hyperparameters on a held-out validation set and we choose the architecture achieving the highest performance on it. The disaggregation phase, also carried out with a sliding window over the aggregated signal with hop size equal to 1 sample, generates overlapped windows of the disaggregated signal. Differently from what proposed in [13], where the authors reconstruct the overlapped windows by aggregating their mean value, we adopt the strategy proposed in [16] in which the disaggregated signal is reconstructed by means of a median filter on the overlapped portions. The neural networks are implemented in Python with PyTorch, an open source machine learning framework [30] and the experiments are conducted on a cluster of NVIDIA Tesla K80 GPUs. The training time requires several hours for each architecture depending on the network dimension and on the granularity of the dataset.

**Table 1.** Sequence length (*L*) for the LDwA architecture.

