*4.4. Results*

We compare our approach with the HMM implemented in [31] and the DNNs recently proposed: DAE, Seq2Point, S2SwA, SGN, and SCANet. We report the MAE, SAE, and F1- score for the REDD and the UK-DALE datasets in Tables 2 and 3, respectively. The results show that our approach turns out to be by far the best for both datasets. Apart from us, the two most competitive methods are SGN and SCANet, which share the same backbone we drew inspiration from. Results show that our network is better than both SGN and SCANet, implying that the differences introduced in our approach are significantly beneficial. In particular, our network outperforms SGN, showing that including our regression network significantly improves both the estimate of the power consumption and the load classification, and thus the overall disaggregation performance. More in detail, for the dataset REDD the improvement in terms of MAE (SAE) with respect to SGN ranges from a minimum of 24.13% (23.44%) on the fridge to a maximum of 45.15% (54.4%) on the dishwasher, with an average improvement of 32.64% (39.33%). As for the F1-score, the classification performance increase from a minimum of 6.67% on the fridge to a maximum of 24.03% on the microwave, with an average increase of 15.45%. For the UK-DALE dataset instead, the improvement in terms of MAE (SAE) with respect to SGN ranges from a minimum of 18.62% (8.93%) on the fridge to a maximum of 39.78% (50.25%) on the dishwasher, with an average improvement of 27.84% (30.65%). The F1-score increases from a minimum of 2.49% on the kettle to a maximum of 10.82% on the washing machine, with an average increment of the accuracy of 6.79%. Moreover, our method outperforms the more recent SCANet getting better disaggregation performance on all the appliances for both the datasets and both the metric. Looking at the tables, we see that for the REDD dataset the improvement in terms of MAE (SAE) with respect to SCANet

ranges from a minimum of 9% (5.84%) on the fridge to a maximum of 18.76% (28.59%) on the microwave, with an average improvement of 13.21% (15.03%). The improvement of the F1-score ranges from a minimum of 3.64% on the fridge to maximum of 11.58% on the microwave, with an average increment of 6.81% of the accuracy. Finally, on the UK-DALE dataset, the improvement in terms of MAE (SAE) with respect to SCANet ranges from a minimum of 7.33% (7.2%) on the kettle to a maximum of 24.57% (19.55%) on the dishwasher, with an average improvement of 15.69% (14%). The F1-score increases from a minimum of 0.92% on the kettle to a maximum of 8.85% on the washing machine, with an overall improvement of 4.41%.

In order to evaluate the computational burden of the proposed LDwA, we report in Tables 4 and 5 the training time with respect to the most accurate DNNs. Clearly, LDwA is less efficient than SGN as LSTM layers have larger number of trainable parameters than the convolutional ones. However, the efficiency of our architecture with respect to the attentionbased S2SwA is remarkable. This is explained by the presence of the tailored attention mechanism that does not require additional recurrent layers in the decoder. There is also a huge improvement in the training time with respect to the SCANet. We achieve better performance without the need to train a Generative Adversarial Network, that requires a significant amount of computational resources and has notorious convergence issues.

The profiles related to the dishwasher, microwave, fridge, and kettle are shown in Figures 4–7, respectively, where each appliance activation is successfully detected by the LDwA in the disaggregated trace.


**Table 2.** Disaggregation performance for the REDD dataset. We report in boldface the best approach.


**Table 3.** Disaggregation performance for the UK-DALE dataset. We report in boldface the best approach.

**Table 4.** Training time in hours for the REDD dataset.


**Table 5.** Training time in hours for the UK-DALE dataset.


The tailored attention mechanism inserted into the regression branch of the network allows us to correctly identify the relevant time steps in the signal and generalize well in unseen houses. Furthermore, modeling attention is particularly interesting from the perspective of the interpretability of deep learning models because it allows one to directly inspect the internal working of the architecture. The hypothesis is that the magnitude of the attention weights correlates with how relevant the specific region of the input sequence is, for the prediction of the output sequence. As shown in Figures 4–7, our network is effective at predicting the activation of an appliance and the attention weights present a peak in correspondence of the state change of that appliance.

**Figure 4.** REDD dishwasher load and the heatmap of the attention weights at3sresolution.

**Figure 5.** REDD microwave load and the heatmap of the attention weights at3sresolution.

**Figure 6.** UK-DALE fridge load and the heatmap of the attention weights at6sresolution.

**Figure 7.** UK-DALE kettle load and the heatmap of the attention weights at6sresolution.

In conclusion, our approach does not only predict the correct disaggregation in terms of scale, but is also successful at deciding if the target appliance is active in the aggregate load at a given time step.
