3.3.2. Features

The active power from the aggregate measurement is the only input for most of the reviewed works. There are, however, a number of papers that extended the input to further features: Reactive and apparent power, current, first-order difference of the active power signal, power factor, the variant power signature, and different time-based features have been used additionally. A noteworthy case is the input of the aggregate power from multiple neighboring buildings which, according to [107], lead to considerable performance improvements. Input features of the reviewed publications are marked in the column 'Input' of Table 2. See also Section 4.2 for a discussion of the benefits of multiple input features.

#### *3.4. Deep Neural Networks*

#### 3.4.1. Architectures Elements

In Table 2, column 'DNN Elements', we summarize proposed DNN architectures based on a set of DNN building blocks or elements. Naturally, this attempt can only be a coarse approximation of the encountered diversity. It still provides a high-level view on what has been tried out. We mention only original architectures proposed by the authors. Models from earlier authors or baselines are not listed. As a consequence the column is empty in several cases, e.g., where the authors compare previous works. Looking at Table 2, we observe that starting with the year 2018, feedforward elements—in particular convolutional elements—gained in popularity. These elements are used roughly twice as much as recurrent elements. In the same time span, advanced DNN elements such as generative adversarial networks (GAN) and attention were also adapted to NILM.

Below, we give a description of the DNN elements used in Table 2 to describe the proposed models:



In case different elements have been combined, they are joined with a hyphen '-'. That means, e.g., CNN-dAE corresponds to a dAE that includes convolutional layers.

#### 3.4.2. Training and Loss Functions

DNN gradient descent has its own set of hyperparameters such as the type of optimizer, number of training epochs, or early stopping criterion. As these elements are not specific to the NILM problem, we just mention a few specific points in this section. There have been several authors that optimized (training) parameters of the proposed networks by automatic means. For example, grid search [72,92], Hill Climbing [41], and Bayesian optimization [42,75,90,102] have been employed. [118] investigated different variants of curriculum learning [168]. In this type of learning, samples are not randomly presented to the DNN, but organized in a meaningful order, the intuition being that humans learn by mastering concepts with increasing difficulty. Contrary to this intuition, ref. [118] finds that easy samples hinder training, and the author used synthetic training data composed from sets of more than 7 appliance sub-meters.

A key element of DNN optimization is the loss function that guides the optimization process. The vast majority of works employ either the mean absolute error (MAE) or the mean squared error (MSE) in case of power disaggregation and the cross entropy loss for on/off classification. Recent works also investigate alternative loss functions: Quantile regression [169] was employed by [46,59]. The authors of [59] found that their proposed loss increased the performance of two state-of-the-art models compared to the MSE loss. Some works [34,35,39,43] employ GAN loss functions—called 'adversarial loss' in [35]— that classify if the output of the regression DNN is a real or fake appliance load curve. This loss should make outputs more realistic and help especially in case of datasets of limited size. Finally, ref. [55] introduced a loss composed from four different terms: In addition to the MSE, a Kullback–Leibler divergence loss, a soft-margin loss, and the MAE are used. To our knowledge, a systematic comparison of loss functions for DNN-NILM approaches has not been published.
