3.5.1. Post-Processing

Different variants to process the output from the DNN-NILM approaches have been proposed in the literature. In cases where the DNN is of type *s2s* or *s2sub* and disaggregation is done by moving the input window a single time step at a time, the network will deliver n predictions for each time step, where n is the length of the output window. In order to obtain a single prediction, many authors used the mean, e.g., [14,22,35,89,119] or the median, e.g., [21,72]. In [21], the authors find that networks underestimate the power of appliances if activations are only partially in the input window. As a consequence, the mean also underestimates the ground truth and [21] proposes to use the median instead, which is less impacted by this problem. As the authors of [39] use a GAN, they only use the disaggregated signal for which the discriminator outputs the highest probability of being a true sample.

Some authors note that the models produce noisy output, e.g., in the form of sporadic activations either too short or too frequent for the target appliance. Ref. [37] filters out such events with the same approach as for activation detection in the ground truth data. Similarly, [36] removes all activations of an appliance that are shorter than those found in the ground truth data. Depending on the metric, the reported improvement ranges from 28% to 54%. Refs. [58,89] go one step further and train a second DNN to suppress spurious activations. According to [58], the additional DNN leads to "significant performance boosts".

## *3.6. Evaluation Metrics*

The performance of NILM algorithm is assessed in various ways. The interested reader is referred to [18,145]: Ref. [18] provides a comprehensive review and discussion of employed metrics, and ref. [145] proposes a set of metrics to assess the generalization ability of NILM aglorithms. In the following, we only repeat the definition of the mean absolute error (MAE) and the F1-score. In the reviewed literature, these were the most encountered metrics to assess the estimated energy consumption and on/off status of an appliance, and we use them for our comparison in Section 4.1.

$$\text{MAE} = \sum\_{t}^{T} \frac{|y\_t - \hat{y}\_t|}{T} \tag{4}$$

where the sum goes over *T* time steps, and *yt*, *y*ˆ*t* correspond to the measured and estimated power consumption, respectively. In this publication, we use Watts as the unit for the MAE.

$$\mathbf{F}\_1 = 2 \cdot \frac{P \cdot R}{P + R} \tag{5}$$

where precision *P* = *TP*/(*TP* + *FP*) and recall *R* = *TP*/(*TP* + *FN*) with *TP*, *FP*, and *FN* denoting true positives, false positives, and false negatives, respectively [18].

#### **4. Discussion and Current Research Gaps**

The following sections discuss different aspects of the reviewed literature. Each section concludes with a paragraph on current research gaps we see concerning the discussed topic.

## *4.1. Performance Comparison*

One of the basic questions that accompanied us throughout this literature review was: "What is the most promising approach or classes of approaches?" As the last section hints, there is no straightforward answer to that. Too many degrees of freedom (see Figure 1) make the approaches differ in so many ways that a comparison based solely on the results presented in the publications can only give indications. For that purpose, the MAE and F1 score were extracted from the reviewed publications wherever available. (The data is available on our GitHub account. The link is provided in the 'Supplementary Materials'.) These two metrics are the most applied performance measures in case of energy estimation and on/off state classification, respectively. Figures 3 and 4 each display the best reported results split up by dataset and appliance. Only results from the *observed*, *unseen* evaluation scenario are given. This scenario was selected as it is closest to an actual application of DNN-NILM algorithms, see Section 3.2.1. The graphs only include results from approaches proposed in the corresponding publications: Results from baselines, or approaches from earlier work that were used for comparison, are not included. Appliances with a single result in the displayed range are excluded. We observe that the results for kettle and microwave are overall better and not as distributed as those from the other displayed appliances. We believe this is because of their simpler nature: Both kettle and microwave are appliances with only two states whereas dishwasher, washing machine, and fridge (to a certain degree) have a more diverse load signature.

**Figure 3.** Minimal reported MAE for the corresponding dataset and appliance. Only results from the *observed*, *unseen* evaluation scenario have been included. Only approaches proposed by the authors in the corresponding publications are taken into consideration (i.e., no baselines or models from the state-of-the-art). Results have been split according to the appliance and employed dataset. Please note that appliances with a single result in the selected range are not shown.

We caution the reader *not to interpret the displayed values as the result of a direct comparison under identical conditions*. Results have been generated under broadly differing settings, see Table 2. One key difference is that evaluation data varied strongly between publications. While results in Figures 3 and 4 are not directly comparable, we try to identify common elements of successful approaches. For that purpose, we sorted the results for each appliance (irrespective of the dataset) and took the top quarter of the results. Depending on the appliance, a quarter consisted of four to six results in case of the MAE and two to four in case of the F1-score. We then evaluated the number of times a publication appears in these results. Those with *more than one count* are [34,35] (five times), [62] (four times), and [36,63,97,98] (two times) in case of MAE, and [63] (six times), [118] (three times), and [36,37,64] (two times) for the F1-score. These publications have been marked in column 'Best' of Table 2.

**Figure 4.** Maximal reported F1-score for the corresponding dataset and appliance. Only results from the *observed*, *unseen* evaluation scenario have been included. Only approaches proposed by the authors in the corresponding publications are taken into consideration (i.e., no baselines or models from the state-of-the-art). Results have been split according to the appliance and employed dataset. Please note that appliances with a single result in the selected range are not shown.

Based on these publications, we make the following observations:



We see the following limitations in the previously performed comparison of the MAE and F1-scores: As already pointed out, results have been obtained through wildly different procedures. It is also questionable if these metrics are the most relevant, they have simply been chosen as the ones mostly provided in the reviewed literature. That means also that publications reporting other metrics do not appear in the evaluation. With these words of caution said, we still believe the previous observations provide some value.

With respect to the initial question "How does the performance of approaches compare?" and the performed literature review, we identify several challenges worth addressing by the research community: We observe that the experiments performed in the reviewed literature are not always well specified with respect to the 'degrees of freedom' mentioned in Section 3. We hope future works profit from our listing of available options and specify their decisions clearly. Besides, we see several additional steps at different levels that could lead to a better comparability:


#### *4.2. Multiple Input Features*

While the reactive power has already been used in the founding works of NILM [1,2], most authors take the active power as the only input for disaggregation, see Section 3.3.2. Therefore, we raise the question: "Can we find evidence that multiple input features benefit DNN-NILM performance?"

As can be seen in the overview Table 2, different authors employed alternative input features. The following authors report results from a comparison of input features [22,36,89,94,106,107,117]. Ref. [117] was the first DNN-NILM approach we are aware of that used multiple input features. Unfortunately, these results do not allow separation of the influence of input features from multi-task learning because the two are always used in conjunction. Ref. [22] explicitly exploits reactive power (Q). The authors evaluate the impact of Q on the F1-score within the AMPds and UK-DALE datasets. Over the investigated appliances, they find an average improvement of around 12.5% in the seen and 8% in the unseen evaluation scenario. (Average F1-scores for P in seen and unseen scenarios are 0.68 and 0.58, respectively.) Interestingly, the reported improvement is small or negative for purely resistive loads, such as kettle and electric oven. We hypothesize that in such cases, the reactive power provides no information, but purely noise. Refs. [89,94,106] all worked with the AMPds dataset. This data set contains measurements from a single house, thus all results stem from seen evaluation scenarios. Ref. [94] compares two feature sets with the help of the estimation accuracy: For the combination P and Q versus P alone, the authors find an improvement of 6%. For the combination P, Q, current (I), and apparent power (S), the improvement is slightly higher at 7%. (The estimation accuracy in case of P alone amounts to 0.83.) Ref. [89] investigates the same feature set P, Q, I, S versus P based on three different performance measures, namely the MAE, root mean square error (RMSE), and the normalized RMSE. In this work, the improvements with the additional features are much larger: For all measures, the average improvement is around 40% to 50%. (The MAE, RMSE, and normalized RMSE averaged over all investigated appliances and models for P alone amount to 36.7 W, 122.8 W, and 0.75, respectively.) The temperature as a supplementary feature has been used by [106]. The authors find that the disaggregation of 'heat pump' and 'home office' works 3% and 4% better based on the F1-score and estimation accuracy, respectively. (The F1-score and estimation accuracy averaged over the two appliances in case of P alone amount to 0.87 and 0.91, respectively.) The authors of [107] find that providing the aggregate electrical consumption from neighbors as additional features leads to performance improvements of 17% and 31% with and without multi-task learning, respectively. (The symmetric mean absolute percentage error (SMAPE) for P alone amounts to 23% and 38% in the respective cases.) Finally, ref. [36] calculates the mutual information between P, Q, S, I, voltage, the power factor of the aggregate measurement, and P of the appliance as a feature selection step. Voltage is the least informative feature, and is therefore dropped for the subsequent evaluations.

Previously mentioned improvements have been calculated from the original values reported by the authors according to the following formula

$$Improvement = \frac{Perf\_{\text{addFeatures}} - Perf\_{\text{only}}p}{Perf\_{\text{only}}p} \,. \tag{6}$$

where *Per fonlyP* and *Per faddFeatures* correspond to the performance (measured in any measure) of the approach based only on P and additional features, respectively. In cases where smaller values indicate better performance of a measure (e.g., MAE), we swapped *Per faddFeatures* and *Per fonlyP* to always result with a positive value for improved results.

Based on the results presented, we conclude that features beside P can improve disaggregation performance. No conclusions about the amount of improvement can, however, be made, as the spread of the results is quite broad. For the time being, we can only speculate about possible reasons. It might be a worthwhile investigation to examine what kind of factors, e.g., architectures, can make the most out of the information from features beside P. With the exception of those in [22], all results originate from seen evaluation scenarios. That effectively means that additional features help to estimate the power usage of a particular appliance. However, it is unclear how much they help to disaggregate an appliance *type*, see Section 3.2.1. Non DNN-NILM approaches already employed a very broad feature set [13]. Compared to this breadth, DNN-NILM approaches tested a very limited set of options. It would be interesting to see a systematic investigation on a broader feature set.
