**4. Results**

In this section, experimental validation results are analyzed considering data from realworld installations. The Building-Level fully-labeled dataset for Electricity Disaggregation (BLUED) [48] is used to test the applicability of the proposed event-detection algorithm. Energy consumption data from three household installations are also used to evaluate the performance of the proposed methodology; common metrics are employed and results are compared with those obtained from other state-of-the-art methods proposed in the literature. Finally, the computational and memory efficiency of the proposed system is discussed.

#### *4.1. Event Detection Evaluation*

The BLUED dataset contains aggregate voltage/current and active power data, sampled at 12 kHz and 60 Hz, respectively, from a 2-phase household in Pittsburgh, USA. The recording duration is eight days. The time instants when a turn-on or turn-off event occurred are also reported in the dataset. In particular, for testing the proposed eventdetection algorithm, the active power measurements of phase A, at 1 Hz, from 11:58:32 20 October 2011, to 09:29:55 21 October 2011, are used. In fact, during this period, 125 events have occurred, including six pairs of simultaneous events. The proposed algorithm detects the simultaneous events as well as two near-simultaneous turn-off events as single events, respectively. Finally, one false event is detected; an appliance power drop, was incorrectly identified as an appliance turning-off, while the appliance being still in operation. In summary, 118 out of the 125 events have been correctly detected by the proposed event-detection algorithm. In Figure 4, the active power and the detected events for the period from 18:30:00 to 20:30:00 are shown.

**Figure 4.** Event detection in a real household on a given day.

The TPR, FPR and FNR metrics are calculated and compared to other more complex solutions [32,49,50] in Table 3. It can be seen that the proposed algorithm can achieve good results while being simple and computationally efficient.

**Table 3.** Event detection evaluation.


*4.2. Classification Evaluation*

To evaluate the classifiers performance regarding the three target appliances, the private testing sets mentioned in Section 3.1 are used. The calculated accuracy, precision, recall and F1-score results are summarized in Table 4.

**Table 4.** Classification results.


It is evident that the proposed classification algorithm presents high performance regarding the microwave and the fridge. These appliances are related to transient response patterns presenting specific characteristics, thus can be identified with high confidence. However, this is not the case for the washing machine, since the turn-on transient response is a simple steep step-up waveform. Similar patterns are also related to the heating processes of most of the household appliances, e.g., dishwasher, oven and generally appliances that use resistive elements for heating as shown in Figure 5. This illustrates the relatively lower scores obtained for the washing machine metrics compared to the other appliances.

**Figure 5.** Turn-on transient responses from different household appliances.

#### *4.3. Application on Residential Households*

The overall performance of the proposed methodology is tested on a private dataset. This dataset includes three 3-phase power supply households located in the Netherlands. For each household, aggregated active power per phase was measured at 100 Hz along with power consumption of selected appliances for 15 days. For evaluation purposes, the proposed NILM system is applied only when the target appliance is connected.

Figure 6 presents the results for each target appliance, assuming an operational duration of four hours. Specifically, the aggregated power is colored in blue. The actual target appliance power measured with plugwise meters is colored in red. The target appliance power, as estimated by the proposed methodology, is colored in green.

**Figure 6.** Power estimation for the selected appliances in real households. Time-series of (**a**) aggregated power, (**b**) actual target appliance power, (**c**) estimated power for fridge; (**d**) aggregated power, (**e**) actual target appliance power, (**f**) estimated power for washing machine; (**g**) aggregated power, (**h**) actual target appliance power, (**i**) estimated power for microwave.

The accuracy, precision, recall, F1-score, MAE, RMSE and RE are calculated as well as their average considering the three households for 15 days. Results for the fridge, washing machine and microwave oven are shown in Tables 5–7, respectively. It can be generally observed that the proposed algorithm presents high accuracy regarding the power and energy estimates of the fridge and the microwave. On the contrary, the microwave oven recall metric is low. This can be attributed to the fact that the proposed methodology considers this appliance standby mode of operation as OFF. In fact, the power consumption during this period is low, thus, of trivial importance regarding energy consumption calculations. Regarding the washing machine results, the NILM system is designed to detect only the most energy-intensive process during the washing machine operation cycle, i.e., water heating mode of operation. For the rest of the operational cycles (non-detected), i.e., water pumping, drum spinning, rinsing, the appliance status is assumed OFF. The partial detection of the washing machine appliance is evident in Figure 6, resulting into low recall scores. Moreover, in the third household, the calculated low precision is due to the operation of appliances presenting similar transient response patterns, being misclassified as washing machine end-uses.


**Table 5.** Results for fridge.

**Table 6.** Results for washing machine.


#### **Table 7.** Results for microwave.


#### *4.4. Comparison with Other Methods*

The performance of the proposed methodology is compared to other NILM-based energy consumption estimation systems. The average MAE, RE, precision, recall, F1-score and accuracy calculations obtained by the proposed method are summarized in Tables 8–10 regarding the fridge, washing machine and microwave, respectively. The corresponding results (where available) reported in the relevant literature are also presented as well as the associated NILM technique, sampling frequency, and testing dataset. Note that, most of the literature state-of-the-art methods have been tested by using the well-known UK Domestic Appliance-Level Electricity (UK-DALE) [51] dataset. This dataset includes aggregated active power and appliance measurements of 0.167 Hz for several months, recorded for a small number of household installations. Moreover, the Reference Energy Disaggregation Data Set (REDD) [52] has been used in [21] to evaluate the LSTM algorithm performance; the sampling frequency is 1 Hz for mains and 0.333 Hz for the appliances. The proposed NILM system is tested by using an 100-Hz private dataset, since high-frequency sampling data are not provided in the above mentioned public datasets. It is important to stress out that in order to conduct a fair comparison between the different approaches, all metrics should be taken into consideration. However, this is not possible, since results for all metrics calculations are not always provided in the corresponding literature. Therefore, a direct comparison should be carried out with caution.


**Table 8.** Comparison results among existing non-intrusive load monitoring (NILM) solutions for fridge identification and energy consumption estimation.

Note: GRU stands for gated recurrent units, seq2point/seq2seq for sequence-to-point/sequence-to-sequence, WGRU for window GRU, SAEDdot/SAEDadd for self-attentive energy disaggregation with 'additive'/'dot' attention mechanism, PCNN AE for parallel CNN autoencoder.

**Table 9.** Comparison results among existing NILM solutions for washing machine identification and energy consumption estimation.



**Table 10.** Comparison results among existing NILM solutions for microwave identification and energy consumption estimation.

> From the results of Table 8 it can be seen that the proposed algorithm presents a high performance on most metrics. In particular, the method presents the third-best MAE, being inferior only to PCNN AE and PCNN LSTM. Regarding energy estimation, the RE metric is low (equal to 0.19), thus the proposed method is outperformed only by the CNN [19] and the WGRU [22] algorithms. Finally, the proposed solution presents the highest precision in terms of status estimation. In particular, the fridge status has been falsely identified as ON (real status was OFF) for the minimum of cases from all examined NILM solutions. On the other hand, the proposed method presents moderate performance in terms of recall (0.80), since the Autoencoder, CNN [19] and LSTM [21] algorithms achieve better results. This is mainly attributed to the proposed power estimation algorithm design. The fridge status may be falsely considered OFF prior to an actual turning-off, due to similar power stepdown recordings, caused by appliances different from the target one. A possible solution is to determine the fridge duration pulse. However, this is practically infeasible since the fridge duration pulse varies significantly due to temperature difference inside and outside the appliance. Finally, by ranking all methods in terms of the F1-score and accuracy, it can be realized that the proposed method is the second-best and first, respectively, among all examined solutions (where the corresponding metrics were available).

> By analysing the washing machine results in Table 9, it can be observed that the proposed method presents relatively high MAE; seven out of the fourteen examined methods perform better. Regarding energy estimation the proposed method can be considered as the second-best in terms of RE, following the seq2point implementation [20]. Moreover, the proposed method presents the highest precision and the lowest recall among the examined solutions. This is due to the fact that the proposed system is specifically designed to detect the most energy-intensive and lower-duration process of the appliance, i.e., heating. The rest of the washing machine operation cycles, e.g., drum-spinning and rinsing are not taken into account as low energy-consumption longer-duration processes; thus, being of less importance. This implies that the proposed NILM system can accurately estimate the washing-machine energy consumption (low RE value) but predicts the appliance idle status (no water heating process) as OFF, resulting into low recall and high MAE. Some of

the current state-of-the-art NILM systems can indeed detect these low energy-intensive processes. However, this results into an increased number of FP and consequently to low precision. Note that, the low precision (although the highest among the examined solutions) is attributed to the fact that the transient response of the heating process is similar to that of other household appliances; thus, may lead to an increased number of FP predictions. Finally, the F1-score and accuracy metrics set the proposed method as the third- and fourth-best, respectively, among the examined solutions (where metrics were available).

Finally, regarding the microwave oven (Table 10), the proposed method outperforms the examined NILM methods presenting the lowest MAE and RE as well as the highest precision, F1-score and accuracy. Better results by other methods are observed only in terms of recall. This is due to the fact that the proposed system can not detect the microwave oven standby mode of operation. However, the power consumption during this period can be considered negligible. It is also important to note, that in NILM and from a user-experience point of view, precision is considered more important than recall; missing an appliance event is preferable than detecting an appliance event that has not actually occurred. In this sense, missing standby modes is more favored than predicting false microwave end-uses. The superiority of the proposed method for the analysis of the microwave oven is based on the following: (a) the microwave transient response pattern is unique, thus, it can be easily identified, and (b) the microwave oven end-use duration is short, varying from few seconds to minutes; thus, the number of the possible turning-off events caused from other appliances that may degrade the power estimation algorithm performance is very limited.

#### *4.5. Computational and Memory Efficiency*

The proposed methodology is designed to be memory and computationally efficient. The first part, i.e., the event detector, calculates the power difference over time. The second part, i.e., the classifier, is triggered only when a significant power step-up is detected. If the classifier detects a target appliance, the power estimation algorithm is enabled. This eventbased approach can be considered computationally efficient compared to other solutions operating continuously, i.e., even no turning-on event occurs. Furthermore, the transient response classifier consists of 54,377 parameters. This is a small number compared to other end-to-end deep learning models requiring a number of parameters in the order of millions, e.g., the model parameters proposed in [19] range from 1 million to more than 150 million parameters. Therefore, the proposed NILM system can be considered as memory efficient.

The only drawback is the use of 100 Hz active power data to recognize appliance turning-on when a transient occurs. However, this feature is important to enable the real-time application of the proposed NILM system, contrary to other approaches requiring power data of more extended periods (minutes to hours) in order to identify which appliance is operating. Moreover, it must be noted that the 100 Hz time-series is used only when an event is detected, and onlya6s window is extracted. Based on the above, it is evident that the proposed system can operate on the edge without the need of high-end microprocessors.

#### **5. Discussion—Towards Scalable Real-Time NILM Services**

As already reported, the proposed methodology is implemented as a real-time scalable solution with minimum hardware requirements, thus allowing utilities to perform a large-scale deployment. However, some criteria need to be met from an industry perspective before massively adopting such a service. Coming up with the correct blend of characteristics is not a trivial issue. So, it is no surprise that no real-time NILM solution based on sub-second energy data resolution has been rolled out in scale ( >50 K end-users) globally yet. In this section, four necessary criteria are investigated and we examine if the proposed methodology meets them or not.

1. First of all, as expected, comes the accuracy metric. Accuracy usually refers to a weights-based combination of (i) correctly detected events, (ii) precise energy consumption estimation for the detected appliance events and (iii) minimized FP. Energy companies and electricity consumers usually trust a NILM service when its accuracy exceeds 90% and when they are not receiving reports for appliances/activities never actually occurred.


In Figure 7, three of the criteria mentioned above are analyzed for the proposed system, i.e., accuracy, sampling frequency, and computational burden. As we can see, scalabilityrelated criteria #1 and #2 are met; the proposed system presents accuracy higher than 90% in all examined cases by utilizing the sampling frequency of 100 Hz (see results in Section 4). Although this frequency is high, it is still considerably lower than a resolution of several kHz used in most state-of-the-art real-time implementations [33–35,39]. To that end, it is an excellent "do a lot with a little" decision to take. Regarding criterion #3, i.e., computational and memory efficiency, as demonstrated in Section 4.5, optimized design can efficiently run on the edge and even on low-cost chip-sets. Specifically, in Figure 7, it is assumed that the "High" value refers to expensive algorithms, incorporating several parameters that cannot be easily integrated into a low-cost microprocessor. On the other hand, the "Very Low" value refers to low computational complexity algorithms that can be integrated and run in a low-cost microprocessor. The proposed system is between the "Low" and "Very Low" area. Criterion #4 is expected to be met as a consequence of #3. However, such an investigation falls out of the scope of this paper.

**Figure 7.** Scalability evaluation of the proposed implementation.
