**1. Introduction**

Non-Intrusive Load Monitoring (NILM) is the task of estimating the power demand of each appliance given aggregate power demand signal recorded by a single electric meter monitoring multiple appliances [1]. In the last years, machine learning and optimization played a significant role in the research on NILM [2,3]. In the literature, solutions based on k-Nearest Neighbor(k-NN), Support Vector Machine (SVM), Matrix Factorization have been proposed [4,5]. A practical approach to NILM has to handle real power measurements sampled at intervals of seconds or minutes. In this setting, one of the most popular approaches is based on the Hidden Markov Model (HMM) [6], because of its ability to model transitions in consumption levels of real energy consumption for target appliances. Some successive papers focused on enhancing the expressive power of this class of methods [7,8]. Recently, the energy disaggregation problem has been reformulated as a multi-label classification problem [9]. In order to detect the active appliances at each time step, the idea is to associate each value of the main power to a vector of labels of length equal to the number of appliances, that are set to 1 if the appliance is active and 0 otherwise. The reformulated problem has been solved with different approaches [10–12]. However, there is no direct way to derive the power consumption for each appliance at that time step. During the last years, approaches based on deep learning have received particular attention as they exhibited noteworthy disaggregation performance. Deep Neural Networks (DNNs) have been successfully applied for the first time to NILM by Kelly and Knottenbelt in [13], who coined the term "Neural NILM". Neural NILM is a nonlinear regression problem that consists of training a neural network for each appliance in order

**Citation:** Piccialli, V.; Sudoso, A.M. Improving Non-Intrusive Load Disaggregation through an Attention-Based Deep Neural Network. *Energies* **2021**, *14*, 847. https://doi.org/10.3390/en14040847

Received: 29 December 2020 Accepted: 2 February 2021 Published: 5 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

to predict a time window of the appliance load given the corresponding time window of aggregated data. Kelly and Knottenbelt proposed three different neural networks to perform NILM with high-frequency time series data: a recurrent neural network (RNN) using Long Short-Term Memory units (LSTM); a Denoising Autoencoder (DAE); and a regression model that predicts the start time, end time, and average power demand of each appliance. The capability of LSTMs to successfully learn long-range temporal dependencies of time series data makes it a suitable candidate for NILM. Their first approach is based on stacked layers of LSTM units combined with a Convolutional Neural Network (CNN) at the beginning of the network to automatically extract features from the raw data. In the same paper, NILM is treated as a noise reduction problem, in which the disaggregated load represents the clean signal, and the aggregated signal is considered corrupted by the presence of the remaining appliances and by the measurement noise. For this purpose, noise reduction is performed by means of a DAE composed of convolutional layers and fully connected layers. In the experiments conducted by the authors, the DAE network outperforms the LSTM-based architecture and the other approaches frequently employed for this problem, such as HMMs and Combinatorial Optimization. In [14], an empirical investigation of deep learning methods is conducted by using two types of neural network architectures for NILM. The first neural network solves a regression problem which estimates the transient power demand of a single appliance given the whole series of the aggregate power. The second type of network is a multi-layer RNN using LSTM units, which is similar to the structure used in [13]. Zhang et al. [15] proposed instead a sequence-to-point learning for energy disaggregation where the midpoint of an appliance window is treated as classification output of a neural network with the aggregate window being the input. Bonfigli et al. [16] proposed different algorithmic and architecture improvements to the DAE for NILM, showing that the Neural NILM approach improves on the best known NILM approaches not based on DNNs like Additive Factorial Approximate Maximum A Posteriori estimation (AFAMAP) by Kolter and Jaakkola [6]. Compared to the work in [13], their DAE approach is improved by introducing pooling and upsampling layers in the architecture and a median filter in the disaggregation phase to reconstruct the output signal from the overlapped portions of the disaggregated signal. Shin et al. [17] proposed a subtask gated network (SGN) that combines two CNNs, namely, a regression subnetwork and a classification subnetwork. The building block of the two subnetworks is the sequence-to-sequence CNN proposed in [15]. In their work, the regression subnetwork is used to infer the initial power consumption, whereas the classification subnetwork focuses on the binary classification of the appliance state (on/off). The final estimate of the power consumption is obtained by multiplying the regression output with the probability classification output. In the experiments conducted by the authors, the SGN outperforms HHMs and state-of-the-art CNN architectures that have been proposed recently [13,15]. Chen et al. [18] adopted the structure of the SGN proposed in [17] and added to the SGN backbone a Generative Adversarial Network (GAN). In their model, the disaggregator for a given appliance is followed by a generator that produces the load pattern for that appliance. They show that adding the adversarial loss can help the model to produce more accurate result with respect to the basic SGN architecture. None of these state-of-the art deep learning models use RNNs. In fact, in the NILM literature, CNNs have always shown better performance than RNNs, even though RNNs are still widely employed for sequence modeling tasks. In [19], a CNN-based DNN has been combined with data augmentation and an effective postprocessing phase, improving its ability to correctly detect the activation of each appliance with a small amount of data available. The attention mechanism applied to NILM is a relatively new idea [20]. The DNN in [20] remarkably improves over Kelly's DAE when trained and tested on the same house. On the other hand, the generalization capability on houses not seen during the training is modest. Moreover, training and testing for the NILM task are time-consuming as they used the same architecture proposed in [21] for machine translation which consists of RNN layers in both the encoder and the decoder.

In this paper, we propose a RNN-based encoder–decoder model to extract appliance specific power usage from the aggregated signal and we enhance it with a scalable and lightweight attention mechanism designed for the energy disaggregation task. More in detail, we substantially improve the generalization capability of the SGN by Shin et al. by encapsulating our model in the regression subnetwork and by combining it with the classification subnetwork. The implemented attention mechanism has the function to strengthen the representational power of the neural network to locate the positions of the input sequence where the relevant information is present. The intuition is that our attention-based model could help the energy disaggregation task by assigning importance to each position of the aggregated signal which corresponds to the position of a state change of the target appliance. This feature is crucial for developing appliance models that generalize well on buildings not seen during the training.

The proposed DNN is tested on two publicly available datasets—REDD and UK-DALE—and the performance is evaluated using different metrics. The obtained results show that our algorithm outperforms state-of-the-art DNNs in all the addressed experimental conditions. The paper is organized as follows. Section 2 describes the NILM problem. Section 3 presents our DNN architecture. Section 4 describes the experimental procedure and the obtained results. Finally, Section 5 concludes the paper.

#### **2. NILM Problem**

Given the aggregate power consumption (*<sup>x</sup>*1, ... , *xT*) from *N* active appliances at the entry point of the meter, the task of the NILM algorithm is to deduce the contribution (*yi* 1, ... , *yi T*) of appliance *i* = 1, ... , *N*, such that at time *t* = 1, ... , *T*, the aggregate power consumption is given by the sum of the power consumption of all the known appliances plus a noise term. The energy disaggregation problem can be stated as

$$\mathbf{x}\_t = \sum\_{i=1}^{N} y\_t^i + \varepsilon\_{t\prime} \tag{1}$$

where *xt* is the aggregated active power measured at time *t*, *yi t* is the individual contribution of appliance *i*, *N* is the number of appliances, and *t* is a noise term. In a denoised scenario, there is no noise term, whereas in a noised scenario, *t* is given by the total contribution from appliances not included and the measurement noise. Similarly to the work in [13], we refer to the power over a complete cycle of an appliance as an appliance activation.

#### **3. Encoder–Decoder with Attention Mechanism**

In this section, we describe the adopted attention mechanism and DNN architecture for solving the NILM problem.

## *3.1. Attention Mechanism*

In the classical setting, a sequence-to-sequence network is a model consisting of two components called the encoder and decoder [22]. The encoder is an RNN that takes an input sequence of vectors (*<sup>x</sup>*1, ... , *xT*), where *T* is the length of input sequence, and encodes the information into fixed length vectors (*h*1, ... , *hT*). This representation is expected to be a good summary of the entire input sequence. The decoder is also an RNN which is initialized with a single context vector *c* = *hT* as its inputs and generates an output sequence (*y*1, ... , *yN*) vector by vector, where *N* is the length of output sequence. At each time step, *ht* and *σt* denote the hidden states of the encoder and decoder, respectively. There are two well-known challenges with this traditional encoder–decoder framework. First, a critical disadvantage of single context vector design is the incapability of the system to remember long sequences: all the intermediate states of the encoder are eliminated and only the final hidden state vector is used to initialize the decoder. This technique works only for small sequences, however, as the length of the sequence increases, the vector becomes a bottleneck and may lead to loss of information [23]. Second, it is unable to

capture the need of alignment between input and output sequences, which is an essential aspect of structured output tasks such as machine translation or text summarization [24]. The attention mechanism, first introduced for machine translation by Bahdanau et al. [21], was born to address these problems. The novelty in their approach is the introduction of an alignment function that finds for each output word significant input words. In this way, the neural network learns to align and translate at the same time. The central idea behind the attention is not to discard the intermediate encoder states but to combine and utilize all the states in order to construct the context vectors required by the decoder to generate the output sequence. This mechanism induces attention weights over the input sequence to prioritize the set of positions where relevant information is present. Following the definition from Bahdanau et al., attention-based models compute a context vector *ct* for each time step as the weighted sum of all hidden states of the encoder. Their corresponding attention weights are calculated as

$$\mathbf{c}\_{t\circ} = f(\sigma\_{t-1}, h\_{\circ}), \quad \mathbf{a}\_{t\circ} = \frac{\exp(e\_{t\circ})}{\sum\_{k=1}^{T} \exp(e\_{tk})}, \quad \mathbf{c}\_{t} = \sum\_{j=1}^{T} \mathbf{a}\_{t\circ} h\_{j}. \tag{2}$$

where *f* is a learned function that computes a scalar importance value for *hj* given the value of *hj* and the previous state *σt*−1 and each attention weight *<sup>α</sup>tj* determines the normalized importance score for *hj*. As shown in Figure 1, the context vectors *ct* are then used to compute the decoder hidden state sequence, where *σt* depends on *σt*−1, *ct*, and *yt*−1. The attention weights can be learned by incorporating an additional feed-forward neural network that is jointly trained with encoder–decoder components of the architecture.

**Figure 1.** Original graphical representation of the attention model by Bahdanau et al. in [21].

The intuition is that an attention-based model could help in the energy disaggregation task by assigning importance to each position of the aggregated signal which corresponds to the position of an activation, or more generally, to a state change of the target appliance. This allows the neural network to focus its representational power on selected time steps of the target appliance in the aggregated signal, rather than on the activations of nontarget appliances, hopefully yielding more accurate predictions. In fact, some events (e.g., turning on or off an appliance) or signal sections (e.g., high power consumption) are more

important than other parts within the input signal. For this reason, being able to correctly detect the corresponding time steps can play a key role in the disaggregation task. In neural machine translation, languages are typically not aligned because of the word ordering between the source and the target language. For the NILM problem, the aggregated power consumption is perfectly aligned with the load of the corresponding appliance and the alignment is known ahead of time. For this reason, to amplify the contribution of an appliance activation in the aggregated signal, we use a simplified attention model inspired by Raffel and Ellis [25], that combines all the hidden states of the encoder using their relative importance. The attention mechanism can be formulated as

$$e\_t = a(h\_t), \quad a\_t = \frac{\exp(e\_t)}{\sum\_{j=1}^T \exp(e\_j)}, \quad c = \sum\_{t=1}^T a\_t h\_{t\_t} \tag{3}$$

where *a* is a learnable function the depends only on the hidden state vector of the encoder *ht*. The function *a* can be implemented with a feed-forward network that learns a particular attention weight *αt* that determines the normalized importance score for *hj*. This allows the network to recognize the time steps that are more important to the desired output as the ones having higher attention value.
