*3.2. Model Design*

From a practical point of view, DNNs use partial sequences obtained with a sliding window technique. The duration of an appliance activation is used to determine the size of the window that selects the input and output sequences for the NILM modeling. To be precise, let *xt***,***<sup>L</sup>* = (*xt*, ... , *xt*+*L*−<sup>1</sup>) and *<sup>y</sup>it***,***<sup>L</sup>* = (*yit*, ... *<sup>y</sup>it*+*L*−<sup>1</sup>) be, respectively, the partial aggregate and appliance sequences of length *L* starting at time *t*. In addition, we build the auxiliary state sequence (*si*1, ... ,*<sup>s</sup>iT*), where *sit* ∈ {0, 1} represent the on/off state of the appliance *i* at time *t*. The state of an appliance is considered "on" when the consumption is greater than some threshold and "off" when the consumption is less or equal the same threshold. We use the notation *<sup>s</sup>it***,***<sup>L</sup>* = (*st*, ... ,*st*+*L*−<sup>1</sup>) for the partial state sequences of length *L* starting at time *t*. Our idea is to exploit the structure of the SGN architecture proposed in [17] as the building block of the model. This general framework uses an auxiliary sequence-to-sequence classification subnetwork that is jointly trained with a standard sequence-to-sequence regression subnetwork. The difference here is that we generate a more accurate estimate of the power consumption by performing the regression subtask with a scalable RNN-based encoder–decoder with attention mechanism. The intuition behind the proposed model is that the tailored attention mechanism allows the regression subnetwork to implicitly detect and assign more importance to some events (e.g., turning on or off of the appliance) and to specific signal sections (e.g., high power consumption), whereas the classification subnetwork helps the disaggregation process by enforcing explicitly the on/off states.

Differently from the DNN in [20], the scalability of the overall architecture is ensured by the regression subnetwork where no RNN is needed in the decoder. In fact, the adopted attention mechanism allows one to decouple the input representation from the output and the structure of the encoder from the structure of the decoder. We exploit these benefits and we design a hybrid encoder–decoder which is based on a combination of convolutional layers and recurrent layers for the encoder and fully connected layers for the decoder.

## *3.3. Network Topology*

Let *f ireg* : <sup>R</sup>*L*+ → <sup>R</sup>*L*+ be the appliance power estimation model, then the regression subnetwork learns the mapping *<sup>p</sup>***<sup>ˆ</sup>***it***,***<sup>L</sup>* = *f ireg*(*xt***,***<sup>L</sup>*). The topology of the regression subnetwork is as follows.

**Encoder**: The encoder network is composed by a CNN with 4 one-dimensional convolutional layers (Conv1D) with ReLU activation function that processes the input aggregated signal and extracts the appliance-specific signature as a set of feature maps. Finally, a RNN takes as input the set of feature maps and produces the sequence of the hidden states summarizing all the information of the aggregated signal. We use Bidirectional LSTM (BiLSTM) in order to ge<sup>t</sup> the hidden states *ht* that summarize the information from both directions. A bidirectional LSTM is made up of a forward LSTM −→*g* that reads the sequence from left to right and a backward LSTM ←−*g* that reads it from right to left. The final sequence of the hidden states of the encoder is obtained by concatenating thehiddenstatevectorsfrombothdirections,i.e., *ht*=[ −→*ht*; ←− *ht*] *T*.

**Attention**: The attention unit between the encoder and the decoder consists of a single layer feed-forward neural network that computes the attention weights and returns the context vector as a weighted average of the output of the encoder over time. Not all the feature maps produced by the CNN have equal contribution in the identification of the activation of the target appliance. Thus, the attention mechanism captures salient activations of the appliance, extracting more valuable feature maps than others for the disaggregation. The implemented attention unit is shown in Figure 2, and it is mathematically defined as

$$e\_t = V\_a^T \tanh(\mathcal{W}\_a h\_t + b\_a),\tag{4}$$

$$a\_t = 
softmax{(e\_t)},\tag{5}$$

$$\mathfrak{c} = \sum\_{t=1}^{T} \mathfrak{a}\_{t} h\_{t\prime} \tag{6}$$

where *Va*, *Wa*, and *ba* are the attentions parameters jointly learned with the other components of the architecture. The output of the attention unit is the context vector *c* that is used as the input vector for the following decoder.

**Figure 2.** Graphical illustration of the implemented attention unit.

**Decoder**: The decoder network is composed by 2 fully connected layers (Dense). The second layer has the same number of units of the sequence length *L*.

The exact configuration of regression subnetwork is as follows:


Let *f ireg* : <sup>R</sup>*L*+ → [0, 1]*<sup>L</sup>* be the appliance state estimation model, then the classification subnetwork learns the mapping *<sup>s</sup>***<sup>ˆ</sup>***it***,***<sup>L</sup>* = *f icls*(*xt***,***<sup>L</sup>*). We use the sequence-to-sequence CNN proposed in [15] consisting of 6 convolutional layers followed by 2 fully connected layers. The exact configuration of the classification subnetwork is the following:


The final estimate of the power consumption is obtained by multiplying the regression output with the probability classification output:

$$\mathfrak{H}\_{\mathfrak{t},\mathcal{L}}^{i} = f\_{\text{out}}^{i}(\mathfrak{x}\_{\mathfrak{t},\mathcal{L}}) = \mathfrak{P}\_{\mathfrak{t},\mathcal{L}} \odot \mathfrak{E}\_{\mathfrak{t},\mathcal{L}} \tag{7}$$

where is the component-wise multiplication. The overall architecture is shown in Figure 3, and we call it LDwA, that is, Load Disaggregation with Attention.

**Figure 3.** Proposed Load Disaggregation with Attention (LDwA) architecture used in our experiments.
