*3.2. Feature Pre-Processing*

As discussed in the related work section, the V-I binary trajectory mapping has been one of the favored features for appliance classification in *single-label learning*. However, in this work, we consider the features derived from source current *<sup>i</sup>*(*t*) in recognizing multiple running appliances from aggregate measurements. Through experimentation, it was found that the aggregated activation voltage *v*(*t*) has an almost identical pattern for most of the events, as illustrated in Figure 3. This suggests that the activation current *i* reflects the electrical properties of an appliance.

**Figure 3.** Activation voltage *v*(*t*) for different appliances in the PLAID dataset. The voltage has an almost identical pattern for all the appliances.

Therefore, we propose the decomposed current features obtained by applying the Fryze power theory [41].

The Fryze power theory decomposes activation current into orthogonal components related to electrical energy in time-domain [41]. According to this theory, it is possible to decompose the activation current **i** into active *<sup>i</sup>*(*t*)*a* and non-active components *<sup>i</sup>*(*t*)*f* , such that:

$$\dot{a}(t) = \dot{\imath}(t)\_a + \dot{\imath}(t)\_f \tag{2}$$

The active current *<sup>i</sup>*(*t*)*a* is the current of the resistive load, having the same active power at the same activation voltage. In Fryze's theory, the active power is calculated as the average value of *<sup>i</sup>*(*t*) · *v*(*t*) over one fundamental cycle *Ts* defined as follows:

$$p\_d = \frac{1}{T\_s} \sum\_{t=1}^{T\_s} i(t)v(t) \tag{3}$$

The active current is therefore defined as

$$\dot{a}\_a(t) = \frac{p\_a}{v\_{rms}^2} v(t) \tag{4}$$

where *vrms* is the rms voltage, expressed as follows:

$$v\_{rms} = \sqrt{\frac{1}{T\_s} \sum\_{t=1}^{T\_s} v(t)^2} \tag{5}$$

The current *ia*(*t*) represents the resistance information and is purely sinsoidal. The non-active component is then equal to

$$i(t)\_f = i(t) - i(t)\_a \tag{6}$$

Figure 4 presents the source currents and the corresponding active and non-active components for the twelve appliances in the PLAID dataset. It can be observed from Figure 4 that the active component approaches a pure sine wave even for non-periodic load currents like a Compact Fluorescent Lamp (CFL) and Laptop.

Once the activation-current has been decomposed, the Piece-wise Aggregate Approximation (PAA) is used to reduce the dimensional of the decomposed signal **i***a* and **i***f* from *Ts* to a predefined size *w*. PAA is a dimension reduction method for high-dimensional time series signal [42]. This is a crucial pre-processing step as it reduces the high-dimensionality of the extracted activation current feature with minimal information loss.

To further enhance the uniqueness of the decomposed-current feature, a Euclidean distance function *du*,*<sup>v</sup>* = ||*i*(*t*)*u* − *<sup>i</sup>*(*t*)*v*||<sup>2</sup> that measures how similar or related two data points are is applied on the active and non-active current. The distance similarity function is widely used as a pre-processing step for many of the machine learning approaches such as K-means clustering and K-nearest neighbor algorithms [29,43].The distance similarity matrix *Dw*,*<sup>w</sup>* for points *<sup>i</sup>*(*t*)1, *<sup>i</sup>*(*t*)2, ... *<sup>i</sup>*(*t*)*w* is the a matrix of squared euclidean distances representing the spacing of a set of *w* points in euclidean space [29] such that

$$D\_{w,w} = \begin{bmatrix} 0 & d\_{1,2} & \cdots & d\_{1,w} \\ d\_{2,1} & 0 & \cdots & d\_{2,w} \\ \vdots & \vdots & \ddots & \vdots \\ d\_{w,1} & d\_{w,2} & \cdots & 0 \end{bmatrix} \tag{7}$$

Figure 5 depicts the activation current, its components and their corresponding distance similarity matrix when a CFL and a laptop charger are active.

**Figure 4.** Normalized source current *<sup>i</sup>*(*t*) and their respective active *<sup>i</sup>*(*t*)*a* and reactive components *<sup>i</sup>*(*t*)*f* after applying Fryze power theory. The current is normalized for visualization purposes.

**Figure 5.** Currents and distance matrix when Compact Fluorescent Lamp (CFL) and laptop charger are active. (**a**) Source current *<sup>i</sup>*(*t*). (**b**) Active current *ia*(*t*). (**c**) Non-active current *ia*(*t*). (**d**) Distance matrix for source current. (**e**) Distance matrix for active current (**f**) Distance matrix for non-active current.

## *3.3. Multi-Label Modeling*

A common approach that extends neural networks to multi-label classification is to use one neural network to learn the joint probability of multiple labels conditioned to the input features representation. The final predicted multi-label is obtained by applying a *sigmoid* activation function [23]. This process

requires an additional thresholding mechanism to transform the sigmoid probabilities to multi-label outputs. However, building such a threshold function is very challenging. Therefore a default threshold of 0.5 is often employed [44].

To address this challenge, we propose a CNN multi-label classifier that uses softmax to implicitly capture the relations between multiple labels. As shown in Figure 6, the proposed CNN multi-label classifier consists of a four-stage CNN layer each with 16, 32, 64, and 128 feature maps, 2 × 2 strides.

The first two CNN layers use a 5 × 5 filter size, while the last two layers use a 3 × 3 filter size. The four CNN layers are followed by a batch normalization layer and the ReLU activation function. The last CNN layer is followed by an adaptive average pooling layer with an output size of 1 × 1. The CNN layer takes current-based features as inputs and produces a latent-feature vector **<sup>z</sup>***i*.

**Figure 6.** Block diagram of the Convolutional Neural Network (CNN) multi-label classifier. It consists of a CNN encoder to learn feature representation from the input feature, and the output layer to produce the predicted labels.

The output layer consists of three FC layers with a hidden size of 502, 1024, and 2*M*, respectively. The last layer is followed by an adaptive average pooling layer and three linear layers with a hidden size of 5012, 1024, and *M*, respectively. M is the maximum number of appliances available. This layer receives the output of the CNN layer to produce an output **O***s* of size (2 × *<sup>M</sup>*). The final predicted multi-label states, **<sup>s</sup>**<sup>ˆ</sup>*t*, is obtained by applying the softmax activation function **s**ˆ*t* = softmax(**<sup>O</sup>***s*). Thus, the proposed multi-label classifier learns the joint representation of multiple appliances states conditioned on activation-based input features.

To learn the model parameters, a standard back propagation is used to optimize the cross-entropy between the predicted softmax distribution and the multi-label target of each input feature.

$$\mathcal{L}(\mathbf{\hat{s}}, \mathbf{s}) = -\frac{1}{N} \sum\_{t=1}^{N} \sum\_{i=1}^{M} s\_{ti} \cdot \log \frac{\exp(\hat{s}\_{ti})}{\sum\_{j}^{2} \exp(\mathbb{s}\_{tj})} \tag{8}$$

The joint cross-entropy loss implicitly captures the relations between labels.

The CNN multi-label classifier is trained for 500 iterations using the Adam optimizer with an initial learning rate of 0.001, betas of (0.9, 0.98), and a batch size of 16. A factor of 0.1 reduces the learning rate once the learning stagnates for 20 consecutive iterations. To avoid over-fitting, early stopping with patience is used where the training model terminates once the validation performance does not change after 50 iterations. The dropout is set to 0.25.

## **4. Evaluation Methodology**
