*4.3. Experiment Description*

To benchmark our approach, we adopt multi-label stratified 10-fold cross-validation with random shuffle [46]. This evaluation approach provides stratified randomized folds for multi-label while preserving the label's percentage in each fold. We compare the performance of the proposed CNN model against the commonly used multi-label k-nearest-neighbor (MLkNN) [47] and Binary relevance k-nearest-neighbor (BRkNN) [48] model.

To evaluate the proposed activation current feature, we first establish a baseline in which the V-I binary image is used as an appliance feature. The VI binary image of size *w* × *w*, is obtained by meshing the *V* − *I* trajectory and assigning a binary value that denotes whether the trajectory traverses it as described in [14]. This experiment setup helps us to answer an essential question on whether the proposed approach is sufficient for recognizing multiple appliances from aggregated measurements. We analyze this by altering the type input features and compare the obtained performance. To gain more insight into the proposed approach, we further examine the individual appliance performance and misclassification errors.

To analyze the computational complexity of the proposed approach, we also assess the training and inference times as a function of the number of data samples. This was achieved by training the MLkNN baseline and CNN-based multi-label classifier while varying the training and testing size. In each run, the model is trained on *p* samples of data for 100 iterations and tested on (1 − *p*) samples data where *p* ∈ [0.1, 0.9].

Finally, we compare the appliance classification results with related state-of-art methods. However, we should emphasize that due to the difficulty in producing fair comparisons as a result of different experimental settings (e.g., sampling frequency, measurements, learning strategy, dataset, and the metrics) these comparisons are merely illustrative of the potential of the proposed method.

#### **5. Results and Discussion**

#### *5.1. Comparison with Baseline*

The results of the comparisons between the baselines and the proposed CNN multi-label learning for the V-I binary image and activation current feature are depicted in Figure 8a.

**Figure 8.** ma*F*1 score performance comparison between the proposed CNN model and the two baselines for different inputs features: (**a**) Comparison between voltage-current (V-I) binary image and current activation features; (**b**) Comparison between the activation current based features.

From Figure 8a, we see that the proposed CNN multi-label learning performs better than the baselines in both feature types. We also observe that compared to the current activation feature, the V-I binary feature representation yields low ma*F*1 scores in both the proposed CNN model and the two baseline algorithms. We see a slight increase in ma*F*1 score (from 0.826 ± 0.024 to 0.849 ± 0.024 for the CNN model and from 0.779 ± 0.028 to 0.827 ± 0.021 for the MLkNN model) when activation current is used as input features. This result suggests that features derived from activation current could be useful in recognizing appliances from total measurements.

We, therefore, analyzed three additional features derived from the activation current, namely decomposed current, current distance similarity matrix, and decomposed distance similarities. The results are presented in Figure 8b.

As it can be observed, the three current-based features significantly improve the classification performance in the CNN model, while achieving nearly the same performance on the two baselines. For the CNN model, the decomposed current feature attains an average 9.4% ma*F*1 score (from 0.849 ± 0.024 to 0.931 ± 0.015) increase over the activation current feature. This result is in line with the one obtained in [20], which suggested that decomposing the activation current into its active components enhances the uniqueness of the V-I trajectory. We also see about 10 percentage points increase in ma*F*1 (from 0.849 ± 0.024 to 0.94 ± 0.015) for the decomposed distance similarities. The decomposed current and current distance matrix features achieve comparable performance. This result indicates that the decomposed current features could help increase the performance of appliance recognition in NILM.

Figure 9 presents the predicted multiple appliances from the CNN based classifier with different feature representation. We see that compared to the activation current in Figure 9a and the V-I image Figure 9c, the proposed Fryze's current-decomposition in Figure 9b,d is capable of detecting all multiple running appliances. This shows that the Fryze current decomposition-based feature alone is sufficient for the identification of multiple running appliances.

**Figure 9.** Prediction comparison for different feature representations with the proposed CNN multilabel classifier. (**a**) Action current. (**b**) Decomposed current. (**c**) V-I image. (**d**) Distance matrix.

To gain insights on the performance of individual appliances, we further analyze the per-appliance eb*F*1 score for the MLkNN and the proposed CNN multi-label classifier, as depicted in Figure 10. The CNN model with decomposed current distance matrix feature obtains over 90% eb*F*1 score for each appliance except for AC, ILB, and LaptopCharger. We also see that the MLkNN baseline with the same decomposed current distance matrix feature obtains over 90% eb*F*1 score for only four appliances, namely FridgeDefroster, CoffeeMaker, Vacuum, and CFL. In both cases, we observe low scores for V-I binary features except for FridgeDefroster, CoffeeMaker, and Vacuum, which score above 90% eb*F*1.

**Figure 10.** Per-appliance eb*F*1 score on PLAID dataset. AC = air conditioning, CFL = compact fluorescent lamp, ILB = incandescent light bulb. (**a**) Multi-label k-nearest-neighbor (MLkNN) (**b**) CNN.
