5.1.1. Dataset

All training, validation and testing data are real data extracted from SGCC. This dataset contains 5000 three-phase four-wire industrial customers, where there are 461 normal and abnormal customers who have been inspected in-field manually. For each inspected customer, Hence, all normal and abnormal samples are labeled artificially based on inspection reports. The rule for data labeling is that all samples of same customer own the same label. The remaining unlabeled customers are randomly selected. All SM data are created as samples following the method mentioned in Section 3.4. Detailed information about the dataset is provided in Table 2. Although unlabeled customers could contribute more samples, we just select 500,000 samples randomly. The meters report electrical magnitudes listed in Table 1 every 15 min or 1 h. This paper unifies the frequency of all customers to 24 measurements/day.


**Table 2.** Brief information of dataset.

For each round of experiments, all samples are randomly splitted into training, validation and testing sets in approximated proportions of 10%, 10% and 80% by customers. It is worth noting that those three sub sets must follow above proportions to cover all types of NTL. Further, to verify generalization performance of algorithms, the samples of same customer would not be split.
