**2. Proposed Algorithm**

The detection of electricity theft behaviors is a binary classification problem which calls for distinguishing of normal and electricity theft users. If the electricity data of the user side are directly used by a classifier, unbalanced data may make the classifier more prone to PD and ignore the important information contained in ND, which may degrade the performance of the classifier substantially.

As shown in Figure 1, the triangle and circle represent two kinds of datasets. Respectively, the solid box represents the actual decision boundary of the two kinds of datasets, while the dotted box represents the possible learning decision boundary of the classification algorithm. The number of triangle data in Figure 1a is less than the circular data, so they represent an unbalanced dataset. From Figure 1a,b that shows the normal dataset, it can be seen that the decision boundary of the classification algorithm may be quite different from the real decision boundary if the dataset is unbalanced.

**Figure 1.** The schematic diagram of the impact of unbalanced data on the classification algorithm. (**a**) Unbalanced data, (**b**) normal data.

In the actual power consumption environment, the number of users stealing electricity is far less than normal users, so the users' electricity dataset is an unbalanced dataset. Unbalanced user data will make the classification algorithm more prone to normal user samples, thereby ignoring the important information contained in a small number of electricity theft user samples, making the decision boundary of the classifier and the actual decision boundary inconsistent, resulting in serious performance degradation of the classifier. Therefore, it was necessary to use an appropriate method to balance the dataset. The traditional SMOTE method was easy to cause data marginalization problems. If there are more PD between some ND, the artificial data generated around these ND will cause the problem of blurred boundaries of PD and ND.

In the field of the detection of electricity theft, the problem about the low detection accuracy due to the unbalance of the power consumption dataset on the user side needs to be solved. Based on a kind of unbalanced data processing method based on K-means clustering and SMOTE, named K-SMOTE, the problem of low electricity theft detection accuracy caused by unbalance electricity data is solved in this paper.
