*2.3. K-SMOTE*

This paper combined the K-means algorithm and SMOTE to balance the electricity data on the user side. The specific steps are as follows:


#### **3. Random Forest Classification Based on K-SMOTE**

RF, a statistical learning algorithm proposed by Breiman in 2001 [47], is essentially a combinatorial classifier containing multiple decision trees [48]. It mainly uses bagging method to generate bootstrap training datasets and classification and regression tree (CART) to generate pruning-free decision trees. As a new machine learning classification and prediction algorithm, random forest features the following advantages.


The electricity data of the grid user side includes various types, such as voltage, current, power consumption, user classification, etc. So the electricity theft users need to be detected quickly and accurately, so as to promptly notify the power department or relevant stakeholders to take timely and proper action.

On the other hand, the RF classifier has poor processing ability for unbalanced datasets, so in this paper it was combined with the K-SMOTE to detect electric power theft.

## *3.1. Decision Tree*

Random forest is a single classifier composed of several decision trees. Decision trees can be regarded as a tree model including three kinds of nodes: Root, intermediate, and leaf nodes. Each node represents the attribute of the object, the bifurcation path from each node represents a possible attribute value, and each leaf node corresponds to the value of the object represented by the path from the root to the leaf node. The path, which starts from the root to the leaf node, represents a rule, and the whole tree represents a set of rules determined by the training dataset. The decision tree has only a single output, which starts from the root node, and only the unique leaf nodes can be reached. In other words, the rule is unique essentially. The classification idea of decision tree is a data mining process which is achieved by analyzing data with a series of generated rules.

Concept learning system (CLS), iterative dichotomiser 3 (ID3), classification 4.5 (C4.5), CART and other node-splitting algorithms can be used to generate the decision trees [51]. This paper selected the CART node-splitting algorithm because it can handle both continuous variables and discrete variables.

The principle of CART node-splitting algorithm is as follows.

Information entropy (IE) is the most commonly used indicator to measure the purity of a sample set. Assume that the proportion of the *k*-th samples in the set *D* is *pk* (*k* = 1, 2, ... , *r*), then the information entropy of *D* (*Ent*(*D*)) is defined as:

$$Ent(D) = -\sum\_{k=1}^{r} p\_k \log\_2 p\_k. \tag{4}$$

The smaller the value of *Ent*(*D*), the higher is the purity of *D*.

CART decision tree uses the Gini-index to select the partitioning attributes. Using the same sign as in Equation (9), the purity of the dataset *D* can be measured using the *Gini* value, calculated as below:

$$Gini(D) = \sum\_{k=1}^{r} \sum\_{k' \neq k} p\_k = 1 - \sum\_{k=1}^{r} p\_k^2. \tag{5}$$

Intuitively, *Gini*(*D*) reflects the probability that two samples are randomly selected from the dataset *D*, and their class labels are inconsistent. Therefore, the smaller the *Gini*(*D*), the higher is the purity of the dataset *D*.

Assume that the discrete attribute *a* has *V* possible values {*a*1, *a*1, ... , *a<sup>V</sup>*}. If property *a* is used to partition the dataset *D*, there will be *V* branch nodes, in which the *v* node contains all the data with *a<sup>v</sup>* value on property *a* in *D*, and is denoted as *Dv*. We can calculate the IE of *Dv* according to Equation (4). Considering that the number of samples contained in different branch nodes is different, then give each branch node a weight |*Dv*|/|*D*|, that is, the more samples of branch nodes, the greater the influence of branch nodes. Then, the *Gini-index* of the attribute *a* is defined as:

$$Gini(D, a) = \sum\_{v=1}^{V} \frac{|D^v|}{|D|} Gini(D^v) \tag{6}$$

In the candidate attribute set *A*, select the attribute that minimizes the *Gini-index* after division as the optimal division attribute, and define the optimization attribute as *a*<sup>∗</sup>; then:

$$a\_\* = \min\_{a \in A} \text{Gini\\_index}(D, a) \tag{7}$$

#### *3.2. Discretization of Continuous Variable*

The continuous attribute in the decision tree needs to be discretized. The dichotomy method is used for node splitting of decision trees. The main idea of the method is to find the maximum and minimum values of a continuous variable, and set multiple equal breakpoints between them. These equal breakpoints divide the dataset into two small sets and calculate the information gain rate generated by each breakpoint. In CART decision tree, the steps of discretization of continuous variables are as follows.


#### *3.3. Random Forest*

#### 3.3.1. Bootstrap Random Sampling

Bootstrap random sampling algorithm is used to obtain di fferent training datasets for training base classifiers.

The mathematical model of bootstrap is as follows: Assuming that there are *n* di fferent data {*<sup>x</sup>*1, *x*2, *x*3, ... , *xn*} in the dataset *D*, if any data is extracted from *D* and put back for *n* times to form a new set *D*<sup>∗</sup>, then the probability that *D*∗ does not contain the *xi* (*i* = 1, 2, ... , *n*) is (1-1/*n*) *n.* When *n*→∞, it can be launched:

$$\lim\_{n \to \infty} \left( 1 - \frac{1}{n} \right)^n = e^{-1} \approx 0.368. \tag{8}$$

Equation (8) indicates that approximately 36.8% of the original data are not extracted in each sampling. This part of the data are called out-of-bag (OOB) data.

#### 3.3.2. OOB Error Estimate

OOB data are not fitted to the training set. However, OOB data can be used to test the generalization capabilities of the model. It has been proven that the error calculated by OOB, called OBB error, is an unbiased estimate of the true error of the random forest [52]. Therefore, OOB error can be used to evaluate the accuracy of the random forest algorithm.

The performance of the generated random forest can be tested with OOB data. The principle of OOB is shown in the Table 1, in the first column, where *xi* represents the input sample and *yi* represents the classification label corresponding to *xi.* In the first row, *Ti* represents the decision tree constructed by RF. Yes "Y" indicates that the current sample participates in the classification of the corresponding decision tree, and No "N" indicates that the current sample does not participate in the classification of the corresponding decision tree. Therefore, it can be seen from Table 1 that (*<sup>x</sup>*1, *y*1) was not used for the construction of *T*1, *T*2, and *T*3, so (*<sup>x</sup>*1, *y*1) was the OOB data of the decision trees *T*1, *T*2, and *T*3. After RF model is trained, its performance can be tested by OOB dataset, and the test result is the OOB error. In addition, there is also a relationship between the number of decision trees and the OOB error; therefore, for a certain dataset, this relationship can be used to solve the optimal number of decision trees in RF.


**Table 1.** The schematic of OOB.

Suppose the random forest consists of *k* decision trees. The OOB dataset is *O* and the OOB data of each decision tree are *Oi* (*i* = 1, 2, ... , *k*), bringing the OOB data into the corresponding decision tree for classification. The numbers of misclassifications of each decision tree are set to *Xi* (*i* = 1, 2, ... , *k*), and the error size of the OOB is calculated from:

$$\text{COOBError} = \frac{1}{k} \sum\_{i=1}^{k} \frac{X\_i}{O\_i}. \tag{9}$$
