**1. Introduction**

Power losses are usually divided into technical losses (TLs) and nontechnical losses (NTLs) [1]. NTLs refer to the power loss during the transformation, transmission, and distribution process and are mainly caused by electricity theft on the user side [2]. In most countries, electricity theft losses (ETLs) account for the predominant part of the overall electricity losses [3], and are mainly taking place in the medium- and low-voltage power grids. ETLs can cause serious problems, such as loss of revenue of power suppliers, reducing the stability, security, and reliability of power grids and increasing unnecessary resources consumption. In India, ETLs were valued at about US\$4.5 billion [4], which is still rising year by year. ETLs are reported to reach up to 40% of the total electricity losses in countries such as Brazil, Malaysia, and Lebanon [5]. ETLs of some provinces in China reached about 200 million kWh, with an overall cost of 100 million yuan. As reported in [6], the losses due to electricity theft reached about 100 million Canadian dollars every year with a power loss that could

have supplied 77,000 families for one year. The annual income loss caused by electricity theft in the United States accounted for 0.5% to 3.5% of the total income [7,8]. Therefore, the research on advancing electricity theft detection techniques has become essential due to its significance for energy saving and consumption reductions [9].

In the past an electric meter packaging, professional electric meters, and bidirectional metering conventional methods were adopted to deal with electricity theft [10,11]. Today, electricity theft detection methods rely on classifying the data collected by smart meter measurement system. Classification of the electricity theft and normal behaviors is conducted through data analysis [12].

The modern methods for electricity theft detection mainly include state-based analysis, game theory, and classification [13].

State-based detection schemes employ specific devices to provide high detection accuracy. A novel hybrid intrusion detection system framework that incorporates power information and sensor placement has been developed in [14] to detect malicious activities such as consumer attacks. In [15], an integrated intrusion detection solution (AMIDS), was presented to identify malicious energy theft attempts in advanced metering infrastructures. AMIDS makes use of different information sources to gather a sufficient amount of evidence about an ongoing attack before marking an activity as a malicious energy theft. In [16], state estimation was used to determine electricity theft users. When there was a difference between the voltage of the state estimation and the voltage of the measured node, the breadth-first search was conducted from the root node of the distribution network, and the magnitude of the difference at the same depth was compared to locate electricity theft users. In [17], in order to detect and localize the occurrence of theft in grid-tied microgrids, A Stochastic Petri Net (SPN) with a low sampling rate was used to first detect the random occurrence of theft and then localize it. The detection was based on determining the accurate line losses through (Singular Value Decomposition) SVD, which led to the recognition of theft in grid-tied MGs. State-based detection schemes will bring additional investment required for monitoring systems, including equipment costs, system implementation costs, software costs, and operation/training costs. In [18], it investigated energy theft detection in microgrids, considering a realistic model for the microgrid's power system and the protection of users' privacy. It proposed two energy theft detection algorithms capable of successfully identifying energy thieves. One algorithm, called centralized state estimation algorithm based on the Kalman filter (SEK), employed a centralized Kalman filter. However, it could not protect users' privacy and did not have very good numerical stability in large systems with high measurement errors. The other one, called privacy-preserving bias estimation algorithm (PPBE), was based on two loosely coupled filters, and could preserve users' privacy by hiding their energy measurements from the system operator, other users, and eavesdroppers. However, state-based detection schemes employ specific devices to provide high detection accuracy, which, however, come with the price of extra investment required for the monitoring system including device cost, system implementation cost, software cost, and operating/training cost.

Another approach for theft detection is based on game theory. Reference [19] formulated the problem of theft detection as a game between an illegitimate user and a distributor. The distributor wants to maximize the probability of theft detection while illegitimate users or thieves want to minimize the likelihood of being caught by changing their Probability Density Functions (PDFs) of electricity usage.

Classification-based methods include expert systems and machine learning. Expert systems are based on computer models trained by human experts to deal with complex problems and draw the same conclusions as experts [20]. The expert system of electricity theft detection based on specific decision rules was initially used. With the rapid development of artificial intelligence technology, machine learning enables computers to learn decision rules from training. Therefore, in recent years, machine learning has become the main research direction of electricity theft detection [21]. In [22], it explored the possibilities that exist in the implementation of Machine-Learning techniques for the detection of nontechnical losses in customers. The analysis was based on the work done in collaboration with an international energy distribution company. It reported on how the success in detecting nontechnical losses can help the company to better control the energy provided to their customers, avoiding a misuse, and, hence, improving the sustainability of the service that the company provides. Reference [23] provides a novel knowledge-embedded sample model and deep semi-supervised learning algorithm to detect NTL by using the data in smart meter. It first analyzed the characteristic of realistic NTL, and designed a knowledge-embedded sample model referring to the principle of electricity measurement. Next, it proposed an autoencoder based semi-supervised learning model.

In [24], fuzzy logic and expert system were combined to integrate human expert knowledge into the decision-making process to identify electricity theft behavior. A grid-based local outliers algorithm was proposed in [25] to achieve unsupervised learning of abnormal behavior of power users. This method mapped variables features into two-dimensional plane by factor analysis (FA) and principal component analysis (PCA). The dimensionality of data and the operation cost of outlier factor algorithm were reduced by grid technology. In [26], electricity theft detection method based on probabilistic neural network was employed to detect two types of illegal consumption.

In [27], clustering analysis was carried out firstly to reduce the number of data to be analyzed, then the suspected users were found through neural network. In [28], the extreme learning machine (ELM) was used to identify the weight between the hidden and output layer, and electricity theft was detected through the measured data of the meter. In [29], a five-joint neural network was trained with power data comprising 20,000 customers and achieved considerable accuracy. SVM-FIS method was proposed in [30], which could reduce the calculation complexity and improve the detection accuracy by combining the fuzzy inference system with the SVM. In [31], a data-based method was proposed to detect sources of electricity theft and other commercial losses. Prototypes of typical consumption behavior were extracted through clustering the data collected from smart meters.

For an unbalanced dataset, intelligent algorithms tend to favor positive data (PD) in the training process and ignore the important information contained in a few negative data (ND), which may reduce the detection accuracy [32]. Therefore, optimizing the unbalance of the dataset plays an important role for improving the e fficiency and accuracy of the algorithm. Data-oriented methods mostly rely on existing and validated cases of fraud either for training or validation. However, since frauds are scarce, it is di fficult to obtain these samples, unless another Fraud detection methods such as unsupervised detection, or a manual inspection campaign are used [33].

The theory of unbalanced data processing has been widely used in the fields of network fraud identification, network intrusion detection, medical diagnosis, and text classification. However, it is still rarely used in electricity theft detection. Reference [34] introduced consumption pattern-based energy theft detector (CPBETD), a new algorithm for detecting energy theft in advanced metering infrastructure (AMI). CPBETD relies on the predictability of customers' normal and malicious usage patterns, and it addresses the problem of imbalanced data and zero-day attacks by generating a synthetic attack dataset, benefiting from the fact that theft patterns are predictable. In [35], a methodology was proposed to improve the performance and evaluation of supervised classification algorithms in the context of NTL detection with imbalanced data. The main contributions of our work lie in two aspects: (1) The strategies considered to counteract the e ffects of imbalanced classes, and (2) an extensive list of performance metrics detailed and tested in the experiments.

A comprehensive detection method for NTLs of unbalanced power data was proposed in [36], which contained three detection models (Boolean rule, fuzzy logic, and support vector machine). Reference [37] proposed two undersampling methods for the classification of unbalanced data, easy ensemble (EE) algorithm and balance cascade (BC) algorithm. The above two methods exhibited high computation and implementation complexity. In [38], a one-sided selection (OSS) method was proposed for dealing with unbalanced data. In [39], a KNN-near miss method based on the K-nearest neighbor (KNN) undersampling method was proposed. In [40], an oversampling method, called synthetic minority oversampling technique (SMOTE), was adopted, which achieved excellent results in the processing of unbalanced data and e ffectively solved the problem of excessive random sampling. However, the algorithm had certain blindness in the selection of neighbors, did not consider the distribution of data when generating new data, and had strong marginality.

Reference [41] reported that, compared with single strong decision tree, weak decision tree had high computational efficiency. In addition, considering the weight sparsity of weak classifier, the recognition rate of the cluster could be further improved [42]. In [43], decision trees were used for NTL detecting and the algorithms were tested with real a database of Endesa Company. In addition, random forest classifier (RF) can save resources and computational time because the multiple decision trees run in parallel. Moreover, each decision tree can achieve random selection of data and attributes without overfitting [44].

In summary, considering the shortcomings of existing electricity theft detection methods and the unbalance of user data, a method for electricity theft behaviors detection was proposed based on improved SMOTE and random forest classifier in this paper. The main contributions of this paper can be listed as follows.


This paper is organized as follows: Section 2 proposes the K-SMOTE. Section 3 describes the proposed detection method for electricity theft behaviors. In Section 4, simulation results are presented to verify the feasibility and superiority of the proposed method. Section 5 summarizes the main conclusion of this study. In addition, the nomenclature table is shown in Appendix A.
