Enhancing Electricity Theft Detection through K-Nearest Neighbors and Logistic Regression Algorithms with Synthetic Minority Oversampling Technique: A Case Study on State Electricity Company (PLN) Customer Data

Maraden, Yan; Wibisono, Gunawan; Nugraha, I Gde Dharma; Sudiarto, Budi; Jufri, Fauzan Hanif; Kazutaka,; Prabuwono, Anton Satria

doi:10.3390/en16145405

Open AccessArticle

Enhancing Electricity Theft Detection through K-Nearest Neighbors and Logistic Regression Algorithms with Synthetic Minority Oversampling Technique: A Case Study on State Electricity Company (PLN) Customer Data

by

Yan Maraden

^1,*

,

Gunawan Wibisono

¹

,

I Gde Dharma Nugraha

¹

,

Budi Sudiarto

¹,

Fauzan Hanif Jufri

¹

,

Kazutaka

¹ and

Anton Satria Prabuwono

²

¹

Departement of Electrical Engineering, Universitas Indonesia, Depok 16424, Indonesia

²

Faculty of Computing and Information Technology in Rabigh, King Abdulaziz University, Rabigh 21911, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Energies 2023, 16(14), 5405; https://doi.org/10.3390/en16145405

Submission received: 1 May 2023 / Revised: 22 June 2023 / Accepted: 1 July 2023 / Published: 16 July 2023

(This article belongs to the Section F: Electrical Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Electricity theft has caused massive losses and damage to electricity utilities. The damage affects the electricity supply’s quality and increases the generation load. The losses happen not only for the electricity utilities but also affect the legitimate users who have to pay excessive electricity bills. That is why the method to detect electricity theft is indispensable. Recently, machine learning algorithms have been used to develop a model for detecting electricity theft. However, most algorithms have problems due to imbalanced data, overfitting issues, and lack of data. Therefore, this paper proposes a solution that implements the oversampling technique to address the problems and increase the developed model’s accuracy. It is used to perform oversampling on the imbalanced dataset. Our proposed method consists of a pre-processing step to remove empty values and extract several parameters. After that, the oversampling technique is performed on the result of the pre-processing step. The logistic regression model combined with the oversampling techniques shows the best performance results on the developed model of electricity theft detection based on the state electricity company customers. The experiment shows that the proposed method, logistic regression combined with the synthetic minority oversampling technique, shows superior performance in terms of the accuracy of the training data and data testing, precision, recall, and F1-scores of 98.97%, 98.7%, 95%, 99%, and 97%, respectively. Moreover, the experiment also shows that the proposed solution outperforms existing methods.

Keywords:

machine learning; k-nearest neighbors; logistic regression; anomalies detection; electricity theft

1. Introduction

The State Electricity Company/Perusahaan Listrik Negara (PLN) was originally an electricity supply agency built by the Dutch. However, between 1942 and 1945, there was a transition of ownership from the Netherlands to Japan. The transition process ended in 1945 when Ir. Soekarno eventually took over under the name of the Electricity and Gas Office. On 1 January 1961, the Bureau of Electricity and Gas was changed to BPU-PLN (Badan Pemimpin Umum Perusahaan Listrik Negara/General Governing Body of the State Electricity Company), which runs in the field of electricity, gas, and coal. Finally, in 1965, the company was split into two, namely the state electricity company (PLN) and the state gas company (PGN) [1]. In 2020, there were 72,102,008 electricity customers, according to the Indonesian Central Bureau of Statistics [2]. Not all customers use electricity according to the regulation, so PLN had to make a program called Electricity Consumption Regulation/Penertiban Pemakaian Tenaga Listrik called P2TL. P2TL is an activity that includes planning, inspection, enforcement, and settlement carried out by state electricity providers on the customers’ electricity installation grids [3]. This process is performed manually through onsite inspection of suspected customers’ electricity home installations and meters. Information regarding suspicious electricity installations is mostly based on reports from the public or other customers. Sometimes the information is not accurate, which is considered a false positive. For this reason, a new method is needed by utilizing customer monthly electricity usage data, which will be further processed using machine learning. This approach reduces the cost incurred by dispatching enforcement officers to perform the onsite investigation.

The modeling uses two methods to compare results. The first method is the k-nearest neighbors (KNN) algorithm. KNN is used due to its simplicity and intuitive algorithm, and often for classification tasks [4]. It makes predictions based on the majority class of the k-nearest neighbors in the feature space. KNN can work well when the decision boundaries are nonlinear or when there is no clear separation between classes. It can handle multi-class classification problems [5]. KNN does not make assumptions about the underlying data distribution. The second method is the logistic regression algorithm, which is suitable for binary classification tasks. It models the relationship between the independent variables and the probability of a certain outcome using a logistic function. Logistic regression can provide interpretable results, as it estimates the impact of each feature on the predicted probability [6].

The accurate data labeling of customers’ monthly electricity usage is essential to ensure a model’s high accuracy. This labeling process is also performed manually after the onsite investigation of the customers’ premises. Then, the labeled dataset is used to train and validate the machine learning models. In this paper, we conducted several experiment comparisons based on the methods mentioned earlier. The logistic regression model showed a better result than the other models, such as k-nearest neighbors (KNN) with and without SMOTE, as it can better predict data. The logistic regression model shows the accuracy of 99.44%, 99.46%, 99%, 98%, and 99% for training data, data testing, precision, recall, and F1-score, respectively. The F1-score is a single metric that combines precision and recall, offering a balanced assessment of a model’s performance in binary classification tasks.

2. Materials and Methods

In 2020, there were 72,102,008 electricity customers, according to the Indonesian Central Bureau of Statistics [2]. Not all customers complied with the established electricity regulation. Some customers tampered with the electrical wiring and meter installed on their premises to reduce their monthly electricity bill. Therefore the electricity provider company must make a program called Electricity Consumption Regulation (P2TL), that includes planning, inspection, action, and settlement by the electricity provider company to ensure that the electrical wiring and meter installation on customers’ premises are not tampered with or damaged [3].

Customers generally pay electricity bills from the energy they use, measured in Watt hours. The instantaneous power equation is defined as

P = V \times I \times cos ϕ,

(1)

which represents the active power or average power corresponding to the actual energy transmitted or consumed by the load. Thus, the amount of energy used in units of

k W h

can be defined as follows:

k W h = \frac{V \times I \times cos ϕ \times t}{1000},

(2)

where a customer’s monthly electricity energy consumption is multiplied by the tariff per kWh unit, based on different subscription plans.

2.1. Anomaly Detection on Electricity Consumption

Anomaly and violation detection of electricity consumption in Indonesia still uses manual methods to determine unusual electricity usage on the customers’ side. The process starts by determining the target operations based on the reports and initial analysis of the customers’ monthly usage. Then, a team is dispatched directly to the location of the operation target to inspect the wiring, limiter, and metering device (kWh meter) installed at the customer’s premise. The device is inspected manually by verifying the integrity of the device’s tamper-proof seal, stand meter reading, safety cover, and miniature circuit breaker (MCB). The team may refer the inspection results to the manufacturer’s consultant to obtain further analyses [3]. The anomalies and violations found by the P2TL team are categorized into the following criteria:

Category 1 (P1)—A violation that influences the power consumption limit but does not affect the energy measurement.
Category 2 (P2)—A violation that influences the measurement energy but does not affect the power limit consumption.
Category 3 (P3)—A violation affecting the power limit and the energy meter.
Category 4 (P4)—A violation due to a faultiness that the customers are not responsible for.

In determining the violation, it is necessary first to analyze the load and also the tariff used by the customer. The category of a customer tariff rate and monthly charge can be used to determine the average daily electricity usage, then compare it with other normal customers’ usage. The following is a class based on tariffs and charges [7]:

Rates power electricity for social necessity.
–
The tariff class for very small social services at low voltage, with a power of 220 VA (S-1/TR) [7].
–
The tariff class for small to medium social service at low voltage, with a power of 450 VA to 200 kVA (S-2/TR).
–
The tariff class for large social services at medium voltage, with power above 200 kVA (S-3/TM).
Rates power electricity for household necessity.
–
Tariff class for small households at low voltage, with power up to 450 VA, 900 VA, 900 VZ-RTM, 1300 VA, and 2200 VA (R-1/TR).
–
Tariff class for medium households at low voltage, with a power between 3500 VA up to 5500 VA (R-2/TR).
–
Tariff class for large households at low voltage, with a power of 6600 VA and above (R-3/TR).
Rates power electricity for business necessity.
–
Tariff class for small business at voltage low, with a power of 450 VA to 5500 VA (B-1/TR).
–
Tariff class for medium business at low voltage, with power 6600 VA until 200 kVA (B-2/TR).
–
Tariff class for big business at medium voltage, with power supply above 200 kVA (B-3/TM).
Rates power electricity for industry necessity.
–
Tariff class for small industry/home industry at low voltage, with power supply of 450 VA to 14 kVA (I-1/TR).
–
Tariff class for medium industries at low voltage, with power above 14 kVA up to 200 kVA (I-2/TR).
–
Tariff class for medium industrial at medium voltage, with power above 200 kVA (I-3/TM).
–
Tariff class for large industrial at high voltage, with a power of 30,000 kVA and above (I-4/TT).
Rates power electricity for office government and public street lighting necessity.
–
Tariff class for small government offices at low voltage, with a power of 450 VA up to 450 VA up to 5500 VA (P-1/TR) [7].
–
Tariff class for medium government office at low voltage, with a power of 6600 VA up to 200 kVA (P-1/TR).
–
Tariff class for large government offices at medium voltage, with power above 200 kVA (P-2/TM).
–
Tariff class for public street lighting at low voltage (P-3/TR).
Electricity tariffs for medium-voltage traction purposes, with power above 200 kVA (T/TM), are intended for electric train companies.
Electricity tariffs for bulk sales at medium voltage, with power above 200 kVA (C/TM), are intended for holders of electricity supply business licenses.
Electricity tariffs for special service necessity at low, medium, and high voltages (L/TR, TM, TT) are intended only for electricity users who require services of special quality and who, for various reasons, are not included in the provisions of the social tariff class, households, businesses, industry, government offices and public street lighting, traction, and bulk.

2.2. Machine Learning

Machine learning is a remarkable computational capability that enables the study of data to achieve accurate predictive outcomes, mimicking human learning. A significant milestone in the history of machine learning was reached in 1959 when Arthur Samuel developed a program that improved its ability to play checkers. Samuel’s pioneering work stands among numerous papers published on the subject. Notably, he introduced the concept of self-learning, wherein machines can acquire knowledge and improve without the need for explicit programming.

Machine learning begins by feeding a dataset into models to generate output. The dataset input plays a vital role in determining and producing the desired model output. In the process of machine learning modeling, the dataset is typically divided into two parts: data training and data testing. Data training involves using the dataset to teach the machine learning model how to recognize patterns and accurately classify data. This phase allows the model to learn from the provided examples and improve its predictive abilities. Once the model is trained, it undergoes a testing phase to evaluate its performance and accuracy. This is achieved by comparing the model’s output with the data in the testing set. By examining the model’s predictions against known outcomes, we can assess its effectiveness and validate its capabilities.

Machine learning can be classified into three main categories: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning algorithms are designed to learn patterns from datasets where the variables and corresponding outputs are already labeled. By leveraging these labeled examples, the algorithm can discern relationships and make predictions based on new input. On the other hand, unsupervised learning algorithms uncover hidden patterns within datasets without any predefined labels. These algorithms autonomously identify structures and groupings within the data, generating labels or clusters as output. Reinforcement learning, the third category, involves algorithms that continually improve and strengthen their models based on feedback received from previous iterations. Through trial and error, the algorithm learns to make optimal decisions and maximize rewards within a given environment. The most suitable approach can be chosen based on the nature of the problem and the available data [5].

This paper focuses on the utilization of machine learning techniques for identifying anomalies in customer electricity consumption. There are numerous algorithms available for modeling electricity consumption behavior in the Jakarta region and its surrounding areas. In this study, we specifically employed the k-nearest neighbors and logistic regression models. We conducted a comparative analysis of their accuracy in detecting anomalies and violations in customer electricity consumption.

2.2.1. K-Nearest Neighbors (KNN)

The k-nearest neighbors (KNN) algorithm is a supervised machine learning algorithm used for classification and regression. The algorithm works by finding the k-closest training examples in the feature space to the new data point and predicting the class of the new data point based on the majority class among its k-nearest neighbors. The distance between neighbors is calculated based on the Euclidean distance, given as follows:

d i s t a n c e (x, y) = \sum_{i = 0}^{n} \sqrt{{(x_{i} - y_{i})}^{2}}

(3)

where

x_{i}

and

y_{i}

are the coordinates of the two neighboring points in data space. The procedure for the KNN algorithm is as follows [8]:

1.: Load the data: First, the data are loaded into memory, and the features and labels are separated into separate arrays.
2.: Normalize the data: Before running the KNN algorithm, the data may need to be normalized to ensure that each feature has equal importance. Normalizing the data involves scaling each feature to have a mean of 0 and a standard deviation of 1.
3.: Choose the value of k: The value of k, which is the number of neighbors to consider, must be selected. This value is usually chosen using a cross-validation technique or by trying different values and selecting the one that performs the best.
4.: Calculate distances: Once the value of k is chosen, the distances between the new data point and all the training examples are calculated. The most commonly used distance metric is the Euclidean distance.
5.: Select k-nearest neighbors: The k-nearest neighbors are selected based on the calculated distances. These are the training examples that are closest to the new data point.
6.: Predict the class: Finally, the class of the new data point is predicted based on the majority class among its k-nearest neighbors. For example, if the majority of the k-nearest neighbors are of class A, then the new data point is predicted to be of class A.
7.: Evaluate the model: The performance of the KNN algorithm is evaluated using a test dataset. This performance evaluation involves measuring the accuracy of the predictions made by the algorithm on the test dataset.

The KNN algorithm is simple and effective for classification and regression tasks, but it can be computationally expensive for large datasets. Several factors influence the KNN performance, i.e., parameter k, Euclidean distance, and normalization of parameters [8].

2.2.2. Logistic Regression

Logistic regression is a statistical model used to analyze and predict the relationship between a categorical dependent variable and one or more independent variables. It is commonly used in machine learning for binary classification problems [9].

The procedure for logistic regression can be broken down into the following steps:

1.: Data preparation: Gather and prepare the data for analysis. This includes cleaning, transforming, and selecting the relevant features to be used in the model.
2.: Model selection: Choose the appropriate logistic regression model to use. This could include binary, multinomial, or ordinal logistic regression, depending on the type of dependent variable being analyzed.
3.: Model building: Use the selected model to build a logistic regression model. This involves estimating the model coefficients (also known as weights or parameters) using a training dataset.
4.: Model evaluation: Evaluate the performance of the model using a validation dataset. This can be achieved by computing metrics such as accuracy, precision, recall, and F1-score.
5.: Model improvement: Improve the performance of the model by making adjustments to the model or the data used to build it. This could include feature engineering, regularization, or hyper-parameter tuning.
6.: Deployment: Once the model has been built and evaluated, it can be deployed to make predictions on new, unseen data.

Logistic regression is a mathematical model widely used in statistics to predict binary outcomes. It is beneficial for modeling the probabilities associated with specific classes or events. In logistic regression, the dependent variable takes on binary values (0/1, “yes”/“no”, “true”/“false”), while the independent variables can be continuous or categorical.

The main goal of logistic regression is to estimate the probability of the dependent variable based on the independent variables. This is achieved through the application of the logistic function, which is a sigmoid function that maps any real number to a value between 0 and 1. By utilizing this function, the model can compute probabilities. These probabilities are then converted into binary outcomes according to a threshold. If the probability exceeds the threshold, the model predicts a positive class (1); otherwise, it predicts a negative class (0). Logistic regression is a valuable tool for various applications where binary predictions are required, providing insights into the relationship between independent variables and the likelihood of specific outcomes.

2.3. Data Preprocessing with SMOTE

This paper focuses on utilizing the Synthetic Minority Oversampling Technique (SMOTE) to address imbalanced datasets in machine learning. SMOTE is a well-known method specifically designed for handling imbalanced datasets. A key aspect of SMOTE is its ability to generate new instances based on existing minority cases provided as input. It should be noted that, in most cases, this implementation does not simply duplicate existing instances [10]. Instead, the algorithm samples the feature space of each target class and its nearest neighbors. By combining the features of the target case with those of its neighboring instances, the SMOTE algorithm generates new synthetic instances. This approach effectively increases the available features for each class, resulting in more generalized examples [10]. The procedure of SMOTE can be described as follows:

1.: Identify the minority class: The first step is to identify the minority class in the dataset, which is the class with fewer instances.
2.: Choose a minority class instance: Next, SMOTE selects a minority class instance at random from the dataset.
3.: Find its k-nearest neighbors: For the selected minority class instance, SMOTE then identifies its k-nearest neighbors. The value of k is a parameter specified by the user.
4.: Generate synthetic instances: SMOTE then generates synthetic instances by interpolating between the minority class instance and its k-nearest neighbors. To generate a synthetic instance, SMOTE selects one of the k-nearest neighbors at random, calculates the difference between the feature values of the minority class instance and the selected neighbor, multiplies this difference by a random number between 0 and 1, and adds the result to the feature values of the minority class instance.
5.: Repeat steps 2–4: The process of selecting a minority class instance, finding its k-nearest neighbors, and generating synthetic instances is repeated until the minority class is balanced with the majority class or until a specified target level of balance is achieved.
6.: Evaluate the results: Finally, the balanced dataset is used for training a machine learning model, and the performance of the model is evaluated on a test set to determine if SMOTE has improved the accuracy of the model.

SMOTE is a straightforward yet highly effective technique for addressing imbalanced datasets by creating synthetic instances of the minority class.

2.4. Evaluation Parameters

A confusion matrix is a valuable tool for evaluating the performance of machine learning models. It is an N × N matrix that provides a comprehensive analysis of the model’s classification performance, where N represents the number of classes predicted by the model.

As shown in Figure 1, the row and column are denoted by two symbols of “p” and “n”, referring to “positive” and “negative”, respectively. In some cases, they can be represented as a binary system with “0” and “1”, or the letter symbols of T/F and P/N, which refer to true/false and positive/negative, respectively. There are four possible results in a confusion matrix:

True positive (TP)—A result prediction that shows results positive and the actual results are also following the predictions.
False positive (FP)—A result prediction that shows results positive and the actual results are not following the predictions.
False negative (FN)—A result prediction that shows results negative and the actual results are not following the predictions.
True negative (TP)—A result prediction that shows results negative and the actual results are also following the predictions.

Based on the above four criteria, predictive performance indicators of advanced models can be defined as follows:

Accuracy—This measures how often the model correctly predicts the outcome. It is calculated by dividing the number of correct predictions by the total number of predictions. The accuracy formula is defined as follows:

$A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}$

(4)
Misclassification—This measures how often the model incorrectly predicts the outcome. It is calculated by dividing the number of incorrect predictions by the total number of predictions. The misclassification formula is defined as follows:

$M i s c l a s s i f i c a t i o n = \frac{F P + F N}{T P + T N + F P + F N}$

(5)
Precision—This measures how many of the positive predictions are correct. It is calculated by dividing the number of true positives by the sum of true positives and false positives. The precision formula is defined as follows:

$P r e c i s i o n = \frac{T P}{T P + F P}$

(6)
Recall—This measures how many of the actual positive cases were correctly identified by the model. It is calculated by dividing the number of true positives by the sum of true positives and false negatives. The recall formula is defined as follows:

$R e c a l l = \frac{T P}{T P + F N}$

(7)
F1-score—This is a combination of precision and recall that provides a single metric for evaluating a model’s performance. It is calculated by taking the harmonic mean of precision and recall. The F1-score formula is defined as follows:

$F 1 - s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l}$

(8)

2.5. ROC Curve and AUC

ROC (receiver operating characteristic) curve and AUC (area under the curve) are commonly used performance evaluation metrics in machine learning for binary classification problems. The ROC curve is a graphical representation of the performance of a binary classification model as its discrimination threshold is varied [11]. It plots the true positive rate (TPR) against the false positive rate (FPR) for different threshold values. The TPR is the proportion of actual positive samples that are correctly identified as positive by the model, while the FPR is the proportion of actual negative samples that are incorrectly identified as positive by the model. The ROC curve shows how well the model can distinguish between the positive and negative classes, and its shape can indicate the model’s overall performance.

The AUC is a scalar value that represents the model’s overall performance based on the ROC curve. It is the area under the ROC curve, and it ranges from 0 to 1. A model with an AUC of 1 indicates perfect performance, while a model with an AUC of 0.5 indicates random performance. To calculate the ROC-AUC, calculate the true positive rate (TPR) and false positive rate (FPR) at various classification thresholds [11]. TPR is determined by dividing the number of true positives by the total number of positive cases in the test set. On the other hand, FPR is calculated by dividing the number of false positive predictions by the total number of negative cases in the test set. The ROC-AUC metric provides a concise summary of the model’s performance across all classification thresholds. A high ROC-AUC score indicates that the model is a strong classifier, while a low ROC-AUC score suggests that the model’s classification performance is subpar.

In summary, the ROC curve is a graphical representation of the trade-off between the true positive rate and the false positive rate of a model, while the AUC is a scalar value that summarizes the overall performance of the model based on the ROC curve.

3. System Design and Implementation

System design and implementation of a machine learning model is a critical process that involves designing and developing a practical algorithm for training and testing a model to perform a specific task. The process includes selecting the appropriate data set, deciding on the most suitable machine learning algorithm, and fine tuning the model to ensure optimal performance. In modeling anomalies and violations of customers’ electricity consumption, it is necessary to have a good-quality dataset as the input. The result’s quality entirely depends on the dataset used to train and validate the model. Several test scenarios are conducted to compare KNN and logistic regression algorithms and find the best parameters for each algorithm.

3.1. KNN and LR Testing Scenarios

In this scenario, the k-nearest neighbors (KNN) algorithm is used with the synthetic minority over-sampling technique (SMOTE) to enhance the training data. The scenario begins with a preprocessing step, where data adjustments are made, and anomaly/violation labels are transformed into binary numbers. The data are divided into a training set comprising 25% of the data and a testing set comprising 75% of the data. SMOTE oversampling is employed to balance the labels in the training data. The number of neighbors is varied from 1 to 15 to determine the optimal value through iteration. In the second stage, the model is trained, and the predicted labels from the model output are validated. The newly trained model is then evaluated using various metrics, including a confusion matrix and ROC-AUC parameters. The modeling process is repeated until the desired evaluation parameters are maximized. For a visual representation of the KNN with SMOTE procedure in this scenario, refer to Figure 2.

The logistic regression (LR) modeling follows a similar process to k-nearest neighbors (KNN), although there are some differences in determining the iteration value. The training data used are identical to the KNN model, and the ratio between training and testing data is 75% and 25%, respectively. In this case, the iteration value varies in the

10^{1}

to

10^{6}

iterations range. The model is then evaluated using the same parameters as the KNN testing scenarios, including the confusion matrix and ROC-AUC. For a more detailed procedure of LR with SMOTE in this scenario, refer to Figure 3.

3.2. Data Sets

The dataset used in this study consists of electricity usage data from January 2013 to December 2022, divided into various regional codes. The raw data are sourced from the Microsoft Database and will be processed into a CSV format. Meanwhile, the P2TL data are available from January 2013 to September 2022. The P2TL data consist of 164 columns and 102,804 rows. In addition to these two datasets, normal usage data verified by the data provider is included, which only contains customer IDs. These data are joined with the raw data and labeled as “N” for normal.

Further exploration of the data set is needed to understand the characteristics of the data to be used and to select the most suitable model appropriate to be used for model prediction. Every power usage has a particular

k W h

pattern different from normal usage. Besides, customer electricity-use data are also classified based on rates.

As shown in Figure 4, it can be concluded that based on the data used, the majority of customers are households, which are the top three on the chart. Therefore, further analysis is needed to determine the correlation between the tariff category data and electricity usage data, whether there is a correlation or not.

3.3. Classification Preprocessing

Data sets in the CSV format are then analyzed and preprocessed. Several columns must be deleted because it is not helpful for the model. Additionally, electricity usage data prior to the year 2017 need to be deleted due to the presence of numerous missing values in each month. The data utilized in this study are sourced from electricity customers in Jakarta and its surrounding regions. The data undergo preprocessing and cleaning, retaining essential information, such as customer ID, subscription type, monthly

k W h

usage, and violation-type labels from the previous investigation. Following data cleaning and processing, the dataset consists of 486,306 rows and 72 columns, with an additional column of violation-type labels. Each of these 72 columns represents the monthly kWh consumption for six consecutive years, from 2017 to 2022. All types of customer violations from the previous investigation are included in this data source. Each type of violation varies and has different variations in the electricity usage data. The results of the violation distribution from the household customers are shown in Figure 5 and also depicted in more detail in Table 1.

As shown in Table 1, the normal data are larger than any other P2TL category in the data sets. It can cause the results of the model to be biased because the data provided are not balanced between the labels. Therefore, the SMOTE oversampling technique is employed to overcome the data imbalance and further enhance both KNN and LR algorithms. The string violation-type labels are replaced with binary “0” and “1” to represent “normal” and “abnormal”, respectively. In addition, all types of violation are considered one singular violation, labeled as “abnormal”. This approach is chosen because the labeling by the P2TL officer is not consistent with the violation-type definitions, especially for P2 and P1 violation types. There are many mismatches in the violation-type labeling in the datasets. The data distribution after grouping all violation-type labels can be seen in Figure 6, the data with the label “0” refer to normal data, whereas “1” refers to abnormal data. It is apparent that the data are unbalanced, which will impact the prediction results of the model to be biased.

3.4. Evaluation Scenarios

The testing is conducted two times on each model. The first test uses data as given by changing the label to ”0” for normal data and ”1” for abnormal data. In contrast, the second test uses SMOTE to oversample the data to balance between labels. The results of the unbalanced data are discussed in the later section, which contains all the additional test results. The classification model is evaluated using the accuracy value parameter, precision, sensitivity, and F1-score. The accuracy score is used to measure accuracy value in a manner that the whole test uses the validation test data. The precision score is used to evaluate the false positive. The sensitivity score is used to assess true positives, and the F1-score is used to make judgments when there is an unbalanced data set. The true positive, true negative, false positive, and false negative values generated by the model can also be presented as a confusion matrix.

Python, along with the pandas and scikit-learn libraries, is utilized for implementing and evaluating machine learning models such as k-nearest neighbors and logistic regression [12,13,14]. The Visual Studio Code Editor serves as the platform for development and evaluation. Additionally, Jupyter Notebook is employed for essential tasks like data cleaning, preprocessing, and visualization. The following are software and libraries used to build and implement the model:

Python 3.9.13;
Conda 22.11.1;
Pandas 1.4.4;
Numpy 1.21.5;
Matplotlib 3.5.2;
scikit-learn 1.1.3;
mlxtend 0.21.0.

All raw datasets are converted into CSV format and cleaned and processed in Jupyter Notebook for data visualization after the model is trained and validated.

4. Experiment Result

The test results of the implementation in the previous section are compared between the two algorithms, namely k-nearest neighbors and logistic regression. The performance comparison between the two algorithms is based on the result evaluation of the confusion matrix parameters and ROC values.

4.1. KNN Testing Result

In the first experiment, the modeling is conducted using the k-nearest neighbors algorithm. Subsequently, the data are partitioned into a 3:1 ratio between training data and testing data. The random state parameter is set to one due to computational constraints. The optimal number of neighbors is determined by iteratively training the model with various numbers of neighbors ranging from one to fifteen. The most optimal k-value is determined based on the result shown in Figure 7, a value of 14 neighbors, which produces 93.3% accuracy. Then, the results of the optimal k-value are applied to the model to start the training and testing process with the KNN algorithm.

The learning curve result of the KNN algorithm with k-value of 14 neighbors is shown in Figure 8. The percentage of error in performing the classification is very low, below 7%, with percentage training at 100%. The most significant percentage error is at the beginning of training the model, with an error percentage of 7%. The results of the training scores are as high as 93.2%, and the testing scores as high as 93.2%. So, it can be concluded that the evaluated model is an excellent fit model as shown in Figure 8, where the error of training decreases as the number of data increases.

The model obtained in the training phase is further evaluated based on the confusion matrix parameter, i.e., true positive, true negative, false positive, and false negative values. Based on Figure 9, the values obtained are fairly good with positive values of 130,695 instances, 7043 true negative instances, 9343 false positive instances, and 1432 false negative instances. With the above data, it can be assumed that the model can predict the results well. However, further evaluation is needed by determining the parameter evaluation according to the precision, the recall, and the F1-score. The following is the result of the evaluation value:

Based on the result shown in Table 2, the accuracy of predicting abnormal and normal data is 88%. Based on these results, this model is good enough to predict abnormal and abnormal data. At the same time, the recall has a score of 71%. It means that the model results are good enough to make predictions. However, quite a lot of data still contain misclassification. Then, they are validated again using the F1-score of 76%, which means the model is good enough at predicting ”normal” and ”abnormal” data.

The following evaluation parameter is ROC-AUC, as shown in Figure 10. It shows the model’s results with reasonably good prediction, even though it still shows some false positives with an AUC score of 77%. When the ROC curves closer toward the upper left corner, the models can correctly predict the normal and abnormal data. Otherwise, it will give a lot of false predictions. As shown in Table 2 and Figure 9 and Figure 10, the results show that the model used to predict the anomaly behavior has reasonably accurate prediction.

4.2. Logistic Regression Testing Result

In the second experiment, logistic regression is used for modeling. Similar to the previous experiment, the data are divided into a 3:1 ratio, with 75% for training data and 25% for testing data. The parameter used for the random state is set to one due to computational constraints. Furthermore, in determining the value of the iteration to be used, the author performed repeated training experiments with maximum iteration values ranging from

10^{1}

to

10^{6}

. By observing Figure 11, it can be concluded that

10^{4}

is the most suitable maximum iterations value, resulting in 99% accuracy for both training and testing. Then, the maximum number of the

10^{4}

iteration is set to initiate the model training and testing processes with the logistic regression algorithm.

Based on the learning curve in Figure 12, the percentage of error in classification is very low, just below 0.6%, for most of the training set size, up to 100%. The highest error percentage occurs at the beginning of the training process, with an error rate of 0.67% at 30% of the training process. Thus, the results of the model for training and testing accuracy scores are 99.44% and 99.46%, respectively. After obtaining the results of each parameter obtained, it is necessary to carry out an evaluation process using the confusion matrix. The model obtained in the training phase is further evaluated based on the confusion matrix parameter, i.e., true positive, true negative, false positive, and false negative values.

The confusion matrix of the logistic regression algorithm is shown in Figure 13. The values obtained through the tests are quite good for values of true positive as many as 131,902 instances, 15,817 true negative instances, 516 false positive instances, and 278 false negative instances. The results show that the model can predict anomaly behavior correctly. However, further evaluation is needed by determining the evaluation parameters, i.e., the precision score, the recall score, and the F1-score values. The following is the result of the evaluation value:

Furthermore, based on the classification performance shown in Table 3, each evaluation parameter has a value up to 99%. The precision result indicates that the predictions made by the model are close to perfect. It indicates that the model has a low rate of falsely labeling negative instances as positive. Meanwhile, the recall has a value of 97%, which means that around 3% of the predictions show false negatives. It shows how well the model captures the positive instances in the dataset and how sensitive it is to identify them correctly.

Then proceed with an advanced evaluation parameter, which is the ROC-AUC. If viewed through Figure 14, the model results show very accurate predictions. It can be seen that the ROC curve tends towards the top-left corner as the AUC value approaches one, which indicates that the predictions can effectively classify “0” as normal data and “1” as abnormal data.

4.3. Model Training and Testing Using SMOTE

In this experiment, the existing data set will be oversampled using SMOTE. The purpose of oversampling is to balance the normal data with the abnormal data, specifically related to electricity usage in Indonesia. Before oversampling, the training dataset consists of 264,253 instances of normal data and 32,771 instances of abnormal data. However, after applying the oversampling technique, the ratio between these two data types becomes balanced at 264,253 instances each, with a total of 528,506 instances.

4.3.1. KNN with SMOTE

In this experiment, the training data consists of 528,506 instances, with a 1:1 ratio, between normal and abnormal data. The training data consist of 528,506 instances, with an equal ratio of normal and abnormal data. Subsequently, the training is conducted iteratively to determine the optimal number of neighbors ranging from 1 to 15 to achieve high training accuracy and prediction results.

Based on Figure 15, it can be concluded that the highest training accuracy for the training data is achieved when using a parameter value of 13 neighbors, resulting in an accuracy of 73.53%. However, the highest accuracy for the testing data is obtained when using a parameter value of two neighbors, with an accuracy of 91.73%. Therefore, based on the results shown, the KNN experiment with SMOTE is conducted using a parameter value of 2 neighbors to achieve better prediction results.

The learning curve in Figure 16 shows very low classifying error at 0.3% and below 0.1% for a complete size of the training and testing datasets, respectively. After obtaining the respective results from each parameter, it is necessary to carry out an evaluation process using the confusion matrix.Initially, we looked for true positive, true negative, false positive, and false negative values. After obtaining the results from each parameter, it is necessary to evaluate the model using a confusion matrix. Initially, the author determines the values of true positive, true negative, false positive, and false negative.

The confusion matrix for KNN with oversampling is shown in Figure 17. The model shows a quite good result, i.e., 129,199 true positive instances, 7030 false negative instances, 9356 false positive instances, and 2928 false negative instances. Based on the above data, it can be assumed that the model can predict the results quite well. However, further evaluation is needed by investigating other parameters, i.e., precision, recall, and F1-score values.

The evaluation of classification parameters in Table 4 shows a precision of 82% for predicting both normal and abnormal data. Based on the given results, the model performs well in predicting abnormal data. However, the recall value is 70%, indicating that there are still many false negative predictions. Furthermore, when validated using the F1-score, it produces a value of 74%, indicating that the model is only moderately effective in predicting both normal and abnormal data.

Then proceed with the ROC-AUC evaluation parameters result shown in Figure 18. The result shows that the model has a reasonably good prediction score of 74%. However, it still exhibits a considerable number of false positives. It can be seen that if the ROC curve approaches the top-left corner or the AUC value approaches one, the prediction results can effectively classify “0” as normal data and “1” as abnormal data. Thus, based on the results in Table 4 and Figure 17 and Figure 18, the applied model has fairly precise accuracy, which can be seen from the results of each existing evaluation.

4.3.2. Logistic Regression with SMOTE

In the second experiment, the modeling is carried out using logistic regression. The data are divided into a 3:1 ratio between the training and testing data. The data are chosen randomly by using a random state value of 6133. Then the optimal number of maximum iterations is determined by varying the maximum value of iterations from

10^{1}

to

10^{6}

.

Based on the result of iterating logistic regression, shown in Figure 19, it is determined that the optimal iteration value is

10^{5}

, which produces training accuracy of 98.4% and 98.6% for data testing accuracy. Then, the results of the iteration values are applied to the model to start the training process with a logistic regression algorithm with a maximum value of

10^{5}

iterations.

The learning curve performance of logistic regression with the SMOTE oversampling technique can be seen in Figure 20; the error percentage of classification is very low, that is, around 1.4%, with percentage training at 100%. Thus, the result of the training score model is 98.9% and for the testing score model, 98.5%. The model obtained in the training phase is further evaluated based on the confusion matrix parameter, i.e., true positive, true negative, false positive, and false negative values.

The values obtained are quite good based on the result shown in Figure 21: true positive as many as 131,902 instances, true negative as many as 16,180 instances, 153 false positive instances, and 1765 false negative instances. It can be inferred that the model can predict the outcome very well. However, further evaluation is needed by looking for other evaluation parameters by looking for precision, recall, and F1-score values.

The evaluation of the classification parameters in Table 5 show that the predictions made by the model are close to perfect, meaning that the prediction results, as well as actual data, are more relevant. At the same time, the recall has a value of 99%, meaning that the predicted results of around 1% show false negative or false positive results. The results in Table 5 show that the model applied has high accuracy, which can be seen from the results of each existing evaluation as evidenced in the results of the F1-score.

Then proceed with further evaluation parameters of ROC-AUC shown in Figure 22. The model results show predictions that are close to perfect. It can be seen if the ROC curves towards the upper-left corner or the value of AUC approaches “1”, the results prediction can detect anomalies within the dataset.

4.4. Model Comparison Result

After researching each algorithm using a training data size of 297,024 and a testing data size of 148,513, along with the parameters used for each algorithm, the optimal configuration is found for KNN and logistic regression without the oversampling technique, i.e., 13 neighbors and maximum

10^{4}

iterations. Next, further testing is conducted using the same models by applying a synthetic minority oversampling technique to balance the data. The k-nearest neighbors algorithm uses the value of 2 neighbors, while the logistic regression uses maximum

10^{5}

iterations. After applying the oversampling technique, the amount of data is 528,506 and 148,513 instances of training and testing datasets, respectively.

Based on the accuracy of the results shown in Table 6, the logistic regression model is more accurate regarding the prediction data and the training data. The difference in accuracy between models is relatively small. However, with the note that it is necessary to change the normal and abnormal labels, there exist two possibilities.Further comparison is needed by evaluating the confusion matrix parameters, i.e., each model’s recall, precision, and F1-score. This method evaluates each current accuracy value and whether the existing model is accurate in determining each label.

Based on the result shown in Table 7, the two models show different results in Table 7. The logistic regression model has the highest recall score of 99% with only 1% false prediction of the anomalies within the dataset. Combined with the oversampling technique (SMOTE), the logistic regression achieves the highest score in terms of precision with 99% correct determination of the true positive classification compared to other positive classifications. A high precision score means that the model is good at identifying true positive instances and does not often classify negative instances as positive. In other words, a high precision score indicates that the model has a low false positive rate. However, precision alone may not be a sufficient metric to evaluate a model’s performance, especially if the dataset is imbalanced. In such cases, other metrics, such as the recall and F1-score, may also need to be considered to understand the model’s overall performance better. Thus, it can also be seen in Table 7 that the logistic regression has the highest results of 99% of the F1-score.

The models are also further evaluated based on the AUC evaluation parameters. The AUC evaluation parameter has several advantages over other metrics, like accuracy, precision, and recall. One of the main advantages is that it is insensitive to class imbalance, meaning that it can handle datasets where one class is much more prevalent than the other. Another advantage is that it provides a single value summarizing the model’s performance across all possible classification thresholds. As shown in Table 8, the logistic regression model and the logistic regression model achieve a score of 99%. It means the model has a higher true positive rate than the false negative rate. The higher the AUC value, the better the model can distinguish between positive and negative classes. Thus, based on the training and testing results of the two models, the model logistic regression and logistic regression with SMOTE can be used to determine P2TL labels with high accuracy in making predictions.

5. Conclusions

In this study, we aimed to investigate the effects of oversampling through SMOTE for imbalanced data. Our results indicate an outstanding average accuracy of 98% for the logistic regression model with SMOTE oversampling on the data. These findings are consistent with the other model and suggest that oversampling through SMOTE improves the accuracy of the models in detecting anomalies in monthly electricity usage. It is important to note that the existing methods have problems due to imbalanced data, overfitting issues, and lack of data. The dataset we used consists only of user ID and monthly electricity usage. Therefore, we adjusted the method to detect anomalies within the dataset, and our proposed method shows an accuracy enhancement in theft detection. The proposed methods consist of a preprocessing step to remove empty values, extract several parameters, and oversample through SMOTE to address the problems.

In conclusion, our findings enhance the accuracy of theft detection on the dataset with only monthly electricity usage columns for the inputs. Moreover, the experiment result shows that the proposed solution of logistic regression with the SMOTE oversampling technique outperforms the existing method performed manually by the P2TL officer by overcoming the imbalanced issues of a dataset.

Author Contributions

Conceptualization, Y.M., I.G.D.N., G.W., B.S. and F.H.J.; methodology, Y.M., I.G.D.N., K. and F.H.J.; software, Y.M., I.G.D.N. and K.; validation, Y.M. and I.G.D.N.; formal analysis, Y.M., I.G.D.N. and F.H.J.; investigation, Y.M. and I.G.D.N.; resources, G.W. and B.S.; data curation, Y.M. and I.G.D.N.; writing—original draft preparation, Y.M., I.G.D.N. and K.; writing—review and editing, Y.M., I.G.D.N., G.W. and A.S.P.; visualization, Y.M. and I.G.D.N.; supervision, G.W. and B.S.; project administration, B.S. and F.H.J.; funding acquisition, G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Universitas Indonesia Research Grant for International Publication Financial Year 2022/2023, contract number: NKB-1474/UN2.RST/HKP.05.00/2022.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

PLN Company Profile. Available online: https://web.pln.co.id/tentang-kami/profil-perusahaan (accessed on 15 November 2022).
The State Electricity Company Customer 2018–2020. Available online: https://www.bps.go.id/indicator/7/317/1/pelanggan-perusahaan-listrik-%20negara.html (accessed on 15 November 2022).
Information on Electricity Consumption Regulation (P2TL). Available online: https://web.pln.co.id/pelanggan/informasi-p2tl (accessed on 15 November 2022).
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Theobald, O. Machine Learning for Absolute Beginners; Independently Published: Traverse City, MI, USA, 2018. [Google Scholar]
Zhang, Z. Introduction to machine learning: k-nearest neighbors. Ann. Transl. Med. 2016, 4, 218. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Minister of Energy and Mineral Resources Regulation of the Republic of Indonesia No. 28 of 2016 Regarding Electricity Tariffs Provided by PT Perusahaan Listrik Negara (Persero). Available online: https://jdih.esdm.go.id/peraturan/Permen%20ESDM%20No.%2028%20Th%202016.pdf (accessed on 15 November 2022).
Cunningham, P.; Delany, S.J. k-Nearest Neighbour Classifier—A Tutorial. ACM Comput. Surv. 2022, 54, 1–25. [Google Scholar] [CrossRef]
Zou, X.; Hu, Y.; Tian, Z.; Shen, K. Logistic Regression Model Optimization and Case Analysis. In Proceedings of the 2019 IEEE 7th International Conference on Computer Science and Network Technology (ICCSNT), Dalian, China, 19–20 October 2019; pp. 135–139. [Google Scholar] [CrossRef]
SMOTE. 12 April 2021. Available online: https://learn.microsoft.com/en-us/azure/machine-learning/component-reference/smote?view=azureml-api-2 (accessed on 3 December 2022).
Park, S.H.; Goo, J.M.; Jo, C.-H. Receiver Operating Characteristic (ROC) Curve: Practical Review for Radiologists. Korean J. Radiol. 2004, 5, 11–18. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Halvorsen, H.-P. Python Programming; Independently Published: Traverse City, MI, USA, 2020. [Google Scholar]
About Pandas. Available online: https://pandas.pydata.org/about/index.html (accessed on 20 November 2022).
Scikit Learn—Introduction. Available online: https://www.tutorialspoint.com/ (accessed on 1 December 2022).

Figure 1. Example of

2 \times 2

confusion matrix.

Figure 1. Example of

2 \times 2

confusion matrix.

Figure 2. Training and testing KNN model with SMOTE.

Figure 3. Training and testing LR model with SMOTE.

Figure 4. Customer distribution based on tariff categories.

Figure 5. Customer violation distribution.

Figure 6. Violation distribution of binary category.

Figure 7. KNN testing result with various number of neighbors.

Figure 8. Learning curves performance of KNN with 14 neighbors.

Figure 9. KNN testing result with 14 neighbors.

Figure 10. Receiver operating characteristic curve of KNN with 14 neighbors.

Figure 11. Logistic regression maximum iteration testing.

Figure 12. Learning curves logistic regression with maximum value of

10^{4}

iterations.

Figure 12. Learning curves logistic regression with maximum value of

10^{4}

iterations.

Figure 13. Logistic regression confusion matrix testing.

Figure 14. Receiver operating characteristic curve of logistic regression.

Figure 15. KKN with oversampling testing result.

Figure 16. KNN learning curve with 2 neighbors.

Figure 17. KNN confusion matrix with oversampling.

Figure 18. ROC curve of 2-nearest neighbors with SMOTE oversampling.

Figure 19. Logistic regression with oversampling maximum iteration testing.

Figure 20. Logistic regression maximum iteration testing with oversampling.

Figure 21. Logistic regression confusion matrix with oversampling.

Figure 22. ROC curve of logistic regression testing with oversampling.

Table 1. Electricity violation distribution.

Class	Amount	Amount in Percent
Normal	396,380	81.51%
Abnormalities 1	49	0.01%
Abnormalities 2	36,187	7.44%
Violation 1	18,430	3.79%
Violation 2	8155	1.68%
Violation 3	27,105	5.57%

Table 2. KNN classification result.

	Precision	Recall	F1-Score	Support
0	0.93	0.99	0.96	132,127
1	0.83	0.43	0.57	16,386
accuracy			0.93	148,513
Macro average	0.88	0.71	0.76	148,513
Weighted average	0.92	0.93	0.92	148,513

Table 3. Logistic regression classification result.

	Precision	Recall	F1-Score	Support
0	1	1	1	132,180
1	0.98	0.97	0.98	16,333
accuracy			0.99	148,513
Macro average	0. 99	0.98	0.99	148,513
Weighted average	0. 99	0.98	0.99	148,513

Table 4. KNN classification result with oversampling.

	Precision	Recall	F1-Score	Support
0	0.93	0.98	0.95	132,127
1	0.71	0.43	0.53	16,386
accuracy			0.92	148,513
Macro average	0.82	0.70	0.74	148,513
Weighted average	0.91	0.92	0.91	148,513

Table 5. Logistic regression with oversampling.

	Precision	Recall	F1-Score	Support
0	1	0.99	0.99	132,180
1	0.9	0. 97	0. 94	16,333
accuracy			0.92	148,513
Macro average	0.99	0.99	0.99	148,513
Weighted average	0.99	0.99	0.99	148,513

Table 6. Summary of models accuracy.

Keterangan	KNN	LR	KNN	LR
			(SMOTE)	(SMOTE)
Data Training	93.26%	99.44%	92.21%	98.97%
Data Testing	93.22%	99.46%	91.71%	98.7%

Table 7. Summary of parameter evaluation.

Keterangan	KNN	LR	KNN	LR
			(SMOTE)	(SMOTE)
recall	88%	99%	82%	95%
Precision	71%	98.5%	70%	99%
F1-score	76.5%	99%	74%	97%

Table 8. Summary of ROC-UAC results.

Keterangan	KNN	LR	KNN	LR
			(SMOTE)	(SMOTE)
AUC	77%	99%	74%	99%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Maraden, Y.; Wibisono, G.; Nugraha, I.G.D.; Sudiarto, B.; Jufri, F.H.; Kazutaka; Prabuwono, A.S. Enhancing Electricity Theft Detection through K-Nearest Neighbors and Logistic Regression Algorithms with Synthetic Minority Oversampling Technique: A Case Study on State Electricity Company (PLN) Customer Data. Energies 2023, 16, 5405. https://doi.org/10.3390/en16145405

AMA Style

Maraden Y, Wibisono G, Nugraha IGD, Sudiarto B, Jufri FH, Kazutaka, Prabuwono AS. Enhancing Electricity Theft Detection through K-Nearest Neighbors and Logistic Regression Algorithms with Synthetic Minority Oversampling Technique: A Case Study on State Electricity Company (PLN) Customer Data. Energies. 2023; 16(14):5405. https://doi.org/10.3390/en16145405

Chicago/Turabian Style

Maraden, Yan, Gunawan Wibisono, I Gde Dharma Nugraha, Budi Sudiarto, Fauzan Hanif Jufri, Kazutaka, and Anton Satria Prabuwono. 2023. "Enhancing Electricity Theft Detection through K-Nearest Neighbors and Logistic Regression Algorithms with Synthetic Minority Oversampling Technique: A Case Study on State Electricity Company (PLN) Customer Data" Energies 16, no. 14: 5405. https://doi.org/10.3390/en16145405

APA Style

Maraden, Y., Wibisono, G., Nugraha, I. G. D., Sudiarto, B., Jufri, F. H., Kazutaka, & Prabuwono, A. S. (2023). Enhancing Electricity Theft Detection through K-Nearest Neighbors and Logistic Regression Algorithms with Synthetic Minority Oversampling Technique: A Case Study on State Electricity Company (PLN) Customer Data. Energies, 16(14), 5405. https://doi.org/10.3390/en16145405

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Electricity Theft Detection through K-Nearest Neighbors and Logistic Regression Algorithms with Synthetic Minority Oversampling Technique: A Case Study on State Electricity Company (PLN) Customer Data

Abstract

1. Introduction

2. Materials and Methods

2.1. Anomaly Detection on Electricity Consumption

2.2. Machine Learning

2.2.1. K-Nearest Neighbors (KNN)

2.2.2. Logistic Regression

2.3. Data Preprocessing with SMOTE

2.4. Evaluation Parameters

2.5. ROC Curve and AUC

3. System Design and Implementation

3.1. KNN and LR Testing Scenarios

3.2. Data Sets

3.3. Classification Preprocessing

3.4. Evaluation Scenarios

4. Experiment Result

4.1. KNN Testing Result

4.2. Logistic Regression Testing Result

4.3. Model Training and Testing Using SMOTE

4.3.1. KNN with SMOTE

4.3.2. Logistic Regression with SMOTE

4.4. Model Comparison Result

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI