1. Introduction
Benford’s Law (BL) is described as the first-significant-digit law and it is telling that the frequency distribution of first digits in specific real-world datasets consisting of numerical data follows a specific pattern. This law tells us that in many natural collections of numbers, the first significant digit has a very specific probability of occurrence. Benford’s Law can be applied to a wide range of datasets resulting in a very wide implementation area range, helping fraud or artificial intervention detection, e.g., in accounting, bank transaction registers, etc.
During our previous research [
1], we found that in electricity distribution networks, the electric energy consumption values belong to datasets, which, under certain circumstances and when measured at particular network nodes, follow natural, authentic value distribution with no special circumstances occurring in the dataset [
1]. However, in case of specific circumstances of electricity consumption, e.g., a kind of unnatural intervention caused by macro or micro-economic events and rules, or non-technical losses caused by electricity theft, the character of the electricity consumption changes and the dataset values become unauthentic and thus do not follow BL distribution [
1].
We have also searched for the answer to the question of the BL-based detection method’s effectiveness. We found that there are specific effectiveness thresholds. In our experiments and simulations, we first used only one intervention operation—the operation of value replacement. However, we have found that when using other intervention operations, the effectiveness threshold is still stable, but it has different values for different intervention operations. So, we have redefined our research goals; we have conducted intensive reference research and we found that this threshold is not discussed in the available articles very often and we found no references focusing on threshold value patterns according to the intervention operation.
The differentiation between intervention operations has significant importance because each effect changing the original data has its own specific character, e.g., accounting fraud usually involves the operation of value replacement, energy theft will probably raise the measured values by adding different values, and macro-economic or micro-economic effects will probably cause in some situations multiplication of the original dataset.
The contribution of this paper is the research into how the effectiveness threshold changes according to the intervention operation. This threshold is crucial for detection in the original dataset by the BL-based method. We used data on electricity consumption in our experiments.
Within the reference survey, we can say, the following:
The analysis for data from smart electric meters provides some references, but we found analysis using Benford’s Law only rarely [
1]. There are few references focusing on conformity tests for BL-based methods [
2,
3,
4] and there are also references regarding the application of Benford’s Law in general for other science fields [
5,
6,
7]. However, as of our current knowledge, there are no references dealing with the method’s sensitivity threshold changes when changing the original dataset in a non-natural way using Benford’s Law with different affecting operations.
In [
1], the authors focus on how Benford’s Law can be applied to datasets measured in electric energy systems. Other case studies are described in [
8,
9,
10,
11,
12]. In [
13], the authors collected electricity consumption data with Benford’s Law applied to the collected dataset [
8,
9,
10,
11,
14].
Furthermore, in other references [
14,
15,
16,
17,
18,
19,
20,
21,
22], the authors applied Benford’s Law to the following:
- -
Detecting the manipulation of infected people numbers for COVID-19 infection;
- -
Analyzing lightning datasets;
- -
Detecting image forgery during resizing and compression;
- -
Detecting anomalies within the number of references for a researcher and the number of researchers for a publication;
- -
Evaluating the economic data in companies.
In [
3], the authors focused on conformity tests applied in BL-based methods. The articles [
2,
3,
4] deal with the effectiveness of Benford’s Law and the determination of the effectiveness threshold.
We investigated references dealing with Benford’s Law’s effectiveness thresholds taking into account the affecting operation, e.g., altering some part of the original dataset by adding a specific number, replacing it with a specific number, or using any other arithmetic operation, but as far as we know, there is no such article dealing with this topic.
After collecting all available information from the references and after our verification experiments [
1], we assumed that Benford’s Law is applicable to the analysis of data in electricity distribution networks. Our motivation for the experiments and simulations described further in this article was to fill the research gap in analyzing the effect of various mathematical operations used to affect the original recorder dataset and in finding the deviation threshold changes in first-digit probability data according to the operation used.
In summary, based on the reference overview, we have determined the research gap in the following points:
Finding the effectiveness thresholds of Benford’s Law-based methods according to the type of operation affecting the original dataset.
Verification of the possibility of extracting the affecting operation type from first-digit probability distribution.
The published article contributions with corresponding research gaps are summarized in
Table 1:
The research gap points are basic starting points for the contribution of the research topic. In our article, we first quickly review the theory of the BL. We summarize the BL-based method applications in electric power engineering. In the following sections, we describe the method used in the experiments and our experimental datasets; then, we present our experimental results. Finally, we include a discussion about the presented results and we provide the conclusions.
2. Short Description of Benford’s Law Theory
Simon Newcomb in 1881 and then Frank Benford mentioned and published this law’s effects in 1938. Benford introduced the effect, known now as a law, in more depth. The route to this discovery can be read in [
23,
24]. The first significant digit probability is expressed in BL for various datasets. There are some dataset properties, which are required for the original dataset to conform to BL.
By basic definition, the significant number [
7] is a real number excluding zero denoted by
x, and the first significant decimal number of any real number
x, denoted as D
1(
x), is a unique integer
j ∈ {1, 2, …, 9} satisfying conditions [
24]:
for a unique
k ∈ Z. In general, for every number m ≥ 2, m ∈ N, the m-th significant decimal digit of
x is denoted as Dm(
x). The definition is inductive for a unique integer number
j ∈ {0, 1, …, 9} in Formula (2) [
1]:
for a unique number
k ∈ Z; for convenience, D
m(0): = 0 for all m ∈ N. According to the definition, the first significant digit D
1(
x) of
x != 0 can never be zero. When taking into account the second, third, fourth, and so on, the significant digits may be any integer number including zero. Thus, they are defined by a set of {0, 1, …, 9} decimal numbers.
Benford’s Law describes the probability of a specific digit becoming the first significant digit in a number from the examined dataset. When a dataset follows BL, it has to comply with the following requirements:
The dataset should not be restricted in value range [
24].
The dataset should not include any kind of artificial influence [
24].
The range of the values in the dataset should be large [
25], as defined by the following equation:
where max and min are maximal and minimal values from the dataset, respectively.
- 4.
The dataset should be large.
Equation (2) describes the calculation of the first significant digit probability according to BL [
14]:
with
Fd as the probability value, and
The sum calculated across all the probabilities is 1; thus, the sum of probability is 100% for the decimal digits d ϵ D = (1, 2… 9). The distribution shows a descending trend in the probability.
The following features of datasets are noticeable:
For this article, we focus on a few operations affecting Benford’s Law distribution in a particular way. Mathematical operations can be applied to a dataset without affecting its conformity with Benford’s Law:
Scaling operation: Scaling operation means a multiplication of all numbers in a dataset by a constant factor. This phenomenon is called scale invariance and this property is the reason why Benford’s Law is robust and applicable in detecting anomalies in datasets from significantly different areas of industry and science. As mentioned above, if the dataset is not large enough, the scaling operation can break the dataset’s conformity with BL.
Multiplicative scaling: Benford’s Law is invariant to scaling transformations, which is particularly relevant for power functions, where scaling can be naturally integrated into the function by a coefficient.
Power Functions: Raising all numbers in a dataset to constant power, including fractional and negative powers.
The mathematical operations described below disrupt a dataset’s conformity to Benford’s Law:
Addition/Subtraction: The addition or subtraction of a constant value from each number in a dataset can alter the distribution of first-digit probability. This is especially true if the operation changes the magnitude of the numbers.
Limiting value range: Any operation that limits the original dataset values to a minimum or maximum value usually alters the expected distribution of the first digits.
Value replacement: This is the operation of a direct value replacement by another value in a certain percent of cases in the whole dataset; one or more replacement values can be used. This operation also causes the deviation from the theoretical expected BL distribution of the first digits.
Let us look first at operations that do not alter the adherence of the dataset to Benford’s Law if there were any in the original dataset. Scale invariance is an important property of Benford’s Law, meaning that the distribution of the first digits remains consistent regardless of the original dataset measurement scale. This property can be expressed mathematically through the transformation of variables in Equation (6) [
24]:
where the probability
P(d) does not change if all data points are scaled by a non-zero constant
k. This invariance through scaling can be expressed in Equation (7):
This equation demonstrates that scaling does not affect the Benford’s Law probability distribution of first digits, causing the law’s robustness across different units of measure.
Multiplicative operations are fundamental for understanding the fact that certain datasets adhere to Benford’s Law. A multiplicative process involves the growth or decay of values by a constant factor multiplication. If Xn+1 = k⋅Xn, where Xn represents the value at stage n and k is a constant factor, the dataset {Xn} will increasingly show the first-digit distribution predicted by Benford’s Law as n grows. This phenomenon is due to the logarithmic nature of growth rates, aligning with the logarithmic scale of Benford’s Law.
However, these theoretical statements are violated in some cases. The multiplication operation can affect the distribution of first digits in some practical cases and experimental conditions despite the theoretical scale invariance. We have to consider specific contexts, e.g., the range and composition of the dataset: finite or uniform dataset ranges do not span several orders of magnitude or if the numbers are close to one or more values. Then, the scaling operation can disrupt BL distribution. Furthermore, a smaller or less diverse dataset might show apparent deviations from Benford’s Law after multiplication due to insufficient data to average out the distribution across the logarithmic scale.
When applying the power function to a dataset, every number is raised to a fixed power. This operation affects the dataset as follows:
- -
If every number x in a dataset {X} is transformed to xp, where p is any real number, the dataset of the results will follow Benford’s Law if the original dataset also did.
This again is caused by the property of logarithms [
24]:
The equation above means that the distribution of the first digits remains unchanged after the application of power transformations.
On the other side, there are operations that do alter the dataset’s adherence to Benford’s Law when the original dataset did so. Addition or subtraction operations, e.g., adding or subtracting a constant value or values to/from every number in a dataset are the first examples of such operations that in general disrupt the adherence to Benford’s Law because they can change the first digit of numbers. This leads to changes in the order of magnitude for numbers that are close to power-of-ten boundaries:
This change significantly affects the distribution of first digits, especially if the operation leads to many numbers crossing these boundaries, so it partially depends also on the original dataset value distribution.
Limiting the dataset value range, e.g., imposing a minimum or maximum value on a dataset can also disrupt Benford’s distribution. These modifications can artificially constrain the range of values in a dataset, preventing the natural occurrence of numbers that span several orders of magnitude, which is essential for a dataset to follow Benford’s Law [
24]:
where
a and
b define the limits. Such constraints can shift the distribution of first digits away from what Benford’s Law predicts, particularly if the bounds are such that they disproportionately affect numbers with certain first digits.
Besides the first digit Benford’s Law, the second digit probability distribution can also provide useful information about the dataset. However, the distribution of the second digit is much more balanced, and although the digits from 0 to 9 do not occur completely uniformly, the differences in the probabilities are significantly smaller in comparison with the first digit probability distribution. For example, 0 (the zero in this case is a valid digit) might appear in about 12% of cases, 1 in approximately 11.4%, and 9 in approximately 8.5%. Therefore, the probability distribution differences are relatively small. In addition, the complexity of determining the probabilities of the second digit requires taking into account all possible combinations with the first digit, leading to a less pronounced imbalance. In real-world applications, this is used less frequently, but it can serve as a complementary tool in data analysis.
The second digit probability distribution exhibits subtler differences between the values and thus this case is less suitable for detecting the effectiveness thresholds of the BL-based method.
3. Conformity Tests for Benford’s Law
The conformity test is a necessary step in the process of determining if the dataset is following the BL first-digit probability distribution. These tests calculate the actual deviation from standard theoretical BL probability distribution. The threshold for conformity is often set using statistical tests. The method for Benford’s Law conformity verification can be found in references [
4,
26,
27] and we can summarize it as follows:
Collection and preparation of the datasets.
Calculation of the expected theoretical distribution.
Calculation of the observed probability distribution for the acquired dataset.
Statistical tests—the Chi-square test, used in our experiments, which provides a p-value indicating the likelihood of the deviation.
Equation (12) describes the calculation of the Chi-square test:
In the equation, Oi is the observed frequency in category i. In our case, the observed frequency is the number of occurrences of a particular event we actually measure in the dataset, i.e., the occurrence or frequency of the actual digit as the first digit. Ei is the expected theoretical frequency in category i for the actual digit (calculated under the null hypothesis). k is the number of categories, or decimal digits.
- 5.
Finding a significance level (a threshold) for the maximum acceptable probability of incorrectly rejecting the null hypothesis (threshold), this value for the test is typically 0.05 or 0.01 [
28].
- 6.
Making the decision: If the p-value resulting from the test is less than or equal to the chosen significance level, we reject the null hypothesis, indicating that the dataset does not conform to Benford’s Law. If the p-value is greater than the significance level, we fail to reject the null hypothesis, suggesting conformity. If the p-value is below the significance level, we may conclude that the dataset does not conform to Benford’s Law. In this case, there are two reasons for this situation:
- (a)
The dataset is not suitable for BL probability distribution tests. Such a dataset can have a very restricted value range, a non-natural dataset, small sample sizes, or the nature of the data. This case must be excluded before any other analysis is made. For our dataset, we have proven the suitability for BL probability tests in [
1].
- (b)
The dataset is affected by a kind of non-natural intervention. This case is especially important when making research simulations and when performing real BL probability conformity tests.
The significance level is a parameter in hypothesis testing. It represents the probability of rejecting a true null hypothesis. In the context of testing Benford’s Law conformity, the significance level helps determine the threshold beyond which you would consider the observed first-digit distribution as significantly different from the expected distribution under Benford’s Law. Commonly chosen significance levels are 0.05 and 0.01, corresponding to a 5% and 1% chance, respectively [
29,
30,
31,
32,
33].
In summary, the choice of significance level is a subjective decision, including the potential implications of errors, field-specific standards, and the nature of the observed data [
29,
30,
31,
32,
33,
34,
35,
36]. For our experiments and result interpretation, we have chosen a significance level of 0.05 across all our experiments presented in this paper.
4. Benford’s Law Application in Electric Power Engineering
Electricity theft or fraud significantly affect non-technical losses within distribution systems. In cases where distribution networks can be considered smart, usually smart meters are used to track electricity consumption at different locations and tiers of distribution. The meters are located at consumer premises and at higher-level nodes of the network. This placement strategy enables the collection of data from multiple points at multiple levels of the network. Smart meters provide data not only regarding electricity consumption but also many other electricity parameters, e.g., quality parameters, availability, location, neighboring node’s information, etc. Comparing the data given from multilevel nodes, useful information can also be extracted, e.g., the probability of electricity theft.
In conventional networks without smart technologies, identifying electricity theft proves considerably more challenging. The management of customer information also poses a significant challenge in these scenarios. Consequently, BL-based methodology presents a viable solution for detecting electricity theft [
1,
37].
BL-based methods are usable as prediction and monitoring methods for electricity consumption. The BL-based method helps to identify deviations from standard electricity consumption forecasts, scenarios and issues in monitoring efforts. The prediction of future electricity consumption helps the planning and management of electricity production [
38]. By comparing real-time production data against expected BL distributions, power generation companies can focus on areas requiring detailed analysis and data collection. This approach aids in reducing energy wastage [
38]. However, a few restrictions exist when using BL in the field of electric power engineering [
2,
38,
39]:
- -
Small-sized datasets—in general, these datasets are not statistically significant and they often do not follow BL distribution.
- -
Datasets do not follow BL distribution when the initial digits of the dataset are symmetrically centered on zero, or when there is a natural uniform distribution of the first digit’s probability.
- -
Data adjustments can be natural when organizations have to follow specific operational, legal or logistical requirements in electricity production or distribution.
These restrictions apply during the usage of BL-based methods in the field of electric power engineering and in the field of electric energy consumption. In the references, the datasets from electricity distribution networks are very sparse; thus, the ability to use them as a benchmark for our methods is very low.
In [
28,
40,
41], methods for detecting electricity theft are presented. However, these studies and their result are not directly comparable with our experiments, regarding dataset size, methods used, and the focus of our experiments.
The BL-based method examined in our study does not require any physical modifications to the distribution network or special preparations prior to data collection, as the dataset is already suitable for the application.
5. The Datasets Used in Our Experiments and Simulations
In summary, the criteria necessary for datasets that are expected to adhere to the Benford’s Law distribution pattern are [
6,
7,
42] origin in identical situations, no restrictions on the dataset’s minimum and maximum values, statistical randomness, span of the values at least two orders of magnitude. For the experiments, we recorded the datasets of electricity consumption values in cooperation with a local electricity distribution company in East Slovakia: Východoslovenská distribučná a.s. The measuring points—smart electric consumption meters—were located in the Košice–Pereš (a part of Košice city) locality in the eastern part of the Slovak Republic, in the low-voltage electricity distribution grid. We used values from 48 “smart” meters.
The values of our datasets were acquired remotely and saved in a local database on the server for later evaluation. The data were recorded in the time range from 1 January 2021 0:15:00 until 1 April 2022 0:00:00. This time interval is enough to consider the datasets as statistically representative.
The data samples were recorded and stored every 15 min. In this way, we have collected and stored a significant database of data samples. For the experiments presented in this paper, we have chosen data subsets of measured and stored values each with a size of 43,676 samples. Thus, these datasets come from different electricity network nodes.
Each data sample value in our database had the same attribute: the date and time of the data measurement and the locality of the node in the electricity distribution grid (for publication reasons, the locality and even the node numbers or other node identification attributes had to be anonymized). The consumption value is shown in W units, and the identification number of the smart electrometer identifies the node in the network.
The datasets come from real recordings and were acquired during normal grid operation. According to the grid owner, the electricity theft in this grid part in the acquisition time range was not probable to occur, because the measurements that the company made in recent years did not indicate this. The choice of a specific location for our experimental data collection was influenced by the challenge of accurately characterizing data anomalies resulting from electricity theft, particularly those stemming from physical tampering with the distribution network. Such incidents can vary widely in their impact, affecting data in unpredictable ways.
Furthermore, when considering non-physical forms of grid tampering, such as manipulation of accounting records, this leads to the introduction of artificial segments within the original dataset—a scenario particularly relevant to our study. To enhance our modeling of real-world electricity theft scenarios and facilitate comparisons across varying levels of data impact, we incorporated assumptions regarding the extent of dataset alteration.
6. Materials and Methods
In our experiments described in this paper, we have used different datasets recorded according to the previous section. We focused on finding patterns in the deviation of lading digit probability distribution compared to standard BL distribution. Of course, we had to simulate artificial data injection into the original dataset for our experiments with a controlled amount of artificially injected data amount and a controlled type of affecting operation. The amount of affected data was chosen in the range between 1% and 100% with the 1% step for each affecting operation. Thus, we have covered states from minimum values artificially changed up to the state of the whole dataset being changed. Theoretically, a higher magnitude of the injection causes higher values of the first digit probability deviation compared to normal leading digit distribution according to BL.
We can summarize our approach in the following steps:
Selecting the original dataset: random selection of different statistically significant original datasets from the amount of data recorded from metering devices. We have selected data from different locations and different time ranges.
Dataset preparation—filtering out the samples with zero values.
Selection of different affecting operations—we have selected four affecting operations: value replacement, the addition of an integer to the samples, power operation, and multiplication operation. For all of these affecting operations, we have selected three different parameters, and the operation was applied on 0% to 100% of the data amount in the dataset.
For all datasets:
Calculation, we calculated the BL probability distribution of the original non-affected dataset.
Affecting, affecting operation application.
For all levels of the affected amount of data, we calculated the BL probability distribution.
We have calculated the difference between the original BL distribution and all of the affected dataset distributions for all levels of the original dataset change.
We have created appropriate tables and corresponding graphs for visualization purposes. Thus, we have created the first qualification domain for finding and qualifying the deviations of the affected dataset according to the original dataset.
We have calculated Chi-square tests for all amount levels. As a result, we obtained the distribution differences for quantifying the particular deviation from the BL distribution.
We have created the graph of the Chi-square test p-value dependency on the amount of the affected data in the original dataset. Thus, we have created a second qualification domain for finding and qualifying the deviations of the affected dataset according to the original dataset.
Previous to performing the affecting operations, we prepared all the values in the dataset by multiplying them by 1000 to perform data scaling and to ensure that all the values are greater than 1. BL distribution should not be affected by a scaling operation.
In summary, the methodology is shown in the flowchart (
Figure 1).
7. The Results of Our Experiments and Simulations
In this section, we present only selected results of our experimental calculations in the form of figures with histogram graphs and Chi-square test graphs. In
Figure 2, the first bar is the theoretically calculated percentage of the leading digit value probability following BL. The second bar represents the number of percent of first digit appearance in the experimental dataset. The third bar represents the difference between theoretical and calculated first-digit appearances.
Figure 2 is presented only as a calculation explanation, which was performed for all the datasets and for all the affected amount levels. In the first set of results, we calculated the first dataset with the operation of value replacement; the replace values were set to seven, eight and nine. In
Figure 2, the result for dataset No. 1 is presented with the replacement value seven, and 7% of the original dataset was affected. This dataset was chosen as an example from all the simulated results and it was recorded in the electricity distribution network node No. 1. In
Figure 2, the red line graph represents the theoretical BL first digit probability distribution. The blue bar graph represents the observed first-digit frequencies for the particular decimal digit.
In
Figure 3, the Chi-square test results are presented for the electrical energy distribution network node No. 1. It shows the difference between BL theoretical probability distribution and the probability distribution in affected datasets with the intervention level from 0% to 100% (of course, the intervention level of 100% is interesting only for theoretical mathematical simulations; this level of dataset change is very unlikely). In the percentage of affected value axis, the amount of affected data in the original dataset is presented. In the
p-value axis, the probability value conformity with theoretical values is presented. The lower the value, the more the dataset conforms to the theoretical first distribution, i.e., the dataset intervention becomes detectable.
In
Figure 4 the results for Chi-square test conformity are presented with the affecting addition operation for three addition values one, five and nine, all with the dataset measured at electricity network Node 1.
In
Figure 5 and
Figure 6, the results for Chi-square test conformity are presented with the affecting replacement operation for three replacement values seven, eight and nine, all with the dataset measured at electricity network Node 1. In
Figure 6, we present the same conformity values as in
Figure 5, but with a logarithmic
y-axis to better visualize and examine the conformity values after the
p-value threshold set for our case.
In
Figure 5 and
Figure 6, the
p-values of the Chi-squared test spread from approximately 0.9—these are the case in the simulation with none or a very small amount of the affected data in the original dataset (approximately 1 to 3%). In these cases, the deviation from theoretical BL distribution has a very low value. In cases, when a large part of the original dataset is affected in the simulation—up to a theoretical value of 100%, which is used only for simulation complexity—the
p-values become very low, i.e., the deviation from the theoretical BL distribution becomes very high.
Usually, a p-value above a significance level of 0.05 indicates good conformity to Benford’s Law, while a p-value below this level means a significant deviation.
In
Figure 7, the results for Chi-square test conformity are presented with the affecting multiplication operation for three multiplication values two, five and nine, all with the dataset measured at electricity network Node 1.
In
Figure 8, the results for Chi-square test conformity are presented with the affecting operation of power for three values two, five and nine, all with the dataset measured at electricity network Node 1.
8. Discussion
Discussing
Figure 2 with histogram graphs in the previous section in detail, we show the results of BL first digit probability distribution for electricity consumption values acquired at one of the nodes in the electricity distribution network randomly selected from our set of measurement nodes. In general, for each node, we applied four different operations to affect a part of the original dataset. In addition, the time interval for value acquisition is randomly chosen for each node and it is a subset of values from a much wider time span. For Node No. 1, we show the bar graph (in
Figure 2) for 7% of affected data in the originally measured dataset, although the BL probability distribution was calculated for all levels of affected data in the original dataset ranging from 0% to 100% (these range limits were selected only for modeling purposes).
For the graph in
Figure 2, the red bar shows the deviation of the affected dataset from the theoretical expected BL first-digit probability distribution. The deviation increases with a higher amount of affected parts of the original dataset.
In
Figure 4,
Figure 5,
Figure 7 and
Figure 8, the graphs of conformity values are presented for each level of affected amount of the data in the original dataset. These levels range from 0 to 100 percent. There are always three selected operands, which affect the dataset. The conformity values represent the rate at which the dataset still behaves according to the theoretical BL first-digit distribution. The higher the value, the more the dataset follows this distribution.
In all cases, the conformity value decreases with a higher affected data amount, which is a natural look at BL theory. However, when looking at the level of affected data where the graph meets the significance level determined for our simulations, these levels follow a common pattern for the same affecting operation regardless of the affecting value. On the contrary, these levels differ when comparing different affecting operations.
The point where the graph meets the significance level on the y-axis is actually the threshold of sensitivity of the BL-based method for detecting the original dataset change. For our experiments, we have chosen the significance level of the p-value to be 0.05%.
When looking at the graphs in
Figure 4, the progress of the graph for affecting value 1 behaves in a strange way at first sight. First, the graph increases from its initial value until approximately 10%, then it decreases according to the theory. This may be explained by a few reasons. First, the original data might not perfectly comply with Benford’s Law. Many datasets only approximate the BL first digit distribution. If a small number of modifications happen to affect exactly those parts of the dataset that critically shape the distribution of the first digits, it can be skewed more significantly. With a larger amount of changed data, various deviations may “cancel each other out”, especially if the changes are relatively evenly distributed. With a small number of changes (1%), a significant and specific shift can occur—for instance, if the modifications are concentrated in one interval or type of data.
Additionally, if the dataset does not exhibit a high degree of compliance with Benford’s Law before the change, a minor change (1%) will worsen the situation, whereas a larger intervention (20%) might randomly balance it out or bring it closer to Benford’s distribution.
The remaining affecting values follow the same graph progress and the sensitivity threshold may be determined at approximately 25% of the affected amount of the data.
In
Figure 5, the conformity value progress is depicted for the replacement operation. In this case, for all three affecting values, the sensitivity threshold can be determined at approximately 7% of the affected data and the graph progress follows roughly the same path.
In
Figure 6, the same results of Chi-squared tests are shown as in
Figure 5, but with a logarithmic
y-axis. This kind of
y-axis allows focusing on Chi-square test
p-value progress at higher affected data amount in more detail.
In both graph areas, with lower and higher affected data percentages, the progress follows a common path very closely. In addition, the breakpoint area where we can observe the Chi-squared test significance level intersection with the x-axis remains stable and does not change with different datasets and different replacement values. Thus, we can assume that the threshold level of method effectiveness for this dataset remains stable across different datasets and different replacement values.
In
Figure 7, the conformity value progress is shown for the multiplication operation. In this case, for all three affecting values, the sensitivity threshold can be determined at approximately 35% of the affected data amount and the graph progress follows roughly the same path. We can see again that the same effect as in
Figure 4, the progress of the graph for affecting value 9 behaves in a strange way at first sight. First, the graph increases from its initial value until approximately 10%, and then it decreases according to the theory. This again can be explained as mentioned above.
In
Figure 8, the conformity value progress is shown for the operation of power. In this case, for all three affecting values, the sensitivity threshold can be determined at approximately 45–50% of the affected data amount and the graph progress follows roughly the same path except for affecting value 5. We can see again the same effect as in
Figure 4: the progress of the graph for affecting values 9 and 5 increases from its initial value until approximately 10%, and then it decreases according to the theory. This again can be explained as mentioned in previous cases.
All results of Chi-squared test conformity values from all nodes are shown with a linear y-axis. This kind of y-axis allows focusing on Chi-square test p-value progress at lower affected data amounts in more detail.
9. Conclusions
In our experiments, Benford’s Law helps to detect artificial changes in datasets gathered from smart electricity consumption metering devices. We calculated BL’s leading digit probability distribution for datasets to check if our dataset are suitable for further experiments. Our datasets represented electricity consumption in randomly selected time intervals, locations and nodes of the electricity distribution network. After confirmation that our datasets follow the BL probability distribution, we have changed the original datasets in a controlled way. We have made a series of calculations for each measurement node with different parameters of dataset intervention. These series correspond to the different amounts of the affected data in the original dataset. The range of this amount spans from 0% to 100% of the affected data amount.
The Chi-squared tests for the calculated series of first-digit probability distribution were calculated to quantify the deviation from BL first-digit distribution and the p-values of the tests were plotted on a graph per node.
The particular graphs for selected nodes show a threshold of the data intervention amount in percent, which determines the minimum amount of the data in the original dataset, which has to be changed to effectively detect the deviation from BL first-digit probability distribution. This threshold was detected and, in addition, this threshold is relatively stable across different datasets and different intervention-affecting values as seen in the comparison graphs.
However, we have detected differences in the BL-based method effectiveness threshold across different affecting operations. For addition operations, this threshold was detected at 25%; for replacement operations, it was 7–9%; for multiplication operations, it was 35%; and for the operation of power, it was 45–50%. Thus, we can say it is necessary to know the target of the BL-based fraud or intervention detection, i.e., it is advisable to know the kind of intervention to be able to determine how precise and effective this method can be in a particular deployment.
We can propose further investigation directions and ideas. It would be interesting to examine other dataset intervention operations and other datasets gathered in different science fields.
There are some references focusing on the datasets in electricity distribution networks, but we did not find any that would deal with this dataset’s origin in depth. As of the knowledge of the authors, a study of the BL-based method’s effectiveness and the minimal amount of affected data has never been published to this extent.
The results of our research show that Benford’s Law in the process of detecting data anomalies has its effectiveness threshold. Similar to previous studies [
1,
21,
38], our research also shows that the higher the amount of data is manipulated, the more evident the deviation of the distribution is and the result becomes more precise.
The reference overview in the introduction showed the facts that have not been examined in the references so far.
The datasets used in the experiments in this article are unique because they are gathered from real operating electricity distribution systems. Because this kind of data is rarely freely available, this data type from this scientific field has not been explored in depth in other articles. Thus, the contribution of our article is also the validation of Benford’s Law in the electricity distribution grid domain. The reference [
38] shows the analysis of smart electric meter data, but the authors use other detecting electricity theft methods. However, in [
39], the method is very sensitive to changes in original datasets.
Previous research in other references did not focus on the BL-based method’s effectiveness as is presented in our paper. We examined the overall dataset, gradually altering 0% to 100% of the overall dataset value amount. The result of our experiments demonstrates how the amount of altered data changes the overall dataset values and shows the effectiveness threshold of the BL-based methods. Such research has never been published previously.
The limitations we had to consider in our experiments and simulations were the unavailability of real-world datasets with more complex and multiple affecting operations occurring at the same time. The effectiveness threshold differences should also be validated with real-world datasets and such datasets have to be labeled, i.e., we have to know what kind of affecting situation occurred during real data measurement in the distribution network. Such datasets acquired in real-world situations are rare and thus hard to obtain because of legislative restrictions on local, national and international levels.
10. Suggestions for Future Work
Although we have made a significant amount of experiments with controlled intervention operations and intervention amounts in the original dataset, the simulation was the main tool in these experiments. It helped us to detect the BL-based fraud detection method sensitivity regarding the affected data amount.
However, in the real world, the situation can be more complex, and multiple affecting operations can occur at the same time. It could be useful to examine also more complex affecting operation combinations.
In addition, the effectiveness threshold differences according to affecting operation could be validated with real-world datasets. However, such datasets are difficult to obtain for research purposes because not every combination of affecting operations is always available from distribution companies. The datasets also have to be labeled, i.e., we have to know what kind of affecting situation occurred during real data measurement in the distribution network. We could consider the usage of datasets from other science areas.
Future work should contain solutions to the above-mentioned challenges.