Next Article in Journal
Deep Neural Network and Evolved Optimization Algorithm for Damage Assessment in a Truss Bridge
Previous Article in Journal
A Study on Effects of Species with the Adaptive Sex-Ratio on Bio-Community Based on Mechanism Analysis and ODE
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Unveiling Malicious Network Flows Using Benford’s Law

by
Pedro Fernandes
1,*,†,
Séamus Ó Ciardhuáin
1,† and
Mário Antunes
2,3,†
1
Department of Information Technology, Technological University of the Shannon, Moylish Campus, Moylish Park, V94 EC5T Limerick, Ireland
2
School of Technology and Management, Polytechnic University of Leiria, 2411-901 Leiria, Portugal
3
INESC TEC, CRACS, 4200-465 Porto, Portugal
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2024, 12(15), 2299; https://doi.org/10.3390/math12152299
Submission received: 8 July 2024 / Revised: 19 July 2024 / Accepted: 20 July 2024 / Published: 23 July 2024

Abstract

:
The increasing proliferation of cyber-attacks threatening the security of computer networks has driven the development of more effective methods for identifying malicious network flows. The inclusion of statistical laws, such as Benford’s Law, and distance functions, applied to the first digits of network flow metadata, such as IP addresses or packet sizes, facilitates the detection of abnormal patterns in the digits. These techniques also allow for quantifying discrepancies between expected and suspicious flows, significantly enhancing the accuracy and speed of threat detection. This paper introduces a novel method for identifying and analyzing anomalies within computer networks. It integrates Benford’s Law into the analysis process and incorporates a range of distance functions, namely the Mean Absolute Deviation (MAD), the Kolmogorov–Smirnov test (KS), and the Kullback–Leibler divergence (KL), which serve as dispersion measures for quantifying the extent of anomalies detected in network flows. Benford’s Law is recognized for its effectiveness in identifying anomalous patterns, especially in detecting irregularities in the first digit of the data. In addition, Bayes’ Theorem was implemented in conjunction with the distance functions to enhance the detection of malicious traffic flows. Bayes’ Theorem provides a probabilistic perspective on whether a traffic flow is malicious or benign. This approach is characterized by its flexibility in incorporating new evidence, allowing the model to adapt to emerging malicious behavior patterns as they arise. Meanwhile, the distance functions offer a quantitative assessment, measuring specific differences between traffic flows, such as frequency, packet size, time between packets, and other relevant metadata. Integrating these techniques has increased the model’s sensitivity in detecting malicious flows, reducing the number of false positives and negatives, and enhancing the resolution and effectiveness of traffic analysis. Furthermore, these techniques expedite decisions regarding the nature of traffic flows based on a solid statistical foundation and provide a better understanding of the characteristics that define these flows, contributing to the comprehension of attack vectors and aiding in preventing future intrusions. The effectiveness and applicability of this joint method have been demonstrated through experiments with the CICIDS2017 public dataset, which was explicitly designed to simulate real scenarios and provide valuable information to security professionals when analyzing computer networks. The proposed methodology opens up new perspectives in investigating and detecting anomalies and intrusions in computer networks, which are often attributed to cyber-attacks. This development culminates in creating a promising model that stands out for its effectiveness and speed, accurately identifying possible intrusions with an F1 of nearly 80 % , a recall of 99.42 % , and an accuracy of 65.84 % .

1. Introduction

The increase in cyber-attacks poses various problems associated with security flaws in computer networks, whether physical or cloud-based, including vulnerabilities that allow attackers to exploit weaknesses in network protocols and carry out malware attacks, such as ransomware infections, compromising data integrity and confidentiality [1,2].
Network traffic flows encapsulate essential data, such as the source and destination IP addresses, the time intervals between server communications, the timestamps of each transaction, and the communication protocols used. A common vulnerability exploited in cyberattacks is unauthorized access to the network, which results in the theft of sensitive data and compromises the integrity of systems. By knowing IP addresses, attackers can identify potential targets within the network, devise attack strategies and, using spoofing techniques, hide or falsify their locations. Statistical analysis of IP addresses in traffic logs can reveal discrepancies that suggest manipulation or fabrication, indicative of malicious activity. In addition, analysis of communication times can expose periods of lower protection or higher activity on the network, allowing attackers to determine the ideal times to launch attacks. Accurate knowledge of these times can be crucial, allowing malicious actions during windows of opportunity when detection is unlikely [3,4].
These features are candidates to be analyzed using a set of statistical laws, namely the application of Benford’s Law, a mathematical principle that describes the frequency of occurrence of the first digit (from 1 to 9) in numerical datasets and makes it possible to detect anomalies in the distribution of digits. This law states that digits tend to follow a specific distribution pattern, with the digit 1 appearing with a frequency of 30.10 % , followed by the digit 2 with 17.6%, and so on, in a pattern that resembles a negative exponential. This pattern has been observed in various datasets, including financial transactions and demographic statistics [5,6]. For example, if an attacker manipulates or creates log records to hide their activities, the distribution of the first digits of these records may not adhere to Benford’s Law.
Understanding the protocols used on the network is a crucial aspect of cybersecurity. It allows attackers to choose the best techniques and tools to exploit existing vulnerabilities in the network. The most common attacks include the abuse of TCP SYN packets, which indicate the intention to establish a connection, and PSH-ACK packets, which convey the urgency of data delivery and confirmation that it has been received by the recipient. These attacks can result in large volumes of transferred data, the analysis of which, using statistical laws such as Benford’s Law, can identify abnormal statistical distributions that deviate from the expected, suggesting illicit activity [7,8,9,10,11].
Flow analysis is a powerful network management and security tool. Each flow is identified by critical components, including the source and destination IP addresses and the protocol used, whether TCP or UDP. The real power of flow analysis lies in its ability to identify atypical traffic patterns, such as communications to suspicious destinations or excessive traffic volumes at certain times, indicating a security breach. Network administrators and security analysts can stay one step ahead of potential threats by analyzing these flows.
In network security, flow analysis plays a crucial role in identifying malware attacks in contrast to other types of attacks due to various factors. These include the rapid spread of this attack, often including methods to hide its presence in infected systems. In addition, the adaptive capacity of malware means that attackers continually develop new strategies to elude intrusion detection systems (IDSs). The complexity and diversity of malware attacks are also significant, ranging from simple viruses to sophisticated spyware or ransomware programs. Given the ability of malware to establish communications with command and control servers through unconventional ports or protocols, rapid detection of these communications is imperative to enable an agile and effective response to control the infection [12,13,14].
Although current systems, such as signature-based IDS, anomaly-based IDS, hybrid or behavior-based IDS, have high success rates in detecting intrusions, they face limitations, such as complex configurations, the need for vast computing resources, the inability to detect new or unknown threats (zero-day attacks), the need for constant updates of signature datasets, the high number of false negatives if the signature dataset is not comprehensive, and the dependence on large volumes of historical data to form a suitable basis for comparative analysis [15,16].
In contrast to traditional intrusion detection systems, Benford’s Law offers distinct advantages due to its simplicity of implementation and operational efficiency. This methodology allows for identifying digit divergences without the need for large computer resources or large volumes of historical data on which to base decisions, making it particularly useful in any attack scenario, whether previously known or unknown.
Benford’s Law can be effectively applied without resorting to statistical analysis mechanisms to detect abnormal or malicious activity in network flows by analyzing the patterns of the first digits of numerical metadata, such as inter-packet times or packet sizes. If the frequency of the first digits deviates significantly from the expectations of Benford’s Law, this can indicate malicious communications or cyber-attacks. Monitoring systems can periodically check these distributions and warn of persistent deviations. However, not all data will follow Benford’s Law, making validating and calibrating its use in security analyses essential. To improve the analysis of these deviations, studies recommend integrating Benford’s Law with statistical measures such as the calculation of Pearson, Spearman, and Kendall correlation coefficients, the Chi-square test, and the application of the Weibull distribution to assess the fit of statistical models to the observed data [17,18,19].
In addition to using such statistical methods, distance functions combined with Benford’s Law can significantly improve the detection of anomalies in network traffic flows. These functions make it possible to quantify the degree to which data deviate from what is expected by Benford’s Law, increasing sensitivity in identifying small deviations that could signal intrusion attempts or other malicious activities. In addition, they provide a standard method for comparing different datasets or periods within the same set, adapting to the specific context of the analysis. For example, the Euclidean distance may be suitable when the magnitude of the deviations is relevant. At the same time, other more subtle and non-linear patterns can be captured using other distance functions, notably when using the Kullback–Leibler divergence, which identifies small patterns that could be overlooked in simple frequency analyses. Functions such as the Chi-squared test, the mean absolute deviation (MAD), and the sum of squared deviations (SSD) are handy, providing an objective and quantitative measurement of anomalies, and are the most widely used. Compliance with Benford’s Law is usually proven when the value of a specific distance function is below a critical threshold, as indicated in Nigrini’s studies on accounting fraud. However, it is crucial to assess whether these thresholds are applicable in the context of network data, thus ensuring the effectiveness and relevance of the security analyses carried out [20,21].
This paper goes beyond analyzing network traffic data, employing an integrated approach that combines Benford’s Law with three distance metrics: the mean absolute deviation, the Kolmogorov–Smirnov test, and the Kullback–Leibler divergence. The developed methodology examines the data and identifies unnatural patterns in the first digit that could signal network vulnerability exploitation. By applying these three distance metrics, the study aims to identify significant distances between the frequencies of occurrence of the first digit and the empirical frequencies stipulated by Benford’s Law, thus making it possible to identify possible anomalies or intrusion attempts.
This approach enables rigorous data flow analysis, contributing to the proactive detection and mitigation of security risks in network environments. Combining these techniques allows for a deeper and more efficient analysis of network traffic, overcoming challenges such as the need for vast computing resources, dependence on large volumes of historical data, and difficulties detecting new or unknown threats. This methodology allows us to detect abnormal patterns in the initial digits of flows, which can indicate malicious activity. The integration of distance functions aims to enrich network flow analysis by quantifying anomalies, providing a robust assessment of data dispersion. The approach proposed in this paper incorporates the following three distance functions:
  • Mean Absolute Deviation (MAD), where the dispersion of the data is calculated by averaging the absolute differences between the observed and expected frequencies, providing a precise measure of the variance about Benford’s Law.
  • Kolmogorov–Smirnov (KS) test compares the cumulative distributions of the observed frequencies of the digits with those predicted by Benford’s Law, identifying significant discrepancies that may indicate anomalies.
  • Kullback–Leibler (KL) divergence measures the information lost when the observed distribution is used to estimate the distribution expected by Benford’s Law. This metric quantifies the degree of divergence between the two distributions.
High values obtained from calculating the distances between the frequencies observed and those expected by Benford’s Law in any of the applied metrics may indicate substantial deviations between what is observed and what is expected by Benford’s Law, which suggests the possible occurrence of abnormal patterns or suspicious activity in the network data. This multidimensional approach allows for detailed and in-depth analysis, which is crucial for accurately identifying irregularities in network flows.
The research was conducted using the CIC-IDS2017 dataset, a comprehensive collection of network flows representing various types of attacks. This dataset, which includes network flows generated and analyzed by CICFlowMeter, covers many attacks, including brute force attacks on FTP and SSH, Heartbleed, web attacks, infiltrations, botnet activities and DDoS attacks.
To develop and evaluate the model based on Benford’s Law in conjunction with the three distance functions, we followed a systematic approach. This approach, similar to the one proposed by Nigrini, involved evaluating the compliance of the data with Benford’s Law for the first digit. This allowed us to implement the different distance functions, namely the Mean Absolute Deviation, the Komolgorov test, and the calculation of the Kullback–Leibler divergence. By following this approach, we were able to develop a robust model for detecting anomalies in network traffic data, contributing to the proactive detection and mitigation of security risks in network environments.
The methodology developed for detecting malicious flows was structured in two main phases. Initially, each distance function was assessed individually for its ability to detect malicious flows, analyzing the discrepancy between the observed frequencies of the first digits and those expected by Benford’s Law.
In the second phase, a specific version of Bayes’ Theorem was integrated and adjusted precisely to detect malicious flows. This integration made it possible to transform the p-values obtained by the distance functions into a new joint p-value, assuming that each flow could be malicious. This new p-value was then recalculated for each distance function and combined using recognized p-value aggregation methods, such as the Fisher and Tippett methods.
This approach aimed to enrich the model’s ability to make informed decisions based on the data derived from the deviations between the frequencies observed and those expected by Benford’s Law, making it possible to assess the significance and likelihood of the anomalies more effectively detected, indicating abnormal behavior, such as potential intrusions. Thus, classifying flows as benign or malicious made it possible to calculate the probability of each flow being correctly identified, leading to a new decision that enhances the accuracy and reliability of security systems in detecting potential intrusions.
Based on the adjusted probabilities and combined p-values produced in the second phase, this new decision was used to formulate an ensemble of the results generated by each distance function. This ensemble aims to provide a more comprehensive and accurate view of detecting malicious flows, significantly boosting the accuracy and reliability of the security system in network environments.
The results obtained with the experiments demonstrate an accuracy of around 65.85 % , with an F1-Score score of approximately 80 % . Although encouraging, the results emphasize the need for further studies to assess the model’s applicability in different contexts, especially in accounting crime. Furthermore, additional studies are essential to investigate the possibility of integrating the model with other fraud detection techniques, such as pattern analysis or machine learning. This integration could increase the model’s accuracy and reduce the false positive rate, providing a more robust and effective approach to identifying fraudulent activity. The wide range of existing studies on the application of Benford’s Law in this field will make it possible to consolidate and validate the proposed model. At the same time, developing the integrated model, which combines Benford’s Law with distance functions dedicated to analyzing malicious network flows, could provide indispensable information for security analysts in the fight against cybercrime in the future.
The paper describes the results obtained in the research:
  • A model based on the joint application of Benford’s Law and three distance functions, namely the Mean Absolute Deviation, the Kullback–Leibler divergence, and the Komolgorov–Smirnov test in analyzing and identifying anomalies in the flows obtained from a computer network.
  • The development of a set of Matlab scripts that facilitated the implementation of Benford’s Law in conjunction with three distance functions. These scripts were used to extract the first digit, calculate each digit’s frequency of occurrence, and generate an ensemble that integrates the distance functions with Benford’s Law, applying Bayes’ Theorem, and can be found in https://github.com/pacfernandes/Unvelling-Network-Malicious-flows.git (1 July 2024).
  • The comparison between the results obtained with this model and those attained with automatic learning-based methods.
This paper is structured in several different sections. Section 2 reviews the literature, highlighting the most relevant works that explore the application of Benford’s Law to detect anomalies and intrusions in computer networks. This includes a critical discussion of the methodologies employed, the results achieved, and their implications for cyber security. Section 3 deals mathematically with Benford’s Law, the distance functions used, their relevance and application in the domain under study. Section 4 sets out the general architecture of the proposed model, including the pre-processing and processing steps for extracting the features aligned with Benford’s Law, the evaluation of the model based on methods suggested by Nigrini, and the metrics used to obtain the overall evaluation results. Section 5 presents the experimental results and subsequent analysis. Finally, Section 6 discusses the research’s main conclusions and suggests directions for future work.

2. Benford’s Law and Distance Functions in the Detection of Malicious Flows

This section analyzes studies that apply Benford’s Law and other statistical techniques for detecting malicious flows in computer networks. At the end of this section, we summarize the main gaps identified in previous work and the motivation for this study.

2.1. Related Work

Computer network security has predominantly focused on using machine learning techniques to analyze and detect anomalies or intrusions. However, purely statistical approaches are often overlooked. This trend neglects valuable methods such as regression analysis, outlier detection, and Markov models, which offer complementary and usually more intuitive insights. These methods allow for the identification of hidden patterns, the prediction of security events, and a deeper understanding of attacker behavior, which are essential elements for a robust analysis of atypical behavior on the network [22,23,24,25].
However, Iorliam [26] has brought a fresh perspective by delving into the applicability of Benford’s Law in analyzing network traffic data. This unique study aimed to verify the compliance of network data with this statistical law and to differentiate the relationship between benign and malicious network traffic flows, offering a novel approach to network security. Iorliam’s study examined all the data collected, applying the Chi-squared statistical test to assess the correlation between the observed data and the expected distributions according to Benford’s and Zipf’s laws.
The Chi-squared test is widely used to test hypotheses about the independence of variables in contingency tables, allowing researchers to determine whether differences between categories are due to chance or a statistically significant relationship. In this case, it was used to assess the compliance of the network data with Benford’s Law and other natural laws, namely Zipf’s Law. The results showed that the p-values obtained by the Chi-squared test when applying Benford’s law are inversely proportional to the values obtained when applying Zipf’s law. This result suggests a variation in the effectiveness of the laws in different contexts of network traffic analysis. While Iorliam’s research laid the groundwork for applying Benford’s laws in network traffic analysis, it left a gap in addressing the practical aspects of differentiating between benign and malicious traffic flows using these statistical laws. It is crucial to note that applying these laws can face challenges in real-life scenarios, such as the need for large amounts of data and the possibility of false positives or negatives. Therefore, this study opens the door for future research that validates or improves these statistical laws as diagnostic tools in cybersecurity environments, emphasizing the potential impact of the contribution to advancing the field.
Recent studies have begun to unveil the potential of Benford’s Law as a tool for revolutionizing intrusion detection systems (IDSs) in high-volume network traffic scenarios. For instance, ref. [27] proposed a new feature extraction method based on this statistical law. The method, which extracted six features from the divergence values, focused mainly on the first three digits. The authors evaluated the model’s effectiveness using three machine learning classifiers, and the results were promising, hinting at a potential enhancement in the efficiency of IDS.
Furthermore, with the growing adoption of the Internet of Things (IoT) and the challenges associated with the limited resources of these devices, ref. [19] explored the applicability of an IDS adapted for resource-constrained environments. IoT devices are computationally limited in resources, memory space, and energy, making it challenging to implement robust security measures. The study proposed using Benford’s Law to differentiate the sizes of network flows and implemented linear regression to process this information. This approach can effectively identify abnormal traffic, even with limited resources, by taking advantage of the distribution patterns inherent in the sizes of network flows. The results showed that this approach could be practical for IoT systems, offering a viable solution that requires fewer computational resources, less memory space, and lower energy consumption.
These studies highlight the versatility and applicability of Benford’s Law in different contexts within cybersecurity, suggesting avenues for future research that could expand its use in intrusion detection systems adapted to contemporary digital security requirements. Distributed Denial of Service (DDoS) attacks, such as SYN flood or ICMP smurf, are often perpetrated using packets generated by malicious scripts or programs. In response to these challenges, Kemal Hajdarevic et al. [28] propose an innovative method based on Benford’s Law to detect abnormal network traffic packets by analyzing real-time data and focusing on packet size.
In addition, zero-day attacks, which refer to unknown vulnerabilities in software, remain a significant threat. These vulnerabilities, when exploited, can allow unauthorized access or destabilization of critical systems before patches or preventative measures can even be applied. These attacks are particularly dangerous because they are not yet known to the software vendor and can, therefore, be used by hackers to gain unauthorized access to systems. Traditionally, network traffic analysis (NTA) is performed by machine learning (ML)-based network intrusion detection systems (NIDSs), whose effectiveness is often compromised by redundant features such as IP addresses. Ref. [29] addressed this issue by using Benford’s Law to extract meaningful network features, assessing the relevance of a feature by whether it complies with or violates Benford’s Law in benign and malicious traffic, respectively. This study used a semi-supervised ML-based approach, comparing feature sets identified in the literature.
Historically, approaches that apply Benford’s Law to network intrusion detection have been restricted to limited features, often excluding negative or zero digits. The exclusion of zero digits, which is not applicable in logarithmic functions, and manipulating negative digits using the modulus are practices discussed in the literature [20]. However, these exclusions can result in the loss of critical information for detecting attacks. In addition, studies have mainly been limited to using the Chi-squared test and, occasionally, Euclidean distance for evaluation [30,31]. Considering a more comprehensive range of features and evaluation methods, these limitations emphasize the need for more comprehensive and robust research into applying Benford’s Law to network intrusion detection.
Our knowledge about the nature of flows in computer networks is limited and characterized by uncertainty. Incorporating Bayes’ Theorem and Benford’s Law and distance functions has facilitated inference based on the available flow data in the dataset. The combination of the p-values, calculated from the discrepancies between the observed and expected frequencies according to Benford’s Law and assuming prior knowledge about the proportion of malicious flows, aimed to refine the detection model to increase the accuracy of identifying these flows and minimize the rate of false positives and negatives. The main aim of integrating Bayes’ Theorem was to calculate the probability of a flow being malicious from the p-values obtained by the distance functions, generating a new p-value for each distance function and then combining them into a single global p-value. From a statistical point of view, the fusion of Benford’s Law with Bayesian updates has added a layer of mathematical rigor, seeking to increase precision and reliability in analyzing each network flow. This innovative model stands out for its scalability, which makes it capable of managing large volumes of data without significantly increasing computing resources, making it ideal for environments with expanding network traffic. In addition, the flexibility of the statistical models makes it possible to adjust the probability thresholds and criteria for identifying malicious flows according to Bayes’ Theorem, adapting to different operational contexts or specific security requirements.

2.2. Challenges and Strengths

Studies that apply Benford’s Law to detect malicious flows highlight several shortcomings. The first of these is the complexity of new attacks, which can alter or camouflage features in network flows, compromising the effectiveness of statistical analysis. In addition, the difficulty of adapting Benford’s Law to all types of data, especially in the massive presence of the zero digits, and the increased complexity of the model, with the inclusion of distance functions and Bayesian inference, can pose real challenges in validating the model and minimizing false positives and negatives. On the other hand, although adopting Benford’s Law, distance functions, and Bayesian inference to new attack patterns have not yet been fully explored, zero-day attacks and advanced evasion methods can be imperceptible to approaches based on traditional statistical patterns.
However, despite these shortcomings, integrating Benford’s Law with the various distance functions can strengthen the model, making it more robust in identifying malicious flows. This enhancement can result in significant benefits, such as the ability to detect subtle deviations in data patterns that may indicate malicious activity, noise filtering and accuracy in identifying malicious flows, providing a more solid basis for security decisions. Additionally, introducing Bayesian inference could allow the probabilities to be continuously updated as new data are received, making the model adaptable to new threats.
The exclusive use of the Chi-squared test can also be limited, especially in situations with subtle anomalies in network traffic. Incorporating Bayesian inference and other statistical methods could increase the model’s sensitivity to these anomalies.
Finally, Benford’s Law assumes a specific distribution of the first digits. In networks where data are manipulated or distorted by malicious activity in subtle ways, the empirical application of this law can be ineffective, resulting in false positives or negatives. Applying distance functions to detect distortions in frequencies and distances between them is crucial to overcoming this weakness.

3. Benford’s Law and Distance Functions

3.1. Benford’s Law

Benford’s law, known as the law of the first digit, is an empirical law that defines the frequency of distribution of digits in a non-uniform way, i.e., different from 1 ÷ 9 = 0.11 , but states that the frequency of occurrence for the digit 1 is 30.10 % , for the digit 2 it is 17.6 % , and so on.
Let X be an independent and identically distributed (i.i.d.) random variable, such that X = X 1 , X 2 , , X i , i = 1 , 2 , 3 , , n , n N , and D i ( X ) represent the i th significant decimal digit of X. The probability mass function that describes Benford’s Law is given by Equation (1):
P ( D i ( X ) ) = log 1 + 1 d , if d = { 1 , 2 , 3 , , 9 }
Definition 1 represents the basic notion governing Benford’s Law and is implicit in the meaning of the number, i.e., the value of the mantissa. Given a decimal number, the mantissa refers to the first significant digit. For example, if we have the number 0.014 , the mantissa is provided by the first significant digit, i.e., 1 [32].
Definition 1
(Mantissa). The mantissa represents the decimal part in the calculation of the logarithm of a number. The relation translates to log S ( x ) . The only number r in 1 10 , 1 with x = r × 10 n for some integer.
Benford’s Law is based on three fundamental properties:
  • The distribution of significant digits is invariant concerning the change of scale.
  • The distribution of significant digits is continuous and invariant concerning the change of base.
  • The frequencies are uniformly distributed in the range of 0 , 1 , relative to the fractional parts of the logarithm.
Theorem 1 applies Benford’s Law to negative digits.
Theorem 1.
Given a sequence of real numbers x n , with n N , log | x n | = log | x 1 | , log | x 2 | ,
Benford’s Law is not only defined for the first digit but can be extended to two or more digits. Thus, Theorem 2 defines the general Benford’s Law that allows obtaining the occurrence frequency of one or more digits [33].
Theorem 2
(General law). Let k Z , d 1 { 1 , 2 , 3 , , 9 } and d j { 0 , 1 , 2 , , 9 } , j = 2 , , k .
P D k = d k = log 1 + 1 i = 1 k d i × 10 k i
We can find proof of the general Benford’s law, described in Theorem 2, in [33,34].
To meet the specifications of Benford’s Law for extracting the first digit, the number’s modulus operation was implemented to eliminate negative digits. In addition, the numbers were rounded to avoid decimals, thus allowing the most significant digit to be extracted from the data. The technique adopted for this extraction is based on the methodology proposed by [20], detailed in his study “Benford’s Law: Applications for Forensic Accounting, Auditing, and Fraud Detection”. Equation (3) describes the specific formula used:
D collapsed = 10 × a 10 i n t log ( a )
where D collapsed represents the digit of the collapsed number a and i n t denotes the function that converts to an integer. To make the value positive, the modulus of the number was added.

3.2. Distance Functions

3.2.1. Kolmogorov–Smirnov Test

The Kolmogorov–Smirnov test is a non-parametric goodness-of-fit test often used to check whether two samples follow the same probability distribution. It is a test statistic that quantifies the absolute distance between the empirical distribution function and the reference distribution obtained from the reference sample. It is a test that is extremely sensitive to differences between the maximum deviations between the empirical distribution and the sample distribution, both locally and globally [35,36].
The reason we chose to use the KS test was because we needed to verify whether each network flow follows the same distribution as Benford’s Law, i.e., to decide between two hypotheses:
H 0 : P = P 0   vs .   H 1 : P P 0
where P 0 refers to each data flow and P refers to the distribution of Benford’s Law.
The dataset, consisting of n network flows, comprises X i independent and identically distributed (i.i.d.) variables, whose empirical distribution function is usually given by Equation (4).
F n x = 1 n i = 1 n I X i X
Contrary to what is stated in the literature, we will not derive the empirical distribution function from the original dataset. However, we will assume that Benford’s Law gives this function. The aim is to check whether deviations exist between the i.i.d. variables of each flow by comparing them with Benford’s Empirical Law. If the deviations observed for each network flow are considerable, we can assume we are dealing with a malicious flow.
These are important ideas to retain when using the K-S test in this research. From a distribution function F X ( x ) , we can define an empirical cumulative distribution function (c.d.f.), given by Equation (4), which allows us to account for the proportion of sample points below the x level. For each x R , the law implies that F n ( x ) = F X ( x ) , given by Equation (5).
F n x = 1 n i = 1 n I X i X E I X i X = F X ( x ) .
From Equations (4) and (5), we can conclude that the probability of the sample proportion in the dataset remains uniform over all x R and that when there are no large deviations between the distribution function and the empirical distribution function, the probability of the difference between the two functions usually tends to be zero.
Theorem 3.
If the distribution function F X ( x ) is continuous, then the distribution of
sup x R | F n ( x ) F X ( x ) |
does not depend on F.
In Equation (6), s u p refers to the maximum value of the set of distances.
This investigation used the KS test to check whether the probability distributions for the network flow and Benford’s Law differed. In this sense, the equation given by Theorem 3 can be changed to Equation (7).
sup x R | F 1 , n ( x ) F 2 , m X ( x ) |
with F 1 , n ( x ) and F 2 , m X ( x ) being the distribution functions of the set of network flows and Benford’s Law, respectively. In this particular case, if F 1 , n X ( x ) and F 2 , m X ( x ) are the corresponding c.d.f.s, then the test statistic is given by Equation (8).
D n , m = m × n m + n × sup x R | F 1 , n ( x ) F 2 , m X ( x ) |
whose null hypothesis will be rejected at significance level α if
D n , m > 1 2 ln α 2 × m × n m + n
Table 1 shows an example of applying the Kolmogorov–Smirnov test to a network flow, such as flow 14, following the procedure described in Table 2.
Based on the values obtained in Table 1, Figure 1 shows the highest value of the differences between the cumulative functions.
Following the procedure described in Table 2, the D t e s t value = 0.2545 . To check whether the flow is malicious or benign, it is necessary to compare the value obtained in D t e s t with a critical value. Considering a significance level of 0.05 and using Equation (9), we obtain a critical value of 4.073 . As D t e s t < D c r i t i c a l , there is not enough statistical evidence to reject H 0 , so we conclude that the flow is not malicious.

3.2.2. Median Absolute Deviation

The Mean Absolute Deviation (MAD), given by Equation (10), is a measure of compliance with Benford’s Law that returns the value of the average deviation of the frequency with which each digit occurs from the empirical frequency of each digit [37]. Usually, the Mean Absolute Deviation Percentage Error (MAPE) is used, which, although an adaptation of the MAD, allows the accuracy of the adjusted time series values to be measured. The smaller the value returned by the difference between the real and empirical frequencies, the closer it is to the real values, producing forecasts with high certainty. The MAD makes it possible to graphically compare the average deviation between the heights of the bars, referring to the actual proportion of each digit and the proportion expected by Benford’s Law in a two-dimensional graphic. Figure 2 shows the MAD between Benford’s Law and flow 14.
This possibility meant that in this research, we only used the MAD to the detriment of the MAPE. More significant mean absolute deviations necessarily imply a more considerable mean difference between the actual and expected proportions, strongly suggesting the presence of anomalies in the data and, therefore, the presence of possible malicious flows [38].
M A D = i = 1 N | F r E f | N
where F r is the real frequency of each digit, E f is the empirical frequency of Benford’s Law, and N represents the number of bins, which equals 9 for the first digit.

3.2.3. Kullback–Leibler Divergence

The Kullback–Leibler (KL) divergence, usually known as relative entropy, is a fundamental metric in information theory and probability theory used to measure the discrepancy between two probability distributions over the same random variable x [21]. This non-symmetric metric quantifies how a probability distribution q ( x ) , which can represent an observed empirical frequency, deviates from a model distribution p ( x ) . In the context of our research, p ( x ) corresponds to the theoretical frequency predicted by Benford’s Law for the occurrence of first digits. In line with the procedure carried out in the Kolmogorov–Smirnov test for flow 14, Figure 3 shows the Kullback–Leibler divergence between each digit’s occurrence frequency in flow 14 and the empirical frequency from Benford’s Law.
Specifically, the KL divergence from q ( x ) to p ( x ) , denoted by D K L p ( x ) q ( x ) , provides a quantitative measure of the information lost when q ( x ) is used to estimate p ( x ) . This analysis assumes that p ( x ) and q ( x ) are independent probability distributions of a discrete random variable x. Both distributions must be strictly positive q ( x ) > 0 and p ( x ) > 0 throughout the sample space X, with their sums approaching 1 [39].
For our research, we apply the continuous version of the KL divergence due to the continuous nature of the random variables observed in the frequencies. Equation (11) continuously defines the KL divergence q ( x ) in p ( x ) .
D K L p ( x ) , q ( x ) = x X p ( x ) × ln p ( x ) q ( x )
where p ( x ) and q ( x ) are the probability densities of the distributions p and q, respectively. This approach provides a detailed analysis of the divergences in the frequencies of occurrence of the digits from the theoretical expectations of Benford’s Law, which is essential for understanding and quantifying the anomalies in the analyzed datasets [40].
Although the KL divergence is not a distance function, it does have several important properties.
  • Non-symmetric, D K L p ( x ) , q ( x ) D K L q ( x ) , p ( x ) ;
  • Non-negative measure, D K L p ( x ) , q ( x ) 0 and D K L p ( x ) , q ( x ) = 0 if p = 0 or q = 0 .
Fisher’s method was implemented to integrate the p-values derived from the mean absolute deviation, the Kolmogorov–Smirnov test, and the Kullback–Leibler divergence. This approach calculated a more robust and sensitive p-value to minimize the number of false positives.
To summarize, the choice of the Mean Absolute Deviation, the Kolmogorov–Smirnov test, and the Kullback–Leibler divergence was based on various aspects that distinguish them from other functions:
Kolmogorov–Smirnov test (KS):
  • Robustness of comparisons: The KS test is robust for comparisons between the empirical frequency of Benford’s Law and the frequency of occurrence of the digits.
  • Sensitivity to deviations: This test shows greater sensitivity to distribution deviations.
  • Non-Parametric Nature: Considering that data from a computer network can be irregular, the non-parametric nature of the KS test is advantageous, as it does not require the data to follow a specific distribution.
Comparison with other tests:
  • Chi-Square Test: Unlike the chi-square test, the KS test does not rely on predefined data categories, thus avoiding information loss. In addition, the KS test can be applied when the null hypothesis is well defined, which is not always possible with the chi-square test.
  • Anderson–Darling test: Although similar to the KS, the Anderson–Darling test is more complex and less intuitive, making the KS preferable for many applications.
Kullback–Leibler Divergence (KL):
  • Assessment of Proximity between Distributions: The KL divergence is widely used in data mining literature to check the closeness between two distributions. The lower the value obtained, the closer the distributions are.
  • Directed and Asymmetric Analysis: The asymmetric and directed nature of KL divergence allows for a detailed analysis of discrepancies between the frequency of digits and the empirical frequency of Benford’s Law.
  • Sensitivity to Small Differences: KL divergence is particularly sensitive to slight differences between distributions, making it helpful in detecting subtle anomalies.
Comparison with other tests:
  • Jensen–Shannon Divergence: Despite being a symmetrical version of KL, KL’s simplicity and sensitivity are preferable for many analyses.
  • Mahalanobis distance: Although the Mahalanobis distance effectively detects multivariate anomalies, the KL is better suited to measuring differences in probability distributions.
Mean Absolute Deviation (MAD):
  • Simplicity and straightforward interpretation: The Mean Absolute Deviation is simple to calculate and interpret, directly measuring the discrepancies between the observed frequencies and those expected by Benford’s Law.
  • Less Sensitivity to Outliers: This method is less sensitive to outliers, especially in digit 1 of Benford’s Law, which makes it preferable to the mean square deviation.
Conclusion:
The choice of the Kolmogorov–Smirnov, Kullback–Leibler, and Mean Absolute Deviation distance functions considered each method’s robustness, sensitivity, and simplicity. These characteristics make them particularly useful for analyzing and evaluating network flows, providing more effective detection of flow anomalies and irregularities [41,42,43,44,45].

3.3. Fisher’s Method

The Fisher method facilitates the aggregation of multiple p-values from independent tests into a single composite value [46]. Given that the p-values are independent, derived from distance functions without any correlation, the employed formula culminates in Equation (12).
T = 2 × i = 1 3 log e p i
where p i represents the p-values obtained from the three distance functions: the mean absolute deviation, the Kolmogorov–Smirnov test, and the Kullback–Leibler divergence. Finally, to combine these p-values, the resulting statistic, T, follows a Chi-squared distribution with 2 k degrees of freedom, given by Equation (13).
P X > T = 1 0 T f x ; d f d x
where f x ; d f represents the probability density function of the Chi-squared distribution with d f = 6 degrees of freedom [47].

3.4. Tippett’s Method

The Tippett test is another methodology used to generate a global set of p-values from the p-values derived from the distance functions. This test, denoted by T p , is modelled by the beta distribution and is described by Equation (14). The resulting global p-value is defined by Equation (15), where the choice for each global p-value results from Equation (16). This test was chosen because of its similarity to the Bonferroni method, which minimizes false positives [48].
T p = m i n p 1 , p 2 , p 3
p = 1 1 p ( 1 ) n
where
p ( 1 ) = m i n p M A D , p K S , p K L

4. Model Architecture

This section details the architecture used to exploit Benford’s Law, distance functions, and Bayes’ Theorem to identify intrusions in computer networks by analyzing data flows.

4.1. Natural Law-Based Method

The proposed model uses Benford’s Law combined with three specific distance measurement methods: the mean absolute deviation, the Kolmogorov–Smirnov test, and the Kullback–Leibler divergence, described in Section 3.2. Each distance function was evaluated individually in terms of results and globally when integrated in applying Bayes’ Theorem, allowing for a holistic approach to assessing discrepancies in the data analyzed and possible improvements in minimizing false positives.
The process begins by examining the first digit of the data, which is extracted after a preliminary data reduction stage. The main application consists of using Benford’s Law on the first digit to identify abnormal patterns that may indicate the nature of the data flow. Based on these patterns, the model distinguishes between malicious and benign flows, enabling a subsequent evaluation of the model’s performance.
Figure 4 depicts the system’s architecture and comprises three main components: pre-processing, processing, and analyzing results. Each block plays a crucial role in data utilization and the overall effectiveness of network intrusion detection.
Figure 5 depicts the general architecture of the model, highlighting the three main blocks represented in Figure 4. It includes each stage of the model, based on Benford’s Law, distance functions, and Bayes’ Theorem. In addition, the ensemble developed to aggregate the p-values obtained through Bayes’ Theorem is detailed in Section 3.3Section 3.4 and Section 5.4.
Given that the dataset used in the research is public, the only checks required were for the presence of non-numerical data and correct labelling. We implemented a set of scripts to develop a functional model based on Benford’s Law and distance functions. These scripts facilitated not only the extraction of the first digit but also the calculation of features aligned with Benford’s Law and the measurement of the distance between the frequencies of the flow features and the empirical frequencies of the law. Subsequently, it was essential to integrate the distance functions to generate a single p-value from the individual p-values, allowing network flows to be categorized as malicious or benign.
After initially analyzing the dataset used in the research, the pre-processing phase began, which involved reducing the data using Microsoft Excel for each characteristic presented in Table 3. Initially, and considering that the dataset is numerical, we chose not to carry out any data cleaning or normalization process to keep the dataset as close as possible to its original form. Subsequently, considering the heterogeneity of the values in the dataset, which comprises integer and decimal values, the number collapse module was carried out so that the dataset consisted only of positive integer values. It is important to emphasize that the zero digit was not removed, as its extraction could result in a significant loss of information. After transforming the numbers, the most significant digit of each characteristic was extracted for subsequent calculation of the Pearson correlation between the frequency of occurrence of each digit and the empirical frequency of Benford’s Law. This process generated a percentage indicating the degree of correlation between the variables. However, it was observed that specific characteristics with only two digits, such as 0 and 1, produced high values in the correlation, which could lead to erroneous conclusions as to the nature of these characteristics following Benford’s Law. This is a significant limitation when calculating the Pearson correlation between Benford’s Law and the frequency of occurrence of digits for traits with little distribution of occurrence. High values in the first digit could lead to errors in the classification of flows and thus affect the model’s accuracy, as seen in Table 9, Section 5.1.
Given the use of Benford’s Law to detect anomalies in the distribution of digits, it is unlikely that there will be a significant bias in the results relating to detecting malicious flows. However, applying the model to a predefined set of attacks could limit its effectiveness in detecting other types of anomalies. However, this limitation is irrelevant in this study, since what is analyzed is possible anomalies in the digits according to Benford’s Law.
On the other hand, the heterogeneity of the data, which include integer values or decimals, can introduce additional complexity to the processing. We must carefully consider how to convert the data to mitigate potential errors that could distort the characteristics of the flows, underscoring the importance of our role in the process. To mitigate possible biases in the data, it is essential to implement additional analyses, namely the analysis of variance (ANOVA) and analysis of outliers, using methods such as the interquartile range (IQR). Including these additional analyses could ensure that the anomaly detection results are more accurate, instilling the audience with a sense of reassurance and confidence.
Figure 6 schematically illustrates this pre-processing phase. The original dataset contained several captures of attacks that occurred on different days of the week, stored in .CSV format, with values separated by commas.
To centralize the information, a representative sample of network flows for each type of attack was selected and compiled into a new dataset called NetworkFlows. Subsequently, a Matlab script was developed to extract the first digit of each network flow, storing these data in a specific digit matrix.
After the data preparation and reduction phase, we begin the data processing process, which is divided into two main stages. The first involves calculating the frequency of occurrence of each digit, with the results stored in a frequency matrix. Using the data in the digit matrix, we developed a Matlab script to determine the frequency of each digit based on the features identified during the pre-processing phase. This process makes it possible to calculate the distance between the observed frequencies of each digit and the corresponding frequencies according to Benford’s Law. The procedure for calculating the frequency of occurrence of each digit in network flows is simple. It divides the number of occurrences of each digit by the total sum of occurrences of all flows. In the same way, we calculate the total frequency for all occurrences of the digits. The values obtained are stored in a matrix of digits and then used to calculate the p-value, which is essential for classifying each network flow as malicious or benign. In this stage, the distance functions are also implemented for the subsequent p-value calculation in the second processing stage. The second stage applies distance functions to compare these frequencies with the empirical frequencies predicted by Benford’s Law to classify each network flow. This approach is schematized in Figure 7.
The second processing stage includes applying specific distance functions—Mean Absolute Deviation, Kolmogorov–Smirnov test, and Kullback–Leibler divergence—and forming an ensemble to integrate these measures. The aim is to classify the dataset and ensure each network flows efficiently. The classification is carried out graphically and probabilistically, making it easier to visualize the discrepancies between the observed frequencies of each digit and the empirical frequencies predicted by Benford’s Law.
In addition to calculating the p-value from each distance function, statistical inference was used for classification, using Bayes’ Theorem. This method aggregated the distance functions to calculate the overall p-value for each flow. Both this method and the subsequent one made it possible to compare the results with different degrees of statistical significance through hypothesis tests. The calculated p-value indicates the probability of obtaining a test statistic as extreme as the one observed, assuming the null hypothesis is accurate, and defines the lowest significance level for rejecting the null hypothesis based on the data analyzed.
After calculating the distances and p-values for each network flow, these data are stored in a matrix of values and subjected to a comparative analysis with different degrees of statistical significance. The methodology for classifying each flow as malicious or benign is based on the following hypotheses:
  • H 0 : “The network flow is benign”;
  • H 1 : “The network flow is malicious”.
If the p-value is less than the established significance level, there is strong statistical evidence to reject the null hypothesis, indicating that the network flow is malicious. The significance levels adopted were 0.1 , 0.01 , and 0.05 , following the standards usually recognized in the literature. Table 4 details the configurations used, including the parameter settings for the statistical tests applied and the threshold values set for anomaly detection.
The results of the classification of each flow are stored in a .txt document, labelled 1 for flows classified as malicious and 0 for benign. These data are then contrasted with the actual classifications of each flow, making it easier to draw up a confusion table to assess the model’s accuracy.

4.2. Dataset

The experiments analyzed 29,000 network flows, consisting of 10,000 benign and 19,000 malicious flows, extracted from the CICIDS2017 dataset. This dataset includes network flows covering various attacks and benign flows and is available for consultation at [50].
The use of the CICIDS2017 dataset instead of more recent versions was due to several factors:
  • High data quality and controlled environment: CICIDS2017 offers high-quality data captured in a controlled environment, guaranteeing the reliability and consistency of the results.
  • Well-defined variety of attack types: The dataset presents an apparent diversity of attack types and precise labelling of flows as malicious or benign, allowing the results obtained by applying Benford’s Law to be compared with the original results, facilitating the classification of flows.
  • Extensive use in previous studies: Numerous studies using CICIDS2017 allow for directly comparing the results obtained with those of other investigations. One example is Mbona’s work, which used CICIDS2017 with Benford’s Law for feature selection.
These flows were analyzed using the CICFlowMeter tool, version 3.0. This open-source software generates bidirectional records from pcap files and extracts features from them, determining the direction of packets from the first packet between source and destination [49]. The research was based on an unbalanced dataset to reproduce what happens daily in computer systems.
The flows were categorized as benign or malicious based on the type of attack, date and time, source and destination IPs, ports used, and protocols. These flows were then stored in .csv files. The dataset was captured between 9 a.m. on Monday (3 July 2017) and 5 p.m. on Friday (7 July 2017). Except for Monday, which only recorded benign traffic, the remaining days included benign and malicious flows. Table 5 summarizes the days of the week, the types of attacks, and the number of flows analyzed to make up the dataset used in the experiments.
The feature selection process was meticulously carried out in two crucial phases. In the first phase, features were extracted from the network traffic packets, while the second phase focused on selecting these features for the various studies. The initial feature extraction phase was carried out using the CICFlowMeter tool, version 3.0, generating realistic traffic for constructing the dataset. Sharafaldin [49] proposed the B-Profile system to create a profile of the abstract behavior of human interactions and generate naturalistic and benign traffic. To build the dataset, the behavior of 25 users was modelled based on the HTTP, HTTPS, FTP, SSH, and email protocols. Eleven criteria were identified: complete network configuration, traffic, labelled dataset, complete interaction, complete capture, available protocols, attack diversity, heterogeneity, resource set, and metadata.
The process of extracting characteristics in the initial phase began with capturing the packets travelling on the network, which were then grouped into flows according to criteria such as source and destination IP addresses, source and destination ports, and transport protocols (UDP, TCP, among others). For each network flow, CICFlowMeter determined a set of features that allow a detailed and differentiated description of the flow, such as:
  • Time features: flow duration, time between packets (minimum, average, maximum time, and standard deviation).
  • Size features: smallest, average, largest, and total packet size.
  • Count features: total number of packets in the flow and count of TCP, UDP, and ICMP packets.
  • Header features: number of TCP flags.
  • Statistical features include the calculation of flow entropy, packet per second rate, and byte per second rate.
After extraction, the features were organized into flow records structured in tables. Each flow record represents a single network flow and includes all the calculated features. Finally, the flow records were stored in .csv format. More details on the extracted features are available on the GitHub [51] project.
In the second phase, the selection of features was based on calculating the Pearson correlation between the frequencies of occurrence of the first most significant digit and the empirical frequency of Benford’s Law. The numbers in the dataset were adjusted using the digit collapse procedure to avoid decimal numbers with leading zeros. This procedure made it possible to transform any decimal number with leading zeros into a decimal number whose significant digit differs from 0. After this adjustment, the most significant digit was extracted, and Pearson’s correlation was calculated. After this procedure, the selection of features was based on the correlation values obtained, from which the features with values of 70 % or above, 80 % or above, and 90 % were selected. These correlation values indicate a strong relationship between the variables, showing their natural dependence.
Two fundamental aspects justify this imbalance between benign and malicious flows. Firstly, in a natural context, malicious and benign events are disproportionate in size and frequency. Thus, a dataset that reflects this disproportion offers a more realistic and challenging test environment for developing intrusion detection systems, ensuring that models can operate effectively in natural environments. Studies such as [52] on the ROC curve in unbalanced environments show that models trained under such conditions can achieve more representative accuracy in detecting minority classes, which are often of greater interest.
On the other hand, an unbalanced dataset favors improvement in evaluating anomalous behavior. A model developed from an unbalanced dataset makes it possible to identify and analyze features potentially indicative of malicious activity. This process increases sensitivity in detecting new or rare forms of attack. In fact, ML techniques, such as those discussed by [53], which include oversampling methods such as SMOTE (Synthetic Minority Over-sampling Technique) and undersampling techniques, can be applied to adjust the class distribution without compromising the integrity of the data, thus maintaining effectiveness in detecting anomalies. Inspired by these studies, we sought to apply this principle to malicious flow detection using a purely statistical model.

4.3. Evaluation Metrics for Classification

Applying Benford’s law to the model resulted in a binary classification, assigning 1 to malicious network flows and 0 to benign ones. In this context, 1 is interpreted as a true positive and 0 as a true negative, forming two distinct classes. However, there are cases where a malicious flow can be wrongly classified as benign (false negative) or a benign flow as malicious (false positive). It is essential to evaluate the model’s performance considering these discrepancies, which is accomplished through the confusion matrix. Table 6 shows the confusion matrix customized for our analysis, as described in [32].
The relationship between True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) allows the model to be evaluated using a set of widely documented metrics used in machine learning models, including Accuracy, Precision, Recall, and F1-score, as detailed in [54].

5. Results of the Proposed Model

This section describes and discusses the results obtained using the model based on Benford’s law and the distance functions by analyzing the p-value for each distance function and using Bayes’ Theorem combined, according to the metrics defined in Section 4.3. These results were obtained by comparing the original labels of each flow in the network with those received after the data processing phase. The classification of flows into benign and malicious was obtained by comparing the p-value with the previously defined significance levels.
This research focused on identifying malicious flows in computer networks. It analyzed a dataset of 29,000 flows, of which 10,000 were benign and 19,000 malicious, covering various attacks. During the pre-processing phase, an analysis was conducted to determine which features adhered to Benford’s Law, as detailed in Section 4. Of the 78 features evaluated, 41 showed compliance with the law, as shown in Table 7. These features were correlated with Benford’s empirical distribution for each digit.
The experiment was structured in different stages, according to the two phases described in Section 1, and was based on the correlation of these features with the data:
  • First phase:
    -
    First Stage: Features with a correlation of 70 % or more were intriguingly grouped into Cluster 1, which indicates a substantial correlation between the observed frequencies.
    -
    Second Stage: Features with a correlation of 80 % or more were significantly grouped into Cluster 2, reflecting a strong correlation between the frequencies.
    -
    Third Stage: Finally, features with a correlation of 90 % or more were grouped into Cluster 3, highlighting a robust correlation. Each stage was meticulously planned to ensure a rigorous and detailed analysis of data trends by Benford’s Law.
    -
    Fourth Stage: A comparison was made between the number of features extracted by the method based on Pearson’s correlation and other methods based on distance functions. The results show that the correlation technique more effectively selects the ideal features for identifying malicious flows.
  • Second phase:
    -
    An ensemble was developed from the p-values to maximize the detection of malicious flows, reducing the number of false positives and improving the evaluation of the model.

5.1. First Stage: Features with a Correlation of 70 % or More

All the network flows were then processed, resulting in the graphic shown in Figure 8. Graphical analysis reveals some discrepancies in the digits, particularly digits 2, 4, and 6. It can be seen that digits 2 and 4 occur less frequently than expected, while digit 6 appears more frequently than predicted by the empirical distribution. These discrepancies suggest the presence of malicious flows in the dataset. Pearson’s correlation was calculated to determine whether the dataset adheres to Benford’s Law, resulting in a value of 98.11 % . This high percentage indicates that the dataset generally follows Benford’s Law. However, the anomalies observed in the graphics indicate flows that do not conform to this law, suggesting the existence of anomalous flows in the data.
Subsequently, the p-values of the distance functions were calculated, as detailed in Section 3, whose main objective in this initial phase of the investigation is the detection of potentially malicious flows. Table 8 presents the results of these calculations, showing the p-values between the frequency of occurrence of each digit and the empirical frequency predicted by Benford’s Law for each distance function specified.
As shown directly in Table 8, the MAD and KS distance functions proved to be the most effective in detecting malicious flows, with success rates of 90.22 % and 68.71 % , respectively, with a significance level of 0.1. The main difference between these two methods lies in each method. The MAD, a measure of dispersion, indicates how much the data deviate from the central value, usually the median. On the other hand, the KS test, commonly used to compare two independent samples, in our case, compares the frequency of occurrence of the digits with the empirical frequency of Benford’s Law to check whether the samples come from the same distribution.
Regarding sensitivity, the MAD is less affected by extreme outliers due to its direct and simple calculation methodology, as long as these outliers are not dispersed among the digits, as seen in Table 9. The KS test, which compares two distributions, can be more sensitive to the presence of outliers if they are concentrated in a single digit and ultimately requires a deeper understanding of the test statistics. This knowledge is easily manageable in a closed network environment but may be inadequate in real environments.
Table 9 shows that the Mean Absolute Deviation (MAD) provides superior results, followed by the Kolmogorov–Smirnov (KS) test. From the values shown in the table, it can be seen that MAD under-performs in detecting benign flows when the frequencies of occurrence of the numbers follow an almost uniform distribution, resulting in erroneous decisions, such as false positives. However, MAD’s performance improves significantly when detecting genuinely malicious flows. On the other hand, the KS test tends to make better decisions under conditions of almost uniform distribution. However, it fails more often when the frequencies are high in the first digit or when they are randomly dispersed across the digits, which can lead to an increase in false negatives, a potentially more damaging situation than the occurrence of false positives.
The visual representation in Figure 9 clearly illustrates the deviation from Benford’s Law in flow 30. The observed frequencies, which are almost uniformly distributed, lead to an incorrect decision by the Mean Absolute Deviation (MAD). This visual evidence underscores the importance of considering the distribution of occurrence frequencies in anomaly detection.
Although the KL distance function could have been more effective in detecting malicious flows, it proved highly accurate in identifying benign flows. The KL measure evaluates the amount of information lost when trying to approximate a data distribution (in this case, the frequency of occurrence of the digits in each network flow) by another reference distribution (the empirical frequency of Benford’s Law). Because it is sensitive to differences, especially in the tails of the distributions, KL can produce worse p-values when these differences are accentuated. In addition, the centrality of the data in the first two or three digits can adversely affect KL, even if the subsequent frequencies overlap in a typical way. KL’s sensitivity to slight variations in probability density, mainly where the frequency of occurrence is more prevalent, also contributes to its inferior performance in the presence of outliers.
Table 9. Comparing the decisions of the different distance functions and the original data labels.
Table 9. Comparing the decisions of the different distance functions and the original data labels.
Digits and Flows23018,34218,361
10.57890.33330.28570.3429
20.21050.13890.17140.2286
300.10140.11430.0571
400.16670.14290.0286
500.02780.05710.1143
60.21050.05560.02860
700.08330.08570.1143
800.02780.02860.1143
900.02780.08570
MAD0111
KS1010
KL1000
Original Label0011
After analyzing the results, it is possible to evaluate the model’s performance considering metrics such as precision, recall, F1-score, and accuracy, whose values are shown in Table 10, corresponding to the significance levels that obtained the best results. Analyzing Table 10, it can be seen that the Mean Absolute Deviation is the distance function that best fits the model, as evidenced by the F1-Score of 77.19 % , superior to the performance of the other distance functions.

5.2. Second Stage: Features with a Correlation of 80 % or More

In the second stage, only the features with a correlation of 80 % or more were selected, reducing the initial number from 41 to 27. Table 11 shows the new features obtained from a correlation of 80 % or more.
The graphical analysis of this phase did not reveal any significant changes compared to the first phase’s graphic, as seen in Figure 10.
The results derived from the three distance functions are detailed in Table 12, where we focus on the results with the best scores. It can be seen that these values are in line with those presented in Table 8. The values obtained by the Mean Absolute Deviation consistently exceed those generated by the Kolmogorov–Smirnov (KS) and Kullback–Leibler (KL) distance functions.
Concerning the model’s performance metrics, including precision, recall, F1-score, and accuracy, Table 13 shows the values achieved for the significance levels with the best results.
As shown in Table 13, the Mean Absolute Deviation continues to be the most suitable distance function for the model, demonstrated by an F1-Score of 75.18 % , which is superior to the performance of the other distance functions. However, there is a slight reduction in the detection of malicious flows, which moderately affects the model’s performance due to the increase in false positives. This is because, with the reduction in the number of features, the frequency of occurrence of the digits tends to increase and become more widely distributed among the remaining digits.

5.3. Third Stage: Features with a Correlation of 90 % or More

In the third stage, only features with a correlation of 90 percent or higher were considered, reducing the initial number from 27 to 20. The focus was to investigate whether features with robust correlations improve the detection of malicious flows, given that a correlation above 90 % makes it possible to assess the extent of adherence to Benford’s Law more accurately. Similarly to Figure 8 and Figure 10, the graphic generated from the features with almost perfect correlation does not reveal significant differences that confirm full compliance with the frequencies expected by Benford’s Law, although the distances are smaller, as can be seen from Figure 11. This observation suggests that, despite the high correlation, the data may not perfectly follow the predictions of the law, which implies the need for a more detailed analysis to understand the discrepancies observed.
Significant deviations between the frequencies observed and those expected by Benford’s Law can suggest anomalies, intrusions, or even system failures. A near-perfect correlation between observed and expected frequencies is expected to improve the accuracy of predictions, providing a clearer understanding of network activity. Meanwhile, correlations of 70 % and 80 % , although considered strong, may indicate the existence of different probability distributions in the analysis, given that the proximity between the frequencies obtained and those expected does not suggest a clear distinction between abnormal behavior and that considered normal. Many studies have shown that characteristics with a strong correlation with Benford’s Law can indicate data behavior that is more consistent with Benford’s Law and, in turn, more natural. The application of these studies has allowed patterns to be identified, particularly in diverse fields such as genetics, which indicate the presence of irregularities but, in many cases, require further investigation. At this stage, the aim is to understand how the data behave in the face of near-perfect correlations with Benford’s Law and whether such correlations contribute to better efficiency in detecting malicious flows.
Table 14 and Table 15 summarize the features that showed a correlation of 90 % or more and the results obtained by distance functions.
Analyzing Table 15 and comparing with Table 12, we see a modest reduction of less than 3.9 % in detecting malicious flows, contrasting with a considerable increase of approximately 20 % in cases where malicious flows were wrongly classified as benign. This situation is worrying in a forensic analysis context of detecting anomalies or intrusions in computer networks, suggesting that the high correlation with Benford’s Law may not necessarily translate into better model performance. A plausible explanation for this phenomenon could be the similarity between the frequencies observed in benign and malicious flows, making it difficult for the model to distinguish between them effectively. Table 16 exemplifies this situation by showing two flows, benign and malicious, respectively, with their observed frequencies compared to those expected by Benford’s Law and the decisions resulting from the model.
When we analyze the distance function of the Mean Absolute Deviation in more detail, we see, as shown in Table 16, that high frequencies of occurrence in the first digit often lead the model to incorrectly classify originally malicious flows as benign. This pattern is worrying and suggests a vulnerability of the model to false negatives, significantly when the frequency of the first digit exceeds 60 % . This behavior deviates considerably from the frequencies expected by Benford’s Law, increasing the risk of the model generating numerous false positives or false negatives.
This analysis highlights the need for adjustments to the model to improve its accuracy and reliability in detecting threats. Two different approaches were implemented in this context, giving rise to the fourth and fifth stages. The first approach involved analyzing selected features following methodologies documented in the literature [6]. The second approach sought to improve the robustness of the analysis by combining the p-values derived from the distance functions, creating an ensemble of p-values. As recommended in the literature, two statistical techniques were used: the Fisher and Tippett techniques, both recognized for their effectiveness in combining statistical evidence from multiple tests. These approaches aim to identify abnormal patterns in the analyzed data, thereby improving the model’s accuracy in detecting malicious flows.
Regarding the first approach and following the research by Mbona, we found that the author integrated seven features that had not been considered in our study, as seen in Table 17 and Table 18. The exclusion of these features in our study was because they presented either exclusively zeros or ones or because the correlation between the frequency of occurrence of each digit and the frequency expected by Benford’s Law was below the 70 percent threshold established by us. However, to assess the relevance of including these features, we incorporated them, as Mbona suggested. Figure 12 illustrates that, although there is apparent adherence to the expected frequency for digit 1, lower results are observed for digits 2 and 6, where the differences are more marked. There are also more minor deviations for digits 3, 8, and 9. As for the results, analyzing Table 19 reveals no significant changes compared to previous evaluations, suggesting that the features proposed by Mbona can be optionally omitted from future analyses.
Regarding the evaluation of the proposed model, Table 20 shows the results obtained with the new features, aligning closely with the initial predictions of our research.

5.4. Fourth Stage and Second Phase: Method Combining the Three Distance Functions, Benford’s Law, and Bayes’ Theorem

The second approach proposed combined the multiple p-values obtained through the distance functions, resulting in an overall p-value. In developing this p-value, two consolidated statistical methodologies were used to aggregate evidence in multiple tests: the Fisher and Tippett methods [46,48]. In addition, Bayes’ Theorem was used as the primary classification mechanism. On the other hand, after comparing the results of the model evaluation in clusters 1, 2, and 3, we chose to generate the ensemble based on the data from cluster 2, which proved to be the most promising. In this context, 35 % of the flows analyzed were identified as benign and 65 % as malicious. Figure 13 shows the decision tree developed using Bayes’ Theorem, detailing the classification process used.
Figure 13 illustrates the use of Bayes’ Theorem to classify malicious and benign flows. A Prior variable was created, representing the proportion of malicious flows in the dataset. Combining the results of the three distance functions made it possible to calculate a p-value for each flow based on the frequency of the digits originating from the initial p-values. Subsequently, the Fisher and Tippett methods generated a global p-value, allowing us to decide on the nature of each flow by comparing it with standard significance thresholds. Table 21 details the results achieved after implementing this original method.
Table 21 compares the results obtained by applying the Fisher and Tippett methods to detect malicious flows. The Tippett method, which selects the lowest p-values, proved more effective in identifying malicious flows due to its less conservative nature. Table 21 shows that the method achieved high detection rates for malicious flows but a low hit rate for benign flows, with efficiencies of 99.42 % and 2.04 % , respectively. The presence of false positives is relatively high, with a rate of 97.96 % , in contrast to the presence of false negatives, with a rate of 0.57 % , at a significance level of 0.05 . This method becomes helpful in scenarios where a single significant test is enough to validate the network flow analysis, resulting in a high detection rate of malicious flows, although with less accuracy in identifying benign flows. On the other hand, Fisher’s method, which adds up the logarithms of the p-values and applies the Chi-squared distribution to calculate an overall p-value, shows greater sensitivity when all the individual p-values are low. Table 21 shows that the method achieved detection rates of 67.81 % for malicious flows and 31.34 % for benign flows, at a significance level of 0.1 , making it more balanced than the Tippett method. However, there was a significant increase in both the number of false positives and the number of false negatives. This behavior makes it more suitable for situations that require a consistent evaluation of multiple pieces of evidence but can lead to less accurate decisions if the data are not uniformly significant.
In evaluating the model, Table 22 shows the results achieved by the Fisher and Tippett methods, highlighting the most effective classifications as indicated in Table 21. Analyzing the table, it is clear that Tippett’s method outperforms Fisher’s, achieving an F1 score close to 80 % and showing slightly better accuracy. It can, therefore, be concluded that Tippett’s method is more suitable for the problem being analyzed.

6. Conclusions and Future Work

Developing faster and more efficient techniques that consume less energy and computing resources has been vital in supporting forensic teams in detecting anomalies or intrusions in computer networks. Over the last few years, this field has seen significant progress, although with advances in the application of purely statistical techniques to the detriment of the massive use of models based on machine learning. The method we propose, based on Benford’s Law and documented in various studies, particularly in financial auditing and accounting, aims to create a balanced, fast, and efficient model for detecting potentially malicious network flows. This model is based on advanced statistical techniques, including distance functions such as the mean absolute deviation, the Kolmogorov–Smirnov test, and the Kullback–Leibler divergence, which serve as robust measures of dispersion to quantify the magnitude of anomalies detected in the flows of a computer network. In addition, we integrated Bayes’ Theorem with the three distance functions mentioned to develop a model that generates a single global p-value. This model makes it possible to identify discrepancies in the digits, making it easier to conclude the nature of the flows analyzed—whether malicious or benign. The research was conducted using the CIC-IDS2017 public dataset.
The research carried out in this study was structured into five phases, focusing on the correlation of features with the data and the implementation of an ensemble to generate a global p-value, classifying network flows as malicious or benign. In the first phase, we selected features with a correlation of at least 70 % with Benford’s Law. The results indicated that the mean absolute deviation was more effective in detecting malicious flows, identifying 17,143 out of 19,000 malicious flows. The Kolmogorov–Smirnov (KS) test also showed high performance, detecting 13,504 malicious flows. In contrast, the Kullback–Leibler (KL) test was less effective at detecting malicious flows but showed high accuracy in identifying benign flows. As discussed in Section 5, these results reflect the frequencies of occurrence of each digit, where higher frequencies in the first digit suggest benign flows and a more even distribution between digits can result in false positives or negatives. This leads to the model performing worse when compared to ML-based models.
In the second phase, we only considered features with a correlation of at least 80 % . The results obtained by the mean absolute deviation remained higher, although there was a slight reduction in the detection of malicious flows and an increase in false positives and negatives. In the third and fourth phases, we focused on features with a correlation greater than or equal to 90 % , observing results similar to those of the second phase. These indicate that using features strongly correlated with Benford’s Law can deteriorate the detection of malicious flows, influenced by the proximity or dependence between the features used.
A high false positive rate was observed in many of the proposed scenarios, where the model classified specific benign flows as potentially malicious. A high false positive rate overloads network administrators, generating many alerts that result in unnecessary allocation of resources, such as time and effort, to investigate threat scenarios that are not real. This process can lead to network administrators becoming desensitized to the alerts generated, increasing the risk of overlooking genuine alerts. The model proposed in this study uses a set of adjustable thresholds (significance levels) for detecting malicious flows, making it possible to calibrate the model to reduce the false positive rate without compromising its sensitivity. About false negatives, the rate was low in almost all the scenarios analyzed. False negatives represent a high security risk, since real attacks can go undetected. The combination of distance functions, namely the mean absolute deviation, Kolmogorov–Smirnov test, and Kullback–Leibler divergence, increased the model’s robustness, helping keep the false negative rate low.
Future research should explore the initial identification of features that have little dependence on each other but still show a strong correlation with Benford’s Law. In addition, compared to other studies, the correlation-based method extracted fewer features, producing more effective results when applied to the Benford’s Law-based model.
In the last phase of the study, an ensemble was developed combining the p-values to assess the effectiveness of the Benford’s Law-based model in detecting malicious and benign flows. This ensemble was based on two methods, Fisher and Tippett, with the Tippett method showing the best results. Evaluating the model based on Benford’s Law in conjunction with distance functions, it was possible to achieve an F1-score close to 80 % with a recall of 99.42 % . However, the model’s precision and accuracy were lower than expected, approximately 65 % , a result influenced by the proximity in the frequencies of occurrence of each digit.
Although this model’s results are lower than those of the usual machine learning (ML) techniques, several factors should be considered, such as the model’s speed and the low consumption of computational resources, which essentially highlight the high detection rates of malicious flows. These aspects underline the model’s significant potential in practical applications where efficiency and speed are crucial, with promising results despite being inferior. One possibility for improving the proposed model could be to integrate it into various existing security systems, such as intrusion detection systems (IDSs) and security information and event management systems (SIEM). Whether integrated into IDS or SIEM, the model can be incorporated into tools like Snort or Splunk via specific plugins or modules. These plugins or modules can monitor and analyze network flows based on Benford’s Law, adding an extra layer of security in detecting potentially malicious flows.
The method proposed in this study detects malicious flows in a network and can be seamlessly integrated into existing security systems, significantly improving threat protection and response capabilities in various scenarios. These scenarios include:
Corporate Networks:
  • Detection of Fraudulent Financial Activities: The model will be able to identify possible fraudulent financial activities, detecting fraudulent transactions that do not follow the frequency of occurrence of the digits based on Benford’s Law.
  • False Positive Reduction: Adjusting detection thresholds based on digit analysis may reduce the false positive rate, allowing network administrators to focus on real threats.
  • Integration with Accounting and ERP Systems: Integrating the model into accounting and ERP (Enterprise Resource Planning) systems will enable real-time and continuous monitoring of financial activities.
Industrial Control Systems:
  • Critical Infrastructure Protection: The method will detect malicious activity in critical infrastructures like energy and telecommunications by analyzing SCADA (Supervisory Control and Data Acquisition) data flows.
  • Analyzing industrial protocols: The proposed model makes it possible to detect flows resulting from injection attacks by analyzing the traffic obtained from the Modbus and DNP3 (Distributed Network Protocol 3) protocols.
Critical Infrastructures:
  • Anomaly Detection in Water and Sanitation Systems: The model could be used to identify possible anomalies in sensors or control systems, guaranteeing the safety and continuity of operations in them.
In future research, we plan to reduce the dependency between the extracted features and introduce two new distance functions: the Chi-squared distance function and the Euclidean distance. In addition, we will explore the model’s applicability to Zipf’s Law to assess the coherence between the results under these two laws in contexts of the forensic analysis of computer networks. Finally, we intend to improve the model by incorporating unsupervised machine learning techniques to reduce high false positive rates.

Author Contributions

Conceptualization, methodology, writing and preparation of the original draft, P.F.; writing and revision, S.Ó.C. and M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

BLBenford’s Law
DDoSDistributed Denial of Service
DNP3Distributed Network Protocol 3
ERPEnterprise Resource Planning
ICMPInternet Control Message Protocol
IDSIntrusion Detection Systems
IoTInternet of Things
IPInternet Protocol
KLKullback–Leibler Divergence
KSKomolgorov–Smirnov test
MADMean Absolute Deviation
MDPIMultidisciplinary Digital Publishing Institute
MLMachine Learning
NIDSNetwork Intrusion Detection
NTANetwork Traffic Analysis
ROCReceiver Operating Characteristic
SCADASupervisory Control and Data Acquisition
SIEMSecurity Information and Event Management Systems
SSDSum of Squared Deviation
TCPTransmission Control Protocol
UDPUser Datagram Protocol

References

  1. Yurtseven, I.; Bagriyanik, S. A Review of Penetration Testing and Vulnerability Assessment in Cloud Environment. In Proceedings of the 2020 Turkish National Software Engineering Symposium (UYMS), İstanbul, Turkey, 7–9 October 2020; IEEE: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
  2. Norton. 115 Cybersecurity Statistics + Trends to Know in 2024; Technical report; Norton: Mountain View, CA, USA, 2022. [Google Scholar]
  3. RFC. RFC 2722: Traffic Flow Measurement: Architecture. Technical Report. 1999. Available online: https://datatracker.ietf.org/doc/rfc2722/ (accessed on 27 May 2024).
  4. RFC. RFC 3697: Specification of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers; Technical Report; Internet Engineering Task Force (IETF): Fremont, CA, USA, 2004. [Google Scholar]
  5. Milano, F.; Gomez-Exposito, A. Detection of Cyber-Attacks of Power Systems Through Benford’s Law. IEEE Trans. Smart Grid 2021, 12, 2741–2744. [Google Scholar] [CrossRef]
  6. Mbona, I.; Eloff, J.H.P. Detecting Zero-Day Intrusion Attacks Using Semi-Supervised Machine Learning Approaches. IEEE Access 2022, 10, 69822–69838. [Google Scholar] [CrossRef]
  7. Erickson, J. Hacking; No Starch Press: San Francisco, CA, USA, 2007; p. 296. [Google Scholar]
  8. Stallings, W. Network Security Essentials Applications and Standards; Pearson: London, UK, 2016; p. 464. [Google Scholar]
  9. Jaswal, N. Hands-On Network Forensics; Packt Publishing Limited: Birmingham, UK, 2019; p. 358. [Google Scholar]
  10. Khraisat, A.; Gondal, I.; Vamplew, P.; Kamruzzaman, J. Survey of intrusion detection systems: Techniques, datasets and challenges. Cybersecurity 2019, 2. [Google Scholar] [CrossRef]
  11. Cascavilla, G.; Tamburri, D.A.; Van Den Heuvel, W.J. Cybercrime threat intelligence: A systematic multi-vocal literature review. Comput. Secur. 2021, 105, 102258. [Google Scholar] [CrossRef]
  12. Carrier, B. File System Forensic Analysis; Addison-Wesley: San Francisco, CA, USA, 2005; p. 569. [Google Scholar]
  13. Casey, E. Handbook of Digital Forensics and Investigation; Elsevier Science & Technology Books: Amsterdam, The Netherlands, 2009. [Google Scholar]
  14. Wang, F.; Tang, Y. Diverse Intrusion and Malware Detection: AI-Based and Non-AI-Based Solutions. J. Cybersecur. Priv. 2024, 4, 382–387. [Google Scholar] [CrossRef]
  15. Aljanabi, M.; Ismail, M.A.; Ali, A.H. Intrusion Detection Systems, Issues, Challenges, and Needs. Int. J. Comput. Intell. Syst. 2021, 14, 560. [Google Scholar] [CrossRef]
  16. Dini, P.; Elhanashi, A.; Begni, A.; Saponara, S.; Zheng, Q.; Gasmi, K. Overview on Intrusion Detection Systems Design Exploiting Machine Learning for Networking Cybersecurity. Appl. Sci. 2023, 13, 7507. [Google Scholar] [CrossRef]
  17. Arshadi, L.; Jahangir, A.H. Benford’s law behavior of Internet traffic. J. Netw. Comput. Appl. 2014, 40, 194–205. [Google Scholar] [CrossRef]
  18. Sun, L.; Anthony, T.S.; Xia, H.Z.; Chen, J.; Huang, X.; Zhang, Y. Detection and classification of malicious patterns in network traffic using Benford’s law. In Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, 12–15 December 2017; IEEE: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
  19. Sethi, K.; Kumar, R.; Prajapati, N.; Bera, P. A Lightweight Intrusion Detection System using Benford’s Law and Network Flow Size Difference. In Proceedings of the 2020 International Conference on COMmunication Systems & NETworkS (COMSNETS), Bengaluru, India, 7–11 January 2020; IEEE: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
  20. Nigrini, M.J. Benford’s Law: Applications for Forensic Accounting, Auditing, and Fraud Detection; John Wiley & Sons: Hoboken, NJ, USA, 2012; Volume 586. [Google Scholar]
  21. Cerqueti, R.; Maggi, M. Data validity and statistical conformity with Benford’s Law. Chaos Solitons Fractals 2021, 144, 110740. [Google Scholar] [CrossRef]
  22. Thottan, M.; Ji, C. Anomaly detection in IP networks. IEEE Trans. Signal Process. 2003, 51, 2191–2204. [Google Scholar] [CrossRef]
  23. Wang, Y. Statistical Techniques for Network Security; Information Science Reference: Hershey, PA, USA, 2008; p. 476. [Google Scholar]
  24. Ahmed, M.; Naser Mahmood, A.; Hu, J. A survey of network anomaly detection techniques. J. Netw. Comput. Appl. 2016, 60, 19–31. [Google Scholar] [CrossRef]
  25. Hero, A.; Kar, S.; Moura, J.; Neil, J.; Poor, H.V.; Turcotte, M.; Xi, B. Statistics and Data Science for Cybersecurity. Harv. Data Sci. Rev. 2023, 5. [Google Scholar] [CrossRef]
  26. Iorliam, A. Natural Laws (Benford’s Law and Zipf’s Law) for Network Traffic Analysis. In Cybersecurity in Nigeria; Springer International Publishing: Cham, Switzerland, 2019; pp. 3–22. [Google Scholar] [CrossRef]
  27. Sun, L.; Ho, A.; Xia, Z.; Chen, J.; Zhang, M. Development of an Early Warning System for Network Intrusion Detection Using Benford’s Law Features. In Communications in Computer and Information Science; Springer: Singapore, 2019; pp. 57–73. [Google Scholar] [CrossRef]
  28. Hajdarevic, K.; Pattinson, C.; Besic, I. Improving Learning Skills in Detection of Denial of Service Attacks with Newcombe—Benford’s Law using Interactive Data Extraction and Analysis. TEM J. 2022, 11, 527–534. [Google Scholar] [CrossRef]
  29. Mbona, I.; Eloff, J.H. Feature selection using Benford’s law to support detection of malicious social media bots. Inf. Sci. 2022, 582, 369–381. [Google Scholar] [CrossRef]
  30. Campanelli, L. On the Euclidean distance statistic of Benford’s law. Commun. Stat. Theory Methods 2022, 53, 451–474. [Google Scholar] [CrossRef]
  31. Kossovsky, A.E. On the Mistaken Use of the Chi-Square Test in Benford’s Law. Stats 2021, 4, 419–453. [Google Scholar] [CrossRef]
  32. Fernandes, P.; Antunes, M. Benford’s law applied to digital forensic analysis. Forensic Sci. Int. Digit. Investig. 2023, 45, 301515. [Google Scholar] [CrossRef]
  33. Berger, A.; Hill, T.P. The mathematics of Benford’s law: A primer. Stat. Methods Appl. 2020, 30, 779–795. [Google Scholar] [CrossRef]
  34. Wang, L.; Ma, B.Q. A concise proof of Benford’s law. Fundam. Res. 2023, in press. [CrossRef]
  35. Bunn, D.W.; Gianfreda, A.; Kermer, S. A Trading-Based Evaluation of Density Forecasts in a Real-Time Electricity Market. Energies 2018, 11, 2658. [Google Scholar] [CrossRef]
  36. Andriulli, M.; Starling, J.K.; Schwartz, B. Distributional Discrimination Using Kolmogorov-Smirnov Statistics and Kullback-Leibler Divergence for Gamma, Log-Normal, and Weibull Distributions. In Proceedings of the 2022 Winter Simulation Conference (WSC), Singapore, 11–14 December 2022; IEEE: New York, NY, USA, 2022. [Google Scholar] [CrossRef]
  37. Pham-Gia, T.; Hung, T. The mean and median absolute deviations. Math. Comput. Model. 2001, 34, 921–936. [Google Scholar] [CrossRef]
  38. Fernandes, P.; Ciardhuáin, S.Ó.; Antunes, M. Uncovering Manipulated Files Using Mathematical Natural Laws. In Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2023; pp. 46–62. [Google Scholar] [CrossRef]
  39. Bulinski, A.; Dimitrov, D. Statistical Estimation of the Kullback–Leibler Divergence. Mathematics 2021, 9, 544. [Google Scholar] [CrossRef]
  40. Li, J.; Fu, H.; Hu, K.; Chen, W. Data Preprocessing and Machine Learning Modeling for Rockburst Assessment. Sustainability 2023, 15, 13282. [Google Scholar] [CrossRef]
  41. Zaidi, Z.R.; Hakami, S.; Landfeldt, B.; Moors, T. Real-time detection of traffic anomalies in wireless mesh networks. Wirel. Netw. 2009, 16, 1675–1689. [Google Scholar] [CrossRef]
  42. Zhou, W.; Lv, Z.; Li, G.; Jiao, B.; Wu, W. Detection of Spoofing Attacks on Global Navigation Satellite Systems Using Kolmogorov–Smirnov Test-Based Signal Quality Monitoring Method. IEEE Sens. J. 2024, 24, 10474–10490. [Google Scholar] [CrossRef]
  43. Bouyeddou, B.; Harrou, F.; Kadri, B.; Sun, Y. Detecting network cyber-attacks using an integrated statistical approach. Clust Comput. 2020, 24, 1435–1453. [Google Scholar] [CrossRef]
  44. Bouyeddou, B.; Harrou, F.; Sun, Y.; Kadri, B. Detection of smurf flooding attacks using Kullback-Leibler-based scheme. In Proceedings of the 2018 4th International Conference on Computer and Technology Applications (ICCTA), Istanbul, Turkey, 3–5 May 2018; IEEE: New York, NY, USA, 2018. [Google Scholar] [CrossRef]
  45. Romo-Chavero, M.A.; Cantoral-Ceballos, J.A.; Pérez-Díaz, J.A.; Martinez-Cagnazzo, C. Median Absolute Deviation for BGP Anomaly Detection. Future Internet 2024, 16, 146. [Google Scholar] [CrossRef]
  46. Ham, H.; Park, T. Combining p-values from various statistical methods for microbiome data. Front. Microbiol. 2022, 13, 990870. [Google Scholar] [CrossRef] [PubMed]
  47. Borenstein, M.; Hedges, L.; Higgins, J.; Rothstein, H. Introduction to Meta-Analysis; Wileyl: Hoboken, NJ, USA, 2011. [Google Scholar]
  48. Chen, Z. Optimal Tests for Combining p-Values. Appl. Sci. 2021, 12, 322. [Google Scholar] [CrossRef]
  49. Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In Proceedings of the International Conference on Information Systems Security and Privacy, Madeira, Portugal, 22–24 January 2018. [Google Scholar]
  50. UNB. Intrusion Detection Evaluation Dataset. 2017. Available online: https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 1 July 2024).
  51. Lashkari, A.H. CICFlowMeter; Github: San Francisco, CA, USA, 2021. [Google Scholar]
  52. Davis, J.; Goadrich, M. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning—ICML ’06, Pittsburgh, PA, USA, 25–29 June 2006; ACM Press: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
  53. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  54. Ferreira, S.; Antunes, M.; Correia, M.E. A Dataset of Photos and Videos for Digital Forensics Analysis Using Machine Learning Processing. Data 2021, 6, 87. [Google Scholar] [CrossRef]
Figure 1. Maximum value obtained from the differences between the cumulative functions.
Figure 1. Maximum value obtained from the differences between the cumulative functions.
Mathematics 12 02299 g001
Figure 2. Median Absolute Deviation between the frequency of occurrence of each digit in flow 14 and the empirical frequency from Benford’s Law.
Figure 2. Median Absolute Deviation between the frequency of occurrence of each digit in flow 14 and the empirical frequency from Benford’s Law.
Mathematics 12 02299 g002
Figure 3. Kullback–Leibler divergence between the frequency of occurrence of each digit in flow 14 and the empirical frequency from Benford’s Law.
Figure 3. Kullback–Leibler divergence between the frequency of occurrence of each digit in flow 14 and the empirical frequency from Benford’s Law.
Mathematics 12 02299 g003
Figure 4. General architecture of the model where pre-processing and processing are highlighted.
Figure 4. General architecture of the model where pre-processing and processing are highlighted.
Mathematics 12 02299 g004
Figure 5. General architecture of the model based on Benford’s Law, distance functions, and Bayes’ Theorem.
Figure 5. General architecture of the model based on Benford’s Law, distance functions, and Bayes’ Theorem.
Mathematics 12 02299 g005
Figure 6. Preprocessing phase architecture [49].
Figure 6. Preprocessing phase architecture [49].
Mathematics 12 02299 g006
Figure 7. Processing phase that schematizes the two main stages.
Figure 7. Processing phase that schematizes the two main stages.
Mathematics 12 02299 g007
Figure 8. Comparison of the frequencies of occurrence of all the digits in the dataset with those expected by Benford’s Law for Cluster 2, where the features show a correlation of 70 % or better.
Figure 8. Comparison of the frequencies of occurrence of all the digits in the dataset with those expected by Benford’s Law for Cluster 2, where the features show a correlation of 70 % or better.
Mathematics 12 02299 g008
Figure 9. Comparison between the frequencies of occurrence of the flows numbered 2 and 30 in the first row and 18,342 and 18,361 in the second row, with the frequencies predicted by Benford’s Law. The discrepancies between the observed and expected frequencies are visible.
Figure 9. Comparison between the frequencies of occurrence of the flows numbered 2 and 30 in the first row and 18,342 and 18,361 in the second row, with the frequencies predicted by Benford’s Law. The discrepancies between the observed and expected frequencies are visible.
Mathematics 12 02299 g009
Figure 10. Comparison of the frequencies of occurrence of all the digits in the dataset with those expected by Benford’s Law for Cluster 2, where the features show a correlation of 80 % or higher.
Figure 10. Comparison of the frequencies of occurrence of all the digits in the dataset with those expected by Benford’s Law for Cluster 2, where the features show a correlation of 80 % or higher.
Mathematics 12 02299 g010
Figure 11. Comparison of the frequencies of occurrence of all the digits in the dataset with those expected by Benford’s Law for Cluster 3, where the features show a correlation of 90 % or higher.
Figure 11. Comparison of the frequencies of occurrence of all the digits in the dataset with those expected by Benford’s Law for Cluster 3, where the features show a correlation of 90 % or higher.
Mathematics 12 02299 g011
Figure 12. Comparison of the frequencies of occurrence of all the digits in the dataset with those expected by Benford’s Law for Cluster 5, according to Mbona.
Figure 12. Comparison of the frequencies of occurrence of all the digits in the dataset with those expected by Benford’s Law for Cluster 5, according to Mbona.
Mathematics 12 02299 g012
Figure 13. Tree diagram illustrating Bayes’ Theorem, based on the values derived from the distance functions.
Figure 13. Tree diagram illustrating Bayes’ Theorem, based on the values derived from the distance functions.
Mathematics 12 02299 g013
Table 1. Observed and empirical cumulative frequencies, and the respective deviations between frequencies.
Table 1. Observed and empirical cumulative frequencies, and the respective deviations between frequencies.
First DigitCumulative Frequency
for Flow 14: F X ( x )
Cumulative Frequency
for Benfords’ Law: F n ( x )
D = sup x R | F n ( x ) F X ( x ) |
10.22220.30100.0788
20.27780.47710.1993
30.38890.60210.2132
40.44440.69900.2545
50.55560.77820.2226
60.72220.84510.1229
70.88890.90310.0142
80.94440.95420.0098
9110
Table 2. Kolmogorov–Smirnov test procedure based on the study of Benford’s Law applied to any network flow.
Table 2. Kolmogorov–Smirnov test procedure based on the study of Benford’s Law applied to any network flow.
Calculation of the Cumulative Empirical Distribution Function for one flow in general;
Calculation of the Empirical Cumulative Distribution Function for Benford’s Law;
Calculation of the Kolmogorov–Smirnov Statistic using Equation (7);
D test will be the largest value of the D i calculated in the previous step;
Compare D test with a critical value of D;
If D test < D critical , there is not enough statistical evidence to reject H 0 .
Table 3. General features present in the dataset.
Table 3. General features present in the dataset.
Destination PortBwd Packet Length MaxFwd IAT TotalFwd Flags
Flow DurationBwd Packet Length MinFwd IAT MeanBwd PSH Flags
Total Fwd PacketsBwd Packet Length MeanFwd IAT StdFwd URG Flags
Total Backward PacketsBwd Packet Length StdFwd IAT MaxBwd URG Flags
Total Length of Fwd PacketsFlow Bytes/sFwd IAT MinFwd Header Length
Total Length of Bwd PacketsFlow Packets/sBwd IAT TotalBwd Header Length
Fwd Packet Length MaxFlow IAT MeanBwd IAT MeanFwd Packets/s
Fwd Packet Length MinFlow IAT StdBwd IAT StdBwd Packets/s
Fwd Packet Length MeanFlow IAT MaxBwd IAT MaxMin Packet Length
Fwd Packet Length StdFlow IAT MinBwd IAT MinMax Packet Length
Packet Length MeanECE Flag CountBwd Avg Packets/BulkActive Mean
Packet Length StdDown/Up RatioBwd Avg Bulk RateActive Std
Packet Length VarianceAverage Packet SizeSubflow Fwd PacketsActive Max
FIN Flag CountAvg Fwd Segment SizeSubflow Fwd BytesActive Min
SYN Flag CountAvg Bwd Segment SizeSubflow Bwd PacketsIdle Mean
RST Flag CountFwd Header Length_1Subflow Bwd BytesIdle Std
PSH Flag CountFwd Avg Bytes/BulkInit_Win_bytes_forwardIdle Max
ACK Flag CountFwd Avg Packets/BulkInit_Win_bytes_backwardIdle Min
URG Flag CountFwd Avg Bulk Rateact_data_pkt_fwd
CWE Flag CountBwd Avg Bytes/Bulkmin_seg_size_forward
Table 4. Parameter settings used for the statistical tests.
Table 4. Parameter settings used for the statistical tests.
Statistical TestParametersSettingsThreshold SettingSettings
Kolmogorov–Smirnov testThe significance
level used was 0.05 , 0.01 , 0.1 .

Sample size:
29,000 flows
The distribution of digit occurrence frequencies was calculated and compared with the empirical distribution of
Benford’s Law.
The KS test was applied to verify the most significant difference between the empirical cumulative distributions of the observed data and Benford’s Law.
A threshold value was
established for the 1%,
5%, and 10 %
significance levels.
p-values obtained
lower than the critical
value were considered
malicious.
Kullback–Leibler divergence ε = 10 10 was added to all observed probabilities to avoid division by zero.

The probabilities were normalized to add up to 1.
We calculated the probability distributions for the first digit of each feature in the dataset and Benford’s Law.
The KL divergence was calculated to measure the difference between the observed distribution of digits and the distribution expected by Benford’s Law.
Mean Absolute DeviationThe first digit of the dataset was considered.The KL divergence was calculated to measure the difference between the observed distribution of digits and the distribution expected by Benford’s Law.
Table 5. Quantity of flows extracted according to the type of activity for each day of the week.
Table 5. Quantity of flows extracted according to the type of activity for each day of the week.
Week DateType of ActivityFlows Extracted
MondayOnly benign flows2000-
TuesdayBenign flows2000-
FTP-Patator-2000
SSH-Patator-2000
WednesdayBenign flows2000-
DoS/DDoS-2000
DoS slowloris-2000
DoS Slowhttptest-2000
DoS Hulk-2000
DoS GoldenEye-2000
ThursdayBenign flows2000-
Web Attack—Brute Force-1000
Web Attack—XSS-1000
Web Attack—Sql Injection-1000
Infiltration-1000
FridayBenign Flows2000-
DDoS LOIT-1000
Total:10,00019,000
Table 6. Confusion matrix.
Table 6. Confusion matrix.
Predicted Observation
PositiveNegative
Real observationPositiveMalicious network flow

True positive (TP)
Malicious network flow rated as benign

False negative (FN)
NegativeBenign network flow rated malicious

False positive (FP)
Benign network flow

True negative (TN)
Table 7. General features that are in line with Benford’s Law.
Table 7. General features that are in line with Benford’s Law.
Bwd Packet Length MeanFwd IAT TotalFlow IAT MeanPacket Length Std
Flow DurationFwd IAT MeanFlow IAT StdPacket Length Variance
Total Fwd PacketsFwd IAT StdFlow IAT MaxDown/Up Ratio
Total Backward PacketsFwd IAT MaxSubflow Fwd PacketsAvg Fwd Segment Size
Total Length of Fwd PacketsFwd IAT MinSubflow Fwd BytesAvg Bwd Segment Size
Total Length of Bwd PacketsBwd IAT TotalSubflow Bwd PacketsMax Packet Length
Fwd Packet Length MeanBwd IAT StdSubflow Bwd BytesPacket Length Mean
Fwd Packet Length StdBwd IAT Maxact_data_pkt_fwdFlow Packets/s
Fwd Packet’sBwd Packet Length StdActive MeanIdle Std
Bwd Packet’sFlow Bytes/sActive StdActive Max
Active Min
Table 8. Results of the application of distance functions with Benford’s Law in Cluster 1.
Table 8. Results of the application of distance functions with Benford’s Law in Cluster 1.
Distance FunctionDegree of SignificanceTPTNFPFN
MAD0.053745749515,2552505
0.01010,000019,000
0.117,143172682741857
KS test0.0589963826617410,004
0.0147836026397414,217
0.113,504252274785496
Kullback-Leibler0.0530538447155315,947
0.011126930469617,874
0.133597841215915,641
Table 10. Results of the model evaluation for cluster 1.
Table 10. Results of the model evaluation for cluster 1.
Distance FunctionPrecisionRecallF1-ScoreAccuracy
MAD ( 0.1 )0.67450.90230.77190.6507
KS test ( 0.1 )0.64360.71070.67550.5526
Kullback–Leibler ( 0.05 )0.66280.16070.25870.3966
Table 11. Features with a correlation greater than or equal to 80 % .
Table 11. Features with a correlation greater than or equal to 80 % .
Flow Packets/sBwd IAT MaxFlow IAT MeanPacket Length Std
Flow DurationFwd IAT MeanFlow IAT StdPacket Length Variance
Total Fwd PacketsFwd IAT StdActive StdDown/Up Ratio
Total Backward PacketsBwd IAT TotalSubflow Fwd PacketsIdle Std
Total Length of Fwd PacketsBwd IAT StdSubflow Fwd BytesBwd Packet’s
Total Length of Bwd PacketsBwd Packet Length StdSubflow Bwd Packetsact_data_pkt_fwd
Fwd Packet’sFlow Bytes/sSubflow Bwd Bytes
Table 12. Results of the application of distance functions with Benford’s Law in cluster 2.
Table 12. Results of the application of distance functions with Benford’s Law in cluster 2.
Distance FunctionDegree of SignificanceTPTNFPFN
MAD0.116,389178882122611
KS test0.113,504252274785496
Kullback–Leibler0.0527158290171016,285
Table 13. Results of the model evaluation for cluster 2.
Table 13. Results of the model evaluation for cluster 2.
Distance FunctionPrecisionRecallF1-ScoreAccuracy
MAD ( 0.1 )0.66620.86260.75180.6268
KS test ( 0.1 )0.64360.71070.67550.5526
Kullback–Leibler ( 0.05 )0.61360.14290.23180.3795
Table 14. Features with a correlation greater than or equal to 90 % .
Table 14. Features with a correlation greater than or equal to 90 % .
Flow DurationBwd Packets/s
Total Backward PacketsPacket Length Std
Total Length of Fwd PacketsPacket Length Variance
Bwd Packet Length StdSubflow Fwd Packets
Flow Bytes/sSubflow Bwd Packets
Flow Packets/sact_data_pkt_fwd
Flow IAT MeanActive Std
Flow IAT StdIdle Std
Fwd IAT MeanBwd IAT Std
Fwd IAT StdFwd Packets/s
Table 15. Results of the application of distance functions with Benford’s Law in cluster 3.
Table 15. Results of the application of distance functions with Benford’s Law in cluster 3.
Distance FunctionDegree of SignificanceTPTNFPFN
MAD0.115,757193880623243
KS test0.113,504252274785496
Kullback-Leibler0.0527358197180316,265
Table 16. For example, four flows, two benign and two malicious, are compared, as well as the frequency of each digit’s occurrence. The analysis includes the decision made by the model, which is contrasted with the original classification of each flow to determine the model’s effectiveness in correctly identifying the benign and malicious nature of the flows analyzed.
Table 16. For example, four flows, two benign and two malicious, are compared, as well as the frequency of each digit’s occurrence. The analysis includes the decision made by the model, which is contrasted with the original classification of each flow to determine the model’s effectiveness in correctly identifying the benign and malicious nature of the flows analyzed.
Flow123456789Decision by
MAD 0.1
Original
Label
Benign20.66670.3333000000000
810.538400.15380.076900.07690.15380010
Malicious237770.625000.125000.2500001
286900.50000.2500000.083300.16660011
Table 17. Features suggested by Mbona and their correspondence with our research findings.
Table 17. Features suggested by Mbona and their correspondence with our research findings.
Flow DurationPacket Length Mean
Fwd Packet Length MeanPacket Length Std
Fwd Packet Length StdPacket Length Variance
Bwd Packet Length MeanAvg Fwd Segment Size
Flow Bytes/sAvg Bwd Segment Size
Flow Packets/sSubflow Fwd Packets
Flow IAT MeanSubflow Fwd Bytes
Flow IAT StdSubflow Bwd Packets
Fwd Packets/sAvg Fwd Segment Size
Max Packet LengthAvg Bwd Segment Size
Table 18. Features suggested by Mbona that were not included in our research.
Table 18. Features suggested by Mbona that were not included in our research.
FeaturesCorrelationFeaturesCorrelation
Bwd Packet Length Min 24.91 % Bwd Avg Packets/Bulk-
Flow IAT Min 66.22 % Bwd Avg Bulk Rate-
Average Packet Size 69.13 % Init_Win_bytes_backward 44.33 %
Fwd Avg Bytes/Bulk-
Table 19. Results achieved using the features identified by Mbona.
Table 19. Results achieved using the features identified by Mbona.
Distance FunctionDegree of SignificanceTPTNFPFN
MAD0.115,686190580953314
KS test0.112745399860026255
Kullback–Leibler0.0532357978202215,765
Table 20. Results of the model evaluation for the features proposed by Mbona.
Table 20. Results of the model evaluation for the features proposed by Mbona.
Distance FunctionPrecisionRecallF1-ScoreAccuracy
MAD ( 0.1 )0.65960.82560.73330.6066
KS test ( 0.1 )0.67980.67080.67530.5773
Kullback–Leibler ( 0.05 )0.61540.17030.26670.3867
Table 21. Detection of malicious and benign flows using Bayes’ Theorem in conjunction with Fisher’s and Tippett’s methods to generate a global p-value for network flow classification.
Table 21. Detection of malicious and benign flows using Bayes’ Theorem in conjunction with Fisher’s and Tippett’s methods to generate a global p-value for network flow classification.
α TPTNFPFN
Fisher method0.0587335360464010,267
0.0141367622237814,864
0.112,885313468666115
Tippett method0.0518,8902049796110
0.012448921178916,552
0.119,000010,0000
Table 22. Model evaluation results were obtained from creating an ensemble using the Fisher and Tippett methods and applied to cluster 3.
Table 22. Model evaluation results were obtained from creating an ensemble using the Fisher and Tippett methods and applied to cluster 3.
Ensemble MethodPrecisionRecallF1-ScoreAccuracy
Fisher method ( α = 0.1 )0.65240.67820.66500.5524
Tippett method ( α = 0.05 )0.65850.99420.79230.6584
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fernandes, P.; Ciardhuáin, S.Ó.; Antunes, M. Unveiling Malicious Network Flows Using Benford’s Law. Mathematics 2024, 12, 2299. https://doi.org/10.3390/math12152299

AMA Style

Fernandes P, Ciardhuáin SÓ, Antunes M. Unveiling Malicious Network Flows Using Benford’s Law. Mathematics. 2024; 12(15):2299. https://doi.org/10.3390/math12152299

Chicago/Turabian Style

Fernandes, Pedro, Séamus Ó Ciardhuáin, and Mário Antunes. 2024. "Unveiling Malicious Network Flows Using Benford’s Law" Mathematics 12, no. 15: 2299. https://doi.org/10.3390/math12152299

APA Style

Fernandes, P., Ciardhuáin, S. Ó., & Antunes, M. (2024). Unveiling Malicious Network Flows Using Benford’s Law. Mathematics, 12(15), 2299. https://doi.org/10.3390/math12152299

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop