Detecting False Data Injection Attacks Using Machine Learning-Based Approaches for Smart Grid Networks

Abudin, MD Jainul; Thokchom, Surmila; Naayagi, R. T.; Panda, Gayadhar

doi:10.3390/app14114764

Open AccessArticle

Detecting False Data Injection Attacks Using Machine Learning-Based Approaches for Smart Grid Networks

¹

Department of Computer Science & Engineering, National Institute of Technology Meghalaya, Shillong 793003, Meghalaya, India

²

Department of Electrical Power Engineering, Newcastle University (Singapore), Singapore 567739, Singapore

³

Department of Electrical Engineering, National Institute of Technology Meghalaya, Shillong 793003, Meghalaya, India

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(11), 4764; https://doi.org/10.3390/app14114764

Submission received: 26 April 2024 / Revised: 26 May 2024 / Accepted: 29 May 2024 / Published: 31 May 2024

(This article belongs to the Special Issue Electric Power Applications II)

Download

Browse Figures

Versions Notes

Abstract

:

Current electricity sectors will be unable to keep up with commercial and residential customers’ increasing demand for data-enabled power systems. Therefore, next-generation power systems must be developed. It is possible for the smart grid, an advanced power system of the future, to make decisions, estimate loads, and execute other data-related jobs. Customers can adjust their needs in smart grid systems by monitoring bill information. Due to their reliance on data networks, smart grids are vulnerable to cyberattacks that could compromise billing data and cause power outages and other problems. A false data injection attack (FDIA) is a significant attack that targets the corruption of state estimation vectors. The primary goal of this paper is to show the impact of an FDIA attack on a power dataset and to use machine learning algorithms to detect the attack; to achieve this, the Python software is used. In the experiment, we used the power dataset from the IoT server of a 10 KV solar PV system (to mimic a smart grid system) in a controlled laboratory environment to test the effect of FDIA and detect this anomaly using a machine learning approach. Different machine learning models were used to detect the attack and find the most suitable approach to achieve this goal. This paper compares machine learning algorithms (such as random forest, isolation forest, logistic regression, decision tree, autoencoder, and feed-forward neural network) in terms of their effectiveness in detecting false data injection attacks (FDIAs). The highest F1 score of 0.99 was achieved by the decision tree algorithm, which was closely followed by the logistic regression method, which had an F1 score of 0.98. These algorithms also demonstrated high precision, recall, and model accuracy, demonstrating their efficacy in detecting FDIAs. The research presented in this paper indicates that combining logistic regression and decision tree in an ensemble leads to significant performance enhancements. The resulting model achieves an impressive accuracy of 0.99, a precision of 1, and an F1 score of 1.

Keywords:

smart grid (SG); false data injection attack (FDIA); machine learning (ML)

1. Introduction

A smart grid is a technologically advanced and modernized electrical grid that integrates digital communication, advanced technologies, and intelligent devices to improve the efficiency, reliability, sustainability, and security of power generation, distribution, and consumption [1,2]. Unlike conventional electrical grids, which are predominantly one-way systems in which power flows from centralized generation facilities to consumers, smart grids facilitate two-way communication and real-time data exchange between the grid’s components [2]. While smart grids offer numerous advantages, such as increased efficiency and dependability, they are also susceptible to numerous cyber threats that can compromise their security and functionality. There can be different types of cyberattacks on smart grids; one such typical scenario is shown in Figure 1, which shows that an attacker from a remote location can easily send malware to the communication network [3]. Cyberattacks are developed to obtain unauthorized access to damage and disrupt or steal an IT asset, other sensitive data, etc. [1]; in the case of power data, cyberattacks are primarily focused on compromising the data’s integrity and the availability of the service. Cyberattacks can originate from trusted individuals within an organization or from unknown parties located remotely.

This paper aims to show the effect of false data injection attacks on the power dataset collected from an IoT server of a 10 KV solar PV system in a laboratory environment and to detect such attacks using machine learning approaches. To demonstrate the effects of a false data injection attack (FDIA), a simulated FDIA was conducted on a portion of the power dataset using Python (https://www.python.org/, accessed on 10 May 2024). This allowed for the voltage curve to be visualized before and after the attack. Different machine learning methods were utilized to efficiently detect the attack. Among these algorithms, decision tree and logistic regression showed superior performance. These approaches were employed to construct an ensemble approach, aiming to enhance the overall detection of the FDIA. The techniques used in this paper are commonly utilized in several fields, yet there are limited research resources on power datasets. We attempted to integrate the ensemble methodology, which relies on the individual performances of the algorithms employed. There is no definitive formula for determining which approaches to use in an ensemble. In our research, we selected the top two algorithms based on their evaluation metrics and combined them, resulting in improved performance. This comparison analysis will give researchers insights for choosing machine learning approaches in anomaly detection. This work focuses exclusively on FDIA. However, it is possible to include additional types of attacks to evaluate the performance of these algorithms and develop an intelligent anomaly detection system.

2. Related Work

Wang et al. [4], in their study, provided a machine learning-based approach for identifying DoS attacks on smart grids to address the intrusion problem. This approach collects real-time data from a smart meter and a data server. DoS attacks are identified and detected using SVM classifier-trained models, which use feature selection and PCA dimension reduction to select more characteristics. Experiments on the public domain dataset show that the SVM classification model outperforms the Naive Bayesian Network and the decision tree classification methods. The model presented in this paper primarily employs the machine learning algorithm (ML) to detect DoS attack behavior or expected behavior, thereby enhancing the smart grid’s security. In their work, Esmalifalak et al. [5] showed how they first collected the state estimator’s normal and stealthy attacked operational points. Learning (historical) data come from network active power flow monitoring. Historical data are well separated from those under attack when projected into a low-dimensional space. Machine learning methods can detect stealthy erroneous data injection in the state estimator. The authors identify attacked and safe operation modes using supervised and unsupervised learning. The suggested algorithms identify stealthy fraudulent data injection in numerical results. Transmission line and generator outages are examples of non-cyberattack failures. This article could leverage additional advanced machine learning and data mining methods to discover power network problems. Omer et al. [6] demonstrated a man-in-the-middle-based attack scenario in which an attacker monitors process communication between control systems and field devices, injects spoofed data, and corrupts the data, for example, by sending bogus commands to the field devices. The authors analyze the data collected under both standard and attack settings to extract domain-specific information for detection techniques, and they show that the given attack scenario is applicable in a physical smart grid laboratory context. Ruobin et al. [7] created a technique to detect cyber intrusions in smart grids based on semi-supervised anomaly detection and deep representation learning using PMU measurements that span physical and cyber realms. Since supervised anomaly detection methods rely solely on examples of regular occurrences to train detection models, semi-supervised anomaly detection methods are ideal for detecting events of unknown attack types. They used publicly available datasets of attacks on the power grid to evaluate various techniques. Semi-supervised algorithms are superior to standard supervised algorithms in detecting attack events. Furthermore, their findings demonstrate that deep autoencoder-based representation learning can improve the detection performance of semi-supervised algorithms. In their work, Lui et al. [8] comprehensively analyzed the effects of forged data attacks on a smart grid in their survey. They summed up the state-of-the-art techniques for detection based on machine learning by categorizing them into three different types of attacks. They established the critical need for cutting-edge machine learning methods combining high resilience and efficiency.

The above-mentioned work emphasizes the need to use machine learning techniques to improve the security of smart grids against cyber threats. Many technologies, such as SVM classification, deep representation learning, and semi-supervised anomaly detection, are used to detect and mitigate many cyberattacks, including DoS and data injection. These technologies use real-time data from smart meters, data servers, and power flow monitoring to detect abnormal behavior and assure the integrity and stability of smart grid operations. Furthermore, powerful machine learning approaches are widely acknowledged as necessary for efficiently addressing growing cyberattacks. In particular, it is essential to investigate potential antagonistic attacks and appropriate defenses. Furthermore, novel detection approaches like collaborative and decentralized learning can address data availability and efficiency challenges.

3. False Data Injection Attacks in Smart Grids

A group of malicious data attacks that target crucial infrastructures under the management of cyber-physical information systems is called false data injection. FDIA tactics entail the attacker interfering with sensor readings, which causes erroneous data to be inserted in calculations and variables that define the system state without being noticed [9,10]. In an FDI attack, the attacker introduces erroneous measurements or data into the power grid’s sensors or communication lines. By manipulating the data, the attacker can deceive the control systems into believing that the grid is in a different state than it is. As control systems rely on precise and timely information to make crucial decisions, like changing power generation, load balancing, or managing system problems, this could have catastrophic repercussions. false data injection (FDI) attacks are common in power system models for numerous reasons [9,10,11,12,13]:

High impact: Industries, businesses, and families depend on power systems. Power grid disruptions can cause economic losses, societal discomfort, and security risks. FDI attacks can cause significant harm or interruptions with little effort.
Interconnectedness: Power grids span broad geographic areas and involve many power production units, transmission lines, and distribution networks. Power systems are interconnected, allowing attackers to exploit vulnerabilities in one grid sector and cause widespread disruptions or blackouts. FDI attacks can disable a system by targeting individual components or nodes.
Power system operation depends on accurate and timely data: Control systems monitor the grid and make decisions and act using sensor measurements, communication routes, and data processing algorithms. Attackers can mislead control systems and cause disruptions by inserting misleading data. Data-dependent power system models are vulnerable to FDI attacks.
FDI attacks can adversely affect the equipment: Data and control signal manipulation can cause equipment overheating, voltage instability, and failure. Power systems are tempting targets for cyberattackers looking to cause physical damage.

In an electrical power system, “state estimation” refers to estimating the grid system’s status using data from the meters on each bus. “The bus voltage, bus active power, and bus reactive power are monitored; bus voltage and phase angle are state variables; and this system’s alternating current (AC) power flow model.

a = m (b) + i

(1)

where b

ϵ

EF is the state variable of the power system, which is the node voltage and phase angle variable, a

ϵ

EL is the measurement vector, which is the measurement data of the sensor, i

ϵ

EL is the measurement noise, m(b) is the nonlinear relationship between the measurement value and the state variable [9]. The direct current flow model can be obtained by applying Taylor expansion on the AC model, which can be approximated linearly near the operating point. This is based on the assumption that the noise follows a Gaussian distribution with a mean value of 0 and a covariance matrix of Λ and that the system’s state does not change quickly over time.

a = M b + i

(2)

where M

ϵ

EL is the Jacobian matrix, and the state vector estimation can be calculated by the weighted least square estimation as follows:

\hat{b} = {(M^{T} \land M)}^{- 1} M^{T} \land a

(3)

In the case of FDI attacks, the attacker aims to tamper/alter the measurement data in the terminal to make the devices generate vague state estimation, eventually resulting in the wrong output. The measurement carried out using Equation (2) is modified when the terminal is attacked.

\tilde{a} = M b + C + i

(4)

where N a ∈ R is the attack vector; bad data detection (BDD) is one of the most commonly used detection methods for the attack vector.

γ = {‖\tilde{a} - M \tilde{b}‖}_{2}

(5)

When the measurement residual reaches a certain threshold γ > ε₀, it is considered attacked, where ε₀ is the threshold that needs to be set [9,10,11]. To filter the error metrics provided by faulty devices or malicious attacks, BDD is employed to ensure the integrity of the state estimates. Assuming that the attacker is fully aware of the power grid topology and transmission online and understands the Jacobian matrix, M, they can launch a concealed attack vector, a = Hc, to obtain a vague value—the system’s state estimation changes despite the threat having the same error as the non-threat.

4. Machine Learning Approach

Machine learning (ML) techniques and algorithms allow computer systems to learn from data and improve performance without being explicitly programmed. ML algorithms analyze data patterns and trends to create predictions from the learned information [12]. ML typically follows these steps [11,12,13,14]:

Data are collected from databases, sensors, and user interactions: The data should be representative and reflect a variety of task-related scenarios. For this paper, we collected the power dataset from an IoT server of a 10 KV solar PV system.
Data are preprocessed for analysis: This stage may involve removing outliers, addressing missing values, normalizing or scaling data, and encoding categorical variables into numbers. To enhance the overall training process, data pre-processing is a crucial step in the experiment; we utilized generative adversarial networks (GANs) [15] to achieve data augmentation. The data collected from the IoT server was in CSV format, which we passed through the GAN model, and the augmented data were later used as the input for the machine learning models.
Feature selection/engineering: Features representing the data’s significant qualities and that fit the learning objective are selected and developed. Feature engineering creates new features from existing ones, while feature selection selects a subset. As per the dataset available, there are multiple features, but only one feature is used for the experiment, and that is Gp3Ph (grid power on three phases), which is used for various calculations.
Model selection: As far as the experiment is concerned, we need to classify the attacked data from non-attacked data, for which we use the supervised machine learning approach by using classification algorithms. We will see the details of the algorithms in a later section in this paper.
Model training: The selected ML model is trained with the prepared data. Based on the input data, the model optimizes its performance throughout training. When a loss or error function is minimized, which involves measuring the model’s prediction error, this is usually carried out.
Model evaluation: The trained model is evaluated using task-specific measures, including accuracy, precision, recall, and mean squared error (Evaluation Metrics) [16,17]. Evaluation determines the model’s efficacy and generalization on unknown data.
Model deployment [18,19]: The learned model is deployed to provide forecasts or decisions based on fresh data. It is possible to utilize the trained model to work on various datasets, provided that one must ensure the input/features used in training are intact. In contrast, testing in a testing case or fresh training will be required for the new dataset.
Model monitoring and maintenance [20]: The deployed ML model is regularly checked and upgraded. ML models may need periodic data re-training to adapt to shifting trends or increase performance. The model needs to be trained if there is a change in the features/input parameters or a significant change in the dataset. For this paper, we collected daily power generation data for one year from the IOT server, which is integrated with a 10 KV PV system in a laboratory environment. If there is a change in the power generation pattern for the next two years or one year, we also need to conduct fresh training to obtain optimum results. A typical machine learning scenario is presented in Figure 2 below.

Machine learning models are employed in power system models for cyber threat detection because they can scrutinize vast data, identify anomalies, and adjust to emerging threats. The utilization of these models facilitates the monitoring of power systems in real time, the recognition of intricate patterns, and scalability, thereby providing a comprehensive approach to detecting potential threats. Through machine learning, entities can augment their security measures, proficiently address potential threats, and guarantee the dependability of essential power infrastructure [19,20,21].

5. Methodology

The dataset used was collected from an IoT server that hosts the data of a 10 kV solar power system for laboratory purposes. It is a time series dataset from an energy monitoring system. It regularly captures various energy consumption, production, and system performance parameters. A brief description of each column in the dataset is presented below:

DateTime: The timestamp when the data were recorded.
DlyEn (Daily Energy): The energy consumed or produced daily, measured in an appropriate unit (likely kilowatt-hours or similar).
TotEn (Total Energy): The cumulative total energy consumption or production up to that point.
ParEn (Parameter Energy): A specific measured value of energy, likely an additional or derived energy parameter.
WkE (Weekly Energy): The energy consumption or production over the past week.
MnE (Monthly Energy): The energy consumption or production over the past month.
YrE (Yearly Energy): The energy consumption or production over the past year.
Gv3Ph (Grid Voltage, 3-Phase): The voltage measured across a 3-phase grid.
Gc3Ph (Grid Current, 3-Phase): The current measured across a 3-phase grid.
Gp3Ph (Grid Power, 3-Phase): The power measured across a 3-phase grid.
Freq3Ph (Frequency, 3-Phase): The frequency of the 3-phase grid.
InPw1 (Input Power 1): The power input at point 1.
InPv1 (Input Voltage 1): The voltage input at point 1.
InPc1 (Input Current 1): The current input at point 1.
InPw2 (Input Power 2): The power input at point 2.
InPv2 (Input Voltage 2): The voltage input at point 2.
InPc2 (Input Current 2): The current input at point 2.
InvTemp (Inverter Temperature): The temperature of the inverter.
BstTemp (Boost Temperature): The temperature of the boost component.
IslRes (Isolation Resistance): The resistance value for system isolation is essential for safety.
WGFreq (Wind Generator Frequency): The operating frequency of a wind generator, if applicable.
Phi3Ph (Phase Angle, 3-Phase): The 3-phase system’s phase angle indicates the power factor and efficiency.

In our simulation, we used Gc3Ph and Gp3Ph as two features. In this study, we applied the following machine learning models to identify the FDIA on the dataset:

Isolation Forest;
Random Forest;
SVM;
Decision Tree;
Autoencoder;
Logistic Regression.

The objective of the simulation is to demonstrate the effects of a false data injection attack on the dataset and employ various machine learning algorithms to identify the attack. The overarching methodology used in the simulation is illustrated in Figure 3. In the case of other models, it is necessary to modify the training and model definitions; the remaining steps will remain unchanged.

Isolation Forest: The isolation forest algorithm detects anomalies using machine learning. It randomly partitions data into subsets using binary trees. Fewer splits in these trees isolate anomalies, which are data points that are markedly distinct from the majority. A threshold can be defined to find anomalies based on the algorithm’s anomaly scores. Isolation forest is efficient, works well with high-dimensional data, and does not use complex distance metrics, although data distributions and parameter settings affect its performance [22,23]. By dividing the dataset into separate segments and arbitrarily choosing features, the isolation forest method separates anomalies until individual data points are isolated. Anomalies are anticipated to be simpler to isolate and require fewer splits. Based on the average path length in the isolation trees, the anomaly score is calculated as follows:

$p (y, m) = 2^{- F (g (y)) / e (m)}$

(6)

where F(g(y)) is the average path length for a datapoint y in m isolation trees, and e(m) is a normalization factor.
Random Forest: Several decision trees are used in the random forest ensemble machine learning technique to enhance predictions. Randomly selected data and characteristics train each decision tree to minimize overfitting and maximize generalization. Because random forest is a machine learning method that can handle a wide range of data and tasks, it is frequently used for classification and regression problems [24,25]. Aggregating predictions from individual trees can help obtain the prediction of the random forest as follows:

t = mode (predictions from individual trees)

(7)
SVM: Support Vector Machine (SVM) is a machine learning approach for anomaly identification with only the majority class of data used for training. It finds a hyperplane encapsulating the majority class and recognizes anomalies as data points beyond this boundary during testing. It is useful for fraud detection and quality control when there is a goal of finding rare events in a regular dataset [22,26]. The decision function of the One-Class SVM is used to determine if a data point is an anomaly:

$g (a) = s i g n (d i s t a n c e (a) - t h)$

(8)

where distance(a) represents the distance of the datapoint, a, from the decision boundary, which is a threshold, or a predefined value.
Decision Tree: A decision tree is a machine learning technique that models predictions and choices as trees. It is used for classification and regression. Root nodes reflect the initial decisions or features in the tree. Leaf nodes indicate outcomes or predictions, whereas internal nodes reflect intermediate judgments. Decision trees divide data by impurity reduction or knowledge gain. They can be pruned to avoid overfitting. In classification, leaf nodes are class labels; in regression, they are numbers [27,28]. The decision tree’s prediction is made by traversing down the tree based on the feature values of the input sample until a leaf node is reached: y = prediction at the leaf node.
Autoencoder: The autoencoder is an artificial neural network variant that is commonly employed in unsupervised machine learning. The primary applications of this technique include dimensionality reduction, feature learning, and data compression. Autoencoders are composed of two main components: an encoder and a decoder. The primary objective of autoencoders is to reconstruct the input data, thereby acquiring a compressed representation of the data through unsupervised learning [29,30]. The autoencoder learns to reconstruct its input data. The reconstruction error is used as a measure of an anomaly. The reconstruction error is calculated as follows:

$R S E (c) = \frac{1}{m} \sum_{i = 1}^{m} {(c i - c)}^{2}$

(9)

where ci represents the input data, c represents the reconstructed data, and m represents several features. An anomaly is detected if the reconstruction error is above a predefined threshold.
Logistic Regression: Logistic regression is a supervised machine learning algorithm that represents the relationship between one or more independent variables (features) and the likelihood of a specific result occurring. It is often used to solve binary classification problems, such as when determining whether an email is spam, whether a customer would purchase a product, etc. [31]. The logistic regression model predicts the probability of the positive class using the logistic function as follows:

$Q (a = 1| C) = \frac{1}{1 + e^{- C ω}}$

(10)

where C represents the input features, and ω indicates the model parameters.
FFNN: A feed-forward neural network (FFNN) is an artificial neural network in which information flows in a single direction: from the input layer to one or more hidden layers and then to the output layer. In the experiment, we used three layers: The first layer is the input layer, which receives input data. Each node in this layer represents one of the input’s features. There is also a hidden layer, which is an intermediate layer between the input and output layers, where processing is carried out. Each node in a hidden layer is linked to every node in the preceding and subsequent layers. The output layer is the last layer [31,32]. The prediction of the MLP is obtained through forward propagation as follows:

b¹ = CD⁽¹⁾ + g⁽¹⁾
f⁽¹⁾ = ReLU(b⁽¹⁾)
b⁽²⁾ = f⁽¹⁾C⁽²⁾ + g⁽²⁾
f⁽²⁾ = ReLU(b⁽²⁾)
b^(L) = f^(L−1) C^(L) + g^(L)
e = Softmax(b^(L))

(11)

where C represents the input data, D⁽ⁱ⁾ and g⁽ⁱ⁾ represent the weights and biases of layer i, f(i) represents the activation of layer i, and e represents the prediction output.

Data Processing and Steps

In this study, we aimed to detect false data injection attacks by analyzing the features Gp3Ph. To simulate an attack, we artificially increased the values of these features by a factor of 2.0, resulting in Gp3Ph_a and Gc3Ph_a. After generating the labels based on the modified feature values, we performed feature selection using Recursive Feature Elimination with Cross-Validation (RFECV) [32,33] to identify the most critical features for our model.

To address the class imbalance in the dataset, we applied the SMOTEENN (Synthetic Minority Over-sampling Technique and Edited Nearest Neighbors) [34,35] method. SMOTEENN combines the over-sampling of the minority class with SMOTE and cleans the dataset with ENN, thereby effectively balancing the class distribution. In our code, after RFECV, we used SMOTEENN on the feature-selected data (X_rfecv and y) to produce a balanced dataset (X_resampled and y_resampled). This balanced dataset was then split into training and testing sets using train_test_split. Using SMOTEENN, we ensured that the decision tree classifier was trained on a dataset with an approximately equal representation of both classes. This helped prevent the model from being biased towards the majority class and improved its ability to detect anomalies effectively.

The overall process for detecting FDI attacks using the ML algorithms mentioned above is shown in Figure 3. Every technique has to undergo the following steps: import the power dataset, simulate FDIA, carry out data preprocessing (which is described above in detail), and then train the model and execute it for attack detection. If the attack is detected, then the overall process is successful; otherwise, the model is fine-tuned persistently to obtain the correct result.

6. Ensemble Approach

In machine learning, ensemble techniques combine several models to enhance generalization, robustness, and predictive accuracy. By combining the predictions of several models, these techniques take advantage of the wisdom of crowds and frequently provide better outcomes than any one model could on its own. The following are some salient features of ensemble approaches:

Diversity: Having a varied set of base models helps ensemble models perform better. Ensemble approaches can capture a broader range of patterns and lower the danger of overfitting by utilizing various algorithms, feature subsets, or training data.
Voting: In ensemble approaches like majority voting or averaging, each base model’s forecast is weighted equally. This simple method can be surprisingly effective, especially when the basis models are diverse and exhibit uncorrelated mistakes.
Two well-liked ensemble approaches are used, namely boosting and bagging (also known as bootstrap aggregating): To lower variance, bagging creates several models using arbitrary portions of the training data and averages their predictions. Boosting enhances performance by iteratively training models with the extra weight assigned to incorrectly identified samples.
Stacking: Using a meta-model, stacking, known as meta-learning, aggregates predictions from several base models. Stacking trains is a process known as a meta-model which is used to figure out how to optimally combine the forecasts of base models, as opposed to just average or voting. Stacking frequently results in better performance and can capture more intricate patterns.

To summarize, ensemble techniques are effective machine learning tools that can considerably increase model performance and robustness. Ensemble methods provide a viable answer to various prediction challenges across several domains by harnessing the aggregate expertise of multiple models.

The experiment involves logistic regression and decision tree models, two standard machine learning techniques for categorization applications. While logistic regression delivers probabilistic predictions and is interpretable, decision trees are more flexible and may capture nonlinear relationships in data. Combining these two algorithms into an ensemble technique can allow for the use of each model’s capabilities, improving the overall prediction performance and model durability. The ensemble approach combines logistic regression and decision tree predictions with a majority vote scheme. If both models forecast the same class, the ensemble prediction will match. If the two models disagree, a majority vote decides the ensemble prediction. By combining logistic regression with decision tree predictions, the ensemble may improve the generalization performance and capture a more thorough representation of the underlying data patterns. The logistic regression and decision tree ensemble technique provide a powerful and versatile solution for classification tasks, using the benefits of both methods to improve the predicted accuracy and model interpretability.

Let us denote the prediction of logistic regression as

C l r

and the prediction of decision trees as

E d t

. In this ensemble approach, logistic regression outputs probabilities for binary classification, typically between 0 and 1. We can threshold these probabilities to obtain binary predictions (0 or 1). Decision trees directly provide binary predictions based on the features. The ensemble combines the predictions of logistic regression and decision trees using a majority voting scheme. Specifically, if logistic regression and decision trees predict the same class, the ensemble prediction is that class. If logistic regression and decision trees predict different classes, the ensemble prediction is determined by a majority vote among these two predictions.

Mathematically, the ensemble prediction

G E n

can be defined as

G E n = M V (C l r, E d t)

(12)

where

C l r

is the prediction of logistic regression,

E d t

is the prediction of decision trees, and

M V (.)

is the function that returns the majority class among its inputs.

This ensemble technique combines the advantages of decision trees (non-linear, flexible models) and logistic regression (interpretable, probabilistic predictions) to enhance the predictive performance and model robustness. Furthermore, this ensemble can offer a broader range of forecasts, lowering the possibility of overfitting and capturing various facets of the data. Later, the overall result of the ensemble approach is given.

7. Experimental Result

To demonstrate the impact of false data injection attacks (FDIAs) on power data, we used the IEEE 14-bus model. We injected an FDIA on bus number 2 (the attack increases a generator’s active power output at bus no. 2) and conducted a power flow analysis before and after the FDIA. To perform this experiment, we used a Python tool; we carried out the following steps to perform the simulation:

We loaded the predefined power network with the help of the raw power dataset.

Then, we performed a power flow analysis to determine whether or not it converged successfully and presented the power network data as a result, which contained the following parameters: -vm_pu (voltage magnitude per unit), va_degree (the phase angle of voltage, often measured in degrees, representing the angular displacement of voltage to a reference point or reference phase angle), p_mw (active power in megawatts), and q_mvar (reactive power in MVARs). The results are shown below in Table 1 and Table 2.

Figure 4 and Figure 5 show the voltage curve before and after the FDIA attack, which clearly shows the attack’s impact on bus no. 2.

Figure 6 shows the year’s grid power three-phase data snapshot under normal conditions and FDI attacks, clearly showing the attack’s impact. A detailed comparison is carried out to evaluate the algorithms based on anomaly detection parameters, and a thorough comparison table is given below. The selection of the most suitable anomaly detection method from the options presented in the table is contingent upon the individual requirements and priorities of the user. Every method possesses distinct advantages and disadvantages, and the optimal choice depends upon various aspects, such as the characteristics of the data, the relative significance of precision vs. recall, and computational efficiency. The following algorithms are commonly used in anomaly detection: isolation forest, random forest, SVM, decision tree, logistic regression, and autoencoder. Table 3 below shows the performance metrics of these algorithms. The isolation forest and feed-forward neural network (FFNN) exhibit inferior F1 scores, precision, and recall performance. The isolation forest algorithm achieves an F1 score of 0.68, which suggests difficulties in reliably detecting anomalies, although it achieves a moderate precision of 0.76. Similarly, the FFNN demonstrates an F1 score of 0.53, indicating constraints in its capacity to effectively generalize to unfamiliar data or accurately differentiate anomalies from usual occurrences.

The autoencoder, logistic regression, and the Support Vector Machine (SVM) perform similarly in all evaluation metrics. The autoencoder demonstrates a notable F1 score of 0.95, signifying its efficacy in detecting anomalies while upholding a high level of precision and recall. Logistic regression and the SVM exhibit high levels of robustness in correctly detecting anomalies, as evidenced by their F1 scores of 0.98. Nevertheless, logistic regression exhibits marginally inferior model performance compared to the SVM, indicating that it has possible difficulties in managing intricate anomaly patterns. In the ensemble approach, the choice between using a decision tree and logistic regression relies on their performance in evaluation metrics, which are shown in Table 3.

An ROC (AUC value) closer to 1 means the model can discriminate between positive and negative instances, with a high actual positive rate and a low false positive rate across various threshold values. In practical terms, an AUC of 1, as shown in Table 3, means the model has a 100% chance of correctly identifying positive and negative instances. The graphical representation of Table 3 is presented in Figure 7, which shows the performance metrics for the algorithms used in the detection of false data injection attacks.

The ensemble approach (a combination of decision tree and logistic regression) demonstrates outstanding performance in all measures: the F1 score, precision, and recall are 1, whereas the Kappa score and model accuracy are 0.99. These statistics clearly show that the ensemble approach can detect the anomaly.

8. Conclusions

In conclusion, this study investigated the repercussions of false data injection attacks through simulated scenarios utilizing Python tools. While logistic regression and decision trees have demonstrated commendable performance across various metrics within the dataset, considerations such as interpretability, computational efficiency, and scalability are paramount in the selection process. Using logistic regression and decision trees, an ensemble approach was explored as a viable alternative. The comparative analysis conducted in this paper determined the efficacy of different machine learning methods in identifying fake data injection attacks (FDIAs). Decision tree was the most successful algorithm of all the ones that were tested, with the highest F1 score of 99%. The logistic regression technique, which received an F1 score of 98%, was a close second. Additionally, there was a noticeable improvement in the ensemble technique that combined the decision tree and logistic regression algorithms, achieving an F1 score of 1, a 99% model accuracy, and a precision score of 1. These results highlight the importance of using machine learning methods for accurate FDIA detection. By leveraging the complementary strengths of these algorithms, ensemble methods offer the potential to enhance the overall performance and robustness. Ensemble techniques underscore the importance of leveraging diverse models to mitigate biases and errors inherent in individual algorithms.

Author Contributions

Conceptualization, S.T. and G.P.; Methodology, M.J.A.; Software, M.J.A.; Validation, S.T. and G.P.; Formal analysis, M.J.A.; Investigation, M.J.A.; Resources, S.T. and G.P.; Data curation, M.J.A.; Writing—original draft, M.J.A.; Writing—review & editing, S.T.; Visualization, M.J.A.; Supervision, S.T. and G.P.; Project administration, S.T.; Funding acquisition, R.T.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in our article was generated from our laboratory for the study purpose.

Conflicts of Interest

The authors declare no conflict of interest

References

Ding, J.; Qammar, A.; Zhang, Z.; Karim, A.; Ning, H. Cyber Threats to Smart Grids: Review, Taxonomy, Potential Solutions, and Future Directions. Energies 2022, 15, 6799. [Google Scholar] [CrossRef]
Faquir, D.; Chouliaras, N.; Sofia, V.; Olga, K.; Maglaras, L. Cybersecurity in smart grids, challenges, and solutions. AIMS Electron. Electr. Eng. 2021, 5, 24–37. [Google Scholar]
Liu, J.; Xiao, Y.; Li, S.; Liang, W.; Chen, C.L. Cybersecurity and Privacy issues in smart grids. IEEE Commun. Surv. Tutor. 2012, 14, 981–997. [Google Scholar] [CrossRef]
Zhe, W.; Wei, C.; Chunlin, L. DoS attack detection model of smart grid based on machine learning method. In Proceedings of the IEEE International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 28–30 July 2020. [Google Scholar]
Esmalifalak, M.; Liu, L.; Nguyen, N.; Zheng, R.; Han, Z. Detecting Stealthy False Data Injection Using Machine Learning in Smart Grid. IEEE Syst. J. 2017, 11, 1644–1652. [Google Scholar] [CrossRef]
Sen, O.; van der Velde, D.; Linnartz, P.; Hacker, I.; Henze, M.; Andres, M.; Ulbig, A. Investigating Man-in-the-Middle-based False Data Injection in a Smart Grid Laboratory Environment. In Proceedings of the IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT Europe), Espoo, Finland, 18–21 October 2021. [Google Scholar]
Qi, R.; Rasband, C.; Zheng, J.; Longoria, R. Detecting Cyber Attacks in Smart Grids Using Semi-Supervised Anomaly Detection and Deep Representation Learning. Information 2021, 12, 328. [Google Scholar] [CrossRef]
Cui, L.; Qu, Y.; Gao, L.; Xie, G.; Yu, S. Detecting false data attacks using machine learning techniques in smart grid: A survey. J. Netw. Comput. Appl. 2020, 170, 102808. [Google Scholar] [CrossRef]
Xu, A.; Zhang, T.; Chen, L.; Li, Q.; Zhang, Y.; Lin, H.; Wang, P.; Wu, S.; Zhao, R.; Jiang, Y. Research on False Data Injection Attack in Smart Grid. In Proceedings of the IOPSCIENCE, 8th Annual International Conference on Geo-Spatial Knowledge and Intelligence, Xi’an, China, 18–19 December 2020. [Google Scholar]
Wang, Q.; Tai, W.; Tang, Y.; Ni, M. Review of the false data injection attack against the cyber-physical power system. IET Cyber-Phys. Syst. Theory Appl. 2019, 4, 101–107. [Google Scholar] [CrossRef]
Cintuglu, M.H.; Mohammed, O.A.; Akkaya, K.; Uluagac, A.S. A Survey on Smart Grid Cyber-Physical System Testbeds. IEEE Commun. Surv. Tutor. 2016, 19, 446–464. [Google Scholar] [CrossRef]
Mahesh, B. Machine Learning Algorithms—A Review. Int. J. Sci. Res. (IJSR) 2020, 9, 381–386. [Google Scholar]
Gyawali, S.; Beg, O. Cyber Attacks Detection using Machine Learning in Smart Grid Systems. In Proceedings of the IEEE INFOCOM 2022—IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), online. 2–5 May 2022. [Google Scholar]
Alwageed, H.S. Detection of cyber-attacks in smart grids using SVM-boosted machine learning models in Springer. Serv. Oriented Comput. Appl. 2022, 16, 313–326. [Google Scholar] [CrossRef]
Huseinović, A.; Mrdović, S.; Bicakci, K.; Uludag, S. A Survey of Denial-of-Service Attacks and Solutions in the Smart Grid. IEEE Access 2020, 8, 177447–177470. [Google Scholar] [CrossRef]
Asri, S.; Pranggono, B. Impact of Distributed Denial-of-Service Attack on Advanced Metering Infrastructure. Wirel. Pers. Commun. 2015, 83, 2211–2223. [Google Scholar] [CrossRef]
Wang, K.; Du, M.; Maharjan, S.; Sun, Y. Strategic Honeypot Game Model for Distributed Denial of Service Attacks in the Smart Grid. IEEE Trans. Smart Grid 2017, 8, 2474–2482. [Google Scholar] [CrossRef]
Le, T.D.; Anwar, A.; Loke, S.W.; Beuran, R.; Tan, Y. GridAttackSim: A Cyber Attack Simulation Framework for Smart Grids. Electronics 2020, 9, 1218. [Google Scholar] [CrossRef]
Sakhnini, J.; Karimipour, H.; Dehghantanha, A. Smart Grid Cyber Attacks Detection Using Supervised Learning and Heuristic Feature Selection. In Proceedings of the IEEE International Conference on Smart Energy Grid Engineering (SEGE), Oshawa, ON, Canada, 12–14 August 2019. [Google Scholar]
Song, H.; Fink, G.A.; Jeschke, S. Detecting Data Integrity Attacks in Smart Grid. In IEEE Xplore Book Chapter, Security and Privacy in Cyber-Physical Systems: Foundations, Principles, and Applications; Wiley-IEEE Press, 2017; Available online: https://ieeexplore.ieee.org/document/8068874 (accessed on 10 May 2024).
El Houda, Z.A.; Hafid, A.; Khoukhi, L. Blockchain Meets AMI: Towards Secure Advanced Metering Infrastructures. In Proceedings of the IEE, ICC 2020—2020 IEEE International Conference on Communications (ICC), Dublin, Ireland, 7–11 June 2020. [Google Scholar]
Feng, C.; Wang, Y.; Chen, Q.; Ding, Y.; Strbac, G.; Kang, C. Smart grid encounters edge computing: Opportunities and applications. Adv. Appl. Energy 2021, 1, 100006. [Google Scholar] [CrossRef]
Otokwala, U.; Petrovski, A.; Kalutarage, H. Improving Intrusion Detection Through Training Data Augmentation. In Proceedings of the 14th International Conference on Security of Information and Networks (SIN), Edinburgh, UK, 15–17 December 2021. [Google Scholar]
Liu, F.T.; Ting, K.M.; Zhou, Z.-H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008. [Google Scholar]
Breiman, L.; Forests, R. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Brownlee, J. One-Class Classification Algorithms for Imbalanced Datasets. Machine Learning Mastery. 2020. Available online: https://machinelearningmastery.com/one-class-classification-algorithms-for-imbalanced-datasets/ (accessed on 10 May 2024).
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Ross Quinlan, J. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers: San Mateo, CA, USA, 1993. [Google Scholar]
MacQueen, J. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA; Volume 1, pp. 281–297. Available online: https://www.semanticscholar.org/paper/Some-methods-for-classification-and-analysis-of-MacQueen/ac8ab51a86f1a9ae74dd0e4576d1a019f5e654ed (accessed on 10 May 2024).
Iglewicz, B.; Hoaglin, D.C. Statistical Methods for Detecting Outliers. Technometrics 1993, 35, 1–12. [Google Scholar]
Kingma, D.P.; Welling, M. Autoencoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Maalouf, M. Logistic regression in data analysis: An overview. Int. J. Data Anal. Tech. Strateg. (IJDATS) 2011, 3, 281–299. [Google Scholar] [CrossRef]
Jeon, H.; Oh, S. Hybrid-Recursive Feature Elimination for Efficient Feature Selection. Appl. Sci. 2020, 10, 3211. [Google Scholar] [CrossRef]
Awad, M.; Fraihat, S. Recursive Feature Elimination with Cross-Validation with Decision Tree: Feature Selection Method for Machine Learning-Based Intrusion Detection Systems. J. Sens. Actuator Netw. 2023, 12, 67. [Google Scholar] [CrossRef]
Chai, C.W.; Tan, J.; Shen, L. A Hybrid SMOTEENN-XGBoost Model for Predicting Customer Churn in the Banking Sector. PLoS ONE 2023, 18, e0289724. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Cyberattack scenario in smart grid.

Figure 2. Machine learning process [18,19,20].

Figure 3. The steps involved in the overall process.

Figure 4. Voltage magnitude before attack.

Figure 5. The voltage magnitude curve after the attack.

Figure 6. A snapshot (output figure from Python interface) of the Gp3Ph (grid power three-phase) curve before and after the attack.

Figure 7. Performance evaluation.

Table 1. Power flow results before FDIA.

Bus	vm_pu	va_degree	p_mw	q_mvar
0	1.060000	0.000000	−1845.104233	−6684.031922
1	1.045000	0.028608	−85.000000	2046.038644
2	1.010000	0.193523	200.000000	10,699.913750
3	1.035265	−0.179392	200.000000	0.000000
4	1.040079	−0.098594	200.000000	0.000000
5	1.056000	−0.561160	0.000000	−6610.052732
6	1.050408	−0.471364	200.000000	0.000000
7	1.052931	−0.578354	0.000000	0.000000
8	1.052931	−0.578354	200.000000	0.000000
9	1.052885	−0.588686	0.000000	0.000000
10	1.052668	−0.638291	200.000000	0.000000
11	1.052423	−0.694122	200.000000	0.000000
12	1.052282	−0.726391	200.000000	0.000000
13	1.052235	−0.737150	200.000000	0.000000

Table 2. Power flow results after FDIA.

Bus	vm_pu	va_degree	p_mw	q_mvar
0	1.060000	0.000000	−70,218.370276	221,716.020580
1	1.945000	−13.416744	−85.000000	−607,688.647079
2	1.010000	−4.510858	200.000000	62,916.456990
3	1.102653	−5.666652	200.000000	0.000000
4	1.242754	−6.981959	200.000000	0.000000
5	1.056000	−5.173081	0.000000	34,449.823189
6	1.117861	−5.899939	200.000000	0.000000
7	1.086644	−5.605040	0.000000	0.000000
8	1.086644	−5.605040	200.000000	0.000000
9	1.086600	−5.614741	0.000000	0.000000
10	1.086389	−5.661315	200.000000	0.000000
11	1.086152	5.713733	200.000000	0.000000
12	1.086016	−5.744029	200.000000	0.000000
13	1.085970	−5.754129	200.000000	0.000000

Table 3. Evaluation metrics.

SL No.	Algorithm/Technique	F1 Score	Precision	Recall	Model Accuracy	ROC	Kappa Score
1	Random Forest	0.97	0.97	0.97	0.96	0.98	0.93
2	Isolation Forest	0.68	0.76	0.70	0.69	0.70	0.39
3	SVM	0.98	0.98	0.98	0.98	0.98	0.93
4	FFNN (Feed-Forward Neural network)	0.53	0.71	0.59	0.59	0.64	0.18
5	Autoencoder	0.95	0.95	0.95	0.94	0.95	0.89
6	Logistic Regression	0.98	0.98	0.98	0.98	0.92	0.93
7	Decision Tree	0.99	0.99	0.98	0.99	0.98	0.99
8	Ensemble Approach	1	1	1	0.99	1	0.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Abudin, M.J.; Thokchom, S.; Naayagi, R.T.; Panda, G. Detecting False Data Injection Attacks Using Machine Learning-Based Approaches for Smart Grid Networks. Appl. Sci. 2024, 14, 4764. https://doi.org/10.3390/app14114764

AMA Style

Abudin MJ, Thokchom S, Naayagi RT, Panda G. Detecting False Data Injection Attacks Using Machine Learning-Based Approaches for Smart Grid Networks. Applied Sciences. 2024; 14(11):4764. https://doi.org/10.3390/app14114764

Chicago/Turabian Style

Abudin, MD Jainul, Surmila Thokchom, R. T. Naayagi, and Gayadhar Panda. 2024. "Detecting False Data Injection Attacks Using Machine Learning-Based Approaches for Smart Grid Networks" Applied Sciences 14, no. 11: 4764. https://doi.org/10.3390/app14114764

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detecting False Data Injection Attacks Using Machine Learning-Based Approaches for Smart Grid Networks

Abstract

1. Introduction

2. Related Work

3. False Data Injection Attacks in Smart Grids

4. Machine Learning Approach

5. Methodology

Data Processing and Steps

6. Ensemble Approach

7. Experimental Result

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI