*2.2. Methodology*

Despite the fact that the machine learning approach is capable of detecting disturbances and cyber-attacks on electric grids, it can have these drawbacks. Currently, references just discuss how to diagnose attacks in the electrical grids and seldom examine the data relationship. In contrast, when working with multi-classification problems, many algorithms convert them into multi-two-class situations. Nonetheless, the AdaBoost algorithm is able to handle multi-classification situations directly. It utilizes weak classifiers well for cascading and is capable of using various classification algorithms as weak classifiers. In terms of the error rate of misclassification, the AdaBoost algorithm is highly competitive [22]. With an increase in data amount, the fitting ability is affected both by generalization problems and by the increasing difficulty of computing. Machine learning requires a large amount of calculating to find the best solution. Additionally, the accuracy rates on the model presented in [11,12] are about 90% compared to the multiclass data sets, which provides considerable space for development. As a consequence of these findings, this paper constructs a model that can perform superior feature engineering and next can split the data by the diverse PMUs to minimize computation overhead. It should be noted that the PMU allocation in the smart grid is performed in the planning stage and might be implemented according to different purposes. While the high cost might be a limitation, the high number of PMUs is always preferred to cover all areas of the smart grid. It is worth noting that PMU allocation is out of the scope of this work but can be found in other research works widely. In addition, the AdaBoost algorithm for detecting the 37-class fault and cyber-attack case studies in the electric grids is adopted in this paper.

About the feature selection process, it should be noted that this experiment applied a data set that contains 128 features recorded using PMUs 1 to 4 and relay snort alarms and logs (relay and PMU have been combined). Please also note that each PMU can record 29 different features. In this regard, and in order to obtain enriched and integrated informative data, feature construction engineering is performed, and 16 novel features are constructed via an analysis of the features and possible links of the raw data in the electrical network. Technically, it is possible to construct novel features using a combination of attributes that could help more effectively utilize possible types of data instances, which could be used in machine learning models for better application. It is worth noting that we made use of the random forest method to create and classify features. Finally, based on anticipation weighted voting, 37 various case studies were implemented for simulation purposes.

#### *2.3. Diagnosing Attack Behavior Model Structure*

A model architecture diagram is shown in Figure 2 to detect faults and cyber-attack in electrical grids. According to Figure 2, the model architecture usually consists of four stages: property making, data dividing, weight voting, and layout training as follows:

**Figure 2.** Explanation of layout to detect disturbance and cyber-attack in electrical networks.

Stage.1. Property making. By creating novel features manually from the original data set, it is able to improve the dimension of the data. A novel piece of data is generated by integrating the novel features with several original ones. The upper limit of the model is determined by the features and data, and the algorithm can just approximate the upper limit as closely as feasible. In order to achieve maximum accuracy and improve robustness, feature construction engineering is essential. It is important for feature construction using the original data to obtain more flexible features, and therefore increase data sensitivity and increase the ability to analyze it in the case of sending it to models for classification and training. The target of helpful features is to be simple to understand and maintain. The results of the analysis have led to the construction of 16 novel features. There is also a tendency in machine learning problems to include a large number of features for training instances, and it results in excessive computational overhead and overfitting, leading to poor efficiency. The curse of dimensionality has usually been used to describe this problem. Feature selection and feature extraction have been widely applied to mitigating the problems caused by high dimensionality in learning problems [23].

Stage.2. Datum dividing and training. The test and training sets are divided through 9:1 through the data splitting module. There is too much noise in the classifier if too many features are used [24]; therefore, every original data has been split into four parts according to features from various PMUs. While doing this, a section of the main characteristics is picked and sent to the AdaBoost layout to train alongside the novel features as well. This step is necessary for reducing the effect of errors resulting from bad PMU measurements. In case the feature dimension increases, the classifier's performance decreases. As a result

of this step, several of the original features are combined with novel ones in order to reduce the dimension. The original features are sorted using feature importance, and afterward, a variety of proportions of the features are selected, explained in more detail in Part 3. In addition, several classifier models are developed for personalizing the features following splitting. Various classifiers are set up to make every section of the data display the greatest impact on the classifier, i.e., the training model. Using five classifiers and later obtaining five tags following transferring the information to the layout reduces the effect of the alone classifier generalization error.

Stage.3. Weights for voting. It is the responsibility of the module to assign diverse weights to the tags derived from diverse classifiers and vote on the last classification tag of the data. According to the accuracy ratio of every classifier in the training set, the ratio of various weights has been thus determined. Various tags are generated by the test set following they have passed through the trained classifier, and the weights are determined for the last voting session based on the tags of the relevant classifier. By updating the weights in real-time, the entire system can become more robust and generalizable.

#### *2.4. In-Depth Explanation of the Attack-Diagnosing Layout*

## 2.4.1. Properties Making

During property making, 16 novel features have been extracted from every PMU measurement feature and incorporated into the original data set for preparing for the next step. Raw data is mainly used for extracting novel features based on corresponding computations. Table 2 shows the name, explanation, and extraction process of the extracted feature.


**Table 2.** Explanation of extracted characteristics.

## 2.4.2. Data Processing

It is important to process the data prior to sending it to the machine learning model. The normalization of the data is an important part of data processing. The benefit of this method is that it speeds up and improves the accuracy of iterations for finding the best solution for gradient descent. Among the most common techniques of data normalization are z-score standardization and min-max standardization. Basically, min-max standardization works by changing the original data linearly toward an outcome between [0, 1] shown below:

$$X\_{scale} = \frac{\mathbf{x} - \mathbf{x}\_{min}}{\mathbf{x}\_{max} - \mathbf{x}\_{min}} \tag{1}$$

In addition, Z-score standardization has been known as standard deviation standardization, and it has been mostly applied for characterizing deviations from the average. The data analyzed through this technique assure the standard usual distribution, which is that the standard deviation and average are equal to one and zero, respectively. The data processed using the process can satisfy the standard normal distribution, meaning the mean equals 0 and the standard deviation Equation (1). Following is the transformation function, the mean amount of the instant data is shown by *μ*, and the standard deviation is represented by *σ*. This study adopts this normalization process.

$$X\_{scale} = \frac{\mathbf{x} - \mu}{\sigma} \tag{2}$$

A data set may contain the not a number (NAN) and infinity (INF) amount, but it has been usually substituted through the mean amount or zero. For the data set applied here, the novel replacement process is proposed to avoid underflows in the final replacement value and the data being overly discrete. *log*\_*mean* value is used for replacing NAN and INF values present in the data. It can be calculated as follows:

$$\log\_{-}{mean} = \frac{\sum \log |k\_i|}{Num(k\_i)} \cdot \left(1 - 2 \sphericalangle \left(\frac{\sum k\_i}{Num(k\_i)} < 0\right)\right) \tag{3}$$

Here, the number of digits in a column is shown by *Num*(*ki*) and the indicator function is represented by -(*x*), which can be described in the following way:

$$\kappa \circ (\mathfrak{x}) = \begin{cases} 1 \text{ if } \mathfrak{x} \text{ is true} \\ 0 \text{ otherwise} \end{cases} \tag{4}$$

Comparative experiments are conducted on various treatment approaches in this study. Section 3 shows the outcomes that show that the suggested process succeeds.

#### 2.4.3. Establish Classifier Layouts

During the process of making the classifier scheme, the features and characteristics of the SG information are considered, and various DML classification schemes are established for the data obtained from every PMU. Various experiments have shown that random forest is the best for the data gathered through every PMU, and AdaBoost is the ideal layout for combined features, including a section of the main characteristics as well as properties derived from the property making. With AdaBoost, several basic classifiers are combined into a robust classifier. The experiment proposes a new model in which random forest has been applied as the basic classifier of AdaBoost, followed by weighted voting on the anticipation outcomes (AWV).

**Stage. (1)** Set the training data's weights of observation = (*<sup>ω</sup>*1,... *ω*2,... *<sup>ω</sup>n*) *ωi* = 1/*<sup>n</sup>*. **Stage. (2)** For *t* = 1:T


Here, *Xi* shows the *i*th input feature vector, the actual tag of the *i*th input property vector is represented by *yi*. The predicted outcome is shown through *RFC*(*t*)(*Xi*).


(V) Renormalize so that ∑*ni*=<sup>1</sup>*ωi*= 1.

**Stage. (3)** Output *<sup>C</sup>*(*x*) = *argmaxy* ∑*Tt*=<sup>1</sup> *α*(*t*)-(*RFC*(*t*)(*X*) = *y*

Here, *argmaxx*(*f*(*x*)) function is meaned return the amount of *x* which maximizes *f*(*x*). Here, for 37-class classification problem, so ∈ (1, 2, . . . , <sup>37</sup>), and ∑*Tt*=<sup>1</sup> *α*(*t*)-(*RFC*(*t*)(*X*) = *y* is a 37-dimensional vector. When various probabilities are associated with various tags for one feature vector *Xi*, the last output is determined through the probability with the highest amount.

#### 2.4.4. Voting with Weights

Hard combination and soft combination are two ways of addressing the final multiple tags [25]. The hard combination is training the similar data set section with various DML methods and assigning the similar weight to the achieved last tags for voting. The result is the tag with the highest weight value. Similar to that, the soft combination involves adopting various DML methods for a similar section of the data set. However, the tags are assigned with different weights, and the end result is the tag with the highest weight. To summarize, the main difference between the hard and soft combinations is whether or not

the weights are equal. In a classifier, weights represent the probability value of a tag or its confidence level. The present study sets up various machine learning models for various data blocks to address multi-tag problems so as to make the model perform effectively for the data set. Lastly, different weights are assigned to tags to determine the final results. Algorithm 1 describes these steps.


#### **3. Experiment and Evaluation**

In machine learning, classifications and regressions are the primary learning tasks. It is obvious that the classification problem is addressed in this study. The next experiments are designed to test whether the model structure described in this study is capable of distinguishing fault and disturbance in electrical systems. A comparison is made between the model and various conventional models, such as convolution neural network (CNN), gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), decision tree (DT), support vector machine (SVM), and k-nearest neighbor (KNN).

Additionally, the accuracy achieved through transferring information has been compared after the property making is compared.

#### *3.1. Data Set*

A multiclass classification data set for ICS cyber-attacks is used in the present study. There are a total of 15 groups in the multiclass data set, each with about 5000 pieces of data. Each group's situation is shown in Table 3. Across all tag kinds, the distribution of data can be fairly uniform. ARFF (Attribute-Relation File Format) is the main file template of the data set. An ARFF file is the ASCII text format, which represents a set of attributes shared by several samples. To ease the process, ARFF files are converted to CSV (Comma Separated Values) template. In CSV files, textual/numeric tabular information is stored in plain text. AUC, F1 score, ROC curve, ROC curve, precision, accuracy, and recall area are primarily used to evaluate classification models in machine learning. There are several terms applied in machine learning that require an explanation. The true positive (TP) is the positive sample that the layout predicts to be positive, the false positive (FP) is the negative sample that the layout predicts to be positive, and the false negative (FN) is the positive sample that the model predicts to be negative, the true negative (TN) is the negative sample that the model predicts to be negative. The suggested layout is evaluated using accuracy, precision, recall, and F1 score. An F1 score is basically the harmonic value of precision and recall, which are calculated according to the following equations:

$$accuracy = (TP + TN) / (TP + FP + FN + TN) \tag{5}$$

$$precision = TP/(TP + FP) \tag{6}$$

$$recall = TP / (TP + FN) \tag{7}$$

$$F1\text{ score} = \frac{2TP}{2TP + FN + FP} = \frac{2 \cdot precision \cdot recall}{precision + recall} \tag{8}$$

**Table 3.** Multiclass instance data statistics.


## *3.2. Experiment Outcome*

3.2.1. Machine Learning Model

In this experiment, KNN, SVM, GBDT, XGBoost, CNN, and others were applied as conventional models.

(A) Based on the distance among feature values, the K-nearest neighbor algorithm has been categorized. Distance is calculated primarily using Euclidean/Manhattan distances formulation.

(B) The SVM [26] layout uses the sample as a spot in the region and applies various mapping functions for mapping the input into the great-dimensional property region for constructing the hyperplane group or hyperplane. According to intuition, the further away the boundary is from the point of data training, the more accurate the classification will be. *ωTx* + *b* = 0 shows the formulation to divide the hyperplane, in which the normal vector is shown by *ω* determining the hyperplane's direction., and the displacement term is shown by *b* determining the distance between the hyperplane and the origin. γ = *ωTx* + *<sup>b</sup>*/||*ω*|| show the formulation for the interval from each spot *x* to the hyperplane in the region, γ must be maximized within the conditions, which the hyperplane properly divides the training instances, i.e.:

$$\begin{array}{c} \max\_{\boldsymbol{\omega}, \boldsymbol{b}} \frac{2}{||\boldsymbol{\omega}||}\\ \text{subject to } y\_i(\boldsymbol{\omega}^T \mathbf{x} + \boldsymbol{b}) \ge 1 \end{array} \tag{9}$$

Calculating the limitation problem via the Lagrange function is more efficient, and an objective function can be derived from the following formula, in which *αi* shows the Lagrange multiplier and *αi* ≥ 0.

$$L(\omega, b, a) = \frac{1}{2}||\omega||^2 + \sum\_{i=1}^{m} a\_i \left(1 - y\_i \left(\omega^T \mathbf{x} + b\right)\right) \tag{10}$$

Determine *<sup>L</sup>*(*<sup>ω</sup>*, *b*, *α*)*s* partial derivatives and make them 0:

$$\frac{\partial L(\omega, b, a)}{\partial \omega} = 0, \frac{\partial L(\omega, b, a)}{\partial b} = 0 \tag{11}$$

The dual problem can be as follows:

$$\max\_{a} \sum\_{i=1}^{m} a\_i - \frac{1}{2} \sum\_{i=1}^{m} \sum\_{j=1}^{m} a\_i a\_j y\_i y\_j \mathbf{x}\_i^T \mathbf{x}\_i \text{ subject to } \sum\_{i=1}^{m} a\_i y\_i = 0, \ a\_i \ge 0 \tag{12}$$

(C) The decision tree algorithm starts with a group of instances/cases and then makes a tree information framework, which is applied to novel cases. A group of amounts/symbolic amounts describes every case [27]. Entropy is used in C4.5 and C5.0 for the spanning tree algorithm.

(D) A boosting algorithm has been used to improve the XGBoost [28] classifier algorithm. The model is based on residual lifting. Based on the error function, the objective function is calculated by taking the prime and second derivatives of every data spot. The

loss function is a square loss. Here is its objective function, in which *l* shows a differential convertible loss function, which shows variation among the prediction *y*ˆ*i* and the purpose *yi*. The second part Ω can penalize the pattern complexity, and *T* shows the leaves number in the tree. The *γ* and *λ* show the tree's complexity, the greater their amount, and the simpler the framework of the tree.

$$L(\phi) = \sum\_{i} l(\hat{y}\_i, y\_i) + \sum\_{k} \Omega(f\_k) \text{ where } \Omega(f) = \gamma T + \frac{1}{2}\lambda ||\omega||^2 \tag{13}$$

(E) The random forest exhibits excellent efficiency and has been extensively applied [29]. RF utilizes the decision tree as its base classifier and shows an extension of Bagging. RF uses two very significant procedures. The first technique involves introducing random features in the procedure of decision tree making, and the second involves an out-of-bag estimation. The RF method can be described below. The first step is to randomly select a sample from every data, and afterward, to return the sample to the original data. As a root sample for a decision tree, the chosen samples have been applied for training the decision tree. Second, for splitting the nodes of the decision tree, *m* attributes have been chosen randomly (there are a total of *M* attributes and ensuring << *M*). Choose an attribute to be the dividing feature of the node using the strategy, such as information gain. Continue to do this until the decision tree can no longer be divided.

(F) Among the more popular deep learning networks is CNN. There are usually input, output, latent, and max-pooling layers in a CNN model. Several grea<sup>t</sup> results have been obtained in numerous areas of computer vision. Here, one-dimension property vectors are used as input, and a one-dimension convolution kernel in convolution layers is adopted. The convolution layer extracts properties from the input, and here the kernel size is three. The process of the CNN model is shown in Figure 3.

**Figure 3.** The procedure of CNN layout.

Actually, the main purpose of this research is to show the high and successful role of the deep learning models in reinforcing the smart grid against various cyber-attacks. In this regard, the proposed model would detect and stop cyber-hacking at the installation location rather than focusing on the cyber-attack type. Therefore, the localization procedure would be attained through the diverse detection models located in the smart grid, but the cyber-attack type detection requires more data that can be made later based on the recorded abnormal data.
