*Article* **Detection of Cyberattacks and Anomalies in Cyber-Physical Systems: Approaches, Data Sources, Evaluation**

**Olga Tushkanova 1,†, Diana Levshun 1,†, Alexander Branitskiy 1,†, Elena Fedorchenko 1,2,\* ,†, Evgenia Novikova 1,† and Igor Kotenko 1,2,†**


**Abstract:** Cyberattacks on cyber-physical systems (CPS) can lead to severe consequences, and therefore it is extremely important to detect them at early stages. However, there are several challenges to be solved in this area; they include an ability of the security system to detect previously unknown attacks. This problem could be solved with the system behaviour analysis methods and unsupervised or semi-supervised machine learning techniques. The efficiency of the attack detection system strongly depends on the datasets used to train the machine learning models. As real-world data from CPS systems are mostly not available due to the security requirements of cyber-physical objects, there are several attempts to create such datasets; however, their completeness and validity are questionable. This paper reviews existing approaches to attack and anomaly detection in CPS, with a particular focus on datasets and evaluation metrics used to assess the efficiency of the proposed solutions. The analysis revealed that only two of the three selected datasets are suitable for solving intrusion detection tasks as soon as they are generated using real test beds; in addition, only one of the selected datasets contains both network and sensor data, making it preferable for intrusion detection. Moreover, there are different approaches to evaluate the efficiency of the machine learning techniques, that require more analysis and research. Thus, in future research, the authors aim to develop an approach to anomaly detection for CPS using the selected datasets and to conduct experiments to select the performance metrics.

**Keywords:** anomaly detection; attack detection; cyber-physical system; machine learning; datasets; evaluation metrics

## **1. Introduction**

Cybersecurity risks are highly relevant nowadays. It is almost impossible to completely exclude security risks for modern information systems, including cyber-physical systems (CPS) and Internet of Things (IoT). Thus, it is essential to continuously detect cyberattacks and anomalies to monitor security risks and provide security awareness.

Cyberattacks against cyber-physical systems can lead to severe impacts on physical, environmental, as well as economical safety of the population [1]. For example, the attack on the Colonial Pipeline disrupted fuel supply on the US East Coast in 2021 [2], and the attack on the Venezuelan hydroelectric power plant led to a nationwide blackout in 2019 [3]. In 2022, Germany's internal fuel distribution system was disrupted by a cyberattack [4]. Thus, it is extremely important to detect such attacks at early stages.

There are several challenges in this area, and one of the most critical challenges is the detection of the previously unknown attacks. Another challenge relates to the availability of the datasets used to train analytical models, as the performance of the attack detection strongly depends on the quality of the training datasets. The first challenge relates to

**Citation:** Tushkanova, O.; Levshun, D.; Branitskiy, A.; Fedorchenko, E.; Novikova, E.; Kotenko, I. Detection of Cyberattacks and Anomalies in Cyber-Physical Systems: Approaches, Data Sources, Evaluation. *Algorithms* **2023**, *16*, 85. https://doi.org/ 10.3390/a16020085

Academic Editors: Francesco Bergadano and Giorgio Giacinto

Received: 21 December 2022 Revised: 28 January 2023 Accepted: 30 January 2023 Published: 3 February 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the fact that machine learning models are usually trained on datasets with known attack patterns, and as a result, they are unable to detect previously unseen attacks. One of the possible solutions is to use anomaly detection techniques based on the analysis of the cyber-physical entities' behaviour [5–7]. However, such approaches require high-quality datasets to model normal behaviour or apply unsupervised or semi-supervised machine learning techniques. The lack of datasets close to the real world is explained by the fact that organizations do not want to share data, as they can include confidential data. There are attempts to generate such datasets using cyber-physical or software test beds, but the completeness and validity of such generated datasets are questionable. The last challenge relates to the validation of the attack and anomaly detection models. The analysis of the research papers has shown that different researchers use different approaches to calculate performance metrics that complicate the comparison of the models.

In this paper, the authors review existing approaches to attack and anomaly detection, outline the most commonly used datasets, and evaluate the applicability of the selected datasets in the anomaly detection task. We also revealed that researchers use different approaches to calculate performance metrics to evaluate machine learning models. These metrics consider the fact that the anomalies in CPS have a certain duration, and malicious activity may result in a delayed response of the system process; however, the variety of used metrics makes the comparison of the obtained experimental results complicated.

Thus, the *contribution* of the research is as follows:


The paper is organized as follows. Section 2 provides the results of the review of the approaches to anomaly and attack detection for cyber-physical systems. Section 3 analyzes the datasets used for the attack and anomaly detection that contain the data from the cyber-physical systems. Section 4 researches the metrics for the evaluation of the attack and anomaly detection models. The paper ends with a conclusion.

## **2. Approaches to the Anomaly and Attack Detection for the Cyber-Physical Systems**

Anomaly detection is the process of identifying anomalous events that do not match the expected behaviour of the system. This allows the detection of new and hidden attacks. Currently, anomaly detection approaches are often implemented using machine learning, such as shallow (or traditional) learning and deep learning [7,11,12]. In this case, the profile of normal behaviour can be built using many data sources.

Anomaly and attack detection in CPS based on shallow learning methods uses algorithms such as support vector machine (SVM) [13], Bayesian classification [14], k-nearest neighbor (kNN) [15], Random Forest (RF) [16,17], Isolation Forest [18], XGBoost [19], and artificial neural networks (ANN) [20,21]. They are based on training intelligent models to profile the normal behaviour of a cyber-physical system, and then inconsistent observations are identified as anomalies. For example, Elnour et al. [18] propose an attack detection framework based on dual isolation forest (DIF). Two isolated forest models are trained independently using normalized raw data and a preprocessed version of the data using principal component analysis (PCA). The principle of the approach is to detect and separate anomalies using the concept of isolation after analyzing the data in the original and PCA-transformed representations. Mokhtari et al. [16] and Park and Lee [17] explore such supervised learning algorithms for anomaly detection as k-nearest neighbours, decision tree classifier, and random forest. In both studies, the random forest shows the best detection result.

The analysis of related works has shown that the research focus has now shifted towards the use of deep neural networks to detect anomalies in technological processes. A number of authors compare classical and deep learning approaches to anomaly detection.

So Inoue et al. [22] compare one-class SVM with radial basis function kernel deep and dense neural network with a layer of long short-term memory (LSTM), and the experiments have shown that the deep learning model is characterized by a lower rate of false positive alarms. Gaifulina and Kotenko [23] experimentally compare several models of deep neural networks for anomaly detection. Shalyga et al. [24] propose several methods to improve the quality of anomaly detection, including exponentially weighted smoothing to reduce the false positive rate, individual error weight for each feature, non-overlapping prediction windows, etc. The authors also propose their own anomaly detection model based on a multilayer perceptron (MLP).

Traditional machine learning methods tend to be inefficient when processing largescale data and unevenly distributed samples. Deep learning models are more productive when analyzing such data. Researchers often use autoencoders (AE) [5,6], recurrent neural networks [25], convolutional neural networks (CNN) [26–28], and generative adversarial networks (GAN) [29,30] as deep neural networks for anomaly detection in CPS. Often, the approaches propose a hybrid use of neural network data. For example, Xie et al. [25] and Wu et al. [31] use CNN for data dimensionality reduction and gated recurrent units (GRU) for data prediction. GRU is one of the types of recurrent networks, as well as LSTM. Bian X. [32] also uses GRU for anomaly detection. The main idea of the anomaly detection method is to predict the value of the next moment and determine if an anomaly occurs due to a deviation between the predicted value and the actual value.

The autoencoder is trained on normal data, and then the incoming events are reconstructed based on the normal model. Exceeding the reconstruction error threshold indicates an anomaly. Such an approach is used in the APAD (Autoencoder-based Payload Anomaly Detection) model by Kim et al. [5]. Wang et al. [6] propose an approach to anomaly detection using a composite model. The proposed model consists of three components: the encoder and decoder used to reconstruct the error, and the LSTM classifier, which takes the encoder output as input and makes predictions. To detect an anomaly, both model outputs, i.e., reconstruction error and prediction value, are considered together to calculate the anomaly score. The authors also compare the change ratio of each attribute during the current period and the previous one, and those attributes that have changed more are considered anomalous.

Generative adversarial networks can be used to investigate the distribution of normal data for recognizing anomalies from unknown data. The generator creates new data instances, and the discriminator evaluates them for authenticity. In the MAD-GAN (Multivariate Anomaly Detection with GAN) approach by Li et al. [29], both generator and discriminator components are represented by LSTM. The discriminator is trained to distinguish anomalies from normal data, and the anomaly score is computed as a combination of the discrimination output and reconstruction error produced by the generator component. A similar approach is proposed by Neshenko et al. [30]. The building blocks for the proposed GAN are the recurrent neural network and convolutional neural network. The authors also extended the anomaly detection approach by incorporating a module that attributes potentially attacked sensors. This task is solved by the application of various techniques starting with feature importance evaluation and finishing with KernelShap [33] and LIME [34] techniques that are model agnostic methods.

We should also mention approaches to anomaly detection using graph probabilistic models, such as Bayesian networks (BN) and Markov models. For example, Lin et al. [35] propose TABOR (Time Automata and Bayesian netwORk). Time Automata simulate the operation of the sensors and actuators, and the Bayesian network (BN) models the dependencies among random variables from the sensors and the actuators. This approach allows for the detection of timing anomalies, anomalies of sensor, and actuator value range, as well as a violation in their dependencies. Another popular way to represent normal behaviour is the hidden Markov model (HMM). Sukhostat L. [36] uses hierarchical HMM to detect anomalies in sensor values.

Application of the proposed techniques requires high-quality datasets that allow proper modelling of the CPS system functioning. Depending on the technique, it is required to have only normal data; some techniques require having both samples with normal and abnormal behaviour.

The first group of datasets is the data containing the indicators of the sensors of the cyber-physical system in the form of logs. The analysis of the research papers showed that currently, the most commonly used CPS dataset is SWAT dataset [9]. It is used in [5,6,18,24,25,29,30,35,36]. This dataset contains records from sensors, actuators, control programmable logic controllers (PLCs), and network traffic. Another new dataset for anomaly detection is HAI [10], which is used in research [16,17,32]. The dataset contains the parameters of sensors for an industrial power generation system using steam turbines and pumped storage power plants. To detect anomalies in IoT devices, the authors in the papers [19,20,31] use the TON\_IoT dataset [8]. The ToN\_IoT dataset includes telemetry from heterogeneous IoT and Industrial Internet of Things (IIoT) sensors.

Another group of datasets that are often used to detect anomalies and attacks in CPS are represented by network traffic datasets. They include such datasets such as NSL-KDD [37], CICIDS2017 [38] and UNSW-NB15 [39], and are used in the following research papers [23,27,31,40]. However, these datasets are represented mainly by network data that could be given in form of the PCAP (Packet Capture) files or labelled network flows. Section 3 discusses datasets in detail.

We should note that differences in the experimental conditions affect the possibility of comparing the results of anomaly detection. For example, Elnour et al. [18] exclude the stabilization time from the SWaT dataset. The way metrics are calculated can also vary, and research papers do not always provide a way to calculate these metrics. In general, the above machine learning methods show high anomaly detection results and can be used in further developments. A promising area of research and development is the creation of hybrid machine learning models for anomaly detection. In particular, combined networks with RNNs are used to capture temporal relationships [6,29], and combined networks with CNNs are applicable for context analysis (e.g., packet order and content) [25,30].

## **3. Datasets for the Attack and Anomaly Detection**

An essential challenge of anomaly detection research is generating or finding a suitable dataset for the experiments. The authors analyzed existing datasets to select the dataset for further research.

The authors specified the following requirements of the dataset based on the research goal of anomaly detection in cyber-physical systems:

**R1:** the dataset should be gathered from the cyber-physical system;

**R2:** the dataset should contain event logs;

**R3:** the dataset should contain anomalies;

**R4:** the dataset should be labelled (what is normal and what is abnormal);

**R5:** the dataset should be close to real data (i.e., data from the real or semi-real system). Currently, there are a lot of datasets available for various purposes and systems; they represent the functioning of the computer networks and cyber-physical systems, including the Internet of Things, Industrial Internet of Things, and Industrial Control Systems (ICS), such as SCADA (Supervisory Control And Data Acquisition) system [41].

Alsaedi et al. [8] present the comparative analysis of the available datasets for security purposes. Thus, there are datasets containing computer network traffic that was generated for attack detection purposes: KDDCUP99, NSL-KDD [37], UNSW-NB15 [39], and CICIDS2017 [38]. Such datasets do not contain sensors' data that is specific to CPS. Moreover, they do not include CPS network traffic, both normal and abnormal.

There are also datasets generated for cyber-physical systems security research purposes. Choi et al. [42] provide a comparison of the existing datasets generated for ICSs security research based on attack paths. Lemay and Fernandez [43] generate the SCADA network datasets (Modbus dataset) for intrusion detection research. The SCADA network

datasets by Rodofile et al. [44] contain attacks on the S7 protocol. These datasets are SCADA specific and contain a limited set of protocol specific attacks.

There are also multiple datasets for IoT and IIoT. Suthaharan et al. [45] propose the labelled wireless sensor network dataset (LWSNDR). It contains homogeneous data collected from a humidity-temperature sensor. The sensor is deployed in single-hop and multi-hop wireless sensor networks (WSNs). The dataset does not contain attack scenarios, but does contain anomalies introduced by the author using a hot water kettle. Sivanathan et al. [46] propose the datasets gathered from a smart home testbed. It contains network traffic characteristics of IoT devices. The dataset is generated for the IoT devices classification. The dataset does not contain attack scenarios.

There are also multiple network-based IoT datasets [37–39,46–48]. These datasets do not consider sensor data; thus, they do not allow for the detection of the attacks that manipulate sensors' data.

The datasets that are suitable considering the set requirements, i.e., that contain labelled sensors and network data, are as follows: TON\_IoT [8], SWaT [9], and HAI [10]. The authors conducted a more detailed analysis of these datasets.

## *3.1. TON\_IoT Dataset Analysis*

The TON\_IoT dataset is created by the Intelligent Security Group of the UNSW Canberra, Australia, and positioned by its authors as realistic telemetry datasets of IoT and IIoT sensors. It contains data from seven IoT devices, namely, a smart fridge, GPS tracker, smart sense motion light, remotely activated garage door, Modbus device, smart thermostat, and weather monitoring system. All the data were generated using a testbed of Industry 4.0/Industrial IoT networks developed by the authors. The data include several normal and cyber-attack events, namely, scanning, DoS, DDoS, ransomware, backdoor, data injection, cross-site scripting, password cracking attacks, and man-in-the-middle. The TON\_IoT dataset incorporates the ground truth indicating normal and attack classes for binary classification, and the feature indicating the classes of attacks for multi-classification problems. Statistics on class balance for device samples from the TON\_IoT dataset are presented in Table 1.


**Table 1.** The statistics on the TON\_IoT dataset class balance by devices.

Alsaedi et al. and Moustafa [8] also tried several popular machine learning methods to show that the TON\_IoT dataset may be used to train classifiers for intrusion detection purposes. To justify the results and ensure that attacks are indeed identifiable, we have tried to follow the course of the authors' experiment with binary classification. It should be mentioned that the authors reported very high accuracy for the majority of the investigated methods (more than 0.8 for the F-measure in most cases). As we tried to follow the authors, at first we applied the same preprocessing procedures, namely, transformed categorical features with two unique values into binary ones, applied the min-max scaling technique to numeric features, and randomly split data into train and test subsamples in 80% to 20% stratified proportion.

It should be noted that during data preprocessing, we found several artefacts in the data. For example, '*temp\_condition*' feature for the fridge contains values 'high', 'low', 'low', 'high', 'low', 'high' values, and '*sphone\_signal*' for fridge contains 'true', 'false', '0', '1' values. As there are no special notes about that in the paper or the dataset description, we supposed that those were inaccuracies in the data and fixed them.

Figure 1 shows the correlation between features for different devices, both with each other and with the anomaly behaviour label. We can note a high correlation between the features of the dataset for a fridge, garage door, GPS tracker, and motion light. At the same time, the correlation value between these features and the label is low. The correlation of features for Modbus, thermostat, and weather is close to zero.

**Figure 1.** IoT device feature and label correlation.

We applied the same machine learning models to those mentioned in the original paper, namely, Logistic Regression (LR), Linear Discriminant Analysis (LDA), k-Nearest Neighbour (kNN), Classification and Regression Trees (CART), Random Forest (RF), Naïve Bayes (NB), Support Vector Machine (SVM), and Long Short-Term Memory (LSTM), with the hyperparameters that authors specified, and also tried to tune those hyperparameters using 4-fold cross-validation.

We did not manage to reach the reported accuracy for most of the datasets, in either case. Table 2 shows the best values for F-measure that we received for classifiers trained on 80% of the data for each device calculated on the remaining 20% of the data.

**Table 2.** The F-measure values for the best hyperparameters of the model trained on the TON\_IoT dataset calculated for the test subsample.


The best F-measure values were obtained for the GPS Tracker dataset. We assume that this is due to the strongest correlation between features and anomaly class labels in this device dataset in comparison to the other device datasets. For other datasets, correlations are close to zero, that is, very weak. The strong correlations between features and the weak correlations between features and anomaly class labels for fridge, garage door, and motion light may explain the low F-measure values for these datasets.

Further investigation of the data showed that anomaly class labels relate only to data and time of device events; although, according to the authors' experiment design, date and time are not taken into account. Figure 2 shows an example distribution of anomaly class labels in time for the temperature feature of the smart fridge.

**Figure 2.** Normal and attack events for temperature feature of smart fridge.

*Conclusions*. We analyzed the obtained results considering the set requirements. Requirements R1, R2, and R4 are satisfied; requirement R3 is partially satisfied, as soon as the dataset contains attack scenarios. However, the performed experiments showed that these attacks do not affect IoT telemetry. The requirement R5 is not satisfied. The analysis demonstrated that there is no connection between the data in the network dataset and the data in the sensor's dataset. Moreover, the sensors do not follow any normal behaviour scenario and the obtained accuracy results are rather low. Thus, this data set is not suitable for the goals of further research.

## *3.2. SWaT Dataset Analysis*

The Secure Water Treatment (SWaT) dataset [9] is generated by the Singapore University of Technology and Design (SUTD). The researchers deployed a six-stage SWaT testbed simulating a real-world industrial water treatment plant. The collected dataset contains both normal and attack traffic. It should be noticed that the deployed plant was run non-stop for eleven days: during the first seven it operated without any attacks, while during the remaining days, cyber and physical attacks were conducted against the plant. The collected dataset contains both the data from sensors and actuators of the plant (25 sensors and 26 actuators) and network traffic. Currently, there are several versions of this dataset; the researchers regularly update it by organizing cybersecurity events using it, thus, generating new data with different attack types.

We conducted a series of experiments with different machine learning models for anomaly detection using the SWaT dataset 2015 to evaluate this dataset and check its compliance with the criteria proposed above. The dataset incorporates three CSV files with anomaly (or attack) and normal data: "Attack\_v0.csv", "Normal\_ v0.csv", and "Normal\_v1.csv". The attacks were performed on different technological processes, and Table 3 shows the number of abnormal records for different technological processes. It should be also noted that some network attacks do not impact the readings from physical sensors.


**Table 3.** Distribution of the attacks per processes in SWaT dataset

*Experiment 1*. For this experiment series, we tried both time and random train-test splits on the "Attack\_v0.csv" dataset containing 449,919 rows in total, including 395,298 normal records and 54,621 anomaly records that correspond to attacks, meaning that the contamination rate is 0.138 for this subsample. For the time train-test split mode, the training sample was incorporated all rows before 2 January 2016 , while the testing sample contained rows after 2 January 2016 (inclusively). Due to uneven distribution of anomalies across time, the class balance for train and test subsamples was different: the train subsample included 344,436 normal instances and 51,483 attack instances meaning that the contamination rate was equal to 0.149; the test subsample included 50,862 normal instances and 3138 attack instances with a contamination rate of 0.062. The results of the experiment for the train-test split mode and different anomaly detection machine learning models are provided in Table 4. The best results were obtained for the K Nearest Neighbors method (KNN) with *F*1-measure 0.784, AUC-ROC 0.935, and AUC-PRC 0.739 on the test subsample.

**Table 4.** The results of Experiment 1 for the time split mode for the SWaT dataset.


For the random train-test split mode, we used 80% to 20% ratio so the train subsample contained 316,238 normal instances and 43,697 attack instances, while the test subsample contained 79,060 normal instances and 10,924 attack instances with a contamination rate of 0.138 for both. The results of experiment 1 for the random train-test split mode and different anomaly detection machine learning models are provided in Table 5. It can be seen that rather close results were obtained for the ECOD (F1-measure 0.743, AUC-ROC 0.878, and AUC-PRC 0.758 on the testing sample), COPOD (F1-measure 0.744, AUC-ROC 0.874, and AUC-PRC 0.768 on the testing sample), VAE (F1-measure 0.766, AUC-ROC 0.892, and AUC-PRC 0.661 on the testing sample), AutoEnc (F1-measure 0.767, AUC-ROC 0.892, and AUC-PRC 0.660 on the testing sample), and AnoGAN (F1-measure 0.750, AUC-ROC 0.864, and AUC-PRC 0.753 on the testing sample).


**Table 5.** The results of Experiment 1 for the random split mode for the SWaT dataset.

*Experiment 2.* For this experiment series, we used the data from "Attack\_v0.csv" and "Normal\_v0.csv" files to form train, test, and validation subsamples. The train and test subsamples incorporated all instances before 2 January 2016, with 672,989 normal instances and 41,186 attack instances for train and 168,247 normal instances and 10,297 attack instances for test (contamination is equal to 0.061 for both) after 80% to 20% stratified train test split. Meanwhile, the validation sample consisted of all instances after 2 January 2016 (inclusively), with 50,862 normal instances and 3138 attack instances and contamination of 0.062. The results of experiment 2 for different anomaly detection machine learning models are provided in Tables 6 and 7. It can be seen that rather close results are obtained for the ECOD (F1-measure 0.718, AUC-ROC 0.864, and AUC-PRC 0.530 on the testing sample), COPOD (F1-measure 0.729, AUC-ROC 0.867, and AUC-PRC 0.563 on the testing sample), VAE (F1-measure 0.732, AUC-ROC 0.896, and AUC-PRC 0.505 on the testing sample), AutoEnc (F1-measure 0.732, AUC-ROC 0.896, and AUC-PRC 0.505 on the testing sample), and AnoGAN (F1-measure 0.746, AUC-ROC 0.851, and AUC-PRC 0.555 on the testing sample).


ECOD 0.862 0.623 0.377 0.724 0.865 0.540 0.856 0.619 0.381 0.718 0.864 0.530 COPOD 0.897 0.621 0.379 0.734 0.868 0.575 0.890 0.617 0.383 0.729 0.867 0.563 KNN 0.058 1.000 0.000 0.109 0.209 0.040 0.058 0.999 0.000 0.109 0.213 0.041 DeepSVDD 0.067 0.832 0.168 0.124 0.490 0.054 0.067 0.826 0.174 0.124 0.489 0.055 VAE 0.772 0.696 0.304 0.732 0.896 0.509 0.770 0.696 0.304 0.732 0.896 0.505 AutoEnc 0.772 0.696 0.304 0.732 0.896 0.509 0.770 0.696 0.304 0.732 0.896 0.505 AnoGAN 0.899 0.644 0.356 0.751 0.854 0.568 0.893 0.641 0.359 0.746 0.851 0.555

**Table 6.** The results of Experiment 2 for the SWaT dataset (for train and test data).

*Experiment 3.* The data from "Attack\_v0.csv", "Normal\_v0.csv", and "Normal\_ v1.csv" files together were used to train algorithms in novelty detection or unsupervised mode in this experiment series. The data contain 1,441,719 instances in total, including 1,387,098 normal instances and 54,621 attack instances with contamination of 0.039. To train algorithms, all instances from "Normal\_v0.csv" and "Normal\_v1.csv" files (except stabilization period of 3 hours) were used, while all instances from "Attack\_v0.csv" file were used for testing. The train sample included 972,000 normal instances and no attack instances. The test sample included 395,298 normal instances and 54,621 attack instances, that is, contamination is

equal to 0.139. The results of experiment 3 for the novelty detection mode and different anomaly detection machine learning models are provided in Table 8. It can be seen that the results are rather close for different models with a rather low false positive rate on the testing sample.


**Table 7.** The results of Experiment 2 for the SWaT dataset (for validation data).

**Table 8.** The results of Experiment 3 for the SWaT dataset.


*Conclusions*. We analyzed the obtained results considering the dataset requirements listed above. All specified requirements are satisfied for this dataset. It is generated using physical devices and components, and this impacts the efficiency of the network attacks; not all network attacks result in changes in the readings of the sensors. Thus, we consider that this dataset is a realistic one. The preliminary results of the analysis of the sensors data are in conformance with the results obtained by other researchers [6,29,30,35,36]. Interestingly, all considered papers do not analyze network and sensor data together, and we believe that joint analysis of such data could significantly enhance the performance of the analysis models targeted to detect anomalies and network attacks.

## *3.3. HAI Dataset Analysis*

The dataset describes the parameters of an industrial control system testbed with an embedded simulator. The testbed comprises four elements: a boiler, turbine, watertreatment component, and a hardware-in-the-loop (HIL) simulator. The HIL simulation

implements a simulation of the thermal power and pumped-storage hydropower generation.

When forming the dataset, several different attack scenarios were used, aimed at three types of devices: the Emerson Ovation, GE Mark-VIe, and Siemens S7-1500.

During the attack, the attacker operates with four types of variables: set points, process variables, control variables, and control parameters. The set of certain values of these variables in a given period of time determines one of two behaviours of the system: anomalous or normal. When the system is operating normally, the values of the process variables change within a predefined range. To this end, the operator adjusts the set point values, which allows for achieving stable and predictable results in the behaviour of the sensors, and the entire system as a whole.

This dataset has three versions: HAI 20.07, HAI 21.03, and HAI 22.04. Statistical information about each of them is given in Table 9.

Figure 3 shows 10 features which keep the highest correlation value with the class label for files test1.csv within HAI 20.07, HAI 21.03, and HAI 22.04.

Table 10 contains the values of F-measure (*F*1) and accuracy (ACC) in percentages for 5 classifiers: decision tree (DT), KNN, random forest, logistic regression, and neural network (NN).

*Conclusions*. We analyzed the obtained results considering the set requirements. The requirements R1, R2, R3, and R4 are satisfied. The requirement R5 is also satisfied; however, considering the existence of the simulated part of the test bed, the quality of the dataset depends on the quality of the simulated part of the test bed. The preliminary experimental results are in line with the results obtained in other research papers. Thus, this dataset is consistent and suitable for the intrusion detection task.


**Table 9.** Statistical data on the HAI dataset class balance by version.

**(a)** HAI 20.07 dataset.


**Table 10.** Result of evaluating classifiers on HAI dataset.

## **4. Performance Metrics for Anomaly and Attack Detection**

Finally, in this section, we describe performance metrics used for anomaly and attack detection. Precision, recall, and F-measure are the most used evaluation metrics. There is no specialized metric to measure the performance of anomaly detection methods. The listed metrics are classic for machine learning methods, on which most anomaly detection methods are based. However, we discovered that there are different approaches to calculating them [28,49,50]. This section reviews proposed approaches.

Let us denote the time series signal observed from *K* sensors during time *T* as

$$X = \{\mathfrak{x}\_1, \dots, \mathfrak{x}\_T\}, \mathfrak{x}\_l \in \mathbb{R}^N.$$

The normalized signal is divided into a number of time windows:

$$\begin{aligned} W &= \{w\_{1\prime}, \dots, w\_{T-h+\tau}\}\_{\prime}, \\ w\_{\mathbf{t}} &= \{\mathfrak{x}\_{\mathbf{t}\prime}, \dots, \mathfrak{x}\_{\mathbf{t}+h-\tau}\}\_{\prime} \end{aligned}$$

where *h*—window size, *τ*—step length.

The purpose of the time series anomaly detection method is to predict the binary label of the presence of an anomaly (<sup>∧</sup> *yt* ), either for individual *X* instances or for time windows *W*. The labels are obtained by comparing the anomaly estimates *A* with a threshold *δ*. For the specific instances:

$$
\stackrel{\diamond}{\mathcal{Y}}\_t = \begin{cases} 1, & \text{if } A(\mathfrak{x}\_t) > \delta, \\ 0, & \text{otherwise.} \end{cases}
$$

For all windows in the test dataset:

$$
\hat{y}\_t = \begin{cases} 1, & \text{if } A(w\_t) > \delta, \\ 0, & \text{otherwise.} \end{cases}
$$

A set of test data may contain several sequences (segments) of anomalies within a certain period of time. Let us denote *S* as a set of *M* segments of anomalies:

$$\begin{array}{c} \mathcal{S} = \{ \mathcal{S}\_{1\prime}, \dots, \mathcal{S}\_M \}\_{\prime} \\ \mathcal{S}\_m = \{ \mathfrak{x}\_{t^{\mathrm{ms}}}, \dots, \mathfrak{x}\_{t^{\mathrm{mc}}} \}\_{\prime} \end{array}$$

where *t ms* and *t me* are the *S<sup>m</sup>* starting and ending time, accordingly.

Below, several approaches to calculate the performance metrics of anomaly detection are described.

*Point-wise calculation approach.* The calculation of the performance metrics is implemented using separate records within the dataset [28,49]. The calculation of precision (*P*), recall (*R*), and F-measure (*F*1) is implemented using all points within the dataset:

$$P = \frac{TP}{TP + FP'} \qquad \qquad R = \frac{TP}{TP + FN'} \qquad \qquad F1 = 2 \times \frac{P \times R}{P + R'}$$

where


*Point-adjusted (PA) calculation approach* . The calculation of the performance metrics is implemented using the corrected labels. If at least one observation of an anomalous segment is detected correctly, all other observations of the segment are also considered to be correctly detected, even if they were not detected [28,49]. Observations outside the true anomaly segment are processed as usual. It can be specified as follows:

$$\begin{aligned} \stackrel{\wedge}{y}\_t^{pa} = \begin{cases} 1, & \text{if } A(\mathbf{x}\_t) > \delta \text{ or } \exists A(\mathbf{x}\_{t'} > \delta), \mathbf{x}\_{t'} \mathbf{x}\_t' \in S\_{m\nu} \\\ 0, & \text{otherwise}. \end{cases} \end{aligned}$$

The metrics are calculated considering the corrected labels in the dataset:

$$P\_{pa} = \frac{TP\_{pa}}{TP\_{pa} + FP\_{pa}}, \qquad R\_{pa} = \frac{TP\_{pa}}{TP\_{pa} + FN\_{pa}}, \qquad F1\_{pa} = 2 \times \frac{P\_{pa} \times R\_{pa}}{P\_{pa} + R\_{pa}}.$$

This idea is represented in Figure 4.

**Figure 4.** True, corrected, and predicted labels in case of the PA approach to metrics calculation.

*Revised point-adjusted (RPA, event-wise) calculation approach*. The calculation of metrics is implemented using time windows of records [50]. If any point at the anomaly window is labelled as anomalous, then one true positive result is fixed. If the anomalies were not labelled, then one false negative result is fixed. Any predicted anomalies outside the anomaly windows are considered false positives. This can be specified as follows:

$$P\_{rpa} = \frac{TP\_{rpa}}{TP\_{rpa} + FP\_{rpa}}, \qquad \qquad \qquad \mathbf{R}\_{pa} = \frac{TP\_{rpa}}{TP\_{rpa} + FN\_{rpa}}, \qquad \qquad \mathbf{F1}\_{pa} = \mathbf{2} \times \frac{P\_{rpa} \times R\_{rpa}}{P\_{rpa} + R\_{rpa}}.$$

where


This idea is represented in Figure 5.

**Figure 5.** True, predicted, and corrected labels in case of the RPA approach to the metrics calculation.

Another metric is the composite *F*1 score [50]. For this metric, precision is considered as *P* (by the number of points), and recall is calculated as *Rrpa* (by the number of segments):

$$F1\_{\mathcal{L}} = \mathcal{2} \times \frac{P \times R\_{rpa}}{P + R\_{rpa}}.$$

This idea is represented in Figure 6.

**Figure 6.** True, predicted, and corrected labels in case of the composite *F*1 score approach to the metrics calculation.

*Conclusions.* There are various approaches to the calculation of metrics for performance evaluation of the machine learning models. In addition to the classical way of calculating through *TP*, *TN*, *FN*, and *FP*, researchers present options with adjusted indicators. This is aimed at improving the quality of anomaly detection in a large amount of data, or at reducing the number of false positives. In this case, the choice of metrics strongly depends on the detection problem being solved. To select the appropriate approach to calculation, additional experiments are required: the authors plan to implement and compare all the described metrics in future experiments with anomaly detection methods.

## **5. Conclusions**

In the paper, the authors considered existing approaches in the anomaly detection area, existing datasets that can be used for the experiments, and existing performance metrics. The analysis of the related works showed that the research focus has shifted to the application of deep neural networks to anomaly detection in technological processes; however, there are still solutions based on classical anomaly detection techniques. The application of machine learning techniques requires high-quality datasets. High-quality datasets are datasets that are relevant to the subject domain, meaningful, and reliable. We formulated five requirements for the datasets that consider these properties and evaluated three different datasets that are currently proposed for testing and evaluation of cybersecurity applications. The selected datasets are SWaT, HAI, and TON\_IoT. Our experiments revealed that TON\_IoT is not suitable for the intrusion detection task, as we did not dis-

cover any relations between sensor data and network data. We consider that SWaT and HAI datasets are more relevant for cybersecurity tasks, primarily due to the fact that they were generated using real physical test beds. The SWaT dataset contains both network and sensor data; this makes it preferable for intrusion detection, as authors believe that joint analysis of the network and sensor data could benefit the early detection of the attacks a lot, including multi-step attacks.

Another interesting finding relates to the performance evaluation of the machine learning techniques proposed to detect anomalies. These techniques consider the specificity of the anomalies of the CPS systems—their duration and the delayed response of the system. Although these features could significantly enhance the evaluation process of the proposed cybersecurity solutions, they require more analysis and research.

Finally, in future research, the authors plan to develop an approach to anomaly detection in cyber-physical systems that will provide accurate and explainable results, and will conduct experiments to select the performance metrics.

**Author Contributions:** Conceptualization, E.N., E.F. and I.K.; methodology, E.N. and E.F.; software, O.T., D.L. and A.B.; validation, E.N., E.F., O.T. and I.K.; investigation, E.F., E.N., O.T., D.L. and A.B.; writing—original draft preparation, E.F.; writing—review and editing, E.N., O.T., D.L. and A.B.; visualization, D.L. and A.B.; supervision, I.K., E.N. and E.F.; funding acquisition, I.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research is being supported by the grant of RSF #21-71-20078 in SPC RAS.

**Data Availability Statement:** Not applicable

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
