Mutation-Based Multivariate Time-Series Anomaly Generation on Latent Space with an Attention-Based Variational Recurrent Neural Network for Robust Anomaly Detection in an Industrial Control System

Jeon, Seungho; Koo, Kijong; Moon, Daesung; Seo, Jung Taek

doi:10.3390/app14177714

Open AccessArticle

Mutation-Based Multivariate Time-Series Anomaly Generation on Latent Space with an Attention-Based Variational Recurrent Neural Network for Robust Anomaly Detection in an Industrial Control System

¹

Department of Computer Engineering (Smart Security), Gachon University, Seongnam-daero 1342, Seongnam-si 13119, Republic of Korea

²

Electronics and Telecommunications Research Institute (ETRI), Daejeon 34129, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7714; https://doi.org/10.3390/app14177714 (registering DOI)

Submission received: 29 July 2024 / Revised: 27 August 2024 / Accepted: 29 August 2024 / Published: 1 September 2024

(This article belongs to the Special Issue Methods and Applications of Data Management and Analytics)

Download

Browse Figures

Versions Notes

Abstract

:

Anomaly detection involves identifying data that deviates from normal patterns. Two primary strategies are used: one-class classification and binary classification. In Industrial Control Systems (ICS), where anomalies can cause significant damage, timely and accurate detection is essential, often requiring analysis of time-series data. One-class classification is commonly used but tends to have a high false alarm rate. To address this, binary classification is explored, which can better differentiate between normal and anomalous data, though it struggles with class imbalance in ICS datasets. This paper proposes a mutation-based technique for generating ICS time-series anomalies. The method maps ICS time-series data into a latent space using a variational recurrent autoencoder, applies mutation operations, and reconstructs the time-series, introducing plausible anomalies that reflect multivariate correlations. Evaluations of ICS datasets show that these synthetic anomalies are visually and statistically credible. Training a binary classifier on data augmented with these anomalies effectively mitigates the class imbalance problem.

Keywords:

anomaly generation; variational Bayes; attention mechanism; recurrent neural network; industrial control system

1. Introduction

Anomaly detection is the task of predicting anomalous data that exhibit different patterns from normal data. It plays a crucial role in various fields such as finance [1], industrial control systems (ICS) [2], and cybersecurity [3]. There are two main approaches to classifying anomalies from normal data: one-class classification and binary classification. (1) One-class classification involves training a classifier using only one type of data (typically normal data) to predict anomalous data [4,5]. This strategy is commonly adopted in most domains because anomalous data is often rare and difficult to collect compared to normal data. (2) On the other hand, if anomalous data are available, the anomaly detection problem can be reduced to binary classification [6,7]. In this setting, a prediction model is trained with both normal and anomalous data.

One-class classification-based anomaly detection models are generally known to have a high false alarm rate [8]. The primary reason for this drawback is that the prediction model struggles to learn a sophisticated decision boundary due to the absence of anomalous instances. Consequently, researchers attempt to address the anomaly detection problem through binary classification, aiming to accurately separate anomalous instances from normal data. Most of this research concentrates on independently and identically distributed (i.i.d) data, with only limited studies addressing anomaly generation for time-series data.

The lack of research on time-series anomaly generation exacerbates the issue of building robust anomaly detection models in ICS environments. ICS is core to various domains like the manufacturing industry, electrical grids, and transportation systems. These systems share a closed nature, which makes data collection challenging. Moreover, the data inherently have time-series characteristics. As a result, this leads to data imbalance, making it difficult to train robust anomaly detection models and consequently reducing the reliability of performance evaluation. Therefore, in this paper, we focus on generating pseudo-time-series anomalies in the ICS environment.

Based on our observation, there are three major challenges in generating time-series anomalies. (1) Unlike i.i.d data, time-series data exhibit dependencies between data samples over time. In other words, when generating anomalies for time-series data, it is crucial to reflect this temporal dependency; (2) In the case of multivariate data, an anomaly in one feature can affect other features due to the correlations between them. However, systematically calculating these correlations and assigning values manually is not scalable [9]; (3) Anomalous data are inherently rare compared to normal data, making it difficult to determine an accurate probability distribution for these data. This scarcity complicates the process of generating data through sampling from a probability distribution.

To overcome the identified challenges, our insights are as follows. (1) By using variational inference, we map normal data into a latent space that follows a well-known probability distribution [10]. Variational inference is a method for training generative models, where the model maximizes the evidence lower bound (ELBO); (2) By ensuring that the latent space follows a well-known probability distribution, we can easily sample latent vectors and generate new data from them. Additionally, since we already know the statistical properties of the probability distribution, we can appropriately tamper with the latent vectors to generate anomalous data; (3) Inspired by dynamic software testing methods such as fuzzing, we adopt mutation to modify latent vectors [11,12]. These insights guide our approach to effectively generate time-series anomalies while addressing the challenges of temporal dependency, multivariate correlation, and data scarcity.

In this paper, we propose a method for generating pseudo anomalous data for industrial process data by synthesizing the aforementioned insights. The proposed method consists of two main parts. First, we map industrial time-series data to latent variables using a neural network. The learned latent variables encapsulate not only the characteristics of each time step of the data but also the correlations between features. Next, we tamper with the latent variables using mutation operations inspired by fuzz testing. We define mutation types to reproduce several known types of time-series anomaly patterns. In evaluations using several well-known ICS time-series datasets, the proposed method successfully generated time-series anomalies with various patterns. The contributions of this paper are as follows:

We define patterns of time-series anomalies by analyzing several time-series datasets and existing studies.
We propose a robust time-series anomaly generation algorithm using mutations and a variational recurrent autoencoder.
We generate time-series anomalies from well-known ICS datasets using the proposed algorithm and comprehensively evaluate the quality of the synthetic anomalies.

The remainder of this paper is organized as follows. Section 2 presents existing studies on anomaly detection from the perspectives of one-class classification and binary classification; Section 3 analyzes the patterns of time-series anomalies; Section 4 proposes a method for generating time-series anomalies using AVRAE and mutation; Section 5 thoroughly evaluates the quality of the time-series anomalies generated by the proposed method using widely used anomaly detection ICS time-series datasets; Section 6 discusses several limitations of the proposed method; Finally, in Section 7, we provide conclusions and suggest directions for future research.

2. Related Work

Anomaly detection is a major research topic in various application domains, including cybersecurity. Most studies in this field adopt either one-class classification or binary classification detection strategies, depending on the availability of anomalous data during the training of detection models. In this section, we present existing research related to each approach (Section 2.1, Section 2.2 and Section 2.3). Furthermore, since this paper focuses on anomalies observed in time-series data, we discuss recent studies on generating time-series data and anomalous data in Section 2.4 and Section 2.5, respectively.

2.1. Statistical Anomaly Detection

The most traditional method to detect an anomaly is to utilize statistical techniques such as calculating the mean and standard deviation of a dataset. By identifying data points that fall outside of a certain threshold, these methods can flag potential outliers.

W. Yu et al. [13] introduces a generalized probabilistic monitoring model (GPMM) designed to handle both random and sequential data for process monitoring. It unifies various probabilistic linear models and establishes connections between different monitoring methods. Using the expectation-maximization (EM) algorithm, the model estimates parameters, derives monitoring statistics, and investigates the equivalence between these statistics and those from classical multivariate methods. The model’s effectiveness is demonstrated through numerical examples and application to the Tennessee Eastman process.

W. Yu et al. [14] presents an unsupervised fault detection and diagnosis method called Sparse Distribution Dissimilarity Analytics (SDDA), which combines distribution dissimilarity with a lasso penalty. This method addresses shortcomings in existing techniques by maximizing the dissimilarity between normal and abnormal data distributions, enabling accurate detection and diagnosis of process faults, including those with small magnitudes. The method is validated through both simulations and real industrial processes, showing superior performance compared to traditional methods.

W. Yu et al. [15] presents a novel fault detection method designed for complex industrial systems. The proposed MoniNet framework integrates both temporal and spatial information using a cascaded monitoring network, which enhances the detection accuracy of process anomalies. The method is validated using real industrial data, demonstrating its effectiveness in identifying faults more accurately compared to traditional methods.

2.2. Anomaly Detection in One-Class Classification

As mentioned earlier, anomalous behaviors occurring in enterprise networks or systems are typically very rare. Consequently, most research interprets anomaly detection as a one-class classification problem. L. Shen et al. [4] proposed a temporal hierarchical one-class model and an end-to-end learning method for time-series anomaly detection. This model is designed with a dilated recurrent neural network and incorporates multi-scale clustering to better capture temporal dynamics. The cluster centers are encouraged to be orthogonal.

H. Xu et al. [5] developed a calibrated one-class classifier that learns a more refined normality boundary. The calibration of this model involves penalizing uncertain predictions and discriminating normal samples from simulated abnormal behaviors.

S. Mauceri et al. [16] focused on the representation of time-series data rather than designing a novel classifier. This study represented given time-series data based on their dissimilarities to a set of so-called prototypes. The study evaluated the Cartesian product of 12 dissimilarities and 8 prototypes and used a one-class nearest neighbor classifier to detect anomaly samples.

L. Gjorgiev and S. Gievska [17] explore the use of variational autoencoders (VAEs) combined with Mahalanobis distance for anomaly detection in time-series data. The study evaluates various deep learning architectures on the BATADAL challenge dataset, which focuses on detecting cyber-attacks in water distribution systems. The results indicate that simpler VAE models using Mahalanobis distance can effectively detect anomalies, demonstrating significant promise in time-to-detection performance.

2.3. Anomaly Detection in Binary Classification

While most studies on time-series anomaly detection focus on the one-class classification strategy using only normal data, there have been recent attempts to identify anomalous data through binary classification. Z. Ghrib et al. [6] proposed a hybrid approach to recognize fraudulent credit card transactions. This study generated latent representations of the given data using a long short-term memory (LSTM) [18] based autoencoder pretrained on normal data, and then classified these representations using a support vector machine (SVM) [19].

P. Primus et al. [7] proposed a method for detecting anomalous sounds. Instead of presenting a novel detection model, this study focused on collecting anomalous samples from a given dataset by careful selection of unrelated data to serve as proxy outliers. Then, a detection model based on ResNet [20] was trained as a binary classifier using normal samples and proxy outliers.

I. Ullah et al. [21] proposed an RNN-based model for detecting intrusions in Internet of Things (IoT) networks. The proposed model, designed with LSTM, bidirectional LSTM, or gated recurrent unit (GRU) [22], was evaluated on various IoT datasets. This study trained the detection model as a binary classifier and employed techniques such as weighted loss and borderline SMOTE [23] to overcome the limitations of imbalanced datasets.

K. Gundersen et al. [24] proposed a binary time-series classification model for detecting gas emissions from the ocean. This study adopted a Bayesian convolutional neural network to detect gas leaks and introduced Monte Carlo dropout [25] to maximize generalization.

F. Liu et al. [26] proposed a model for detecting anomalous data in quasi-periodic time-series. This study first split the quasi-time-series into successive quasi-periods through two-level clustering and then detected anomalies using a hybrid attentional LSTM-CNN model.

2.4. Time-Series Data Generation

Many real-world datasets are time-series, which exhibit higher dynamicity and complexity compared to i.i.d. data. Consequently, time-series generation models should address issues arising from temporal dependencies. G. Forestier et al. [27] proposed a time-series generation method based on dynamic time warping (DTW) barycenter averaging (DBA). This method generates new time-series by calculating a weighted average of the time-series within a given dataset. By assigning different weights to each time-series, a variety of rich patterns can be generated. Additionally, this study presented several weight selection methods to ensure diversity in time-series synthesis.

Recently, deep neural network (DNN)-based models have been employed to more effectively capture the temporal dependency and dynamicity in data synthesis. Powerful generative models based on variational inference [10] and generative adversarial networks (GAN) [28] are among these approaches. C. Zhang et al. [29] proposed a GAN-based model for synthesizing power consumption data in smart grids. This study represented power consumption data using two attributes (level and pattern) and conditionally modeled the data probabilistically over users, days, and months, enabling the model to learn temporal attributes.

L. Zhou et al. [30] presents a novel model called LS4, a deep latent state space model for time-series generation. It addresses the limitations of existing ordinary differential equations (ODE)-based models, particularly their struggles with sharp transitions in time-series data and high computational costs. LS4 leverages a convolutional representation to enhance speed and efficiency, significantly outperforming previous models in accuracy and computational efficiency, especially on datasets with irregular sampling and long sequences.

2.5. Pseudo Anomaly Generation

Synthetic data generation plays a crucial role in various domains for augmenting imbalanced datasets and ensuring privacy protection. It is well known that building robust anomaly detection models requires balanced datasets containing both normal and anomalous data [31]. However, to construct high-quality datasets, it is essential to secure relatively rare anomalous data. M. Salem et al. [32] proposed a strategy to transform normal data into anomalous data using Cycle-GAN [33]. Cycle-GAN employs a training method that enables mutual transformation between images from different domains. This study utilized this characteristic to convert data into images and transform template data (normal data) into anomalies. The synthetic anomalies are then appended to the dataset for training detection models.

M. Pourreza et al. [34] introduced G2D, a GAN-based general framework for generating anomalies. This study categorizes synthetic data into three stages based on the learning progress of the generator: random samples, outliers surrounding the normal samples, and samples following the data distribution. Random samples and outliers are considered anomalies and used for training anomaly detection models. As a result, the detection models trained with G2D demonstrate stable performance as the proportion of outliers in the dataset increases.

H. Shen et al. [35] presented a novel method for unsupervised anomaly detection in industrial images. This method involves generating high-quality pseudo-anomaly images and enhancing normal image features. Experiments on various real datasets show performance improvements, specifically by enhancing normal image features to boost the model’s prediction accuracy. The method also reduces the uncertainty of single models by integrating various anomaly scores through ensemble detection.

Y. Lin et al. [36] introduced the FastLogAD system, designed for rapid anomaly detection in log data. It uses a mask-guided anomaly generation (MGAG) technique to generate pseudo-abnormal logs and a discriminative abnormality separation (DAS) model for efficient anomaly identification. FastLogAD significantly outperforms existing methods by achieving anomaly detection speeds at least ten times faster and competitive performance metrics, representing a significant advancement in log anomaly detection with an emphasis on speed and efficiency in handling security-related data.

T. Hu et al. [37] introduced AnomalyDiffusion, a new anomaly generation model that enables accurate anomaly detection with limited data. AnomalyDiffusion significantly improves generation accuracy and diversity by integrating anomaly location and appearance information. Experimental results demonstrate that this model outperforms existing methods in both generation accuracy and diversity, showing high performance in downstream anomaly detection tasks.

While numerous studies have proposed methods for generating anomalous data, only a limited number of these focus on time-series data. Most synthetic anomaly generation techniques are confined to i.i.d data, such as images. In contrast, data collected from industrial processes often exhibit time-series characteristics, making i.i.d-based anomaly generation techniques unsuitable for generating time-series anomalous data.

3. Patterns of Time-Series Anomaly

Understanding the characteristics and patterns of anomalous data is as crucial as the techniques for generating time-series or anomalies. Several studies have explored the patterns of time-series anomalies [38,39,40,41]. We synthesize the findings of these studies to comprehensively analyze anomaly patterns. Based on this analysis, we design mutation operations in Section 4 to reproduce these patterns. Specifically, we focus on continuous data collected from industrial processes [42,43], rather than discrete data such as log data.

P. Boniol et al. [41] analyzed various publicly available time-series datasets and categorized the anomalous data within these datasets into several patterns. Broadly, this study distinguishes between anomaly patterns occurring at single data points and interval anomaly patterns (collective anomalies) that span multiple data points.

Anomaly patterns for single data points are further divided into point anomalies, which exceed expected value ranges, and contextual anomalies, which do not. Additionally, the concept of multiple anomalies, where several anomalous data points are present, is mentioned. Y. Bao et al. [40] and Z. Tang et al. [39] proposed anomaly detection models for health monitoring data and, as part of their research, analyzed the anomaly patterns within the data. These studies fundamentally consider data that deviate statistically from the norm as anomalies.

In addition to conducting a literature review, we analyzed well-known ICS datasets to examine the anomaly patterns contained within these datasets. Specifically, we analyzed the SWaT [42] and the HIL-based augmented ICS security dataset (HAI) [43]. The secure water treatment (SWaT) [42] dataset is derived from a water treatment testbed designed for ICS and security research. This testbed consists of six processes: raw water intake, chemical dosing, ultrafiltration (UF), reverse osmosis (RO), RO filtration, and UF backwash. In this study, three types of attackers were modeled, performing various attacks such as network packet sniffing and physical access.

The HAI [43] dataset also originates from a testbed primarily for water treatment, consisting of four major processes: the boiler process, turbine process, water treatment process, and hardware-in-the-loop simulator. The dataset was collected over more than ten days, capturing three types of attacks: process variable response prevention, setpoint attack, and control output attack.

Figure 1 illustrates the anomaly patterns found in the HAI and SWaT datasets. The blue and red lines represent normal data and anomalies, respectively. As seen in the figure, the anomaly patterns in both ICS datasets mostly correspond to point anomalies or contextual anomalies as defined by P. Boniol et al. [41]. The SWaT dataset includes some collective anomalies.

Furthermore, according to Figure 1, the collective anomalies in the SWaT dataset can be categorized as minor types, as described by Y. Bao et al. [40] and Z. Tang et al. [39]. Based on our literature review and analysis of the anomaly patterns in ICS datasets, we focus on generating point anomalies, contextual anomalies, and collective anomalies from given normal ICS time-series data.

4. Mutation-Based Multivariate Time-Series Anomaly Generation in ICS

This section describes the mutation-based ICS time-series anomaly generation model. Figure 2 illustrates the ICS time-series anomaly generation process. To effectively generate anomalies for ICS time-series data, we adopt an attention-based variational recurrent autoencoder (AVRAE) [44]. AVRAE is fundamentally designed with an RNN architecture and incorporates variational inference and an attention mechanism to better learn the characteristics of the time-series data. In the figure, the blue and red rectangles represent the RNN cells of the encoder and decoder, respectively, while the green circles denote the hidden states produced by the encoder or decoder.

Our anomalous data generation process is divided into two main phases: the training phase (left side of Figure 2) and the generation phase (right side of Figure 2). The training phase aims to train AVRAE as a robust time-series generation model. Each timestep’s latent vector produced by AVRAE is forced to follow a well-known probability distribution. Additionally, the attention layers of AVRAE effectively learn the relationships between the hidden states (latent vectors) produced by the encoder and those produced by the decoder. Once AVRAE training is completed, the process moves to the generation phase. In the generation phase, the hidden states generated by the trained encoder of AVRAE are mutated and passed through the attention layers. AVRAE then generates time-series anomalies using the mutated hidden states from the encoder and the hidden states from the decoder, where we define several mutation operators specifically designed to generate point anomalies, contextual anomalies, or collective anomalies.

The remainder of this section is organized as follows. In Section 4.1, we describe the details of AVRAE, including the objective function and attention mechanisms. Section 4.2 presents the mutation operators for generating ICS time-series anomalies. Finally, Section 4.3 proposes an algorithm for ICS time-series anomaly generation that combines AVRAE with the mutation operators.

4.1. Attention-Based Variational Recurrent Autoencoder

AVRAE is central to our ICS time-series anomaly generation. There are two main reasons for adopting AVRAE for anomalous data generation. First, AVRAE extends variational inference to time-series data, allowing control over the distribution of hidden states produced at each timestep by the RNN. Second, AVRAE’s attention mechanism enables the generation of highly plausible time-series by leveraging the relationships between the encoder’s and decoder’s hidden states.

Since our approach involves applying mutation operations to the hidden states to generate anomalous data, AVRAE, which maps actual data to controllable hidden states, is a highly suitable model. This section formally describes AVRAE’s objective function, the time-series evidence lower bound, and the inference process.

4.1.1. Evidence Lower Bound

Variational inference is an approximation method that uses relatively simple distributions to handle complex probability distributions. It is used to optimize a latent variable model for given data, replacing complex distributions with simpler, more computationally feasible ones, thereby reducing computational cost and increasing efficiency. ELBO is a typical objective used in variational inference.

\begin{matrix} log p (x) & = K L (q (z) | | p (z | x)) + E_{q (z)} [log p (z, x)] - E_{q (z)} [log q (z)] \\ \geq E_{q (z)} [log p (z, x)] - E_{q (z)} [log q (z)] \\ = E L B O (q) \end{matrix}

(1)

Equation (1) is the derivation of ELBO. Originally, we aim to train the model to maximize the loglikelihood

log p (x)

, where

x

is the observable data and

z

is the latent variable.

p (.)

represents the true distribution, and

q (.)

represents the variational distribution. The

log p (x)

can be decomposed into the Kullback–Leibler divergence (KLD)

K L (q (z) | | p (z | x))

and the

E L B O (q)

. Since the KLD

K L (q (z) | | p (z | x))

is always non-negative,

log p (x)

is always greater than or equal to

E L B O (q)

.

Therefore, variational inference seeks to maximize

E L B O (q)

instead of directly maximizing

log p (x)

. This ELBO fundamentally assumes that the data

x

is i.i.d. Hence, to handle probability distributions over time-series data with variational inference, ELBO must be extended to time-series.

\begin{matrix} E L B O_{t s} (q) & = E_{q (z, h_{1 : T} | x_{1 : T})} [log p (z, h_{1 : T}, x_{1 : T})] - E_{q (z, h_{1 : T} | x_{1 : T})} [log q (z, h_{1 : T} | x_{1 : T})] \\ = E_{q (z, h_{1 : T} | x_{1 : T})} [log p (x_{1 : T} | z)] - K L (q (z | h_{1 : T}) | | p (z | h_{1 : T})) \\ - K L (q (h_{1 : T} | x_{1 : T}) | | p (h_{1 : T})) \end{matrix}

(2)

Equation (2) formulates

E L B O_{t s} (q)

, which is an extended version of ELBO for time-series data.

E L B O_{t s} (q)

not only extends the data to a time-series but also considers an additional sequential latent variable

h_{1 : T}

to account for the sequential nature of the data.

Consequently, maximizing

E L B O_{t s} (q)

is equivalent to maximizing the log-likelihood of the time-series data

x_{1 : T}

given the latent variable

z

, while minimizing the Kullback–Leibler divergence between the true distribution and the variational distribution for both the latent variable

z

and the sequential latent variable

h_{1 : T}

.

4.1.2. Inference

AVRAE follows an encoder–decoder architecture composed of RNNs. At each timestep, the RNN processes the input data

x_{t}

and the hidden state

h_{t - 1}

from the previous timestep to produce a new hidden state

h_{t}

. This hidden state

h_{t}

is propagated to the next timestep, allowing the RNN to reflect the sequential nature of the data. However, basic RNNs perform all inference operations deterministically. Therefore, AVRAE introduces stochasticity into the inference process to enable variational inference.

\begin{matrix} h_{t} & \sim q_{θ} (h_{t} | h_{t - 1}, x_{t}) \end{matrix}

(3)

Equation (3) represents the stochastic process of the RNN-based encoder at timestep t, where

q_{θ}

is the variational distribution parameterized by

θ

. The hidden state

h_{t}

is sampled from the variational distribution

q_{θ}

, which is conditioned on the hidden state

h_{t - 1}

from the previous timestep and the input data

x_{t}

at the current timestep.

\begin{matrix} h_{t} & \sim N (μ_{h_{t}}, diag (σ_{h_{t}}^{2})), where [μ_{h_{t}}, σ_{h_{t}}] = ϕ (h_{t - 1}, x_{t}) \end{matrix}

(4)

Equation (4) describes the sampling process assuming

q_{θ}

is a Gaussian distribution. The function

ϕ (.)

uses

h_{t - 1}

and

x_{t}

to produce the mean

μ_{h_{t}}

and variance

σ_{h_{t}}

of the Gaussian distribution. Then,

h_{t}

is sampled from the Gaussian distribution parameterized by

μ_{h_{t}}

and

σ_{h_{t}}

. However, as is well known, sampling is a non-differentiable operation, so the reparameterization trick is typically used in variational inference [10,45,46]. We use a location-scale transformation to ensure that the reparameterized hidden state still follows a Gaussian distribution:

h_{t} = μ_{h_{t}} + σ_{h_{t}} ⊙ ϵ

(

ϵ \sim N (0, I)

).

The decoder fundamentally mirrors the encoder and thus has similar inference, with three key exceptions. First, the decoder requires the latent vector

z

(referred to as context) produced by the encoder. In AVRAE, the hidden state

h_{T}

from the encoder’s final timestep is used as

z

. Second, unlike the encoder, the decoder uses the output from the previous timestep as the input data for the current timestep. The initial piece of input data for the decoder is the start symbol <start>. Third, AVRAE processes the decoder’s outputs through attention layers to reconstruct the encoder’s input data

x_{1 : T}

.

These attention layers include cross-attention and self-attention. Cross-attention combines the encoder’s hidden states

h_{1 : T}^{e}

and the decoder’s hidden states

h_{1 : T}^{d}

to calculate attention weights, which are then used to produce the output

o_{1 : T}^{c}

. Additionally, cross-attention employs a look-ahead mask to prevent future information from being referenced during sequence generation. Next, self-attention takes

o_{1 : T}^{c}

as input, calculates attention weights, and produces the output

o_{1 : T}^{s}

. Finally, AVRAE uses a dense layer to reconstruct

x_{1 : T}

from

o_{1 : T}^{s}

, resulting in

{\hat{x}}_{1 : T}

.

4.1.3. Training

The training of AVRAE is relatively straightforward. AVRAE is trained to maximize Equation (2). More specifically,

E_{q (z, h_{1 : T} | x_{1 : T})} [log p (x_{1 : T} | z)]

is maximized by minimizing the error between the encoder’s input

x_{1 : T}

and the decoder’s output

{\hat{x}}_{1 : T}

.

K L (q (z | h_{1 : T}) | | p (z | h_{1 : T}))

and

K L (q (h_{1 : T} | x_{1 : T}) | | p (h_{1 : T}))

are minimized as the variational distribution becomes closer to the true distribution.

AVRAE adopts the standard normal distribution as the true distribution, so the KLD decreases as the hidden states of the encoder

h_{1 : T}^{e}

and the decoder

h_{1 : T}^{d}

follow the standard normal distribution more closely.

4.2. Mutation Operators

We define three mutation operators to generate ICS time-series anomalies of the type analyzed in Section 3, utilizing the aforementioned characteristics. Algorithms 1–3 are the mutation operators for generating point anomalies, contextual anomalies, and collective anomalies, respectively. All three algorithms are applied to the hidden state sequence

h_{1 : T}^{e}

produced by the encoder of AVRAE and return the mutated sequence

{\hat{h}}_{1 : T}^{e}

.

Algorithm 1: Mutation operator

M_{p}

for point anomaly generation.

input: the encoder’s hidden state sequence

h_{1 : T}^{e}

output: the mutated hidden state sequence

{\hat{h}}_{1 : T}^{e}

t \leftarrow U (1, T)

;

v \leftarrow N (0, I)

;

h_{1 : T}^{e} (t) \leftarrow v

;

{\hat{h}}_{1 : T}^{e} \leftarrow h_{1 : T}^{e}

;

Algorithm 2: Mutation operator

M_{c t x}

for contextual anomaly generation.

input: the encoder’s hidden state sequence

h_{1 : T}^{e}

output: the mutated hidden state sequence

{\hat{h}}_{1 : T}^{e}

t_{s} \leftarrow U (1, T)

;

t_{d} \leftarrow U (t_{s}, T)

;

v \leftarrow h_{1 : T}^{e} (t_{s})

;

h_{1 : T}^{e} (t_{d}) \leftarrow v

;

{\hat{h}}_{1 : T}^{e} \leftarrow h_{1 : T}^{e}

;

Algorithm 3: Mutation operator

M_{c o l}

for collective anomaly generation.

input: the encoder’s hidden state sequence

h_{1 : T}^{e}

output: the mutated hidden state sequence

{\hat{h}}_{1 : T}^{e}

t_{s} \leftarrow U (1, T)

;

t_{d} \leftarrow U (t_{s}, T)

;

v_{t_{s} : t_{d}} \leftarrow

a vector of arbitrary value with length (

t_{d} - t_{s}

);

h_{1 : T}^{e} (t_{s} : t_{d}) \leftarrow v_{t_{s} : t_{d}}

;

{\hat{h}}_{1 : T}^{e} \leftarrow h_{1 : T}^{e}

;

Algorithm 1 introduces a point anomaly into the given hidden state sequence. First, an index t is randomly selected to specify the element in the hidden state sequence to be altered. Then, a random value

v

is sampled from the standard normal distribution to replace the t-th element of the hidden state sequence.

Algorithm 2 is the mutation operator

M_{c t x}

for generating contextual anomalies. This algorithm randomly selects two indices

t_{s}

and

t_{d}

that are smaller than the total length of the sequence T. Then, it replaces the value at the

t_{d}

-th position in the hidden state sequence with the value

v

from the

t_{s}

-th position. The rationale behind this design is that contextual anomalies inherently have values within the normal range but are considered abnormal within a given context. Therefore, Algorithm 2 generates anomalies by altering the context of normal values within the hidden state sequence.

Lastly, Algorithm 3 is the mutation operator

M_{c o l}

for generating collective anomalies. This algorithm first randomly selects two indices

t_{s}

and

t_{d}

, both less than the total length of the sequence T, to specify the segment to mutate. Then, it replaces the values in this segment of the hidden state sequence with a sequence

v_{t_{s} : t_{d}}

composed of arbitrary values, generating the mutated sequence

{\hat{h}}_{1 : T}^{e}

.

Depending on the method used to generate

v_{t_{s} : t_{d}}

, various types of collective anomalies can be produced. While there are multiple approaches, this study recommends replacing with a sequence of constant vectors, a sequence of random vectors, or a sequence from another segment. The first two methods are the simplest ways to create

v_{t_{s} : t_{d}}

, while the latter extends contextual anomalies into collective anomalies. It is noteworthy that the three algorithms presented so far only mutate the hidden state sequence and do not directly generate actual time-series anomalies.

4.3. Anomaly Generation

As previously mentioned, we generate anomalies by altering the latent space rather than the actual data space. More specifically, we use AVRAE to map ICS time-series data into the latent space. In particular, the encoder of AVRAE produces hidden states at each timestep while processing the time-series. Additionally, since these hidden states follow a well-known probability distribution (in this study, the standard normal distribution), it is easy for us to predict or control the values of the hidden states.

We apply mutations to the hidden states produced by the encoder of AVRAE. This approach has two major advantages. First, the hidden states (the latent vectors) effectively encapsulate the correlations among the features of the observations. Therefore, changing only a part of the latent vector affects all the features of the observations. Second, the hidden states from the encoder are used by the decoder’s attention layer (cross-attention) for time-series generation. The attention layer learns the relevance between the elements of the hidden state sequences produced by the encoder and the decoder. Thus, if a segment of the hidden state sequence from the encoder is altered, the subsequent data generated by the decoder are also influenced.

Algorithm 4 outlines the process for generating anomalous ICS time-series proposed in this study. This algorithm is very straightforward. First, the encoder of the trained AVRAE processes

x_{1 : T}

to produce the hidden state sequence

h_{1 : T}^{e}

. As mentioned in Section 4.1, we assume the standard normal distribution as the true distribution, so each

h_{t}^{e}

that makes up

h_{1 : T}^{e}

follows the standard normal distribution. Then, one of the three mutation operators

{M_{p}, M_{c t x}, M_{c o l}}

presented in Section 4.2 is randomly selected as

M

. Although

M

is chosen randomly in this study, if there is a need to control the type of anomaly to be generated, specifying a particular mutator is allowed. Using

M

, we generate the mutated sequence

{\hat{h}}_{1 : T}^{e}

from

h_{1 : T}^{e}

. Finally, the decoder of AVRAE generates the ICS time-series

{\hat{x}}_{1 : T}

from

{\hat{h}}_{1 : T}^{e}

. Since

{\hat{x}}_{1 : T}

is generated from the mutated sequence

{\hat{h}}_{1 : T}^{e}

, it contains anomalies. This generated anomalous ICS time-series not only considers the correlations among the features of the actual data but also reflects the temporal dependencies through the attention layers of AVRAE.

Algorithm 4: Mutation-based ICS time-series anomaly generation.

input: the trained AVRAE, the ICS time-series

x_{1 : T}

output: the anomalous ICS time-series

{\hat{x}}_{1 : T}

h_{1 : T}^{e} \leftarrow

produce the hidden state sequence from

x_{1 : T}

with the AVRAE’s encoder;

M \leftarrow

randomly select mutation operator in

{M_{p}, M_{c t x}, M_{c o l}}

;

{\hat{h}}_{1 : T}^{e} \leftarrow M (h_{1 : T}^{e})

;

{\hat{x}}_{1 : T} \leftarrow

product the anomalous ICS time-series from

{\hat{h}}_{1 : T}^{e}

with the AVRAE’s decoder;

5. Evaluation

In this section, a series of experiments is conducted to evaluate the mutation-based ICS time-series anomaly generation proposed in this paper for the ICS dataset. To our best knowledge, there is no prior research on the generation of anomalous data for time-series. Therefore, instead of comparing performance with other studies, the plausibility of the generated anomalies is assessed, and binary classification is performed using them. From this, two research questions (RQs) are defined as follows:

RQ1: Are the synthetic ICS time-series anomalies visually and in the embedding space similar to real data?
RQ2: Does a binary classifier trained on the synthetic ICS time-series anomalies perform better than a one-class classifier?

5.1. Dataset Description

To evaluate the mutation-based ICS time-series anomaly generation proposed in this study, the HAI [43] and SWaT [42] datasets were used. HAI is a testbed (and dataset) collected using a hardware-in-the-loop (HIL) simulator composed of turbines, boilers, and a water treatment system. In this testbed, normal and attack scenarios were repeatedly executed in an unmanned supervised control and data acquisition (SCADA) operating environment, and data were collected accordingly. This testbed has been continuously enhanced since 2017, and the dataset has been updated accordingly.

For our experiments, we used HAI 22.04. This version of the dataset provides six training datasets (each file ranging from 45 MB to 136 MB) and four test datasets (each file ranging from 33 MB to 69 MB). Each dataset consists of 86 features. The training datasets comprise only normal data, while the test datasets include anomaly data along with label information.

The SWaT dataset was collected from a six-stage water treatment process, where each stage is autonomously controlled by PLCs. This testbed is designed to closely mimic a real water treatment system, ensuring that the collected data can be applied to actual systems. In SWaT, communication among sensors, actuators, and PLCs is realized using a combination of wired and wireless channels, allowing for extensive experiments in realistic environments. The SWaT dataset has also been enhanced over time, with corresponding updates to the dataset. For this experiment, datasets collected in 2015 were used. Although the training and test datasets are not explicitly separated, the dataset provides two normal datasets (each file sized 127 MB) and one attack dataset (file sized 113 MB). Excluding the labels, this dataset consists of 51 features.

Furthermore, before presenting the analysis of the experimental results, it is worth noting that various attack methods can be included in actual attack scenarios against ICS. Consequently, the types of anomalies can also be diversified. However, the HAI and SWaT datasets used in this evaluation only categorize the data as either normal or attack. This study does not aim to present a model with high anomaly detection performance. Nonetheless, we conducted anomaly detection experiments to verify that synthetic anomalies help train detection models. Additionally, the primary purpose of these experiments is to distinguish attack data from normal data, rather than categorizing the types of attacks in detail.

5.2. Experimental Settings

All experiments were conducted on the same machine with the following specifications: Intel(R) Core(TM) i9-11900 2.50 GHz, 32 GB RAM, 64-bit Ubuntu 20.04 LTS, and NVIDIA GTX 3080 Titan. The implementation of AVRAE was implemented using Python code, with the deep learning library PyTorch 2.3.0.

Table 1 shows the architecture of AVRAE used in this study. AVRAE has a nearly symmetrical encoder and decoder. The encoder consists of two LSTM layers, with the first layer taking data of size d at each timestep (sequence length of 100) and producing an output of size

100 \times p

. For HAI, p is set to 1024, and for SWaT, it is set to 256. The second layer takes the output of the first layer as input and produces an output of size

100 \times p

.

The decoder uses the hidden state of the last timestep of the encoder as its initial hidden state, taking a vector of size p at each timestep as input and producing an output of size

100 \times p

. Similarly, the second layer of the decoder takes the output of the first layer as input and produces an output of size

100 \times 1024

. The reason the input size of the first layer of the decoder is p is that it receives the timestep-wise output of the second layer as feedback.

Then, the cross-attention layer takes the outputs of both the encoder and the decoder as input and produces a sequence of size

100 \times p

. The self-attention layer takes the output of the cross-attention layer as input and outputs a vector sequence of size

100 \times p

. Finally, the output of the self-attention layer is restored to a vector sequence of size

100 \times d

, the same shape as the input to the encoder, by the dense layer.

AVRAE is trained using the Adam optimizer [47]. The Adam optimizer is an adaptive learning rate optimization algorithm designed for training deep learning models, combining the advantages of two other extensions of stochastic gradient descent, namely, AdaGrad [48] and RMSProp [49], to compute individual adaptive learning rates for different parameters. The learning rate was set to

5 \times 10^{- 4}

, the mini-batch size to 64, and AVRAE was trained for 1024 epochs.

5.3. Assessment on Quality of Synthetic Anomaly

This section assesses the quality of the synthetic ICS time-series anomaly both visually and statistically. Visually inspecting the quality of generated time-series anomaly data is crucial because it allows evaluation of whether an anomaly detection model accurately captures real anomaly situations and whether the data patterns are realistic. This helps in intuitively understanding the detection model’s performance and identifying potential areas for improvement.

Figure 3 shows the ICS time-series anomalies generated by the proposed method. In the figure, the blue line represents the real data used as the source for AVRAE to generate anomalous data, and the red line represents the generated anomaly. Each subplot depicts the changes in the values of a variable over time in the dataset (the y-axis represents the variable’s value, and the x-axis represents the time sequence). The first row of Figure 3 represents point anomaly generation, the second row represents context anomaly generation, the third row represents collective anomaly generation with a constant value, the fourth row represents collective anomaly generation with a random value, and the fifth row represents collective anomaly generation with swap. The sections where the mutation operation is applied are marked with green circles.

The results of this experiment were extremely interesting. In the case of point anomaly generation, the anomalous values significantly differed from the surrounding context. Additionally, the direction of the outliers varied depending on the variables, which indicates that AVRAE learned the multivariate relationships of the HAI dataset.

The context anomaly generation creates anomalies by swapping the hidden states at two arbitrary timesteps in the embedding space. As a result, the range of anomaly values occurring early did not exceed the surrounding context. In fact, the anomaly values matched the values at the swapped location (i.e., the normal values of the later time). Conversely, the values at the later time remained unchanged, and the exact cause of this phenomenon has not been determined yet, although it is suspected to be due to the influence of the attention layer.

The collective anomaly generations with constant value and random value both exhibited similar effects. Anomalous data ignoring the pattern of the data were placed in a randomly selected interval. Lastly, the collective anomaly generation with swap selects two intervals of the same length at random and swaps their hidden states in the embedding space. Except for the anomaly occurring within the interval, the effect is similar to the context anomaly generation. In this method as well, the anomaly pattern occurring early matched the pattern at the swapped location, while the pattern at the later time remained unchanged. This phenomenon is also suspected to be due to the influence of the attention layer.

Figure 4 shows the results of ICS time-series anomaly generation using the SWaT dataset. Overall, results similar to the experiments using HAI were observed. However, for SWaT, the quality of the generated time-series was lower compared to HAI. This issue was analyzed as a limitation of AVRAE rather than a problem with the anomaly generation method. The SWaT dataset exhibited much more subtle variations in the values of each variable compared to HAI. This issue appears to have caused a decrease in the data reconstruction performance of the decoder in AVRAE.

Figure 5 visualizes the original source data and the synthetic time-series anomalies given to AVRAE in a lower dimension using principal component analysis (PCA) [50] and t-distributed stochastic neighbor embedding (SNE) [51]. Although PCA is not ideally suited for visualizing high-dimensional data, it can be used to examine the data’s variance. On the other hand, t-SNE excels at projecting high-dimensional data into a low-dimensional space to reveal local similarities within the data. In the figure, the two subplots on the left represent the HAI dataset, while the two subplots on the right depict the SWaT dataset. In each subplot, blue dots represent the original source data, and red dots represent anomalies. Each dot corresponds to a

100 \times d

vector, meaning the input to AVRAE’s encoder. Additionally, the anomalies used in this figure were generated by the collective anomaly generation method, which produced the greatest variations compared to the original source data.

In the HAI dataset, it was confirmed that the original data and the synthetic anomalies had similar distributions in both the PCA and t-SNE plots. This indicates that the synthetic anomalies share structural similarities and variance characteristics with the original data. However, it is evident that the distributions of the two datasets did not completely overlap. This suggests that new variance was introduced during the generation of synthetic anomalies through mutation.

This observation is even more pronounced in the SWaT dataset. The PCA plot shows that most of the data in the SWaT dataset has nearly similar variance characteristics. However, some synthetic anomalies appeared somewhat distinguishable from the original data. Additionally, in the t-SNE plot, the synthetic anomalies exhibited a distribution similar to the original data but were relatively more clustered. These two observations could imply that the anomaly generation method proposed in this study might be less effective for the SWaT dataset compared to the HAI dataset. We interpret this as the differences in the values of the features in the SWaT dataset being very subtle compared to the HAI dataset, which resulted in the mutation-induced changes in hidden states ultimately producing anomalies that differ from the original data.

Figure 6 shows the kernel density estimation (KDE) between the original source data and the synthetic anomaly used as input for AVRAE. In the figure, the first row displays the results for the HAI data, while the second row presents the results for the SWaT dataset. Additionally, the blue area represents the distribution of the original data, and the red area indicates the distribution of the synthetic anomaly. The synthetic anomaly used in Figure 6 was generated through collective anomaly generation, similar to Figure 5. As seen in the figure, the distributions of the original data and the anomaly are almost identical. Table 2 shows the Jensen–Shannon divergence (JSD) for each variable measured using KDE. The upper part of the table presents the results for the HAI dataset, and the lower part shows the results for the SWaT dataset. Furthermore, six variables with the largest JSD values were selected from each dataset.

As shown in the table, even the variable with the largest JSD value had a value close to 0. This indicates that the ICS time-series anomaly generated by the proposed method is not only visually similar to the actual anomaly (as shown in Figure 1) but also statistically does not deviate significantly from the original source data.

Nevertheless, it is worth noting from Figure 6 and Table 2 that there is a difference in similarity between the original source time-series and the anomaly depending on the features. For example, in Figure 6, the KDE between the source data and the synthetic anomaly for HAI’s P1_B2016 is almost similar, whereas a significant difference is observed in P1_B4022. This observation suggests that the level of realism and indistinguishability may vary depending on the features.

The answer for RQ1. Various experimental results presented in this section confirm that the ICS time-series anomaly generated by mutation-based anomaly generation is visually and statistically quite similar to the actual anomaly. However, as shown in the experiments using the SWaT dataset, the proposed method somewhat depends on the reconstruction performance of AVRAE. Therefore, to improve the performance of the proposed method, it is necessary to enhance the model that maps the original data (such as AVRAE) into the embedding space.

5.4. Comparison between One-Class and Binary Classification

The application of techniques for artificially generating ICS time-series anomalies is clear. When given a dataset containing a small amount of anomalies or even only normal data, augmenting the data with anomalies can help mitigate class imbalance to some extent. Particularly, in the absence of such augmentation techniques, anomaly detection often relies on a one-class classification strategy. Therefore, in this section, the proposed synthetic ICS anomaly is generated to augment the given ICS dataset, and binary classification models are trained and compared with several one-class classification models.

Table 3 shows the anomaly detection performance of several one-class classification models (OC) and binary classification models (BIN) for the HAI and SWaT datasets. The one-class classification models adopted are one-class support vector machine (OCSVM), isolation forest (IF), local outlier factor (LOF), and LSTM-based autoencoder (LSTM-AE). The binary classification models adopted are kernel SVM (k-SVM), random forest (RF), k-nearest neighbor (KNN), and LSTM-based binary classifier (LSTM-BIN). The performance indicators used are accuracy (ACC), recall (REC), precision (PRE), F1 (F1 score), and ROC (ROC-AUC). These models are trained to take the ICS time series of length 100 as input and determine whether an anomaly is present. Specifically, any sequence containing non-normal data for at least one timestep is considered an anomaly.

Note that no model selection process was conducted to enhance performance, aiming to confirm the differences in performance among models based on the dataset differences. Additionally, the LSTM-AE and LSTM-BIN were designed to have as similar complexity as possible, specifically the same number of parameters. The LSTM-AE consists of an encoder and decoder, each composed of a single LSTM layer with a hidden state size of 256, with a dense layer added at the end of the decoder for reconstruction of the original input sequence. The LSTM-BIN consists of two LSTM layers with a hidden state size of 256, applying a dense layer to the average of the hidden states from the last LSTM layer to classify the given sequence. Both the LSTM-AE and LSTM-BIN were trained for 2048 epochs.

Last but not least, all one-class classification models were trained on a training set composed only of normal data, while the binary classification models were trained on a dataset where the original normal data, synthetic normal data (no mutation) by AVRAE, and synthetic anomalies were mixed in a 1:1:1 ratio. In other words, the OC models were trained on the standard dataset, while the BIN models were trained on the dataset augmented with synthetic anomalies. Additionally, the standard training sets for HAI and SWaT do not inherently include anomaly (attack) data. The performance of each model was measured using a test set that did not include any training data and did not contain the synthetic anomalies. Therefore, this experiment not only demonstrates the validity of the binary classifier in anomaly detection but also highlights the effectiveness of the synthetic anomaly.

The results of this experiment were quite intriguing. Basically, accuracy was generally higher for one-class classification models in the case of HAI, while it was higher for binary classification models in the case of SWaT. However, since both HAI and SWaT are imbalanced datasets that overwhelmingly contain normal data, accuracy is not a reliable performance metric for these models. On the other hand, for both HAI and SWaT, the model with the highest F1 score was LSTM-BIN. In fact, most models except for LSTM-BIN exhibited higher recall than precision. This discrepancy is primarily due to the imbalance in the ICS datasets.

Generally, a high recall indicates that the model correctly identifies actual positive samples (anomalies), whereas low precision means that the proportion of actual positives among the samples predicted as positive by the model is low. This suggests that, given the relative scarcity of positive class (anomalies) compared to the negative class (normal data), many samples predicted as positive by the model are likely to be false positives.

Another possible interpretation is that the models, while relatively effectively identifying samples from the positive class, generate many false positives because the differentiation between the negative and positive classes is not pronounced. In other words, while the models successfully detect positive class samples, they frequently confuse them with the negative class, leading to lower precision.

Despite this, LSTM-BIN demonstrated a relatively high F1 score because its recall and precision were not significantly different from each other. This indicates that LSTM-BIN, by using data from both classes, learned an optimal decision boundary that better distinguishes between positive and negative classes. Additionally, LSTM-BIN, being based on a neural network, can learn complex patterns in the data, including non-linearities, allowing it to form a more sophisticated decision boundary compared to other binary classification models. In contrast, LSTM-AE, which is trained only on normal data, learns a less refined decision boundary compared to LSTM-BIN.

The answer for RQ2. Comparing the performance of one-class classification models and binary classification models using the pure ICS dataset and the anomaly-augmented ICS dataset revealed that the neural net-based binary classifier exhibited slightly higher performance. In other words, it was confirmed that dataset augmentation through the generation of ICS time-series anomalies indeed aids in training sophisticated anomaly detection models.

6. Limitations

The mutation-based ICS time-series anomaly generation proposed in this paper has been demonstrated to be effective through various experiments in terms of visual/statistical plausibility and dataset augmentation. This technique is novel, particularly in its application to time-series data, and can be utilized to address imbalances in various time-series datasets beyond ICS datasets. Despite these advantages, the proposed anomaly generation technique has several clear limitations.

Currently, this study is limited to anomaly generation for numerical time-series data. From a broader perspective, sequential data encompass various forms of data, including text. Although there is potential for this technique to be applied to other types of sequential data, further investigation is required.
The quality of the synthetic anomaly heavily depends on the performance of the generation model. In this study, AVRAE was adopted as the generation model. While AVRAE is a robust generation model, it still falls short of perfectly capturing the temporal dynamics inherent in time-series datasets.
The most significant limitation of this technique is its inability to generate scenario-based anomalies. The mutation operation fundamentally relies on randomness. In other words, the generated ICS anomalies do not reflect the causes and consequences of specific cyberattacks. Nevertheless, the proposed technique has sufficient utility because it generates anomalies that consider the correlations between variables in the data through AVRAE.

7. Conclusions

In this paper, a mutation-based ICS time-series anomaly generation method is proposed. This technique first utilizes the encoder of the powerful generation model, AVRAE, to map observations into latent space. Then, it applies carefully designed mutation operations to alter the latent representation. The decoder of AVRAE reconstructs the observations from the mutated latent representations. To design appropriate mutation operations, commonly used ICS datasets were analyzed to derive the types of anomalies they contain. Based on this analysis, mutation operations were proposed to generate point anomalies, contextual anomalies, and collective anomalies. Evaluations using the HAI and SWaT datasets confirmed that the proposed method could generate ICS time-series anomalies that are both visually and statistically plausible.

Furthermore, by using the augmented ICS dataset with synthetic anomalies, various one-class classification models and binary classification models were compared. The results empirically demonstrated that a robust binary classifier could be trained using the augmented dataset. This study presents an effective means to augment ICS datasets, which are often plagued by imbalance issues. The proposed technique is transferable to other time-series datasets with similar characteristics, although its application to sequential datasets of a different nature, such as text, requires further research. Additionally, the method’s dependence on the performance of the underlying generation model and its inability to generate scenario-based anomalies were identified as limitations. Consequently, future research will naturally extend to applying the proposed method to various datasets and exploring more powerful anomaly detection models.

Author Contributions

Conceptualization, S.J.; methodology, S.J.; software, S.J.; validation, K.K.; formal analysis, S.J.; investigation, K.K.; resources, J.T.S.; data curation, D.M.; writing—original draft preparation, S.J.; writing—review and editing, J.T.S.; visualization, S.J.; supervision, J.T.S.; project administration, D.M.; funding acquisition, D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea Government (MSIT) (No. 2022-0-00961).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ICS	Industrial control systems
i.i.d	Independently and identically distributed
ELBO	Evident lower bound
GPMM	Generalized probabilistic monitoring model
EM	Expectation-Maximization
SDDA	Sparse distribution dissimilarity analytics
VAE	Variational autoencoder
RNN	Recurrent neural network
LSTM	Long short-term memory
SVM	Support vector machine
IoT	Internet of Things
GRU	Gated recurrent unit
DTW	Dynamic time warping
DBA	Barycenter averaring
DNN	Deep neural network
GAN	Generative adversarial networks
ODE	Ordinary differential equations
MGAG	Mask-guided anomaly generation
DAS	Discriminative abnormality separation
HAI	HIL-based augmented ICS
SWaT	Secure water treatment
UF	Ultrafiltration
RO	Reverse osmosis
AVRAE	Attention-based variational recurrent autoencoder
KLD	Kullback–Leibler divergence
RQ	Research questions
HIL	Hardware-in-the-loop
SCADA	Supervised control and data acquisition
PCA	Principal component analysis
t-SNE	t-distributed stochastic neighbor embedding
KDE	Kernel density estimation
JSD	Jensen-Shannon divergence

References

Huang, D.; Mu, D.; Yang, L.; Cai, X. CoDetect: Financial Fraud Detection with Anomaly Feature Detection. IEEE Access 2018, 6, 19161–19174. [Google Scholar] [CrossRef]
Kravchik, M.; Shabtai, A. Efficient Cyber Attack Detection in Industrial Control Systems Using Lightweight Neural Networks and PCA. IEEE Trans. Dependable Secur. Comput. 2022, 19, 2179–2197. [Google Scholar] [CrossRef]
Elsayed, M.S.; Le-Khac, N.A.; Dev, S.; Jurcut, A.D. Network Anomaly Detection Using LSTM Based Autoencoder. In Proceddings of the ACM Symposium on QoS and Security for Wireless and Mobile Networks, Alicante, Spain, 16–20 November 2020; pp. 37–45. [Google Scholar] [CrossRef]
Shen, L.; Li, Z.; Kwok, J.T. Timeseries anomaly detection using temporal hierarchical one-class network. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Volume 2020. [Google Scholar]
Xu, H.; Wang, Y.; Jian, S.; Liao, Q.; Wang, Y.; Pang, G. Calibrated One-class Classification for Unsupervised Time Series Anomaly Detection. IEEE Trans. Knowl. Data Eng. 2024, 1–14. [Google Scholar] [CrossRef]
Ghrib, Z.; Jaziri, R.; Romdhane, R. Hybrid approach for Anomaly Detection in Time Series Data. In Proceedings of the International Joint Conference on Neural Networks, Glasgow, UK, 19–24 July 2020. [Google Scholar] [CrossRef]
Primus, P.; Haunschmid, V.; Praher, P.; Widmer, G. Anomalous Sound Detection as a Simple Binary Classification Problem with Careful Selection of Proxy Outlier Examples. arXiv 2020, arXiv:2011.02949. [Google Scholar] [CrossRef]
Luca, S.; Clifton, D.A.; Vanrumste, B. One-class classification of point patterns of extremes. J. Mach. Learn. Res. 2016, 17, 1–21. [Google Scholar]
Lee, J.H.; Ji, I.H.; Jeon, S.H.; Seo, J.T. Generating ICS Anomaly Data Reflecting Cyber-Attack Based on Systematic Sampling and Linear Regression. Sensors 2023, 23, 9855. [Google Scholar] [CrossRef] [PubMed]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations. arXiv 2013, arXiv:1312.6114. [Google Scholar] [CrossRef]
Zalewski, M. American Fuzzy Lop. 2017. Available online: http://lcamtuf.coredump.cx/afl (accessed on 31 August 2024).
Fioraldi, A.; Maier, D.; Eißfeldt, H.; Heuse, M. AFL++: Combining incremental steps of fuzzing research. In Proceedings of the WOOT 2020—14th USENIX Workshop on Offensive Technologies, Online, 11 August 2020. [Google Scholar]
Yu, W.; Wu, M.; Huang, B.; Lu, C. A generalized probabilistic monitoring model with both random and sequential data. Automatica 2022, 144, 110468. [Google Scholar] [CrossRef]
Yu, W.; Zhao, C.; Huang, B.; Xie, M. An Unsupervised Fault Detection and Diagnosis with Distribution Dissimilarity and Lasso Penalty. IEEE Trans. Control. Syst. Technol. 2024, 32, 767–779. [Google Scholar] [CrossRef]
Yu, W.; Zhao, C.; Huang, B. MoniNet With Concurrent Analytics of Temporal and Spatial Information for Fault Detection in Industrial Processes. IEEE Trans. Cybern. 2022, 52, 8340–8351. [Google Scholar] [CrossRef] [PubMed]
Mauceri, S.; Sweeney, J.; McDermott, J. Dissimilarity-based representations for one-class classification on time series. Pattern Recognit. 2020, 100, 107122. [Google Scholar] [CrossRef]
Gjorgiev, L.; Gievska, S. Time Series Anomaly Detection with Variational Autoencoder Using Mahalanobis Distance. In Proceedings of the Communications in Computer and Information Science, Virtual Event, 8–10 July 2020; Volume 1316. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Ullah, I.; Mahmoud, Q.H. Design and Development of RNN Anomaly Detection Model for IoT Networks. IEEE Access 2022, 10, 62722–62750. [Google Scholar] [CrossRef]
Cho, K.; Merriënboer, B.V.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the EMNLP 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014. [Google Scholar] [CrossRef]
Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the Lecture Notes in Computer Science, Istanbul, Turkey, 26–28 October 2005; Volume 3644. [Google Scholar] [CrossRef]
Gundersen, K.; Alendal, G.; Oleynik, A.; Blaser, N. Binary time series classification with bayesian convolutional neural networks when monitoring for marine gas discharges. Algorithms 2020, 13, 145. [Google Scholar] [CrossRef]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning, ICML 2016, New York, NY, USA, 19–24 June 2016; Volume 3. [Google Scholar]
Liu, F.; Zhou, X.; Cao, J.; Wang, Z.; Wang, T.; Wang, H.; Zhang, Y. Anomaly Detection in Quasi-Periodic Time Series Based on Automatic Data Segmentation and Attentional LSTM-CNN. IEEE Trans. Knowl. Data Eng. 2022, 34, 2626–2640. [Google Scholar] [CrossRef]
Forestier, G.; Petitjean, F.; Dau, H.A.; Webb, G.I.; Keogh, E. Generating synthetic time series to augment sparse datasets. In Proceedings of the IEEE International Conference on Data Mining, ICDM, New Orleans, LA, USA 18–21 November 2017; Volume 2017. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Zhang, C.; Kuppannagari, S.R.; Kannan, R.; Prasanna, V.K. Generative Adversarial Network for Synthetic Time Series Data Generation in Smart Grids. In Proceedings of the 2018 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids, SmartGridComm 2018, Aalborg, Denmark, 29–31 October 2018. [Google Scholar] [CrossRef]
Zhou, L.; Poli, M.; Xu, W.; Massaroli, S.; Ermon, S. Deep Latent State Space Models for Time-Series Generation. In Proceedings of the Machine Learning Research, Seattle, WA, USA, 30 November–1 December 2023; Volume 202. [Google Scholar]
Chen, Z.; Duan, J.; Kang, L.; Qiu, G. Supervised Anomaly Detection via Conditional Generative Adversarial Network and Ensemble Active Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7781–7798. [Google Scholar] [CrossRef]
Salem, M.; Taheri, S.; Yuan, J.S. Anomaly Generation Using Generative Adversarial Networks in Host-Based Intrusion Detection. In Proceedings of the 2018 9th IEEE Annual Ubiquitous Computing, Electronics and Mobile Communication Conference, UEMCON 2018, New York, NY, USA, 8–10 November 2018. [Google Scholar] [CrossRef]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; Volume 2017. [Google Scholar] [CrossRef]
Pourreza, M.; Mohammadi, B.; Khaki, M.; Bouindour, S.; Snoussi, H.; Sabokrou, M. G2D: Generate to detect anomaly. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, 5–9 January 2021. [Google Scholar] [CrossRef]
Shen, H.; Wei, B.; Ma, Y.; Gu, X. Unsupervised industrial image ensemble anomaly detection based on object pseudo-anomaly generation and normal image feature combination enhancement. Comput. Ind. Eng. 2023, 182, 109337. [Google Scholar] [CrossRef]
Lin, Y.; Deng, H.; Li, X. FastLogAD: Log Anomaly Detection with Mask-Guided Pseudo Anomaly Generation and Discrimination. arXiv 2024, arXiv:2404.08750. [Google Scholar]
Hu, T.; Zhang, J.; Yi, R.; Du, Y.; Chen, X.; Liu, L.; Wang, Y.; Wang, C. AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion Model. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 8526–8534. [Google Scholar] [CrossRef]
Choi, K.; Yi, J.; Park, C.; Yoon, S. Deep Learning for Anomaly Detection in Time-Series Data: Review, Analysis, and Guidelines. IEEE Access 2021, 9, 120043–120065. [Google Scholar] [CrossRef]
Tang, Z.; Chen, Z.; Bao, Y.; Li, H. Convolutional neural network-based data anomaly detection method using multiple information for structural health monitoring. Struct. Control Health Monit. 2019, 26, e2296. [Google Scholar] [CrossRef]
Bao, Y.; Tang, Z.; Li, H.; Zhang, Y. Computer vision and deep learning–based data anomaly detection method for structural health monitoring. Struct. Health Monit. 2019, 18, 401–421. [Google Scholar] [CrossRef]
Boniol, P.; Paparrizos, J.; Palpanas, T. New Trends in Time-Series Anomaly Detection. In Proceedings of the Advances in Database Technology-EDBT, Ioannina, Greece, 28–31 March 2023; Volume 26. [Google Scholar] [CrossRef]
Mathur, A.P.; Tippenhauer, N.O. SWaT: A water treatment testbed for research and training on ICS security. In Proceedings of the 2016 International Workshop on Cyber-physical Systems for Smart Water Networks, CySWater 2016, Vienna, Austria, 11 April 2016. [Google Scholar] [CrossRef]
Shin, H.K.; Lee, W.; Yun, J.H.; Kim, H.C. HAI 1.0: HIL-based augmented ICS security dataset. In Proceedings of the CSET 2020-13th USENIX Workshop on Cyber Security Experimentation and Test, Online, 10 August 2020. [Google Scholar]
Jeon, S.; Seo, J.T. A Synthetic Time-Series Generation Using a Variational Recurrent Autoencoder with an Attention Mechanism in an Industrial Control System. Sensors 2023, 24, 128. [Google Scholar] [CrossRef]
Fabius, O.; van Amersfoort, J.R. Variational recurrent auto-encoders. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Chung, J.; Kastner, K.; Dinh, L.; Goel, K.; Courville, A.; Bengio, Y. A recurrent latent variable model for sequential data. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 2015. [Google Scholar]
Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
Hinton, G.E.; Srivastava, N.; Swersky, K. Neural Networks for Machine Learning Lecture 6a Overview of Mini-Batch Gradient Descent. 2012. Available online: https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf (accessed on 31 August 2024).
Bro, R.; Smilde, A.K. Principal Component Analysis. 2014. Available online: https://doi.org/10.1039/c3ay41907j (accessed on 31 August 2024).
Maaten, L.V.D.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9. Available online: https://jmlr.org/papers/v9/vandermaaten08a.html (accessed on 31 August 2024).

Figure 1. Anomaly patterns in ICS datasets. The leftmost two columns: HAI dataset. The rightmost two columns: SWaT dataset. The title of the subplot consists of the dataset and variable names.

Figure 2. Overview of mutation-based ICS time-series anomaly generation.

Figure 3. Visual comparison of the original source data for AVRAE and the synthetic anomaly on HAI.

Figure 4. Visual comparison of the original source data for AVRAE and the synthetic anomaly on SWaT.

Figure 5. Visual comparison between the original source data and the synthetic anomaly using PCA and t-SNE.

Figure 6. Kernel density estimation between the original source data and the synthetic anomaly.

Table 1. The architecture of AVRAE.

Layer	Encoder			Decoder
Layer	Type	Input Size	Output Size	Type	Input Size	Output Size
Layer 1	LSTM	$100 \times d$	$100 \times p$	LSTM	$100 \times p$	100 × p
Layer 2	LSTM	$100 \times p$	$100 \times p$	LSTM	$100 \times p$	$100 \times p$
Layer 3				Cross-Attention	$100 \times p$ , $100 \times p$	$100 \times p$
Layer 4				Self-Attention	$100 \times p$ , $100 \times p$	$100 \times p$
Layer 5				Dense	$100 \times p$	$100 \times d$

Table 2. Jensen–Shannon divergence between the original source data and the synthetic anomaly.

Variable (HAI)	P1_B4022	P1_PP04	P2_SIT01	P1_PIT02	P1_FCV03D	P1_FCV01Z
JSD	0.13	0.075	0.065	0.037	0.022	0.02
Variable (SWaT)	FIT503	PIT503	PIT501	FIT501	FIT502	LIT401
JSD	0.138	0.08	0.077	0.034	0.015	0.004

Table 3. Performance comparison between one-class and binary classification.

Type	Model	HAI					SWaT
Type	Model	ACC	REC	PRE	F1	ROC	ACC	REC	PRE	F1	ROC
OC	OCSVM	0.77	0.5	0.04	0.07	0.32	0.61	0.83	0.22	0.35	0.14
	IF	0.94	0.25	0.11	0.14	0.36	0.13	1.0	0.13	0.23	0.18
	LOF	0.97	0.31	0.29	0.3	0.35	0.97	0.31	0.28	0.29	0.35
	LSTM-AE	0.93	0.22	0.31	0.26	0.69	0.94	0.55	0.99	0.71	0.8
BIN	k-SVM	0.37	0.79	0.06	0.11	0.54	0.86	0.71	0.47	0.57	0.85
	RF	0.05	1.0	0.05	0.1	0.47	0.62	0.74	0.22	0.34	0.78
	KNN	0.67	0.35	0.05	0.1	0.53	0.85	0.1	0.24	0.1	0.72
	LSTM-BIN	0.94	0.56	0.44	0.49	0.73	0.95	0.81	0.78	0.79	0.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jeon, S.; Koo, K.; Moon, D.; Seo, J.T. Mutation-Based Multivariate Time-Series Anomaly Generation on Latent Space with an Attention-Based Variational Recurrent Neural Network for Robust Anomaly Detection in an Industrial Control System. Appl. Sci. 2024, 14, 7714. https://doi.org/10.3390/app14177714

AMA Style

Jeon S, Koo K, Moon D, Seo JT. Mutation-Based Multivariate Time-Series Anomaly Generation on Latent Space with an Attention-Based Variational Recurrent Neural Network for Robust Anomaly Detection in an Industrial Control System. Applied Sciences. 2024; 14(17):7714. https://doi.org/10.3390/app14177714

Chicago/Turabian Style

Jeon, Seungho, Kijong Koo, Daesung Moon, and Jung Taek Seo. 2024. "Mutation-Based Multivariate Time-Series Anomaly Generation on Latent Space with an Attention-Based Variational Recurrent Neural Network for Robust Anomaly Detection in an Industrial Control System" Applied Sciences 14, no. 17: 7714. https://doi.org/10.3390/app14177714

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mutation-Based Multivariate Time-Series Anomaly Generation on Latent Space with an Attention-Based Variational Recurrent Neural Network for Robust Anomaly Detection in an Industrial Control System

Abstract

1. Introduction

2. Related Work

2.1. Statistical Anomaly Detection

2.2. Anomaly Detection in One-Class Classification

2.3. Anomaly Detection in Binary Classification

2.4. Time-Series Data Generation

2.5. Pseudo Anomaly Generation

3. Patterns of Time-Series Anomaly

4. Mutation-Based Multivariate Time-Series Anomaly Generation in ICS

4.1. Attention-Based Variational Recurrent Autoencoder

4.1.1. Evidence Lower Bound

4.1.2. Inference

4.1.3. Training

4.2. Mutation Operators

4.3. Anomaly Generation

5. Evaluation

5.1. Dataset Description

5.2. Experimental Settings

5.3. Assessment on Quality of Synthetic Anomaly

5.4. Comparison between One-Class and Binary Classification

6. Limitations

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI