1. Introduction
Anomaly detection aims to detect data points or fragments that do not conform to expected behavior patterns in rapidly changing data. In different application fields, these non-compliant patterns are also called anomalies, outliers, discordant observations, and exceptions. The technology for anomaly detection is commonly used for commercial applications by major industries, such as power, finance, network security, industrial fields, system health monitoring, e-commerce fields, and ecological disaster monitoring [
1]. Moreover, anomaly detection plays an elementary and important role in the Artificial Intelligence for IT Operations [
2] (AIOps) system, which provides the basis for decision making in the subsequent alarms, automatic stop loss, and root cause analysis.
Time series data are the typical data type in the system of IT operations, especially in the scenario of anomaly detection in AIOps [
3]. Common anomalous data in time series data can be divided into three categories [
4]: point anomalies, contextual anomalies, and collective anomalies. Due to the multivariate heterogeneity of operation and maintenance data, it is a challenge to improve the accuracy of detection by automatically analyzing and summarizing abnormal patterns in the data. In the existing methods, statistical methods based on traditional thresholds need to assume that the data must follow a certain form of distribution and cannot work well with increasingly dynamic and diverse data. In addition, the statistical methods have difficulty identifying anomalies in the time series since they have limited ability to extract contextual information from the time series data [
5]. Due to the increasing volume and complexity of data streams, researchers try to use machine learning methods to process large amounts of data. However, data annotation and abnormal data collection are time-consuming. Unsupervised machine learning methods for anomaly detection provide a solution that can get rid of the heavy manual cost. Most unsupervised anomaly detection methods [
6,
7] monitor time series data by predicting and reconstructing the time series and calculating the deviation between the true value and the predicted value. Our model also detects anomalies by reconstruction and comparison and can better learn the distribution of normal data. In addition, we use discriminators to determine anomalies directly.
Deep learning methods of anomaly detection can automatically learn the complex correlation of time series without complicated feature engineering. Hence, deep learning methods are commonly used in the task of anomaly detection for time series data. Generative Adversarial Networks (GANs) [
8] are a type of typical deep learning model that has achieved great success in image processing tasks. Moreover, GANs have also been proven to be very successful in anomaly detection [
9]. However, the classical GANs are weak to capture the complex contextual features of the time series data with the existing generators and discriminators. The deep learning models for sequential data, such as Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) and Gate Recurrent Unit (GRU) [
10], can be used in the generators and discriminators to capture the implicit relationship of the time series data, but they cannot work in a parallel way because of their sequential dependence. Transformer-based models [
11] have supreme performance in the forecasting of time series data, since they can learn complex patterns from time series data in a parallel way using the self-attention mechanism.
Hence, in this work, we propose a novel model, Transformer-based GAN for Anomaly Detection of time series data(TGAN-AD), which learns the anomaly patterns of time series data with Transformer in an adversarial way. Our method involves three main components: (1) a generator component, which uses the Transformer to simulate the normal patterns of time series data; (2) a discriminator component, which can capture the intrinsic characteristics of the real-time series data to learn the boundary between normal patterns and anomalous patterns; and (3) an anomaly detection, which can identify the anomaly data with the trained Generator and Discriminator components.
The main contributions of our work are summarized as follows:
We propose a new model, TGAN-AD, that incorporates Transformer into the GAN framework. TGAN-AD makes use of Transformer to capture the contextual information of the time series data for the subsequent GAN framework.
We use three public datasets and six baseline methods to comprehensively evaluate TGAN-AD anomaly detection performance. Compared with the baseline methods, TGAN-AD showed the best performance. We further provide some insights on the use of GANs for anomaly detection on time series data.
2. Related Work
In most practical scenarios, the labels for anomaly detection are missing, or the labels for anomalous data and normal data are unreliable. As we know, the model cannot function properly without accurate labeled data, and high-quality annotation is time-consuming. Hence, the intractable situation of the task of anomaly detection prevents us from using supervised machine learning methods, and the unsupervised method of anomaly detection provides a solution to alleviate the problems. Unsupervised anomaly detection methods can be divided into three categories: statistic-based methods, clustering-based methods, and deep learning methods.
2.1. Statistic-Based Methods
Statistical anomaly detection techniques are based on the following key assumptions. Normal data instances occur in high probability regions of a stochastic model, while anomalies occur in low probability regions of the stochastic model [
12]. Based on the assumed distribution, the statistical methods can be further classified as follows:
Gaussian-based models: They assume that the data are generated by a Gaussian distribution. The distance between a data instance and the expected value of a Gaussian distribution is the anomaly score of the given instance. A threshold is set over the anomaly score to discriminate between normal and anomalous data. For example, a simple anomaly detection technique, three-sigma rule of thumb is to declare all data instances that are more than 3
distance away from the distribution mean
, where
is the standard deviation for the distribution. The
region contains 99.7% of the data instances. This technique is mostly applicable to univariate and multivariate continuous data. A box plot is a way of summarizing data measured on an interval scale and is often used for exploratory data analysis. The box plot rule has been applied to detect outliers in univariate and multivariate medical data by Laurikkala et al. [
13]. However, these methods are mostly used for monovariate time series and can generally only handle local outliers with simple distributions and large anomalous deviations.
Regression-based Models: There are two main steps in anomaly detection technology: Firstly, the regression model is fitted to the data. Then, for each test case, the residual is used to determine the exception score. For example, Eduardo et al. [
14] proposed the method of traffic characterization and detection of traffic anomalies using sFlow analysis by incorporating two different models, Autoregressive Integrated Moving Average model (ARIMA) and Holt-Winters, into a behavior-based system. Another variant that detects anomalies in multivariate time series data generated by an Autoregressive Moving Average (ARMA) model was proposed by Galeano et al. [
15]. In this technique, the authors transform the multivariate time series into a univariate time series by linearly combining the components of the multivariate time series.
The disadvantages of statistical techniques are that they rely on the assumption that the data are generated from a particular distribution. This assumption often does not hold true, especially for high-dimensional real datasets.
2.2. Clustering-Based Methods
Clustering is mainly an unsupervised technology. Although clustering and anomaly detection seem to be fundamentally different, several anomaly detection technologies based on clustering have been developed. These methods are based on the following assumption. Normal data instances are closer to the nearest cluster core, while abnormal data instances are farther from the nearest cluster core. When using clustering algorithms to cluster data, for each data instance, the distance from the nearest cluster centroid is calculated as its anomaly score. For example, Smith et al. [
16] proposed Self-Organizing Maps (SOM), K-means Clustering, and Expectation Maximization to cluster training data and then use the clusters to classify test data.
However, if the exceptions in the data themselves form a cluster, the abovementioned methods will not be able to detect such exceptions. To solve this problem, people propose a density-based anomaly detection method. For example, Breunig et al. [
17] assigned an anomaly score to a given data instance, known as the Local Outlier Factor (LOF). For any given data instance, the LOF score is equal to the ratio of the average local density of the k nearest neighbors of the instance and the local density of the data instance itself. The anomalous instance will obtain a higher LOF score. Mahoney et al. [
18] proposed the CLAD algorithm, which obtains the width from the data by randomly sampling and calculating the average distance between the nearest points. All those clusters whose density is lower than a threshold are declared as “local” outliers, while all those clusters that are far away from other clusters are declared as “global” outliers. He et al. [
19] proposed the FindCBLOF algorithm, which assigns an anomaly score to each data instance, called the clustering-based local outlier factor (CBLOF). The CBLOF score captures the size of the cluster to which the data instance belongs and the distance between the data instance and its cluster centroid.
Such techniques can often be adapted to other complex data types by simply plugging in a clustering algorithm that can handle the particular data type. However, these methods are unable to capture temporal correlations.
2.3. Deep Learning Methods
Among them, the method based on deep learning can better represent the hidden information in the dataset and has a better effect on anomaly detection, which attracts more researchers to study the research topic [
20,
21]. For example, the most commonly used deep learning method for anomaly detection is the AutoEncoder [
22]. However, the model does not have a strong regularization method, which makes it easy to overfit. When there are many abnormal points, it will learn abnormal patterns. There is also an optimized variational autoencoder (VAE) [
23] proposed by An et al. Unlike AutoEncoder, VAE can learn the distribution of hidden variables generated in the data, so it can have a function similar to regularization to prevent overfitting. They both use the reconstruction error between the encoder and the decoder for anomaly detection. However, they still cannot capture the time correlation and some hidden behaviors of the time series from the multivariate time series. Our model uses the more powerful GAN-based model as the reconstruction framework.
In deep-learning-based methods, generative adversarial networks have great potential applications, and they have shown excellent results for images, text, and time series. Time series anomaly detection methods based on Generative Adversarial Networks currently have some research, such as MAD-GAN, TAno-GAN, Tad-GAN [
24,
25,
26], and so on. Among the three methods mentioned above, MAD-GAN, TAnoGAN, and TadGAN are all reconstruction-based deep learning methods. Their core idea is to learn a model that can encode data points and then decode the encoded vector (that is, to reconstruct the sequence for a period of time). The effective model after training cannot reconstruct the anomaly; the anomaly will lose information in the encoding process for its low frequency of occurrence. Moreover, to capture the time correlation and some timing-hiding behaviors in the time series, MAD-GAN, TAnoGAN, and TadGAN all use LSTM, which cannot run in a parallel way, as a generator and discriminator model to process time series data. Our model enhances the performance of GANs to learn sequence-to-context correlations more efficiently.
Recently, Transformer [
11] has been successfully applied to the NLP field, and its major success in the NLP field demonstrates its powerful modeling capabilities for time series data. Some related works on the construction of time series data on the basis of Transformer have been published. Shaw et al. [
27] proposed the concept of relative position coding, so that Transformer can adapt to sequences of different lengths. Dai et al. [
28] proposed Transformer-XL, introducing the segment-level recurrence mechanism to establish a connection between each text segment (segments) so that the model can capture more distant dependencies. Dehghani et al. [
29] proposed the Universal-Transformer, which introduced time step, time and position coding, and replaced the feed-forward layer with the Transition function, which improved the versatility of the Transformer. Wu et al. [
30] proposed the use of a self-attention mechanism to learn complex patterns and dynamics from time series data, which can be applied to univariate and multivariate time series data. Our model is very close to the way this method learns. Wu et al. [
31] proposed a new time series prediction model—Adversarial Sparse Transformer (AST). AST uses the Sparse Transformer as a generator to learn sparse attention maps for time series prediction and uses a discriminator to improve prediction performance at the sequence level. Our model also uses discriminators to aid in anomaly detection. Zhou et al. [
32] proposed a way of designing an efficient structure suitable for long-term time series forecasting (LSTF) based on Transformer. Li et al. [
33] proposed the LogSparse Self Attention structure to reduce the amount of calculation to
.
In short, Transformer has shown obvious advantages in tasks such as prediction of time series data, which provides a practical basis for the incorporation of Transformer into anomaly detection in our work.
4. Proposed Model
4.1. Problem Definition
Anomaly detection of time series data is the process of identifying abnormal events or behaviors from normal time series. Time series data are divided into the training data
and test data
, where
n and
k are the maximum length of the timestamp, and
m is the feature dimension of each time series data. Our goal is to build an unsupervised model that can efficiently capture the non-linear pattern and multivariate distribution of multivariate time series, accurately discriminate the abnormal data in the test data
, and generate a probability vector
, where
indicates whether the
i-th timestamp is abnormal or not. Our model is based on a generative adversarial network, which can conduct anomaly detection with its discriminator and generator, and Transformer [
11], which uses a multi-head attention mechanism to capture feature correlation and time dependence in time series.
4.2. TGAN-AD Architecture
As shown in
Figure 3, TGAN-AD uses the Transformer as the generator and discriminator of the GAN framework to capture the temporal correlation and other hidden behaviors in the multivariate time series. To improve the efficiency of our framework, the multivariate time series data are divided into sub-sequences using sliding windows during the pre-processing, as shown in
Figure 4, and the window size
is set empirically.
The real time series X and the random hidden variables Z are both input into the generator (G) to generate a fake time series : . Then the fake time series and the real time series are input into the discriminator (D), where the discriminator learns the parameters to distinguish real and fake time series data. The parameters of G and D are both updated according to the output of D, and G generates fake samples approaching the existing normal samples X. Moreover, D can improve its discrimination ability and distinguish fake time series and normal time series X. By iterative training, D’s discrimination ability can accurately distinguish normal sequence X and fake sequence . Then G can capture the hidden time correlation in the normal time series X and generate a fake time series that can deceive D.
After the offline training, all the parameters of G and D are fixed. The test time series is encoded as the hidden space. The optimal is trained by gradient descent and is input into G to reconstruct the test sample: . The reconstruction loss is calculated by the deviation of and . Meanwhile, the test time series is input into the trained D to calculate the discrimination loss . Finally, the anomaly score () of the test data is calculated with two kinds of losses: . Then the anomaly state of the test data is determined according to the anomaly score.
4.3. Transformer Components for TGAN-AD
The generator and discriminator of our model are trained in an adversarial way, and the core part is the encoder–decoder module based on Transformer. TGAN-AD contains two core components:
4.3.1. Generator Training Process
As shown in
Figure 5a,
is input into the Transformer encoder to learn the hidden representation of the normal time series, which can help the generators generate the data approaching the real time series data. The hidden representation of
is fed into the Transformer decoder. To generate rich similar samples, the hidden space
Z is the input of the Transformer decoder. The fake samples are generated based on the hidden representation of
and
Z with Transformer.
The goal of the generator is to continuously train the transformer to generate time series data approaching the real time series data, which makes it difficult for the discriminator to distinguish the generated data from the training data.
4.3.2. Transformer-Based Discriminator
As shown in
Figure 5b, a sample, from the training data or generated data, is input into the Transformer encoder to obtain the hidden representation of the sample; then the sample and its representation are sent to the decoder. The class distribution of the sample is output by the Transformer decoder.
The goal of the discriminator is to accurately distinguish the generated fake data from the training data so that the model can be well trained by the - game.
4.4. Detection of Anomalies
The training dataset
, test time series
,
A and
B are the maximum length of the timestamp, and
C is the feature dimension. To better capture the hidden information in the time series, a sliding window is used to divide the multivariate time series into sub-sequences. That is,
,
.
,
. In the same way,
Z is the random hidden space,
. Then put
X and
Z into TGAN-AD; the TGAN-AD model trains the generator and discriminator in
-
games:
After many iterations, the fake samples generated by the TGAN-AD generator can fool the discriminator, indicating that the training is completed. Then the reconstruction loss and discrimination loss of the test sample are calculated to obtain the AD-Score by using the sum of the training.
4.4.1. Discrimination Loss
The discriminator has the ability to identify whether the input sample is anomalous data so discrimination loss can be used as part of the AD-Score. For the discrimination loss, the test time series are input into the trained discriminator, and TGAN-AD directly outputs the discrimination loss. Intuitively, the discrimination loss represents the probability that the input sample is anomalous.
4.4.2. Reconstruction Loss
For reconstruction loss, we firstly search the optimal
of the test dataset in the latent space, which can generate the most similar generated sequence in
, i.e.,
. TGAN-AD uses covariance as a reference to update
, that is, the similarity between the generated sequence and the test sequence, and the gradient descent method can also be used to find the optimal sequence.
After finding an optimal
, we calculate the reconstruction loss:
Since the generator learns the distribution pattern of the normal data, the degree of dissimilarity between the two samples can be calculated by comparing the original sample with the reconstructed sample. That is, the difference between the abnormal data and the normal data can be obtained.
4.4.3. Anomaly Detection Score
The anomaly detection is determined by the two losses above. Calculate the anomaly detection loss based on
and
, AD-Score:
Moreover, represents the weight of reconstruction loss and discrimination loss, and can be set empirically. Then mark the timestamp according to the obtained : if , which represents abnormal data, otherwise which represents normal data.
The overall algorithm is summarized in Algorithm 1.
Algorithm 1 Transformer-GAN Anomaly Detection Algorithm |
epoch , initialize network parameters if epoch within number of training iterations then for the epoch do Generate samples from the random space: discrimination: Update Transformer discriminator parameters: Update Transformer generator parameters: Record parameters in the current iteration end for end if for number of iterations do Find the best generated sample: end for Reconstruction loss:
Discriminate loss:
Anomaly Detection Score:
|
5. Datasets
5.1. Secure Water Treatment (SWaT)
Secure Water Treatment (SWaT) is a water treatment tested for research in the area of cybersecurity. The Swat dataset contains a total of 264 h of numerical data and network traffic data collected from 51 sensors and processors for 11 consecutive days. It includes 7 days of normal data obtained when the system was under normal operation and operation without being attacked, and 4 days of abnormal data obtained when the system was attacked in different scenarios.
5.2. Water Distribution (WADI)
As an extension of SWaT, Water Distribution (WADI) is a distribution system comprising a larger number of water distribution pipelines. WADI is more vulnerable than the SWaT dataset and has more features than SWaT. It collects data for 16 consecutive days from networks, sensors, and actuators. WADI collected 14 days of normal data and 2 days of abnormal data. The abnormal data contains 15 attacks from the same attack model. The abnormal ratio is also lower than other datasets so that the WADI is more unbalanced.
5.3. KDD Cup 1999
The KDD99 dataset is the dataset from the Third International Knowledge Discovery and Data Mining Tools Competition in 1999. The requirement of the competition was to design a network intrusion detector to detect if the network connection was under attack or intrusion. Each network connection is marked as “normal” or “attack”. There are 39 types of abnormalities, 22 of which occur in the training set, and the remaining 17 occur only in the test set.
Table 1 shows the information about the datasets.
6. Experiment
To demonstrate the performance of the proposed model, the following two problems need to be experimentally verified:
Q1: Does the proposed model perform better than the baseline methods in key metrics, especially and -?
Q2: How can we determine the most appropriate hyperparameter settings for the model in real-world engineering?
6.1. Data Preprocessing
To better capture the time correlation and other hidden behaviors of the time series, the dataset was divided into sub-sequences. To determine the optimal sub-sequence length, experiments were carried out with different window sizes. The initial subsequence length was set at 10 empirically, i.e., .
6.2. Baselines
We compared the performance of TGAN-AD with four popular anomaly detection methods, including:
PCA: The method is based on Principal Component Analysis [
34];
Random Forest: The method is based on a completely random forest [
35];
LSTM:The method is based on a Long Short-Term Memory Neural Network [
36];
FNN:The method is based on a Feed-forward Neural Network [
37];
MAD-GAN: The method is based on Generative Adversarial Networks [
24];
GDN: The method is based on Graph Neural Networks [
38].
6.3. Evaluation Metric
Five kinds of metrics were used in our work: , , -, , and , to evaluate the anomaly detection performance. describes the proportion of positive examples predicted by the classifier that are real positive examples. describes the proportion of real positive examples in the test set that have been selected by the classifier. The and can provide a diagnostic tool for binary classification models. Moreover, the curve () and the - curve () are used to evaluate the performance of anomaly detection. and can be optimistic about severely imbalanced classification problems with few samples of the minority class, i.e., anomaly classification. The and are obtained by varying the hyperparameters. , , and - are the model performance under the best set of parameters selected.
6.4. Performance and Analysis
To answer
Q1, the proposed model was used in an experiment with four baseline methods on three publicly available real-world datasets with ground truth labeled anomalies. Their performance on the metrics was recorded. The goal of anomaly detection is to detect anomalies as completely as possible, and we place more emphasis on the performance of the model on the recall metric when conducting experiments and model evaluation. Therefore, while ensuring higher F1 scores, this paper primarily uses recall as a performance measure for the model.
Table 2 shows the performance of anomaly detection for three datasets by five methods, including six baseline methods. Bolded items in the table indicate the highest values of metrics in each dataset.
For the SWaT dataset, TGAN-AD has excellent performance in , , and -. The and are second only to FNN. However, in conditions where reaches an impressive 99%, the comprehensive indicator - has a significant improvement compared to other methods. It scored 10.6% higher than the second best baseline method. Apart from the of , the other four metrics all reached more than , of which the reached , which was higher than the MAD-GAN by .
For the WADI dataset, most of the metrics are the best among all methods. This dataset, as we know, is unbalanced and has higher dimensionality than other datasets. However, TGAN-AD performs as well on this dataset as on any other dataset. In addition, the anomaly detection performance of TGAN-AD is significantly higher than that of FNN, LSTM, MAD-GAN, and GDN, all of which are deep learning methods.
For the KDDCUP99 dataset, TGAN-AD has excellent performance in all five metrics. All five metric values of TGAN-AD reached the highest among all methods. The - is 7% higher than the second best baseline method. Except for the recall rate of , the other four values all reached . Moreover, , , and - maintain the highest levels across all datasets.
The given result fully demonstrates the usage of Transformer which can well represent the data of Generator G and Discriminator D and paves the most direct way to the optimal representation of the testing sample which can filter the anomaly data from normal data. We implement our method and its variants on an NVIDIA Tesla T4 graphics card. The models are trained for up to 50 epochs and use early stopping with patience of 10. We recorded the training time of the model. In particular, our model took 53 s to train on the SWaT dataset, 1 min 11 s on the more dimensional WADI dataset, and converged in 41 s on the KDDCUP99 dataset. Our model requires less training time than classical deep learning framework LSTM (1 min 9 s/1 min 41 s/51 s) and the latest multivariate anomaly detection model GDN (2 min 25 s/6 min 42 s/1 min 45 s).
6.5. Model Variations
For Q2, to evaluate the importance of different hyperparameters in the TGAN-AD, we also tried the different settings of the hyperparameters in our model, i.e., the , the layers of Transformer, and the sliding window length. All other settings remained unchanged when the comparative experiments were conducted.
6.5.1. The Effect of Hyperparameter on the Model
In our experiment, we chose the representative dataset, SWAT, to analyze the impact of
in the model.
. In Equation (
7),
is the proportion of reconstruction loss
and discrimination loss
in the anomaly score. For this experiment, we kept the other hyperparameters constant, while setting the number of transformer layers to four and the sliding window length to 60. In SWAT, when
is set as
, all the metrics showed the best performance. With the increase of
,
,
, and
decreased dramatically, and
and
also decreased.
As show in
Figure 6, with
, the anomaly score only depended on
. Almost all the results with
were higher than the average performance of all the experiments and ranked only second to the results with
, and only the
value was slightly higher with
. It indicates that the anomaly score depends more on the discrimination loss, and a small quantity of the reconstruction loss can improve the overall performance. With
, it indicates that the anomaly score depends only on
. The results with
were much lower than the results with other
settings. It shows that reconstruction loss has a weak impact on the anomaly score of TGAN-AD.
6.5.2. Role of Transformer Layers
Different numbers, 2, 4, 6, and 8, of Transformer layers in the generator and discriminator were set. We set the
to 0.2 and the sliding window length to 60.
Figure 7 shows our experiment using two datasets, SWaT and KDDCUP99. The performance of anomaly detection outperformed in the SWaT dataset with four layers of encoder–decoder, in the WADI dataset with four and in the KDDCUP99 dataset with four and six layers.
6.5.3. The influence of Different Sliding Window Widths
The width of the sliding window (i.e., length of a sub-sequence) is sensitive for the model to capture the hidden information in time series. In this experiment, different window widths,
, are used to observe the influence of different sliding window widths on the performance of anomaly detection. Here,
is set to 0.2, and the number of Transformer layers is set to four. For each sub-sequence length, the TGAN-AD model was trained recursively for 100 iterations. We depict the box-plots of the metrics values of TGAN-AD at each of the training iterations over different sub-sequence lengths in
Figure 8.
As shown in
Figure 8, the impact of sequence length of our model is given as follows:
- 1.
SWaT dataset: When the sequence length was set at 60, our TGAN-AD achieved the best performance in and , while the value of was close to 0.9. When the sequence length was 100 and 40, the model showed relatively bad performance.
- 2.
WADI dataset: In the experiment, and showed bad performance, since the model predicted some false positive results. However, in the scenario of anomaly detection, the wrong alarm of non-anomalous samples is permissible. When the length was 20, the model performed best.
- 3.
KDDCUP99 dataset: TGAN-AD was not stable in testing on the KDD dataset but had good overall average metric values. When the sequence length was 50, it had an excellent metric value (, and ) close to .
6.5.4. Discussion
In summary, the hyperparameters, i.e., , the layers of Transformer, and the sliding window length, are important for learning the optimal parameters to detect anomalous data from large-scale online time series data. The method of automatic parameter selection is still a challenge for real-world application scenarios. Our model has the stable performance of with different settings, which is significant for the applications of AIOps and other similar scenarios. In real-world applications, time series data frequently lack contextual data, which is critical for time series anomaly detection.
7. Conclusions
Anomaly detection is one of the most popular applications of time series data analysis, and it is also an important research branch in the field of AIOps. In this paper, we propose a novel framework called TGAN-AD for anomaly detection of multivariate time series. We use Transformer to train the generator and discriminator of GANs and finally use reconstruction loss and discrimination loss to measure the anomaly. We tested TGAN-AD on three public datasets and compared it with the state-of-the-art methods of time series anomaly detection. TGAN-AD showed the best performance.
The performance of anomaly detection is sensitive to the sliding window length. Hence, in future work, automatic selection of the optimal sliding window length of the model is a good direction for the further improvement of anomaly detection. In our work, only the elementary Transformer was used. The variants of Transformer can also be used in the model of time series data.