1. Introduction
Industrial equipment and service systems, such as servers, cybersecurity and robotic systems, are often tested using multiple time series (telemetry data or sensor data) to keep track of equipment operation and promptly detect system anomalies. Since anomalies in the data imply important information, timely and efficient detection of these messages by operational engineers helps to improve efficiency, reduce costs and increase safety [
1]. Anomalies are system behaviour patterns in time steps that do not conform to a well-defined notion of normal behaviour [
2]. The purpose of anomaly detection is to provide the opportunity to take action to identify and resolve potential problems before they cause disasters. The difficulty of label acquisition and the extreme imbalance between normal and abnormal categories have resulted in anomaly detection being classified as an unsupervised machine learning task [
3,
4,
5]. Traditionally, domain experts set thresholds for normal events, and the system is considered abnormal if its measurements exceed the expert-defined thresholds. Given the dramatic increase in the number of sensors required to detect systems today and the hidden correlation and temporal information in the data becoming cumbersome, the traditional threshold approach is no longer applicable and automatic anomaly detection methods have become a necessity.
Currently, anomaly detection has become an active research topic in the field of data mining and is widely used in areas such as healthcare, aerospace and industrial production [
6,
7,
8,
9,
10,
11]. Although numerous time series anomaly detection methods have been developed for univariate time series [
1,
12,
13,
14], where the anomalies are detected mainly based on one specific metric, for a complex real-world system, there is an intrinsic correlation between different sensors. A single univariate time series does not represent the overall state of the system well. Formally, a multivariate time series consists of a set of univariate time series, each of which describes an attribute of a complex entity. Multivariate time series have higher spatial and temporal complexity, more noisy data and more severe disturbances. In addition, there is often synergistic variation among individual features. Dividing multivariate time series into several univariate time series will lead to poor performance in anomaly detection [
15,
16]. This causes difficulties in anomaly detection for multivariate time series.
In the past few years, many classical algorithms have been proposed for automatic anomaly detection in multivariate time series. These include distance-based methods [
17,
18], clustering-based methods [
19,
20], similarity-based methods [
21,
22] and classification-based methods [
23,
24], which have driven the development of anomaly detection techniques. However, the failure to consider the temporal dependence of time series limits the performance of these techniques. To address this problem, time series prediction models, such as autoregressive moving average (ARMA) [
25] and autoregressive integrated moving average (ARIMA) [
26], are used to model time-dependent anomaly detection. However, these methods do not consider the correlation between time series and are sensitive to noise, which affects the robustness of the models. We argue that it is beneficial to improve the performance of the model by modeling the time dependence and the correlation between different sequences.
Recently, deep learning has gained significant attention in computer vision and natural language processing and has also been applied to the task of anomaly detection. Anomaly detection algorithms for deep learning can be broadly classified into two categories: prediction-based and reconstruction-based. Reconstruction-based anomaly detection algorithms, e.g., TAnoGAN [
27], EncDec-AD [
28], TCN-AE [
11] and OmniAnomaly [
29], reconstruct the input by learning the data distribution of the normal state of the time series. Reconstruction errors are used for anomaly detection, avoiding the difficulty of time series prediction; however, minor anomalies in the data are not easily detected using this method. In addition, prediction-based models, such as LSTM-NDT [
10] and GDN [
30], learn historical patterns to predict the future and perform anomaly detection by prediction errors. Lastly, hybrid models, such as NSIBF [
31] and MTAD-GAT [
32], use prediction and reconstruction errors of all dimensions for detection. This multi-task optimization problem is extremely complicated when there are numerous features in the data. We believe that combining the advantages of both can help improve detection performance without increasing the complexity of the task.
To address the above problem, we proposed a new framework, namely, graph attention network and temporal convolutional network for multivariate time series anomaly detection (GTAD). Specifically, GTAD uses temporal convolutional networks to capture the temporal dependence of the sequences, where causal convolution maintains the causality of the sequences and dilation convolution allows for flexible perceptual field sizes. Subsequently, graph attention networks are adopted to model the correlation of different time series, which are naturally obtained through the properties of the edges in the structure of the graph. Finally, we utilized a joint approach based on the prediction and reconstruction of one feature to optimize the model, simplifying the optimization objective.
We summarize the study’s main contributions as follows:
We proposed a new framework for an unsupervised multivariate time series anomaly detection algorithm (GTAD) that combines the advantages of prediction-based approaches, which focus on feature engineering at the next time step, and reconstruction-based approaches, which emphasize capturing the overall distribution of the data.
GTAD uses parallel operations instead of RNN frameworks, such as LSTM and GRU, and its ability to extract contextual information is enhanced, resulting in a model with low sensitivity to sliding window size.
GTAD specifies the optimization objective by using the error of prediction and reconstruction for one dimension as the loss function, rather than all dimensions, leading to better detection performance.
The structure of the rest of the paper is as follows. The related work on the anomaly detection of time series is described in
Section 2. The method overview is introduced and briefly analyzed in
Section 3. In
Section 4, the effectiveness of the proposed method is experimentally verified. Finally, we summarize the whole paper and suggest possible future work in
Section 5.
2. Related Works
Multivariate time series anomaly detection algorithms have been advanced by many methods that can achieve effective detection in recent years. Here, we summarize these anomaly detection methods as classical methods and deep learning-based methods.
Classical methods: The K-nearest neighbor (KNN) [
17] algorithm calculates the average distance of the K nearest neighbors of each sample as the anomaly score. However, the computational complexity of this method is high when the feature dimension of the data is high. Principal component analysis (PCA) [
33] and partial least squares (PLS) [
34] are two linear model-based approaches for anomaly detection. However, these models assume that the data are Gaussian-distributed and are only feasible for highly correlated data. The find-CBLOF algorithm [
35] assigns a predefined anomaly score known as the cluster-based local outlier factor (CBLOF) for each data instance to cluster different data samples. The extended fuzzy C-means clustering technique [
20] can work better to detect anomalies. However, the computational overhead of clustering methods is high, and performance depends heavily on the effectiveness of capturing the clustering structure of normal instances. ARMA [
25] and ARIMA [
26] are two common statistical techniques that are used for anomaly detection, which rely on assumptions that the data is generated from a particular distribution. This assumption often does not hold, especially for multivariate time series. Other machine learning methods, such as support vector domain description (SVDD) [
36], one-class support vector machine (OC-SVM) [
37] and isolation forest (IF) [
38], show effectiveness regarding anomaly detection, but these methods do not consider the time dependence, resulting in poor detection of contextual anomalies.
Deep-learning-based methods: Most contemporary state-of-the-art technologies employ some form of deep learning. The DAGMM [
39] uses a deep autoencoder to generate a low-dimensional representation and reconstruct the error for each input data, further feeding it into a Gaussian mixture model. However, this method cannot exploit temporal information. The TCN-AE [
11], combining the temporal convolutional network (TCN) [
40] and autoencoder (AE), was designed to learn the compressed expression of normal time series. This approach has the ability to obtain temporal information but ignores the correlation between time series. The MSCRED [
41] constructs a multi-scale signature matrix, reconstructs the matrix using an attention-based mechanism Conv-LSTM and a convolutional encoder–decoder, and detects anomalies using the residual signature matrices. Although correlations and temporal information in multivariate time series are captured using this method, it consumes a lot of training time and works poorly in the case of insufficient data.
Models with LSTM or GRU require long training times. The NSIBF [
31] designs an LSTM-based neural network framework for system identification and Bayesian filtering for robust anomaly detection by recursively tracking the uncertainty of the hidden states of the system over time. Ergen et al. [
42] proposed an algorithm to turn a variable-length data sequence into a fixed-length sequence using LSTM, followed by an anomaly detector decision function based on a single class support vector machine or support vector data description algorithm. The LSTM-NDT [
10] method is an LSTM-based neural network model that makes predictions for each input timestamp. This work also proposes a nonparametric dynamic error thresholding strategy that uses the moving average of the prediction error sequence to set the threshold of anomaly markers. The MAD-GAN [
43] uses LSTM-RNN as the basic framework of the GAN to capture potential interrelationships between time series and determines whether a sample is anomalous or not using trained discriminator and reconstruction errors. The MTAD-GAT [
32] uses a graph attention networks (GATs) [
44] in each of the time dimension and feature dimensions to better represent the complex dependencies of the time series, and then captures the dependency information of the input data through a GRU, which is used for prediction and reconstruction. However, these RNN-based models tend to be inefficient in modeling long time series, especially when the data is noisy.
More recent methods, such as USAD [
45], GDN [
30] and TranAD [
46], do not use resource-hungry recurrent models, but only attention-based network architectures [
47] to improve training speeds. The USAD [
45], which is based on generative adversarial networks and an autoencoder, is used for unsupervised anomaly detection, where reconstruction errors containing anomalous inputs are amplified by an adversarial trained autoencoder. This is one of the first works to focus on a low overhead, allowing a several-fold reduction in training time compared with existing techniques. The GDN [
30] combines structural learning with GAT, applying attention mechanisms to adjacent sensors on the graph to learn predictions for each timestamp and detect anomalies using prediction errors. However, existing graph-neural-network-based anomaly detection methods still have difficulties in dealing with lengthy and highly correlated multivariate time series. The deep-transformer-network-based anomaly detection and diagnosis model (TranAD) [
26] employs an attention-based sequential encoder to rapidly make inferences using extensive temporal trend information in the data.
4. Experimental Evaluation
In this section, we describe the experimental datasets, baseline methods and evaluation metrics. Then, we present the results of the many experiments conducted to show the effectiveness of GTAD in unsupervised anomaly detection.
4.1. Datasets
We employed three publicly available datasets in our experiment. The Soil Moisture Active Passive (SMAP) satellite and the Mars Science Laboratory (MSL) rover datasets are two real-world public datasets collected by NASA [
46]. The Server Machine Dataset (SMD) is a five-week dataset collected and made publicly available by a large Internet company [
29]. It contains data from 28 server machines. SMD is divided into two subsets of equal size: the first half of the data from each machine is the training set and the second half is the test set. Detailed information about the datasets is shown in
Table 1.
4.2. Experimental Setup
In the experiments, the machine learning library Scikit-learn, the deep learning framework Pytorch-1.7.1 and Python 3.6 were adopted. The operating system used to implement the experiment was Ubuntu 16.04, the computer configuration was an Intel(R) Core (TM) i7-6850K CPU @ 3.60 GHz and the GPU was an NVIDIA GTX1080Ti. The Adam optimizer was used to train the anomaly detection model and the root-mean-squared error function was used as the loss function during training. We employed a learning rate schedule of exponential decay with fixed steps and set the initial learning rate to 2.5 × 10−5.
4.3. Baseline Methods and Indicators Evaluation
Baseline methods: These state-of-the-art multivariate time series anomaly detection models, including DAGMM [
39], MSCRED [
41], USAD [
45], MTAD-GAT [
32], OmniAnomaly [
29], GDN [
30] and MAD-GAN [
43] were used as baseline models in this paper for comparison with GTAD.
Evaluation indicators
: Anomaly detection is a binary classification problem. Precision, recall and the F1 score were used to evaluate the detection performance of the model GTAD and various benchmark methods, as shown in Equation (7). Anomalous observations usually occur in the form of contiguous anomaly segments. If at least one observation of an anomalous segment is correctly detected, all the other observations of the segment are also considered as correctly detected, even if they were not. This approach is known as the point adjust method [
5], which was utilized by our model.
with
TP referring to true positives,
FP referring to false positives and
FN referring to false negatives. Precision indicates how many of the anomalous events predicted by the algorithm are actual anomalous events. Recall denotes the percentage of predicted abnormal behavior versus all abnormal behavior. The F1 score is a better measure of model performance since it considers precision and recall. We expect the highest F1 score.
4.4. Results
We conducted experiments on the three publicly available datasets. In detail, we used the hyperparameters of the baseline models as presented in their respective papers.
Table 2 shows the precisions, recalls and F1 scores of GTAD and all baseline methods.
The experimental results showed that our model outperformed the other models on all three datasets. On the SMAP and SMD datasets, the F1 score of GTAD outperformed all benchmark models, and on the MSL dataset, it ranked second, only slightly behind the GDN, but the GDN did not work well on the SMAP and SMD datasets. Overall, compared with the best results of the baseline models, GTAD improved the F1 scores by about 7.4% on SMAP and 3.4% on SMD, which was significant in terms of anomaly detection. DAGMM could achieve better detection performance in SMD, but it did not work well in MSL and SMAP because DAGMM only considers the correlation between different sequences and ignores the temporal dependence. GTAD achieved better detection of anomalies by considering the temporal dependence using TCN. It shows that in the field of multivariate time series anomaly detection, obtaining temporal dependence helps with performance improvement.
MSCRED, UASD and OmniAnomaly discriminate anomalies only through reconstruction-based methods, which will result in some mild anomalies not being detected. The GDN performs anomaly detection via prediction only, focusing on feature engineering for the next timestamp, but it is always known that there are time series that are not predictable. Different data have different attributes, ranges and feature dimensions, resulting in different performances for the same algorithm on different datasets. Among them, MSCRED, GDN and OmniAnomaly performed the best on MSL and average on SMD and SMAP, while USAD worked well on SMD but had moderate detection in SMAP. GTAD performs anomaly detection by using a blend of prediction and reconstruction methods, with excellent results.
The MTAD-GAT based on a graph neural network (GNN) models time series as graph structures, which takes correlations between time series into consideration. However, dividing the time series into small sliding windows restricts the model from acquiring more contextual information, while causing the model to be more sensitive to the data and less effective in detecting datasets with long-term anomalies, like SMD, reducing its robustness. The MAD-GAN performed well on the SMD dataset with a large number of collective anomalies by considering the relationship between features through the autoencoder and adversarial training, but it ignored temporal information, resulting in moderate performance on the MSL and SMAP datasets with many contextual anomalies. GTAD applies TCN and multi-head attention to obtain more contextual information, which can improve the capability of the algorithm.
4.5. Ablation Analysis
In this section, we present the results from analyzing the impact of six main components on the model performance: GATv2, TCN for feature extraction, attention mechanism, prediction and reconstruction errors in a feature, using the prediction method and employing the reconstruction method. On the three datasets, we observed the F1 score of the model after removing each principal component to measure its impact on the model. Specifically, the first variant of the model was a mapping of its own being used to replace GATv2. Second, an own mapping was chosen to replace the TCN in the feature extraction. Third, the multi-head attention layers were removed. Fourth, the error in predicting and reconstructing all features was chosen as the optimization objective and anomaly score, rather than choosing one of the dimensions. Fifth, we eliminated the reconstruction-based method for optimization and detection. Finally, we abandoned the prediction-based approach to optimization and detection. The following conclusions were obtained based on the results shown in
Figure 6.
Using the prediction and reconstruction errors of all dimensions as a loss function and anomaly detection resulted in an average decrease of about 23% in the F1 score. The most notable of these was a 26% decrease on the SMAP dataset, implying that the loss in selecting a dimension was significant.
When we removed GATv2 from GTAD, the F1 scores decreased by about 6%, indicating that GTAD could work well using the GATv2, taking into account the correlation of the time series.
Without the attention mechanism, the F1 scores were reduced by 10% on average. This suggested that adding the attention mechanism allowed for more contextual information and facilitated reconstruction.
The absence of TCN caused a decrease of about 2% in the F1 score, indicating that the TCN could capture temporal dependence and local features that could steadily improve the model performance.
Both the prediction-based and reconstruction-based methods were less effective on their own than the integration of the two methods, demonstrating that GTAD could combine their advantages.
4.6. Sensitivity Analysis of Hyperparameters
Sensitivity of the window size: We used the SMD dataset to compare the F1 scores of GTAD with its baseline approach under different sliding window sizes, as shown in
Figure 7. Since DAGMM does not use sliding windows for data preprocessing, DAGMM was not addressed in this experiment. A small sliding window size will result in limited ability to obtain contextual information, but with a large sliding window size, short-term subtle anomalies will be hidden in long sequences, resulting in most baseline models being sensitive to sliding windows, e.g., GDN, MSCRED and USAD. Although MTAD-GAT and OmniAnomaly are also insensitive to sliding window size, their overall results were not as effective as GTAD. Long-term memory was preserved by the TCN, while contextual information was captured by a multi-headed attention mechanism, resulting in the low sensitivity of GTAD to the sliding window size.
Analysis of
θ: Empirically, we adjusted the weights of the prediction and reconstruction errors of the training process without any significant improvement in the performance of our model; therefore, we made the prediction and reconstruction parts have the same weight in terms of the loss function. We performed an additional experiment to evaluate the recall, precision and F1 score of the algorithm for different values of
θ on the three datasets, and the results are shown in
Figure 8. The result showed that GTAD achieved the largest F1 scores on the SMAP and SMD datasets when
θ was equal to two, and on the SML dataset when
θ was equal to one. The recall was low when
θ was less than one only on the SMAP dataset. Overall, GTAD achieved excellent anomaly detection performances at different values of
θ, which indicated that our algorithm was robust against
θ.
4.7. Overhead Analysis
In this section, we present the computational performance of GTAD by comparing it with all baseline models.
Table 3 shows the average training time for all models on the three datasets in seconds per epoch. The training time for DAGMM and USAD, which consist of fully connected layers, was shorter than that of GTAD, and the time consumption of MSCRED was the largest. Models with LSTM or GRU, such as MSCRED, MTAD-GAT and OmniAnomaly, had a longer training time than GTAD with parallel computing. MSCRED first increases the amount of data by constructing a 2D feature matrix, and second, applies Conv-LSTM to process the data sequentially, leading to a tremendous time overhead. To ensure fairness in the experimental comparison, we set the batch size sequence length to be the same for all models except DAGMM. Because DGAMM does not consider the temporal relationship, its sequence length was set to 1, which also explained the short training time of DAGMM.
4.8. The Effectiveness of Automatic Threshold Selection
An efficient automatic thresholding method for time series anomaly detection is essential. In these three datasets, we compared the F1 scores obtained using the NDT method, which were obtained by iterating all thresholds in small steps between 0 and 2, with the best F1 score. The results shown in
Table 4 indicate that the F1 scores of the NDT method were just lower than the best F1 scores (0.005 to 0.131), demonstrating the validity of the thresholding method used by GTAD.
4.9. Discussion
In this subsection, we discuss the advantages and disadvantages of our model. The advantages are mainly in three aspects. First, different from supervised learning for multivariate time series anomaly detection, GTAD does not require labels for each timestamp in the training of the model. Labels for time series are often generated manually by experts, which is often inefficient, time-consuming and costly. Moreover, unlike other joint reconstruction-based and prediction-based methods, we only reconstructed and predicted one sequence of multivariate time series without increasing the complexity of the multi-task optimization objective. This advantage was experimentally demonstrated in
Section 4.5 to benefit anomaly detection. Finally, we implemented parallel operations using causal convolution and attention mechanisms in modeling time dependence and sequence correlation, respectively. Compared with models that process data sequentially, the experimental arguments in
Section 4.6 and
Section 4.7 revealed that our model had a short training time and low sensitivity to the time window size.
Although the proposed method showed very good performance, there are some limitations. A limitation of the proposed method is the inability of the model to explain the anomaly; we cannot provide the root cause of the anomaly occurrence. However, GDN, TranAD, MSCRED and OmniAnomaly enable the diagnosis and root cause analysis of the anomaly. Another limitation is that the training data of the model needs to be all normal data and cannot be mixed with abnormal data. However, DAGMM and many classical methods [
17,
19,
23] do not require the training set to be all normal data.