Next Article in Journal
Efficient Large-Width Montgomery Modular Multiplier Design Based on Toom–Cook-5
Previous Article in Journal
Rapid Right Coronary Artery Extraction from CT Images via Global–Local Deep Learning Method Based on GhostNet
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

TiTAD: Time-Invariant Transformer for Multivariate Time Series Anomaly Detection

School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(7), 1401; https://doi.org/10.3390/electronics14071401
Submission received: 7 February 2025 / Revised: 15 March 2025 / Accepted: 26 March 2025 / Published: 31 March 2025

Abstract

:
Anomaly detection in multivariate time series data is critical for industrial sectors such as manufacturing and aerospace. While existing methods have achieved notable success in specific scenarios, they often narrowly focus on either the temporal or spatial dimensions while overlooking their complex interdependencies. Furthermore, these approaches tend to neglect the time-invariant characteristics that are crucial for accurately capturing the spatio-temporal dynamics of the time series. To address these limitations, this paper introduces the Time-invariant Transformer for Multivariate Time Series Anomaly Detection (TiTAD), a novel framework that synergizes temporal invariance with spatio-temporal modeling. TiTAD leverages the Time-invariant Transformer, a component that excels at extracting both spatio-temporal and time-invariant features by incorporating an augmented memory mechanism. This mechanism enhances anomaly identification robustness through synergistic integration of heterogeneous feature sets. Additionally, TiTAD mitigates the Transformer’s tendency to lose temporal sequence information through the use of the Gated Recurrent Unit (GRU), thereby further enhancing the model’s capability to discern spatio-temporal patterns. The inclusion of a Feature Fusion module within TiTAD serves to refine the extracted features by adjusting their weights and minimizing redundancy, ensuring that the most relevant information is utilized for prediction and anomaly detection. Empirical evaluation on three industrial-scale benchmarks (SWaT, WADI, and SMD) demonstrates TiTAD’s superior performance compared to other methods.

1. Introduction

With the advancement of sensor technology, sensor deployment in cyber–physical systems—including vehicles, water treatment plants, and space probes—is experiencing rapid growth. These sensors generate vast amounts of multivariate time series data while continuously monitoring the operational health of these systems [1]. The effective utilization of these time series data has become increasingly important for ensuring the seamless operation of various facilities, particularly critical infrastructures such as transportation, power, and communications [2,3,4,5]. The primary focus is on time series anomaly detection, which aims to identify and alert security teams to anomalies within these multivariate time series datasets. By doing so, potential irregularities can be addressed promptly, thereby minimizing potential risks and losses.
Multivariate time series exhibit dynamic characteristics over distinct time periods, encompassing spatial features such as evolving variable correlations and temporal features like inherent noise fluctuations [6]. Existing approaches [7,8,9,10,11,12,13,14] predominantly leverage Recurrent Neural Networks (RNNs) to capture the temporal features of multivariate time series. Another major category of methods [15,16,17,18,19] adopts Convolutional Neural Networks (CNNs) or Graph Neural Networks (GNNs) to capture the spatial features among variables within multivariate time series. Multivariate time series data generated by real-world systems exhibit complex patterns and large volumes, making the process of labeling anomalies costly [20,21]. Consequently, the field of anomaly detection predominantly relies on unsupervised methods [22]. Unsupervised methods typically employ prediction or reconstruction models, which improve the accuracy of predicting or reconstructing the time series to learn the standard patterns during the training process [23,24]. Since anomalous data patterns deviate from the standard patterns, the models generate higher bias in predicting or reconstructing anomalies, allowing these anomalies to be inferred from these biases.
While these methods have demonstrated remarkable performance in anomaly detection, they predominantly concentrate on either temporal or spatial dimensions of the data, failing to adequately capture the complex spatio-temporal interactions inherent in multivariate time series. Such data naturally embody nonlinear spatial dependencies between variables and intricate temporal dependencies within variables. The unilateral modeling approach inevitably leads to incomplete pattern learning, causing failures in capturing either spatial structural anomalies or temporal contextual deviations, thereby constraining detection capabilities [25]. Moreover, these methods often overlook the importance of time-invariant properties when extracting spatio-temporal features. Notably, multivariate time series preserve essential time-invariant attributes including stable inter-variable correlations and persistent periodic patterns or trend components. Neglecting these fundamental characteristics may severely compromise model effectiveness. In this work, ”invariant” refers to the robustness of our model to specific transformations (e.g., temporal shifts or scaling) within the multivariate time series. This does not assume stationarity of the underlying process, as trends, cycles, or seasonalities may still exist. Following [1], we define anomalies as deviations from invariant relational patterns across variables, rather than static statistical properties.
Recent advances have demonstrated the successful application of Transformers to multivariate time series analysis, harnessing their exceptional global representation and long-range relationship modeling capabilities [26]. These architectures have shown the capacity to extract time-invariant features across both spatial and temporal dimensions. The latest methods leverage Transformers to decode temporal patterns within individual variables, achieving significantly improved performance [12,13]. However, current methods fail to explicitly model the time-invariant characteristics of time series, limiting the full utilization of these properties. Additionally, Transformers tend to lose the temporal order that is critical to time series data, which further hinders their ability to extract intrinsic temporal features of each variable [27,28].
To address these limitations, this paper proposes the Time-invariant Transformer for Multivariate Time Series Anomaly Detection (TiTAD). TiTAD harnesses a Time-invariant Transformer to holistically model the spatio-temporal dependencies in multivariate time series. We capitalize on the Transformer’s dual strengths in long-range dependency modeling and relational representation to extract spatio-temporal features. Meanwhile, TiTAD integrates the Gated Recurrent Unit (GRU) [29] to counteract the Transformer’s temporal order degradation, thereby refining spatio-temporal feature extraction. The Time-invariant Transformer further incorporates an augmented memory mechanism comprising learnable embedding vectors and an attention module. During training, embeddings are continuously optimized to fit the time-invariant characteristics per variable, while the attention mechanism precisely quantifies inter-variable relationships. Through the coordinated operation of these components, the mechanism discovers latent time-invariant attributes in the data. TiTAD strategically combines time-invariant and spatio-temporal features to amplify anomaly discrimination. Crucially, time-invariant features achieve superior prediction fidelity for normal data with stable patterns, while failing to generalize anomalous instances with irregular patterns. This amplifies the disparity between normal and abnormal prediction errors, sharpening anomaly detectability. The architecture culminates in a Feature Fusion (FF) module employing NeXtVLAD [30], which dynamically recalibrates time-invariant feature weights while eliminating redundancy. Fused features are processed through transformation layers for prediction-based anomaly inference.
The main contributions of this study are summarized below:
  • We propose the Time-invariant Transformer, which includes an augmented memory mechanism to extract spatio-temporal features inherent in time series. Specifically, it integrates the spatio-temporal features extracted by the Transformer and the time-invariant features obtained through the augmented memory mechanism. The different effects of time-invariant features on the prediction of anomalous and normal data can highlight the distinguishability of anomalous data, thereby improving the performance of anomaly detection.
  • We utilize GRU, which focuses on extracting intra-variable dynamic features from a temporal perspective, to compensate for the Transformer’s loss of temporal order. This ensures that the model retains the critical temporal information necessary for accurate anomaly detection.
  • We propose a novel Feature Fusion module with adaptive gating mechanisms that dynamically balances spatio-temporal and time-invariant feature contributions. This module helps balance the contributions of spatio-temporal and time-invariant features, ensuring that the model’s overall performance is robust and reliable.
  • We conduct extensive empirical research on three widely used datasets from real-world sources, demonstrating the superiority of TiTAD. Our experiments also showcase the model’s capability in anomaly diagnosis, highlighting its practical value in real-world applications.
The remainder of this paper is organized as follows: Section 2 reviews related work in multivariate time series anomaly detection. Section 3 details the proposed TiTAD framework, including the Time-invariant Transformer architecture, enhanced temporal modeling, and Feature Fusion module. Section 4 presents experimental results on three industrial datasets, with comparisons to state-of-the-art baselines. Section 5 discusses the implications and limitations of our approach. Finally, Section 6 concludes the paper.

2. Related Work

Anomalies exhibit multifaceted data patterns, whereas normal data demonstrate relatively stable characteristics. Therefore, existing anomaly detection methods often learn standard patterns from normal data and then perform prediction or reconstruction from learned standard patterns, using the prediction or reconstruction error as an important basis for anomaly inference. These methods are thus categorized into prediction-based and reconstruction-based models.
Prediction Methods: LSTM-NDT [7] uses Long Short-Term Memory (LSTM) and nonparametric dynamic thresholding to achieve superior prediction performance while preserving interpretability in the system. When model forecasts are obtained, it also provides an unsupervised thresholding method to evaluate residuals. Deep Autoencoding Gaussian Mixture Model (DAGMM) [10] consists of an estimation network and a compression network. The compression network includes an autoencoder for feature extraction of the input data, while the estimation network uses a Gaussian Mixture Model to predict the next data point and uses the prediction error as an important basis for anomaly inference. The Graph Deviation Network (GDN) [17] is a prediction model that integrates graph structure learning and Graph Neural Networks (GNNs) to extract spatial features and generate anomaly scores using prediction errors.
Reconstruction Methods: UnSupervised Anomaly Detection (USAD) [31] leverages adversarial autoencoders for unsupervised learning, enabling efficient anomaly identification through adversarial training. OmniAnomaly [9] employs stochastic variable connections and planar normalization flow to characterize normal time series patterns, detecting anomalies via reconstruction probabilities. It detects anomalies using reconstruction probabilities. LSTM-VAE [32] integrates LSTM and the Variational Autoencoder (VAE) by substituting the VAE’s linear layer with LSTM to better capture the temporal dynamics. The Multi-Scale Convolutional Recurrent Encoder–Decoder (MSCRED) [33] creates multi-scale signature matrices representing temporal states, using convolutional encoders to encode inter-sensor correlations and ConvLSTM networks to model temporal dependencies. While advancing the field, these methods lack the capacity for comprehensive modeling of temporal feature interactions and neglect the intrinsic time-invariant properties of time series, ultimately limiting anomaly detection efficacy.
Transformer Models: AnomalyTransformer [12] adopts Transformer architectures for temporal feature extraction. It computes association discrepancy via the Transformer’s attention weights and combines this metric with reconstruction errors to derive anomaly scores. TranAD [13] establishes a Transformer-based framework for anomaly detection and diagnosis, enabling few-shot learning through model-agnostic meta-learning. GTA [11] integrates structure learning with graph convolution and Transformer-based temporal modeling. Its graph structure learning employs Gumbel softmax sampling to learn bidirectional sensors while developing influence propagation convolution to model anomaly information flow between nodes. Though achieving progress, these methods exhibit limitations in spatio-temporal feature extraction and fail to explicitly model time-invariant time series properties. Additionally, Transformers suffer from temporal order degradation in sequential data.
Prior works assume temporal stationarity or overlook persistent patterns. TiTAD explicitly disentangles invariant contextual features (learned via contrastive phase alignment) from time-variant residuals. Most methods prioritize either spatial or temporal features. TiTAD introduces attention gates that dynamically weight spatial and temporal embeddings based on their anomaly saliency. The Transformers’ permutation invariance weakens sequential causality. Our TiTAD enforces order-aware attention distributions, preserving causal dependencies critical for anomaly localization. By unifying these innovations, TiTAD achieves outstanding performance while addressing foundational gaps in anomaly detection.

3. Method

The multivariate time series consists of the observations of multiple variables over some time, denoted by S R T × N , where T represents the time length and N represents the number of variables. Specifically, S = o 1 , o 2 , , o T , where t 1 , 2 , , T and o t represents the observations of N variables at time point t: o t = v 1 , v 2 , , v i , with i 1 , 2 , , N and v i representing the observations of the ith variable at time point t. Given that multivariate time series can be long, we usually divide them into sliding windows as inputs to the model.
In the foundational challenges of time series anomaly detection, there are three critical pain points: (1) temporal dependency modeling, where the GRU layer is tailored to capture short-term local patterns (e.g., sensor drifts) while the Transformer attends to global context (e.g., diurnal cycles), explicitly addressing the multi-scale nature of industrial time series data; (2) non-stationarity mitigation, achieved through adversarial training to disentangle time-invariant features (e.g., equipment degradation signatures) from transient noise, as motivated by the failure of traditional thresholding under concept drift; and (3) anomaly sparsity and temporal locality, handled via the hybrid loss function that prioritizes reconstruction fidelity around abrupt deviations. We further contextualized these choices with domain-specific temporal constraints, such as the use of exponential smoothing in preprocessing to suppress high-frequency instrumentation noise without blurring step-change anomalies, and the dynamic thresholding mechanism that adapts to seasonal baseline shifts. Anomaly detection in multivariate time series focuses on continuously monitoring the time series and promptly detecting and reporting anomalous events.
Given a multivariate time series S R T × N , it outputs a set of anomaly labels denoted by L = l 1 , l 2 , , l t , with t 1 , 2 , , T and l t 0 , 1 , where 0 means that S is not anomalous in the time point t and 1 means that S is anomalous in the time point t. Due to the complexity and variability of anomalies, which differ from normal data patterns, the model generates higher prediction deviations for anomalies. The prediction deviations are further amplified by the distinct effects of time-invariant features on normal versus anomalous data [34]. The model is based on a fixed window length w of historical data x t = o t w + 1 , o t w + 2 , , o t 1 , where o t R 1 × N to predict the observation x t ˜ R N in the time point t.
This study proposes TiTAD, a novel prediction framework that capitalizes on the Transformer’s attention mechanism to explicitly model complex variable interactions and efficiently extract spatio-temporal features. A critical innovation lies in the Transformer’s augmented memory mechanism, specifically engineered to learn time-invariant characteristics. The integration of GRUs preserves temporal order information, thereby enhancing intrinsic temporal feature extraction. TiTAD addresses feature redundancy by incorporating a Feature Fusion module that dynamically calibrates time-invariant feature weights. The FF module’s output undergoes linear transformation for temporal prediction, while the framework culminates in anomaly detection through prediction error analysis. Figure 1 illustrates TiTAD’s architectural overview.

3.1. Time-Invariant Transformer

The prediction models learn standard patterns from multivariate time series by improving prediction accuracy during training. The prediction accuracy of a time t depends strongly on the spatio-temporal properties of the corresponding sliding window, and these properties change dynamically. To help the model explore standard patterns, we use the Transformer to mine the dynamic spatio-temporal features within the sliding windows. During the anomaly inference phase, the prediction model generates higher prediction errors when predicting anomalous points that deviate from the standard patterns. In particular, we improve the discrimination of anomalies by leveraging time-invariant features. To ensure robustness against temporal distribution shifts, we formally define time-invariant features. A feature f F is considered time-invariant if it satisfies the conditions as defined in Equation (1).
( C 1 ) d f ( X t ) d t 0 (Continuous systems) , ( C 2 ) ρ ( f , t ) < δ ρ (Discrete systems, Spearman’s rank test) ,
where δ ρ = 0.1 denotes the correlation threshold and X t represents input data at time t. These features, including spectral energy ratios and conserved physical quantities, are selected based on signal stationarity and domain-specific invariance principles. Since the time-invariant features can complement the spatio-temporal features to enhance the prediction precision of the normal data while the anomalies have a lower correlation with the spatio-temporal features, it does not help or even harm the prediction accuracy of the anomalies, making the prediction deviation of the anomalies more outstanding and easier to detect.

3.1.1. Memory-Augmented Transformer

The integration of time-invariant features induces structural differentiation in prediction error distributions. We formalize the error amplification mechanism through error decomposition in Equation (2).
ε = ε s + λ ε t ,
where ε y ^ y 2 denotes the total prediction error, ε s represents spatio-temporal components, and ε t quantifies the mismatch in time-invariant features. The coupling coefficient λ governs interaction intensity between modalities. For a normal sample N , temporal consistency in invariant features ( C t N > τ ) is induced by Equation (3).
ε t N exp ( α C t N ) 0 ( α > 0 ) ,
yielding ε N ε s N . For an anomalous sample A , temporal inconsistency ( C t A < τ ) triggers error amplification, as defined in Equation (4).
ε A = ε s A + λ Δ t A Invariance Violation ,
where Δ t A f t ( x A ) E [ f t ( x N ) ] 2 measures the deviation from invariant patterns.
The characteristics of each variable and the relationships among variables are dynamic, but there are also some time-invariant characteristics and relationships. We use embedding vectors denoted by E R N × d to capture these time-invariant features, where d represents the embedding vector dimension. These vectors are continuously updated through the training process. The attention mechanism is a powerful tool for learning relationships between items and has been extensively applied in various domains, including link prediction and natural language processing. In this study, we apply the attention mechanism on the embedding vectors to extract the time-invariant relationships among variables, as defined in Equation (5).
Q = E W q , K = E W k , ω = Q K d k ,
where Q and K denote query and key, respectively, W q and W k are learnable parameters, and d k represents the query and key dimension.
The attention score ω R N × N generated by the attention mechanism represents the relationship among each variable, and the relationship may be strong or weak. The weak correlations may be due to errors, and they are probably closer to noise. To reduce the influence of this noise, we introduce a threshold for ω . If the attention score between two variables is greater than the threshold, the relationship between them is stable. Conversely, the relationship between them is unstable. Finally, we obtain an adjacency matrix G R N × N . We set the threshold to 0 for better training. E and G represent the time-invariant feature from different perspectives. E captures the long-term behavioral characteristics of its corresponding variables in T, and G captures the stable inter-variable relationships throughout T.

3.1.2. Enhanced Temporal Modeling

Figure 2 depicts the Transformer’s architecture. Given the dynamic evolution of variable characteristics and inter-variable relationships, we employ the Transformer to compute attention scores through variable similarity comparisons, enabling effective spatio-temporal feature aggregation within temporal windows. To amplify prediction deviations between normal and anomalous data, we implement dual time-invariant mechanisms. First, E is dynamically updated during training to capture stable behavioral patterns (periodicity/trends) per variable. E and X t are jointly processed to incorporate time-invariant temporal context in predictions. Second, we multiply the attention scores generated by the Transformer element-wise with G, which represents the long-term stable correlation between variables. This operation ensures predictions consider both transient and time-invariant correlations, effectively overcoming limitations caused by local temporal smoothness while maintaining sensitivity to persistent system-wide dependencies.
The time-invariant features exhibit selective efficacy analogous to ”One man’s meat is another one’s poison”. For normal data, these features suppress noise and provide long-term contextual cues, thereby improving prediction accuracy. Conversely, their mismatch with anomalous patterns exacerbates prediction errors, enhancing anomaly discriminability. Anomalous data, characterized by sporadic and statistically divergent temporal patterns, inherently conflict with time-invariant feature representations. This incongruity manifests in the prediction phase and the detection phase. In the prediction phase, time-invariant features, optimized for stable normal patterns, fail to guide predictions for anomaly-driven irregularities. This results in amplified reconstruction/prediction errors due to the mismatch between invariant feature priors and anomalous dynamics. In the detection phase, the exaggerated prediction errors act as discriminative signals, creating a stronger separation between normal and anomalous distributions in the error space. This discriminative effect lowers the decision boundary for anomaly identification, effectively transforming the model’s prediction weakness into a detection strength. Our Transformer’s attention scores are calculated by Equation (6).
Q = ( X t + E ) W Q , K = ( X t + E ) W K , V = ( X t + E ) , α = ψ Q K d k G .
Let Q, K, V R d × d denote query, key, value, and normalization operations, respectively. W Q and W K R d × d represent the learnable matrix, and d k is the dimension of query, key, and value. α R N × N represents the relationship between the variables. Finally, we aggregate the spatial features S t R N × d through Equation (7).
S t = α V .

3.1.3. Feature Interaction Mechanism

Among existing methods, Transformers have been predominantly employed for temporal feature extraction due to their exceptional capability in capturing inter-temporal relationships, analogous to word interaction modeling in NLP. However, the native attention mechanism inherently lacks inherent sequential preservation—a critical property for time series analysis [27]. GRU, as a streamlined RNN variant, matches LSTM’s performance with reduced parameterization while effectively retaining temporal order information and preserving long-term dependencies. We therefore integrate GRUs to complement the Transformers’ temporal modeling capabilities through sequential information retention. Specifically, the GRU acts as a local temporal encoder, processing sequential data stepwise to retain fine-grained short-term dynamics. Its hidden states preserve local temporal order, providing context-rich inputs to the Transformer. The Transformer leverages global self-attention to model long-range dependencies and variable correlations, enhanced by time-invariant embeddings that stabilize relationships across time. When the GRU detects a short-term spike in a sensor, the Transformer evaluates whether this anomaly contradicts the long-term behavioral patterns of other variables (e.g., stable correlations learned via time-invariant embeddings). This collaboration distinguishes true faults from noise, as the GRU’s local sensitivity and the Transformer’s global reasoning mutually validate anomalies.
Especially, we first calculate the update gate state denoted by Z t R N × h and the reset gate state denoted by R t R N × h , where h represents the dimension of the GRU. The reset gate is applied to determine the amount of information from the previous hidden state H t 1 that should be retained when constructing the candidate hidden state H ˜ t . The update gate decides the amount of information from H t 1 that should be retained and the amount of information from H ˜ t that should be accepted when constructing the current hidden state H t , as defined in Equation (8).
R t = σ ( X t W x r + H t 1 W h r + b r ) , Z t = σ ( X t W x z + H t 1 W h z + b z ) ,
where W x r , W x z R d × h , W h r , W h z R h × h represent the weight matrix, b r , b z R 1 × h represent the bias matrix, and σ is the sigmoid operation. After that, we use R t to compute the candidate hidden state denoted by H ˜ t R N × h , as defined in Equation (9).
H ˜ t = tanh ( X t W x h + ( R t H t 1 ) W h h + b h ) ,
where W x h R d × h , W h h R h × h represent the weight matrix, and b h R 1 × h represents the bias matrix. Then, we combine H ˜ t and the update gate Z t to calculate H t R N × h . Finally, H t will take as final output T t R N × h , as defined in Equation (10).
H t = Z t H t 1 + ( 1 Z t ) H ˜ t , T t = H t .

3.2. Feature Fusion

The FF module incorporates NeXtVLAD, originally developed for video processing. To mitigate model overemphasis on time-invariant features, this module recalibrates spatio-temporal feature weights for optimal balance. Video data, conceptualized as multivariate time series, exhibit inherent spatio-temporal correlations. Each pixel represents a variable, while frame sequences capture temporal observations [35]. NeXtVLAD’s success in video feature fusion stems from its ability to aggregate intra-frame (spatial) and inter-frame (temporal) dependencies [30,36]. As a VLAD extension [37], NeXtVLAD reduces parameterization while enhancing performance over NetVLAD [38]. Its proficiency in spatio-temporal dependency aggregation makes it ideal for fusing temporal patterns and invariant characteristics in our framework.
Figure 3 shows the architecture of NeXtVLAD. Before Feature Fusion, our pipeline conducts feature decomposition followed by clustering operations. The process computes residuals between decomposed features and cluster centroids based on grouping results. This methodology achieves dual objectives: mitigating noise contamination through efficient feature representation while amplifying spatio-temporal feature saliency to counteract the disproportionate emphasis on time-invariant characteristics. Specifically, we first feed the spatio-temporal features ( T t S t ) R N × 2 d to the linear layer in order to expand its dimension to x R N × 2 λ d . Then, we divide x into G groups x ¯ R N × G × 2 λ d G ; this step can be considered as decomposing x into G low-dimensional feature vectors x ¯ . And then we cluster x ¯ , the cluster centers are u R K × 2 λ d G , it is derived from the model training, and it fits time-invariant features. Next, we compute the probability α g k that each group of features x ¯ n g belongs between the clustering centers u k ; the probability α g k will be computed by the linear layer and softmax, as defined in Equation (11).
softmax ( z ) i = exp ( z i ) j = 1 K exp ( z j ) , for i = 1 , , K , α g k ( x n ) = softmax ( w k g T x n + b k g ) , α g k ( x n ) = exp ( w k g T x n + b k g ) k = 1 K exp ( w k g T x n + b k g ) , n 1 , , N , k 1 , , K , g 1 , , G .
We then aggregate the feature residuals of each group of features x ¯ with clustering centers u with probability α g k to eliminate some redundant information and noise; the residuals represent dynamic features, as defined in Equation (12).
z ˜ k g = n = 1 N α g k ( x n ) ( x ¯ n g μ k ) .
Next, we use a linear and a sigmoid operation (denoted by σ ) to calculate the weights α g for each set of features x ¯ and construct the global features z k of the input data based on α g , as defined in Equation (13).
α g ( x n ) = σ ( w g N x n + b g ) , z k = g = 1 G α g ( x n ) n = 1 N α g k ( x n ) ( x ¯ n g μ k ) .
Finally, a linear layer is applied for mining high-level features as well as performing dimensional transformations for the prediction task. The final output is represented as A R N × d .

3.3. Optimization and Anomaly Inference

3.3.1. Prediction

The model utilizes the standard patterns learned during training to generate predictions for test data. Owing to complex and variable data patterns of anomalous data, which are different from those of normal data, and the different effects of time-invariant features on anomalous data and normal data, prediction deviation becomes pronounced for anomalies, enabling detection through error analysis. We process the features fused by FF through a linear layer, yielding a prediction x t ˜ R N .

3.3.2. Loss Function

Compared to MAE, MSE exhibits greater sensitivity to outliers, prioritizing them at the expense of normal data accuracy. However, as anomaly detection models learn the standard patterns in the training, and the training set does not tend to contain anomalies while outliers are often anomalies, the impact of outliers is small. Therefore, we adopt Mean Squared Error (MSE) for the loss function to accelerate model convergence, which calculates the mean squared error of the predictions x ˜ t and ground truth x t and takes the square. This is a common choice for regression problems, as defined in Equation (14).
L o s s = 1 N x ˜ t x t 2 2 .

3.3.3. Anomaly Inference

We compute anomaly scores per variable based on the basis of the distance of the predicted value from the ground truth value, then aggregate variable-wise scores into temporal point scores for holistic anomaly detection. These variable-level scores provide diagnostically valuable insights, facilitating root cause localization while enhancing model interpretability. To calculate the anomaly scores, we measure the Mean Absolute Error (MAE) between x ˜ t i and x t i of ith variable at t. The formula for MAE is computed using Equation (15).
s t i = x ˜ t i x t i .
Due to the varying difficulty of predicting each variable, such as with discrete and continuous data, certain variables may have a significantly higher average prediction error than others. To neutralize this bias in the final anomaly scoring, we implement the normalization strategy employed by GDN. We first compute the median ( μ i ) and the interquartile range (IOR) ( σ i ) for the time series of the ith variable. We then smooth anomaly scores by the simple moving average, as defined in Equation (16).
S t i = s t i μ i σ i .
We base the prediction deviation to compute the anomaly scores for each variable at each time point and choose the maximum values as the total score for the whole time point, as defined in Equation (17).
A t = max ( S t ) .
Finally, we set a threshold for the anomaly score; time steps that have scores above the threshold are detected as anomalous, and time steps that have scores below the threshold are detected as normal. Threshold selection critically influences anomaly inference [34], with numerous methodologies existing for this purpose (e.g., Peaks-Over-Threshold (POT) [39]). The optimal threshold is determined through parametric optimization of POT. For simplicity, we exhaustively evaluate candidate thresholds via F 1 score-driven optimization to identify the most discriminative value.

4. Experiment

In the experimental section, we evaluate our method using three publicly available anomaly detection datasets. These datasets are selected to encompass a variety of real-world scenarios, ensuring a thorough assessment of the model’s performance. Our experiments compare TiTAD with several state-of-the-art baseline methods, and the results consistently show that our model outperforms these baselines across all datasets.

4.1. Experimental Setup

4.1.1. Datasets

In this study, we evaluate our method using three publicly available anomaly detection datasets: Secure Water Treatment (SWaT), Water Distribution (WADI), and Server Machine Dataset (SMD).
  • Secure Water Treatment [40,41]: The SWaT dataset originates from an operational water treatment facility, capturing eleven days of continuous multivariate operational data encompassing actuator states, sensor measurements, and network telemetry. This dataset includes seven days of data collected during non-attack operations and four days of data collected under attack conditions. The seven days of operational data without attacks are used for training, while the four days of operational data under attack are used for testing.
  • Water Distribution [42]: The WADI dataset consists of sixteen days of actuator and sensor data from a city Water Distribution System (WDS). It includes two weeks of data collected during operation without attacks and two days of data collected under attack conditions. The two weeks of operation data without attacks are used for training, and the two days of operation data in the presence of attack behavior are used for testing.
  • Server Machine Dataset [9]: The SMD dataset is provided by a large internet company and is composed of monitoring data from twenty-eight server machines over a period of five weeks, all with the same variables. The SMD dataset has twenty-eight sub-datasets, each of which needs to be independently trained and tested.
We use these three datasets to evaluate our method, and their statistics are shown in Table 1. The training sets of these datasets are exclusive of anomalies. For SWaT and WADI, we adopt a downsampling strategy to speed up training. For SMD, we train and test the twenty-eight sub-datasets separately and then calculate the average of the results. Since TranAD and AnomalyTransformer [12] use different evaluation methods for SMD, we also use their respective evaluation methods to ensure fair comparisons.

4.1.2. Data Preprocessing

To balance certain variables that have larger orders of magnitude with other variables that do not have larger orders of magnitude, we use StandardScaler to normalize each variable so that its variance is 1 and its mean is 0. The formula is defined in Equation (18).
x = x μ σ .
Specifically, we first normalize each variable by computing its mean μ and standard deviation σ in the training set, then use the μ and σ to normalize the test sets. While z-score normalization assumes stationarity, our sliding window design inherently mitigates moderate local trends by normalizing relative variations within each window. For datasets with strong global trends, we recommend preliminary detrending (e.g., differencing). The current approach remains valid for our benchmark datasets, where operational phase alignment ensures quasi-stationarity. Since multivariate time series are long, we divide them into sliding windows that can overlap as inputs. x t = o t w + 1 , o t w + 2 , , o t 1 , where o t R 1 × N , w represents the sliding window length, and N represents the number of variables. To extract high-dimensional features from the input data, we employ the projection layer in advance, as defined in Equation (19).
X t = x t T W l i n e ,
where W l i n e R w × d represents the trainable weight matrix and X t R N × d represents the extracted high-dimensional features. The projection layer serves as a critical preprocessing step to transform raw sensor readings into a latent space that better captures discriminative spatio-temporal patterns. Specifically, this layer applies a learnable linear transformation to each sliding window of normalized input data, effectively compressing the temporal dimension while preserving cross-variable correlations. It acts as a learnable feature enhancer that transforms raw sensor readings into discriminative representations.

4.1.3. Implementation Details

We used PyTorch [43] version 1.12.1, PyTorch Geometric Library [44] 1.5.0, and CUDA 11.6, and all our experiments were performed on an NVIDIA RTX 4080. For all the experiments, we fixed the size of the time window to 64, the groups’ number G to 16, the number of clustering centers K to 4, and λ to 2. For the WADI dataset, the embedding vector dimensionality is fixed at 128, for the SWaT dataset, the embedding vector dimensionality is fixed at 512, and for the SMD dataset, the SMD consists of 28 sub-datasets, which are not presented owing to the limited space. The linear layer and GRU dimensionality are fixed equal to the embedding vector dimensionality. The number of layers for both Transformer and GRU is set to 3. We adopted the Adam [45] optimizer and fixed the learning rate at 0.001. The batch size is set to 256. The number of epochs is set to 70 for the SWaT and the WADI datasets, and 250 for the SMD dataset. For better training, we randomly divided 30% of the training data for validation, whereas the WADI dataset is randomly divided into 20% for validation. To avoid overfitting, we applied the early-stopping criterion in training the model, and we stopped training when the accuracy started to drop continuously after validation.

4.1.4. Baselines

We select DAGMM [10], LSTM-VAE [32], OmniAnomaly [9], MSCRED [33], USAD [31], THOC [46], MTAD-GAT [47], CAE-M [48], GDN [17], TranAD [13], AnomalyTransformer [12], and GRN [18] as our baseline methods. The MSCRED captures multi-scale temporal patterns and inter-variable dependencies using a hybrid CNN-RNN architecture. The Temporal Hierarchical One-Class network (THOC) leverages hierarchical temporal representations for one-class anomaly detection. The Deep Convolutional Autoencoder Memory network (CAE-M) enhances autoencoder robustness by integrating a memory module for prototypical normal patterns. The Multivariate Time Series Anomaly Detection with Graph Attention Network (MTAD-GAT) combines graph attention layers with temporal convolutions to explicitly model variable interactions. In particular, we note that 15 variables are removed during the training of GDN on the WADI dataset, so we re-evaluated GDN on the WADI dataset. The original papers of TranAD and AnomalyTransformer only provide the result with point adjustment, so we re-evaluated these methods without point adjustment.

4.1.5. Evaluation Metrics

We employ the F 1 score, a commonly applied evaluation metric in multivariate time series anomaly detection, as our evaluation metric. The F 1 score comprehensively indicates the precision and recall of anomaly detection. The F 1 Score is computed as defined in Equation (20).
F 1 = 2 × P r e c × R e c P r e c + R e c ,
P r e c = T P T P + F P ,
R e c = T P T P + F N ,
where T N , T P , F N , and F P are the counts of true negatives, the counts of true positives, the counts of false negatives, and the counts of false positives, separately.

4.2. Main Results

4.2.1. Performance Comparison

To ensure reproducibility, we provide complete implementation details including hyperparameter configurations and training protocols. All experiments were conducted under identical hardware/software conditions to ensure statistical reliability. Table 2 presents the anomaly detection capabilities of TiTAD compared to various baseline models. Specifically, TranAD and AnomalyTransformer omit the evaluation of SMD’s 28 sub-datasets individually: TranAD tests only four sub-datasets (machine-1-1/2-1/3-2/3-7), while AnomalyTransformer concatenates all sub-datasets pre-evaluation. Our replication of their methodologies yields F 1 scores of 0.437 (TranAD) and 0.2286 (AnomalyTransformer) on SMD. GDN achieves 0.569 F 1 on WADI’s 15-variable subset, whereas TiTAD surpasses this with 0.658 F 1 . TiTAD consistently outperforms all baselines across the three benchmark datasets. Effective anomaly detection necessitates robust spatio-temporal feature extraction. LSTM-VAE, OmniAnomaly, USAD, and AnomalyTransformer underperform due to inadequate inter-variable relationship modeling. DAGMM’s neglect of temporal dynamics severely compromises its effectiveness. GDN’s graph-based spatial modeling achieves notable performance, particularly on high-dimensional WADI, yet its disregard for temporal patterns limits its overall efficacy. MSCRED’s sequential processing of spatial and temporal features fails to achieve effective spatio-temporal fusion. Critically, existing methods overlook time-invariant feature utilization for anomaly amplification. TiTAD’s dual exploitation of spatio-temporal dynamics and time-invariant characteristics drives its superior detection performance.

4.2.2. Anomaly Diagnosis

TiTAD provides anomaly diagnostic capabilities, evaluated using widely used metrics HitRate@P% [9] and NDCG@P% [49]. HitRate@P% is defined as H i t @ P % × G T t G T t , where G T t denotes the number of ground truth dimensions (i.e., labeled anomalous features) and P % × G T t represents the number of top candidates (i.e., highest-ranked predicted anomalies). The HitRate@P% measures the proportion of ground truth dimensions found within the top candidates that the model inferred. Unlike HitRate@P%, Normalized Discounted Cumulative Gain (NDCG) rewards models for ranking true anomalies higher within the same top-k candidates. To explore TiTAD’s diagnostic performance, we used the SMD dataset, which provides ground truth dimensions and published baselines. As shown in Table 3, our model achieves sub-optimal HitRate@100% and optimal HitRate@150%. In particular, our model outperforms others in NDCG@P%, which further demonstrates superior robustness to overprediction and ranking accuracy.

4.3. Model Analysis

4.3.1. Ablation Analysis

“Ablation analysis” refers to systematically removing or modifying components of a model to quantify their individual impact on performance. This method is a cornerstone of interpretable model design in complex systems. We conducted an ablation analysis to isolate the contributions of the GRU module, Transformer module, and fusion layer. Table 4 demonstrates the critical importance of each component. The original Transformer variant (“w/o:TiF”) shows performance degradation ( F 1 : 0.785 vs. 0.842 on SWaT) due to its inability to leverage time-invariant features, particularly crucial for detecting subtle anomalies through long-term pattern analysis. The Transformer captures global dependencies via self-attention. Removing spatial modeling capability (“w/o:Transformer”) reduces F 1 by 11.5% on SWaT, confirming the necessity of inter-variable relationship analysis for distinguishing local fluctuations from system-level anomalies. The GRU removal (“w/o:GRU”) causes the most severe performance drop ( F 1 : 0.528 vs. 0.842 on SWaT), highlighting the essential role of sequential processing in capturing temporal dynamics. The GRU specializes in modeling local temporal dependencies through its gated recurrent structure. The GRU’s output (local features) serves as input to the Transformer, which then refines these features by incorporating global context. Simplified Feature Fusion (“w/o:FF”) degrades performance by 30.9% on WADI, emphasizing the need for sophisticated integration of spatio-temporal and invariant features. The memory-augmented time-invariant module stabilizes training by providing persistent normal patterns to both GRU and Transformer. These results (as shown in Table 4), combined with theoretical insights about GRU’s gate mechanisms filtering irrelevant temporal contexts for the Transformer’s global modeling, collectively suggest that their interaction balances local–global trade offs rather than merely stacking functions. These findings collectively validate our architectural design choices, demonstrating that the synergistic combination of temporal invariance modeling, spatial attention, sequential processing, and adaptive fusion is fundamental to TiTAD’s superior anomaly detection capability.

4.3.2. Point-Adjustment Analysis

If a time point within a continuous abnormal segment is detected, all anomalies within that segment are considered to be correctly identified. This approach is justified by the observation that an anomalous time point will trigger an alert and subsequently draw attention to the entire segment. Therefore, detecting a single point of anomaly effectively implies the recognition of the whole abnormal segment. This strategy is a standard practice in time series anomaly detection frameworks [9,12,13,46]. It considers an entire anomaly segment to be correctly detected if any abnormal point within that segment is identified. This methodology enables certain algorithms to report inflated F 1 scores exceeding 0.90. However, recent studies [50,51,52] demonstrate that point adjustment significantly inflates the actual performance of existing methods. As evidenced in Table 5, randomly generated scores achieve superior performance to state-of-the-art methods under the point adjustment. This phenomenon arises because longer anomaly segments statistically increase the probability of random hits [53]. More critically, we identify a negligible correlation between raw and adjusted scores. To ensure rigorous evaluation, we exclusively report non-adjusted F 1 scores.
To exemplify how point adjustment artificially inflates performance metrics and induces methodological bias, we analyze the inherent flaws of this strategy using AnomalyTransformer as an example. Figure 4 reveals frequent false positives in normal regions of AnomalyTransformer on the SWaT dataset. For comparison, Figure 5 presents the anomaly inference results of our proposed method on the same dataset. Figure 4a illustrates the anomaly scores generated by AnomalyTransformer on the SWaT dataset. Our analysis (Figure 5a) reveals partial detection capability where specific anomaly segments exhibit elevated scores, yet non-salient scores for numerous anomalies impede their distinction from normal data. Figure 4b and Figure 5b display the anomaly label distribution, showing a homogeneous distribution with frequent label occurrences in anomaly segments. This pattern induces excessive false positives, resulting in significantly sub-optimal detection accuracy. Figure 4c and Figure 5c illustrate the point-adjusted results. The presence of sporadic labels within anomaly segments enables artificial true positive inflation while retaining unchanged false positive/negative counts [50]. Crucially, AnomalyTransformer’s threshold selection (upper quartile of scores) reduces total anomaly labels, partially mitigating false positive impacts and enabling artificial F 1 score inflation.
Our analysis demonstrates that point-adjusted scores constitute statistically biased estimators, failing to reflect true algorithmic performance. This adjustment induces algorithmic bias toward generating anomaly distributions with homogeneous spatial patterns and suppressed anomaly counts. Such methodological artifacts fundamentally explain the observed discrepancy between adjusted and raw F 1 scores, where artificially inflated F 1 metrics emerge from statistical gaming rather than detection capability. In operational contexts, these uniform anomaly labels prove analogous to stochastic sampling protocols—theoretically valid for security audits yet deficient in practical detection scenarios. Operational anomaly detection necessitates capturing genuine anomaly patterns with authentic temporal frequencies and spatial distributions, rather than optimizing for statistical artifacts.

4.3.3. Sensitivity Analysis

Figure 6 illustrates the impact of different window sizes on anomaly detection performance. Empirical analysis reveals a positive correlation between window size and F 1 scores, as larger windows capture enhanced temporal context with extended spatio-temporal dependencies. Inadequate temporal context (window size < 32) induces homogeneous prediction errors across normal and anomalous data. Consequently, both normal and anomalous data exhibit large prediction biases, making it difficult for the model to distinguish between them, which leads to poor performance. Optimal window ranges (64–128) enable effective synergy between temporal dynamics and time-invariant patterns. Normal data typically have more stable patterns and stronger correlations with time-invariant features, providing periodicity, and trend information that improves prediction accuracy. In contrast, anomalous data exhibit rare and unstable patterns with weaker correlations to time-invariant features, limiting or even deteriorating their prediction accuracy. This discrepancy enhances the distinction between normal and anomalous data, contributing to improved anomaly detection performance. Excessive window sizes (>128) introduce feature distribution skew between invariant and dynamic patterns, compromising F 1 scores. Therefore, we chose a window size of 64 as the optimal compromise.

4.3.4. Distinguishability Analysis

“Distinguishability” refers to the capability of the model to differentiate between normal and abnormal time points in time series data based on their association patterns. Normal points typically exhibit broad temporal associations (e.g., long-range dependencies, periodic patterns) that reflect stable system behavior. Abnormal points, due to their rarity and contextual isolation, tend to form adjacent-concentrated associations (focusing on neighboring points). To examine whether time-invariant features enhance anomaly discriminability, we conducted a comparative analysis of prediction deviation-based anomaly scores across three categories: anomalies, normal points, and all points. We quantified relative prominence metrics through normalized scoring as shown in Table 6. The average anomaly scores for all points, normal points, and abnormal points were scaled from 1 to 100. Specifically, the outstanding degree (Normal/All) represents the prominence of normal points relative to all points, (Anomaly/All) represents the prominence of anomalies relative to all points, and (Anomaly/Normal) represents the prominence of anomalies relative to normal points. The average anomaly scores of normal points across all three datasets decreased after incorporating time-invariant features, indicating that these features aid in predicting normal points more accurately, thereby reducing their anomaly scores. In contrast, the average anomaly scores of anomalies either decreased minimally or increased, suggesting that anomalies are less associated with time-invariant features. This divergence substantiates our hypothesis: while time-invariant features optimize normal pattern modeling, they deliberately decouple from anomalous patterns, creating amplified contrast between normal and anomaly predictions.

4.3.5. Anomaly Threshold Determination

The anomaly threshold is dynamically optimized through a two-stage statistical process to ensure robustness against label sparsity and temporal noise. The thresholding interfaces with the error smoothing layer (exponential moving average) to suppress transient noise, ensuring that thresholding responds to sustained anomalies rather than sporadic fluctuations. In deployment, thresholding is periodically updated using an incremental algorithm, enabling adaptation to concept drift without full retraining. We acknowledge that fully adaptive thresholding (e.g., online Bayesian updates) could enhance generalizability, which will be explored in future work.

5. Discussion and Limitations

The development and evaluation of TiTAD yield foundational insights for time series anomaly detection. The hybrid GRU–Transformer architecture demonstrates superior efficacy in balancing local–global temporal dependencies compared to monolithic designs. Unlike AnomalyTransformer, TiTAD mitigates temporal order degradation via GRU, improving the detection of transient anomalies. In comparison to USAD, time-invariant features in TiTAD reduce false positives caused by normal pattern shifts. TiTAD underperforms compared with GDN in detecting interaction-driven anomalies. TiTAD may exhibit limitations in environments with abrupt concept drift and could miss subtle anomalies induced by latent variable interactions. These limitations delineate clear pathways for improvement. Future work should investigate dynamic feature adaptation mechanisms, such as online meta-learning, to handle non-stationary environments without retraining. Architectural innovations like neural architecture search could automate the GRU–Transformer balance per application domain, while graph-enhanced fusion modules may better capture multivariate causality.

6. Conclusions

This paper presents the Time-invariant Transformer for Multivariate Time Series Anomaly Detection (TiTAD), a novel unsupervised framework that synergizes Transformer architectures with temporal invariance modeling. TiTAD employs hybrid Transformer and GRU architectures to learn spatio-temporal features in time series data through a Feature Fusion module. The integrated features are subsequently fed into a linear layer for forecasting, with the forecasting results utilized to infer anomalies. Specifically, TiTAD innovatively integrates time-invariant features into the Transformer architecture and introduces a Feature Fusion module for enhanced integration. By leveraging the distinct impacts of time-invariant features on the prediction accuracy of normal versus anomalous data, TiTAD improves anomaly detection performance. Our experimental results demonstrate that our model outperforms baseline methods across all three real-world datasets.

Author Contributions

Conceptualization, Y.L. and W.W.; methodology, Y.L. and Y.W.; software, Y.L. and W.W.; validation, W.W; formal analysis, Y.W. and Y.L.; investigation, Y.L.; resources, Y.L. and W.W.; data curation, Y.L. and W.W.; writing—original draft preparation, Y.L.; writing—review and editing, W.W. and Y.W.; visualization, W.W.; supervision, Y.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under the Grant No. 62002330 and No. 62176239.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
TiTADTime-invariant Transformer for Multivariate Time Series Anomaly Detection
GRUGated Recurrent Unit
RNNsRecurrent Neural Networks
CNNsConvolutional Neural Networks
GNNsGraph Neural Networks
FFFeature Fusion
LSTMLong Short-Term Memory
DAGMMDeep Autoencoding Gaussian Mixture Model
GDNGraph Deviation Network
USADUnSupervised Anomaly Detection
VAEVariational Autoencoder
MSCREDMulti-Scale Convolutional Recurrent Encoder–Decoder
MSEMean Squared Error
MAEMean Absolute Error
IORinterquartile range
POTPeaks-Over-Threshold
SWaTSecure Water Treatment
WADIWater Distribution
SMDServer Machine Dataset
WDSWater Distribution System
THOCTemporal Hierarchical One-Class network
CAE-MDeep Convolutional Autoencoder Memory network
MTAD-GATMultivariate Time Series Anomaly Detection with Graph Attention Network
NDCGNormalized Discounted Cumulative Gain

References

  1. Blázquez-García, A.; Conde, A.; Mori, U.; Lozano, J.A. A Review on Outlier/Anomaly Detection in Time Series Data. ACM Comput. Surv. 2022, 54, 56:1–56:33. [Google Scholar] [CrossRef]
  2. Cook, A.A.; Misirli, G.; Fan, Z. Anomaly Detection for IoT Time-Series Data: A Survey. IEEE Internet Things J. 2020, 7, 6481–6494. [Google Scholar] [CrossRef]
  3. Wan, R.; Mei, S.; Wang, J.; Liu, M.; Yang, F. Multivariate Temporal Convolutional Network: A Deep Neural Networks Approach for Multivariate Time Series Forecasting. Electronics 2019, 8, 876. [Google Scholar] [CrossRef]
  4. Huang, Y.; Liu, W.; Li, S.; Guo, Y.; Chen, W. MGAD: Mutual Information and Graph Embedding Based Anomaly Detection in Multivariate Time Series. Electronics 2024, 13, 1326. [Google Scholar] [CrossRef]
  5. Peña, D.; Yohai, V.J. A review of outlier detection and robust estimation methods for high dimensional time series data. Econom. Stat. 2023; in press. [Google Scholar] [CrossRef]
  6. Xia, F.; Chen, X.; Yu, S.; Hou, M.; Liu, M.; You, L. Coupled Attention Networks for Multivariate Time Series Anomaly Detection. IEEE Trans. Emerg. Top. Comput. 2024, 12, 240–253. [Google Scholar] [CrossRef]
  7. Hundman, K.; Constantinou, V.; Laporte, C.; Colwell, I.; Söderström, T. Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, 19–23 August 2018; ACM: New York, NY, USA, 2018; pp. 387–395. [Google Scholar] [CrossRef]
  8. Li, D.; Chen, D.; Jin, B.; Shi, L.; Goh, J.; Ng, S. MAD-GAN: Multivariate Anomaly Detection for Time Series Data with Generative Adversarial Networks. In Proceedings of the Artificial Neural Networks and Machine Learning—ICANN 2019: Text and Time Series—28th International Conference on Artificial Neural Networks, Munich, Germany, 17–19 September 2019; Proceedings, Part IV; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2019; Volume 11730, pp. 703–716. [Google Scholar] [CrossRef]
  9. Su, Y.; Zhao, Y.; Niu, C.; Liu, R.; Sun, W.; Pei, D. Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, 4–8 August 2019; ACM: New York, NY, USA, 2019; pp. 2828–2837. [Google Scholar] [CrossRef]
  10. Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  11. Chen, Z.; Chen, D.; Zhang, X.; Yuan, Z.; Cheng, X. Learning Graph Structures with Transformer for Multivariate Time-Series Anomaly Detection in IoT. IEEE Internet Things J. 2022, 9, 9179–9189. [Google Scholar] [CrossRef]
  12. Xu, J.; Wu, H.; Wang, J.; Long, M. Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy. In Proceedings of the Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. [Google Scholar]
  13. Tuli, S.; Casale, G.; Jennings, N.R. TranAD: Deep Transformer Networks for Anomaly Detection in Multivariate Time Series Data. Proc. VLDB Endow. 2022, 15, 1201–1214. [Google Scholar] [CrossRef]
  14. Zhou, H.; Yu, K.; Zhang, X.; Wu, G.; Yazidi, A. Contrastive autoencoder for anomaly detection in multivariate time series. Inf. Sci. 2022, 610, 266–280. [Google Scholar] [CrossRef]
  15. Munir, M.; Siddiqui, S.A.; Dengel, A.; Ahmed, S. DeepAnT: A Deep Learning Approach for Unsupervised Anomaly Detection in Time Series. IEEE Access 2019, 7, 1991–2005. [Google Scholar] [CrossRef]
  16. Ding, C.; Sun, S.; Zhao, J. MST-GAT: A multimodal spatial-temporal graph attention network for time series anomaly detection. Inf. Fusion 2023, 89, 527–536. [Google Scholar] [CrossRef]
  17. Deng, A.; Hooi, B. Graph Neural Network-Based Anomaly Detection in Multivariate Time Series. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, 2–9 February 2021; AAAI Press: Washington, DC, USA, 2021; pp. 4027–4035. [Google Scholar] [CrossRef]
  18. Tang, C.; Xu, L.; Yang, B.; Tang, Y.; Zhao, D. GRU-Based Interpretable Multivariate Time Series Anomaly Detection in Industrial Control System. Comput. Secur. 2023, 127, 103094. [Google Scholar] [CrossRef]
  19. Yang, Q.; Zhang, J.; Zhang, J.; Sun, C.; Xie, S.; Liu, S.; Ji, Y. Graph Transformer Network Incorporating Sparse Representation for Multivariate Time Series Anomaly Detection. Electronics 2024, 13, 2032. [Google Scholar] [CrossRef]
  20. Shen, G.; Zhang, L. A Complex Empirical Mode Decomposition for Multivariant Traffic Time Series. Electronics 2023, 12, 2476. [Google Scholar] [CrossRef]
  21. Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. 2009, 41, 15:1–15:58. [Google Scholar] [CrossRef]
  22. Hong, K.; Ren, Y.; Li, F.; Mao, W.; Liu, Y. Unsupervised Anomaly Detection of Intermittent Demand for Spare Parts Based on Dual-Tailed Probability. Electronics 2024, 13, 195. [Google Scholar] [CrossRef]
  23. Pang, G.; Shen, C.; Cao, L.; van den Hengel, A. Deep Learning for Anomaly Detection: A Review. ACM Comput. Surv. 2022, 54, 38:1–38:38. [Google Scholar] [CrossRef]
  24. Ruff, L.; Kauffmann, J.R.; Vandermeulen, R.A.; Montavon, G.; Samek, W.; Kloft, M.; Dietterich, T.G.; Müller, K. A Unifying Review of Deep and Shallow Anomaly Detection. Proc. IEEE 2021, 109, 756–795. [Google Scholar] [CrossRef]
  25. Chalapathy, R.; Chawla, S. Deep Learning for Anomaly Detection: A Survey. arXiv 2019, arXiv:1901.03407. [Google Scholar]
  26. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  27. Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are Transformers Effective for Time Series Forecasting? In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Washington, DC, USA, 7–14 February 2023; AAAI Press: Washington, DC, USA, 2023; pp. 11121–11128. [Google Scholar] [CrossRef]
  28. Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. In Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Online, 6–14 December 2021; pp. 22419–22430. [Google Scholar]
  29. Cho, K.; van Merrienboer, B.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Doha, Qatar, 25–29 October 2014; A Meeting of SIGDAT, a Special Interest Group of the ACL. ACL: Doha, Qatar, 2014; pp. 1724–1734. [Google Scholar] [CrossRef]
  30. Lin, R.; Xiao, J.; Fan, J. NeXtVLAD: An Efficient Neural Network to Aggregate Frame-Level Features for Large-Scale Video Classification. In Proceedings of the Computer Vision—ECCV 2018 Workshops, Munich, Germany, 8–14 September 2018; Proceedings, Part IV; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2018; Volume 11132, pp. 206–218. [Google Scholar] [CrossRef]
  31. Audibert, J.; Michiardi, P.; Guyard, F.; Marti, S.; Zuluaga, M.A. USAD: UnSupervised Anomaly Detection on Multivariate Time Series. In Proceedings of the KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, 23–27 August 2020; ACM: New York, NY, USA, 2020; pp. 3395–3404. [Google Scholar] [CrossRef]
  32. Park, D.; Hoshi, Y.; Kemp, C.C. A Multimodal Anomaly Detector for Robot-Assisted Feeding Using an LSTM-Based Variational Autoencoder. IEEE Robot. Autom. Lett. 2018, 3, 1544–1551. [Google Scholar] [CrossRef]
  33. Zhang, C.; Song, D.; Chen, Y.; Feng, X.; Lumezanu, C.; Cheng, W.; Ni, J.; Zong, B.; Chen, H.; Chawla, N.V. A Deep Neural Network for Unsupervised Anomaly Detection and Diagnosis in Multivariate Time Series Data. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Washington, DC, USA, 2019; pp. 1409–1416. [Google Scholar] [CrossRef]
  34. Li, G.; Jung, J.J. Deep learning for anomaly detection in multivariate time series: Approaches, applications, and challenges. Inf. Fusion 2023, 91, 93–102. [Google Scholar] [CrossRef]
  35. Wang, S.; Cao, J.; Yu, P.S. Deep Learning for Spatio-Temporal Data Mining: A Survey. IEEE Trans. Knowl. Data Eng. 2022, 34, 3681–3700. [Google Scholar] [CrossRef]
  36. Girdhar, R.; Ramanan, D.; Gupta, A.; Sivic, J.; Russell, B.C. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 3165–3174. [Google Scholar] [CrossRef]
  37. Jégou, H.; Douze, M.; Schmid, C.; Pérez, P. Aggregating local descriptors into a compact image representation. In Proceedings of the Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13–18 June 2010; pp. 3304–3311. [Google Scholar] [CrossRef]
  38. Arandjelovic, R.; Gronát, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1437–1451. [Google Scholar] [CrossRef] [PubMed]
  39. Siffer, A.; Fouque, P.; Termier, A.; Largouët, C. Anomaly Detection in Streams with Extreme Value Theory. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; ACM: New York, NY, USA, 2017; pp. 1067–1075. [Google Scholar] [CrossRef]
  40. Goh, J.; Adepu, S.; Junejo, K.N.; Mathur, A. A Dataset to Support Research in the Design of Secure Water Treatment Systems. In Proceedings of the Critical Information Infrastructures Security—11th International Conference, CRITIS 2016, Paris, France, 10–12 October 2016; Revised Selected Papers; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2016; Volume 10242, pp. 88–99. [Google Scholar] [CrossRef]
  41. Mathur, A.P.; Tippenhauer, N.O. SWaT: A water treatment testbed for research and training on ICS security. In Proceedings of the 2016 International Workshop on Cyber-Physical Systems for Smart Water Networks, CySWater@CPSWeek 2016, Vienna, Austria, 11 April 2016; pp. 31–36. [Google Scholar] [CrossRef]
  42. Ahmed, C.M.; Palleti, V.R.; Mathur, A.P. WADI: A water distribution testbed for research in the design of secure cyber physical systems. In Proceedings of the 3rd International Workshop on Cyber-Physical Systems for Smart Water Networks, CySWATER@CPSWeek 2017, Pittsburgh, PA, USA, 21 April 2017; ACM: New York, NY, USA, 2017; pp. 25–28. [Google Scholar] [CrossRef]
  43. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; pp. 8024–8035. [Google Scholar]
  44. Fey, M.; Lenssen, J.E. Fast Graph Representation Learning with PyTorch Geometric. arXiv 2019, arXiv:1903.02428. [Google Scholar]
  45. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  46. Shen, L.; Li, Z.; Kwok, J.T. Timeseries Anomaly Detection using Temporal Hierarchical One-Class Network. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020. [Google Scholar]
  47. Zhao, H.; Wang, Y.; Duan, J.; Huang, C.; Cao, D.; Tong, Y.; Xu, B.; Bai, J.; Tong, J.; Zhang, Q. Multivariate Time-series Anomaly Detection via Graph Attention Network. In Proceedings of the 20th IEEE International Conference on Data Mining, ICDM 2020, Sorrento, Italy, 17–20 November 2020; pp. 841–850. [Google Scholar] [CrossRef]
  48. Zhang, Y.; Chen, Y.; Wang, J.; Pan, Z. Unsupervised Deep Anomaly Detection for Multi-Sensor Time-Series Signals. IEEE Trans. Knowl. Data Eng. 2023, 35, 2118–2132. [Google Scholar] [CrossRef]
  49. Järvelin, K.; Kekäläinen, J. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 2002, 20, 422–446. [Google Scholar] [CrossRef]
  50. Kim, S.; Choi, K.; Choi, H.; Lee, B.; Yoon, S. Towards a Rigorous Evaluation of Time-Series Anomaly Detection. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Virtual Event, 22 February–1 March 2022; AAAI Press: Washington, DC, USA, 2022; pp. 7194–7201. [Google Scholar] [CrossRef]
  51. Doshi, K.; Abudalou, S.; Yilmaz, Y. Reward Once, Penalize Once: Rectifying Time Series Anomaly Detection. In Proceedings of the International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar] [CrossRef]
  52. Hwang, W.; Yun, J.; Kim, J.; Min, B. Do you know existing accuracy metrics overrate time-series anomaly detections? In Proceedings of the SAC ’22: The 37th ACM/SIGAPP Symposium on Applied Computing, Virtual Event, 25–29 April 2022; ACM: New York, NY, USA, 2022; pp. 403–412. [Google Scholar] [CrossRef]
  53. Wagner, D.; Michels, T.; Schulz, F.C.F.; Nair, A.; Rudolph, M.; Kloft, M. TimeSeAD: Benchmarking Deep Multivariate Time-Series Anomaly Detection. Trans. Mach. Learn. Res. 2023, 2023, 735. [Google Scholar]
Figure 1. Overview of TiTAD. TiTAD primarily comprises a Transformer, GRU, and Feature Fusion module. First, input data pass through a linear projection layer to extract high-dimensional features. This output is then processed by both the Transformer and GRU to capture spatio-temporal features while accounting for time-invariant properties. Subsequently, the outputs from these components are fused using the FF module. Finally, a prediction head is applied for making predictions. The prediction results are utilized for anomaly detection.
Figure 1. Overview of TiTAD. TiTAD primarily comprises a Transformer, GRU, and Feature Fusion module. First, input data pass through a linear projection layer to extract high-dimensional features. This output is then processed by both the Transformer and GRU to capture spatio-temporal features while accounting for time-invariant properties. Subsequently, the outputs from these components are fused using the FF module. Finally, a prediction head is applied for making predictions. The prediction results are utilized for anomaly detection.
Electronics 14 01401 g001
Figure 2. The network architecture of the Transformer. Due to the specificity of the forecasting task, we only use the Transformer encoder. In particular, we use embedding and attention to learn the time-invariant properties of the time series and integrate the time-invariant properties by adding embedding to the input of the Transformer and element-wise multiplying the attention scores generated by the Transformer with the adjacency matrix G. Add represents the addition of two vectors. Threshold represents filtering on a matrix by a threshold.
Figure 2. The network architecture of the Transformer. Due to the specificity of the forecasting task, we only use the Transformer encoder. In particular, we use embedding and attention to learn the time-invariant properties of the time series and integrate the time-invariant properties by adding embedding to the input of the Transformer and element-wise multiplying the attention scores generated by the Transformer with the adjacency matrix G. Add represents the addition of two vectors. Threshold represents filtering on a matrix by a threshold.
Electronics 14 01401 g002
Figure 3. The network architecture of NeXtVLAD. NeXtVLAD first feeds the input into a linear layer to expand its dimensionality, divides it into multiple groups of low-dimensional vectors, and then calculates their residuals and the learned clustering centers. Finally, the residuals are aggregated based on α g k , which represents the probability of each low-dimensional vector belonging to each clustering center, and α g , which represents the weights of each group of low-dimensional vectors, and fed into a linear layer to further extract features and adjust dimensionality.
Figure 3. The network architecture of NeXtVLAD. NeXtVLAD first feeds the input into a linear layer to expand its dimensionality, divides it into multiple groups of low-dimensional vectors, and then calculates their residuals and the learned clustering centers. Finally, the residuals are aggregated based on α g k , which represents the probability of each low-dimensional vector belonging to each clustering center, and α g , which represents the weights of each group of low-dimensional vectors, and fed into a linear layer to further extract features and adjust dimensionality.
Electronics 14 01401 g003
Figure 4. AnomalyTransformer anomaly inference situation: (a) Anomaly scores generated by AnomalyTransformer. (b) Anomaly labels inferred from the anomaly scores. (c) Anomaly labels after point adjustment. The blue area represents the anomaly segment.
Figure 4. AnomalyTransformer anomaly inference situation: (a) Anomaly scores generated by AnomalyTransformer. (b) Anomaly labels inferred from the anomaly scores. (c) Anomaly labels after point adjustment. The blue area represents the anomaly segment.
Electronics 14 01401 g004
Figure 5. TiTAD anomaly inference situation: (a) Anomaly scores generated by TiTAD. (b) Anomaly labels inferred from the anomaly scores. (c) Anomaly labels after point adjustment. The blue area represents the anomaly segment.
Figure 5. TiTAD anomaly inference situation: (a) Anomaly scores generated by TiTAD. (b) Anomaly labels inferred from the anomaly scores. (c) Anomaly labels after point adjustment. The blue area represents the anomaly segment.
Electronics 14 01401 g005
Figure 6. The effect of different window sizes on the result. (a) SWaT; (b) WADI; (c) SMD.
Figure 6. The effect of different window sizes on the result. (a) SWaT; (b) WADI; (c) SMD.
Electronics 14 01401 g006
Table 1. The statistical profile of the datasets in the experiment.
Table 1. The statistical profile of the datasets in the experiment.
DatasetsTrainTestDimensionsEntitiesAnomaly Ratio (%)Year
SWaT496,800449,91951111.98%2016
WADI1,048,571172,80112315.99%2017
SMD708,405708,42038284.16%2019
Table 2. The F 1 scores comparison of our model with the baseline. In addition, as a reference, we also list the precision and recall for each dataset, except for the SMD which consists of multiple sub-datasets. The best results are in bold and the next best results are underlined.
Table 2. The F 1 scores comparison of our model with the baseline. In addition, as a reference, we also list the precision and recall for each dataset, except for the SMD which consists of multiple sub-datasets. The best results are in bold and the next best results are underlined.
DatasetsSWaTWADISMD
MetricF1PrecRecF1PrecRecF1
DAGMM0.5500.4690.6650.1210.0650.9130.238
LSTM-VAE0.7750.9890.6370.2270.9940.1280.435
OmniAnomaly0.7820.9820.6490.2230.9940.1290.474
MSCRED0.662--0.087--0.097
USAD0.7910.9850.6610.2320.9940.1310.426
THOC0.612--0.130--0.168
MTAD-GAT0.8100.9710.6950.4160.2810.801-
CAE-M0.8100.9690.6950.4110.2780.791-
GDN0.8080.9930.6810.4260.2910.7930.529
TranAD0.6070.9960.4360.2510.2170.2980.288
AnomalyTransformer0.2460.4440.1700.0540.0370.1000.021
GRN0.7490.9980.5900.4820.3580.739-
Ours0.8420.9660.7450.5120.4680.5650.557
Table 3. Anomaly diagnosis. The best results are in bold and the next best results are underlined.
Table 3. Anomaly diagnosis. The best results are in bold and the next best results are underlined.
DatasetsSMD
MetricHitRate@100%HitRate@150%NDCG@100%NDCG@150%
MERLIN0.59070.61770.41500.4912
LSTM-NDT0.38080.52250.36030.4451
DAGMM0.49270.60910.51690.5845
OmniAnomaly0.45670.56520.45450.5125
MSCRED0.42720.51800.46090.5164
MAD-GAN0.46300.57850.46810.5522
USAD0.49250.60550.51790.5781
MTAD-GAT0.34930.47770.37590.4530
CAE-M0.47070.58780.54740.6178
GDN0.31430.43860.29800.3724
TranAD0.49810.64010.49410.6178
Ours0.53280.64600.57030.6379
Table 4. Ablation results. The best results are in bold.
Table 4. Ablation results. The best results are in bold.
DatasetsSWaTWADISMD
w/o:TiF0.7850.4380.521
w/o:Transformer0.7450.3890.517
w/o:GRU0.5280.4080.510
w/o:FF0.7770.3540.521
Ours0.8420.5120.557
Table 5. The F 1 score comparison under the point-adjustment strategy. The best results are in bold.
Table 5. The F 1 score comparison under the point-adjustment strategy. The best results are in bold.
DatasetsSWaTWADISMD
DAGMM0.8530.2090.723
LSTM-VAE0.8050.3800.808
OmniAnomaly0.8660.4170.944
MSCRED0.8680.3460.389
USAD0.8460.4290.938
THOC0.8800.5060.541
GDN0.9350.8550.716
TranAD0.8150.4950.961
AnomalyTransformer0.9410.6980.923
Random Anomaly Score0.9690.9650.804
Table 6. The average anomaly scores of all points, normal points, anomalies, and the outstanding degree of normal and anomalies relative to all points and anomalies relative to normal. The best results are in bold.
Table 6. The average anomaly scores of all points, normal points, anomalies, and the outstanding degree of normal and anomalies relative to all points and anomalies relative to normal. The best results are in bold.
DatasetsSWaTWADISMD
Methodw/o:TiFOurw/o:TiFOurw/o:TiFOur
All points10.13579.03290.77990.77730.58470.6040
Normal points3.09152.10870.1340.12380.37980.3706
Abnormal points60.705758.741111.187911.30727.50698.3227
Normal/All0.3050.23340.17180.15930.64950.6135
Anomaly/All5.98926.50314.344114.545512.836913.7774
Anomaly/Normal19.635927.855683.46691.276719.761522.4550
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Y.; Wang, W.; Wu, Y. TiTAD: Time-Invariant Transformer for Multivariate Time Series Anomaly Detection. Electronics 2025, 14, 1401. https://doi.org/10.3390/electronics14071401

AMA Style

Liu Y, Wang W, Wu Y. TiTAD: Time-Invariant Transformer for Multivariate Time Series Anomaly Detection. Electronics. 2025; 14(7):1401. https://doi.org/10.3390/electronics14071401

Chicago/Turabian Style

Liu, Yuehan, Wenhao Wang, and Yunpeng Wu. 2025. "TiTAD: Time-Invariant Transformer for Multivariate Time Series Anomaly Detection" Electronics 14, no. 7: 1401. https://doi.org/10.3390/electronics14071401

APA Style

Liu, Y., Wang, W., & Wu, Y. (2025). TiTAD: Time-Invariant Transformer for Multivariate Time Series Anomaly Detection. Electronics, 14(7), 1401. https://doi.org/10.3390/electronics14071401

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop