1. Introduction
With the advancement of sensor technology, sensor deployment in cyber–physical systems—including vehicles, water treatment plants, and space probes—is experiencing rapid growth. These sensors generate vast amounts of multivariate time series data while continuously monitoring the operational health of these systems [
1]. The effective utilization of these time series data has become increasingly important for ensuring the seamless operation of various facilities, particularly critical infrastructures such as transportation, power, and communications [
2,
3,
4,
5]. The primary focus is on time series anomaly detection, which aims to identify and alert security teams to anomalies within these multivariate time series datasets. By doing so, potential irregularities can be addressed promptly, thereby minimizing potential risks and losses.
Multivariate time series exhibit dynamic characteristics over distinct time periods, encompassing spatial features such as evolving variable correlations and temporal features like inherent noise fluctuations [
6]. Existing approaches [
7,
8,
9,
10,
11,
12,
13,
14] predominantly leverage Recurrent Neural Networks (RNNs) to capture the temporal features of multivariate time series. Another major category of methods [
15,
16,
17,
18,
19] adopts Convolutional Neural Networks (CNNs) or Graph Neural Networks (GNNs) to capture the spatial features among variables within multivariate time series. Multivariate time series data generated by real-world systems exhibit complex patterns and large volumes, making the process of labeling anomalies costly [
20,
21]. Consequently, the field of anomaly detection predominantly relies on unsupervised methods [
22]. Unsupervised methods typically employ prediction or reconstruction models, which improve the accuracy of predicting or reconstructing the time series to learn the standard patterns during the training process [
23,
24]. Since anomalous data patterns deviate from the standard patterns, the models generate higher bias in predicting or reconstructing anomalies, allowing these anomalies to be inferred from these biases.
While these methods have demonstrated remarkable performance in anomaly detection, they predominantly concentrate on either temporal or spatial dimensions of the data, failing to adequately capture the complex spatio-temporal interactions inherent in multivariate time series. Such data naturally embody nonlinear spatial dependencies between variables and intricate temporal dependencies within variables. The unilateral modeling approach inevitably leads to incomplete pattern learning, causing failures in capturing either spatial structural anomalies or temporal contextual deviations, thereby constraining detection capabilities [
25]. Moreover, these methods often overlook the importance of time-invariant properties when extracting spatio-temporal features. Notably, multivariate time series preserve essential time-invariant attributes including stable inter-variable correlations and persistent periodic patterns or trend components. Neglecting these fundamental characteristics may severely compromise model effectiveness. In this work, ”invariant” refers to the robustness of our model to specific transformations (e.g., temporal shifts or scaling) within the multivariate time series. This does not assume stationarity of the underlying process, as trends, cycles, or seasonalities may still exist. Following [
1], we define anomalies as deviations from invariant relational patterns across variables, rather than static statistical properties.
Recent advances have demonstrated the successful application of Transformers to multivariate time series analysis, harnessing their exceptional global representation and long-range relationship modeling capabilities [
26]. These architectures have shown the capacity to extract time-invariant features across both spatial and temporal dimensions. The latest methods leverage Transformers to decode temporal patterns within individual variables, achieving significantly improved performance [
12,
13]. However, current methods fail to explicitly model the time-invariant characteristics of time series, limiting the full utilization of these properties. Additionally, Transformers tend to lose the temporal order that is critical to time series data, which further hinders their ability to extract intrinsic temporal features of each variable [
27,
28].
To address these limitations, this paper proposes the Time-invariant Transformer for Multivariate Time Series Anomaly Detection (TiTAD). TiTAD harnesses a Time-invariant Transformer to holistically model the spatio-temporal dependencies in multivariate time series. We capitalize on the Transformer’s dual strengths in long-range dependency modeling and relational representation to extract spatio-temporal features. Meanwhile, TiTAD integrates the Gated Recurrent Unit (GRU) [
29] to counteract the Transformer’s temporal order degradation, thereby refining spatio-temporal feature extraction. The Time-invariant Transformer further incorporates an augmented memory mechanism comprising learnable embedding vectors and an attention module. During training, embeddings are continuously optimized to fit the time-invariant characteristics per variable, while the attention mechanism precisely quantifies inter-variable relationships. Through the coordinated operation of these components, the mechanism discovers latent time-invariant attributes in the data. TiTAD strategically combines time-invariant and spatio-temporal features to amplify anomaly discrimination. Crucially, time-invariant features achieve superior prediction fidelity for normal data with stable patterns, while failing to generalize anomalous instances with irregular patterns. This amplifies the disparity between normal and abnormal prediction errors, sharpening anomaly detectability. The architecture culminates in a Feature Fusion (FF) module employing NeXtVLAD [
30], which dynamically recalibrates time-invariant feature weights while eliminating redundancy. Fused features are processed through transformation layers for prediction-based anomaly inference.
The main contributions of this study are summarized below:
We propose the Time-invariant Transformer, which includes an augmented memory mechanism to extract spatio-temporal features inherent in time series. Specifically, it integrates the spatio-temporal features extracted by the Transformer and the time-invariant features obtained through the augmented memory mechanism. The different effects of time-invariant features on the prediction of anomalous and normal data can highlight the distinguishability of anomalous data, thereby improving the performance of anomaly detection.
We utilize GRU, which focuses on extracting intra-variable dynamic features from a temporal perspective, to compensate for the Transformer’s loss of temporal order. This ensures that the model retains the critical temporal information necessary for accurate anomaly detection.
We propose a novel Feature Fusion module with adaptive gating mechanisms that dynamically balances spatio-temporal and time-invariant feature contributions. This module helps balance the contributions of spatio-temporal and time-invariant features, ensuring that the model’s overall performance is robust and reliable.
We conduct extensive empirical research on three widely used datasets from real-world sources, demonstrating the superiority of TiTAD. Our experiments also showcase the model’s capability in anomaly diagnosis, highlighting its practical value in real-world applications.
The remainder of this paper is organized as follows:
Section 2 reviews related work in multivariate time series anomaly detection.
Section 3 details the proposed TiTAD framework, including the Time-invariant Transformer architecture, enhanced temporal modeling, and Feature Fusion module.
Section 4 presents experimental results on three industrial datasets, with comparisons to state-of-the-art baselines.
Section 5 discusses the implications and limitations of our approach. Finally,
Section 6 concludes the paper.
2. Related Work
Anomalies exhibit multifaceted data patterns, whereas normal data demonstrate relatively stable characteristics. Therefore, existing anomaly detection methods often learn standard patterns from normal data and then perform prediction or reconstruction from learned standard patterns, using the prediction or reconstruction error as an important basis for anomaly inference. These methods are thus categorized into prediction-based and reconstruction-based models.
Prediction Methods: LSTM-NDT [
7] uses Long Short-Term Memory (LSTM) and nonparametric dynamic thresholding to achieve superior prediction performance while preserving interpretability in the system. When model forecasts are obtained, it also provides an unsupervised thresholding method to evaluate residuals. Deep Autoencoding Gaussian Mixture Model (DAGMM) [
10] consists of an estimation network and a compression network. The compression network includes an autoencoder for feature extraction of the input data, while the estimation network uses a Gaussian Mixture Model to predict the next data point and uses the prediction error as an important basis for anomaly inference. The Graph Deviation Network (GDN) [
17] is a prediction model that integrates graph structure learning and Graph Neural Networks (GNNs) to extract spatial features and generate anomaly scores using prediction errors.
Reconstruction Methods: UnSupervised Anomaly Detection (USAD) [
31] leverages adversarial autoencoders for unsupervised learning, enabling efficient anomaly identification through adversarial training. OmniAnomaly [
9] employs stochastic variable connections and planar normalization flow to characterize normal time series patterns, detecting anomalies via reconstruction probabilities. It detects anomalies using reconstruction probabilities. LSTM-VAE [
32] integrates LSTM and the Variational Autoencoder (VAE) by substituting the VAE’s linear layer with LSTM to better capture the temporal dynamics. The Multi-Scale Convolutional Recurrent Encoder–Decoder (MSCRED) [
33] creates multi-scale signature matrices representing temporal states, using convolutional encoders to encode inter-sensor correlations and ConvLSTM networks to model temporal dependencies. While advancing the field, these methods lack the capacity for comprehensive modeling of temporal feature interactions and neglect the intrinsic time-invariant properties of time series, ultimately limiting anomaly detection efficacy.
Transformer Models: AnomalyTransformer [
12] adopts Transformer architectures for temporal feature extraction. It computes association discrepancy via the Transformer’s attention weights and combines this metric with reconstruction errors to derive anomaly scores. TranAD [
13] establishes a Transformer-based framework for anomaly detection and diagnosis, enabling few-shot learning through model-agnostic meta-learning. GTA [
11] integrates structure learning with graph convolution and Transformer-based temporal modeling. Its graph structure learning employs Gumbel softmax sampling to learn bidirectional sensors while developing influence propagation convolution to model anomaly information flow between nodes. Though achieving progress, these methods exhibit limitations in spatio-temporal feature extraction and fail to explicitly model time-invariant time series properties. Additionally, Transformers suffer from temporal order degradation in sequential data.
Prior works assume temporal stationarity or overlook persistent patterns. TiTAD explicitly disentangles invariant contextual features (learned via contrastive phase alignment) from time-variant residuals. Most methods prioritize either spatial or temporal features. TiTAD introduces attention gates that dynamically weight spatial and temporal embeddings based on their anomaly saliency. The Transformers’ permutation invariance weakens sequential causality. Our TiTAD enforces order-aware attention distributions, preserving causal dependencies critical for anomaly localization. By unifying these innovations, TiTAD achieves outstanding performance while addressing foundational gaps in anomaly detection.
3. Method
The multivariate time series consists of the observations of multiple variables over some time, denoted by , where T represents the time length and N represents the number of variables. Specifically, , where and represents the observations of N variables at time point t: , with and representing the observations of the ith variable at time point t. Given that multivariate time series can be long, we usually divide them into sliding windows as inputs to the model.
In the foundational challenges of time series anomaly detection, there are three critical pain points: (1) temporal dependency modeling, where the GRU layer is tailored to capture short-term local patterns (e.g., sensor drifts) while the Transformer attends to global context (e.g., diurnal cycles), explicitly addressing the multi-scale nature of industrial time series data; (2) non-stationarity mitigation, achieved through adversarial training to disentangle time-invariant features (e.g., equipment degradation signatures) from transient noise, as motivated by the failure of traditional thresholding under concept drift; and (3) anomaly sparsity and temporal locality, handled via the hybrid loss function that prioritizes reconstruction fidelity around abrupt deviations. We further contextualized these choices with domain-specific temporal constraints, such as the use of exponential smoothing in preprocessing to suppress high-frequency instrumentation noise without blurring step-change anomalies, and the dynamic thresholding mechanism that adapts to seasonal baseline shifts. Anomaly detection in multivariate time series focuses on continuously monitoring the time series and promptly detecting and reporting anomalous events.
Given a multivariate time series
, it outputs a set of anomaly labels denoted by
, with
and
, where 0 means that
S is not anomalous in the time point
t and 1 means that
S is anomalous in the time point
t. Due to the complexity and variability of anomalies, which differ from normal data patterns, the model generates higher prediction deviations for anomalies. The prediction deviations are further amplified by the distinct effects of time-invariant features on normal versus anomalous data [
34]. The model is based on a fixed window length
w of historical data
, where
to predict the observation
in the time point
t.
This study proposes TiTAD, a novel prediction framework that capitalizes on the Transformer’s attention mechanism to explicitly model complex variable interactions and efficiently extract spatio-temporal features. A critical innovation lies in the Transformer’s augmented memory mechanism, specifically engineered to learn time-invariant characteristics. The integration of GRUs preserves temporal order information, thereby enhancing intrinsic temporal feature extraction. TiTAD addresses feature redundancy by incorporating a Feature Fusion module that dynamically calibrates time-invariant feature weights. The FF module’s output undergoes linear transformation for temporal prediction, while the framework culminates in anomaly detection through prediction error analysis.
Figure 1 illustrates TiTAD’s architectural overview.
3.1. Time-Invariant Transformer
The prediction models learn standard patterns from multivariate time series by improving prediction accuracy during training. The prediction accuracy of a time
t depends strongly on the spatio-temporal properties of the corresponding sliding window, and these properties change dynamically. To help the model explore standard patterns, we use the Transformer to mine the dynamic spatio-temporal features within the sliding windows. During the anomaly inference phase, the prediction model generates higher prediction errors when predicting anomalous points that deviate from the standard patterns. In particular, we improve the discrimination of anomalies by leveraging time-invariant features. To ensure robustness against temporal distribution shifts, we formally define time-invariant features. A feature
is considered time-invariant if it satisfies the conditions as defined in Equation (
1).
where
denotes the correlation threshold and
represents input data at time
t. These features, including spectral energy ratios and conserved physical quantities, are selected based on signal stationarity and domain-specific invariance principles. Since the time-invariant features can complement the spatio-temporal features to enhance the prediction precision of the normal data while the anomalies have a lower correlation with the spatio-temporal features, it does not help or even harm the prediction accuracy of the anomalies, making the prediction deviation of the anomalies more outstanding and easier to detect.
3.1.1. Memory-Augmented Transformer
The integration of time-invariant features induces structural differentiation in prediction error distributions. We formalize the error amplification mechanism through error decomposition in Equation (
2).
where
denotes the total prediction error,
represents spatio-temporal components, and
quantifies the mismatch in time-invariant features. The coupling coefficient
governs interaction intensity between modalities. For a normal sample
, temporal consistency in invariant features (
) is induced by Equation (
3).
yielding
. For an anomalous sample
, temporal inconsistency (
) triggers error amplification, as defined in Equation (
4).
where
measures the deviation from invariant patterns.
The characteristics of each variable and the relationships among variables are dynamic, but there are also some time-invariant characteristics and relationships. We use embedding vectors denoted by
to capture these time-invariant features, where
d represents the embedding vector dimension. These vectors are continuously updated through the training process. The attention mechanism is a powerful tool for learning relationships between items and has been extensively applied in various domains, including link prediction and natural language processing. In this study, we apply the attention mechanism on the embedding vectors to extract the time-invariant relationships among variables, as defined in Equation (
5).
where
and
denote query and key, respectively,
and
are learnable parameters, and
represents the query and key dimension.
The attention score generated by the attention mechanism represents the relationship among each variable, and the relationship may be strong or weak. The weak correlations may be due to errors, and they are probably closer to noise. To reduce the influence of this noise, we introduce a threshold for . If the attention score between two variables is greater than the threshold, the relationship between them is stable. Conversely, the relationship between them is unstable. Finally, we obtain an adjacency matrix . We set the threshold to 0 for better training. E and G represent the time-invariant feature from different perspectives. E captures the long-term behavioral characteristics of its corresponding variables in T, and G captures the stable inter-variable relationships throughout T.
3.1.2. Enhanced Temporal Modeling
Figure 2 depicts the Transformer’s architecture. Given the dynamic evolution of variable characteristics and inter-variable relationships, we employ the Transformer to compute attention scores through variable similarity comparisons, enabling effective spatio-temporal feature aggregation within temporal windows. To amplify prediction deviations between normal and anomalous data, we implement dual time-invariant mechanisms. First,
E is dynamically updated during training to capture stable behavioral patterns (periodicity/trends) per variable.
E and
are jointly processed to incorporate time-invariant temporal context in predictions. Second, we multiply the attention scores generated by the Transformer element-wise with
G, which represents the long-term stable correlation between variables. This operation ensures predictions consider both transient and time-invariant correlations, effectively overcoming limitations caused by local temporal smoothness while maintaining sensitivity to persistent system-wide dependencies.
The time-invariant features exhibit selective efficacy analogous to ”One man’s meat is another one’s poison”. For normal data, these features suppress noise and provide long-term contextual cues, thereby improving prediction accuracy. Conversely, their mismatch with anomalous patterns exacerbates prediction errors, enhancing anomaly discriminability. Anomalous data, characterized by sporadic and statistically divergent temporal patterns, inherently conflict with time-invariant feature representations. This incongruity manifests in the prediction phase and the detection phase. In the prediction phase, time-invariant features, optimized for stable normal patterns, fail to guide predictions for anomaly-driven irregularities. This results in amplified reconstruction/prediction errors due to the mismatch between invariant feature priors and anomalous dynamics. In the detection phase, the exaggerated prediction errors act as discriminative signals, creating a stronger separation between normal and anomalous distributions in the error space. This discriminative effect lowers the decision boundary for anomaly identification, effectively transforming the model’s prediction weakness into a detection strength. Our Transformer’s attention scores are calculated by Equation (
6).
Let
Q,
K,
denote query, key, value, and normalization operations, respectively.
and
represent the learnable matrix, and
is the dimension of query, key, and value.
represents the relationship between the variables. Finally, we aggregate the spatial features
through Equation (
7).
3.1.3. Feature Interaction Mechanism
Among existing methods, Transformers have been predominantly employed for temporal feature extraction due to their exceptional capability in capturing inter-temporal relationships, analogous to word interaction modeling in NLP. However, the native attention mechanism inherently lacks inherent sequential preservation—a critical property for time series analysis [
27]. GRU, as a streamlined RNN variant, matches LSTM’s performance with reduced parameterization while effectively retaining temporal order information and preserving long-term dependencies. We therefore integrate GRUs to complement the Transformers’ temporal modeling capabilities through sequential information retention. Specifically, the GRU acts as a local temporal encoder, processing sequential data stepwise to retain fine-grained short-term dynamics. Its hidden states preserve local temporal order, providing context-rich inputs to the Transformer. The Transformer leverages global self-attention to model long-range dependencies and variable correlations, enhanced by time-invariant embeddings that stabilize relationships across time. When the GRU detects a short-term spike in a sensor, the Transformer evaluates whether this anomaly contradicts the long-term behavioral patterns of other variables (e.g., stable correlations learned via time-invariant embeddings). This collaboration distinguishes true faults from noise, as the GRU’s local sensitivity and the Transformer’s global reasoning mutually validate anomalies.
Especially, we first calculate the update gate state denoted by
and the reset gate state denoted by
, where
h represents the dimension of the GRU. The reset gate is applied to determine the amount of information from the previous hidden state
that should be retained when constructing the candidate hidden state
. The update gate decides the amount of information from
that should be retained and the amount of information from
that should be accepted when constructing the current hidden state
, as defined in Equation (
8).
where
,
represent the weight matrix,
represent the bias matrix, and
is the sigmoid operation. After that, we use
to compute the candidate hidden state denoted by
, as defined in Equation (
9).
where
,
represent the weight matrix, and
represents the bias matrix. Then, we combine
and the update gate
to calculate
. Finally,
will take as final output
, as defined in Equation (
10).
3.2. Feature Fusion
The FF module incorporates NeXtVLAD, originally developed for video processing. To mitigate model overemphasis on time-invariant features, this module recalibrates spatio-temporal feature weights for optimal balance. Video data, conceptualized as multivariate time series, exhibit inherent spatio-temporal correlations. Each pixel represents a variable, while frame sequences capture temporal observations [
35]. NeXtVLAD’s success in video feature fusion stems from its ability to aggregate intra-frame (spatial) and inter-frame (temporal) dependencies [
30,
36]. As a VLAD extension [
37], NeXtVLAD reduces parameterization while enhancing performance over NetVLAD [
38]. Its proficiency in spatio-temporal dependency aggregation makes it ideal for fusing temporal patterns and invariant characteristics in our framework.
Figure 3 shows the architecture of NeXtVLAD. Before Feature Fusion, our pipeline conducts feature decomposition followed by clustering operations. The process computes residuals between decomposed features and cluster centroids based on grouping results. This methodology achieves dual objectives: mitigating noise contamination through efficient feature representation while amplifying spatio-temporal feature saliency to counteract the disproportionate emphasis on time-invariant characteristics. Specifically, we first feed the spatio-temporal features
to the linear layer in order to expand its dimension to
. Then, we divide
x into
G groups
; this step can be considered as decomposing
x into G low-dimensional feature vectors
. And then we cluster
, the cluster centers are
, it is derived from the model training, and it fits time-invariant features. Next, we compute the probability
that each group of features
belongs between the clustering centers
; the probability
will be computed by the linear layer and softmax, as defined in Equation (
11).
We then aggregate the feature residuals of each group of features
with clustering centers
u with probability
to eliminate some redundant information and noise; the residuals represent dynamic features, as defined in Equation (
12).
Next, we use a linear and a sigmoid operation (denoted by
) to calculate the weights
for each set of features
and construct the global features
of the input data based on
, as defined in Equation (
13).
Finally, a linear layer is applied for mining high-level features as well as performing dimensional transformations for the prediction task. The final output is represented as .
3.3. Optimization and Anomaly Inference
3.3.1. Prediction
The model utilizes the standard patterns learned during training to generate predictions for test data. Owing to complex and variable data patterns of anomalous data, which are different from those of normal data, and the different effects of time-invariant features on anomalous data and normal data, prediction deviation becomes pronounced for anomalies, enabling detection through error analysis. We process the features fused by FF through a linear layer, yielding a prediction .
3.3.2. Loss Function
Compared to MAE, MSE exhibits greater sensitivity to outliers, prioritizing them at the expense of normal data accuracy. However, as anomaly detection models learn the standard patterns in the training, and the training set does not tend to contain anomalies while outliers are often anomalies, the impact of outliers is small. Therefore, we adopt Mean Squared Error (MSE) for the loss function to accelerate model convergence, which calculates the mean squared error of the predictions
and ground truth
and takes the square. This is a common choice for regression problems, as defined in Equation (
14).
3.3.3. Anomaly Inference
We compute anomaly scores per variable based on the basis of the distance of the predicted value from the ground truth value, then aggregate variable-wise scores into temporal point scores for holistic anomaly detection. These variable-level scores provide diagnostically valuable insights, facilitating root cause localization while enhancing model interpretability. To calculate the anomaly scores, we measure the Mean Absolute Error (MAE) between
and
of
ith variable at
t. The formula for MAE is computed using Equation (
15).
Due to the varying difficulty of predicting each variable, such as with discrete and continuous data, certain variables may have a significantly higher average prediction error than others. To neutralize this bias in the final anomaly scoring, we implement the normalization strategy employed by GDN. We first compute the median (
) and the interquartile range (IOR) (
) for the time series of the
ith variable. We then smooth anomaly scores by the simple moving average, as defined in Equation (
16).
We base the prediction deviation to compute the anomaly scores for each variable at each time point and choose the maximum values as the total score for the whole time point, as defined in Equation (
17).
Finally, we set a threshold for the anomaly score; time steps that have scores above the threshold are detected as anomalous, and time steps that have scores below the threshold are detected as normal. Threshold selection critically influences anomaly inference [
34], with numerous methodologies existing for this purpose (e.g., Peaks-Over-Threshold (POT) [
39]). The optimal threshold is determined through parametric optimization of POT. For simplicity, we exhaustively evaluate candidate thresholds via
score-driven optimization to identify the most discriminative value.