1. Introduction
Anomalies are patterns in data that do not conform to a well defined notion of normal behavior [
1]. Anomaly detection involves identifying rare events, items, or observations that significantly differ from expected behaviors or established patterns [
2]. Anomaly detection is widely applied across numerous domains, including network monitoring, healthcare, smart devices, smart cities, the Internet of Things, fraud detection, cloud computing, and beyond [
3]. In real-world applications, data reported by sensors are often in the form of multivariate time series. Such data may exhibit correlations among different variables as well as autocorrelations across individual observations. Accurately identifying anomalies within such complex datasets is a challenging task. With advancements in artificial intelligence, deep learning and related techniques have been widely adopted in the field of anomaly detection, providing a variety of effective solutions for detecting anomalies in multivariate time series data.
Traditional methods for anomaly detection predominantly rely on statistical measures and data density. The earliest work in this domain was introduced by Shewhart, who proposed algorithms for detecting anomalies in process quality control applications [
4]. Subsequently, more advanced statistical techniques, such as Grubbs’s test [
5], student’s
t-test [
6], Hotelling’s t-squared test [
7], and chi-squared test [
8] were developed to enhance anomaly detection capabilities. The core principle of statistical-based anomaly detection is that anomalies tend to appear in low-probability regions, whereas normal data instances are typically located in high-probability regions. These algorithms are among the earliest approaches proposed for anomaly detection. Their methodology involves first fitting a statistical model to a given dataset, followed by performing statistical tests on new data to evaluate its conformity to the model. Data points consistent with the model are classified as normal, while those that deviate are identified as anomalies [
9]. However, statistical-based methods are limited by the assumption that the data conforms to a specific distribution, such as a Gaussian distribution. When this assumption is not met, their effectiveness diminishes significantly. Furthermore, these methods struggle with high-dimensional data, as conventional statistical approaches are inherently unsuitable for such scenarios. Although dimensionality reduction techniques like Principal Component Analysis (PCA) can mitigate this issue, they are often inadequate for high-dimensional data with nonlinear relationships. These limitations restrict the applicability of statistical-based methods in complex anomaly detection tasks.
With the advent of Industry 4.0, researchers and practitioners face an explosion in the volume of data that needs to be processed, often accompanied by nonlinear relations between variables in large datasets. Concurrently, rapid advancements in computational hardware have alleviated the computational complexity constraints, allowing data-driven approaches to become mainstream. Among these, machine learning-based anomaly detection has emerged as a prominent research area. Various machine learning algorithms implement anomaly detection through different methodologies. The distance-based algorithm k-Nearest Neighbors (k-NNs) [
10] operates on the core idea that an anomalous data point is far from its k-NN, with the anomaly score computed as the distance between the data instance and its k-NN. Local Outlier Factor (LOF) [
11] is another well-known and widely used anomaly detection algorithm, which evaluates the relative density of a data point with respect to its k-Nearest Neighborhood. Mika et al. [
12] combined statistical foundations with machine learning kernel methods to propose Kernel Principal Component Analysis (KPCA), addressing the limitations of PCA in handling high-dimensional data with nonlinear relations. Sheridan et al. [
13] successfully applied a clustering-based algorithm, utilizing density-based spatial clustering of applications with noise (DBSCAN), for flight anomaly detection during the aircraft approach phase. Additionally, isolation-based algorithms, subspace-based algorithms, and others have also been proven effective in the domain of anomaly detection. However, while these machine learning methods have somewhat addressed the problem of anomaly detection in multivariate data with nonlinear relations, they still exhibit varying degrees of failure when confronted with higher-dimensional data and data with autocorrelation.
With the development of artificial intelligence and neural networks, deep learning-based anomaly detection has gradually become mainstream [
14]. This approach can handle massive amounts of data and has relatively lenient requirements for training data. Depending on the labels available in the dataset, one can choose unsupervised, semi-supervised, or supervised learning [
1]. Among neural network-based methods, Autoencoder (AE)-based approaches are predominantly used due to their powerful capability in advanced data representation. These methods first encode and compress the input into latent variables and then decode it back to a state close to the original dataset, ensuring the AE learns only the essential information in the data [
15]. Since AE lacks specific targets, they adopt an unsupervised training approach to learn data representation [
16]. However, the massive influx of high-dimensional time series data generated in Internet of Things (IoT) and smart manufacturing environments poses challenges that conventional AE cannot address due to the inherent time series properties.
Inspired by Cho [
17], researchers have begun integrating Recurrent Neural Networks (RNNs) with AE, proposing a series of hybrid models. By leveraging RNNs’ ability to remember past information, these models aim to handle multivariate time series data. To mitigate the gradient vanishing and exploding issues in RNNs, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are used as substitutes. Experiments in skeleton-based gait anomaly recognition have demonstrated the capabilities of these models, with Long Short-Term Memory Autoencoder (LSTM-AE) combinations outperforming Gated Recurrent Unit Autoencoder (GRU-AE) [
18]. LSTM-AE is widely used as a foundational approach for time series anomaly detection and has spawned numerous variants. Park et al. [
19] proposed a Variational Autoencoder (VAE) based on LSTM, successfully detecting anomalies in robot-assisted feeding with superior performance to baseline models. Lee et al. [
16] applied Bi-directional Long Short-Term Memory (Bi-LSTM) networks to AE, where one unidirectional LSTM processes data from past to future and the other from future to past, enhancing anomaly detection accuracy in smart metering systems. Raihan and Ahmed [
20] demonstrated that Bi-LSTM-AE outperformed conventional LSTM-AE on wind power datasets. However, these studies all build on LSTM, inheriting its inherent properties.
Gers and Schmidhuber [
21] argued that due to the gate connectivity approach of conventional LSTMs, this can lead to loss of necessary information and a decline in network performance. Landi et al. [
22] proposed Working Memory Connections (WMCs) for LSTM, providing an efficient way to utilize intra-cell knowledge within the network. This design significantly outperformed vanilla LSTM and addressed key issues in previous formulations (LSTM with peephole connections). It has been validated in two different toy problems, as well as in tasks such as digit recognition, language modeling, and image captioning. However, there has been no example of its use in the field of anomaly detection. We believe this variant can enhance existing LSTM-based anomaly detection algorithms by allowing memory cell state to influence gates, thereby improving information flow control and making anomaly detection algorithms more sensitive.
Recently, several studies have introduced time series anomaly detection methods that do not rely on LSTM. Jeong et al. [
23] proposed a method named AnomalyBERT, a self-supervised Transformer-based approach for time series anomaly detection, which leverages a novel data degradation scheme with synthetic outliers to effectively capture temporal contexts and inter-variable relationships. Yang et al. [
24] proposed a method named DCdetector, a dual-attention contrastive representation learning algorithm for time series anomaly detection, which enhances representation differences between normal and anomalous points without relying on reconstruction error. Xu et al. [
25] proposed a calibrated one-class classification method for unsupervised time series anomaly detection (COUTA), leveraging Temporal Convolutional Networks and dual calibration techniques to achieve contamination-tolerant, anomaly-informed normality learning with state-of-the-art performance. However, it is worth noting that no universally optimal model or approach currently exists in the academic community.
In this paper, we propose an anomaly detection model called Bi-directional Long Short-Term Memory with Working Memory Connections-Autoencoder (Bi-LSTM-WMCs-AE). It applies Bi-LSTM with WMCs to an AE network. While the Bi-LSTM with the WMCs network effectively learns the time-related information inherent in the multivariate timeseries dataset, the AE part acquires the ability to reconstruct normal data by learning from normal data. The difference between the input data and the output data, known as the reconstruction error, is used for anomaly detection.
The main contributions of our study are as follows:
- (1)
We propose a novel anomaly detection model for multivariate time series data. Building on the widely adopted Bi-LSTM-AE framework, we introduce a variant of LSTM, referred to as Working Memory Connections for LSTM, to replace the traditional LSTM units. This variant enables the memory cell to influence the gate values, thereby facilitating more effective processing of time series data. To the best of our knowledge, our study is the first to apply this novel LSTM variant to the field of anomaly detection.
- (2)
Our model leverages a reconstruction-based framework, making it inherently suited for semi-supervised learning. Unlike approaches that rely on extensive labeled anomaly data, our method requires only normal data for training, significantly reducing the dependency on scarce and resource-intensive expert-labeled anomaly datasets. This advantage addresses a critical challenge in real-world applications, where labeled anomaly data are often limited or unavailable.
- (3)
We evaluated the performance of our model on diverse datasets that include both continuous sensor data and mixed data containing sensor and actuator reports. Our model demonstrated consistently strong performance across these scenarios, highlighting its robustness and scalability in handling various types of multivariate time series data.
The rest of this paper unfolds as follows.
Section 2 discusses the related work.
Section 3 describes the proposed model for multivariate time series anomaly detection.
Section 4 presents a case study using the proposed model.
Section 5 consists of our concluding remarks.
3. Proposed Method
In this section, we introduce our proposed model that uses the combination of Bi-LSTM-WMCs and AE to detect anomalies based on the analysis of multivariate time series data. We first provide the overview of our proposed model based on the three steps of data preprocessing, the Bi-LSTM-WMCs-AE model, and anomaly detection. We also provide a detailed description of the algorithm our model uses in terms of training and testing phases.
Step 1: Data Preprocessing
Similar to models based on traditional LSTM, our proposed model requires transforming the original input data into a format suitable for processing by LSTM units, specifically multiple time windows. Typically, a complete multivariate time series is represented as a 2D array comprising n m-dimensional vectors, where n represents the total number of time points, and m denotes the number of variables in the time series. Since LSTM models process data in multiple time windows, each represented as a 2D array, the number of time points in a single window, referred to as the window size ((t), determines the time steps the model considers for decision-making.
To preprocess the data, a sliding window approach with a fixed size
is applied, shifting the window by one time point at each step. This process transforms the original 2D array into (
) stacked 2D arrays, each with dimensions
, representing the sequence of time windows.
Figure 3 illustrates the details of the data preprocessing workflow.
Step 2: Online Training
In this step, the traditional LSTM units in the Bi-LSTM-AE network, a model widely employed in existing studies, are replaced with bidirectional LSTM-WMC units to enhance the ability to capture temporal dependencies in multivariate time series data. The Autoencoder comprises an encoder and a decoder, each consisting of two layers. The outer layer contains 64 Bi-LSTM-WMC units, while the inner layer includes half as many units as the outer layer. The decoder reconstructs the input data, and the model is trained by minimizing the reconstruction error between the original input and its reconstruction. Upon completing the training, the reconstruction errors of the training data are calculated and used to establish the threshold for anomaly detection in subsequent steps.
Figure 4 illustrates the general flow of our proposed model.
Figure 5 illustrates the details of Bi-LSTM-WMCs cell in the model.
Step 3: Offline Detection
In this study, “normal” events are defined as patterns or behaviors in the data that reflect expected or typical operational conditions, while “anomalies” are deviations from these patterns, potentially indicating faults, errors, or unexpected behaviors. The definitions of “normal” and “anomalies” are dataset-specific. A detailed explanation of what constitutes “normal” and “anomalies” will be provided when each dataset is introduced.
An observation that deviates from the majority of the data might be referred to as an anomaly [
29]. Prior to the anomaly detection phase, a threshold for distinguishing between normal and anomalous instances must first be established. Similar to traditional AE-based approaches, we use the absolute difference between the input data and the reconstructed data as the reconstruction error. However, it is important to note that since we are dealing with multivariate time series data, both the input and the reconstructed output of our model are three-dimensional arrays, where each time window is represented as a two-dimensional array. Each time point in the original dataset appears in multiple time windows. Therefore, to calculate the final reconstruction error for each time point, we sum all reconstruction errors for that time point and divide by the total number of time windows in which that time point appears. This average value is used as the final reconstruction error for the time point. Our model is trained on a training dataset that exclusively consists of normal data. It can provide us the reconstruction error values related to the normal data points only. After completing the model training and calculating all reconstruction errors, the threshold for anomaly detection is set as the
percentile of the reconstruction errors from the entire training dataset. Here,
represents the confidence level commonly used in statistical methods, indicating the degree of error tolerance in our model. In the case study below, we will plot Performance Curves for
ranging from 0 to 0.1, with intervals of 0.01, to compare the model’s performance at different
values. After determining the threshold, the test dataset, which includes both normal and anomaly data, is fed into the trained model for anomaly detection. For each sample in the test dataset, the reconstruction error is calculated. If the reconstruction error exceeds the threshold, the sample is classified as anomalies.
4. Case Study
Our experiments were conducted on Google Colab, utilizing the default CPU environment with 12.7 GB of system RAM. The implementation was developed in Python 3.10, with key libraries including TensorFlow 2.15, Keras 2.15, Scikit-learn 1.2.2, and pandas 2.0.3, along with other relevant libraries.
To evaluate the performance of our proposed model, we applied three open-source datasets. We express our sincere gratitude to the authors who provided the datasets free of charge. The first one is the Skoltech Anomaly Benchmark (SKAB) dataset, which can be found on Kaggle [
30]. The datasets contain a multivariate time series collected from the sensors installed on the testbed. Columns in each data file are as follows:
datetime—Represents dates and times of the moment when the value is written to the database (YYYY-MM-DD hh:mm:ss);
Accelerometer1RMS—Shows a vibration acceleration (amount of g units);
Accelerometer2RMS—Shows a vibration acceleration (amount of g units);
Current—Shows the amperage on the electric motor (ampere);
Pressure—Represents the pressure in the loop after the water pump (bar);
Temperature— Shows the temperature of the engine body (the degree Celsius);
Thermocouple—Represents the temperature of the fluid in the circulation loop (the degree Celsius);
Voltage—Shows the voltage on the electric motor (volt);
RateRMS—Represents the circulation flow rate of the fluid inside the loop (liter per minute);
anomaly—Shows if the point is anomalous (0 or 1);
changepoint—Shows if the point is a changepoint for collective anomalies (0 or 1).
In the SKAB dataset, “normal” corresponds to system operation under expected conditions, where sensor readings remain within predefined ranges, whereas “anomalies” are manually induced faults, such as sensor malfunctions or component failures, intentionally introduced to cause deviations from expected values. The training data of the dataset contain normal data only. Also, since our study is only for anomaly detection and not for changepoint detection, the last column (changepoint) will be removed in the data preprocessing stage. The size and details of the dataset are summarized in
Table 1.
Figure 6 illustrates a subset of the training data and one of the test datasets from the SKAB dataset, with the red regions in the figure labeled as anomalies.
The second and third datasets are the Soil Moisture Active Passive (SMAP) and Mars Science Laboratory (MSL) datasets from NASA [
31]. These datasets are also publicly available and contain multivariate time series collected from satellite and space exploration missions. They are widely used for benchmarking anomaly detection models due to their diverse set of anomaly types and realistic operational scenarios. For the SMAP dataset, “normal” represents the normal operation of the satellite’s soil moisture monitoring system, where telemetry readings align with expected operational conditions. “Anomalies,” on the other hand, are events derived from expert-labeled Incident Surprise Anomaly (ISA) reports, including system failures such as unexpected sensor readings or communication errors. Similarly, in the MSL dataset, “normal” denotes the proper functioning of the Mars Science Laboratory during its mission, as indicated by telemetry data that reflect standard operational behavior. “Anomalies” are identified through expert analysis of ISA reports and include irregularities such as sensor faults, communication disruptions, or mechanical malfunctions. Since the SMAP and MSL datasets contain multiple entities, each with corresponding training and testing data, we train and test each entity separately.
Figure 7 shows the visualization of a subset of variables from the SMAP dataset (D-1).
Figure 8 shows the visualization of a subset of variables from the MSL dataset (F-5). As in
Figure 6, the red regions in the figure are labeled as anomalies.
We employ the commonly used F1-score as the benchmark for evaluating model performance. To assess the efficacy of our proposed model, we compared it with state-of-the-art models including Recurrent Neural Networks–Autoencoder (RNN-AE), LSTM-AE, LSTM-WMCs-AE, and Bi-LSTM-AE. The construction and parameters of all models are kept consistent with our proposed model. Additionally, we conducted extensive performance comparisons using different threshold values, with ranging from 0 to 0.1 in increments of 0.01. To further assess the suitability of our proposed model for diverse requirements, we experimented with various time window sizes. This approach aims to determine the scenarios in which our model demonstrates significant advantages.
For the SKAB dataset, we performed these comparisons as described above. However, for the other two datasets, due to space constraints, we selected the optimal time window size after multiple adjustments and compared the model’s performance at
values of 0.01, 0.05, and 0.1. The calculation of F1-score is expressed using the following equations:
Here, TP (True Positive) refers to the count of accurately identified anomalies, TN (True Negative) refers to the count of correctly identified normal events, FP (False Positive) indicates the count of normal events that were inaccurately diagnosed as anomalies, while FN (False Negative) denotes the count of anomalies that were incorrectly identified as normal events. Using Equations (13)–(15), we computed the performance of each model under various
values and time window sizes.
Figure 9 illustrates the comparison of model performance across different time window sizes on the SKAB dataset.
Table 2 presents the comparisons of model performance for four entities from the SMAP dataset.
Table 3 shows the performance of the model on the MSL dataset. In this context, the varying
values determine the sensitivity thresholds for anomaly detection.
To assess the feasibility of the proposed anomaly detection model in real-world scenarios, we conducted comparative experiments to evaluate the training time and detection time of various models across different datasets.
Figure 10 illustrates the training time of each model for different datasets, while
Figure 11 presents the time required for anomaly detection. Notably, our model utilizes a time window of size
as input, meaning that an increase in the number of time points within a time window (
t) or the number of variables in the data (
m) inevitably increases computational complexity. This, in turn, necessitates longer training and fitting times. For example, in the SKAB dataset, when the time window size (
t) is increased to 40, the training time for our proposed model is considerably longer than that of the baseline model. Although it increases complexity, it results in superior performance compared to the baseline model.
Although the training time of our proposed model is longer than that of the baseline models—specifically, models utilizing this Working Memory Connections for LSTM take more time to fit compared to those based on conventional LSTM units—we argue that, with advancements in hardware, such computational resource consumption can be deemed acceptable, especially in applications where accuracy is critical, such as in nuclear power plant scenarios. Moreover, the difference in computational resource requirements is predominantly observed during the model training and fitting phase, with only negligible differences in the anomaly detection phase. Additionally, in real-time application scenarios, where each input consists of a single time window, the decision-making time for all compared models converges to nearly zero.
5. Conclusions
In this study, we developed a novel anomaly detection framework and demonstrated its application using several publicly available multivariate time series datasets. Our proposed framework combines the advantages of an AE and a new variant of LSTM, namely working memory connections. The AE part is responsible for learning the latent features of the normal training data to acquire the ability to reconstruct normal data. When anomalous data are input into the trained model during the anomaly detection phase, the reconstruction error increases significantly. The Bi-LSTM-WMCs part is responsible for learning the temporal dependencies of the data. Through comparisons with a series of baseline models, we found that as the time window size (t) increases, the performance advantage of the model we proposed, namely Bi-LSTM-WMCs-AE, becomes increasingly evident. For smaller time windows (t = 5), the bidirectional architectures do not exhibit a significant advantage, likely due to the fact that with smaller time windows, unidirectional LSTM models can already effectively utilize the temporal information within the window. However, models using LSTM-WMCs (LSTM-WMCs-AE and Bi-LSTM-WMCs-AE) almost consistently outperform those without such connections. Therefore, we have reason to believe that our proposed model, incorporating bidirectional architecture and LSTM-WMCs, demonstrates greater robustness under various conditions.
The practical significance of our study lies in its ability to address the challenges of anomaly detection in multivariate time series data through a reconstruction-based approach. Central to our contribution is the integration of LSTM-WMCs, which, while not proposed by us, is applied for the first time in the anomaly detection domain in this study. This innovative LSTM variant enhances the model’s ability to effectively process time series data by allowing memory cells to influence gate values, leading to more accurate reconstructions and improved anomaly detection reliability. This also demonstrates the feasibility of applying such a novel LSTM variant to other domains. Furthermore, our model significantly reduces the reliance on labeled anomaly data, addressing a critical limitation in real-world applications where such data are often scarce and expensive to obtain. The robust performance of our approach across datasets with varying characteristics, including continuous sensor data and mixed sensor-actuator data, further highlights its scalability and adaptability. These advancements validate the effectiveness of our approach and underscore its potential to provide a reliable and flexible solution for anomaly detection across diverse fields, such as industrial monitoring and space exploration.
Our study has some limitations that should be acknowledged. Firstly, while replacing the traditional LSTM units with the LSTM-WMCs has undoubtedly improved performance, this enhancement comes at the cost of increased computational complexity. The more intricate connection mechanism of the LSTM-WMCs demands greater computational resources, especially when combined with a bidirectional structure and integrated into the AE framework. This could pose challenges for deploying the model in resource-constrained environments or real-time applications, where computational efficiency is critical. Secondly, our study is primarily focused on detecting point anomalies, and the model’s performance on other types of anomalies remains to be validated.
We have demonstrated the exceptional capability of the Working Memory Connections for LSTM in the domain of anomaly detection. However, LSTM has been widely employed in various real-world applications, such as logistics and inventory management, energy management, and intelligent production scheduling. An interesting direction for future research would be to investigate whether the Working Memory Connections for LSTM remains an effective tool in these application domains.