Next Article in Journal
Non-Contact Vision-Based Techniques of Vital Sign Monitoring: Systematic Review
Next Article in Special Issue
Normal-Incidence Germanium Photodetectors Integrated with Polymer Microlenses for Optical Fiber Communication Applications
Previous Article in Journal
Real-Time Visual Feedback Based on MIMUs Technology Reduces Bowing Errors in Beginner Violin Students
Previous Article in Special Issue
A Dual-Stream Cross AGFormer-GPT Network for Traffic Flow Prediction Based on Large-Scale Road Sensor Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

BiLSTM-MLAM: A Multi-Scale Time Series Prediction Model for Sensor Data Based on Bi-LSTM and Local Attention Mechanisms

1
School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China
2
School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China
3
School of Cybersecurity, Northwestern Polytechnical University, Xi’an 710072, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(12), 3962; https://doi.org/10.3390/s24123962
Submission received: 8 May 2024 / Revised: 2 June 2024 / Accepted: 17 June 2024 / Published: 19 June 2024
(This article belongs to the Special Issue Feature Papers in the 'Sensor Networks' Section 2024)

Abstract

:
This paper introduces BiLSTM-MLAM, a novel multi-scale time series prediction model. Initially, the approach utilizes bidirectional long short-term memory to capture information from both forward and backward directions in time series data. Subsequently, a multi-scale patch segmentation module generates various long sequences composed of equal-length segments, enabling the model to capture data patterns across multiple time scales by adjusting segment lengths. Finally, the local attention mechanism enhances feature extraction by accurately identifying and weighting important time segments, thereby strengthening the model’s understanding of the local features of the time series, followed by feature fusion. The model demonstrates outstanding performance in time series prediction tasks by effectively capturing sequence information across various time scales. Experimental validation illustrates the superior performance of BiLSTM-MLAM compared to six baseline methods across multiple datasets. When predicting the remaining life of aircraft engines, BiLSTM-MLAM outperforms the best baseline model by 6.66% in RMSE and 11.50% in MAE. In the LTE dataset, it achieves RMSE improvements of 12.77% and MAE enhancements of 3.06%, while in the load dataset, it demonstrates RMSE enhancements of 17.96% and MAE improvements of 30.39%. Additionally, ablation experiments confirm the positive impact of each module on prediction accuracy. Through segment length parameter tuning experiments, combining different segment lengths has resulted in lower prediction errors, affirming the effectiveness of the multi-scale fusion strategy in enhancing prediction accuracy by integrating information from multiple time scales.

1. Introduction

With the rapid advancement of the Internet of Things (IoT) and 5G technologies, sensor networks have found widespread applications across various domains, including industrial production, environmental monitoring, intelligent transportation, and energy management. Sensor networks play a crucial role by continuously monitoring and transmitting diverse data such as environmental parameters, device statuses, and user behaviors in real time, thus providing a reliable data foundation for numerous applications. Leveraging the collected data for prediction purposes has become increasingly crucial in this context, facilitating efficient decision-making and management strategies across different fields. For instance, with the rapid digital transformation and the expanding user base of the internet, effective management and prediction of network traffic have gained significant importance. According to the latest report from the China Internet Network Information Center (CNNIC) [1], as of June 2023, the number of internet users in China surged to 1.079 billion, with an internet penetration rate of 76.4%. Internet traffic witnessed substantial growth, reaching a total of 1423 billion GB, marking a 14.6% year-on-year increase. Predicting network traffic can aid in optimizing network resource allocation and enhancing operational efficiency and user experience.
The rapid development of artificial intelligence (AI) technologies offers novel methods and tools to improve the performance of time series prediction in sensor networks [2]. Several studies [3,4,5] have modeled it as a general time series prediction problem and focused on improving prediction performance. The variability of time series data is driven by various factors, including seasonal variations, long-term trends, periodic fluctuations, and random events. Seasonal factors encompass natural seasons, holidays, and differences between working days and weekends, while trend factors involve technological advancements, population growth, and economic development. Additionally, cyclical factors stem from economic cycles and industry-specific fluctuations, while random factors, such as natural disasters and equipment failures, introduce uncertainty and anomalies. Policy changes, management decisions, technological advancements, and societal behaviors also significantly influence data patterns. These factors interact, resulting in data with high dimensionality, nonlinearity, and non-stationarity, making traditional linear prediction models inadequate. Furthermore, the spatial and temporal correlations between data from different sensor nodes further complicate data analysis. For instance, in smart grids, power load data from different regions influence each other, while in intelligent transportation systems, traffic flow in adjacent sections is interrelated.
Addressing these challenges requires researchers to delve into complex data analysis techniques, particularly leveraging the latest advancements in machine learning and deep learning domains. Machine learning, characterized by computational methods that enhance performance or make accurate predictions based on experiences [6], can effectively identify latent patterns and trends from extensive time series datasets, significantly improving prediction accuracy. However, despite the remarkable performance of emerging methods in certain sequence prediction challenges, their effectiveness may not surpass that of traditional methods in specific scenarios. Therefore, to overcome the limitations of traditional models and address the shortcomings of emerging methods, this study aims to conduct an in-depth investigation into sensor network data time series prediction techniques, aiming to achieve higher accuracy and reliability across various application environments, thereby enhancing management and decision-making capabilities in different fields.
The primary contributions of this study include the following:
  • We propose a multi-scale feature fusion prediction method named BiLSTM-MLAM for time series prediction. Experimental validation on multiple datasets demonstrates its outstanding performance.
  • We utilize the Bi-LSTM structure to enable the model to consider past and future contextual information at each time step. Compared to traditional LSTM, Bi-LSTM comprehensively utilizes information in the sequence, aiding in capturing long-term dependencies.
  • We introduce a multi-scale feature fusion mechanism, design the multi-scale patch segmentation module, and obtain long sequences composed of equal-length segments representing multiple time scales by setting different segment lengths. This facilitates the capture of patterns and features at different temporal resolutions.
  • We introduce a local attention mechanism for effective modeling of temporal dependencies within multiple sub-sequence time series, thereby enhancing feature extraction. The local attention mechanism allows explicit extraction of local features from the temporal relationships of sub-sequence time series within long sequences.

2. Related Works

Time series prediction is widely applied across various fields. In finance [7], it aids in predicting stock prices, currency exchange rates, and interest rates, empowering investors to make more informed decisions. In meteorology [8], forecasting weather changes, including temperature and rainfall, offer critical insights for daily travel, agricultural production, and emergency management. Traffic prediction [9] supports urban planning and traffic management, while in the medical field [10], it contributes to disease prevention and public health management. In the energy sector [11], forecasting energy demand and prices assists in effective energy planning and management for enterprises and dispatch departments. In practical applications, time series generated by complex systems often exhibit non-stationarity and non-linear characteristics, encompassing both deterministic and stochastic components. Previous research has predominantly relied on mathematical-statistical models, such as autoregressive integrated moving average [12], vector autoregression [13], and generalized autoregressive conditional heteroskedasticity [14]. However, these models, constrained by fixed mathematical formulas, struggle to adequately express the complex features of time series, posing challenges in accurately predicting them. While classical methods like ARIMA models [15,16] and exponential smoothing [17,18] have been utilized in many studies with certain achievements, they often focus on a single time series, overlooking the correlations between sequences and encountering limitations in handling complex time patterns, such as long-term dependencies and non-linear relationships.
Propelled by rapid advancements in machine learning and deep learning, time series prediction algorithms leverage the potential of these cutting-edge technologies, demonstrating remarkable performance. Machine learning methods transform time series problems into supervised learning, utilizing feature engineering and advanced algorithms for prediction, effectively addressing complex time series data. As machine learning applications in research expand, notable models like the random forest model, support vector regression model, and the Bayesian network model have emerged. While these methods excel with straightforward datasets, they face challenges in capturing intricate non-linear relationships among multiple variables with extensive datasets, leading to suboptimal performance.
In recent years, researchers have increasingly relied on artificial neural networks (ANNs) to tackle complex time series prediction problems, owing to their abilities for self-learning, self-organization, adaptability, and robust approximation of non-linear functions. The development of prediction algorithms based on deep neural networks indicates a growing trend [19,20,21,22,23,24,25,26]. Among these, a multi-layer perceptron (MLP) [27] and extreme learning machine (ELM) [28] are frequently utilized in time series prediction. Indrastanti et al. [29] employed multi-layer perceptron to develop a precise flood prediction system. Sven F. Crone [30] attained second place in a synthetic time series competition and the ESTSP 2008 dataset by employing MLP. Alexander Grigorievskiy et al. [31] utilized an optimally pruned extreme learning machine (OP-ELM) for long-term time series prediction. Min Han et al. [32] introduced an innovative method combining a hybrid variable selection algorithm with an enhanced extreme learning machine for predicting multivariate chaotic time series. In domains such as renewable energy and the power market, MLP and ELM demonstrate superior performance compared to traditional statistical approaches and machine learning methodologies, effectively processing intricate energy data, power demand, and market behavior to provide accurate predictions and decision support [33,34,35,36]. However, due to their simplistic structure, ANNs have specific limitations in time feature extraction. Consequently, many studies have shifted to recurrent neural networks (RNNs). Among them, recurrent neural network models, specifically crafted for addressing time series problems, employ gated unit structures to manage information. These models pass information layer by layer through concatenation, recursively updating themselves for predictions, and finding widespread applications in solving prediction problems [24,37,38,39,40,41,42,43]. Among various variants of RNN, long short-term memory (LSTM) has become the most popular model for addressing the exploding and vanishing gradient problems during RNN training [44]. For example, Vinayakumar et al. [45] successfully employed LSTM in backbone networks for traffic prediction. Despite the impressive performance of LSTM in many aspects, it has some limitations, especially in dealing with very long sequences, where a lack of long-term dependencies may arise. As sequences lengthen, LSTM’s memory units may lose vital information, compromising its predictive accuracy. Furthermore, longer sequences exacerbate LSTM’s computational complexity, placing a burden on computational resources and constraining its applicability for handling extended sequences.
In 2017, the Transformer architecture proposed by Google gradually found applications in time series prediction [46]. The Transformer model handles time series data by introducing self-attention mechanisms, effectively capturing repetitive patterns of long-term dependencies. The self-attention mechanism enables the model to assign greater attention weights to important information at different positions in the sequence, thereby enhancing modeling capabilities. Architectures based on attention mechanisms have shown outstanding performance in time series prediction tasks [21,22]. Bryan Lim et al. [47] introduce the temporal fusion transformer (TFT)—a novel attention-based architecture that combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics. Over time, efficient mixed time series prediction models have started to emerge. For instance, the encoder–decoder model, comprising two LSTMs acting as an encoder and decoder, can proficiently extract features and produce accurate predictions. Liang et al. [42] employed encoder–decoder architecture to investigate two environmental quality datasets in the experiment, demonstrating the satisfactory performance of the method. Recently, time series prediction models that combine convolutional neural networks (CNNs) and graph neural networks (GNNs) have made significant progress. These advanced models leverage CNNs’ strengths in capturing local patterns and GNNs’ abilities to handle relational data structures, offering superior performance for complex time series data. Zhao et al. [48] demonstrated that STGCN-HO, which integrates graph convolution and gated linear units, significantly improves cellular traffic prediction accuracy compared to existing RNN and CNN-grid methods.
Although significant progress has been made in time series prediction methods, there are still some challenges. One major issue is the difficulty in capturing long-term dependencies and complex patterns in non-stationary time series data, which often exhibit sudden changes and trends that traditional models struggle to adapt to. Additionally, existing models may lack the robustness required to handle the inherent noise and randomness in real-world data, leading to suboptimal predictions. Models like LSTM, while proficient at handling certain sequence dependencies, may encounter limitations when dealing with very long sequences, posing problems for maintaining long-term memory. Therefore, the demand for innovative approaches, such as the self-attention mechanisms in Transformer models, becomes crucial. These models offer enhanced capabilities in recognizing and modeling complex patterns within time series data, emphasizing the urgency and necessity for continued advancements in this field to achieve more accurate and reliable predictions.

3. Methods

3.1. Overall Framework

In predicting future time series based on known time series data, this paper introduces a novel time series prediction model. This model framework, which is termed the multi-scale time series prediction model and is based on Bi-LSTM and local attention mechanism (BiLSTM-MLAM), incorporates three essential modules: bidirectional long short-term memory (Bi-LSTM), multi-scale segmentation, and local attention mechanism (LAM). The schematic diagram illustrating the BiLSTM-MLAM framework is presented in Figure 1.
Assuming an input raw sequence I = x 1 , x 2 , x 3 , , x t is used to predict y t + 1 , where x i R m and m represent the feature dimensions. Initially, we input the raw sequence I in chronological order into a bidirectional long short-term memory network. Bi-LSTM extracts information in both forward and backward directions, enabling the model to thoroughly consider the current input at each time step, along with contextual information from the past and future. This approach enhances the model’s ability to capture long-term dependencies, learning patterns, and features within the sequence.
Simultaneously, to enable the model to capture and extract features at different time scales, enhancing its perception of various scale features within the time series, we introduce the multi-scale patch segmentation module. This module, based on a set hyperparameter specifying the segment length, divides a long time series into multiple equal-length time sequence segments, each considered as a local region. By setting different segment lengths, we obtain long sequences composed of equal-length segments representing multiple time scales.
Furthermore, the model incorporates the local attention mechanism, performing segment-based attention computations within each long sequence. This allows the model to explicitly extract local features from the temporal relationships of sub-sequence time series, capturing the temporal dependencies of multiple sub-sequence time series more effectively. The local attention mechanism facilitates interaction and integration of information within long sequences, enhancing the model’s expressive power. By applying the local attention mechanism to equal-length segments of various segment lengths constituting different long sequences, we efficiently acquire local features at different time scales.
Finally, all features are fused, and the prediction result is obtained through fully connected layers. Overall, the BiLSTM-MLAM model, by considering temporal information at different time scales and local dependencies comprehensively, enhances its accuracy in predicting future time series.

3.2. Bi-LSTM

Recurrent neural networks (RNNs) can capture dependencies within sequences and have shown good results when training on short sequences in the past. However, they suffer from issues such as exploding or vanishing gradients when dealing with long sequences due to having only one hidden state. Long short-term memory (LSTM) was invented by Hochreiter [44] to solve this issue. LSTM networks incorporate memory cells and gate mechanisms, allowing them to better capture and retain long-term dependencies within sequential data. The structure of the LSTM network unit is illustrated in Figure 2.
LSTM introduces an output based on the ordinary output h t of each unit, which is the memory cell C t , and adds three gates: forget gate, input gate, and output gate. With the control of these three gates, it can regulate the flow of information, enabling LSTM to better handle and store long-term dependent information, thereby improving its memory capacity. The gating structure, hidden layer output, and cell state transfer process of the LSTM unit are shown in Equation (1) to Equation (6).
f t = σ ( w f h t 1 , x t + b f )
i t = σ w i h t 1 , x t + b i
C ˜ t = tanh w C h t 1 , x t + b C
C t = f t C t 1 + i t C ˜ t
o t = σ w o h t 1 , x t + b o
h t = o t tanh C t
The forget gate ( f t ) decides what information to discard from the previous cell state. The input ( x t ) provides current information, while h t 1 represents the previous hidden state. The sigmoid activation function ( σ ) determines the forgetfulness probability. The input gate ( i t ) selects new information to store, combining it with temporary memory ( C ˜ t ). C t is the current cell state, with f t C t 1 representing forgotten information and i t C ˜ t denoting new input. The output gate ( h t ) controls output to the hidden layer using sigmoid and tanh activations, resulting in the final output.
However, LSTM relies solely on the previous time step’s data and past information for predictions, potentially overlooking future contextual information. In order to address this shortcoming of LSTM networks, the suggested LSTM network in this research incorporates a bidirectional topology that allows it to process data over the full-time range. This enables the utilization of both preceding features and the inclusion of forthcoming information. The schematic representation of the Bi-LSTM concept is illustrated in Figure 3.
Bi-LSTM employs two distinct hidden layers: the forward hidden layer transmits information from the past to the future, while the backward hidden layer conveys information from the future to the past. In deep learning architectures, Bi-LSTM shows enhanced data representation capabilities compared to conventional LSTM. The output in Bi-LSTM is elucidated as follows:
h t f = LSTM x t , h t 1 f
h t b = LSTM x t , h t 1 b
y t = W o h t + b o
W h y f represents the weights from the forward layer to the output layer, W h y b represents the weights from the backward layer to the output layer, and b o is the bias vector of the output layer. h t is composed of integrating h t f and h t b . At time t, Bi-LSTM simultaneously utilizes both past and future data, combining the information from both parts for learning.

3.3. Multi-Scale Patch Segmentation

Considering the strong local nature of time series, characterized by continuity between adjacent points, we will conduct segment-wise processing and maintain continuity within each segment during aggregation. Given a time series of length t, by setting p a t c h s i z e equal to p a t c h s t e p equal to L, where L is the length, patches are generated, dividing the time series into different segments. Let X i represent the i _ t h segment.
X i = X i · L seg : ( i + 1 ) · L seg , 1 i T L seg
. represents the floor function, and yields the following:
X = X 1 , X 2 ,
By configuring different patch sizes, sequences of varying lengths corresponding to different time scales can be obtained, as shown in Equation (12), allowing the capture of patterns at different temporal resolutions.
X , L seg 1 , L seg 2 , = X 1 , X 2 ,
where L seg 1 , L seg 2 represent the segment lengths at different scales, and X 1 , X 2 are sequences composed of segments from different time scales. L seg determines the resolution in the calculation of the minimum unit involved in segment correlation. A larger L seg implies capturing coarse-grained time dependencies in the time series, while a smaller L seg can capture fine-grained dependencies. Through comprehensive learning, multi-granularity features are obtained, utilizing multiscale information to aid in predictions. Figure 4 provides examples of multiscale segmented sequences obtained with segment lengths of 2, 4, and 8, respectively.

3.4. Local Attention Mechanism

The attention mechanism focuses on key positions, reducing the weighting of non-key positions, and highlighting more relevant influencing factors, aiding the model in making accurate judgments. After segmenting the time series, the attention mechanism is applied on a segment-by-segment basis. For long sequences X = x 1 , x 2 , at a specific time scale, the local attention mechanism module is applied to effectively capture the correlations between different segments. The structure of the local attention mechanism is illustrated in Figure 5.
For the input time series segment x i , it is mapped to the query matrix q i , key matrix k i , and value matrix v i for that time segment through three learnable mapping matrices. The specific calculation process is as follows:
q i = W q x i
k i = W k x i
v i = W v x i
The correlation between different time segments is as follows:
α i , j = q i k j
where ⊙ denotes element-wise multiplication, and the correlation between time segments is measured by calculating the element-wise multiplication of the query matrix q and the key matrix k. The correlations between multiple time segments are obtained by computing the query matrix and key matrix, the s o f t m a x function is applied for normalization. The calculation formula is as follows:
x ˜ i = j = 1 t α 1 , j v j
Concatenate the outputs x ˜ i t of multiple time segments along the time dimension. The original sequence X is transformed into a time series X ˜ = x ˜ 1 , x ˜ 2 , , x ˜ n , representing the learned features through the local self-attention mechanism module. Here, n denotes the number of segments into which the long sequence X is divided.
For different time-scale segment sequences X 1 , X 2 , apply the local attention mechanism module to obtain X ˜ 1 , X ˜ 2 ; concatenate these results and feed them into a fully connected layer to obtain the final prediction result.

4. Experiments and Simulation Results

In this section, we present a comparative analysis of our approach against six baseline methods using three public datasets, showcasing its superior performance in time series prediction. Furthermore, an ablation experiment is performed to affirm the efficacy of each component in the proposed method. Lastly, we explore the influence of diverse patch size parameter configurations and their merging on prediction performance.

4.1. Data Description

This study validates and compares results using three publicly available datasets: the C-MAPSS dataset, the LTE dataset, and the load dataset. Detailed information for each dataset is summarized in Table 1.
C-MAPSS: This dataset is designed for predicting the remaining lifespan of aircraft engines and is publicly released by the National Aeronautics and Space Administration (NASA) of the United States. The dataset is divided into four subsets, namely FD001 to FD004, each comprising both training and testing datasets. Each training and testing dataset includes engine ID, operational cycles, and three operational conditions (flight altitude, Mach number, and throttle resolver angle), along with 21 additional sensor values, totaling 26 columns. Since previous studies on time series prediction have predominantly concentrated on the FD001 subset, we chose to employ the FD001 dataset in our experiments for ease of comparison. Specifically, our training set includes data from 100 engines, while the testing set also encompasses data from 100 engines. The training set provides data on turbofan engine samples from operation to failure, whereas the testing set comprises turbofan engine samples under identical operating conditions, covering only the front half of the engine’s operational cycles, and includes the true remaining useful life (RUL) values for each sample. The C-MAPSS dataset is available at https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/, accessed on 18 June 2023.
To process the C-MAPSS dataset, we followed these steps:
  • Data loading: We imported the dataset from the provided source, incorporating both the training and testing datasets for the FD001 subset.
  • Feature selection:We identified relevant features such as engine ID, operational cycles, operational conditions, and sensor values.
  • Data preprocessing:
    Handling missing values: We checked for any missing values and applied appropriate techniques such as imputation or removal.
    Standardization: We ensured that numerical features were standardized to have a mean of 0 and a standard deviation of 1.
  • Data splitting: We segmented the dataset into training and testing sets, ensuring a representative distribution of engine samples.
  • Target generation: We derived the target variable, remaining useful life (RUL), for the training set by calculating the remaining operational cycles until failure.
  • Final Dataset Preparation: We formatted the datasets for model training and evaluation, ensuring proper alignment and compatibility.
LTE: This dataset, sourced from Kaggle, is designed for predicting LTE 4G network traffic. The dataset encompasses traffic data from 57 base stations in a specific region, covering the period from 23 October 2017 to 22 October 2018. Hourly sampling was conducted, resulting in 24 samples per day. For our experiments, we focused on a three-month subset of the data to assess the proposed model. The LTE dataset can be accessed at https://www.kaggle.com/naebolo/predict-traffic-of-lte-network, accessed on 28 June 2022.
Load dataset: This dataset is intended for load prediction and originates from historical data on electrical, cooling, and heating loads at the Tempe campus, provided by the Campus Metabolism system at Arizona State University. The data spans from 1 January 2019, at 0:00, to 31 July 2023, at 24:00, with a sampling interval of 1 h. The load dataset is available at https://cm.asu.edu, accessed on 8 June 2022.
The LTE and load datasets underwent similar data processing steps to the C-MAPSS dataset, encompassing data loading, feature selection, data preprocessing, data splitting, target generation, and other relevant procedures.

4.2. Model Evaluation Criteria

The root mean square error (RMSE) and mean absolute error (MAE) serve as evaluation criteria. These metrics are employed to quantify the disparity between the observed and predicted data, and are described as follows:
E RMSE = 1 n i = 1 n y ^ t y t 2
E MAE = 1 n i = 1 n y ^ t y t
where y t represents the observed responses, y ^ t represents the estimated responses, and n is the total number of observations.

4.3. Experiments and Discussions

4.3.1. Model Comparison

In order to prove that the suggested BiLSTM-MLAM approach is preferable, we carried out a short-term prediction experiment comparing BiLSTM-MLAM with six other methods. The C-MAPSS dataset is designed to predict the RUL of aircraft engines, focusing primarily on the engine’s life status, to forecast the life status within the next operating cycle. The LTE and load datasets are collected at an hourly frequency and used to predict the network traffic in a specific region and the load in the Arizona State University Tempe campus for the next time interval. In this experiment, the models compared are as follows:
  • SVR: In time series regression, SVR can adapt flexibly to different data distributions by selecting the optimal kernel function and relevant parameters, thereby providing accurate regression results.
  • RNN: Compared to traditional feedforward neural networks, RNN introduces recurrent connections, enabling it to model sequential information and capture temporal dependencies within sequences.
  • LSTM: This is a variant of RNN designed to overcome the vanishing gradient problem in traditional RNNs. It excels at handling long sequences and capturing long-term dependencies.
  • GRU: Compared to LSTM, GRU has a more concise structure, with one fewer gate units, resulting in fewer parameters and easier convergence.
  • Bi-LSTM: Bi-LSTM enhances traditional LSTM networks by analyzing input sequences bidirectionally. This enables the network to capture dependencies from both the past and future in the sequence, improving its understanding of temporal relationships in time series data.
  • Encoder–decoder: This is a sequence-to-sequence model widely used in tasks such as machine translation. The model processes the source sequence step by step through an encoder, mapping it to a vector of fixed length.
We employed a grid search method to adjust the hyperparameters of each model. For SVR, the final decision was to use a linear kernel, with regularization parameters and gamma set to their default values. For baseline models, including RNN, LSTM, GRU, Bi-LSTM, and encoder–decoder, we conducted a grid search over the sizes of recurrent and dense layers, considering 16, 32, 64, 128, 256. We chose the patch size from 2, 3, 4 for BiLSTM-MLAM, taking efficiency and performance into consideration.
The prediction results of all approaches on three datasets are summarized in Table 2, where the best results for each dataset are bolded. It can be observed that, across all datasets, the proposed BiLSTM-MLAM outperforms other methods in both metrics. Specifically, BiLSTM-MLAM shows a 6.66% and 11.50% improvement in RMSE and MAE, respectively, compared to most comparative methods (encoder–decoder) in predicting the remaining life of aircraft engines. It demonstrates a 12.77% and 3.06% enhancement in predicting base station network traffic and a 17.96% and 30.39% improvement in load prediction. Figure 6, Figure 7 and Figure 8 visualize the prediction results of BiLSTM-MLAM on three datasets It can be seen that BiLSTM-MLAM fits the actual values well. These results strongly attest to the excellence of BiLSTM-MLAM in time series forecasting.

4.3.2. Ablation Study

To illustrate the usefulness of the suggested approach and investigate how each element affects prediction performance, we used the C-MAPSS dataset for an ablation investigation. By changing or eliminating individual components one at a time in BiLSTM-MLAM, we obtained three variants:
  • Bi-LSTM: Without performing multi-scale segmentation fusion and local attention mechanism processing.
  • BiLSTM-AM: Utilizing the attention mechanism but without the multi-scale fusion mechanism.
  • LSTM-MLAM: Using a single-layer LSTM.
Table 3 presents the experimental results, from which the following conclusions can be drawn:
  • BiLSTM-MLAM achieved the best results across all metrics;
  • BiLSTM-AM outperformed Bi-LSTM, demonstrating that the attention mechanism effectively improves prediction performance;
  • Removing multi-scale patch segmentation significantly reduced the model’s prediction accuracy, highlighting the effectiveness of multi-scale fusion;
  • Compared to LSTM-MLAM, BiLSTM-MLAM showed significant improvements, revealing the crucial role of the bidirectional structure.
In conclusion, each component of the BiLSTM-MLAM model effectively enhances prediction performance.

4.3.3. Multi-Scale Feature Fusion

In this section, we delve into the impacts of different segment length parameters on prediction performance within the multi-scale feature fusion mechanism. Specifically, we selected the C-MAPSS dataset and fixed the input data length (T) at 24.
The patch size parameter represents the length of segments in a long time series, influencing the model’s extraction of features across different time scales. In this experiment, we set the patch size to 2, 3, 4, representing various scales of feature extraction. In order to confirm that the multi-scale feature fusion technique works as intended, we conducted experiments by combining features extracted from two or three scales. The summarized experimental results are presented in Table 4.
Observing Table 4, we note the significant variations in prediction errors with different patch sizes, attributed to the inconsistency in feature granularity. As the variety of patch sizes in the fusion increases, the model’s prediction error gradually decreases. This is because the multi-scale fusion extracts more comprehensive features, enabling better capturing of changes across different scales in the time series.

5. Conclusions

This paper presents a multi-scale time series prediction model named BiLSTM-MLAM, which integrates bidirectional long short-term memory (Bi-LSTM) and local attention mechanism (LAM). By capturing sequence information at different time scales and enhancing feature extraction through the local attention mechanism, BiLSTM-MLAM demonstrates exceptional performance in time series prediction tasks. Experimental results indicate that, compared to alternative methods, BiLSTM-MLAM achieves superior results across multiple metrics, validating its outstanding capabilities on diverse datasets such as C-MAPSS, LTE, and Load. Additionally, the ablation experiments emphasize the effectiveness of the local attention mechanism and multi-scale fusion strategy, as well as the crucial role of the bidirectional LSTM architecture in capturing dependencies in time series data. Segment length parameter tuning experiments further confirm the significant impact of multi-scale fusion on improving prediction accuracy, particularly by optimizing combinations of different segment lengths to effectively enhance model performance. For the remaining useful life prediction of aircraft engines, BiLSTM-MLAM effectively captures the time series characteristics of engine operational data, enabling more accurate life predictions. This aids in optimizing maintenance and operational planning, thereby enhancing safety and economic efficiency. In LTE network traffic prediction, the model can handle large and complex traffic data, providing high-precision traffic forecasts. This is of significant practical value for network operators in resource allocation, congestion management, and ensuring service quality. In energy load forecasting, BiLSTM-MLAM can help energy management systems more accurately predict load demand, and optimize energy distribution and scheduling, thus improving energy utilization efficiency and reducing operational costs.
In the future, we plan to conduct in-depth research building upon the foundation of our proposed model. The focus will primarily be on methods and strategies related to feature extraction and multi-scale feature fusion. Our intention is to explore diverse approaches for extracting features at different granularities and optimizing the fusion of these features. Specifically, we aim to acquire temporal information from multiple perspectives, which may involve finer time partitioning and more flexible feature selection. By delving deeper into the intrinsic patterns of time series data, we aim to devise a more precise predictive model, further enhancing the accuracy of time series forecasting.

Author Contributions

Data curation, Y.F.; methodology, Y.F.; validation, Y.F. and Q.T.; writing—original draft, Y.F. and Q.T.; writing—review and editing, Y.G. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data utilized in this paper were sourced from publicly available resources, including the C-MAPSS dataset from https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/, accessed on 18 June 2023, the LTE dataset from https://www.kaggle.com/naebolo/predict-traffic-of-lte-network, accessed on 28 June 2022, and the load dataset from https://cm.asu.edu, accessed on 8 June 2022.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. China Internet Network Information Center. The 53rd Statistical Report on China’s Internet Development. 2024. Available online: https://www.cnnic.net.cn/n4/2024/0322/c88-10964.html (accessed on 22 March 2024).
  2. Chen, A.; Law, J.; Aibin, M. A Survey on Traffic Prediction Techniques Using Artificial Intelligence for Communication Networks. Telecom 2021, 2, 518–535. [Google Scholar] [CrossRef]
  3. Feng, J.; Chen, X.; Gao, R.; Zeng, M.; Li, Y. Deeptp: An end-to-end neural network for mobile cellular traffic prediction. IEEE Netw. 2018, 32, 108–115. [Google Scholar] [CrossRef]
  4. Alsaade, F.W.; Hmoud Al-Adhaileh, M. Cellular traffic prediction based on an intelligent model. Mob. Inf. Syst. 2021, 2021, 1–15. [Google Scholar] [CrossRef]
  5. Jaffry, S.; Hasan, S.F. Cellular Traffic Prediction using Recurrent Neural Networks. In Proceedings of the 2020 IEEE 5th International Symposium on Telecommunication Technologies (ISTT), Shah Alam, Malaysia, 9–11 November 2020; pp. 94–98. [Google Scholar] [CrossRef]
  6. Mohri, M.; Rostamizadeh, A.; Talwalkar, A. Foundations of Machine Learning; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  7. Sezer, O.B.; Gudelek, M.U.; Ozbayoglu, A.M. Financial time series forecasting with deep learning: A systematic literature review: 2005–2019. Appl. Soft Comput. 2020, 90, 106181. [Google Scholar] [CrossRef]
  8. Fathi, M.; Haghi Kashani, M.; Jameii, S.M.; Mahdipour, E. Big data analytics in weather forecasting: A systematic review. Arch. Comput. Methods Eng. 2022, 29, 1247–1275. [Google Scholar] [CrossRef]
  9. Medina-Salgado, B.; Sánchez-DelaCruz, E.; Pozos-Parra, P.; Sierra, J.E. Urban traffic flow prediction techniques: A review. Sustain. Comput. Inform. Syst. 2022, 35, 100739. [Google Scholar] [CrossRef]
  10. Hernandez-Matamoros, A.; Fujita, H.; Hayashi, T.; Perez-Meana, H. Forecasting of COVID19 per regions using ARIMA models and polynomial functions. Appl. Soft Comput. 2020, 96, 106610. [Google Scholar] [CrossRef]
  11. Gasparin, A.; Lukovic, S.; Alippi, C. Deep learning for time series forecasting: The electric load case. CAAI Trans. Intell. Technol. 2022, 7, 1–25. [Google Scholar] [CrossRef]
  12. ArunKumar, K.; Kalaga, D.V.; Kumar, C.M.S.; Kawaji, M.; Brenza, T.M. Comparative analysis of Gated Recurrent Units (GRU), long Short-Term memory (LSTM) cells, autoregressive Integrated moving average (ARIMA), seasonal autoregressive Integrated moving average (SARIMA) for forecasting COVID-19 trends. Alex. Eng. J. 2022, 61, 7585–7603. [Google Scholar] [CrossRef]
  13. Wang, D.; Zheng, Y.; Lian, H.; Li, G. High-dimensional vector autoregressive time series modeling via tensor decomposition. J. Am. Stat. Assoc. 2022, 117, 1338–1356. [Google Scholar] [CrossRef]
  14. Gbolagade, S.; Oyeyemi, G.; Abidoye, A.; Adejumo, T.; Okegbade, A. Performance of Generalized Autoregressive Conditional Heteroscedasticity (GARCH) Models in Modeling Volatility of Brent Crude Oil Price. Ilorin J. Sci. 2022, 9, 20–32. [Google Scholar]
  15. Dimri, T.; Ahmad, S.; Sharif, M. Time series analysis of climate variables using seasonal ARIMA approach. J. Earth Syst. Sci. 2020, 129, 1–16. [Google Scholar] [CrossRef]
  16. Gourav; Rekhi, J.K.; Nagrath, P.; Jain, R. Forecasting air quality of delhi using arima model. In Proceedings of the Advances in Data Sciences, Security and Applications: Proceedings of ICDSSA 2019, New Delhi, India, 7–8 March 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 315–325. [Google Scholar]
  17. Nurhamidah, N.; Nusyirwan, N.; Faisol, A. Forecasting seasonal time series data using the holt-winters exponential smoothing method of additive models. J. Mat. Integr. 2020, 16, 151–157. [Google Scholar] [CrossRef]
  18. Dudek, G.; Pełka, P.; Smyl, S. A hybrid residual dilated LSTM and exponential smoothing model for midterm electric load forecasting. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 2879–2891. [Google Scholar] [CrossRef]
  19. Graves, A. Generating Sequences with Recurrent Neural Networks. arXiv 2014, arXiv:1308.0850. Available online: http://arxiv.org/abs/1308.0850 (accessed on 12 January 2023).
  20. Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv 2020, arXiv:1905.10437. Available online: http://arxiv.org/abs/1905.10437 (accessed on 11 June 2023).
  21. Fan, C.; Zhang, Y.; Pan, Y.; Li, X.; Zhang, C.; Yuan, R.; Wu, D.; Wang, W.; Pei, J.; Huang, H. Multi-Horizon Time Series Forecasting with Temporal Attention Learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2527–2535. [Google Scholar] [CrossRef]
  22. Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.X.; Yan, X. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting. arXiv 2020, arXiv:1907.00235. Available online: http://arxiv.org/abs/1907.00235 (accessed on 1 June 2023).
  23. Guen, V.L.; Thome, N. Shape and Time Distortion Loss for Training Deep Time Series Forecasting Models. arXiv 2019, arXiv:1909.09020. Available online: http://arxiv.org/abs/1909.09020 (accessed on 11 June 2023).
  24. Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]
  25. Wang, D.; Jiang, M.; Syed, M.; Conway, O.; Juneja, V.; Subramanian, S.; Chawla, N.V. Calendar Graph Neural Networks for Modeling Time Structures in Spatiotemporal User Behaviors. arXiv 2020, arXiv:2006.06820. Available online: http://arxiv.org/abs/2006.06820 (accessed on 5 June 2023).
  26. Wang, C.; Zhang, M.; Ma, W.; Liu, Y.; Ma, S. Make It a Chorus: Knowledge- and Time-aware Item Modeling for Sequential Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event China, 25–30 July 2020; pp. 109–118. [Google Scholar] [CrossRef]
  27. Chen, Q.; Zhang, W.; Lou, Y. Forecasting stock prices using a hybrid deep learning model integrating attention mechanism, multi-layer perceptron, and bidirectional long-short term memory neural network. IEEE Access 2020, 8, 117365–117376. [Google Scholar] [CrossRef]
  28. Pan, Z.; Meng, Z.; Chen, Z.; Gao, W.; Shi, Y. A two-stage method based on extreme learning machine for predicting the remaining useful life of rolling-element bearings. Mech. Syst. Signal Process. 2020, 144, 106899. [Google Scholar] [CrossRef]
  29. Widiasari, I.R.; Nugroho, L.E. Deep learning multilayer perceptron (MLP) for flood prediction model using wireless sensor network based hydrology time series data mining. In Proceedings of the 2017 International Conference on Innovative and Creative Information Technology (ICITech), Salatiga, Indonesia, 2–4 November 2017; pp. 1–5. [Google Scholar]
  30. Crone, S.F.; Kourentzes, N. Feature selection for time series prediction—A combined filter and wrapper approach for neural networks. Neurocomputing 2010, 73, 1923–1936. [Google Scholar] [CrossRef]
  31. Grigorievskiy, A.; Miche, Y.; Ventelä, A.M.; Séverin, E.; Lendasse, A. Long-term time series prediction using OP-ELM. Neural Netw. 2014, 51, 50–56. [Google Scholar] [CrossRef]
  32. Han, M.; Zhang, R.; Xu, M. Multivariate chaotic time series prediction based on ELM–PLSR and hybrid variable selection algorithm. Neural Process. Lett. 2017, 46, 705–717. [Google Scholar] [CrossRef]
  33. Yi, K.; Zhang, Q.; Fan, W.; Wang, S.; Wang, P.; He, H.; An, N.; Lian, D.; Cao, L.; Niu, Z. Frequency-domain MLPs are More Effective Learners in Time Series Forecasting. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 76656–76679. [Google Scholar]
  34. Zhang, T.; Zhang, Y.; Cao, W.; Bian, J.; Yi, X.; Zheng, S.; Li, J. Less Is More: Fast Multivariate Time Series Forecasting with Light Sampling-oriented MLP Structures. arXiv 2022, arXiv:2207.01186. Available online: http://arxiv.org/abs/2207.01186 (accessed on 12 June 2023).
  35. Ismaeel, S.; Miri, A. Multivariate Time Series ELM for Cloud Data Centre Workload Prediction. In Proceedings of the Human-Computer Interaction. Theory, Design, Development and Practice, Toronto, ON, Canada, 17–22 July 2016; pp. 565–576. [Google Scholar]
  36. Zhao, Y.; Ye, L.; Li, Z.; Song, X.; Lang, Y.; Su, J. A novel bidirectional mechanism based on time series model for wind power forecasting. Appl. Energy 2016, 177, 793–803. [Google Scholar] [CrossRef]
  37. Wen, R.; Torkkola, K.; Narayanaswamy, B.; Madeka, D. A Multi-Horizon Quantile Recurrent Forecaster. arXiv 2018, arXiv:1711.11053. Available online: http://arxiv.org/abs/1711.11053 (accessed on 13 June 2023).
  38. Yu, R.; Zheng, S.; Anandkumar, A.; Yue, Y. Long-term Forecasting using Higher Order Tensor RNNs. arXiv 2019, arXiv:1711.00073. Available online: http://arxiv.org/abs/1711.00073 (accessed on 11 June 2023).
  39. Lai, G.; Chang, W.C.; Yang, Y.; Liu, H. Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks. arXiv 2018, arXiv:1703.07015. Available online: http://arxiv.org/abs/1703.07015 (accessed on 15 June 2023).
  40. Pütz, S.; Schäfer, B. Forecasting Power Grid Frequency Trajectories with Structured State Space Models. In Proceedings of the 14th ACM International Conference on Future Energy Systems, Orlando, FL, USA, 20–23 June 2023. [Google Scholar] [CrossRef]
  41. Peng, L.; Liu, S.; Liu, R.; Wang, L. Effective long short-term memory with differential evolution algorithm for electricity price prediction. Energy 2018, 162, 1301–1314. [Google Scholar] [CrossRef]
  42. Gensler, A.; Henze, J.; Sick, B.; Raabe, N. Deep Learning for solar power forecasting —An approach using AutoEncoder and LSTM Neural Networks. In Proceedings of the 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Budapest, Hungary, 9–12 October 2016; pp. 002858–002865. [Google Scholar] [CrossRef]
  43. Fischer, T.; Krauss, C. Deep learning with long short-term memory networks for financial market predictions. Eur. J. Oper. Res. 2018, 270, 654–669. [Google Scholar] [CrossRef]
  44. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  45. Vinayakumar, R.; Soman, K.P.; Poornachandran, P. Applying deep learning approaches for network traffic prediction. In Proceedings of the 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi, India, 13–16 September 2017; pp. 2353–2358. [Google Scholar] [CrossRef]
  46. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. Available online: http://arxiv.org/abs/1706.03762 (accessed on 17 June 2023).
  47. Lim, B.; Arık, S.Ö.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
  48. Zhao, S.; Jiang, X.; Jacobson, G.; Jana, R.; Hsu, W.L.; Rustamov, R.; Talasila, M.; Aftab, S.A.; Chen, Y.; Borcea, C. Cellular Network Traffic Prediction Incorporating Handover: A Graph Convolutional Approach. In Proceedings of the 2020 17th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), Como, Italy, 22–25 June 2020; pp. 1–9. [Google Scholar] [CrossRef]
Figure 1. Overall framework of the model.
Figure 1. Overall framework of the model.
Sensors 24 03962 g001
Figure 2. The basic structure of the LSTM neural network.
Figure 2. The basic structure of the LSTM neural network.
Sensors 24 03962 g002
Figure 3. The Bi-LSTM network structure.
Figure 3. The Bi-LSTM network structure.
Sensors 24 03962 g003
Figure 4. Multi-scale patch segmentation.
Figure 4. Multi-scale patch segmentation.
Sensors 24 03962 g004
Figure 5. The structure and computational process of the LAM.
Figure 5. The structure and computational process of the LAM.
Sensors 24 03962 g005
Figure 6. Prediction results of BiLSTM-MLAM on C-MAPSS.
Figure 6. Prediction results of BiLSTM-MLAM on C-MAPSS.
Sensors 24 03962 g006
Figure 7. Prediction results of BiLSTM-MLAM on LTE.
Figure 7. Prediction results of BiLSTM-MLAM on LTE.
Sensors 24 03962 g007
Figure 8. Prediction results of BiLSTM-MLAM on Load.
Figure 8. Prediction results of BiLSTM-MLAM on Load.
Sensors 24 03962 g008
Table 1. Dataset details.
Table 1. Dataset details.
DatasetsTotal SizeSample Rate
C-MAPSS17,7311 cycle
LTE26,2141 h
Load17,5201 h
Table 2. Prediction results of all models.
Table 2. Prediction results of all models.
ModelsC-MAPSSLTELoad
RMSE MAE RMSE MAE RMSE MAE
SVR0.12120.09500.11320.07660.01410.0112
RNN0.10490.07630.10030.06850.01240.0091
LSTM0.10070.07610.09830.07070.01110.0086
GRU0.12780.09660.09560.06570.01160.0088
Bi-LSTM0.10280.07560.08730.06330.01080.0079
Encoder–Decoder0.10050.08170.09710.06200.01280.0102
BiLSTM-MLAM0.09380.07230.08470.06010.01050.0071
Note: The bold numbers indicate the best results for each dataset.
Table 3. Results of ablation study.
Table 3. Results of ablation study.
C-MAPSS
Models RMSE MAE
Bi-LSTM0.10280.0756
BiLSTM-AM0.09980.0741
LSTM-MLAM0.09840.0788
BiLSTM-MLAM0.09380.0723
Table 4. Prediction results of BiLSTM-MLAM with different patch sizes on the C-MAPSS dataset.
Table 4. Prediction results of BiLSTM-MLAM with different patch sizes on the C-MAPSS dataset.
C-MAPSS
Patch Configuration RMSE MAE
20.09840.0788
30.09810.0725
40.10380.0731
2&30.09770.0724
2&3&40.09380.0723
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fan, Y.; Tang, Q.; Guo, Y.; Wei, Y. BiLSTM-MLAM: A Multi-Scale Time Series Prediction Model for Sensor Data Based on Bi-LSTM and Local Attention Mechanisms. Sensors 2024, 24, 3962. https://doi.org/10.3390/s24123962

AMA Style

Fan Y, Tang Q, Guo Y, Wei Y. BiLSTM-MLAM: A Multi-Scale Time Series Prediction Model for Sensor Data Based on Bi-LSTM and Local Attention Mechanisms. Sensors. 2024; 24(12):3962. https://doi.org/10.3390/s24123962

Chicago/Turabian Style

Fan, Yongxin, Qian Tang, Yangming Guo, and Yifei Wei. 2024. "BiLSTM-MLAM: A Multi-Scale Time Series Prediction Model for Sensor Data Based on Bi-LSTM and Local Attention Mechanisms" Sensors 24, no. 12: 3962. https://doi.org/10.3390/s24123962

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop