1. Introduction
The goal of time series forecasting is to predict future values based on patterns observed in historical data. It has been an active area of research with applications in many diverse fields such as weather, financial markets, electricity consumption, health care, and market demand, among others. Over the last few decades, different approaches have been developed for time series prediction involving classical statistics, mathematical regression, machine learning, and deep learning-based models. Both univariate and multivariate models have been developed for different application domains. The classical statistics- and mathematics-based approaches include moving average filters, exponential smoothing, Autoregressive Integrated Moving Average (ARIMA), SARIMA [
1], and TBATs [
2]. SARIMA improves upon ARIMA by also taking into account any seasonality patterns and usually performs better in forecasting complex data containing cycles. TBATs further refines SARIMA by including multiple seasonal periods.
With the advent of machine learning where the foundational concept is to develop a model that learns from data, several approaches to time series forecasting have been explored including Linear Regression, XGBoost, and random forests. Using random forests or XGBoost for time series forecasting requires the data to be transformed into a supervised learning problem using a sliding window approach. When the training data are relatively small, the statistical approaches tend to yield better results; however, it has been shown that for larger data, machine-learning approaches tend to outperform the classical mathematical techniques of SARIMA and TBATs [
2,
3].
In the last decade, deep learning-based approaches [
4] to time series forecasting have drawn considerable research interest starting from designs based on Recurrent Neural Networks (RNNs) [
5,
6]. A detailed study comparing the ARIMA-based architectures and RNNs [
6] concluded that RNNs can model seasonality patterns directly if the data have homogeneous seasonal patterns; otherwise, a deseasonalization step was recommended. It was also concluded that (semi-) automatic RNN models are no silver bullets but can be competitive in some situations. The work in [
6] compared different RNN designs and indicated that a Long Short-Term Memory (LSTM) cell with peephole connections performed relatively better, the Elmann Recurrent Neural Network (ERNN) cell performed the worst, and the performance of the Gated Recurrent Unit (GRU) was in between.
LSTM and Convolutional Neural Networks (CNNs) [
7] have been combined to address the long-term and short-term patterns arising in data. One notable design was proposed in [
8], termed by the authors as Long- and Short-term Time-series network (LSTNet). It uses the CNN and RNN to extract short-term local dependency patterns among variables and to discover long-term patterns for time series trends. Recently, the use of RNNs and CNNs is being replaced by transformer-based architectures in many applications, such as Natural Language Processing (NLP) and Computer Vision. Transformers [
9], which use an attention mechanism to determine the similarity in the input sequence, are one of the best models for NLP applications, as demonstrated by the success of large language models such as ChatGPT. Some time series forecasting implementations using transformers have achieved good performance [
10,
11,
12,
13]; however, the transformer has some inherent challenges and limitations with respect to time series forecasting in current implementations due to the following reasons:
Temporal dynamics vs. semantic correlations: Transformers excel in identifying semantic correlations but struggle with the complex, non-linear temporal dynamics crucial in time series forecasting [
14,
15]. To address this, an auto-correlation mechanism is used in Autoformer [
11];
Order insensitivity: The self-attention mechanism in transformers treats inputs as an unsequenced collection, which is problematic for time series prediction where order is important. Even though, positional encodings used in transformers partially address this but may not fully incorporate the temporal information. Some transformer-based models try to solve this problem using enhancements in architecture, e.g., Autoformer [
11] uses series decomposition blocks that enhance the system’s ability to learn from intricate temporal patterns [
11,
13,
15];
Complexity trade-offs: The attention mechanism in transformers has high computational costs for long sequences due to its quadratic complexity
, and modifications of sparse attention mechanisms, e.g., Informer [
10], reduce this to
by using a ProbSparse technique. Some models reduce this complexity to
, e.g., FEDformer [
12], which uses a Fourier-enhanced structure, and Pyraformer [
16], which incorporates a pyramidal attention module with inter-scale and intra-scale connections to accomplish the linear complexity. These reductions in complexity come at the cost of some information loss in the time series prediction;
Noise susceptibility: transformers with many parameters are prone to overfitting noise, a significant issue in volatile data like a financial time series where the actual signal is often subtle [
15];
Long-term dependency challenge: Transformers, despite their theoretical potential, often find it challenging to handle very long sequences typical in time series forecasting, largely due to training complexities and gradient dilution. For example, PatchTST [
14] used disassembling a time series into smaller segments and used it as patches to address this issue. This may cause some segment fragmentation issues at the boundaries of the patches in input data;
Interpretation challenge: Transformers’ complex architecture, with layers of self-attention and feed-forward networks, complicates understanding their decision-making, a notable limitation in time series forecasting where rationale clarity is crucial. An attempt has been made in LTS-Linear [
15] to address this by using a simple linear network instead of a complex architecture; however, this may be unable to exploit the intricate multivariate relationships between data.
In summary, different approaches for time series forecasting have been explored. These include classical approaches based on mathematics and statistics, neural network approaches (including linear networks, LSTMs and CNNs), and recently the transformer-based approaches. Even though transformer-based models have claimed to outperform previous approaches, the recent work in [
15] questions the use of complex models including transformers, and shows that a simple linear neural network yields better results than transformer-based models. This seems counter-intuitive to not utilize the attention capabilities of the transformer, which has revolutionized AI in text generation in large language models. We investigate this paradox further to see if better models for time series can be created by using either the linear network or transformer-based approaches. We review the related work in the next section before elaborating on our enhanced models.
2. Related Work
Some of the recent works related to time series forecasting include models based on simple linear networks, transformers, and state-space models. One of the important works related to Long-Term Time Series Forecasting (LTSF), termed LTSF-Linear, was presented in [
15]. It uses the most fundamental Direct Multi-Step DMS [
17] model through a temporal linear layer. The core approach of LTSF-Linear involves predicting future time series data by directly applying a weighted sum to historical data, as shown in
Figure 1.
The output of LTSF-Linear is described as
, where
is a temporal linear layer and
is the input for the
variable. This model applies uniform weights across various variables without considering spatial correlations between the variates. Besides LTSF-Linear, a few variations termed NLinear and DLinear were also introduced in [
15]. NLinear processes the input sequence through a linear layer with normalization by subtracting and re-adding the last sequence value before predicting. DLinear decomposes raw data into trend and seasonal components using a moving average kernel, processes each with a linear layer, and sums the outputs for the final prediction [
15]. This concept has been borrowed from the AutoFormer and FedFormer models [
11,
12].
Although some research indicates the success of the transformer-based models for time series forecasting, e.g., [
10,
11,
12,
16], the LTSF-Linear work in [
15] questions the use of transformers due to the fact that the permutation-invariant self-attention mechanism may result in temporal information loss. The work in [
15] also presented better forecasting results than the previous transformer-based approaches. However, important research later presented in [
14] proposed a transformer-based architecture called PatchTST, showing better results than [
15] in some cases. PatchTST segments the time series into subseries-level patches and maintains channel independence between variates. Each channel contains a single univariate time series that shares the same embedding and transformer weights across all the series.
Figure 2 depicts the architecture of PatchTST.
In PatchTST, the ith series for L time steps is treated as a univariate . Each of these is fed independently to the transformer backbone after converting to patches, which provides prediction results for T future steps. For a patch length P and stride S, the patching process generates a sequence of N patches , where . With the use of patches, the number of input tokens can reduce to approximately .
Recently, state-space models (SSMs) have received considerable attention in the NLP and Computer Vision domain [
18,
19]. For time series forecasting, it has been reported that SSM representations cannot express autoregressive processes effectively. An important recent work using SSM is presented in [
20] (termed SpaceTimeSSM) that enhances the traditional SSM model by employing a companion matrix, which enables SpaceTime’s SSM layers to learn desirable autoregressive processes. The time series forecasting represents the input series for
p past samples as the following:
Then the state-space formulation is given as follows:
The SpaceTimeSSM composes the companion matrix
A as a
dxd square matrix:
where
=.We provide a comparison of different time series benchmarks on the SpaceTimeSSM approach in the results
Section 4.
4. Results
We tested our architectures and performed analyses on nine widely used datasets from real-world applications. These datasets consist of the Electricity Transformer Temperature (ETT) series, which include ETTh1 and ETTh2 (hourly intervals), and ETTm1 and ETTm2 (5-minute intervals), along with datasets pertaining to Traffic (hourly), Electricity (hourly), Weather (10-minute intervals), Influenza-like illness (ILI) (weekly), and Exchange rate (daily). The characteristics of the different datasets used are summarized in
Table 1.
The architecture type of models that we compare to our approach are listed in
Table 2.
Table 3 shows the detailed results for our Enhanced Linear Model (ELM) on different datasets and compares it with other recent popular models.
As can be seen from
Table 3, our ELM model surpasses most established baseline methods in the majority of the test cases (indicated by bold values). The underlined values in
Table 3 indicate the second-best results for a given category. Our model is either the best or the second-best in most categories. Note that each model in
Table 3 follows a consistent experimental setup, with prediction lengths T of {96, 192, 336, 720} for all datasets except for the ILI dataset. For the ILI dataset, we use prediction lengths of {24, 36, 48, 60}. For our ELM model, the look-back window
L is 512 for all datasets except Exchange and Illness, which use
L = 96. For the other models that we compare to, we select their best prediction based on look-back window size from either of the (96, 192, 336, 720) [
14,
15]. Metrics used for evaluation are MSE (Mean Squared Error) and MAE (Mean Absolute Error).
Table 4 provides the quantitative improvement over two recent best-performing time series prediction models of PatchTST [
14] and DLinear [
15]. The values presented are the average of the percent improvement for the four lookback window sizes of 96, 192, 336, and 720. With respect to PatchTST, our model lags in performance on the traffic and illness datasets using the MSE metric but is competitive or exceeds the MSE or MAE metrics on the other benchmarks. The percentage improvement with respect to DLinear is more significant than the PatchTST Model, and our ELM model exceeds the DLinear in almost all dataset categories.
Figure 5 and
Figure 6 show the graphs of predicted vs. actual data for two of the datasets with different prediction lengths using a context length of 512 for our ELM model for the first channel (pressure for the weather dataset, and HUFL—high useful load for the ETTm1 dataset). As can be seen, if the data are more cyclical in nature (e.g., HUFL in ETTm1), our model is able to learn the patterns nicely, as shown in
Figure 6. For complex data such as the pressure feature in weather, the prediction is less accurate, as indicated in
Figure 5.
Table 5 presents our results on the Swin transformer-based implementation for time series. As explained earlier, we divide the input multivariate time series data into 16 × 16, i.e., 256 patches, before feeding it to a Swin model with three transformer layers. The embeddings used in the three layers are [128,128,256]. As can be seen, the Swin transformer-based approach has the inherent capability to combine information between different channels as well as between different time-steps but does not perform as well as our linear model (ELM); only on the traffic dataset it produces the best result. This could be attributed to the fact that this dataset has the most number of features, which Swin can effectively use for more cross-channel information. Comparing our Swin transformer-based model to the PatchTST model [
14] (also transformer-based), the PatchTST that uses channel independence performs better than our Swin-based model. Note that the PatchTST performs worse than our ELM model, which is based on a linear network.
We also compare our ELM model to the newly proposed state-space model-based time series prediction [
20]. State-space models such as Mamba [
18], VMamba [
19], Vision Mamba [
23], and Time Machine Mamba [
24] are drawing significant attention for modeling temporal data such as time series, and therefore we compare our ELM model with the recently published work of [
20] and [
24,
25], which are based on state-space models.
Table 6 shows the results of our ELM model with the work in [
20,
24]. In one case, the SpaceTime model is better but most of the time our ELM model performs better than both the state-space and the previous DLinear models. The context length in
Table 6 is 720, and the prediction is also 720 time steps.
5. Discussion
One of the recent unanswered questions in time series forecasting has been as to which architecture is best suited for this task. Some earlier research papers have indicated better results with transformer-based models than previous approaches, e.g., Informer [
10], Autoformer [
11], Fedformer [
12], and Pyraformer [
16]. Of these models, FedFormer demonstrated much better results as it uses Fourier-enhanced blocks and Wavelet-enhanced blocks in the transformer structure that can learn important patterns in series through frequency domain mapping. A simpler transformer-based architecture yielding even better results was proposed in [
14]. This architecture, termed PatchTST, uses independent channels where an input channel is divided into patches. All channels share the same embedding and transformer weights. Since PatchTST is a simple transformer design with a simple independent channel architecture, we explored replacing this design with a Swin transformer with patching across channels. The Swin transformer has the capability to combine information across patches due to its hierarchical overlapping window design. Our detailed experimental results on the Swin architecture-based design did not produce better results as compared to the channel-independent design of PatchTST; however, compared with other transformer-based designs, it yielded improved results in many cases.
To answer the question of the best architecture for time series forecasting, we improve the recently proposed simple linear network-based model in [
15] by creating dual pipelines with batch and reversible instance normalizations. We maintain channel independence and our results on the benchmarks show the best results obtained so far as compared to existing approaches in the majority of the standard datasets used in time series forecasting.
6. Conclusions
We perform a detailed investigation as to the best architecture for time series forecasting. We have implemented time series forecasting on the Swin transformer to see if aggregated channel information is useful. We also analyzed and improved an existing simpler model based on linear networks. Our study highlights the significant potential of simpler models, challenging the prevailing emphasis on complex transformer-based architectures. The ELM model developed in this work, with its straightforward design, has demonstrated superior performance across various datasets, underscoring the importance of re-evaluating the effectiveness of simpler models in time series analysis. Compared to the recent transformer-based PatchTST model, our ELM model achieves a percentage improvement of approximately 1–5% on most benchmarks. With respect to the recent linear network-based models, the percentage improvement by our model is more significant, ranging between 1 and 25% for different datasets. It is only when the number of variates in the dataset is large that the Swin transformer-based design we adapt for the time series prediction seems to be effective.
Future work involves the development of hybrid models that leverage both linear and transformer elements such that each contributes to the effective learning of the time series behavior. For example, the frequency domain component as used in FedFormer could aid a linear model when past periodicity pattern is more complex. The recent developments in state-space models and their applications to time series forecasting such as TimeMachine [
24,
25] (based on Mamba) also deserve further research in optimizing these models for better prediction.