An Adaptive Learning Time Series Forecasting Model Based on Decoder Framework

Hao, Jianlong; Sun, Qiwei

doi:10.3390/math13030490

Open AccessArticle

An Adaptive Learning Time Series Forecasting Model Based on Decoder Framework

by

Jianlong Hao

^*

and

Qiwei Sun

School of Information, Shanxi University of Finance and Economics, Taiyuan 030006, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(3), 490; https://doi.org/10.3390/math13030490

Submission received: 11 December 2024 / Revised: 24 January 2025 / Accepted: 29 January 2025 / Published: 31 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

Time series forecasting constitutes a fundamental technique for analyzing dynamic alterations within temporal datasets and predicting future trends in various domains. Nevertheless, achieving effective modeling faces challenges arising from complex factors such as accurately capturing the relationships among temporally distant data points and accommodating rapid shifts in data distributions over time. While Transformer-based models have demonstrated remarkable capabilities in handling long-range dependencies recently, directly applying them to address the evolving distributions within temporal datasets remains a challenging task. To tackle these issues, this paper presents an innovative sequence-to-sequence adaptive learning approach centered on decoder framework for addressing temporal modeling tasks. An end-to-end deep learning architecture-based Transformer decoding framework is introduced, which is capable of adaptively discerning the interdependencies within temporal datasets. Experiments carried out on multiple datasets indicate that the time series adaptive learning model based on the decoder achieved an overall reduction of 2.6% in MSE (Mean Squared Error) loss and 1.8% in MAE (Mean Absolute Error) loss when compared with the most advanced Transformer-based time series forecasting model.

Keywords:

time series forecasting; Transformer; decoder-only; concept drift; low-rank decomposition

MSC:

68T05

1. Introduction

Time series forecasting assumes a crucial role in diverse fields, such as meteorological monitoring [1], stock price forecasting [2], traffic flow monitoring [3], and energy demand forecasting [4]. In recent years, concomitant with the advancement of advanced algorithms, computational tools, and big data, deep learning models have witnessed significant progress in numerous domains. Among these models, the Transformer [5] has attained remarkable success in natural language processing, computer vision, image and video analysis, speech recognition, and other domains, mainly attributed to its utilization of attention mechanisms. Consequently, the Transformer has emerged as an ideal option for sequence modeling tasks.

One significant sequence modeling task is long sequence time series forecasting (LSTF). Nevertheless, LSTF faces several crucial challenges: the difficulty of effectively capturing long-range dependencies in data, the limited quantities of training data, and the rapid evolution of data distributions over time. In contrast to other methods, Transformer-based models can avail themselves of their potent attention mechanisms to capture precise long-range dependencies between inputs and outputs, rendering them highly suitable for LSTF tasks. Nevertheless, LSTF tasks are frequently restricted by the quantity of available data, thereby making it arduous to construct deep Transformer models. Furthermore, owing to the non-stationarity of real-world environments, data distributions frequently shift over time, a phenomenon referred to as concept drift [6]. In LSTF tasks, concept drift can result in disparities between the training and testing data distributions. Merely removing the non-stationary components of time series data may lead to information loss and lower the accuracy of forecasting. The current mainstream approaches for addressing concept drift center on enabling the model to learn the variations in data distribution and adjust the forecasting accordingly.

In order to address the aforementioned issues, this paper puts forward an adaptive learning model based on decoder framework (ALD) for time series forecasting, featuring two main innovations:

The ALD model employs a decoder-only architecture. In LSTF tasks, the masking operation in the decoder-only architecture converts the attention matrix from a non-full-rank matrix to a full-rank one, which theoretically offers greater expressive ability [7]. In contrast to other architectures, the decoder-only structure proves to be more efficacious in extracting, learning, and representing useful information from a given quantity of data, addressing the challenge of deep modeling in LSTF tasks while enhancing data utilization efficiency. Owing to the decoder-only design, the ALD model demonstrates considerable advantages in forecasting accuracy.
An adaptive statistical forecasting layer is designed. This layer models the evolving trends in the initial time series data dynamically. Specifically, the model acquires and predicts the key statistical features of the time series independently through this layer. When concept drift takes place, the model is capable of adjusting its outputs adaptively in accordance with the learned trends. This mechanism efficiently captures alterations in the data distribution and promptly adjusts forecasting when abrupt changes arise, significantly enhancing the model’s adaptability and robustness in complex dynamic environments. This design guarantees that the model retains high predictive performance even in the event of concept drift.

2. Related Work

In deep learning models, the Transformer architecture has gained widespread adoption across various tasks owing to its superior performance. The cornerstone of the Transformer model is its encoder-decoder framework, wherein the encoder maps input data into low-dimensional latent representations to capture essential features, while the decoder reconstructs these latent vectors into the original data space. In Transformer models, the encoder compresses the input data into low-dimensional latent vectors to capture the primary features of the data, while the decoder restores these latent vectors to the original data space. Models such as Informer [8] and Pyraformer [9] utilize both the encoder and decoder of Transformer models, enhancing the self-attention mechanism to mitigate computational complexity and expedite inference speed. The Bert [10] model employs the encoder of the Transformer model to train copious amounts of unlabeled text data in an unsupervised fashion. Presently, decoder-only architectures are utilized in large-scale natural language processing models, such as GPT [11], PaLM [12], and LLaMA [13].

In the Transformer model, position embedding is employed to address the absence of positional information in sequence data. To accurately represent the position of elements within the input sequence, the model incorporates two main techniques: absolute position embedding and relative position embedding. Absolute position embedding is simplistic in design and computationally effective, while relative position embedding better reflects relative positional information and tends to yield superior performance in experiments. The Vanilla Transformer [5] employs absolute position embedding, where periodic functions are utilized during embedding, rendering the resulting embedding associated with the absolute positions within the input sequence. To reflect the relative positional information within the sequence, Transformer-XL [14] adopts relative position embedding by adding positional information to the embedding, which also employs periodic functions. Roformer [15] encodes absolute positions through a rotation matrix, naturally integrating relative positional dependencies into the self-attention mechanism. This method is referred to as RoPE (Rotary Position Embedding). Unlike traditional additive position embedding methods, RoPE incorporates positional information via rotation operations, enabling the model to capture relative dependencies between different positions more effectively.

In the field of time series prediction, the distribution of data can evolve over time, a phenomenon known as “concept drift”. This shift in data distribution can significantly impact the accuracy of prediction results. To address this challenge, the Revin [16] model normalizes the input data to fix their distribution in accordance with the mean and variance and restores the output to its original distribution through inverse normalization. The DDG-DA [17] model tackles the issue of concept drift in time series forecasting by optimizing the training process via inner and outer layers, with the aim of predicting future data distributions and enhancing the accuracy of time series forecasting. The DRAIN [18] framework models the changes in model parameters as a dynamic edge-weighted graph problem, where each neuron in the neural network is represented as a node in the graph, and the changes in network parameters over time are reflected as dynamic alterations in edge weights. This enables the model to acquire knowledge of how network parameters evolve under the influence of concept drift. The SAN [19] model improves the accuracy of time series forecasting by means of more flexible normalization and inverse normalization. Firstly, SAN eliminates the non-stationarity of time series by operating on local time segments (that is, subsequences). Secondly, SAN models the evolving trends in the statistical characteristics of the original time series separately.

3. Model Construction

The expressive power of neural networks augments with depth; nevertheless, time series forecasting models are frequently constrained by the size of the training set, rendering it arduous to employ deep models. Augmenting the model’s expressive capacity is an efficacious solution. The ALD model proposed in this paper builds upon the core concepts of decoder-only models and the SAN model, effectively capitalizing on the advantages of both to offer a new approach to time series forecasting. The fundamental concept of the ALD model lies in the fact that the Softmax operation in attention mechanisms projects the values of the input sequence onto the range [0, 1], with all the values summing up to 1. The attention matrix of a decoder-only model is a lower triangular matrix, and upon applying the Softmax operation, the diagonal of this triangular matrix contains positive values, signifying that the determinant is also positive. Consequently, the attention matrix of the decoder-only model is a full-rank matrix, endowing the ALD model with robust expressive capabilities.

Furthermore, time series models are influenced by concept drift. The ALD model makes improvements to the idea from the SAN model, where a separate network is employed to learn changes in the data distribution. Through the incorporation of low-rank decomposition, the ALD model decreases the space needed for storing and transmitting data. Low-rank decomposition decomposes a large matrix into two smaller matrices of lower ranks, enabling the original data to be approximately represented with less storage space and enhancing computational efficiency. The fundamental structure of the ALD model mainly comprises a decoder layer and a statistical forecasting layer, as depicted in Figure 1.

During the forward propagation process, suppose the input time series to the model is

X = {{x}^{i}}_{i = 1}^{N}

. In order to capture distributional changes caused by concept drift in a timely fashion, the input data need to be further segmented into finer-grained data. Firstly, given a time span T,

x^{i}

is partitioned into M smaller time slices

x_{1 : N}^{(i)} = (x_{1}^{(i)}, \dots, x_{N}^{(i)}), i = 1, \dots, M

. Each slice

s \in {{x}_{j}^{i}}_{j = 1}^{M}

undergoes normalization to eliminate non-stationary factors prior to being independently fed into the backbone network. Finally, the backbone network yields the forecasting

{\hat{x}}^{(i)} = ({\hat{x}}_{N + 1}^{(i)}, \dots, {\hat{x}}_{N + T}^{(i)})

.

Rotary Position Embedding (RoPE) integrates the merits of both absolute and relative position embedding. Assume that the slice s, when input into the backbone network, is transformed into the query vector

q_{m}

and the key vector

k_{n}

by adding position embedding as follows:

q_{m} = f (s_{m}, m)

and

k_{n} = f (s_{n}, n)

, where

q_{m}

and

k_{n}

represent the vectors at positions

m

and

n

, respectively. The essence of the attention mechanism lies in computing the similarity between vectors through the inner product, which determines the weight matrix, and the final features are extracted based on these weights. In order to incorporate relative positional information into the inner product results, there must be an operation G that satisfies

〈q_{m}, k_{n}〉 = G (s_{m}, s_{n}, m - n)

. The Roformer [7] model offers functions

f

and

G

that fulfill this relationship, and their forms are as follows:

\begin{matrix} f (s_{m}, m) = (W_{q} s_{m}) e^{i m θ}, \end{matrix}

(1)

\begin{matrix} f (s_{n}, n) = (W_{k} s_{n}) e^{i m θ}, \end{matrix}

(2)

\begin{matrix} G (s_{m}, s_{n}, m - n) = R e [(W_{q} s_{m}) (W_{k} s_{n}) \times e^{i m θ}], \end{matrix}

(3)

where

R e

denotes the real part of a complex number. This can further be expressed as follows:

\begin{matrix} f (s_{m}, m) = [\begin{matrix} c o s m θ & - s i n m θ \\ s i n m θ & c o s m θ \end{matrix}] [\begin{matrix} W_{q}^{(1,1)} & W_{q}^{(1,2)} \\ W_{q}^{(2,1)} & W_{q}^{(2,2)} \end{matrix}] [\begin{matrix} s_{m}^{(1)} \\ s_{m}^{(2)} \end{matrix}] \\ \begin{matrix} = [\begin{matrix} c o s m θ & - s i n m θ \\ s i n m θ & c o s m θ \end{matrix}] [\begin{matrix} q_{m}^{(1)} \\ q_{m}^{(2)} \end{matrix}], \end{matrix} \end{matrix}

(4)

\begin{matrix} f (s_{n}, n) = [\begin{matrix} c o s m θ & - s i n m θ \\ s i n m θ & c o s m θ \end{matrix}] [\begin{matrix} W_{k}^{(1,1)} & W_{k}^{(1,2)} \\ W_{k}^{(2,1)} & W_{k}^{(2,2)} \end{matrix}] [\begin{matrix} s_{n}^{(1)} \\ s_{n}^{(2)} \end{matrix}] \\ \begin{matrix} = [\begin{matrix} c o s m θ & - s i n m θ \\ s i n m θ & c o s m θ \end{matrix}] [\begin{matrix} k_{n}^{(1)} \\ k_{n}^{(2)} \end{matrix}] . \end{matrix} \end{matrix}

(5)

Finally,

G (s_{m}, s_{n}, m - n)

can be expressed as follows:

\begin{matrix} G (s_{m}, s_{n}, m - n) = [q_{m}^{(1)}, q_{m}^{(2)}] [\begin{matrix} c o s ((m - n) θ) & - s i n ((m - n) θ) \\ s i n ((m - n) θ) & c o s ((m - n) θ) \end{matrix}] [\begin{matrix} k_{n}^{(1)} \\ k_{n}^{(2)} \end{matrix}] . \end{matrix}

(6)

The training task of the adaptive learning model based on the decoder for time series is a non-stationary data forecasting task, which can be decomposed into a statistical data forecasting task and a stationary data forecasting task. The training process is bifurcated into two stages. Firstly, the original data are fed into the statistical forecasting layer for training, and the parameters of the statistical forecasting layer are frozen upon the completion of the training. Subsequently, the normalized data are fed into the decoder layer. Employing the potent fitting capability of the decoder, the stationary data are utilized for training, and ultimately, the output of the decoder is restored through inverse normalization to retrieve the missing statistical information.

In existing studies, it is widely acknowledged that the overall mean of the input sequence is the maximum likelihood estimate of the mean of the target sequence [19]. Based on this, the statistical forecasting layer employs residual learning techniques to enable the module to learn the difference between the future slice mean

μ^{i}

and the input sequence mean

ρ^{i}

, rather than predicting a specific value in the future. Furthermore, two learnable weights, namely

W_{1}

and

W_{2}

, are utilized to predict the statistics in the form of a weighted sum. The statistical forecasting process is depicted as follows:

{\hat{μ}}^{i} = W_{1} \times M L P (μ^{i} - ρ^{i}, {\bar{s}}^{i} - ρ^{i}) + W_{2} \times ρ^{i}

,

{\hat{σ}}^{i} = M L P (σ^{i}, {\bar{s}}^{i})

, where

μ^{i} = [μ_{1}^{i}, μ_{2}^{i}, \dots, μ_{M}^{i}]

represents the means of all slices of a time series in the input model, and

{\hat{μ}}^{i}

represents the future mean values of these slices.

s^{i} = [s_{1}^{i}, s_{2}^{i}, \dots, s_{M}^{i}]

represents all time slices in the input model, and

{\bar{s}}^{i}

represents the predicted values of these slices.

σ^{i}

represents the standard deviation of the slices within the input model, and

{\hat{σ}}^{i}

represents the future standard deviations of these slices. Finally, the MSE between the predicted statistics and the true values is employed as the loss function to train the network through backpropagation.

This paper employs a standard Transformer decoder to generate output sequences based on time series data represented by rotary position embedding. The decoder comprises eight identical multi-head attention layers that are stacked together. For each input slice

s \in {{x}_{j}^{i}}_{j = 1}^{M}

, a trainable D-dimensional linear layer

W_{p} \in R^{D \times p}

projects the slice into a higher-dimensional space to extract features and enhance the expressive ability of the model. Subsequently, a learnable rotary position embedding layer

W_{p o s} \in R^{D \times N}

adds relative positional information to the high-dimensional mapped slices. Finally, the slice

s^{(i)}

is mapped as

s_{d}^{(i)} = W_{p} s_{p}^{(i)} + W_{p o s}

, where

s_{d}^{(i)} \in R^{D \times N}

, and this is fed into the decoder. Each head

h = 1, \dots, H

in the multi-head attention layer converts

s_{d}^{(i)}

into a query matrix

Q_{h}^{(i)} = {(s_{d}^{(i)})}^{T} W_{h}^{Q}

, a key matrix

k_{h}^{(i)} = {(s_{d}^{(i)})}^{T} W_{h}^{K}

, and a value matrix

V_{h}^{(i)} = {(s_{d}^{(i)})}^{T} W_{h}^{V}

, where

W_{h}^{Q}, W_{h}^{K} \in R^{D \times d_{k}}

, and

W_{h}^{V} \in R^{D \times D}

. Following attention computation, the final output

O_{h}^{(i)} \in R^{D \times N}

is derived as follows:

\begin{matrix} {(O_{h}^{(i)})}^{T} = A t t e n t i o n (Q_{h}^{(i)}, k_{h}^{(i)}, V_{h}^{(i)}) = s o f t m a x (\frac{Q_{h}^{(i)} {k_{h}^{(i)}}^{T}}{\sqrt{d_{k}}}) V_{h}^{(i)} . \end{matrix}

(7)

The decoder layer ultimately generates the forecasting

{\hat{x}}^{(i)} = ({\hat{x}}_{N + 1}^{(i)}, \dots, {\hat{x}}_{N + T}^{(i)})

.

Low-rank decomposition is capable of reducing model complexity and lowering computational requirements, thereby enhancing the efficiency of neural networks in deep learning. Following attention computation, the output

O_{h}^{(i)} \in R^{D \times N}

traverses a feed-forward neural network layer to yield the final output. The feed-forward neural network comprises two linear layers, which can be represented by two matrices

A \in R^{m \times n}

and

B \in R^{n \times p}

. Matrix multiplication

C = A \times B

has a computational complexity of

O (m n p)

, which can be time-consuming for large matrices since it entails multiplying and summing n times for each element. To mitigate this complexity, matrix

A

can be approximated by a low-rank matrix of rank

r

, namely,

A \approx U Σ V^{T}

, where

U \in R^{m \times r}

,

Σ \in R^{r \times r}

,

V^{T} \in R^{r \times n}

. This decomposition condenses the original representation of matrix

A

into the product of three smaller matrices, where

r

is significantly smaller than

m

and

n

. The matrix multiplication then becomes

C = (U Σ V^{T}) \times B

, and the computational complexity is as follows:

Compute $V^{T} \times B$ , $V^{T} \in R^{r \times n}, B \in R^{n \times p}$ , with a complexity of $O (r n p)$ .
Compute $Σ \times (V^{T} \times B), Σ \in R^{r \times r}$ , with a complexity of $O (r^{2} p)$ .
Compute $U \times (Σ (V^{T} \times B))$ , $U \in R^{m \times r}$ , with a complexity of $O (m r p)$ .

The total complexity thereby becomes

O ((n + r + m) r p)

, and given that

r ≪ n

and

r ≪ m

, the overall complexity is conspicuously reduced compared to the original

O (m n p)

. Both the statistical forecasting layer and the feed-forward neural network layer within the decoder of the ALD model employ low-rank decomposition, thereby effectively enhancing computational efficiency.

4. Experimental Setup and Analysis of Results

4.1. Datasets

To verify the performance of the adaptive learning model based on the decoder (ALD model), its prediction accuracy was tested on five datasets, which are classified into three categories, including power Transformer temperature data (ETT) [8] (ETTh1, ETTh2, ETTm1), exchange rate data (Exchange), and customer electricity consumption data (ECL). The power Transformer dataset contains Transformer load data and oil temperature data collected from July 2016 to July 2018. Specifically, the ETTm1 dataset consists of sampling data at 15 min intervals, while the ETTh1 and ETTh2 datasets are composed of sampling data at 1 h intervals. The exchange rate dataset comprises daily exchange rates for eight countries—Australia, the United Kingdom, Canada, Switzerland, China, Japan, New Zealand, and Singapore—spanning the period from 1990 to 2016. The electricity dataset contains the hourly electricity consumption of 321 customers from 2012 to 2014. Table 1 presents a summary of the statistical information of these five commonly used datasets. These datasets are widely used as benchmarks for time series prediction models and are publicly available.

4.2. Benchmark Models and Experimental Setup

The benchmark models chosen for this study encompass iTransformer (ICLR 2024) [20], PatchTST (ICLR 2023) [21], Crossformer (ICLR 2023) [22], TiDE (Google 2023) [23], TimesNet (ICLR 2023) [24], DLinear (AAAI 2023) [25], SCINet (NeurIPS 2022) [26], and FEDformer (ICML 2022) [27]. To compare the performance of ALD over different forecasting horizons, the input length was set at 96, and the model was evaluated with forecasting steps of {96, 192, 336, 720}. Two evaluation metrics were employed for each forecasting window: Mean Square Error

M S E = 1 / n \sum_{i = 1}^{n} {(y - \hat{y})}^{2}

and Mean Absolute Error

M A E = 1 / n \sum_{i = 1}^{n} |y - \hat{y}|

.

4.3. Model Parameters

By default, ALD encompasses three encoder layers, with each layer featuring eight attention heads and a hidden layer dimension of 512. The feed-forward network in the decoder layer comprises four linear layers, with ReLU (Rectified Linear Unit) [28] serving as the activation function. The first three linear layers jointly undertake low-rank decomposition operations. In all experiments, a dropout with a probability of 0.1 was implemented in the decoder layers.

4.4. Results and Analysis

Table 2 summarizes the experimental results of nine models across five datasets. The lower the MSE and MAE, the more accurate the forecasting. Compared to other baseline models, the proposed ALD model demonstrates superior performance in terms of prediction accuracy. Specifically, compared to the best results of the other baseline models, ALD achieved an overall 2.6% reduction in the MSE and a 1.8% reduction in the MAE. These findings validate the effectiveness of the ALD model in learning deep nonlinear correlations in time series data.

4.5. Ablation Study

To validate the rationality and efficacy of the statistical forecasting layer in the ALD model, a detailed component removal experiment was carried out on the financial dataset. As depicted in Table 3, the statistical forecasting layer is of crucial importance in alleviating the impact of concept drift on the model.

4.6. Hyperparameter Sensitivity

We assessed the hyperparameter sensitivity of the ALD model in accordance with the following factors: learning rate, the number of decoder layers, and hidden layer dimensions. The results are presented in Figure 2 and Figure 3. The results are documented with an input length of 96 and a forecasting window length of 96. On both the ETT and exchange rate datasets, it was found that the model achieved the best performance with a learning rate of 0.0001, three decoder layers, and a hidden layer dimension of 512. In the ALD model, augmenting the number of decoder layers and hidden dimensions did not inevitably enhance the model’s performance.

5. Conclusions and Future Work

To overcome the constraints of existing Transformer-based time series forecasting models in time series modeling, this paper presents an adaptive learning model based on decoder framework, which can be utilized in real-world time series forecasting tasks. By employing a decoder-only structure with a full-rank attention matrix, the ALD model augments the expressiveness of the model, while the design of the statistical forecasting module addresses the concept drift issue inherent in time series data. This enables the model to maintain the strong fitting capability of Transformer models while addressing the concept drift problem in time series data. The experimental results indicate that the ALD model surpasses existing benchmark models and is more suitable for time series forecasting.

In the future, we intend to integrate the statistical forecasting layer with the Transformer decoder layer and explore parameter sharing between them to further streamline the model structure without sacrificing forecasting accuracy.

Author Contributions

Conceptualization, writing—review and editing, and funding acquisition, J.H.; methodology and writing—original draft preparation, Q.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Graduate Student Research Innovation Program of Shanxi Province (Grant No. 2023KY523).

Data Availability Statement

The original data presented in the study are openly available at https://github.com/juyongjiang/TimeSeriesDatasets.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sun, J.; Cao, Z.; Li, H.; Qian, S.; Wang, X.; Yan, L.; Xue, W. Application of artificial intelligence technology to numerical weather forecasting. J. Appl. Meteorol. Sci. 2021, 32, 1–11. [Google Scholar]
Ding, F.; Jiang, M.Y. Housing price forecasting based on improved lion swarm algorithm and BP neural network model. J. Shandong Univ. (Eng. Sci.) 2021, 51, 8–16. [Google Scholar]
Zhao, W.Z.; Yuan, G.; Zhang, Y.M.; Qiao, S.; Wang, S.; Zhang, L. Multi-view Fused Spatial-temporal Dynamic GCN for Urban Traffic Flow Forecasting. J. Softw. 2024, 35, 1751–1773. [Google Scholar]
Wang, C.; Wang, Y.; Zheng, T.; Dai, Z.M.; Zhang, K.F. Multi-Energy Load Forecasting in Integrated Energy System Based on ResNet-LSTM Network and Attention Mechanism. Trans. China Electrotech. Soc. 2022, 37, 1789–1799. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Tang, S.Q.; Feng, C.; Gao, K. Recurrent concept drift data stream classification based on online transfer learning. J. Comput. Res. Dev. 2016, 53, 1781–1791. [Google Scholar]
Dong, Y.H.; Cordonnier, J.B.; Loukas, A. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 21–27 July 2021; pp. 2793–2803. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Liu, S.; Yu, H.; Liao, C.; Li, J.; Lin, W.; Liu, A.X.; Dustdar, S. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In Proceedings of the International Conference on Learning Representations, Virtual Conference, 3–7 May 2021. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. Preprint. 2018. Available online: https://scholar.google.com/citations?view_op=view_citation&hl=zh-CN&user=dOad5HoAAAAJ&citation_for_view=dOad5HoAAAAJ:W7OEmFMy1HYC (accessed on 10 December 2024).
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Su, J.; Lu, Y.; Pan, S.; Murtadha, A.; Wen, B.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
Kim, T.; Kim, J.; Tae, Y.; Park, C.; Choi, J.H.; Choo, J. Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Li, W.; Yang, X.; Liu, W.; Xia, Y.; Bian, J. Ddg-da: Data distribution generation for predictable concept drift adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Arlington, VA, USA, 17–19 November 2022; Volume 36, pp. 4092–4100. [Google Scholar]
Bai, G.J.; Ling, C.; Zhao, L. Temporal Domain Generalization with Drift-Aware Dynamic Neural Network. arXiv 2022, arXiv:2205.10664. [Google Scholar]
Liu, Z.; Cheng, M.; Li, Z.; Huang, Z.; Liu, Q.; Xie, Y.; Chen, E. Adaptive normalization for non-stationary time series forecasting: A temporal slice perspective. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. iTransformer: Inverted transformers are effective for time series forecasting. arXiv 2023, arXiv:2310.06625. [Google Scholar]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Zhang, Y.; Yan, J. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In Proceedings of the International conference on learning representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Das, A.; Kong, W.; Leach, A.; Mathur, S.; Sen, R.; Yu, R. Long-term forecasting with tide: Time-series dense encoder. arXiv 2023, arXiv:2304.08424. [Google Scholar]
Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. Timesnet: Temporal 2d-variation modeling for general time series analysis. In Proceedings of the International conference on learning representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington DC, USA, 7–14 February 2023; pp. 11121–11128. [Google Scholar]
Liu, M.; Zeng, A.; Chen, M.; Xu, Z.; Lai, Q.; Ma, L.; Xu, Q. Scinet: Time series modeling and forecasting with sample convolution and interaction. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]

Figure 1. The framework of the adaptive learning time series forecasting model based on decoder framework.

Figure 2. Hyperparameter sensitivity on the ETTh1 dataset. (a) learning rate; (b) decoder layers; (c) hidden dimension.

Figure 3. Hyperparameter sensitivity on the exchange rate dataset. (a) learning rate; (b) decoder layers; (c) hidden dimension.

Table 1. Statistical information of five common datasets.

Dataset	Variables	Forecasting Length	Dataset Size	Frequency	Information
ETTh1, ETTh2	7	{96, 192, 336, 720}	(8545, 2881, 2881)	Hourly	Electricity
ETTm1	7	{96, 192, 336, 720}	(34,465, 11,521, 11,521)	15 min	Electricity
Exchange	8	{96, 192, 336, 720}	(5120, 665, 1422)	Daily	Economy
ECL	321	{96, 192, 336, 720}	(18,317, 2633, 5261)	Hourly	Electricity

Table 2. Comparison of experimental results for nine forecasting models on five datasets.

Models		ALD		iTransformer		PatchTST		Crossformer		TiDE		TimesNet		DLinear		SCINet		FEDformer
Metric		MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
ETTh1	96	0.387	0.397	0.386	0.405	0.414	0.419	0.423	0.448	0.479	0.464	0.384	0.402	0.386	0.400	0.654	0.599	0.376	0.419
	192	0.452	0.437	0.441	0.436	0.460	0.445	0.471	0.474	0.525	0.492	0.436	0.429	0.437	0.432	0.719	0.631	0.420	0.448
	336	0.509	0.469	0.487	0.458	0.501	0.446	0.570	0.546	0.565	0.515	0.491	0.469	0.481	0.459	0.778	0.659	0.459	0.465
	720	0.565	0.522	0.503	0.491	0.500	0.488	0.653	0.621	0.594	0.558	0.521	0.500	0.519	0.516	0.836	0.699	0.506	0.507
ETTh2	96	0.304	0.358	0.297	0.349	0.302	0.348	0.745	0.584	0.400	0.440	0.340	0.374	0.333	0.387	0.707	0.621	0.358	0.397
	192	0.377	0.403	0.380	0.400	0.388	0.400	0.877	0.656	0.528	0.509	0.402	0.414	0.477	0.476	0.860	0.689	0.429	0.439
	336	0.418	0.437	0.428	0.432	0.426	0.433	1.043	0.731	0.643	0.571	0.452	0.452	0.594	0.541	1.000	0.744	0.496	0.487
	720	0.434	0.456	0.427	0.445	0.431	0.446	1.104	0.763	0.874	0.679	0.462	0.468	0.831	0.657	1.249	0.838	0.463	0.474
ETTm1	96	0.184	0.277	0.334	0.368	0.329	0.367	0.404	0.426	0.364	0.387	0.338	0.375	0.345	0.372	0.418	0.438	0.379	0.419
	192	0.250	0.322	0.377	0.391	0.367	0.385	0.450	0.451	0.398	0.404	0.374	0.387	0.380	0.389	0.439	0.450	0.426	0.411
	336	0.314	0.364	0.426	0.420	0.399	0.410	0.532	0.515	0.428	0.425	0.410	0.411	0.413	0.413	0.490	0.485	0.445	0.459
	720	0.415	0.425	0.491	0.459	0.454	0.439	0.666	0.589	0.487	0.461	0.478	0.450	0.474	0.453	0.595	0.550	0.543	0.490
Exchange	96	0.077	0.198	0.086	0.206	0.088	0.205	0.256	0.367	0.094	0.218	0.107	0.234	0.088	0.218	0.267	0.396	0.148	0.278
	192	0.164	0.290	0.177	0.299	0.176	0.299	0.470	0.509	0.184	0.307	0.266	0.344	0.176	0.315	0.351	0.459	0.271	0.315
	336	0.317	0.409	0.331	0.417	0.301	0.397	1.268	0.883	0.349	0.431	0.367	0.448	0.313	0.427	1.324	0.853	0.460	0.427
	720	0.843	0.688	0.847	0.691	0.901	0.714	1.767	1.068	0.852	0.698	0.964	0.746	0.839	0.695	1.058	0.797	1.195	0.695
ECL	96	0.176	0.261	0.148	0.240	0.195	0.285	0.219	0.314	0.237	0.329	0.168	0.272	0.197	0.282	0.247	0.345	0.193	0.308
	192	0.175	0.262	0.162	0.253	0.199	0.289	0.231	0.322	0.236	0.330	0.184	0.289	0.196	0.285	0.257	0.355	0.201	0.315
	336	0.185	0.270	0.178	0.269	0.215	0.305	0.246	0.337	0.249	0.344	0.198	0.300	0.209	0.301	0.269	0.369	0.214	0.329
	720	0.218	0.294	0.225	0.317	0.256	0.337	0.280	0.363	0.284	0.373	0.220	0.320	0.245	0.333	0.299	0.390	0.246	0.355

Note: the best results are shown in bold, and the second best results are underlined.

Table 3. Ablation study of ALD model.

Models		Transformer Decoder Layer + Statistical Forecasting Layer		Transformer Decoder Layer		FEDformer
Metrics		MSE	MAE	MSE	MAE	MSE	MAE
Exchange	96	0.077	0.198	0.084	0.203	0.148	0.278
	192	0.164	0.290	0.184	0.305	0.271	0.315
	336	0.317	0.409	0.334	0.418	0.460	0.427
	720	0.843	0.688	0.873	0.702	1.195	0.695

Note: the best results are highlighted in bold, while the second best results are underlined.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hao, J.; Sun, Q. An Adaptive Learning Time Series Forecasting Model Based on Decoder Framework. Mathematics 2025, 13, 490. https://doi.org/10.3390/math13030490

AMA Style

Hao J, Sun Q. An Adaptive Learning Time Series Forecasting Model Based on Decoder Framework. Mathematics. 2025; 13(3):490. https://doi.org/10.3390/math13030490

Chicago/Turabian Style

Hao, Jianlong, and Qiwei Sun. 2025. "An Adaptive Learning Time Series Forecasting Model Based on Decoder Framework" Mathematics 13, no. 3: 490. https://doi.org/10.3390/math13030490

APA Style

Hao, J., & Sun, Q. (2025). An Adaptive Learning Time Series Forecasting Model Based on Decoder Framework. Mathematics, 13(3), 490. https://doi.org/10.3390/math13030490

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Adaptive Learning Time Series Forecasting Model Based on Decoder Framework

Abstract

1. Introduction

2. Related Work

3. Model Construction

4. Experimental Setup and Analysis of Results

4.1. Datasets

4.2. Benchmark Models and Experimental Setup

4.3. Model Parameters

4.4. Results and Analysis

4.5. Ablation Study

4.6. Hyperparameter Sensitivity

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI