1. Introduction
Time series forecasting (TSF) plays a crucial role in various real-life applications, including transportation [
1], health care [
2,
3], and energy management [
4]. In such TSF-dependent fields, accurate forecasts over extended horizons contribute to long-term planning and resilience to future challenges. For instance, in sales forecasting, historical sales data can inform inventory management to anticipate future demands.
Interest in accurate TSF methods has been established. Traditional approaches such as Kalman filters [
5], hidden Markov models [
6], and statistical models such as auto regressive integrated moving average (ARIMA) [
7] have been commonly used. However, these models often require external input and struggle with long-sequence time series forecasting (LSTF). Similarly, a variety of deep learning architectures have emerged for LSTF. Temporal convolutional networks (TCNs) [
8,
9] and their variants have demonstrated success in LSTF [
10,
11,
12,
13], achieving a wide receptive field with reduced computational complexity through dilated convolutions. However, TCNs are limited in their ability to capture temporal relationships. Recently, transformer architectures have garnered attention for their capacity to model one-to-one relationships in sequential data [
14], proving effective in LSTF [
4,
15]. Nonetheless, despite their ability to model one-to-one relationships, transformers require massive computational resources and they also appear to struggle in capturing the inter- and intra-series patterns in time series data.
Despite the abundance of architectures designed to address TSF and LSTF, many of these models overlook key challenges inherent in real-life time series data. First, a significant portion of time series data are non-stationary, with statistical properties such as mean and variance changing over time [
16]. Second, complex patterns exist within inter- and intra-series relationships of these time series data, posing difficulties for modeling [
17]. Overlooking these issues can lead to inaccurate forecasts [
17,
18]. Recent efforts using modern deep learning techniques have begun to address these challenges. For instance, an adaptive recurrent neural network model was suggested to handle time series non-stationarity [
19], while another work suggested a simple normalization technique to mitigate distribution shifts in time series. This approach involved normalization followed by the restoration of the distribution to recover withdrawn information [
18]. Additionally, techniques such as the disentanglement of time series into trends and seasonality have been beneficial for improving forecasts [
20]. Moreover, learning features of time series across multiple temporal scales has shown promise [
21].
In our approach, we draw inspiration from traditional methods to tackle non-stationary time series and unravel complex patterns. Initially, we consider the widely used statistical ARIMA model, which uses weighted averages of past data point differences to predict future values [
7]. We suggest using a separate block to compute these differences, ensuring only past data points with similar phase positions corresponding to each input data point are used. Additionally, prior to forecastings, we restore non-stationarity information to enhance forecast accuracy. Next, we perform spectral decomposition of the time series data to extract suitable features before learning the features for forecasts. Spectral decomposition, commonly employed to decompose input into its frequency components for detailed analysis, is leveraged to simplify intricate time series patterns into spectral components, enabling the learning of hidden features for long-term forecasts. Finally, inspired by the success of mixer architectures such as multilayer perceptron mixer [
22] and ConvMixer [
23] for mixing spatial and channel-wise data points, we deploy ConvMixer to capture the intra- and inter-series dependencies from the spectral representation of the multivariate time series signals. The choice of ConvMixer is motivated by its effectiveness in challenging transformers in computer vision tasks, as well as its unambiguous and consistent architecture.
This paper introduces a deep learning architecture tailored for LSTF tasks, specifically addressing the challenges posed by non-stationary time series and aiming to unravel the inter-series and intra-series dependencies. The proposed architecture comprises two novel components, a “weak-stationarizing” block and “non-stationarity restoring” block, designed to handle non-stationary input time series. The “weak-stationarizing” block operates by duplicating the original time series and aligning each data point to similar phase positions as the original series. Subsequently, the resulting time series undergoes differencing to render it weak-stationary. This block utilizes the power spectral density (PSD) of the time series at individual frequencies to determine the appropriate number of roll backs needed before differencing, aligning the time series based on the dominant frequency. The weak-stationary time series is then transformed using fast Fourier transform (FFT) [
24] to obtain its spectral decomposition, which serves as the basis for feature learning using the ConvMixer architecture for forecasting. To account for the loss of information due to the “weak-stationarizing” block, the “non-stationarity restoring” block is employed to restore the non-stationary properties of the time series. The resulting architecture, formed integrating these methods, surpasses previous state-of-the-art (SOTA) results across six benchmark datasets. Our contributions are summarized as follows:
We propose a generalized deep learning model capable of addressing both univariate and multivariate forecasting problems.
We present a novel “weak-stationarizing” block, which utilizes PSD values at different frequency levels to determine the appropriate number of rollbacks before differencing, effectively rendering the time series weak-stationary. The “non-stationarity restoring” black is employed to restore non-stationarity, ensuring information preservation for final predictions. Ablation studies demonstrate the significant performance improvement achieved with these blocks.
We modify the ConvMixer architecture for use in TSF, which operates on the spectral decompositions of the time series to produce high-quality forecasts.
The proposed overall architecture obtains an average of 21% and up to 64.6% of relative improvements compared to the previous state-of-the-art methods on six real-world datasets, ETT, electricity, traffic, weather, ILI, and exchange, in various settings.
1.1. Related Works
TSF has been extensively studied, with traditional methods such as hidden Markov models [
6], Kalman filters [
5], and statistical models such as ARIMA [
25], exponentially weighted moving averages [
26,
27], and vector auto regressors [
28] demonstrating notable performance. In the field of deep learning, RNNs were initially prominent for TSF due to their effectiveness in modeling sequential data [
29,
30,
31,
32]. Subsequently, TCN architectures gained a decent amount of popularity [
8,
9,
10,
11,
12,
13], with TCNs and RNNs often used in conjunction with graph neural networks (GNNs) to capture both spatial patterns and temporal patterns [
13,
33,
34,
35,
36]. Transformer architectures have emerged as dominant players in sequential modeling tasks, largely replacing RNNs [
14]. The success of transformers is attributed to their self-attention mechanism; however, the quadratic computation and memory complexity of self-attention pose challenges for handling long sequences. Consequently, recent efforts in transformer-based LSTF models have focused on developing more efficient architectures, often by proposing sparser query matrices for computing self-attention [
4,
15]. Incorporating classical concepts alongside modern deep learning concepts has performed reasonably well [
20,
21]. For instance, Autoformer [
20] decomposes the original time series into seasonality and trend components, extracting trend information through multiple decomposition steps using average pooling and treating the difference between the original signal and trend as seasonality. Additionally, Autoformer uses an autocorrelation block to extract the dependencies [
20]. Whereas SCINet [
21] leverages multiple resolution analysis in deep learning, employing a unique downsampling technique alongside convolutions and an interaction block [
21].
1.1.1. Distribution Shift and Non-Stationary Time Series
Domain adaptation (DA) [
37,
38,
39,
40,
41] and domain generalization (DG) [
42,
43,
44,
45] address distribution shifts in machine learning when predefining the domain is feasible. DA pertains to scenarios where training and test sets’ distributions differ, while DG involves training with multiple domain sources. However, in non-stationary time series data, domain shift occurs gradually over time, making predefining domain specification impractical. Recently, the adaptive RNN architecture was proposed to handle distribution shifts in time series [
19]. It splits the training data into periods to adapt the model for distribution shifts. In contrast reversible instance normalization [
18], compatible with various forecasting models, employs normalization along with additional learnable parameters to obtain forecasts and restore the prior distribution.
1.1.2. Spectral Decomposition
Spectral decomposition of time series data finds applications in speech and music recognition [
46,
47], machine health and mechanical vibration monitoring [
48], river and oceanographic tide modeling [
49], and power demand prediction [
50]. In TSF, spectral decomposition is gaining traction. A new decomposition method based on Koopman theory [
51] and comparable to Fourier transform [
24,
52,
53] has been suggested for long-term forecasts [
54]. Furthermore, methods such as StemGNN utilize graph Fourier transform and discrete Fourier transform to exploit spectral dependencies for intra- and inter-series correlations in time series [
17]. However, StemGNN’s complexity, combining gated RNN, self-attention, and GNN, may hinder its application in real-world datasets. Moreover, StemGNN has not been evaluated in LSTF settings.
2. Methodology
The proposed framework addresses both univariate and multivariate TSF problems, focusing on the non-stationarity of time series and simplifying time series data. This section outlines the concepts of “spectral decomposition”, the blocks for addressing non-stationarity, (“weak-stationarizing” block and “non-stationarity restoring” block), and the ConvMixer architecture, and provides an overview of the architecture.
2.1. Problem Formulation
LSTF can be categorized into two main problem settings: multivariate TSF and univariate TSF. In the multivariate setting, we are given N different time series = [, , , ……, ], which are interdependent. If the current time is denoted as t, the values of the N time series at time t are represented as = [, , , ……, ]. Given a look-back window of length T aiming to forecast future data points up to a horizon length of , the input time series can be expressed as = , and the forecasts and actual values can be expressed as = and = , respectively.
The univariate setting is a case where the number of dependent time series, N, is 1. In this case, the input time series can be denoted as = [] and the forecasts and actual future values can be represented as = [] and = [], respectively.
2.2. Spectral Decomposition
Spectral analysis is widely used for time series analysis, enabling the identification of frequency components present in the original signal [
16,
55]. Spectral decomposition involves breaking down the original signal into constituent components for further analysis. According to spectral decomposition principles, a time series
X of length
T starting at the time
t can be decomposed into a linear combination of sines and cosines with different frequencies
f.
Here,
A(
f) and
B(
f) are the amplitudes of the sine and cosine components at frequency
f. Traditionally, while performing forecasting using spectral decomposition, the extracted spectral components are extensively repeated to the desired horizon length, and then, merged. We adapt the concept of spectral decomposition to clarify the frequency components upon which the input signal is dependent. We use DFFT [
24] to obtain the spectral representation of the signal, which is used to learn relevant features for LSTF. Since, the output of DFFT is imaginary, its real and imaginary portions are treated as separate channels.
2.3. Weak-Stationarizing Block and Non-Stationarity Restoring Block
Spectral analysis requires the input time series to be weakly stationary [
16]. meaning that the joint moments of the first- and second-order across equal-length segments are consistent. While achieving perfect stationarity is difficult, our objective is to reduce the non-stationary properties of the time series.
ARIMA models use differencing to obtain the weak-stationary signals [
7]. Building on this concept, the “weak-stationarizing” block is designed to yield a weak-stationary signal with a single differencing step. This involves aligning two copies of the input, where each data point in the copies lies in similar phase positions but varies by a single period of the dominant frequency component in the time series. One copy remains unchanged, while the other is rolled back by a value we term as the “optimum roll back value” (ORV), representing a single period of the dominant frequency.
The ORV is determined using the PSD of the time series [
56]. The PSD is calculated by multiplying the FFT of the input with its complex conjugate, with the dominant frequency components having the highest PSD values. Since there are multiple dependent time series, the ORV is chosen as the most frequent position with the highest PSD values across the different dependent time series. Then, one copy of the input signal is rolled back by the ORV and subtracted from the original input signal. However, since the ORV may not be optimum for all the dependent time series, two additional learnable parameters are used as weights and biases to adjust the effect of the differencing for each input series. The learnable weights and biases are single dimensional arrays of length
N, where, as in previous sections, this represents the number of dependent time series. The output of the “weak-stationarizing” block is a weak-stationary output (WSO).
The PSD can be obtained as
where
Conj() gives the complex conjugate value of the input. Using
PSD, the
ORV is then obtained as
Finally, the
WSO is given by
where
= [
,
,
, …,
] and
= [
,
,
, …,
] are the weights and biases for ‘N’ dependent time series.
Transforming the time series using the “weak-stationarizing” block causes the loss of important statistical information. To compensate for this loss of information the “non-stationarity restoring” block is introduced before the projection layer to restore the necessary details. The output of this block is a non-stationary output (NSO), where the rolled back portion is denoted as non-stationary information representation (NSIR), i.e., NSIR =
, for simplicity. This output then goes through the ConvMixer layers. If
F represents the output of the ConvMixer architecture (discussed in
Section 2.4), then, the process in the “non-stationarity restoring” block can be represented as
The detailed architectures of the “weak-stationarizing” block and the “non-stationarity restoring” block can be seen in the lower half of
Figure 1.
2.4. ConvMixer
Extraction of prominent features is critical in deep learning tasks. Leveraging the success of the ConvMixer architecture in computer vision applications, we adopt it with some modifications to suit our purpose. The fundamental concept of the mixer architecture is to shuffle the data both spatially and channel-wise using depthwise and pointwise convolutions, respectively. Additionally, ConvMixer maintains the input structure throughout the mixer layers.
Unlike the original ConvMixer, which operates on image data and includes an embedding layer to deal with image patches, we omit the embedding layer as our data can be directly processed in its existing form. However, as mentioned in
Section 2.1, for obtaining spectral features the real and imaginary parts of the output of DFFT are treated as separate channels. Hence, the number of input channels becomes twice the number of dependent time series
N. Additionally, we use 1D convolutions due to the nature of our data. Furthermore, we use layer normalization, contrary to the original ConvMixer architecture. For input X, the pointwise convolution (PW) is represented as
and the depthwise convolution (DW) is represented as
Then, the function
F representing the working of ConvMixer architecture can be represented as
where
Residual signifies the presence of a residual connection in the block,
represents the DFFT operation, and
L is the number of repetitions of the mixer layer.
2.5. Architecture Overview
The suggested framework, as illustrated in
Figure 1, begins by transforming the non-stationary input into a weakly stationary form using the “weak-stationarizing” block. Subsequently, the resulting weak-stationary output undergoes a DFFT operation to obtain its spectral representation. Since the DFFT output comprises imaginary numbers, the real and imaginary parts are concatenated as separate channels for further processing. To acquire suitable features from the input containing spectral information, we employ a ConvMixer architecture. The learned features are then transformed back into the time domain. To address information loss due to the “weak-stationarizing” block, we revert the processes via the “non-stationarity restoring” block. The resulting combination of learned features with additional non-stationarity information is fed to a projection layer to generate the forecasts.
3. Experiments
We conducted comprehensive qualitative and quantitative evaluations of the proposed architecture on six real-world datasets. We also performed ablation studies to evaluate the different components of our suggested model.
3.1. Datasets
ETT [
4]: It consists of oil temperature readings of electrical transformers and six other factors affecting the temperature; collected from July 2016 and July 2018.
Exchange [
57]: It includes daily exchange rates of eight different currencies collected from 1990 to 2016.
Weather (
https://www.bgc-jena.mpg.de/wetter/, accessed on 22 September 2021): A collection of measurements of 21 different meteorological indicators, such as air temperature and humidity, collected every 10 min throughout 2020.
Traffic (
https://pems.dot.ca.gov/, accessed on 22 September 2021): Records of readings collected hourly from sensors on San Francisco Bay area freeways, indicating the occupancy rate of roads; provided by the California Department of Transportation.
ILI (
https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html, accessed on 22 September 2021): This dataset was collected by the Center for Disease Control and Prevention of the United States; it consists of the weekly counts of patients displaying influenza-like illness symptoms between 2002 and 2021.
3.2. Implementation Details
We trained our model using L2 loss and the Adam [
58] optimizer. Hyperparameters such as the initial learning rate, number of ConvMixer layers, and batch sizes were determined via grid search on the held-out validation datasets. Early stopping was deployed to prevent overfitting. The code was implemented in PyTorch (version 1.11,
https://pytorch.org, accessed on 18 April 2022), and the experiments were performed on a single NVIDIA TITAN RTX 24 GB GPU. For ensuring fair comparisons with other baselines, we set the look-back window lengths (LWLs) similar to that of Autoformer [
20].
To gain an insight on the performance of our model, we compared our model’s performance with eight other baseline methods: SCINet [
21], Autoformer [
20], Informer [
4], Reformer [
59], LogTrans [
15], LSTNet [
57], N-BEATS [
60], DeepAR [
32], and ARIMA [
7].
3.3. Results
To evaluate the suggested model’s performance along with other existing solutions, we establish a constant LWL while varying the horizon lengths. For datasets other than ILI, the LWL was set to ‘96’, with horizon lengths varying in the range of 96, 192, 332, and 720. Similarly, for the ILI dataset in the multivariate setting, the LWL was set to 36, with horizon lengths of 24, 36, 48, and 60.
3.3.1. Results for Multivariate Setting
As presented in
Table 1, our method outperforms existing baselines across all of the settings except for the “weather” dataset at horizon lengths of 96 and 192. Our proposed model is second only to SCINet in terms of mean squared error (MSE) for the ‘weather’ dataset, and only for forecast horizons of 96 and 192 time steps. However, our model outperforms all other models in terms of mean absolute error (MAE) in these settings. This observation indicates that for the ‘weather’ dataset and shorter forecast horizons (96 and 192 time steps) our model exhibits a more consistent error distribution but is more susceptible to outliers than SCINet under these specific conditions. The average relative improvements in mean squared error (MSE) compared to the previous SOTA are 13.7% on ETTm2, 20.26% on electricity, 56.45% on exchange, 16.5% on traffic, 0.3% on weather, and 21.7% on the ILI dataset. The overall average improvement in MSE is 21.5%, and the most significant improvement seems to be in the exchange dataset. The best result obtained in any particular setting also seems to be on the exchange dataset with a horizon length of 720, with a 64.82% relative improvement. In contrast to other methods, there does not seem to be any significant drop-off in performance as the length of the prediction horizon increases. The plots of the forecasts for these datasets in the multivariate setting are shown in
Figure 2. In
Figure 2, we can visualize the superiority of the results of our model as compared to those of the SOTA Autoformer. Our model seems to forecast seasonal data such as traffic and electricity very well. In the case of data that do not exhibit seasonality, the plots are not groundbreaking; however, our model still outperforms the existing SOTA Autoformer. We attribute the success of our model to its ability to make careful consideration of the non-stationary property of the time series and better generalization of the complexities in inter- and intra-series relationships in the multivariate time series.
3.3.2. Results for Univariate Setting
The results of the experiments in the univariate setting are presented in
Table 2. Except for two instances, i.e., ETTm2 with a forecast horizon of 96 and exchange with a forecast horizon of 720, our method surpasses existing baselines in all other settings. The average improvement in the overall MSE value in the ETTm2 dataset is 4%, and in the exchange dataset is 12.5%. While considering the individual settings, the best improvement in performance is achieved in the exchange dataset at a forecast horizon length (FHL) of 96, with a 61.8% relative improvement compared to the previous best. Also, in the univariate setting there does not seem to be any striking degradation in performance as the length of the forecasting horizon increases.
3.4. Ablation Study
To interpret the effects of the individual elements of our model, we observe the performance of our model following the application of several modifications to it. The datasets used in this experiment are ETTm1, ECL, and exchange. For convenient comparison, the batch size and initial learning rates were set to fixed values of 32 and 0.003, respectively, throughout all the experiments. This section can be divided into two sections: the study of the impacts of learning the features in the spectral domain and the usage of “weak-stationarizing” and “non-stationarity restoring” blocks, and the study of the impact of varying the number of mixer layers.
3.4.1. Impact of Processing the Time Series in the Spectral Domain and the Usage of “Weak-Stationarizing” and “Non-Stationarity Restoring” Blocks
The observation results for this section can be seen in
Table 3. We set up four different variations of our model for this study. The first one is the model suggested in the methodology section without any alteration (results represented by the “spectral domain” sub-column under “with WS and NSR blocks” column in
Table 3). For the second variation, we omit the step of transforming the input into its spectral-domain and instead the output of the “weak-stationarizing” block is directly fed to the ConvMixer block for further processing (results represented by the ’time domain’ sub-column under ’with WS and NSR blocks’ column in
Table 3). The third and fourth variations are to study the effects of the “weak-stationarizing” and “non-stationarity restoring” blocks, both of which are variations without these blocks. The third variation has a simple skip connection at the same position as the connection between the “weak-stationarizing” block and the “non-stationarity restoring” block, and the fourth variation does not have this additional skip connection. This third variation is to check whether the effects of the “weak-stationarizing” and “non-stationarity restoring” blocks are due to the removal and addition of the non-stationary properties or just skip-connection-like properties of the connection between the two blocks.
The results presented in
Table 3 suggest that, except for the instances of ETTm1 with an FHL of 24 and exchange with an FHL of 720, the results of the suggested model are better as compared to the other variations in each instance. While the results of the second and third variations are comparable to that of our model, the fourth variation seems to perform significantly worse. This shows that while the transformation of the input into the spectral domain and the transformation of non-stationary input into the weak-stationary form and back via the “weak-stationarizing” and “non-stationarity restoring” blocks are important, the role of the additional connection between these two blocks is much more noteworthy.
3.4.2. Impact of Varying the Number of ConvMixer Layers
The results of varying the number of mixer layers for the ETTm1, ECL, and exchange datasets can be observed in
Table 4. As per
Table 4, a single best choice for the number of mixer layers to be used cannot be suggested for all of the datasets. However, it seems logical to increase the number of mixer layers in our architecture for more complex datasets and longer forecast horizons. In particular, the benefit of using a higher number of mixer layers can be observed in the ECL dataset. The best performances can be observed for all settings with the number of mixer layers set to either 4 or 5 in all cases in the ECL dataset. The best results are obtained in the ETTm1 dataset with the number of layers set to a different number for different forecast horizons. Setting the number of mixer layers to 1 or 2 seems to be best for the exchange dataset.
3.4.3. Analysis of the Generalization Capabilities
To analyze the generalization capabilities of the model, we performed five-fold cross validation on the ETTm1, ECL, and exchange datasets. The results of using five-fold cross validation are shown in
Table 5. Based on
Table 5, we can see that the performance of the model is consistent over the different test sets.
3.5. Efficiency Analysis
We compare the memory and run-time requirements of recent transformer architectures, which present SOTA results, with our suggested method with a varying number of mixer layers. We use the ETTm2 dataset in a univariate setting with a batch size set to 8. The memory and run-time data were collected during the training phase for all compared architectures. The results are shown in
Figure 3.
Figure 3a,b show the comparison of the memory requirements with varying values for horizon length and look-back window length. Similarly,
Figure 3c,d show the comparison of run-time with varying values for horizon length and LWL, respectively. From
Figure 3 we can conclude that the suggested architecture is much more efficient as compared to the recently famous transformer architecture. The usage of resource-intensive self-attention, prob-sparse self-attention, and autocorrelation results in the inefficiency of the transformer, Informer, and Autoformer architectures.
4. Conclusions
This study proposed a deep learning architecture designed to address the challenges of TSF in both multivariate and univariate settings. It focuses on the non-stationary nature of real-world time series data as well as the complexities of intra- and inter-series relationships. The suggested architecture comprises novel components such as the “weak-stationarizing” block and “non-stationarity restoring” block to handle non-stationarity, while also leveraging spectral decomposition and a ConvMixer architecture to capture complex relations within the data. The experimental results demonstrate the effectiveness of the proposed model across six real-world datasets, achieving superior or comparable performance to SOTA methods in most cases. Additionally, the proposed model requires significantly less memory and execution time compared to other transformer-based models. This makes the model suitable for usage in scenarios where the computation resources are limited. Although the focus is on long-sequence time series, the model is adaptable to short-sequence scenarios as well. Moving forward, it would be interesting to explore how the concepts introduced in the proposed model can complement existing forecasting models, such as transformers, TCNs, and RNNs. Additionally, there is room for improvement in handling datasets without seasonality, suggesting avenues for further research to enhance the model’s performance in such scenarios.