Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Data-Centric Benchmarking of Neural Network Architectures for the Univariate Time Series Forecasting Task

Forecasting 2024, 6(3), 718-747; https://doi.org/10.3390/forecast6030037

by Philipp Schlieper^1,*

, Mischa Dombrowski¹

, An Nguyen¹, Dario Zanca¹

and Bjoern Eskofier^1,2

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Forecasting 2024, 6(3), 718-747; https://doi.org/10.3390/forecast6030037

Submission received: 22 July 2024 / Revised: 15 August 2024 / Accepted: 22 August 2024 / Published: 26 August 2024

(This article belongs to the Section Forecasting in Computer Science)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors The paper explores the performance of various neural network architectures—LSTM, CNN, and Transformers—on time series forecasting using synthetically generated datasets based on sinusoidal functions. The study adopts a data-centric approach, focusing on how different dataset characteristics (like delay length, noise, frequency, and sequence length) impact model performance. Findings indicate that Transformers excel with varying delay lengths, while LSTMs outperform in scenarios with different noise levels, frequencies, and sequence lengths. Here are some comments to address: 1) It is recommended to enlist the article's contributions in the Introduction, so that the interested reader can locate them quickly. 2) Appropriately reducing the length of the Related Work and Conclusions sections will make the article more concise and focused. 3) Representative experimental datasets need to be provided. The synthetic data used in the article is overly simplistic and does not capture the complexity of real-world data. This simplification may lead to a lack of applicability and reliability of the experimental results in practical applications, making it ineffective in guiding actual time series prediction tasks. 4) Advanced models need to be considered. The article only compares basic LSTM, CNN, and Transformer architectures without considering more complex or advanced variants, and these models are all from before 2018. This limits the general applicability of the conclusions, as more complex and optimized models are typically used in practical applications to handle time series data. It is recommended to experiment with models from the recent years, such as Non-stationary Transformers (Yong Liu et al., NeurIPS 2022). 5) Each experiment has different hyperparameter settings (as shown in Table 1), but the article does not explain in detail how these settings affect the performance of each model. The performance results of each model are only provided in Section 3 without an in-depth analysis of the specific impact of different hyperparameters on the results. 6) The hyperparameter optimization algorithm in the article only considered the Tree-Structured Parzen Estimator (TPE). However, as far as I know, your model has fewer hyperparameters, and Bayesian optimization methods may usually achieve better results. Comments on the Quality of English Language

None.

Author Response

Dear Reviewer,

Thank you for the opportunity to revise our manuscript "Data-Centric Benchmarking of Neural Network Architectures for the Univariate Time Series Prediction Task”. We appreciate the careful review and constructive suggestions. In the following, we catalog our responses to the comments and the adjustments in the paper which are highlighted with blue text color.

Comment 1: It is recommended to enlist the article's contributions in the Introduction, so that the interested reader can locate them quickly.

Response 1: We agree with the reviewer that the contributions were not described with enough emphasis in the introduction. Therefore, we added an additional paragraph to this section:

"In this research paper, we introduce three major contributions to the field of time series forecasting. Firstly, we present a framework for synthetically generating time series data, which allows for precise control over various data characteristics. This framework is designed to aid practitioners in creating customized datasets tailored to specific research needs. Secondly, we conduct an in-depth analysis of the learning phase and performance of the most common basic neural network architectures for univariate time series forecasting. This analysis provides valuable insights into the strengths and weaknesses of the architectures' different time series processing structures. Lastly, we establish a causal connection between the performance of these architectures and specific characteristics of the data. This connection helps to understand how different data characteristics influence model effectiveness, offering guidance for selecting appropriate models based on the inherent properties of the time series data."

Comment 2: Appropriately reducing the length of the Related Work and Conclusions sections will make the article more concise and focused.

Response 2: We understand that the paper present a considerable text volume. Therefore, we tried to be more concise in the mentioned sections by reducing the length where we could see fit. The "Related Work" subsection is now elevated to its own Section 2.

Comment 3: Representative experimental datasets need to be provided. The synthetic data used in the article is overly simplistic and does not capture the complexity of real-world data. This simplification may lead to a lack of applicability and reliability of the experimental results in practical applications, making it ineffective in guiding actual time series prediction tasks.

Response 3: Our research aims to offer a novel perspective on univariate time series prediction by connecting network architectures (recurrence, convolution, attention) to specific data characteristics. We isolated both elements in our experiments, by using vanilla versions of the models and synthetic data. This enabled us to explore if one architecture performs better than the others on certain characteristics.

Using real-world data would hinder these connections because extracting data characteristics is non-trivial. While we acknowledge limitations for direct applications, we believe our findings provide valuable insights for the forecasting community. In scenarios like industry and healthcare, where resources are limited but data expertise is high, our work can guide informed decisions on architecture selection. We thank the reviewer for their comment and incorporated a clear description of current study limitations in Section 5.2 and refined our scope in the Introduction.

Comment 4: Advanced models need to be considered. The article only compares basic LSTM, CNN, and Transformer architectures without considering more complex or advanced variants, and these models are all from before 2018. This limits the general applicability of the conclusions, as more complex and optimized models are typically used in practical applications to handle time series data. It is recommended to experiment with models from the recent years, such as Non-stationary Transformers (Yong Liu et al., NeurIPS 2022).

Response 4: We understand this comment in connection with Comment 3 and we agree that both items are limitations that we are aware of and which we tried to highlight in the discussion section. While we recognize the limited scope of our study, we believe that examining these fundamental networks allows us to establish a direct connection between performance outcomes and the intrinsic processing capabilities of these models.

Our intent is not to identify the sole best-performing model for certain data characteristics. But rather to establish intuition which temporal processing paradigm is more suited for certain data properties.

We acknowledge that our introduction of the topic can benefit from a more precise description of our research goal. Therefore, we add further information in the revised version of the paper:

"Our main goal is to develop a deeper intuition on how different neural network architectures — specifically recurrence-based, convolution-based, and attention-based models — process key characteristics in time series data, such as delay length, frequency and noise, and sequence length. By systematically analyzing these architectures, we aim to uncover how each architecture type handles these distinct properties in the data, providing a clearer understanding of their internal mechanisms and performance. Additionally, we seek to detect any significant differences between these architectures in their ability to manage and learn from these characteristics, thereby offering insights into their suitability for various time series forecasting tasks."

Comment 5: Each experiment has different hyperparameter settings (as shown in Table 1), but the article does not explain in detail how these settings affect the performance of each model. The performance results of each model are only provided in Section 3 without an in-depth analysis of the specific impact of different hyperparameters on the results.

Response 5: When investigating the learning phases of our experiments, we were mainly interested in the convergences of the architectures. Therefore, we equipped each model with the same training resource to achieve a fair comparison. For the hyperparameter optimization, Optuna was employed to perform 25 optimization trials using the Tree-Structure Parzen Estimator (TPE). Then, the best-performing configuration is used for the 100-epoch training. Since the number of hyperparameters and their search-space is rather large, we abstained from an in-depth analysis of each individual hyperparameter. Our aim was to provide the same budget for each model and evaluate their performance after training on comparable resources. In the revised manuscript, we tried to make the description of the training procedure more explicit in Section 3.3.1.

Comment 6: The hyperparameter optimization algorithm in the article only considered the Tree-Structured Parzen Estimator (TPE). However, as far as I know, your model has fewer hyperparameters, and Bayesian optimization methods may usually achieve better results.

Response 6: We chose the TPE due to its reported ability to outperform random search (Bergstra, J., Yamins, D., & Cox, D. (2013, February). Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning (pp. 115-123). PMLR.). To our knowledge, TPE is a Bayesian optimization method that employs probabilistic modeling to distinguish good and bad hyperparameter settings. Other Bayesian methods would, of course, also be feasible solutions in this case.

We would again like to thank the reviewer for evaluating our manuscript. We have tried to address all concerns thoroughly and believe our manuscript has improved after its revision. We hope that you find the revised manuscript suitable for publication or share further suggestions for improvement with us.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper presents an interesting perspective by using the data generator to generate a benchmark time series dataset and compare the performance of 3 well-known predictive models on the generated data.

There are several comments:

1. The synthesis data are generated in a controlled environment; therefore, it might not reflect the characteristics of time-series data in real life. Hence, the predicted result could be biased and moreover, the predicted result could be manipulated. How do the authors avoid these problems?

2. The authors claim that their data generator is based on a previous work of [13]. But there is no (short) description of the work [13]; moreover, the work [13] can be considered as a natural competitor, but there is no comparison of predictive models on the data generated by [13] with the results on the modified version (authors’). Please explain.

3. In the experiments:

- Why only the MAE measure is used to evaluate the performance of predictive models while there are many other measures used, such as RMSE, MAPE, and SMAPE?

- The authors do not explain why the maximum of epochs is 100. In the manuscript, the authors claim that “After 100 epochs the Transformer presents the lowest validation loss”, but actually, the MAE value of the Transformer model fluctuates and tends to increase at epoch 100 (Figure 8). Please clarify.

Comments on the Quality of English Language

The manuscript needs to be carefully proofread. A reference error in subsection 3.1.2. “the description in Section ??.”

Author Response

Dear Reviewer,

Comment 1: The synthesis data are generated in a controlled environment; therefore, it might not reflect the characteristics of time-series data in real life. Hence, the predicted result could be biased and moreover, the predicted result could be manipulated. How do the authors avoid these problems?

Response 1: Thank you for highlighting this important matter. The intended contribution of our research is to provide a novel perspective on univariate time series prediction by drawing direct connections between network architectures and data characteristics. We presume that different time series processing structures - being recurrence, convolutions and attention - of the architectures are more or less suitable for different characteristics in the data. Therefore, we tried to isolate both, the characteristics and the processing structure, in our experiment which allows us to draw said connections. We were interested in the question if there are any advantages of using one architecture over the other. Incorporating real-world data would prevent us from drawing said connections between network architectures and certain data characteristics.

Nonetheless, we understand the limitations of our approach for direct applications as discussed in Section 5.2. However, we are still providing useful insights for the forecasting community. In many real-world scenarios, for example, in the industry or in healthcare, there is a high domain expertise regarding the datasets which ought to be processed with neural networks. At the same time, resources for model development can be limited. Based on that, our findings can lead to an informed decision on which architecture family is the most promising given the dataset's characteristics.

Our intention is not to provide a survey on best model variants, but rather a novel, data-centric perspective on benchmarking network architectures. This approach can certainly benefit from an investigation of real-world data which is an interesting direction for follow-up work. In the current work, we are mainly interested in discovering which architecture type or processing strategy of temporal information is more suitable for which data characteristics. As elaborated in Section 3.1, the three architectures establish distinct approaches on how temporal information is incorporated in the output calculation. With our experiments we are testing if there even are differences.

We acknowledge that our introduction of the topic can benefit from more a more precise description of our research goal. Therefore, we add further information in the revised version of the paper:

Comment 2: The authors claim that their data generator is based on a previous work of [13]. But there is no (short) description of the work [13]; moreover, the work [13] can be considered as a natural competitor, but there is no comparison of predictive models on the data generated by [13] with the results on the modified version (authors’). Please explain.

Response 2: Thank you for highlighting this item. We agree that a short description and connection to [13] was missing in our manuscript. Therefore, we added a short paragraph to the Related Work section:

"Li et al. [13] providing the LogFormer to counter the memory bottleneck for time series forecasting, the authors also present a technique to generate synthetic data based on sinusoidal signals. By altering the amplitudes of the signals, they create variation in the data. This approach is the foundation of our data synthesis as described in Section 3.2. For our purposes, we extended the data generation to create signals that vary in different parameters and to obtain control over the varying characteristics."

We also acknowledge the topical connection to [13] and that comparing their results seems logical. However, the data generated in [13] is much simpler than the synthetic data in our approach and the experimental design is different. [13] aims to identify a "better" transformer variant while we are looking for connections between data characteristics and neural network structures.

Comment 3: Why only the MAE measure is used to evaluate the performance of predictive models while there are many other measures used, such as RMSE, MAPE, and SMAPE?

Response 3: The MAE was the error metric in the model trainings of our experiments. Nonetheless, we agree that other measures are necessary to provide a more complete picture. Therefore, we added the RMSE and MAPE metrics to the Tables 8, 9, and 10 in our Result section. In this specific case, these metrics do not provide a considerable amount of new information as the best-performing architecture presents the lowest error in each metric.

Comment 4: The authors do not explain why the maximum of epochs is 100. In the manuscript, the authors claim that “After 100 epochs the Transformer presents the lowest validation loss”, but actually, the MAE value of the Transformer model fluctuates and tends to increase at epoch 100 (Figure 8). Please clarify.

Response 4: Thank you for highlighting this issue. We agree that the maximum of 100 epochs should be explained in more detail which we did in the revised version of the paper. In the design of our experiments, we had to limit the training times to a feasible amount while still achieving convergence of the models. 100 epochs were chosen as a cap in each of our experiments to provide each model architecture with a similar training budget. Applying an early stopping criteria is also possible. But in our experiments, we wanted to prevent the architectures obtaining any advantage from prolonged training times and rather investigate their performances under similar training conditions.

In Figure 8, the MAE of the Transformer tends to increase but the overall trend of the loss is still declining. In comparison with the LSTM and CNN, the loss of the Transformer fluctuates less.

Reviewer 3 Report

Comments and Suggestions for Authors

This paper adopts a data-centric approach by synthesizing datasets with controlled characteristics such as sequence length, noise levels, and delay lengths. The authors compare the performance of three neural network-based architectures (LSTM, CNN, and Transformers), providing insights into the suitability of each architecture based on specific data characteristics.

The topic of the article is of obvious interest to the journal Forecasting. It is current and presents a good review of related works. Scientifically, it is well-constructed and presents interesting results.

However, the article presents some aspects that can and should be improved for it to be accepted for publication. I list these points below:

Line 7: change to "architectures" in “…the most popular architecture for time series…”

Line 58: A space is missing to separate (CNN) from the expression.

Personally, I do not think it is appropriate for Section 1. Introduction to have only one subsection (1.1. Related Work). I think the best solution would be to create a new section 2. Related Work with the content of subsection 1.1. If the authors opt for this change, the article summary at the end of section 1 should be conveniently updated. In this summary, it makes sense to present only a summary at the level of sections and not at the level of subsections.

Figure 2 is referenced before Figure 1. Consequently, I suggest that Figure 2 appear first and the caption of this figure be changed to Figure 1.

Lines 83-84: The sentence should be changed to: “…impacts the performance of these models in face of data with different long- and short-term memories, and frequency…”

In the caption of Figure 1, there is some confusion between singular and plural. It is not clear whether one or several networks of each type (LSTM, CNN, Transformer) are trained.

Line 226: In section 2.1. Preliminaries, although an LSTM is a variant of RNNs, for the article to be more consistent, I suggest that section 2.1.1. make direct reference to LSTMs.

In formula (5), it is not appropriate for the function name f and the frequency f to have the same notation. I suggest changing the name of the function f. The same applies to formula (3). The filter f should have a different designation.

Line 304: The reference to section 5 is incorrect. It should be Section 2.3.

In line 320, at the beginning of section 2.3, a brief description of this section should be made to avoid having two headings in a row (2.3. and 2.3.1.).

In Table 1, there is a column designated as RNN. In my opinion, I think it would be better if the column were called LSTM, since the entire article always refers to the three models: LSTM, CNN, and Transformer. For example, this is particularly relevant in Table 2. In that case, the order in which the columns are arranged in Table 1 should remain consistent with the way the models are presented, that is, in the order: LSTM, CNN, and Transformer.

The reference to Optuna is only made in the caption of Table 1. I think that in line 336, where Adam Optimizer is mentioned, there should be a reference to Optuna (including a citation).

The font size of Tables 5, 6, and 7 is much larger compared to the previous tables.

The reference to Section 4 in line 368 seems incorrect. The same goes for lines 403, 524, 589, and 751. It should be Section 2.2? There is also a similar problem in line 450.

Comments on the Quality of English Language

In terms of English, the article is well-written. The level of English allows for easy reading without any issues.

Author Response

Dear Reviewer,

Comment 1: Line 7: change to "architectures" in “…the most popular architecture for time series…”

Response 1: Adjusted in the revised manuscript.

Comment 2: Line 58: A space is missing to separate (CNN) from the expression.

Response 2: Adjusted in the revised manuscript.

Comment 3: Personally, I do not think it is appropriate for Section 1. Introduction to have only one subsection (1.1. Related Work). I think the best solution would be to create a new section 2. Related Work with the content of subsection 1.1. If the authors opt for this change, the article summary at the end of section 1 should be conveniently updated. In this summary, it makes sense to present only a summary at the level of sections and not at the level of subsections

Response 3: Thank you for mentioning this issue. In the revised version of the paper we have adjusted the sections by creating a new Related Work section. We have also adjusted the summary at the end of the Introduction section.

Comment 4: Figure 2 is referenced before Figure 1. Consequently, I suggest that Figure 2 appear first and the caption of this figure be changed to Figure 1.

Response 4: We have imagined Figure 1 to serve as some kind of visual abstract. Therefore, we placed Figure 1 before Figure 2. To stick to this intention, we adjusted the revised paper such that Figure 2 is not referenced before Figure 1.

Comment 5: Lines 83-84: The sentence should be changed to: “…impacts the performance of these models in face of data with different long- and short-term memories, and frequency…”

Response 5: Adjusted in the revised manuscript.

Comment 6: In the caption of Figure 1, there is some confusion between singular and plural. It is not clear whether one or several networks of each type (LSTM, CNN, Transformer) are trained.

Response 6: Adjusted in the revised manuscript.

Comment 7: Line 226: In section 2.1. Preliminaries, although an LSTM is a variant of RNNs, for the article to be more consistent, I suggest that section 2.1.1. make direct reference to LSTMs.

Response 7: Adjusted in the revised manuscript.

Comment 8: In formula (5), it is not appropriate for the function name f and the frequency f to have the same notation. I suggest changing the name of the function f. The same applies to formula (3). The filter f should have a different designation.

Response 8: Adjusted in the revised manuscript.

Comment 9: Line 304: The reference to section 5 is incorrect. It should be Section 2.3.

Response 9: Adjusted in the revised manuscript.

Comment 10: In line 320, at the beginning of section 2.3, a brief description of this section should be made to avoid having two headings in a row (2.3. and 2.3.1.).

Response 10: Adjusted in the revised manuscript.

Comment 11: In Table 1, there is a column designated as RNN. In my opinion, I think it would be better if the column were called LSTM, since the entire article always refers to the three models: LSTM, CNN, and Transformer. For example, this is particularly relevant in Table 2. In that case, the order in which the columns are arranged in Table 1 should remain consistent with the way the models are presented, that is, in the order: LSTM, CNN, and Transformer.

Response 11: Adjusted in the revised manuscript.

Comment 12: The reference to Optuna is only made in the caption of Table 1. I think that in line 336, where Adam Optimizer is mentioned, there should be a reference to Optuna (including a citation).

Response 12: Adjusted in the revised manuscript.

Comment 13: The font size of Tables 5, 6, and 7 is much larger compared to the previous tables.

Response 13: Adjusted in the revised manuscript.

Comment 14: The reference to Section 4 in line 368 seems incorrect. The same goes for lines 403, 524, 589, and 751. It should be Section 2.2? There is also a similar problem in line 450.

Response 14: Adjusted in the revised manuscript.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have addressed the comments from reviewers well. Therefore I would suggest accepting the paper at this stage.

Article Menu

Data-Centric Benchmarking of Neural Network Architectures for the Univariate Time Series Forecasting Task

Further Information

Guidelines

MDPI Initiatives

Follow MDPI