FFTNet: Fusing Frequency and Temporal Awareness in Long-Term Time Series Forecasting

Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsDear Authors,
Thank you for the opportunity to read this interesting paper. It is challenging to follow. I have included my points below for your consideration.
The bullet points in the introduction do not follow the structure of an academic paper. It would be better if you provided a paragraph with your contributions.
Although the model's architecture is well-detailed, the paper does not provide a theoretical rationale for why the integration of frequency and temporal domains would be superior to models that concentrate solely on a single domain.
The paper lacks a comprehensive explanation of how the frequency-domain MLP functions. Specifically, it is unclear how the MLP processes the frequency components and how it differs from the traditional MLPs used in the time domain.
Although you are discussing hybrid time-frequency methods, the experimental comparison does not include any hybrid models. I think you should have a direct comparison with hybrid models to evaluate the performance.
Figure 1, which visualizes the feature maps from the MLP and CNN, is unclear. The differences between the two feature maps are not obvious. I think it would be better if you added some labeling.
Table 2 presents the performance metrics (MSE and MAE) for FFTNet and other models. The results are impressive but difficult to follow. Additionally, you should include a more detailed discussion on why FFTNet performs exceptionally well on specific datasets (e.g., weather and traffic) compared to others.
In Table 4, you provide results for the ablation study on patch size and stride, but the reasoning behind choosing p16, s8 as optimal is not clearly stated.
In addition, the datasets used (ETT, Electricity, Traffic, Weather) are stationary or semi-stationary. It is unclear how FFTNet would perform on highly irregular or financial time series, such as volatility modeling or high-frequency stock prices. This will help demonstrate generalizability.
In general, it is difficult and also is not well explained in some equations, the related notations.
Thank you.
Author Response
Dear Reviewer,
We would like to express our sincere gratitude to the reviewer for taking the time and effort to review our manuscript. Your detailed and constructive comments have been extremely valuable in helping us to improve the quality of our work. We have carefully considered each of your comments and have made the following responses and revisions.
Comment 1: The bullet points in the introduction do not follow the structure of an academic paper. It would be better if you provided a paragraph with your contributions.
Response 1: Thank you for this feedback. We agree that the introduction’s structure required clarification. We have revised the introduction to replace bullet points with a cohesive paragraph summarizing our contributions (Page 3).
Comment 2: Although the model's architecture is well - detailed, the paper does not provide a theoretical rationale for why the integration of frequency and temporal domains would be superior to models that concentrate solely on a single domain.
Response 2: This is an extremely valuable comment. We sincerely appreciate your insight, as it has significantly enhanced the depth of our manuscript. In response, we have incorporated comprehensive theoretical justifications in Sections 3.1, 3.2, and 3.3. We have analyzed the theoretical basis of our method by leveraging Takens' theorem, the convolution theorem, and Parseval's theorem. Additionally, we have compared the mathematical theories of 2DCNN and 1DCNN, as well as those of Frequency - domain MLP and Time - domain MLP. This not only elucidates the advantages of integrating the frequency and temporal domains but also provides a solid theoretical foundation for our proposed model.
Comment 3: The paper lacks a comprehensive explanation of how the frequency - domain MLP functions. Specifically, it is unclear how the MLP processes the frequency components and how it differs from the traditional MLPs used in the time domain.
Response 3: We agree that this aspect needed elaboration. In Section 3.3, we have used mathematical theories to analyze in detail the advantages of the Frequency - domain MLP compared to the Time - domain MLP method.
Comment 4: Although you are discussing hybrid time - frequency methods, the experimental comparison does not include any hybrid models. I think you should have a direct comparison with hybrid models to evaluate the performance.
Response 4: We didn't adequately introduce the Pathformer method in the baseline before. In fact, this method is a hybrid time - frequency method. We have supplemented the introduction of Pathformer in the Related Work section. 《Pathformer: Multi - scale transformers with adaptive pathways for time series forecasting》
Comment 5: Figure 1, which visualizes the feature maps from the MLP and CNN, is unclear. The differences between the two feature maps are not obvious. I think it would be better if you added some labeling.
Response 5: Thank you for pointing out this issue. We agree that the current Figure 1 lacks clarity. We have revised Figure 1 to add clear labels that distinguish between the feature maps of the MLP and CNN. We believe these changes will make the figure much more understandable and effectively convey the differences between the two feature maps.
Comment 6: Table 2 presents the performance metrics (MSE and MAE) for FFTNet and other models. The results are impressive but difficult to follow. Additionally, you should include a more detailed discussion on why FFTNet performs exceptionally well on specific datasets (e.g., weather and traffic) compared to others.
Response 6: We agree with this suggestion. We have added a new section 4.5, Visualization, in which we discuss in detail the performance of FFTNet. We analyze the underlying reasons why FFTNet outperforms other models on specific datasets like ETTh1 and traffic.
Comment 7: In Table 4, you provide results for the ablation study on patch size and stride, but the reasoning behind choosing p16, s8 as optimal is not clearly stated.
Response 7: Thank you for raising this. We have supplemented the discussion on the reasons for choosing p16, s8 in the updated content. Theoretically, patch size and stride are related to Shannon's sampling theorem, as they affect how the time series is “sampled”. the [16,8] combination provided the best performance, balancing local feature capture and efficient data processing, and proving robust across different datasets for time series prediction.
Comment 8: The datasets used (ETT, Electricity, Traffic, Weather) are stationary or semi - stationary. It is unclear how FFTNet would perform on highly irregular or financial time series, such as volatility modeling or high - frequency stock prices. This will help demonstrate generalizability.
Response 8: We appreciate this feedback. It is quite evident from the newly added visualization results that the non - stationarity and volatility of these datasets are prominent. In particular, the ETTh1 dataset in Figure 6 demonstrates these characteristics vividly. The data points in ETTh1 show significant fluctuations over time, indicating its non-stationary nature.
In addition, we have also supplemented the visualization results of the ILI dataset in Figure 5. The ILI dataset, with its own unique data distribution and trends, further enriches our understanding of the performance of our method on different types of datasets. By presenting these visualizations, we aim to provide a more comprehensive view of how our approach can handle datasets with varying degrees of complexity, non - stationarity, and volatility.
Comment 9: In general, it is difficult and also is not well explained in some equations, the related notations.
Response 9: Thank you for pointing out this issue. We have carefully checked and revised some equations.
Thank you again for your valuable feedback. We believe these revisions address your concerns and strengthen the manuscript.
Reviewer 2 Report
Comments and Suggestions for AuthorsIn time series analysis, Frequency domain representations and Time domain methods are the main types of the methods. Frequency domain representations are particularly effective for identifying periodic features. Time domain methods are better suited for detecting local abrupt changes. Typical applications usually deploy one type of methods.
The authors presented an innovative hybrid model, which simultaneously extracts features from both the frequency and time domains. The system can have the advantage of both types of methods.
In the paper, the power transformer temperature dataset and its four subsets, as well as weather, electricity, and traffic datasets were tested. Seven real-world datasets have been tested, and the experimental results are promising.
- If we count weather and temperature as similar, there are only three types of data are tested. It would be better if more types of data are available.
- In the result section, there is only one figure. It would be recommended if more appropriate figures are used.
- The section on “4.4. Long-term Time Series forecasting” may not be deep enough. The authors need to explain in more details.
Comments on the Quality of English Language
The authors should check the English mistakes very carefully, for example:
“in the LTSF domain.These included Transformer-based methods PathFormer (2024)[39]and PatchTST
(2023)[22]; CNN-based methods TimesNet (2023)[7]and MICN (2023)[6]; and MLP-based”
"Informer[[19]]”
Authors must also pay attention to regular spacing between the word.
Author Response
Dear Reviewer,
We would like to express our sincere gratitude to the reviewer for taking the time and effort to review our manuscript. Your detailed and constructive comments have been extremely valuable in helping us to improve the quality of our work.
We have carefully considered each of your comments and have made the following responses and revisions.
Comments 1: "If we count weather and temperature as similar, there are only three types of data are tested. It would be better if more types of data are available."
Response 1: Thank you for your comments. We have added Section 4.5 "Visualization" and included the visualization results of the performance of FFTNet on the ILI dataset. You can find these new contents on page 12, figure 5.
Comments 2: "In the result section, there is only one figure. It would be recommended if more appropriate figures are used."
Response 2: We agree with this comment. We have added Section 4.5 "Visualization", in which we present our results intuitively. You can check the new section on page 14 - 16. In this section, we validated the performance of FFTNet by comparing different methods, various datasets, and different prediction horizons.
Comments 3: "The section on “4.4. Long - term Time Series forecasting” may not be deep enough. The authors need to explain in more details."
Response 3: Thank you for your feedback. We have conducted a more in - depth discussion in Section 4.4, and further discussed the performance of FFTNet in Section 4.5 "Visualization".
Comments 4: "The authors should check the English mistakes very carefully, for example: 'in the LTSF domain.These included Transformer - based methods PathFormer (2024)[39]and PatchTST (2023)[22]; CNN - based methods TimesNet (2023)[7]and MICN (2023)[6]; and MLP - based' 'Informer[[19]]'"
Response 4: Thank you for pointing out this issue. We have carefully checked and updated the English grammar throughout the full text. You can find the corrected text in the revised manuscript.
Comments 5: "Authors must also pay attention to regular spacing between the word."
Response 5: We agree with this comment. We have gone through the entire manuscript and ensured that there is regular spacing between words. We have carefully checked all paragraphs and sentences to correct any inconsistent spacing issues. This review and correction were done throughout the manuscript, and you can verify the improved spacing in any part of the revised document.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe article deals with time series forecasting. Its innovative approach involves using features from both the frequency and time domains. The article's advantage is its method of appraising real data from the literature and comparing the prediction results. Its disadvantage is the small number of method visualizations presented. the article's research is relevant to the Electronics Journal scope.
There are such notes
- In my opinion, the words historical data (see lines 21 and 25, 107, 354, and 356) are not correct. Historical means that the data belongs to the subject history. In the case of time series, it would be better to use the words "collected data" or " previously measured data" etc.
- The description of formulas (1)-(3) contains variable D (see line 187), but in formulas (1)-(3) there is no variable D.
- In the description of formulas (1)-(3) it would be better to add description variables C, N, and P.
- In the description of formula (5) it would be better to add the description variable beta.
- Formula (16) has to be changed because the left part of the formula should contain the index I.
- It would be better to add the article title to the reference description for all numbers in the list of references in lines 415-475.
After minor revision, the article can be published.
Author Response
Dear Reviewer,We would like to express our sincere gratitude to the reviewer for taking the time and effort to review our manuscript. Your detailed and constructive comments have been extremely valuable in helping us to improve the quality of our work. We have carefully considered each of your comments and have made the following responses and revisions.
Comment 1: In my opinion, the words historical data (see lines 21 and 25, 107, 354, and 356) are not correct. Historical means that the data belongs to the subject history. In the case of time series, it would be better to use the words "collected data" or "previously measured data" etc.
Response 1: We wholeheartedly agree with your observation. We have thoroughly gone through the manuscript and replaced "historical data" with "collected data" at lines 21, 25, 107, 354, and 356.
Comment 2: The description of formulas (1)-(3) contains variable D (see line 187), but in formulas (1)-(3) there is no variable D.
Response 2: We sincerely apologize for this error. This is a typing mistake. We have corrected variable D to variable C.
Comment 3: In the description of formulas (1)-(3) it would be better to add description variables C, N, and P.
Response 3: Thank you for this excellent suggestion. We have added clear descriptions for variables C, N, and P in the description of formulas (1)-(3). “C denotes the feature dimension, N signifies the number of patches, and P indicates the patch size.”
Comment 4: In the description of formula (5) it would be better to add the description variable beta.
Response 4: We appreciate your feedback. The reason for this issue was that \pi was incorrectly wrapped in \mathit{}, resulting in the display of an italic π. This problem has been resolved in our subsequent manuscript.
Comment 5: Formula (16) has to be changed because the left part of the formula should contain the index I.
Response 5: We have made the necessary adjustment to formula (16) as per your instruction. The left part of formula (16) now correctly includes the index I. The revised formula can be found on page 12, paragraph 2, line where formula (21) is written.
Comment 6: It would be better to add the article title to the reference description for all numbers in the list of references in lines 415 - 475.
Response 6: We agree with this comment. We have carefully added the article titles to the reference descriptions for all entries in the reference list from lines 415 - 475. Each reference now includes the article title as follows.
Reviewer 4 Report
Comments and Suggestions for AuthorsThe authors present the results using the FFTNet model. One of the advantages of the FFTNet model is its innovative hybrid architecture that effectively combines frequency and time domain features, which makes it excellent in long-term time series prediction. This integration allows the model to capture both periodic trends and sudden local changes, thereby improving prediction accuracy. However, a potential weakness is the complexity of the model, which may require significant computational resources and training time, and does not address issues that limit its applicability in real-time scenarios or devices with limited processing power.
The authors suggest several scientific improvements that can be considered to improve the performance and applicability of the FFTNet model.
Advanced techniques to more effectively integrate frequency and time domain features, such as attention mechanisms or adaptive weighting schemes that prioritize the most relevant features based on input data characteristics, should be investigated. In addition, noise-tolerant learning techniques that simulate various noise conditions or data augmentation strategies should be integrated to improve the robustness of the model to noisy data, and stable performance should be evaluated in real applications. The lack of objective performance evaluation evidence presented in the results section should be supplemented.
Comments on the Quality of English LanguageThe English could be improved to more clearly express the research.
Author Response
Dear Reviewer,We would like to express our sincere gratitude to the reviewer for taking the time and effort to review our manuscript. Your detailed and constructive comments have been extremely valuable in helping us to improve the quality of our work. We have carefully considered each of your comments and have made the following responses and revisions.
Comment 1: The FFTNet model's hybrid architecture for integrating frequency and time domain features is innovative and beneficial for long - term time series prediction, but its complexity poses challenges in terms of computational resources and training time, limiting its real - time applicability.
Response 1: We wholeheartedly concur with this assessment. Our approach is fundamentally a lightweight model; the relatively higher parameter count and MACs (multiply - accumulate operations) compared to DLinear arise because DLinear is a channel - independent single - linear - layer architecture. To demonstrate FFTNet's effectiveness, we have added Section 4.5: Visualization, which compares prediction results across different datasets, forecast horizons, and methods. We believe this addition significantly enhances the persuasiveness of FFTNet's performance.
Comment 2: Investigating advanced techniques like attention mechanisms or adaptive weighting schemes to better integrate frequency and time domain features is recommended.
Response 2: This is an excellent suggestion. We warmly acknowledge the potential of attention mechanisms for enhancing feature integration, and have incorporated this direction into our future work section. While we are unable to implement immediate changes due to time constraints, we plan to systematically explore adaptive weighting schemes and attention - based approaches in upcoming studies. Thank you for your insightful recommendation—I hope this aligns with your expectations.
Comment 3: Integrating noise - tolerant learning techniques and data augmentation strategies to enhance the model's robustness to noisy data is proposed.
Response 3: We sincerely appreciate this valuable suggestion. We also recognize that this is a highly worthwhile area of research. However, based on our preliminary experiments, applying data augmentation operations to time series data often leads to a decrease in the network's accuracy.
We guess this may be related to the unique nature of time series data. The sequential order and temporal dependencies between data points are critical for model training, and traditional augmentation techniques like random scaling or shifting may disrupt these inherent patterns. For example, scaling operations can distort the relative magnitudes between time steps, making it difficult for the model to capture accurate trends; shifting operations may break temporal continuity, hindering the model's ability to learn periodic patterns.
Additionally, time series data often contains complex non - stationary characteristics, and augmentation may inadvertently introduce noise patterns inconsistent with the original distribution. We hypothesize that this disrupts the model's ability to learn underlying patterns, thereby reducing prediction accuracy.
We will explore more suitable augmentation strategies tailored to time series data in future research.
Comment 4: There is a lack of objective performance evaluation evidence in the results section.
Response 4: Thank you for your constructive feedback on the lack of objective performance evaluation evidence in the results section. We have addressed this concern through the following improvements:
These revisions not only solidify the theoretical framework but also enhance the objectivity and interpretability of our results through visual validation.
Thank you again for your insightful suggestions!
Reviewer 5 Report
Comments and Suggestions for AuthorsThis is an interesting and well written paper on the advantages of considering both time and frequency features when predicting time series. The proposed method is thoroughly tested on several benchmark data sets and shows superior results over many state-of-the-art methods for time series prediction.
There are however a few things though that should be clarified and enhanced in the paper. In addition I have a bunch of minor comments on the text.
- Many things are unclear with figure 1 on page 2. The more I look at it the less I understand. First, you simultaneously change both architecture and input domain (MLP with frequency, versus CNN with time domain), then it is hard to know what causes the difference. You should have use either two MLP:s or two CNN:s (with different input domains) instead to show your point. Secondly, you observe than one representation is sparse one is dense, and conclude from this that one is better for long term and the other short term. I do not agree. Denseness in itself has nothing to do with long or short term storage. It only reflects that different NN architectures are differently prone to sparse representations. Thirdly, what exactly do the pictures show: the input features or some internal hidden layer activation (embedding)? The latter I presume, since you talk about "feature maps". But what then are on the axes? Is it just a random 2D arrangement of the 1D vector of internal features, or is there time (in the right plot) and frequency (in the left plot) on one axis? Please clarify. And finally, the reason I ask is that there is something very strange with the figures, which makes me suspect that maybe you used the wrong plots: indeed the pictures are suspiciously similar apart from the magnitude of values, in that both exhibit a similar 2D structure with bands of similar texture. How can they possibly come from different architectures and different input domains? If you train two neural networks (even of the same architecture) with different random starting weights, the hidden units may become arbitrary permutations of each other, and you would not be able to see any similarity in patterns between them. Unless of course you somehow connect the hidden units with the inputs (eg using residual weights). But then instead, one plot would show structural traces of the time domain input and the other plot of the frequency input, and again you would not be able to see any similar patterns. So in conclusion: Unless you can explain exactly why they look like this, I think the plots do not show what you claim they do and may even be the wrong plots.
- Regarding related works, my reflection is that you mainly focus on some recent works. However, the dichotomy between frequency and time domains has been discussed for decades, and when wavelets were popular in the 1980:ies the motivation was that they combined time and frequency in a natural way. I lack some discussion in Related works on the early attempts to combine these two domains, since clearly this discussion has been going for a long time, and I suspect there are several older proposed methods that combine them which are relevant to mention.
- Section 3.3, lCCB: Why are you using a 2D CNN rather than a 1D CNN (possibly with multiple channels)? One dimension is the time, but what is the other? If there is no known translation invariance across the other dimension, it makes no sense to make it 2D.
- Section 3.6, normalization: You describe "instance normalization" as normalizing each time series sample individually, such that the feature values in that sample has zero mean and unit variance. However, this is not what equations 14, 15 and 16 describe. In eq 14, X_i is a sample of feature length M, and you sum the samples over i, producing an M length vector mu. Eq 15 does the same for sigma (although you are squaring a vector, so I suppose you intend it to be applied element-wise on each feature). That is the feature averages over all samples, not the average over all features within a sample. It does not have the nice properties of data distribution discrepancies that you describe. Indeed, the test set may be normalized completely different from the training set, causing problems.
- Table 5 on page 12: The results for DLinear appear to be the best in this table. You need at least comment on this and put your result in perspective of it. Maybe that method fares less well in the other tables, but it still deserves mention. By the way, what is the difference between DLinear (mentioned in table 5 and in Related works) and NLinear (mentioned in relation to the experiment results in and around table 2)? Is this just a typo and should represent the same method?
And then a couple of minor comments:
- Line 142, "The LCCB thus provides rich contextual information and enhances the model’s ability to represent time series." This sentence does not contribute anything and can safely be removed. If you don't agree, here are my concerns with the sentence in more detail: "rich contextual information" - no, its just local context, which is kind of opposite to "rich" so this is misleading; "enhances ability to represent" - no, it is its representation, so enhance is wrong. Easier to just scrap the sentence than to rewrite it, since it contributes nothing.
- On line 156, the expression for matrix X contains the letters L and C, but the explanation for the expression instead describes S and P, which occurs later on line 158. You need to explain L and C here, but S and P can wait until they are mentioned.
- Line 161, "This design increases data redundancy but allows the model to observe the same time points from different perspectives and at varying scales, thereby enhancing feature representation and improving prediction accuracy." Again I propose to remove this sentence which does not say anything useful. Using a stride smaller than the patch size is standard procedure and needs no motivation. Again my detailed concerns: "increases data redundancy" - no, not in the usual sense of redundant features, rather related to data augmentation; "from different perspectives and at varying scales" - no, not different scales, just translation, and "perspectives"... come on, pretentions name of what is just translation; "improving prediction accuracy" - well that's speculation and something you need to show in the paper (by comparing different S) if that is what you claim (but this is not the point of you paper). Again, just scrap it.
- Line 187, you define T twice.
- Line 242, "Through meticulously designed feature extraction processes, these modules..." Avoid value words in scientific writing. It is not a advertisement. Just start the sentence at "These modules..."
- Line 246, "The Frequency-Time Aggregation Block (FTAB) integrates these two feature types for effective prediction." More value words "effective prediction". The sentence on line 248 says exactly the same thing in a more objective manner, so you can again safely remove this sentence (possible spelling out "Frequency-Time Aggregation Block (FTAB)" in the line 248 sentence).
- Line 310, "Furthermore, we also found that for datasets such as Weather, Electricity, and Traffic, marked by strong non-stationary characteristics, U-Mixer combines the Unet and Mixer architectures with a stationarity correction method." Remove "we also found that". I suppose it is a fact that "U-Mixer combines the Unet and Mixer architectures with a stationarity correction method" and not something you "found". Potentially the next sentence can be started with "We found that [this allows it to recover...]", if that is what you tried to say.
- Line 361, is it supposed to be a new subsection "Performance Analysis", in analogy with "Model Analysis" and "Robustness Analysis"? It has been accidentally merged with the previous subsection.
- References, all paper titles are missing.
Author Response
Dear Reviewer,We would like to express our sincere gratitude to the reviewer for taking the time and effort to review our manuscript. Your detailed and constructive comments have been extremely valuable in helping us to improve the quality of our work. We have carefully considered each of your comments and have made the following responses and revisions.
Comments 1:
Many things are unclear with figure 1 on page 2. The more I look at it the less I understand. First, you simultaneously change both architecture and input domain (MLP with frequency, versus CNN with time domain), then it is hard to know what causes the difference. You should have use either two MLP:s or two CNN:s (with different input domains) instead to show your point. Secondly, you observe than one representation is sparse one is dense, and conclude from this that one is better for long term and the other short term. I do not agree. Denseness in itself has nothing to do with long or short term storage. It only reflects that different NN architectures are differently prone to sparse representations. Thirdly, what exactly do the pictures show: the input features or some internal hidden layer activation (embedding)? The latter I presume, since you talk about "feature maps". But what then are on the axes? Is it just a random 2D arrangement of the 1D vector of internal features, or is there time (in the right plot) and frequency (in the left plot) on one axis? Please clarify. And finally, the reason I ask is that there is something very strange with the figures, which makes me suspect that maybe you used the wrong plots: indeed the pictures are suspiciously similar apart from the magnitude of values, in that both exhibit a similar 2D structure with bands of similar texture. How can they possibly come from different architectures and different input domains? If you train two neural networks (even of the same architecture) with different random starting weights, the hidden units may become arbitrary permutations of each other, and you would not be able to see any similarity in patterns between them. Unless of course you somehow connect the hidden units with the inputs (eg using residual weights). But then instead, one plot would show structural traces of the time domain input and the other plot of the frequency input, and again you would not be able to see any similar patterns. So in conclusion: Unless you can explain exactly why they look like this, I think the plots do not show what you claim they do and may even be the wrong plots.
Response 1:
In our network, we first perform a Patching operation. This operation stacks the one - dimensional time series into two - dimensions. At this point, there are two branches: one is a CNN in the frequency domain and the other is a CNN in the time domain. Through some network designs, I ensure that the output dimensions of the frequency domain and the time domain are the same. After the data is processed and learned by the network, corresponding features are output. The MLP in the frequency domain will also be transformed back to the time domain through the inverse Fourier transform, and I visualize the feature matrix output at this step. For this feature, I sum the multi - variable dimensions and normalize it, which is shown as the shade of the hue on the feature map. The same principle applies to the output results of the time - domain CNN. The horizontal and vertical axes of the two figures represent within the same patch and between different patches respectively. The similarity of the bands is also due to this patching operation. Although there is such similarity in the two figures, one shows sparsity and the other shows denseness. Therefore, based on this phenomenon, I conjecture that it is caused by the different mechanisms of the frequency - domain MLP and the time - domain CNN, which contributes to this research. In the updated manuscript, we introduce Takens' theorem, Parseval's Theorem, and the convolution theorem to explain the performance of FFTNet. Thank you again for your careful review comments.
Comments 2:
Regarding related works, my reflection is that you mainly focus on some recent works. However, the dichotomy between frequency and time domains has been discussed for decades, and when wavelets were popular in the 1980:ies the motivation was that they combined time and frequency in a natural way. I lack some discussion in Related works on the early attempts to combine these two domains, since clearly this discussion has been going for a long time, and I suspect there are several older proposed methods that combine them which are relevant to mention.
Response 2:
Thank you for your insightful feedback. We acknowledge the oversight in our initial coverage of historical time - frequency integration methods. In the revised manuscript, we have incorporated citations to foundational works such as Gabor's short - time Fourier transform (1947) and Ville's analytic signal theory (1948), which established the theoretical basis for combining temporal and spectral analysis. These additions can be found in Section 2.3 (page 4), where we now explicitly situate our hybrid architecture within the broader trajectory of time - frequency research. This revision strengthens the paper's contextualization and aligns with your suggestion to highlight long - standing discussions in this field.
Comments 3:
Section 3.3, lCCB: Why are you using a 2D CNN rather than a 1D CNN (possibly with multiple channels)? One dimension is the time, but what is the other? If there is no known translation invariance across the other dimension, it makes no sense to make it 2D.
Response 3:
Thank you for raising this question. Our method first stacks the one - dimensional time series into a two - dimensional form through a patching operation, which is in line with Takens' theorem. Performing 2D CNN operations in this way helps capture the features within and between patches. A detailed introduction can be found in Sections 3.1 and 3.2 of the updated manuscript.
Comments 4:
Section 3.6, normalization: You describe "instance normalization" as normalizing each time series sample individually, such that the feature values in that sample has zero mean and unit variance. However, this is not what equations 14, 15 and 16 describe. In eq 14, X_i is a sample of feature length M, and you sum the samples over i, producing an M length vector mu. Eq 15 does the same for sigma (although you are squaring a vector, so I suppose you intend it to be applied element - wise on each feature). That is the feature averages over all samples, not the average over all features within a sample. It does not have the nice properties of data distribution discrepancies that you describe. Indeed, the test set may be normalized completely different from the training set, causing problems.
Response 4:
Thank you for pointing out this issue. Our method actually employs instance normalization. We've identified the problem with the formulas and made revisions. The updated formulas can be found as Formula (20) on page 12 of the manuscript.
Comments 5:
Table 5 on page 12: The results for DLinear appear to be the best in this table. You need at least comment on this and put your result in perspective of it. Maybe that method fares less well in the other tables, but it still deserves mention. By the way, what is the difference between DLinear (mentioned in table 5 and in Related works) and NLinear (mentioned in relation to the experiment results in and around table 2)? Is this just a typo and should represent the same method?
Response 5:
Thank you for raising this important point. We clarify that DLinear and NLinear are indeed variants of the same method, differing primarily in their input decomposition strategies. DLinear's high efficiency stems from its channel - wise independent linear layers and unusually simple architecture, which we now explicitly highlight in the manuscript. Specifically, we added a discussion in Section 4.7, paragraph 3 (page 18)
Comments 6:
a couple of minor comments
Response 6:
Response to Minor Comments and Revisions:
Thank you for your meticulous review and constructive feedback. We have meticulously addressed all minor comments and technical issues in the revised manuscript:
These changes improve clarity, adherence to academic writing standards, and overall professionalism. Specific revisions can be tracked via the manuscript's tracked - changes version.
Round 2
Reviewer 4 Report
Comments and Suggestions for AuthorsRevision has been revised. Some minor English editing should be required.
Comments on the Quality of English LanguageThe English could be improved to more clearly express the research.
Author Response
Dear Reviewer,Thank you for your feedback on improving the English in our manuscript. We wholeheartedly agree that clear language is crucial for effectively communicating our research.
To enhance the English in the manuscript, we utilized QuillBot, a well - regarded grammar - checking and text - enhancing tool. QuillBot carefully scanned the entire manuscript, identifying and rectifying a wide range of grammar, spelling, and syntactic errors. It also provided suggestions to improve sentence structure and word choice, aiming to make our writing more coherent and engaging.
We have meticulously gone through all the changes recommended by QuillBot, ensuring that each modification aligns with the intended meaning of our research. We are confident that these efforts will result in a substantial improvement in the language quality of the manuscript, making it more accessible and understandable for you and other readers.
We sincerely hope that you will notice the positive changes in the revised version. Thank you again for your invaluable comment.
Reviewer 5 Report
Comments and Suggestions for AuthorsThanks for responding to all my concerns. This version of the paper is much enhanced.
I have just a few minor comments:
Regarding references: There is no need to exaggerate: I asked for some earlier references than the last five years, and you added two from the mid 1900:s... I had expected something from 90:ies or early 2000:s when there was much ML research. However, I see that you have replaced many arXiv references and corrected the format.
Line 147: "The method [7] ..." is not a good way to reference a paper. Use something like "The method proposed by <Authors> [7] ...".
Lines 575 - 579: You now mention DLinear below table 6, but without commenting on its superior results in the table. You need to add something like: "Although it has the best efficiency of the compared models, its prediction performance is not competitive with the transformer based methods." It is a consistency problem though that you compare with DLinear in this table and NLinear in table 3 - it would be better if you could focus on one of them in both tables, to substantiate this claim. Otherwise you may have to formulate a more accurate statement than my example.
Lines 508- and 621- : A reflection: Completely random events in the future with no visible clues in the previous time series can of course not be predicted a long time in advance (more than on a very high level in terms of the expected frequency of such events, but not in detail as a time series), no matter what methods you will use (e.g GAN:s or anomaly detection). Such methods may certainly detect that something unexpected has just started, but long term prediction will never work perfectly in a non-deterministic universe. I do not see why you need even pretend that it can be solved. Surely no method would fare better in this case.
Author Response
We appreciate your time and effort in reading over our manuscript. Your thorough and useful comments have definitely helped us to do better work. We gave your recommendations great thought and responded with changes here.
Comment1: Regarding references: There is no need to exaggerate: I asked for some earlier references than the last five years, and you added two from the mid 1900:s... I had expected something from 90:ies or early 2000:s when there was much ML research. However, I see that you have replaced many arXiv references and corrected the format.
Response 1: We have carefully searched and added two more papers from the early 2010s in section 2.3, both of which are centered around the combination of the time domain and the frequency domain.
<Authors> [7] ...".
Response 2: We have made the change as required. "The method [7] ..." has been revised to "The method proposed by Wu [7]".
Comment 3: Lines 575 - 579: You now mention DLinear below table 6, but without commenting on its superior results in the table. You need to add something like: "Although it has the best efficiency of the compared models, its prediction performance is not competitive with the transformer based methods." It is a consistency problem though that you compare with DLinear in this table and NLinear in table 3 - it would be better if you could focus on one of them in both tables, to substantiate this claim. Otherwise you may have to formulate a more accurate statement than my example.
Response 3: To ensure consistency, we have re-conducted the performance analysis experiment on NLinear and replaced the DLinear data in the table with NLinear data.
Comment 4: Lines 508- and 621- : A reflection: Completely random events in the future with no visible clues in the previous time series can of course not be predicted a long time in advance (more than on a very high level in terms of the expected frequency of such events, but not in detail as a time series), no matter what methods you will use (e.g GAN:s or anomaly detection). Such methods may certainly detect that something unexpected has just started, but long term prediction will never work perfectly in a non-deterministic universe. I do not see why you need even pretend that it can be solved. Surely no method would fare better in this case.