Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessReview

Peer-Review Record

Machine Learning Based Restaurant Sales Forecasting

Mach. Learn. Knowl. Extr. 2022, 4(1), 105-130; https://doi.org/10.3390/make4010006

by Austin Schmidt

, Md Wasi Ul Kabir and Md Tamjidul Hoque^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Mach. Learn. Knowl. Extr. 2022, 4(1), 105-130; https://doi.org/10.3390/make4010006

Submission received: 19 December 2021 / Revised: 25 January 2022 / Accepted: 26 January 2022 / Published: 30 January 2022

(This article belongs to the Section Learning)

Round 1

Reviewer 1 Report

Being a review work, it is ok.

There is an error in the name of the authors: there is an "and" without other names. Moreover the affiliations seems written wrongly: please check.

Author Response

Response to First Reviewer

Reviewer’s assessment, comment and suggestions:

Comments and Being a review work, it is ok. There is an error in the name of the authors: there is an "and" without other names. Moreover, the afﬁliations seems written wrongly: please check.

Response:

We appreciate the reviewer’s comments and positive remarks. The names and affiliations have been updated to reflect the template received.

FROM:

TO:

TEMPLATE:

References:

Stergiou, K. and T.E. Karakasidis, Application of deep learning and chaos theory for load forecastingin Greece. Neural Computing and Applications, 2021. 33: p. 16713–16731.

Holmberg, M. and P. Halldén, Machine Learning for Restauraunt Sales Forecast, in Department of Information Technology. 2018, UPPSALA University. p. 80.

Tanizaki, T., et al., Demand forecasting in restaurants usingmachine learning and statistical analysis. Procedia CIRP, 2019. 79: p. 679-683.

G.P., S.R., et al., Machine Learning based Restaurant Revenue Prediction. Lecture Notes on Data Engineering and Communications Technologies, 2021. 53.

Sakib, S.N., Restaurant Sales Prediction Using Machine Learning. 2021.

X., L. and I. R., Food Sales Prediction with Meteorological Data - A Case Study of a Japanese Chain Supermarket. Data Mining and Big Data, 2017. 10387.

Reviewer 2 Report

This is an interesting article concerning an interesting domain of applications. However, there are several amendments that the authors should perform to the manuscript before publication.

When discussing the various models some equations for the models should be added
I believe that more metrics for errors should be considered such as MAPE which represents the percentage error.
Since the authors study time series it would be of interest to examine if there is a safe horizon for predictions (see for example the paper “Application of deep learning and chaos theory for load forecasting in Greece”. Neural Computing and Applications, 33(23), 16713-16731 (2021)), where also methods for ameliorating longer time predictions were introduced. This would make better long-time predictions.

Author Response

Response to Second Reviewer

Reviewer’s assessment, comment and suggestions:

This is an interesting article concerning an interesting domain of applications. However, there are several arrangements that the author should perform to the manuscript before publication.

When discussing the various models some equations for the models should be added.

Response:

We are greatly thankful to the reviewer for their positive feedback. We subjected a large population of models to training and testing. More well-known models are given in a table with descriptive resources as a survey – we leave those alone in favor of the reader following the additional resources if they are interested.

However, we have included relevant formulae to aid in the discussion of recurrent neural networks as they play a larger part in the final results. Specifically, we show some formulae for the gates in: RNN, LSTM, and GRU. This should help show the relationships between the more complicated RNN models better and aid in discussion. Forward and backward propagation, although still important, was left out of the conversation in favor of discussing the gates. As a final note to the reviewer, we refrained from adding the architecture definitions for the TFT model as the calculations are long and perhaps too complicated for this review work. Instead, we named the original paper for readers to find themselves. If the reviewer disagrees, we are happy to add formulations for the architecture of the TFT model as well. Formulae for baselines were already included but were cleaned up and placed more prominently in the text. The section on RNNs has been updated,

FROM:

“We focus on methods of deep learning, which do not have a simple feed-forward prediction style. Instead, recurrent neural networks (RNNs) create short-range time-specific dependencies by feeding the output prediction to the input layer for the next prediction in the series. This design philosophy has been popular since the late ’90s and has been used for financial predictions, music synthesis, load forecasting, language processing, and more [11, 13, 14]. These recurrent layers can be supplemented with convolutional layers to capture non-linear dependencies [2]. To further improve performance, special logic gates are added to the activation functions of the recurrent layers, which have mechanisms to alter the data flow. Encoding/decoding transformer layers, which take variable-length inputs and convert them to a fixed-length representation [15], have also shown good results. These layers integrate information from static metadata to be used to include context for variable selection, processing of temporal features, and enrichment of those features with the static information [1]. The final main augmentation is a special self-attention block layer, which finds long-term dependencies not easily captured by the base RNN architecture [1].

This work highlights four models using a recurrent architecture. The first is a simple RNN model without any gating enhancements. While RNN models have proven successful in many disciplines, the models struggle to store information long-term [12], and the model is known to be chaotic [16] due to vanishing or exploding gradients [17]. To aid in the gradient problem, the LSTM and GRU neural networks have been developed. Both architectures are specific cases [11] or otherwise modified from the base RNN model by the addition of gating units in the recurrent layers.

The long short-term memory (LSTM) network was developed with the idea of improving the RNN’s vanishing/exploding gradient problems in 1997 [18] and is better at storing and accessing past information [12]. Some studies have been completed, suggesting that LSTMs significantly improve retail sales forecasting [5]. The LSTM memory cell is comprised of three gates, an in-gate, an out-gate, and an update-gate. These gates all have trainable parameters for the purpose of controlling specific memory access (in-gate), memory strength (out-gate), and whether a memory needs to be changed (update-gate).

The gated recurrent unit (GRU) model is another type of recurrent neural network implementing memory cells to improve the gradient problem and was proposed in 2014 by Cho et al. as a convolutional recursive neural network with special gating units [15]. The GRU model is very similar to the LSTM model but removes the out-gate with the idea that fewer parameters to train will make the training task easier – although we lack fine control over the strength of the memory being fed into the model [19].

In the year 2020, authors Lim et al. released the temporal fusion transformer (TFT) network, a novel, attention-based architecture that combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics [1]. The TFT architecture uses a direct forecasting method where any prediction has access to all available inputs [3] and achieves this uniquely by not assuming that all time-varying variables are appropriate to use [1]. Variable selection layers ensure relevant input variables are captured for each individual time step. Static variables, such as the date or a holiday, are integrated into the network through encoding layers to train for temporal dynamics properly. A static covariant encoder integrates information from static metadata to be used to include context for variable selection, processing of temporal features, and enrichment of those features with the static information [1]. Short-term dependencies are found with LSTM layers, and long-term dependencies are captured with multi-headed self-attention block layers. The practical difference is that the RNN, LSTM, and GRU models must make long-term forecasting decisions from a single instance, while the TFT model can use future instances while hiding unknown information – such as the forecasting target.”

TO:

“We focus discussion on neural networks which do not only include a simple feed-forward prediction method. Instead, recurrent neural networks (RNNs) create short-range time-specific dependencies by feeding the output prediction to the input layer for the next prediction in the series. This design philosophy has been popular since the late ’90s and has been used for financial predictions, music synthesis, load forecasting, language processing, and more [12, 64, 65]. The most basic RNN form may be modeled with the idea that the current hidden state of the neural network is dependent on the previous hidden state, the current input, and any additional parameters of some function, which we show as (4). RNN models are trained with forward and backward propagation [66].

(3)

By building on the ideas of (4) we allow for many flavors of the RNN model. These recurrent layers can be supplemented with convolutional layers to capture non-linear dependencies [6]. To further improve performance, special logic gates are added to the activation functions of the recurrent layers, which have mechanisms to alter the data flow. Encoding/decoding transformer layers, which take variable-length inputs and convert them to a fixed-length representation [61], have also shown good results. These layers integrate information from static metadata to be used to include context for variable selection, processing of temporal features, and enrichment of those features with the static information [5]. Another augmentation is a special self-attention block layer, which finds long-term dependencies not easily captured by the base RNN architecture [5].

This work highlights four models using a recurrent architecture. The first is a simple RNN model without any gating enhancements. While RNN models have proven successful in many disciplines, the models struggle to store information long-term [13], and the model is known to be chaotic [58] due to vanishing or exploding gradients [59]. To aid in the gradient problem, the LSTM and GRU neural networks have been developed. Both architectures are specific cases [12] or otherwise modified from the base RNN model by the addition of gating units in the recurrent layers. Some studies have been completed, suggesting that LSTMs significantly improve retail sales forecasting [9].

The long short-term memory (LSTM) network was developed with the idea of improving the RNN’s vanishing/exploding gradient problems in 1997 [60] and is better at storing and accessing past information [13]. The mechanism used is a self-looping memory cell, which produces longer paths without exploding or vanishing gradients. The LSTM memory cell is comprised of three gates, an in-gate, an out-gate, and a forget-gate. These gates all have trainable parameters for the purpose of controlling specific memory access (in-gate), memory strength (out-gate), and whether a memory needs to be changed (update-gate) [66]. The main state unit, , is a linear self-loop weighted directly by the forget unit, , shown as:

(4)

Where is the current input, is the bias, the recurrent weights, and the input weights. This value is scaled between 0 and 1 by the σ unit. The internal state is then updated as:

(5)

The external gated unit, , is calculated in the exact same manner as (5), although the unit uses its own set of parameters for biases, input weights and, forget-unit gates. Finally, we concern ourselves with the output of the LSTM hidden layer as

(6)

(7)

The output gate is influential and may even cut off all output signals from the memory unit and has parameters for the biases, input weights, and recurrent weights as defined in the other gates. Although LSTM models have shown success in forecasting, the next extension to RNNs questions if all three gates are necessary for training a strong model.

Next, the gated recurrent unit (GRU) model is another type of recurrent neural network implementing memory cells to improve the gradient problem and was proposed in 2014 by Cho et al. as a convolutional recursive neural network with special gating units [61]. The GRU model is very similar to the LSTM model but removes the combines the forget-gate with the update-gate [66] with the idea that fewer parameters to train will make the training task easier – although we lack fine control over the strength of the memory being fed into the model [62]. The states for a GRU cell can be updated as follows:

(8)

Where u and r are update and reset gates, and the values are calculated similarly as seen for LSTMs as:

(9)

and

(10)

In the year 2019, authors Lim et al. released the temporal fusion transformer (TFT) network, a novel, attention-based learning architecture that combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics [5]. We give a brief description of the TFT model as an introduction as the model architecture is laid out in sufficient detail in the preliminary paper titled Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting.

The TFT architecture uses a direct forecasting method [7] where any prediction has access to all available inputs and achieves this uniquely by not assuming that all time-varying variables are appropriate to use [5]. The main components of the architecture are gating mechanisms, variable section networks, static covariant encoders, a multi-head attention layer, and the temporal fusion decoder. The authors propose a gating mechanism called Gated Residual Network (GRN), which may skip over unneeded or unused parts of the model architecture. Variable selection network ensures relevant input variables are captured for each individual time step by transforming each input variable into a vector matching dimensions with subsequent layers. Each static, past and future inputs get their own network for instance-wise variable selection. Static variables, such as the date or a holiday, are integrated into the network through static enrichment layers to train for temporal dynamics properly. The static covariant encoder integrates information from static metadata to be used to inform context for variable selection, processing of temporal features, and enrichment of those features with the static information [5]. Short-term dependencies are found with LSTM layers, and long-term dependencies are captured with multi-headed self-attention block layers. An additional layer of processing is completed on the self-attention output in the position-wise feed-forward layer. This process is designed like the static enhancement layer. The practical difference gained from the advanced architecture is that the RNN, LSTM, and GRU models must make long-term forecasting decisions from a single instance, while the TFT model can use future instances while hiding unknown information – such as the forecasting target. The drawback for generating this context with such a large architecture, making use of many layers, is the abundance of hyper parameters to be tuned.”

I believe that more metrics for errors should be considered such as MAPE which represents the percentage error.

Response:

We appreciate the reviewer’s meticulous review. The base MAPE formula does not work well with instances which have a true value of 0. We have one dataset, the actual, where the MAPE gives inaccurate results due to zero-sales days. So, we include the sMAPE metric, which only gives incorrect results when resulting in zero-error, which can happen with the TFT model. As this is only two instances on one model, the metric is good enough. We use the same justification for adding the gMAE metric. We added a discussion of the new metrics to the metrics section, the results discussion, and the table holding the results. Since we needed new metrics, all of our test results were re-entered with the new changes.

Changes in tables:

FROM:

Model	Test MAE	Feature Test MAE	Feature Number	Dataset
use-last-week-enhanced	239	N/A	2	Any
TFT Less	215	N/A	17+Window	Actual
Stack	220	206	25	Actual
Continued in manuscript…

TO:

Model	Type	MAE	sMAPE	gMAE	Dataset
Stacking	NR	220	0.195	142	Actual
TFT Less Features	R	220	0.196	133	Actual
Bayesian Ridge	NR	221	0.195	144	Actual
Continued in manuscript…

Changes in Metrics section:

FROM:

“The metric officially used to evaluate our models’ accuracy is the mean absolute error (MAE), defined below in (1).

(1)

Where n is the number of instances being predicted, and y is each instance. When listing the MAE, we always cut off the decimal and keep the whole number for ease of notation. We take the MAE because positive errors and negative errors have a similarly negative effects. The MAE also provides a simple way to discuss the results because the target directly relates to the sale values. When we have an MAE of 50, we may directly say that our prediction was incorrect by $50.00. Next, it is worthwhile to consider the average meal cost per customer. A single order grosses approximately $12.00, which includes the standard sandwich, side, and drink. The price can gross $17.00+ dollars if a customer orders any premium meal upgrade options, sandwich add-ons, or extra sides. There are options to dine cheaper, but the most affordable option of just a sandwich and a side still grosses around $9.00 for the restaurant. Also, there are cases where more than one customer is ordering at a time, such as in families. To measure prediction accuracy, we define a method to estimate the performance. The time-period used for training our models is between 2:00 PM and 5:59 PM, which is a total of four hours of sales. A daily MAE of $200 could be read as an hourly error of $50, or between three and five customers poorly accounted for per hour. Extended, this also represents as few as 12 and as many as 20 unaccounted customers between 2:00 PM and 5:59 PM. Thus, the goal is to reach an MAE score that is as low as possible and, we aim to find which models will outperform our baselines. “

TO:

“The metrics officially used to evaluate our models’ accuracy is the mean absolute error (MAE), defined below in (11), the symmetric mean absolute percent error (sMAPE) in (12) and the geometric mean absolute error (gMAE) in (13).

		(11)
		(12)
		(13)

Where n is the number of instances being predicted, and y is each instance. When listing the MAE, we always cut off the decimal and keep the whole number for ease of notation. We take the MAE because positive errors and negative errors have a similarly negative effects. The MAE also provides a simple way to discuss the results because the target directly relates to the sale values. When we have an MAE of 50, we may directly say that our prediction was incorrect by $50.00. sMAPE is a useful, balanced way to verify the mean percentage of error for our forecast result. We use symmetric-MAPE because the basal MAPE formula places only the absolute value of y_true into the denominator, which gives the restriction that no instance may be 0 or the result will be undefined. Different implementations of the metric may ignore the label entirely or give the largest possible error. Either way, this creates an inconsistency in testing, and this is a problem because there are exactly three instances in our dataset with an actual value of 0. sMAPE is slightly improved because the formula will only be undefined in the case that y_true= y_pred= 0. Although rare, we will point out situations where this may occur, and the sMAPE value may be slightly smaller than actually recorded for those instances. The gMAE is another metric often used when researching time-series and tends to be most appropriate for series with repeated serial correlation, which makes this suitable for our needs. gMAE does have similar problems as seen in sMAPE, and 0-valued errors are handled by adding a small amount of noise to the returned value whenever the error calculated is 0.”

Since the authors study time series it would be of interest to examine if there is a safe horizon for predictions (see for example the paper "Application of deep teaming and chaos theory for load forecasting in Greece". Neural Computing and Applications, 33(23), 16713-16731 (2021)), where also methods for ameliorating longer time predictions were introduced. This would make better long-time predictions.

Response:

We appreciate the reviewer’s comments. The attached paper is very relevant to this work, and the authors have made points that will complement some of our points nicely. Some of their material will make for good sources of future work as well, which we outline in our paper. The paper is specifically considered in the new Literature Review section with smaller mentions in the Introduction and Conclusion. Please see the additions made to literature review as citation 11, where the resource is highlighted yellow.

“The comparison of ML techniques to forecast curated restaurant sales is a common research question and can be seen in several recent works [14-17]. Two additional recent, non-restaurant forecasting with ML problems are also examined [11, 18]. Although similar in model training and feature engineering techniques, our methodology differs from other recent forecasting papers in a few key areas, which we outline. The first important difference in researched methods is the forecasting horizon window used. Many papers either used an unclear horizon window or made forecasts of only one time step at a time [14-18]. Only one paper increased the forecast horizon beyond one time step [11], so we consider forecasting one week of results a main contribution of this research. Another point of departure with reviewed papers is the importance of stationary data. Traditionally, it is important to have a stationary dataset when working with time series so there is no trend or seasonality learned – instead each instance is separated and can be forecasted based on its own merit instead of implicit similarity. However, only one paper [11] even mentions this stationary condition. Instead of exploring it further, the authors simply trained models using data that did not seem to have any associated trend. As an extension to these works, we consider the stationary condition and test multiple datasets to gauge its importance. A main departure in this study from the other literature is the engineering of weather as features in the feature-set. All but one paper [11] includes the modeling of weather as feature labels for forecasting, although we refrained from including this in our own research. Even though not present in the work, we may consider weather as a potential future enhancement to try and improve results further. The premiere difference, which we aim to highlight in this work, are the models selected for testing. Some papers test only traditional models like linear regression, decision trees, or support vector machines[15-17] with no mention of recurrent neural networks (RNN). One paper only includes RNN models without any other family of statistical or ML models for comparison [11]. The final two papers include a mix of models; however, the collection of RNN and traditional models is small [14] [18]. Not one paper includes results for the new TFT model first published in 2020 which is featured heavily in this work. Compared to these papers, we test a larger number of RNN and ML models for a more robust comparison.”

References:

Stergiou, K. and T.E. Karakasidis, Application of deep learning and chaos theory for load forecastingin Greece. Neural Computing and Applications, 2021. 33: p. 16713–16731.

Holmberg, M. and P. Halldén, Machine Learning for Restauraunt Sales Forecast, in Department of Information Technology. 2018, UPPSALA University. p. 80.

Tanizaki, T., et al., Demand forecasting in restaurants usingmachine learning and statistical analysis. Procedia CIRP, 2019. 79: p. 679-683.

G.P., S.R., et al., Machine Learning based Restaurant Revenue Prediction. Lecture Notes on Data Engineering and Communications Technologies, 2021. 53.

Sakib, S.N., Restaurant Sales Prediction Using Machine Learning. 2021.

X., L. and I. R., Food Sales Prediction with Meteorological Data - A Case Study of a Japanese Chain Supermarket. Data Mining and Big Data, 2017. 10387.

Reviewer 3 Report

The publication is undoubtedly interesting and contains new elements in the methodological approach.

Unfortunately, the approach to the investigated issue, the method of presentation and the arrangement of information in the submitted publication is far from the expected form for scientific paper.

The basic remark is that in the paper authors do not refer other papers directly related to the subject of the reviewed publication. The authors use the following expression in the text:

"To the best of our knowledge, we are the first to propose this new dataset to study forecasting techniques, and the dataset has been processed and defined such that there is some suitable groundwork for further studies."

Which, without further explanation this thesis, gives the impression that the subject of the publication is avant-garde. Unfortunately, we can easily find relatively new publications with almost identical topics, e.g.

1.Takashi Tanizakia, Tomohiro Hoshinoa, Takeshi Shimmurab, Takeshi Takenaka; Demand forecasting in restaurants using machine learning and statistical analysis

DOI: 10.1016 / j.procir.2019.02.042

2.Mikael Holmberg, Pontus Halldén, Machine Learning for Restaurant Sales Forecast

https://uu.diva-portal.org/smash/get/diva2:1216397/FULLTEXT01.pdf

In addition, the reviewer identified the actual source of the results, it is one of the authors' Master project, unfortunately also not cited in the references.

https://scholarworks.uno.edu/td/2876/

Another remark concerns the inadequate Abstract. For example, it omits the necessary information about the main results of research and includes elements of the methodological approach that are not necessary in this section. The authors do not use the form of a scientific report but use phrases referring to their beliefs ("we ... benchmarking this study to the best of our knowledge").

Another comment concerns in particular the mixed content of the chapters. For example, the Introduction contains elements of the methodology related to the collection and processing of data.

It is difficult to know what the authors' intention was in the case of the Background chapter - whether it is a literature review or a description of the tools used to solve the basic problem discussed in the paper. Moreover, the authors again weave the methodological thread, recalling the research approach they used.

Authors often use abbreviations without their explanation, some of them (RNN) can be identified, while others (POS) not necessarily. This creates some trouble while reading the paper.

Analyzing further, the Authors only in chapter 3.1 Data acquisition (rows 192-193) define the aim of the paper. This confirms a large inconsistency in the layout of the publications, which absolutely requires ordering.

More similar remarks can be found by reading the following chapters.

The proposed layout of the publication supplemented with an additional file definitely does not make it easier to read and analyze.

The reviewer, while analyzing the supplementary file, encountered, in his opinion, a critical error. Figures S5-S10 show graphs with p-value, the value of which reaches values greater than 1. This definitely requires clarification, what are the actual values presented there? Moreover, the very description of these figures is not clear again - what is a p-score? And how is the ranking determined?

With regard to the methodology, the reviewer asks for an explanation why it was omitted in the analysis as weather factors, which is considered important in other publications on the issue.

The last remark concerns placing the elements of the reviewed manuscript on websites before its publication (a document identical to the "supplementary file" in the reviewer's panel), but this issue is left to the Publishing House.

https://zenodo.org/record/5791480#.YdmuYGjMK00

Summing up, the peer-reviewed publication has significant potential.

In order for the publication to be published, the following conditions must be met.

The information layout should be organized in the classical form for a scientific paper. Typical parts should be included - introduction, literature review and research methodology, edited to uniform content.

The literature review should be supplemented, as publications on the subject matter have been omitted.

Authors must be clear about the aim of their work and what is new in their research compared to already published work.

The authors should refer to the remaining comments, especially the p-value.

Author Response

Response to Third Reviewer

Reviewer’s assessment, comment and suggestions:

The publication is undoubtedly interesting and contains new elements in the methodological approach.

Unfortunately, the approach to the investigated issue, the method of presentation and the arrangement of information in the submitted publication is far from the expected form for scientific paper.

Response:

We are greatly thankful to the reviewer for their positive feedback. The entire paper has been reworked with these and upcoming comments in mind. A full organizational overhaul was completed to make the paper easier to read and understand. Every section has been streamlined and adjusted for clarifying the presentation of information. This involved separating into the clear sections required for scientific work. Many of the reviewer’s further comments were detailed enough to respond with specific sections of changes, but it should be noted that the entire paper, from Abstract to Conclusions, has been updated, reordered, or rewritten to try and better reflect the methodology and conclusions of this work. Again, look to further comments for specific changes for your ease of reference.

The basic remark is that in the paper authors do not refer other papers directly related to the subject of the reviewed publication. The authors use the following expression in the text:

1.Takashi Tanizakia, Tomohiro Hoshinoa, Takeshi Shimmurab, Takeshi Takenaka; Demand forecasting in restaurants using machine learning and statistical analysis

DOI: 10.1016 / j.procir.2019.02.042

2.Mikael Holmberg, Pontus Halldén, Machine Learning for Restaurant Sales Forecast

https://uu.diva-portal.org/smash/get/diva2:1216397/FULLTEXT01.pdf

Response:

We appreciate the reviewer’s comments. Both suggested papers have been included in the work, and we have looked for additional similar recent research which may add to the discussion. Any inappropriate impressions that this is an entirely new research concern have been eliminated. The literature has been completely revamped to compare and contrast our work with six other recent publications instead of implying uniqueness. Here is the updated Literature Review:

“The comparison of ML techniques to forecast curated restaurant sales is a common research question and can be seen in several recent works [14-17]. Two additional recent, non-restaurant forecasting with ML problems are also examined [11, 18]. Although similar in model training and feature engineering techniques, our methodology differs from other recent forecasting papers in a few key areas, which we outline. The first important difference in researched methods is the forecasting horizon window used. Many papers either used an unclear horizon window or made forecasts of only one time step at a time [14-18]. Only one paper increased the forecast horizon beyond one time step [11], so we consider forecasting one week of results the main contribution of this research. Another point of departure with reviewed papers is the importance of stationary data. Traditionally, it is important to have a stationary dataset when working with time series so there is no trend or seasonality learned – instead, each instance is separated and can be forecasted based on its own merit instead of implicit similarity. However, only one paper [11] even mentions this stationary condition. Instead of exploring it further, the authors simply trained models using data that did not seem to have any associated trend. As an extension to these works, we consider the stationary condition and test multiple datasets to gauge its importance. The main departure in this study from the other literature is the engineering of weather as a feature in the feature-set. All but one paper [11] includes the modeling of weather as feature labels for forecasting, although we refrained from including this in our own research. Even though not present in work, we may consider weather as a potential future enhancement to try and improve results further. The premiere difference, which we aim to highlight in this work, are the models selected for testing. Some papers test only traditional models like linear regression, decision trees, or support vector machines[15-17] with no mention of recurrent neural networks (RNN). One paper only includes RNN models without any other family of statistical or ML models for comparison [11]. The final two papers include a mix of models; however, the collection of RNN and traditional models are small [14] [18]. Not one paper includes results for the new TFT model first published in 2020, which is featured heavily in this work. Compared to these papers, we test a larger number of RNN and ML models for a more robust comparison. ”

References:

Stergiou, K. and T.E. Karakasidis, Application of deep learning and chaos theory for load forecastingin Greece. Neural Computing and Applications, 2021. 33: p. 16713–16731.
Holmberg, M. and P. Halldén, Machine Learning for Restauraunt Sales Forecast, in Department of Information Technology. 2018, UPPSALA University. p. 80.
Tanizaki, T., et al., Demand forecasting in restaurants usingmachine learning and statistical analysis. Procedia CIRP, 2019. 79: p. 679-683.
G.P., S.R., et al., Machine Learning based Restaurant Revenue Prediction. Lecture Notes on Data Engineering and Communications Technologies, 2021. 53.
Sakib, S.N., Restaurant Sales Prediction Using Machine Learning. 2021.
X., L. and I. R., Food Sales Prediction with Meteorological Data - A Case Study of a Japanese Chain Supermarket. Data Mining and Big Data, 2017. 10387.
In addition, the reviewer identified the actual source of the results, it is one of the authors' Master project, unfortunately also not cited in the references.

https://scholarworks.uno.edu/td/2876/

Response:

We appreciate the reviewer’s suggestion. The author’s Master's work has been added as a proper citation to the manuscript and is now cited at the beginning of the Materials and Methodologies section.

Another remark concerns the inadequate Abstract. For example, it omits the necessary information about the main results of research and includes elements of the methodological approach that are not necessary in this section. The authors do not use the form of a scientific report but use phrases referring to their beliefs ("we ... benchmarking this study to the best of our knowledge").

Response:

We appreciate the reviewer’s suggestion. The Abstract has been rewritten to a more scientific standard. Methodological approaches were removed from the abstract in favor of discussing the main results of the work. Here is the updated Abstract.

FROM:

“To encourage proper employee scheduling for managing crew load, restaurants need accurate sales forecasting. We collect real-world restaurant sales data to build a plethora of Machine Learning (ML) models to review their performances on such data for Sales forecasting. Two additional datasets are added to test methods of removing trend and seasonality by differencing and modeling training. Thus, we have collected three datasets and benchmarking this study to the best of our knowledge, and we are the first to collect restaurant sales data and review the performances of the plethora of ML models. To reduce forecasting error, we optimize the number of features per model through an exhaustive feature testing step. For the one-day forecasting, the best results are derived from the daily differenced dataset with simple linear models. Among these ML models, the RNN models perform comparably using the actual dataset, although they provide poor results on both others. On all accounts, the weekly differenced dataset performed poorly. When increasing to a one-week forecasting horizon, the temporal fusion transformer (TFT) model has unique advantages in one-week forecasting and achieves the best using the actual dataset showing results comparable to the best performing one-day forecast results. All other combinations of models and datasets performed worse than one-day forecasting when extending to full one-week forecasting.”

TO:

“To encourage proper employee scheduling for managing crew load, restaurants need accurate sales forecasting. This paper proposes a case study on many machine learning (ML) models using real-world sales data from a mid-sized restaurant. Trendy recurrent neural network (RNN) models are included for direct comparison to many methods. To test the effects of trend and seasonality, we generate three different datasets to train our models with and to compare our results. To aid in forecasting, we engineer many features and demonstrate good methods to select an optimal subset of highly correlated features. We compare the models based on their performance for forecasting time steps of one-day and one-week over a curated test dataset. The best results seen in one-day forecasting come from linear models with a sMAPE of only 19.6%. Two RNN models, LSTM and TFT, and ensemble models also performed well with errors of less than 20%. When forecasting one-week, non-RNN models performed poorly, giving results worse than 20% error. RNN models extended better with good sMAPE scores giving 19.5% in the best result. The RNN models performed worse overall on datasets with trend and seasonality removed, however many simpler ML models performed well when linearly separating each training instance.”

Another comment concerns in particular the mixed content of the chapters. For example, the Introduction contains elements of the methodology related to the collection and processing of data.

Response:

The reviewer has brought up a very important point which we thank them for. Firstly, we have completely separated the main sections. To avoid any “mixed content”, we updated the chapters in the following ways:

Introduction-> The ‘narrative’ of the research. First, we discuss the problem we are concerned with, some history of the problem, recent advances in recurrent methods, and the aim of the work. Any mention of tools has been relocated to Materials and Methodologies. Here is the updated Introduction section.

“Small and medium-sized restaurants often have trouble forecasting sales due to a lack of data or funds for data analysis. The motivation for forecasting sales is that every restaurant has time-sensitive tasks which need to be completed. For example, a local restaurant wants to make sales predictions on any given day to schedule employees. The idea is that a proper sales prediction will allow the restaurant to be more cost-effective with employee scheduling. Traditionally, this forecasting task is done intuitively by whoever is creating the schedule, and sales averages commonly aid in the prediction. Managers do not need to know the minute-to-minute sales amounts to schedule employees. So, we focus on finding partitions of times employees are working, such as dayshift, middle shift, and nightshift. No restaurant schedules employees one day at a time, so predictions need to be made one-week into the future to be useful in the real world. Empirical evidence by interviewing retail managers has pointed to the most important forecasted criteria to be guest counts and sales dollars and that these should be forecasted with high accuracy [1]. Restaurants tend to conduct these types of predictions in one of three ways: (1) through a manager’s good judgment, (2) through economic modeling, or (3) through time series analysis [2]. A similar restaurant literature review on several models/restaurants [3] shows how the data is prepared will highly influence the method used. Good results can be found using many statistical models, machine learning models, or deep learning models, but they all have some drawbacks [3], expected by the ‘No Free Lunch’ theorem. A qualitative study was conducted in 2008 on seven well-established restaurant chains in the same area as the restaurant in our case study. The chains had between 23 and 654 restaurants and did between $75 million and $2 billion in sales. Most used some sort of regression or statistical method as the forecasting technique, while none of them used ARIMA or neural networks [4]. ARIMA models have fallen out of favor for modeling complex time series problems providing a good basis for this work to verify if neural network research has improved enough to be relevant in the restaurant forecasting environment.

In the modern landscape, neural networks and other machine learning methods have been suggested as powerful alternatives to traditional statistical analysis [5-9]. There are hundreds [10] of new methods and models being surveyed and tested, many of which are deep learning neural networks, and progress is being seen in image classification, language processing, and reinforcement learning [5]. Even convolutional neural networks have been shown to provide better results than some of the ARIMA models [6]. Traditionally, critics have stated that many of these studies are not forecasting long enough into the future, nor do they compare enough to old statistical models instead of trendy machine learning algorithms. Following, machine learning techniques can take a long time to train and tend to be ‘black boxes’ of information [10]. Although some skepticism has been seen towards neural network methods, recurrent networks are showing improvements over ARIMA and other notable statistical methods. Especially when considering the now popular recurrent LSTM model, we see improvements when comparing to ARIMA models [8, 9], although the works do not compare the results with a larger subset of machine learning methods. Researchers have recently begun improving the accuracy of deep learning forecasts over larger multi-horizon windows and are also beginning to incorporate hybrid deep learning-ARMIA models [7]. Safe lengths of forecast horizons and techniques for increasing the forecasting window for recurrent networks are of particular interest [11]. Likewise, methods for injecting static features as long-term context have resulted in new architectures which implement transformer layers for short-term dependencies and special self-attention layers to capture long-range dependencies [5].

While there has been much research using the old ARIMA method [12, 13] for restaurant analysis, in this paper, we aim to broaden modern research by studying a wide array of machine learning (ML) techniques on a curated, real-world dataset. We conducted an extensive survey of models to compared together with a clear methodology for reproduction and extension. An ML model is ideally trained using an optimal number of features and will capture fine details in the prediction task, like holidays, without underperforming when the forecast window increases from one day to one week. We describe the methodology required to go from a raw dataset to a collection of well performing forecasting models. The purpose is to find which combination of methodologies and models provide high scores and to verify whether state-of-the-art recurrent neural network architectures like LSTM, GRU, or TFT, outcompete traditional forecasting ML models like decision trees or simple linear regression. The results add to the landscape of forecasting methodologies tested over the past 100 years, and suggestions for the best methods for the data are given in the conclusion.”

Literature Review-> A short dive into recently published papers on the same, or similar, subjects. We compare the methods of our work to the others for justification of the novelty of our research.

Materials and Methodologies-> The only place discussion on Methodology is discussed, other than to contrast other papers in the Literature Review. Any mention or discussion of results from any sub-section in the methodology has been moved. For example, the Baselines section was split to describe the baselines in Methodologies and then discuss them in Results.

FROM:

“For fair testing, we designate a testing dataset. One-day forecasting uses the final 221 days of instances, and one-week forecasting uses the final 224 days of testing. One-week forecasting uses a sliding window to predict seven days at a time, so we can test different starting points to show which models are stable when beginning predictions on different days of the week. The test set is not sorted or otherwise altered, and we create testing sets for the actual, daily differenced, and weekly differenced sales datasets. Respectively, the shapes can be seen as Supplementary Figures S1, S2, and S3, and an example one-week forecast test is seen in S4. The simplest method of predicting sales is by using the previous instance as the prediction. The instance window may be daily, weekly, or any real value, but there is no additional logic. Naïve solutions are easy to implement and are often discussed in introductory forecasting texts [48]. We analyze the MAE scores using a previous window of one day and one week using the actual sales. In Figure 8, we see the actual value in blue with our prediction line in orange. The MAE score for use-yesterday prediction is 403. Our data is correlated weekly instead of daily, making sense that we get poor results here. This does show the upper bounds prediction error, so it is a simple goal to achieve better results. In Figure 9, we see the result of the use-last-week prediction on the test dataset. The MAE score for use-last-week prediction is 278. As expected, we see a large increase over the previous baseline due to the weekly seasonality, and we consider this to be a well-reasoned prediction. There are issues regarding holidays as they propagate errors forward.

Fig removed.

Figure 8. Use-Yesterday Prediction. The most basic possible prediction model assumes that predicted day D(t) is exactly the previous day D(t-1). The MAE baseline generated is 403, and the prediction shape does not fit the test set well.

Fig removed.

Figure 9. Use-Last-Week Prediction. Using the weekly seasonality, the next prediction baseline expects day D(t) is exactly the previous weekday D(t-7). The MAE baseline generated is 278, and the prediction shape fails when experiencing extreme values.

To improve our baseline, consider the average sales found for a particular sales day. The lag used is the previous weekday D(t-7), and the moving average window is the entire dataset preceding D(t) that belongs to the same weekday (i.e., only Mondays) seen in (4).

(4)

Where is the current prediction, y is actual, and n represents the total number of weeks preceding t. As an example of (4), a prediction for Saturday would be a mean result between last Saturday’s sales and historical average Saturday sales. The dataset’s full average values are used, but using a rolling moving average window to contextualize current trends is another approach. In Figure 10, we see the result of the enhanced use-last-week prediction. The MAE score for the prediction is 239 showing a large improvement over simpler baselines and is even sensitive to change over time as short-term increases or decreases will be caught by the next week.

Fig removed.

Figure 10. Enhanced Use-Last-Week Actual Prediction. Using the weekly seasonality and the mean weekday average, the final prediction baseline implements a simple history. The MAE baseline generated is 239

TO:

“A simple method of predicting sales is by using the previous instance as the prediction. The instance window may be daily, weekly, or any real value to use as ‘lag,’ but there is no additional logic. Naïve solutions are easy to implement and are often discussed in introductory forecasting texts [23], and may be easily adopted in real-world practice. So, we analyze the MAE and MAPE scores of Use-Yesterday (14) and Use-Last-Week (15) for use as a baseline result. Naïve solutions are not always reliable, so to improve our baseline, consider the average sales found for a particular sales day. The lag used is the previous weekday D(t-7), and the moving average window is the entire dataset preceding D(t) that belongs to the same weekday (i.e., only Mondays) seen in (16).

		(14)
		(15)
		(16)

Where is the current prediction, y is actual, and n represents the total number of weeks preceding t. As an example of (16), a prediction for Saturday would be a mean result between last Saturday’s sales and historical average Saturday sales. The dataset’s full average values are used but using a rolling moving average window to contextualize current trends is another common approach.”

Results-> The only place where results are examined and discussed.

Conclusions-> We have added final thoughts, best models, and future work.

Authors often use abbreviations without their explanation, some of them (RNN) can be identified, while others (POS) not necessarily. This creates some trouble while reading the paper.

Response:

To improve readability, all abbreviations have been defined for the first time they are used in every case. To allow easier reading to those with less domain experience, we spell out common abbreviations at the start of any section where they are used extensively. Mostly recurrent neural network (RNN) and Temporal Fusion Transformer (TFT). This should be more readable without being too ‘wordy’. The main offender (the POS abbreviation) can be seen corrected as,

FROM:

“The raw data was collected in three chunks ranging from 9/8/2016-12/31/2016, 1/1/2017-12/31/2017, and 3/1/2018-12/5/2019. The POS system catalogs a wide array of information including the time, items sold, gross sales, location, customer name, payment information, and so on. While there is plenty of interesting information to mine from the raw data, we have narrowed down our interest to just the gross sales, date, and time. … “

TO:

“The raw data was collected in three chunks ranging from 9/8/2016-12/31/2016, 1/1/2017-12/31/2017, and 3/1/2018-12/5/2019. Restaurants gather customer information through a point of sale (POS) system, which catalogs a wide array of information, including the time, items sold, gross sales, location, customer name, and payment information such that each sale made provide a unique data point. While there is plenty of interesting information to mine from the raw data, we have narrowed down our interest to just the gross sales, date, and time. …”

Analyzing further, the Authors only in chapter 3.1 Data acquisition (rows 192-193) define the aim of the paper. This confirms a large inconsistency in the layout of the publications, which absolutely requires ordering.

More similar remarks can be found by reading the following chapters.

The proposed layout of the publication supplemented with an additional file definitely does not make it easier to read and analyze.

Response:

We appreciate the reviewer’s suggestion. Three main points require attention here.

a) The aim of the paper has been rewritten to better reflect the results and work we have done. To clarify the aim, we put the following passage at the end of the Introduction

“While there has been much research using the old ARIMA method [12, 13] for restaurant analysis, in this paper, we aim to broaden modern research by studying a wide array of machine learning (ML) techniques on a curated, real-world dataset. We conducted an extensive survey of models to compared together with a clear methodology for reproduction and extension. An ML model is ideally trained using an optimal number of features and will capture fine details in the prediction task, like holidays, without underperforming when the forecast window increases from one day to one week. We describe the methodology required to go from a raw dataset to a collection of well performing forecasting models. The purpose is to find which combination of methodologies and models provide high scores and to verify whether state-of-the-art recurrent neural network architectures like LSTM, GRU, or TFT, outcompete traditional forecasting ML models like decision trees or simple linear regression. The results add to the landscape of forecasting methodologies tested over the past 100 years, and suggestions for the best methods for the data are given in the conclusion.”

b) Reordering of the paper has been completed as identified in the prior review responses.

c) The supplemental file has been completely reworked to better aid in understanding the main research. The document length was cut my more than half as we incorporated more results into the same graph or table. The sprawling back-and-forth nature of the document has been fixed in two ways. Any time a supplement is mentioned in the main work, the supplement comes after the previously mentioned figures. For example, figures S3 is mentioned before S4 in the main text, so S3 comes before S4 in the supplement. This was not necessarily the case before and should make it easier to read as the next figure in the supplement is always the next figure mentioned in the text. Second, the results were spread out among the supplement sections before. Now, all one-day results are put together in a table, and one-week results are put together in a table for analysis, regardless of the dataset or model type. This way, all results are easy to see and parse, and we simply put interesting forecasts after the results are given.

The reviewer, while analyzing the supplementary file, encountered, in his opinion, a critical error. Figures S5-S10 show graphs with p-value, the value of which reaches values greater than 1. This definitely requires clarification, what are the actual values presented there? Moreover, the very description of these figures is not clear again - what is a p-score? And how is the ranking determined?

With regard to the methodology, the reviewer asks for an explanation why it was omitted in the analysis as weather factors, which is considered important in other publications on the issue.

Response:

We appreciate the reviewer’s meticulous review. The first point is discussed sufficiently then we discuss the second.

a) This certainly was a large error that we had missed by misreading the documentation for the implementation of the algorithm:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression.

The output provides both the F-score and the p-value. The F-score, which may reach values greater than 1, was named p-value originally, which is a probability and bound between [0,1]. The proper names have been swapped out. We also scaled the results by the highest value to provide graphs bounded by (0,1] as it more easily shows which features are important. We describe the F-Score with additional formulae and descriptions in the paper.

FROM:

“… This test ranks the features from greatest importance to least important and then generates an MAE score for an increasing number of features. From this test, we find the optimal number of features per model and estimate the final MAE score. For one-day forecasting, it is easy to use univariant testing techniques to estimate feature importance. For one-week forecasting, we use the same importance found for the one-day forecast. Features are ranked using a linear model finding the P-value to tell whether a feature is statistically relevant or not. …”

TO:

“First, we rank the features from greatest importance to least important by scoring each feature individually. The feature-ranking is conducted using a quick univariant linear regression model to find Pearson’s correlation coefficient between each regressor and the target to calculate the p-value as seen in (100). The correlation is then converted to an F-score, a test of statistical relevancy where the higher the score the more relevant the feature is to the forecast, shown in (101), and is bounded by (0, infinity).

		(17)
		(18)

”

And we update all tables with mention to P-Value like:

FROM:

Figure 11. P-values for Top Features (Actual). The top 25 features as ranked by their p-scores. Weekly sales average is the highest scoring feature by far with other statistical metrics and days of the week following. Numbers 0-13 mark how many days until removal from the dataset, so 13 is yesterday, 7 is one week ago, and 0 is two weeks ago.

TO:

Figure 11. F-score for Top Features (Actual). The top 25 features as ranked by their F-scores. Weekly sales average is the highest scoring feature by far with other statistical metrics and days of the week following. Numbers 0-13 mark how many days until removal from the dataset, so 13 is yesterday, 7 is one week ago, and 0 is two weeks ago.

b) Weather has been shown in many works to be an important part of time series forecasting, especially for retail environments. Mainly, the researchers were interested to see if they could get good results using other methods. We mention the weather in the paper in the following papers since it is such a large part of the literature:

i) Literature Review: “ … The main departure in this study from the other literature is the engineering of weather as a feature in the feature-set. All but one paper [11] includes the modeling of weather as feature labels for forecasting, although we refrained from including this in our own research. Even though not present in the work, we may consider weather as a potential future enhancement to try and improve results further. … ”

ii) Feature Definitions: “… It is of note that we do not use weather metrics as feature labels in favor of testing holidays and summary statistics on their own merit, although this is generally accepted as a very good method for increasing forecast accuracy [14-16] [18]. …”

iii) Conclusions: “... Weather is also well-known to improve forecasting results, especially in retail and restaurant domains, which gives a clear path to try for future improvements in our feature engineering pipeline. Including weather, any non-static data could be used in training and then filtered out, which may allow for interesting and novel combinations of features and techniques. …”

The last remark concerns placing the elements of the reviewed manuscript on websites before its publication (a document identical to the "supplementary file" in the reviewer's panel), but this issue is left to the Publishing House.

https://zenodo.org/record/5791480#.YdmuYGjMK00

Response:

We appreciate the reviewer’s comments. We found a new instruction from MDPI to upload the supplementary material on the website(https://zenodo.org) during submission. So, we have done so.

Summing up, the peer-reviewed publication has significant potential. In order for the publication to be published, the following conditions must be met.
a) The information layout should be organized in the classical form for a scientific paper. Typical parts should be included - introduction, literature review and research methodology, edited to uniform content.
b) The literature review should be supplemented, as publications on the subject matter have been omitted.
c) Authors must be clear about the aim of their work and what is new in their research compared to already published work.
d) The authors should refer to the remaining comments, especially the p-value.

Response:

We are greatly thankful to the reviewer for their excellent review notes. We respond to the summary of changes below, acknowledging we have already described the changes in detail above.

Here is the updated table of contents with the better form.

b) The six references at the end of this document were used to compare and contrast similar, recent studies on the same topics of forecasting restaurants, retail, or load demands in the Literature Review section. We identify several topics of difference between our paper and the reviewed research.

c) The general aim of the work is described at the end of the Introduction section as,

“While there has been much research using the old ARIMA method [12, 13] for restaurant analysis, in this paper, we aim to broaden modern research by studying a wide array of machine learning (ML) techniques on a curated, real-world dataset. We conducted an extensive survey of models to compare together with a clear methodology for reproduction and extension. An ML model is ideally trained using an optimal number of features and will capture fine details in the prediction task, like holidays, without underperforming when the forecast window increases from one day to one week. We describe the methodology required to go from a raw dataset to a collection of well performing forecasting models. The purpose is to find which combination of methodologies and models provide high scores and to verify whether state-of-the-art recurrent neural network architectures like LSTM, GRU, or TFT, outcompete traditional forecasting ML models like decision trees or simple linear regression. The results add to the landscape of forecasting methodologies tested over the past 100 years, and suggestions for the best methods for the data are given in the conclusion.”

And we reinforce the aim by briefly discussing the differences in methodologies between our and similar research in the Literature Review section, which may be seen in the response for the second reviewer.

d) The p-value issue has been resolved, as mentioned in the previous comment. This description should now adequately describe the F-Score instead.

References:

Stergiou, K. and T.E. Karakasidis, Application of deep learning and chaos theory for load forecastingin Greece. Neural Computing and Applications, 2021. 33: p. 16713–16731.

Holmberg, M. and P. Halldén, Machine Learning for Restauraunt Sales Forecast, in Department of Information Technology. 2018, UPPSALA University. p. 80.

Tanizaki, T., et al., Demand forecasting in restaurants usingmachine learning and statistical analysis. Procedia CIRP, 2019. 79: p. 679-683.

G.P., S.R., et al., Machine Learning based Restaurant Revenue Prediction. Lecture Notes on Data Engineering and Communications Technologies, 2021. 53.

Sakib, S.N., Restaurant Sales Prediction Using Machine Learning. 2021.

X., L. and I. R., Food Sales Prediction with Meteorological Data - A Case Study of a Japanese Chain Supermarket. Data Mining and Big Data, 2017. 10387.

Round 2

Reviewer 2 Report

The authors have addresed the points raised by the reviewers in an very extensive and concise way. Thus I suggest its acceptance.

Reviewer 3 Report

I accept all amendments and accept the clarifications.

Article Menu

Machine Learning Based Restaurant Sales Forecasting

Further Information

Guidelines

MDPI Initiatives

Follow MDPI