*4.3. Results*

Table 3 presents the results of AvgRelMSE for the series of each hierarchical level, while Table 4 presents the results of AvgRelMSE for the complete hierarchy. BU refers to bottom-up method, TDGSa refers to top-down "a" method of Gross and Sohl [8], TDGSf refers to top-down "f" method of Gross and Sohl [8], TDfp refers to top-down with forecast proportions, OLS refers to Ordinary Least Squares, MinT-VarScale refers to Minimum Trace Variance Scaling estimator, MinT-StructScale refers to Minimum Trace Structural Scaling estimator, MinT-Shrink refers to Minimum Trace Shrinkage estimator and Base refers to base forecasts. The left side of these tables shows the results using ARIMA base forecasts, while the right side shows the results using ETS base forecasts. As the base forecasts were used to scale the errors, in the rows labelled Base the AvgRelMSE is equal to 1 across all columns. We provide forecast results for 1 week, 2 weeks, 4 weeks (about one month), 8 weeks (about two months), and 12 weeks (about three months). The column labelled Rank provides the mean rank of each method across all forecast horizons. A method with rank of 1 is interpreted as being the best on all the horizons, while that with a rank of 9 it is always the worst. To support the comparisons between the methods that are expected to perform better, Figure 5 visualises the results of AvgRelMSE for the MinT-VarScale, MinT-StructScale, MinT-Shrink, and Base methods, presented in the Tables 3 and 4. The results for the complete hierarchy are highlighted with a light grey background.

It is immediately clear that the MinT-Shrink forecasts improved on the accuracy of the ARIMA base forecasts for all levels and for the complete hierarchy, across all forecast horizons. The only exception was the bottom level for the short-term horizons *h* = 1 and 1 − 2(*h* = <sup>2</sup>), albeit with marginal differences. The gains in forecast accuracy were more substantial at the higher levels of aggregation. This was not the case for all other reconciliation methods, attesting to the difficulty of producing reconciled forecasts that were (at least) as accurate as the base forecasts. Furthermore, the MinT-Shrink method using ARIMA base forecasts returned the most accurate coherent forecasts for all levels, the only exceptions being the Store level, for which the MinT-VarScale returned the most accurate forecasts, and the Area level, where the MinT-StructScale performed best. The improvements on the accuracy of MinT-Shrink forecasts, across all forecast horizons, are more pronounced with the ARIMA base forecasts, compared to the ETS base forecasts (with the exception of horizon *h* = 1 at the bottom level), although the former was almost always more accurate than the latter (see Table 5). This could have be due to the limitation of the ets() function in the forecast package, which restricts seasonality to have a maximum period of 24. Without this limitation, ARIMA can potentially capture seasonalities of a higher order than ETS.

Clearly, the least accurate method was the OLS, for both ETS and ARIMA forecasts and across all forecast horizons. OLS only improved forecast accuracy over the base forecasts at the top level. This was due to ignoring the differences in scale between the levels of the hierarchy and any relationships between the series. A major drawback of the TDGSa and TDGSf methods was that they only considered information from the top level. Interestingly, their forecasts only improved on the accuracy of the ARIMA base forecasts for the Area level, never improving over the ETS base forecasts (the forecasts at the top level are equal to the base forecasts). The TDfp proportions were based on forecasts from all disaggregated levels of the hierarchy, but it performed badly, never improving the forecast accuracy over the ARIMA base forecasts across all forecast horizons. This could be expected, since top-down approaches never give unbiased reconciled forecasts, even if the base forecasts are unbiased. BU provided poor forecasts for all aggregate levels in the hierarchy, showing average increases in the MSE relative to the base forecasts for all levels of aggregation and all forecast horizons (the forecasts at the bottom level are equal to the base forecasts). These losses in forecast accuracy were more substantial at higher levels of aggregation.

**Table 3.** Average Relative Mean Squared Error (AvgRelMSE) for each level of the hierarchy obtained with ARIMA and ETS base forecasts.


Like OLS, MinT-StructScale only depended on the structure of the aggregations and not on the actual data, resulting in poor forecasts, especially at the lower levels of aggregation; in our case, at the Category, Sub-category, and SKU levels, which comprised about 94% of the time-series of the complete hierarchy (see Figure 5). On the other hand, by accommodating the differences in scale between the levels of the hierarchy, MinT-VarScale performed well almost always, generally improving the forecast accuracy over the base forecasts. MinT-Shrink also accounted for the inter-relationships between the series in the hierarchy, always performing better than MinT-VarScale, across both ETS and ARIMA forecasts for all forecast horizons; the only exception being at the Store level (which comprised only one time-series).


**Table 4.** AvgRelMSE for the complete hierarchy obtained with ARIMA and ETS base forecasts.

**Table 5.** AvgRelMSE results of ARIMA base forecasts with ETS base forecasts used as benchmark.


To improve on the accuracy of the base forecasts, the reconciliation methods have to take advantage of the combination of informative signals from all levels of aggregation. It is clear that MinT-Shrink was able do this and, hence, improvements in forecast accuracy over the base forecasts were attained. For the complete hierarchy, the accuracy gains generally increased with the forecast horizon varying between 1.7% and 3.7%. It is also evident that the gains in forecast accuracy were more substantial at higher levels of aggregation, which means that information about the individual dynamics of the series which was lost due to aggregation, was brought back again from the lower levels of aggregation to the higher levels by the reconciliation process, substantially improving the forecast accuracy over the base forecasts.

These results are in accordance with those obtained by Kourentzes and Athanasopoulos [45], which compared MinT-Shrink and MinT-VarScale forecasts with base forecasts in the context of generating coherent cross-temporal forecasts for Australian tourism. Both MinT-Shrink and MinT-VarScale improved the forecast accuracy over the base ETS and ARIMA forecasts for the bottom level and the complete hierarchy. MinT-Shrink performed better than MinT-VarScale across both ETS and ARIMA forecasts.

In order to find out if the forecast error differences between the several competing methods were statistically significant or not, we conducted a Nemenyi test [46]. The results of this test are shown in Figure 6. The panels on the left side show the results for the complete hierarchy using ARIMA base

forecasts, for each forecast horizon; while the panels on the right side show the respective results using ETS base forecasts. In the vertical axis, the methods are sorted by MSE mean rank. In the horizontal axis, they are ordered as in Tables 3 and 4. In each row, the cell in black represents the method being tested and any blue cell indicates a method with no evidence of statistically significant differences, at a 5% level, while the white cells indicate methods without such evidence. We use the Nemenyi test implementation available in the tsutils [47] package for R.

0LQ7í9DU6FDOH 0LQ7í6WUXFW6FDOH 0LQ7í6KULQN %DVH

**Figure 5.** AvgRelMSE for the MinT-VarScale, MinT-StructScale, MinT-Shrink, and Base methods with ARIMA and ETS.

**Figure 6.** Nemenyi test results, at a 5% significance level, for the complete hierarchy.

Analysing the results for ARIMA presented in the panels on the left side, we observe that, for *h* = 1, BU and Base are grouped together as the top-performing methods. They are immediately followed by MinT-Shrink and MinT-VarScale, which are found to be statistically indifferent. For the forecast horizon 1 − 2 (*h* = <sup>2</sup>), BU, Base, MinT-Shrink, and MinT-VarScale are now grouped together as top-performing methods. For the forecast horizon 1 − 4 (*h* = <sup>4</sup>), MinT-Shrink and MinT-VarScale belong to the top-performing group of forecasts and BU and Base perform significantly worse. For the long-term forecasts, MinT-Shrink performs significantly better than MinT-VarScale, BU, and Base. The TDfp and MinT-StructScale methods perform significantly worse than MinT-Shrink, MinT-VarScale, BU, and Base across all forecast horizons, and are found to be statistically indifferent, outperforming only TDGSa, TDGSf, and OLS.

Analysing the results for ETS presented in the panels on the right side, we observe that, for *h* = 1, BU and Base are again grouped together as top-performing methods, followed by MinT-VarScale and MinT-Shrink. For the forecast horizon 1 − 2 (*h* = <sup>2</sup>), MinT-VarScale and Base are grouped together as top-performing methods, being immediately followed by MinT-Shrink and BU; which are found to be statistically indifferent. For the other forecast horizons, MinT-VarScale performs better, being always followed by MinT-Shrink. Overall, for both ETS and ARIMA, the MinT approach outperforms the other competing methods, with the exception for the short horizon *h* = 1.
