**4. Empirical Study**

## *4.1. Case Study Data*

The Jerónimo Martins Group is an international company, based in Portugal, with 225 years of accumulated experience in the retail sector. Food distribution is its main business and represents more than 95% of their consolidated sales. In Portugal, it leads the supermarket segmen<sup>t</sup> through a supply chain called Pingo Doce. This empirical study was performed using a real database of product sales from one of the largest stores of Pingo Doce. The data were aggregated on a weekly basis and span the period between 3 January 2012 and 27 April 2015, comprising a total of 173 weeks. Only the products that have at least one sale every week were considered, since these are the most challenging for inventory planning. The hierarchical structure of products adopted by the retailer, from the top level to the bottom level, is: Store > Area > Division > Family > Category > Sub-category > SKU. The total number of time-series considered is 1751 (aggregated and disaggregated) and their split in the six levels of the hierarchy is summarised in Table 1. The most aggregated level, referred to as the top level, comprises the total sales at the store level. Level 1 comprises these sales disaggregated by the six main areas: Grocery, specialized perishables, non-specialized perishables, beverages, detergents and cleaning, and personal care. These are further disaggregated, at level 2, into 21 divisions; at level 3, into 73 families; at level 4, into 203 categories; at level 5, into 459 subcategories; and, at the bottom level, into 988 SKUs (Stock Keeping Units).


**Table 1.** Number of series in each hierarchical level by area.

Figure 2 plots the sales at the top level and at level 1 of the hierarchy, aggregating these by the store and by each of the 6 main areas. The scale on the *y* axis was removed due to confidentiality reasons. The strong peak in sales in 2012, observed in all series, is relative to a promotional event carried out at a national level by Pingo Doce on 1 May (Labour day), after which the company shifted from an Every Day Low Price strategy to a continuous promotional cycle.

All the series show local upward and downward trends, although less prominent in the detergents/cleaning and personal care time-series. The store time-series shows a similar behaviour to the perishables time-series, as the later represent the major proportion of the total sales. These aggregate series do not show any seasonal variation.

 **Figure 2.** Total sales (top level, or store) and sales aggregated by area (level 1).

For a better understanding of the hierarchical structure of the data, we show, in Table 2, the complete hierarchy for the milk division (level 2). The total sales of the milk division are disaggregated, at level 3, into 2 families: Raw and UHT. The raw family is disaggregated into the Pasteurized category at level 4, which is further disaggregated into the Brik sub-category at level 5, which comprises 5 SKUs. The UHT family is disaggregated into the Current and Special categories. The Current category is disaggregated into the Semi-skimmed and Skimmed sub-categories, which comprise 2 and 3 SKUs, respectively. The Special category is disaggregated into the Semi-skimmed, Skimmed, and Flavored sub-categories, which comprise 10, 10, and 3 SKUs, respectively. The plots in Figure 3 show the sales of the SKUs within each subcategory of the milk division. These help us to visualise the diverse individual dynamics within each sub-category and the relative importance of each SKU. As we move down the hierarchy, the signal-to-noise ratio of the series decreases. Therefore, the series at the bottom level shows a lot more random variation, compared to the higher levels.

**Table 2.** Hierarchical structure of the milk division.

 **Figure 3.** Sales of the SKUs within each sub-category of the milk division.

## *4.2. Experimental Setup*

Generating accurate forecasts for each of the 1751 time-series within the hierarchical structure is crucial for the planning operations of the store. We can always forecast the series at each level of the hierarchy independently (we refer to these as base forecasts), based on forecasting models fitted individually for each series. However, by ignoring the aggregation constraints, it is very unlikely that the resulting forecasts will be coherent. To ensure aligned decision-making across the various levels of management, it is essential that these forecasts are reconciled across all levels of the hierarchy.

We consider two alternative forecasting model families for generating the base forecasts; namely, ETS and ARIMA, as discussed in Section 2. The appropriate ETS model for each time-series is chosen from the 18 potential models by minimising AICc, and the smoothing parameters and initial states are estimated by maximising the likelihood L [19], as implemented in the forecast package in the R software [36]. The ARIMA model is chosen following the algorithm proposed by Hyndman and Khandakar [37], also implemented in the forecast package. First, the number of seasonal and ordinary differences *D* and *d* required for stationarity are selected, and then the orders of *p*, *q*, *P*, and *Q* are identified, based on AICc. ETS and ARIMA models are the two most widely-used approaches to time-series forecasting. They are based on different perspectives to the problem and often, but not always, perform differently, although they share some mathematically equivalent models [21,22,38–40]. ARIMA can potentially capture higher-order time-series dynamics than ETS [34]. Therefore, we use both approaches to generate base forecasts, in order to evaluate how these can influence the performance of each reconciliation process. To make incoherent ETS and ARIMA forecasts coherent, we use the implementations of the hierarchical forecasting approaches, as discussed in Section 3.2, available in the hts package [41] for R.

We evaluate the forecasting accuracies of several competing methods using a rolling origin, as illustrated in Figure 4. By increasing the number of forecast errors available, we increase the confidence in our results.

**Figure 4.** Cross-validation procedure, based on a rolling forecast origin with 1- to 12-week ahead forecasts.

We start with the training set containing the first 139 weeks and generate 1- to 12-week ahead base forecasts for each of the 1751 series using ETS and ARIMA. These base forecasts are then reconciled, using the alternative hierarchical methods. The training set is then expanded by one week, and the process is repeated until week 161. This gives a total of 23 forecast origins for each of the 1751 series. For each forecast origin, new ETS and ARIMA models based on the updated training data are specified, from which we generate new base forecasts which are again reconciled using the corresponding errors for both calculated. The performance of the hierarchical forecasting methods was evaluated by using the Average Relative Mean Squared Error (AvgRelMSE) [42]. As we are comparing forecast accuracy across time-series with different units, it is important to use a scale-independent error measure. For each time-series *i*, we calculate the Relative Mean Squared Error (RelMSE) [43]

$$\text{RelMSE}\_{i,h} = \frac{\text{MSE}\_{i,h}}{\text{MSE}\_{i,h}^{\text{base}}}, \quad i = 1, \dots, 1751; \ h = 1, 2, 4, 8, 12, \dots \tag{42}$$

where MSE*i*,*<sup>h</sup>* is the mean squared error of the forecast of interest averaged across all forecast origins and forecast horizons *h*, and MSEbase *i*,*h* is the mean squared error of the base forecast averaged across all forecast origins and forecast horizons *h*, which is used as a benchmark. If the hierarchical forecasting method reconciles with ARIMA (ETS) base forecasts, then the ARIMA (ETS) base forecasts are taken as the benchmark. For each forecast horizon *h*, we averaged (42) across the time-series of the hierarchy using the following geometric mean

$$\text{AvgRelMSE}\_{L,h} = \left(\prod\_{i \in L} \text{RelMSE}\_{i,h}\right)^{\#}, \quad h = 1,2,4,8,12. \tag{43}$$

where *L* is the level (i.e., Top level, Level 1, ..., Level 5, Bottom level, All). The geometric mean should be used for averaging benchmark ratios, since it gives equal weight to reciprocal relative changes [44]. An advantage of AvgRelMSE is its interpretability. When it is smaller than 1, (1-AvgRelMSE)100% is the average percentage of improvement in MSE of the evaluated forecast over the benchmark.
