**1. Introduction**

Retailers need demand forecasts at different levels of aggregation to support decision-making at operational and short-term strategic levels [1]. Consider a retailer warehouse storing inventory that is used to replenish multiple retail stores: Store-level forecasts at different product levels are needed to manage inventory in the store or to allocate shelf space, but aggregate forecasts are also required for the inventory decisions of the retailer warehouse [2]. Understanding whether these aggregate forecasts should be generated independently at each level of the hierarchy, based on the aggregated demand, or obtained using an hierarchical forecasting method, which depends on the aggregation constraints of the hierarchy but ensures coherent decision-making at the different levels, is the gap we seek to address in this paper.

SKUs (Stock Keeping Units) are naturally grouped together in hierarchies, with the individual sales of each product at the bottom level of the hierarchy, sales for groups of related products (such as categories, families, or areas) at increasing aggregation levels, and the total sales at the top level [3]. Generating accurate forecasts for hierarchical time-series can be particularly difficult. Time-series at different levels of the hierarchical structure have different scales and can exhibit very different patterns. The time-series at the most disaggregated level can be very noisy and are often intermittent, being more challenging to model and forecast. Aggregated series at higher levels are usually much smoother and, therefore, easier to forecast. Additionally, in order to ensure coherent decision-making at the different levels of the hierarchy, it is essential that forecasts of each aggregated series be equal to the sum of the forecasts of the corresponding disaggregated series. However, it is very unlikely that these aggregation constraints will be satisfied if the forecasts for each series in the hierarchical structure are generated independently. Finally, hierarchical forecasting methods should take advantage of the interrelations between the series at each level of the hierarchy.

The most traditional approaches to hierarchical forecasting are bottom-up and top-down methods. The bottom-up method involves forecasting each series at the bottom level, and then summing these to obtain forecasts at the higher levels of the hierarchy [4–7]. The main advantage of this approach is that, since forecasts are obtained at the bottom level, no information is lost due to aggregation. However, it ignores the inter-relations between the series and usually performs poorly on highly aggregated data. The top-down method involves forecasting the most aggregated series at the top level, and then disaggregating these, using either historical [8] or forecasted proportions [9], to obtain bottom level forecasts. Top-down approaches based on historical proportions tend to produce less accurate forecasts at lower levels of the hierarchy. The middle-out approach combines both bottom-up and top-down methods. First, forecasts for each series of an intermediate level of the hierarchy chosen previously are obtained. The forecasts for the series above the intermediate level are produced using the bottom-up approach, while the forecasts for the series below the intermediate level are produced using the top-down approach. Empirical studies comparing the performance of bottom-up and top-down methods have mixed results as to a preference for either bottom-up or top-down [4,6,10–12].

Recent work in the area tackles the problem using a two-stage approach: In the first step, forecasts for all series at all the levels of the hierarchy, rather then at a single level, are independently produced (these are called base forecasts). Then, a regression model is used to combine these to give coherent forecasts (these are called reconciled forecasts). Athanasopoulos et al. [9] and Hyndman et al. [13] used the Ordinary Least Squares (OLS) estimator and showed that their approach worked well, compared to most traditional methods. Hyndman et al. [14] suggested the Weighted Least Squares (WLS) estimator, proposing the variances of the base forecast errors as a proxy to the diagonal of the errors covariance matrix, with null off-diagonal elements. They also introduced several algorithms to make the computations involved more efficient under a very large number of series. To extend the work of Hyndman et al. [14], Wickramasuriya et al. [15] proposed a closed-form solution, based on the Generalised Least Squares (GLS) estimator, that minimised the sum of the variances of the reconciled forecast errors incorporating information from a full covariance matrix of the base forecast errors. The authors evaluated the performance of their method, compared to the most commonly-used methods and the results showed that it worked well with both artificial and real data.

Erven and Cugliari [16] proposed a Game-Theoretically OPtimal (GTOP) reconciliation method that selected the set of reconciled predictions, such that the total weighted quadratic loss of the reconciled predictions will never be greater than the total weighted quadratic loss of the base predictions. The authors illustrated the benefits of their approach on both simulated data and real electricity consumption data. This approach required fewer assumptions about the forecasts and forecast errors, but it did not have a closed-form solution and did not scale well for a huge set of time-series.

Mircetic et al. [17] proposed a top-down approach for hierarchical forecasting in a beverage supply chain, based on projecting the ratio of bottom and top level series into the future. Forecast projections were then used to disaggregate the base forecasts of the top level series. The disadvantage of all top-down approaches, including this one, is that they do not produce unbiased coherent forecasts [13].

The remainder of the paper is organized as follows. The next section presents a brief description of the two most widely-used approaches to time-series forecasting: State space models and ARIMA models. The procedure for using information criteria in model selection is also discussed. Section 3 describes the methods more commonly used to forecast hierarchical time-series. Section 4 presents the case study of a Portuguese retailer, explains the evaluation setup implemented and error measures used, and discusses the results obtained. Finally, Section 5 offers the concluding remarks.

## **2. Pure Forecasting Models**

We consider two alternative forecasting methods for generating the base forecasts used by hierarchical forecasting approaches; namely, state space models and ARIMA models. These are briefly described in this section, giving a special focus on the use of information criteria for model selection.

## *2.1. State Space Models*

Forecasts generated by exponential smoothing methods are weighted averages of past observations, where the weights decrease exponentially as the observations ge<sup>t</sup> older. The component form representation of these methods comprises the forecast equation and one smoothing equation for each of the components considered, which can be the level, the trend, and the seasonality. The possibilities for each of these components are: Trend = {N, A, Ad} and Seasonality = {N, A, <sup>M</sup>}, where N, A, Ad and M mean, respectively, none, additive, additive damped, and multiplicative. By considering all combinations of the trend and seasonal components, nine exponential smoothing methods are possible. Each method is usually labelled by a pair of letters, (T,S), specifying the type of trend and seasonal components. Denoting the time-series by *yt*, *t* = 1, 2, ... , *n* and the forecast of *yt*+*h*, based on all data up to time *t* by *<sup>y</sup>*<sup>ˆ</sup>*t*+*h*|*<sup>t</sup>*, the component form of the additive Holt-Winters' method, (A, <sup>A</sup>), is

$$\mathcal{G}\_{t+h|t} = l\_t + hb\_t + s\_{t+h-m(k+1)}\tag{1}$$

$$l\_t = \mathfrak{a} \left( y\_t - s\_{t-m} \right) + (1 - \mathfrak{a}) \left( l\_{t-1} + b\_{t-1} \right) \tag{2}$$

$$b\_t = \beta^\* \left(l\_t - l\_{t-1}\right) + \left(1 - \beta^\*\right) b\_{t-1} \tag{3}$$

$$s\_t = \gamma \left( y\_t - l\_{t-1} - b\_{t-1} \right) + (1 - \gamma) \, s\_{t-m} \tag{4}$$

$$0 \le \mathfrak{a} \le 1, \qquad 0 \le \mathfrak{k}^\* \le 1, \qquad 0 \le \gamma \le 1 - \mathfrak{a}\_{\prime}$$

where *lt*, *bt*, and *st* denote, respectively, the estimates of the series level, trend (slope), and seasonality at time *t*; *m* denotes the period of seasonality; and *k* is the integer part of (*h* − <sup>1</sup>)/*<sup>m</sup>*. The smoothing parameters *α*, *β*<sup>∗</sup>, and *γ* are constrained, to ensure that the smoothing equations can be interpreted as weighted averages. Fitted values are calculated by setting *h* = 1 with *t* = 0, 1, ... , *n* − 1. *H*-step ahead forecasts, for *h* = 1, 2, ..., can then be obtained using the last estimated values of the level, trend, and seasonality (*t* = *<sup>n</sup>*). Details about all the other methods may be found in Hyndman and Athanasopoulos [18]. To be able to produce forecast intervals and use a model selection criteria, Hyndman et al. [19] (amongst others) developed a statistical framework, where an innovation state space model can be written for each of the exponential smoothing methods. Each state space model comprises a measurement equation, which describes the observed data, and state equations which describe how the unobserved components (level, trend, and seasonality) change with time. For each exponential smoothing method, two possible state space models are considered, one with additive errors and one with multiplicative errors, giving a total of 18 models. To distinguish state space models with additive and multiplicative errors, an extra letter E was added: The triplet (E, T, S) identifies the type of error, trend, and seasonality. The general state space model is

$$y\_t = w(\mathbf{x}\_{t-1}) + r(\mathbf{x}\_{t-1})\varepsilon\_t \tag{5a}$$

$$\mathbf{x}\_{t} = f(\mathbf{x}\_{t-1}) + \mathbf{g}(\mathbf{x}\_{t-1})\boldsymbol{\varepsilon}\_{t\prime} \tag{5b}$$

where *yt* denotes the observation at time *t*, *xt* is the state vector, {*<sup>ε</sup>t*} is a white noise process with variance *σ*<sup>2</sup> referred to as the innovation (new and unpredictable), *<sup>w</sup>*(.) is the measurement function, *<sup>r</sup>*(.) is the error term function, *f*(.) is the transition function, and *g*(.) is the persistence function. Equation (2a) is the measurement equation and Equation (2b) gives the state equations. The measurement equation shows the relationship between the observations and the unobserved states. The transition equation shows the evolution of the state through time. The equations of the ETS(A, A, A) model (underlying additive Holt-Winters' method with additive errors) are [18]

$$y\_t = l\_{t-1} + b\_{t-1} + s\_{t-m} + \varepsilon\_t \tag{6a}$$

$$l\_t = l\_{t-1} + b\_{t-1} + a\varepsilon\_t \tag{69}$$

$$b\_{l} = b\_{l-1} + \beta \varepsilon\_{l} \tag{6c}$$

$$\mathbf{s}\_t = \mathbf{s}\_{t-m} + \gamma \mathbf{e}\_{t\_f} \tag{6d}$$

and the equations of the ETS(M, A, A) model (underling additive Holt-Winters' method with multiplicative errors) are [19]

$$y\_t = \left(l\_{t-1} + b\_{t-1} + s\_{t-m}\right) \left(1 + \varepsilon\_t\right) \tag{7a}$$

$$l\_t = l\_{t-1} + b\_{t-1} + \alpha \left(l\_{t-1} + b\_{t-1} + s\_{t-m}\right) \varepsilon\_t \tag{7b}$$

$$b\_t = b\_{t-1} + \beta \left( l\_{t-1} + b\_{t-1} + s\_{t-m} \right) \varepsilon\_t \tag{7c}$$

$$s\_t = s\_{t-m} + \gamma \left(l\_{t-1} + b\_{t-1} + s\_{t-m}\right) \varepsilon\_t. \tag{7d}$$

## 2.1.1. Estimation of State Space Models

Maximum likelihood estimates of the parameters and initial states of the state space model (2) can be obtained by minimizing its likelihood. The probability density function for *y* = (*y*1, ... , *yn*) is given by [19]

$$p(y \mid \theta, \mathbf{x}\_{0\prime} \sigma^2) = \prod\_{t=1}^{n} p(y\_t \mid \mathbf{x}\_{t-1}) = \prod\_{t=1}^{n} p(\varepsilon\_t) / |r(\mathbf{x}\_{t-1})| \,\tag{8}$$

where *θ* is the parameters vector, *x*0 is the initial states vector, and *σ*<sup>2</sup> is the innovation variance. By assuming that the distribution of {*<sup>ε</sup>t*} is Gaussian, this likelihood has the form

$$\mathcal{L}(\boldsymbol{\theta}, \mathbf{x}\_{0}, \sigma^{2} \mid \mathbf{y}) = (2\pi\sigma^{2})^{-n/2} \left| \prod\_{t=1}^{n} r(\mathbf{x}\_{t-1}) \right|^{-1} \exp\left( -\frac{1}{2} \sum\_{t=1}^{n} \varepsilon\_{t}^{2} / \sigma^{2} \right),\tag{9}$$

and its logarithm is

$$\log \mathcal{L} = -\frac{n}{2} \log(2\pi\sigma^2) - \sum\_{t=1}^{n} \log|r(\mathbf{x}\_{t-1})| - \frac{1}{2} \sum\_{t=1}^{n} \varepsilon\_t^2 / \sigma^2. \tag{10}$$

The maximum likelihood estimate of *σ*<sup>2</sup> can be obtained by taking the partial derivative of (10) with respect to *σ*<sup>2</sup> and setting it to zero:

$$
\partial^2 = n^{-1} \sum\_{t=1}^n \varepsilon\_t^2. \tag{11}
$$

This estimate can be used to eliminate *σ*<sup>2</sup> from the likelihood (9), which becomes

$$\mathcal{L}(\boldsymbol{\theta}, \mathbf{x}\_0 \mid \boldsymbol{y}) = (2\,\pi \, e \, \boldsymbol{\vartheta}^2)^{-n/2} \left| \prod\_{t=1}^n r(\mathbf{x}\_{t-1}) \right|^{-1} \,. \tag{12}$$

> Hence, twice the negative logarithm of this likelihood is

$$-2\log \mathcal{L}(\theta, \mathbf{x}\_0 \mid \mathbf{y}) = \mathbf{c}\_n + n \log \left(\sum\_{t=1}^n \epsilon\_t^2\right) + 2\sum\_{t=1}^n \log |r(\mathbf{x}\_{t-1})| \,\tag{13}$$

where *cn* = *n* log(2 *π e*) − *n* log(*n*). Thus, maximum likelihood estimates for the parameters *θ* and the initial states *x*0 can be obtained by minimizing

$$\mathcal{L}^\*(\theta, \mathbf{x}\_0) = n \log \left( \sum\_{t=1}^n \epsilon\_t^2 \right) + 2 \sum\_{t=1}^n \log |r(\mathbf{x}\_{t-1})|. \tag{14}$$

The innovations can be computed recursively, using the relationships

$$\varepsilon\_{t} = \left[y\_{t} - w(\mathbf{x}\_{t-1})\right] / r(\mathbf{x}\_{t-1}) \tag{15}$$

$$\mathbf{x}\_{l} = f(\mathbf{x}\_{t-1}) + \mathbf{g}(\mathbf{x}\_{t-1})\mathbf{e}\_{l}.\tag{16}$$

## 2.1.2. Information Criteria for Model Selection

Forecast accuracy measures can be used to select a model for a given time-series, as long as the errors are computed from a test set and not from the training set used to estimate the model. However, the errors usually available are not enough to draw reliable conclusions. One possible solution is to use an information criterion (IC), based on the likelihood L(*<sup>θ</sup>*, *x*0 | *y*), that would include a regularization term to compensate for potential overfitting. The Akaike Information Criteria (AIC) for state space models is defined as [18]

$$\text{AIC} = -2\log \mathcal{L}(\theta, \mathbf{x}\_0 \mid \mathbf{y}) + 2k,\tag{17}$$

where L(*<sup>θ</sup>*, *x*0 | *y*) is the likelihood and *k* is the number of parameters and initial states of the estimated model. Akaike based his model selection criteria on the Kullback-Liebler (K-L) discrimination information, also known as negative entropy, defined by

$$I(f, \mathbf{g}) = \int f(\mathbf{x}) \log \left( \frac{f(\mathbf{x})}{g(\mathbf{x}|\boldsymbol{\theta})} \right) d\mathbf{x} \,\tag{18}$$

which measures the information lost when the model *g* is used to approximate the real model *f* . He found that he could estimate the expectation of K-L information by the maximized log-likelihood corrected for bias. This bias can be approximated by the number of estimated parameters in the approximating model. Thus, the model selection procedure is to choose the model amongs<sup>t</sup> the candidates having the minimum value of the AIC. The Bayesian Information Criteria (BIC) is defined as [20]

$$\text{BIC} = \text{AIC} + k[\log(n) - 2]. \tag{19}$$

The BIC is order-consistent, but is not asymptotically efficient like the AIC. The AIC corrected for small-sample bias, denoted by AICc, is defined as [19]

$$\text{AICC}\_{\text{c}} = \text{AIC} + \frac{k(k+1)}{n-k-1}. \tag{20}$$

Appropriate models can be selected by minimizing the AIC, the BIC, or the AICc.

## *2.2. ARIMA Models*

ARIMA models are generally accepted as one of the most versatile classes of models for forecasting time-series [21,22]. Many different types of stochastic seasonal and non-seasonal time-series can be represented by them. These include pure autoregressive (AR), pure moving average (MA), and mixed AR and MA processes, all requiring stationary data so that they can be applied. Although many time-series are non-stationary, they can be transformed to stationary time-series by taking proper degrees of differencing (regular and/or seasonal). The multiplicative seasonal ARIMA model, denoted as ARIMA(*p*, *d*, *q*) × (*<sup>P</sup>*, *D*, *Q*)*<sup>m</sup>*, has the following form [23]:

$$
\oint \phi\_p(B) \Phi\_P(B^m) (1 - B)^d (1 - B^m)^D y\_t = \mathfrak{c} + \theta\_q(B) \Theta\_Q(B^m) \varepsilon\_t,\tag{21}
$$

where

$$\begin{aligned} \Phi\_{\mathcal{P}}(B) &= 1 - \phi\_1 B - \dots - \phi\_p B^p & \Phi\_{\mathcal{P}}(B^m) &= 1 - \Phi\_1 B^m - \dots - \Phi\_P B^{P^m}, \\ \Phi\_{\mathcal{Q}}(B) &= 1 + \theta\_1 B + \dots + \theta\_q B^q & \Theta\_{\mathcal{Q}}(B^m) &= 1 + \Theta\_1 B^m + \dots + \Theta\_{\mathcal{Q}} B^{\mathcal{Q} m}, \end{aligned}$$

*m* is the period of seasonality, *D* is the degree of seasonal differencing, *d* is the degree of ordinary differencing, *B* is the backward shift operator, *φp*(*B*) and *<sup>θ</sup>q*(*B*) are the regular autoregressive and moving average polynomials of orders *p* and *q*, respectively, <sup>Φ</sup>*P*(*Bm*) and <sup>Θ</sup>*Q*(*Bm*) are the seasonal autoregressive and moving average polynomials of orders *P* and *Q*, respectively, *c* = *μ*(<sup>1</sup> − *φ*1 −···− *φp*)(<sup>1</sup> − Φ1 −···− <sup>Φ</sup>*P*), where *μ* is the mean of (1 − *B*)*<sup>d</sup>*(1 − *Bm*)*Dyt*, and *εt* is a zero-mean Gaussian white noise process with variance *σ*2. To ensure causality and invertibility, the roots of the polynomials *φp*(*B*), <sup>Φ</sup>*P*(*Bm*), *<sup>θ</sup>q*(*B*), and <sup>Θ</sup>*Q*(*Bm*) should lie outside the unit circle. One of the main tasks in ARIMA forecasting is selecting the values of *p*, *q*, *P*, *Q*, *d*, and *D*. Usually, the following steps are used [23]: Plot the series, identify outliers, and choose a proper variance-stabilizing transformation. For that purpose, a Box-Cox transformation may be applied [24]:

$$y\_t' = \begin{cases} \ln(y\_t), & \lambda = 0\\ (y\_t^{\lambda} - 1) / \lambda, & \lambda \neq 0 \end{cases} \tag{22}$$

where the parameter *λ* is a real number, often between −1 and 2. Then, the sample ACF (Auto-Correlation Function) and sample PACF (Partial Auto-Correlation Function) can be computed to decide appropriate degrees of differencing (*d* and *D*). Alternatively, unit-root tests may be applied. The Canova–Hansen test [25] can be used to choose *D*. After *D* is selected, *d* can be chosen by applying successive KPSS (Kwiatkowski, Phillips, Schmidt & Shin) tests [26]. Finally, the sample ACF and sample PACF are matched with the theoretical patterns of known models, to identify the orders of *p*, *q*, *P*, and *Q*.

## Information Criteria for Model Selection

As for state space models, the values of *p*, *q*, *P*, and *Q* may be selected by an information criterion, such as the Akaike Information Criteria [18]:

$$\text{AIC} = -2\log \mathcal{L}(\theta, \sigma^2 | \, y) + 2(p + q + P + Q + k + 1), \tag{23}$$

where *k* = 1 if *c* = 0 and 0 otherwise, and logL(*<sup>θ</sup>*, *σ*<sup>2</sup>| *y*) is the log-likelihood of the model fitted to the properly transformed and differenced data, given by [27]

$$\log \mathcal{L}(\theta, \sigma^2 | \, \underline{y}) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \sum\_{t=1}^{n} \frac{\varepsilon\_t^2}{2\sigma^2} \tag{24}$$

where *θ* is the parameter vector of the model and *σ*<sup>2</sup> is the innovation variance (the last term in parentheses in (23) is the total number of parameters that have been estimated, including the innovation variance). Note that the AIC is defined by considering the same principles of maximum likelihood and negative entropy discussed in Section 2.1. The AIC corrected for small sample sizes, AICc, is defined as

$$\text{AICC}\_{\text{\textquotedblleft}} = \text{AIC} + \frac{2(p+q+P+Q+k+1)(p+q+P+Q+k+2)}{n-p-q-P-Q-k-2}.\tag{25}$$

The Bayesian Information Criterion is defined as

$$\text{BIC} = \text{AIC} + [\log(n) - 2](p + q + P + Q + k + 1). \tag{26}$$

As for the state space models, appropriate ARIMA models may be obtained by minimizing either the AIC, AICc, or BIC.

## **3. Hierarchical Forecasting**

## *3.1. Hierarchical Time-Series*

For the purpose of illustration, consider the example of the hierarchical structure shown in Figure 1. At the top of the hierarchy (level 0) is the most aggregated time-series, denoted by *Total*. The observation at time *t* of the *Total* series is denoted by *yTotal*,*t*. The *Total* series is disaggregated into series *A* and series *B*, at level 1. The *t*-th observation of series *A* is denoted as *yA*,*<sup>t</sup>* and the *t*-th observation of series *B* is denoted as *yB*,*t*. The series *A* and *B* are disaggregated, respectively, into two and three series that are at the bottom level (level 2). For example, *yAA*,*<sup>t</sup>* denotes the *t*-th observation of series *AA*. In this case, the total number of series is *n* = 8 and the number of series at the bottom level is *m* = 5. For any time *t*, the observations at the bottom level will sum to the observations of the series above. Hence, in this case, we have

$$y\_{\text{Total},l} = y\_{\text{AAJ}} + y\_{\text{ABJ}} + y\_{\text{BAJ}} + y\_{\text{BAJ}} + y\_{\text{BCJ}}, \quad y\_{\text{AJ}} = y\_{\text{AAJ},l} + y\_{\text{ABJ}}, \quad y\_{\text{BJ}} = y\_{\text{BAJ}} + y\_{\text{BBJ}} + y\_{\text{BCJ}}.\tag{27}$$

**Figure 1.** Example of a two-level hierarchical structure.

These aggregation constraints can be easily represented using matrix notation

$$y\_t = Sb\_{t, \prime} \tag{28}$$

where *yt* = (*yTotal*,*t*, *yA*,*t*, *yB*,*t*, *yAA*,*t*, *yAB*,*t*, *yBA*,*t*, *yBB*,*t*, *yBC*,*<sup>t</sup>*) is an *n*-dimensional vector, *bt* = (*yAA*,*t*, *yAB*,*t*, *yBA*,*t*, *yBB*,*t*, *yBC*,*<sup>t</sup>*) is an *m*-dimensional vector, and *S* is the summing matrix of order *n* × *m*, given by

$$\mathbf{S} = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 1 & 1 \\ & & & 1 \end{bmatrix} . \tag{29}$$

Note that the first three rows of *S* correspond, respectively, to the three aggregation constraints in (27). The identity matrix *I*5 below guarantees that each bottom level observation on the right-hand side of the equation is equal to itself on the left hand side. These concepts can be applied to an arbitrary set of *n* time-series that are subject to an aggregation structure, with *m* series at the bottom level [18]. The goal is to produce coherent forecasts for each series in the hierarchy; that is, forecasts that add up according to the aggregation constraints of the hierarchical structure.

## *3.2. Hierarchical Forecasting Methods*

Let *<sup>y</sup>*<sup>ˆ</sup>*t*+*h*|*t* be an *n*-dimensional vector containing the forecasts of the values of all series in the hierarchy at time *t* + *h* (with *h* = 1, 2, ...), obtained using observations up to and including time *t*, and stacked in the same order as *yt*. These are usually called base forecasts. They are calculated independently for each time-series, not taking into account any relationship that might exist between them due to the aggregation constraints. Any forecasting method, such as ETS or ARIMA, can be used to generate these forecasts. The issue is that it is very unlikely that these will be coherent forecasts, hence some reconciliation method should be further applied. All existing reconciliation methods can be expressed as

$$
\tilde{y}\_{t+h|t} = \mathcal{SP}\hat{y}\_{t+h|t'} \tag{30}
$$

where *<sup>y</sup>***˜***t*+*h*|*t* is an *n*-dimensional vector of reconciled forecasts, which are now coherent, and *P* is a matrix of dimension *m* × *n*, which maps the base forecasts *<sup>y</sup>*<sup>ˆ</sup>*t*+*h*|*t* into reconciled bottom level forecasts, which are then aggregated by the summing matrix *S*. If the bottom-up (BU) approach is used, then *P* = [**<sup>0</sup>***m*<sup>×</sup>(*<sup>n</sup>*−*<sup>m</sup>*) | *<sup>I</sup>m*], where **<sup>0</sup>***m*<sup>×</sup>(*<sup>n</sup>*−*<sup>m</sup>*) is the null matrix of order *m* × (*n* − *m*) and *Im* is the identity matrix of order *m* [4–6,9,10,28,29]. For the hierarchy shown in Figure 1, *P* is given by

$$P = \begin{bmatrix} 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{bmatrix} . \tag{31}$$

This approach is computationally very efficient, since it only requires summing the bottom level base forecasts. It also has the advantage of forecasting the series at the most disaggregated level and, although it is more difficult to model, no information about the dynamics of the series is lost due to aggregation. However, it usually provides very poor forecasts for the upper levels in the hierarchy [13]. If a top-down (TD) approach is used, then *P* = [*p* | **<sup>0</sup>***m*<sup>×</sup>(*<sup>n</sup>*−<sup>1</sup>)], where *p* = [*p*1, ... , *pm*] is an *m*-dimensional vector containing the disaggregation proportions, which indicate how the top level base forecast at time *t* + *h* is to be distributed to obtain forecasts for the bottom level series, which are then summed by *S* [8,17,30–33]. For the hierarchy shown in Figure 1, *P* is given by

$$P = \begin{bmatrix} p\_1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ p\_2 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ p\_3 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ p\_4 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ p\_5 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix} . \tag{32}$$

The most common top-down methods performed quite well in Gross and Sohl [8]. In method "a" of Gross and Sohl [8] (referred to in the results that follow as TDGSa), each proportion *pi* is the average of the historical proportions of bottom level series *yi*,*j*, relative to top level series *yT*,*j*, over the time period *j* = 1, . . . , *t*:

$$p\_i = \frac{1}{t} \sum\_{j=1}^{t} \frac{y\_{i,j}}{y\_{T,j}}, \qquad i = 1, \ldots, m. \tag{33}$$

In method "f" (referred to in the results that follow as TDGSf), each proportion *pi* is the average value of the historical data of bottom level series *yi*,*j*, relative to the average value of the historical data of top level series *yT*,*j*, over the time period *j* = 1, . . . , *t*:

$$p\_i = \sum\_{j=1}^{t} \frac{y\_{i,j}}{t} \Big/ \sum\_{j=1}^{t} \frac{y\_{T,j}}{t}, \qquad i = 1, \ldots, m. \tag{34}$$

These two methods are very simple to implement, since they only require forecasts for the most aggregated series in the hierarchy. They seem to provide reliable forecasts for the aggregate levels. However, they are not able to capture the individual dynamics of the series that is lost due to aggregation. Moreover, since they are based on historical proportions, they tend to produce less accurate forecasts than the bottom-up approach at lower levels of the hierarchy, as they do not take into account how these proportions may change over time. To address this issue, Athanasopoulos et al. [9] proposed to obtain proportions based on forecasts rather than historical data:

$$p\_i = \prod\_{l=0}^{k-1} \frac{\hat{g}\_{i,t+h|t}^{(l)}}{\hat{s}\_{i,t+h|t}^{(l+1)}}, \qquad i = 1, \dots, m\_r \tag{35}$$

where *k* is the level of the hierarchy, *y*ˆ(*l*) *<sup>i</sup>*,*t*+*h*|*<sup>t</sup>* is the base forecast at the time *t* + *h* of the series that corresponds to the node which is *l* levels above *i*, and *S* ˆ(*l*+<sup>1</sup>) *<sup>i</sup>*,*t*+*h*|*<sup>t</sup>* is the sum of the base forecasts at the time *t* + *h* of the series that corresponds to the nodes that are below the node that is *l* levels above node *i* and are directly connected to it. In the results that follow, this top-down method is referred as TDfp. In the methods discussed so far, no real reconciliation has been performed, because these have been based on base forecasts from a single level of the hierarchy. However, processes that reconcile the base forecasts from the whole hierarchy structure in order to produce coherent forecasts can also be considered. Hyndman et al. [13] proposed an approach based on the regression model

$$
\mathfrak{F}\_{t+h|t} = \mathfrak{F}\mathfrak{F}\_{t+h|t} + \mathfrak{e}\_{h\prime} \tag{36}
$$

where *βt*+*h*|*t* is the unknown conditional mean of the most disaggregated series and *εh* is the coherency error assumed with mean zero and covariance matrix **Σ***h*. If **Σ***h* was known, the generalised least squares (GLS) estimator of *βt*+*h*|*t* would lead to the following reconciled forecasts

$$\mathfrak{F}\_{t+h|t} = \mathbf{S}\hat{\mathfrak{F}}\_{t+h|t} = \mathbf{S}(\mathbf{S}'\Sigma\_h^{-1}\mathbf{S})^{-1}\mathbf{S}'\Sigma\_h^{-1}\hat{\mathfrak{F}}\_{t+h|t} = \mathbf{S}\mathbf{P}\hat{\mathfrak{F}}\_{t+h|t\prime} \tag{37}$$

where *P* = (*S***Σ**−<sup>1</sup> *h <sup>S</sup>*)−1*S***Σ**−<sup>1</sup> *h* . Hyndman et al. [13] also showed that, if the base forecasts *<sup>y</sup>*<sup>ˆ</sup>*t*+*h*|*t* are unbiased, then the reconciled forecasts *<sup>y</sup>***˜***t*+*h*|*t* will be unbiased, provided that *SPS* = *S*. This condition is true for this reconciliation approach and also for the bottom-up, but not for top-down methods. So, the top-down approaches will never give unbiased reconciled forecasts, even if the base forecasts are unbiased. Recently, Wickramasuriya et al. [15] showed that, in general, **Σ***h* is not identifiable. They showed that the covariance matrix of the *h*-step ahead reconciled forecast errors is given by

$$\text{Var}(y\_{t+h} - \tilde{y}\_{t+h|t}) = \text{SPW}\_h \mathbf{P}' \mathbf{S}',\tag{38}$$

for any *P* such that *SPS* = *S*, where *Wh* = Var(*yt*+*<sup>h</sup>* − *<sup>y</sup>*<sup>ˆ</sup>*t*+*h*|*t*) = <sup>E</sup>(*e*<sup>ˆ</sup>*t*+*h*|*te*<sup>ˆ</sup>*t*+*h*|*t*) is the covariance matrix of the corresponding *h*-step ahead base forecast errors. The goal is to find the matrix *P* that minimises the error variances of the reconciled forecasts, which are on the diagonal of the covariance matrix Var(*yt*+*<sup>h</sup>* − *<sup>y</sup>***˜***t*+*h*|*t*). Wickramasuriya et al. [15] showed that the optimal reconciliation matrix *P* that minimises the trace of *SPWhPS*, such that *SPS* = *S*, is

$$P = (\mathbf{S'W}\_h^{-1}\mathbf{S})^{-1}\mathbf{S'W}\_h^{-1}.\tag{39}$$

Therefore, the optimal reconciled forecasts are given by

$$\tilde{y}\_{t+h|t} = \mathbf{S} (\mathbf{S}' \mathbf{W}\_h^{-1} \mathbf{S})^{-1} \mathbf{S}' \mathbf{W}\_h^{-1} \hat{y}\_{t+h|t'} \tag{40}$$

which is referred to as the MinT (Minimum Trace) estimator. Note that the MinT and GLS estimators only differ in the covariance matrix. We still need to estimate *Wh*, which is a matrix of order *n* that can be quite large. The following simplifying approximations were considered by Wickramasuriya et al. [15]:

(1) *Wh* = *kh In* for all *h* with *kh* > 0. In this case, the MinT estimator corresponds to the ordinary least squares (OLS) estimator of *β<sup>t</sup>*+*h*|*<sup>t</sup>*. It is the most simplifying approximation considered, being *P*-independent of the data (it only depends on *S*), which means that this method does not account for differences in scale between the levels of the hierarchy (captured by the error variances of the base forecasts), or the relationships between the series (captured by the error covariances of the base forecasts). This is optimal only when the base forecast errors are uncorrelated and equivariant, which are unrealistic assumptions for an hierarchical time-series. In the results that follow, this method is referred to as OLS.

(2) *Wh* = *kh*diag(*W*<sup>ˆ</sup> 1) for all *h* with *kh* > 0, where *W*<sup>ˆ</sup> 1 is the sample covariance estimator of the in-sample 1-step ahead base forecast errors. Then, *Wh* is a diagonal matrix with the diagonal entries of *W*<sup>ˆ</sup> 1, which are the variances of the in-sample 1-step ahead base forecast errors, stacked in the same order as *yt*. This approximation scales the base forecasts, using the variance of the residuals. In the results that follow, this specification is referred to as MinT-VarScale.

(3) *Wh* = *kh***Λ** for all *h* with *kh* > 0, and **Λ** = diag(*S***1**) where **1** is a unit vector of dimension *n*. This method was proposed by Athanasopoulos et al. [34] for temporal hierarchies, and assumes that the bottom level base forecasts errors are uncorrelated between nodes and have variance *kh*. Hence, the diagonal entries in **Λ** are the number of forecast error variances contributing to each node, stacked in the same order as *yt*. This estimator only depends on the aggregation constraints, being independent of the data. Therefore, it is usually referred to as structural scaling, and we label it as MinT-StructScale. Notice that this specification only assumes equivariant base forecast errors at the bottom level, which is an advantage over OLS. It is particularly useful when the residuals are not available, which is the case when the base forecasts are generated by judgmental forecasting.

(4) *Wh* = *khW*<sup>ˆ</sup> ∗1,*D* for all *h* with *kh* > 0, where *W*<sup>ˆ</sup> ∗1,*D* = *λ W*<sup>ˆ</sup> 1,*D* + (1 − *λ*)*W*<sup>ˆ</sup> 1 is a shrinkage estimator that shrinks the off-diagonal elements of *W*<sup>ˆ</sup> 1 towards zero (while the diagonal elements remain unchanged), *W*<sup>ˆ</sup> 1,*D* is a diagonal matrix with the diagonal entries of *W*<sup>ˆ</sup> 1, and *λ* is the shrinkage intensity parameter. By parameterizing the shrinkage in terms of variances and correlations, rather than variances and covariances, and assuming that the variances are constant, Schäfer and Strimmer [35] proposed the following shrinkage intensity parameter

$$
\lambda = \frac{\sum\_{\bar{i}\neq\bar{j}} \text{Var}(\bar{\hat{r}\_{ij}})}{\sum\_{\bar{i}\neq\bar{j}} \hat{r}\_{\bar{i}\bar{j}}^2},
\tag{41}
$$

where *r*ˆ*ij* is the *ijth* element of *R*ˆ 1, the sample correlation matrix of the in-sample 1-step ahead base forecast errors. In contrast to variance and structure scaling estimators, which are diagonal covariance estimators accommodating only differences in scale between the levels of the hierarchy, this shrinkage estimator, which is a full covariance estimator, also accounts for the relationships between the series, while the shrinkage parameter regulates the complexity of the matrix *Wh*. In the results that follow, this method is referred to as MinT-Shrink. In all estimators, *kh* is a proportionality constant that needs to be estimated only to obtain prediction intervals.
