*Article* **Using the Entire Yield Curve in Forecasting Output and Inflation**

#### **Eric Hillebrand 1, Huiyu Huang 2, Tae-Hwy Lee 3,\* and Canlin Li <sup>4</sup>**


Received: 17 June 2018; Accepted: 21 August 2018; Published: 29 August 2018

**Abstract:** In forecasting a variable (forecast target) using many predictors, a factor model with principal components (PC) is often used. When the predictors are the yield curve (a set of many yields), the Nelson–Siegel (NS) factor model is used in place of the PC factors. These PC or NS factors are combining information (CI) in the predictors (yields). However, these CI factors are not "supervised" for a specific forecast target in that they are constructed by using only the predictors but not using a particular forecast target. In order to "supervise" factors for a forecast target, we follow Chan et al. (1999) and Stock and Watson (2004) to compute PC or NS factors of many forecasts (not of the predictors), with each of the many forecasts being computed using one predictor at a time. These PC or NS factors of forecasts are combining forecasts (CF). The CF factors are supervised for a specific forecast target. We demonstrate the advantage of the supervised CF factor models over the unsupervised CI factor models via simple numerical examples and Monte Carlo simulation. In out-of-sample forecasting of monthly US output growth and inflation, it is found that the CF factor models outperform the CI factor models especially at longer forecast horizons.

**Keywords:** level, slope, and curvature of the yield curve; Nelson-Siegel factors; supervised factor models; combining forecasts; principal components

**JEL Classification:** C5; E4; G1

#### **1. Introduction**

The predictive power of the yield curve for macroeconomic variables has been documented in the literature for a long time. Many different points on the yield curve have been used and various methodologies have been examined. For example, Stock and Watson (1989) find that two interest rate spreads, the difference between the six-month commercial paper rate and the six-month Treasury bill rate, and the difference between the ten-year and one-year Treasury bond rates, are good predictors of real activity, thus contributing to their index of leading indicators. Bernanke (1990), Friedman and Kuttner (1993), Estrella and Hardouvelis (1991), and Kozicki (1997), among many others, have investigated a variety of yields and yield spreads individually on their ability to forecast macroeconomic variables. Hamilton and Kim (2002) as well as Diebold et al. (2005) provide a brief summary of this line of research and the link between the yield curve and macroeconomic variables.

Various macroeconomic models for exploring the yield curve information for real activity prediction are proposed. Ang and Piazzesi (2003) and Piazzesi (2005) study the role of macroeconomic variables in an arbitrage-free affine yield curve model. Estrella (2005) constructs an analytical rational expectations model to investigate the reasons for the success of the slope of the yield curve (the spread

between long-term and short-term government bond rates) in predicting real economic activity and inflation. The model in Ang et al. (2006), Piazzesi and Wei is an arbitrage-free dynamic model (using lags of GDP growth and yields as regressors) that characterizes expectations of GDP growth. Rudebusch and Wu (2008) provide an example of a macro-finance specification that employs more macroeconomic structure and includes both rational expectations and inertial elements.

Stock and Watson (1999, 2002) investigate forecasts of output growth and inflation using over a hundred of economic indicators, including many interest rates and yield spreads. Stock and Watson (2002, 2012) advocate methods that aim at solving the large-*N* predictor problem, particularly those using principal components (PC). Ang et al. (2006) suggest the use of the short rate, the five-year to three-month yield spread, and lagged GDP growth in forecasting GDP growth out-of-sample. The choice of these two yield curve characteristics, as they argue, is because they have almost one-to-one correspondence with the first two principal components of the short rate and five yield spreads that account for 99.7% of quarterly yield curve variation.

Alternatively to the PC factor approach on the large-*N* predictor information set, Diebold and Li (2006) propose the Nelson and Siegel (1987) (NS) factors for the large-*N* yields. They use a modified three-factor NS model to capture the dynamics of the yield curve and show that the three NS factors may be interpreted as level, slope, and curvature. Diebold et al. (2006) examine the correlations between NS yield factors and macroeconomic variables. They find that the level factor is highly correlated with inflation and that the slope factor is highly correlated with real activity. For more on the yield curve background and the three characteristics of the yield curve, see Litterman and Scheinkman (1991) and Diebold and Li (2006).

In this paper, we utilize the yield curve information for prediction of macro-economic variables. Using a large number of yield curve points with different maturities yields a large-*N* problem in the predictive regression. The PC factors or the NS factors of the yield curve may be used to reduce the large dimension of the predictors. However, the PC and NS factors of the yield curve are not supervised for a specific variable to forecast. These factors simply combine information (CI) of many predictors (yields) without having to look at a forecast target. Hence, the conventional CI factor models (using factors of the predictors) are unsupervised for any forecast target.

Our goal in this paper is to consider factor models where the factors are computed with a particular forecast target in mind. Specifically, we consider the PC or NS factors of forecasts (not of predictors), with each of the forecasts formed using one predictor at a time. (It could be generalized to make each forecast from using more than one predictor, e.g., a subset of the *N* predictors, in which case there can be as many as 2*<sup>N</sup>* forecasts to combine.) These factors will combine the forecasts (CF). The PC factors of forecasts are combined forecasts using the combining weights that solves a singular value problem for a set of forecasts, while the NS factors of forecasts are combined forecasts using the combining weights obtained from orthogornal polynomials that emulate the shape of a yield curve (in level, slope, and curvature). The PC or NS factors of the many forecasts are supervised for a forecasting target. The main idea of the CF-factor model is to focus on the space spanned by *forecasts* rather than the space spanned by *predictors*. The factorization of forecasts (CF-factor model) can substantially improve forecasting performance compared to the factorization of predictors (CI-factor model). This is because the CF-factor model takes the forecast target into the factorization, while the conventional CI-factor model is blind to the forecast target because the factorization uses only information on predictors.

For both CI and CF schemes, the NS factor model can be relevant only when the yield curve is used as predictors while the PC factor model can be used in general. The NS factors are specific to the yield curve factors such as level, slope, and curvature factors. When the predictors are from the points on the yield curve, the NS factor models proposed here is nearly the same as the PC factors. Given the similarity of NS and PC and the generality of PC, we begin the paper with the PC models to understand the mechanism of the supervision in CF-factor models. We demonstrate how the supervised CF factor models outperform the unsupervised CI factor model, under the presence of many predictors (50 points on the yield curve at each time). The empirical work shows that there are

potentially big gains in the CF-factor models. In out-of-sample forecasting of U.S. monthly output growth and inflation, it is found that the CF factor models (CF-NS and CF-PC) are substantially better than the conventional CI factors models (CI-NS and CI-PC). The advantage of supervised factors is even greater for longer forecast horizons.

The paper is organized as follows: in Section 2, we describe the CI and CF frameworks and principal component approaches for their estimation, present theoretical results about supervision, and an example to provide intuition. Section 3 provides simulations of supervision under different noise, predictor correlation, and predictor persistence conditions. In Section 4, we introduce the NS component approaches for the CI and CF frameworks. In Section 5, we show the out-of-sample performance of the proposed methods in forecasting U.S. monthly output growth and inflation. Section 6 presents the conclusions.

#### **2. Supervising Factors**

#### *2.1. Factor Models*

Let *yt*<sup>+</sup>*<sup>h</sup>* denote the variable to be forecast (output growth or inflation) using yield curve information stamped at time *t*, where *h* denotes the forecast horizon. The predictor vector *xt* contains information about the yield curve at various maturities: *xt* := (*x*1*t*, *x*2*t*, ... , *xNt*)- , where *xit* := *xt*(*τi*) denotes the yield at time *t* with maturity *τ<sup>i</sup>* (*i* = 1, 2, . . . , *N*).

Consider the CI model when *N* is large

$$y\_{t+h} = (1\,\mathbf{x}\_t^\prime)a + \varepsilon\_{t+h\prime} \qquad (t = 1, 2, \dots, T) \tag{1}$$

for which the forecast at time *T* is

$$\mathfrak{F}\_{T+h}^{\text{Cl-OLS}} = (1 \,\mathfrak{x}'\_T)\mathfrak{k},\tag{2}$$

with *α*ˆ estimated by OLS using the information up to time *T*. A problem is that here the mean-squared forecast error (MSFE) is of order *O* - *N T* increasing with *N*. <sup>1</sup> A solution to this problem is to reduce the dimension either by selecting a subset of the *N* predictors, e.g., by Lasso type regression (Tibshirani 1996) or by using factor models of, e.g., Stock and Watson (2002). In this paper, we focus on using the factor model rather than selecting a subset of the *N* predictors.2

#### 2.1.1. CI-Factor Model

The conventional factor model is the CI factor model for *xt* of the form

$$\mathbf{x}\_t = \Lambda\_{\text{CI}} f\_{\text{CI},t} + v\_{\text{CI},t} \qquad (t = 1, \dots, T), \tag{3}$$

where <sup>Λ</sup>CI is *<sup>N</sup>* <sup>×</sup> *<sup>k</sup>*CI and *<sup>f</sup>*CI,*<sup>t</sup>* is *<sup>k</sup>*CI <sup>×</sup> 1. The estimated factor loadings <sup>Λ</sup><sup>ˆ</sup> CI are obtained either by following Stock and Watson (2002) and Bai (2003), or by following Nelson and Siegel (1987) and Diebold and Li (2006). The latter approach is discussed in Section 4. The factors are then estimated by

$$
\hat{f}\_{\rm CI,t} = \hat{\Lambda}\_{\rm CI}' \mathbf{x}\_t. \tag{4}
$$

As this model computes the factors from all *N* predictors of *xt* directly, it will be called "CI-factor". The forecast *y*ˆ*T*+*<sup>h</sup>* = (1 ˆ *f* - CI,*T*)*α*ˆ CI can be formed using *α*ˆ CI estimated at time *T* from the regression

$$y\_t = (1\ f'\_{\rm CI, t-h})a\_{\rm CI} + \mu\_{\rm CI, t\prime} \qquad (t = h+1, \ldots, T). \tag{5}$$

<sup>1</sup> This is explained in Bai and Ng 2008; Huang and Lee 2010; Stock and Watson 2002.

<sup>2</sup> Bai and Ng (2008) consider CI factor models with a selected subset (targeted predictors).

In matrix form, we write the factor model (3) and (5) for the vector of forecast target observations *<sup>y</sup>* and for the *<sup>T</sup>* × *<sup>N</sup>* matrix of predictors *<sup>X</sup>* as follows:3

$$X = F\_{\rm CI} \Lambda\_{\rm CI}^{\prime} + \upsilon\_{\rm CI\nu} \tag{6}$$

$$y = F\_{\rm CI} \mu\_{\rm CI} + \mu\_{\rm CI} \tag{7}$$

where *y* is the *T* × 1 vector of observations, *F*CI is a *T* × *k*CI matrix of factors, ΛCI is an *N* × *k*CI matrix of factor loadings, *α*CI is a *k*CI × 1 parameter vector, *v*CI is a *T* × *N* random matrix, and *u*CI is a *T* × 1 vector of random errors.

**Remark 1.** *(No supervision in CI-factor model): Consider the joint density of* (*yt*<sup>+</sup>*h*, *xt*)

$$D(y\_{t+h'}\mathbf{x}\_t; \theta) = D\_1(y\_{t+h}|\mathbf{x}\_t; \theta)D\_2(\mathbf{x}\_t; \theta),\tag{8}$$

*where D*<sup>1</sup> *is the conditional density of yt*<sup>+</sup>*<sup>h</sup> given xt*, *and D*<sup>2</sup> *is the marginal density of xt. The CI-factor model assumes a situation where the joint density operates a "cut" in the terminology of Barndorff-Nielsen (1978) and Engle et al. (1983), such that*

$$D(y\_{t+h}, \mathbf{x}\_t; \theta) = D\_1(y\_{t+h}|\mathbf{x}\_t; \theta\_1) D\_2(\mathbf{x}\_t; \theta\_2), \tag{9}$$

*where θ* = (*θ*<sup>1</sup> *θ*- 2)- , *and θ*<sup>1</sup> = *α*, *θ*<sup>2</sup> = (*F*, Λ) *are "variation-free". Under this situation, the forecasting equation in (5) is obtained from the conditional model D*<sup>1</sup> *and the factor equation in (3) is solely obtained from the marginal model D*<sup>2</sup> *of the predictors. The computation of the factors is entirely from the marginal model D*<sup>2</sup> *that is blind to the forecast target yt*<sup>+</sup>*h*.

While the CI factor analysis of a large predictor matrix *X* solves the dimensionality problem, it computes the factors using information in *X* only, without accounting for the variable *y* to be forecast, and therefore the factors are not supervised for the forecast target. Our goal in this paper is to improve this approach by accounting for the forecast target in the computation of the factors. The procedure will be called *supervision*.

There are some attempts in the literature to supervise factor computation for a given forecast target. For example, Bair et al. (2006) and Bai and Ng (2008) consider factors of selected predictors that are informative for a specified forecast target; Zou et al. (2006) consider sparse loadings of principal components; De Jong (1993) and Groen and Kapetanios (2016) consider partial least squares regression; De Jong and Kiers (1992) consider principal covariate regression; Armah and Swanson (2010) select variables for factor proxies that have the maximum predictive power for the variable being forecast; and some weighted principal components have been used to downweight noisier series.

In this paper, we consider the CF-factor model that computes factors from forecasts rather than from predictors. This approach has been proposed in Chan et al. (1999) and in Stock and Watson (2004), there labeled "principal component forecast combination". We will refer to this approach as CF-PC (combining forecasts principal components). The details are as follows.

<sup>3</sup> The suppressed time stamp of *y* and *X* captures the *h*-lag relation for the forecast horizon and we treat the data centered so that we do not include a constant term explicitly in the regression for notational simplicity.

#### 2.1.2. CF-Factor Model

The forecasts from a CF-factor model are computed in two steps. The first step is to estimate the factors of the individual forecasts. Let the individual forecasts be formed by regressing the forecast target *yt*<sup>+</sup>*<sup>h</sup>* using the *i*th individual predictor *xit*:

$$\mathcal{Y}\_{T+h}^{(i)} := a\_{i,T} + b\_{i,T} \mathbf{x}\_{iT} \qquad (i = 1, 2, \dots, N). \tag{10}$$

Stack the *N* individual forecasts into a vector **yˆ***t*+*<sup>h</sup>* := (*y*ˆ (1) *<sup>t</sup>*+*h*, *y*ˆ (2) *<sup>t</sup>*+*h*, ... , *y*ˆ (*N*) *t*+*h*) and consider a factor model of **yˆ***t*+*h*:

$$
\Psi\_{t+h} = \Lambda\_{\text{CF}} f\_{\text{CF},t+h} + \upsilon\_{\text{CF},t+h}.\tag{11}
$$

The CF-factor is estimated from

$$f\_{\text{CF},t+h} := \Lambda\_{\text{CF}}' \mathfrak{F}\_{t+h} \tag{12}$$

The second step is to estimate the forecasting equation (for which the estimated CF-factors from the first step are used as regressors)<sup>4</sup>

$$y\_{t+h} = \hat{f}'\_{\text{CF},t+h} \mathbb{a}\_{\text{CF}} + u\_{\text{CF},t+h}.\tag{13}$$

Then, the CF-factor forecast at time *T* is

$$\mathfrak{H}\_{T+h}^{\rm CF} = \hat{f}\_{\rm CF,T+h}^{\prime} \mathfrak{h}\_{\rm CF\prime} \tag{14}$$

where *α*ˆ CF is estimated. See (Chan et al. 1999; Huang and Lee 2010; Stock and Watson 2004).

To write the CF-factor model in matrix form, we assume for notational simplicity that the data has been centered so that we do not include a constant term. We regress *y* on the columns *xi* of *X*, *i* = 1, . . . , *N*, one at a time, and write the fitted values in (10) as

$$\mathcal{Y}^{(i)} = \mathbf{x}\_i (\mathbf{x}\_i^\prime \mathbf{x}\_i)^{-1} \mathbf{x}\_i^\prime \mathbf{y} =: \mathbf{x}\_i b\_i. \tag{15}$$

Collect the fitted values in the matrix

$$\hat{Y} = [\hat{\mathfrak{y}}^{(1)} \ \hat{\mathfrak{y}}^{(2)} \ \cdots \ \mathfrak{y}^{(N)}] := XB \in \mathbb{R}^{T \times N},\tag{16}$$

where *<sup>B</sup>* <sup>=</sup> diag(*b*1, ... , *bN*) <sup>∈</sup> <sup>R</sup>*N*×*<sup>N</sup>* is a diagonal matrix containing the regression coefficients. We call *B* the *supervision matrix*. Then, the CF-factor model is

$$
\hat{Y} = F\_{\text{CF}} \Lambda\_{\text{CF}}^{\prime} + \upsilon\_{\text{CF}\prime} \tag{17}
$$

$$
\Delta y = F\_{\text{CF}} \mathbb{1}\_{\text{CF}} + \mathbb{1}\_{\text{CF}\_{\text{\textdegree}}} \tag{18}
$$

where *<sup>F</sup>*CF is a *<sup>T</sup>* <sup>×</sup> *<sup>k</sup>*CF matrix of factors of *<sup>Y</sup>*<sup>ˆ</sup> <sup>=</sup> *XB*, <sup>Λ</sup>CF is an *<sup>N</sup>* <sup>×</sup> *<sup>k</sup>*CF matrix of factor loadings, *<sup>α</sup>*CF is an *k*CF × 1 parameter vector, *v*CF is a *T* × *N* random matrix, and *u*CF is a *T* × 1 vector of random errors. In the rest of the paper, the subscripts CI and CF may be omitted for simplicity.

We use principal components (PC) as discussed in Stock and Watson (2002), Bai (2003), and Bai and Ng (2006). For the specific case of yield curve data, we use NS components as discussed in Nelson and Siegel (1987) and Diebold and Li (2006). We use both CF and CI approaches together with PC factors and NS factors. Our goal is to show that forecasts using supervised factor models (CF-PC and CF-NS) are better than forecasts from conventional unsupervised factor models (CI-PC and CI-NS).

<sup>4</sup> Given the dependent nature of macroeconomic and financial time series, the forecasting equation can be extended to allow the supervision to be based on the relation between *yt* and some predictors after controlling for lagged dependent variables and to allow the dynamic factor structure, which we leave for future work.

We show analytically and in simulations how supervision works to improve factor computation with respect to a specified forecast target. In Section 5, we present empirical evidence.

**Remark 2.** *(Estimation of B): The CF-factor model in (17) and (18) with B* = *IN (identity matrix) is a special case when there is no supervision. In this case, the CF-factor model collapses to the CI-factor model. If B were consistently estimated by minimizing the forecast error loss, then the CF-factor model with the "optimal" B would outperform the CI-factor model. However, as the dimension of the supervision matrix B grows with N*2*, B is an "incidental parameter" matrix and can not be estimated consistently. See Neyman and Scott (1948) and Lancaster (2000). Any estimation error in B translates into forecast error in the CF-factor model. Whether there is any virtue in considering Bayesian methods of estimating B, while still avoiding this problem, is left for future research. Instead, in this paper, we circumvent this difficulty by imposing that B* = *diag*(*b*1, ... , *bN*) *be a diagonal matrix and by estimating the diagonal elements bi's from the ordinary least squares regression in (10) or (15) with one predictor xi at a time. The supervision matrix B can be non-diagonal in general. As imposing the diagonality on B may be restrictive, it would be an interesting empirical question to examine if the CF-factor forecast with this restriction and the estimation strategy of B can still outperform the CI-factor forecast with B* = *IN*. *Our empirical results in Section 5 (Table 1) support this simple estimation strategy for the diagonal matrix B*, *in favor of the CF-factor model.*

**Remark 3.** *(Combining forecasts with many predictors): It is generally believed that it is difficult to estimate the forecast combination weights when N is large. Therefore, the equal weights* - 1 *N have been widely used instead of estimating weights.*<sup>5</sup> *It is often found in the literature that equally-weighted combined forecasts are often the best. Stock and Watson (2004) call this the "forecast combination puzzle". See also Timmermann (2006). Smith and Wallis (2009) explore a possible explanation of the forecast combination puzzle and conclude that it is due to estimation error of the combining weights.*

*Now, we note that, in the CF-factor model described above, we can* consistently *estimate the combining weights. From the CF-factor forecast (14) and the estimated factor (12),*

$$\mathfrak{z}\_{T+h} = \mathfrak{f}'\_{\text{CF},T+h} \mathfrak{a}\_{\text{CF}} = \left(\mathfrak{Y}'\_{T+h} \hat{\Lambda}\_{\text{CF}}\right) \mathfrak{a}\_{\text{CF}} := \mathfrak{Y}'\_{T+h} \mathfrak{w}\_{\text{\textquotedblleft}} \tag{19}$$

*where*

$$
\mathfrak{w} := \mathfrak{A}\_{\text{CF}} \mathfrak{A}\_{\text{CF}} \tag{20}
$$

*is estimated consistently as long as* Λˆ *CF and α*ˆ *CF are estimated consistently.*

#### *2.2. Singular Value Decomposition*

In this section, we formalize the concept of supervision and explain how it improves factor extraction. We compare the two different approaches CI-PC (Combining Information—Principal Components) and CF-PC (Combining Forecasts—Principal Components) in a linear forecast problem of the time series *y* given predictor data *X*. We explain the advantage of the CF-PC approach over CI-PC in Section 2.3 and give some examples in Section 2.4. We explore the advantage of supervision in simulations in Section 3.2. As an alternative to PC factors, we propose the use of NS factors in Section 4.

**Principal components of predictors** *<sup>X</sup>* **(CI-PC):** Let *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*T*×*<sup>N</sup>* be a matrix of regressors and let

$$X = R\Sigma \mathcal{W}' \in \mathbb{R}^{T \times N} \tag{21}$$

<sup>5</sup> An exception is Wright (2009), who uses Bayesian model averaging (BMA) for pseudo out-of-sample prediction of U.S. inflation, and finds that it generally gives more accurate forecasts than simple equal-weighted averaging. He uses *N* = 107 predictors.

be the singular value decomposition of *<sup>X</sup>*, with <sup>Σ</sup> <sup>∈</sup> <sup>R</sup>*T*×*<sup>N</sup>* diagonal rectangular, that is, diagonal square matrix padded with zero rows below the square if min(*T*, *N*) = *N* or padded with zero columns next to the square if min(*T*, *<sup>N</sup>*) = *<sup>T</sup>*, *<sup>R</sup>* <sup>∈</sup> <sup>R</sup>*T*×*T*, and *<sup>W</sup>* <sup>∈</sup> <sup>R</sup>*N*×*<sup>N</sup>* is unitary. Write

$$X^\prime X = \mathcal{W}\Sigma^\prime R^\prime R \Sigma \mathcal{W}^\prime = \mathcal{W}\Sigma^\prime \Sigma \mathcal{W}^\prime,\tag{22}$$

where Σ- Σ := diag(*σ*<sup>2</sup> <sup>1</sup> , ... , *<sup>σ</sup>*<sup>2</sup> *<sup>N</sup>*) is diagonal and square. Therefore, *W* contains the eigenvectors of *X*- *X*. For a matrix *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*T*×*N*, denote by *Ak* <sup>∈</sup> <sup>R</sup>*T*×*<sup>k</sup>* the matrix consisting of the first *<sup>k</sup>* <sup>≤</sup> *<sup>N</sup>* columns of *<sup>A</sup>*. Then, *Wk* is the matrix containing the singular vectors corresponding to the *k* = *k*CI largest singular values (*σ*1,..., *σk*). The first *k* principal components are given by

$$F\_{\rm CI} := X\mathcal{W}\_k = R\Sigma\mathcal{W}^\prime\mathcal{W}\_k = R\Sigma \begin{bmatrix} I\_k \\ \mathbf{0} \end{bmatrix} = R\Sigma\_k = R\_k\Sigma\_{kk\prime} \tag{23}$$

where *Ik* is the *k* × *k* identity matrix, **0** is an (*N* − *k*) × *k* matrix of zeros, and Σ*kk* is the *k* × *k* upper-left diagonal block of Σ. Note that the first *k* principal components *F*CI of *X* are constant multiples of columns of *Rk* as Σ*kk* is diagonal. The projection (forecast) of *y* onto *F*CI is given by

$$\begin{split} \mathcal{G}\_{\text{CL-PC}} &:= \mathrm{F\_{\text{CI}}} (\mathbf{F}\_{\text{CI}}' \mathbf{F}\_{\text{CI}})^{-1} \mathbf{F}\_{\text{CL}}' \boldsymbol{y} = X \mathcal{W}\_{k} (\mathcal{W}\_{k}' \mathbf{X}' X \mathcal{W}\_{k})^{-1} \mathbf{W}\_{k}' \mathbf{X}' \boldsymbol{y} \\ &= \mathcal{R}\_{k} \Sigma\_{kk} (\boldsymbol{\Sigma}\_{\text{kk}}' R\_{k}' \mathbf{R}\_{k} \boldsymbol{\Sigma}\_{\text{kk}})^{-1} \boldsymbol{\Sigma}\_{\text{kk}}' R\_{k}' \mathbf{y} = \mathcal{R}\_{k} (\mathbf{R}\_{k}' \mathbf{R}\_{k})^{-1} \mathbf{R}\_{k}' \mathbf{y} = \mathcal{R}\_{k} \mathbf{R}\_{k}' \mathbf{y}, \end{split} \tag{24}$$

as *R*- *<sup>k</sup>Rk* = *Ik*. Therefore, the CI forecast, *y*ˆCI-PC, is the projection of *y* onto *Rk*. The CI forecast error and the CI sum of squared error (SSE) are

$$y - \hat{y}\_{\text{CI-PC}} = y - R\_k R\_k' y = (I\_T - R\_k R\_k') y\_\prime \tag{25}$$

$$SSE\_{\text{CI-PC}} = ||y - \mathfrak{H}\_{\text{CI-PC}}||^2 = y'(I\_T - R\_k R\_k')y,\tag{26}$$

as (*IT* − *RkR*- *<sup>k</sup>*) is symmetric idempotent.

Bai (2003) shows that, under general assumptions on the factor and error structure, *F*CI is a consistent and asymptotically normal estimator of *<sup>F</sup>*CI*H*, where *<sup>H</sup>* is an invertible *<sup>k</sup>* × *<sup>k</sup>* matrix.6 This identification problem is also clear from Equation (24), and it conveniently allows us to identify the principal components *F*CI = *Rk*Σ*kk* as *F*CI = *Rk* since Σ*kk* is diagonal. The principal components are scalar multiples of the first *k* columns of *R*. Bai's result shows that principal components can be estimated consistently only up to linear combinations. Bai and Ng (2006) show that the parameter vector *α* in the forecast equation can be estimated consistently for *α*- *H*−<sup>1</sup> with an asymptotically normal distribution.

**Principal components of forecasts** *Y*ˆ **(CF-PC):** To generate forecasts in a CF-factor scheme, we regress *y* on the columns *xi* of *X*, *i* = 1, ... , *N*, one at a time, and calculate the fitted values of (15). Collect the fitted values in the matrix as in (16), with *B* = diag(*b*1, ... , *bN*) containing the regression coefficients in its diagonal. Compute the singular value decomposition of *Y*ˆ:

$$
\hat{Y} = \mathbb{S}\mathbb{S}V',
\tag{27}
$$

<sup>6</sup> In order for the objects in Bai's (2003) analysis to converge, he introduces scaling such that the singular values are the eigenvalues of the matrix *X*- *<sup>X</sup>*/*T*. Then, the singular vectors are multiplied by <sup>√</sup>*T*. In our notation, the singular value decomposition becomes *<sup>X</sup>* <sup>=</sup> <sup>√</sup>*TR* <sup>√</sup><sup>Σ</sup> *<sup>T</sup> <sup>W</sup>*- .

with <sup>Θ</sup> <sup>∈</sup> <sup>R</sup>*T*×*<sup>N</sup>* is diagonal rectangular, and *<sup>S</sup>* <sup>∈</sup> <sup>R</sup>*T*×*T*, *<sup>V</sup>* <sup>∈</sup> <sup>R</sup>*N*×*<sup>N</sup>* unitary. Pick the first *<sup>k</sup>* <sup>=</sup> *<sup>k</sup>*CF principal components of *Y*ˆ,

$$F\_{\rm CF} := \hat{Y}V\_k = S\Theta V'V\_k = S\Theta \begin{bmatrix} I\_k \\ \mathbf{0} \end{bmatrix} = S\Theta\_k = \mathbb{S}\_k \Theta\_{kk\prime} \tag{28}$$

where *Vk* is the *N* × *k* matrix of the singular vectors corresponding to the *k* largest singular values (*θ*1, ... , *θk*) and Θ*kk* is the *k* × *k* upper-left diagonal block of Θ. Again, we can identify the estimated *k* principal components of *<sup>Y</sup>*<sup>ˆ</sup> with *<sup>F</sup>*CF <sup>=</sup> *Sk*, where *<sup>F</sup>*CF is the *<sup>T</sup>* <sup>×</sup> *<sup>k</sup>*CF matrix of factors of *<sup>Y</sup>*ˆ. The projection (forecast) of *y* onto *F*CF is given by:

$$\begin{split} \mathcal{Y}\_{\text{CF-PC}} &:= F\_{\text{CF}} (F\_{\text{CF}}^{\prime} F\_{\text{CF}})^{-1} F\_{\text{CF}}^{\prime} y = \hat{Y} V\_k (V\_k^{\prime} \hat{Y}^{\prime} \hat{Y} V\_k)^{-1} V\_k^{\prime} \hat{Y}^{\prime} y \\ &= S\_k \Theta\_{kk} (\Theta\_{kk}^{\prime} S\_k^{\prime} S\_k \Theta\_{kk})^{-1} \Theta\_{kk}^{\prime} S\_k^{\prime} y = S\_k (S\_k^{\prime} S\_k)^{-1} S\_k^{\prime} y = S\_k S\_k^{\prime} y \end{split} \tag{29}$$

as *S*- *<sup>k</sup>Sk* = *Ik*. The CF forecast, *y*ˆCF-PC, is the projection of *y* onto *Sk*. The CF forecast error and the CF SSE are

$$\|y - \hat{y}\text{CF-PC} = y - S\_k S\_k' y = (Ir - S\_k S\_k') y,\tag{30}$$

$$SSE\_{\text{CF-PC}} = ||y - \hat{y}\_{\text{CF-PC}}||^2 = y'(I\_T - S\_k S\_k')y,\tag{31}$$

as (*IT* − *SkS*- *<sup>k</sup>*) is symmetric idempotent.

#### *2.3. Supervision*

In this sub-section, we explain the advantage of CF-PC over CI-PC in factor computation. We call the advantage "supervision", which is defined as follows:

**Definition 1.** *(Supervision). The advantage of CF-PC over CI-PC, called* supervision*, is the selection of principal components according to their contribution to variation in y, as opposed to selection of principal components according to their contribution to variation in the columns of X. This is achieved by selecting principal components from a matrix of forecasts of y.*

We use the following measures of supervision of CF-PC in comparison with CI-PC.

**Definition 2.** *(Absolute Supervision). Absolute supervision is the difference of the sums of squared errors (SSE) of CI-PC and CF-PC:*

$$s\_{\rm abs}(X, y, k\_{\rm CI}, k\_{\rm CF}) := ||y - \mathfrak{H}\_{\rm CI \cdot PC}||^2 - ||y - \mathfrak{H}\_{\rm CF \cdot PC}||^2 = y'(\mathcal{S}\_{k\_{\rm CF}} S\_{k\_{\rm CF}}' - R\_{k\_{\rm CI}} R\_{k\_{\rm CI}}')y. \tag{32}$$

**Definition 3.** *(Relative Supervision). Relative supervision is the ratio of the sums of squared errors of CI-PC over CF-PC:*

$$s\_{\rm rel}(X, y, k\_{\rm CI}, k\_{\rm CF}) := \frac{||y - \hat{y}\_{\rm CI \cdot PC}||^2}{||y - \hat{y}\_{\rm CF \cdot PC}||^2} = \frac{y'(I\_T - R\_{k\_{\rm CI}} R\_{k\_{\rm CI}}')y}{y'(I\_T - S\_{k\_{\rm CF}} S\_{k\_{\rm CF}}')y}.\tag{33}$$

**Remark 4.** *When kCI* = *kCF* = *N, there is no room for supervision*

$$s\_{\rm abs}(X, y, N, N) = y'(SS' - RR')y = y'(I\_T - I\_T)y = 0\tag{34}$$

*because SS*- = *RR*-= *IT*. *Relative supervision is defined only for kCF* < *N*.

For the sake of simplifying the notation and presentation, we consider the same number of factors in CI and CF factor models with *k*CI = *k*CF = *k* for the rest of the paper.

**Remark 5.** *Sk is a block of a basis change matrix that in the expression y*- *Sk returns the first k coordinates of y with respect to the new basis. This new basis is the one with respect to which the mapping Y*ˆ*Y*ˆ- = *XBBX*- = *S*ΘΘ- *S becomes diagonal, with singular values in descending order such that the first k columns of S correspond to the k largest singular values. Therefore, y*- *SkS*- *<sup>k</sup>y is the sum of the squares of these coordinates. Broadly speaking, the Sk are the k largest components of y in the sense of Y*ˆ *and its construction from the single regression coefficients. Thus, y*- *SkS*- *<sup>k</sup>y is the sum of the squares of the k coefficients in y that contributes most to the variation in the columns of Y.* ˆ

*Analogously, Rk is a block of a basis change matrix that for y*- *Rk returns the first k coordinates of y with respect to the basis that diagonalizes the mapping XX*- = *R*ΣΣ- *R*- *. Therefore, y*- *RkR*- *<sup>k</sup>y is the sum of squares of the k coordinates of y selected according to their contribution to variation in the columns of X.*

We emphasize *the factors that explain most of the variation of the columns of X, i.e., the eigenvectors associated with the largest eigenvalues of XX*- *, which are selected in the principal component analysis of X, may have little to do with the factors that explain most of the variation of y, however.* The relation between *X* and *y* in the data-generating process can, at worst, completely reverse the order of principal components in the columns of *X* and in *y*. We demonstrate this in the following Example 1.

#### *2.4. Example 1*

In this subsection, we give a small example to facilitate intuition for the supervision mechanics of CF-PC. Example 1 illustrates how the supervision of factor computation defined in Definition 1 operates. In Example 2 in the next section, we add randomness to Example 1 to explore the effect of stochasticity in a well-understood problem.

Let

$$X = \begin{bmatrix} 0 & 0 & 1 & 0 & 0 \\ 1/2 & 0 & 0 & 0 & 0 \\ 0 & 1/3 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1/4 \\ 0 & 0 & 0 & 1/5 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{bmatrix},\tag{35}$$

with *T* = 6 and *N* = 5. The singular value decomposition of *X* = *R*Σ*W* is

$$
\begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 0 & \frac{1}{2} & 0 & 0 & 0 \\ 0 & 0 & \frac{1}{3} & 0 & 0 \\ 0 & 0 & 0 & \frac{1}{4} & 0 \\ 0 & 0 & 0 & 0 & \frac{1}{5} \\ 0 & 0 & 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 \end{bmatrix} . \tag{36}
$$

Let

$$y = (1, 2, 3, 4, 5, 0)'. \tag{37}$$

Then, the diagonal matrix *B* that contains the coefficients of *y* w.r.t. each column of *X* is

$$B = \text{diag}(4, 9, 1, 25, 16), \tag{38}$$

and

$$\begin{aligned} \dot{Y} := XB = \begin{bmatrix} 0 & 0 & 1 & 0 & 0 \\ 2 & 0 & 0 & 0 & 0 \\ 0 & 3 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 4 \\ 0 & 0 & 0 & 5 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{bmatrix} . \end{aligned} \tag{39}$$

The singular value decomposition of *Y*ˆ = *XB* = *S*Θ*V* is

$$
\begin{bmatrix} 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 5 & 0 & 0 & 0 & 0 \\ 0 & 4 & 0 & 0 & 0 \\ 0 & 0 & 3 & 0 & 0 \\ 0 & 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \end{bmatrix} . \tag{40}
$$

We set *k*CI = *k*CF = *k* and compare CI-PC and CF-PC with the same number of principal components. Recall from (23) that *FC I* = *R*Σ*<sup>k</sup>* and from (28) that *FCF* = *S*Θ*k*. The absolute supervision and relative supervision, defined in (32) and (33), are computed for each *k* :


See Appendix A for the calculation. The absolute supervision is all positive and the relative supervision is larger than 1 for all *k* < *N*.

As noted in Remarks 1 and 5, the relation between *X* and *y* is crucial. In this example, the magnitude of the components in *y* is reversed from the order in *X*. For *X*, the ordering of the columns of *X* with respect to the largest eigenvalues of *XX* is {3, 1, 2, 5, 4}. For *y*, the ordering of the columns of *X* with respect to the largest eigenvalues of *Y*ˆ*Y*ˆ is {4, 5, 2, 1, 3}. For example, consider the case *k* = 2, i.e., we choose two out of five factors in the principal component analysis. CI-PC, the analysis of *X*, will pick the columns 3 and 1 of *X*, that is, the vectors (1, 0, 0, 0, 0, 0) and (0, 1/2, 0, 0, 0, 0)- . These correspond to the two largest singular values 1 and 1/2 of *X*. CF-PC, the analysis of *Y*ˆ, will pick columns 4 and 5 of *X*, that is, the vectors (0, 0, 0, 0, 1/5, 0) and (0, 0, 0, 1/4, 0, 0)- . These correspond to the two largest singular values 5 and 4 of *Y*ˆ. The regression coefficients in *B* = diag(4, 9, 1, 25, 16) de-emphasize columns 3 and 1 of *X* and emphasize columns 4 and 5 of *X*.

#### **3. Monte Carlo**

There are several simplifications in the construction of Example 1, which we relax by the following extensions:

(a) Adding randomness makes the estimation of the regression coefficients in *B* a statistical problem. The sampling errors influence the selection of the components of *Y*ˆ. (b) Adding correlation among regressors (columns of *X*) introduces correlation among individual forecasts (columns of *Y*ˆ), increasing the effect of sampling error in the selection of the components of *Y*ˆ. (c) Increasing *N* to realistic magnitudes, in particular in the presence of highly correlated regressors, will increase estimation error in the principal components due to collinearity.

We address the first extension (a) in Example 2. All three extensions (a), (b), (c) are addressed in Example 3 of Section 3.2.

#### *3.1. Example 2*

Consider adding some noise to *X*, *y* in Example 1. Let *v* be a *T* × *N* matrix of independent random numbers, each entry distributed as *N*(0, *σ*<sup>2</sup> *<sup>v</sup>* ), and *u* be a vector of independent random numbers, each distributed as *N*(0, *σ*<sup>2</sup> *<sup>u</sup>*). In this example, the new regressor matrix *X* is the sum of *X* in Example 1 and the noise term *v*, and the new *y* is the sum of *y* in Example 1 and the noise term *u*. For simplicity, we set *σ<sup>v</sup>* = *σ<sup>u</sup>* in the simulations and let both range from 0.01 to 3. This covers a substantial range of randomness given the magnitude of the numbers in *X* and *y*. For each scenario of *σ<sup>v</sup>* = *σu*, we generate 1000 random matrices *v* and random vectors *u* and calculate the Monte Carlo average of the sums of squared errors (SSE).

Figure 1 plots the Monte Carlo average of the SSEs for selection of *k* = 1 to *k* = 4 components. For standard deviations *σ<sup>v</sup>* = *σ<sup>u</sup>* close to zero, the sum of squared errors are as calculated in Example 1. As the noise increases, the advantage of CF over CI decreases but remains substantial, in particular for smaller numbers of principal components. For *k* = 5 estimated components (not shown), the SSEs of CI-PC and CF-PC coincide because *k* = *N*.

**Figure 1.** For Example 2. Monte Carlo averages of the sum of squared errors (SSE) against a grid of standard deviations *σ<sup>u</sup>* = *σ<sup>v</sup>* ranging from 0.01 to 3 in factor and forecast equations, for a selection of *k* = 1 to *k* = 4 components. When the standard deviation is close to zero, the SSE are close to the ones reported in Example 1. With increasing noise, the advantage of CF over CI decreases but remains substantial, in particular for few components. For *k* = 5 = *N* (not shown), the SSE of CI-PC and CF-PC coincide, as shown in Remark 4.

#### *3.2. Example 3*

We consider the data-generating process (DGP)

$$X = F\Lambda' + v\_{\prime} \tag{41}$$

$$y = \mathbb{R}\mathfrak{a} + \mathfrak{u},\tag{42}$$

where *y* is the *T* × 1 vector of observations, *F* is a *T* × *r* matrix of factors, Λ is an *N* × *r* matrix of factor loadings, *α* is an *r* × 1 parameter vector, *v* is a *T* × *N* random matrix, and *u* is a *T* × 1 vector of random errors. We set *T* = 200, *N* = 50 and consider *r* = 3 data-generating factors.

Note that, under this DGP, the CI-PC model in Equations (6) and (7) is correctly specified if the correct number of factors is identified, i.e., *k*CI = *r*. Even under this DGP, however, an insufficient number of factors, *k*CI < *r*, can still result in an advantage of the CF-PC model over the CI-PC model. We will explore this question in this section.

**Factors and persistence:** For each run in the simulation, we generate the *r* factors in *F* as independent AR(1) processes with zero mean and a normally distributed error with mean zero and variance one:

$$F\_{t,i} = \phi F\_{t-1,i} + \varepsilon\_{t,i}, \ t = 2, \dots, T, \ i = 1, \dots, r. \tag{43}$$

We consider a grid of 19 different AR(1) coefficients *φ*, equidistant between 0 and 0.90. We consider *r* = 3 data-generating factors and *k* ∈ {1, 2, 3, 4} estimated factors.

**Contemporaneous factor correlation:** Given a correlation coefficient *ρ* for adjacent regressors, the *N* × *r* matrix Λ of factor loadings is obtained from the first *r* columns of an upper triangular matrix from a Cholesky decomposition of

$$
\begin{bmatrix}
1 & \rho & \rho^2 & \cdots & \rho^{N-1} \\
\rho & 1 & \rho & \cdots & \rho^{N-2} \\
\rho^2 & \rho & 1 & \cdots & \rho^{N-3} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
\rho^{N-1} & \rho^{N-2} & \rho^{N-3} & \cdots & 1
\end{bmatrix}.
\tag{44}
$$

We consider a grid of 19 different values for *ρ*, equidistant between the points −0.998 and 0.998. In this setup, the 10th value is very close to *ρ* = 0. Then, the covariance matrix of the regressors is given by

$$\mathbb{E}X^{\prime}X = \mathbb{E}[(\Lambda F^{\prime} + \upsilon^{\prime})(F\Lambda^{\prime} + \upsilon)] = \Omega\_F + \Omega\_{\upsilon\_{\prime}} \tag{45}$$

where Ω*<sup>F</sup>* = ΛΛ and Ω*<sup>v</sup>* = E*v*- *v* is given by the identity matrix in our simulations. The relation E*F*- *F* = *I* is due to the independence of the factors, but may be subject to substantial finite sample error, in particular for *φ* close to one, for well-known reasons.

**Relation of** *X* **and** *y***:** The *r* × 1 parameter vector *α* is drawn randomly from a standard normal distribution for each run in the simulation. This allows *α* to randomly shuffle which factors are important for *y*.

**Noise level:** We set *σ<sup>u</sup>* = *σ<sup>v</sup>* and let it range between 0.1 and 3 in steps of 0.1. We add the case of 0.01 that essentially corresponds to a deterministic factor model.

For a given number *r* = 3 of data-generating factors, the simulation setup varies along the dimensions *φ* (19 points), *k* (4 points), *ρ* (19 points), *σ<sup>u</sup>* = *σ<sup>v</sup>* (31 points). For every single scenario, we run 1000 simulations and calculate the SSEs of CI-PC and CF-PC, and the relative supervision *srel*(*X*, *y*, *k*, *k*). Then, we take the Monte Carlo average of the SSEs and *srel*(*X*, *y*, *k*, *k*) over the 1000 simulations.<sup>7</sup>

The Monte Carlo results are presented in Figures 2–4. Each figure contains four panels that plot the situation for *k* = 1, 2, 3, 4 estimated number of factors. The main findings from the figures can be summarized as follows:

1. Figure 2: If the number of estimated factors *k* is below the true number *r* = 3, as shown in top panels, the supervision becomes smaller with increasing noise. If the correct number of factors or

<sup>7</sup> In relation to the empirical application using the yield data in Section 5, we could have calibrated the simulation design to make the Monte Carlo more realistic for the empirical application in Section 5. Nevertheless, our Monte Carlo design covers wide ranges of the parameter values for the noise levels, correlation structures (*ρ* and *φ*) in the yield data. Figure 2 shows that the supervision is smaller with larger noise levels, which may be rather obvious intuitively. Figure 4 shows that the advantage of supervision when the factors are persistence, which depends on the number of factors *k* relative to the true number of factors *r*. Particularly interesting is Figure 3 which shows that the advantage of supervision is smaller when the contemporaneous correlation *ρ* between predictors is larger, which may be relevant for the yield data because the yields with different maturities may be moderately contemporaneously correlated. We thank a referee for pointing this out.

more are estimated (*k* ≥ *r*), as in bottom panels, the advantage of supervision increases with the noise level *σ<sup>u</sup>* = *σv*, Even in this case when the CI-PC is the correct model (*k* ≥ *r*), supervision becomes larger as the noise increases.


**Figure 2.** Supervision dependent on noise. Relative supervision against a grid of standard deviations in factor and forecast equation *σ<sup>u</sup>* = *σv*, ranging from 0.01 to 3, while the factor serial correlation is fixed at *φ* = 0 and the contemporaneous factor correlation is *ρ* = 0.

**Figure 3.** Supervision dependent on contemporaneous factor correlation *ρ*. Relative supervision against a grid of contemporaneous correlation coefficients *ρ* ranging from −0.998 to 0.998, while the factor serial correlation *φ* is fixed at zero and the noise level is fixed at *σ<sup>u</sup>* = *σ<sup>v</sup>* = 1.

**Figure 4.** Supervision dependent on factor persistence *φ*. Relative supervision against a grid of AR(1) coefficients *φ* ranging from 0 to 0.9, while the noise level is fixed at *σ<sup>u</sup>* = *σ<sup>v</sup>* = 1 and the contemporaneous regressor correlation is *ρ* = 0.

#### **4. Supervising Nelson–Siegel Factors**

In the previous section, we have examined the factor model based on principal components. When the predictors are points on the yield curve, an alternative factor model can be constructed based on Nelson–Siegel (NS) components. We introduce two new factor models, CF-NS and CI-NS, by replacing principal components with NS components in CF-PC and CI-PC models. Like CI-PC, CI-NS is unsupervised. Like CF-PC, CF-NS is supervised for the particular forecast target of interest.

#### *4.1. Nelson–Siegel Components of the Yield Curve*

As an alternative to using principal components in the factor model, one can apply the modified Nelson–Siegel (NS) three-factor framework of Diebold and Li (2006) to factorize the yield curve. Nelson and Siegel (1987) propose Laguerre polynomials *Ln*(*z*) = *<sup>e</sup><sup>z</sup> n*! *dn dz<sup>n</sup>* (*zne*−*z*) with weight function *w*(*z*) = *e*−*<sup>z</sup>* to model the instantaneous nominal forward rate (forward rate curve)

$$\begin{split} f\_{\mathbb{T}}(\tau) &= \beta\_1 + (\beta\_2 + \beta\_3) \left( L\_0(z) e^{-\theta \tau} \right) - \beta\_3 \left( L\_1(z) e^{-\theta \tau} \right) \\ &= \beta\_1 + (\beta\_2 + \beta\_3) e^{-\theta \tau} - \beta\_3 (1 - \theta \tau) e^{-\theta \tau} \\ &= \beta\_1 + \beta\_2 e^{-\theta \tau} + \beta\_3 \theta \tau e^{-\theta \tau} \end{split} \tag{46}$$

where *<sup>z</sup>* <sup>=</sup> *θτ*, *<sup>L</sup>*0(*z*) = 1, *<sup>L</sup>*1(*z*) = <sup>1</sup> <sup>−</sup> *θτ*, and *<sup>β</sup><sup>j</sup>* <sup>∈</sup> <sup>R</sup> for all *<sup>j</sup>*. The decay parameter *<sup>θ</sup>* may change over time, but we fixed *θ* = 0.0609 for all *t* following Diebold and Li (2006).8

Then, the continuously compounded zero-coupon nominal yield *xt*(*τ*) of the bond with maturity *τ* months at time *t* is

$$\mathbf{x}\_{l}(\tau) = \frac{1}{\tau} \int\_{0}^{\tau} f\_{l}(s)ds = \beta\_{1} + \beta\_{2} \left(\frac{1 - e^{-\theta\tau}}{\theta\tau}\right) + \beta\_{3} \left(\frac{1 - e^{-\theta\tau}}{\theta\tau} - e^{-\theta\tau}\right). \tag{47}$$

Allowing the *βj*'s to change over time and adding the approximation error *vit*, we obtain the following approximate NS factor model for the yield curve for *i* = 1, . . . , *N*:

$$\begin{split} \mathbf{x}\_{t}(\tau\_{i}) &= \beta\_{1t} + \beta\_{2t} \left( \frac{1 - e^{-\theta \tau\_{i}}}{\theta \tau\_{i}} \right) + \beta\_{3t} \left( \frac{1 - e^{-\theta \tau\_{i}}}{\theta \tau\_{i}} - e^{-\theta \tau\_{i}} \right) + v\_{it} \\ &= \left[ 1 \cdot \left( \frac{1 - e^{-\theta \tau\_{i}}}{\theta \tau\_{i}} \right) \left( \frac{1 - e^{-\theta \tau\_{i}}}{\theta \tau\_{i}} - e^{-\theta \tau\_{i}} \right) \right] \begin{bmatrix} \beta\_{1t} \\ \beta\_{2t} \\ \beta\_{3t} \end{bmatrix} + v\_{it} \\ &= \lambda\_{t}^{\prime} f\_{t} + v\_{it\prime} \end{split} \tag{48}$$

where *ft* = (*β*1*t*, *β*2*t*, *β*3*t*) are the three NS factors and *λ*- *<sup>i</sup>* = 1 -<sup>1</sup>−*e*−*θτ<sup>i</sup> θτ<sup>i</sup>* -<sup>1</sup>−*e*−*θτ<sup>i</sup> θτ<sup>i</sup>* <sup>−</sup> *<sup>e</sup>*−*θτ<sup>i</sup>* are the factor loadings. Because *xt*(∞) = *β*1*t*, *xt*(∞) − *xt*(0) = −*β*2*t*, and [*xt*(0) + *xt*(∞)] − 2*xt*(*τm*) with *τ<sup>m</sup>* = 24 (say) is proportional to −*β*3*t*, the three NS factors (*β*1*t*, *β*2*t*, *β*3*t*) are associated with level, slope, and curvature of the yield curve.

<sup>8</sup> Diebold and Li (2006) show that fixing Nelson–Siegel decay parameter at *θ* = 0.0609 maximizes the curvature loading at the two-year bond maturity and allows better identifications of the three NS factors. They also show that allowing the *θ* to be a free parameter does not improve the forecasting performance. Therefore, following their advice, we fix *θ* = 0.0609 and did not estimate it. A small *θ* (for a slow decaying curve) fits the curve for long maturities better and a large *θ* (for a fast decaying curve) fits the curve for short maturities better.

#### *4.2. CI-NS and CF-NS*

#### 4.2.1. NS Components of Predictors *X* (CI-NS)

We have *N* predictors of yields *xt* = (*x*1*t*, *x*2*t*, ... , *xNt*) where *xit* = *xt*(*τi*) denotes the yield to maturity *τ<sup>i</sup>* months at time *t*, (*i* = 1, 2, . . . , *N*). Stacking *xit* for *i* = 1, 2, . . . , *N*, (48) can be written as

$$
\Delta \mathbf{x}\_t = \Lambda\_\mathrm{C1} f\_\mathrm{C1} t + v\_\mathrm{C1} t\_{t'} \tag{49}
$$

or

$$\mathbf{x}\_{\rm il} = \lambda\_{\rm CI,i}' f\_{\rm CI,t} + \upsilon\_{\rm CI,i} \,\prime \,\,, \tag{50}$$

where *λ<sup>i</sup>* denotes the *i*-th row of

$$
\Lambda\_{\rm CI} = \begin{pmatrix} 1 & \frac{1 - e^{-\theta \tau\_1}}{\theta \tau\_1} & \left(\frac{1 - e^{-\theta \tau\_1}}{\theta \tau\_1} - e^{-\theta \tau\_1}\right) \\ \vdots & \vdots & \vdots \\ 1 & \frac{1 - e^{-\theta \tau\_N}}{\theta \tau\_N} & \left(\frac{1 - e^{-\theta \tau\_N}}{\theta \tau\_N} - e^{-\theta \tau\_N}\right) \end{pmatrix},\tag{51}
$$

which is the *N* × 3 matrix of *known* factor loadings because we fix *θ* = 0.0609 following Diebold and Li (2006). The NS factors ˆ *f*CI,*<sup>t</sup>* = (*β*ˆ <sup>1</sup>*t*, *β*ˆ <sup>2</sup>*t*, *β*ˆ 3*t*) are estimated from regressing *xit* on *λ*- CI,*<sup>i</sup>* (over *i* = 1, . . . , *N*) by fitting the yield curve period by period for each *t*.

Then, we consider a linear forecast equation

$$y\_t = (1\,\,\dot{f}'\_{\text{CI},t-h})u\_{\text{CI}} + u\_{\text{CI},t}, \qquad t = h+1, \ldots, T,\tag{52}$$

in order to forecast *yt*<sup>+</sup>*<sup>h</sup>* (such as output growth or inflation). We first estimate *α*ˆ CI using the information up to time *T* and then form the forecast we call CI-NS by

$$
\mathfrak{H}\_{T+h}^{\text{CI-NS}} = (1 \, \hat{f}\_{\text{CI},T}') \mathfrak{A}\_{\text{CI}}.\tag{53}
$$

This method is comparable to CI-PC with number of factors fixed at *k* = 3. It differs from CI-PC, however, in that the three NS factors (*β*ˆ <sup>1</sup>*t*, *β*ˆ <sup>2</sup>*t*, *β*ˆ <sup>3</sup>*t*) have intuitive interpretations as level, slope and curvature of the yield curve, while the first three principal components may not have a clear interpretation. In the empirical section, we also consider two alternative CI-NS forecasts by including only the level factor *β*ˆ <sup>1</sup>*<sup>t</sup>* (denoted CI-NS (*k* = 1)), and only the level and slope factors (*β*ˆ <sup>1</sup>*t*, *β*ˆ 2*t*) (denoted CI-NS (*k* = 2)) to see whether the level factor or the combination of level and slope factors have dominant contribution in forecasting output growth and inflation.

#### 4.2.2. NS Components of Forecasts *Y*ˆ (CF-NS)

While CI-NS solves the large-*N* dimensionality problem by reducing the *N* yields to three factors ˆ *f*CI,*<sup>t</sup>* = (*β*ˆ <sup>1</sup>*t*, *β*ˆ <sup>2</sup>*t*, *β*ˆ 3*t*)- , it computes the factors entirely from yield curve information *xt* only, without accounting for the variable *yt*<sup>+</sup>*<sup>h</sup>* to be forecast. Similar in spirit to CF-PC, here we can improve CI-NS by supervising the factor computation, which we term as CF-NS.

The CF-NS forecast is based on the NS factors of **yˆ***t*+*<sup>h</sup>* := (*y*ˆ (1) *<sup>t</sup>*+*h*, *y*ˆ (2) *<sup>t</sup>*+*h*, ... , *y*ˆ (*N*) *t*+*h*)- , a vector of the *N* individual forecasts as in (10) and (11),

$$
\mathfrak{Y}\_{t+h} = \Lambda\_{\rm CF} f\_{\rm CF, t+h} + v\_{\rm CF, t+h\prime} \tag{54}
$$

with ΛCF = ΛCI in (51). Hence, ΛCI = ΛCF = Λ for the NS factor models. Note that, when the NS factors loadings are normalized to sum up to one, the three CF-NS factors

$$\begin{split} f\_{\text{CF},t+h} &= \Lambda^{\prime} \mathfrak{H}\_{t+h} \\ &= \left( \begin{array}{cc} \frac{1}{s\_1} \sum\_{i=1}^{N} \mathfrak{H}\_{T+h}^{(i)} & \frac{1}{s\_2} \sum\_{i=1}^{N} \left( \frac{1-e^{-\theta\tau\_i}}{\theta\tau\_i} \right) \mathfrak{H}\_{T+h}^{(i)} & \frac{1}{s\_3} \sum\_{i=1}^{N} \left( \frac{1-e^{-\theta\tau\_i}}{\theta\tau\_i} - e^{-\theta\tau\_i} \right) \mathfrak{H}\_{T+h}^{(i)} \end{array} \right)^{\prime} \end{split} \tag{55}$$

are weighted individual forecasts with the three *normalized* NS loadings, with *s*<sup>1</sup> = *N*, *s*<sup>2</sup> = ∑*<sup>N</sup> i*=1 -<sup>1</sup>−*e*−*θτ<sup>i</sup> θτ<sup>i</sup>* , and *s*<sup>3</sup> = ∑*<sup>N</sup> i*=1 -<sup>1</sup>−*e*−*θτ<sup>i</sup> θτ<sup>i</sup>* <sup>−</sup> *<sup>e</sup>*−*θτ<sup>i</sup>* . The CF-NS forecast can be obtained from the forecasting equation

$$y\_{t+h} = \hat{f}'\_{\text{CF},t+h} a\_{\text{CF}} + u\_{\text{CF},t+h} \tag{56}$$

$$\hat{g}'^{\text{CF-NS}}\_{T+h} = \hat{f}'\_{\text{CF},T+h} \hat{a}\_{\text{CF}}$$

which is denoted CF-NS(*k* = 3). The parameter vector *α*ˆ *<sup>T</sup>* is estimated using information up to time *T*. Using only the first factor or the first two factors, one can obtain the forecasts CF-NS(*k* = 1) and CF-NS(*k* = 2).

Note that, while the CF-PC method can be used for data of many kinds, the CF-NS method we propose is tailored to forecasting using the yield curve. It uses fixed factor loadings in Λ that are the NS exponential factor loadings for yield curve modeling, and hence avoids the estimation of factor loadings. In contrast, CF-PC needs to estimate Λ.

Also note that, by construction, CF-NS(*k* = 1) is the equally weighted combined forecast 1 *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *y*ˆ (*i*) *<sup>T</sup>*+*h*.

#### **5. Forecasting Output Growth and Inflation**

This section presents the empirical analysis where we describe the data, implement forecasting methods introduced in the previous sections on forecasting output growth and inflation, and analyze out-of-sample forecasting performances. This allows us to analyze the differences between output growth and inflation forecasting using the same yield curve information and to compare the strengths of different methods.

#### *5.1. Data*

Let *yt*<sup>+</sup>*<sup>h</sup>* denote the variable to be forecast (output growth or inflation) using yield information up to time *t*, where *h* denotes the forecast horizon. The predictor vector **x***<sup>t</sup>* = (*xt*(*τ*1), *xt*(*τ*2), ... , *xt*(*τN*))- contains the information about the yield curve at various maturities: *xt*(*τi*) denotes the zero coupon yield of maturity *τ<sup>i</sup>* months at time *t* (*i* = 1, 2, . . . , *N*).

Two forecast targets, output growth and inflation, are constructed respectively as monthly growth rate of Personal Income (PI, seasonally adjusted annual rate) and monthly change in CPI (Consumer Price Index for all urban consumers: all items, seasonally adjusted) from 1970:01 to 2010:01. PI and CPI data are obtained from the web site of the Federal Reserve Bank of St. Louis (FRED2).

We apply the following data transformations. For the monthly growth rate of PI, we set *yt*<sup>+</sup>*<sup>h</sup>* = 1200[(1/*h*)ln(PI*t*+*h*/PI*t*)] as the forecast target (as used in Ang et al. (2006)). For the consumer price index (CPI), we set *yt*<sup>+</sup>*<sup>h</sup>* = 1200[(1/*h*)ln(CPI*t*+*h*/CPI*t*)] as the forecast target (as used in Stock and Watson (2007)).9

<sup>9</sup> *yt*<sup>+</sup>*<sup>h</sup>* <sup>=</sup> <sup>1200</sup>[(1/*h*)ln(CPI*t*+*h*/CPI*t*) <sup>−</sup> ln(CPI*t*/CPI*t*−1)] is used in Bai and Ng (2008).

Our yield curve data consist of U.S. government bond prices, coupon rates, and coupon structures, as well as issue and redemption dates from 1970:01 to 2009:12.<sup>10</sup> We calculate zero-coupon bond yields using the unsmoothed Fama and Bliss (1987) approach. We measure bond yields on the second day of each month. We also apply several data filters designed to enhance data quality and focus attention on maturities with good liquidity. First, we exclude floating rate bonds, callable bonds and bonds extended beyond the original redemption date. Second, we exclude outlying bond prices less than 50 or greater than 130 because their price discounts/premium are too high and imply thin trading, and we exclude yields that differ greatly from yields at nearby maturities. Finally, we use only bonds with maturity greater than one month and less than fifteen years because other bonds are not actively traded. Indeed, to simplify our subsequent estimation, using linear interpolation we pool the bond yields into fixed maturities of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 72, 78, 84, 90, 96, 102, 108, and 120 months, where a month is defined as 30.4375 days.11

We examine some descriptive statistics (not reported for space) of the two forecast targets and yield curve level, slope, and curvature (empirical measures), over the full sample from 1970:01 to 2009:12 and the out-of-sample evaluation period from 1995:02 to 2010:01. We observe that both PI growth and CPI inflation become more moderate and less volatile from around the mid-1980s. This has become a stylized fact known as the "Great Moderation". In particular, there is a substantial drop in persistency of CPI inflation. The volatility and persistency of the yield curve slope and curvature do not change much. The yield curve level, however, decreases and stabilizes.

In predicting macroeconomic variables using the term structure, yield spreads between yields with various maturities and the short rate are commonly used in the literature. One possible reason for this practice is that yield levels are treated as I(1) processes, so yield spreads will likely be I(0). Similarly, macroeconomic variables are typically assumed to be I(1) and transformed properly into I(0), so that, in using yield spreads to forecast macro targets, issues such as spurious regression are avoided. In this paper, however, we use yield levels (not spreads) to predict PI growth and CPI inflation (not change in inflation), for the following reasons. First, whether yields and inflation are I(1) or I(0) is still arguable. Stock and Watson (1999, 2012) use yield spreads and treat inflation as I(1), so they forecast change in inflation. Inoue and Kilian (2008), however, treat inflation as I(0). Since our target is forecasting inflation, not change in inflation, we will treat CPI inflation as well as yields as I(0) in our empirical analysis. Second, we emphasize real-time, out-of-sample forecasting performance more than in-sample concerns. As long as out-of-sample forecast performance is unaltered or even improved, we think the choice of treating the variables as I(1) or I(0) variables does not matter much.12 Third, using yield levels will allow us to provide clearer interpretations for questions such as what part of the yield curve contributes the most towards predicting PI growth or CPI inflation, and how the different parts of the yield curve interact in the prediction, etc.

#### *5.2. Out-of-Sample Forecasting*

All forecasting models are estimated in a rolling window scheme with window size *R* = 300 months ending at month *t* (starting at *t* − *R* + 1). In the evaluation period from *t* = 1995:02

<sup>10</sup> As a robust check, we apply our method to the original yield data of Diebold and Li (2006) and also to the sub-samples in our data set. The results are essentially the same as those summarized at the end of Section 5.

<sup>11</sup> It may be interesting to explore whether different maturity yields might have different effects on the forecast outcome. However, the present paper is focused on the comparison between CF and CI, rather than a detailed CI-only analysis, e.g., to find the best maturity yield for the forecast outcome. Nevertheless, our CI-NS model has reflected such effects as the three NS factors (level, slope, and curvature) are different combinations of bond maturities as shown in Equation (55). The different coefficients on the NS factors suggest that different bond maturities have different effects on the forecast outcome, as Gogas et al. (2015) has found.

<sup>12</sup> While not reported for space, we tried forecasting change in inflation and found forecasting inflation directly using all yield levels improves out-of-sample performances of most forecasting methods by a large margin.

to *t* = 2010:01 (180 months), the first rolling sample to estimate models begins at 1970:02 and ends at 1995:01, the second rolling sample is for 1970:03–1995:02, the third 1970:04–1995:03, and so on. The out-of-sample evaluation period is from 1995:02 to 2010:01 (hence out-of-sample size *P* = 180).<sup>13</sup> In all NS-related methods (CI and CF), we set *θ*, the parameter that governs the exponential decay rate, at 0.0609 for reasons discussed in Diebold and Li (2006).<sup>14</sup> We compare *h*-months-ahead out-of-sample forecasting results of those methods introduced so far for *h* = 1, 3, 6, 12, 18, 24, 30, 36 months ahead.

Figure 5 illustrates what economic contents these factors in CF-PC may bear. It shows that the first PC assigns about equal weights to all *N* = 50 individual forecasts that use yields at various maturities (in months) so that it may be interpreted as the factor that captures the level of the yield curve; the second PC assigns roughly increasing weights so that it may be interpreted as the factor capturing the slope; and the third PC assigns roughly first decreasing then increasing weights, so that it may be interpreted as factor capturing curvature.

Tables 1 and 2 present the root mean squared forecast errors (RMSFE) of PC methods with *k* = 1, 2, 3, 4, 5, and of NS methods with *k* = 1, 2, 3, for PI growth (Table 1A) and for CPI inflation (Table 2A) forecasts using all 50 yield levels.<sup>15</sup> In Panel A of Tables 1 and 2, we report the Root Mean Squared Forecast Errors (RMSFE, which is the squared root of the MSFE of a model).<sup>16</sup> In Panel B of Tables 1 and 2, we report Relative Supervision of CI-PC vs. CF-PC and Relative Supervision of CI-NS vs. CF-NS, according to Definition 3, which is the ratio of the MSFEs of two CI and CF models. The relative supervision in Panel B can be obtained from RMSFEs in Panel A. For simplicity of presentation in Panel B, we present the relative supervision only with the same number of factors (*k*CI = *k*CF and *k*NS = *k*NS).

<sup>13</sup> As a robust check, we have also tried with different sample splits for the estimation and prediction periods, i.e., the number of in-sample regression observations and the out-of-sample evaluation observations. We find that the results are similar.

<sup>14</sup> For different values of *θ*, the performances of CI-NS and CF-NS change only marginally.

<sup>15</sup> While we report the results for *k* = 4, 5 for CF-PC, we do not report for *k* = 4, 5 for CF-NS. Svennsson (1995) and Christensen et al. (2009) (CDR 2009) extend the three factor NS model to four or five factor NS models. CDR's dynamic generalized NS model has five factors with one level factor, two slope factors and two curvature factors. The Svensson and CDR extensions are useful to fit the yield curve at longer maturities (>10 years). Because we only used yields with maturities ≤10 years, the second curvature factor loadings will look similar to the slope factor loadings and we will have collinearity problem. CDR use yields up to 30 years. The 4th and 5th factors have no clear economic intrepretations and are hard to explain. For these reasons, we report results for *k* = 1, 2, 3 for the CF-NS model.

<sup>16</sup> For the statistical significance of the loss-difference (see Definition 2), the asymptotic *p*-values of the Diebold–Mariano statistics are all very close to zero especially for larger values of the forecast horizon *h*.

**Figure 5.** Factor loadings of principal components and Nelson–Siegel factors. The first two panels: factor loadings of the first three principal components in CF-PC (*k* = 3) averaged over the out-of-sample period (02/1995–01/2010), for both PI growth (first panel) and CPI inflation (second panel). The abscissa refers to the 50 individual forecasts that use yields at the 50 maturities (in months). The loading of the first principal component has the circle-symbol, the second the cross-symbol, and the third the square-symbol. The third panel: three normalized Nelson–Siegel (NS) exponential loadings in CF-NS that correspond to the three NS factors, respectively. The abscissa refers to the 50 individual forecasts that use yields at the 50 maturities (in months). The circled line denotes the first normalized NS factor loading 1/*N*, the crossed line denotes the second normalized NS factor loading (<sup>1</sup> <sup>−</sup> *<sup>e</sup>*−*θτ*)/(*θτ*), divided by the sum, and the squared line denotes the third normalized NS factor loading (<sup>1</sup> <sup>−</sup> *<sup>e</sup>*−*θτ*)/(*θτ*) <sup>−</sup> *<sup>e</sup>*−*θτ*, divided by the sum, where *<sup>τ</sup>* denotes maturity and *<sup>θ</sup>* is fixed at 0.0609.

We find that, in general, supervised factorization performs better. The CF schemes (CF-PC and CF-NS) perform substantially better than the CI schemes (CI-PC and CI-NS). Within the same CF or CI schemes, two alternative factorizations work similarly: CF-PC and CF-NS are about the same, and CI-PC and CI-NS are about the same. We summarize our findings from Figure 5 and Tables 1 and 2 as follows.


<sup>17</sup> We conducted a Monte Carlo (not reported), which are consistent with the empirical results that the supervision is stronger for a longer forecast horizon *h*.

4. *We often get the best supervised predictions with a single factor* (*k* = 1) *with the CF-factor models.*<sup>18</sup> Since CF-NS(*k* = 1) is the equally weighted combined forecast as noted in Section 4.2.2, this is another case of the forecast combination puzzle discussed in Remark 3 that the equal-weighted forecast combination is hard to beat. Since CF-PC(*k* = 1) is numerically identical to CF-NS(*k* = 1) as shown in Figure 5, CF-PC(*k* = 1) is also effectively equally weighted forecast averaging.19


**Table 1.** Out-of-sample forecasting of personal income growth.

<sup>18</sup> Figlewski and Urich (1983) talked about various constrained models in forming a combination of forecasts and examined when we need more than the simple averaging combined forecast. They discussed a sufficient condition when the simple average of forecasts is the optimal forecast combination: "Under the most extensive set of constraints, forecast errors are assumed to have zero mean and to be independent and identically distributed. In this case the optimal forecast is the simple average." This corresponds to CF-PC(*k* = 1) and CF-NS(*k* = 1) when the first factor (*k* = 1) in PC or NS is sufficient for the CF factor model. It is clearly the case in CF-NS as shown in Equation (55). One can show that the first PC (corresponding to the largest singular value) would also be the simple average. Hence, in terms of the CF-factor model, the forecast combination puzzle amounts to the fact that we often do not need the second PC factor. Interestingly, (Figlewski and Urich 1983, p. 696) continued to note the cases when the simple average is not optimal: "However, the hypothesis of independence among forecast errors is overwhelmingly rejected for our data-errors are highly positively correlated with one another." On the other hand, they also noted other reasons why the simple average may still be preferred, as they wrote, "Because the estimated error structure was not completely stable over time, the models which adjusted for correlation did not achieve lower mean squared forecast error than the simple average in out-of-sample tests. Even so, we find...that forecasts from these models, while less accurate than the simple mean, do contain information which is not fully reflected in prices in the money market, and is therefore economically valuable." We thank a referee for letting us know on this from Figlewski and Urich (1983).

<sup>19</sup> While the simple equally weighted forecast combination can be implemented without the use of PCA or without making reference to the NS model, it is important to note that the simple average combined forecast indeed corresponds the first CF-PC factor (CF-PC(*k* = 1)) or the first CF-NS factor (CF-NS(*k* = 1)). In view of Figlewski and Urich (1983), it will be useful to know when the first factor (*k* = 1) is enough so that the simple average is good or when the higher order factors (*k* > 1) may be necessary as they contain more information in addition to the first CF-factor. This is important in understanding the forecast combination puzzle. The forecast combination puzzle is about whether to include only the first CF factor or more.


**Table 1.** *Cont*.

The forecast target is Output Growth *yt*<sup>+</sup>*<sup>h</sup>* = 1200 × log(*PIt*<sup>+</sup>*h*/*PIt*) ÷ *h*. Out-of-sample forecasting period is 02/1995–01/2010. In Panel A, reported are the Root Mean Squared Forecast Errors (which is the squared root of the MSFE of a model). In Panel B, reported are Relative Supervision of CI-PC vs. CF-PC and Relative Supervision of CI-NS vs. CF-NS, according to Definition 3, which is the ratio of the MSFEs of the two models. For simplicity of presentation, we present the relative supervision in Panel B only with the same number of factors (*k*CI = *k*CF = *k* and *k*NS = *k*NS = *k*).

**Table 2.** Out-of-sample forecasting of CPI inflation.


The forecast target is Inflation *yt*<sup>+</sup>*<sup>h</sup>* = 1200 × log(*CPIt*<sup>+</sup>*h*/*CPIt*) ÷ *h*. Out-of-sample forecasting period is 02/1995–01/2010. In Panel A, reported are the Root Mean Squared Forecast Errors (which is the squared root of the MSFE of a model). In Panel B, reported are Relative Supervision of CI-PC vs. CF-PC and Relative Supervision of CI-NS vs. CF-NS, according to Definition 3, which is the ratio of the MSFEs of the two models. For simplicity of presentation, we present the relative supervision in Panel B only with the same number of factors (*k*CI = *k*CF = *k* and *k*NS = *k*NS = *k*).

CI-NS(*k* = 2) vs. CF-NS(*k* = 2) 1.33 1.64 2.38 4.36 6.85 8.05 7.89 7.14 CI-NS(*k* = 3) vs. CF-NS(*k* = 3) 1.33 1.64 2.33 3.71 5.02 6.21 6.83 7.11

#### **6. Conclusions**

For forecasting in the presence of many predictors, it is often useful to reduce the dimension by a factor model (in a dense case) or by variable selection (in a sparse case). In this paper, we consider a factor model. In particular, we examine the supervised principal component analysis of Chan et al. (1999). The model is called CF-PC, as the principal components of many forecasts are the combined forecasts.

The CF-PC extracts factors from the space spanned by forecasts rather than from the space spanned by predictors. This factorization of the forecasts improves forecast performance compared to factor analysis of the predictors. We extend the CF-PC to CF-NS, which uses the NS factor model in place of the PC factor model, for the application where the predictors are the yield curve. While the yield curve is a functional data consisting of many different maturity points on a curve at each time, the NS factors can parsimoniously capture the shapes of the curve.

We have applied the CF-PC and CF-NS models in forecasting output growth and inflation using a large number of bond yields to examine if the supervised factorization improves forecast performance. In general, we have found that CF-PC and CF-NS perform substantially better than CI-PC and CI-NS, that the advantage of supervised factor models is even larger for longer forecast horizons, and that the two alternative factor models based on PC and NS factors are similar and perform similarly.

**Author Contributions:** All authors contributed equally to the paper.

**Acknowledgments:** We would like to thank the two referees for helpful comments. We also thank Jonathan Wright and seminar participants at FRB of San Francisco, FRB of St Louis, Federal Reserve Board (Washington DC), Bank of Korea, SETA meeting, Stanford Institute of Theoretical Economics (SITE), University of Cambridge, NCSU, UC Davis, UCR, UCSB, UCSD, USC, Purdue, LSU, Indiana, Drexel, OCC, WMU, and SNU, for useful discussions and comments. All errors are our own. E.H. acknowledges support from the Danish National Research Foundation. The views presented in this paper are solely those of the authors and do not necessarily represent those of ICBCCS, the Federal Reserve Board or their staff.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Calculation of Absolute and Relative Supervision in Example 1**

Using *R* and Σ*<sup>k</sup>* obtained from the SVD for CI in (36), and *S* and Θ*<sup>k</sup>* obtained from the SVD for CF in (40), we calculate the absolute supervision and relative supervision for each *k*. The CI factors are *FC I* = *R*Σ*<sup>k</sup>* from (23), and the CF factors *FCF* = *S*Θ*<sup>k</sup>* from (28).

For *k* = 1,


$$\mathcal{Y}\_{\text{CI-PC}} = \mathcal{R}\_1 \mathcal{R}\_1' y = (1, 0, 0, 0, 0, 0)',\tag{A2}$$

$$\mathcal{G}\_{\text{CF-PC}} = \mathcal{S}\_1 \mathcal{S}\_1' y = (0, 0, 0, 0, 5, 0)',\tag{A3}$$

$$||y - \hat{y}\_{\text{Cl-PC}}||^2 = ||(1, 2, 3, 4, 5, 0)^\prime - (1, 0, 0, 0, 0, 0)^\prime||^2 = 54,\tag{A4}$$

$$||y - \hat{y}\_{\text{CF-PC}}||^2 = ||(1, 2, 3, 4, 5, 0)^\prime - (0, 0, 0, 0, 5, 0)^\prime||^2 = 30. \tag{A5}$$

Hence, *sabs*(*X*, *<sup>y</sup>*, 1, 1) = ||*<sup>y</sup>* − *<sup>y</sup>*ˆCI-PC||<sup>2</sup> − ||*<sup>y</sup>* − *<sup>y</sup>*ˆCF-PC||<sup>2</sup> = <sup>54</sup> − <sup>30</sup> = 24, and *srel*(*X*, *<sup>y</sup>*, 1, 1) = ||*<sup>y</sup>* − *<sup>y</sup>*ˆCI-PC||2/||*<sup>y</sup>* − *<sup>y</sup>*ˆCF-PC||<sup>2</sup> = 54/30 = 1.8.

For *k* = 2,

$$F\_{\rm CI} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & \frac{1}{2} \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & \frac{1}{2} \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \end{bmatrix}, \quad F\_{\rm CF} = \begin{bmatrix} 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 5 & 0 \\ 0 & 4 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \end{bmatrix} = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 4 \\ 5 & 0 \\ 0 & 0 \end{bmatrix}, \text{ (A6)}$$

$$\mathcal{G}\_{\text{Cl-PC}} = \mathcal{R}\_2 \mathcal{R}\_2' y = (1, 2, 0, 0, 0, 0)',\tag{A7}$$

$$
\mathfrak{H}\_{\text{CF-PC}} = \mathfrak{S}\_2 \mathfrak{S}\_2' y = (0, 0, 0, 4, 5, 0)', \tag{A8}
$$

$$\left|\left|y - \hat{y}\_{\text{CI-PC}}\right|\right|^2 = \left|\left|\left(1, 2, 3, 4, 5, 0\right)' - \left(1, 2, 0, 0, 0, 0\right)'\right|\right|^2 = 50,\tag{A9}$$

$$\left|\left|y - \hat{y}\_{\text{CF-PC}}\right|\right|^2 = \left|\left|\left(1, 2, 3, 4, 5, 0\right)' - \left(0, 0, 0, 4, 5, 0\right)'\right|\right|^2 = 14.\tag{A10}$$

Hence, *sabs*(*X*, *<sup>y</sup>*, 2, 2) = ||*<sup>y</sup>* − *<sup>y</sup>*ˆCI-PC||<sup>2</sup> − ||*<sup>y</sup>* − *<sup>y</sup>*ˆCF-PC||<sup>2</sup> = <sup>50</sup> − <sup>14</sup> = 36, and *srel*(*X*, *<sup>y</sup>*, 2, 2) = ||*<sup>y</sup>* − *<sup>y</sup>*ˆCI-PC||2/||*<sup>y</sup>* − *<sup>y</sup>*ˆCF-PC||<sup>2</sup> = 50/14 = 3.6. For *k* = 3,

$$\begin{aligned} \;^c \mathbf{F}\_{\text{CI}} = \begin{bmatrix} 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \\ 0 \ 1 \ 0 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 1 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 1 \ 0 \\ 0 \ 0 \ 0 \ 0 \ 1 \ 0 \\ 0 \ 0 \ 0 \ 0 \ 0 \ 1 \end{bmatrix} \begin{bmatrix} 1 \ 0 \ 0 \\ 0 \ \frac{1}{2} \ 0 \\ 0 \ 0 \ \frac{1}{3} \\ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \end{bmatrix} = \begin{bmatrix} 1 \ 0 \ 0 \\ 0 \ \frac{1}{2} \\ 0 \ 0 \ \frac{1}{3} \\ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \end{bmatrix}, \quad \begin{bmatrix} 0 \ 0 \ 0 \ 0 \ 1 \ 0 \\ 0 \ 0 \ 0 \ 1 \ 0 \\ 0 \ 0 \ 1 \end{bmatrix} \begin{bmatrix} 5 \ 0 \ 0 \\ 0 \ 4 \ 0 \\ 0 \ 0 \ 3 \\ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \end{bmatrix} = \begin{bmatrix} 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \\ 0 \ 0 \ 3 \\ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \end{bmatrix} = \begin{bmatrix} 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \\ 0 \ 0 \ 3 \\ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \end{bmatrix}, \quad \text{(A1)}$$

$$\mathcal{G}\_{\text{CI-PC}} = R\_3 R\_3' y = (1, 2, 3, 0, 0, 0)',\tag{A12}$$

$$
\mathfrak{H}\_{\text{CF-PC}} = \mathbb{S}\_3 \mathbb{S}\_3' y = (0, 0, 3, 4, 5, 0)', \tag{A13}
$$

$$||y - \hat{y}\_{\text{CJ-PC}}||^2 = ||(1, 2, 3, 4, 5, 0)' - (1, 2, 3, 0, 0, 0)'||^2 = 41,\tag{A14}$$

$$||y - \hat{y}\_{\text{CF-PC}}||^2 = ||(1, 2, 3, 4, 5, 0)^\prime - (0, 0, 3, 4, 5, 0)^\prime||^2 = 5. \tag{A15}$$

Hence, *sabs*(*X*, *<sup>y</sup>*, 3, 3) = ||*<sup>y</sup>* − *<sup>y</sup>*ˆCI-PC||<sup>2</sup> − ||*<sup>y</sup>* − *<sup>y</sup>*ˆCF-PC||<sup>2</sup> = <sup>41</sup> − <sup>5</sup> = 36, and *srel*(*X*, *<sup>y</sup>*, 3, 3) = ||*<sup>y</sup>* − *<sup>y</sup>*ˆCI-PC||2/||*<sup>y</sup>* − *<sup>y</sup>*ˆCF-PC||<sup>2</sup> = 41/5 = 8.2.

For *k* = 4,

*F*CI = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 100000 010000 001000 000100 000010 000001 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1000 0 1 <sup>2</sup> 0 0 0 0 <sup>1</sup> 3 0 000 <sup>1</sup> 4 0000 0000 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1000 0 1 <sup>2</sup> 0 0 0 0 <sup>1</sup> 3 0 000 <sup>1</sup> 4 0000 0000 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ , *F*CF = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 000010 000100 001000 010000 100000 000001 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 5000 0400 0030 0002 0000 0000 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 0000 0002 0030 0400 5000 0000 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ , (A16)

$$\begin{aligned} \hat{y}\_{\text{CI-PC}} &= R\_4 R\_4' y = \left(1, 2, 3, 4, 0, 0\right)', \\ \hat{y}\_{\text{CF-PC}} &= S\_4 S\_4' y = \left(0, 2, 3, 4, 5, 0\right)', \end{aligned} \tag{A17}$$

$$||y - \hat{y}\_{\text{CJ-PC}}||^2 = ||(1, 2, 3, 4, 5, 0)' - (1, 2, 3, 4, 0, 0)'||^2 = 25,\tag{A19}$$

$$||y - \hat{y}\_{\text{CF-PC}}||^2 = ||(1, 2, 3, 4, 5, 0)^\prime - (0, 2, 3, 4, 5, 0)^\prime||^2 = 1. \tag{A20}$$

Hence, *sabs*(*X*, *<sup>y</sup>*, 4, 4) = ||*<sup>y</sup>* − *<sup>y</sup>*ˆCI-PC||<sup>2</sup> − ||*<sup>y</sup>* − *<sup>y</sup>*ˆCF-PC||<sup>2</sup> = <sup>25</sup> − <sup>1</sup> = 24, and *srel*(*X*, *<sup>y</sup>*, 4, 4) = ||*<sup>y</sup>* − *<sup>y</sup>*ˆCI-PC||2/||*<sup>y</sup>* − *<sup>y</sup>*ˆCF-PC||<sup>2</sup> = 25/1 = 25.

For *k* = 5, *sabs*(*X*, *y*, 5, 5) = *y*- (*SS*- − *RR*- )*y* = 0 because

$$||y - \hat{y}\_{\text{Cl-PC}}||^2 = ||(1, 2, 3, 4, 5, 0)' - (1, 2, 3, 4, 5, 0)'||^2 = 0,\tag{A21}$$

$$||y - \hat{y}\_{\text{CF-PC}}||^2 = ||(1, 2, 3, 4, 5, 0)' - (1, 2, 3, 4, 5, 0)'||^2 = 0. \tag{A22}$$

Hence, as noted in Remark 4, *sabs*(*X*, *<sup>y</sup>*, 5, 5) = ||*<sup>y</sup>* − *<sup>y</sup>*ˆCI-PC||<sup>2</sup> − ||*<sup>y</sup>* − *<sup>y</sup>*ˆCF-PC||<sup>2</sup> = <sup>0</sup> − <sup>0</sup> = 0, and *srel*(*X*, *y*, 5, 5) is not defined for *k* = *N* = 5.

#### **References**


Christensen, Jens, Francis Diebold, and Glenn Rudebusch. 2009. An Arbitrage-free Generalized Nelson–Siegel Term Structure Model. *Econometrics Journal* 12: C33–C64. [CrossRef]


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Covariance Prediction in Large Portfolio Allocation**

#### **Carlos Trucíos 1,\*, Mauricio Zevallos 2, Luiz K. Hotta <sup>2</sup> and André A. P. Santos 3,4**


Received: 12 November 2018; Accepted: 2 May 2019; Published: 9 May 2019

**Abstract:** Many financial decisions, such as portfolio allocation, risk management, option pricing and hedge strategies, are based on forecasts of the conditional variances, covariances and correlations of financial returns. The paper shows an empirical comparison of several methods to predict one-step-ahead conditional covariance matrices. These matrices are used as inputs to obtain out-of-sample minimum variance portfolios based on stocks belonging to the S&P500 index from 2000 to 2017 and sub-periods. The analysis is done through several metrics, including standard deviation, turnover, net average return, information ratio and Sortino's ratio. We find that no method is the best in all scenarios and the performance depends on the criterion, the period of analysis and the rebalancing strategy.

**Keywords:** Minimum variance portfolio; risk; shrinkage; S&P 500

**JEL Classification:** C13; C53; C58; G11

#### **1. Introduction**

Forecasting returns, volatilities and conditional correlations has attracted the interest of researchers and practitioners in finance since these factors are crucial, for example, in portfolio allocation, risk management, option pricing and hedging strategies; see, for instance, Engle (2009), Hlouskova et al. (2009) and Boudt et al. (2013) for some references.

A well-known stylised fact in multivariate time series of financial returns is that not only conditional variances but also conditional covariances and correlations evolve over time. To describe this evolution, several methods have been proposed in the literature. In general, these methods involve different ways to circumvent the issue of dimensionality. The treatment of this problem is vital for the estimation of large portfolios (composed of hundreds or thousands of assets). As noted by Engle et al. (2017), when dealing with portfolios composed of a thousand time series, many multivariate GARCH models present unsatisfactory performance or computational problems in their estimation. For some multivariate GARCH models, estimation problems arise even for smaller dimensions; see, for instance, Laurent et al. (2012), Caporin and McAleer (2014), Caporin and Paruolo (2015) and de Almeida et al. (2018).

Our empirical application is based on an investor who adopts the minimum variance criterion in order to decide on portfolio allocations. A very large body of literature in portfolio optimization considers this particular policy; see, for instance, Clarke et al. (2011 2006) for extensive practitioner-oriented studies on the performance and composition of minimum variance portfolios. This policy can be seen as a particular case of the traditional mean-variance optimisation. The mean-variance problem, however, is known to be very sensitive to estimation of the mean returns (Frahm 2010; Jagannathan and Ma 2003).<sup>1</sup> Very often, the estimation error in the mean returns degrades the overall portfolio performance and introduces an undesirable level of portfolio turnover. In fact, existing evidence suggests that the performance of optimal portfolios that do not rely on estimated mean returns is usually better, see DeMiguel et al. (2009).

To obtain the minimum variance portfolio, the key input is the estimate of the conditional covariance matrix. As far as we known, there are few works in the literature comparing the estimation of this matrix for large portfolios, with Creal et al. (2011), Hafner and Reznikova (2012), Engle et al. (2017), Nakagawa et al. (2018) and Moura and Santos (2018) being especially relevant. Given the myriad of models and methods in the literature to estimate the covariance matrix, empirical studies about the comparison of estimates in large portfolios are most welcome.

The paper is intended to assess the performance of several methods to predict one-step-ahead conditional covariance matrices in large portfolios. This is done empirically, by comparing the out-of-sample performance of minimum variance portfolios based on S&P500 stocks traded from 2 January 2000 to 30 November 2017, using measures such as average (AV), standard deviation (SD), information ratio (IR), Sortino's ratio (SR) (Sortino and van der Meer 1991), turnover (TO) and average portfolio net of transaction cost (AV*net*). Since not all stocks of the index were traded during the whole period, we consider portfolios of dimension *N* = 174 stocks. To assess the robustness of the results, we also the analyse three sub-periods: the pre-crisis period (January 2004 to December 2007), the subprime crisis period (January 2008 to June 2009), and the post-crisis period (July 2009 to November 2017).

We consider several attractive methods and models including recent proposals used by practitioners and academics to predict one-step-ahead conditional covariance matrices. They are selected mainly because they use different approaches to overcome the issue of dimensionality problem. Specifically, the paper compares the DCC model as used in Engle et al. (2017), the DECO model of Engle and Kelly (2012), the OGARCH model of Alexander and Chibumba (1996), the RiskMetrics 1994 and the RiskMetrics 2006 (Zumbach 2007) methods, the generalised principal volatility components analysis (GPVC) proposed by Li et al. (2016) as a generalisation of the procedure of Hu and Tsay (2014), and we also apply the robust version of the GPVC method proposed by Trucíos et al. (2019). DCC models are estimated using composite likelihood, as advocated in Pakel et al. (2014). In addition, the linear shrinkage (LS) and non-linear shrinkage (NLS) of Ledoit and Wolf (2004a) and Ledoit and Wolf (2012), respectively, are applied on all the previous methods. Therefore, compared to Engle et al. (2017), Hafner and Reznikova (2012) and Nakagawa et al. (2018), the set of competing methods is much bigger and the device of shrinkage is assessed in all the compared methods. We consider a total of 47 methods, including the equal-weighted portfolio strategy. This constitutes the main contribution of the paper.

The rest of the paper is organised as follows: Section 2 presents the methods and models used to predict the one-step-ahead volatility covariance matrix. It also presents the composite likelihood used to estimate the DCC model and the shrinkage method as presented in Pakel et al. (2014). The empirical application is given in Section 3. Section 4 concludes and the list of the estimation methods is in the Appendix A.

#### **2. The Forecast Methods**

Denote by *ri*,*t*, *i* = 1, ... , *N*, *t* = 1, ... , *T* the return of the *i*-th asset at time *t*, where *N* is the number of assets under consideration to construct the portfolio and *T* denotes the sample size. For simplicity, consider that *E*(*ri*,*t*|F*t*−1) = 0, where F*t*−<sup>1</sup> denotes the information available at time (*t* − 1). Let **r***<sup>t</sup>* = (*r*1,*t*, ... ,*rN*,*t*)- ; the conditional covariance matrix is defined as **H***<sup>t</sup>* = Cov(**r**t|Ft−1)

<sup>1</sup> See Wied et al. (2013) for a test for the presence of structural breaks in minimum variance portfolios

with elements *hi*,*j*,*<sup>t</sup>* = Cov(ri,t, rj,t|Ft−1). At time (*t* − 1), we are interested in estimating **H***<sup>t</sup>* in order to select a portfolio for the period (*t* − 1, *t*]. In the following we present some methods to estimate it.

#### *2.1. The RiskMetrics Methods*

One of the most popular methods used in risk analysis is the RiskMetrics method developed by the RiskMetrics Group at JP Morgan. We call this the RiskMetrics 1994 (RM1994) method. The main feature of the RiskMetrics method is that the predicted volatility is a linear function of the present and past squared returns. Although it has being widely used, it has some problems. In order to overcome some of these problems, the same group developed the RM2006 method. Like the RM1994 method, the RM2006 method is also data-oriented, in the sense that it was calibrated and tested to have good performance with the majority of the target empirical data, and was developed to take into account some of the stylised facts and weaknesses detected in the RM1994 method. We can summarize the main modifications in three types. In the first type, considering that the volatility has a long memory feature, the weights decay logarithmically instead of exponentially, as happens in the RM1994 method. The second is that the weights depend on the forecast horizon. The third is that the conditional distribution of the return is not multivariate Gaussian; the distribution is based on the estimated devolatilised residuals and it can be roughly defined as a Student-*t* distribution with scale correction. Finally, the return levels are modelled considering the lagged correlation between returns.

#### *2.2. The CCC Model*

The constant conditional correlation model (Bollerslev 1990) is one of the simplest MGARCH models to estimate, since basically the variances are modelled independently and the covariances are obtained using the conditional standard deviation and a constant conditional correlation matrix. The conditional covariance matrix **H***t* evolves according to:

$$\mathbf{H}\_l = \begin{array}{c} \mathbf{D}\_l \mathbf{R} \mathbf{D}\_l \end{array} \tag{1}$$

$$\mathbf{D}\_l \quad = \quad \text{Diag}(d\_{1,t}, \dots, d\_{N,t}), \tag{2}$$

$$\mathbf{R}^{\prime} = -\operatorname{Diag}(\mathbf{H})^{-1/2}\mathbf{H}\operatorname{Diag}(\mathbf{H})^{-1/2},\tag{3}$$

$$\mathbf{H}\_{\perp} = \begin{array}{c} \text{Cov}(\mathfrak{r}\_{\mathbb{L}}), \\ \end{array} \tag{4}$$

with *d*<sup>2</sup> *<sup>i</sup>*,*<sup>t</sup>* = *Var*(*ri*,*t*|F*t*−1) (marginal univariate conditional variances). The advantage of the CCC model is its easy estimation, although, the main disadvantage is the strong assumption that conditional correlations are time-invariant. Engle (2002) extended this idea in a dynamic conditional correlation way, as detailed in the next section.

#### *2.3. The DCC Model*

In this section, we describe the scalar DCC model of Engle (2002) as used in Pakel et al. (2014) and Engle et al. (2017), and the composite likelihood. The non-linear shrinkage method, which is also used to estimate the DCC model, is presented in Section 2.8. In the DCC model, the marginal univariate conditional variances *d*<sup>2</sup> *<sup>i</sup>*,*<sup>t</sup>* = *Var*(*ri*,*t*|F*t*−1) are modelled first. Define the devolatilised residuals as **s***<sup>t</sup>* = (*r*1,*t*/*d*1,*t*, ... ,*rN*,*t*/*dN*,*t*)- . We use the DCC model with correlation targeting as in Engle et al. (2017). The conditional covariance matrix **H***t* evolves according to:

$$\mathbf{H}\_{t} \quad = \begin{array}{c} \mathbf{D}\_{t} \mathbf{R}\_{t} \mathbf{D}\_{t'} \end{array} \tag{5}$$

$$\mathbf{R}\_t = \begin{array}{c} \text{Diag}(\mathbf{Q}\_t)^{-1/2}\mathbf{Q}\_t \text{Diag}(\mathbf{Q}\_t)^{-1/2}, \end{array} \tag{6}$$

$$\mathbf{Q}\_{t} \quad = \ (1 - \boldsymbol{\alpha} - \boldsymbol{\beta}) \mathbf{C} + \boldsymbol{\alpha} \mathbf{s}\_{t-1} \mathbf{s}\_{t-1}^{'} + \boldsymbol{\beta} \mathbf{Q}\_{t-1} \tag{7}$$

where **D***<sup>t</sup>* is a diagonal matrix with the *i*-th element of the diagonal equal to *d*<sup>2</sup> *i*,*t* , *C* = *Corr*(**r***t*) = *Cov*(**s***t*) is the unconditional correlation matrix, and *Rt* = *Corr*(**r***t*|F*t*−1) = *Cov*(**s***t*|F*t*−1) is the conditional correlation matrix at time *t*. The parameters *α* and *β* are non-negative with *α* + *β* < 1. We have

$$\mathbf{r}\_t | \mathcal{F}\_{t-1} \sim \mathcal{W}S(\mathbf{0}, \mathbf{H}\_t), \tag{8}$$

where *WS*(0, **H***t*) means a multivariate distribution with mean zero and covariance matrix **H***t*.

The model is usually estimated in three stages. In each stage, the estimation is conditional on the estimates found in previous stages. The stages are: (1) estimate **D***<sup>t</sup>* usually assuming a GARCH(1,1) model for each *t* = 1, ... , *T*, and evaluate the devolatilised residuals; (2) select an estimator of the correlation target matrix C using the devolatilised residuals; and (3) estimate the parameters *α* and *β*. We will comment on stage one in the application section and on stage 2 in Section 2.8. In the third stage, even with only two parameters, one may face estimation problems with a large number of assets because it is necessary to invert the conditional covariance matrix **H***<sup>t</sup>* (for each *t* = 1, ... , *T*). One way to overcome this problem is through the use of the composite (log-)likelihood<sup>2</sup> to compute it. This method was proposed in the 2008 version of Pakel et al. (2014). In the 2014 version, they showed that the estimators of *α* and *β*, given by maximizing the composite likelihood, are consistent although not efficient. They evaluate the composite likelihood by summing the likelihood of all contiguous pairs. Thus, there are only (*N* − 1) bivariate terms and for any contiguous pair it is only necessary to invert a matrix of order two. For instance, let **r**(*i*) = (*ri*,1, ... ,*ri*,*T*) - , *i* = 1, ... , *N*, i.e., the series of returns of the *<sup>i</sup>*th asset, and denote by *li*(*α*, *<sup>β</sup>*;**r**(*i*),**r**(*i*+1)) the likelihood of the pair (**r**(*i*),**r**(*i*+1)), *<sup>i</sup>* <sup>=</sup> 1, ... , *<sup>N</sup>* <sup>−</sup> 1, assuming that each pair comes from a bivariate DCC model, defined similarly as the model given by Equations (5–7). Then, the composite likelihood is given by:

$$\text{CL}(\mathbf{a}, \boldsymbol{\beta}; \mathbf{r}^{(i)}, \mathbf{i} = 1, \dots, N) = \sum\_{i=1}^{N-1} l\_i(\mathbf{a}, \boldsymbol{\beta}; \mathbf{r}^{(i)}, \mathbf{r}^{(i+1)}). \tag{9}$$

Engle et al. (2017) argue that the estimator of the conditional covariance matrix given by the DCC model using composite likelihood in stage three with the estimation of the unconditional correlation matrix using non-linear shrinkage in stage two is robust against model misspecification in large dimensions (*N*).

#### *2.4. The DECO Model*

Engle and Kelly (2012) propose a dynamic equicorrelation (DECO) model as a trade-off between a model which imposes many restrictions in the covariance matrix and a less structured model. They contend that imposing too much structure can lead to an efficient estimation when the restrictions are correct, but can suffer from breakdown in the presence of misspecification. On the other hand, the lack of restrictions may lead to the issue of dimensionality. Considering this trade-off, they propose a model where the cross-correlations between any pair of returns are equal on the same day, but it can vary over time. In addition, as in the CCC and DCC models, the DECO model also assumes that the marginals are modelled by a univariate volatility model. Using the same notation, we have *d*2 *<sup>i</sup>*,*<sup>t</sup>* = *Var*(*ri*,*t*|F*t*−1), and the covariance matrix is written as **H***<sup>t</sup>* = **D***t***R***t***D***<sup>t</sup>* as in Equation (5). The equicorrelation matrix is given by:

$$\mathbf{R}\_{l} = (1 - \rho\_{l})\mathbf{I}\_{N} + \rho\_{l}\mathbf{J}\_{N},\tag{10}$$

where *ρ<sup>t</sup>* is the equicorrelation, **I***<sup>N</sup>* denotes the *N*-dimensional identity matrix and **J***<sup>N</sup>* is the *<sup>N</sup>* <sup>×</sup> *<sup>N</sup>* matrix of ones. According to Engle and Kelly (2012), **<sup>R</sup>**−<sup>1</sup> *<sup>t</sup>* exist if and only if *ρ<sup>t</sup>* = 1 and

<sup>2</sup> From now on we just call the log-likelihood likelihood.

*ρ<sup>t</sup>* = −1/(*N* − 1), and **R***<sup>t</sup>* is positive definite if and only if *ρ<sup>t</sup>* ∈ (−1/(*N* − 1), 1). The evaluation of the likelihood is easy because we have closed forms for **R**−<sup>1</sup> *<sup>t</sup>* and det(**R***t*), given by:

$$\mathbf{R}\_t^{-1} = \frac{1}{1 - \rho\_t} \mathbf{I}\_N - \frac{\rho\_t}{(1 - \rho\_t)(1 + [N - 1]\rho\_t)} \mathbf{J}\_{N\prime} \tag{11}$$

and

$$\det(\mathbf{R}\_l) = (1 - \rho\_l)^{N-1} \left[ 1 + (N - 1)\rho\_l \right],\tag{12}$$

respectively. This description of the DECO model corresponds to a single block. The DECO model can also be used considering many blocks, as described in Engle and Kelly (2012).

#### *2.5. The OGARCH Model*

Alexander and Chibumba (1996) propose the Orthogonal GARCH (OGARCH) model, a dimension reduction technique to model the conditional covariance matrix. The model intends to simplify the problem of modelling an *N*-dimensional system into modelling a system of *K*-dimension orthogonal components where those components are obtained through principal component analysis (*K* ≤ *N*). Since the components are orthogonal, the conditional covariance matrix of the whole system can be obtained as:

$$\mathbf{H}\_{l} = \mathbf{A} \mathbf{D}\_{l} \mathbf{A}^{\prime} + \mathbf{V}\_{\varepsilon \prime} \tag{13}$$

where **A** is an *N* × *k* matrix whose columns are the normalised eigenvectors associated with the unconditional covariance matrix, **D***t* is a diagonal matrix whose elements are the conditional variances of the *k* principal orthogonal components associated with the *k* largest eigenvalues, and **V** is the covariance matrix of the errors that can be ignored. The conditional variances of each component can be modelled by a GARCH-type model.

Alexander and Chibumba (1996) and Alexander (2002) emphasise the importance of using a number of components *k* much smaller than *N*. However, Bauwens et al. (2006) and Becker et al. (2015) suggest using *k* = *N* to avoid problems related with the inverse of **H***t*. The OGARCH model with *k* = *N* is a particular case of the GO-GARCH model (Van der Weide 2002).

#### *2.6. The Generalised Principal Volatility Components Model*

The generalised principal volatility components (GPVC) procedure is a dimension reduction technique recently proposed by Li et al. (2016), which decomposes a series into two groups of volatility components. The first group corresponds to a small number of components with volatility evolving over time while the second one corresponds to components whose volatility is constant over time. The GPVC procedure considers an orthogonal matrix **M** = [**A** : **B**] and decomposes an *N*-dimensional vector **y***<sup>t</sup>* = (*y*1*t*, ..., *yNt*)with *E*(**y***t*|F*t*−1) = 0 into:

$$\mathbf{y}\_{l} = \mathbf{M}\mathbf{M}^{\prime}\mathbf{y}\_{l} = (\mathbf{A}\mathbf{A}^{\prime} + \mathbf{B}\mathbf{B}^{\prime})\mathbf{y}\_{l} = \mathbf{A}\mathbf{f}\_{l} + \mathbf{f}\mathbf{f}\_{l},\tag{14}$$

with **f***<sup>t</sup>* = **A**- **y***<sup>t</sup>* and **ffl***<sup>t</sup>* = **BB**- **y***t*. The matrix **M** is obtained through the decomposition **GM** = **ΛM**, where **Λ** is a diagonal matrix with elements given by the eigenvalues in decreasing order and **M** is the associated matrix of normalised eigenvectors. The columns of matrices **A** and **B** are the eigenvectors associated with the non-zero and zero eigenvalues, respectively, which are obtained from the eigenvalue decomposition of the matrix **G**. In practice, **G** is given by:

$$\mathbf{G} = \sum\_{k=1}^{\mathcal{S}} \sum\_{t=1}^{T} \omega(\mathbf{y}\_t) E^2 \left[ \left( \mathbf{y}\_t \mathbf{y}\_t^\prime - \boldsymbol{\Sigma} \right) I(\|\mathbf{y}\_{t-k}\| \le \|\mathbf{y}\_t\|) \right],\tag{15}$$

where *g* is a positive integer that gives the maximum lag order considered, *ω*(·) is a weight function, **Σ** is the unconditional covariance matrix and · is the *L*<sup>1</sup> norm. Then, after some calculations, the conditional covariance matrix can be obtained by:

$$\mathbf{H}\_l = \mathbf{A}\mathbf{H}\_l^\dagger \mathbf{A}' + \mathbf{A}\mathbf{A}'\Sigma\mathbf{B}\mathbf{B}' + \mathbf{B}\mathbf{B}'\Sigma,\tag{16}$$

where **H***<sup>f</sup> <sup>t</sup>* is the conditional covariance matrix of the volatility components with volatility evolving over time and the remaining are terms as defined previously3. The matrix **G** is estimated as:

$$\hat{\mathbf{G}} = \sum\_{k=1}^{\mathcal{S}} \sum\_{\tau=1}^{T} \omega(\mathbf{y}\_{\tau}) \left[ \frac{1}{T-k} \sum\_{t=k+1}^{T} \left[ \left( \mathbf{y}\_{t} \mathbf{y}\_{t}^{\prime} - \mathbf{\hat{z}} \right) I(\|\mathbf{y}\_{t-k}\| \le \|\mathbf{y}\_{\tau}\|) \right] \right]^2. \tag{17}$$

The estimated version of Equation (16) is obtained by replacing the true values with the estimated ones.

#### *2.7. The Robust GPVC Model*

Trucíos et al. (2019) show the non-robustness of the GPVC procedure of Li et al. (2016) and propose an alternative procedure to obtain volatility components that is robust to outliers. This procedure is based on a robust estimator of the unconditional covariance matrix, a weighted estimator of *E* [(**y***t***y**- *<sup>t</sup>* − **<sup>Σ</sup>**) *<sup>I</sup>*( **<sup>y</sup>***t*−*k* ≤ **y***<sup>t</sup>* )], and robustified filters. The matrix (17) is replaced by a less sensitive matrix, defined as:

$$\hat{\mathbf{G}}^{R} = \sum\_{k=1}^{\mathcal{S}} \sum\_{\tau=1}^{T} \omega(\mathbf{y}\_{\tau}) \left[ \sum\_{t=k+1}^{T} \delta^\*(d\_t^2) \left\{ (\mathbf{y}\_{t} \mathbf{y}\_t^\prime - \hat{\mathbf{L}}^R) I(\|\mathbf{y}\_{t-k}\| \le \|\mathbf{y}\_{\tau}\|) \right\} \right]^2,\tag{18}$$

where *d*<sup>2</sup> *<sup>t</sup>* is the robust squared Mahalanobis distance given by *d*<sup>2</sup> *<sup>t</sup>* = (**y***<sup>t</sup>* − <sup>ˆ</sup>**¯***R*)- **Σ**ˆ <sup>−</sup><sup>1</sup> *<sup>t</sup>* (**y***<sup>t</sup>* − <sup>ˆ</sup>**¯***R*) with **Σ**ˆ *<sup>t</sup>* = 0.01*ρ*(**y**- *<sup>t</sup>*−1**y***t*−1) + 0.99**Σ**<sup>ˆ</sup> *<sup>t</sup>*−1, **<sup>Σ</sup>**<sup>ˆ</sup> <sup>1</sup> <sup>=</sup> **<sup>Σ</sup>**<sup>ˆ</sup> *<sup>R</sup>* and <sup>ˆ</sup>**¯***R*, **<sup>Σ</sup>**<sup>ˆ</sup> *<sup>R</sup>* being robust estimates of the unconditional mean and covariance matrix. Trucíos et al. (2019) use the minimum covariance determinant (MCD) estimator of Rousseeuw (1984), implemented by the algorithm of Hubert et al. (2012). The robust filters, *<sup>ρ</sup>*(·) and *<sup>δ</sup>*(·) are given by *<sup>ρ</sup>*(*xt*) = *xt* if *<sup>d</sup>*<sup>2</sup> *<sup>t</sup>* <sup>≤</sup> *<sup>c</sup>*, *<sup>ρ</sup>*(*xt*) = <sup>Σ</sup><sup>ˆ</sup> *<sup>R</sup>* if *<sup>d</sup>*<sup>2</sup> *<sup>t</sup>* > *c*; *δ*(*x*) = 1 if *x* ≤ *c*, *δ*(*x*) = 1/*x* if *x* > *c* and *δ*∗(·) = *δ*(·)/||*δ*(·)||, where · is the *L*<sup>1</sup> norm. For details, see Trucíos et al. (2019).

To avoid returns corresponding to periods with high volatility being considered as possible outliers, the robust procedure incorporates in the squared Mahalanobis distance a covariance matrix evolving over time, which can be seen as a robust RM1994 method with *λ* = 0.99.

Finally, the conditional covariance matrix **H***t* is obtained as in Equation (16).

#### *2.8. Linear and Non-Linear Shrinkage*

Besides the estimation of the covariance matrix (**H***t*), in some of the aforementioned models, we have to estimate the unconditional covariance or correlation matrix; for instance, the matrix **C** in Equation (7) of the DCC model. Generally, the estimation of the unconditional correlation (covariance) matrix is done using the sample correlation (covariance) matrix. However, this is inefficient in the large dimensional case because we could end up with a number of parameters with the same order of magnitude as the dataset, or even larger (see, for instance, the simulation study in the Appendix of Engle et al. (2017)). In general, comparing the eigenvalues of the true correlation matrix with the eigenvalues of the sample correlation matrix, there is a tendency to underestimate the smaller eigenvalues and overestimate the larger ones. A natural way to reduce this bias is to increase the smaller eigenvalues and decrease the larger sample eigenvalues and then reconstruct the estimate of the

<sup>3</sup> Note that when **<sup>Σ</sup>** = **<sup>I</sup>**, **<sup>H</sup>***<sup>t</sup>* = **AH***<sup>f</sup> <sup>t</sup>* **A**- + **BB**- **Σ** = **AH***<sup>f</sup> <sup>t</sup>* **A**-+ **Σffl** as presented in Li et al. (2016).

correlation matrix. This is the main idea behind the shrinkage method. Engle et al. (2017) analyse the use of three types the shrinkage: linear shrinkage of Ledoit and Wolf (2004b) with shrinkage target given by (a multiple of) the identity matrix; linear shrinkage of Ledoit and Wolf (2004a) with shrinkage target given by the equicorrelation matrix; and the non-linear shrinkage of Ledoit and Wolf (2012) for the estimation of the unconditional correlation matrix in Equation (7). Using simulation, they conclude that the three types of shrinkage have better performance than the use of the sample correlation matrix in the estimation of **H***t*, and the best performance is obtained from the non-linear shrinkage. They conclude that the application of non-linear shrinkage improves the estimation, and the improvement generally increases for a larger number of assets. In the application, they also apply the non-linear shrinkage to the estimated one-step-ahead conditional covariance matrix, which is not done in the simulation study. In the empirical application, they construct portfolios of global minimum variance with portfolio sizes 100, 500 and 1000 and updated monthly. As in the simulation study, they construct portfolios with *Ht* modelled by DCC and CCC models and the RiskMetrics 2006 method. However, besides applying the linear and non-linear shrinkage to the target correlation matrix, they also apply the shrinkages to the one-step-ahead prediction of the volatility matrix. The best performance is achieved by the DCC model with the non-linear shrinkage applied only to the estimation of the intercept matrix, followed by the non-linear shrinkage applied both to the intercept matrix and to the one-step-ahead prediction matrix. We use the linear shrinkage towards the equicorrelation matrix, because in Engle et al. (2017) it presented slightly better performance than the shrinkage towards the identity matrix, although the estimator does not belong to the class of rotation-equivariant estimators.

For a light introduction to the main idea behind shrinkage, suppose we want to estimate the covariance matrix **Σ** and we have an estimate **C**ˆ based on a sample of size *T*. For instance, **C**ˆ could be the sample covariance matrix and **Σ**, the population matrix (unconditional covariance matrix). This is the case of the estimation of the DCC, where **Σ** is the intercept matrix. When the ratio *N*/*T*, called concentration ratio, becomes large, we have in-sample overfitting due to the excessive number of parameters, introducing a bias in the estimation of the eigenvalues. One way to correct this problem is through the shrinkage method.

For the linear shrinkage towards the equicorrelation matrix, denote by *c*ˆ*ij* the element of the estimate *C*ˆ. The mean of the estimated correlations is given by:

$$\vec{r} = \frac{2}{(N-1)N} \sum\_{i=1}^{N-1} \sum\_{j=i+1}^{N} \frac{\mathcal{E}\_{i,j}}{\sqrt{\hat{c}\_{i,j}\hat{c}\_{j,j}}},\tag{19}$$

such that for the target matrix *F* we have *fi*,*<sup>i</sup>* = *c*ˆ*i*,*<sup>i</sup>* and *fi*,*<sup>j</sup>* = *r*¯ *c*ˆ*i*,*ic*ˆ*j*,*j*. The shrinkage estimate is given by:

$$
\Delta\_{Shrink}^{\bullet} = \delta F + (1 - \delta)\hat{\mathsf{C}}\_{\prime} \tag{20}
$$

where the shrinkage intensity, *δ*, is such that it minimizes the expected quadratic loss as in Ledoit and Wolf (2004a). For the shrinkage intensity *δ*, define the quadratic loss function

$$L(\delta) = ||\delta F + (1 - \delta)\hat{\mathcal{C}} - \Sigma||^2.$$

Ledoit and Wolf (2004a) propose to use the shrinkage intensity, which minimizes the risk function *R*(*δ*) = *E*(*L*(*δ*)). The formulae and the derivation of the estimated shrinkage intensity can be found in the Appendix B of Ledoit and Wolf (2004a).

Regarding the non-linear shrinkage, let *<sup>C</sup>*<sup>ˆ</sup> having dimension (*<sup>N</sup>* <sup>×</sup> *<sup>N</sup>*), (*λ*<sup>ˆ</sup> 1, ... , *<sup>λ</sup>*<sup>ˆ</sup> *<sup>N</sup>*), sorted in descending order, be the set of eigenvalues, and (**u**ˆ 1, ... , **u**ˆ *<sup>N</sup>*) the corresponding eigenvectors, such that:

$$
\hat{\mathbf{C}} = \sum\_{i=1}^{N} \hat{\lambda}\_i \hat{\mathbf{u}}\_i \hat{\mathbf{u}}\_i'. \tag{21}
$$

For an investor holding a portfolio with weights *ω*, the estimated variance is given by *ω*- *C*ˆ*ω*. The non-linear shrinkage of Ledoit and Wolf (2004b) is a transformation from (*λ*ˆ 1, ... , *λ*ˆ *<sup>N</sup>*) to *λ*˜ = (*λ*˜ 1, ... , *λ*˜ *<sup>N</sup>*), such that substituting *λ*ˆ *<sup>i</sup>* for *λ*˜ *<sup>i</sup>* in Equation (21) gives a consistent estimator of the out-of-sample variance *ω*- Σ*ω*- . Denote by *λ* = (*λ*1, ... , *λN*) the set of eigenvalues of Σ in descending order. Ledoit and Wolf (2004b) define QuEST functions (*q*1(*λ*), ... , *qN*(*λ*), such that *λ*˜ minimizes the Euclidean distance between the QuEST functions and the sample eigenvalues, i.e., given by:

$$\tilde{\lambda} = \arg\min\_{\lambda \in [0,\infty)^N} \sum\_{i=1}^N [q\_i(\lambda) - \hat{\lambda}\_i]^2. \tag{22}$$

A definition of the QuEST functions and a rigorous exposition of non-linear shrinkage can be found in Ledoit and Wolf (2012), while a lighter presentation can be found in the Supplementary Material of Engle et al. (2017).

#### **3. Empirical Application**

#### *3.1. Data and Methods*

In this section, we implement the procedures described in Section 2 and use the predicted one-step-ahead conditional covariance matrix to construct the minimum variance portfolio (MVP) of the stocks used in the composition of the S&P 500 index, traded from 2 January 2000 to 30 November 2017. Because not all stocks of the index were traded during the whole period, we ended up with *N* = 174 stocks.

To evaluate the out-of-sample portfolio performance, we consider a rolling window scheme. The out-of-sample portfolio performance is evaluated in four different periods, namely: pre-crisis period (January 2004 to December 2007, 1008 days), subprime crisis period (January 2008 to June 2009, 378 days), post-crisis period (July 2009 to November 2017, 2218 days), and full period (January 2004 to November 2017, 3503 days). In each window, the one-step-ahead covariance matrix is estimated and the MVP values with and without short-sale constraints are obtained. The weights in the MVP portfolio are rebalanced with both daily and monthly frequencies. In the latter case, we follow Engle et al. (2017), that is, we obtain the portfolio returns daily but update the weights monthly (following the common convention we use 21 consecutive trading days as a month). Monthly updating is common in practice to reduce transaction costs.

The procedures described in Section 2 are combined with the linear and non-linear shrinkage estimator described in Subsection 2.8. The linear and non-linear shrinkage are applied at the beginning and/or at the end of the estimation procedure. A detailed description of each combination of the estimation procedures is given in the Appendix A. In addition, for the sake of comparison, we also implement the naive equal-weighted portfolio. In the line of Engle et al. (2017), Gambacciani and Paolella (2017), Trucíos et al. (2018) among others, we consider the following annualised out-of-sample performance measures. Denote by *Rp* = {*rp*,1, ... ,*rp*,*k*} the observed out-of-sample returns from a given method where *k* in the length of the out-of-sample period. The measures considered in this paper: the annualised average portfolio return (AV), standard deviation portfolio return (SD), information ratio (IR), Sortino's ratio (SR) and average turnover (TO) are computed as follows:

AV: equal to 252 <sup>×</sup> *<sup>R</sup>*¯ *<sup>p</sup>*, where *<sup>R</sup>*¯ *<sup>p</sup>* is the average of the elements of *Rp*.

SD: equal to <sup>√</sup><sup>252</sup> <sup>×</sup> *Sp*, where *Sp* is the standard deviation of the elements of *Rp*. IR: AV/SD.

SR: AV/<sup>√</sup> <sup>252</sup> × *<sup>S</sup>*∗2, where *<sup>S</sup>*∗<sup>2</sup> is the mean of *<sup>r</sup>*<sup>∗</sup> *p*,*i* , *i* = 1, ... , *k*, with *r*∗ *<sup>p</sup>*,*<sup>i</sup>* = *<sup>r</sup>*<sup>2</sup> *<sup>p</sup>*,*<sup>i</sup>* if *rp*,*<sup>i</sup>* less than the minimal acceptable return, which is taken to zero, and zero otherwise.

TO: *<sup>k</sup>*−<sup>1</sup> *<sup>k</sup>* ∑ *t*=2 *N* ∑ *j*=1 |*ωj*,*<sup>t</sup>* − *ωj*,*t*−1| where *ωj*,*<sup>t</sup>* is the portfolio weight at time *t* for the *j*-th asset, and *k* is the

number of the out-of-sample portfolio returns.

As pointed out by Kirby and Ostdiek (2012), Santos and Ferreira (2017), Olivares-Nadal and DeMiguel (2018), among others, transaction costs (*c*) can have an impact on the portfolio's performance. In order to take into account those costs, we also compute the portfolio returns net of transaction cost. For a given *c*, the portfolio return net of transaction costs at time *t* is given by *rnet <sup>p</sup>*,*<sup>t</sup>* = (1 − *c* × *turnovert*)(1 + *rp*,*t*) − 1 and then the annualised average portfolio return net of transaction costs is AV*net* <sup>=</sup> <sup>252</sup> <sup>×</sup> *<sup>R</sup>*¯ *net <sup>p</sup>* where *R*¯ *net <sup>p</sup>* is the average of the portfolio return net of transaction costs *rnet <sup>p</sup>*,1, ... ,*rnet <sup>p</sup>*,*<sup>k</sup>* . We consider *c* = 20*bp* (intermediate) and *c* = 50*bp* (high level) transaction costs where a basis point (bp) is a unit of measure commonly used in finance and is equivalent to 0.01%. The annualised average portfolio return net of transation costs considering *c* = 20*bp* and *c* = 50*bp* are denoted by AV*net* <sup>20</sup>*bp* and AV*net* <sup>50</sup>*bp*, respectively.

#### *3.2. Results*

Tables 1–8 report annualised out-of-sample performance measures for MVP with performance for the pre-crisis, crisis, post-crisis and full periods. Tables 1–4 report the results for daily rebalanced portfolios whereas Tables 5–8 report the results for monthly rebalanced portfolios. We also have results for MVP with no short-sale constraints. However, in this paper we focus on the results for MVP with short-sale constraints and give a short summary of the main findings for the case without short-sale constraints. A detailed analysis of the case without short-sale constraints is given in the Supplementary Material.

In Tables 1–8 we report (in parentheses) the rank of the methods according to the SD criterion in the second column. Moreover, for each criterion, the best five methods are highlighted in shadowed cells. The equal-weighted portfolio strategy is represented by 1/*N*.

Taking into account the fact that portfolios are chosen in order to have the minimum variance, the analysis is first done according to the SD criterion. For portfolios rebalanced daily or monthly, the largest SD is reported by the equal-weight portfolio strategy. For portfolios rebalanced daily (Tables 1–4), the five smallest SDs are obtained by the DCC based-methods, except in the crisis period, in which case the five smallest SDs are spread among the DCC, OGARCH and GPVC based-methods. In the crisis-period, the smallest SD is obtained by the GPVC procedure with the non-linear shrinkage applied to the one-step-ahead conditional covariance matrix. For portfolios rebalanced monthly (Tables 5–8), the smallest SDs are obtained by the RM2006-LS4, NLS-DCC, NLS-GPVC and RM2006-LS procedures for the full, pre-crisis, crisis and post-crisis periods, respectively.

The best performance in terms of the AV criterion differs depending on the period and rebalance strategy. For instance, for daily rebalancing the best performance in the full period is achieved by the RPVC followed by the RPVC with non-linear shrinkage applied to the one-step-ahead conditional covariance matrix. However, for the pre-crisis, crises and post-crisis periods, the best performance is achieved by the OGARCH with non-linear shrinkage applied to the unconditional covariance matrix (NLS-OGARCH), RPVC with linear shrinkage applied to the one-step-ahead conditional covariance matrix (RPVC-LS) and RiskMetrics method with linear shrinkage applied to the one-step-ahead conditional covariance matrix (RM1994-LS), respectively. For monthly rebalancing, the best performances in the full, pre-crisis, crisis and post-crisis periods are achieved by the RPVC, OGARCH-NLS, GPVC-LS and equal-weight portfolio strategy, respectively.

In terms of average turnover, the five smallest average turnovers are in the OGARCH and GPVC groups, with the best performance being achieved by the OGARCH with non-linear shrinkage applied to the one-step-ahead conditional covariance matrix in almost all cases. The only two exceptions are observed in the crisis period, in which case the best performance is achieved by the GPVC procedure with non-linear shrinkage applied to the one-step-ahead conditional covariance matrix.

<sup>4</sup> The acronyms are described in the Appendix A.

Additionally, note that regardless of whether portfolio is rebalanced daily or monthly, the average turnover reported by all dimension reduction techniques is smaller than reported by the non-dimension reduction procedures.

**Table 1.** Annualised performance measures: AV, SD, IR, SR and TO stand for the average, standard deviation, information ratio, Sortino's ratio and turnover of the out-of-sample MVP returns. AV*net* <sup>20</sup>*bp* and AV*net* <sup>50</sup>*bp* stand for the average out-of-sample MVP return net of transaction costs considering 20 and 50 basis-points, respectively. Period January 2004 to November 2017. The shaded cells denote the top five for each criterion. Weights are rebalanced on a daily basis considering short-selling constraints.


**Table 2.** Annualised performance measures: AV, SD, IR, SR and TO stand for the average, standard deviation, information ratio, Sortino's ratio and turnover of the out-of-sample MVP returns. AV*net* <sup>20</sup>*bp* and AV*net* <sup>50</sup>*bp* stand for the average out-of-sample MVP return net of transaction costs considering 20 and 50 basis-points, respectively. Period January 2004 to December 2007. The shaded cells denote the top five for each criterion. Weights are rebalanced on a daily basis considering short-selling constraints.


**Table 3.** Annualised performance measures: AV, SD, IR, SR and TO stand for the average, standard deviation, information ratio, Sortino's ratio and turnover of the out-of-sample MVP returns. AV*net* <sup>20</sup>*bp* and AV*net* <sup>50</sup>*bp* stand for the average out-of-sample MVP return net of transaction costs considering 20 and 50 basis-points, respectively. Period January 2008 to June 2009. The shaded cells denote the top five for each criterion. Weights are rebalanced on a daily basis considering short-selling constraints.


**Table 4.** Annualised performance measures: AV, SD, IR, SR and TO stand for the average, standard deviation, information ratio, Sortino's ratio and turnover of the out-of-sample MVP returns. AV*net* <sup>20</sup>*bp* and AV*net* <sup>50</sup>*bp* stand for the average out-of-sample MVP return net of transaction costs considering 20 and 50 basis-points, respectively. Period July 2009 to November 2017. The shaded cells denote the top five for each criterion. Weights are rebalanced on a daily basis considering short-selling constraints.


**Table 5.** Annualised performance measures: AV, SD, IR, SR and TO stand for the average, standard deviation, information ratio, Sortino's ratio and turnover of the out-of-sample MVP returns. AV*net* <sup>20</sup>*bp* and AV*net* <sup>50</sup>*bp* stand for the average out-of-sample MVP return net of transaction costs considering 20 and 50 basis-points, respectively. Period January 2004 to November 2017. The shaded cells denote the top five for each criterion. Weights are rebalanced on a monthly basis considering short-selling constraints.


**Table 6.** Annualised performance measures: AV, SD, IR, SR and TO stand for the average, standard deviation, information ratio, Sortino's ratio and turnover of the out-of-sample MVP returns. AV*net* <sup>20</sup>*bp* and AV*net* <sup>50</sup>*bp* stand for the average out-of-sample MVP return net of transaction costs considering 20 and 50 basis-points, respectively. Period January 2004 to December 2007. The shaded cells denote the top five for each criterion. Weights are rebalanced on a monthly basis considering short-selling constraints.


**Table 7.** Annualised performance measures: AV, SD, IR, SR and TO stand for the average, standard deviation, information ratio, Sortino's ratio and turnover of the out-of-sample MVP returns. AV*net* 20*bp* and AV*net* <sup>50</sup>*bp* stand for the average out-of-sample MVP return net of transaction costs considering 20 and 50 basis-points, respectively. Period January 2008 to June 2009. The shaded cells denote the top five for each criterion. Weights are rebalanced on a monthly basis considering short-selling constraints.


**Table 8.** Annualised performance measures: AV, SD, IR, SR and TO stand for the average, standard deviation, information ratio, Sortino's ratio and turnover of the out-of-sample MVP returns. AV*net* <sup>20</sup>*bp* and AV*net* <sup>50</sup>*bp* stand for the average out-of-sample MVP return net of transaction costs considering 20 and 50 basis-points, respectively. Period July 2009 to November 2017. The shaded cells denote the top five for each criterion. Weights are rebalanced on a monthly basis considering short-selling constraints.


As for the annualised average portfolio returns taking into account transaction costs, the procedures with the five largest values of AV*net* <sup>20</sup>*bp* and AV*net* <sup>50</sup>*bp* are the same procedures with the largest AV, except in some cases in the pre-crisis period, where one of five largest AV*net* <sup>50</sup>*bp* is obtained by the NLS-OGARCH-NLS procedure.

For each period, the five best methods in terms of information criteria are the same (except in Table 8, where four methods are the same). We omit the analysis in the crisis period because these criteria values are negative. Overall, for daily rebalancing, RiskMetrics based methods are among the best in the full and post-crisis periods, RPVC and RPVC-NLS are among the best in the full and pre-crisis periods, and NLS-OGARCH and LS-OGARCH are among the best in the pre-crisis period. For monthly rebalancing, some OGARCH-based methods are among the best in the pre-crisis and full periods, some CCC-based methods are among the best in the post-crisis and full periods, RM1994-LS is among the best for the post-crisis period, and RPVC is among the best for the full period.

The analysis of Tables 1–8 reveals that none of the methods is the best in all scenarios and the performance depends on the criterion, the period and the rebalancing strategy. In this sense, the analysis will focus on the full period (Tables 1 and 5) in order to account for periods with different volatility levels. When portfolios are rebalanced on a daily basis, we find that DCC-based methods are the best in terms of SD; RM2006-LS, RM2006-NL, RPVC and RPVC-NLS are the best in terms of {AV, AV*net* 20*bp*, AV*net* <sup>50</sup>*bp*} and {IR, SR}, and some OGARCH-based are the best regarding TO. For monthly rebalanced portfolios, the best methods in terms of SD are DCC, LS-DCC, NLS-DCC, RM2006 and RM2006-LS, whereas the best performances in terms of {AV, AV*net* 20*bp*, AV*net* <sup>50</sup>*bp*} and {IR, SR} are given by (RPVC, RPVC-NLS), (OGARCH-NLS,NLS-OGARCH-NLS) and CCC. In addition, the equal-weighted strategy is the second best in terms of AV, but the worst regarding SD, IR and SR criteria.

To show when the shrinkage method improves performance in terms of SD, the analysis is again focused on the full period (Tables 1 and 5). For daily and monthly portfolio rebalancing : shrinkage always improves the performance of the RM2004 and GPVC methods (except LS-GPVC for monthly rebalancing) whereas it always worsens the DCC method; linear shrinkage at the end improves RM2006; just linear/non-linear shrinkage at the beginning improves DECO; OGARCH-NLS and NLS-OGARCH-NLS improves OGARCH; LS-CCC improves CCC (as well as NLS-DCC for daily rebalancing). Additionally, for daily rebalancing, shrinkage always improves the performance of RPVC (except LS-GPVC), whereas for monthly rebalancing, linear shrinkage applied at the beginning and/or end improves RPVC. Nakagawa et al. (2018) also reports that in some cases the use of non-linear shrinkage on the unconditional covariance matrix of the devolatilised returns in the DCC model increases the standard deviation of the out-of-sample portfolio returns.

We now discuss the effect of shrinkage in terms of AV*net* <sup>50</sup>*bp*. For daily rebalancing, shrinkage improves the performance of the RM2006 and DECO methods, and worsens the performance of the DCC and RPVC methods. In addition, CCC-NLS is better than CCC, RM1994-NLS is better than RM1994, and LS-GPVC is better than GPVC. For monthly rebalancing, shrinkage does not improve the performance of the CCC, DCC, GPVC and RPVC methods. In addition, RM2006-LS is better than RM2006, RM1994-NLS is better than RM1994, DECO-NLS and NLS-DECO-NLS are better than DECO, and OGARCH-NLS and NLS-OGARCH-NLS are better than OGARCH.

Finally, we list next the main findings when short-selling is allowed for optimisation of the portfolio variance. A detailed analysis of these cases is given in the Supplementary Material. First, none of the methods is the best in all scenarios and the performance depends on the criterion, the sample period and the portfolio rebalancing scheme. Second, the analysis of the full period reveals that for daily rebalancing, DCC methods are the best regarding SD and are among the best in terms of IR and SR. RM1994-LS and RM2006-LS are the best according to AV, AV*net* 20*bp*, AV*net* <sup>50</sup>*bp*, IR and SR. For monthly rebalancing, DCC-LS and LS-DCC-LS are among the best in terms of SD, RM2006-NLS is the best in terms of SD and is among the best regarding IR and SR. RM 1994 and RM1994-LS are the first and second best in terms of AV, AV*net* 20*bp*, AV*net* <sup>50</sup>*bp* but are among the worst in terms of SD. Third, the analysis of the turnover and average net returns in the no short-sale constraints case must be carefully done. This is because since no limits are imposed on the weights of the portfolio, large turnover values can be obtained and consequently we can have a large loss (average return) but huge net gain (average net portfolio return taking into account transaction costs). Fourth, in many cases shrinkage improves the performance of the methods in terms of SD, and this improvement can be substantial. Fifth, the top-five

models in terms of SD are the same in both restricted and unrestricted minimum variance portfolios for daily rebalancing, except in the crisis period.

#### **4. Conclusions**

The main conclusion of the paper is that none of the methods is the best in all scenarios and the performance depends on the criterion, the sample period, the portfolio rebalancing scheme and whether or not short-selling constraints are included in the portfolio optimisation process.

When short-selling constraints are included in the portfolio optimisation process, the main results can be summarised as follows. First, none of the methods is the best in all scenarios and the performance depends on the criterion, the sample period and the portfolio rebalancing scheme. Second, when considering the SD criterion, the five smallest SDs are obtained by the DCC based-methods, except in the crisis period, in which case, the five smallest SDs are spread among the DCC, OGARCH and GPVC based-methods. In the crisis-period, the smallest SDs are obtained by the GPVC procedure with the non-linear shrinkage applied to the one-step-ahead conditional covariance matrix. For portfolios rebalanced monthly, the smallest SDs are obtained by the RM2006-LS, NLS-DCC, NLS-GPVC and RM2006-LS procedures for the full, pre-crisis, crisis and post-crisis periods, respectively. Third, unlike Engle et al. (2017) and Nakagawa et al. (2018), we do not find that applying non-linear shrinkage to the unconditional correlation matrix of the devolatilised returns improves the performance of the portfolio in terms of SD when the DCC model is used, and this also happens when applied in other methods. It is important to point out that Engle et al. (2017) use portfolio of 1000 assets, Nakagawa et al. (2018) use portfolios of 100, 500 and 1000 assets and we use a portfolio with 174 assets.

When short-selling is allowed for optimisation of the portfolio variance, the main conclusions are: none of the methods is the best in all scenarios and the performance depends on the criterion, the sample period and the portfolio rebalancing scheme; in many cases shrinkage improves the performance of the methods in terms of SD and this improvement can be substantial; for daily rebalancing the top-five models in terms of SD are the same of those when short-selling constraints are imposed, except in the crisis period cases. Finally, focusing on the analysis of the full period cases we can say that overall the DCC and Riskmetrics-based methods are the best; and the analysis of the turnover and average net returns in the no short-selling constraints case should be carefully done.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2225-1146/7/2/19/s1, File: Covariance Prediction in Large Portfolio Allocation: Supplementary Material.

**Author Contributions:** This paper has been a collaborative effort, with all authors contributing equally to this work. This includes conceptualization and investigation of the main ideas in the manuscript, methodology proposals, and formal analysis, as well as all aspects of the writing process.

**Funding:** The first three authors acknowledge financial support from the São Paulo Research Foundation (FAPESP), grants 2016/18599-4, 2018/03012-3, 2013/00506-1 and 2018/04654-9. The fourth author is grateful to the National Council for Scientific and Technological Development (CNPq) for grant 303688/2016-5. The third author is also grateful to CNPq for grant 313035/2017-2.

**Acknowledgments:** The first three authors acknowledge support of the Centre for Applied Research on Econometrics, Finance and Statistics (CAREFS) and the Centre of Quantitative Studies in Economics and Finance (CEQEF). The authors are also grateful to two anonymous referees and the academic editor for providing useful comments and suggestions on earlier version of the paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Estimation Methods**

Here we present the detailed list of the estimation methods implemented in the paper. The marginal variances in the CCC, DCC and DECO models were modelled by the GJR-(1,1) model (Glosten et al. 1993) and the parameters were estimated by quasi-maximum likelihood assuming a Student-*t* distribution. The volatility components in the GPVC and RPVC procedures were modelled by the GJR(1,1)-cDCC(1,1) model and its robust version proposed by Boudt et al. (2013) and

Laurent et al. (2016), respectively. The univariate variances in the OGARCH model were also modelled by the GJR-(1,1).

In the GPVC and RPVC procedures, the number of selected volatility components was estimated using criteria of Ahn and Horenstein (2013), Bai and Ng (2002) and Kaiser-Guttman Guttman (1954), and using the ratio estimator proposed by Lam and Yao (2012). Following these criteria and the suggestions in Trucíos et al. (2019), we use one volatility component in the GPVC procedure and four volatility components in the RPVC procedure.

The CCC, DCC, DECO, RM1994 and RM2006 procedures were implemented using the MFE Matlab Toolbox of Kevin Sheppard. The OGARCH, GPVC and RPVC procedures were implemented in R (R Core Team 2017) using the R packages *rugarch* of Ghalanos (2017), *Rcpp* of Eddelbuettel and François (2011) and *covRobust* of Wang et al. (2017). For the shrinkage procedures, we used the R packages *RiskPortfolios* (Ardia et al. 2018) and *nlshrink* (Ramprasad 2016) for the linear and non-linear shrinkage, respectively, coupled with the MATLAB toolbox QuEST (Ledoit and Wolf 2017) for the non-linear shrinkage and the MATLAB function *covCor*5. Whenever a program presented other options, we used the default options.

#### CCC based-methods


#### DCC based-methods


#### DECO based-methods


<sup>5</sup> Available at www.econ.uzh.ch/en/people/faculty/wolf/publications.


Because in the DECO model the estimated unconditional covariance matrix and **H***T*+<sup>1</sup> are already equicorrelated there is no sense in using linear shrinkage towards the equicorrelation matrix, since it has no effect.

RiskMetrics based-methods


OGARCH based-methods


GPVC based-methods


<sup>6</sup> This method was implemented using the MFE Matlab Toolbox of Kevin Sheppard with the default options. An R implementation of the same procedure can be found in Trucios (2017).

• NLS-GPVC-NLS: Estimated as in NLS-GPVC with non-linear shrinkage applied to the predicted one-step-ahead conditional covariance matrix **H***T*+1.

#### RPVC based-methods


#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
