1. Introduction
Forecasting is one of the most important and widely studied areas in time series econometrics. While there are many challenges related to financial forecasting, forecast evaluation is a key topic in the field. One of the challenges faced by the forecasting literature is the development of adequate tests to conduct inference about predictive ability. In what follows, we review some advances in this area and address some of the remaining challenges.
“Mighty oaks from little acorns grow”. This is probably the best way to describe the forecast evaluation literature since the mid-1990s. The seminal works of Diebold and Mariano (1995) [
1] and West (1996) [
2] (DMW) have flourished in many directions, attracting the attention of both scholars and practitioners in the quest for proper evaluation techniques. See West (2006) [
3], Clark and McCracken (2013) [
4], and Giacomini and Rossi (2013) [
5] for great reviews on forecasting evaluation.
Considering forecasts as primitives, Diebold and Mariano (1995) [
1] showed that under mild conditions on forecast errors and loss functions, standard time-series versions of the central limit theorem apply, ensuring asymptotic normality for tests evaluating predictive performance. West (1996) [
2] considered the case in which forecasts are constructed with estimated econometric models. This is a critical difference with respect to Diebold and Mariano (1995) [
1], since forecasts are now polluted by estimation error.
Building on this insight, West (1996) [
2] developed a theory for testing population-level predictive ability (i.e., using estimated models to learn something about the true models). Two fundamental issues arise from West’s contribution: Firstly, in some specific cases, parameter uncertainty is “asymptotically irrelevant”, hence, it is possible to proceed as proposed by Diebold and Mariano (1995) [
1]. Secondly, although West’s theory is quite general, it requires a full rank condition over the long-run variance of the objective function when parameters are set at their true values. A leading case in which this assumption is violated is in standard comparisons of mean squared prediction errors (MSPE) in nested environments.
As pointed out by West (2006) [
3]: “A rule of thumb is: if the rank of the data becomes degenerate when regression parameters are set at their population values, then a rank condition assumed in the previous sections likely is violated. When only two models are being compared, “degenerate” means identically zero” West (2006) [
3], page 117. Clearly, in the context of two nested models, the null hypothesis of equal MSPE means that both models are exactly the same, which generates the violation of the rank condition in West (1996) [
2].
Forecast evaluations in nested models are extremely relevant in economics and finance for at least two reasons. Firstly, it is a standard in financial econometrics to compare the predictive accuracy of a given model A with a simple benchmark that usually is generated from a model B, which is nested in A (e.g., the ‘no change forecast’). Some of the most influential empirical works, like Welch and Goyal (2008) [
6] and Meese and Rogoff (1983, 1988) [
7,
8], have shown that outperforming naïve models is an extremely difficult task. Secondly, comparisons within the context of nested models provide an easy and intuitive way to evaluate and identify the predictive content of a given variable X
t: suppose the only difference between two competing models is that one of them uses the predictor X
t, while the other one does not. If the former outperforms the latter, then X
t has relevant information to predict the target variable.
Due to its relevance, many efforts have been undertaken to deal with this issue. Some key contributions are those of Clark and McCracken (2001, 2005) [
9,
10] and McCracken (2007) [
11], who used a different approach that allows for comparisons at the population level between nested models. Although, in general, the derived asymptotic distributions are not standard, for some specific cases (e.g., no autocorrelation, conditional homoskedasticity of forecast errors, and one-step-ahead forecasts), the limiting distributions of the relevant statistics are free of nuisance parameters, and their critical values are provided in Clark and McCracken (2001) [
9].
While the contributions of many authors in the last 25 years have been important, our reading of the state of the art in forecast evaluation coincides with the view of Diebold (2015) [
12]:
“[…] one must carefully tiptoe across a minefield of assumptions depending on the situation. Such assumptions include but are not limited to: (1) Nesting structure and nuisance parameters. Are the models nested, non-nested, or partially overlapping? (2) Functional form. Are the models linear or nonlinear? (3) Model disturbance properties. Are the disturbances Gaussian? Martingale differences? Something else? (4) Estimation sample. Is the pseudo-in-sample estimation period fixed? Recursively expanding? Something else? (5) Estimation method. Are the models estimated by OLS? MLE? GMM? Something else? And crucially: Does the loss function embedded in the estimation method match the loss function used for pseudo-out-of-sample forecast accuracy comparisons? (6) Asymptotics. What asymptotics are invoked?” Diebold (2015) [
12], pages 3–4. Notably, the relevant limiting distribution generally depends on some of these assumptions.
In this context, there is a demand for straightforward tests that simplify the discussion in nested model comparisons. Of course, there have been some attempts in the literature. For instance, one of the most used approaches in this direction is the test outlined in Clark and West (2007) [
13]. The authors showed, via simulations, that standard normal critical values tend to work well with their test, even though, Clark and McCracken (2001) [
9] demonstrated that this statistic has a non-standard distribution. Moreover, when the null model is a martingale difference and parameters are estimated with rolling regressions, Clark and West (2006) [
14] showed that their test is indeed asymptotically normal. Despite this and other particular cases, as stated in the conclusions of West (2006) [
3] review: “
One of the highest priorities for future work is the development of asymptotically normal or otherwise nuisance parameter-free tests for equal MSPE or mean absolute error in a pair of nested models. At present only special case results are available”. West (2006) [
3], page 131. Our paper addresses this issue.
Our WCW test can be viewed as a simple modification of the CW test. As noticed by West (1996) [
2], in the context of nested models, the CW core statistic becomes degenerate under the null hypothesis of equal predictive ability. Our suggestion is to introduce an independent random variable with a “small” variance in the core statistic. This random variable prevents our test from becoming degenerate under the null hypothesis, keeps the asymptotic distribution centered around zero, and eliminates the autocorrelation structure of the core statistic at the population level. While West’s (1996) [
2] asymptotic theory does not apply for CW (as it does not meet the full rank condition), it does apply for our test (as the variance of our test statistic remains positive under the null hypothesis). In this sense, our approach not only prevents our test from becoming degenerate, but also ensures asymptotic normality relying on West’s (1996) [
2] results. In a nutshell, there are two key differences between CW and our test. Firstly, our test is asymptotically normal, while CW is not. Secondly, our simulations reveal that WCW is better sized than CW, especially at long forecasting horizons.
We have also demonstrated that “asymptotic irrelevance” applies; hence the effects of parameter uncertainty can be ignored. As asymptotic normality and “asymptotic irrelevance” apply, our test is extremely user friendly and easy to implement. Finally, one possible concern about our test is that it depends on one realization of one independent random variable. To partially overcome this issue, we have also provided a smoothed version of our test that relies on multiple realizations of this random variable.
Most of the asymptotic theory for the CW test and other statistics developed in Clark and McCracken (2001, 2005) [
9,
10] and McCracken (2007) [
11] focused almost exclusively on direct multi-step-ahead forecasts. However, with some exceptions (e.g., Clark and McCracken (2013) [
15] and Pincheira and West (2016) [
16]), iterated multi-step-ahead forecasts have received much less attention. In part for this reason, we evaluated the performance of our test (relative to CW), focusing on iterated multi-step-ahead forecasts. Our simulations reveal that our approach is reasonably well-sized, even at long horizons when CW may present severe size distortions. In terms of power, results have been rather mixed, although CW has frequently exhibited some more power. All in all, our simulations reveal that asymptotic normality and size corrections come with a cost: the introduction of a random variable erodes some of the power of WCW. Nevertheless, we also show that the power of our test improves with a smaller variance of our random variable and with an average of multiple realizations of our test.
Finally, based on the commodity currencies literature, we provide an empirical illustration of our test. Following Chen, Rossi, and Rogoff (2010, 2011) [
17,
18]; Pincheira and Hardy (2018, 2019, 2021) [
19,
20,
21]; and Pincheira and Jarsun (2020) [
22], we evaluated the performance of the exchange rates of three major commodity exporters (Australia, Chile, and South Africa) when predicting commodity prices. Consistent with previous literature, we found evidence of predictability for some of the commodities considered in this exercise. Particularly strong results were found when predicting the London Metal Exchange Index, aluminum and tin. Fairly interesting results were also found for oil and the S&P GSCI. The South African rand and the Australian dollar have a strong ability to predict these two series. We compared our results using both CW and WCW. At short horizons, both tests led to similar results. The main differences appeared at long horizons, where CW tended to reject the null hypothesis of no predictability more frequently. From the lessons learned from our simulations, we can think of two possible explanations for these differences: Firstly, they might be the result of CW displaying more power than WCW. Secondly, they might be the result of CW displaying a higher false discovery rate relative to WCW. Let us recall that CW may be severely oversized at long horizons, while WCW is better sized. These conflicting results between CW and WCW might act as a warning of a potential false discovery of predictability. As a consequence, our test brings good news to careful researchers that seriously wish to avoid spurious findings.
The rest of this paper is organized as follows.
Section 2 establishes the econometric setup and forecast evaluation framework, and presents the WCW test.
Section 3 addresses the asymptotic distribution of the WCW, showing that “asymptotic irrelevance” applies.
Section 4 describes our DGPs and simulation setups.
Section 5 discusses the simulation results.
Section 6 provides an empirical illustration. Finally,
Section 7 concludes.
2. Econometric Setup
Consider the following two competing nested models for a target scalar variable
.
where
and
are both zero mean martingale difference processes, meaning that
for
and
stands for the sigma field generated by current and past values of
and
. We will assume that
and
have finite and positive fourth moments.
When the econometrician wants to test the null using an out-of-sample approach in this econometric context, Clark and McCracken (2001) [
9] derived the asymptotic distribution of a traditional encompassing statistic used, for instance, by Harvey, Leybourne, and Newbold (1998) [
23] (other examples of encompassing tests include Chong and Hendry (1986) [
24] and Clements and Hendry (1993) [
25], to name a few). In essence, the ENC-t statistic proposed by Clark and McCracken (2001) [
9] studies the covariance between
and
. Accordingly, this test statistic takes the form:
where
is the usual variance estimator for
and P is the number of out-of-sample forecasts under evaluation (as pointed out by Clark and McCracken (2001) [
9], the HLN test is usually computed with regression-based methods. For this reason, we use
rather than
). See
Appendix A.1 for two intuitive interpretations of the ENC-
t test.
The null hypothesis of interest is that . This implies that . This null hypothesis is also equivalent to equality in MSPE.
Even though West (1996) [
2] showed that the ENC-t is asymptotically normal for non-nested models, this is not the case in nested environments. Note that one of the main assumptions in West’s (1996) [
2] theory is that the population counterpart of
is strictly positive. This assumption is clearly violated when models are nested. To see this, recall that under the null of equal predictive ability,
and
for all t. In other words, the population prediction errors from both models are identical under the null and, therefore,
is exactly zero. Consequently,
. More precisely, notice that under the null:
It follows that the rank condition in West (1996) [
2] cannot be met as
.
The main aim of our paper was to modify this ENC-t test to make it asymptotically normal under the null. Our strategy required the introduction of a sequence of independent random variables with variance and expected value equal to 1. It is critical to notice that is not only i.i.d, but also independent from and .
With this sequence in mind, we define our “Wild Clark and West” (WCW-t) statistic as
where
is a consistent estimate of the long-run variance of
(e.g., Newey and West (1987, 1994) [
26,
27] or Andrews (1991) [
28]).
In this case, under the null we have
, therefore:
Besides, we have that under the null
The last result follows from the fact that . Notice that this transformation is important: under the null hypothesis, even if is identically zero for all t, the inclusion of prevents the core statistic from becoming degenerate, preserving a positive variance (it is also possible to show that the term has no autocorrelation under the null).
Additionally, under the alternative:
Consequently, our test is one-sided.
Finally, there are two possible concerns with the implementation of our WCW-t statistic. The first one is about the choice of
. Even though this decision is arbitrary, we give the following recommendation:
should be “small”; the idea of our test is to recover asymptotic normality under the null hypothesis, something that could be achieved for any value of
. However, if
is “too big”, it may simply erode the predictive content under the alternative hypothesis, deteriorating the power of our test. Notice that a “small” variance for some DGPs could be a “big” one for others, for this reason, we propose to take
as a small percentage of the sample counterpart of
. As we discuss later in
Section 4, we considered three different standard deviations with reasonable size and power results:
(1 percent, 2 percent, and 4 percent of the standard deviation of
). We emphasize that
is the sample variance of the estimated forecast errors. Obviously, our test tends to be better sized as
grows, at the cost of some power.
Secondly, notice that our test depends on
realization of the sequence
. One reasonable concern is that this randomness could strongly affect our WCW-t statistic (even for “small” values of the
parameter). In other words, we would like to avoid significant changes in our statistic generated by the randomness of
. Additionally, as we report in
Section 4, our simulations suggest that using just one realization of the sequence
sometimes may significantly reduce the power of our test relative to CW. To tackle both issues, we propose to smooth the randomness of our approach by considering
different WCW-t statistics constructed with different and independent sequences of
. Our proposed test is the simple average of these
standard normal WCW-t statistics, adjusted by the correct variance of the average as follows:
where
is the k-th realization of our statistic and
is the sample correlation between the i-th and j-th realization of the WCW-t statistics. Interestingly, as we discuss in
Section 4, when using
, the size of our test is usually stable, but it significantly improves the power of our test.
3. Asymptotic Distribution
Since most of our results rely on West (1996) [
2], here we introduce some of his results and notation. For clarity of exposition, we focus on one-step-ahead forecasts. The generalization to multi-step-ahead forecasts is cumbersome in notation but straightforward.
Let
be our loss function. We use “*” to emphasize that
depends on the true population parameters, hence
, where
. Additionally, let
be the sample counterpart of
. Notice that
rely on estimates of
, and as a consequence,
is polluted by estimation error. Moreover, notice the subindex in
: the out-of-sample forecast errors (
and
) depend on the estimates
constructed with the relevant information available up to time t. These estimates can be constructed using either rolling, recursive, or fixed windows. See West (1996, 2006) [
2,
3] and Clark and McCracken (2013) [
4] for more details about out-of-sample evaluations.
Let
be the expected value of our loss function. As considered in Diebold and Mariano (1995) [
1], if predictions do not depend on estimated parameters, then under weak conditions, we can apply the central limit theorem:
where
stands for the long-run variance of the scalar
. However, one key technical contribution of West (1996) [
2] was the observation that when forecasts are constructed with estimated rather than true, unknown, population parameters, some terms in expression (2) must be adjusted. We remark here that we observe
rather than
. To see how parameter uncertainty may play an important role, under assumptions
Appendix A.1,
Appendix A.2,
Appendix A.3 and
Appendix A.4 in the
Appendix A, West (1996) [
2] showed that a second-order expansion of
around
yields
where
, R denotes the length of the initial estimation window, and T is the total sample size (T = R + P), while
will be defined shortly.
Recall that in our case, under the null hypothesis,
, hence expression (3) is equivalent to
Note that according to West (2006) [
3], p. 112, and in line with Assumption 2 in West (1996) [
2], pp. 1070–1071, the estimator of the regression parameters satisfies
where
is
;
is
with (a)
as a matrix of rank k; (b)
=
if the estimation method is recursive,
=
if it is rolling, or
=
if it is fixed.
is a
orthogonality condition that is satisfied. Notice that
; (c)
.
As explained in West (2006) [
3]: “Here,
can be considered as the score if the estimation method is ML, or the GMM orthogonality condition if GMM is the estimator. The matrix
is the inverse of the Hessian if the estimation method is ML or a linear combination of orthogonality conditions when using GMM, with large sample counterparts
.” West (2006) [
3], p. 112.
Notice that Equation (3) clearly illustrates that
can be decomposed into two parts. The first term of the RHS is the population counterpart, whereas the second term captures the sequence of estimates of
(in other words, terms arising because of parameter uncertainty). Then, as
, we can apply the expansion in West (1996) [
2] as long as assumptions of
Appendix A.1,
Appendix A.2,
Appendix A.3 and
Appendix A.4 hold. The key point is that a proper estimation of the variance in Equation (3) must account for: (i) the variance of the first term of the RHS (
, i.e., the variance when there is no uncertainty about the population parameters), (ii) the variance of the second term of the RHS, associated with parameter uncertainty, and iii) the covariance between both terms. Notice, however, that parameter uncertainty may be “asymptotically irrelevant” (hence (ii) and (iii) may be ignored) in the following cases: (1)
as
, (2) a fortunate cancellation between (ii) and (iii), or (3)
.
Note that under the null,
and recall that
, therefore
With a similar argument, it is easy to show that
This result follows from the fact that we define as a martingale difference with respect to and .
Hence, in our case “asymptotic irrelevance” applies as
and Equation (3) reduces simply to
In other words, we could simply replace true errors by estimated out-of-sample errors and forget about parameter uncertainty, at least asymptotically.
4. Monte Carlo Simulations
In order to capture features from different economic/financial time series and different modeling situations that might induce a different behavior in the tests under evaluation, we considered three DGPs. The first DGP (DGP1) relates to the Meese–Rogoff puzzle and matches exchange rate data (Meese and Rogoff (1983,1988) [
7,
8] found that, in terms of predictive accuracy, many exchange rate models perform poorly against a simple random walk). In this DGP, under the null hypothesis, the target variable is simply white noise. In this sense, DGP1 mimics the low persistence of high frequency exchange rate returns. While in the null model, there are no parameters to estimate, under the alternative model there is only one parameter that requires estimation. Our second DGP matches quarterly GDP growth in the US. In this DGP, under the null hypothesis, the target variable follows an AR(1) process with two parameters requiring estimation. In addition, the alternative model has four extra parameters to estimate. Differing from DGP1, in DGP 2, parameter uncertainty may play an important role in the behavior of the tests under evaluation. DGP1 and DGP2 model stationary variables with low persistence, such as exchange rate returns and quarterly GDP growth. To explore the behavior of our tests with a series displaying more persistence, we considered DGP3. This DGP is characterized by a VAR(1) model in which both the predictor and the predictand are stationary variables that display relatively high levels of persistence. In a nutshell, there are three key differences in our DGPs: persistence of the variables, the number of parameters in the null model, and the number of excess parameters in the alternative model (according to Clark and McCracken (2001) [
9], the asymptotic distribution of the ENC-t, under the null hypothesis, depends on the excess of parameters in the alternative model—as a consequence, the number of parameters in both the null and alternative models are key features of these DGPs).
To save space, we only report here results for recursive windows, although in general terms, results with rolling windows were similar and they are available upon request. For large sample exercises, we considered an initial estimation window of R = 450 and a prediction window of P = 450 (T = 900), while for small sample exercises, we considered R = 90 and P = 90 (T = 180). For each DGP, we ran 2000 independent replications. We evaluated the CW test and our test, computing iterated multi-step-ahead forecasts at several forecasting horizons from h = 1 up to h = 30. As discussed at the end of
Section 2, we computed our test using K = 1 and K = 2 realizations of our WCW-t statistic. Additionally, for each simulation, we considered three different standard deviations of
:
(1 percent, 2 percent, and 4 percent of the standard deviation of
). We emphasize that
is the sample variance of the out-of-sample forecast errors and it was calculated for each simulation.
Finally, we evaluated the usefulness of our approach using the iterated multistep ahead method for the three DGPs under evaluation (notice that the iterated method uses an auxiliary equation for the construction of the multistep ahead forecasts—here, we stretched the argument of “asymptotic irrelevance” and we assumed that parameter uncertainty on the auxiliary equation plays no role). We report our results comparing the CW and the WCW-
t test using one-sided standard normal critical values at the 10% and 5% significance level (a summary of the results considering a 5% significance level can be found in the
Appendix section). For simplicity, in each simulation we considered only homoscedastic, i.i.d, normally distributed shocks.
DGP 1
Our first DGP assumes a white noise for the null model. We considered a case like this given its relevance in finance and macroeconomics. Our setup is very similar to simulation experiments in Pincheira and West (2006) [
16], Stambaugh (1999) [
29], Nelson and Kim (1993) [
30], and Mankiw and Shapiro (1986) [
31].
We set our parameters as follows:
| | | | | under | under |
1.19 | −0.25 | (1.75)2 | (0.075)2 | 0 | 0 | −2 |
The null hypothesis posits that
follows a no-change martingale difference. Additionally, the alternative forecast for multi-step-ahead horizons was constructed iteratively through an AR(p) on
. This is the same parametrization considered in Pincheira and West (2016) [
16], and it is based on a monthly exchange rate application in Clark and West (2006) [
14]. Therefore,
represents the monthly return of a U.S dollar bilateral exchange rate and
is the corresponding interest rate differential.
DGP 2
Our second DGP is mainly inspired by macroeconomic data, and it was also considered in Pincheira and West (2016) [
16] and Clark and West (2007) [
13]. This DGP is based on models exploring the relationship between U.S GDP growth and the Federal Reserve Bank of Chicago’s factor index of economic activity.
We set our parameters as follows:
under | under | under | under |
0 | 0 | 0 | 0 |
under | under | under | under |
3.363 | −0.633 | −0.377 | −0.529 |
| | | |
0.261 | 10.505 | 0.366 | 0.528 |
DGP 3
Our last DGP follows Busetti and Marcucci (2013) [
32] and considers a very simple VAR(1) process:
We set our parameters as follows:
| | | | | c under | c under |
0.8 | 0.8 | 1 | 1 | 0 | 0 | 0.5 |
6. Empirical Illustration
Our empirical illustration was inspired by the commodity currencies literature. Relying on the present value model for exchange rate determination (Campbell and Shiller (1987) [
33] and Engel and West (2005) [
34]), Chen, Rogoff, and Rossi (2010, 2011) [
17,
18]; Pincheira and Hardy (2018, 2019, 2021) [
19,
20,
21]; and many others showed that the exchange rates of some commodity-exporting countries have the ability to predict the prices of the commodities being exported and other closely related commodities as well.
Based on this evidence, we studied the predictive ability of three major commodity-producer’s economies frequently studied by this literature: Australia, Chile, and South Africa. To this end, we considered the following nine commodities/commodity indices: (1) WTI oil, (2) copper, (3) S&P GSCI: Goldman Sachs Commodity Price Index, (4) aluminum, (5) zinc, (6) LMEX: London Metal Exchange Index, (7) lead, (8) nickel, and (9) tin.
The source of our data was the Thomson Reuters Datastream, from which we downloaded the daily close price of each asset. Our series was converted to the monthly frequency by sampling from the last day of the month. The time period of our database went from September 1999 through June 2019 (the starting point of our sample period was determined by the date in which monetary authorities in Chile decided to pursue a pure flotation exchange rate regime).
Our econometric specifications were mainly inspired by Chen, Rogoff, and Rossi (2010) [
17] and Pincheira and Hardy (2018, 2019, 2021) [
19,
20,
21]. Our null model was
While the alternative model was
where
denotes the log-difference of a commodity price at time t + 1,
stands for the log-difference of an exchange rate at time t;
are the regression parameters for the null model, and
are the regression parameters for the alternative model. Finally
and
are error terms.
One-step-ahead forecasts are constructed in an obvious fashion through both models. Multi-step-ahead forecasts are constructed iteratively for the cumulative returns from t through t + h. To illustrate, let be the one-step-ahead forecasts from t to t + 1 and be the one-step-ahead forecast from t + 1 to t + 2; then, the two-steps-ahead forecast is simply .
Under the null hypothesis of equal predictive ability, the exchange rate has no role in predicting commodity prices, i.e., . For the construction of our iterated multi-step-ahead forecasts, we assumed that follows an AR(1) process. Finally, for our out-of-sample evaluations, we considered P/R = 4 and a rolling scheme.
Following Equation (1), we took the adjusted average of K = 2 WCW statistics and considered . Additional results using a recursive scheme, other splitting decisions (P and R), and different values of and K are available upon request.
Table 14 and
Table 15 show our results for Chile and Australia, respectively.
Table A3 in the
Appendix section reports our results for South Africa.
Table 14 and
Table 15 show interesting results for the LMEX. In particular, the alternative model outperformed the AR(1) for almost every forecasting horizon, using either the Australian Dollar or the Chilean Peso. A similar result was found for aluminum prices when considering
. These results seem to be consistent with previous findings. For instance, Pincheira and Hardy (2018, 2019, 2021) [
19,
20,
21], using the ENCNEW test of Clark and McCracken (2001) [
9], showed that models using exchange rates as predictors generally outperformed simple AR(1) processes when predicting some base metal prices via one-step-ahead forecasts.
Interestingly, using the Chilean exchange rate, Pincheira and Hardy (2019) [
20] reported very unstable results for the monthly frequencies of nickel and zinc; moreover, they reported some exercises in which they could not outperform an AR(1). This is again consistent with our results reported in
Table 14.
Results of the CW and our WCW tests were similar. Most of the exercises tended to have the same sign and the statistics had similar “magnitudes”. However, there are some important differences worth mentioning. In particular, CW tended to reject the null hypothesis more frequently. There are two possible explanations for this result. On the one hand, our simulations reveal that CW had, frequently, higher power; on the other hand, CW tended to be more oversized than our test at long forecasting horizons, especially for
.
Table 14 can be understood using these two points. Both tests tended to be very similar for short forecast horizons; however, some discrepancies became apparent at longer horizons. Considering
, CW rejected the null hypothesis at the 10% significance level in 54 out of 81 exercises (67%), while the WCW rejected the null only 42 times (52%).
Table 15 has a similar message: CW rejected the null hypothesis at the 5% significance level in 49 out of 81 exercises (60%), while WCW rejected the null only 41 times (51%). The results for oil (C1) in
Table 15 emphasize this fact: CW rejected the null hypothesis at the 5% significance level for most of the exercises with
, but our test only rejected at the 10%. In summary, CW showed a higher rate of rejections at long horizons. The question here is whether this higher rate is due to higher size-adjusted power, or due to a false discovery rate induced by an empirical size that was higher than the nominal size. While the answer to this question cannot be known for certain, a conservative approach, one that protects the null hypothesis, would suggest to look at these extra CW rejections with caution.
7. Concluding Remarks
In this paper, we have presented a new test for out-of-sample evaluation in the context of nested models. We labelled this statistic as “Wild Clark and West (WCW)”. In essence, we propose a simple modification of the CW (Clark and McCracken (2001) [
9] and Clark and West (2006, 2007) [
13,
14]) core statistic that ensures asymptotic normality: basically, this paper can be viewed as a “non-normal distribution problem”, becoming “a normal distribution” one, which significantly simplifies the discussion. The key point of our strategy was to introduce a random variable that prevents the CW core statistic from becoming degenerate under the null hypothesis of equal predictive accuracy. Using West’s (1996) [
2] asymptotic theory, we showed that “asymptotic irrelevance” applies, hence our test can ignore the effects of parameter uncertainty. As a consequence, our test is extremely simple and easy to implement. This is important, since most of the characterizations of the limiting distributions of out-of-sample tests for nested models are non-standard. Additionally, they tend to rely, arguably, on a very specific set of assumptions, that, in general, are very difficult to follow by practitioners and scholars. In this context, our test greatly simplifies the discussion when comparing nested models.
We evaluated the performance of our test (relative to CW), focusing on iterated multi-step-ahead forecasts. Our Monte Carlo simulations suggest that our test is reasonably well-sized in large samples, with mixed results in power compared to CW. Importantly, when CW shows important size distortions at long horizons, our test seems to be less prone to these distortions and, therefore, it offers a better protection to the null hypothesis.
Finally, based on the commodity currencies literature, we provided an empirical illustration of our test. Following Chen, Rossi, and Rogoff (2010, 2011) [
17,
18] and Pincheira and Hardy (2018, 2019, 2021) [
19,
20,
21], we evaluated the predictive performance of the exchange rates of three major commodity exporters (Australia, Chile, and South Africa) when forecasting commodity prices. Consistent with the previous literature, we found evidence of predictability for some of our sets of commodities. Although both tests tend to be similar, we did find some differences between CW and WCW. As our test tends to “better protect the null hypothesis”, some of these differences may be explained by some size distortions in the CW test at long horizons, but some others are most likely explained by the fact that CW may, sometimes, be more powerful.
Extensions for future research include the evaluation of our test using the direct method to construct multi-step-ahead forecasts. Similarly, our approach seems to be flexible enough to be used in the modification of other tests. It would be interesting to explore, via simulations, its potential when applied to other traditional out-of-sample tests of predictive ability in nested environments.