We now turn to some empirical examples, focusing on how the different implementations of the GRS statistic, as well as the asymptotic
statistic, compare to the correct calculation, based on model testing and ranking outcomes, borrowing from
Fama and French (
2015,
2016) the choice of asset pricing models and the choice of test assets. The models we consider include the CAPM, the Fama–French three-factor model, two variations of a four-factor model, the Fama–French five-factor model and a six-factor model that includes momentum. The test assets we explore include 5 × 5 sortings based on market capitalization and various anomaly variables including operating profitability, return volatility, residual volatility, accruals and so on, up to as many as 32 (2 × 4 × 4) portfolio sortings. We also explore decile portfolio sortings based on size, operating profitability, momentum, book-to-market and investment. The number of test assets used in empirical work is commonly as large as 25, as we see in
Fama and French (
2015,
2016), though many studies use 30 to over 50 test assets. See, for instance,
Lewellen et al. (
2010),
Kroencke (
2017),
Demaj et al. (
2018), and
Kleibergen and Zhan (
2020). Recently, asset pricing models have typically contained at least four or five factors, though six are also commonly seen. See, for instance,
Barillas and Shanken (
2018),
Fama and and French (
2018),
Kan et al. (
2024) or
Hanauer (
2020). Given the state of the literature, our choice of test assets and factors sits comfortably amidst the typical empirical asset pricing applications.
We use data retrieved from the French data library, and we consider five, ten, fifteen, twenty, twenty-five, forty, and fifty-year periods drawn from 1963–2019 for our consideration of up to six factors in the competing asset pricing models, and from 1926–2019 for our consideration of up to four factors in the competing asset pricing models.
13. We limit our sample window to no less than five years of monthly data because few studies use less than 60 observations;
Gibbons et al. (
1989) note that issues of stationarity can reasonably constrain the length of a time series used, so that “it is not uncommon to published work where T is around 60”,
Affleck-Graves and McDonald (
1989) limited their analysis and simulations to 60 month periods;
Ferson and Foerster (
1994) studied 60, 120, and 720 monthly observations in their simulation study;
Rouwenhorst (
1999) used five years in sub-sample analysis; and among recent work that exploited as few as four or five years of data are
Belimam et al. (
2018), and
Qin (
2019).
Leite et al. (
2018) used as few as 98 months of data, Lewellen, Nagel, and Shanken (2010) used 168 observations of quarterly data,
Choi et al. (
2020) performed sub-sample stability tests using eight years of monthly data, and many studies of emerging economy markets have used ten to fifteen years of monthly data. See, for instance,
Alhomaidi et al. (
2019),
Alshammari and Goto (
2022),
Merdad et al. (
2015), and
Sha and Gao (
2019).
Our primary results, found in
Table A1,
Table A2,
Table A3,
Table A4,
Table A5,
Table A6 and
Table A7, make use of the full sample available to us by partitioning the data sample into overlapping periods. For instance, at the five year horizon over 1963–2019, we form five-year windows starting in 1963 and every year following, so that the first window extends from July 1963 to June 1967, the second from January 1964 to December 1968, January 1965 to December 1969, and so on, resulting in 53 five-year overlapping windows. For each of these windows over the period 1963–2019, we use 19 sets of test assets, listed in the first column of
Table A1, and six competing asset pricing models. These models are the CAPM, the Fama–French three-factor model, four and five-factor models, as well as a six-factor model including momentum, all as considered in
Fama and French (
2015,
2016).
Appendix B.1. Results for Five-Year Windows
We first present a small subset of our empirical findings in
Table A1,
Table A2 and
Table A3. For convenience,
Table A1 and
Table A2 replicate
Table 1 and
Table 2 from the main text, and here we discuss them in greater depth. In these tables, we consider five-year windows, the minimum span of data the GRS statistic is commonly applied to. This short span of data shows the most serious over-rejection of asset pricing models from the application of the alternative formulations of the GRS test, as well as the highest frequency of misrankings relative to the correct formulation of the GRS statistic. We present evidence for longer spans of data, up to 50 year windows, in
Table A4,
Table A5,
Table A6 and
Table A7, and discuss them in
Appendix B.2.
In
Table A1, we present the number of excess test rejections at the 1%, 5%, and 10% levels, relative to the correct GRS statistic
, for each of the alternative test statistics, the asymptotic
,
and
. The results for the largest number of test assets we considered, 32, are displayed in the first three rows of the table; the results for cases with 25 test assets follow on rows four through thirteen; the 17 test assets of the industry portfolios follow on row fourteen; and the remaining five rows present results for sets of 10 test assets. In
Table A1, we use a total of 53 five-year overlapping windows starting from 1963 and consider six different asset pricing models, meaning that we have 318 cases for which an asset pricing model might be rejected, for each of the 19 sets of test assets.
Table A1.
Number and proportion of subsamples with more rejections relative to the GRS test for five year windows sampled during 1963–2019, across 19 sets of test assets.
Table A1.
Number and proportion of subsamples with more rejections relative to the GRS test for five year windows sampled during 1963–2019, across 19 sets of test assets.
Test Assets | Asymptotic | | |
---|
Nb of Subsamples = 53 | | | |
---|
| 1% | 5% | 10% | 1% | 5% | 10% | 1% | 5% | 10% |
---|
2 × 4 × 4 MExMEBExINV | 289 | 275 | 247 | 4 | 17 | 21 | 0 | 1 | 0 |
2 × 4 × 4 MExMEBExOP | 262 | 246 | 220 | 4 | 14 | 18 | 0 | 0 | 0 |
2 × 4 × 4 MExOPxINV | 282 | 238 | 194 | 10 | 27 | 28 | 0 | 3 | 1 |
5 × 5 AccrualsxME | 222 | 218 | 207 | 7 | 17 | 31 | 0 | 0 | 1 |
5 × 5 BExME | 212 | 191 | 166 | 8 | 21 | 20 | 1 | 0 | 1 |
5 × 5 BetaxME | 229 | 221 | 204 | 11 | 16 | 25 | 0 | 1 | 0 |
5 × 5 MExOP | 238 | 227 | 188 | 9 | 27 | 42 | 0 | 0 | 0 |
5 × 5 MomentumxME | 210 | 166 | 124 | 14 | 29 | 27 | 0 | 0 | 0 |
5 × 5 NetIssuexME | 216 | 176 | 147 | 9 | 29 | 27 | 0 | 2 | 2 |
5 × 5 RVariancexME | 147 | 94 | 65 | 18 | 28 | 12 | 0 | 2 | 0 |
5 × 5 VariancexME | 137 | 81 | 45 | 26 | 27 | 12 | 0 | 3 | 1 |
5 × 5 BExInv | 248 | 259 | 245 | 4 | 18 | 16 | 0 | 2 | 1 |
5 × 5 MExInv | 214 | 189 | 171 | 14 | 17 | 29 | 0 | 0 | 0 |
Industry | 126 | 153 | 152 | 12 | 9 | 22 | 0 | 1 | 0 |
Book-to-Market Deciles | 48 | 70 | 67 | 5 | 22 | 21 | 0 | 1 | 1 |
Investment Deciles | 22 | 30 | 56 | 4 | 4 | 6 | 0 | 0 | 0 |
Momentum Deciles | 56 | 60 | 66 | 18 | 14 | 20 | 1 | 1 | 1 |
Size Deciles | 49 | 90 | 68 | 6 | 23 | 27 | 0 | 0 | 2 |
Operating Profitability | 48 | 70 | 65 | 4 | 24 | 15 | 0 | 1 | 0 |
Deciles | | | | | | | | | |
Average | 171.3 | 160.7 | 142 | 9.8 | 20.2 | 22.1 | 0.11 | 0.95 | 0.58 |
Proportion (%) | 53.9 | 50.5 | 44.6 | 3.1 | 6.3 | 6.9 | 0.0 | 0.3 | 0.2 |
The asymptotic statistic fares the worst relative to the correct GRS statistic among the alternatives, over-rejecting roughly 50% of the time on average relative to , across the common significance levels of 1%, 5%, and 10%. This over-rejection is worse when we consider more test assets.
The and do not display patterns related to the number of test assets or significance level, with the () over-rejecting relative to the correct GRS statistic about 5% (0.2%) of the time on average, across the common significance levels of 1%, 5%, and 10%. The over-rejection of the is unstable across test assets, however, with a few cases displaying close to 6% over-rejection (three times over 53 subsamples), and some with no over-rejection.
Table A2.
Number and proportion of subsamples with different ranking outcomes from the GRS statistic for five year windows sampled during 1963–2019, across 19 sets of test assets.
Table A2.
Number and proportion of subsamples with different ranking outcomes from the GRS statistic for five year windows sampled during 1963–2019, across 19 sets of test assets.
Test Assets | Any Model Mis-Ranked | Top Model Mis-Ranked |
---|
Nb of Subsamples = 53 | | | | |
---|
2 × 4 × 4 MExMEBExINV | 26 | 2 | 5 | 0 |
2 × 4 × 4 MExMEBExOP | 37 | 1 | 6 | 0 |
2 × 4 × 4 MExOPxINV | 31 | 1 | 12 | 0 |
5 × 5 AccrualsxME | 36 | 1 | 12 | 0 |
5 × 5 BExME | 26 | 1 | 7 | 0 |
5 × 5 BetaxME | 31 | 1 | 10 | 0 |
5 × 5 MExOP | 33 | 0 | 10 | 0 |
5 × 5 MomentumxME | 33 | 0 | 14 | 0 |
5 × 5 NetIssuexME | 27 | 4 | 9 | 2 |
5 × 5 RVariancexME | 32 | 0 | 14 | 0 |
5 × 5 VariancexME | 34 | 1 | 10 | 0 |
5 × 5 BExInv | 35 | 3 | 6 | 0 |
5 × 5 MExInv | 30 | 3 | 9 | 0 |
Industry | 29 | 2 | 4 | 0 |
Book-to-Market Deciles | 28 | 0 | 4 | 0 |
Investment Deciles | 24 | 2 | 4 | 0 |
Momentum Deciles | 30 | 2 | 5 | 0 |
Size Deciles | 27 | 3 | 5 | 0 |
Operating Profitability | 30 | 0 | 6 | 0 |
Deciles | | | | |
Average | 30.47 | 1.42 | 8.00 | 0.11 |
Proportion (%) | 57.5 | 2.7 | 15.1 | 0.2 |
Table A3.
Summary statistics on factor models and test statistics for five year windows over 2007–2009 financial crisis sample period, for investment 5 × 5 NetIssuexME test assets.
Table A3.
Summary statistics on factor models and test statistics for five year windows over 2007–2009 financial crisis sample period, for investment 5 × 5 NetIssuexME test assets.
Date/ | Average | | | |
---|
Factor Model | Annualized | Statistic | Statistic | Statistic |
---|
| /Raw % | /Rank
| /Rank
| /Rank
|
---|
JAN 2005-DEC 2009 | | | | |
Mkt | 2.63/0.219 | 0.688/1 | 0.712/1 | 0.688/1 |
Mkt SMB HML | 2.39/0.199 | 0.819/2 | 0.878/2 | 0.819/2 |
Mkt SMB HML UMD | 2.37/0.198 | 0.837/3 | 0.913/3 | 0.837/3 |
Mkt SMB RMW CMA | 2.54/0.212 | 0.910/4 | 0.992/4 | 0.912/4 |
Mkt SMB HML RMW CMA | 2.52/0.210 | 0.997/6 | 1.108/6 | 0.999/6 |
Mkt SMB HML RMW CMA UMD | 2.57/0.214 | 0.972/5 | 1.100/5 | 0.974/5 |
JAN 2006-DEC 2010 ‡ | | | ⇓ | |
Mkt † | 3.23/0.269 | 1.114/3 | 1.152/ | 1.114/3 |
Mkt SMB HML † | 2.32/0.193 | 1.078/2 | 1.155/2 | 1.078/ |
Mkt SMB HML UMD † | 2.41/0.201 | 1.256/6 | 1.371/ | 1.257/6 |
Mkt SMB RMW CMA *,† | 2.64/0.220 | 1.077/1 | 1.174/ | 1.080/ |
Mkt SMB HML RMW CMA | 2.65/0.221 | 1.130/4 | 1.256/4 | 1.134/4 |
Mkt SMB HML RMW CMA UMD † | 2.81/0.234 | 1.253/5 | 1.418/ | 1.257/5 |
JAN 2007-DEC 2011 ‡ | | | | ⇓ |
Mkt † | 3.39/0.283 | 1.329/2 | 1.375/ | 1.329/ |
Mkt SMB HML † | 3.11/0.259 | 1.503/4 | 1.610/ | 1.504/4 |
Mkt SMB HML UMD | 3.26/0.272 | 1.888/5 | 2.060/5 | 1.890/5 |
Mkt SMB RMW CMA *,† | 2.89/0.240 | 1.327/1 | 1.448/ | 1.332/ |
Mkt SMB HML RMW CMA † | 2.92/0.244 | 1.478/3 | 1.643/ | 1.484/3 |
Mkt SMB HML RMW CMA UMD | 3.08/0.257 | 1.947/6 | 2.204/6 | 1.955/6 |
JAN 2008-DEC 2012 | | | | |
Mkt | 4.34/0.361 | 1.299/2 | 1.344/2 | 1.300/2 |
Mkt SMB HML | 3.64/0.303 | 1.500/3 | 1.607/3 | 1.500/3 |
Mkt SMB HML UMD † | 3.77/0.314 | 1.906/6 | 2.079/ | 1.907/6 |
Mkt SMB RMW CMA | 2.98/0.248 | 1.161/1 | 1.267/1 | 1.165/1 |
Mkt SMB HML RMW CMA | 3.15/0.263 | 1.577/4 | 1.753/4 | 1.583/4 |
Mkt SMB HML RMW CMA UMD † | 3.39/0.282 | 1.896/5 | 2.146/ | 1.904/5 |
JAN 2009-DEC 2013 ‡ | | ⇓ | | ⇓ |
Mkt | 2.74/0.228 | 1.678/2 | 1.736/2 | 1.681/2 |
Mkt SMB HML | 3.05/0.254 | 2.938/5 | 3.148/5 | 2.946/5 |
Mkt SMB HML UMD | 3.12/0.260 | 3.324/6 | 3.626/6 | 3.333/6 |
Mkt SMB RMW CMA | 2.46/0.205 | 1.402/1 | 1.530/1 | 1.406/1 |
Mkt SMB HML RMW CMA | 3.05/0.254 | 2.723/3 | 3.026/3 | 2.734/3 |
Mkt SMB HML RMW CMA UMD | 2.72/0.227 | 2.733/4 | 3.093/4 | 2.745/4 |
Table A4.
Percentage of subsamples/models with different decision outcomes from the correct GRS test .
Table A4.
Percentage of subsamples/models with different decision outcomes from the correct GRS test .
Window | | | |
---|
(Months) | 1% | 5% | 10% | 1% | 5% | 10% | 1% | 5% | 10% |
---|
Panel A: 1963–2019 |
60 | 53.9 | 50.5 | 44.6 | 3.1 | 6.3 | 6.9 | 0.0 | 0.3 | 0.2 |
120 | 27.4 | 24.8 | 20.0 | 3.2 | 3.7 | 3.9 | 0.1 | 0.0 | 0.1 |
180 | 16.4 | 13.0 | 11.4 | 2.1 | 2.1 | 2.2 | 0.1 | 0.0 | 0.0 |
240 | 9.7 | 8.6 | 6.3 | 1.4 | 1.7 | 1.2 | 0.0 | 0.0 | 0.0 |
300 | 6.5 | 5.7 | 4.5 | 0.9 | 1.3 | 1.1 | 0.1 | 0.1 | 0.0 |
480 | 3.0 | 2.9 | 1.9 | 0.3 | 0.7 | 0.4 | 0.0 | 0.0 | 0.0 |
600 | 2.9 | 0.8 | 1.6 | 0.3 | 0.2 | 0.3 | 0.0 | 0.0 | 0.0 |
Panel B: 1926–2019 |
60 | 34.6 | 36.1 | 33.2 | 2.6 | 4.3 | 5.3 | 0.0 | 0.1 | 0.1 |
120 | 15.4 | 15.5 | 12.6 | 2.2 | 3.1 | 2.5 | 0.0 | 0.0 | 0.0 |
180 | 8.8 | 10.3 | 9.2 | 1.0 | 1.4 | 1.7 | 0.1 | 0.0 | 0.0 |
240 | 8.6 | 7.7 | 4.7 | 1.3 | 2.0 | 1.3 | 0.0 | 0.0 | 0.0 |
300 | 6.0 | 4.4 | 3.1 | 1.0 | 0.8 | 0.6 | 0.0 | 0.0 | 0.0 |
480 | 3.9 | 2.8 | 0.9 | 0.7 | 0.7 | 0.2 | 0.0 | 0.0 | 0.0 |
600 | 2.2 | 1.7 | 0.6 | 0.6 | 0.1 | 0.2 | 0.0 | 0.0 | 0.0 |
Table A5.
Percentage of subsamples/models with different model rank outcomes from the correct GRS statistic , ranked by test statistic value.
Table A5.
Percentage of subsamples/models with different model rank outcomes from the correct GRS statistic , ranked by test statistic value.
Window | Any Model Misranked | Top Model Misranked |
---|
(Months) | | | | |
---|
Panel A: 1963–2019 |
60 | 57.5 | 2.7 | 15.1 | 0.2 |
120 | 37.6 | 0.9 | 9.4 | 0.0 |
180 | 21.4 | 0.0 | 4.4 | 0.0 |
240 | 12.5 | 0.0 | 3.0 | 0.0 |
300 | 9.7 | 0.2 | 2.4 | 0.2 |
480 | 14.3 | 0.3 | 1.8 | 0.0 |
600 | 15.1 | 0.0 | 0.0 | 0.0 |
Panel B: 1926–2019 |
60 | 17.4 | 0.7 | 9.6 | 0.4 |
120 | 10.6 | 0.2 | 5.5 | 0.0 |
180 | 5.2 | 0.2 | 1.9 | 0.0 |
240 | 3.1 | 0.0 | 0.9 | 0.0 |
300 | 1.9 | 0.0 | 1.2 | 0.0 |
480 | 1.5 | 0.3 | 0.6 | 0.0 |
600 | 1.1 | 0.0 | 0.7 | 0.0 |
In
Table A2, we present the number of cases for which each statistic misranks factor models relative to the correct GRS statistic ranking. Here, we rank the six models for each of the 53 five-year windows considered in
Table A1. Misranking of models at the five year horizon by the
statistic displays much worse performance than we saw for test rejections, with close to 60% of the cases displaying a misranking, and we saw even worse performance for the
, with close to 3% of the cases misranked. If we restrict our attention to cases for which the top model is misranked, the
statistic misranks between 10% and 15% of the time, while the
misranks the top model less than 0.5% of the time.
In
Table A3, we present detailed results for the 5 × 5 net share issuance’s crossed with size portfolio test asset set, for five overlapping five-year windows near the end of the 1963–2019 sample period, in order to give the reader a finer sense for the results in
Table A1 and
Table A2. We present the average annualized and raw percentage alpha for each of the six asset pricing models and each window, as well as values of the correct GRS statistic
, and the
and
statistics. Beside each test statistic value, we report the ranks of the six models, from one to six. A model ranked differently by
or
from
is indicated by a † next to the factor label and further identified with the appropriate column’s rank number being boldfaced. A misranked top model is indicated by an * next to the factor label and further identified with the appropriate row’s factor label being bolded. A test method producing different ranks using
p-values from test statistic is indicated by a ‡ next to the window period and further identified with a ⇓ in the appropriate column. This issue of different ranking using the test statistic versus the
p-value will be drawn out below.
Table A6.
Percentage of subsamples/models with different model rank outcomes from the correct GRS statistic , ranked by test p-value.
Table A6.
Percentage of subsamples/models with different model rank outcomes from the correct GRS statistic , ranked by test p-value.
Window | Any Model Mis-Ranked | Top Model Mis-Ranked |
---|
(Months)
| | | | |
---|
Panel A: 1963–2019 |
60 | 54.6 | 1.9 | 13.1 | 0.4 |
120 | 37.4 | 0.5 | 9.1 | 0.0 |
180 | 21.9 | 0.4 | 4.4 | 0.1 |
240 | 11.5 | 0.1 | 2.8 | 0.0 |
300 | 9.4 | 0.2 | 2.1 | 0.0 |
480 | 13.2 | 0.3 | 1.8 | 0.0 |
600 | 13.2 | 0.0 | 0.0 | 0.0 |
Panel B: 1926–2019 |
60 | 15.0 | 0.7 | 8.1 | 0.7 |
120 | 10.2 | 0.0 | 5.5 | 0.0 |
180 | 5.4 | 0.0 | 2.1 | 0.0 |
240 | 3.3 | 0.2 | 0.9 | 0.0 |
300 | 1.9 | 0.0 | 1.2 | 0.0 |
480 | 1.2 | 0.0 | 0.6 | 0.0 |
600 | 1.1 | 0.0 | 0.7 | 0.0 |
What we see in
Table A3 is fairly typical across the full set of empirical findings that
Table A1 and
Table A2 are based on; the
displays many misrankings, and misrankings of the top model are rare. Although not tabulated, only the asymptotic
is typically rejecting factor models in this particular small set of examples, so that the
,
, and correct
test statistic are all consistent with each other. Common in studies that seek to rank models are rankings by the average absolute alpha, whether or not all models are rejected by the GRS test. See for instance
Fama and French (
2015). As we can see from
Table A3, often the model with the smallest average absolute alpha is not top-ranked.
Appendix B.2. Summary Results for Longer Windows
We now present evidence for a longer span of data in
Table A4,
Table A5,
Table A6 and
Table A7, using two sets of date windows up to 50 years. In addition to the period of time 1963–2019 that we considered in
Appendix B.1, now we add the period 1926–1962. The pre-1963 period lacks data for factors and test assets built using operating profitability, accruals etc., so we are left with six test assets constructed from book-to-market, size, industry classification, and momentum, and three different factor models, including the CAPM, the Fama–French three-factor model, and a four-factor model including momentum, all as considered in
Fama and French (
2015,
2016). For this restricted set of test assets and factors, we estimate test rejections and rankings using the entire 1926–2019 sample and we break out results separately from those constructed using the larger set of test assets and models on the 1963–2019 data alone. It is interesting to do this, as the chance of a misranking declines with fewer asset pricing models being considered.
In
Table A4, we present the percentage of excess test rejections relative to the correct GRS statistic at each of the 1%, 5%, and 10% levels for each of the alternative test methods, the asymptotic
, the
, and the
. This is performed for overlapping windows of 5, 10, 15, 20, 25, 40, and 50 years, using data that span either 1963–2019 or 1926–2019. Panel A displays the percentage of over-rejection rates averaged over six asset pricing models and 19 sets of test assets for the period 1963–2019; Panel B displays the same percentages for the period 1926–2019 using the smaller set of three factor models and six sets of test assets.
For both Panels A and B, we see a virtually monotonic decline in over-rejections as sample size increases, albeit at a fairly slow rate. The asymptotic
statistic over-rejects roughly 5% of the time relative to the
statistic even with a 25-year window. The
over-rejects roughly 5% of the time with 5 years of data, and over 1% of the time even with a 25-year window. The
, which is fairly close to the correct
statistic, does not appear to over-reject with more than 25 years of data, and over-rejects less than 0.1% of the time with 10 or more years of data. The simulations based on models and calibrations described in
Appendix C.1 reveal that
always over-rejects, at least in this experimental design. The over-rejection declines with sample size, but at a decreasing (non-linear) rate, increases with the number of factors, and is largely unrelated to the number of test assets.
In
Table A5 and
Table A6, we present the number of cases for which each statistic misranks factor models relative to the correct
statistic’s ranking by test value and
p-value respectively. Misranking of models by the
statistic is remarkably large, even at a 50-year data horizon if the set of models is as large as six, for either ranking method (statistic value or
p-value). Replacing the correct form of the GRS test with
scrambles rankings over 15% of the time even at the 50-year horizon, and over 50% of the time at the 5-year horizon, for cases with six models (Panel A). When there are only three models (Panel B), misrankings are naturally fewer, and for data horizons over 15 years misrankings occur mostly less than 5% of the time across methods. The
is typically consistent with the correct
statistic, though misrankings occur even at the 40-year window length.
If we restrict our attention to cases for which the top model is misranked, the statistic is consistent with the correct statistic once we have 40 or more years of data, but misranks even at 40 years of data. Again, when there are only three models (Panel B), misrankings are fewer, and for data horizons over 10 years misrankings occur less than 5% of the time across methods.
Finally, in
Table A7, we present evidence on rankings from these different test statistics compared to rankings based on the
p-values of these test statistics, to see if they are consistent. We can think of ranking by the test statistic as a sort of mean squared error or model
ranking - perhaps helpful if we are interested in minimizing model prediction error even if models are false (see, for instance,
Teräsvirta and Mellin 1986). Here, we see that all the rankings by test statistics are fragile, even at a 50 year horizon, averaging close to 10% misranked if we compare three or six asset pricing models to each other.
Table A7 highlights that the statistically sound
p-values, instead of the raw GRS statistics, should be used in model ranking.
Considering the case of a true model that includes a subset of the available factors, an untabulated analysis of the simulated test rankings formed using the magnitude of the GRS statistic confirms that the probability of an incorrect model, larger than the true model, achieving a high rank versus other models increases with the number of factors in the model, when the GRS statistic ranking differs from the p-value ranking. This bias is stronger when using an incorrect GRS statistic. In cases for which both the GRS statistic ranking and the p-value ranking agree, there is no such pattern to tilt to larger models than the true model.
The important insight to take away from these results is that the error in calculating the GRS statistic can have a material impact on empirical results, particularly when twenty or fewer years of data are used, which is not uncommon in empirical asset pricing studies. For example,
Barillas and Shanken (
2018) performed model comparisons on a little less than 15 years of monthly data.
Harvey and Liu (
2021) considered tests of asset pricing models and report simulations for 20 and 40 years of monthly data.
Sha and Gao (
2019) used 144 months of data and exploited 6 metrics to evaluate factor model performance, including the GRS statistic.
Baek and Bilson (
2015) considered 234 months of data in a subsample estimation.
Chiah et al. (
2016) used 23 years of data when comparing models using the GRS statistic. One takeaway from these papers is that many situations involving specialized data (like
Sha and Gao 2019, exploring mutual fund returns in China) or sub-sample robustness checks (like
Baek and Bilson 2015) are necessarily constrained to shorter samples than fifty or even twenty years, so that the bias from an incorrectly calculated GRS statistic becomes large.
Table A7.
Percentage of subsamples/models with different model rankings if ranked by p-values rather than test statistics.
Table A7.
Percentage of subsamples/models with different model rankings if ranked by p-values rather than test statistics.
Window | | | |
---|
(Months)
| | | |
---|
Panel A: 1963–2019 |
60 | 16.3 | 21.8 | 16.8 |
120 | 4.9 | 5.8 | 5.3 |
180 | 4.3 | 3.9 | 3.9 |
240 | 4.6 | 5.4 | 4.4 |
300 | 4.8 | 4.8 | 4.8 |
480 | 2.6 | 3.8 | 2.6 |
600 | 6.6 | 7.9 | 6.6 |
Panel B: 1926–2019 |
60 | 2.2 | 5.0 | 2.6 |
120 | 0.6 | 1.2 | 0.8 |
180 | 0.4 | 0.2 | 0.6 |
240 | 0.4 | 0.2 | 0.2 |
300 | 0.7 | 0.7 | 0.7 |
480 | 5.5 | 5.8 | 5.8 |
600 | 7.4 | 7.4 | 7.4 |
We do not recommend ranking models with the magnitude of the GRS statistic, and instead suggest the use of the p-value of the statistic from the exact F-distribution, since the p-value internalizes the different degrees of freedom of the GRS statistics computed for models with a different number of factors. We recognize that ranking of models by the GRS statistic has a desirable economic intuition—this is a direct, easy-to-understand metric tied to a model’s factors spanning the test asset returns. But researchers need to at least understand that this ranking might have undesirable statistical properties. The detailed results that these tables are based on, and additional summary statistics are available on request.