Next Article in Journal
The Information Content of Stock Splits: In the Context of Stock Splits Concurrently Announced with Earnings
Next Article in Special Issue
Downside Risk in Australian and Japanese Stock Markets: Evidence Based on the Expectile Regression
Previous Article in Journal
What Drives Asset Returns Comovements? Some Empirical Evidence from US Dollar and Global Stock Returns (2000–2023)
Previous Article in Special Issue
Knowledge Sharing and Cumulative Innovation in Business Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Testing and Ranking of Asset Pricing Models Using the GRS Statistic

1
Schulich School of Business, Room N204-C, York University, 4700 Keele St., Toronto, ON M3J 1P3, Canada
2
Department of Economics, University of California Riverside, 900 University Avenue, Riverside, CA 92521, USA
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
J. Risk Financial Manag. 2024, 17(4), 168; https://doi.org/10.3390/jrfm17040168
Submission received: 21 March 2024 / Revised: 4 April 2024 / Accepted: 8 April 2024 / Published: 19 April 2024

Abstract

:
We clear up an ambiguity in the statement of the GRS statistic by providing the correct formula of the GRS statistic and the first proof of its F-distribution in the general multiple-factor case. Casual generalization of the Sharpe-ratio-based interpretation of the single-factor GRS statistic to the multiple-portfolio case makes experts in asset pricing studies susceptible to an incorrect formula. We illustrate the consequences of using the incorrect formulas that the ambiguity in GRS leads to—over-rejecting and misranking asset pricing models. In addition, we suggest a new approach to ranking models using the GRS statistic p-value.

1. Introduction

In an influential paper, Gibbons et al. (1989) developed and analyzed a test of the ex ante mean-variance efficiency of portfolios. This test statistic is now widely used to evaluate asset pricing models and has also been exploited to rank competing models. For the single factor case, Gibbons et al. (1989) carefully developed the statistic in a linear regression model (hereafter referred to as the GRS statistic or test), derived its small-sample F distribution, investigated its power properties, and highlighted its significance in asset pricing theory by purveying an alternative interpretation involving the Sharpe ratio (Sharpe 1966)—the excess return to a portfolio per unit of risk (or volatility, measured by standard deviation)—which is a key measure of portfolio efficiency. For the multiple factor case, however, Gibbons et al. (1989, sec. 7), were ambiguous on how the statistic should be constructed.
The solution to the portfolio optimization problem that yields the Sharpe ratio has us estimate a variance–covariance matrix of the portfolio excess returns, but the equivalence of the GRS statistic and the F test statistic relies on this matrix arising in the projection of the test asset returns on the column space of the asset pricing factors, not as a variance–covariance matrix. Unfortunately, Gibbons et al. (1989) used equivocal language to describe this matrix, referring to it as a “variance-covariance matrix”, and this has apparently caused confusion about the function of the GRS statistic, which is further exacerbated by the fact that the small-sample F distribution wrongfully conjures up a degrees-of-freedom (d.f. hereafter) adjustment that is improper in this case. This has led to the application of a very common incorrect formula that, paradoxically, is more likely to be used by financial economists, the experts in the field, than by someone who focuses only on the statistical aspects of the problem.1 We find that using the incorrect formula, which we will refer to for conciseness as W ^ below, leads to (i) a test statistic that does not follow the F distribution as prescribed and over-rejects the null hypothesis of portfolio efficiency; and (ii) smaller models often being favored over larger ones when the statistic is used to rank asset pricing models. This error comes from mixing terms that fall out of portfolio optimization with a statistical object that comes from the small-sample F test derivation.
The main contribution of our paper is to clear up the ambiguity in the calculation of the GRS statistic and highlight (both theoretically and empirically) issues that arise from the use of W ^ and two related and popularly used statistics, one which folds in a second degree of freedom error (we will refer to this as W ˘ ), and the asymptotic χ 2 version often used to replace the GRS test.2 The asymptotic χ 2 and W ˘ implementations result in much higher model rejection rates than the correct GRS statistic, most notably for the asymptotic χ 2 test, even with 50 years of monthly data. The use of an incorrect implementation of the GRS statistic also results in inconsistent model rankings across W ^ , W ˘ and the correct calculation of the GRS statistic, with 40 or even 50 years of data. Finally, we propose a new methodology for the ranking of competing asset pricing models, making use of test p-values rather than the raw GRS statistic values, meant to properly internalize the model sizes. While a determination of statistically significantly different model performance is valuable, often researchers are simply attempting to rank models. Our approach is a computationally straightforward approach to answering this question.
We will adopt the notation in Gibbons et al. (1989, sec. 7), whenever possible. The proofs of the theoretical results and the details of the empirical results are in the Appendix A.

2. The GRS Test for Multiple Factors

The proofs of all the claims in this section can be found in Appendix A. The problem is to test the mean-variance efficiency of L portfolios utilizing another type of N assets (known as test assets).
We start with a linear regression model:
r ˜ i t = δ i 0 + δ i r ˜ p t + η ˜ i t , i = 1 , , N , and t = 1 , , T ,
where r ˜ i t denotes the excess return on test asset i in period t, the L-vector of portfolio excess returns r ˜ p t serves as factors, and η ˜ i t denotes the disturbance. Mean-variance efficiency of the L portfolios implies (Sharpe 1964)
H 0 : δ i 0 = 0 , i = 1 , , N .
Lemma 1
(Joint F test). Let r ˜ p r ˜ p 1 , , r ˜ p T , r ¯ p T 1 t = 1 T r ˜ p t , and let δ ^ 0 be the ordinary least squares (OLS) estimator of δ 0 δ 10 , , δ N 0 ; also let η ^ t η ^ 1 t , , η ^ N t be the OLS residuals of model (1). We follow Gibbons et al. (1989) to assume that the disturbance η ˜ t ( η ˜ 1 t , , η ˜ N t ) is independent from the factors r ˜ p t and has a joint normal distribution3 with mean zero and nonsingular variance–covariance matrix Σ and is iid over t. Define
Ω ˜ 1 T t = 1 T r ˜ p t r ˜ p t ,
Σ ^ 1 T L 1 t = 1 T η ^ t η ^ t .
Then, the F statistic
W ˜ T ( T N L ) N ( T L 1 ) 1 r ¯ p Ω ˜ 1 r ¯ p δ ^ 0 Σ ^ 1 δ ^ 0
follows the F N , T N L distribution under H 0 .
From a purely statistical perspective, Lemma 1 is all we need for testing the implication (2) of mean-variance efficiency, which is just the usual joint F test of zero intercepts in a linear regression system.4
The economic interpretation of the GRS test, however, is better understood via another implication of mean-variance efficiency— θ N + L , the Sharpe ratio of the optimal portfolio consisting of the L portfolios and the N test assets, equals θ p , the Sharpe ratio of the L portfolios alone (Gibbons et al. 1989).
We consider a general portfolio optimization problem that yields the Sharpe ratio. Let r ˜ denote a vector of excess returns of K assets ( K 1 ), and let μ r ˜ and Ω r ˜ be their ex ante mean vector and variance–covariance matrix, respectively. Let m be the target mean excess return and ω be a vector of K asset weights. The optimal portfolio weights ω solve
min ω ω Ω r ˜ ω , subject to ω μ r ˜ = m .
The square of the Sharpe ratio of the optimal portfolio composed of these K assets, therefore, is
θ 2 m ω Ω r ˜ ω 2 = μ r ˜ Ω r ˜ 1 μ r ˜ ,
in which the variance–covariance matrix Ω r ˜ of the K assets plays a central role. Applying this general result twice, we obtain that
W 1 + θ N + L 2 1 + θ p 2 2 1 = 1 + μ r ˜ p Ω 1 μ r ˜ p 1 δ 0 Σ 1 δ 0 ,
where Ω is the variance–covariance matrix of L portfolio excess returns r ˜ p t and Σ is that of the disturbances η ˜ t . So, W = 0 if the L portfolios are efficient, and this is the basis of the GRS test in asset pricing theory.
Theorem 1
(Generalized GRS statistic). Define
Ω ˜ 1 T t = 1 T r ˜ p t r ¯ p r ˜ p t r ¯ p = 1 T t = 1 T r ˜ p t r ˜ p t r ¯ p r ¯ p
and the generalized GRS statistic
W ˜ T ( T N L ) N ( T L 1 ) 1 + r ¯ p Ω ˜ 1 r ¯ p 1 δ ^ 0 Σ ^ 1 δ ^ 0 .
Then, W ˜ = W ˜ , and therefore under the conditions of Lemma 1, W ˜ follows the F N , T N L distribution under the H 0 .
Theorem 1 connects the statistical perspective to the economic interpretation of the GRS test, because W ˜ equals the F statistic W ˜ and can be regarded as a sample analog of W—replace Ω in Equation (6) with its maximum likelihood estimator (MLE) Ω ˜ , Σ with its unbiased estimator Σ ^ , μ r ˜ p with r ¯ p , δ 0 with δ ^ 0 , and pre-multiply the ratio T ( T N L ) N ( T L 1 ) , then one obtains W ˜ in Equation (8).

Common Mistakes and Consequences

W ˜ equals the original GRS statistic when L = 1 . For the L > 1 case, however, Gibbons et al. (1989, p. 1146) gave a statistic W ^ , almost identical to W ˜ , but instead of Ω ˜ , they prescribed “sample variance-covariance matrix” Ω ^ without giving its explicit formula. Since the sample variance–covariance matrix customarily entails a d.f. adjustment, i.e.,
Ω ^ 1 T 1 t = 1 T r ˜ p t r ¯ p r ˜ p t r ¯ p = T T 1 Ω ˜ ,
this would cause W ^ to differ from W ˜ , and therefore Theorem 1 implies that W ^ does not follow the F N , T N L distribution as prescribed.
This incorrect GRS statistic W ^ inflicts two consequences on empirical asset pricing studies. First, it over-rejects mean-variance efficiency when gauged against the F N , T N L distribution, because the ratio between W ^ and W ˜ is always larger than 1. Second, it misranks competing asset pricing models, because the ratio between W ^ and W ˜ tends to be disproportionally larger for models with more factors.
The significance of the GRS statistic in recent financial studies, as Fama and French (2015) advocate, resides in the ranking of competing asset pricing models, rather than testing them. Using even the correct GRS statistics to rank models, albeit its portfolio optimization interpretation, is subject to a familiar critique akin to the use of R 2 for linear regression model comparison. Instead, the p-values associated with W ˜ in respective F N , T N L distributions are a statistically sound metric for this purpose, as they internalize the difference in the second d.f.
The implementations of the GRS test found in popular user-defined software packages, such as GRS.test in R and grstest and grstest2 in Stata, not only use Ω ^ when computing the GRS statistic, but also fold in additional errors. Results for the R formula are labeled with W ˘ in the rest of this paper.
The asymptotic χ 2 test is frequently recommended as an alternative of the F test. One might think that when T is large, the d.f. issue we point out here can be circumvented by using the asymptotic χ 2 test. Unfortunately, we find that the commonly used χ 2 test statistics also over-reject for any sample size, especially if the number of test assets N or the number of factors L is large. In addition, they erroneously favor smaller models to an extent worse than W ^ .

3. Empirical Results

We use portfolios of test assets borrowed from Fama and French (2015, 2016) to show that over-rejection and misranking of W ^ , W ˘ and χ 2 relative to W ˜ is empirically significant, even remarkable in many cases. To summarize our empirical findings (which are detailed in Appendix B and below), the W ˘ misapplication of the GRS statistic and the asymptotic χ 2 both result in much higher model rejection rates than the correct GRS statistic, most notably for the asymptotic χ 2 test, even with 50 years of monthly data, and also result in scrambled model rankings, with 40 or even 50 years of data, most notably for the W ˘ version of the GRS statistic. While the F test is asymptotically equivalent to χ 2 , typical sample sizes available in financial markets research are not large enough to make this approximation innocuous. The exact F test construction is also the most conservative test, resulting in less over-rejection of the null hypothesis when the null is correct, even with highly non-normal return data, by measure of the bootstrap resampling experiments we perform.
In Table 1, we highlight the over-rejection issue, with five-year windows. This span of data shows serious over-rejection of asset pricing models from the application of the alternative formulations of the GRS test. Results for the largest number of test assets we considered, 32, are displayed in the first three rows of the table, the results for cases with 25 test assets follow in rows four through thirteen, the 17 test assets of the industry portfolios follow in row fourteen, and the remaining six rows present results for sets of 10 test assets. In Table 1, we use a total of 53 five-year overlapping windows starting from 1963 and consider six different asset pricing models, meaning that we have 318 cases for which an asset pricing model might be rejected, for each of the 19 sets of test assets.
The asymptotic χ 2 statistic fares the worst relative to the correct GRS statistic among the alternatives; it over-rejects dramatically relative to the correct GRS statistic, 50% of the time on average, faring worse when we consider more test assets. W ˘ and W ^ over-reject relative to the correct GRS statistic about 10% and 1% of the time on average. W ^ is unstable across test assets, however, with a few cases displaying close to 4% over-rejection, and some with no over-rejection.
In Table 2, we consider the model misranking among six models for each of the 53 five-year windows of Table 1. Misranking of models shows similar problems for the χ 2 statistic as we saw for test rejections, with over 40% of the cases displaying misranking of at least one asset pricing model. The W ˘ statistic displays much worse performance than test rejections, with close to 60% of the cases displaying misranking, and we see worse performance even for the W ^ , with close to 3% of the cases misranked.
While the typical Fama and French paper uses 40 or 50 years of data, it is also true that much empirical work uses far less data. Gibbons et al. (1989) noted that issues of stationarity can reasonably constrain the length of a time series used, so that “it is not uncommon to see published work where T is around 60”, Affleck-Graves and McDonald (1989) limited their analysis and simulations to 60 month periods, Ferson and Foerster (1994) studied 60, 120, and 720 monthly observations in their simulation, Rouwenhorst (1999) used five years in subsample analysis, and among recent works that exploited as little as four or five years of data are Belimam et al. (2018) and Qin (2019). Leite et al. (2018) used as few as 98 months of data, Lewellen et al. (2010) used 168 observations of quarterly data, Choi et al. (2020) performed subsample stability tests using eight years of monthly data, and many studies of emerging economy markets have used 10 to 15 years of monthly data. See, for instance, Alhomaidi et al. (2019), Alshammari and Goto (2022), Merdad et al. (2015), and Sha and Gao (2019).
One takeaway from these papers is that many situations involving specialized data (like (Sha and Gao 2019) and their exploration of mutual fund returns in China) or subsample robustness checks (like Baek and Bilson 2015) are necessarily constrained to shorter samples than fifty or even twenty years, so that the bias from an incorrectly calculated GRS statistic becomes large.

4. Concluding Remarks

The GRS statistic of Gibbons et al. (1989), developed to provide a test of the ex ante mean-variance efficiency of portfolios and more recently exploited to rank competing models, can be easily implemented incorrectly due to an ambiguity in the presentation of the multivariate form of the test in Gibbons et al. (1989). This presentation suggests a degree-of-freedom-adjusted unbiased variance–covariance matrix estimator Ω ^ of the portfolio excess returns used in the small-sample GRS F test. Indeed, the portfolio optimization problem naturally has us estimate the variance–covariance matrix Ω of the portfolio excess returns, but the equivalence of the GRS statistic to the F test relies on Ω ˜ , which arises in the projection matrix of the test asset returns on the column space of the asset pricing factors, not as a variance–covariance matrix. Paradoxically, this error is clearly visible when turning a blind eye to the economic interpretation of the GRS statistic and taking a purely statistical approach. Although an unbiased estimator Ω ^ appears intuitive in the context of portfolio optimization, it does not yield a correct small-sample exact F test.
Further complicating this ambiguity, Cochrane (2005) presented the GRS statistic omitting a degree-of-freedom adjustment in the calculation of the variance–covariance matrix of the regression residuals, Σ .5 Perhaps an outcome of Cochrane (2005), there is an implementation of the GRS statistic in an R package, which we label W ˘ , that omits the d.f. adjustment when estimating Σ but fails to pre-multiply the correct ratio.6 It has also become common in the field to ignore the F distribution completely and employ an asymptotic χ 2 approximation in place of the F test.
The main results for both the asymptotic asymptotic χ 2 and W ˘ implementations is much higher model rejection rates than the correct GRS statistic, most notably for the asymptotic χ 2 test, and they also result in scrambled model rankings. Further, the F distribution is inherently pertinent to small-sample exact tests, where one should make a point of computing the d.f. correctly. For this reason, we recommend the exact F test construction with its attendant F distribution, for both testing and ranking of asset pricing models. The exact F test construction is also the most conservative test, resulting in less over-rejection of the null hypothesis when the null is correct, even with highly non-normal return data.
Another result of this research inquiry is that we provided the first proof of the F-distribution of this test for the general multi-factor case and we recommended a new ranking method, making use of the p-value rather than the raw GRS statistic value. Although ranking by the values of the GRS statistic has a desirable economic intuition attached to it, the applied researcher taking advantage of this must recognize that this ranking is statistically as unsound as favoring a regression model with the highest R 2 .

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/jrfm17040168/s1.

Author Contributions

Conceptualization, methodology, review, editing, software, validation, investigation, resources, data curation, writing—original draft preparation, writing—review and editing, visualization, supervision, project administration, funding acquisition were all shared equally. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Social Sciences and Humanities Research Council of Canada, grant number 510991.

Data Availability Statement

Data used is available through Ken French’s data library.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs and Details for the Results in Section 2

The following two lemmas are used in the proof of Lemma 1.
Lemma A1.
If a random vector Y and a random matrix W satisfy: (i) Y N d ( μ , Σ ) , the d dimensional normal distribution; (ii) W W d ( f , Σ ) , the d × d dimensional Wishart distribution; and (iii) Y W . Then, given Hotelling’s T-squared defined as T 2 f ( Y μ ) W 1 ( Y μ ) , we have F f d + 1 f d T 2 F d , f d + 1 .
Lemma A2
(Sherman–Morrison formula). Suppose A is an invertible L × L matrix and u and v are L × 1 vectors. If A + u v is invertible, then A + u v 1 = A 1 A 1 u v A 1 1 + v A 1 u .
Lemma A1 is a standard result in multivariate statistics (see, e.g., Anderson 2003, Theorem 5.2.2), and Lemma A2 is a standard result in linear algebra (see, e.g., Bartlett 1951, p. 107).
Proof of Lemma 1.
The proof proceeds in three steps.
Step 1. In this step, we show that under the null hypothesis (2),
T ( 1 r ¯ p Ω ˜ 1 r ¯ p ) δ ^ 0 N N ( 0 , Σ ) ,
where Ω ˜ is defined in Equation (3).
Let T denote a T × 1 vector with every element being one, and let I T denote the T × T identity matrix. Define P p , T = r ˜ p r ˜ p r ˜ p 1 r ˜ p as the T × T projection matrix (onto the column space of r ˜ p ) and its T × T complement matrix Q p , T = I T P p , T . It is a standard result (e.g., Hayashi 2000, pp. 18–19) that the OLS estimator of δ i 0 satisfies δ ^ i 0 δ i 0 = T Q p , T T 1 T Q p , T η ˜ i , where η ˜ i η ˜ i 1 , , η ˜ i T for i = 1 , , N . Since η ˜ i t has a normal distribution, and let σ i i 2 denote the ( i , i ) entry of Σ , then it is a standard result (e.g., Hayashi (2000, Sec. 1.3) that T Q p , T T δ ^ i 0 δ i 0 N 1 ( 0 , σ i i 2 ) . It then only takes some algebra to show that
T Q p , T T δ ^ 0 δ 0 N N ( 0 , Σ ) .
Now, let us take a closer look at T Q p , T T :
T Q p , T T = T T T r ˜ p r ˜ p r ˜ p 1 r ˜ p T = T t = 1 T r ˜ p t t = 1 T r ˜ p t r ˜ p t 1 t = 1 T r ˜ p t = T T 1 T t = 1 T r ˜ p t 1 T t = 1 T r ˜ p t r ˜ p t 1 1 T t = 1 T r ˜ p t = T ( 1 r ¯ p Ω ˜ 1 r ¯ p ) .
Recall that δ 0 = 0 under the null hypothesis (2), so Equation (A2) and (A3) together imply (A1), the claim of Step 1.
Step 2. In this step, we will show that δ ^ 0 Σ ^ and
( T L 1 ) Σ ^ W N ( T L 1 , Σ ) .
Let X = [ T , r ˜ p ] denote the T × ( L + 1 ) design matrix of Equation (1). Define the T × T projection matrix P = X X X 1 X and its complement Q = I T P . Let η ˜ = [ η ˜ 1 , , η ˜ N ] denote the T × N matrix of all disturbances in Equation (1). Then, by the standard results of the OLS estimators with normal disturbances (e.g., Hayashi (2000, Sec. 1.3), we have δ ^ 0 Σ ^ and ( T L 1 ) Σ ^ = t = 1 T η ^ t η ^ t = η ˜ Q η ˜ = η ˜ U D U η ˜ , where the last equality holds by the singular value decomposition of Q, in which U is a T × T unitary matrix, and D is a T × T diagonal matrix with T L 1 diagonal entries being ones and the rest being zeros. Since we assume that the rows of η ˜ are mutually independent and follow the N N ( 0 , Σ ) distribution, the rows of U η ˜ are also mutually independent and follow the N N ( 0 , Σ ) distribution. This further implies that η ˜ U D U η ˜ has the same distribution as such sum S = j = 1 T L 1 ξ j ξ j , where ξ j are mutually independent and ξ j N N ( 0 , Σ ) ( j = 1 , , T L 1 ). By construction, the distribution of S is the Wishart distribution W N ( T L 1 , Σ ) . This proves the claim of Step 2.
Step 3. In this step, we apply Lemma A1 to the results of Steps 1 and 2. After some simple algebra, we obtain W ˜ F N , T N L with W ˜ defined in Equation (5). This completes the proof of Lemma 1. □
Derivation of Equation (6).
Gibbons et al. (1989) derives this, in Section 6, for the L = 1 case, and here we provide the derivation for the general L 1 case. We start by considering a general portfolio optimization problem that yields the Sharpe ratio—mean excess return to a portfolio per unit of volatility (standard deviation)—of the optimal portfolio consisting of given assets. Let r ˜ denote a vector of excess returns of K assets ( K 1 ), and let μ r ˜ and Ω r ˜ be their ex ante mean vector and variance–covariance matrix, respectively. Let m be the target mean excess return and ω be a vector of K asset weights. The optimal portfolio weights ω solve
min ω ω Ω r ˜ ω , subject to ω μ r ˜ = m .
The first order conditions for this problems are ω = φ Ω r ˜ 1 μ r ˜ and φ = m / ( μ r ˜ Ω r ˜ 1 μ r ˜ ) , where φ is the Lagrange multiplier. The squared Sharpe ratio of the optimal portfolio consisting of these K assets is, therefore,
θ 2 m ω Ω r ˜ ω 2 = μ r ˜ Ω r ˜ 1 μ r ˜ ,
in which the variance–covariance matrix Ω r ˜ plays a central role.
Applying this general result, we know that when the constituent assets are the L portfolios, the squared Sharpe ratio is
θ p 2 = μ r ˜ p Ω 1 μ r ˜ p .
When the constituent assets include both the N test assets and the L portfolios, the squared Sharpe ratio is θ N + L 2 = μ r ˜ N + L Ω r ˜ N + L 1 μ r ˜ N + L , where μ r ˜ N + L μ r ˜ N , μ r ˜ p ,
Ω r ˜ N + L δ Ω δ + Σ δ Ω Ω δ Ω ,
and δ δ 1 , , δ N with δ i being the slope coefficient in model (1). Equation (A6) holds because we can rewrite the variance–covariance matrix of the N test assets and their covariance matrix with the L portfolios using Ω , Σ and δ (in the same way as V ^ on p. 1143 and eq. (24) in Gibbons et al. 1989). Applying the inverse formula for a block matrix and noticing the relationship between μ r ˜ N and μ r ˜ p implied by model (1), we obtain
θ N + L 2 = θ p 2 + δ 0 Σ 1 δ 0 ,
which is essentially the same as Equations (22) and (23) in MacKinlay and Richardson (1991). This, together with Equation (A5) and simple algebra, further implies Equation (6).
Proof of Theorem 1.
Based on Lemma 1, we only need to show that W ˜ defined in Equation (5) equals W ˜ in Equation (8). By comparing Equation (3) and (7), we see that Ω ˜ = Ω ˜ r ¯ p r ¯ p , so it suffices to show that
1 r ¯ p Ω ˜ 1 r ¯ p = 1 + r ¯ p Ω ˜ 1 r ¯ p 1 = 1 + r ¯ p Ω ˜ r ¯ p r ¯ p 1 r ¯ p 1 .
Applying Lemma A2 with A = Ω ˜ , u = r ¯ p and v = r ¯ p , we get Ω ˜ r ¯ p r ¯ p 1 = Ω ˜ 1 + Ω ˜ 1 r ¯ p r ¯ p Ω ˜ 1 1 r ¯ p Ω ˜ 1 r ¯ p , which implies that 1 + r ¯ p Ω ˜ r ¯ p r ¯ p 1 r ¯ p = 1 + r ¯ p Ω ˜ 1 r ¯ p + r ¯ p Ω ˜ 1 r ¯ p 2 1 r ¯ p Ω ˜ 1 r ¯ p = 1 r ¯ p Ω ˜ 1 r ¯ p 1 , which further immediately implies Equation (A8). This completes the proof of Theorem 1. □
Original GRS statistic when
L = 1 . When L = 1 , Ω ˜ equals to 1 T t = 1 T r ˜ p t 2 r ¯ p 2 = s p 2 , the sample variance of r ˜ p t  without d.f. defined by Gibbons et al. (1989, p. 1124). So, W ˜ equals the original GRS statistic when L = 1 .
Over-rejection of
W ^ . Take the ratio between W ^ and W ˜ , then by the relationship between Ω ^ and Ω ˜ in Equation (9), we get
W ^ W ˜ = 1 + r ¯ p Ω ˜ 1 r ¯ p 1 + T 1 T r ¯ p Ω ˜ 1 r ¯ p ,
which measures how much the incorrect formula inflates the GRS statistic. Define a function g ( x ) = 1 + x 1 + T 1 T x . Since the first-order derivative of this function is g ( x ) = 1 / T 1 + T 1 T x 2 > 0 , we know that g ( x ) is a monotonically increasing function of x. This, combined with the facts that g ( 0 ) = 1 and r ¯ p Ω ˜ 1 r ¯ p > 0 , implies that W ^ / W ˜ > 1 . As a result, when W ^ is gauged against the F N , T N L , the distribution of W ˜ , it will over-reject the null hypothesis of mean-variance efficiency of the L portfolios.
Model misranking by
W ^ . Some back-of-the-envelop calculation shows that r ¯ p Ω ˜ 1 r ¯ p tends to be larger for models with more factors. To see this, let μ r ˜ p denote the mean vector of r ˜ p t as above, then by the central limit theorem, we have T ( r ¯ p μ r ˜ p ) d . N ( 0 , Ω ) ; and by the law of large numbers, we have Ω ˜ p . Ω . These two results imply that T ( r ¯ p μ r ˜ p ) Ω ˜ 1 ( r ¯ p μ r ˜ p ) d . χ L 2 . Note that E ( χ L 2 ) = L , so this in turn implies that for fixed T, the mean of r ¯ p Ω ˜ 1 r ¯ p is approximately E ( r ¯ p Ω ˜ 1 r ¯ p ) L T + μ r ˜ p Ω 1 μ r ˜ p , where μ r ˜ p Ω 1 μ r ˜ p is expected to increase with L since the dimensions of both μ r ˜ p and Ω increase with L. As a result, the random variable r ¯ p Ω ˜ 1 r ¯ p tends to increase with L on average.7
Combined with Equation (A9), this means that the ratio W ^ / W ˜ tends to be larger for larger models; that is, smaller models tend to be disproportionally favored if the incorrect GRS statistic W ^ is used to rank models, compared to the ranking based on the correct GRS statistic W ˜ .
Additional errors in software packages.
The R package GRS.test computes two different GRS statistics, see Kim (2022). One (function GRS.test) uses the unbiased estimators Ω ^ and Σ ^ at the same time; the other (function GRS.MLtest) uses the MLEs Ω ˜ and Σ ˜ 1 T t = 1 T η ^ t η ^ t at the same time. The former is just W ^ , and we denote the latter as
W ˘ T ( T N L ) N ( T L 1 ) 1 + r ¯ p Ω ˜ 1 r ¯ p 1 δ ^ 0 Σ ˜ 1 δ ^ 0 , and note W ˘ = T T L 1 W ˜ .
These two statistics are both incorrect and clearly stem from the interpretation of Ω and Σ as variance–covariance matrices in Gibbons et al. (1989).
Because of the relationship between W ˜ and W ˘ in Equation (A10), similar analysis as in that for W ^ indicates the same over-rejection and misranking problems for W ˘ as well, even to a worse extent than W ^ for typical data in empirical asset pricing studies.
The Stata packages grstest and grstest2, composed by different contributors,8 make use of Ω ^ and further compound this error by estimating Σ as 1 T 1 t = 1 T η ^ t η ^ t and pre-multiplying the ratio T N L N instead of T ( T N L ) N ( T L 1 ) . The result is a statistic that is difficult to justify and different from all those we discussed above.9
Asymptotic 
χ 2  test. First note that the distribution of the correct GRS statistic W ˜ , when multiplied by N, converges to the χ N 2 distribution as T .10 So, comparing N W ˜ with the critical value from the χ N 2 distribution, rather than comparing W ˜ with the F N , T N L critical value, is by itself an asymptotically valid χ 2 test. The commonly used χ 2 test statistics in empirical asset pricing research deviate from N W ˜ , and the deviations are all positive,11 so they will over-reject compared to N W ˜ .12 We find that the χ 2 statistics misrank models more often even than W ˘ in our empirical studies, but we do not report the model ranking results for χ 2 , because they do not have an intuitive economic interpretation in the model ranking context, and therefore are not commonly used for this purpose. The misrankings of these χ 2 can be easily shown by a similar analysis as for W ^ and W ˘ and are therefore skipped here.

Appendix B. Details of the Empirical Results

We now turn to some empirical examples, focusing on how the different implementations of the GRS statistic, as well as the asymptotic χ 2 statistic, compare to the correct calculation, based on model testing and ranking outcomes, borrowing from Fama and French (2015, 2016) the choice of asset pricing models and the choice of test assets. The models we consider include the CAPM, the Fama–French three-factor model, two variations of a four-factor model, the Fama–French five-factor model and a six-factor model that includes momentum. The test assets we explore include 5 × 5 sortings based on market capitalization and various anomaly variables including operating profitability, return volatility, residual volatility, accruals and so on, up to as many as 32 (2 × 4 × 4) portfolio sortings. We also explore decile portfolio sortings based on size, operating profitability, momentum, book-to-market and investment. The number of test assets used in empirical work is commonly as large as 25, as we see in Fama and French (2015, 2016), though many studies use 30 to over 50 test assets. See, for instance, Lewellen et al. (2010), Kroencke (2017), Demaj et al. (2018), and Kleibergen and Zhan (2020). Recently, asset pricing models have typically contained at least four or five factors, though six are also commonly seen. See, for instance, Barillas and Shanken (2018), Fama and and French (2018), Kan et al. (2024) or Hanauer (2020). Given the state of the literature, our choice of test assets and factors sits comfortably amidst the typical empirical asset pricing applications.
We use data retrieved from the French data library, and we consider five, ten, fifteen, twenty, twenty-five, forty, and fifty-year periods drawn from 1963–2019 for our consideration of up to six factors in the competing asset pricing models, and from 1926–2019 for our consideration of up to four factors in the competing asset pricing models.13. We limit our sample window to no less than five years of monthly data because few studies use less than 60 observations; Gibbons et al. (1989) note that issues of stationarity can reasonably constrain the length of a time series used, so that “it is not uncommon to published work where T is around 60”, Affleck-Graves and McDonald (1989) limited their analysis and simulations to 60 month periods; Ferson and Foerster (1994) studied 60, 120, and 720 monthly observations in their simulation study; Rouwenhorst (1999) used five years in sub-sample analysis; and among recent work that exploited as few as four or five years of data are Belimam et al. (2018), and Qin (2019). Leite et al. (2018) used as few as 98 months of data, Lewellen, Nagel, and Shanken (2010) used 168 observations of quarterly data, Choi et al. (2020) performed sub-sample stability tests using eight years of monthly data, and many studies of emerging economy markets have used ten to fifteen years of monthly data. See, for instance, Alhomaidi et al. (2019), Alshammari and Goto (2022), Merdad et al. (2015), and Sha and Gao (2019).
Our primary results, found in Table A1, Table A2, Table A3, Table A4, Table A5, Table A6 and Table A7, make use of the full sample available to us by partitioning the data sample into overlapping periods. For instance, at the five year horizon over 1963–2019, we form five-year windows starting in 1963 and every year following, so that the first window extends from July 1963 to June 1967, the second from January 1964 to December 1968, January 1965 to December 1969, and so on, resulting in 53 five-year overlapping windows. For each of these windows over the period 1963–2019, we use 19 sets of test assets, listed in the first column of Table A1, and six competing asset pricing models. These models are the CAPM, the Fama–French three-factor model, four and five-factor models, as well as a six-factor model including momentum, all as considered in Fama and French (2015, 2016).

Appendix B.1. Results for Five-Year Windows

We first present a small subset of our empirical findings in Table A1, Table A2 and Table A3. For convenience, Table A1 and Table A2 replicate Table 1 and Table 2 from the main text, and here we discuss them in greater depth. In these tables, we consider five-year windows, the minimum span of data the GRS statistic is commonly applied to. This short span of data shows the most serious over-rejection of asset pricing models from the application of the alternative formulations of the GRS test, as well as the highest frequency of misrankings relative to the correct formulation of the GRS statistic. We present evidence for longer spans of data, up to 50 year windows, in Table A4, Table A5, Table A6 and Table A7, and discuss them in Appendix B.2.
In Table A1, we present the number of excess test rejections at the 1%, 5%, and 10% levels, relative to the correct GRS statistic W ˜ , for each of the alternative test statistics, the asymptotic χ 2 , W ^ and W ˘ . The results for the largest number of test assets we considered, 32, are displayed in the first three rows of the table; the results for cases with 25 test assets follow on rows four through thirteen; the 17 test assets of the industry portfolios follow on row fourteen; and the remaining five rows present results for sets of 10 test assets. In Table A1, we use a total of 53 five-year overlapping windows starting from 1963 and consider six different asset pricing models, meaning that we have 318 cases for which an asset pricing model might be rejected, for each of the 19 sets of test assets.
Table A1. Number and proportion of subsamples with more rejections relative to the GRS test for five year windows sampled during 1963–2019, across 19 sets of test assets.
Table A1. Number and proportion of subsamples with more rejections relative to the GRS test for five year windows sampled during 1963–2019, across 19 sets of test assets.
Test AssetsAsymptotic W ˘ W ^
Nb of Subsamples = 53 χ 2
1%5%10%1%5%10%1%5%10%
2 × 4 × 4 MExMEBExINV28927524741721010
2 × 4 × 4 MExMEBExOP26224622041418000
2 × 4 × 4 MExOPxINV282238194102728031
5 × 5 AccrualsxME22221820771731001
5 × 5 BExME21219116682120101
5 × 5 BetaxME229221204111625010
5 × 5 MExOP23822718892742000
5 × 5 MomentumxME210166124142927000
5 × 5 NetIssuexME21617614792927022
5 × 5 RVariancexME1479465182812020
5 × 5 VariancexME1378145262712031
5 × 5 BExInv24825924541816021
5 × 5 MExInv214189171141729000
Industry12615315212922010
Book-to-Market Deciles48706752221011
Investment Deciles223056446000
Momentum Deciles566066181420111
Size Deciles49906862327002
Operating Profitability48706542415010
Deciles
Average171.3160.71429.820.222.10.110.950.58
Proportion (%)53.950.544.63.16.36.90.00.30.2
Notes: (1) The figures are the number of all decisions at the stated significance level for which the test statistic rejects the model when the correct GRS statistic does not reject, out of a total of 318 possible, with the exception of the last row for which the proportion is given. The sample periods of five years are sampled over July 1963 to December 2019 and the number of models tested are 6 for each window, which we sum over to obtain the total number of over-rejections. These windows overlap, adjusted in a rolling window so that all but 1 year of data overlaps with the next sample window. This means that there are 53 samples for the 5 year window. (2) For a detailed description of the factor and test asset construction see Fama and French (2015, 2016).
The asymptotic χ 2 statistic fares the worst relative to the correct GRS statistic among the alternatives, over-rejecting roughly 50% of the time on average relative to W ˜ , across the common significance levels of 1%, 5%, and 10%. This over-rejection is worse when we consider more test assets.
The W ˘ and W ^ do not display patterns related to the number of test assets or significance level, with the W ˘ ( W ^ ) over-rejecting relative to the correct GRS statistic about 5% (0.2%) of the time on average, across the common significance levels of 1%, 5%, and 10%. The over-rejection of the W ^ is unstable across test assets, however, with a few cases displaying close to 6% over-rejection (three times over 53 subsamples), and some with no over-rejection.
Table A2. Number and proportion of subsamples with different ranking outcomes from the GRS statistic for five year windows sampled during 1963–2019, across 19 sets of test assets.
Table A2. Number and proportion of subsamples with different ranking outcomes from the GRS statistic for five year windows sampled during 1963–2019, across 19 sets of test assets.
Test AssetsAny Model Mis-RankedTop Model Mis-Ranked
Nb of Subsamples = 53 W ˘ W ^ W ˘ W ^
2 × 4 × 4 MExMEBExINV26250
2 × 4 × 4 MExMEBExOP37160
2 × 4 × 4 MExOPxINV311120
5 × 5 AccrualsxME361120
5 × 5 BExME26170
5 × 5 BetaxME311100
5 × 5 MExOP330100
5 × 5 MomentumxME330140
5 × 5 NetIssuexME27492
5 × 5 RVariancexME320140
5 × 5 VariancexME341100
5 × 5 BExInv35360
5 × 5 MExInv30390
Industry29240
Book-to-Market Deciles28040
Investment Deciles24240
Momentum Deciles30250
Size Deciles27350
Operating Profitability30060
Deciles
Average30.471.428.000.11
Proportion (%)57.52.715.10.2
Notes: (1) The figures are the number of all misrankings from a particular test statistic value across models relative to the GRS statistic ranking, out of a total of 53 possible, with the exception of the last row for which the proportion is given. The sample periods of five years are sampled over July 1963 to December 2019 and the number of models ranked are 6 for each window. These windows overlap, adjusted in a rolling window so that all but 1 year of data overlaps with the next sample window. This means that there are 53 samples for the 5 year window. (2) For detailed description of the factor and test asset construction see Fama and French (2015, 2016).
Table A3. Summary statistics on factor models and test statistics for five year windows over 2007–2009 financial crisis sample period, for investment 5 × 5 NetIssuexME test assets.
Table A3. Summary statistics on factor models and test statistics for five year windows over 2007–2009 financial crisis sample period, for investment 5 × 5 NetIssuexME test assets.
Date/Average W ˜ W ˘ W ^
Factor ModelAnnualizedStatisticStatisticStatistic
/Raw % α /Rank /Rank /Rank
JAN 2005-DEC 2009
Mkt2.63/0.2190.688/10.712/10.688/1
Mkt SMB HML2.39/0.1990.819/20.878/20.819/2
Mkt SMB HML UMD2.37/0.1980.837/30.913/30.837/3
Mkt SMB RMW CMA2.54/0.2120.910/40.992/40.912/4
Mkt SMB HML RMW CMA 2.52/0.2100.997/61.108/60.999/6
Mkt SMB HML RMW CMA UMD2.57/0.2140.972/51.100/50.974/5
JAN 2006-DEC 2010
Mkt 3.23/0.2691.114/31.152/ 1 1.114/3
Mkt SMB HML 2.32/0.1931.078/21.155/21.078/ 1
Mkt SMB HML UMD 2.41/0.2011.256/61.371/ 5 1.257/6
Mkt SMB RMW CMA *,†2.64/0.2201.077/11.174/ 3 1.080/ 2
Mkt SMB HML RMW CMA2.65/0.2211.130/41.256/41.134/4
Mkt SMB HML RMW CMA UMD 2.81/0.2341.253/51.418/ 6 1.257/5
JAN 2007-DEC 2011
Mkt 3.39/0.2831.329/21.375/ 1 1.329/ 1
Mkt SMB HML 3.11/0.2591.503/41.610/ 3 1.504/4
Mkt SMB HML UMD3.26/0.2721.888/52.060/51.890/5
Mkt SMB RMW CMA *,†2.89/0.2401.327/11.448/ 2 1.332/ 2
Mkt SMB HML RMW CMA 2.92/0.2441.478/31.643/ 4 1.484/3
Mkt SMB HML RMW CMA UMD3.08/0.2571.947/62.204/61.955/6
JAN 2008-DEC 2012
Mkt4.34/0.3611.299/21.344/21.300/2
Mkt SMB HML3.64/0.3031.500/31.607/31.500/3
Mkt SMB HML UMD 3.77/0.3141.906/62.079/ 5 1.907/6
Mkt SMB RMW CMA2.98/0.2481.161/11.267/11.165/1
Mkt SMB HML RMW CMA3.15/0.2631.577/41.753/41.583/4
Mkt SMB HML RMW CMA UMD 3.39/0.2821.896/52.146/ 6 1.904/5
JAN 2009-DEC 2013
Mkt2.74/0.2281.678/21.736/21.681/2
Mkt SMB HML3.05/0.2542.938/53.148/52.946/5
Mkt SMB HML UMD3.12/0.2603.324/63.626/63.333/6
Mkt SMB RMW CMA2.46/0.2051.402/11.530/11.406/1
Mkt SMB HML RMW CMA3.05/0.2542.723/33.026/32.734/3
Mkt SMB HML RMW CMA UMD2.72/0.2272.733/43.093/42.745/4
Notes: (1) A test method producing different ranks using p-values versus test statistic is indicated by a ‡ in the assets label. This is further identified with a ⇓ in the appropriate column. Ranked test statistics different than W ˜ test ranked value is indicated by a † on the factor model label. This is further identified with the appropriate column’s rank being bolded. Top ranked test statistics different than top ranked W ˜ test value are indicated by a * in the assets label. This is further identified by the appropriate row’s model being bolded. (2) For a detailed description of the factor and test asset construction see Fama and French (2015, 2016).
Table A4. Percentage of subsamples/models with different decision outcomes from the correct GRS test W ˜ .
Table A4. Percentage of subsamples/models with different decision outcomes from the correct GRS test W ˜ .
Window χ 2 W ˘ W ^
(Months)1%5%10%1%5%10%1%5%10%
Panel A: 1963–2019
6053.950.544.63.16.36.90.00.30.2
12027.424.820.03.23.73.90.10.00.1
18016.413.011.42.12.12.20.10.00.0
2409.78.66.31.41.71.20.00.00.0
3006.55.74.50.91.31.10.10.10.0
4803.02.91.90.30.70.40.00.00.0
6002.90.81.60.30.20.30.00.00.0
Panel B: 1926–2019
6034.636.133.22.64.35.30.00.10.1
12015.415.512.62.23.12.50.00.00.0
1808.810.39.21.01.41.70.10.00.0
2408.67.74.71.32.01.30.00.00.0
3006.04.43.11.00.80.60.00.00.0
4803.92.80.90.70.70.20.00.00.0
6002.21.70.60.60.10.20.00.00.0
Notes: The figures are the percentage of all decisions at the stated significance level for which the test statistic rejects the model when the correct GRS statistic does not reject. (1) For Panel A, the sample periods cover July 1963 to December 2019, the sample window for the test is either 5, 10, 15, 20, 25, 40, or 50 years, the number of models tested are 6 for each window, and we average over 19 different sets of test assets, listed in Table A1. These windows overlap, adjusted in a rolling window, so that all but 1 year of data overlaps with the next sample window. This means that there are 53 samples for the 5-year window. The rejection rates are aggregated over the 6 models employed, so that a 54% value in the top leftmost cell corresponds to roughly 171 incorrect rejections on average. The factor models considered are the CAPM, the Fama–French 3 factor model, 4 and 5 factor models, as well as a six-factor model including momentum, all as considered in Fama and French (2016). (2) For Panel B, the sample periods cover January 1926 to December 2019. The models considered here number three, as factors for size, book-to-market, momentum and the market are all that are available. The test assets are constructed from industry classification, book-to-market crossed with size, momentum crossed with size, book-to-market, size and momentum decile portfolios. (3) For a detailed description of the factor and test asset construction see Fama and French (2015, 2016).
Table A5. Percentage of subsamples/models with different model rank outcomes from the correct GRS statistic W ˜ , ranked by test statistic value.
Table A5. Percentage of subsamples/models with different model rank outcomes from the correct GRS statistic W ˜ , ranked by test statistic value.
WindowAny Model MisrankedTop Model Misranked
(Months) W ˘ W ^ W ˘ W ^
Panel A: 1963–2019
6057.52.715.10.2
12037.60.99.40.0
18021.40.04.40.0
24012.50.03.00.0
3009.70.22.40.2
48014.30.31.80.0
60015.10.00.00.0
Panel B: 1926–2019
6017.40.79.60.4
12010.60.25.50.0
1805.20.21.90.0
2403.10.00.90.0
3001.90.01.20.0
4801.50.30.60.0
6001.10.00.70.0
Notes: The figures are the percentage of misrankings from a particular test statistic value across models relative to the GRS statistic ranking. (1) For Panel A, the sample periods cover July 1963 to December 2019, the sample window for the test is either 5, 10, 15, 20, 25, 40, or 50 years, the number of models tested is 6 for each window, and we aggregate over 19 different sets of test assets, listed in Table A1. These windows overlap, adjusted in a rolling window so that all but 1 year of data overlaps with the next sample window. The factor models considered are the CAPM, the Fama–French 3-factor model, 4 and 5-factor models, as well as a 6-factor model including momentum, all as considered in Fama and French (2016). (2) For Panel B, the sample periods cover January 1926 to December 2019. The models considered here number three, as factors for size, book-to-market, momentum and the market are all that are available. The test assets are constructed from industry classification, book-to-market crossed with size, momentum crossed with size, book-to-market, size and momentum decile portfolios. (3) For a detailed description of the factor and test asset construction see Fama and French (2015, 2016).
In Table A2, we present the number of cases for which each statistic misranks factor models relative to the correct GRS statistic ranking. Here, we rank the six models for each of the 53 five-year windows considered in Table A1. Misranking of models at the five year horizon by the W ˘ statistic displays much worse performance than we saw for test rejections, with close to 60% of the cases displaying a misranking, and we saw even worse performance for the W ^ , with close to 3% of the cases misranked. If we restrict our attention to cases for which the top model is misranked, the W ˘ statistic misranks between 10% and 15% of the time, while the W ^ misranks the top model less than 0.5% of the time.
In Table A3, we present detailed results for the 5 × 5 net share issuance’s crossed with size portfolio test asset set, for five overlapping five-year windows near the end of the 1963–2019 sample period, in order to give the reader a finer sense for the results in Table A1 and Table A2. We present the average annualized and raw percentage alpha for each of the six asset pricing models and each window, as well as values of the correct GRS statistic W ˜ , and the W ˘ and W ^ statistics. Beside each test statistic value, we report the ranks of the six models, from one to six. A model ranked differently by W ^ or W ˘ from W ˜ is indicated by a † next to the factor label and further identified with the appropriate column’s rank number being boldfaced. A misranked top model is indicated by an * next to the factor label and further identified with the appropriate row’s factor label being bolded. A test method producing different ranks using p-values from test statistic is indicated by a ‡ next to the window period and further identified with a ⇓ in the appropriate column. This issue of different ranking using the test statistic versus the p-value will be drawn out below.
Table A6. Percentage of subsamples/models with different model rank outcomes from the correct GRS statistic W ˜ , ranked by test p-value.
Table A6. Percentage of subsamples/models with different model rank outcomes from the correct GRS statistic W ˜ , ranked by test p-value.
WindowAny Model Mis-RankedTop Model Mis-Ranked
(Months) W ˘ W ^ W ˘ W ^
Panel A: 1963–2019
6054.61.913.10.4
12037.40.59.10.0
18021.90.44.40.1
24011.50.12.80.0
3009.40.22.10.0
48013.20.31.80.0
60013.20.00.00.0
Panel B: 1926–2019
6015.00.78.10.7
12010.20.05.50.0
1805.40.02.10.0
2403.30.20.90.0
3001.90.01.20.0
4801.20.00.60.0
6001.10.00.70.0
Notes: The figures are the percentage of misrankings from a particular test statistic p-value across models relative to the GRS statistic ranking. (1) For Panel A the sample periods cover July 1963 to December 2019, the sample window for the test is either 5, 10, 15, 20, 25, 40, or 50 years, the number of models tested is 6 for each window, and we aggregate over 19 different sets of test assets, listed in Table A1. These windows overlap, adjusted in a rolling window so that all but 1 year of data overlaps with the next sample window. The factor models considered are the CAPM, the Fama–French 3-factor model, 4 and 5-factor models, as well as a 6-factor model including momentum, all as considered in Fama and French (2016). (2) For Panel B, the sample periods cover January 1926 to December 2019. The models considered here number three, as factors for size, book-to-market, momentum and the market are all that are available. The test assets are constructed from industry classification, book-to-market crossed with size, momentum crossed with size, book-to-market, size and momentum decile portfolios. (3) For a detailed description of the factor and test asset construction see Fama and French (2015, 2016).
What we see in Table A3 is fairly typical across the full set of empirical findings that Table A1 and Table A2 are based on; the W ˘ displays many misrankings, and misrankings of the top model are rare. Although not tabulated, only the asymptotic χ 2 is typically rejecting factor models in this particular small set of examples, so that the W ˘ , W ^ , and correct W ˜ test statistic are all consistent with each other. Common in studies that seek to rank models are rankings by the average absolute alpha, whether or not all models are rejected by the GRS test. See for instance Fama and French (2015). As we can see from Table A3, often the model with the smallest average absolute alpha is not top-ranked.

Appendix B.2. Summary Results for Longer Windows

We now present evidence for a longer span of data in Table A4, Table A5, Table A6 and Table A7, using two sets of date windows up to 50 years. In addition to the period of time 1963–2019 that we considered in Appendix B.1, now we add the period 1926–1962. The pre-1963 period lacks data for factors and test assets built using operating profitability, accruals etc., so we are left with six test assets constructed from book-to-market, size, industry classification, and momentum, and three different factor models, including the CAPM, the Fama–French three-factor model, and a four-factor model including momentum, all as considered in Fama and French (2015, 2016). For this restricted set of test assets and factors, we estimate test rejections and rankings using the entire 1926–2019 sample and we break out results separately from those constructed using the larger set of test assets and models on the 1963–2019 data alone. It is interesting to do this, as the chance of a misranking declines with fewer asset pricing models being considered.
In Table A4, we present the percentage of excess test rejections relative to the correct GRS statistic at each of the 1%, 5%, and 10% levels for each of the alternative test methods, the asymptotic χ 2 , the W ˘ , and the W ^ . This is performed for overlapping windows of 5, 10, 15, 20, 25, 40, and 50 years, using data that span either 1963–2019 or 1926–2019. Panel A displays the percentage of over-rejection rates averaged over six asset pricing models and 19 sets of test assets for the period 1963–2019; Panel B displays the same percentages for the period 1926–2019 using the smaller set of three factor models and six sets of test assets.
For both Panels A and B, we see a virtually monotonic decline in over-rejections as sample size increases, albeit at a fairly slow rate. The asymptotic χ 2 statistic over-rejects roughly 5% of the time relative to the W ˜ statistic even with a 25-year window. The W ˘ over-rejects roughly 5% of the time with 5 years of data, and over 1% of the time even with a 25-year window. The W ^ , which is fairly close to the correct W ˜ statistic, does not appear to over-reject with more than 25 years of data, and over-rejects less than 0.1% of the time with 10 or more years of data. The simulations based on models and calibrations described in Appendix C.1 reveal that W ^ always over-rejects, at least in this experimental design. The over-rejection declines with sample size, but at a decreasing (non-linear) rate, increases with the number of factors, and is largely unrelated to the number of test assets.
In Table A5 and Table A6, we present the number of cases for which each statistic misranks factor models relative to the correct W ˜ statistic’s ranking by test value and p-value respectively. Misranking of models by the W ˘ statistic is remarkably large, even at a 50-year data horizon if the set of models is as large as six, for either ranking method (statistic value or p-value). Replacing the correct form of the GRS test with W ˘ scrambles rankings over 15% of the time even at the 50-year horizon, and over 50% of the time at the 5-year horizon, for cases with six models (Panel A). When there are only three models (Panel B), misrankings are naturally fewer, and for data horizons over 15 years misrankings occur mostly less than 5% of the time across methods. The W ^ is typically consistent with the correct W ˜ statistic, though misrankings occur even at the 40-year window length.
If we restrict our attention to cases for which the top model is misranked, the W ^ statistic is consistent with the correct W ˜ statistic once we have 40 or more years of data, but W ˘ misranks even at 40 years of data. Again, when there are only three models (Panel B), misrankings are fewer, and for data horizons over 10 years misrankings occur less than 5% of the time across methods.
Finally, in Table A7, we present evidence on rankings from these different test statistics compared to rankings based on the p-values of these test statistics, to see if they are consistent. We can think of ranking by the test statistic as a sort of mean squared error or model R 2 ranking - perhaps helpful if we are interested in minimizing model prediction error even if models are false (see, for instance, Teräsvirta and Mellin 1986). Here, we see that all the rankings by test statistics are fragile, even at a 50 year horizon, averaging close to 10% misranked if we compare three or six asset pricing models to each other. Table A7 highlights that the statistically sound p-values, instead of the raw GRS statistics, should be used in model ranking.
Considering the case of a true model that includes a subset of the available factors, an untabulated analysis of the simulated test rankings formed using the magnitude of the GRS statistic confirms that the probability of an incorrect model, larger than the true model, achieving a high rank versus other models increases with the number of factors in the model, when the GRS statistic ranking differs from the p-value ranking. This bias is stronger when using an incorrect GRS statistic. In cases for which both the GRS statistic ranking and the p-value ranking agree, there is no such pattern to tilt to larger models than the true model.
The important insight to take away from these results is that the error in calculating the GRS statistic can have a material impact on empirical results, particularly when twenty or fewer years of data are used, which is not uncommon in empirical asset pricing studies. For example, Barillas and Shanken (2018) performed model comparisons on a little less than 15 years of monthly data. Harvey and Liu (2021) considered tests of asset pricing models and report simulations for 20 and 40 years of monthly data. Sha and Gao (2019) used 144 months of data and exploited 6 metrics to evaluate factor model performance, including the GRS statistic. Baek and Bilson (2015) considered 234 months of data in a subsample estimation. Chiah et al. (2016) used 23 years of data when comparing models using the GRS statistic. One takeaway from these papers is that many situations involving specialized data (like Sha and Gao 2019, exploring mutual fund returns in China) or sub-sample robustness checks (like Baek and Bilson 2015) are necessarily constrained to shorter samples than fifty or even twenty years, so that the bias from an incorrectly calculated GRS statistic becomes large.
Table A7. Percentage of subsamples/models with different model rankings if ranked by p-values rather than test statistics.
Table A7. Percentage of subsamples/models with different model rankings if ranked by p-values rather than test statistics.
Window W ˜ W ˘ W ^
(Months)
Panel A: 1963–2019
6016.321.816.8
1204.95.85.3
1804.33.93.9
2404.65.44.4
3004.84.84.8
4802.63.82.6
6006.67.96.6
Panel B: 1926–2019
602.25.02.6
1200.61.20.8
1800.40.20.6
2400.40.20.2
3000.70.70.7
4805.55.85.8
6007.47.47.4
Notes: The figures are the percentage of misrankings from a particular test statistic value across models relative to the GRS statistic ranking. (1) For Panel A, the sample periods cover July 1963 to December 2019, the sample window for the test is either 5, 10, 15, 20, 25, 40, or 50 years; the number of models tested are 6 for each window; and we average over 19 different sets of test assets, listed in Table A1. These windows overlap, adjusted in a rolling window so that all but 1 year of data overlaps with the next sample window. The factor models considered are the CAPM, the Fama–French 3-factor model, 4 and 5-factor models, as well as a 6-factor model including momentum, all as considered in Fama and French (2016). (2) For Panel B, the sample periods cover January 1926 to December 2019. The models considered here number three, as factors for size, book-to-market, momentum and the market are all that are available. The test assets are constructed from industry classification, book-to-market crossed with size, momentum crossed with size, book-to-market, size and momentum decile portfolios. (3) For a detailed description of the factor and test asset construction see Fama and French (2015, 2016).
We do not recommend ranking models with the magnitude of the GRS statistic, and instead suggest the use of the p-value of the statistic from the exact F-distribution, since the p-value internalizes the different degrees of freedom of the GRS statistics computed for models with a different number of factors. We recognize that ranking of models by the GRS statistic has a desirable economic intuition—this is a direct, easy-to-understand metric tied to a model’s factors spanning the test asset returns. But researchers need to at least understand that this ranking might have undesirable statistical properties. The detailed results that these tables are based on, and additional summary statistics are available on request.

Appendix C. Simulation Results

Here, we present simulation results to evaluate the size performance of the alternative tests W ˜ , W ^ , W ˘ and χ 2 . We conducted three sets of simulations, differentiated by the error generating process. Within each of these three sets of simulations, we look at two different factor models, one with three factors and one with six factors, and we look at two different groupings of test assets, one grouping being decile portfolios and one being a five by five set of twenty five portfolios. For each of these, we consider monthly samples of lengths 5, 10, 15, 20, 25, 40, and 50 years.
The first two sets of simulation results exploit the normal and t-distributions for the errors, calibrating return moments (mean, volatility, covariance) to French’s portfolio returns. For the decile returns, we calibrate to the size-sorted decile monthly returns over 1963/7–2019/12, and for the 5 by 5 test asset case, we calibrate to the size by book-to-market returns, again over 1963/7–2019/12. The last set of simulations employed bootstrap methods and the French portfolio returns, described fully below.

Appendix C.1. Simulated Normal and t-Distributed Returns

Following the literature, we generate portfolio excess returns r ˜ p t as normal, independent and identically distributed, calibrated to monthly U.S. stock returns. We also consider the t-distribution with eight degrees of freedom and a bootstrap simulation.
The excess return for test asset i and time t is generated based on model (1), which is rewritten here:
r ˜ i t = δ i 0 + j = 1 L δ i j r ˜ j t + η ˜ i t ,
where η ˜ i t i i d n o r m a l across t with mean 0 and volatility σ i i , and r ˜ j t i i d n o r m a l across t with mean ( μ j / L ) , volatility σ j and E η ˜ i t r ˜ j t = 0 . We set μ j = 0.01 , σ j = 0.02 , σ i i = 0.08 , δ i j = 1 , i , j and we explore only the case of δ i 0 = 0 , i .
We explore the size properties of the correct and incorrect formulas of the GRS statistic for numbers of portfolios (L) from 3 to 6, test assets (N) from 10 to 25, and sample sizes (T) from 60 (months) to 600. This spans typical applications of the GRS statistic. Our simulations show that the performance of the incorrect GRS formula generally suffers deterioration as the number of firms and factors increases, as one might expect. We present simulation results for the normal case in Table A8 and Table A9 for the t-distribution case.
The correct formula of the GRS statistic generally presents no evidence of incorrect size in Table A8, as our simulation setting is one in which it should have the correct small-sample exact size. Even with the t-distribution, the GRS performs well. The W ^ formula shows some evidence of over-rejection with the t-distribution and a small sample size. The W ˘ formula of the GRS statistic and the χ 2 show strong over-rejection under 20 years of data and the χ 2 shows evidence of over-rejection even with 50 years of data.
Table A8. Null rejection rates.
Table A8. Null rejection rates.
TestYears W ˜ W ^ W ˘ χ 2
Assets/Portfolios 1%10%1%10%1%10%1%10%
N/L
10 /350.0100.1040.0110.1060.0160.1320.0620.240
100.0110.1000.0110.1020.0130.1190.0260.162
150.0100.0990.0100.1000.0120.1090.0200.135
200.0080.0940.0080.0960.0100.1010.0160.120
250.0100.1000.0100.1010.0110.1070.0170.122
400.0090.1020.0090.1030.0100.1060.0120.113
500.0090.0960.0090.0960.0090.0990.0120.105
10 /650.0090.1040.0100.1100.0210.1580.0650.246
100.0100.0950.0100.0970.0150.1240.0260.156
150.0110.1050.0110.1070.0140.1250.0200.143
200.0090.0940.0090.0960.0110.1110.0150.124
250.0100.1030.0100.1040.0120.1140.0150.125
400.0090.0970.0100.0980.0110.1040.0120.111
500.0090.0980.0090.0990.0100.1030.0110.109
25 /350.0110.1030.0110.1070.0160.1410.4840.730
100.0080.1010.0080.1040.0110.1220.1240.367
150.0100.0990.0100.1010.0120.1150.0620.260
200.0090.0990.0100.1010.0110.1100.0430.206
250.0100.0970.0110.0980.0120.1060.0340.180
400.0090.1000.0090.1010.0090.1050.0200.147
500.0080.0930.0080.0940.0090.0980.0180.136
25 /650.0080.0970.0090.1040.0200.1630.5240.763
100.0100.1020.0110.1060.0180.1400.1300.378
150.0110.1030.0120.1070.0160.1310.0690.261
200.0120.1030.0120.1050.0160.1240.0460.212
250.0090.1000.0100.1020.0130.1160.0310.181
400.0090.0990.0090.1000.0110.1110.0200.149
500.0110.0990.0110.1000.0120.1070.0190.135
Notes: (1) Bold-faced numbers are rejection rates three standard deviations larger than the nominal values. (2) These results are based on 10,000 simulations. (3) The models are r ˜ i t = δ i 0 + j = 1 L δ i j r ˜ j t + η ˜ i t , where r ˜ j t i i d N μ j / L , σ j 2 , η ˜ i t i i d N 0 , σ i i 2 and E η ˜ i t r ˜ j t = 0 . For j , μ j = 0.01 , σ j = 0.02 , σ i i = 0.08 , and δ i j = 1 .
Table A9. Null rejection rates.
Table A9. Null rejection rates.
TestYears W ˜ W ^ W ˘ χ 2
Assets/Portfolios 1%10%1%10%1%10%1%10%
N/L
10 /350.0090.1040.0090.1070.0150.1370.0610.236
100.0110.0980.0110.1000.0120.1130.0260.158
150.0090.1000.0090.1010.0110.1090.0180.135
200.0100.1000.0100.1010.0110.1070.0170.125
250.0100.0990.0100.0990.0110.1060.0140.122
400.0100.1060.0100.1060.0100.1090.0130.120
500.0090.1000.0090.1010.0100.1050.0120.112
10 /650.0100.1040.0120.1100.0230.1630.0680.256
100.0100.0950.0100.0980.0150.1220.0270.155
150.0100.0990.0110.1010.0140.1170.0200.136
200.0090.0960.0090.0980.0120.1110.0160.123
250.0120.0970.0120.0980.0140.1090.0170.120
400.0100.0990.0100.0990.0110.1050.0140.111
500.0100.1020.0110.1030.0110.1090.0120.114
25 /350.0090.0950.0090.0980.0140.1310.4770.725
100.0110.1050.0110.1070.0140.1270.1290.370
150.0110.1030.0110.1050.0130.1170.0640.254
200.0110.0940.0110.0960.0120.1060.0410.206
250.0090.0950.0090.0960.0100.1040.0310.180
400.0090.0980.0090.0990.0100.1040.0210.145
500.0100.0970.0100.0980.0110.1020.0170.137
25 /650.0090.1000.0100.1060.0220.1690.5330.772
100.0100.1030.0110.1080.0200.1400.1310.380
150.0110.1040.0120.1080.0150.1320.0690.266
200.0100.1010.0100.1030.0150.1170.0440.209
250.0100.0990.0100.1000.0120.1140.0320.180
400.0090.1020.0100.1040.0110.1140.0210.150
500.0090.1010.0100.1010.0110.1090.0190.141
Notes: (1) Bold-faced numbers are rejection rates three standard deviations larger than the nominal values. (2) The results are based on 10,000 simulations. (3) The models are r ˜ i t = δ i 0 + j = 1 L δ i j r ˜ j t + η ˜ i t , where r ˜ j t i i d μ j / L , σ j 2 , t with 8  df, η ˜ i t i i d 0 , σ i i 2 , t with 8  df, and E η ˜ i t r ˜ j t = 0 . For j , μ j = 0.01 , σ j = 0.02 , σ i i = 0.08 , and δ i j = 1 .
Table A10. Null rejection rates.
Table A10. Null rejection rates.
TestYears W ˜ W ^ W ˘ χ 2
Assets/Portfolios 1%10%1%10%1%10%1%10%
N/L
10 /350.0970.2900.0970.2920.1150.3330.2280.458
100.1400.3440.1400.3440.1530.3670.2040.425
150.1670.3720.1670.3720.1760.3850.2100.417
200.1890.3950.1890.3950.1980.4050.2220.433
250.2020.4080.2020.4080.2070.4170.2300.436
400.2340.4310.2340.4310.2370.4350.2500.447
500.2490.4500.2490.4500.2510.4550.2610.465
10 /650.0800.2660.0800.2680.1140.3420.2130.451
100.1240.3230.1240.3240.1460.3610.1830.402
150.1480.3560.1490.3570.1640.3810.1920.406
200.1730.3740.1730.3740.1840.3940.2030.415
250.1860.3870.1860.3870.1940.4030.2080.418
400.2130.4170.2130.4170.2180.4250.2270.434
500.2340.4370.2340.4370.2390.4440.2460.450
25 /350.0210.1440.0210.1450.0310.1840.5460.776
100.0230.1560.0240.1570.0310.1820.1850.452
150.0260.1670.0260.1670.0310.1830.1130.349
200.0320.1750.0320.1760.0360.1880.0910.304
250.0370.1770.0370.1770.0390.1880.0820.279
400.0420.1880.0420.1880.0440.1960.0690.248
500.0440.1920.0440.1920.0470.1970.0670.235
25 /650.0180.1320.0180.1340.0340.2080.5830.793
100.0220.1450.0230.1460.0320.1940.1850.449
150.0230.1500.0230.1510.0330.1830.1030.329
200.0250.1580.0250.1580.0320.1820.0780.288
250.0270.1600.0270.1610.0310.1790.0670.255
400.0300.1670.0300.1670.0330.1810.0560.228
500.0340.1670.0340.1670.0360.1760.0520.209
Notes: (1) Bold-faced numbers are rejection rates three standard deviations larger than the nominal values. (2) The results are based on 10,000 simulations. (3) The models are r ˜ i t = δ i 0 + j = 1 L δ i j r ˜ j t + η ˜ i t , where the data were generated through a block-bootstrap approach.

Appendix C.2. Bootstrap Simulation

There are two main categories of bootstrapping in the regression context, the random X approach, which resamples the complete set of variables including the dependent variable for each observation, and the fixed X approach, which resamples residuals and explanatory variable values and forms simulated dependent variable values. That is, the fixed X approach builds simulated dependent variable values from the explanatory variables and either simulated or resampled regression residuals. The choice of using either simulated or resampled residuals is what distinguishes the major variations of the fixed X bootstrap approach. The non-parametric fixed X bootstrap approach, which we employ, uses resampled regression residuals.
Suppose we have a sample of T observations of a dependent variable r i , t , ( i = 1 , , N ), a K × 1 vector of factor portfolios r p , t , and a regression model E [ r i , t ] = α i + β i r p , t . Define α ^ i and β ^ i as the OLS estimates of α i and β i and, noting that we wish to explore the null hypothesis that α i = 0 , define r ^ i , t = β ^ i r p , t and e ^ i , t = r i , t α ^ i + β ^ i r p , t . We then form R resamples of r ^ i , t , ( i = 1 , , N ), and r p , t with each resample containing T observations. Separately and independently, we form R resamples of e ^ i , t with each resample also containing T observations, and finally we form r i , t = r ^ i , t + e ^ i , t for each of the R resamples. Using r i , t and r p , t , we fit the model E [ r i , t ] = α i + β i r p , t on each of these resampled datasets, and retrieve the various GRS test statistics for each resampled dataset.
To deal with a well-documented property of financial returns, lack of independence across time, we also employ block bootstrap resampling which allows for data dependence. See, for instance, Politis and Romano (1994), White (2000), and Gonçalves and White (2002, 2005). It is the resampling in (random-length) blocks from the original data that produces results incorporating data dependence. Politis and Romano (1994) used blocks of data with lengths distributed according to the geometric distribution. The mean block length b is data-dependent. Politis and Romano (1994) recommended a length proportional to T 1 / 3 , where T = sample size, which is what we use.
Again, we exploit French’s portfolio returns. For the decile returns, we resample from the size-sorted decile monthly returns over 1963/7–2019/12, and for the 5 by 5 test asset case we resample from the size by book-to-market returns, over 1963/7–2019/12. Bootstrap simulations show persistent over-rejection of the null hypothesis in all these tests, though the correct GRS F test shows the smallest over-rejection. Similarly to Harvey and Liu (2021), we find little or no over-rejection when we evaluate t-tests on the intercept with bootstrapped data, rather than a joint test across intercepts of the test assets, but joint tests appear much more fragile than the one-at-a-time t-tests on intercepts.

Appendix D. Software Packages

The SAS and R packages used to implement our generalized GRS test can be found at the authors’ websites: http://markkamstra.com/data.html (accessed on 31 August 2023) (SAS) and https://ruoyaoshi.github.io/ (accessed on 31 August 2023) (R). A Stata package grsftest coded by Mengnan (Cliff) Zhu can be found at https://ideas.repec.org/c/boc/bocode/s458828.html (accessed on 31 August 2023). See Zhu (2020).

Appendix E. Detailed Model Statistics

Appendix E.1. 5 Year Windows over 1963–2019

Appendix E.2. 10 Year Windows over 1963–2019

Appendix E.3. 15 Year Windows over 1963–2019

Appendix E.4. 20 Year Windows over 1963–2019

Appendix E.5. 25 Year Windows over 1963–2019

Appendix E.6. 40 Year Windows over 1963–2019

Appendix E.7. 50 Year Windows over 1963–2019

Appendix E.8. 5 Year Windows over 1926–2019

Appendix E.9. 10 Year Windows over 1926–2019

Appendix E.10. 15 Year Windows over 1926–2019

Appendix E.11. 20 Year Windows over 1926–2019

Appendix E.12. 25 Year Windows over 1926–2019

Appendix E.13. 40 Year Windows over 1926–2019

Appendix E.14. 50 Year Windows over 1926–2019

Appendix F. Summary Statistics

Appendix F.1. 5 Year Windows over 1963–2019

Appendix F.2. 10 Year Windows over 1963–2019

Appendix F.3. 15 Year Windows over 1963–2019

Appendix F.4. 20 Year Windows over 1963–2019

Appendix F.5. 25 Year Windows over 1963–2019

Appendix F.6. 40 Year Windows over 1963–2019

Appendix F.7. 50 Year Windows over 1963–2019

Appendix F.8. 5 Year Windows over 1926–2019

Appendix F.9. 10 Year Windows over 1926–2019

Appendix F.10. 15 Year Windows over 1926–2019

Appendix F.11. 20 Year Windows over 1926–2019

Appendix F.12. 25 Year Windows over 1926–2019

Appendix F.13. 40 Year Windows over 1926–2019

Appendix F.14. 50 Year Windows over 1926–2019

Notes

1
See, for instance, Cakici et al. (2013, eq. (4)) and Mosoeu and Kodongo (2022, eq. (2)).
2
Asymptotic versions are commonly employed or promoted. See, for instance, MacKinlay and Richardson (1991), Cochrane (2005, p. 234), Zaremba and Czapkiewicz (2017), Demaj et al. (2018), Belimam et al. (2018), Qin (2019), and Verbeek (2021, Sct. 2.3).
3
We acknowledge that it is difficult to take seriously the assumption of normality of returns—returns are bounded below by −100% due to limited liability in financial markets for publicly traded assets and returns are known to be heteroskedastic and dependent over time. Here we adopt the Gibbons et al. (1989) setting for comparison purposes and to develop small sample results. Knez and Ready (1997) develop some interesting approaches for factor model estimation allowing for non-normality.
4
For related analysis on an extension to the GRS test, see Kleibergen and Zhan (2020) and Kleibergen et al. (2023).
5
Cochrane (2005, eq. (12.6)) uses Ω ˜ for Ω and Σ ˜ for Σ , but pre-multiplies by T N L N , so that the resulting GRS statistic equals to W ˜ in this paper. The d.f. adjustment (or lack of it) in the estimators of Σ can be easily offset by pre-multiplying an appropriate factor, but this is not the case for Ω .
6
Perhaps another outcome of Cochrane (2005), the Stata packages pre-multiply the ratio T N L N used by Cochrane (2005), but fail to use the corresponding Σ ˜ .
7
We need to point out that the argument here is based on an approximation, as E ( r ¯ p Ω ˜ 1 r ¯ p ) is a non-linear function of r ¯ p and Ω ˜ . Moreover, r ¯ p Ω ˜ 1 r ¯ p may deviate from its mean for a particular sample. Therefore, it is entirely possible that the incorrect formula of the GRS statistic favors larger models in some cases.
8
See Tharyan (2009) and Ibert (2014).
9
For comparison, the generalized GRS statistic formula given in Cochrane (2005, p. 230) uses Ω ˜ and Σ ˜ , but Cochrane (2005) is careful to properly pre-multiply the ratio T N L N so that the statistic equals W ˜ in this paper.
10
This is because the F d 1 , d 2 and the χ d 1 2 distributions are related through d 1 F d 1 , d 2 d . χ d 1 2 as d 2 , where d 1 and d 2 are the degrees of freedom.
11
Several different statistics are all commonly used in empirical research. For example, the usual Wald statistic equals T 1 + r ¯ p Ω ˜ 1 r ¯ p 1 δ ^ 0 Σ ^ 1 δ ^ 0 , which deviates from N W ˜ by a factor of T L 1 T N L . Another example is the χ 2 statistic formula in Cochrane (2005, p. 230), which deviates from N W ˜ by a factor of T T N L . Both factors are larger than 1, meaning that both χ 2 statistics are larger than N W ˜ for any sample size, with Cochrane’s (2005, p. 230) formula being the largest. Note that Cochrane (2005, p. 230), like Gibbons et al. (1989), only gives the formula for the L = 1 case, and here we refer to its generalized version for the L 1  case.
12
In Section 3, we only report empirical results using the Wald statistic for the asymptotic χ 2 test. Since the formula in Cochrane (2005, p. 230) is even larger, it will lead to worse over-rejection.
13
We thank Ken French for making this valuable resource freely available.

References

  1. Affleck-Graves, John, and Bill McDonald. 1989. Nonnormalities and tests of asset pricing theories. The Journal of Finance 44: 889–908. [Google Scholar] [CrossRef]
  2. Alhomaidi, Asem, M. Kabir Hassan, William J. Hippler, and Abdullah Mamun. 2019. The impact of religious certification on market segmentation and investor recognition. Journal of Corporate Finance 55: 28–48. [Google Scholar] [CrossRef]
  3. Alshammari, Saad, and Shingo Goto. 2022. What factors drive Saudi stock markets? Firm characteristics that attract retail trades. International Review of Economics and Finance 80: 994–1011. [Google Scholar] [CrossRef]
  4. Anderson, Theodore Wilbur. 2003. An Introduction to Multivariate Statistical Analysis, 3rd ed. Hoboken: Wiley-Interscience. [Google Scholar]
  5. Baek, Seungho, and John F. O. Bilson. 2015. Size and value risk in financial firms. Journal of Banking and Finance 55: 295–326. [Google Scholar] [CrossRef]
  6. Barillas, Francisco, and Jay Shanken. 2018. Comparing asset pricing models. The Journal of Finance 73: 715–54. [Google Scholar] [CrossRef]
  7. Bartlett, Maurice. 1951. An Inverse Matrix Adjustment Arising in Discriminant Analysis. The Annals of Mathematical Statistics 22: 107–11. [Google Scholar] [CrossRef]
  8. Belimam, Doha, Yong Tan, and Ghizlane Lakhnati. 2018. An empirical comparison of asset-pricing models in the Shanghai a-share exchange market. Asia-Pacific Financial Markets 25: 249–65. [Google Scholar] [CrossRef]
  9. Cakici, Nusret, Frank J. Fabozzi, and Sinan Tan. 2013. Size, value, and momentum in emerging market stock returns. Emerging Markets Review 16: 46–65. [Google Scholar] [CrossRef]
  10. Chiah, Mardy, Daniel Chai, Angel Zhong, and Song Li. 2016. A Better Model? An empirical investigation of the Fama-French five-factor model in Australia. International Review of Finance 16: 595–638. [Google Scholar] [CrossRef]
  11. Choi, Seo Joon, Kanghyun Kim, and Sunyoung Park. 2020. Is systemic risk systematic? Evidence from the US stock markets. International Journal of Finance and Economics 25: 642–63. [Google Scholar] [CrossRef]
  12. Cochrane, John. 2005. Asset Pricing: Revised Edition. Princeton: Princeton University Press. [Google Scholar]
  13. Demaj, Arber, Bora Oskay, Betim Lushtaku, and Thedo Linssen. 2018. Asset Pricing Models and Anomalies: An Empirical Analysis. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3128562 (accessed on 4 April 2022).
  14. Fama, Eugene F., and Kenneth R. French. 2015. A Five-Factor Asset Pricing Model. Journal of Financial Economics 116: 1–22. [Google Scholar] [CrossRef]
  15. Fama, Eugene F., and Kenneth R. French. 2016. Dissecting Anomalies with a Five-Factor Model. The Review of Financial Studies 29: 69–103. [Google Scholar] [CrossRef]
  16. Fama, Eugene F., and Kenneth R. French. 2018. Choosing factors. Journal of Financial Economics 128: 234–52. [Google Scholar] [CrossRef]
  17. Ferson, Wayne E., and Stephen R. Foerster. 1994. Finite sample properties of the generalized method of moments in tests of conditional asset pricing models. Journal of Financial Economics 36: 29–55. [Google Scholar] [CrossRef]
  18. Gibbons, Michael R., Stephen A. Ross, and Jay Shanken. 1989. A Test of the Efficiency of a Given Portfolio. Econometrica 57: 1121–52. [Google Scholar] [CrossRef]
  19. Gonçalves, Sílvia, and Halbert White. 2002. The Bootstrap of the Mean for Dependent Heterogenous Arrays. Econometric Theory 18: 1367–84. [Google Scholar] [CrossRef]
  20. Gonçalves, Sílvia, and Halbert White. 2005. Bootstrap Standard Error Estimates for Linear Regressions. Journal of the American Statistical Association 100: 970–79. [Google Scholar] [CrossRef]
  21. Hanauer, Matthias X. 2020. A Comparison of Global Factor Models. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3546295 (accessed on 19 April 2022).
  22. Harvey, Campbell R., and Yan Liu. 2021. Lucky factors. Journal of Financial Economics 141: 413–35. [Google Scholar] [CrossRef]
  23. Hayashi, Fumio. 2000. Econometrics. Princeton: Princeton University Press. [Google Scholar]
  24. Ibert, Markus. 2014. GRSTEST2: Stata Module to Implement the Gibbons, Ross, Shanken (1989) Test. Statistical Software Components S457786. Boston: Boston College Department of Economics. [Google Scholar]
  25. Kan, Raymond, Xiaolu Wang, and Xinghua Zheng. 2024. In-sample and out-of-sample Sharpe ratios of multi-factor asset pricing models. Journal of Financial Economics 155: 103837. [Google Scholar] [CrossRef]
  26. Kim, Jae H. 2022. GRS.test: GRS Test for Portfolio Efficiency, Its Statistical Power Analysis, and Optimal Significance Level Calculation. R Package Version 1.2. Available online: https://CRAN.R-project.org/package=GRS.test (accessed on 1 April 2022).
  27. Kleibergen, Frank, and Zhaoguo Zhan. 2020. Robust inference for consumption-based asset pricing. The Journal of Finance 75: 507–50. [Google Scholar] [CrossRef]
  28. Kleibergen, Frank, Lingwei Kong, and Zhaoguo Zhan. 2023. Identification robust testing of risk premia in finite samples. Journal of Financial Econometrics 21: 263–97. [Google Scholar] [CrossRef]
  29. Knez, Peter J., and Mark J. Ready. 1997. On the robustness of size and book-to-market in cross-sectional regressions. The Journal of Finance 52: 1355–82. [Google Scholar]
  30. Kroencke, Tim A. 2017. Asset pricing without garbage. The Journal of Finance 72: 47–98. [Google Scholar] [CrossRef]
  31. Leite, André Luis, Marcelo Cabus Klotzle, Antonio Carlos Figueiredo Pinto, and Aldo Ferreira da Silva. 2018. Size, value, profitability, and investment: Evidence from emerging markets. Emerging Markets Review 36: 45–59. [Google Scholar] [CrossRef]
  32. Lewellen, Jonathan, Stefan Nagel, and Jay Shanken. 2010. A skeptical appraisal of asset pricing tests. Journal of Financial Economics 96: 175–94. [Google Scholar] [CrossRef]
  33. MacKinlay, A. Craig, and Matthew P. Richardson. 1991. Using generalized method of moments to test mean-variance efficiency. The Journal of Finance 46: 511–27. [Google Scholar]
  34. Merdad, Hesham Jamil, M. Kabir Hassan, and William J. Hippler, III. 2015. The Islamic risk factor in expected stock returns: An empirical study in Saudi Arabia. Pacific-Basin Finance Journal 34: 293–314. [Google Scholar] [CrossRef]
  35. Mosoeu, Selebogo, and Odongo Kodongo. 2022. The Fama-French five-factor model and emerging market equity returns. The Quarterly Review of Economics and Finance 85: 55–76. [Google Scholar] [CrossRef]
  36. Politis, Dimitris N., and Joseph P. Romano. 1994. The Stationary Bootstrap. Journal of the American Statistical Association 89: 1303–13. [Google Scholar] [CrossRef]
  37. Qin, Rui. 2019. Study on Applicability of Fama-French Five-Factor Model in Chinese A-Share Market. Paper presented at the 2nd International Symposium on Social Science and Management Innovation (SSMI 2019), Xi’an, China, November 29–30; Amsterdam: Atlantis Press, pp. 491–500. [Google Scholar]
  38. Rouwenhorst, K. Geert. 1999. Local return factors and turnover in emerging stock markets. The Journal of Finance 54: 1439–64. [Google Scholar] [CrossRef]
  39. Sha, Yezhou, and Ran Gao. 2019. Which is the best: A comparison of asset pricing factor models in Chinese mutual fund industry. Economic Modelling 83: 8–16. [Google Scholar] [CrossRef]
  40. Sharpe, William F. 1964. Capital Asset Prices: A Theory of Market Equilibrium Under Conditions of Risk. Journal of Finance 19: 425–42. [Google Scholar]
  41. Sharpe, William F. 1966. Mutual fund performance. The Journal of Business 39: 119–38. [Google Scholar] [CrossRef]
  42. Teräsvirta, Timo, and Ilkka Mellin. 1986. Model selection criteria and model selection tests in regression models. Scandinavian Journal of Statistics, 159–71. [Google Scholar]
  43. Tharyan, Rajesh. 2014. GRSTEST: Stata Module to Implement the Gibbons et al. (1989) Test in a Single-Factor or Multi-Factor Setting. Statistical Software Components S457069. Boston: Boston College Department of Economics. [Google Scholar]
  44. Verbeek, Marno. 2021. Panel Methods for Finance. In Panel Methods for Finance. Berlin: De Gruyter. [Google Scholar]
  45. White, H. 2000. A Reality Check for Data Snooping. Econometrica 68: 1097–26. [Google Scholar]
  46. Zaremba, Adam, and Anna Czapkiewicz. 2017. Digesting anomalies in emerging European markets: A comparison of factor pricing models. Emerging Markets Review 31: 1–15. [Google Scholar] [CrossRef]
  47. Zhu, Mengnan. 2020. GRSFTEST: Stata Module to Perform the Gibbons, Ross, Shanken Test of Mean-Variance Efficiency of Asset Returns. Statistical Software Components S458828. Boston: Boston College Department of Economics. [Google Scholar]
Table 1. Number and proportion of cases with more rejections relative to the GRS test for five-year windows sampled during 1963–2019, across 19 sets of test assets.
Table 1. Number and proportion of cases with more rejections relative to the GRS test for five-year windows sampled during 1963–2019, across 19 sets of test assets.
Test AssetsAsymptotic W ˘ W ^
Nb of Subsamples = 53 χ 2
1%5%10%1%5%10%1%5%10%
2 × 4 × 4 MExMEBExINV28927524741721010
2 × 4 × 4 MExMEBExOP26224622041418000
2 × 4 × 4 MExOPxINV282238194102728031
5 × 5 AccrualsxME22221820771731001
5 × 5 BExME21219116682120101
5 × 5 BetaxME229221204111625010
5 × 5 MExOP23822718892742000
5 × 5 MomentumxME210166124142927000
5 × 5 NetIssuexME21617614792927022
5 × 5 RVariancexME1479465182812020
5 × 5 VariancexME1378145262712031
5 × 5 BExInv24825924541816021
5 × 5 MExInv214189171141729000
Industry12615315212922010
Book-to-Market Deciles48706752221011
Investment Deciles223056446000
Momentum Deciles566066181420111
Size Deciles49906862327002
Operating Profitability48706542415010
Deciles
Average171.3160.71429.820.222.10.110.950.58
Proportion (%)53.950.544.63.16.36.90.00.30.2
Notes: (1) The figures are the number of all decisions at the stated significance level for which the test statistic rejects the model when the correct GRS statistic W ˜ does not reject, out of a total of 318 possible cases, with the exception of the last row in which the proportion is given. The sample periods of five years are sampled from July 1963 to December 2019, and the number of models tested is 6 for each window, which we sum over to obtain the total number of over-rejections. These windows overlap, adjusted in a rolling window so that all but 1 year of data overlaps with the next sample window. This means that there are 53 samples for the 5-year window. (2) For a detailed description of the factor and test asset construction see Fama and French (2015, 2016).
Table 2. Number and proportion of cases with different ranking outcomes from the GRS statistic for five-year windows sampled during 1963–2019, across 19 sets of test assets.
Table 2. Number and proportion of cases with different ranking outcomes from the GRS statistic for five-year windows sampled during 1963–2019, across 19 sets of test assets.
Any Model Mis-RankedTop Model Mis-Ranked
Test Assets W ˘ W ^ W ˘ W ^
Nb of Subsamples = 53
2 × 4 × 4 MExMEBExINV26250
2 × 4 × 4 MExMEBExOP37160
2 × 4 × 4 MExOPxINV311120
5 × 5 AccrualsxME361120
5 × 5 BExME26170
5 × 5 BetaxME311100
5 × 5 MExOP330100
5 × 5 MomentumxME330140
5 × 5 NetIssuexME27492
5 × 5 RVariancexME320140
5 × 5 VariancexME341100
5 × 5 BExInv35360
5 × 5 MExInv30390
Industry29240
Book-to-Market Deciles28040
Investment Deciles24240
Momentum Deciles30250
Size Deciles27350
Operating Profitability30060
Deciles
Average30.471.428.000.11
Proportion (%)57.52.715.10.2
Notes: (1) The figures are the number of all misrankings by the test statistic value across models relative to the correct GRS statistic ranking, out of a total of 53 possible cases, with the exception of the last row in which the proportion is given. The sample periods of five years are sampled over July 1963 to December 2019, and the number of models ranked is 6 for each window. These windows overlap, adjusted in a rolling window, so that all but 1 year of data overlaps with the next sample window. This means that there are 53 samples for the 5-year window. (2) For a detailed description of the factor and test asset construction see Fama and French (2015, 2016).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kamstra, M.J.; Shi, R. Testing and Ranking of Asset Pricing Models Using the GRS Statistic. J. Risk Financial Manag. 2024, 17, 168. https://doi.org/10.3390/jrfm17040168

AMA Style

Kamstra MJ, Shi R. Testing and Ranking of Asset Pricing Models Using the GRS Statistic. Journal of Risk and Financial Management. 2024; 17(4):168. https://doi.org/10.3390/jrfm17040168

Chicago/Turabian Style

Kamstra, Mark J., and Ruoyao Shi. 2024. "Testing and Ranking of Asset Pricing Models Using the GRS Statistic" Journal of Risk and Financial Management 17, no. 4: 168. https://doi.org/10.3390/jrfm17040168

Article Metrics

Back to TopTop