**1. Introduction**

The paradigm of point null hypothesis testing has been almost exclusively adopted in all areas of empirical research in business, including accounting, economics, finance, management, and marketing. The procedure involves forming a sharp null hypothesis (typically the value of a parameter equal to zero, to represent no effect) and using the "*p*-value less than *α*" criterion to reject or fail to reject the null hypothesis, or in the Neyman–Pearson tradition, determining whether the test statistic lies in a region defined by *α*, the test size. Although the alternative hypothesis is often unspecified, the rejection of a null hypothesis of no effect is frequently taken as evidence for the existence of a non-zero effect.

As a hybrid of Fisher's approach to significance testing and Neyman–Pearson decision-theoretic approach, the procedure is often conducted in an automatic manner without considering the key factors of statistical research, such as effect size, statistical power, relative loss, and prior beliefs (see, for example, Kim and Choi 2019). This practice has been criticized by many authors, for example, Gigerenzer (2004) calls it the "null ritual"; while McCloskey and Ziliak (1996) warn against widespread practice of "asterisk econometrics" and "sign econometrics". Despite numerous calls for change for years, little improvement has been made in the practice of "mindless statistics" (Gigerenzer 2004). The consequences include serious distortion of scientific process (Wasserstein and Lazar 2016), an embarrassing number of false positives (Kim and Ji 2015; Harvey 2017; Kim et al. 2018) replication crises in many fields of science (see, for example, Open Science Collaboration 2015), and publication bias (Basu and Park 2014; Kim and Ji 2015).

With increasing availability of large or massive data sets in the business disciplines in recent years, the current paradigm has become even more problematic, and arguably deficient. This is because, in reality, any null hypothesis is violated even when it is (practically or economically) true (see, for example, De Long and Lang 1992). Rao and Lovric (2016) call this phenomenon the *zero-probability paradox*, providing a mathematical proof for a simple case. Its consequence is that the *p*-value is a decreasing function of sample size, even when the null hypothesis is violated by an economically or scientifically negligible margin (see Kim and Ji 2015). As a result, the probability of a false positive increases with sample size, as also noted by Ohlson (2018). As Spanos (2017) points out, there is nothing paradoxical about this, since it is a reflection of the consistency property of a test. As Kim and Ji (2015) and Kim et al. (2018) report from their respective meta-analytic surveys, many empirical researchers routinely adopt large or massive samples under the current paradigm, with a high chance that their scientific findings represent false positives. It is also problematic in the context of model specification testing, since any model may be judged to be mis-specified when the sample size is large enough (Spanos 2017).

In view of the above points, Rao and Lovric (2016) argue that "in the 21st century, statisticians will deal with large data sets and complex questions, it is clear that the current point-null paradigm is inadequate" and that "next generation of statisticians must construct new tools for massive data sets since the current ones are severely limited" (see also van der Laan and Rose 2010). They call for a paradigm shift in statistical hypothesis testing and sugges<sup>t</sup> the Hodges and Lehmann (1954) paradigm as a possible alternative, arguing that this will substantially improve the credibility of scientific research based on statistical testing. Under the Hodges and Lehmann (1954) paradigm, the null and alternative hypotheses are formulated as *intervals*. The focus of testing is whether the parameter value belongs to an interval of no practical (or economic) significance, with its limits set by the researcher based on substantive importance. In this way, the researcher's economic reasoning or judgment can be incorporated into hypothesis testing.

In fact, the tests for interval-based hypotheses have been in existence and being used in biostatistics and psychology under the name of equivalence tests, non-inferiority tests, and minimum tests: see, for comprehensive and in-depth reviews, Wellek (2010), Murphy et al. (2014), and Lehmann and Romano (2005, sct. 13.5.2). However, the researchers in the business disciplines have little knowledge about these tests, especially those who are engaged in empirical or applied research. The purpose of this paper is to present a brief review of these tests to the researchers in business, discussing their merits and otherwise. The tests are also presented for parameter restrictions and model specification in the linear regression context, incorporating the bootstrap method. The tests are presented with three empirical applications in economics and finance. We propose that these tests be routinely employed in business research as an alternative to point null hypothesis testing. We hope that this will contribute to a paradigm shift in statistical inference, which will restore credibility and integrity in statistical research in business disciplines.

In the next section, we briefly discuss the current paradigm of point null hypothesis and its problems and consequences. In Section 3, we present a review of equivalence, non-inferiority, and minimum-effect tests for the simple *t*-test and regression *F*-test. Section 4 provides empirical applications, and Section 5 concludes the paper.

### **2. Current Paradigm and Its Deficiencies**

We begin by presenting the current (frequentist) paradigm of hypothesis testing, which is widely adopted in many areas of statistical research, in the context of a simple *t*-test for a point null hypothesis. This is followed by a review of its deficiencies as a criterion of statistical evidence. We also review the problems and malpractices such as *p*-hacking and data-mining and how they are related with the current paradigm of statistical inference.

### *2.1. A Simple t-Test for a Point Null Hypothesis*

Consider the case of a simple one-sample *t*-test for the population mean *θ*, where *Xi* (*i* = 1, ... , *n*) is independently generated from a normal distribution with mean *θ* and standard deviation *σ*. Applying the point null hypothesis paradigm, we test (assuming two-tailed alternative) for

$$H\_0: \theta = 0 ; H\_1: \theta \neq 0.$$

The null hypothesis most often represents the claim of "no effect". When *H*0 is true (hereafter, *under H*0), the *t*-statistic follows a *t*-distribution; while under *H*1, it follows a non-central *t*-distribution with the non-centrality parameter √*nθ*/*<sup>σ</sup>*. The decision to reject or fail to reject depends on the "*p*-value less than *α*" criterion where *p*-value ≡ *Prob*(|*t*| > *tc*,1−0.5*α*|*<sup>H</sup>*0) and *tc*,1−0.5*<sup>α</sup>* is the critical value from a central *t*-distribution at the *α* level of significance. The value of *α* conventionally adopted is 0.05, although values such as 0.01 or 0.10 are often used. When the *p*-value satisfies the criterion, the effect is said to be statistically significant at the *α* level of significance. This is what Gigerenzer (2004) calls the "null ritual", which is a hybrid of the proposal of Fisher and that of Neyman and Pearson. In practical applications, a small *p*-value is often interpreted as a strong evidence against *H*0 and its strength is marked with the number of asterisks indicating the significance at a 0.10, 0.05, or 0.01 level of significance. More seriously, many researchers do not pay attention to the magnitude of the *θ* estimate, making their decisions based only on its sign and statistical significance. This practice has been branded as "asterisk econometrics" and "sign econometrics" by Ziliak and McCloskey (2008), who correctly argue that it only shows whether the effect exists or not, but nothing about economic significance or substantive importance of the effect (see also Kleijnen 1995).

### *2.2. Shortcomings of the p-Value Criterion*

It is well-known that the *p*-value is not a good measure of evidence for a hypothesis. For example, Berger and Sellke (1987) shows that the *p*-value provides a measure of evidence against *H*0 that can differ from the actual value by an order of magnitude. Johnstone and Lindley (1995) demonstrates that a *p*-value less than 0.05 may represent evidence in favor of the null, not against it, especially when the sample size is large (see, also, Kim et al. 2018). It is largely because the *p*-value does not take account of the probability under *H*1; nor does it represent the probability that null is true given data. On this basis, the American Statistical Association expressed grave concerns against the misuse or abuse of the *p*-value criterion in empirical research, stating that this practice has led to a considerable distortion of the scientific process (Wasserstein and Lazar 2016).

Another problem of the *p*-value criterion is that the choice of its threshold *α* is arbitrary (Keuzenkamp and Magnus 1995; Lehmann and Romano 2005, p. 57). As Arrow (1960) and Leamer (1978) argue, it should be chosen in consideration of the key factors such as sample size, statistical power, and relative loss from Type I and II errors. For example, the level of significance should be set at a range of 0.3 to 0.4 when the power is low (Winer 1962); while is should be set at a small a value (such as 0.001) when the sample size is large (McCloskey and Ziliak 1996, p. 102). This is to balance the probabilities of Type I and II errors when the losses from Type I and II errors are (almost) equal. Kim and Choi (2017, 2019) provided a review of a decision-theoretic approach to the optimal level of significance with applications.

### *2.3. Zero-Probability Paradox*

In practice, the null hypothesis cannot hold exactly, as shown by Rao and Lovric (2016). As Leamer (1988), De Long and Lang (1992), and Startz (2014) point out, an economic hypothesis should not be formulated as a point, but as a neighborhood or an interval since an economic effect (or parameter) cannot take a numerically exact value such as 0. The consequence is that, with observational data, the distribution under a point *H*0 is never observed nor realized; but the *t*-statistic is always generated from the distribution under *H*1, which is a non-central *t*-distribution. This is

another reason that makes the *p*-value criterion deficient because the critical value *tc*,*<sup>α</sup>* is obtained from a central *t*-distribution which is never observed in practice.

The problem is exacerbated as the sample size increases, because the non-centrality of the *t*-distribution (√*nθ*/*σ*) also sharply increases, meaning that the *p*-value approaches 0. This occurs even when the true value of *θ* is practically or economically no different from 0. When the sample size is large, this distribution is so far away from the central *t*-distribution. Hence, when *H*0 is numerically violated but it holds practically, rejection of *H*0 occurs with certainty in large samples, as long as the level of significance *α* is maintained at a conventional value such as 0.05. In practice, many empirical researchers often take an economically negligible violation of *H*0 as evidence for particular alternative hypothesis, committing what is called the "fallacy of rejection" (Spanos 2017). A natural solution in this context is to obtain the critical value from a non-central distribution under *H*1, which increases with sample size. In fact, this is a proposal of interval-based hypothesis testing, as we shall see in the next section.

### *2.4. Problems and Consequences*

The deficiency and weakness of the *p*-value criterion discussed above have created a number of problems and malpractices, namely *p*-hacking (Harvey 2017), data mining (Black 1993) or data snooping (Lo and MacKinlay 1990). They generally refer to the practice of cherry-picking the results in order to achieve statistically significant outcome. Black (1993, p. 75) provides a good description of data mining:

When a researcher tries many ways to do a study, including various combinations of explanatory factors, various periods, and various models, we often say, he is "data mining." If he reports only the more successful runs, we have a hard time interpreting any statistical analysis he does. We worry that he selected, from the many models tried, only the ones that seem to support his conclusions. With enough data mining, all the results that seem significant could be just accidental.

A consequence is an embarrassing number of false positives, as Harvey (2017) puts it. As Kim and Ji (2015), Kim et al. (2018) and Kim (2019) report, the use of alternative criteria for statistical significance (such as Bayes factors, adaptive or optimal levels of significance, or posterior probabilities for null hypotheses) gives different inferential outcomes from the *p*-value criterion in a large number of published results. This may have led to accumulation of many false stylized facts in empirical studies. For example, Black (1993) argues that most of investment anomalies identified in finance are likely to be the result of data-mining; while Kandel and Stambaugh (1996) argue that the *p*-value as measure of evidence often conflicts with economic significance in asset-allocation decisions. Kim and Choi (2017) report that many economically puzzling research outcomes (such as empirical invalidity of the purchasing power parity) based on unit root testing may be the result of incorrectly maintaining the conventional level of significance, despite extremely low power of the test. In behavioral finance, it is a stylized fact that the weather affects stock market (see, for example, Saunders 1993; Hirshleifer and Shumway 2003). However, as Kim (2017) argues, this statistical significance is the result of having power of practically one due to massive sample size. In a similar context, Kamstra et al. (2003) report the statistically significant effect of winter blues on stock market (as discussed in Section 4.1 as an application), while they find statistically insignificant effect of weather variables in the same equation. This conflicting result may be the outcome of data-mining, where statistical significance is purely accidental.

Abuse and misuse of the *p*-value criterion for statistical significance also have contributed to other serious problems which undermines the research integrity and credibility in science: namely, publication bias and replication crisis. The practice of *p*-hacking and data-mining is closely related with publication bias where statistically significant results are favored in the publication process. The meta-analytic evaluation of Kim and Ji (2015) and Kim et al. (2018) reveal unreasonably high

proportions of studies published in accounting and finance journals are statistically significant. Harvey (2017) also recognizes the practice of *p*-hacking can contribute to publication bias. This is partly because many journal editors and referees favor statistically significant results, and often judge statistically insignificant studies with skepticism and suspicion. As a consequence, many studies with statistically insignificant results (at a conventional significance level) may not have been published, even though they are economically important and statistically sound. This practice can push many researchers to the malpractice of *p*-hacking or data-mining to gain higher chance of publication. "Replication crisis" refers to the problem that a high proportion of published results are not reproducible by replication exercises (Peng 2015). For example, in psychology, only 36% of the replications are found to be statistically significant, compared to 97% of the original studies that reported significance (Open Science Collaboration 2015).

As discussed in this section, the current paradigm of statistical inference has a number of problems, and has contributed to a range of serious issues that undermine research integrity and credibility. On this basis, Rao and Lovric (2016) call for a new paradigm for statistical inference, especially needed in the big data era where the *p*-value fails as a measure of statistical evidence and the conventional level of significance is inappropriate. They sugges<sup>t</sup> an interval-based test as a possible alternative, which will be discussed in the next section.

### **3. Tests for Minimum-Effect, Equivalence, and Non-Inferiority**

We now present the three types of interval tests, namely the equivalence, minimum-effect, and non-inferiority tests, based on the well known one-sample *t*-test or test for linear restrictions in the linear regression. Loosely speaking, the difference between the equivalence and minimum-effect tests comes down to the condition for which proof is being sought. If the status quo conjecture is characterized by equality, that is, the conjecture against which we wish to assess evidence is that one thing equals another, then we falsify the conjecture by the minimum-effect test. On the other hand, if the status quo conjecture is characterized by inequality, so the conjecture against which we wish to assess evidence is that things are unequal, then we falsify using an equivalence test. A non-inferiority test may be used when the hypothesis formulated as an open interval.

### *3.1. Test for Minimum Effect*

The minimum-effect test, originally put forward by Hodges and Lehmann (1954), has the null and alternative hypotheses of the following form:

$$H\_0: \theta\_l \le \theta \le \theta\_u; H\_1: (\theta < \theta\_l) \cup (\theta\_u < \theta),\tag{1}$$

where *θl* and *θu* denote the limits of practical or economic importance. Hodges and Lehmann (1954, p. 254) propose conducting separate one-tailed *t*-tests of the two one-sided hypotheses. That is,


According to Hodges and Lehmann (1954, p. 254), we then reject *H*0 given in (1) if either of these separate tests rejects. The size of this composite *t*-test is the sum of their separate sizes. The power of the test should depend on the power of the individual one-tailed tests associated. The decision can also be made by using the confidence interval: the null hypothesis of minimum-effect given in (1) cannot be rejected at the *α* level of significance if a two-sided (1 − <sup>2</sup>*α*) confidence interval for *θ* lies entirely within the interval [*<sup>θ</sup>l*, *<sup>θ</sup>u*].

Even though the interval test extends the simple *t*-test, the intention is the same: to detect a statistically significant and important difference. Rejection of the test is interpreted as a failure to detect such a difference—a failure to split. We now review tests that have the opposite effect: to detect a statistically significant and important similarity. These tests are equivalence tests. Rejection of these tests is interpreted as a failure to detect such a similarity—a failure to lump.

### *3.2. Test for Equivalence*

If we switch the null and alternative hypotheses, we have what is called an equivalence test (e.g., Wellek 2010). That is,

$$H\_0: (\theta \le \theta\_l) \cup (\theta\_u \le \theta); H\_1: \theta\_l < \theta < \theta\_u. \tag{2}$$

The decision rule for the equivalence test can be developed by conducting two one-sided test procedures similarly to the above, which is referred to as TOST:


Let *p*1 be the one-sided *p*-value for the test of *H*01 against *H*11; and *p*2 be the same for for the test of *H*02 against *H*12. For the equivalence test, the null hypothesis of no equivalence given in (2) is rejected at the *α* level of significance if *max*(*p*1, *p*2) < *α*. Equivalently, it is rejected at the *α* level of significance if a two-sided (1 − <sup>2</sup>*α*) confidence interval for *θ* lies entirely within the interval [*<sup>θ</sup>l*, *<sup>θ</sup>u*]. The power of the test should depend on the power of the individual one-tailed tests associated.

Note that the researcher should choose between the minimum-effect test and equivalence test by considering whether the evidence being sought is against similarity (minimum effect test) or difference (equivalence test). It is worth mentioning that, as the sample size increases, the confidence interval shrinks but the limits of economic significance do not change. For interval-based tests, this can be interpreted as the critical values increasing with sample size, relative to the test statistic, which is a feature not shared by point-null hypothesis testing. It is also worth mentioning that the minimum effect and equivalence tests give mutually exclusive results in that one always rejects and the other always does not, as long as the two tests share the same limits of economic importance.

### *3.3. Test for Non-Inferiority*

It is often the case that testing for a one-sided (open) interval may be appropriate. The test is called the non-inferiority test or superiority test, whose null and alternative hypotheses can be written as

$$H\_0: \theta \ge \theta\_l; H\_1: \theta < \theta\_{l\prime} \tag{3}$$

where *θl* denotes the smallest effect size of economics importance. The non-inferiority test tests whether the null hypothesis that an effect is at least as large as *θl* can be rejected. The actual direction of the hypothesis depends on whether a higher value of the response is desirable or not. The above test can be conducted as a usual one-tailed test.

### *3.4. Interval Tests in the Linear Regression Model*

Following Hodges and Lehmann (1954), Murphy and Myors (1999) approach the minimum-effect using the *F*-test, which can be presented in a regression context. In this subsection, we review their proposal and extend it to a more general setting.

Consider a regression model of the form

$$Y = \gamma\_0 + \gamma\_1 X\_1 + \dots + \gamma\_K X\_K + u\_\prime \tag{4}$$

where *Y* is a dependent variable and *X*'s are independent variables. Suppose the researcher tests for a linear restriction such as *H*0 : *γ*1 = ... = *γJ* = 0, where *J* ≤ *K*. The *F*-statistic can be written as

$$F = \frac{(R\_1^2 - R\_0^2)/f}{(1 - R\_1^2)/(T - K - 1)},\tag{5}$$

where *R*2*j* represents the coefficient of determination under *Hj* (*j* = 0, 1). Under *H*0, the *F*-statistic follows the *F*-distribution with *J* and *T* − *K* − 1 degrees of freedom, denoted as *F* (*J*, *T* − *K* − <sup>1</sup>). Under *H*1, the *F*-statistic follows *<sup>F</sup>*(*J*, *T* − *K* − 1; *<sup>λ</sup>*), which denotes the non-central *F*-distribution with the degrees of freedom (*J*, *T* − *K* − 1) and the non-centrality parameter *λ*. Note that

$$
\lambda = T \frac{R\_{p1}^2 - R\_{p0}^2}{1 - R\_{p1}^2} \equiv T \eta\_{\prime} \tag{6}
$$

where *R*2*pj* denotes the population or desired coefficient of determination under *Hj*, following from Peracchi (2001, Theorem 9.2). Note that *η* ≡ (*R*2*p*<sup>1</sup> − *<sup>R</sup>*2*p*<sup>0</sup>)/(<sup>1</sup> − *<sup>R</sup>*2*p*<sup>1</sup>) may be called the population signal-to-noise ratio, measuring the incremental contribution of (*X*1, ... , *XJ*) relative to the noise to the model. The degree of non-centrality is determined as a product of sample size and signal-to-noise ratio, with the former playing a dominant role.

Hodges and Lehmann (1954, p. 253) and Murphy and Myors (1999) propose that the above non-central distribution be used to test for the minimum-effect test. As an example, consider a simple regression model *Y* = *γ*0 + *γ*1*X*1 + *u* with *H*0 : *γ*1 = 0. Here, *<sup>R</sup>*2*p*<sup>1</sup> measures the incremental contribution of *X*1 for *Y* (note that *<sup>R</sup>*2*p*<sup>0</sup> = 0). The researcher wishes to test for *H*0 : 0 ≤ *γ*1 ≤ *γ<sup>u</sup>*, where *γu* represents the limit for the minimum-effect. The researcher can also specify the value of *<sup>R</sup>*2*p*<sup>1</sup> corresponding to the value of *γ<sup>u</sup>*, which is the minimum desired value of *R*<sup>2</sup> for *X*1 to be economically significant (see Section 3.8 for the details as to how this value may be chosen with applications in Section 4.1). Alternatively to *H*0 : 0 ≤ *γ*1 ≤ *γ<sup>u</sup>*, one can formulate the null hypothesis in terms of *<sup>R</sup>*2*p*1, namely *H*0 : 0 ≤ *<sup>R</sup>*2*p*<sup>1</sup> ≤ *<sup>R</sup>*2*max*, where *<sup>R</sup>*2*max* is the maximum of *<sup>R</sup>*2*p*<sup>1</sup> value for 0 ≤ *γ*1 ≤ *γ<sup>u</sup>*, given (*<sup>Y</sup>*, *<sup>X</sup>*1); and also 0 < *λ* ≤ *λmax* corresponding to 0 ≤ *<sup>R</sup>*2*p*<sup>1</sup> ≤ *<sup>R</sup>*2*max*.

If the *F*-statistic is greater than *<sup>F</sup><sup>α</sup>*,*λmax* , the *α*-level critical value from *<sup>F</sup>*(*J*, *T* − *K* − 1; *<sup>λ</sup>max*), then the null hypothesis of the minimum-effect is rejected at the *α*-level of significance. An interesting feature of the decision rule for the minimum-effect test is that its critical value and sampling distribution change with sample size. This is in stark contrast with those of the point-null hypothesis, which do not change with sample size. The latter property is the root cause of the "large-n problem" associated with the point-null hypothesis, as Rao and Lovric (2016) point out.

As an illustration, consider a regression where *K* = 1. For simplicity, we assume that *Var*(*<sup>X</sup>*1) = *Var*(*Y*), when the sample size *T* takes values 2000 and 4000. Consider first the case of point-null hypothesis where *H*0 : *γ*1 = 0. The black curves in Figure 1 plot the density *<sup>F</sup>*(*J*, *T* − *K* − <sup>1</sup>), which is the distribution of the *F*-statistic under *H*0 : *γ*1 = 0, for each sample size of 2000 and 4000. It is clear that the 5% critical value does not change with increasing sample size. Since *F*-statistic is an increasing function of sample size, rejection of *H*0 : *γ*1 = 0 will eventually occur (except of course for the rare case that the true value of *γ*1 is really numerically identical to zero).

Suppose the researcher tests for a minimum effect: *H*0 : 0 ≤ *γ*1 ≤ 0.1 against *H*1 : *γ*1 > 0.1. Since *<sup>R</sup>*2*p*<sup>1</sup> = *<sup>γ</sup>*21*Var*(*<sup>X</sup>*1)/*Var*(*Y*) and *<sup>R</sup>*2*p*<sup>0</sup> = 0, the null and alternative hypotheses can be formulated as *H*0 : *<sup>R</sup>*2*p*<sup>1</sup> ≤ 0.01 against *H*1 : *<sup>R</sup>*2*p*<sup>1</sup> > 0.01. The red curves in Figure 1 plot the density *<sup>F</sup>*(*J*, *T* − *K* − 1; *<sup>λ</sup>max*) associated with *<sup>R</sup>*2*p*<sup>1</sup> = 0.01 for each sample size. The gray under area under represents 5%, indicated by the critical value which is the 95th percentile of the red curve. It appears that this critical value increases with sample size. The blue curve plots the density *<sup>F</sup>*(*J*, *T* − *K* − 1; *λ*) associated with *H*1 : *<sup>R</sup>*2*p*<sup>1</sup> = 0.02 and the red shaded area represents the power of the test for *H*0 : *<sup>R</sup>*2*p*<sup>1</sup> ≤ 0.01. It shows that the power increases with sample size.

Note: The black curve plots the density ǻǰȱ*T ǙŃK ǙŃ*<sup>1</sup>) which is the distribution of the *)*statistic under *H*0 : = 0. The gray area under it represents 5% associated with the corresponding critical value. The red curve plots the density *F*(*J*, *T ǙŃ K ǙŃ* 1; ɉ*max*) which is the distribution of the *F*statistics under *H*0 : *助Ń*0.1 or *H*0 : *助Ń*0.01.The gray area under it represents 5% associated with the corresponding critical value. The blue curve plots the densit y *F*(*J*, *T ǙŃK ǙŃ*1; ) for *H*1 : = 0.02. The red-shaded are represents the power of thetest for *<sup>H</sup>*0 : *助Ń*0.01. *γ*1 *γ*1 *λ R*2*p*1 *R*2*p*1*R*2*p*1

### **Figure 1.** Test for minimum-effect: An illustration.

It is often the case in economics and finance that a test for linear restrictions is conducted involving a number of regression parameters. For example, under the point-null paradigm, the null hypothesis can be formulated as *H*0 : *γ*1 = *γ*2 = 0 for a regression of *Y* on *X*1 and *X*2. In the context of minimum-effect test, the null hypothesis can be written as

$$H\_0: (\gamma\_{1l} \le \gamma\_1 \le \gamma\_{1u}) \cup (\gamma\_{2l} \le \gamma\_2 \le \gamma\_{2u}),$$

where *γil* and *γiu* for (*i* = 1, 2) denote the boundaries of economic significance. In this case, the null hypothesis can be formulated in terms of *<sup>R</sup>*2*pj*. That is,

$$H\_0: \eta \le \eta\_{\max\nu}$$

where *ηmax* is the maximum population signal-to-noise ratio implied by *<sup>R</sup>*2*pj*. The researcher can formulate the value of *R*<sup>2</sup> *p*1 − *R*<sup>2</sup> *p*0 as the economically significant incremental contribution of ( *X*1, *X*2) to *Y*, where the value of *R*<sup>2</sup> *p*0 can be estimated from the regression with restriction *γ*1 = *γ*2 = 0. Let *λmax* = *Tηmax*, then if the *F*-statistic is greater than *<sup>F</sup><sup>α</sup>*,*λmax* , the *α*-level critical value from *<sup>F</sup>*(*J*, *T* − *K* − 1; *<sup>λ</sup>max*), the null hypothesis of the minimum-effect is rejected at the *α*-level of significance. An example in a more general setting can be found in Section 4.1.2.

### *3.5. Bootstrap Implementation*

The tests introduced so far are valid under the assumption of normality. When the assumption of normality is questionable, the one-tailed tests, confidence intervals and the distribution *<sup>F</sup>*(*J*, *T* − *K* − 1; *λ*) can be implemented using the bootstrap (Efron and Tibshirani 1994). Since there are extensive references available for bootstrapping the *p*-value and confidence intervals for a one-sample *t*-test, the details are not given here.

For the minimum-effect test in the linear regression model, the researcher may want to obtain the bootstrap counterpart of the red curve in Figure 1, when the underlying normality is questionable. Consider a simple case of *H*0 : 0 ≤ *γ*1 ≤ 0.1 against *H*1 : *γ*1 > 0.1. Since *γ*1 = 0.1 is associated with the maximum value of *R*<sup>2</sup> *p*1 of 0.01, we consider the regression model under the restriction *γ*1 = 0.1. That is,

$$\boldsymbol{\Upsilon} = \hat{\gamma}\boldsymbol{\upsilon} + 0.1\boldsymbol{X}\_1 + \boldsymbol{e}\_r$$

where *γ*ˆ0 is the estimator for *γ*0 under the restriction *γ*1 = 0.1 and *e* represents the associated residuals. Generate the artificial data *Y*∗ given *X*1 as

$$\mathcal{Y}^\* = \hat{\gamma}\_0 + 0.1X\_1 + \varepsilon^\*,$$

where *e*∗ is a random resample of *e* with replacement. Calculate the *F*-statistic from {*Y*<sup>∗</sup>, *<sup>X</sup>*1}, denoted as *F*<sup>∗</sup>. Repeat the above process sufficiently many times, say *B*, to obtain {*F*∗(*i*)}*<sup>B</sup> i*=1, which represents the bootstrap distribution for *<sup>F</sup>*(*J*, *T* − *K* − 1; *<sup>λ</sup>*).

When a number of parameters are involved with the linear restrictions being tested, the bootstrap can be conducted at the parameter values which maximize the value *λ*. As an example, consider the minimum-effect test

$$H\_0: (\gamma\_{1l} \le \gamma\_1 \le \gamma\_{1u}) \cup (\gamma\_{2l} \le \gamma\_2 \le \gamma\_{2u}).$$

Let *γ*ˆ1 and *γ*ˆ2 denote the values under the above *H*0 which jointly imply the largest economic impact on *Y*. The bootstrap is conducted with the restrictions *γ*1 = *γ*ˆ1 and *γ*2 = *γ*ˆ2.

### *3.6. Model Equivalence Test*

Lavergne (2014) proposes a general framework based on the Kullback-Leibler information to assess the approximate validity of multivariate restrictions in parametric models, which is labeled as model equivalence testing. Consider a random sample *Xt* (*t* = 1, ... *T*) whose probability density function is denoted as *f*(*X*|*<sup>θ</sup>*0) where *θ*0 ∈ Θ the parameter space. Let *g*(*<sup>θ</sup>*0) = 0 denote multivariate restrictions on *θ*0 with *r* number of restrictions. As a measure of closeness to the true distribution, Lavergne (2014) adopts the Kullback-Leibler information criterion, which is defined as

$$KLIC = E\_{\theta\_0} \left[ \log \frac{f(X|\theta\_0)}{f(X|\theta\_0^c)} \right],$$

where *<sup>E</sup>θ*0 denotes the expectation when *θ*0 is the parameter value and *θc*0 is the value which maximizes *<sup>E</sup>θ*0 log *f*(*X*|*<sup>θ</sup>*0) under *<sup>g</sup>*(*θ<sup>c</sup>*0) = 0. Noting that *KLIC* ≥ 0 and it is 0 when the restriction *g*(*<sup>θ</sup>*0) = 0 holds exactly, Lavergne (2014) considers the null and alternative hypotheses of the form

$$H\_0: 2\text{KLIC} \ge \delta^2/T; H\_1: 2\text{KLIC} < \delta^2/T,\tag{7}$$

where *δ*2 ≡ *T*Δ<sup>2</sup> while Δ<sup>2</sup> being the tolerance of substantive importance. Rejection of *H*0 implies that the restriction *g*(*<sup>θ</sup>*0) = 0 is close to be valid.

According to Lavergne (2014), the above model equivalence test can be conducted using the log-likelihood ratio (LR) test, which can be written as

$$LR = 2\left[L(\theta) - L(\theta^c)\right],\tag{8}$$

where ˆ *θ* denotes the unrestricted (quasi) maximum likelihood estimator for *θ* and ˆ *θc* the restricted (quasi) maximum likelihood estimator. The LR statistic follows a non-central chi-squared distribution with *r* degrees of freedom with the non-centrality parameter *δ*2, denoted as *<sup>χ</sup>*2*r*,*δ*<sup>2</sup> . The null hypothesis is rejected in favor of model equivalence if the LR statistic is less than *<sup>χ</sup>*2*r*,*δ*<sup>2</sup> (*α*), which is the *α*th percentile of *<sup>χ</sup>*2*r*,*δ*<sup>2</sup> .

Note that the vanishing tolerance *δ*2/*T* is based on a theoretical consideration, as Lavergne (2014, p. 416) points out. In practical applications, a fixed tolerance Δ<sup>2</sup> is chosen so that *δ*2 = *T*Δ2. This means that the degree of non-centrality of *<sup>χ</sup>*2*r*,*δ*<sup>2</sup> increases with sample size, so does the critical value of the test. This is a feature different from the point-null hypothesis testing where the critical value is obtained from a central distribution regardless of sample size. Lavergne (2014) has shown that, in the regression context, 2*KLIC* measures the loss in explanatory power coming from imposing the constraint relative to the error's variance. Hence, if the researcher sets Δ<sup>2</sup> = 0.1, the models under *H*0 and *H*1 are considered to be equivalent if the loss of explanatory power due to imposing the restriction is no more than 10%. Lavergne (2014) provides further asymptotic theories of the test, along with empirical applications.

### *3.7. Equivalence Test for Model Validation*

Model validation or specification tests are often performed based on the paradigm of point null hypothesis testing, for which the null hypothesis is that the model is valid, and the alternative hypothesis is that the model is not valid. Such tests inherit the problems associated with the conventional statistical testing. As Box (1976) points out, all models are wrong, they are approximations to the true data-generation process; consequently a test based on a sharp null hypothesis is not suitable. It is possible that, in small samples, the tests may commit Type II errors due to low power, whereas in large samples all models are found to be rejected due to extreme power (see, for example, Spanos 2017).

As a consequence, Robinson and Froese (2004) recommended the use of equivalence tests for model validation, arguing that using traditional point-null hypothesis testing, as commonly done, enabled the rejection of good models when the data were too many and the failure to reject poor models when the data were too few. Furthermore, equivalence tests permit the expression of a 'region of equivalence', within which model predictions could be close enough to reality to be useful, without necessarily being exactly identical (see, e.g., Kleijnen 1995; Robinson 2019). The principle was further extended by Robinson et al. (2005), who produced an equivalence-based variant of a regression-style test originally proposed by Cohen and Cyert (1961). We now summarize Robinson et al.'s (2005) approach.

Assume that we have computer simulation results *xi*, *i* = *i*, ... , *n* that are intended to represent process observations *yi*. For example, *y* could be the heights of a sample of trees selected from a forest, and *x* the predicted heights for the same trees having been computed using the tree diameter and some mathematical function that we wish to validate; *y*ˆ = *x* = *f*(*d*; *β*). Centre the predictions: *x*∗*i* = *xi* − *x*¯. Fit the linear regression model *yi* = *β*0 + *β*1*x*<sup>∗</sup> *i* + *i*; *i d* = *N*(0, *<sup>σ</sup>*<sup>2</sup>). Then, perform a TOST on the null hypothesis that *β*0 = 0 as a test of model *bias* and a TOST on the null hypothesis that *β*1 = 1 as a test of the model *fidelity*, where fidelity is taken to mean both the spread of the predictions compared to the observations and the order of the predictions compared to the observations. The estimate of the slope will reflect how well the predictions match the spread of the observations—close to 1 is good, and the standard error of the slope will reflect how well the quantiles of the predictions match the quantiles of the observations—small is good. In this way, several interpretable characteristics of model performance can be distilled from the omnibus test. Robinson (2019) provides a more detailed explanation with examples, and Robinson (2016) provides an R (R Core Team 2017) package that runs such tests1.

### *3.8. Choosing the Limits of Economic Significance*

The choice of the limits of economic significance is the most critical step for interval-based tests. Detailed discussions in the contexts of psychology and medical research appear in Murphy and Myors (1999), Walker and Nowacki (2011) and Lakens et al. (2018), among others. These limits affect the outcomes of the test, and also provide scientific credibility to the research outcome. The limits should be determined by the researcher, in consideration of economic theories and meaningful effect size. In so doing, economic reasoning or theory can be incorporated into statistical decision-making.

As Murphy and Myors (1999, p. 237) point out, the choice of limits requires "value judgment". The choice can also be "context-dependent", since it may depend on the type of dependent variable involved; and can also depend on the likelihood or seriousness of Type I and II errors. It would be desirable to have a set convention or a consensus of expert opinions in the related field as to the extent of "negligible effects" that could be economically ignored. One may also use meta-analytic evidence from past studies.

The researcher can be guided by estimation-based measures to further justify their choice. For example, one may choose the limits so that they imply the smallest effect size guided by the value of Cohen's *d* (Cohen 1977), which is a measure of effect size (the mean difference divided by the standard deviation of the data). In the regression context, the limits may be determined so that the implied economic impact provides a certain value of (incremental) signal-to-noise ratio *η* given in (6) (which is also called Cohen's *f* 2) or desired coefficient of determination *<sup>R</sup>*2*pj*. For example. if *Y* is stock return and *X* is a proposed factor, the interval can be formulated so that *X* can explain at least 5% of the total variation of stock return ( *R*<sup>2</sup> *p*1 = 0.05; and *R*<sup>2</sup> *p*0 = 0). This is based on the judgment that an economically meaningful factor should explain at least 5% of the stock return variation, in the absence of other factors. Again, this choice requires value judgment that can be context-dependent. For example, the choice may be different across markets depending on the market conditions such as the trading cost, regulatory framework, and development of market structure, among others. The researcher may consider a number of different values or possible candidates of this value, and make a decision considering the inferential outcomes and their economic significance. However, most ideally, the choice of the limits should be made before the researcher observes the data.

The proposed interval can be indicative of the decision when the point estimate is available. However, the point estimate is subject to sampling variability and it is necessary to conduct the test to make a more informed decision under sampling variability. Proposing such an interval may be equivalent to providing a prior distribution for the Bayesian inference. It is well known that the outcome of the Bayesian inference in large part depends on the choice of prior. But if the choice is made based on concrete economic reasoning and evidence, the Bayesian inference can provide

<sup>1</sup> There are two other R packages for equivalence and non-inferiority tests. One is EQUIVNONINF (Wellek and Ziegler 2017) which accompanies the book by Wellek (2010), and the other is PowerTOST (Labes et al. 2018), which contains functions to calculate power and sample size for various study designs used for bio-equivalence studies.

an informed decision. Similarly, if the interval of economic significance is proposed with concrete economic rationale, then it can help the researcher make a correct decision.

Furthermore, it is important for the researcher to include the key components of the test in reporting, such as the interval of equivalence. Doing so serves two purposes: first, it enables the reader to apply different intervals for different applications, and second, it provides a check against unscrupulous researchers choosing intervals that suit their narrative.

### **4. Empirical Applications**

In this section, we provide empirical applications of the interval-based tests discussed in Section 3 to economics and finance. We present two cases where large sample size is used; and one case of a small sample.

### *4.1. A SAD Stock Market Cycle*

In empirical finance, a large number of market anomalies have been identified, where it is claimed that a stock market is systematically influenced by the factors unrelated with the market fundamentals. The evidence is at odds with the efficient market hypothesis which is a cornerstone of modern finance theories. Central to this is the findings that investors' mood systematically and negatively affects stock return. For example, it is hypothesized that less sunlight or more cloudiness negatively affect investors' mood, which in turn exerts a negative impact on stock market return. The seminal papers in this area of literature include Saunders (1993), Hirshleifer and Shumway (2003), and Kamstra et al. (2003). However, as Kim (2017) reports, the studies in this area typically show negligible effects with high statistical significance, accompanied by large sample size and negligible *R*<sup>2</sup> values.

Kamstra et al. (2003) study the effect of depression linked with seasonal affective disorder (SAD) on stock return. They claim that, through the link between SAD and depression, and the link between depression and risk aversion, seasonal variation in length of day can translate into seasonal variation in equity return. They consider the regression model of the following form:

$$R\_{l} = \gamma\_{0} + \sum\_{i=1}^{2} \gamma\_{i} R\_{l-i} + \gamma\_{3} M\_{l} + \gamma\_{4} T\_{l} + \gamma\_{5} SAD\_{l} + \gamma\_{6} F\_{l} + \gamma\_{7} \mathbf{C}\_{l} + \gamma\_{8} P\_{l} + \gamma\_{9} \mathbf{G}\_{l} + \varepsilon\_{l} \tag{9}$$

where *Rt* denotes the stock return in percentage on day *t*; *M* a dummy variable for Monday; *T* a dummy for the last trading day or the first five trading days of the tax year; *F* a dummy for fall; *C* cloud cover, *P* a precipitation; and *G* temperature. *SADt* is a measure of seasonal depression, which takes the value of *Ht* − 12 where *Ht* represents the time from sunset to sunrise if the day *t* is in the fall or winter; 0 otherwise.

Kamstra et al. (2003; p. 326) argue that lower returns should commence with autumn because depressed investors shunning risk and re-balance their portfolio in favor of safer assets (i.e., *γ*6 < 0). This is followed by abnormally higher returns when days begin to lengthen and SAD-affected investors begin resuming their risky holdings (i.e., *γ*5 > 0). They use the daily index return data from the markets around the world: U.S. (S&P 500, NYSE, NASGAQ, AMEX), Sweden, U.K., Germany Canada, New Zealand, Japan, Australia, and South Africa. They report, nearly for all markets, that the parameter estimate of *γ*5 is positive and statistically significant at a conventional level of significance; and that of *γ*6 is negative and statistically significant. These results are the basis of their evidence for the existence of the SAD effects around the world. However, the results are based on the point null hypothesis at a conventional level of significance under large sample sizes, for which Rao and Lovric (2016) among others are concerned about. In this section, we evaluate the regression results of Kamstra et al. (2003) using the interval-based tests.

4.1.1. Evaluating the Results of Kamstra et al.

We first conduct the interval tests using the regression results reported in Kamstra et al. (2003). Table 1 reports the sample size ( *T*) and *R*<sup>2</sup> values of the regression (9), reproduced from Kamstra et al. (2003; Tables 2 and 4A–C). From these values, we calculate the *F*-statistic for joint significance of all slope coefficients are jointly zero ( *H*0 : *γ*1 = ··· = *γ*9 = 0), as reported in Table 1. The *CR* column reports the 5% critical values from the central *F* distributions, which are around 1.88 regardless of sample size. It appears that the *F*-test for joint significance is clearly rejected for all markets at a conventional significance level, which indicates that the all slope coefficients of regression (9) are statistically significant. However, this is at odds with negligible *R*<sup>2</sup> values reported in Table 1 which indicate little predictive power for all markets.

Suppose that, for a regression model for stock return to be economically significant, it should explain at least 5% of the return variation. That is, we test for *H*0 : 0 ≤ *R*<sup>2</sup> *p*1 ≤ 0.05 against *H*1 : *R*<sup>2</sup> *p*1 > 0.05. The column labeled *CR*2 reports the 5% critical values associated with *<sup>F</sup>*(*J*, *T* − *K* − 1; *<sup>λ</sup>max*) while the value of *λmax* is associated with *R*<sup>2</sup> *p*1 = 0.05 (and *R*<sup>2</sup> *p*0 = 0). According to these critical values, the null hypothesis of economically negligible effect cannot be rejected for all market indices except for US4. The critical values listed in column *CR*1 are those associated with *H*0 : 0 ≤ *R*<sup>2</sup> *p*1 ≤ 0.01, which delivers rejection in four markets only. If we test for *H*0 : 0 ≤ *R*<sup>2</sup> *p*1 ≤ 0.1, the critical values in column labeled *CR*3 indicate that the predictive power of the estimated models are economically negligible for all markets.

**Table 1.** Testing for the SAD effect.


US1: United States, S&P500, from 04 January 1928 to 29 December 2000; US2: United States, NYSE, from 1962-07-05 to 2000-12-29; US3: United States, NASDAQ, from 1972-12-18 to 2000-12-29; SWE: Sweden from 1982-09-15 to 2001-12-18; UK: Britain from 1984-01-04 to 2001-12-06; GER: Germany from 1965-01-05 to 2001-12-12; CAN: Canada from 1969-01-03 to 2001-12-18; NZ: New Zealand from 1991-07-02 to 2001-12-18; JAP: Japan from 1950-04-05 to 2001-12-06; AUS: Australia from 1980-01-03 to 2001-12-18; SA: South Africa from 1973-01-03 to 2001-12-06; *T*: sample size, calculated using R package "bizdays" (Freitas 2018) from the sample ranges reported in Kamstra et al. (2003; Table 2); *R*2: *R*<sup>2</sup> values reported in Kamstra et al. (2003; Table 4A–C); *F*: *F*-statistic for the joint significance of regression slope coefficients; *CR*: 5% critical values from a central *F* distribution for *H*0 : *R*<sup>2</sup> = 0; *CR*1: 5% critical values for *H*0 : 0 ≤ *R*<sup>2</sup> ≤ 0.01; *CR*2: 5% critical values for *H*0 : 0 ≤ *R*<sup>2</sup> ≤ 0.05; *CR*3: 5% critical values for *H*0 : 0 ≤ *R*<sup>2</sup> ≤ 0.10.

Economic significance of the magnitude of regression coefficients reported in Kamstra et al. (2003) is also questionable. For example, for the U.S. market with S&P500 index (US1), *γ*ˆ6 = −0.058 and its 90% confidence interval is [−0.10, <sup>−</sup>0.01]. The point estimate means that the stock return is on average lower by 0.058% during the autumn period. Suppose, for a factor to have an economically meaningful impact on stock return, its marginal effect should be at least 0.5% (either positive or negative) to justify transaction cost. Then, one can formulate the null hypothesis of economically negligible effect as *H*0 : −0.5 ≤ *γ*6 ≤ 0.5. The 90% confidence interval is clearly within this bound, so we do not reject *H*0 at the 5% level of significance. The same inferential outcomes apply to all the other regression coefficients of (9) reported in Kamstra et al. (2003). Note that, depending on the

attitude of the researcher, one can formulate the null hypothesis as *H*0 : (*<sup>γ</sup>*6 < −0.5) ∪ (*<sup>γ</sup>*6 > 0.5), but it is also clearly rejected at the 5% level in favor of a negligible effect. Although Kamstra et al. (2003) justify their effect size using the annualized return, this annualized return does not take account of the underlying volatility of stock return or trading costs involved.

### 4.1.2. Replicating the Results of Kamstra et al.

We now replicate the model (9) using the value-weighted daily returns from the NYSE composite index (CRSP). The SAD variable and other dummy variables are generated following Kamstra et al. (2003), using programming language R (R Core Team 2017). The data for weather variables (*C*, *P*, and *G*) are collected from the National Center for Environmental Information.<sup>2</sup> Our data for the regression ranges from January 1965 to April 1996 (7886 observations), due to the limited availability of the weather data (*C*) for New York. We have the following estimated values for the key coefficients: *γ*ˆ5 = 0.032 with *t*-statistic of 2.29; *γ*ˆ6 = −0.055 with *t*-statistic of −2.17; and *R*<sup>2</sup> = 0.05. These values are fairly close to those reported in Table 4A of Kamstra et al. (2003).

We first pay attention to the point null hypothesis that *H*0 : *γ*5 = *γ*6 = 0 for joint significance of the SAD effects. The *F*-statistic is 3.18 with the *p*-value of 0.04, rejecting *H*0 at the 5% significance level. This is despite the observation that the incremental contribution of these two variables is negligible, measured by *R*21 − *R*20 = 0.0008 with *R*21 = 0.0501 and *R*20 = 0.0493. Next, we consider an interval hypothesis of minimum-effect. Suppose that the incremental contribution of these variables should be at least 0.01 to be economically significant. That is,

$$H\_0: (R\_{p1}^2 - R\_{p0}^2) \le 0.01.$$

Assuming *<sup>R</sup>*2*p*<sup>0</sup> = 0.05, *λmax* = 83.87 and the corresponding 5% critical value is 58.97, obtained from *<sup>F</sup>*(*J*, *T* − *K* − 1; *<sup>λ</sup>max*). With this critical value being much larger than the *F*-statistic of 3.18, the above interval null hypothesis of minimum-effect cannot be rejected at the 5% level, providing evidence that the SAD economic cycle is economically negligible in the U.S. stock market.

### *4.2. Empirical Validity of an Asset-Pricing Model*

An asset-pricing model explains the variation of asset return as a function of a range of risk factors. The most fundamental is the capital asset pricing model (CAPM) which stipulates that an asset (excess) return is a linear function of market (excess) return. The slope coefficient (often called beta) measures the sensitivity of an asset return to the market risk. While the CAPM is theoretically motivated, the market risk alone cannot fully explain the variation of asset return. In response to this, several multi-factor models have been proposed, which augmen<sup>t</sup> the CAPM with a number of empirically motivated risk factors such as the size premium or value premium (see, for example, Fama and French 1993). The most recently proposed multi-factor model is the five-factor model of Fama and French (2015), which can be written as

$$R\_{it} - R\_{ft} = a\_i + b\_i(R\_{Mt} - R\_{ft}) + s\_i SMB\_t + h\_i HML\_t + r\_i R M W\_t + c\_i CMA\_t + c\_{it} \tag{10}$$

where *Rit* is the return on an asset or portfolio *i* at time *t* (*i* = 1, ... , *N*; *t* = 1, ... , *<sup>T</sup>*), *Rf t* is the risk-free rate, *RMt* is the return on a (value-weighted) market portfolio at time *t*, *SMBt* is the return on a diversified portfolio of small stocks minus the return on a diversified portfolio of big stocks, the *HMLt* is the spread in returns between diversified portfolios of high book-to-market stocks and low book-to-market stocks, *RMWt* is the spread in returns between diversified portfolios of stocks with robust and weak profitability, and the *CMAt* is the spread in returns between diversified portfolios

<sup>2</sup> https://www.ncdc.noaa.gov/data-access.

of low and high investment firms. The precursors to this 5-factor model include the 3-factor model of Fama and French (1993) which include (*RMt* − *Rf t*), *SMB*, and *HML*; and the 4-factor model of Carhart (1997) which adds momentum factor (*MOM*) to the 3-factor model. If these factors fully or adequately capture the variation of asset return, then the intercept terms *ai* (which may be may be interpreted as the risk-adjusted return) should be zero or sufficiently close to it. On this basis, the model's empirical validity is evaluated by testing for *H*0 : *a*1 = ... = *aN* = 0, which is a point-null hypothesis.

### 4.2.1. GRS Test: Minimum-Effect

The *F*-test for *H*0 is widely called the GRS test, proposed by Gibbons et al. (1989). Let *a* = (*<sup>a</sup>*1, ... , *aN*) be the vector of *N* intercept terms, and Σ be the *N* × *N* covariance matrix of error terms. The model (10) is estimated using the ordinary least-squares: *a*ˆ denotes the estimator for *a* and Σ the estimator for Σ. The *F*-test statistic is written as

$$F = \frac{T(T - N - K)}{N(T - K - 1)} \frac{\hbar' \hat{\Sigma}^{-1} \hbar}{1 + \hat{\mu}' \hat{\Omega}^{-1} \hat{\mu}} , \tag{11}$$

where *T* is the sample size, *K* = 5 is the number of risk factors, Ω is the *K* × *K* covariance matrix of risk factors, and *μ*ˆ is the *K* × 1 mean vector. Under the assumption that the error terms *e*'s follow a multivariate normal distribution, the statistic follows the *<sup>F</sup>*(*<sup>N</sup>*, *T* − *N* − *K*; *λ*) distribution, with the non-centrality parameter

$$
\lambda = \left(\frac{T}{1+\theta^2}\right) a' \Sigma^{-1} a = \left(\frac{T}{1+\theta^2}\right) (\theta^{\*2} - \theta^2),
\tag{12}
$$

where ˆ *θ* is the *ex-post* maximum Sharpe ratio of *K*-factor portfolio, *θ* is the *ex-ante* maximum Sharpe ratio of *K*-factor portfolio, and *θ*∗ is the slope of the *ex ante* efficient frontier based on all assets. Gibbons et al. (1989) call *θ*/*θ*∗ the proportion of the potential efficiency. Note that, under *H*0, this ratio is equal to one and *λ* = 0.

However, perfect efficiency cannot exist in practice. It is unrealistic that all of *a* values are jointly and exactly zero. On this point, it is sensible to consider an interval-based hypothesis testing. For example, consider *H*0 : 0.75 < *θ*/*θ*∗ ≤ 1 against *H*1 : *θ*/*θ*∗ < 0.75. This is on the basis of judgment that the factors with the proportion of potential efficiency of 0.75 or higher provide practically efficient asset-pricing.

The data is available from French's data library monthly from 1963 to 2015 (*T* = 630).<sup>3</sup> We use 25 portfolio returns (*N* = 25) sorted by size and book-to-market ratio extensively analyzed by Fama and French (1993, 2015). Table 2 reports the test results. The GRS test for *H*0 : *a*1 = ... = *aN* = 0 are clearly rejected for all models considered, with the *p*-value (not reported) practically 0 for all cases. The critical values of this test (from the central *F* distributions) is listed in the column labeled *CR*. This results sugges<sup>t</sup> that none of the asset pricing models are able to fully capture asset return variations. This is at odds with the high values of *R*<sup>2</sup> and small values of |*a*|, especially multi-factor models. For the 4-factor and 5-factor model, the estimated ratio of potential efficiency is much higher than other models, close to 0.7.

<sup>3</sup> http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/index.html.


**Table 2.** GRS test for asset-pricing models.

CAPM: the model with single factor (*RMt* − *Rf t*); 3-factor: CAPM plus *SMB* and *HML* (Fama and French 1993); 4-factor: 3-factor plus *MOM* (Carhart 1997); 5-factor: 3 factor plus *RMW* and *CMA* (Fama and French 1993); GRS: GRS test statistic *H*0 : *a*1 = ... = *aN* = 0; *R*2: average *R*<sup>2</sup> values over *N* = 25 equations; |*a*|: average intercept estimates over *N* = 25 equations; CR: 5% critical value from *<sup>F</sup>*(*<sup>N</sup>*, *T* − *N* − *<sup>K</sup>*); *CR*1: 5% critical value from *<sup>F</sup>*(*<sup>N</sup>*, *T* − *N* − *K*, *<sup>λ</sup>max*); ratio: sample estimate of *θ*/*θ*<sup>∗</sup>.

Table 2 also reports the critical values (*CR*2) for *H*0 : 0.75 < *θ*/*θ*∗ ≤ 1, which is calculated from *<sup>F</sup>*(*<sup>N</sup>*, *T* − *N* − *K*, *<sup>λ</sup>max*) distribution with the value of *λmax* implied by *θ*/*θ*∗ = 0.75. It is found that, for the 4-factor and 5-factor models, *H*0 : 0.75 < *θ*/*θ*∗ ≤ 1 cannot be rejected at the 5% level of significance. This suggests that these multi-factor model have captured the variation of asset returns adequately, with economically negligible deviation from the perfect efficiency. For the CAPM and 3-factor models, the interval-based *H*0 is rejected at the 5% level, but this seems consistent with the estimated values of potential efficiency which are less than 0.5 for both cases. It is worth noting that the critical values *CR* for the point-null hypothesis (based on the central *F*-distribution) are nearly identical for all cases, regardless of the estimation results such as *R*<sup>2</sup> and |*a*|. However, those for the interval-based tests are different, depending on the model estimation results.

### 4.2.2. LR Test: Model Equivalence

We now test for the validity of the asset-pricing models using the model equivalence test discussed in Section 3.6. We calculate the LR test for given in (8) for *H*0 : *a*1 = ... = *aN* = 0, which is written as

$$LR = T(\log \{ \det[\hat{\Sigma}(H\_0)] \} - \log \{ \det[\hat{\Sigma}(H\_1)] \}),$$

where Σ <sup>ˆ</sup>(*Hi*) denotes the maximum likelihood estimator for Σ under *Hi*. For the model equivalence test given in (7), the above LR statistic follows the *<sup>χ</sup>*2*N*,*δ*<sup>2</sup> distribution with *δ*2 = *T*Δ2. Using the same data set as in Section 4.2.1, the LR statistic is 105.67, 88.05, 75.82, and 69.26 for the CAPM, 3-factor model, 4-factor model, and 5-factor model respectively. If we set Δ<sup>2</sup> to 0.1, the 5% critical value is 61.12, indicating that *H*0 is not approximately valid for all models. If we set Δ<sup>2</sup> to 0.15, the 5% critical value is 87.19, indicating *H*0 is approximately valid only for 4-factor and 5-factor models. If we set Δ<sup>2</sup> to 0.20, the 5% critical value is 114.00, indicating that *H*0 is approximately valid for all models. It appears that the results are sensitive to the choice of Δ<sup>2</sup> values. However, at a reasonable value of Δ<sup>2</sup> = 0.15, the results are consistent with the minimum-effect test based on the GRS test conducted above.

### *4.3. Testing for Persistence of a Time Series*

The presence of a unit root in economic and financial time series has strong implications to many economic theories and their empirical validity (see Choi 2015). For example, a unit root in the real exchange rate is evidence that the purchasing power parity does not hold (Lothian and Taylor 1996); and a unit root in the real GNP supports the view that a shock to the economy has a permanent effect, which is not consistent with the traditional (or Keynesian) view of business cycle (Campbell and Mankiw 1987). To test for the hypothesis, the unit root test proposed by Dickey and Fuller (1979) has been widely used, while a large number of its extensions and improvement have been proposed. The augmented Dickey–Fuller (ADF) test for a time series *Y* is based on the regression of the form

$$
\Delta Y\_t = \delta\_0 + \delta\_1 t + \theta Y\_{t-1} + \sum\_{j=1}^{m-1} \rho\_j \Delta Y\_{t-j} + u\_{t\_f} \tag{13}
$$

where Δ *Yt* = *Yt* − *Yt*−1; *m* is the autoregressive (AR) order of *Y*; and *ut* is an *i*.*i*.*d*. error term with zero mean and fixed variance. Note that *θ* ≡ *τ* − 1 where *τ* is the sum of all AR(m) coefficients in level of *Y*, measuring the degree of persistence. The test for a unit root is based on point-null hypothesis of *H*0 : *θ* = 0 against *H*1 : *θ* < 0. Under *H*0, the *t*-test statistic asymptotically follows the Dickey–Fuller distribution, from which the critical values of the test are obtained. Under *H*1, the *t*-test statistic asymptotically follows the standard normal distribution.

The problems of the unit root test are well documented (see, for example, Choi 2015). The most well-known is its low power (at a conventional significance level), which means that there is a high chance of committing Type II error (failure to reject a false null hypothesis). On this point, Kim and Choi (2017) propose the unit root test at the optimal level of significance, which is obtained by minimizing the expected loss from hypothesis testing. They find that the optimal level is in the 0.3 and 0.4 range for many economic time series, arguing that the exclusive use of 0.05 level has led to accumulation of false stylized facts. The other problem of the test is the discontinuity of the sampling distributions of the test statistic under *H*0 and *H*1. This makes the decision highly sensitive to the value specified under *H*0.

More importantly, as discussed in Section 2.3, it is unrealistic to assume that an economic time series such as the real GNP or real exchange rate has an autoregressive root exactly equal to one. An economist may wish to test whether a time series shows a degree of persistence practically different from that of a unit root time series. The test can be conducted in the context of non-inferiority test discussed in the previous section. To do this, we need to find the value of *τ* or *θ* under which a time series shows a practically different degree of persistence from a unit root time series. According to DeJong et al. (1992), a plausible value of *τ* under *H*1 : *θ* < 0 is 0.85, 0.95, 0.99 for annual, quarterly and monthly data respectively, which translate to the *θ* values of −0.15, −0.05, and −0.01. On this basis, we test for the persistence of a time series using the following interval hypotheses:

$$H\_0: \theta \le \theta\_1; H\_1: \theta > \theta\_{1\prime}$$

where *θ*1 ∈ {−0.15, −0.05, −0.01} depending on the data frequency. The time series is practically trend-stationary under this *H*0. This test is a standard one-sample *t*-test whose statistic asymptotically follows the standard normal distribution. However, we note that the least-squares estimator for *τ* or *θ* is biased in small samples, which may adversely affect the small sample properties of the test. As an alternative to the non-inferiority test, we also use the bias-corrected bootstrap confidence interval for *θ* for improved statistical inference, similar to those of Kilian (1998a, 1998b) and Kim (2004).

For a set of time series (*<sup>Y</sup>*1, ... .*YT*), we first estimate the parameters of model (13) using the bias-corrected estimators. Let (ˆ *δ*0, ˆ *δ*1, ˆ *θ*, *ρ*ˆ1, ... , *ρ*<sup>ˆ</sup>*m*−<sup>1</sup>) be the bias-corrected estimators; and let {*et*} denote the corresponding residual. Generate the artificial data set as

$$Y\_t^\* = \delta\_0 + \delta\_1 t + \beta\_1 \mathcal{Y}\_{t-1} + \dots + \beta\_m \mathcal{Y}\_{t-m} + e\_t^\*,$$

using (*<sup>Y</sup>*1, ... ,*Ym*) as the starting values, where *e*∗ *t* is a random draw with replacement from {*et*}*<sup>T</sup> t*=*m*+1 and (*β*ˆ 1, ... , *β*<sup>ˆ</sup>*m*) are the AR coefficients in level associated with (ˆ *θ*, *ρ*ˆ1, ... , *ρ*<sup>ˆ</sup>*m*−<sup>1</sup>). Using {*Y*∗ *t* }*T t*=1, estimate the *AR*(*m*) coefficients, again with bias correction, (ˆ *δ*∗ 0 , ˆ *δ*∗ 1 , *β*ˆ∗ 1, ... , *β*ˆ∗ *m*). For bias correction, we use Shaman and Stine (1988) asymptotic formula with stationarity-correction, following Kilian (1998b) and Kim (2004). We obtain ˆ *θ*∗ = *τ*<sup>ˆ</sup><sup>∗</sup> − 1, where *τ*<sup>ˆ</sup><sup>∗</sup> = ∑*m j*=1 *β*ˆ∗ *j* . Repeat this process *B* times to obtain the bootstrap distribution { ˆ *θ*∗(*j*)}*<sup>B</sup> j*=1, which can be used as an approximation to the sampling distribution of ˆ *θ*. If the confidence interval for *θ* obtained from { ˆ *θ*∗(*j*)}*<sup>B</sup> j*=1 covers *θ*1, then this is evidence that the time series shows a degree of of persistence practically no different from that of a trend-stationary time series.

Table 3 reports the results from the extended Nelson and Plosser (1982) data for a set of annual U.S. macroeconomic time series, setting *θ*1 = −0.15. Firstly, the ADF test (a point-null hypothesis test) provides the *p*-values larger than 0.05 for most of time series, providing evidence that many macroeconomic time series have a unit root. In contrast, the *t*-test (non-inferiority test) results for *H*0 : *θ* ≤ −0.15 against *H*1 : *θ* > −0.15 show that we clearly cannot reject this *H*0 at the 5% level of significance (asymptotic critical value 1.645) for the real GNP, real per capita GNP, industrial production, employment, unemployment rate, providing evidence that these time series are practically trend-stationary. As for the bootstrap inference, it is found that the 95% bias-corrected bootstrap confidence interval for *θ* does cover −0.15, for the real GNP, real per capita GNP, industrial production, employment, unemployment rate, real wage, and interest rate, indicating that these time series show the degree of persistence practically of a trend-stationary time series. The two alternative methods are in agreemen<sup>t</sup> in their inferential outcomes, except for real wage and interest rate.


**Table 3.** Test for persistence: Extended Nelson–Plosser Data.

R.GNP: Real GNP; N.GNP: Nominal GNP: P.GNP: Real per capita GNP; IP" Industrial Production; Emp: Employment; Uemp: Unemployment Rate; Def: GNP deflator; CPI: Consumer Price Index; Wages: Wages; Rwages: Real Wages: MS: Money Stock; Vel: Velocity; Rate: Interest rate; S&P: Common Stock Price. *T*: Sample size; *p*-value: *p*-value of the ADF test for *H*0 : *θ* = 0; ˆ *θ*: bias-corrected estimators for *θ*; *t*-stat: *t*-statistic for *H*0 : *θ* ≤ −0.15 against *H*1 : *θ* > −0.15 based on equation (13) with 5% critical value of 1.645; (*C I*1, *C <sup>I</sup>*2): lower and upper bounds of 95% bootstrap bias-corrected confidence interval for *θ*; The AR orders used are same as those of Nelson and Plosser (1982).

The results for the test of persistence based on the non-inferiority test are largely consistent with those of Kim and Choi (2017) who re-evaluate the ADF test results at the optimal level of significance and report evidence that the real GNP, real per capita GNP, employment, and money stock do not have a unit root. These results are also largely consistent with the Bayesian evidence of Schotman and van Dijk (1991).
