Next Article in Journal
Normalised Mutual Information of High-Density Surface Electromyography during Muscle Fatigue
Next Article in Special Issue
Evaluating Flight Crew Performance by a Bayesian Network Model
Previous Article in Journal
Combined Forecasting of Rainfall Based on Fuzzy Clustering and Cross Entropy
Previous Article in Special Issue
Choosing between Higher Moment Maximum Entropy Models and Its Application to Homogeneous Point Processes with Random Effects
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hypothesis Tests for Bernoulli Experiments: Ordering the Sample Space by Bayes Factors and Using Adaptive Significance Levels for Decisions

by
Carlos A. de B. Pereira
1,*,
Eduardo Y. Nakano
2,
Victor Fossaluza
1,
Luís Gustavo Esteves
1,
Mark A. Gannon
1 and
Adriano Polpo
3
1
Institute of Mathematics and Statistics, University of São Paulo, São Paulo 05508-090, Brazil
2
Department of Statistics, University of Brasília, Brasília 70910-900, Brazil
3
Department of Statistics, Federal University of São Carlos, São Carlos 13565-905, Brazil
*
Author to whom correspondence should be addressed.
Entropy 2017, 19(12), 696; https://doi.org/10.3390/e19120696
Submission received: 31 August 2017 / Revised: 18 December 2017 / Accepted: 18 December 2017 / Published: 20 December 2017
(This article belongs to the Special Issue Maximum Entropy and Bayesian Methods)

Abstract

:
The main objective of this paper is to find the relation between the adaptive significance level presented here and the sample size. We statisticians know of the inconsistency, or paradox, in the current classical tests of significance that are based on p-value statistics that are compared to the canonical significance levels (10%, 5%, and 1%): “Raise the sample to reject the null hypothesis” is the recommendation of some ill-advised scientists! This paper will show that it is possible to eliminate this problem of significance tests. We present here the beginning of a larger research project. The intention is to extend its use to more complex applications such as survival analysis, reliability tests, and other areas. The main tools used here are the Bayes factor and the extended Neyman–Pearson Lemma.

1. Introduction

Recently, the use of p -values in tests of significance has been criticized. The question posed in [1] and discussed in [2,3,4] concerns the misuse of canonical values of significance level (0.10, 0.05, 0.01, and 0.005). More recently, a publication by the American Statistical Association [5] makes recommendations for scientists to be concerned with choosing the appropriate level of significance. Pericchi and Pereira [6] consider the calculation of adaptive levels of significance in an apparently successful solution for the correction of significance level choices. This suggestion eliminates the risk of a breach of the likelihood principle. However, that article deals only with simple null hypotheses, although the alternative may be composite. Another constraint is the dimensionality of the parameter space; the article was only about one-dimensional spaces. More recent is the article by 72 prominent scientists [7], as described on the website of Nature Human Behavior [8]. In a genuinely Bayesian context, the authors of [9] introduced the index e (e-value, e for evidence) as an alternative to the classical p-value, which we write with a lower-case “ p ”. A correction to make the null hypothesis invariant under transformations was presented in [10], and a more theoretical review can be seen in [11,12]. The e-value was the basis of the solution of an astrophysical problem described in [13]. The relationship between p-values and e-values is discussed in [14]. However, while the e-value works for hypotheses of any dimensionality without needing assignment of “point mass” probabilities to hypotheses of lower dimensionality than the parameter space, setting its significance level is not an easy task. This has made us look for a way to obtain a significance index that allows us to better understand how to obtain the optimal (in the sense we explain later) significance level of a problem of any finite dimensionality. This work is based on four previous works [15,16,17,18]. It has taken a long time to see the possibility of using them in combination and with reasonable adjustments: the Bayes factor takes the place of the likelihood ratio and the average value of the likelihood function replaces its maximum value. The mean of the likelihood function under the null hypothesis will be the density used in the calculation of the new index, the P -value, which we represent with a capital “P” to differentiate it from the classical p -value. The basis of all our work is the extended Neyman–Pearson Lemma in its Bayesian form (see [19], sections “Optimal Tests” (Theorem 1) and “Bayes Test Procedures” (pp. 451–452)).
We present here a new hypothesis testing procedure that can eliminate some of the major problems associated with currently used hypothesis tests. For example, the new tests do not tend to reject all hypotheses in the many-data limit like Neyman–Pearson tests do, nor do they tend to fail to reject all hypotheses in the same limit, like Jeffreys’s Bayesian (Bayes factor) hypothesis tests do.

2. Blending Bayesian and Classical Concepts

2.1. Statistical Model

As usual, let x and θ be random vectors (could be scalars) x   X R s , X being the sample space, and θ Θ   R k , Θ being the parameter space, and s and k being positive integers. To state the relation between the two random vectors, the statistician considers the following: a family of probability density functions indexed by the conditioning parameter   θ ,   { f ( x | θ ) ; θ Θ } ; a prior probability density function g ( θ ) on the entire parameter space   Θ , and the posterior density function g ( θ | x ) . In order to be appropriate, the family of likelihood functions indexed by x , { L ( θ | x ) = f ( x | θ ) ; x X   } , must be measurable in the prior σ-algebra.
With the statistical model defined, a partition of the parameter space is defined by the consideration of a null hypothesis that is to be compared to its alternative:
H : θ Θ H   and   A : θ Θ A ,   where   Θ H Θ A = Θ   and   Θ H Θ A = .
In the case of composite hypotheses with the partition elements having the same dimensionality, the model would then be complete. Such cases would not involve partitions for which there are components of zero Lebesgue measure. In the case of precise or “sharp” hypotheses, that is, the partition components having different dimensionalities, other elements must be added:
  • positive probabilities of the hypotheses, π ( H ) > 0   e   π ( A ) = 1 π ( H ) > 0 ; and
  • a density on the subset that has the smaller dimension. The choice of this density should be coherent with the original prior density over the global parameter space   Θ .
Consider the common case for which the null hypothesis is the one defined by a subset of lower dimensionality. In this case, we use a surface integral to normalize the values of the prior density in the null set so that the sum or integral of these values is equal to unity. Figure 1 illustrates how this procedure is carried out. Recall that a prior density can be seen as a preference system in the parameter space. That preference system must be kept even within the null hypothesis; coherence in access to prior distributions is crucial. Further details on this procedure can be found in [16,17,18]. Later, Dawid, and Lauritzen [20] considered multiple ways of obtaining compatible priors under alternative models (hypotheses). The “conditioning” approach described by Dawid and Lauritzen is equivalent to the technique presented here. Dickey [21] used a similar approach previously, but in a more parameterization-dependent way.

2.2. Significance Index

By “significance index”, we mean a real function over the sample space that is used as an evidence measure for decision-making with respect to accepting or rejecting the null hypothesis, H. We begin this section by stating a generalization of the Neyman–Pearson Lemma, as presented by DeGroot [19]. Cox [22,23] also considers the classical p -value as an evidence measure, and Evans [24] considers evidence measures in general, outlines the relative belief theory developed in the references of that paper, and suggests that the associated evidence measure could have advantages over other measures of evidence and be the basis of a complete approach to estimation and hypothesis-assessment problems. The classical p -value is the most widely used significance index across diverse fields of study, including almost all scientific areas. In the present work, we present a replacement for the classical p -value has a number of advantages that will be described here and in future work. The conceptual and operational similarity between classical hypothesis tests as currently used and the new tests could potentially help researchers accept and use the new tests.
Let f H ( x )   and   f A ( x ) be probability density functions over the sample space   X . The decision problem is to choose one of these densities as being the true generator of the observed data. Consider now a binary function δ ( x ) used to define the decision procedure. Defining a partition of the sample space with X H X A = X and   X H X A = , where   X H is the non-rejection region for H. The test function is
δ ( x ) = { 0 , if   x X H 1 , if   x X A     .
To choose between a hypothesis and its alternative, one should first choose two positive real numbers, say   A   and   B , with A > B , A = B   and   A < B meaning, respectively, preference for the null hypothesis, indifference, and preference for the alternative. The decision rule is then to reject the null hypothesis,   H , whenever   δ ( x ) = 1 , and not to reject otherwise. The following theorem, a generalized version of the Neyman–Pearson Lemma presented in the textbook by DeGroot [19] provides a test that is optimal in the sense of minimizing a linear combination of the probabilities of the two types of errors: Type I, which is the rejection of a true hypothesis, and Type II, the non-rejection of a false hypothesis.
α ( δ ) = Pr { rejecting   H | H   is   true } = Pr { δ ( x ) = 1 | H }
and
β ( δ ) = Pr { not   rejecting   H | H   is   false } = Pr { δ ( x ) = 0 | A } .
Generalized Neyman–Pearson Lemma:
Let δ * be a test that rejects H in favor of A if A f H ( x ) <   B f A ( x ) , does not reject H if A f H ( x ) >   B f A ( x ) , and is indifferent if A f H ( x ) = B f A ( x ) . Then, for any other test δ ,
A α ( δ ) + B β ( δ ) A α ( δ * ) + B β ( δ * ) .
In 1957, both Lindley [25] and Bartlett [26] recognized that fixing a significance level was a major cause of problems with hypothesis tests. In 1966, Cornfield [27] advocated hypothesis tests that minimize a linear combination of error probabilities like Equation (5) rather than fixing a canonical α and minimizing β , like in the Neyman–Pearson approach [28].
To see that Bayesian hypothesis tests minimize a linear combination of error probabilities of the form A α ( δ ) + B β ( δ ) , consider a loss function that is zero if the decision is correct and w A   ( w H ) if the decision favors A   ( H ) when H   ( A ) is the true state of nature. In addition, if π is the prior probability of H and δ the test function, the risk function is
r ( δ ) = w A π α ( δ ) + w H ( 1 π ) β ( δ ) .
Consequently, simply identifying   ( π w A )   and   ( 1 π ) w H     as   A   and   B , respectively, and recalling that risk functions are to be minimized; Bayesian tests should minimize a linear combination of the form   A α ( δ ) + B β ( δ ) . Both the classical and the Bayesian applications of the theorem are stated in terms of the comparison of the ratio f H   f A to the constant K, given by
K =     B     A = ( 1 π ) w H   π w A .
It is important to remember that this generalized version of the Neyman–Pearson Lemma, from the classical point of view, will only apply to simple-versus-simple hypotheses. It is not common in classical inference to consider a density function under a composite hypothesis. However, some classical methods use optimization by considering the maximum of the likelihood function both under H and under   A . Recall that the likelihood function can be represented as I x = { L ( θ | x ) = f ( x | θ ) ; θ Θ } .
In the Bayesian paradigm, the likelihood function L plays an important role, which is not at all surprising, because it is the only mathematical object considered that defines an association between a sample x and a parameter   θ . Rather than optimization, integration is the Bayesian tool applied here. With the prior densities defined, the following conditional expectations are calculated:
f H ( x ) = E { L ( θ | x ) | x , θ Θ H }   and   f A ( x ) = E { L ( θ | x ) | x , θ Θ A } .
These functions are the Bayesian predictive densities under the respective hypotheses. Both are probability density functions over the sample space X . The ratio between the two functions is known as the Bayes factor,
B F ( x ) =     f H ( x )       f A ( x ) .
To define a confidence index, an alternative to the usual p -value, it is necessary to establish an ordering of all the points in the sample space. Montoya-Delgado et al. [17] suggest the use of the Bayes factor values of all sample points to induce the necessary order. García-Donato and Chen [29] use a similar ordering of the sample space on the way to calculating Type-I and Type-II error probabilities for Bayes factor tests like those of Jeffreys [30] under a specific symmetry condition on the sampling distribution of the Bayes factor. Gu, Hoijtink, and Mulder [31] apply a similar condition, essentially holding the probabilities of the two types of error to be equal via tuning of the Bayes factor for a “Bayesian t -test” using a specific kind of prior. Both of these approaches continue to use the comparison of a Bayes factor to fixed values, such as those in the table presented by Jeffreys [30] and the updated table presented by Kass and Raftery [32], to choose from competing hypotheses. The new hypothesis tests presented here adopt a criterion for choosing which hypothesis to reject that is more like the one used in familiar Neyman–Pearson testing, but with the advantage that the significance level is adaptive, that is, depends on the sample size.
The steps to perform a hypothesis test are as follows:
  • Define a prior density g ( θ ) over the entire parameter space   Θ . This function can be chosen either objectively of subjectively.
  • Clearly define the hypotheses to be tested, H and A.
  • Obtain the predictive functions under the two alternative hypotheses. In the case for which the parametric subspaces defined by the hypotheses are of different dimensionalities, the definition of a prior density under the subset of smaller dimension, say H, is obtained from the following expression, subject to the condition (on the parameter space as a whole and the hypotheses) that the integral in the denominator can be defined:
    g ( θ | H ) = { 0 i f   θ Θ A g ( θ ) Θ H g ( y ) d y i f   θ Θ H .
The denominator is the surface integral over the subspace   Θ H . When Θ H consists of a single point, there is no need to perform the integral. In the case of Θ H   and   Θ A of different dimensionalities, define an additional positive probability π that H is the true hypothesis. Figure 1 illustrates how g ( θ | H ) is obtained from the prior g ( θ ) over the full parameter space   Θ .
4.
Define the loss function, considering mainly the relative importance of the hypotheses and of the two types of error—consider, for example, a governor who is concerned more with the budget than with public health and who will strongly prefer the hypothesis that the apparent wave of meningitis cases in his state do not represent an epidemic.
5.
Use the Bayes factor to order the sample space: { B F ( x ) : x X }     R establishes the order of each   x X . This ordering can be used independently of the dimensionalities of the spaces X   and   Θ   .
6.
Using the theorem above, compute the optimal averaged error probabilities and use the value of α ( δ * ) as the adaptive level of significance, which will depend on the loss function, the probability densities, the prior probability   π , and especially on the sample size.
7.
Calculate the significance index, the P -value, as follows: if x 0 is the observed value of a statistic and C 0 =   { x ; B F ( x ) B F ( x 0 ) } is the observed tail under the new ordering, the P -value is calculated using the expression P 0 = C 0 f H ( x ) d x . Clearly, this may be a single or a multiple integral or sum.
8.
Compare the value P 0 with the value of α ( δ * ) .   Reject (do not reject) H if P 0 < ( > ) α ( δ * ) . In the case of equality, take either decision without prejudice to optimality.
9.
Finally, if a value of α ( δ * ) is specified a priori, calculate the sample size needed to make this fixed value as close as possible to optimal according to the generalized Neyman–Pearson Lemma.
We emphasize that it does not matter how the prior over the entire parameter space is chosen. The present work is concerned with how to perform the new hypothesis tests once an overall prior has been chosen.

3. Illustrative Examples

This section introduces four simple examples to illustrate the use of the new P -value and how the adaptive significance level varies with sample sizes.

3.1. Example 1—Comparing Two Proportions

A doctor wants to show that the incorporation of a new technology in a treatment can produce better results than the conventional treatment. He plans a clinical trial with two arms, case and control, each with eight patients. The case arm receives the new treatment and the control arm receives the conventional one. Details of a clinical trial of this kind are shown in [33]. The observed results in this example are that only one of the patients in the control arm responded positively, but in the case arm there were four positive outcomes.
The most common classical significance tests result in the following p -values: the Pearson χ2 p -value is 0.106, changed to 0.281 with the Yates continuity correction applied, and Fisher’s exact p -value is 0.282. Traditional analysts would conclude that there were no statistically significant differences between the two treatments, using any of the canonical significance levels. Note that these procedures were for testing a sharp hypothesis against a composite alternative: H : θ 0 = θ 1 and A : θ 0 θ 1 , comparing the proportion of success of the two treatments. In what follows, we calculate the proposed P -value and use the optimal significance level α ( δ * ) to make the decision of choosing one of the hypotheses.
To be fair in our comparisons, we consider independent uniform (non-informative) prior distributions for   θ 0   and   θ 1 . With these suppositions and the likelihoods being binomials with sample sizes n = 8, the predictive probability functions under the two hypotheses are
f H ( x , y ) = ( 8   x   ) ( 8 y ) 17 ( 16 x + y )   and   f A ( x , y ) = 1 81     ( x , y ) { 0 , 1 , , 8 } × { 0 , 1 , , 8 } .
The variables x   and   y   represent the possible observed values of the number of positive outcomes in the two arms. Table 1 and Figure 2 present the Bayes factors for all possible results.
To obtain the proposed P -value, define the set Ψ o b s of sample points ( x , y ) for which the Bayes factors are smaller than or equal to the Bayes factor of the observed sample point; i.e.,
Ψ o b s = { ( x , y ) { 0 , 1 , , 8 } × { 0 , 1 , , 8 } : B F B F o b s } .
Thus, the significance index, P -value, is the sum of all predictive probabilities (under H) in Ψ o b s :
P - value = ( x , y ) Ψ o b s f H ( x , y ) = ( x , y ) Ψ o b s ( 8 x ) ( 8 y ) 17 ( 16 x + y ) .
Recalling the observed result of the clinical trial, ( x , y ) = ( 1 , 4 ) ,   the observed Bayes factor is B R o b s = 0.661 . The italic-bold cells in Table 1 identify the set of possible values of the Bayes factor. Thus, according to Equation (13), the P -value is   P = 0.0923 .
To obtain the optimal solution we minimize the sum of the error probabilities, α ( δ ) + β ( δ ) . The two error types are considered to be of the same severity in this example. The optimal solution is the result of comparing the Bayes factor with the constant K as defined in Equation (7) to make the choice according to the extended Neyman–Pearson Lemma. Defining the set of sample space points ( x , y ) with Bayes factors smaller than or equal to K, i.e., Ψ * = { ( x , y ) { 0 , 1 , , 8 } × { 0 , 1 , , 8 } : B F K } , the optimal Type I and Type II errors are given by
α ( δ * ) = ( x , y ) Ψ * f H ( x , y ) = ( x , y ) Ψ * ( 8   x   ) ( 8   y   )   17 ( 16 x + y )
and
β ( δ * ) = ( x , y ) Ψ * f A ( x , y ) = ( x , y ) Ψ * 1   81 .
In this example, we consider the two hypotheses to be equally probable a priori, π = 0.5 , and represent the equal severity of Type-I and Type-II errors by   w H = w A = 1 , resulting in   K = 1 . The set Ψ * was identified by red cells in Table 1. From Equations (14) and (15), we obtain the optimal adaptive level of significance α ( δ * ) = 0.1245 and the probability of a Type-II error   β ( δ * ) = 0.4815 . The high value of the probability of the second kind of error is expected whenever the sample sizes are small. Contrary to the classical results, the conclusion now is the most intuitive one; the null hypothesis is rejected since P < α ( δ * ) .
The physician, owner of the data in Example 1, looking at our analysis, asked about the sample size needed to obtain at most a 10 % level of significance for our procedure. The answer could be obtained from the next example, which shows the case of two arms with 20 patients each.

3.2. Example 2—Two Proportions, Varying Sample Sizes

Consider now a clinical trial as in Example 1, but with an arm size of n = 20 . The observed result is   ( x , y ) = ( 4 , 10 ) . We leave to the reader the simple exercise of repeating the calculations of Example 1 with different samples. Consider independent uniform (non-informative) prior distributions for   θ 0   and   θ 1 and take the two hypotheses to have equal prior probabilities and the two types of error to have the same relative severity, π = 0.5   and   w H = w A = 1 . The predictive probability functions under hypotheses H :   θ 0 = θ 1   and   A : θ 0 θ 1 are
f H ( x , y ) = ( 20   x   ) ( 20   y   )   41 ( 40 x + y )   and   f A ( x , y ) = 1   441 ( x , y ) { 0 , 1 , , 20 } × { 0 , 1 , , 20 }
and the observed Bayes factor is B F o b s = 0.415 , which leads to the following results: significance index P = 0.02901 ; optimal adaptive level of significance α ( δ * ) = 0.0995 ; and the probability of a Type-II error β ( δ * ) = 0.3651 . The classical χ2 p -value is   p   =   0.0467 , indicating rejection of the null hypothesis at the canonical 5 %   level of significance. This agrees with our decision of rejecting the null hypothesis since again   P < α ( δ * ) . It is interesting to see the relative distance between the index and the level of significance. For the χ2 test, we have 1   0.0467 0.05 = 0.07 and the adaptive case obtains   1   0.029   0.0995 = 0.71 .
Figure 3 presents the optimal adaptive level of significance and the Type-II error by sample size. As expected, the probabilities of both kinds of errors decrease when the sample size increases.
The response to the question about the sample size needed to obtain a significance level of at most 10 % is n   =   20 in each arm. For a level of at most   5 % , we need a sample size of n = 90   in each arm.
Optimal adaptive significance levels and Type-II error probabilities for different arm sizes, n 1 and n 2 are presented in Table 2. With a fixed total sample size, an unbalanced sample can have larger (both Type-I and Type-II) errors than a balanced sample. The greater the imbalance of the sample, the greater the averaged error probabilities is. For example, the error probabilities of an unbalanced sample with n 1 = 60 and n 2 = 10 is larger than a balanced sample with n1 = n2 = 20 (Table 2), despite the unbalanced sample having a total size of 70 and the balanced sample just 40.
Pericchi and Pereira [6] present a closed asymptotic formula that relates sample size and significance level in the simple case of testing H :   θ = θ 0   vs .     A : θ θ 0 , in a binomial with parameters   θ   and   n . A natural future project is to find this type of relation in other complex statistical problems such as the one presented in the above examples.
The following example is an attempt to show that our P -value should not violate the likelihood principle. Recall that violation of this principle has produced some of the Bayesian community’s main criticisms of the classical p-values.

3.3. Example 3—Test for One Proportion and the Likelihood Principle

A common example in which the likelihood principle can be violated is the case of binomials compared to negative binomials. For the same values of x, the number of successes in n independent Bernoulli trials, the two distributions produce different p -values that can lead to different decisions if compared with the same level of significance. The present example shows that the new test introduced here will produce identical decisions if the observed sample size and the number of successes are the same. The proof that this is the case in general for the new tests is presented as Appendix A to this article. The reason the decisions end up being the same for different models is that, although the P -values for the different models are different from each other, they are compared to different significance levels. The decision about the null hypothesis ends up being the same, so there is no violation of the likelihood principle. Changing the notation, let the sample vector be composed of the number of success and the number of failures, ( x , y ) , and the corresponding vector of probabilities be ( θ 0 , θ 1 )   with   θ 0 = 1 θ 1 .   Take H :   θ 1 = 0.5   and   A : θ 1 0.5   as the hypotheses to be tested. Taking a uniform (non-informative) prior distribution for θ 1 and taking the two hypotheses to be equally probable a priori and the two types of error to have equal relative severity, π = 0.5 with   w H = w A = 1 , the predictive densities needed for the significance tests are as follows:
  • for a (positive) binomial,
    f H ( x ) = ( x + y x ) ( 1   2   ) x + y   and       f A ( x ) = ( x + y + 1 ) 1
  • for a negative binomial,
    f H ( x ) = ( x + y 1 x ) ( 1   2   ) x + y   and   f A ( x ) = y [ ( x + y ) ( x + y + 1 ) ] 1 .
Clearly, the Bayes factors, as defined by Equation (9), are equal for the two models, and since using the lemma will lead to comparing them to the same constant, the decisions about the null hypothesis end up being the same. Note that both the P -values and the significance levels are different for the two models. For instance, if we consider the observations ( x , y ) = ( 3 , 10 )   and     ( x , y ) = ( 10 , 3 )   for a positive binomial, we obtain the same results for both samples; α = 0.09 ,   β = 0.43 ,   and   P = 0.02 . For the negative binomial, the two observed points will produce different significance levels and probabilities of both kinds of errors. For the first (second) sample, one stops observing whenever the number of successes reaches 3 Equation (11). For the first result, we have   α = 0.18 , β = 0.4   and   P = 0.0 ; for the second ,     α = 0.12 ,   β = 0.33 ,     and   P = 0.01 . Therefore, the decisions based on positive binomials are the same as the ones based on negative binomials for the same   ( x , y ) .
Table 3 presents the predictive densities under several kinds of hypotheses for one proportion. For all kinds of hypotheses, positive and negative binomial models, for the same   ( x , y ) , produce equal Bayes factors.

3.4. Example 4

This is an example used by Pereira and Wechsler [15], showing that the critical region is not always the tails of the null distribution; it can be a union of disjoint intervals. In such cases, it can be impossible to calculate a classical p -value, but the ordering of the entire sample space by Bayes factors allows for an unambiguous definition and calculation of the new index, a P -value.
Let x be a normal random variable with zero mean and unknown variance   σ 2 . The hypotheses are H :   σ 2 = 2   vs . A :   σ 2 2 . A χ 1 2 (chi-squared distribution with one degree of freedom) is taken as a prior density for   σ 2 . After an integration exercise, we can establish the predictive densities for our significance test as
f A ( x ) = { π ( 1 + x 2 ) } 1 and f H ( x ) = ( 2 π ) 1 exp ( x 2 4 ) .
These are, respectively, a Cauchy density and a normal density with zero mean and variance 2. Figure 4 shows the Bayes factor for all sample points, using the constant 1.1 as a cutoff for the decision about the null hypothesis. The sample points that do not favor the null hypothesis are a central region together with the heavy tails of the Cauchy density. The set that favors H does not include the central region:
X H = { x | x ( 2.8 ; 0.6 ) ( 0.6 ; 2.8 ) }
The set favoring the alternate hypothesis A includes the interval ( 0.6 ; 0.6 ) , a considerable central region.

4. Final Remarks

It is worth noting that there are multiple ways to understand our new test, and we would like to present a specific vision. Consider a statistical model, with a family of probability functions indexed by   θ , denoted by f ( x | θ ) , with all necessary conditions imposed for all relevant mathematical objects to be well-defined. If λ is a function of   θ , one can simply write   f ( x | θ ) = f ( x | θ , λ ) , because the sub-σ-algebra defined by the new parameter λ is contained in the one defined by the original parameter   θ . Given a prior density g ( θ ) for the original parameter θ ,
f ( x | λ ) = E θ { L ( λ , θ | x ) } = E θ { g ( θ | λ ) f ( x | θ , λ ) } .
If the new parameter λ is a binary function (produces only values 0 and 1), then the two predictive probability functions are f 0 ( x ) = f ( x | λ = 0 )   and   f 1 ( x ) = f ( x | λ = 1 ) . These functions are averages, weighted by g ( θ | λ ) , of the likelihood function. The original parameter has been removed as a “nuisance”, leaving only the new parameter representing the decision. Because the new parameter is binary, hypotheses involving it are simple-versus-simple, so the generalized Neyman–Pearson Lemma applies. Our procedure can be seen as elimination of a nuisance parameter for the application of optimization. We refer to Basu [34] for elimination of nuisance parameters when the parameter spaces are variation dependent.
For decades, and increasingly in recent years, users of statistics have been questioning the logic of using the canonical significance levels, or indeed, any fixed significance level, for hypothesis testing. We believe that there are no formal reasons for using the established numbers, and that there are in fact good reasons not to fix significance levels a priori. We use the natural logic of optimization to define an adaptive significance level, that is, one that depends on the sample size. Our test using the new index ( P -value) and the adaptive significance level is compatible with the likelihood principle, as proved in the Appendix A of the present article.
There is still much work to be done, testing different kinds of hypotheses in the parameter spaces of different models, including multivariate problems. We are not aware of any complex model that prevents the use of the hypothesis tests discussed in the present paper. It is hoped that the similarity of the apparatus of the new tests to that of existing Neyman–Pearson tests, plus favorable characteristics of the new tests, will make the new testing procedure useful and popular among investigators in the many fields in which statistical hypothesis testing can be useful.
There is certainly a one-to-one relation between   P   and   B F ! Hence, after a cut-off for P is defined automatically, we have a corresponding cut-off for B F and there is then a one-to-one correspondence of the pair of error type probabilities between the two methods. Those who prefer to use Bayes factors directly can certainly do so, but they can also advantage of the cut-off provided by our method.

Acknowledgments

The first and sixth authors are grateful to the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) for financial support. CABP grant number 308776/2014-3; AP grant number 304025/2013-5. Our research group, GIS—group of inductive statistics, contributed to this work by discussing and making suggestions. We are very grateful for all the collaboration from these colleagues, especially Fernando Corrêa Filho, Julio Michael Stern, and Sergio Wechsler. The editor and four reviewers of this article engaged in lengthy discussion that helped in sharpening our work. This work is dedicated to the memory of the late Oscar Kempthorne.

Author Contributions

The authors contributed equally to this work. It would be difficult for us to identify what any one author did not contribute.

Conflicts of Interest

The six authors declare no conflict of interest.

Appendix A

It is proved here that the new tests are compatible with the likelihood principle in general.
Imagine two different possible experiments 1 = ( X 1 , Θ , P ( 1 ) ) and 2 = ( X 2 , Θ , P ( 2 ) ) , where X i , i { 1 , 2 } , is the discrete sample space for the observable Z i in experiment i , and P ( i ) is a parametric family of probability functions indexed by the common parameter θ Θ , that is, P ( i ) = { f ( i ) ( · | θ ) :   θ Θ } , i { 1 , 2 } . Let g ( θ ) be a prior for θ .
Consider the hypotheses H : θ Θ H , and A : θ Θ A , with Θ H Θ A = and Θ H Θ A = Θ . Let the risks for the two types of errors in making a decision be A = π w A   and   B = ( 1 π ) w H ,   both   positive .
For i { 1 , 2 } and x i X i , let
f H ( i ) ( x i ) = H f ( i ) ( x i | θ ) g ( θ | H ) d θ
be the prior predictive probability function for Z i under H , where g ( θ | H ) is the conditional measure of θ given H , i.e., given θ Θ H .
In the same way,
f A ( i ) ( x i ) = A f ( i ) ( x i | θ ) g ( θ | A ) d θ
is the prior predictive under the alternative hypothesis A . Define the Bayes factor in favor of H by
B F ( i ) ( x i ) = f H ( i ) ( x i ) f A ( i ) ( x i ) .
For i { 1 , 2 } , let
α ( i ) = H ( i ) ( B F ( i ) ( Z i ) B A ) = X A f H ( i ) ( x i )
where H ( i ) is the probability measure associated with the probability mass function f H ( i ) .
Define
K ( i ) = max { B F ( i ) ( x i ) : x i X i   and   B F ( i ) ( x i )   B   A }
and if the set in this expression is empty, take K ( i ) = 0 . Note that
α ( i ) = H ( i ) ( B F ( i ) ( Z i ) K ( i ) )
and that, for r 1 ( i ) , r 2 ( i ) { B F ( i ) ( x ) : x X i } ,
r 1 ( i ) r 2 ( i ) H ( i ) ( B F ( i ) ( Z i ) r 1 ( i ) ) H ( i ) ( B F ( i ) ( Z i ) r 2 ( i ) ) .
Finally, define the test function φ i * : X i { 0 , 1 } by
φ i * ( x ) = 1 P H ( i ) ( x ) α ( i )
where P H ( i ) ( x ) is the “P-value”, the significance index used in the new test, at sample point x :
P H ( i ) ( x ) = H ( i ) ( { B F ( i ) ( Z i ) B F ( i ) ( x ) } ) .
The conditions for rejection of H in each experiment can be rewritten:
φ i * ( x ) = 1 P H ( i ) ( x ) H ( i ) ( B F ( i ) ( Z i ) K ( i ) ) B F ( i ) ( x ) K ( i ) .
Now consider a single observation that could be produced by either experiment, expressed in the respective sample spaces as x 1 * X 1 , x 2 * X 2 , such that f ( 1 ) ( x 1 * | θ ) = C ( x 1 * , x 2 * ) f ( 2 ) ( x 2 * | θ ) , with C ( x 1 * , x 2 * ) > 0 , θ Θ . That is, the likelihood generated by data x 1 * in experiment 1 differs by a constant (not a function of θ ) multiplicative factor from the likelihood generated by data x 2 * in experiment 2 . We will prove that φ 1 * ( x 1 * ) = φ 2 * ( x 2 * ) , that is, that the decision whether or not to reject the hypothesis H : θ Θ H is the same, regardless of the details of the experiment that produced the observation and considering K ( 1 ) = K ( 2 ) = B / A .
φ 1 * ( x 1 * ) = 1 B F ( 1 ) ( x 1 * ) K ( 1 )
B F ( 1 ) ( x 1 * ) B A
f H ( 1 ) ( x 1 * ) f A ( 1 ) ( x 1 * ) B A
H f ( 1 ) ( x 1 * | θ ) g ( θ | H ) d θ A f ( 1 ) ( x 1 * | θ ) g ( θ | A ) d θ B A
H C ( x 1 * , x 2 * ) f ( 2 ) ( x 2 * | θ ) g ( θ | H ) d θ A C ( x 1 * , x 2 * ) f ( 2 ) ( x 2 * | θ ) g ( θ | A ) d θ B A
H f ( 2 ) ( x 2 * | θ ) g ( θ | H ) d θ A f ( 2 ) ( x 2 * | θ ) g ( θ | A ) d θ B A
f H ( 2 ) ( x 2 * ) f A ( 2 ) ( x 2 * ) B A
B F ( 2 ) ( x 2 * ) B A
H ( 2 ) ( B F ( 2 ) ( Z 2 ) B F ( 2 ) ( x 2 * ) ) H ( 2 ) ( B F ( 2 ) ( Z 2 ) B A )
P H ( 2 ) ( x 2 * ) α ( 2 ) φ 2 * ( x 2 * ) = 1 .
Thus, it has been proven that   φ 1 * ( x 1 * ) = 1 φ 2 * ( x 2 * ) = 1 . The proof of φ 2 * ( x 2 * ) = 1 φ 1 * ( x 1 * ) = 1 is analogous and is omitted.

References

  1. Johnson, V.E. Revised standards for statistical evidence. Proc. Natl. Acad. Sci. USA 2013, 110, 19313–191317. [Google Scholar] [CrossRef] [PubMed]
  2. Gaudart, J.; Huiart, L.; Milligan, P.J.; Thiebaut, R.; Giorgi, R. Reproducibility issues in science, is P value really the only answer? Proc. Natl. Acad. Sci. USA 2014, 111, E1934. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Gelman, A.; Robert, C.P. Revised evidence for statistical standards. Proc. Natl. Acad. Sci. USA 2014, 111, E1933. [Google Scholar] [CrossRef] [PubMed]
  4. Pericchi, L.; Pereira, C.A.B.; Pérez, M.E. Adaptive revised evidence for statistical standards. Proc. Natl. Acad. Sci. USA 2014, 111, E1935. [Google Scholar] [CrossRef] [PubMed]
  5. Wasserstein, R.L.; Lazar, N.A. The ASA’s statement on p-values: Context, process, and purpose. Am. Stat. 2016, 70, 129–133. [Google Scholar] [CrossRef]
  6. Pericchi, L.R.; Pereira, C.A.B. Adaptive significance levels using optimal decision rules: Balancing by weighting the error probabilities. Braz. J. Probab. Stat. 2016, 30, 70–90. [Google Scholar]
  7. Benjamin, D.; Berger, J.; Johannesson, M.; Nosek, B.A.; Wagenmakers, E.-J.; Berk, R.; Bollen, K.A.; Brembs, B.; Brown, L.; Camerer, C.; et al. Redefine statistical significance. Nat. Hum. Behav. 2017. [Google Scholar] [CrossRef]
  8. Nature News. Big Names in Statistics Want to Shake up Much-Maligned P Value. Available online: https://www.nature.com/articles/d41586-017-02190-5?WT.mc_id=TWT_NatureNews&sf101140733=1 (accessed on 28 August 2017).
  9. Pereira, C.A.B.; Stern, J.M. Evidence and credibility: A full Bayesian test of precise hypotheses. Entropy 1999, 1, 104–115. [Google Scholar]
  10. Madruga, M.R.; Pereira, C.A.B.; Stern, J.M. Bayesian evidence test for precise hypotheses. J. Stat. Plan. Inference 2002, 117, 185–198. [Google Scholar] [CrossRef]
  11. Pereira, C.A.B.; Stern, J.M.; Wechsler, S. Can a significance test be genuinely Bayesian? Bayesian Anal. 2008, 3, 79–100. [Google Scholar] [CrossRef]
  12. Stern, J.M.; Pereira, C.A.B. Bayesian epistemic values: Focus on surprise, measure probability! Log. J. IGPL 2013, 22, 236–254. [Google Scholar] [CrossRef]
  13. Chakrabarty, D. A New Bayesian Test to Test for the Intractability-Countering Hypothesis. J. Am. Stat. Assoc. 2017, 112, 561–577. [Google Scholar] [CrossRef]
  14. Diniz, M.A.; Pereira, C.A.B.; Polpo, A.; Stern, J.M.; Wechsler, S. Relationship between Bayesian and frequentist significance indices. Int. J. Uncertain. Quantif. 2012, 2, 161–172. [Google Scholar] [CrossRef]
  15. Pereira, C.A.B.; Wechsler, S. On the concept of p-value. Braz. J. Probab. Stat. 1993, 7, 159–177. [Google Scholar]
  16. Pereira, C.A.B. Testing Hypotheses of Different Dimensions: Bayesian View and Classical Interpretation. Professor Thesis, Institute Mathematics & Statistics, USP, Sao Paulo, Brazil, 1985. (In Portuguese). [Google Scholar]
  17. Irony, T.Z.; Pereira, C.A.B. Bayesian hypothesis test: Using surface integrals to distribute prior information among the hypotheses. Resenhas 1995, 2, 27–46. [Google Scholar]
  18. Montoya-Delgado, L.E.; Irony, T.Z.; Pereira, C.A.B.; Whittle, M.R. An unconditional exact test for the Hardy-Weinberg equilibrium law: Sample space ordering using the Bayes factor. Genetics 2001, 158, 875–883. [Google Scholar] [PubMed]
  19. DeGroot, M.H. Probability and Statistics; Addison-Wesley: Boston, MA, USA, 1986. [Google Scholar]
  20. Dawid, A.P.; Lauritzen, S.L. Compatible Prior Distributions. In Bayesian Methods with Applications to Science Policy and Official Statistics; Monographs of Official Statistics; EUROSTAT: Luxembourg, 2001; pp. 109–118. [Google Scholar]
  21. Dickey, J.M. The weighted likelihood ratio, linear hypotheses on normal location parameters. Ann. Math. Stat. 1971, 42, 204–223. [Google Scholar] [CrossRef]
  22. Cox, D.R. The role of significance tests (with discussions). Scand. J. Stat. 1977, 4, 49–70. [Google Scholar]
  23. Cox, D.R. Principles of Statistical Inference; Cambridge University Press: New York, NY, USA, 2006. [Google Scholar]
  24. Evans, M. Measuring statistical evidence using relative belief. Comput. Struct. Biotechnol. J. 2016, 14, 91–96. [Google Scholar] [CrossRef] [PubMed]
  25. Lindley, D.V. A Statistical Paradox. Biometrika 1957, 44, 187–192. [Google Scholar] [CrossRef]
  26. Bartlett, M.S. A comment on D.V. Lindley’s statistical paradox. Biometrika 1957, 44, 533–534. [Google Scholar] [CrossRef]
  27. Cornfield, J. Sequential trials, sequential analysis and the likelihood principle. Am. Stat. 1966, 20, 18–23. [Google Scholar]
  28. Neyman, J.; Pearson, E.S. On the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. Lond. Ser. A Contain. Pap. A Math. Phys. Charact. 1933, 231, 289–337. [Google Scholar] [CrossRef]
  29. García-Donato, G.; Chen, M.-H. Calibrating Bayes factor under prior predictive distributions. Stat. Sin. 2005, 15, 359–380. [Google Scholar]
  30. Jeffreys, H. The Theory of Probability; The Clarendon Press: Oxford, UK, 1935. [Google Scholar]
  31. Gu, X.; Hoijtink, H.; Mulder, J. Error probabilities in default Bayesian hypothesis testing. J. Math. Psychol. 2016, 72, 140–143. [Google Scholar] [CrossRef]
  32. Kass, R.E.; Raftery, A.E. Bayes Factors. JASA 1995, 90, 773–795. [Google Scholar] [CrossRef]
  33. Lopes, A.C.; Greenberg, B.D.; Canteras, M.M.; Batistuzzo, M.C.; Hoexter, M.Q.; Gentil, A.F.; Pereira, C.A.B.; Joaquim, M.A.; de Mathis, M.E.; D’Alcante, C.C.; et al. Gamma Ventral Capsulotomy for Obsessive-Compulsive Disorder: A Randomized Clinical Trial. JAMA Psych. 2014, 71, 1066–1076. [Google Scholar] [CrossRef] [PubMed]
  34. Basu, D. On the elimination of nuisance parameters. JASA 1977, 72, 355–366. [Google Scholar] [CrossRef]
Figure 1. A prior   g ( p , q ) made of independent B e t a ( 2 , 4 ) and B e t a ( 4 , 2 ) distributions in a two-dimensional parameter space is cut along the line p = q and one of the pieces moved away to show the resulting prior on the lower-dimensional set   p = q .
Figure 1. A prior   g ( p , q ) made of independent B e t a ( 2 , 4 ) and B e t a ( 4 , 2 ) distributions in a two-dimensional parameter space is cut along the line p = q and one of the pieces moved away to show the resulting prior on the lower-dimensional set   p = q .
Entropy 19 00696 g001
Figure 2. Bayes factors of all possible results in a clinical trial with arms size of n = 8 each.
Figure 2. Bayes factors of all possible results in a clinical trial with arms size of n = 8 each.
Entropy 19 00696 g002
Figure 3. Type-I and Type-II error probabilities as functions of the sample size n in each arm.
Figure 3. Type-I and Type-II error probabilities as functions of the sample size n in each arm.
Entropy 19 00696 g003
Figure 4. Bayes factor for N ( 0 ; 2 ) vs. Cauchy.
Figure 4. Bayes factor for N ( 0 ; 2 ) vs. Cauchy.
Entropy 19 00696 g004
Table 1. Bayes factor for all possible results in a clinical trial with arms size of n = 8.
Table 1. Bayes factor for all possible results in a clinical trial with arms size of n = 8.
xySum
012345678
04.7652.3821.1120.4760.1830.0610.0170.0034e-049
12.3822.5411.9061.1730.6110.2670.0930.0240.0039
21.1121.9062.0521.7101.1660.6530.2900.0930.0179
30.4761.1731.7101.8661.6331.1610.6530.2670.0619
40.1830.6111.1661.6331.8141.6331.1660.6110.1839
50.0610.2670.6531.1611.6331.8661.7101.1730.4769
60.0170.0930.2900.6531.1661.7102.0521.9061.1129
70.0030.0240.0930.2670.6111.1731.9062.5412.3829
84e-040.0030.0170.0610.1830.4761.1122.3824.7659
Sum99999999981
Note: Cells with red numbers form the region Ψ * and bold-italic cells are the observed value of the Bayes factor.
Table 2. Optimal levels of significance ( α ) and Type-II error probabilities ( β ) for two proportions: Two independent binomial likelihoods and various sample sizes.
Table 2. Optimal levels of significance ( α ) and Type-II error probabilities ( β ) for two proportions: Two independent binomial likelihoods and various sample sizes.
n 1 n 2 α β n 1 n 2 α β n 1 n 2 α β n 1 n 2 α β
10100.16390.405050500.06670.271880100.11300.364890700.05290.2323
20100.13180.393960100.10970.374180200.08340.312290800.04930.2281
20200.09950.365160200.08600.319380300.07040.284790900.04680.2240
30100.11590.390060300.07650.290380400.06340.2671100100.11110.3627
30200.10450.333360400.06890.274780500.06030.2530100200.08180.3079
30300.09970.307060500.06260.265280600.05530.2455100300.06840.2795
40100.12500.370360600.05910.257280700.05310.2380100400.06170.2601
40200.08680.335770100.11300.367580800.05080.2327100500.05590.2479
40300.08500.302970200.08650.313290100.11310.3626100600.05380.2368
40400.07060.296870300.07270.287690200.08100.3114100700.05120.2291
50100.11260.376170400.06450.271790300.07070.2804100800.04830.2238
50200.08830.324070500.06030.259390400.06480.2608100900.04670.2188
50300.07670.299270600.05750.250190500.05750.25061001000.04490.2150
50400.07180.281770700.05390.244690600.05500.2401
Table 3. Predictive densities under several hypotheses for one proportion.
Table 3. Predictive densities under several hypotheses for one proportion.
HypothesesPredictive Densities under H 1
H: θ = θ 0 C ( x , y ) θ 0 x ( 1 θ 0 ) y
H: θ θ 0 C ( x , y ) B ( U , V ) B ( u , v )
H: θ   θ 0 C ( x , y ) B ( θ 0 ; U , V ) B ( θ 0 ; u , v )
H: θ > θ 0 C ( x , y ) B ( U , V ) B ( θ 0 ; U , V ) B ( u , v ) B ( θ 0 ; u , v )
H: θ 1 θ θ 2 C ( x , y ) B ( θ 2 ; U , V ) B ( θ 1 ; U , V ) B ( θ 2 ; u , v ) B ( θ 1 ; u , v )
H: ( θ < θ 1)∪( θ > θ 2) C ( x , y ) B ( U , V ) B ( θ 2 ; U , V ) + B ( θ 1 ; U , V ) B ( u , v ) B ( θ 2 ; u , v ) + B ( θ 1 ; u , v )
H: ( θ 1 θ θ 2)∪( θ 3 θ θ 4) C ( x , y ) B ( θ 2 ; U , V ) B ( θ 1 ; U , V ) + B ( θ 4 ; U , V ) B ( θ 3 ; U , V ) B ( θ 2 ; u , v ) B ( θ 1 ; u , v ) + B ( θ 4 ; u , v ) B ( θ 3 ; u , v )
H: ( θ < θ 1)∪ ( θ 2 < θ < θ 3)∪( θ   > θ 4) C ( x , y ) B ( U , V ) B ( θ 2 ; U , V ) + B ( θ 1 ; U , V ) B ( θ 4 ; U , V ) + B ( θ 3 ; U , V ) B ( u , v ) B ( θ 2 ; u , v ) + B ( θ 1 ; u , v ) B ( θ 4 ; u , v ) + B ( θ 3 ; u , v )
1 Prior distribution for θ : θ ~ B e t a ( u , v ) ; U = u + x ; V = v + y ; C ( x , y ) = ( x + y x ) for positive binomial or C ( x , y ) = ( x + y 1 x ) for negative binomial; B ( r , s ) = 0 1 z r 1 ( 1 z ) s 1 d z is the beta functions; and B ( p ; r , s ) = 0 p z r 1 ( 1 z ) s 1 d z is the incomplete beta function.

Share and Cite

MDPI and ACS Style

Pereira, C.A.d.B.; Nakano, E.Y.; Fossaluza, V.; Esteves, L.G.; Gannon, M.A.; Polpo, A. Hypothesis Tests for Bernoulli Experiments: Ordering the Sample Space by Bayes Factors and Using Adaptive Significance Levels for Decisions. Entropy 2017, 19, 696. https://doi.org/10.3390/e19120696

AMA Style

Pereira CAdB, Nakano EY, Fossaluza V, Esteves LG, Gannon MA, Polpo A. Hypothesis Tests for Bernoulli Experiments: Ordering the Sample Space by Bayes Factors and Using Adaptive Significance Levels for Decisions. Entropy. 2017; 19(12):696. https://doi.org/10.3390/e19120696

Chicago/Turabian Style

Pereira, Carlos A. de B., Eduardo Y. Nakano, Victor Fossaluza, Luís Gustavo Esteves, Mark A. Gannon, and Adriano Polpo. 2017. "Hypothesis Tests for Bernoulli Experiments: Ordering the Sample Space by Bayes Factors and Using Adaptive Significance Levels for Decisions" Entropy 19, no. 12: 696. https://doi.org/10.3390/e19120696

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop