Next Article in Journal
Train Neural Networks with a Hybrid Method That Incorporates a Novel Simulated Annealing Procedure
Previous Article in Journal
A Review of Optimization-Based Deep Learning Models for MRI Reconstruction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Two P or Not Two P: Mendel Random Variables in Combining Fake and Genuine p-Values

1
Departamento de Matemática e Estatística, Faculdade de Ciências e Tecnologia, Universidade dos Açores, Rua da Mãe de Deus, 9500-321 Ponta Delgada, Portugal
2
Centro de Estatística e Aplicações, Universidade de Lisboa, Campo Grande, 1749-016 Lisboa, Portugal
3
Departamento de Estatística e Investigação Operacional, Faculdade de Ciências da Universidade de Lisboa, Campo Grande, 1749-016 Lisboa, Portugal
4
Academia das Ciências de Lisboa, Rua da Academia das Ciências 19, 1249-122 Lisboa, Portugal
5
Instituto de Investigação Científica Bento da Rocha Cabral, Calçada Bento da Rocha Cabral 14, 1250-012 Lisboa, Portugal
6
Departamento de Matemática—FCEE, Campus Universitário da Penteada, Universidade da Madeira, 9020-105 Funchal, Portugal
7
Escola Superior de Tecnologia e Gestão, Instituto Politécnico de Leiria, Apartado 4133, 2411-901 Leiria, Portugal
*
Author to whom correspondence should be addressed.
AppliedMath 2024, 4(3), 1128-1142; https://doi.org/10.3390/appliedmath4030060
Submission received: 1 August 2024 / Revised: 27 August 2024 / Accepted: 3 September 2024 / Published: 5 September 2024

Abstract

:
The classical tests for combining p-values use suitable statistics T ( P 1 , , P n ) , which are based on the assumption that the observed p-values are genuine, i.e., under null hypotheses, are observations from independent and identically distributed Uniform ( 0 , 1 ) random variables P 1 , , P n . However, the phenomenon known as publication bias, which generally results from the publication of studies that reject null hypotheses of no effect or no difference, can tempt researchers to replicate their experiments, generally no more than once, with the aim of obtaining “better” p-values and reporting the smallest of the two observed p-values, to increase the chances of their work being published. However, when such “fake p-values” exist, they tamper with the statistic T ( P 1 , , P n ) because they are observations from a Beta ( 1 , 2 ) distribution. If present, the right model for the random variables P k is described as a tilted Uniform distribution, also called a Mendel distribution, since it was underlying Fisher’s critique of Mendel’s work. Therefore, methods for combining genuine p-values are reviewed, and it is shown how quantiles of classical combining test statistics, allowing a small number of fake p-values, can be used to make an informed decision when jointly combining fake (from Two P) and genuine (from not Two P) p-values.

1. Introduction

The concept of p-value is generally credited to Pearson [1], although it was implicitly used much earlier by Arbuthnot [2] in 1710. Defined as the probability of obtaining, under a null hypothesis, a result that is as extreme or more extreme than the one observed, it was considered to be an informal index to assess the discrepancy between the data and the hypothesis under investigation. The use of p-values gained popularity with Sir Ronald Fisher [3,4], and about their use, Fisher [5] states that “A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this [P = 0.05] level of significance”. Therefore, the question of reproducibility of results was naturally raised (cf. Greenwald et al. [6], or Colquhoun [7]), which in turn demanded the p-values collected from replicated experiments to be summarized into a combined p-value. In 1931, Tippett [8], a co-worker of Fisher, performed the first meta-analysis of p-values, and in 1932, Fisher himself [9] suggested a method for combining p-values.
The classical combined test procedures assume that the observed p-values, p 1 , , p n , are, under null hypotheses H 0 k , k = 1 , , n , of no difference or no effect, observations from independent random variables P k Uniform ( 0 , 1 ) , which is an immediate consequence of the probability integral transform theorem. It is then said that a p k from P k Uniform ( 0 , 1 ) is a genuine (or a true) p-value.
Section 2 describes some classical methods for combining p-values, using either their values directly, for example through order statistics or Pythagorean means, or using basic transformations of standard uniform random variables, such as ln P k and Φ 1 ( P k ) , where Φ 1 is the inverse of the standard Gaussian cumulative distribution function, or the logit function ln P k 1 P k . For additional p-values combinations, see Brilhante et al. [10].
Although today there is an intense debate on whether significance testing, and therefore the use of p-values, is an acceptable scientific research tool; see, for instance, the editorials in The American Statistician vol. 70 (Wasserstein and Lazar [11]) and vol. 73 (Wasserstein et al. [12]) on the topic, traditionally low p-values were a valid passport for being published. This has created a so-called file drawer problem due to publication bias. As with other techniques used in meta-analysis, publication bias can easily lead to false conclusions. In fact, the set of available p-values comes mainly from studies considered worthy of publication because the observed p-values were small, presumably indicating significant results. Thus, the assumption that the p k ’s are observations from independent Uniform ( 0 , 1 ) random variables is quite questionable since generally they are a set of low order statistics, given that p-values greater than 0.05 have less chances of being published.
One way of assessing publication bias is by computing the number of non-significant p-values that would be needed to reverse the decision to reject the overall hypothesis based on a set of available p-values. For example, Jin et al. [13] and Lin and Chu [14] give interesting overviews of how to deal with publication bias. Givens et al. [15] also provide a deep insight into publication bias in meta-analysis, namely using data-augmentation techniques.
Publication bias is also the cause of poor scientific practices, in some cases even fraud, especially when the replication of experiments is carried out with the intent of, hopefully, obtaining more favorable p-values to increase the chances of publishing. While replicating experiments is legitimate and recommended to establish consistent results, replicating with the purpose of reporting the smallest of the observed p-values is an unacceptable scientific practice. If this is indeed the case, the reported “fake” p-value, being the minimum of > 1 independent standard uniform random variables, is Beta ( 1 , ) -distributed. However, replicating experiments has a cost, either monetary or timewise, and if in the replication of an experiment only once, both p-values obtained are greater than 0.05, then what appears to be the wisest decision is not to continue replicating the experiment; otherwise, the smallest of the two p-values is reported, or none at all. In fact, what seems realistic to consider is either = 2 , and therefore a nuisance “fake two p-value” is reported, or = 1 , i.e., a “genuine”, not the minimum of “two p-value”, is disclosed.
In Fisher’s [16] comments about Mendel’s work, he conjectured that “the data of most, if not all, of the experiments have been falsified to agree closely with Mendel’s expectations”. Fisher made it quite clear that he suspected that Mendel’s “too good to be true” results were carefully chosen to support the hereditary theory that Mendel wanted to prove. Due to this historical background, in Section 3, we shall call Mendel distribution the model that is a mixture of a Beta ( 1 , 2 ) (or Beta ( 2 , 1 ) ) distribution and a Uniform ( 0 , 1 ) distribution, thus representing a mixture of “fake two p-value” and “genuine not two p-value”. We briefly explain how an extension of Deng and George’s [17] characterization of the standard uniform distribution using a Mendel random variable instead of a uniform random variable can be considered to test the uniformity of a set of p-values or determine if it is contaminated with fake p-values.
In Section 4, an example is given to illustrate how to use the critical values from the tables in Brilhante et al.’s [10] supplementary materials for jointly combining genuine and fake p-values using classical combining methods. The example shows that a thorough comparison should always be made, since most likely there is no reliable information that rules out the existence of fake p-values that have resulted from bad scientific practices, and therefore it is important to acknowledge their potential effects when performing a meta-analysis of p-values.
In Section 5, further developments for combining p-values are reviewed, with a very brief reference to the recent research field on e-values. Finally, Section 6 reinforces the recommendation that when extending the usual combined tests to include genuine and fake p-values, they should be compared with each other in terms of the conclusions drawn for an informed final decision.

2. An Overview of Classical Combined Tests for p-Values

Let us assume that the p-values p k are known for testing H 0 k versus H A k , k = 1 , , n , in n independent studies on some common issue, and that the objective is to decide on the overall hypothesis H 0 * : all the H 0 k are true versus H A * : some of the H A k are true. As there are many different ways in which H 0 * can be false, selecting the right test is generally unfeasible. On the other hand, combining the available p k ’s so that a function T ( p 1 , , p n ) is the observed value of a random variable with a known sampling distribution under H 0 * is a simple problem, since under H 0 * , ( p 1 , , p n ) is the observed value of a random sample ( P 1 , , P n ) from a Uniform ( 0 , 1 ) distribution. In fact, several different and reasonable combined testing procedures are often used with suitable functions of the p k ’s. Moreover, it should be guaranteed that a combined procedure is monotone, in the sense that if one set of p-values ( p 1 , , p n ) leads to the rejection of the overall null hypothesis H 0 * , then any set of component-wise smaller p-values ( p 1 , , p n ) , i.e., p k p k , k = 1 , , n , must also lead to its rejection.
Tippett [8] used the statistic
T T ( P 1 , , P n ) = min { P 1 , , P n } = P 1 : n .
From the fact that P 1 : n | H 0 * Beta ( 1 , n ) , the criterion for rejecting H 0 * at a significance level α is p 1 : n < 1 ( 1 α ) 1 / n . Tippett’s method is a special case of Wilkinson’s method [18], which recommends that H 0 * should be rejected when some observed order statistic p k : n < c . As P k : n | H 0 * Beta ( k , n + 1 k ) , the cut-of-point c to reject H 0 * is the solution of
0 c x k 1 ( 1 x ) n k d x = α B ( k , n + 1 k ) ,
where B ( p , q ) = 0 1 x p 1 ( 1 x ) q 1 d x , p , q , > 0 , is the Beta function.
Simes [19], on the other hand, gives an interesting development of Wilkinson’s method: Let P 1 : n , , P n : n be the ordered p-values for testing the overall hypothesis H 0 * , which should be rejected at a significance level α if P j : n j α / n for any j = 1 , , n .
Another way of constructing combined p-values is to use functions of standard uniform random variables. Fisher [9] suggested the use of the statistic
T F ( P 1 , , P n ) = 2 k = 1 n ln P k ,
since 2 ln P k χ 2 2 when P k Uniform ( 0 , 1 ) , k = 1 , , n . As 2 k = 1 n ln P k | H 0 * χ 2 n 2 , the criterion for rejecting H 0 * at a significance level α is 2 k = 1 n ln p k > χ 2 n , 1 α 2 , with χ m , p 2 denoting the p-th quantile of the chi-square distribution with m degrees of freedom.
Tippett’s method illustrates the direct use of standard uniform random variables, while Fisher’s method shows the use of transformed standard uniform random variables. Moreover, Fisher’s method is often the most efficient way of making use of all the information available, whereas Tippett’s method disregards almost all available information. Therefore, these two methods can be viewed as two extreme cases.
Combining p-values using functions of their sums or products, namely their arithmetic mean or their geometric mean, is also feasible but less appealing than Fisher’s chi-square transformation method. Edgington [20] suggested the use of the arithmetic mean as a test statistic, i.e.,
T E ( P 1 , , P n ) = P ¯ n = 1 n k = 1 n P k ,
but it has a very cumbersome probability density function, defined as
f P ¯ n ( x ) = n Γ ( n ) j = 0 n x ( 1 ) j n j ( max { 0 , n x j } ) n 1 I [ 0 , 1 ) ( x ) ,
with x being the largest integer not greater than x and Γ ( α ) = 0 x α 1 e x d x , α > 0 , Euler’s Gamma function. However, if n is large, an approximation based on the central limit theorem can be used to perform an overall test on H 0 * versus H A * , but it is not consistent, in the sense that it can fail to reject the overall test’s null hypothesis, even though the results of some of the individual tests are extremely significant.
Pearson’s [21] proposal for combining p-values is based on their product, i.e., on the statistic
T P ( P 1 , , P n ) = k = 1 n P k ,
which under H 0 * has a probability density function
f T P ( x ) = ( ln x ) n 1 Γ ( n ) I ( 0 , 1 ) ( x ) .
In other words, k = 1 n P k | H 0 * BetaBoop ( 1 , 1 , 1 , n ) (see Brilhante et al. [22] for more details on BetaBoop random variables). Consequently, the geometric mean
G n = T G n ( P 1 , , P n ) = k = 1 n P k 1 / n
has a cumulative distribution function
F G n ( x ) = Γ ( n , n ln x ) Γ ( n ) I ( 0 , 1 ) ( x ) + I [ 1 , ) ( x ) ,
where Γ ( α , z ) = z x α 1 e x d x , α , z > 0 , is the upper incomplete Gamma function. The critical quantiles g n , 1 α of G n can easily be computed from the critical quantiles g n , 1 α * of G n n = T P ( P 1 , , P n ) , where 0 g n , 1 α * ( ln x ) n 1 Γ ( n ) d x = 1 α , since g n , 1 α = ( g n , 1 α * ) 1 / n .
Note, however, that using products of standard uniform random variables or adding their exponential logarithms provides essentially the same information, as recognized by Pearson [21] in his final remark, and hence, it is more convenient to use Fisher’s statistic.
In 1934, Pearson [23] considered that in a bilateral framework, it would be more appropriate to use the statistic
T P * ( P 1 , , P n ) = min k = 1 n P k , k = 1 n ( 1 P k ) .
Owen [24] suggested a simple modified version of T P * ( P 1 , , P n ) , namely the statistic
T O ( P 1 , , P n ) = max 2 k = 1 n ln P k , 2 k = 1 n ln ( 1 P k ) ,
for which he recommends a Bonferroni correction to establish lower and upper bounds for the computation of probabilities. Another alternative to T P * ( P 1 , , P n ) is Pearson’s [23] minimum of geometric means statistic,
T min { G n , G n * } ( P 1 , , P n ) = min k = 1 n P k 1 / n , k = 1 n ( 1 P k ) 1 / n .
Also, concerning the use of transformed p-values, Stouffer et al. [25] used as a test statistic
T S ( P 1 , , P n ) = k = 1 n Φ 1 ( P k ) n .
Since T S ( P 1 , , P n ) | H 0 * Gaussian ( 0 , 1 ) , the criterion for rejecting H 0 * at a significance level α is k = 1 n Φ 1 ( P k ) n > z 1 α , with z p denoting the p-th quantile of the standard Gaussian distribution.
A further simple transformation based on the standard uniform random variables P k and 1 P k is the logit transformation ln P k 1 P k Logistic ( 0 , 1 ) , which was used by Mudholkar and George [26] to construct the combined test statistic
T M G ( P 1 , , P n ) = k = 1 n ln P k 1 P k .
Using the approximation
n π 2 ( 5 n + 2 ) 3 ( 5 n + 4 ) 1 / 2 k = 1 n ln P k 1 P k t 5 n + 4 ,
H 0 * should be rejected at a significance level α if
n π 2 ( 5 n + 2 ) 3 ( 5 n + 4 ) 1 / 2 k = 1 n ln p k 1 p k > t 5 n + 4 , 1 α ,
with t m , p denoting the p-th quantile of Student’s t distribution with m degrees of freedom.
On the other hand, Birnbaum [27] has shown that every monotone combined test procedure is admissible, i.e., provides a most powerful test against some alternative hypothesis for combining a collection of tests, and therefore is optimal for a combined testing situation whose goal is to harmonize possibly conflicting evidence or to pool inconclusive evidence. In the context of social sciences, Mosteller and Bush [28] recommend Stouffer’s method, but Littell and Folks [29,30] have shown that under mild conditions, Fisher’s method is optimal for combining independent tests.
The thorough comparison performed by Loughin [31] shows that the normal combining function performs quite well in problems where the evidence against the combined null hypothesis is spread among more than a small fraction of the individual tests. However, when the total evidence is weak, Fisher’s method is the best choice, especially when the evidence is at least moderately strong, and it is concentrated in a relatively small fraction of the individual tests. Mudholkar and George’s [26] logistic combined test manages to provide a compromise between the two previous cases. Additionally, when the total evidence against the combined null hypothesis is concentrated on one or on a few tests to be combined, Tippett’s combining function is useful.

3. Fake p-Values and Mendel Random Variables

An important issue that should be addressed before combining p-values is whether they are genuine or not. The overall alternative hypothesis H A * states that some of the individual H A k are true, and so a meta-decision on H 0 * implicitly assumes that some of the P k ’s may not have a uniform distribution, cf. Hartung et al. [32] (pp. 81–84), and Kulinskaya et al. [33] (pp. 117–119). In fact, the uniformity of the P k ’s is solely the consequence of assuming that the null hypothesis is true, but this questionable assumption led Tsui and Weerahandi [34] to introduce the concept of generalized p-values. See Weerahandi [35], Hung et al. [36] and Brilhante [37], and references therein, on the concepts of generalized and random p-values.
Moreover, the assumption P k | H 0 Uniform ( 0 , 1 ) , k = 1 , , n , can be unrealistic. As a matter of fact, when an observed p-value is not highly significant or significant, there is a possibility that the experiment will be repeated in the hope of obtaining a “better” p-value to increase the likelihood of the research being published. However, the scientific malpractice of trying to obtain better p-values to comply with research teams’ expectations, which in some cases can be labeled as a fraudulent practice, can lead to disclosing results that are “too good to be true”, as Fisher [16] observed in his appraisal of Mendel’s work. Consult Pires and Branco [38] and Franklin [39] for more information on the famous Mendel-Fisher controversy.
If a reported p k is the “best” of k observed p-values of k independent replications of an experiment, i.e., is the minimum of k independent Uniform(0, 1) random variables, then P k Beta ( 1 , k ) , which has a probability density function f P k ( x ; k ) = k ( 1 x ) k 1 I ( 0 , 1 ) ( x ) . Therefore, 2 k ln ( 1 P k ) χ 2 2 . This also holds true for the case k = 1 , i.e., for genuine p-values, since 2 ln P k = d 2 ln ( 1 P k ) χ 2 2 when P k Uniform ( 0 , 1 ) . So, the changes needed in Fisher’s statistic are T F * ( P 1 , , P n ) = 2 k = 1 n k ln ( 1 P k ) , which under H 0 * is also χ 2 n 2 -distributed. However, the main problem here is that there is no information on whether some of the p-values are “fake ones”, and if they do exist, which ones are and what are the corresponding values of k .
Please note that what makes the most sense is to consider either k = 1 or k = 2 because it would be a complete waste of time and resources to continue replicating an experiment if non-significant p-values keep showing up, especially if there is the (wrong) belief that a p-value is only “a good one” if it is significant. It is, therefore, assumed that k = 1 when a genuine p-value is reported, regardless of whether it is significant or not. However, when some researchers are dissatisfied with obtaining non-significant p-values for their (first) results, they may decide not to report them and abandon their research, or repeat the experiment once ( k = 2 ). In the latter case, one of the following scenarios takes place:
(a)
the second p-value is significant, and hence it is the one reported (fake p-value);
(b)
the second p-value is also not significant and consequently, either the smallest of the two observed p-values is reported (fake p-value), or none is reported and the research stops.
From the above, if k = 2 , then clearly the right model for P k is a mixture of the minimum of two independent Uniform ( 0 , 1 ) random variables (or a Beta ( 1 , 2 ) random variable) and a Uniform ( 0 , 1 ) random variable, i.e., with probability density function
f P k ( x ; p ) = p 2 ( 1 x ) + ( 1 p ) I ( 0 , 1 ) ( x ) ,
where 0 p 1 , and which can be reparameterized as
f P k ( x ; m ) = m ( 1 x ) + 1 m 2 I ( 0 , 1 ) ( x ) ,
with m = 2 p , m [ 0 , 2 ] . Therefore, in Equation (1), m 2 is the probability of a p-value being a fake p-value.
What is interesting to notice is that if the probability density function of the standard uniform distribution is tilted using the point 1 2 , 1 as a pole, then for m [ 2 , 2 ] , the right-hand side of Equation (1) is still a probability density function, more specifically, the probability density function of a Mendel random variable X m Mendel ( m ) .
From Equation (1), it is straightforward to see that X 0 Uniform ( 0 , 1 ) , X 2 Beta ( 1 , 2 ) , and X 2 Beta ( 2 , 1 ) , i.e., the maximum of two independent standard uniform random variables. Moreover, if m ( 2 , 0 ) , then the Mendel distribution is a mixture of standard uniform distribution, with weight 1 + m 2 , and a Beta ( 2 , 1 ) distribution, while if m ( 0 , 2 ) , it is a mixture of standard uniform distribution, with weight 1 m 2 , and a Beta ( 1 , 2 ) distribution. So, the probability density function of X m Mendel ( m ) , m [ 2 , 2 ] , can be expressed in the form
f X m ( x ; m ) = | m | 2 f P i : 2 ( x ) + 1 | m | 2 f P ( x ) ,
with i = 1 if m ( 0 , 2 ] , or i = 2 if m [ 2 , 0 ) , and where P 1 : 2 and P 2 : 2 denote, respectively, the minimum and maximum of two independent standard uniform random variables, and P Uniform ( 0 , 1 ) .
An interesting fact related to the Mendel distribution is that if X and Y are independent random variables, both with support [ 0 , 1 ] , and with X Mendel ( m ) , then
V = min X Y , 1 X 1 Y Mendel ( 2 E [ Y ] 1 ) m ,
which generalizes Deng and George’s [17] characterization of the standard uniform distribution when X Uniform ( 0 , 1 ) (see Theorem 1 in Brilhante et al. [10]). Furthermore, if X Uniform ( 0 , 1 ) , then V and Y are independent random variables.
In particular, if X and Y are independent such that X Mendel ( m 1 ) and Y Mendel ( m 2 ) , then
V = min X Y , 1 X 1 Y Mendel m 1 m 2 6 .
On the other hand, if X Mendel ( m ) and Y Beta ( n , 1 ) are independent, then
V = min X Y , 1 X 1 Y Mendel m n 1 n + 1 ,
while if X Mendel ( m ) and Y Beta ( 1 , n ) are independent, then
V = min X Y , 1 X 1 Y Mendel m 1 n 1 + n .
Please note that Equations (2) or (3) can be used to test whether a sample of p-values ( p 1 , , p n ) are observations from a Uniform ( 0 , 1 ) , a Mendel ( 2 ) , or a Mendel ( 2 ) distribution, being very useful to increase the test’s power when the sample size is small (see Gomes et al. [40] and Brilhante et al. [41] for more details). For this purpose, setting x k = p k and generating y k , then v k = min x k y k , 1 x k 1 y k is obtained, and therefore to test, for instance, the uniformity of the sample ( p 1 , , p n ) , one tests the uniformity of the pseudo-random sample ( p 1 , , p n , v 1 , , v n ) .

4. Combining Genuine and Fake p-Values

It is generally impossible to know whether there are or not fake p-values among the set of p-values to be combined. Therefore, a realistic approach is to examine possible scenarios and assess how the probable existence of fake p-values in a sample can affect the decision on the overall hypothesis H 0 * . For this purpose, tables with critical quantiles for p-values’ combination methods that take into account the existence of fake p-values in a sample, most likely in a very small number, can be useful to give an overall picture.
Such tables are given in Brilhante et al.’s [10] supplementary materials for the most commonly used combined test statistics, where it is assumed that among the n ( n = 3 , , 28 ) p-values to be combined there are at most n f = 0 , 1 , , max { 3 , n / 3 } fake ones. The usefulness of the tables is illustrated with Example 1.
Example 1. 
For the set of n = 13 p-values obtained in studies on depressive effects of a weekly 1mg dose of semaglutide
0.0571 0.5954 0.0249 0.4793 0.1792 0.2917 0.6783 0.0554 0.2805 0.8137 0.2824 0.3338 0.1923
the observed values for the combined test statistics are:
T F ( 0.0571 , , 0.0 . 1923 ) = 39.0602 T S ( 0.0571 , , 0.1923 ) = 2.0842 T M G ( 0.0571 , , 0.1923 ) = 13.1940 T G 13 ( 0.0571 , , 0.1923 ) = 0.2226 T min { G 13 , G 13 * } ( 0.0571 , , 0.1923 ) = 0.2226 T E ( 0.0571 , , 0.1923 ) = 0.3280 T T ( 0.0571 , , 0.1923 ) = 0.0249
The quantiles for n = 13 are extracted from the tables in [10] (without the standard errors) for the following methods: Fisher (Table 1), Stouffer (Table 2), Mudholkar and George (Table 3), Pearson’s geometric mean (Table 4), Pearson’s minimum of geometric means (Table 5), Edgington’s arithmetic mean (Table 6) and Tippett (Table 7).
The quantiles that lead to the rejection of H 0 * are highlighted for each method, thus showing for which significance level α { 0.005 , 0.01 , 0.025 , 0.05 , 0.1 } this happens.
  • Fisher’s Statistic T F = 39.0602
Table 1. Estimated quantiles of T F with n f fake p-values.
Table 1. Estimated quantiles of T F with n f fake p-values.
n n f 0.9000.9500.9750.9900.995
13035.563238.885141.923245.641748.2899
13136.654840.029443.085246.782149.4752
13237.724141.105344.124047.946150.6022
13338.811942.157645.273549.053351.7268
13439.906943.275946.299450.172952.9071
  • Stouffer et al.’s Statistic T S = 2.0842
Table 2. Estimated quantiles of T S with n f fake p-values.
Table 2. Estimated quantiles of T S with n f fake p-values.
n n f 0.9000.9500.9750.9900.995
1301.28151.64481.96002.32642.5758
1311.10871.47201.78442.13912.3767
1320.93501.29241.60091.95242.1995
1330.76201.10791.41171.76702.0049
1340.59080.93121.23451.57561.8255
  • Mudholkar and George’s Statistic T M G = 13.1940
Table 3. Estimated quantiles of T M G with n f fake p-values.
Table 3. Estimated quantiles of T M G with n f fake p-values.
n n f 0.9000.9500.9750.9900.995
1308.385910.785012.862715.389217.1365
1319.284011.658913.733716.266717.9027
13210.168212.518714.595217.113418.8075
13311.052313.351215.434417.958719.5983
13411.953214.284816.295418.758720.4252
  • Pearson’s Geometric Mean Statistic T G n = 0.2226
Table 4. Estimated quantiles of T G n with n f fake p-values.
Table 4. Estimated quantiles of T G n with n f fake p-values.
n n f 0.0050.0100.0250.0500.100
1300.156090.172830.199400.224120.25466
1310.149190.165440.190700.214480.24420
1320.142870.158200.183230.205780.23436
1330.136840.151620.175310.197620.22476
1340.130750.145220.168530.189300.21549
  • Pearson’s Minimum of Geometric Means Statistic T min { G n , G n * } = 0.2226
Table 5. Estimated quantiles of T min { G n , G n * } with n f fake p-values.
Table 5. Estimated quantiles of T min { G n , G n * } with n f fake p-values.
n n f 0.0050.0100.0250.0500.100
1300.141440.155780.178820.199400.22388
1310.141770.156080.178760.199390.22400
1320.139160.153700.176670.197100.22212
1330.135360.149600.172350.193260.21799
1340.130190.144380.167170.187460.21216
  • Edgington’s Arithmetic Mean Statistic T E = 0.3280
Table 6. Estimated quantiles of T E with n f fake p-values.
Table 6. Estimated quantiles of T E with n f fake p-values.
n n f 0.0050.0100.0250.0500.100
130 0.29609 0.31496 0.343330.367740.39629
131 0.28659 0.30475 0.332580.356820.38486
132 0.27780 0.29548 0.32267 0.346160.37356
133 0.26796 0.28591 0.31225 0.335570.36253
134 0.25868 0.27644 0.30163 0.32471 0.35112
  • Tippett’s Minimum Statistic T T = 0.0249
Table 7. Quantiles of T T with n f fake p-values.
Table 7. Quantiles of T T with n f fake p-values.
n n f 0.0050.0100.0250.0500.100
130 0.00039 0.00077 0.00195 0.00394 0.00807
131 0.00036 0.00072 0.00181 0.00366 0.00750
132 0.00033 0.00067 0.00169 0.00341 0.00700
133 0.00031 0.00063 0.00158 0.00320 0.00656
134 0.00029 0.00059 0.00149 0.00301 0.00618
For this example, Fisher’s method shows some stability when it comes to deciding on H 0 * , even when a small number of fake p-values can exist in the sample, and thus it seems robust to a prior choice of a significance level α (usually 0.05). The same can be said of Pearson’s geometric mean method, which is, in fact, equivalent to Fisher’s method. The runner-up is Mudholkar and George’s method, which in traditional contexts has shown to be a compromise between Fisher’s and Stouffer’s methods. Please note that Stouffer’s method, recommended in the social sciences, looks less reliable in this case. Clearly, Tippett’s method should be avoided, despite being the simplest of them all and having a very uncomplicated sampling distribution for its statistic, even when n f fake p-values exist, since P 1 : n | H 0 * Beta ( 1 , n + n f ) .
This example reinforces, to some extent, the general belief that Fisher’s combined test (or Pearson’s equivalent geometric mean test) should be used, even in a wider context of jointly combining genuine and fake p-values. However, a more in-depth study is needed to support such a conclusion, but this is beyond the scope of this review paper.

5. Further Developments in Combining p-Values

There are many other modifications and generalizations of the classical test statistics for combining genuine P-values than those discussed in Section 2.
Fisher’s statistic is the most widely used for combining p-values and has therefore been the subject of several generalizations, namely weighted versions. The discussion of conceptual advantages of weighting p-values, for instance, to improve the power of the combination method, goes as far as Good [42]. In regard to the weighted combination of independent probabilities, see also Bhoj [43]. As for the combination of dependent and weighted p-values, these are intertwined topics. Aside from the references Chuang and Shih [44], Hou [45], Makambi [46], and Yang [47], cf. for instance Alves and Yu [48].
Lancaster [49] generalized Fisher’s method by transforming p-values using the chi-squared distribution with d k degrees of freedom,
T L ( P 1 , , P n ) = k = 1 n F χ d k 2 1 ( 1 P k ) ,
where F χ d k 2 1 is the inverse of the chi-square cumulative distribution function with d k degrees of freedom, so that in an independent setup, T L | H 0 * χ k = 1 n d k 2 . Chen’s [50] numerical comparisons indicate that Lancaster’s statistic T L has a higher power than the traditional combination rules described in Section 2. Dai et al. [51] combined dependent P-values using approximations to the distribution of T L , obtaining higher Bahadur efficiency than with a weighted version of the z-test.
Hou and Yang [52] developed a weighted version of Lancaster’s statistic, namely
T H Y ( P 1 , , P n ) = k = 1 n w k F χ d k 2 1 ( 1 P k ) .
Regardless of whether P 1 , , P n are independent or not, T H Y c χ f 2 , and by equating expectations and variances, i.e., E ( T H Y ) = E ( c χ f 2 ) = c f and Var ( T H Y ) = Var ( c χ f 2 ) = 2 c 2 f , the parameter c can be estimated considering that
c = Var ( T H Y ) 2 E ( T H Y ) = k = 1 n w k 2 d k + k = 1 n j < k w k w j Cov F χ d k 2 1 ( 1 P k ) , F χ d j 2 1 ( 1 P j ) k = 1 n w k d k ,
and the parameter f by considering
f = 2 E ( T H Y ) 2 Var ( T H Y ) = k = 1 n w k d k 2 k = 1 n w k 2 d k + k = 1 n j < k w k w j Cov F χ d k 2 1 ( 1 P k ) , F χ d j 2 1 ( 1 P j ) .
It then follows that the ( 1 α ) 100 -th percentile of the distribution of T H Y ( P 1 , , P n ) can be approximated by c F χ f 2 1 ( 1 α ) .
Zhang and Wu [53] investigated a general family of Fisher’s type of statistics referred to as the GFisher, which covers many classical statistics. Systematic simulations show that new p-value calculation methods based on moment-ratio matching and joint distribution surrogating are more accurate under the multivariate Gaussian and more robust under the generalized linear model and the multivariate t distribution. Relevant computation has been implemented in the R package GFisher, which is available in the Comprehensive R Archive Network.
The poolr package (Cinar and Viechtbauer [54]) provides an implementation of a variety of methods for combining p-values, including the inverse chi-square method (Liu [55]), a binomial test (Wilkinson [18]) and a Bonferroni/Holm method [56], which is an alternative to Simes’ test [19]. Using an empirically derived null distribution based on pseudo-replicates that mimics a proper permutation test, an adjustment to account for dependence among the tests from which the p-values have been derived is made assuming multivariate normality among the test statistics. The poolr package has been compared with several other packages that can be used to combine p-values. Dewey’s [57] metap v1.9 package provides an implementation of a wide variety of methods for combining independent p-values described in Becker [58].
Liu and Xie [59] suggested a statistic defined as a weighted sum of the Cauchy transformation of individual p-values, implying that the tail of the null distribution can be well approximated by a Cauchy distribution under arbitrary dependency structures. The p-value calculation of the test is accurate and as simple as the classical z-test or t-test, making it well suited for analyzing massive data. On the other hand, Ham and Park [60] showed that the Cauchy combination test provides the best combined p-value in the sense that it had the best performance among the examined methods while controlling type I error rates.
As the independence assumption is clearly a strong limitation when it comes to combining p-values, in 1975, Brown [61] discussed a method for combining non-independent tests of significance. The combination of p-values in correlated setups, for instance, in genome research requiring the analysis of Big Data, is currently a very active field of research, cf. Makambi [46], Hou [45], Yang [62], and Chuang and Shih [44]. In 2002, Kost and McDermott [63] derived an approximation to the null distribution of Fisher’s statistic for combining p-values when the underlying test statistics are jointly distributed as a multivariate t with a common denominator.
As already mentioned, Fisher’s statistic is the most used for combining p-values and generalizing it for dependence contexts has also been a constantly revisited research topic (see, for instance, Yang [47], Dai et al. [51] or Li et al. [64]). Chen [65] investigated new Gamma-based combination of p-values, based on the test statistic
T G ( α , 1 / δ ) ( P 1 , , P n ) = k = 1 n F G ( α , 1 / δ ) 1 ( 1 P k ) ,
where F G ( α , 1 / δ ) 1 denotes the inverse of the Gamma cumulative distribution function with shape parameter α and scale parameter 1 / δ , and showed that in many situations it provides an asymptotically Uniformly Most Powerful test.
Wilson [66] recommends the use of the harmonic mean p-value, i.e.,
T H n ( P 1 , , P n ) = n k = 1 n 1 / P k ,
for combining dependent p-values, since it controls the overall type I error, i.e., the probability of falsely rejecting the overall null hypothesis H 0 * in favor of at least one alternative hypothesis H A k . It is a complementary method to Fisher’s method by averaging only valid p-values when these are mutually exclusive but not necessarily independent. The sampling distribution of T H n ( P 1 , , P n ) is known to be in the domain of attraction of the heavy-tailed Landau skewed additive (1,1)-stable law, is robust to positive dependency between p-values and also to the distribution of the weights w used in its computation. Furthermore, it is insensitive to the number of tests and is mainly influenced by the smallest p-values.
Chien [67] compared the performances of Wilson’s [66] harmonic mean method and of Kost and McDermott’s [63] method to the performance of an empirical method based on the gamma distribution for combining dependent p-values from multiple hypothesis testing, which robustly controls the type I error and keeps a good power rate.
Based on recent developments in robust risk aggregation techniques, Vovk and Wang [68] by combining a number of p-values without making any assumption about their dependence structure, extended those results to generalized means, and showed that np-values can be combined by scaling up their harmonic mean by a factor of ln n .
E-values, defined as expectations, in contrast to p-values, defined as probabilities, are nonnegative random variables whose expected values under the null hypothesis are bounded by 1 (Shafer et al. [69]), as in Bayes factors and likelihood ratios in the case of a simple null hypothesis (Grünwald et al. [70]; Shafer et al. [69]). The combination of e-values via e-merging functions is a more recent and active field of research (cf. Grünwald et al. [70], Shafer [71], Vovk et al. [72,73], and Vuursteen et al. [74]). For instance, the product of independent e-values is clearly an e-value. However, so far, little is known about the power of these combination procedures, although this is now the main focus of research in this field.

6. Conclusions

The meta-analysis of p-values poses some challenges, especially in today’s world in which academic and scientific achievements are largely measured (and funded) by the number of papers published, thus putting much pressure on researchers. For this reason, possibly some—but almost certainly a very few—of the P k ’s, k = 1 , , n , to be used in a statistic T ( P 1 , , P n ) are fake p-values (minimum of Two P), when in an honest world, they should all be genuine p-values (not Two P). Therefore, it is a good idea to perform a comparison between the conclusions drawn from different combined tests, assuming that among the observed p k ’s there are n f = 0 , 1 , , j n fake p-values, to ensure a more informed decision on the overall hypothesis.
The tables with quantiles of the most used methods for combining p-values that take into consideration the existence of a small number of fake p-values in a sample, obtained by the authors and provided in Brilhante et al. [10], can be a useful tool to assess the reliability of the conclusions drawn from meta-analyses of p-values in the event of their unknown presence.

Author Contributions

Conceptualization, M.F.B., M.I.G., S.M., D.P. and R.S.; Funding acquisition, M.I.G.; Investigation, M.F.B., M.I.G., S.M., D.P. and R.S.; Methodology, M.F.B., M.I.G., S.M., D.P. and R.S.; Project administration, M.F.B., D.P. and R.S.; Resources, M.F.B., M.I.G. and R.S.; Software, M.F.B. and R.S.; Supervision, M.F.B., M.I.G. and D.P.; Writing—original draft, M.F.B., M.I.G., S.M., D.P. and R.S.; Writing—review and editing, M.F.B., M.I.G., S.M., D.P. and R.S. All authors have read and agreed to the published version of the manuscript.

Funding

Research partially financed by national funds through FCT—Fundação para a Ciência e a Tecnologia, Portugal, under the project UIDB/00006/2020 (https://doi.org/10.54499/UIDB/00006/2020).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

References

  1. Pearson, K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1900, 50, 157–175. [Google Scholar] [CrossRef]
  2. Arbuthnot, J. An argument for divine providence, taken from the constant regularity observ’d in the births of both sexes. Philos. Trans. R. Soc. Lond. 1710, 27, 186–190. [Google Scholar] [CrossRef]
  3. Fisher, R.A. On the interpretation of χ2 from contingency tables, and the calculation of P. J. R. Stat. Soc. 1922, 85, 87–94. [Google Scholar] [CrossRef]
  4. Fisher, R.A. Statistical Methods for Research Workers; Oliver and Boyd: Edinburgh, UK, 1925. [Google Scholar]
  5. Fisher, R.A. The arrangement of field experiments. J. Minist. Agric. 1926, 33, 503–515. [Google Scholar]
  6. Greenwald, A.G.; Gonzalez, R.; Harris, R.J.; Guthrie, D. Effect sizes and p values: What should be reported and what should be replicated? Psychophysiology 1996, 33, 175–183. [Google Scholar] [CrossRef]
  7. Colquhoun, D. The reproducibility of research and the misinterpretation of p-values. R. Soc. Open Sci. 2017, 4, 171085. [Google Scholar] [CrossRef]
  8. Tippett, L.H.C. The Methods of Statistics; Williams & Norgate: London, UK, 1931. [Google Scholar] [CrossRef]
  9. Fisher, R.A. Statistical Methods for Research Workers, 4th ed.; Oliver and Boyd: London, UK, 1932. [Google Scholar]
  10. Brilhante, M.F.; Gomes, M.I.; Mendonça, S.; Pestana, D.; Santos, R. Meta-analysis of genuine and fake p-values. Preprints 2024. [Google Scholar] [CrossRef]
  11. Wasserstein, R.L.; Lazar, N.A. The Asa statement on p-values: Context process, and purpose. Am. Stat. 2016, 70, 129–133. [Google Scholar] [CrossRef]
  12. Wasserstein, R.L.; Schirm, A.L.; Lazar, N.A. Moving to a world beyond “p < 0.05”. Am. Stat. 2019, 73, 129–133. [Google Scholar] [CrossRef]
  13. Jin, Z.C.; Zhou, X.H.; He, J. Statistical methods for dealing with publication bias in meta-analysis. Stat. Med. 2015, 34, 343–360. [Google Scholar] [CrossRef]
  14. Lin, L.; Chu, H. Quantifying publication bias in meta-analysis. Biometrics 2018, 74, 785–794. [Google Scholar] [CrossRef] [PubMed]
  15. Givens, G.H.; Smith, D.D.; Tweedie, R.L. Publication bias in meta-analysis: A Bayesian data-augmentation approach to account for issues exemplified in the passive smoking debate. Stat. Sci. 1997, 12, 221–250. [Google Scholar] [CrossRef]
  16. Fisher, R.A. Has Mendel’s work been rediscovered? Ann. Sci. 1936, 1, 115–137. [Google Scholar] [CrossRef]
  17. Deng, L.Y.; George, E.O. Some characterizations of the uniform distribution with applications to random number generation. Ann. Inst. Stat. Math. 1992, 44, 379–385. [Google Scholar] [CrossRef]
  18. Wilkinson, B. A statistical consideration in psychological research. Psychol. Bull. 1951, 48, 156–158. [Google Scholar] [CrossRef] [PubMed]
  19. Simes, R.J. An improved Bonferroni procedure for multiple tests of significance. Biometrika 1986, 73, 751–754. [Google Scholar] [CrossRef]
  20. Edgington, E.S. An additive method for combining probability values from independent experiments. J. Psychol. 1972, 80, 351–363. [Google Scholar] [CrossRef]
  21. Pearson, K. On a method of determining whether a sample of size n supposed to have been drawn from a parent population having a known probability integral has probably been drawn at random. Biometrika 1933, 25, 379–410. [Google Scholar] [CrossRef]
  22. Brilhante, M.F.; Gomes, M.I.; Mendonça, S.; Pestana, D.; Pestana, P. Generalized Beta models and population growth, so many routes to chaos. Fractal Fract. 2023, 7, 194. [Google Scholar] [CrossRef]
  23. Pearson, K. On a new method of determining “goodness of fit”. Biometrika 1934, 26, 425–442. [Google Scholar] [CrossRef]
  24. Owen, A.B. Karl Pearson’s meta-analysis revisited. Ann. Stat. 2009, 37, 3867–3892. [Google Scholar] [CrossRef] [PubMed]
  25. Stouffer, S.A.; Schuman, E.A.; DeVinney, L.C.; Star, S.; Williams, R.M. The American Soldier: Adjustment during Army Life; Princeton University Press: Princeton, NJ, USA, 1949; Volume I. [Google Scholar] [CrossRef]
  26. Mudholkar, G.S.; George, E.O. The logit method for combining probabilities. In Symposium on Optimizing Methods in Statistics; Rustagi, J., Ed.; Academic Press: New York, NY, USA, 1979; pp. 345–366. [Google Scholar]
  27. Birnbaum, A. Combining independent tests of significance. J. Am. Stat. Assoc. 1954, 49, 559–574. [Google Scholar] [CrossRef]
  28. Mosteller, F.; Bush, R. Selected quantitative techniques. In Handbook of Social Psychology: Theory and Methods; Lidsey, G., Ed.; Addison-Wesley: Cambridge, MA, USA, 1954. [Google Scholar]
  29. Littell, R.C.; Folks, L.J. Asymptotic optimality of Fisher’s method of combining independent tests, I. J. Am. Stat. Assoc. 1971, 66, 802–806. [Google Scholar] [CrossRef]
  30. Littell, R.C.; Folks, L.J. Asymptotic optimality of Fisher’s method of combining independent tests, II. J. Am. Stat. Assoc. 1973, 68, 193–194. [Google Scholar] [CrossRef]
  31. Loughin, T.M. A systematic comparison of methods for combining p-values from independent tests. Comput. Stat. Data Anal. 2004, 47, 467–485. [Google Scholar] [CrossRef]
  32. Hartung, J.; Knapp, G.; Sinha, B.K. Statistical Meta-Analysis with Applications; Wiley: Hoboken, NJ, USA, 2008. [Google Scholar] [CrossRef]
  33. Kulinskaya, E.; Morgenthaler, S.; Staudte, R.G. Meta Analysis. A Guide to Calibrating and Combining Statistical Evidence; Wiley: Chichester, UK, 2008. [Google Scholar] [CrossRef]
  34. Tsui, K.; Weerahandi, S. Generalized p-values in significance testing of hypothesis in the presence of nuisance parameters. J. Am. Stat. Assoc. 1989, 84, 602–607. [Google Scholar] [CrossRef]
  35. Weerahandi, S. Exact Statistical Methods for Data Analysis; Springer: New York, NY, USA, 1995. [Google Scholar] [CrossRef]
  36. Hung, H.; O’Neill, R.; Bauer, P.; Kohn, K. The behavior of the p-value when the alternative is true. Biometrics 1997, 53, 11–22. [Google Scholar] [CrossRef] [PubMed]
  37. Brilhante, M.F. Generalized p-values and random p-values when the alternative to uniformity is a mixture of a Beta(1,2) and uniform. In Recent Developments in Modeling and Applications in Statistics; Oliveira, P., Temido, M., Henriques, C., Vichi, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 159–167. [Google Scholar] [CrossRef]
  38. Pires, A.M.; Branco, J.A. A statistical model to explain the Mendel-Fisher controversy. Stat. Sci. 2010, 25, 545–565. [Google Scholar] [CrossRef]
  39. Franklin, A.; Edwards, A.W.; Fairbanks, D.J.; Hartl, D.L. Ending the Mendel-Fisher Controversy; University of Pittsburgh Press: Pittsburgh, PA, USA, 2008. [Google Scholar] [CrossRef]
  40. Gomes, M.I.; Pestana, D.; Sequeira, F.; Mendonça, S.; Velosa, S. Uniformity of offsprings from uniform and non-uniform parents. In Proceedings of the ITI 2009, 31st International Conference on Information Technology Interfaces, Cavtat/Dubrovnik, Croatia, 22–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 243–248. [Google Scholar]
  41. Brilhante, M.; Pestana, D.; Sequeira, F. Combining p-values and random p-values. In Proceedings of the ITI 2010, 32nd International Conference on Information Technology Interfaces, Cavtat/Dubrovnik, Croatia, 21–24 June 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 515–520. [Google Scholar]
  42. Good, I.J. On the weighted combination of significance tests. J. R. Stat. Soc. Ser. B Stat. Methodol. 1955, 17, 264–265. [Google Scholar] [CrossRef]
  43. Bhoj, D.S. On the distribution of the weighted combination of independent probabilities. Stat. Probab. Lett. 1992, 15, 37–40. [Google Scholar] [CrossRef]
  44. Chuang, L.L.; Shih, Y.S. Approximated distributions of the weighted sum of correlated chi-squared random variables. J. Stat. Plan. Inference 2012, 142, 457–472. [Google Scholar] [CrossRef]
  45. Hou, C.D. A simple approximation for the distribution of the weighted combination of non-independent or independent probabilities. Stat. Probab. Lett. 2005, 73, 179–187. [Google Scholar] [CrossRef]
  46. Makambi, K.H. Weighted inverse chi-square method for correlated significance tests. J. Appl. Stat. 2003, 30, 225–234. [Google Scholar] [CrossRef]
  47. Yang, T.S. A New Weighted Combination Procedure. Master’s Thesis, Fu Jen Catholic University, Taipei, Taiwan, 2012. [Google Scholar]
  48. Alves, G.; Yu, Y.K. Combining independent weighted P-values: Achieving computational stability by a systematic expansion with controllable accuracy. PLoS ONE 2011, 6, e22647. [Google Scholar] [CrossRef]
  49. Lancaster, H. The combination of probabilities: An application of orthonormal functions. Aust. J. Stat. 1961, 3, 20–33. [Google Scholar] [CrossRef]
  50. Chen, Z. Is the weighted z-test the best method for combining probabilities from independent tests? J. Evol. Biol. 2011, 24, 926–930. [Google Scholar] [CrossRef]
  51. Dai, H.; Leeder, J.S.; Cui, Y. A modified generalized Fisher method for combining probabilities from dependent tests. Front. Genet. 2014, 5, 32. [Google Scholar] [CrossRef]
  52. Hou, C.D.; Yang, T.S. Distribution of weighted Lancaster’s statistic for combining independent or dependent P-values, with applications to human genetic studies. Commun. Stat. Theory Methods 2023, 52, 7442–7454. [Google Scholar] [CrossRef]
  53. Zhang, H.; Wu, Z. The generalized Fisher’s combination and accurate p-value calculation under dependence. Biometrics 2022, 79, 1159–1172. [Google Scholar] [CrossRef] [PubMed]
  54. Cinar, O.; Viechtbauer, W. The poolr package for combining independent and dependent p values. J. Stat. Softw. 2022, 101, 1–42. [Google Scholar] [CrossRef]
  55. Liu, J.Z.; Mcrae, A.F.; Nyholt, D.R.; Medland, S.E.; Wray, N.R.; Brown, K.M.; AMFS Investigators; Hayward, N.K.; Montgomery, G.W.; Vissher, P.M.; et al. A versatile gene-based test for genome-wide association studies. Am. J. Hum. Genet. 2010, 87, 139–145. [Google Scholar] [CrossRef]
  56. Holm, S. A simple sequentially multiple test procedure. Scand. J. Stat. 1979, 6, 65–70. [Google Scholar]
  57. Dewey, M. metap: Meta-Analysis of Significance Values, R Package Version 1.9; RDocumentation. Available online: https://www.rdocumentation.org/packages/metap/versions/1.9 (accessed on 2 September 2024).
  58. Becker, B.J. Combining significance levels. In The Handbook of Research Synthesis; Cooper, H., Hedges, L.V., Eds.; Russell Sage Foundation: New York, NY, USA, 1994; pp. 215–230. [Google Scholar]
  59. Liu, Y.; Xie, J. Cauchy combination test: A powerful test with analytic p-value calculation under arbitrary dependency structures. J. Am. Stat. Assoc. 2020, 115, 393–402. [Google Scholar] [CrossRef]
  60. Ham, H.; Park, T. Combining p-values from various statistical methods for microbiome data. Front. Microbiol. 2022, 13, 990870. [Google Scholar] [CrossRef]
  61. Brown, M.B. A method for combining non-independent, one-sided tests of significance. Biometrics 1975, 3, 987–992. [Google Scholar] [CrossRef]
  62. Yang, J.J. Distribution of Fisher’s combination statistic when the tests are dependent. J. Stat. Comput. Simul. 2010, 80, 1–12. [Google Scholar] [CrossRef]
  63. Kost, J.T.; McDermott, M.P. Combining dependent p-values. Stat. Probab. Lett. 2002, 60, 183–190. [Google Scholar] [CrossRef]
  64. Li, Q.; Hu, J.; Ding, J.; Zheng, G. Fisher’s method of combining dependent statistics using generalizations of the gamma distribution with applications to genetic pleiotropic associations. Biostatistics 2014, 15, 284–295. [Google Scholar] [CrossRef]
  65. Chen, Z. Optimal tests for combining p-values. Appl. Sci. 2022, 12, 322. [Google Scholar] [CrossRef]
  66. Wilson, D.J. The harmonic mean p-value for combining dependent tests. Proc. Natl. Acad. Sci. USA 2019, 116, 1195–1200. [Google Scholar] [CrossRef] [PubMed]
  67. Chien, L.C. Combining dependent p-values by gamma distributions. Stat. Appl. Genet. Mol. Biol. 2020, 19, 20190057. [Google Scholar] [CrossRef] [PubMed]
  68. Vovk, V.; Wang, R. Combining p-values via averaging. Biometrika 2020, 107, 791–808. [Google Scholar] [CrossRef]
  69. Shafer, G.; Shen, A.; Vereshchagin, N.; Vovk, V. Test martingales, Bayes factors and p-values. Stat. Sci. 2011, 26, 84–101. [Google Scholar] [CrossRef]
  70. Grünwald, P.; De Heide, R.; Koolen, W.M. Safe testing. Information Theory and Applications Workshop (ITA). J. R. Stat. Soc. Ser. B 2020, 1–54. [Google Scholar] [CrossRef]
  71. Shafer, G. Testing by betting: A strategy for statistical and scientific communication. J. R. Stat. Soc. Ser. A (Stat. Soc.) 2021, 184, 407–431. [Google Scholar] [CrossRef]
  72. Vovk, V.; Wang, R. E-values: Calibration, combination and applications. Ann. Stat. 2021, 49, 1736–1754. [Google Scholar] [CrossRef]
  73. Vovk, V.; Wang, B.; Wang, R. Admissible ways of merging p-values under arbitrary dependence. Ann. Stat. 2022, 50, 351–375. [Google Scholar] [CrossRef]
  74. Vuursteen, L.; Szabó, B.; van der Vaart, A.; van Zanten, H. Optimal testing using combined test statistics across independent studies. In Proceedings of the Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Volume 36. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Brilhante, M.F.; Gomes, M.I.; Mendonça, S.; Pestana, D.; Santos, R. Two P or Not Two P: Mendel Random Variables in Combining Fake and Genuine p-Values. AppliedMath 2024, 4, 1128-1142. https://doi.org/10.3390/appliedmath4030060

AMA Style

Brilhante MF, Gomes MI, Mendonça S, Pestana D, Santos R. Two P or Not Two P: Mendel Random Variables in Combining Fake and Genuine p-Values. AppliedMath. 2024; 4(3):1128-1142. https://doi.org/10.3390/appliedmath4030060

Chicago/Turabian Style

Brilhante, M. Fátima, M. Ivette Gomes, Sandra Mendonça, Dinis Pestana, and Rui Santos. 2024. "Two P or Not Two P: Mendel Random Variables in Combining Fake and Genuine p-Values" AppliedMath 4, no. 3: 1128-1142. https://doi.org/10.3390/appliedmath4030060

Article Metrics

Back to TopTop